用1 GPU在不到24小时内从头开始训练视觉变压器

论文标题

用1 GPU在不到24小时内从头开始训练视觉变压器

Training a Vision Transformer from scratch in less than 24 hours with 1 GPU

论文作者

Irandoust, Saghar, Durand, Thibaut, Rakhmangulova, Yunduz, Zi, Wenjie, Hajimirsadeghi, Hossein

论文摘要

变压器已成为计算机视觉最新进展的核心。但是，从头开始培训视觉变压器（VIT）模型可能是资源大量且耗时的。在本文中，我们旨在探讨减少VIT模型培训成本的方法。我们介绍了一些算法改进，以促进具有有限的硬件（1 GPU）和时间（24小时）资源的从头开始培训VIT模型。首先，我们提出了一种有效的方法，以增加VIT体系结构的位置。其次，我们制定了一种新的图像大小课程学习策略，该策略允许在培训开始时减少从每个图像中提取的补丁的数量。最后，我们通过添加硬件和时间限制提出了流行Imagenet1k基准的新变体。我们评估了我们在此基准测试中的贡献，并表明鉴于拟议的培训预算，它们可以显着提高性能。我们将在https://github.com/borealisai/felficited-vit-training中共享代码。

Transformers have become central to recent advances in computer vision. However, training a vision Transformer (ViT) model from scratch can be resource intensive and time consuming. In this paper, we aim to explore approaches to reduce the training costs of ViT models. We introduce some algorithmic improvements to enable training a ViT model from scratch with limited hardware (1 GPU) and time (24 hours) resources. First, we propose an efficient approach to add locality to the ViT architecture. Second, we develop a new image size curriculum learning strategy, which allows to reduce the number of patches extracted from each image at the beginning of the training. Finally, we propose a new variant of the popular ImageNet1k benchmark by adding hardware and time constraints. We evaluate our contributions on this benchmark, and show they can significantly improve performances given the proposed training budget. We will share the code in https://github.com/BorealisAI/efficient-vit-training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题