半监督视觉变压器大规模

论文标题

半监督视觉变压器大规模

Semi-supervised Vision Transformers at Scale

论文作者

Cai, Zhaowei, Ravichandran, Avinash, Favaro, Paolo, Wang, Manchen, Modolo, Davide, Bhotika, Rahul, Tu, Zhuowen, Soatto, Stefano

论文摘要

我们研究视觉变形金刚（VIT）的半监督学习（SSL），尽管VIT架构广泛采用了不同的任务，但视觉变压器（VIT）还是一个不足的话题。为了解决这个问题，我们提出了一条新的SSL管道，该管道由第一个联合国/自我监管的预培训组成，然后是监督的微调，最后是半监督的微调。在半监督的微调阶段，我们采用了指数级的移动平均线（EMA） - 教师框架，而不是流行的FixMatch，因为前者更稳定，并且为半手不见的视觉变压器提供了更高的准确性。此外，我们提出了一种概率的伪混合机制来插入未标记的样品及其伪标签以改善正则化，这对于训练电感偏差较弱的训练VIT很重要。我们所提出的方法称为半维特，比半监督分类设置中的CNN对应物获得可比性或更好的性能。半维特还享有VIT的可伸缩性优势，可以随着精度越来越高的尺寸模型扩展到大型型号。例如，半效率总数仅使用1％标签在Imagenet上获得令人印象深刻的80％TOP-1精度，使用100％ImageNet标签与Inception-V4相当。

We study semi-supervised learning (SSL) for vision transformers (ViT), an under-explored topic despite the wide adoption of the ViT architectures to different tasks. To tackle this problem, we propose a new SSL pipeline, consisting of first un/self-supervised pre-training, followed by supervised fine-tuning, and finally semi-supervised fine-tuning. At the semi-supervised fine-tuning stage, we adopt an exponential moving average (EMA)-Teacher framework instead of the popular FixMatch, since the former is more stable and delivers higher accuracy for semi-supervised vision transformers. In addition, we propose a probabilistic pseudo mixup mechanism to interpolate unlabeled samples and their pseudo labels for improved regularization, which is important for training ViTs with weak inductive bias. Our proposed method, dubbed Semi-ViT, achieves comparable or better performance than the CNN counterparts in the semi-supervised classification setting. Semi-ViT also enjoys the scalability benefits of ViTs that can be readily scaled up to large-size models with increasing accuracies. For example, Semi-ViT-Huge achieves an impressive 80% top-1 accuracy on ImageNet using only 1% labels, which is comparable with Inception-v4 using 100% ImageNet labels.

下载PDF全文

下载文献需遵守相关版权规定

论文标题