论文标题
空间熵作为视觉变压器的感应偏见
Spatial Entropy as an Inductive Bias for Vision Transformers
论文作者
论文摘要
有关视觉变压器(VTS)的最新工作表明,在VT体系结构中引入局部电感偏见有助于减少训练所需的样本数量。但是,体系结构的修改导致变压器骨干的一般性丧失,部分与统一体系结构的发展相矛盾,例如,通过计算机视觉和自然语言处理区域共享。在这项工作中,我们提出了一个不同的互补方向,其中使用辅助自我监督任务引入本地偏见,并通过标准监督培训共同执行。具体而言,我们利用这样的观察结果,即VT的注意图在接受自学训练时可以包含语义分割结构,该结构在监督训练时不会自发出现。因此,我们明确鼓励这种空间聚类的出现作为训练正则化的一种形式。详细说明,我们利用这样的假设,即在给定的图像中,对象通常对应于几个连接的区域,并且我们提出了信息熵的空间表述,以量化此基于对象的电感偏差。通过最小化所提出的空间熵,我们在训练过程中包括一个额外的自我监督信号。使用广泛的实验,我们表明,与其他VT提案相比,提出的正则化导致等效或更好的结果,这些建议包括通过更改基本变压器体系结构来包括局部偏见,并且在使用小型中等训练集时,它可以大大提高VT最终精度。该代码可在https://github.com/helia95/sar上找到。
Recent work on Vision Transformers (VTs) showed that introducing a local inductive bias in the VT architecture helps reducing the number of samples necessary for training. However, the architecture modifications lead to a loss of generality of the Transformer backbone, partially contradicting the push towards the development of uniform architectures, shared, e.g., by both the Computer Vision and the Natural Language Processing areas. In this work, we propose a different and complementary direction, in which a local bias is introduced using an auxiliary self-supervised task, performed jointly with standard supervised training. Specifically, we exploit the observation that the attention maps of VTs, when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. Thus, we explicitly encourage the emergence of this spatial clustering as a form of training regularization. In more detail, we exploit the assumption that, in a given image, objects usually correspond to few connected regions, and we propose a spatial formulation of the information entropy to quantify this object-based inductive bias. By minimizing the proposed spatial entropy, we include an additional self-supervised signal during training. Using extensive experiments, we show that the proposed regularization leads to equivalent or better results than other VT proposals which include a local bias by changing the basic Transformer architecture, and it can drastically boost the VT final accuracy when using small-medium training sets. The code is available at https://github.com/helia95/SAR.