论文标题
序列和圆圈:探索补丁之间的关系
Sequence and Circle: Exploring the Relationship Between Patches
论文作者
论文摘要
视觉变压器(VIT)已经实现了最新的视觉任务。它利用可学习的位置嵌入(PE)机制来编码每个图像补丁的位置。但是,目前尚不清楚这种可学习的PE是否真的是必要的以及它的好处是什么。本文探讨了编码单个补丁位置的两种替代方法,这些方法利用了有关其空间排列的先验知识。一个称为序列关系嵌入(SRE),另一个称为圆形关系嵌入(CRE)。其中,SRE考虑了所有要按顺序进行的补丁,并且相邻的补丁具有相同的间隔距离。 CRE将中央贴片视为圆的中心,并根据四个社区原理来测量其余斑块与中心的距离。具有不同半径的多个同心圆结合了不同的斑块。最后,我们在三个经典VIT上实现了这两种关系,并在四个流行的数据集上对其进行了测试。实验表明,SRE和CRE可以替换PE以减少随机学习参数,同时达到相同的性能。将SRE或CRE与PE结合起来比仅使用PE的性能更好。
The vision transformer (ViT) has achieved state-of-the-art results in various vision tasks. It utilizes a learnable position embedding (PE) mechanism to encode the location of each image patch. However, it is presently unclear if this learnable PE is really necessary and what its benefits are. This paper explores two alternative ways of encoding the location of individual patches that exploit prior knowledge about their spatial arrangement. One is called the sequence relationship embedding (SRE), and the other is called the circle relationship embedding (CRE). Among them, the SRE considers all patches to be in order, and adjacent patches have the same interval distance. The CRE considers the central patch as the center of the circle and measures the distance of the remaining patches from the center based on the four neighborhoods principle. Multiple concentric circles with different radii combine different patches. Finally, we implemented these two relations on three classic ViTs and tested them on four popular datasets. Experiments show that SRE and CRE can replace PE to reduce the random learnable parameters while achieving the same performance. Combining SRE or CRE with PE gets better performance than only using PE.