视觉变压器：从语义细分到密集的预测

论文标题

视觉变压器：从语义细分到密集的预测

Vision Transformers: From Semantic Segmentation to Dense Prediction

论文作者

Zhang, Li, Lu, Jiachen, Zheng, Sixiao, Zhao, Xinxuan, Zhu, Xiatian, Fu, Yanwei, Xiang, Tao, Feng, Jianfeng, Torr, Philip H. S.

论文摘要

图像分类中视觉变压器（VIT）的出现改变了视觉表示学习的方法。尤其是，与跨层和其他替代方案的CNN的接受场不断增加（例如，大内核和极度卷积）相比，VIT在所有图像贴片中每层的视觉表示。在这项工作中，我们第一次探索VIT的全球上下文学习潜力，以进行密集的视觉预测（例如语义细分）。我们的动机是，通过按一层学习全球上下文，VIT可能会捕获更强大的远程依赖信息，这对于密集的预测任务至关重要。我们首先证明将图像编码为一系列斑块，没有局部卷积和分辨率降低的香草VIT可以产生更强的视觉表示，以进行语义分割。例如，我们的模型称为分割变压器（SETR），在ADE20K上擅长（MIOU为50.28％，这是提交当天测试排行榜中的第一个位置），并且在CityScapes上进行了竞争性。但是，由于缺乏金字塔结构，高计算需求和不足的本地环境，基本的VIT体系结构在更广泛的密集预测应用中（例如对象检测和实例分割）缺乏。为了以具有成本效益的方式处理一般密集的视觉预测任务，我们进一步制定了一个层次的本地 - 全球变压器（HLG）变压器，其特征在于Windows中的本地关注和在金字塔体系结构中的Windows跨Windows中的全球注意力。广泛的实验表明，我们的方法在各种密集的预测任务（例如对象检测和实例细分和语义细分）以及图像分类方面实现了吸引力的性能。

The emergence of vision transformers (ViTs) in image classification has shifted the methodologies for visual representation learning. In particular, ViTs learn visual representation at full receptive field per layer across all the image patches, in comparison to the increasing receptive fields of CNNs across layers and other alternatives (e.g., large kernels and atrous convolution). In this work, for the first time we explore the global context learning potentials of ViTs for dense visual prediction (e.g., semantic segmentation). Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information, critical for dense prediction tasks. We first demonstrate that encoding an image as a sequence of patches, a vanilla ViT without local convolution and resolution reduction can yield stronger visual representation for semantic segmentation. For example, our model, termed as SEgmentation TRansformer (SETR), excels on ADE20K (50.28% mIoU, the first position in the test leaderboard on the day of submission) and performs competitively on Cityscapes. However, the basic ViT architecture falls short in broader dense prediction applications, such as object detection and instance segmentation, due to its lack of a pyramidal structure, high computational demand, and insufficient local context. For tackling general dense visual prediction tasks in a cost-effective manner, we further formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture. Extensive experiments show that our methods achieve appealing performance on a variety of dense prediction tasks (e.g., object detection and instance segmentation and semantic segmentation) as well as image classification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题