论文标题
从序列到序列的角度重新思考语义分割
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
论文作者
论文摘要
最新的语义细分方法采用了具有编码器架构的完全跨跨趋义网络(FCN)。编码器逐渐减少了空间分辨率,并通过更大的接收场来学习更多抽象/语义视觉概念。由于上下文建模对于分割至关重要,因此最新的努力集中在通过扩张/严重的卷积或插入注意力模块来增加接受场。但是,基于编码器的FCN体系结构保持不变。在本文中,我们旨在通过将语义分割作为序列到序列预测任务来提供替代的观点。具体来说,我们部署了纯变压器(即无卷积和分辨率减少)以将图像编码为一系列贴片。通过在变压器的每一层中建模的全局上下文,该编码器可以与简单的解码器结合使用,以提供强大的分割模型,称为分割变压器(SETR)。广泛的实验表明,SETR在ADE20K(50.28%MIOU),PASCAL环境(55.83%MIOU)上实现了新的艺术状态,并在CityScapes上取得了竞争成果。特别是,我们在提交当天在竞争激烈的ADE20K测试服务器排行榜中获得了第一个位置。
Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.