论文标题
用于语义分割的金字塔融合变压器
Pyramid Fusion Transformer for Semantic Segmentation
论文作者
论文摘要
最近提出的蒙版式形式对语义分割任务提供了刷新的观点:它从流行的像素级分类范式转变为掩模级别的分类方法。从本质上讲,它生成了对应于类别段的配对概率和掩模,并在推断分割图的过程中结合了它们。在我们的研究中,我们发现单个尺度特征之上的每掩膜分类解码器不足以提取可靠的可能性或掩盖。为了在特征金字塔上进行丰富的语义信息,我们提出了一个基于变压器的金字塔融合变压器(PFT),用于使用多尺度功能的每个面罩方法的语义分割。所提出的变压器解码器在并行的特征金字塔中进行了可学习的查询与每个空间特征之间的交叉注意,并使用跨尺度的Query Inter-Query注意来交换免费信息。我们在三个广泛使用的语义细分数据集上实现了竞争性能。特别是,在ADE20K验证集中,我们使用SWIN-B主链的结果超过了MaskFormer的结果,在单尺度和多尺度推理中,Swin-L主链较大,分别达到54.1 MIOU和55.7 MIOU。使用SWIN-L主链,我们实现了单尺度56.1 MIOU和多尺度57.4 MIOU,在数据集中获得了最先进的性能。对三个广泛使用的语义分割数据集进行了广泛的实验,验证了我们提出的方法的有效性。
The recently proposed MaskFormer gives a refreshed perspective on the task of semantic segmentation: it shifts from the popular pixel-level classification paradigm to a mask-level classification method. In essence, it generates paired probabilities and masks corresponding to category segments and combines them during inference for the segmentation maps. In our study, we find that per-mask classification decoder on top of a single-scale feature is not effective enough to extract reliable probability or mask. To mine for rich semantic information across the feature pyramid, we propose a transformer-based Pyramid Fusion Transformer (PFT) for per-mask approach semantic segmentation with multi-scale features. The proposed transformer decoder performs cross-attention between the learnable queries and each spatial feature from the feature pyramid in parallel and uses cross-scale inter-query attention to exchange complimentary information. We achieve competitive performance on three widely used semantic segmentation datasets. In particular, on ADE20K validation set, our result with Swin-B backbone surpasses that of MaskFormer's with a much larger Swin-L backbone in both single-scale and multi-scale inference, achieving 54.1 mIoU and 55.7 mIoU respectively. Using a Swin-L backbone, we achieve single-scale 56.1 mIoU and multi-scale 57.4 mIoU, obtaining state-of-the-art performance on the dataset. Extensive experiments on three widely used semantic segmentation datasets verify the effectiveness of our proposed method.