掩盖图像建模的统一视图

论文标题

掩盖图像建模的统一视图

A Unified View of Masked Image Modeling

论文作者

Peng, Zhiliang, Dong, Li, Bao, Hangbo, Ye, Qixiang, Wei, Furu

论文摘要

蒙面的图像建模表现出巨大的潜力，可以消除培训大规模视觉变压器的渴望标签问题，从而在各种下游任务上取得了令人印象深刻的性能。在这项工作中，我们在重新审视现有方法后提出了掩盖图像建模的统一视图。在统一的视图下，我们引入了一种简单而有效的方法，称为MaskDistill，该方法从掩盖位置的教师模型中重建了标准化的语义特征，以损坏的输入图像进行调节。图像分类和语义分割的实验结果表明，MaskDistill比最先进的方法实现了可比或优越的性能。当使用巨大的视觉变压器和预处理的300个时期时，MaskDistill在Imagenet-1K（224尺寸）上获得了88.3％的微调TOP-1精度和58.8％的语义分段MIOU MIOU MIOU MIOU MIOU MIOU MIOU MIOU指标（512尺寸）。代码和预估计的型号将在https://aka.ms/unimim上找到。

Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers, achieving impressive performance on various downstream tasks. In this work, we propose a unified view of masked image modeling after revisiting existing methods. Under the unified view, we introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions, conditioning on corrupted input images. Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods. When using the huge vision Transformer and pretraining 300 epochs, MaskDistill obtains 88.3% fine-tuning top-1 accuracy on ImageNet-1k (224 size) and 58.8% semantic segmentation mIoU metric on ADE20k (512 size). The code and pretrained models will be available at https://aka.ms/unimim.

下载PDF全文

下载文献需遵守相关版权规定

论文标题