论文标题
BEIT V2:用矢量定量的视觉引导者掩盖图像建模
BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
论文作者
论文摘要
蒙面图像建模(MIM)通过恢复损坏的图像补丁,在自我监督的表示学习中表现出了令人印象深刻的结果。但是,大多数现有的研究都以低级图像像素进行操作,这阻碍了对表示模型的高级语义的开发。在这项工作中,我们建议使用富含语义的视觉令牌器作为掩盖预测的重建目标,提供了一种系统的方法来将MIM从像素级升级到语义级别。具体而言,我们提出了矢量定量的知识蒸馏以训练令牌仪,该蒸馏器将连续的语义空间离散为紧凑的代码。然后,我们通过预测掩盖图像贴片的原始视觉令牌来预先浏览视觉变压器。此外,我们引入了一个补丁聚合策略,该策略将离散图像贴片关联以增强全局语义表示。图像分类和语义分割的实验表明,BEIT V2优于MIM方法。在ImagEnet-1K(224尺寸)上,基本大小的BEIT V2可实现85.5%的高调准确性,而线性探测的TOP-1精度为80.1%。大尺寸的BEIT V2可获得87.3%的Imagenet-1K(224尺寸)微调的TOP-1准确性,用于语义分割的ADE20K上的56.7%MIOU。代码和预估计的模型可在https://aka.ms/beitv2上找到。
Masked image modeling (MIM) has demonstrated impressive results in self-supervised representation learning by recovering corrupted image patches. However, most existing studies operate on low-level image pixels, which hinders the exploitation of high-level semantics for representation models. In this work, we propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction, providing a systematic way to promote MIM from pixel-level to semantic-level. Specifically, we propose vector-quantized knowledge distillation to train the tokenizer, which discretizes a continuous semantic space to compact codes. We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches. Furthermore, we introduce a patch aggregation strategy which associates discrete image patches to enhance global semantic representation. Experiments on image classification and semantic segmentation show that BEiT v2 outperforms all compared MIM methods. On ImageNet-1K (224 size), the base-size BEiT v2 achieves 85.5% top-1 accuracy for fine-tuning and 80.1% top-1 accuracy for linear probing. The large-size BEiT v2 obtains 87.3% top-1 accuracy for ImageNet-1K (224 size) fine-tuning, and 56.7% mIoU on ADE20K for semantic segmentation. The code and pretrained models are available at https://aka.ms/beitv2.