论文标题
MobileVitv3:移动友好的视觉变压器,具有简单有效的本地,全球和输入功能的融合
MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features
论文作者
论文摘要
MobileVit(MobileVitV1)结合了卷积神经网络(CNN)和视觉变压器(VIT),以创建用于移动视觉任务的轻量级模型。尽管主要的MobileVITV1块有助于实现竞争性的最新结果,但MobileVitv1块内的融合块会产生缩放挑战,并具有复杂的学习任务。我们建议对融合块进行更改,这些更改简单有效地创建了MobileVitv3-block,该块解决了扩展并简化了学习任务。我们提出的MobileVitv3-block用于创建MobileVitv3-XX,XS和S模型在Imagenet-1K,ADE20K,COCO和PASCALVOC2012数据集上优于Mobilevitv1。在ImageNet-1K上,MobileVitV3-XXS和MobileVitv3-Xs分别超过MobileVitv1-XXS和MobileVitV1-XS分别为2%和1.9%。最近发布的MobileVitv2体系结构删除了融合块,并使用线性复杂性变压器比MobileVitv1更好。我们将建议的Fusion Block添加到MobileVitV2中,以创建MobileVitv3-0.5、0.75和1.0型号。与MobileVitV2相比,这些新模型在Imagenet-1K,ADE20K,可可和Pascalvoc2012数据集上提供了更好的精度数字。 MobileVitv3-0.5和MobileVitv3-0.75在ImagEnet-1K数据集上,Mobilevitv2-0.5和Mobilevitv2-0.75均优于Mobilevitv2-0.5和MobileVitv2-0.75。对于细分任务,与ADE20K数据集和PascalVoc2012数据集的MobileVitV2-1.0相比,MobileVITV3-1.0的MIOU分别达到2.07%和1.1%。我们的代码和训练有素的模型可在以下网址找到:https://github.com/microndla/mobilevitv3
MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-the-art results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2 architecture removes fusion block and uses linear complexity transformers to perform better than MobileViTv1. We add our proposed fusion block to MobileViTv2 to create MobileViTv3-0.5, 0.75 and 1.0 models. These new models give better accuracy numbers on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets as compared to MobileViTv2. MobileViTv3-0.5 and MobileViTv3-0.75 outperforms MobileViTv2-0.5 and MobileViTv2-0.75 by 2.1% and 1.0% respectively on ImageNet-1K dataset. For segmentation task, MobileViTv3-1.0 achieves 2.07% and 1.1% better mIOU compared to MobileViTv2-1.0 on ADE20K dataset and PascalVOC2012 dataset respectively. Our code and the trained models are available at: https://github.com/micronDLA/MobileViTv3