MobileVitv3：移动友好的视觉变压器，具有简单有效的本地，全球和输入功能的融合

论文标题

MobileVitv3：移动友好的视觉变压器，具有简单有效的本地，全球和输入功能的融合

MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features

论文作者

Wadekar, Shakti N., Chaurasia, Abhishek

论文摘要

MobileVit（MobileVitV1）结合了卷积神经网络（CNN）和视觉变压器（VIT），以创建用于移动视觉任务的轻量级模型。尽管主要的MobileVITV1块有助于实现竞争性的最新结果，但MobileVitv1块内的融合块会产生缩放挑战，并具有复杂的学习任务。我们建议对融合块进行更改，这些更改简单有效地创建了MobileVitv3-block，该块解决了扩展并简化了学习任务。我们提出的MobileVitv3-block用于创建MobileVitv3-XX，XS和S模型在Imagenet-1K，ADE20K，COCO和PASCALVOC2012数据集上优于Mobilevitv1。在ImageNet-1K上，MobileVitV3-XXS和MobileVitv3-Xs分别超过MobileVitv1-XXS和MobileVitV1-XS分别为2％和1.9％。最近发布的MobileVitv2体系结构删除了融合块，并使用线性复杂性变压器比MobileVitv1更好。我们将建议的Fusion Block添加到MobileVitV2中，以创建MobileVitv3-0.5、0.75和1.0型号。与MobileVitV2相比，这些新模型在Imagenet-1K，ADE20K，可可和Pascalvoc2012数据集上提供了更好的精度数字。 MobileVitv3-0.5和MobileVitv3-0.75在ImagEnet-1K数据集上，Mobilevitv2-0.5和Mobilevitv2-0.75均优于Mobilevitv2-0.5和MobileVitv2-0.75。对于细分任务，与ADE20K数据集和PascalVoc2012数据集的MobileVitV2-1.0相比，MobileVITV3-1.0的MIOU分别达到2.07％和1.1％。我们的代码和训练有素的模型可在以下网址找到：https：//github.com/microndla/mobilevitv3

MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) and vision transformers (ViTs) to create light-weight models for mobile vision tasks. Though the main MobileViTv1-block helps to achieve competitive state-of-the-art results, the fusion block inside MobileViTv1-block, creates scaling challenges and has a complex learning task. We propose changes to the fusion block that are simple and effective to create MobileViTv3-block, which addresses the scaling and simplifies the learning task. Our proposed MobileViTv3-block used to create MobileViTv3-XXS, XS and S models outperform MobileViTv1 on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets. On ImageNet-1K, MobileViTv3-XXS and MobileViTv3-XS surpasses MobileViTv1-XXS and MobileViTv1-XS by 2% and 1.9% respectively. Recently published MobileViTv2 architecture removes fusion block and uses linear complexity transformers to perform better than MobileViTv1. We add our proposed fusion block to MobileViTv2 to create MobileViTv3-0.5, 0.75 and 1.0 models. These new models give better accuracy numbers on ImageNet-1k, ADE20K, COCO and PascalVOC2012 datasets as compared to MobileViTv2. MobileViTv3-0.5 and MobileViTv3-0.75 outperforms MobileViTv2-0.5 and MobileViTv2-0.75 by 2.1% and 1.0% respectively on ImageNet-1K dataset. For segmentation task, MobileViTv3-1.0 achieves 2.07% and 1.1% better mIOU compared to MobileViTv2-1.0 on ADE20K dataset and PascalVOC2012 dataset respectively. Our code and the trained models are available at: https://github.com/micronDLA/MobileViTv3

下载PDF全文

下载文献需遵守相关版权规定

论文标题