论文标题
Vitol:弱监督物体本地化的视觉变压器
ViTOL: Vision Transformer for Weakly Supervised Object Localization
论文作者
论文摘要
弱监督的对象定位(WSOL)旨在仅使用图像级类别标签来预测图像中的对象位置。图像分类模型在本地化对象时会遇到的常见挑战,(a)他们倾向于查看图像中最歧视性的特征,将本地化图限制在非常小的区域中,(b)本地化图是class nostic,并且模型突出了同一图像中多个类别的多个类别的对象,并且(c)(c)背景噪声影响了本地化的性能。为了减轻上述挑战,我们通过提出的方法Vitol引入以下简单的更改。我们利用基于视觉的变压器进行自我注意力,并引入基于贴片的注意力辍学层(P-ADL)来增加本地化图的覆盖范围和梯度注意力推出机制,以生成依赖类的注意图。我们在Imagenet-1K和Cub数据集上进行了广泛的定量,定性和消融实验。我们在两个数据集上分别达到70.47%和73.17%的最新最新MAXBOXACC-V2定位得分。代码可在https://github.com/saurav-31/vitol上找到
Weakly supervised object localization (WSOL) aims at predicting object locations in an image using only image-level category labels. Common challenges that image classification models encounter when localizing objects are, (a) they tend to look at the most discriminative features in an image that confines the localization map to a very small region, (b) the localization maps are class agnostic, and the models highlight objects of multiple classes in the same image and, (c) the localization performance is affected by background noise. To alleviate the above challenges we introduce the following simple changes through our proposed method ViTOL. We leverage the vision-based transformer for self-attention and introduce a patch-based attention dropout layer (p-ADL) to increase the coverage of the localization map and a gradient attention rollout mechanism to generate class-dependent attention maps. We conduct extensive quantitative, qualitative and ablation experiments on the ImageNet-1K and CUB datasets. We achieve state-of-the-art MaxBoxAcc-V2 localization scores of 70.47% and 73.17% on the two datasets respectively. Code is available on https://github.com/Saurav-31/ViTOL