带有双模式解码器的开放词汇多标签分类在对齐视觉文本功能上

论文标题

带有双模式解码器的开放词汇多标签分类在对齐视觉文本功能上

Open Vocabulary Multi-Label Classification with Dual-Modal Decoder on Aligned Visual-Textual Features

论文作者

Xu, Shichao, Li, Yikang, Hsiao, Jenhao, Ho, Chiuman, Qi, Zhu

论文摘要

在计算机视觉中，多标签识别是许多现实应用程序的重要任务，但是对以前看不见的标签进行分类仍然是一个重大挑战。在本文中，我们提出了一种新颖的算法，对齐双模态分类器（ADDS），其中包括一个双模式解码器（DM-DECODER），可在视觉和文本特征之间对齐，用于开放式唱歌式多型多标签分类任务。然后，我们设计了一种简单但有效的方法，称为金字塔 - 福音，以提高分辨率高的输入的性能。此外，采用选择性语言监督以进一步增强模型性能。在几个标准基准，NUS范围的，Imagenet-1K，Imagenet-21K和MS-Coco上进行的广泛实验表明，我们的方法明显胜过以前的方法，并为开放式多标签的开放式多标签分类，常规的多型单位分类和单个模型的单个模型在其中提供了最先进的性能（Imagenet-1K，Imagenet-21K）在多标签的（MS-Coco和NUS范围）上进行了测试。

In computer vision, multi-label recognition are important tasks with many real-world applications, but classifying previously unseen labels remains a significant challenge. In this paper, we propose a novel algorithm, Aligned Dual moDality ClaSsifier (ADDS), which includes a Dual-Modal decoder (DM-decoder) with alignment between visual and textual features, for open-vocabulary multi-label classification tasks. Then we design a simple and yet effective method called Pyramid-Forwarding to enhance the performance for inputs with high resolutions. Moreover, the Selective Language Supervision is applied to further enhance the model performance. Extensive experiments conducted on several standard benchmarks, NUS-WIDE, ImageNet-1k, ImageNet-21k, and MS-COCO, demonstrate that our approach significantly outperforms previous methods and provides state-of-the-art performance for open-vocabulary multi-label classification, conventional multi-label classification and an extreme case called single-to-multi label classification where models trained on single-label datasets (ImageNet-1k, ImageNet-21k) are tested on multi-label ones (MS-COCO and NUS-WIDE).

下载PDF全文

下载文献需遵守相关版权规定

论文标题