通过添加注意力组成学习，带有文本反馈的图像搜索

论文标题

通过添加注意力组成学习，带有文本反馈的图像搜索

Image Search with Text Feedback by Additive Attention Compositional Learning

论文作者

Tian, Yuxin, Newsam, Shawn, Boakye, Kofi

论文摘要

具有文本反馈的有效图像检索将影响一系列现实世界应用，例如电子商务。给定一个源图像和文本反馈，描述了对该图像的所需修改，目标是检索类似于源但通过组成多模式（Image-text）查询来满足给定修改的目标图像。我们为此问题提出了一种新颖的解决方案，即使用基于多模式变压器的体系结构，并有效地对图像文本上下文进行建模。具体来说，我们提出了一个基于添加注意力的新型图像文本组成模块，该模块可以无缝地插入深层神经网络中。我们还引入了从购物100k数据集衍生的新的具有挑战性的基准。在三个大型数据集（Fashioniq，Fashion200K和Shopping100k）上评估AACL，每个数据集都具有强大的基线。广泛的实验表明，AACL在所有三个数据集上都实现了新的最新结果。

Effective image retrieval with text feedback stands to impact a range of real-world applications, such as e-commerce. Given a source image and text feedback that describes the desired modifications to that image, the goal is to retrieve the target images that resemble the source yet satisfy the given modifications by composing a multi-modal (image-text) query. We propose a novel solution to this problem, Additive Attention Compositional Learning (AACL), that uses a multi-modal transformer-based architecture and effectively models the image-text contexts. Specifically, we propose a novel image-text composition module based on additive attention that can be seamlessly plugged into deep neural networks. We also introduce a new challenging benchmark derived from the Shopping100k dataset. AACL is evaluated on three large-scale datasets (FashionIQ, Fashion200k, and Shopping100k), each with strong baselines. Extensive experiments show that AACL achieves new state-of-the-art results on all three datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题