多模式图像检索的概率组成嵌入

论文标题

多模式图像检索的概率组成嵌入

Probabilistic Compositional Embeddings for Multimodal Image Retrieval

论文作者

Neculai, Andrei, Chen, Yanbei, Akata, Zeynep

论文摘要

图像检索中的现有作品通常考虑使用一个或两个查询输入检索图像，这些图像不会推广到多个查询。在这项工作中，我们研究了一个更具挑战性的方案，用于在图像检索中撰写多个多模式查询。给定任意数量的查询图像和（或）文本，我们的目标是检索包含多个多模式查询中指定的语义概念的目标图像。为了学习一个可以灵活地编码各种查询语义的信息嵌入，我们提出了一种新型的多模式概率作曲家（MPC）。具体而言，我们将输入图像和文本建模为概率嵌入，可以通过概率组成规则进一步组成，以促进具有多个多模式查询的图像检索。我们根据MS-Coco数据集提出了一个新的基准测试，并在各种设置上评估了我们的模型，该设置构成了多模式图像检索的多个图像和（或）文本查询。如果没有铃铛和哨声，我们表明我们的概率模型公式明显超过了多模式图像检索的现有方法，同时以任意视觉和（或（或）文本方式给出的不同量输入的概述良好地查询。代码可在此处找到：https：//github.com/andreineculai/mpc。

Existing works in image retrieval often consider retrieving images with one or two query inputs, which do not generalize to multiple queries. In this work, we investigate a more challenging scenario for composing multiple multimodal queries in image retrieval. Given an arbitrary number of query images and (or) texts, our goal is to retrieve target images containing the semantic concepts specified in multiple multimodal queries. To learn an informative embedding that can flexibly encode the semantics of various queries, we propose a novel multimodal probabilistic composer (MPC). Specifically, we model input images and texts as probabilistic embeddings, which can be further composed by a probabilistic composition rule to facilitate image retrieval with multiple multimodal queries. We propose a new benchmark based on the MS-COCO dataset and evaluate our model on various setups that compose multiple images and (or) text queries for multimodal image retrieval. Without bells and whistles, we show that our probabilistic model formulation significantly outperforms existing related methods on multimodal image retrieval while generalizing well to query with different amounts of inputs given in arbitrary visual and (or) textual modalities. Code is available here: https://github.com/andreineculai/MPC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题