学习视觉语义嵌入的最佳集合策略

论文标题

学习视觉语义嵌入的最佳集合策略

Learning the Best Pooling Strategy for Visual Semantic Embedding

论文作者

Chen, Jiacheng, Hu, Hexiang, Wu, Hao, Jiang, Yuning, Wang, Changhu

论文摘要

视觉语义嵌入（VSE）是视力语言检索的主要方法，该方法旨在学习深入的嵌入空间，以便将视觉数据嵌入其语义文本标签或描述附近。最近的VSE模型使用复杂的方法来更好地将多模式特征与整体嵌入到整体嵌入中。但是，我们发现，在不同的特征提取器上，令人惊讶的简单（但经过精心选择）的全局池函数（例如，最大池）优于这些复杂模型。尽管具有简单性和有效性，但为不同的数据模式和特征提取器寻求最佳的合并功能是昂贵且乏味的，尤其是当特征大小变化时（例如，文本，视频）。因此，我们提出了一个广义的合并操作员（GPO），该操作员学会了自动适应不同功能的最佳合并策略，在保持有效和高效的同时不需要手动调整。我们使用此建议的GPO扩展了VSE模型，并将其表示为VSE $ \ infty $。如果没有铃铛和哨声，VSE $ \ infty $在流行功能提取器的图像文本检索基准上大大优于先前的VSE方法。通过简单的改编，VSE $ \ infty $的变体通过在两个视频文本检索数据集上实现新的最新技术，进一步证明了它的实力。全面的实验和可视化证实，GPO始终发现最佳的合并策略，并且可以成为标准VSE型号的插件功能聚合模块。代码和预训练模型可在https://vse-infty.github.io上找到。

Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions. Recent VSE models use complex methods to better contextualize and aggregate multi-modal features into holistic embeddings. However, we discover that surprisingly simple (but carefully selected) global pooling functions (e.g., max pooling) outperform those complex models, across different feature extractors. Despite its simplicity and effectiveness, seeking the best pooling function for different data modality and feature extractor is costly and tedious, especially when the size of features varies (e.g., text, video). Therefore, we propose a Generalized Pooling Operator (GPO), which learns to automatically adapt itself to the best pooling strategy for different features, requiring no manual tuning while staying effective and efficient. We extend the VSE model using this proposed GPO and denote it as VSE$\infty$. Without bells and whistles, VSE$\infty$ outperforms previous VSE methods significantly on image-text retrieval benchmarks across popular feature extractors. With a simple adaptation, variants of VSE$\infty$ further demonstrate its strength by achieving the new state of the art on two video-text retrieval datasets. Comprehensive experiments and visualizations confirm that GPO always discovers the best pooling strategy and can be a plug-and-play feature aggregation module for standard VSE models. Code and pre-trained models are available at https://vse-infty.github.io.

下载PDF全文

下载文献需遵守相关版权规定

论文标题