超越预训练的对象检测器：图像字幕的跨模式文本和视觉上下文

论文标题

超越预训练的对象检测器：图像字幕的跨模式文本和视觉上下文

Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

论文作者

Kuo, Chia-Wen, Kira, Zsolt

论文摘要

在视觉字幕上取得了重大进展，在很大程度上取决于预先训练的特征和后来的固定对象探测器，这些特征是对自动回归模型的丰富输入。但是，这种方法的关键限制是模型的输出仅在对象检测器的输出上。这样的输出可以代表所有必要信息的假设是不现实的，尤其是当检测器跨数据集传输时。在这项工作中，我们推理了该假设引起的图形模型，并提议添加辅助输入来表示缺失的信息，例如对象关系。我们特别建议从视觉基因组数据集中挖掘属性和关系，并在其上调节字幕模型。至关重要的是，我们建议（并表明）使用多模式预训练模型（剪辑）来检索这种上下文描述。此外，对象探测器模型被冷冻，没有足够的丰富度来使字幕模型正确地接地。结果，我们建议在图像上同时调节检测器和描述输出，并在定性和定量上显示这可以改善接地。我们验证了图像字幕的方法，对每个组件进行彻底分析以及预训练的多模式模型的重要性，并在苹果酒中表现出明显的改善，苹果酒的 +7.5％，在BLEU-4指标中 +1.3％。

Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the output of the model is conditioned only on the object detector's outputs. The assumption that such outputs can represent all necessary information is unrealistic, especially when the detector is transferred across datasets. In this work, we reason about the graphical model induced by this assumption, and propose to add an auxiliary input to represent missing information such as object relationships. We specifically propose to mine attributes and relationships from the Visual Genome dataset and condition the captioning model on them. Crucially, we propose (and show to be important) the use of a multi-modal pre-trained model (CLIP) to retrieve such contextual descriptions. Further, object detector models are frozen and do not have sufficient richness to allow the captioning model to properly ground them. As a result, we propose to condition both the detector and description outputs on the image, and show qualitatively and quantitatively that this can improve grounding. We validate our method on image captioning, perform thorough analyses of each component and importance of the pre-trained multi-modal model, and demonstrate significant improvements over the current state of the art, specifically +7.5% in CIDEr and +1.3% in BLEU-4 metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题