图像字幕的未来上下文的有效建模

论文标题

图像字幕的未来上下文的有效建模

Efficient Modeling of Future Context for Image Captioning

论文作者

Fei, Zhengcong, Huang, Junshi, Wei, Xiaoming, Wei, Xiaolin

论文摘要

现有的图像字幕方法通常会从左到右生成句子逐字，并在本地上下文中受到限制，包括给定的图像和历史记录生成的单词。有许多研究目标是在解码过程中使用全球信息，例如迭代精致。但是，它仍然探讨了如何有效，有效地纳入未来的环境。为了应对这个问题，受到非自动回归图像字幕（NAIC）的启发，可以通过修改后的掩码操作利用两侧关系，我们的目标是将此进步移植到常规的自动回归图像字幕（AIC）模型，同时保持推理效率而无需额外的时间成本。具体而言，首先对AIC和NAIC模型与共享的视觉编码器结合，迫使视觉编码器包含足够有效的未来上下文。然后鼓励AIC模型捕获NAIC模型在其不自信的单词上互换的跨层互换的因果动力学，该单词遵循教师学生的范式，并通过分配校准训练目标进行了优化。经验证据表明，我们提出的方法清楚地超过了自动指标和人类评估MS COCO基准测试的最先进基线。源代码可在以下网址获得：https：//github.com/feizc/future-caption。

Existing approaches to image captioning usually generate the sentence word-by-word from left to right, with the constraint of conditioned on local context including the given image and history generated words. There have been many studies target to make use of global information during decoding, e.g., iterative refinement. However, it is still under-explored how to effectively and efficiently incorporate the future context. To respond to this issue, inspired by that Non-Autoregressive Image Captioning (NAIC) can leverage two-side relation with modified mask operation, we aim to graft this advance to the conventional Autoregressive Image Captioning (AIC) model while maintaining the inference efficiency without extra time cost. Specifically, AIC and NAIC models are first trained combined with shared visual encoders, forcing the visual encoder to contain sufficient and valid future context; then the AIC model is encouraged to capture the causal dynamics of cross-layer interchanging from NAIC model on its unconfident words, which follows a teacher-student paradigm and optimized with the distribution calibration training objective. Empirical evidences demonstrate that our proposed approach clearly surpass the state-of-the-art baselines in both automatic metrics and human evaluations on the MS COCO benchmark. The source code is available at: https://github.com/feizc/Future-Caption.

下载PDF全文

下载文献需遵守相关版权规定

论文标题