不确定性感知图像字幕

论文标题

不确定性感知图像字幕

Uncertainty-Aware Image Captioning

论文作者

Fei, Zhengcong, Fan, Mingyuan, Zhu, Li, Huang, Junshi, Wei, Xiaoming, Wei, Xiaolin

论文摘要

人们众所周知，标题单词中较高的不确定性，需要更相关的上下文信息来确定它。但是，当前的图像字幕方法通常考虑在句子中依次和平等地在句子中产生所有单词。在本文中，我们提出了一个不确定性感知的图像字幕框架，该框架与现有单词之间的不连续候选单词的插入从易于到困难到融合到现有单词之间。我们假设句子中的高确定性单词需要更多的先前信息才能做出正确的决定，应在以后产生。由此产生的非自动回归层次结构使字幕生成可以解释和直观。具体而言，我们利用图像条件的单词袋模型来测量不确定性单词，并应用动态编程算法来构建训练对。在推断期间，我们设计了一种不确定性与自适应平行束搜索技术，该技术产生了经验上对数的时间复杂性。在MS Coco基准测试上进行的广泛实验表明，我们的方法在字幕质量和解码速度方面的强大基线和相关方法都优于强大的基线和相关方法。

It is well believed that the higher uncertainty in a word of the caption, the more inter-correlated context information is required to determine it. However, current image captioning methods usually consider the generation of all words in a sentence sequentially and equally. In this paper, we propose an uncertainty-aware image captioning framework, which parallelly and iteratively operates insertion of discontinuous candidate words between existing words from easy to difficult until converged. We hypothesize that high-uncertainty words in a sentence need more prior information to make a correct decision and should be produced at a later stage. The resulting non-autoregressive hierarchy makes the caption generation explainable and intuitive. Specifically, we utilize an image-conditioned bag-of-word model to measure the word uncertainty and apply a dynamic programming algorithm to construct the training pairs. During inference, we devise an uncertainty-adaptive parallel beam search technique that yields an empirically logarithmic time complexity. Extensive experiments on the MS COCO benchmark reveal that our approach outperforms the strong baseline and related methods on both captioning quality as well as decoding speed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题