论文标题
迈向图像的独特而有益的字幕
Towards Unique and Informative Captioning of Images
论文作者
论文摘要
尽管取得了很大进展,但最先进的图像字幕模型会产生通用的字幕,从而遗漏了重要的图像细节。此外,这些系统甚至可能歪曲图像,以产生由共同概念组成的简单标题。在本文中,我们首先通过经验实验来分析现代字幕系统和评估指标,以量化这些现象。我们发现,与地面真相标题相比,现代字幕系统返回不正确的干扰句子的可能性更高的可能性,并且使用依靠对象探测器的简单字幕系统可以将诸如Spice之类的评估指标“顶”“顶”。受这些观察的启发,我们通过引入对标题中产生的概念的独特性概念来设计新的度量标准(Spice-U)。我们表明,与香料相比,Spice-U与人类判断更好,并有效地捕获了多样性和描述性的概念。最后,我们还展示了一种通用技术来改善任何现有字幕模型 - 通过将相互信息用作解码过程中的重新排列目标。从经验上讲,这会产生更独特和信息丰富的标题,并改善了Spice-U的三种不同最新模型以及现有指标的平均得分。
Despite considerable progress, state of the art image captioning models produce generic captions, leaving out important image details. Furthermore, these systems may even misrepresent the image in order to produce a simpler caption consisting of common concepts. In this paper, we first analyze both modern captioning systems and evaluation metrics through empirical experiments to quantify these phenomena. We find that modern captioning systems return higher likelihoods for incorrect distractor sentences compared to ground truth captions, and that evaluation metrics like SPICE can be 'topped' using simple captioning systems relying on object detectors. Inspired by these observations, we design a new metric (SPICE-U) by introducing a notion of uniqueness over the concepts generated in a caption. We show that SPICE-U is better correlated with human judgements compared to SPICE, and effectively captures notions of diversity and descriptiveness. Finally, we also demonstrate a general technique to improve any existing captioning model -- by using mutual information as a re-ranking objective during decoding. Empirically, this results in more unique and informative captions, and improves three different state-of-the-art models on SPICE-U as well as average score over existing metrics.