论文标题
解污染图像字幕:因果回顾
Deconfounded Image Captioning: A Causal Retrospect
论文作者
论文摘要
视觉任务中的数据集偏见正在成为阻碍我们社区进步的主要问题之一。现有解决方案缺乏关于为什么现代图像标题者很容易崩溃到数据集偏见的原则分析。在本文中,我们介绍了一种新颖的视角:脱连图像字幕(DIC),以找出这个问题的答案,然后回顾现代神经图像标题,最后提出了一个DIC框架:DICV1.0减轻数据集偏置带来的负面影响。 DIC基于因果推断,其两个原则:后门和前门调整,可帮助我们审查以前的研究并设计新的有效模型。特别是,我们展示了DICV1.0可以加强两个盛行的字幕模型,并且可以在karpathy拆分和在线分离COCO数据集的单模131.1 CIDER-D和128.4 C40 CIDER-D。有趣的是,DICV1.0是我们因果回顾的自然推导,它为图像字幕打开了有希望的方向。
Dataset bias in vision-language tasks is becoming one of the main problems which hinders the progress of our community. Existing solutions lack a principled analysis about why modern image captioners easily collapse into dataset bias. In this paper, we present a novel perspective: Deconfounded Image Captioning (DIC), to find out the answer of this question, then retrospect modern neural image captioners, and finally propose a DIC framework: DICv1.0 to alleviate the negative effects brought by dataset bias. DIC is based on causal inference, whose two principles: the backdoor and front-door adjustments, help us review previous studies and design new effective models. In particular, we showcase that DICv1.0 can strengthen two prevailing captioning models and can achieve a single-model 131.1 CIDEr-D and 128.4 c40 CIDEr-D on Karpathy split and online split of the challenging MS COCO dataset, respectively. Interestingly, DICv1.0 is a natural derivation from our causal retrospect, which opens promising directions for image captioning.