论文标题
通过人类凝视引导的顺序交叉模式对准生成图像描述
Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze
论文作者
论文摘要
当说话者描述图像时,他们倾向于在提及对象之前先看一下对象。在本文中,我们通过对图像描述生成过程进行计算来研究这种顺序的跨模式对齐。我们将起点作为最先进的图像字幕系统,并开发出几种模型变体,这些变体从语言生产过程中记录的人类凝视模式中利用信息。特别是,我们提出了第一种用于图像描述生成的方法,其中视觉处理是建模$ \ textit {secentally} $。我们的实验和分析证实,可以通过比较将视线模态与语言产生的不同方式进行比较,可以通过利用视线驱动的注意力并阐明人类认知过程来获得更好的描述。我们发现,处理凝视数据会依次导致描述,这些描述与扬声器,更多样化和更自然的$ { - } $更好地保持一致,尤其是当凝视用专用的复发组件编码时。
When speakers describe an image, they tend to look at objects before mentioning them. In this paper, we investigate such sequential cross-modal alignment by modelling the image description generation process computationally. We take as our starting point a state-of-the-art image captioning system and develop several model variants that exploit information from human gaze patterns recorded during language production. In particular, we propose the first approach to image description generation where visual processing is modelled $\textit{sequentially}$. Our experiments and analyses confirm that better descriptions can be obtained by exploiting gaze-driven attention and shed light on human cognitive processes by comparing different ways of aligning the gaze modality with language production. We find that processing gaze data sequentially leads to descriptions that are better aligned to those produced by speakers, more diverse, and more natural${-}$particularly when gaze is encoded with a dedicated recurrent component.