论文标题

了解视力和语言任务的关注

Understanding Attention for Vision-and-Language Tasks

论文作者

Cao, Feiqi, Han, Soyeon Caren, Long, Siqu, Xu, Changwei, Poon, Josiah

论文摘要

注意机制已被用作视觉和语言(VL)任务的重要组成部分,以弥合视觉和文本特征之间的语义差距。尽管注意力已被广泛用于VL任务,但尚未研究它在弥合视觉和文本线索之间语义差距时的不同注意对准计算的能力。在这项研究中,我们通过研究注意力评分计算方法,对了解注意力的作用的作用进行全面分析,并检查其实际代表视觉区域的作用以及文本令牌对全球评估的意义。我们还分析了注意力分数计算机制的条件更多(或更少)可解释,并且可能会影响三个不同的VL任务的模型性能,包括视觉问题回答,文本到图像生成,文本和图像匹配(句子和图像检索)。我们的分析是同类产品的第一个,并在VL任务的训练阶段应用时,提供了每个注意力对齐得分计算的重要性的有用见解,这通常在基于注意力的交叉模态模型和/或预验证的模型中被忽略。我们的代码可在以下网址找到:https://github.com/adlnlp/attention_vl

Attention mechanism has been used as an important component across Vision-and-Language(VL) tasks in order to bridge the semantic gap between visual and textual features. While attention has been widely used in VL tasks, it has not been examined the capability of different attention alignment calculation in bridging the semantic gap between visual and textual clues. In this research, we conduct a comprehensive analysis on understanding the role of attention alignment by looking into the attention score calculation methods and check how it actually represents the visual region's and textual token's significance for the global assessment. We also analyse the conditions which attention score calculation mechanism would be more (or less) interpretable, and which may impact the model performance on three different VL tasks, including visual question answering, text-to-image generation, text-and-image matching (both sentence and image retrieval). Our analysis is the first of its kind and provides useful insights of the importance of each attention alignment score calculation when applied at the training phase of VL tasks, commonly ignored in attention-based cross modal models, and/or pretrained models. Our code is available at: https://github.com/adlnlp/Attention_VL

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源