论文标题
Dimbert:学习视觉语言接地表示,并具有分解的多模式注意
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
论文作者
论文摘要
视觉和语言(V-L)任务要求系统同时了解视觉内容和自然语言,从而学习视觉和语言的细粒度联合表示(又称V-L表示)至关重要。最近,提出了各种预训练的V-L模型来学习V-L表示形式并在许多任务中取得改进的结果。但是,主流模型以相同的注意矩阵处理视觉和语言输入。结果,生成的V-L表示形式纠缠在一个普通的潜在空间中。为了解决这个问题,我们提出了Dimbert(缩短了分离的多模式注意事项BERT),这是一个新颖的框架,将视觉和语言分开的注意空间应用于分离的框架,因此可以明确地解散多模式的表示。为了增强视觉和语言之间的相关性,我们将视觉概念介绍给Dimbert,该概念以文本格式表示视觉信息。通过这种方式,视觉概念有助于弥合两种方式之间的差距。我们将Dimbert预先训练大量图像句子对,以实现两个任务:双向语言建模和序列到序列语言建模。预先训练后,Dimbert对于下游任务进行了细微的调整。实验表明,Dimbert在三个任务(超过四个数据集)上设置了新的最新性能,包括生成任务(图像字幕和视觉讲故事)和分类任务(参考表达式)。提出的DIM(缩短了分离的多模式注意事项)模块可以轻松地纳入现有的预训练的V-L模型中,以提高其性能,在代表性任务上提高了5%。最后,我们进行了系统的分析,并证明了DIM和引入的视觉概念的有效性。
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language, thus learning fine-grained joint representations of vision and language (a.k.a. V-L representations) is of paramount importance. Recently, various pre-trained V-L models are proposed to learn V-L representations and achieve improved results in many tasks. However, the mainstream models process both vision and language inputs with the same set of attention matrices. As a result, the generated V-L representations are entangled in one common latent space. To tackle this problem, we propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which is a novel framework that applies separated attention spaces for vision and language, and the representations of multi-modalities can thus be disentangled explicitly. To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format. In this manner, visual concepts help to bridge the gap between the two modalities. We pre-train DiMBERT on a large amount of image-sentence pairs on two tasks: bidirectional language modeling and sequence-to-sequence language modeling. After pre-train, DiMBERT is further fine-tuned for the downstream tasks. Experiments show that DiMBERT sets new state-of-the-art performance on three tasks (over four datasets), including both generation tasks (image captioning and visual storytelling) and classification tasks (referring expressions). The proposed DiM (short for Disentangled Multimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task. Finally, we conduct a systematic analysis and demonstrate the effectiveness of our DiM and the introduced visual concepts.