无监督的视觉解析：通过依赖关系与语言结构无缝桥接视觉场景图

论文标题

无监督的视觉解析：通过依赖关系与语言结构无缝桥接视觉场景图

Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships

论文作者

Lou, Chao, Han, Wenjuan, Lin, Yuhuan, Zheng, Zilong

论文摘要

了解现实的视觉场景图像以及语言描述是通用视觉理解的基本任务。以前的作品通过单独建立视觉场景（例如场景图）和自然语言（例如依赖项树）的视觉场景（例如，场景图）的层次结构来显示出令人信服的全面结果。但是，如何构建联合视觉语言（VL）结构几乎没有研究。更具挑战性但值得的是，我们引入了一项新任务，该任务针对以无监督的方式诱导这种联合VL结构。我们的目标是无缝地桥接视觉场景图和语言依赖树。由于缺乏VL结构数据，我们首先要构建一个新的数据集VLPARSE。我们没有从头开始使用劳动密集型标签，而是提出一种自动对齐程序来产生粗糙的结构，然后进行人体改进以产生高质量的结构。此外，我们通过提出基于对比度学习（CL）的框架VLGAE来对数据集进行基准测试，这是视觉图形自动编码器的缩写。我们的模型在两个派生任务（即语言语法诱导和VL短语接地）上获得了卓越的性能。消融显示了视觉提示和依赖关系对细粒VL结构构建的有效性。

Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g., dependency trees), individually. However, how to construct a joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce a new task that targets on inducing such a joint VL structure in an unsupervised manner. Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly. Due to the lack of VL structural data, we start by building a new dataset VLParse. Rather than using labor-intensive labeling from scratch, we propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones. Moreover, we benchmark our dataset by proposing a contrastive learning (CL)-based framework VLGAE, short for Vision-Language Graph Autoencoder. Our model obtains superior performance on two derived tasks, i.e., language grammar induction and VL phrase grounding. Ablations show the effectiveness of both visual cues and dependency relationships on fine-grained VL structure construction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题