学习在动态视听场景中回答问题

论文标题

学习在动态视听场景中回答问题

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

论文作者

Li, Guangyao, Wei, Yake, Tian, Yapeng, Xu, Chenliang, Wen, Ji-Rong, Hu, Di

论文摘要

在本文中，我们专注于视听问题回答（AVQA）任务，该任务旨在回答有关不同视觉对象，声音及其在视频中的关联的问题。这个问题需要在视听场景上进行全面的多模式理解和时空推理。为了基准这项任务并促进我们的研究，我们介绍了一个大规模的音乐 - avqa数据集，该数据集包含45k的问题 - 答案对，涵盖了33个不同的问题模板，这些模板涉及不同的模式和问题类型。我们开发了几个基线，并为AVQA问题引入了一个时空扎根的视听网络。我们的结果表明，AVQA受益于多感知感，我们的模型的表现优于最近的A-，V-和AVQA方法。我们认为，我们的构建数据集有可能用作评估和促进视听场景理解和时空推理的进展的测试。代码和数据集：http：//gewu-lab.github.io/music-avqa/

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes. To benchmark this task and facilitate our study, we introduce a large-scale MUSIC-AVQA dataset, which contains more than 45K question-answer pairs covering 33 different question templates spanning over different modalities and question types. We develop several baselines and introduce a spatio-temporal grounded audio-visual network for the AVQA problem. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-, V-, and AVQA approaches. We believe that our built dataset has the potential to serve as testbed for evaluating and promoting progress in audio-visual scene understanding and spatio-temporal reasoning. Code and dataset: http://gewu-lab.github.io/MUSIC-AVQA/

下载PDF全文

下载文献需遵守相关版权规定

论文标题