从右边讲出来：学习视觉和声音的空间对应关系

论文标题

从右边讲出来：学习视觉和声音的空间对应关系

Telling Left from Right: Learning Spatial Correspondence of Sight and Sound

论文作者

Yang, Karren, Russell, Bryan, Salamon, Justin

论文摘要

自我监督的视听学习旨在通过利用视觉和音频输入之间的对应关系来捕获视频的有用表示。现有的方法主要集中在感觉流之间的语义信息上。我们提出了一项新颖的自我监督任务，以利用正交原则：在音频流中匹配空间信息与视觉流中的声源位置。我们的方法简单而有效。我们训练一个模型，以确定左右音频通道是否已翻转，迫使它推理了视觉和音频流的空间定位。为了训练和评估我们的方法，我们介绍了一个大规模的视频数据集YouTube-ASMR-300K，空间音频包括超过900个小时的录像。我们证明，理解空间对应能够使模型能够在三个视听任务上表现更好，从而对不利用空间音频提示的监督和自我监督的基线进行定量增长。我们还展示了如何通过Ambisonic Audio将我们的自我监督方法扩展到360度视频。

Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs. Existing approaches have focused primarily on matching semantic information between the sensory streams. We propose a novel self-supervised task to leverage an orthogonal principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream. Our approach is simple yet effective. We train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams. To train and evaluate our method, we introduce a large-scale video dataset, YouTube-ASMR-300K, with spatial audio comprising over 900 hours of footage. We demonstrate that understanding spatial correspondence enables models to perform better on three audio-visual tasks, achieving quantitative gains over supervised and self-supervised baselines that do not leverage spatial audio cues. We also show how to extend our self-supervised approach to 360 degree videos with ambisonic audio.

下载PDF全文

下载文献需遵守相关版权规定

论文标题