我们是否需要声音来源来源本地化？

论文标题

我们是否需要声音来源来源本地化？

Do We Need Sound for Sound Source Localization?

论文作者

Oya, Takashi, Iwase, Shohei, Natsume, Ryota, Itazuri, Takahiro, Yamaguchi, Shugo, Morishima, Shigeo

论文摘要

在使用视觉和听觉信息的声音源本地化的性能过程中，目前尚不清楚图像或声音模态对结果有何贡献，即我们是否需要图像和声音来获得声音源定位？为了解决这个问题，我们开发了一个无监督的学习系统，该系统通过将此任务分解为两个步骤来解决声音源本地化：（i）“潜在的声源定位”，该步骤仅使用视觉信息（ii）“对象选择”来定位可能的声源，该步骤是标识哪些对象实际上是使用心脏信息来确定对象实际上是在对象的。我们的整体系统在声源本地化中实现了最新的性能，更重要的是，我们发现，尽管有限制了可用信息，但（i）的结果却取得了相似的性能。从这个观察结果和进一步的实验中，我们表明，使用当前采用的基准数据集进行评估时，视觉信息在“声音”源本地化中占主导地位。此外，我们表明，该数据集中样本中的大多数产生声音的对象可以固有地使用视觉信息固有地识别，因此数据集不足以评估系统的功能来利用听觉信息。作为替代方案，我们提出了一种评估协议，该协议可以实施视觉和听觉信息，并通过多个实验验证该属性。

During the performance of sound source localization which uses both visual and aural information, it presently remains unclear how much either image or sound modalities contribute to the result, i.e. do we need both image and sound for sound source localization? To address this question, we develop an unsupervised learning system that solves sound source localization by decomposing this task into two steps: (i) "potential sound source localization", a step that localizes possible sound sources using only visual information (ii) "object selection", a step that identifies which objects are actually sounding using aural information. Our overall system achieves state-of-the-art performance in sound source localization, and more importantly, we find that despite the constraint on available information, the results of (i) achieve similar performance. From this observation and further experiments, we show that visual information is dominant in "sound" source localization when evaluated with the currently adopted benchmark dataset. Moreover, we show that the majority of sound-producing objects within the samples in this dataset can be inherently identified using only visual information, and thus that the dataset is inadequate to evaluate a system's capability to leverage aural information. As an alternative, we present an evaluation protocol that enforces both visual and aural information to be leveraged, and verify this property through several experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题