论文标题
如何听?重新思考视觉声音本地化
How to Listen? Rethinking Visual Sound Localization
论文作者
论文摘要
本地化视觉声音包括位于图像中发出声音的对象的位置。这是一个不断增长的研究领域,在监测自然和城市环境(例如野生动植物迁移和城市交通)中可能采用了潜在的应用。通常使用具有单个显性可见对象的数据集评估以前的工作,并且建议的模型通常需要在培训或专用采样策略期间引入本地化模块,但尚不清楚这些设计选择如何在更具挑战性的场景中这些方法的适应性中发挥作用。在这项工作中,我们分析了视觉声音本地化的各种模型选择,并讨论了它们的不同组件如何影响模型的性能,即编码器的体系结构,损耗函数和本地化策略。此外,我们通过研究这些决策,模型绩效和数据之间的相互作用,通过挖掘跨越不同困难和特征的不同评估数据集,并讨论此类决策在现实世界应用程序中的含义。我们的代码和模型权重是开源的,可用于进一步的应用。
Localizing visual sounds consists on locating the position of objects that emit sound within an image. It is a growing research area with potential applications in monitoring natural and urban environments, such as wildlife migration and urban traffic. Previous works are usually evaluated with datasets having mostly a single dominant visible object, and proposed models usually require the introduction of localization modules during training or dedicated sampling strategies, but it remains unclear how these design choices play a role in the adaptability of these methods in more challenging scenarios. In this work, we analyze various model choices for visual sound localization and discuss how their different components affect the model's performance, namely the encoders' architecture, the loss function and the localization strategy. Furthermore, we study the interaction between these decisions, the model performance, and the data, by digging into different evaluation datasets spanning different difficulties and characteristics, and discuss the implications of such decisions in the context of real-world applications. Our code and model weights are open-sourced and made available for further applications.