一项有关带有听觉信息的视频显着检测的全面调查：视听一致性感知是关键！

论文标题

一项有关带有听觉信息的视频显着检测的全面调查：视听一致性感知是关键！

A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key!

论文作者

Chen, Chenglizhao, Song, Mengke, Song, Wenfeng, Guo, Li, Jian, Muwei

论文摘要

视频显着检测（VSD）旨在快速找到给定的视频剪辑中最具吸引力的对象/事物/模式。现有的与VSD相关的作品主要依赖于视觉系统，但对音频方面的关注较少，而实际上，我们的音频系统是视觉系统中最重要的互补部分。同样，视听显着性检测（AVSD）是模仿人类感知机制的最具代表性的研究主题之一，目前处于起步阶段，并且没有现有的调查论文涉及到它，尤其是从显着性检测的角度来看。因此，本文的最终目标是提供广泛的审查，以弥合视听融合和显着性检测之间的差距。此外，正如这篇评论的另一个亮点一样，我们对可以直接决定AVSD深层模型的表现的关键因素提供了深入的了解，并且我们声称音频视频一致性程度（AVC）（一个长期忽视的问题）可以直接影响使用音频在表现表现出显着性检测时使用音频受益的有效性。此外，为了使AVC问题对未来的关注者更加实用和有价值，我们已经有了新的所有现有公开可用的AVSD数据集，并带有其他框架AVC标签。基于这些升级的数据集，我们进行了广泛的定量评估，以基于我们对AVC在AVSD任务中的重要性的主张。总而言之，我们的想法和新集合都可以作为一个方便的平台，并具有初步和准则，所有这些都非常有潜力，可以促进未来的促进最先进（SOTA）绩效的工作。

Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip. Existing VSD-related works have mainly relied on the visual system but paid less attention to the audio aspect, while, actually, our audio system is the most vital complementary part to our visual system. Also, audio-visual saliency detection (AVSD), one of the most representative research topics for mimicking human perceptual mechanisms, is currently in its infancy, and none of the existing survey papers have touched on it, especially from the perspective of saliency detection. Thus, the ultimate goal of this paper is to provide an extensive review to bridge the gap between audio-visual fusion and saliency detection. In addition, as another highlight of this review, we have provided a deep insight into key factors which could directly determine the performances of AVSD deep models, and we claim that the audio-visual consistency degree (AVC) -- a long-overlooked issue, can directly influence the effectiveness of using audio to benefit its visual counterpart when performing saliency detection. Moreover, in order to make the AVC issue more practical and valuable for future followers, we have newly equipped almost all existing publicly available AVSD datasets with additional frame-wise AVC labels. Based on these upgraded datasets, we have conducted extensive quantitative evaluations to ground our claim on the importance of AVC in the AVSD task. In a word, both our ideas and new sets serve as a convenient platform with preliminaries and guidelines, all of which are very potential to facilitate future works in promoting state-of-the-art (SOTA) performance further.

下载PDF全文

下载文献需遵守相关版权规定

论文标题