通过对比事件对准和基于语义的融合，视听场景分类

论文标题

通过对比事件对准和基于语义的融合，视听场景分类

Audio-visual scene classification via contrastive event-object alignment and semantic-based fusion

论文作者

Hou, Yuanbo, Kang, Bo, Botteldooren, Dick

论文摘要

以前的场景分类作品主要基于音频或视觉信号，而人类通过多种感官感知环境场景。关于视听场景分类的最新研究分别微调了目标数据集上的LargesCale音频和图像预训练的模型，然后将音频模型的中间表示和视觉模型融合，或融合两个模型在剪辑水平上的粗粒性决策。这样的方法忽略了视听场景（AV）中详细的音频事件和视觉对象，而人类经常通过内部的音频事件和视觉对象以及它们之间的一致性来识别不同的场景。为了利用AV中的音频事件和视觉对象的细粒度信息，并协调音频事件和视觉对象之间的隐式关系，本文提出了一个配备了对比度事件对象对象对齐（CEOA）和语义基于语义的融合（SF）（SF）（SF）的多重模型。 CEOA旨在通过比较视听事件对象对之间的差异来对齐音频事件和视觉对象的学习嵌入。然后，与某些音频事件相关的视觉对象，反之亦然，通过跨注意事件突出，并经历SF以进行语义级别的融合。实验表明：1）配备CEOA和SF的提议的AVSC模型优于仅音频和仅视觉模型的结果，即，视听结果比单个模态的结果要好。 2）CEOA将音频事件和相关视觉对象的嵌入在细粒度上，而SF有效地整合了两者； 3）与其他大型集成系统相比，提出的模型即使不使用其他数据集和数据增强技巧也显示出竞争性能。

Previous works on scene classification are mainly based on audio or visual signals, while humans perceive the environmental scenes through multiple senses. Recent studies on audio-visual scene classification separately fine-tune the largescale audio and image pre-trained models on the target dataset, then either fuse the intermediate representations of the audio model and the visual model, or fuse the coarse-grained decision of both models at the clip level. Such methods ignore the detailed audio events and visual objects in audio-visual scenes (AVS), while humans often identify different scenes through audio events and visual objects within and the congruence between them. To exploit the fine-grained information of audio events and visual objects in AVS, and coordinate the implicit relationship between audio events and visual objects, this paper proposes a multibranch model equipped with contrastive event-object alignment (CEOA) and semantic-based fusion (SF) for AVSC. CEOA aims to align the learned embeddings of audio events and visual objects by comparing the difference between audio-visual event-object pairs. Then, visual objects associated with certain audio events and vice versa are accentuated by cross-attention and undergo SF for semantic-level fusion. Experiments show that: 1) the proposed AVSC model equipped with CEOA and SF outperforms the results of audio-only and visual-only models, i.e., the audio-visual results are better than the results from a single modality. 2) CEOA aligns the embeddings of audio events and related visual objects on a fine-grained level, and the SF effectively integrates both; 3) Compared with other large-scale integrated systems, the proposed model shows competitive performance, even without using additional datasets and data augmentation tricks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题