尺寸情感识别中视听融合的联合跨意见模型

论文标题

尺寸情感识别中视听融合的联合跨意见模型

A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

论文作者

Praveen, R. Gnana, de Melo, Wheidima Carneiro, Ullah, Nasib, Aslam, Haseeb, Zeeshan, Osama, Denorme, Théo, Pedersoli, Marco, Koerich, Alessandro, Bacon, Simon, Cardinal, Patrick, Granger, Eric

论文摘要

多模式的情绪识别最近引起了很多关注，因为它可以利用多种方式（例如音频，视觉，生物信号等）来利用多样化和互补的关系，并且可以为嘈杂的方式提供一些鲁棒性。大多数用于视听（A-V）融合的最先进方法都依赖于经常性网络或常规的注意机制，这些网络无法有效利用A-V模式的互补性质。在本文中，我们重点介绍了基于从视频中提取的面部和人声方式融合的维度情感识别。具体而言，我们提出了一个依赖互补关系来跨A-V方式提取显着特征的联合交叉意见模型，从而可以准确预测价值和唤醒的连续值。提出的融合模型有效地利用了模式间关系，同时降低了特征之间的异质性。特别是，它根据组合特征表示和单个模式之间的相关性计算交叉意义权重。通过将组合的A-V特征表示形式部署到交叉意见模块中，我们的融合模块的性能在香草交叉意见模块上显着改善。来自AFFWILD2数据集的验证集视频的实验结果表明，我们提出的A-V融合模型提供了一种具有成本效益的解决方案，可以超越最先进的方法。该代码可在github上找到：https：//github.com/praveena2j/jointcrossattentional-av-fusion。

Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities (e.g., audio, visual, biosignals, etc.), and can provide some robustness to noisy modalities. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. Specifically, we propose a joint cross-attention model that relies on the complementary relationships to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. The proposed fusion model efficiently leverages the inter-modal relationships, while reducing the heterogeneity between the features. In particular, it computes the cross-attention weights based on correlation between the combined feature representation and individual modalities. By deploying the combined A-V feature representation into the cross-attention module, the performance of our fusion module improves significantly over the vanilla cross-attention module. Experimental results on validation-set videos from the AffWild2 dataset indicate that our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches. The code is available on GitHub: https://github.com/praveena2j/JointCrossAttentional-AV-Fusion.

下载PDF全文

下载文献需遵守相关版权规定

论文标题