言语情感识别的多模式关注

论文标题

言语情感识别的多模式关注

Multi-modal Attention for Speech Emotion Recognition

论文作者

Pan, Zexu, Luo, Zhaojie, Yang, Jichen, Li, Haizhou

论文摘要

情感代表了人类言语的基本方面，在言语韵律中表现出来。语音，视觉和文字提示是人类交流中的互补性。在本文中，我们研究了一种混合融合方法，被称为多模式注意网络（MMAN），以利用语音情感识别中的视觉和文本提示。我们提出了一种新型的多模式注意机制CLSTM-MMA，该机制促进了三种模式的注意力，并有选择地融合了信息。 CLSTM-MMA与晚期融合中的其他单模式子网络融合在一起。该实验表明，语音情感识别从视觉和文本提示中受益匪浅，而所提出的CLSTM-MMA就准确性而言与其他融合方法一样具有竞争力，但具有更紧凑的网络结构。拟议的混合网络MMAN在IEMOCAP数据库上实现了最新的性能，以识别情绪。

Emotion represents an essential aspect of human speech that is manifested in speech prosody. Speech, visual, and textual cues are complementary in human communication. In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to make use of visual and textual cues in speech emotion recognition. We propose a novel multi-modal attention mechanism, cLSTM-MMA, which facilitates the attention across three modalities and selectively fuse the information. cLSTM-MMA is fused with other uni-modal sub-networks in the late fusion. The experiments show that speech emotion recognition benefits significantly from visual and textual cues, and the proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of accuracy, but with a much more compact network structure. The proposed hybrid network MMAN achieves state-of-the-art performance on IEMOCAP database for emotion recognition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题