通过使用SVM融合声学和文本网络的预测，两阶段的情绪识别

论文标题

通过使用SVM融合声学和文本网络的预测，两阶段的情绪识别

Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM

论文作者

Atmaja, Bagus Tris, Akagi, Masato

论文摘要

计算机的自动语音情绪识别（SER）是更自然的人机相互作用的关键组成部分。就像在人类互动中一样，正确感知情绪的能力对于在特定情况下采取进一步的步骤至关重要。 SER中的一个问题是是否有必要将声学特征与其他数据（例如面部表达式，文本和运动捕获）相结合。这项研究建议通过采用由两个步骤组成的晚融合方法结合声学和文本信息。首先，声学和文本功能在深度学习系统中分别训练。其次，深度学习系统的预测被送入支持向量机（SVM）以预测最终回归评分。此外，这项研究的任务是维度情绪建模，因为它可以对情感状态进行更深入的分析。实验结果表明，这种两阶段的后期融合方法的性能比任何一阶段处理的性能都更高，并且从一阶段到两个阶段处理的线性相关性。这种后期融合方法改善了以一致性相关系数评分测得的先前的早期融合结果。

Automatic speech emotion recognition (SER) by a computer is a critical component for more natural human-machine interaction. As in human-human interaction, the capability to perceive emotion correctly is essential to take further steps in a particular situation. One issue in SER is whether it is necessary to combine acoustic features with other data such as facial expressions, text, and motion capture. This research proposes to combine acoustic and text information by applying a late-fusion approach consisting of two steps. First, acoustic and text features are trained separately in deep learning systems. Second, the prediction results from the deep learning systems are fed into a support vector machine (SVM) to predict the final regression score. Furthermore, the task in this research is dimensional emotion modeling because it can enable a deeper analysis of affective states. Experimental results show that this two-stage, late-fusion approach, obtains higher performance than that of any one-stage processing, with a linear correlation from one-stage to two-stage processing. This late-fusion approach improves previous early fusion results measured in concordance correlation coefficients score.

下载PDF全文

下载文献需遵守相关版权规定

论文标题