论文标题
有效的端到端变压器,具有渐进的三模式关注,以获得多模式情绪识别
An Efficient End-to-End Transformer with Progressive Tri-modal Attention for Multi-modal Emotion Recognition
论文作者
论文摘要
关于多模式情绪识别的最新著作转向端到端模型,该模型可以与两阶段管道相比提取目标任务监督的特定任务特征。但是,以前的方法仅对文本和声学和视觉方式之间的特征相互作用进行建模,而忽略了捕获声学和视觉方式之间的特征相互作用。在本文中,我们提出了多模式的端到端变压器(ME2ET),该变压器可以有效地模拟低级和高级层次的文本,声学和视觉方式之间的三模式特征相互作用。在低水平上,我们提出了进行性三模式的关注,可以通过采用两次通行策略来对三模式特征相互作用进行建模,并可以进一步利用这种相互作用,以通过降低输入令牌长度来大大降低计算和记忆复杂性。在高水平上,我们引入了三模式特征融合层,以明确汇总三种方式的语义表示。 CMU-MOSEI和IEMOCAP数据集的实验结果表明,ME2ET实现了最新性能。进一步的深入分析证明了拟议的渐进三模式关注的有效性,效率和解释性,这可以帮助我们的模型实现更好的性能,同时显着降低计算和记忆成本。我们的代码将公开可用。
Recent works on multi-modal emotion recognition move towards end-to-end models, which can extract the task-specific features supervised by the target task compared with the two-phase pipeline. However, previous methods only model the feature interactions between the textual and either acoustic and visual modalities, ignoring capturing the feature interactions between the acoustic and visual modalities. In this paper, we propose the multi-modal end-to-end transformer (ME2ET), which can effectively model the tri-modal features interaction among the textual, acoustic, and visual modalities at the low-level and high-level. At the low-level, we propose the progressive tri-modal attention, which can model the tri-modal feature interactions by adopting a two-pass strategy and can further leverage such interactions to significantly reduce the computation and memory complexity through reducing the input token length. At the high-level, we introduce the tri-modal feature fusion layer to explicitly aggregate the semantic representations of three modalities. The experimental results on the CMU-MOSEI and IEMOCAP datasets show that ME2ET achieves the state-of-the-art performance. The further in-depth analysis demonstrates the effectiveness, efficiency, and interpretability of the proposed progressive tri-modal attention, which can help our model to achieve better performance while significantly reducing the computation and memory cost. Our code will be publicly available.