镇静：对比度对齐的音频语言多室和多模式表示

论文标题

镇静：对比度对齐的音频语言多室和多模式表示

CALM: Contrastive Aligned Audio-Language Multirate and Multimodal Representations

论文作者

Sachidananda, Vin, Tseng, Shao-Yen, Marchi, Erik, Kajarekar, Sachin, Georgiou, Panayiotis

论文摘要

得出音频和词汇输入的多模式表示是自然语言理解（NLU）的核心问题。在本文中，我们提出了对齐的音频语言多式和多模式表示（平静），这是一种使用对比度和词汇输入中固有的对比度和多段信息来学习多模式表示的方法。所提出的模型将仅识别语言嵌入模型的输入嵌入空间中的声学和词汇信息对齐。通过将音频表示与预算的语言表示形式保持一致，并利用声学输入之间的对比信息，Calm能够在短短几个小时的培训时间内引起与现有音频表示模型相结合的音频。在操作上，通过光谱变压器（Specran）使用线性化贴片来处理音频图，该贴片是使用对比度音频语言训练的训练的，以使音频和语言从类似查询中对齐。随后，派生的声学和词汇代表表示将输入多模式变压器，以结合话语级别的上下文并得出所提出的平静表示。我们表明，这些预处理的嵌入随后可以用于多模式监督任务中，并在两个嵌入空间的对齐方面证明了拟议预审步的步骤的好处，以及预处理的多余性质。我们的系统显示出比现有情绪识别系统的10-25 \％改进，包括各种评估目标下的最先进的三模式系统。

Deriving multimodal representations of audio and lexical inputs is a central problem in Natural Language Understanding (NLU). In this paper, we present Contrastive Aligned Audio-Language Multirate and Multimodal Representations (CALM), an approach for learning multimodal representations using contrastive and multirate information inherent in audio and lexical inputs. The proposed model aligns acoustic and lexical information in the input embedding space of a pretrained language-only contextual embedding model. By aligning audio representations to pretrained language representations and utilizing contrastive information between acoustic inputs, CALM is able to bootstrap audio embedding competitive with existing audio representation models in only a few hours of training time. Operationally, audio spectrograms are processed using linearized patches through a Spectral Transformer (SpecTran) which is trained using a Contrastive Audio-Language Pretraining objective to align audio and language from similar queries. Subsequently, the derived acoustic and lexical tokens representations are input into a multimodal transformer to incorporate utterance level context and derive the proposed CALM representations. We show that these pretrained embeddings can subsequently be used in multimodal supervised tasks and demonstrate the benefits of the proposed pretraining steps in terms of the alignment of the two embedding spaces and the multirate nature of the pretraining. Our system shows 10-25\% improvement over existing emotion recognition systems including state-of-the-art three-modality systems under various evaluation objectives.

下载PDF全文

下载文献需遵守相关版权规定

论文标题