学习上下文融合的视听表示，视听语音识别

论文标题

学习上下文融合的视听表示，视听语音识别

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

论文作者

Zhang, Zi-Qiang, Zhang, Jie, Zhang, Jian-Shu, Wu, Ming-Hui, Fang, Xin, Dai, Li-Rong

论文摘要

随着音频和视觉方式的自我监督学习的进步，已经有可能学习强大的音频语音表示。这将有助于改善视听语音识别（AVSR）的性能，因为多模式输入原则上包含更富有成果的信息。在本文中，基于现有的自我监督的表示方法，我们提出了一种视听表示方法。提出的方法探索了使用基于变压器的融合模块和灵活的掩盖策略的视听方式和长期背景依赖性的互补性。预训练后，该模型能够提取AVSR要求的融合表示。如果不丧失一般性，则可以应用于单模式任务，例如通过简单地掩盖融合模块中的一种方式，音频/视觉语音识别。提出的预训练模型使用一种或两种模态在语音识别和唇线上的任务上进行评估，并在其中揭示了优势。

With the advance in self-supervised learning for audio and visual modalities, it has become possible to learn a robust audio-visual speech representation. This would be beneficial for improving the audio-visual speech recognition (AVSR) performance, as the multi-modal inputs contain more fruitful information in principle. In this paper, based on existing self-supervised representation learning methods for audio modality, we therefore propose an audio-visual representation learning approach. The proposed approach explores both the complementarity of audio-visual modalities and long-term context dependency using a transformer-based fusion module and a flexible masking strategy. After pre-training, the model is able to extract fused representations required by AVSR. Without loss of generality, it can be applied to single-modal tasks, e.g. audio/visual speech recognition by simply masking out one modality in the fusion module. The proposed pre-trained model is evaluated on speech recognition and lipreading tasks using one or two modalities, where the superiority is revealed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题