通过对齐正规化流式传输音频语音识别

论文标题

通过对齐正规化流式传输音频语音识别

Streaming Audio-Visual Speech Recognition with Alignment Regularization

论文作者

Ma, Pingchuan, Moritz, Niko, Petridis, Stavros, Fuegen, Christian, Pantic, Maja

论文摘要

在这项工作中，我们提出了基于混合连接主义时间分类（CTC）/注意神经网络体系结构的流媒体AV-ASR系统。音频和视觉编码器神经网络均基于构象体结构，该结构可通过块自我注意力（CSA）和因果卷积进行流式传输。使用触发的注意技术实现了使用解码器神经网络的流识别，该技术可以通过关节CTC/注意力评分进行时间同步解码。此外，我们提出了一种新型的对齐正则化技术，该技术促进音频和视觉编码器的同步，进而在所有SNR级别上导致较高的单词错误率（WERS），用于流媒体和离线AV-ASR模型。所提出的AV-ASR模型在唇部阅读句子3（LRS3）数据集中分别在离线和在线设置中获得2.0％和2.6％的WERS，当未使用外部培训数据时，这两者都呈现最新的结果。

In this work, we propose a streaming AV-ASR system based on a hybrid connectionist temporal classification (CTC)/attention neural network architecture. The audio and the visual encoder neural networks are both based on the conformer architecture, which is made streamable using chunk-wise self-attention (CSA) and causal convolution. Streaming recognition with a decoder neural network is realized by using the triggered attention technique, which performs time-synchronous decoding with joint CTC/attention scoring. Additionally, we propose a novel alignment regularization technique that promotes synchronization of the audio and visual encoder, which in turn results in better word error rates (WERs) at all SNR levels for streaming and offline AV-ASR models. The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 (LRS3) dataset in an offline and online setup, respectively, which both present state-of-the-art results when no external training data are used.

下载PDF全文

下载文献需遵守相关版权规定

论文标题