自动语音识别的基于噪声的自我监管的基于训练的预训练模型的语音表示学习

论文标题

自动语音识别的基于噪声的自我监管的基于训练的预训练模型的语音表示学习

A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition

论文作者

Zhu, Qiu-Shi, Zhang, Jie, Zhang, Zi-Qiang, Wu, Ming-Hui, Fang, Xin, Dai, Li-Rong

论文摘要

WAV2VEC2.0是在自动语音识别（ASR）背景下学习语音表示的流行自我监管的预训练框架。结果表明，WAV2VEC2.0对域移动具有良好的鲁棒性，而噪声稳健性仍不清楚。因此，在这项工作中，我们首先通过实验分析了WAV2VEC2.0的噪声稳健性。我们观察到，在嘈杂数据上预先训练的WAV2VEC2.0可以获得良好的表示形式，从而改善了嘈杂测试集的ASR性能，但是在干净的测试集上会导致性能降解。为了避免此问题，在这项工作中，我们提出了增强的WAV2VEC2.0模型。具体而言，嘈杂的语音和相应的干净版本被馈入相同的功能编码器，在该功能编码器中，干净的语音为模型提供了训练目标。实验结果表明，所提出的方法不仅可以改善超过原始WAV2VEC2.0的嘈杂测试集上的ASR性能，而且还可以确保在干净的测试集上的性能降低。另外，在不同类型的噪声条件下证明了所提出的方法的有效性。

Wav2vec2.0 is a popular self-supervised pre-training framework for learning speech representations in the context of automatic speech recognition (ASR). It was shown that wav2vec2.0 has a good robustness against the domain shift, while the noise robustness is still unclear. In this work, we therefore first analyze the noise robustness of wav2vec2.0 via experiments. We observe that wav2vec2.0 pre-trained on noisy data can obtain good representations and thus improve the ASR performance on the noisy test set, which however brings a performance degradation on the clean test set. To avoid this issue, in this work we propose an enhanced wav2vec2.0 model. Specifically, the noisy speech and the corresponding clean version are fed into the same feature encoder, where the clean speech provides training targets for the model. Experimental results reveal that the proposed method can not only improve the ASR performance on the noisy test set which surpasses the original wav2vec2.0, but also ensure a tiny performance decrease on the clean test set. In addition, the effectiveness of the proposed method is demonstrated under different types of noise conditions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题