U-Hubert：统一的混合模式语音预处理和零拍传输到未标记的方式

论文标题

U-Hubert：统一的混合模式语音预处理和零拍传输到未标记的方式

u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality

论文作者

Hsu, Wei-Ning, Shi, Bowen

论文摘要

尽管与纯音频模型相比，视听语音模型可以产生卓越的性能和鲁棒性，但由于缺乏标记和未标记的音频数据以及每种方式部署一个模型的成本，它们的开发和采用受到阻碍。在本文中，我们提出了U-Hubert，这是一个自我监督的预训练框架，可以以统一的蒙版群集预测目标来利用多模式和单峰语音。通过在预训练期间利用模态辍学，我们证明了一个微调模型可以在PAR上取得性能或比最先进的模态特异性模型更好。此外，我们仅在音频上进行微调的模型可以通过视听和视觉语音输入来表现良好，从而实现了多个语音处理任务的零击形态概括。特别是，我们的单个模型在LRS3上带有1.2％/1.4％/27.2％的语音识别单词错误率，并带有视听/音频/视觉输入。代码和模型可在https://github.com/facebookresearch/av_hubert上找到

While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled audio-visual data and the cost to deploy one model per modality. In this paper, we present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech with a unified masked cluster prediction objective. By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models. Moreover, our model fine-tuned only on audio can perform well with audio-visual and visual speech input, achieving zero-shot modality generalization for multiple speech processing tasks. In particular, our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input. Codes and models are available at https://github.com/facebookresearch/av_hubert

下载PDF全文

下载文献需遵守相关版权规定

论文标题