语音LM：使用未配对的文本数据增强的语音预训练

论文标题

语音LM：使用未配对的文本数据增强的语音预训练

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

论文作者

Zhang, Ziqiang, Chen, Sanyuan, Zhou, Long, Wu, Yu, Ren, Shuo, Liu, Shujie, Yao, Zhuoyuan, Gong, Xun, Dai, Lirong, Li, Jinyu, Wei, Furu

论文摘要

如何通过文本数据来提高语音预训练是一个未解决的问题，因为语音和文本是具有不同特征的非常不同的方式。在本文中，我们提出了一个跨模式的语音和语言模型（SpeechLM），以将语音和文本预培训与预定义的统一离散表示形式相结合。具体而言，我们介绍了两个替代的离散令牌，以弥合语音和文本方式，包括音素单位和隐藏单位图形，可以使用少量的配对语音文本数据进行培训。根据训练有素的令牌，我们将未标记的语音和文本数据转换为音素单元或隐藏单元的令牌。训练预训练目标旨在将语音和文本与统一的变压器网络统一为相同的离散语义空间。我们对各种口头语言处理任务进行评估，包括语音识别，语音翻译和通用表示框架框架出色，证明了与内容相关的任务的重大改进。代码和型号可在https://aka.ms/speechlm上找到。

How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Code and models are available at https://aka.ms/SpeechLM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题