T5Lephone：通过音素级别T5进行语言理解的桥接语音和文本自我监督模型

论文标题

T5Lephone：通过音素级别T5进行语言理解的桥接语音和文本自我监督模型

T5lephone: Bridging Speech and Text Self-supervised Models for Spoken Language Understanding via Phoneme level T5

论文作者

Hsu, Chan-Jan, Chung, Ho-Lam, Lee, Hung-yi, Tsao, Yu

论文摘要

在口语理解（SLU）中，一种自然解决方案是在串联预训练的语音模型（例如Hubert）和预读的语言模型（PLM，例如T5）。大多数以前的作品都使用验证的语言模型和基于子词的令牌化。但是，输入单元的粒度会影响语音模型输出和语言模型输入的对齐，而具有基于字符的令牌化的PLM却没有得到充实的反应。在这项工作中，我们对具有不同代币化策略的PLM如何影响口语理解任务，包括口头答案（SQA）和语音翻译（ST），进行了广泛的研究。我们进一步扩展了创建T5Lephone（发音为电话）的想法，这是T5的变体，使用音素化文本进行了审议。我们使用现有PLM的T5Lephone初始化了T5Lephone，以使用相对轻巧的计算资源对其进行预处理。我们在NMSQA上达到了最先进的功能，并且T5Lephone模型超过了T5，其端到端SQA和ST上的其他类型的单元。

In Spoken language understanding (SLU), a natural solution is concatenating pre-trained speech models (e.g. HuBERT) and pretrained language models (PLM, e.g. T5). Most previous works use pretrained language models with subword-based tokenization. However, the granularity of input units affects the alignment of speech model outputs and language model inputs, and PLM with character-based tokenization is underexplored. In this work, we conduct extensive studies on how PLMs with different tokenization strategies affect spoken language understanding task including spoken question answering (SQA) and speech translation (ST). We further extend the idea to create T5lephone(pronounced as telephone), a variant of T5 that is pretrained using phonemicized text. We initialize T5lephone with existing PLMs to pretrain it using relatively lightweight computational resources. We reached state-of-the-art on NMSQA, and the T5lephone model exceeds T5 with other types of units on end-to-end SQA and ST.

下载PDF全文

下载文献需遵守相关版权规定

论文标题