从语音中迈向半监督语义的理解

论文标题

从语音中迈向半监督语义的理解

Towards Semi-Supervised Semantics Understanding from Speech

论文作者

Lai, Cheng-I, Cao, Jin, Bodapati, Sravan, Li, Shang-Wen

论文摘要

关于口语理解（SLU）的最新工作至少以三种方式中的一种缺点：对Oracle文本输入进行了培训，并忽略了自动语音识别（ASR）输出，模型经过培训，可以预测没有插槽值的意图，或者对大量内部数据进行了培训。我们提出了一个干净而通用的框架，以直接从言语中学习语义，并从抄录的语音中进行半掩盖，以解决这些语义。我们的框架建立在验证的端到端（E2E）ASR和自我监督的语言模型（例如Bert）上，并以有限的目标SLU语料库进行了微调。同时，我们确定了对SLU模型进行了测试的两个不足设置：噪声和E2E语义评估。我们在现实的环境噪声下测试了拟议的框架，并在两个公共SLU COLPORA上使用了新的指标，插槽编辑F1分数。实验表明，我们带有语音作为输入的SLU框架可以与具有Oracle文本作为语义理解中的输入的那些输入来执行，同时存在环境噪声，并且有限的标记为语义数据可用。

Much recent work on Spoken Language Understanding (SLU) falls short in at least one of three ways: models were trained on oracle text input and neglected the Automatics Speech Recognition (ASR) outputs, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data. We proposed a clean and general framework to learn semantics directly from speech with semi-supervision from transcribed speech to address these. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU corpus. In parallel, we identified two inadequate settings under which SLU models have been tested: noise-robustness and E2E semantics evaluation. We tested the proposed framework under realistic environmental noises and with a new metric, the slots edit F1 score, on two public SLU corpora. Experiments show that our SLU framework with speech as input can perform on par with those with oracle text as input in semantics understanding, while environmental noises are present, and a limited amount of labeled semantics data is available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题