在真实多方对话环境中，方向感知神经语音增强和识别的联合适应

论文标题

在真实多方对话环境中，方向感知神经语音增强和识别的联合适应

Direction-Aware Joint Adaptation of Neural Speech Enhancement and Recognition in Real Multiparty Conversational Environments

论文作者

Du, Yicheng, Nugraha, Aditya Arie, Sekiguchi, Kouhei, Bando, Yoshiaki, Fontaine, Mathieu, Yoshii, Kazuyoshi

论文摘要

本文描述了一种增强现实耳机的嘈杂语音识别，该耳机有助于在真实的多方对话环境中进行口头交流。在模拟环境中积极研究的一种主要方法是基于以监督方式培训的深神经网络（DNN），依次执行语音增强和自动语音识别（ASR）。但是，在我们的任务中，由于培训和测试条件与用户的头部移动之间的不匹配，因此这种预处理的系统无法正常工作。为了仅增强目标扬声器的话语，我们基于基于DNN的语音掩码估计器的束形式，该掩膜估计器可以自适应地提取与头部相关特定方向相对应的语音组件。我们提出了一种半监督的适应方法，该方法在运行时共同更新蒙版估计器和ASR模型，使用带有基础真实转录和嘈杂的语音信号的干净语音信号，并具有高度符合的估计转录。使用最先进的语音识别系统的比较实验表明，所提出的方法显着改善了ASR性能。

This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication within real multiparty conversational environments. A major approach that has actively been studied in simulated environments is to sequentially perform speech enhancement and automatic speech recognition (ASR) based on deep neural networks (DNNs) trained in a supervised manner. In our task, however, such a pretrained system fails to work due to the mismatch between the training and test conditions and the head movements of the user. To enhance only the utterances of a target speaker, we use beamforming based on a DNN-based speech mask estimator that can adaptively extract the speech components corresponding to a head-relative particular direction. We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean speech signals with ground-truth transcriptions and noisy speech signals with highly-confident estimated transcriptions. Comparative experiments using the state-of-the-art distant speech recognition system show that the proposed method significantly improves the ASR performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题