在鸡尾酒会上识别出端到端注意的演讲者

论文标题

在鸡尾酒会上识别出端到端注意的演讲者

Identify Speakers in Cocktail Parties with End-to-End Attention

论文作者

Zhu, Junzhe, Hasegawa-Johnson, Mark, Sari, Leda

论文摘要

在多个演讲者同时交谈的情况下，能够准确识别说话者很重要。本文提出了一个端到端的系统，该系统整合了语音源提取和说话者的识别，并提出了一种新的方式，通过最大程度地沿频道维度进行说话者的预测来共同优化这两个部分。残留的注意力使我们能够学习为说话者识别目的进行优化的频谱掩模，而剩余的向前连接允许使用足够大的上下文窗口扩张卷积，以确保跨音节界的正确流式。端到端的培训会导致一个系统，该系统以99.9％的精度在两个扬声器的广播语音混合物中识别一位扬声器，并且均具有93.9％的精度，并且在三扬声器方案中以81.2％的精度识别所有扬声器。

In scenarios where multiple speakers talk at the same time, it is important to be able to identify the talkers accurately. This paper presents an end-to-end system that integrates speech source extraction and speaker identification, and proposes a new way to jointly optimize these two parts by max-pooling the speaker predictions along the channel dimension. Residual attention permits us to learn spectrogram masks that are optimized for the purpose of speaker identification, while residual forward connections permit dilated convolution with a sufficiently large context window to guarantee correct streaming across syllable boundaries. End-to-end training results in a system that recognizes one speaker in a two-speaker broadcast speech mixture with 99.9% accuracy and both speakers with 93.9% accuracy, and that recognizes all speakers in three-speaker scenarios with 81.2% accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题