用于使用基于CTC的软vad和全局查询注意的噪声关键字发现和扬声器验证的多任务网络

论文标题

用于使用基于CTC的软vad和全局查询注意的噪声关键字发现和扬声器验证的多任务网络

Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification using CTC-based Soft VAD and Global Query Attention

论文作者

Jung, Myunghun, Jung, Youngmoon, Goo, Jahyun, Kim, Hoirin

论文摘要

尽管众所周知，声音和扬声器域是互补的，但已独立研究了关键字斑点（KWS）和说话者验证（SV）。在本文中，我们提出了一个多任务网络，该网络同时执行KWS和SV，以充分利用相互关联的域信息。多任务网络将旨在在挑战性条件（例如嘈杂环境，开放式KWS和短期SV）中进行绩效提高的子网络紧密结合，通过引入新型连接暂时分类技术（CTC）基于基于的软语音活动检测（VAD）和全球质量查询，通过引入新颖的技术。框架级别的声学和扬声器信息与语音产生的权重集成在一起，以形成单词级的全局表示。然后，它用于特征向量的聚合来生成歧视性嵌入。我们提出的方法与两项任务的基准相比，相同错误率（EER）的相对错误率（EER）的相对改善为4.06％和26.71％。我们还提供了一个可视化示例和消融实验的结果。

Keyword spotting (KWS) and speaker verification (SV) have been studied independently although it is known that acoustic and speaker domains are complementary. In this paper, we propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information. The multi-task network tightly combines sub-networks aiming at performance improvement in challenging conditions such as noisy environments, open-vocabulary KWS, and short-duration SV, by introducing novel techniques of connectionist temporal classification (CTC)-based soft voice activity detection (VAD) and global query attention. Frame-level acoustic and speaker information is integrated with phonetically originated weights so that forms a word-level global representation. Then it is used for the aggregation of feature vectors to generate discriminative embeddings. Our proposed approach shows 4.06% and 26.71% relative improvements in equal error rate (EER) compared to the baselines for both tasks. We also present a visualization example and results of ablation experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题