论文标题

用于使用基于CTC的软vad和全局查询注意的噪声关键字发现和扬声器验证的多任务网络

Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification using CTC-based Soft VAD and Global Query Attention

论文作者

Jung, Myunghun, Jung, Youngmoon, Goo, Jahyun, Kim, Hoirin

论文摘要

尽管众所周知,声音和扬声器域是互补的,但已独立研究了关键字斑点(KWS)和说话者验证(SV)。在本文中,我们提出了一个多任务网络,该网络同时执行KWS和SV,以充分利用相互关联的域信息。多任务网络将旨在在挑战性条件(例如嘈杂环境,开放式KWS和短期SV)中进行绩效提高的子网络紧密结合,通过引入新型连接暂时分类技术(CTC)基于基于的软语音活动检测(VAD)和全球质量查询,通过引入新颖的技术。框架级别的声学和扬声器信息与语音产生的权重集成在一起,以形成单词级的全局表示。然后,它用于特征向量的聚合来生成歧视性嵌入。我们提出的方法与两项任务的基准相比,相同错误率(EER)的相对错误率(EER)的相对改善为4.06%和26.71%。我们还提供了一个可视化示例和消融实验的结果。

Keyword spotting (KWS) and speaker verification (SV) have been studied independently although it is known that acoustic and speaker domains are complementary. In this paper, we propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information. The multi-task network tightly combines sub-networks aiming at performance improvement in challenging conditions such as noisy environments, open-vocabulary KWS, and short-duration SV, by introducing novel techniques of connectionist temporal classification (CTC)-based soft voice activity detection (VAD) and global query attention. Frame-level acoustic and speaker information is integrated with phonetically originated weights so that forms a word-level global representation. Then it is used for the aggregation of feature vectors to generate discriminative embeddings. Our proposed approach shows 4.06% and 26.71% relative improvements in equal error rate (EER) compared to the baselines for both tasks. We also present a visualization example and results of ablation experiments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源