论文标题
用于使用基于CTC的软vad和全局查询注意的噪声关键字发现和扬声器验证的多任务网络
Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification using CTC-based Soft VAD and Global Query Attention
论文作者
论文摘要
尽管众所周知,声音和扬声器域是互补的,但已独立研究了关键字斑点(KWS)和说话者验证(SV)。在本文中,我们提出了一个多任务网络,该网络同时执行KWS和SV,以充分利用相互关联的域信息。多任务网络将旨在在挑战性条件(例如嘈杂环境,开放式KWS和短期SV)中进行绩效提高的子网络紧密结合,通过引入新型连接暂时分类技术(CTC)基于基于的软语音活动检测(VAD)和全球质量查询,通过引入新颖的技术。框架级别的声学和扬声器信息与语音产生的权重集成在一起,以形成单词级的全局表示。然后,它用于特征向量的聚合来生成歧视性嵌入。我们提出的方法与两项任务的基准相比,相同错误率(EER)的相对错误率(EER)的相对改善为4.06%和26.71%。我们还提供了一个可视化示例和消融实验的结果。
Keyword spotting (KWS) and speaker verification (SV) have been studied independently although it is known that acoustic and speaker domains are complementary. In this paper, we propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information. The multi-task network tightly combines sub-networks aiming at performance improvement in challenging conditions such as noisy environments, open-vocabulary KWS, and short-duration SV, by introducing novel techniques of connectionist temporal classification (CTC)-based soft voice activity detection (VAD) and global query attention. Frame-level acoustic and speaker information is integrated with phonetically originated weights so that forms a word-level global representation. Then it is used for the aggregation of feature vectors to generate discriminative embeddings. Our proposed approach shows 4.06% and 26.71% relative improvements in equal error rate (EER) compared to the baselines for both tasks. We also present a visualization example and results of ablation experiments.