论文标题
两个世界上最好的:多任务音频自动语音识别和主动扬声器检测
Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker Detection
论文作者
论文摘要
在嘈杂的条件下,自动语音识别(ASR)可以大大受益于来自说话者脸部视频的视觉信号。但是,当可见多个候选扬声器时,传统上需要解决一个单独的问题,即主动扬声器检测(ASD),这需要在每时每刻选择可见的面与音频相对应。最近的工作表明,我们可以通过对扬声器面孔的竞争视频轨道采用注意力机制来同时解决这两个问题,而在主动扬声器检测中牺牲了一些准确性。这项工作通过展示可以通过多任务损失共同训练的单个模型来缩小主动扬声器检测精度的差距。通过在训练期间结合这两个任务,我们将ASD分类精度降低了约25%,同时与专门针对ASR训练的多人基线相比,同时提高了ASR性能。
Under noisy conditions, automatic speech recognition (ASR) can greatly benefit from the addition of visual signals coming from a video of the speaker's face. However, when multiple candidate speakers are visible this traditionally requires solving a separate problem, namely active speaker detection (ASD), which entails selecting at each moment in time which of the visible faces corresponds to the audio. Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces, at the cost of sacrificing some accuracy on active speaker detection. This work closes this gap in active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss. By combining the two tasks during training we reduce the ASD classification accuracy by approximately 25%, while simultaneously improving the ASR performance when compared to the multi-person baseline trained exclusively for ASR.