论文标题
跨注意就是您所需要的:实时流媒体变压器,用于增强个性化语音
Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement
论文作者
论文摘要
个性化的语音增强(PSE),仅提取目标用户的语音并从录制的音频剪辑中删除其他所有内容,它可能会改善用户对野外部署的音频AI模块的体验。为了支持各种下游音频任务,例如实时ASR和Audio-Call增强功能,PSE解决方案应在流媒体模式下运行,即,应以较小的延迟和实时因素实时进行输入音频清洁。通常以静态嵌入向量的形式从注册音频中提取目标扬声器的语音配置文件,然后使用它来调节PSE模型的输出来实现个性化。但是,在所有条件下,固定的目标扬声器嵌入可能并非最佳。在这项工作中,我们提出了基于流媒介的PSE模型,并提出了一种新型的跨注意方法,以提供自适应目标扬声器表示。我们提出了广泛的实验,并表明我们提出的跨注意方法的表现始终超过竞争基准,即使我们的模型仅大约是一半的大小。
Personalised speech enhancement (PSE), which extracts only the speech of a target user and removes everything else from a recorded audio clip, can potentially improve users' experiences of audio AI modules deployed in the wild. To support a large variety of downstream audio tasks, such as real-time ASR and audio-call enhancement, a PSE solution should operate in a streaming mode, i.e., input audio cleaning should happen in real-time with a small latency and real-time factor. Personalisation is typically achieved by extracting a target speaker's voice profile from an enrolment audio, in the form of a static embedding vector, and then using it to condition the output of a PSE model. However, a fixed target speaker embedding may not be optimal under all conditions. In this work, we present a streaming Transformer-based PSE model and propose a novel cross-attention approach that gives adaptive target speaker representations. We present extensive experiments and show that our proposed cross-attention approach outperforms competitive baselines consistently, even when our model is only approximately half the size.