论文标题
基于适配器的多标签预训练,用于语音分离和增强
An Adapter based Multi-label Pre-training for Speech Separation and Enhancement
论文作者
论文摘要
近年来,由于其从大量未标记的数据中提取表示形式,因此自我监督学习(SSL)在各种语音任务中取得了巨大的成功。但是,与诸如语音识别(ASR)之类的任务相比,语音分离(SS)和增强(SE)中SSL表示的改进要小得多。基于休伯特,这项工作研究了改进SS和SE的SSL模型。我们首先通过整合分离和降解术语来更新休伯特的蒙版语音预测(MSP)目标,从而产生了多个伪标签的预训练方案,从而显着提高了Hubert在SS和SE上的性能,但会降低ASR的性能。为了维持其在ASR上的性能增长,我们进一步建议了Hubert Transformer编码器的基于适配器的体系结构,其中仅将每一层的几个参数调整为多个伪标签MSP,而其他参数仍被冻结为默认的Hubert。实验结果表明,我们提出的基于适配器的多个伪标签Hubert在SE,SS和ASR任务上产生一致且显着的性能提高,并且只有边际参数增加,并且具有更快的前训练速度。
In recent years, self-supervised learning (SSL) has achieved tremendous success in various speech tasks due to its power to extract representations from massive unlabeled data. However, compared with tasks such as speech recognition (ASR), the improvements from SSL representation in speech separation (SS) and enhancement (SE) are considerably smaller. Based on HuBERT, this work investigates improving the SSL model for SS and SE. We first update HuBERT's masked speech prediction (MSP) objective by integrating the separation and denoising terms, resulting in a multiple pseudo label pre-training scheme, which significantly improves HuBERT's performance on SS and SE but degrades the performance on ASR. To maintain its performance gain on ASR, we further propose an adapter-based architecture for HuBERT's Transformer encoder, where only a few parameters of each layer are adjusted to the multiple pseudo label MSP while other parameters remain frozen as default HuBERT. Experimental results show that our proposed adapter-based multiple pseudo label HuBERT yield consistent and significant performance improvements on SE, SS, and ASR tasks, with a faster pre-training speed, at only marginal parameters increase.