论文标题
比较语音分离的手工制作,参数化和可学习的功能
A comparison of handcrafted, parameterized, and learnable features for speech separation
论文作者
论文摘要
声学特征的设计对于语音分离很重要。它可以大致分为三类:手工制作,参数化和可学习的功能。其中,以端到端的方式共同培训了通过分离网络培训的可学习特征,成为现代语音分离研究的新趋势,例如在最近的研究中,卷积时域音符分离网络(Conv-TASNET),同时手工制作和参数化的特征也表现出竞争力。但是,尚未进行三种声学特征的系统比较。在本文中,我们通过将其编码器和解码器设置为具有不同的声学特征的编码器和解码器来比较它们。我们还将手工制作的多相γ滤纸库(MPGTF)推广到一个新的参数化多相伽马酮滤清器(ParampGTF)。 WSJ0-2MIX语料库的实验结果表明,如果解码器是可学习的,则将编码器设置为STFT,MPGTF,ParampGTF和可学习的功能,会导致相似的性能; (ii)当使用STFT,MPGTF和PARAMPGTF的伪内变换用作解码器时,所提出的ParampGTF的性能要比其他两个手工制作的功能更好。
The design of acoustic features is important for speech separation. It can be roughly categorized into three classes: handcrafted, parameterized, and learnable features. Among them, learnable features, which are trained with separation networks jointly in an end-to-end fashion, become a new trend of modern speech separation research, e.g. convolutional time domain audio separation network (Conv-Tasnet), while handcrafted and parameterized features are also shown competitive in very recent studies. However, a systematic comparison across the three kinds of acoustic features has not been conducted yet. In this paper, we compare them in the framework of Conv-Tasnet by setting its encoder and decoder with different acoustic features. We also generalize the handcrafted multi-phase gammatone filterbank (MPGTF) to a new parameterized multi-phase gammatone filterbank (ParaMPGTF). Experimental results on the WSJ0-2mix corpus show that (i) if the decoder is learnable, then setting the encoder to STFT, MPGTF, ParaMPGTF, and learnable features lead to similar performance; and (ii) when the pseudo-inverse transforms of STFT, MPGTF, and ParaMPGTF are used as the decoders, the proposed ParaMPGTF performs better than the other two handcrafted features.