论文标题
关于端到端语音隔离模型的脆弱性的经验分析
An Empirical Analysis on the Vulnerabilities of End-to-End Speech Segregation Models
论文作者
论文摘要
端到端的学习模型表明,在执行语音隔离方面具有出色的能力。尽管它们在现实世界中广泛应用,但对他们对分组的机制并因此将单个说话者隔离了。在这项工作中,我们知道谐调是这些网络分组源的关键提示,我们对Convtasnet和DPT-NET进行了彻底的研究,以分析它们如何对输入混合物进行谐波分析。我们进行消融研究,在其中应用低通,高通和带脚停的传球滤清器,以凭经验分析最重要的隔离谐波。我们还研究了这些网络如何通过引入合成混合物中的不连续性来决定将哪种输出通道分配给估计来源。我们发现,端到端网络非常不稳定,并且在面对人类无法察觉的变形时表现不佳。用频谱图替换这些网络中的编码器会导致整体性能降低,但稳定性更高。这项工作有助于我们理解这些网络依赖语音隔离的信息,并揭示了两种概括源。它还将编码器指定为负责这些错误的网络的一部分,从而可以重新设计专家知识或转移学习。
End-to-end learning models have demonstrated a remarkable capability in performing speech segregation. Despite their wide-scope of real-world applications, little is known about the mechanisms they employ to group and consequently segregate individual speakers. Knowing that harmonicity is a critical cue for these networks to group sources, in this work, we perform a thorough investigation on ConvTasnet and DPT-Net to analyze how they perform a harmonic analysis of the input mixture. We perform ablation studies where we apply low-pass, high-pass, and band-stop filters of varying pass-bands to empirically analyze the harmonics most critical for segregation. We also investigate how these networks decide which output channel to assign to an estimated source by introducing discontinuities in synthetic mixtures. We find that end-to-end networks are highly unstable, and perform poorly when confronted with deformations which are imperceptible to humans. Replacing the encoder in these networks with a spectrogram leads to lower overall performance, but much higher stability. This work helps us to understand what information these network rely on for speech segregation, and exposes two sources of generalization-errors. It also pinpoints the encoder as the part of the network responsible for these errors, allowing for a redesign with expert knowledge or transfer learning.