单个麦克风扬声器提取使用统一的时间频率暹罗 - Unet

论文标题

单个麦克风扬声器提取使用统一的时间频率暹罗 - Unet

Single microphone speaker extraction using unified time-frequency Siamese-Unet

论文作者

Eisenberg, Aviad, Gannot, Sharon, Chazan, Shlomo E.

论文摘要

在本文中，我们提出了一种在干净和嘈杂条件下说话者提取的统一时间频。鉴于混合信号，以及参考信号，提取所需扬声器的常见方法要么在时间域或频域中应用。在我们的方法中，我们提出了一种使用这两种表示形式的暹罗无UNET架构。暹罗编码器分别用于频域，分别推断噪声和参考光谱的嵌入。然后将串联表示形式馈入解码器，以估计所需的扬声器的真实和虚构组件，然后将其反向转换为时间域。该模型经过尺度不变的信噪比（SI-SDR）损失，以利用时间域信息。时间域的损失还通过频域损失进行正规化，以保留语音模式。实验结果表明，与最先进的（SOTA）盲源分离（BSS）方法以及常用的说话者提取方法相比，统一方法不仅很容易训练，而且提供了优越的结果。

In this paper we present a unified time-frequency method for speaker extraction in clean and noisy conditions. Given a mixed signal, along with a reference signal, the common approaches for extracting the desired speaker are either applied in the time-domain or in the frequency-domain. In our approach, we propose a Siamese-Unet architecture that uses both representations. The Siamese encoders are applied in the frequency-domain to infer the embedding of the noisy and reference spectra, respectively. The concatenated representations are then fed into the decoder to estimate the real and imaginary components of the desired speaker, which are then inverse-transformed to the time-domain. The model is trained with the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) loss to exploit the time-domain information. The time-domain loss is also regularized with frequency-domain loss to preserve the speech patterns. Experimental results demonstrate that the unified approach is not only very easy to train, but also provides superior results as compared with state-of-the-art (SOTA) Blind Source Separation (BSS) methods, as well as commonly used speaker extraction approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题