论文标题
UX-NET:基于过滤和过程的改进的U-NET,用于实时时间域音频分离
UX-NET: Filter-and-Process-based Improved U-Net for Real-time Time-domain Audio Separation
论文作者
论文摘要
这项研究介绍了基于修改的U-NET体系结构的时间域音频分离网络(TASNET)UX-NET。所提出的UX-NET实时起作用,并处理单一或多微粒输入。受到基于过滤器和过程的人类听觉行为的启发,提出的系统引入了新颖的混音器和分离模块,从而导致语音源的成本和记忆有效建模。混音器模块结合了潜在特征空间中编码的输入,并输出所需数量的输出流。然后,在分离模块中,应用了一个修改的U-NET(UX)块。 UX块首先在各种分辨率下过滤编码的输入,然后汇总过滤的信息,并将经常性处理应用于估计分离源的掩模。 UX-NET中的字母“ X”是UX块中使用的复发层类型的名称占位符。 WSJ0-2MIX基准数据集上的经验发现表明,UX-NET配置之一的表现优于最先进的Conv-TASNET系统,而conv-TASNET系统的表现仅为0.85 db si-snr,而仅使用16%的模型参数,少58%的计算少了58%,并且保持低延迟。
This study presents UX-Net, a time-domain audio separation network (TasNet) based on a modified U-Net architecture. The proposed UX-Net works in real-time and handles either single or multi-microphone input. Inspired by the filter-and-process-based human auditory behavior, the proposed system introduces novel mixer and separation modules, which result in cost and memory efficient modeling of speech sources. The mixer module combines encoded input in a latent feature space and outputs a desired number of output streams. Then, in the separation module, a modified U-Net (UX) block is applied. The UX block first filters the encoded input at various resolutions followed by aggregating the filtered information and applying recurrent processing to estimate masks of separated sources. The letter 'X' in UX-Net is a name placeholder for the type of recurrent layer employed in the UX block. Empirical findings on the WSJ0-2mix benchmark dataset show that one of the UX-Net configurations outperforms the state-of-the-art Conv-TasNet system by 0.85 dB SI-SNR while using only 16% of the model parameters, 58% fewer computations, and maintaining low latency.