使用基于构象体的架构的扬声器条件单通道目标扬声器提取器提取

论文标题

使用基于构象体的架构的扬声器条件单通道目标扬声器提取器提取

Speaker-conditioning Single-channel Target Speaker Extraction using Conformer-based Architectures

论文作者

Sinha, Ragini, Tammen, Marvin, Rollwage, Christian, Doclo, Simon

论文摘要

目标扬声器提取的目的是从多个扬声器的混合物中提取目标扬声器，这些扬声器的混合物利用有关目标扬声器的辅助信息。在本文中，我们考虑了一个完整的时间域目标扬声器提取系统，该系统由扬声器嵌入式网络和扬声器分离器网络组成，该网络在端到端学习过程中进行了共同培训。我们为扬声器分离器网络提出了两个不同的架构，这些架构基于卷积增强变压器（构象异构体）。第一个体系结构使用构象异构体和外部进料前块（conformer-ffn），而第二个体系结构则使用时间卷积网络（TCN）和构象异构体块（TCN-Conformer）的堆栈。 2扬声器混合物，3扬声器混合物和2扬声器的嘈杂混合物的实验结果表明，在拟议的分离器网络中，与基于TCN的基线基线系统相比，TCN构造器可显着提高目标扬声器的提取性能。

Target speaker extraction aims at extracting the target speaker from a mixture of multiple speakers exploiting auxiliary information about the target speaker. In this paper, we consider a complete time-domain target speaker extraction system consisting of a speaker embedder network and a speaker separator network which are jointly trained in an end-to-end learning process. We propose two different architectures for the speaker separator network which are based on the convolutional augmented transformer (conformer). The first architecture uses stacks of conformer and external feed-forward blocks (Conformer-FFN), while the second architecture uses stacks of temporal convolutional network (TCN) and conformer blocks (TCN-Conformer). Experimental results for 2-speaker mixtures, 3-speaker mixtures, and noisy mixtures of 2-speakers show that among the proposed separator networks, the TCN-Conformer significantly improves the target speaker extraction performance compared to the Conformer-FFN and a TCN-based baseline system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题