同时倾听10个人：使用Sinkhorn的算法进行有效的音频源分离培训

论文标题

同时倾听10个人：使用Sinkhorn的算法进行有效的音频源分离培训

Towards Listening to 10 People Simultaneously: An Efficient Permutation Invariant Training of Audio Source Separation Using Sinkhorn's Algorithm

论文作者

Tachibana, Hideyuki

论文摘要

在基于神经网络的单声道语音分离技术中，最近使用排列不变训练（PIT）损失评估损失是很普遍的。但是，普通坑需要尝试$ n $地面真相和$ n $估算之间的所有$ n！$排列。由于随着$ n $的增加，阶乘复杂性爆炸很快，因此仅当源信号数量较小时，例如$ n = 2 $或$ 3 $。为了克服这一限制，本文提出了一个凹坑，这是坑损失的一种新颖的变体，比$ n $大时，它比普通的坑损失效率要高得多。 sinkpit基于sinkhorn的矩阵平衡算法，该算法有效地找到了双随机矩阵，该矩阵以可区分的方式近似最佳排列。作者进行了一项实验，以使用凹槽将神经网络模型训练单渠道混合物分解为10个来源，并获得了有希望的结果。

In neural network-based monaural speech separation techniques, it has been recently common to evaluate the loss using the permutation invariant training (PIT) loss. However, the ordinary PIT requires to try all $N!$ permutations between $N$ ground truths and $N$ estimates. Since the factorial complexity explodes very rapidly as $N$ increases, a PIT-based training works only when the number of source signals is small, such as $N = 2$ or $3$. To overcome this limitation, this paper proposes a SinkPIT, a novel variant of the PIT losses, which is much more efficient than the ordinary PIT loss when $N$ is large. The SinkPIT is based on Sinkhorn's matrix balancing algorithm, which efficiently finds a doubly stochastic matrix which approximates the best permutation in a differentiable manner. The author conducted an experiment to train a neural network model to decompose a single-channel mixture into 10 sources using the SinkPIT, and obtained promising results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题