论文标题
同时倾听10个人:使用Sinkhorn的算法进行有效的音频源分离培训
Towards Listening to 10 People Simultaneously: An Efficient Permutation Invariant Training of Audio Source Separation Using Sinkhorn's Algorithm
论文作者
论文摘要
在基于神经网络的单声道语音分离技术中,最近使用排列不变训练(PIT)损失评估损失是很普遍的。但是,普通坑需要尝试$ n $地面真相和$ n $估算之间的所有$ n!$排列。由于随着$ n $的增加,阶乘复杂性爆炸很快,因此仅当源信号数量较小时,例如$ n = 2 $或$ 3 $。为了克服这一限制,本文提出了一个凹坑,这是坑损失的一种新颖的变体,比$ n $大时,它比普通的坑损失效率要高得多。 sinkpit基于sinkhorn的矩阵平衡算法,该算法有效地找到了双随机矩阵,该矩阵以可区分的方式近似最佳排列。作者进行了一项实验,以使用凹槽将神经网络模型训练单渠道混合物分解为10个来源,并获得了有希望的结果。
In neural network-based monaural speech separation techniques, it has been recently common to evaluate the loss using the permutation invariant training (PIT) loss. However, the ordinary PIT requires to try all $N!$ permutations between $N$ ground truths and $N$ estimates. Since the factorial complexity explodes very rapidly as $N$ increases, a PIT-based training works only when the number of source signals is small, such as $N = 2$ or $3$. To overcome this limitation, this paper proposes a SinkPIT, a novel variant of the PIT losses, which is much more efficient than the ordinary PIT loss when $N$ is large. The SinkPIT is based on Sinkhorn's matrix balancing algorithm, which efficiently finds a doubly stochastic matrix which approximates the best permutation in a differentiable manner. The author conducted an experiment to train a neural network model to decompose a single-channel mixture into 10 sources using the SinkPIT, and obtained promising results.