论文标题
学习双向声音源本地化的深度直接路径相对传递函数
Learning Deep Direct-Path Relative Transfer Function for Binaural Sound Source Localization
论文作者
论文摘要
直接路径相对传递函数(DP-RTF)是指两个麦克风通道的直接路径声传递函数之间的比率。尽管DP-RTF完全编码声音空间提示并用作可靠的定位功能,但通常会在噪声和混响的存在下错误地估计它。本文提议学习具有深层神经网络的DP-RTF,以进行强大的双耳声源定位。 DP-RTF学习网络旨在将双耳传感器信号回归为DP-RTF的实现表示。它由一个分支的卷积神经网络模块组成,用于分别提取通道间的幅度和相模式,以及用于关节特征学习的卷积复发神经网络模块。为了更好地探索语音光谱以帮助DP-RTF估计,使用单声道语音增强网络来从嘈杂的语音频谱中恢复直接频谱图。增强的频谱图被堆叠在嘈杂的频谱图上,充当DP-RTF学习网络的输入。我们使用许多不同的双耳阵列训练一个独特的DP-RTF学习网络,以使DP-RTF学习跨阵列的概括。这样,避免了耗时的培训数据收集和网络再培训,用于新数组,这在实际应用中非常有用。对模拟和现实世界数据的实验结果表明,在嘈杂和回响环境中提出的到达方向(DOA)估计的方法的有效性,以及良好的概括能力,可以看不见双耳阵列。
Direct-path relative transfer function (DP-RTF) refers to the ratio between the direct-path acoustic transfer functions of two microphone channels. Though DP-RTF fully encodes the sound spatial cues and serves as a reliable localization feature, it is often erroneously estimated in the presence of noise and reverberation. This paper proposes to learn DP-RTF with deep neural networks for robust binaural sound source localization. A DP-RTF learning network is designed to regress the binaural sensor signals to a real-valued representation of DP-RTF. It consists of a branched convolutional neural network module to separately extract the inter-channel magnitude and phase patterns, and a convolutional recurrent neural network module for joint feature learning. To better explore the speech spectra to aid the DP-RTF estimation, a monaural speech enhancement network is used to recover the direct-path spectrograms from the noisy ones. The enhanced spectrograms are stacked onto the noisy spectrograms to act as the input of the DP-RTF learning network. We train one unique DP-RTF learning network using many different binaural arrays to enable the generalization of DP-RTF learning across arrays. This way avoids time-consuming training data collection and network retraining for a new array, which is very useful in practical application. Experimental results on both simulated and real-world data show the effectiveness of the proposed method for direction of arrival (DOA) estimation in the noisy and reverberant environment, and a good generalization ability to unseen binaural arrays.