论文标题
由皮质 - 丘脑皮层电路启发的视听语音分离模型
An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits
论文作者
论文摘要
涉及视觉输入的视听方法为言语分离的最新进展奠定了基础。但是,对听觉和视觉输入的并发使用的优化仍然是一个活跃的研究领域。受到皮质 - 丘脑皮层电路的启发,其中不同模态的感觉处理机制通过非尼斯感官感觉丘脑互相调节,我们提出了一种新型的皮质 - 丘脑 - 丘脑 - 皮质神经网络(CTCNET),以进行音频分离(ACPSS)。首先,CTCNET在单独的听觉和视觉子网中以自下而上的方式学习层次的听觉和视觉表示,模仿听觉和视觉皮质区域的功能。然后,该模型受到皮质区域与丘脑之间的大量连接的启发,该模型通过自上而下的连接在丘脑子网中融合了听觉和视觉信息。最后,该模型将这些融合的信息传输回听觉和视觉子网,并多次重复上述过程。在三个语音分离基准数据集上实验的结果表明,CTCNET明显优于现有的AVSS方法,其参数较少。这些结果表明,模仿哺乳动物大脑的解剖连接组具有推进深神经网络发展的巨大潜力。 Project Repo是https://github.com/jusperlee/ctcnet。
Audio-visual approaches involving visual inputs have laid the foundation for recent progress in speech separation. However, the optimization of the concurrent usage of auditory and visual inputs is still an active research area. Inspired by the cortico-thalamo-cortical circuit, in which the sensory processing mechanisms of different modalities modulate one another via the non-lemniscal sensory thalamus, we propose a novel cortico-thalamo-cortical neural network (CTCNet) for audio-visual speech separation (AVSS). First, the CTCNet learns hierarchical auditory and visual representations in a bottom-up manner in separate auditory and visual subnetworks, mimicking the functions of the auditory and visual cortical areas. Then, inspired by the large number of connections between cortical regions and the thalamus, the model fuses the auditory and visual information in a thalamic subnetwork through top-down connections. Finally, the model transmits this fused information back to the auditory and visual subnetworks, and the above process is repeated several times. The results of experiments on three speech separation benchmark datasets show that CTCNet remarkably outperforms existing AVSS methods with considerably fewer parameters. These results suggest that mimicking the anatomical connectome of the mammalian brain has great potential for advancing the development of deep neural networks. Project repo is https://github.com/JusperLee/CTCNet.