视听语音分离与对抗性的视觉表示形式

论文标题

视听语音分离与对抗性的视觉表示形式

Audio-visual Speech Separation with Adversarially Disentangled Visual Representation

论文作者

Zhang, Peng, Xu, Jiaming, shi, Jing, Hao, Yunzhe, Xu, Bo

论文摘要

语音分离旨在将单个声音与多个同时说话者的音频混合物分开。尽管只有音频的方法达到了令人满意的性能，但它们以处理预定义条件的策略为基础，从而限制了其在复杂的听觉场景中的应用。对于鸡尾酒会问题，我们提出了一种新颖的视听语音分离模型。在我们的模型中，我们使用面部检测器来检测场景中的扬声器数量，并使用视觉信息避免排列问题。为了提高模型对未知扬声器的概括能力，我们通过对抗性分离的方法明确地从视觉输入中提取与语音相关的视觉特征，并使用此功能来帮助语音分离。此外，采用了时域方法，该方法可以避免时频域模型中存在的相重建问题。为了将模型的性能与其他模型进行比较，我们创建了两个基准数据集，这些数据集的2扬声器混合物和TCDTimit Audio-Visual数据集。通过一系列实验，我们提出的模型被证明胜过最先进的音频模型和三个视听模型。

Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers. Although audio-only approaches achieve satisfactory performance, they build on a strategy to handle the predefined conditions, limiting their application in the complex auditory scene. Towards the cocktail party problem, we propose a novel audio-visual speech separation model. In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem. To improve our model's generalization ability to unknown speakers, we extract speech-related visual features from visual inputs explicitly by the adversarially disentangled method, and use this feature to assist speech separation. Besides, the time-domain approach is adopted, which could avoid the phase reconstruction problem existing in the time-frequency domain models. To compare our model's performance with other models, we create two benchmark datasets of 2-speaker mixture from GRID and TCDTIMIT audio-visual datasets. Through a series of experiments, our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题