来自无声视频的基于Vocoder的语音综合

论文标题

来自无声视频的基于Vocoder的语音综合

Vocoder-Based Speech Synthesis from Silent Videos

论文作者

Michelsanti, Daniel, Slizovskaia, Olga, Haro, Gloria, Gómez, Emilia, Tan, Zheng-Hua, Jensen, Jesper

论文摘要

声学和视觉信息都会影响人类对言语的看法。因此，视频序列缺乏音频决定了未经训练的唇读者的语音清晰度极低。在本文中，我们提出了一种使用深度学习的说话者无声视频中综合语音的方法。该系统学习从原始视频帧到声学特征的映射功能，并使用Vocoder合成算法重建语音。为了提高语音重建性能，我们的模型还经过培训，可以以多任务学习方式预测文本信息，并且能够实时重建和识别语音。根据估计的语音质量和可理解性，结果表明了我们方法的有效性，该方法对现有的视频到语音方法表现出了改进。

Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm. To improve speech reconstruction performance, our model is also trained to predict text information in a multi-task learning fashion and it is able to simultaneously reconstruct and recognise speech in real time. The results in terms of estimated speech quality and intelligibility show the effectiveness of our method, which exhibits an improvement over existing video-to-speech approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题