音频输入生成连续的帧，以使用生成的热门网络合成面部视频

论文标题

音频输入生成连续的帧，以使用生成的热门网络合成面部视频

Audio Input Generates Continuous Frames to Synthesize Facial Video Using Generative Adiversarial Networks

论文作者

Zhang, Hanhaodi

论文摘要

本文提出了一种基于音频的语音视频生成的简单方法：给定音频，我们可以生成一个目标面孔的视频，讲述此音频。我们提出了具有切割语音音频输入作为条件的生成对抗网络（GAN），并在发电机和鉴别器中使用卷积门复发单元（GRU）。我们的模型是通过在此期间利用简短音频和帧来训练的。为了训练，我们切割音频并在相应的帧中提取面部。我们设计了一个简单的编码器，并使用和不使用GRU的GAN比较了生成的帧。我们使用GRU进行时间连贯的帧，结果表明，简短的音频可以产生相对现实的输出结果。

This paper presents a simple method for speech videos generation based on audio: given a piece of audio, we can generate a video of the target face speaking this audio. We propose Generative Adversarial Networks (GAN) with cut speech audio input as condition and use Convolutional Gate Recurrent Unit (GRU) in generator and discriminator. Our model is trained by exploiting the short audio and the frames in this duration. For training, we cut the audio and extract the face in the corresponding frames. We designed a simple encoder and compare the generated frames using GAN with and without GRU. We use GRU for temporally coherent frames and the results show that short audio can produce relatively realistic output results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题