论文标题
回忆是说话一代中的一对多映射减轻者
Memories are One-to-Many Mapping Alleviators in Talking Face Generation
论文作者
论文摘要
说话的脸部旨在生成由输入音频驱动的目标人的照片真实视频肖像。由于其从输入音频到输出视频的一对多映射的性质(例如,一个语音内容可能具有多个可行的视觉外观),因此学习像以前的作品一样的确定性映射会在训练过程中带来歧义,从而导致视觉效果下降。尽管这一一对多的映射可以通过两个阶段的框架(即音频到表达模型,然后是神经渲染模型)来缓解,但由于没有足够的信息(例如,情感,皱纹等)产生了预测,因此它仍然不足。在本文中,我们提出了Memface,以分别遵循两个阶段意义的隐式内存和明确的内存来补充缺失的信息。更具体地说,在音频到表达模型中采用隐式内存来捕获音频表达共享空间中的高级语义,而神经渲染模型中则采用了显式内存,以帮助合成像素级别的详细信息。我们的实验结果表明,我们提出的MEMFACE在多种情况下始终如一,显着地超过了所有最新结果。
Talking face generation aims at generating photo-realistic video portraits of a target person driven by input audio. Due to its nature of one-to-many mapping from the input audio to the output video (e.g., one speech content may have multiple feasible visual appearances), learning a deterministic mapping like previous works brings ambiguity during training, and thus causes inferior visual results. Although this one-to-many mapping could be alleviated in part by a two-stage framework (i.e., an audio-to-expression model followed by a neural-rendering model), it is still insufficient since the prediction is produced without enough information (e.g., emotions, wrinkles, etc.). In this paper, we propose MemFace to complement the missing information with an implicit memory and an explicit memory that follow the sense of the two stages respectively. More specifically, the implicit memory is employed in the audio-to-expression model to capture high-level semantics in the audio-expression shared space, while the explicit memory is employed in the neural-rendering model to help synthesize pixel-level details. Our experimental results show that our proposed MemFace surpasses all the state-of-the-art results across multiple scenarios consistently and significantly.