论文标题
视频的视觉语音感知感知3D面部表达重建
Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos
论文作者
论文摘要
由于深度学习的出现,图像数据的单眼3D面对面重建的最新技术取得了令人印象深刻的进步。但是,它主要集中在来自单个RGB图像的输入上,忽视了以下重要因素:a)如今,感兴趣的绝大多数面部图像数据不是源自单个图像,而是来自包含丰富动态信息的视频。 b)此外,这些视频通常以某种形式的口头交流(公众对话,电视对会,视听人的人类计算机互动,访谈,电影中的独白/对话等)捕捉个人。当在此类视频中应用现有的3D面部重建方法时,重建口腔区域的形状和运动中的伪影通常很严重,因为它们与语音音频不太匹配。 为了克服上述局限性,我们提出了3D口表达式视觉语音感知重建的第一种方法。我们通过提出“口语”损失来做到这一点,该损失指导拟合过程,以便从3D重建的会说话的头部引起的感知类似于原始录像带。我们证明,有趣的是,与传统的具有里程碑意义的损失,甚至直接3D监督相比,口头损失更适合3D重建嘴运动。此外,设计的方法不依赖于任何文本转录或相应的音频,因此非常适合在未标记的数据集中培训。我们通过对三个大规模数据集的详尽客观评估以及通过两种基于网络的用户研究的主观评估来验证方法的效率。
The recent state of the art on monocular 3D face reconstruction from image data has made some impressive advancements, thanks to the advent of Deep Learning. However, it has mostly focused on input coming from a single RGB image, overlooking the following important factors: a) Nowadays, the vast majority of facial image data of interest do not originate from single images but rather from videos, which contain rich dynamic information. b) Furthermore, these videos typically capture individuals in some form of verbal communication (public talks, teleconferences, audiovisual human-computer interactions, interviews, monologues/dialogues in movies, etc). When existing 3D face reconstruction methods are applied in such videos, the artifacts in the reconstruction of the shape and motion of the mouth area are often severe, since they do not match well with the speech audio. To overcome the aforementioned limitations, we present the first method for visual speech-aware perceptual reconstruction of 3D mouth expressions. We do this by proposing a "lipread" loss, which guides the fitting process so that the elicited perception from the 3D reconstructed talking head resembles that of the original video footage. We demonstrate that, interestingly, the lipread loss is better suited for 3D reconstruction of mouth movements compared to traditional landmark losses, and even direct 3D supervision. Furthermore, the devised method does not rely on any text transcriptions or corresponding audio, rendering it ideal for training in unlabeled datasets. We verify the efficiency of our method through exhaustive objective evaluations on three large-scale datasets, as well as subjective evaluation with two web-based user studies.