通过语音产生整体3D人类运动

论文标题

通过语音产生整体3D人类运动

Generating Holistic 3D Human Motion from Speech

论文作者

Yi, Hongwei, Liang, Hualin, Liu, Yifei, Cao, Qiong, Wen, Yandong, Bolkart, Timo, Tao, Dacheng, Black, Michael J.

论文摘要

这项工作解决了从人类言语中产生3D整体身体动作的问题。鉴于语音记录，我们合成了现实和多样化的3D身体姿势，手势和面部表情的序列。为了实现这一目标，我们首先建立了具有同步语音的3D整体身体网格的高质量数据集。然后，我们定义了一个新颖的语音到动作生成框架，其中脸部，身体和手分开建模。分离的建模源于以下事实：面部铰接与人的言语密切相关，而身体的姿势和手势却较小。具体而言，我们采用自动编码器进行面部运动，并为人体和手动运动进行组成矢量定量的变分自动编码器（VQ-VAE）。组成VQ-VAE是产生各种结果的关键。此外，我们提出了一个交叉条件自回归模型，该模型产生身体姿势和手势，从而导致连贯和逼真的运动。广泛的实验和用户研究表明，我们提出的方法在定性和定量上都能达到最先进的表现。我们的新颖数据集和代码将在https://talkshow.is.tue.mpg.de上发布。

This work addresses the problem of generating 3D holistic body motions from human speech. Given a speech recording, we synthesize sequences of 3D body poses, hand gestures, and facial expressions that are realistic and diverse. To achieve this, we first build a high-quality dataset of 3D holistic body meshes with synchronous speech. We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately. The separated modeling stems from the fact that face articulation strongly correlates with human speech, while body poses and hand gestures are less correlated. Specifically, we employ an autoencoder for face motions, and a compositional vector-quantized variational autoencoder (VQ-VAE) for the body and hand motions. The compositional VQ-VAE is key to generating diverse results. Additionally, we propose a cross-conditional autoregressive model that generates body poses and hand gestures, leading to coherent and realistic motions. Extensive experiments and user studies demonstrate that our proposed approach achieves state-of-the-art performance both qualitatively and quantitatively. Our novel dataset and code will be released for research purposes at https://talkshow.is.tue.mpg.de.

下载PDF全文

下载文献需遵守相关版权规定

论文标题