听，denoise，动作！通过扩散模型的音频驱动运动合成

论文标题

听，denoise，动作！通过扩散模型的音频驱动运动合成

Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

论文作者

Alexanderson, Simon, Nagy, Rajmund, Beskow, Jonas, Henter, Gustav Eje

论文摘要

扩散模型作为高度表达但有效训练的概率模型的兴趣激增。我们表明，这些模型非常适合合成人类运动，该运动与音频（例如舞蹈和共同语音手势）共同相处，因为运动是复杂且高度模棱两可的，呼吁进行概率描述。具体而言，我们将Diffwave体系结构调整为模型3D姿势序列，将构象异构体代替扩张的卷积，以提高建模能力。我们还使用无分类器的指导来调整风格表达的强度，展示了对运动样式的控制。关于手势和舞蹈产生的实验证实，所提出的方法具有顶级的运动质量，具有独特的样式，其表达方式可以或多或少。我们还使用相同的模型体系结构合成路径驱动的运动。最后，我们概括了获得扩散模型的专家集合的指导程序，并证明了这些过程如何用于例如样式插值，我们认为这是一种独立的兴趣。有关视频示例，数据和代码，请参见https://www.speech.kth.kth.se/research/listen-denoise-action/。

Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest. See https://www.speech.kth.se/research/listen-denoise-action/ for video examples, data, and code.

下载PDF全文

下载文献需遵守相关版权规定

论文标题