让我们播放音乐：音频驱动的性能视频创作

论文标题

让我们播放音乐：音频驱动的性能视频创作

Lets Play Music: Audio-driven Performance Video Generation

论文作者

Zhu, Hao, Li, Yi, Zhu, Feixia, Zheng, Aihua, He, Ran

论文摘要

我们提出了一项名为Audio驱动的每个Formance视频生成（APVG）的新任务，该任务旨在综合一个人弹奏某个乐器Bya指导的音乐音频剪辑的视频。从低维音频形态发电的高维时间一致视频是一项具有挑战性的任务。在本文中，我们提出了一个多阶段的框架，以实现这一新任务，以从给定的音乐中生成现实和同步性能视频。首先，我们通过分别从给定的音乐中产生粗糙的视频和手势的粗糙视频和关键点，提供全球外观和本地空间信息。然后，我们建议将生成的关键点转换为通过可区分的间距构造器将其转换为热图，因为热图提供了更多的空间信息，但很难直接从音频产生。最后，Wepropose一个结构化的时间化UNET（STU），以提取大块框架结构化信息和框架间的时间段落。它们是通过基于图的结构模块和CNN-GRU的高级时间模块获得的。全面的实验验证了我们提出的框架的效果。

We propose a new task named Audio-driven Per-formance Video Generation (APVG), which aims to synthesizethe video of a person playing a certain instrument guided bya given music audio clip. It is a challenging task to gener-ate the high-dimensional temporal consistent videos from low-dimensional audio modality. In this paper, we propose a multi-staged framework to achieve this new task to generate realisticand synchronized performance video from given music. Firstly,we provide both global appearance and local spatial informationby generating the coarse videos and keypoints of body and handsfrom a given music respectively. Then, we propose to transformthe generated keypoints to heatmap via a differentiable spacetransformer, since the heatmap offers more spatial informationbut is harder to generate directly from audio. Finally, wepropose a Structured Temporal UNet (STU) to extract bothintra-frame structured information and inter-frame temporalconsistency. They are obtained via graph-based structure module,and CNN-GRU based high-level temporal module respectively forfinal video generation. Comprehensive experiments validate theeffectiveness of our proposed framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题