解开基于文本的神经视频操纵的内容和动作

论文标题

解开基于文本的神经视频操纵的内容和动作

Disentangling Content and Motion for Text-Based Neural Video Manipulation

论文作者

Karacan, Levent, Kerimoğlu, Tolga, İnan, İsmail, Birdal, Tolga, Erdem, Erkut, Erdem, Aykut

论文摘要

赋予机器能够从语言描述中想象可能的新对象或场景并产生其现实效果，这可以说是计算机视觉中最具挑战性的问题之一。深层生成模型的最新进展导致了新的方法，从而为这一目标带来了令人鼓舞的结果。在本文中，我们介绍了一种名为Dicomogan的新方法，用于用自然语言操纵视频，旨在在视频剪辑上执行本地和语义编辑，以改变感兴趣的对象的外观。我们的GAN架构可以通过解开内容和运动来更好地利用多个观察，以启用可控的语义编辑。为此，我们介绍了两个紧密耦合的网络：（i）一个表示网络，用于构建对运动动态和时间不变内容的简化理解，以及（ii）一个翻译网络，该网络利用了提取的潜在内容表示，以根据目标描述来启动操作。我们的定性和定量评估表明，Dicomogan显着优于现有的基于框架的方法，从而产生时间连贯和语义上更有意义的结果。

Giving machines the ability to imagine possible new objects or scenes from linguistic descriptions and produce their realistic renderings is arguably one of the most challenging problems in computer vision. Recent advances in deep generative models have led to new approaches that give promising results towards this goal. In this paper, we introduce a new method called DiCoMoGAN for manipulating videos with natural language, aiming to perform local and semantic edits on a video clip to alter the appearances of an object of interest. Our GAN architecture allows for better utilization of multiple observations by disentangling content and motion to enable controllable semantic edits. To this end, we introduce two tightly coupled networks: (i) a representation network for constructing a concise understanding of motion dynamics and temporally invariant content, and (ii) a translation network that exploits the extracted latent content representation to actuate the manipulation according to the target description. Our qualitative and quantitative evaluations demonstrate that DiCoMoGAN significantly outperforms existing frame-based methods, producing temporally coherent and semantically more meaningful results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题