Magvit：蒙面的生成视频变压器

论文标题

Magvit：蒙面的生成视频变压器

MAGVIT: Masked Generative Video Transformer

论文作者

Yu, Lijun, Cheng, Yong, Sohn, Kihyuk, Lezama, José, Zhang, Han, Chang, Huiwen, Hauptmann, Alexander G., Yang, Ming-Hsuan, Hao, Yuan, Essa, Irfan, Jiang, Lu

论文摘要

我们介绍了蒙版的生成视频变压器Magvit，以使用单个模型来解决各种视频综合任务。我们介绍了一个3D令牌，将视频量化为时空的视觉令牌，并提出了一种嵌入方法，用于掩盖视频令牌建模，以便利多任务学习。我们进行了广泛的实验，以证明Magvit的质量，效率和灵活性。我们的实验表明，（i）Magvit对最先进的方法表现出色，并在三个视频生成基准测试中建立了最好的FVD，包括具有挑战性的动力学-600。（ii）MAGVIT在推理时间上以两个数量级对扩散模型和60倍对自回归模型的表现优于推理时间的现有方法。（iii）单个MAGVIT模型支持来自不同视觉域中的十个不同的一代任务，并跨越了来自不同视觉域的视频。源代码和训练有素的模型将通过https://magvit.cs.cmu.edu发布给公众。

We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.

下载PDF全文

下载文献需遵守相关版权规定

论文标题