“MAGVIT: Masked Generative Video Transformer”, Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang2022-12-10 (, , )⁠:

We introduce the MAsked Generative VIdeo Transformer (MAGVIT), to tackle various video synthesis tasks with a single model.

We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling [using BERT] to facilitate multi-task learning.

We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (1) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on 3 video generation benchmarks, including the challenging Kinetics-600. (2) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60× against autoregressive models. (3) A single MAGVIT model supports 10 diverse generation tasks and generalizes across videos from different visual domains.

The source code and trained models will be released to the public at https://magvit.cs.cmu.edu/ [Github].