“MAGVIT: Masked Generative Video Transformer”, 2022-12-10 ():
We introduce the MAsked Generative VIdeo Transformer (MAGVIT), to tackle various video synthesis tasks with a single model.
We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling [using BERT] to facilitate multi-task learning.
We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (1) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on 3 video generation benchmarks, including the challenging Kinetics-600. (2) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60× against autoregressive models. (3) A single MAGVIT model supports 10 diverse generation tasks and generalizes across videos from different visual domains.
The source code and trained models will be released to the public at https://magvit.cs.cmu.edu/ [Github].