“CogVideo: Large-Scale Pretraining for Text-To-Video Generation via Transformers”, Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang2022-05-29 (, )⁠:

[demo; code; checkpoint] Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL¡E and CogView) generation. Its application to video generation is still facing many challenges: the potential huge computation cost makes the training from scratch unaffordable; the scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics.

In this work, we present a 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips.

As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.