“CogView2: Faster and Better Text-To-Image Generation via Hierarchical Transformers”, Ming Ding, Wendi Zheng, Wenyi Hong, Jie Tang2022-04-28 (, )⁠:

[CogView1; demo? checkpoints?] The development of the transformer-based text-to-image models are impeded by its slow generation and complexity for high-resolution images.

In this work, we put forward a solution based on hierarchical transformers and local parallel auto-regressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, Cross-modal general language model (CogLM) [cf. MAE], and finetune it for fast super-resolution.

The new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL¡E-2 [looks noticeably worse, IMO], and naturally supports interactive text-guided editing on images.

Figure 1: Text-to-Image samples from CogView2, which supports both Chinese & English. The actual input text is in Chinese, translated into English here for better understanding. Codes and demo website will be updated at Github.