[CogView1; demo?checkpoints?] The development of the transformer-based text-to-image models are impeded by its slow generation and complexity for high-resolution images.
In this work, we put forward a solution based on hierarchical transformers and local parallel auto-regressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, Cross-modal general language model (CogLM) [cf. MAE], and finetune it for fast super-resolution.
The new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL¡E-2 [looks noticeably worse, IMO], and naturally supports interactive text-guided editing on images.
Figure 1: Text-to-Image samples from CogView2, which supports both Chinese & English. The actual input text is in Chinese, translated into English here for better understanding. Codes and demo website will be updated at Github.