“CogView: Mastering Text-To-Image Generation via Transformers”, Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang2021-05-26 (; similar)⁠:

[Affiliations: Tsinghua University, DAMO Academy, AliBaba Group, BAAI; CogView2] Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding.

We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, eg. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, eg. eliminating NaN losses.

CogView (zero-shot) achieves a new state-of-the-art FID on blurred MS-COCO, outperforms previous GAN-based models and the recent similar work DALL·E 1. [cf.: “M6: A Chinese Multimodal Pretrainer”, Lin et al 2021; screenshots; future model release homepage; WuDaoMM dataset (also for Wenlan); live demo.]

Figure 1: Samples from CogView. The text in the first line is either from MS COCO (outside the training set) or user queries on our demo website. The images in the second line are finetuned results for different styles or super-resolution. The actual input text is in Chinese, translated into English here for better understanding. More samples for captions from MS COCO are included in Appendix E.


3.3 Image Captioning and Self-reranking: To finetune CogView for image captioning is straightforward: exchanging the order of text and image tokens in the input sequences [and then finetune-train the model on the new flipped data training corpus ie ‘analysis by synthesis’]. Since the model has already learnt the corresponding relationships between text and images, reversing the generation is not hard. We did not evaluate the performance due to that (1) there is no authoritative Chinese image captioning benchmark (2) image captioning is not the focus of this work. The main purpose of finetuning such a model is for self-reranking. We propose the Caption Score (CapS) to evaluate the correspondence between images and text
this method can be seen as an adaptation of inverse prompting53 for text-to-image generation. Finally, images with the highest CapS are chosen. [cf. self-distillation, CLIP ranking]

Figure 6: 60 generated images for “A man in red shirt is playing video games” (selected at random from COCO), displayed in the order of Caption Score. Most bad cases are ranked in last places. The diversity also eases the concern that CogView might be overfitting a similar image in the training set.