“ERNIE-ViLG: Unified Generative Pre-Training for Bidirectional Vision-Language Generation”, Han Zhang, Weichong Yin, Yewei Fang, Lanxin Li, Boqiang Duan, Zhihua Wu, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang2021-12-31 (, , ; similar)⁠:

[homepage; anime samples; demo, Colab] Conventional methods for the image-text generation tasks mainly tackle the naturally bidirectional generation tasks separately, focusing on designing task-specific frameworks to improve the quality and fidelity of the generated samples. Recently, Vision-Language Pre-training models have greatly improved the performance of the image-to-text generation tasks, but large-scale pre-training models for text-to-image synthesis task are still under-developed.

In this paper, we propose ERNIE-ViLG, an unified generative pre-training framework for bidirectional image-text generation with transformer model. Based on the image quantization models, we formulate both image generation and text generation as autoregressive generative tasks conditioned on the text/image input [cf. CogView, GLIDE, RuDOLPH, L-Verse, Huang et al 2021]. The bidirectional image-text generative modeling eases the semantic alignments across vision and language. For the text-to-image generation process, we further propose an end to end training method to jointly learn the visual sequence generator and the image reconstructor.

To explore the landscape of large-scale pre-training for bidirectional text-image generation, we train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs which achieves state-of-the-art performance for both text-to-image and image-to-text tasks, obtaining an FID of 7.9 on MS COCO for text-to-image synthesis and best results on COCO-CN and AIC-ICC for image captioning.


The sources of our dataset are listed as follows:

Figure 4: Example images ERNIE-ViLG generated in zero-shot setting with texts from open domain. Figure 4(a)–Figure 4(b) show the generated images of simple objects. Figure 4(c)–Figure 4(e) exhibit generated images of complex scenes with multiple objects. The example of creating image of non-existing objects is displayed in Figure 4(f).
Figure 5: Images of different styles generated by ERNIE-ViLG. “None” indicates not adding any prompts about image style.
Figure 6: Generated images given Chinese ancient poetry.


Qualitative Results: ERNIE-ViLG has acquired the generation capability for various scenes, from basic objects to complex combinations of objects. Some examples are shown in Figure 4. As can be seen in the examples, ERNIE-ViLG can not only draw entities mentioned in the given text description, but also combine them together with the background in a reasonable way. Surprisingly, we also find two special skills ERNIE-ViLG develops. Firstly, ERNIE-ViLG can generate images of different styles by simply adding text prompts without fine-tuning like CogView does (Figure 5). Secondly, our model can generate realistic images given Chinese ancient poetry which shows promising understanding of brief and abstractive descriptions. Real concepts in the poetry are well-organized and artistic conception is well-described (Figure 6)


We compare the results of our end-to-end training method with the two-stage pipeline baseline as shown in Table 7. For the two-stage pipeline, we train a text-to-image generator and use the decoder of dVAE directly as the reconstructor. The ‘Two-stage G(R)’ refers to the separately trained generator(reconstructor), and the ‘end-to-end G(R)’ refers to the end-to-end trained generator(reconstructor). Our end-to-end method achieves an important FID improvement of 1.5 compared to the two-stage pipeline. We find that combining the end-to-end trained generator (‘end-to-end G’) and dVAE decoder (‘two-stage R’) also brings a FID improvement of 0.9 compared to that of two-stage pipeline, but falls behind compared to the end-to-end methods. This indicates our proposed end-to-end method can improve both the performance of the generator (two-stage G & two-stage R vs end-to-end G & two-stage R) and the reconstructor (end-to-end G & two-stage R vs end-to-end G & end-to-end R).

We also input visual sequence of real images discretized by dVAE (‘gold image sequence’) to the two reconstructors for comparison. Experimental results (the last two lines in Table 7) show that the end-to-end trained reconstructor has more advantage in the reconstruction from real image discrete representation.

We consider that end-to-end training will be more effective on ERNIE-ViLG with 10 billion parameters, for the image discrete representation generated by more capable generator is much closer to the real distribution, and hidden embeddings of larger model provides more useful features for the reconstructor. Due to the instability of the training of both GAN and large-scale generative model, we haven’t used end-to-end training for our 10-billion parameter model based on VQGAN. We will address the instability issue for future work and improve the 10-billion parameter ERNIE-ViLG through end-to-end training.