[homepage; anime samples; demo, Colab] Conventional methods for the image-text generation tasks mainly tackle the naturally bidirectional generation tasks separately, focusing on designing task-specific frameworks to improve the quality and fidelity of the generated samples. Recently, Vision-Language Pre-training models have greatly improved the performance of the image-to-text generation tasks, but large-scale pre-training models for text-to-image synthesis task are still under-developed.
In this paper, we propose ERNIE-ViLG, an unified generative pre-training framework for bidirectional image-text generation with transformer model. Based on the image quantization models, we formulate both image generation and text generation as autoregressive generative tasks conditioned on the text/image input [cf. CogView, GLIDE, RuDOLPH, L-Verse, Huanget al2021]. The bidirectional image-text generative modeling eases the semantic alignments across vision and language. For the text-to-image generation process, we further propose an end to end training method to jointly learn the visual sequence generator and the image reconstructor.
To explore the landscape of large-scale pre-training for bidirectional text-image generation, we train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs which achieves state-of-the-art performance for both text-to-image and image-to-text tasks, obtaining an FID of 7.9 on MS COCO for text-to-image synthesis and best results on COCO-CN and AIC-ICC for image captioning.
âŠThe sources of our dataset are listed as follows:
Chinese Webpages: We crawl 800 million raw Chinese alt-text descriptions paired with images from various Chinese webpages, conduct several steps of filtering and totally harvest 70 million text-image pairs. The filtering rules mainly include: (1) Text-length: the number of words in alt-text is less than 15. (2) Text-content: the alt-text must contain at least one noun and contain no special characters. (3) Image-text similarity: the similarity score of between the alt-text and image (calculated by an in-house text-image matching model with the score range 0.0â1.0) is greater than 0.5.
Image Search Engine: We collect roughly 60 million query texts and corresponding user-clicked images from our internal image search engine. There is often a strong correlation between the query and user-clicked images.
Public image-text Dataset: We collect a total of 15 million text-image pairs from two public datasets, CC and CC-12M. The captions in these datasets are translated to Chinese through Baidu Translate API.
Figure 4: Example images ERNIE-ViLG generated in zero-shot setting with texts from open domain. Figure 4(a)âFigure 4(b) show the generated images of simple objects. Figure 4(c)âFigure 4(e) exhibit generated images of complex scenes with multiple objects. The example of creating image of non-existing objects is displayed in Figure 4(f).
Figure 5: Images of different styles generated by ERNIE-ViLG. âNoneâ indicates not adding any prompts about image style.
Figure 6: Generated images given Chinese ancient poetry.
âŠQualitative Results: ERNIE-ViLG has acquired the generation capability for various scenes, from basic objects to complex combinations of objects. Some examples are shown in Figure 4. As can be seen in the examples, ERNIE-ViLG can not only draw entities mentioned in the given text description, but also combine them together with the background in a reasonable way. Surprisingly, we also find two special skills ERNIE-ViLG develops. Firstly, ERNIE-ViLG can generate images of different styles by simply adding text prompts without fine-tuning like CogView does (Figure 5). Secondly, our model can generate realistic images given Chinese ancient poetry which shows promising understanding of brief and abstractive descriptions. Real concepts in the poetry are well-organized and artistic conception is well-described (Figure 6)
âŠWe compare the results of our end-to-end training method with the two-stage pipeline baseline as shown in Table 7. For the two-stage pipeline, we train a text-to-image generator and use the decoder of dVAE directly as the reconstructor. The âTwo-stage G(R)â refers to the separately trained generator(reconstructor), and the âend-to-end G(R)â refers to the end-to-end trained generator(reconstructor). Our end-to-end method achieves an important FID improvement of 1.5 compared to the two-stage pipeline. We find that combining the end-to-end trained generator (âend-to-end Gâ) and dVAE decoder (âtwo-stage Râ) also brings a FID improvement of 0.9 compared to that of two-stage pipeline, but falls behind compared to the end-to-end methods. This indicates our proposed end-to-end method can improve both the performance of the generator (two-stage G & two-stage R vs end-to-end G & two-stage R) and the reconstructor (end-to-end G & two-stage R vs end-to-end G & end-to-end R).
We also input visual sequence of real images discretized by dVAE (âgold image sequenceâ) to the two reconstructors for comparison. Experimental results (the last two lines in Table 7) show that the end-to-end trained reconstructor has more advantage in the reconstruction from real image discrete representation.
We consider that end-to-end training will be more effective on ERNIE-ViLG with 10 billion parameters, for the image discrete representation generated by more capable generator is much closer to the real distribution, and hidden embeddings of larger model provides more useful features for the reconstructor. Due to the instability of the training of both GAN and large-scale generative model, we havenât used end-to-end training for our 10-billion parameter model based on VQGAN. We will address the instability issue for future work and improve the 10-billion parameter ERNIE-ViLG through end-to-end training.