Bibliography (17):

https://wenxin.baidu.com/ernie-vilg
https://x.com/jaguring1/status/1564369413922381824
https://huggingface.co/spaces/PaddlePaddle/ERNIE-ViLG
https://colab.research.google.com/drive/1z1Sy7HXWPY8R295tNA-UrFYLfnBe0okl
CogView: Mastering Text-to-Image Generation via Transformers
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
RUDOLPH: One Hyper-Tasking Transformer Can Be Creative As DALL-E and GPT-3 and Smart As CLIP
L-Verse: Bidirectional Generation Between Image and Text
Unifying Multimodal Transformer for Bi-directional Image and Text Generation
‘end-to-end’ directory
Microsoft COCO: Common Objects in Context
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
https://arxiv.org/pdf/2112.15283#page=13&org=baidu
VQ-GAN: Taming Transformers for High-Resolution Image Synthesis
Wikipedia Bibliography:
1. Fréchet inception distance
2. Generative adversarial network