Bibliography (17):

  1. https://wenxin.baidu.com/ernie-vilg

  2. https://x.com/jaguring1/status/1564369413922381824

  3. https://huggingface.co/spaces/PaddlePaddle/ERNIE-ViLG

  4. https://colab.research.google.com/drive/1z1Sy7HXWPY8R295tNA-UrFYLfnBe0okl

  5. CogView: Mastering Text-to-Image Generation via Transformers

  6. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

  7. RUDOLPH: One Hyper-Tasking Transformer Can Be Creative As DALL-E and GPT-3 and Smart As CLIP

  8. L-Verse: Bidirectional Generation Between Image and Text

  9. Unifying Multimodal Transformer for Bi-directional Image and Text Generation

  10. ‘end-to-end’ directory

  11. Microsoft COCO: Common Objects in Context

  12. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

  13. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

  14. https://arxiv.org/pdf/2112.15283#page=13&org=baidu

  15. VQ-GAN: Taming Transformers for High-Resolution Image Synthesis