Bibliography (16):

  1. CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

  2. Attention Is All You Need

  3. Generating Diverse High-Fidelity Images with VQ-VAE-2

  4. Microsoft COCO: Common Objects in Context

  5. DALL·E 1: Creating Images from Text: We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language

  6. M6: A Chinese Multimodal Pretrainer

  7. https://x.com/ak92501/status/1398079706180763649

  8. Text-To-Image Generation. The Repo for NeurIPS 2021 Paper "CogView: Mastering Text-To-Image Generation via Transformers".

  9. WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models

  10. WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

  11. CogView 以文生图

  12. Controllable Generation from Pre-trained Language Models via Inverse Prompting

  13. Self-distillation: Born Again Neural Networks

  14. CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ‘zero-shot’ capabilities of GPT-2 and GPT-3