Bibliography (26):

  1. Zero-Shot Text-to-Image Generation

  2. Borisdayma/dalle-Mini: DALL·E-Mini

  3. https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini--Vmlldzo4NjIxODA

  4. CogView: Mastering Text-to-Image Generation via Transformers

  5. China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) releases Wu Dao 1.0, China’s first large-scale pretraining model.

  6. https://openai.com/dall-e-2

  7. GPT-3: Language Models are Few-Shot Learners

  8. Image GPT (iGPT): We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples

  9. Generating Diverse High-Fidelity Images with VQ-VAE-2

  10. BPEs: Neural Machine Translation of Rare Words with Subword Units

  11. GPT-3 Creative Fiction § BPEs

  12. VQ-VAE: Neural Discrete Representation Learning

  13. Auto-Encoding Variational Bayes

  14. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

  15. Categorical Reparameterization with Gumbel-Softmax

  16. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

  17. The Unusual Effectiveness of Averaging in GAN Training

  18. CLIP: Learning Transferable Visual Models From Natural Language Supervision

  19. CLIPScore: A Reference-free Evaluation Metric for Image Captioning

  20. https://arxiv.org/pdf/2105.13290.pdf#page=8

  21. CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ‘zero-shot’ capabilities of GPT-2 and GPT-3