Bibliography (15):

https://x.com/dilipkay/status/1610091360203476993
Paella: Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces
MaskGIT: Masked Generative Image Transformer
https://www.youtube.com/watch?v=2AsoWS2t484
Attention Is All You Need
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
https://openai.com/dall-e-2
https://parti.research.google/
High-Resolution Image Synthesis with Latent Diffusion Models
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning
Microsoft COCO: Common Objects in Context
CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ‘zero-shot’ capabilities of GPT-2 and GPT-3
https://muse-model.github.io/
Wikipedia Bibliography:
1. Autoregressive model
2. Fréchet inception distance