Bibliography (17):

CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
Attention Is All You Need
Generating Diverse High-Fidelity Images with VQ-VAE-2
Microsoft COCO: Common Objects in Context
DALL·E 1: Creating Images from Text: We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language
M6: A Chinese Multimodal Pretrainer
https://x.com/ak92501/status/1398079706180763649
Text-To-Image Generation. The Repo for NeurIPS 2021 Paper "CogView: Mastering Text-To-Image Generation via Transformers".
WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
CogView 以文生图
CogView: Mastering Text-to-Image Generation via Transformers
Controllable Generation from Pre-trained Language Models via Inverse Prompting
Self-distillation: Born Again Neural Networks
CLIP: Connecting Text and Images: We’re introducing a neural network called CLIP which efficiently learns visual concepts from natural language supervision. CLIP can be applied to any visual classification benchmark by simply providing the names of the visual categories to be recognized, similar to the ‘zero-shot’ capabilities of GPT-2 and GPT-3
Wikipedia Bibliography:
1. Fréchet inception distance
2. Generative adversarial network