Bibliography:

  1. ‘DALL·E’ tag

  2. Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

  3. JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

  4. MAR: Autoregressive Image Generation without Vector Quantization

  5. STAR: Scale-wise Text-to-image generation via Auto-Regressive representations

  6. Chameleon: Mixed-Modal Early-Fusion Foundation Models

  7. Visual Autoregressive Modeling (VAR): Scalable Image Generation via Next-Scale Prediction

  8. IconShop: Text-Guided Vector Icon Synthesis with Autoregressive Transformers

  9. Rejuvenating image-GPT as Strong Visual Representation Learners

  10. Image Captioners Are Scalable Vision Learners Too

  11. Artificial intelligence and art: Identifying the esthetic judgment factors that distinguish human & machine-generated artwork

  12. VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

  13. Retrieval-Augmented Multimodal Language Modeling

  14. Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer

  15. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

  16. CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

  17. MaskGIT: Masked Generative Image Transformer

  18. CM3: A Causal Masked Multimodal Model of the Internet

  19. ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation

  20. Emojich—zero-shot emoji generation using Russian language: a technical report

  21. LAFITE: Towards Language-Free Training for Text-to-Image Generation

  22. NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

  23. L-Verse: Bidirectional Generation Between Image and Text

  24. Telling Creative Stories Using Generative Visual Aids

  25. Unifying Multimodal Transformer for Bi-directional Image and Text Generation

  26. Illiterate DALL·E Learns to Compose

  27. What Users Want? WARHOL: A Generative Model for Recommendation

  28. ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation

  29. Chinese AI lab challenges Google, OpenAI with a model of 1.75 trillion parameters

  30. M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis

  31. CogView: Mastering Text-to-Image Generation via Transformers

  32. GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

  33. VideoGPT: Video Generation using VQ-VAE and Transformers

  34. China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) releases Wu Dao 1.0, China’s first large-scale pretraining model.

  35. Paint by Word

  36. Generating Images with Sparse Representations

  37. M6: A Chinese Multimodal Pretrainer

  38. DALL·E 1: Creating Images from Text: We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language

  39. Taming Transformers for High-Resolution Image Synthesis

  40. Text-to-Image Generation Grounded by Fine-Grained User Attention

  41. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

  42. iGPT: Generative Pretraining from Pixels

  43. Image GPT (iGPT): We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples

  44. The messy, secretive reality behind OpenAI’s bid to save the world: The AI moonshot was founded in the spirit of transparency. This is the inside story of how competitive pressure eroded that idealism

  45. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

  46. Image Transformer

  47. VQ-VAE: Neural Discrete Representation Learning

  48. Categorical Reparameterization with Gumbel-Softmax

  49. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

  50. Borisdayma/dalle-Mini: DALL·E-Mini

  51. Kingnobro/IconShop: (SIGGRAPH Asia 2023) Code of "IconShop: Text-Guided Vector Icon Synthesis With Autoregressive Transformers"

  52. 8f3782f015fca64bf71d2992d31a7e5b7bbd85fe.html

  53. IconShop

  54. d9b367925ce98d93105c4b2cca98e514aa36352b.html

  55. How Sber Built RuDALL-E: Interview With Sergei Markov

  56. 763e99870bf2e2d9a85b6e34dd47cfb4e5a53704.html

  57. The Little Red Boat Story (Make-A-Scene): Our Own Model Was Used to Generate All the Images in the Story, by Providing a Text and Simple Sketch Input

  58. design#future-tag-features

    [Transclude the forward-link's context]

  59. https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini--Vmlldzo4NjIxODA

  60. 9b851d946d9ca37cf712409432960a1945beb0ef.html

  61. Chameleon: Mixed-Modal Early-Fusion Foundation Models

  62. https%253A%252F%252Farxiv.org%252Fabs%252F2405.09818%2523facebook.html

  63. Visual Autoregressive Modeling (VAR): Scalable Image Generation via Next-Scale Prediction

  64. https%253A%252F%252Farxiv.org%252Fabs%252F2404.02905%2523bytedance.html

  65. IconShop: Text-Guided Vector Icon Synthesis with Autoregressive Transformers

  66. %252Fdoc%252Fai%252Fnn%252Ftransformer%252Fgpt%252Fdall-e%252F1%252F2023-wu-2.pdf.html

  67. Rejuvenating image-GPT as Strong Visual Representation Learners

  68. https%253A%252F%252Farxiv.org%252Fabs%252F2312.02147.html

  69. Artificial intelligence and art: Identifying the esthetic judgment factors that distinguish human & machine-generated artwork

  70. %252Fdoc%252Fai%252Fnn%252Ftransformer%252Fgpt%252Fdall-e%252F1%252F2023-samo.pdf.html

  71. VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

  72. Furu Wei

  73. https%253A%252F%252Farxiv.org%252Fabs%252F2301.02111%2523microsoft.html

  74. Retrieval-Augmented Multimodal Language Modeling

  75. Percy Liang

  76. Mike Lewis

  77. Luke Zettlemoyer

  78. https%253A%252F%252Farxiv.org%252Fabs%252F2211.12561%2523facebook.html

  79. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

  80. https%253A%252F%252Farxiv.org%252Fabs%252F2205.15868.html

  81. CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

  82. https%253A%252F%252Farxiv.org%252Fabs%252F2204.14217%2523baai.html

  83. CM3: A Causal Masked Multimodal Model of the Internet

  84. Mike Lewis

  85. Luke Zettlemoyer

  86. https%253A%252F%252Farxiv.org%252Fabs%252F2201.07520%2523facebook.html

  87. ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation

  88. Yu Sun

  89. https%253A%252F%252Farxiv.org%252Fabs%252F2112.15283%2523baidu.html

  90. L-Verse: Bidirectional Generation Between Image and Text

  91. https%253A%252F%252Farxiv.org%252Fabs%252F2111.11133.html

  92. Chinese AI lab challenges Google, OpenAI with a model of 1.75 trillion parameters

  93. https%253A%252F%252Fen.pingwest.com%252Fa%252F8693%2523baai.html

  94. M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis

  95. https%253A%252F%252Farxiv.org%252Fabs%252F2105.14211%2523alibaba.html

  96. CogView: Mastering Text-to-Image Generation via Transformers

  97. https%253A%252F%252Farxiv.org%252Fabs%252F2105.13290%2523baai.html

  98. VideoGPT: Video Generation using VQ-VAE and Transformers

  99. Aravind Srinivas

  100. https%253A%252F%252Farxiv.org%252Fabs%252F2104.10157.html

  101. China’s GPT-3? BAAI Introduces Superscale Intelligence Model ‘Wu Dao 1.0’: The Beijing Academy of Artificial Intelligence (BAAI) releases Wu Dao 1.0, China’s first large-scale pretraining model.

  102. https%253A%252F%252Fsyncedreview.com%252F2021%252F03%252F23%252Fchinas-gpt-3-baai-introduces-superscale-intelligence-model-wu-dao-1-0%252F%2523baai.html

  103. DALL·E 1: Creating Images from Text: We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language

  104. Aditya A. Ramesh

  105. Speaker Details: EmTech MIT 2023

  106. Vedant Misra

  107. Gretchen Krueger

  108. Sandhini Agarwal

  109. https%253A%252F%252Fopenai.com%252Fresearch%252Fdall-e.html

  110. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

  111. Hannaneh Hajishirzi—University of Washington

  112. https%253A%252F%252Farxiv.org%252Fabs%252F2009.11278%2523allen.html

  113. iGPT: Generative Pretraining from Pixels

  114. Speaker Details: EmTech MIT 2023

  115. Alec Radford

  116. %252Fdoc%252Fai%252Fnn%252Ftransformer%252Fgpt%252Fdall-e%252F1%252F2020-chen-2.pdf%2523openai.html

  117. Image GPT (iGPT): We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples

  118. Speaker Details: EmTech MIT 2023

  119. Alec Radford

  120. https%253A%252F%252Fopenai.com%252Findex%252Fimage-gpt%252F.html

  121. The messy, secretive reality behind OpenAI’s bid to save the world: The AI moonshot was founded in the spirit of transparency. This is the inside story of how competitive pressure eroded that idealism

  122. https%253A%252F%252Fwww.technologyreview.com%252F2020%252F02%252F17%252F844721%252Fai-openai-moonshot-elon-musk-sam-altman-greg-brockman-messy-secretive-reality%252F.html

  123. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

  124. %252Fdoc%252Fai%252Fnn%252Fdiffusion%252F2018-sharma.pdf%2523google.html