[Paper: “DALL·E 1: Zero-Shot Text-to-Image Generation”, Rameshet al2021. Re-implementation: DALL·E 1 Mini (writeup). cf. CogView, Wu Dao. Availability through OA API still planned as of 2021-09-05; see DALL·E 2.] DALL·E 1 is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text-image pairs. We’ve found that it has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.
GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. iGPT showed that the same type of neural network can also be used to generate images with high fidelity. [iGPT is another answer to the question of “how do we do images autoregressively, but not at the exorbitant cost of generating pixels 1 by 1?”; iGPT uses ‘super pixels’ & very small images, while DALL·E 1 uses VAE ‘tokens’ corresponding roughly to small squares so the token sequence is relatively small, where the VAE does the actual compilation to raw pixels.] we extend these findings to show that manipulating visual concepts through language is now within reach.
[3 DALL·E 1 prompts: “an armchair in the shape of an avocado…” · “a store front that has the word ‘openai’ written on it…” · “the exact same cat on the top as a sketch on the bottom”]
DALL·E 1’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16,384 [DALL·E 2 continued BPE use and with public access, it became clear it inherited the usual BPE pathologies], and the image is represented using 1,024 tokens with a vocabulary size of 8,192. The images are preprocessed to 256×256 resolution during training. Similar to VQ-VAE,1415 each image is compressed to a 32×32 grid of discrete latent codes using a discreteVAE that we pretrained using a continuous relaxation.1213 We found that training using the relaxation obviates the need for an explicit codebook, EMA loss, or tricks like dead code revival, and can scale up to large vocabulary sizes.
…Capabilities: We find that DALL·E 1 is able to create plausible images for a great variety of sentences that explore the compositional structure of language. We illustrate this using a series of interactive visuals in the next section. The samples shown for each caption in the visuals are obtained by taking the top 32⁄512 after reranking with CLIP [cf. CLIPScore, CogView’s caption scores], but we do not use any manual cherry-picking, aside from the thumbnails and standalone images that appear outside.
Controlling attributes: We test DALL·E 1’s ability to modify several of an object’s attributes, as well as the number of times that it appears.
Drawing multiple objects
Visualizing perspective and three-dimensionality
Visualizing internal and external structure
Inferring contextual details
We find that DALL·E 1 is able to render the same scene in a variety of different styles, and can adapt the lighting, shadows, and environment based on the time of day or season: “a…of a capybara sitting in a field at sunrise”
…With varying degrees of reliability, DALL·E 1 provides access to a subset of the capabilities of a 3D rendering engine via natural language. It can independently control the attributes of a small number of objects, and to a limited extent, how many there are, and how they are arranged with respect to one another. It can also control the location and angle from which a scene is rendered, and can generate known objects in compliance with precise specifications of angle and lighting conditions.
Zero-shot visual reasoning: GPT-3 can be instructed to perform many kinds of tasks solely from a description and a cue to generate the answer supplied in its prompt, without any additional training. For example, when prompted with the phrase “here is the sentence ‘a person walking his dog in the park’ translated into French:”, GPT-3 answers “un homme qui promène son chien dans le parc.” This capability is called zero-shot reasoning. We find that DALL·E 1 extends this capability to the visual domain, and is able to perform several kinds of image-to-image translation tasks when prompted in the right way. [cf. CLIP.]
We did not anticipate that this capability would emerge, and made no modifications to the neural network or training procedure to encourage it. Motivated by these results, we measure DALL·E 1’s aptitude for analogical reasoning problems by testing it on Raven’s Progressive Matrices, a visual IQ test that saw widespread use in the 20th century. Rather than treating the IQ test a multiple-choice problem as originally intended, we ask DALL·E 1 to complete the bottom-right corner of each image using argmax sampling, and consider its completion to be correct if it is a close visual match to the original. DALL·E 1 is often able to solve matrices that involve continuing simple patterns or basic geometric reasoning, such as those in sets B and C. It is sometimes able to solve matrices that involve recognizing permutations and applying boolean operations, such as those in set D. The instances in set E tend to be the most difficult, and DALL·E 1 gets almost none of them correct. For each of the sets, we measure DALL·E 1’s performance on both the original images, and the images with the colors inverted. The inversion of colors should pose no additional difficulty for a human, yet does generally impair DALL·E 1’s performance, suggesting its capabilities may be brittle in unexpected ways.