“DALL·E 2: Hierarchical Text-Conditional Image Generation With CLIP Latents § 7. Limitations and Risks”, Aditya A. Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen2022-04-13 (, , )⁠:

…Although conditioning image generation on CLIP embeddings improves diversity, this choice does come with certain limitations. In particular, unCLIP is worse at binding attributes to objects than a corresponding GLIDE model. In Figure 14, we find that unCLIP struggles more than GLIDE with a prompt where it must bind 2 separate objects (cubes) to 2 separate attributes (colors). We hypothesize that this occurs because the CLIP embedding itself does not explicitly bind attributes to objects, and find that reconstructions from the decoder often mix up attributes and objects, as shown in Figure 15.

Figure 15: Reconstructions from the decoder for difficult binding problems. We find that the reconstructions mix up objects and attributes. In the first 2 examples, the model mixes up the color of 2 objects. In the rightmost example, the model does not reliably reconstruct the relative size of 2 objects.

A similar and likely related issue is that unCLIP struggles at producing coherent text, as illustrated in Figure 16; it is possible that the CLIP embedding does not precisely encode spelling information of rendered text. This issue is likely made worse because the BPE encoding we use obscures the spelling of the words in a caption from the model, so the model needs to have independently seen each token written out in the training images in order to learn to render it.

Figure 16: Samples from unCLIP for the prompt, “A sign that says deep learning.”

[My comments on severity of DALL·E 2 limitations]