Skip to main content

Scaling Image Generation Will Work

N/A

Swimmer963 highlights DALL·E 2 struggling with anime, realistic faces, text in images, multiple characters/objects arranged in complex ways, and editing. (Of course, many of these are still good by the standards of just months ago, and the glass is definitely more than half full.) itsnotatumor asks:

How many of these “cannot dos” will be solved by throwing more compute and training data at the problem? Anyone know if we’ve started hitting diminishing returns with this stuff yet?

In general, we have not topped out on pretty much any scaling curve. Whether it’s language modeling, image generation, DRL, or what-have-you—AFAIK, not a single modality can be truly said to have been ‘solved’ with the scaling curve broken. Either the scaling curve is flat, or we’re still far away. (There are some sound-related ones which seem to be close, but nothing all that important.) diffusion modelsonly scaling law I know of is an older one which bends a little but probably reflects poor hyperparameters, and no one has tried eg. Chinchilla scaling laws on them yet.

So yes, we definitely can just make all the compute-budgets 10× larger without wasting it.

To go through the specific issues (caveat: we don’t know if Make-A-Scene solves any of these because no one can use it; and I have not read the CogView2 paper—on skimming, CogView2 looks like it’d avoid most of the DALL·E 2 pathologies, but looks like it’s noticeably lower-quality in addition to lower-resolution):

  • anime & realistic faces are purely self-imposed problems by OA.

    DALL·E 2 will do them fine just as soon as OA wants it to, and other models by other orgs do just fine on those domains. So no real problem there.

  • text in images: this is an odd one.

    This is especially odd because it destroys the commercial application of any image with text in it (because it’s garbage—who’d pay for these?), and if you go back to DALL·E 1, one of the demos was it putting text into images like onto generated teapots or storefronts.

    It was imperfect, but DALL·E 2 is way worse at it, it looks like. I mean, DALL·E 1 would’ve at least spelled ‘Avengers’ correctly.

    Nostalgebraist has also shown you can get excellent text generation with a specialized small model, and people using compviz (also much smaller than DALL·E 2) get good text results. So text in images is not intrinsically hard, this is a DALL·E 2-specific problem, whatever it is.

    Why? As Nostalgebraist discusses at length in his earlier post, the unCLIP approach to using GLIDE to create the DALL·E 2 system seems to have a lot of weird drawbacks and tradeoffs. Just as CLIP’s contrastive view of the world (rather than discriminative or generative) leads to strange artifacts like images tessellating a pattern, unCLIP seems to cripple DALL·E 2 in some ways like compositionality is worsened. I don’t really get the unCLIP approach, so I’m not completely sure why it’d screw up text. The paper speculates that

    it is possible that the CLIP embedding does not precisely encode spelling information of rendered text.

    This issue is likely made worse because the BPE encoding we use obscures the spelling of the words in a caption from the model, so the model needs to have independently seen each token written out in the training images in order to learn to render it.

    D—N you BPEs! Is there nothing you won’t blight‽

    It may also be partially a dataset issue: OA’s licensing of commercial datasets may have greatly under-emphasized images which have text in them, which tends to be more of a dirty Internet or downstream user thing to have. If it’s unCLIP, raw GLIDE should be able to do text much better. If it’s the training data, it probably won’t be much different.

    If it’s the training data, it’ll be easy to fix if OA wants to fix it (like anime/faces); OA can find text-heavy datasets, or simply synthesize the necessary data by splatting random text in random fonts on top of random images & training etc. If it’s unCLIP, it can be hacked around by letting the users bypass unCLIP to use raw GLIDE, which as far as I know, they have no ability to do at the moment. (Seems like a very reasonable option to offer, if only for other purposes like research.) A longer-term solution would be to figure out a better unCLIP which avoids these contrastive pathologies, and a longer-term still solution would be to simply scale up enough that you no longer need this weird unCLIP thing to get diverse but high-quality samples, the base models are just good enough.

    So this might be relatively easy to fix, or have an obvious fix but won’t be implemented for a long time.

  • complex scenes: this one is easy—unCLIP is screwing things up.

    The problem with these samples generally doesn’t seem to be that the objects rendered are rendered badly by GLIDE or the upscalers, the problem seems to be that the objects are just organized wrong because the DALL·E 2 system as a whole didn’t understand the text input—that is, CLIP gave GLIDE the wrong blueprint and that is irreversible. And we know that GLIDE can do these things better because the paper shows us how much better one pair of prompts are (no extensive or quantitative evaluation, however):

    In Figure 14, we find that unCLIP struggles more than GLIDE with a prompt where it must bind two separate objects (cubes) to two separate attributes (colors). We hypothesize that this occurs because the CLIP embedding itself does not explicitly bind attributes to objects, and find that reconstructions from the decoder often mix up attributes and objects, as shown in Figure 15.

    And it’s pretty obvious that it almost has to screw up like this if you want to take the approach of a contrastively-learned fixed-size embedding (Nostalgebraist again): a fixed-size embedding is going to struggle if you want to stack on arbitrarily many details, especially without any recurrency or iteration (like DALL·E 1 had in being a Transformer on text inputs + VAE-token outputs). And a contrastive model like CLIP isn’t going to do relationships or scenes as well as it does other things because it just doesn’t encounter all that many pairs of images where the objects are all the same but their relationship is different as specified by the text caption, which is the sort of data which would force it to learn how “the red box is on top of the blue box” looks different from “the blue box is on top of the red box”.

    Like before, just offering GLIDE as an option would fix a lot of the problems here. unCLIP screws up your complex scene? Do it in GLIDE. The GLIDE is hard to guide or lower-quality? Maybe seed it in GLIDE and then jazz it up in the full DALL·E 2?

    Longer-term, a better text encoder would go a long way to resolving all sorts of problems. Just existing text models would be enough, no need for hypothetical new architectures. People are accusing DALL·E 2 of lacking good causal understanding or not being able to solve problems of various sorts; fine, but CLIP is a very bad way to understand language, being a very small text encoder (base model: 0.06b) trained contrastively from scratch on short image captions rather than initialized from a real autoregressive language model. (Remember, OA started the CLIP research with autoregressive generation, Figure 2 in the CLIP paper, it just found that more expensive, not worse, and switched to CLIP.) A real language model, like Chinchilla-80b, would do much better when fused to an image model, like in Flamingo.

So, these DALL·E 2 problems all look soluble to me by pursuing just known techniques. They stem from either deliberate choices, removing the user’s ability to choose a different tradeoff, or lack of simple-minded scaling.

  • 2023 edit: between IMAGEN, Parti, DALL·E 3, and the miracle-of-spelling paper, I think that my claims that text in images is simply a matter of scale, and that tokenization screws up text in images, are now fairly consensus in DL as of late 2023.

    Anime is also now routine, showing that whatever stopped DALL·E 2 from generating anime, it was nothing intrinsic to anime or image generation.

  • 2026 edit: as of mid-2026, multiple image generators, especially the seemingly autoregressive Google Nano Banana Pro and the OA ChatGPT Images-2.0, can do large amounts of high-accuracy text, and the severe problems with text inside images are now largely forgotten.