[BPEs strike again; cf. DALL·E 2, Itzhak & Levin2021] Current image generation models struggle to reliably produce well-formed visual text. In this paper, we investigate a key contributing factor: popular text-to-image models lack character-level input features, making it much harder to predict a word’s visual makeup as a series of glyphs.
To quantify this effect, we conduct a series of experiments comparing character-aware [ByT5] vs. character-blind text encoders [T5, PaLM]. In the text-only domain, we find that character-aware models provide large gains on a novel spelling task (WikiSpell).
Applying our learnings to the visual domain, we train a suite of image generation models [Imagen], and show that character-aware variants outperform their character-blind counterparts across a range of novel text rendering tasks (our DrawText benchmark).
Our models set a much higher state-of-the-art on visual spelling, with 30+ point accuracy gains over competitors on rare words, despite training on far fewer examples.
Figure 1: Top: Image generation models lacking character-level input features often misspell words.
Bottom: Using a character-aware text encoder substantially improves the accuracy of rendered text.
Prompts are: “A vintage postage stamp with the message: _______”, with messages: (1) California: All Dreams Welcome, (2) Canada: For Glowing Hearts, (3) Colorado: It’s Our Nature, (4) St. Louis: All Within Reach.
…In §3 we find that, with sufficient scale, character-blind models can achieve near-perfect spelling accuracy. We dub this phenomenon the spelling miracle, to emphasize the difficulty of inferring a token’s spelling from its distribution alone. At the same time, we observe that character-blind text encoders of the sizes used in practice for image generation are lacking core spelling knowledge.
With this in mind, it is unsurprising that today’s image generation models struggle to translate input tokens into glyph sequences. These models’ text encoders are all character-blind, with Stable Diffusion, DALL·E, DALL·E-2, Imagen, Parti and eDiff-I all adopting BPE tokenizers (Rombachet al2021; Rameshet al2021, 2022; Sahariaet al2022;Yuet al2022;Balajiet al2022).
…Secondly, we find that for character-blind models, scale is a key factor in spelling ability. Both T5 and mT5 improve with scale, but even at XXL size, they are not particularly strong (eg. T5-XXL’s performance on common English words is only 66%). Only when character-blind models reach PaLM’s scale do we start to see near-perfect spelling ability: PaLM 540B achieves >99% accuracy across all frequency buckets in English, despite the fact that it sees only 20 examples in its prompt (as opposed to the 1,000 fine-tuning examples shown to T5). However, performance is lower on other languages.
Table 1: WikiSpell exact-match accuracy results for English. T5 models range from Base (B) (250M params) to XXL (11B params), while ByT5 models range from Base (300M) to XXL (13B).
Our experiments on ByT5 show that character-aware models have far greater spelling ability. ByT5’s performance at Base and Large sizes is only slightly behind XL and XXL (though still in at least the mid-90% range), and the frequency of a word has little effect on ByT5’s ability to spell it. These results far exceed those of (m)T5, and are comparable to the English performance of PaLM, which has >100× more parameters, and exceed PaLM’s performance on other languages. These findings indicate that substantially more character-level information is retained by the ByT5 encoder, and in such a way that it can be retrieved from those frozen parameters as needed for the decoding task.
Figure 4: Accuracy of 10 image generation models on DrawText Spelling. Character-aware models (ByT5 and Concat) outperform others regardless of size, and particularly on rare words. Imagen-AR shows the benefit of avoiding cropping, but still underperforms character-aware models, despite training 6.6× longer.
Figure 12: Non-cherrypicked samples from our T5-XXL (top) and ByT5-XXL (bottom) models. The character-aware ByT5 model reliably spells the target word correctly, with only minor issues around letter shapes or letter merging. Over 100 samples, we found the character-blind T5 model never produced the target spelling. Prompt: ‘The word “exquisite” written in modern calligraphy.’