Install Theme

bpe blues

GPT-2′s tokenizer is … kinda weird.

Sure, it’s defined in a perfectly clear and relatively elegant way: it’s a byte-pair encoding on UTF-8 bytes.  Unlike many NLP tokenizers, it doesn’t have special custom handling for any particular feature of text, like uppercase/lowercase, whitespace, or common English morphology.  It takes a completely generic approach that would work for anything, not just English text or even text.

But, this genericness and simplicity comes at a price: its behavior when applied specifically to English text – which is its only intended application – can leave something to be desired.  After all, there’s a reason why people usually do all those text-specific or language-specific customizations.

To better understand the tokenizer, I made a little script that lets me type in text and then shows me the resulting tokens.  The concrete examples below come from this script.  I have it print out each individual token in both a readable text form and as its index in the vocabulary, like 

(' hello', 23748)

Anyway.  How is GPT-2′s tokenizer weird?  Let me count the ways:

1: No special handling for text that differs only in the case (upper vs. lower) of the letters

This means that the model won’t automatically generalize what it knows about the word “hello” to the version “Hello” that occurs at the start of a sentence, or the version “HELLO” that occurs in text that’s all-caps for whatever reason (titles, yelling…)

Thus, GPT-2′s vocabulary contains the English language (or a large subset of it) not once but in several copies: there’s the lowercase version of each word, the capitalized version, the uppercase version, possibly even the GaMzEe-CaSeD version or other rarer variants.

From the model’s perspective, these are totally different universes, disjoint subsets of the vocab that follow their own internal rules.

For example, choosing the first word of a sentence in normally-formatted text is not just choosing a word like any other: it’s choosing a Capitalized Word™, _and _Capitalized Words™ are their own universe.  Insofar as the model understands that the word “Insofar” with which I began this sentence means the exact same thing as the word “insofar” I just used inside it, it understands this by figuring out that these two “seemingly unrelated” things are “secretly” the same.  And it must do that for every single word, separately.

INSOFAR as the model understands the first word of this sentence, capitalized as the first sentences of chapters sometimes are … well, yeah.

(I suspect this is why GPT-2 writes so incoherently whenever it decides to write in all caps.  All-caps text is relatively rare, and to the model it’s a whole different language it’s had to pick up from a few scattered examples.)

While it seems clearly suboptimal, this one isn’t that weird – for better or for worse people commonly do this in NLP.  On the other hand…

2: Spaces glom onto the words after them

BPE tries to be efficient, so it doesn’t waste token slots on spaces if it doesn’t have to.  A word is almost always preceded by a space, so instead of representing “ example text” as four tokens (space, “example,” space, “text”), it represents it as two:

[(' Example', 17934), (' text', 2420)]

So far, seems innocuous, right? But what if you’re feeding a prompt into GPT-2? Unless you’re hip to this particular issue, you’ll probably type in something like

“Example text”

which becomes

[('Example', 16281), (' text', 2420)]

Compare this to the one above. Yes – instead of token #17934, with the preceding space, I’ve unwittingly fed in token #16281, without a preceding space.

Previously, we saw there was a “separate copy of English” for each capitalization style. But really, each of those copies is not one but two: the version with preceding space and the one without. And unless you type a space before your prompt, your first word will be in the “no preceding space” language.

The “capitalized word with no preceding space” language is an interesting case. Where does it appear, besides user prompts? IIUC, the most common cause for it is newlines. “\nExample text” tokenizes to

[('\n', 198), ('Example', 16281), (' text', 2420)]

So your prompt, if it lacks an initial space, looks like the start of a paragraph. Well… that seems fine, actually

But putting prompts aside, consider what this means: not only does the model learn “words at the starts of sentences” as a separate language, it learns “words at the starts of paragraphs” as another separate language! (Perhaps this has something to do with the tendency of samples to veer off topic around paragraph breaks? IDK. Might also be interesting in connection w/ Gwern’s poetry project.)

3: Words split differently depending on surrounding whitespace

The thing I just said isn’t exactly true. It’s worse than that.

So far I’ve talked like one token = one word. Frequently, that’s true, but BPE can split words in the middle too. This automatically handles some morphology stuff, e.g. “ Rob’s” becomes

[(' Rob', 3851), ("'s", 338)]

and can break down unfamiliar words/“words” into atomic or “molecular” components, e.g. the keysmash “hgsdfahsf” becomes

[('h', 71), ('gs', 14542), ('df', 7568), ('ah', 993), ('sf', 28202)]

But this interacts weirdly, if predictably, with the “separate languages” thing belabored above. There’s no constraint making any one of our copies of English split up in the same way as the other ones, and generally they don’t.

For example, how many tokens is “farther”? Well, if you’re in the “lowercase with preceding space” universe, it’s one:

[(' farther', 18485)]

If you’re in the caps-and-preceding-space universe (sentence start within paragraph), though:

[(' F', 376), ('art', 433), ('her', 372)]

Likewise in the caps-without-preceding-space universe (paragraph start), except of course we begin with “F” rather than the totally distinct “ F”:

[('F', 37), ('art', 433), ('her', 372)]

And if for some weird reason you’re lowercase but don’t have a preceding space:

[('f', 69), ('art', 433), ('her', 372)]

From the perspective of text generation, this is super weird. Remember, we generate one token at a time. So when we start a sentence with the word “farther,” we don’t just pick out that word, the way we would inside a sentence. Instead we decide:

  • the sentence should start with " F". N.B. this is not the same as deciding “the sentence begins with the letter F”, as it rules out things like ( For, 1114) which is common enough to be a single token
  • the sentence should continue with "art"
  • the sentence should continue with "her" (as opposed to being a sentence about farting, or something – until this choice, we could have been writing that sentence!)

I do wonder how much better GPT-2 text generation could be, and how much less gigantic the model could be, if this stuff were a little friendlier.

  1. cymae-mesa reblogged this from nostalgebraist
  2. valiantfoxdinosaur reblogged this from nostalgebraist
  3. cthulhubert reblogged this from transgenderer
  4. transgenderer reblogged this from nostalgebraist
  5. nostalgebraist posted this