trees are harlequins, words are harlequins

bpe blues +

Since the SSC post has got me talking about GPT-3 arithmetic again, I might as well talk about how GPT-2/3′s weird tokenizer interacts with arithmetic.

(GPT-3 keeps the same style of tokenizer from GPT-2, although I’m not clear on whether its chunking was recomputed over the new text corpus. Even if it was, I’d expect its simple statistical model to converge long before reaching the scale of these big corpora, so there should be few qualitative differences.

Also, I’ll just write “GPT” below to mean the general case)

—-

For details on the weirdness of the tokenizer, see this post. Briefly:

- When text is converted into GPT input, characters get chunked together into wordlike or morphologic-unit-like pieces of varying length.

- The procedure used to break text into these chunks uses a dumb/simple statistical method to group together characters if they occur together often enough in real text. This procedure was done once, before GPT training, and is fixed in stone.

This is its “raw sense data”: to it text simply is these chunks. It can’t see down to the characters inside the chunks, so any patterns obscured by the chunking must be memorized as arbitrary facts. The underlying abstract patterns are literally invisible to GPT.

- The procedure in fact obscures some patterns, to a glaring extent. For example, different ways of capitalizing a word (”hello” vs “Hello” vs “HELLO”) as completely different “raw sense items,” as different from GPT’s perspective as words in three different languages.

Every generalization from one version to another has to be learned anew: the discovery that “hello” = “Hello” doesn’t help it figure out that “great” = “Great” etc.

—-

So, how does this apply to numerals?

Let’s look at how GPT sees numbers from 0 to 9999. (I prepend each numeral with a space because that’s what it will usually see in practice.)

Let’s look at how many tokens (AKA chunks) it makes out of each numeral. We can imagine a spectrum here, ranging from “every numeral is a single chunk” to “every N-digit numeral is decomposed into its N digits.”

- Each one and two-digit numerals is a single chunk. For example, “ 4″ happens to be chunk #604 in the arbitrary internal enumeration, and “ 79″ happens to be chunk #9225. So far, so good: this is the “every numeral is a single chunk” approach.

- Among three-digit numbers, 45% are one chunk, and 55% are two chunks. Huh, that’s weird. Is there a pattern?

Not that I can see. The first numeral with two chunks is 362: GPT sees it as “ 3″ followed by “62.” Then we’re back to one chunk until 381 and 382, and … I tried to describe this verbally, but it’s easier to just show it:

Two chunks becomes steadily more common as we go up. Here’s the same kind of data, 100 numerals later:

Here we can also see variability in how 3 digits are split into 2 chunks. Usually you get the pattern like 485 = “ 48″ + “5″, but sometimes it’s like 495 = “ 4″ + “95.”

Once most numerals are two chunks, there’s kind of a pattern in the 1-chunk holdouts. Multiples of 100 are 1-chunk for a while, and multiples of 10 are more often 1-chunk.

The first multiple of 100 relegated to two chunks is poor old 2200 (“ 2″ + “200″). For some reason 2400, 2500, and 2600 get to be 1-chunk, but from there on, multiples of 100 are 2-chunk unless they’re also multiples of 1000. The way that multiples of 100 get gradually 2-chunked repeats some of the trends we saw above with multiples of 1:

Check it out: 2500 is the four-digit chunk 2500. 3500 is the digits 35 followed by the digits 00. And 4500 is the digit 4 followed by the digits 500.

As we head further into 4-digit numerals, we start seeing 3-chunk ones eventually. The first 3-chunk numeral is (place your bets…) 4761 = “ 4″ + “76″ + “1″ (did you guess it?). The next is 4791, then 4861, 4862, 4863, then 4881, and so on in another inscrutable integer sequence.

Unlike 2-chunking, though, 3-chunking is consistent about where to split. It’s always first digit + middle two + last digit. This holds across the whole range from the 4761, the first 4-digit / 3-chunk number, to 9984, the last 4-digit / 3-chunk number. Among 4-digit numbers overall, 2.0% are 1 chunk, 95.7% are 2 chunks, and 2.4% are 3 chunks.

… got that?

—-

What does this mean? It definitely makes GPT arithmetic look harder to me. I would have a hard time figuring out this bizarre numeral system myself!

On the other hand, I also thought this sort of problem looked horribly limiting for words, and GPT has done rather famously well in that domain, so … maybe it doesn’t matter, somehow? But I don’t understand how.

In any case, improving upon BPE would be the first thing on my list if I were able to train a GPT from scratch and wanted to improve its performance. Even if it didn’t help, that itself would be surprising and fascinating!

11th Jun 202026 notes

figment-wrangler liked this
ivanvladimir liked this
nonsanes liked this
di--es---can-ic-ul-ar--es liked this
onlyintekong reblogged this from nostalgebraist
moved-to--lana-del-ramiel liked this
billclinton420 liked this
marlemane liked this
rusalkii liked this
lalaithion liked this
lalaithion reblogged this from nostalgebraist
youzicha liked this
candleprism liked this
notthedarklord42 liked this
gen-adder liked this
stumpyjoepete liked this
mathemagicalschema liked this
shlevy liked this
theorem-sorry liked this
brin-bellway liked this
voxette-vk liked this
the-real-numbers liked this
unknought liked this
typicalacademic liked this
jackhkeynes liked this
nostalgebraist posted this