Install Theme

bpe blues +

Since the SSC post has got me talking about GPT-3 arithmetic again, I might as well talk about how GPT-2/3′s weird tokenizer interacts with arithmetic.

(GPT-3 keeps the same style of tokenizer from GPT-2, although I’m not clear on whether its chunking was recomputed over the new text corpus.  Even if it was, I’d expect its simple statistical model to converge long before reaching the scale of these big corpora, so there should be few qualitative differences.

Also, I’ll just write “GPT” below to mean the general case)

—-

For details on the weirdness of the tokenizer, see this post.  Briefly:

- When text is converted into GPT input, characters get chunked together into wordlike or morphologic-unit-like pieces of varying length.

- The procedure used to break text into these chunks uses a dumb/simple statistical method to group together characters if they occur together often enough in real text.  This procedure was done once, before GPT training, and is fixed in stone.

This is its “raw sense data”: to it text simply is these chunks.  It can’t see down to the characters inside the chunks, so any patterns obscured by the chunking must be memorized as arbitrary facts.  The underlying abstract patterns are literally invisible to GPT.

- The procedure in fact obscures some patterns, to a glaring extent. For example, different ways of capitalizing a word (”hello” vs “Hello” vs “HELLO”) as completely different “raw sense items,” as different from GPT’s perspective as words in three different languages.

Every generalization from one version to another has to be learned anew: the discovery that “hello” = “Hello” doesn’t help it figure out that “great” = “Great” etc.

—-

So, how does this apply to numerals?

Let’s look at how GPT sees numbers from 0 to 9999.  (I prepend each numeral with a space because that’s what it will usually see in practice.)

Let’s look at how many tokens (AKA chunks) it makes out of each numeral.  We can imagine a spectrum here, ranging from “every numeral is a single chunk” to “every N-digit numeral is decomposed into its N digits.”

- Each one and two-digit numerals is a single chunk.  For example, “ 4″ happens to be chunk #604 in the arbitrary internal enumeration, and “ 79″ happens to be chunk #9225.  So far, so good: this is the “every numeral is a single chunk” approach.

- Among three-digit numbers, 45% are one chunk, and 55% are two chunks.  Huh, that’s weird.  Is there a pattern?

Not that I can see.  The first numeral with two chunks is 362: GPT sees it as “ 3″ followed by “62.”  Then we’re back to one chunk until 381 and 382, and … I tried to describe this verbally, but it’s easier to just show it:

image

Two chunks becomes steadily more common as we go up.  Here’s the same kind of data, 100 numerals later:

image

Here we can also see variability in how 3 digits are split into 2 chunks.  Usually you get the pattern like 485 = “ 48″ + “5″, but sometimes it’s like 495 = “ 4″ + “95.”

Once most numerals are two chunks, there’s kind of a pattern in the 1-chunk holdouts.  Multiples of 100 are 1-chunk for a while, and multiples of 10 are more often 1-chunk.

The first multiple of 100 relegated to two chunks is poor old 2200 (“ 2″ + “200″).  For some reason 2400, 2500, and 2600 get to be 1-chunk, but from there on, multiples of 100 are 2-chunk unless they’re also multiples of 1000.  The way that multiples of 100 get gradually 2-chunked repeats some of the trends we saw above with multiples of 1:

image

Check it out: 2500 is the four-digit chunk 2500.  3500 is the digits 35 followed by the digits 00.  And 4500 is the digit 4 followed by the digits 500.

As we head further into 4-digit numerals, we start seeing 3-chunk ones eventually.   The first 3-chunk numeral is (place your bets…) 4761 = “ 4″ + “76″ + “1″ (did you guess it?).  The next is 4791, then 4861, 4862, 4863, then 4881, and so on in another inscrutable integer sequence.

Unlike 2-chunking, though, 3-chunking is consistent about where to split.  It’s always first digit + middle two + last digit.  This holds across the whole range from the 4761, the first 4-digit / 3-chunk number, to 9984, the last 4-digit / 3-chunk number.  Among 4-digit numbers overall, 2.0% are 1 chunk, 95.7% are 2 chunks, and 2.4% are 3 chunks.

… got that?

—-

What does this mean?  It definitely makes GPT arithmetic look harder to me.  I would have a hard time figuring out this bizarre numeral system myself!

On the other hand, I also thought this sort of problem looked horribly limiting for words, and GPT has done rather famously well in that domain, so … maybe it doesn’t matter, somehow?  But I don’t understand how.

In any case, improving upon BPE would be the first thing on my list if I were able to train a GPT from scratch and wanted to improve its performance.  Even if it didn’t help, that itself would be surprising and fascinating!

  1. onlyintekong reblogged this from nostalgebraist
  2. lalaithion reblogged this from nostalgebraist
  3. nostalgebraist posted this