I had a closer look at BERT tokens and noticed that "artstation" tokenizes to arts + tation whilst "crungus" splits into cr + ung + us. So, what is trending on "artsungus"?

Aug 9, 2022 · 8:45 AM UTC

Ah, max_length 77 looks like a familiar number. 😉
Replying to @quasimondo
Can you break this down more? What does “artsungus” break down to? Arts+ung+us? Wtf is “crungus” is this just an exercise in absurdity? I feel like I’m missing something, wouldn’t “arts ung us” produce the same result as “artsungus”
Ah - did you miss @bruces crungus explorations? I am not a tokenization expert, but AFAIK there is a difference between tokens that represent an entire word and those that are parts of words. So there should be a different embedding for artsungus and arts ung us.
Replying to @quasimondo
I thought DALLE-mini used BART's byte-level BPE tokenizer (derived from GPT-2), while BERT uses WordPiece tokenization – or is this a typo? Paging @borisdayma for confirmation.
The most famous word overrides the weight of every other word in the text, this is indeed a serious bias to "famous" words. Can you make Legolas rise a bell?
Replying to @quasimondo
also works with the (cr)uggxor tokens you can get a lot of weirdness out of -ungus -uggxor -runcus and some others. repeating terms also weirds things out in fun ways
Replying to @quasimondo
@bruces you might wanna have a look at this