‘LM tokenization’ directory

Gwern

‘LM tokenization’ directory

How to turn words into numbers is important for machine learning models to work well. Different kinds of tokenizations lead to models that ‘think’ in different ways, and can cause subtle & surprising errors (especially with BPEs⁠).

Gwern

“GPT-3 Creative Fiction ”, Gwern 2020

GPT-3 Creative Fiction

“GPT-3 Nonfiction ”, Gwern 2020

GPT-3 Nonfiction

“GPT-2 Folk Music ”, Gwern & Presser 2019

GPT-2 Folk Music

Links

“Why Does Claude Speak Byzantine Music Notation? ”, Finke 2025

⁠Why does Claude Speak Byzantine Music Notation?

“SuperBPE: Space Travel for Language Models ”, Liu et al 2025

⁠SuperBPE: Space Travel for Language Models⁠

“SuperBPE: Space Travel for Language Models [Homepage] ”, Liu et al 2025

⁠⁠SuperBPE: Space Travel for Language Models [homepage] :

View External Link:

⁠https://superbpe.github.io/

“ByteCraft: Generating Video Games and Animations through Bytes ”

⁠⁠ByteCraft: Generating video games and animations through bytes :

View HTML:

⁠https://emygervais.github.io/2025/03/15/bytecraft.html

“Inner Thinking Transformer (ITT): Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking ”, Chen et al 2025

⁠Inner Thinking Transformer (ITT): Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking⁠

“Do Large Language Model Benchmarks Test Reliability? ”, Vendrow et al 2025

Do Large Language Model Benchmarks Test Reliability?⁠

“Language Models Use Trigonometry to Do Addition ”, Kantamneni 2025

⁠Language Models Use Trigonometry to Do Addition⁠

“Scaling Embedding Layers in Language Models ”, Yu et al 2025

Scaling Embedding Layers in Language Models⁠

“Decoding-Based Regression ”, Song & Bahri 2025

Decoding-based Regression⁠

“Over-Tokenized Transformer: Vocabulary Is Generally Worth Scaling ”, Huang et al 2025

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling⁠

“Anomalous Tokens in DeepSeek-V3 & Deep-Seek-R1 ”, Henry 2025

Anomalous Tokens in DeepSeek-V3 & Deep-Seek-r1⁠

“EvaByte: Efficient Byte-Level Language Models at Scale: Introducing EvaByte, an Efficient and Strong Byte-Level Language Model ”, Zheng et al 2025

⁠EvaByte: Efficient Byte-level Language Models at Scale: Introducing EvaByte, an efficient and strong byte-level language model :

View HTML:

⁠/doc/www/hkunlp.github.io/0ee1ab2d100f11caa6808098ac2c61531e0b5db0.html⁠

“Llama Goes off the Rails If You Ask It for 5 Odd Numbers That Don’t Have the Letter ‘E’ in Them ”, Applemoi 2025

⁠Llama goes off the rails if you ask it for 5 odd numbers that don’t have the letter ‘E’ in them⁠ :

View HTML:

⁠/doc/www/old.reddit.com/c8b61b73240b58391713012de92ced749b384518.html⁠

“Tokenization Is NP-Complete ”, Whittington et al 2024

Tokenization is NP-Complete⁠

“Byte Latent Transformer (BLT): Patches Scale Better Than Tokens ”, Pagnoni et al 2024

⁠Byte Latent Transformer (BLT): Patches Scale Better Than Tokens⁠

“Clio: Privacy-Preserving Insights into Real-World AI Use ”, Anthropic 2024

⁠Clio: Privacy-preserving insights into real-world AI use⁠

“Training Large Language Models to Reason in a Continuous Latent Space ”, Hao et al 2024

Training Large Language Models to Reason in a Continuous Latent Space⁠

“The Structure of the Token Space for Large Language Models ”, Robinson et al 2024

The structure of the token space for large language models⁠

“When a Language Model Is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI O1 ”, McCoy et al 2024

When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1⁠

“MaskBit: Embedding-Free Image Generation via Bit Tokens ”, Weber et al 2024

MaskBit: Embedding-free Image Generation via Bit Tokens⁠

“A New Class of Glitch Tokens: BPE Sub-Token Artifacts ”

⁠A New Class of Glitch Tokens: BPE Sub-token Artifacts⁠ :

View HTML:

⁠/doc/www/www.greaterwrong.com/4f4d1a5bc35e58a6ddcb29890185ec949e7945e2.html⁠

“JPEG-LM: LLMs As Image Generators With Canonical Codec Representations ”, Han et al 2024

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations⁠

“CARTE: toward Table Foundation Models ”, Varoquaux 2024

⁠CARTE: toward table foundation models

“Token Erasure As a Footprint of Implicit Vocabulary Items in LLMs ”, Feucht et al 2024

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs⁠

“Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets ”, Walsh et al 2024

Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets⁠

“Transformers Can Do Arithmetic With the Right Embeddings ”, McLeish et al 2024

Transformers Can Do Arithmetic with the Right Embeddings⁠

“From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step ”, Deng et al 2024

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step⁠

“Zero-Shot Tokenizer Transfer ”, Minixhofer et al 2024

Zero-Shot Tokenizer Transfer⁠

“Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models ”, Bai et al 2024

Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models⁠

“Fishing for Magikarp: Automatically Detecting Under-Trained Tokens in Large Language Models ”, Land & Bartolo 2024

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models⁠

“SpaceByte: Towards Deleting Tokenization from Large Language Modeling ”, Slagle 2024

SpaceByte: Towards Deleting Tokenization from Large Language Modeling⁠

“Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge ”, Batsuren et al 2024

Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge⁠

“Why Do Small Language Models Underperform? Studying Language Model Saturation via the Softmax Bottleneck ”, Godey et al 2024

Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck⁠

“Training LLMs over Neurally Compressed Text ”, Lester et al 2024

Training LLMs over Neurally Compressed Text⁠

“Mechanistic Design and Scaling of Hybrid Architectures ”, Poli et al 2024

Mechanistic Design and Scaling of Hybrid Architectures⁠

“Greed Is All You Need: An Evaluation of Tokenizer Inference Methods ”, Uzan et al 2024

⁠Greed is All You Need: An Evaluation of Tokenizer Inference Methods⁠

“Tokenization Is More Than Compression ”, Schmidt et al 2024

⁠Tokenization Is More Than Compression⁠

“CARTE: Pretraining and Transfer for Tabular Learning ”, Kim et al 2024

CARTE: Pretraining and Transfer for Tabular Learning⁠

“Tokenization Counts: the Impact of Tokenization on Arithmetic in Frontier LLMs ”, Singh & Strouse 2024

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs⁠

“Tasks That Language Models Don’t Learn ”, Lee & Lim 2024

Tasks That Language Models Don’t Learn⁠

“Getting the Most out of Your Tokenizer for Pre-Training and Domain Adaptation ”, Dagan et al 2024

Getting the most out of your tokenizer for pre-training and domain adaptation⁠

“MambaByte: Token-Free Selective State Space Model ”, Wang et al 2024

MambaByte: Token-free Selective State Space Model⁠

“A Long-Context Language Model for the Generation of Bacteriophage Genomes ”, Shao 2023

A long-context language model for the generation of bacteriophage genomes⁠

“Diff History for Neural Language Agents ”, Piterbarg et al 2023

diff History for Neural Language Agents⁠

“TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering ”, Chen et al 2023

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering⁠

“Positional Description Matters for Transformers Arithmetic ”, Shen et al 2023

Positional Description Matters for Transformers Arithmetic⁠

“AnyText: Multilingual Visual Text Generation And Editing ”, Tuo et al 2023

AnyText: Multilingual Visual Text Generation And Editing⁠

“EELBERT: Tiny Models through Dynamic Embeddings ”, Cohn et al 2023

EELBERT: Tiny Models through Dynamic Embeddings⁠

“ChipNeMo: Domain-Adapted LLMs for Chip Design ”, Liu et al 2023

ChipNeMo: Domain-Adapted LLMs for Chip Design⁠

“Learn Your Tokens: Word-Pooled Tokenization for Language Modeling ”, Thawani et al 2023

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling⁠

“Tokenizer Choice For LLM Training: Negligible or Crucial? ”, Ali et al 2023

Tokenizer Choice For LLM Training: Negligible or Crucial?⁠

“XVal: A Continuous Number Encoding for Large Language Models ”, Golkar et al 2023

xVal: A Continuous Number Encoding for Large Language Models⁠

“Think Before You Speak: Training Language Models With Pause Tokens ”, Goyal et al 2023

Think before you speak: Training Language Models With Pause Tokens⁠

“Embers of Autoregression: Understanding Large Language Models Through the Problem They Are Trained to Solve ”, McCoy et al 2023

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve⁠

“Subwords As Skills: Tokenization for Sparse-Reward Reinforcement Learning ”, Yunis et al 2023

Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning⁠

“PASTA: Pretrained Action-State Transformer Agents ”, Boige et al 2023

PASTA: Pretrained Action-State Transformer Agents⁠

“Sampling at Negative Temperature ”, Kauffman 2023

⁠Sampling at negative temperature

“In-Context Autoencoder for Context Compression in a Large Language Model ”, Ge et al 2023

In-context Autoencoder for Context Compression in a Large Language Model⁠

“Teaching Arithmetic to Small Transformers ”, Lee et al 2023

Teaching Arithmetic to Small Transformers⁠

“Length Generalization in Arithmetic Transformers ”, Jelassi et al 2023

Length Generalization in Arithmetic Transformers⁠

“ChatGPT Is Fun, but It Is Not Funny! Humor Is Still Challenging Large Language Models ”, Jentzsch & Kersting 2023

ChatGPT is fun, but it is not funny! Humor is still challenging Large Language Models⁠

“Bytes Are All You Need: Transformers Operating Directly On File Bytes ”, Horton et al 2023

Bytes Are All You Need: Transformers Operating Directly On File Bytes⁠

“FERMAT: An Alternative to Accuracy for Numerical Reasoning ”, Sivakumar & Moosavi 2023

FERMAT: An Alternative to Accuracy for Numerical Reasoning⁠

“MEGABYTE: Predicting Million-Byte Sequences With Multiscale Transformers ”, Yu et al 2023

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers⁠

“Evaluating Transformer Language Models on Arithmetic Operations Using Number Decomposition ”, Muffo et al 2023

Evaluating Transformer Language Models on Arithmetic Operations Using Number Decomposition⁠

“What’s AGI, and Why Are AI Experts Skeptical? ChatGPT and Other Bots Have Revived Conversations on Artificial General Intelligence. Scientists Say Algorithms Won’t Surpass You Any Time Soon ”, Rogers 2023

What’s AGI, and Why Are AI Experts Skeptical? ChatGPT and other bots have revived conversations on artificial general intelligence. Scientists say algorithms won’t surpass you any time soon⁠

“BloombergGPT: A Large Language Model for Finance ”, Wu et al 2023

BloombergGPT: A Large Language Model for Finance⁠

“How Well Do Large Language Models Perform in Arithmetic Tasks? ”, Yuan et al 2023

How well do Large Language Models perform in Arithmetic tasks?⁠

“LLaMa-1: Open and Efficient Foundation Language Models ”, Touvron et al 2023

LLaMa-1: Open and Efficient Foundation Language Models⁠

“Language Is Not All You Need: Aligning Perception With Language Models (Kosmos-1) ”, Huang et al 2023

Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)⁠

“XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models ”, Liang et al 2023

XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models⁠

“Language Models Are Better Than Humans at Next-Token Prediction ”, Shlegeris et al 2022

Language models are better than humans at next-token prediction⁠

“Character-Aware Models Improve Visual Text Rendering ”, Liu et al 2022

Character-Aware Models Improve Visual Text Rendering⁠

“Whisper: Robust Speech Recognition via Large-Scale Weak Supervision ”, Radford et al 2022

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision⁠

“NPM: Nonparametric Masked Language Modeling ”, Min et al 2022

NPM: Nonparametric Masked Language Modeling⁠

“Fast Inference from Transformers via Speculative Decoding ”, Leviathan et al 2022

Fast Inference from Transformers via Speculative Decoding⁠

“Efficient Transformers With Dynamic Token Pooling ”, Nawrot et al 2022

Efficient Transformers with Dynamic Token Pooling⁠

“Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities ”, Tjandra et al 2022

Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities⁠

“LMentry: A Language Model Benchmark of Elementary Language Tasks ”, Efrat et al 2022

LMentry: A Language Model Benchmark of Elementary Language Tasks⁠

“n-Gram Is Back: Residual Learning of Neural Text Generation With n-Gram Language Model ”, Li et al 2022

n-gram Is Back: Residual Learning of Neural Text Generation with n-gram Language Model⁠

“Help Me Write a Poem: Instruction Tuning As a Vehicle for Collaborative Poetry Writing (CoPoet) ”, Chakrabarty et al 2022

Help me write a poem: Instruction Tuning as a Vehicle for Collaborative Poetry Writing (CoPoet)⁠

“DALL·E 2 Is Seeing Double: Flaws in Word-To-Concept Mapping in Text2Image Models ”, Rassin et al 2022

DALL·E 2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models⁠

“Incorporating Context into Subword Vocabularies ”, Yehezkel & Pinter 2022

⁠Incorporating Context into Subword Vocabularies⁠

“Most Language Models Can Be Poets Too: An AI Writing Assistant and Constrained Text Generation Studio ”, Roush et al 2022

Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio⁠

“Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints ”, Jawahar et al 2022

Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints⁠

“AudioLM: a Language Modeling Approach to Audio Generation ”, Borsos et al 2022

AudioLM: a Language Modeling Approach to Audio Generation⁠

“PIXEL: Language Modeling With Pixels ”, Rust et al 2022

PIXEL: Language Modeling with Pixels⁠

“N-Grammer: Augmenting Transformers With Latent n-Grams ”, Roy et al 2022

N-Grammer: Augmenting Transformers with latent n-grams⁠

“Forecasting Future World Events With Neural Networks ”, Zou et al 2022

Forecasting Future World Events with Neural Networks⁠

“SymphonyNet: Symphony Generation With Permutation Invariant Language Model ”, Liu et al 2022

SymphonyNet: Symphony Generation with Permutation Invariant Language Model⁠

“FLOTA: An Embarrassingly Simple Method to Mitigate Und-Es-Ira-Ble Properties of Pretrained Language Model Tokenizers ”, Hofmann et al 2022

FLOTA: An Embarrassingly Simple Method to Mitigate Und-es-ira-ble Properties of Pretrained Language Model Tokenizers⁠

“DALL·E 2: Hierarchical Text-Conditional Image Generation With CLIP Latents § 7. Limitations and Risks ”, Ramesh et al 2022 (page 16 org openai)

DALL·E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents § 7. Limitations and Risks⁠

“ByT5 Model for Massively Multilingual Grapheme-To-Phoneme Conversion ”, Zhu et al 2022

ByT5 model for massively multilingual grapheme-to-phoneme conversion⁠

“Make-A-Scene: Scene-Based Text-To-Image Generation With Human Priors ”, Gafni et al 2022

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors⁠

“Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words ”, Feng et al 2022

Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words⁠

“Between Words and Characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP ”, Mielke et al 2021

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP⁠

“PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts ”, Khashabi et al 2021

PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts⁠

“OCR-Free Document Understanding Transformer ”, Kim et al 2021

OCR-free Document Understanding Transformer⁠

“What Changes Can Large-Scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-Scale Korean Generative Pretrained Transformers ”, Kim et al 2021

What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers⁠

“Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens ”, Itzhak & Levy 2021

Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens⁠

“Perceiver IO: A General Architecture for Structured Inputs & Outputs ”, Jaegle et al 2021

Perceiver IO: A General Architecture for Structured Inputs & Outputs⁠

“Charformer: Fast Character Transformers via Gradient-Based Subword Tokenization ”, Tay et al 2021

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization⁠

“ByT5: Towards a Token-Free Future With Pre-Trained Byte-To-Byte Models ”, Xue et al 2021

ByT5: Towards a token-free future with pre-trained byte-to-byte models⁠

“Robust Open-Vocabulary Translation from Visual Text Representations ”, Salesky et al 2021

Robust Open-Vocabulary Translation from Visual Text Representations⁠

“GPT-3 vs Water Cooler Trivia Participants: A Human vs Robot Showdown ”, Waldoch 2021

GPT-3 vs Water Cooler Trivia participants: A Human vs Robot Showdown

“CANINE: Pre-Training an Efficient Tokenization-Free Encoder for Language Representation ”, Clark et al 2021

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation⁠

“There Once Was a Really Bad Poet, It Was Automated but You Didn’t Know It ”, Wang et al 2021

There Once Was a Really Bad Poet, It Was Automated but You Didn’t Know It⁠

“Perceiver: General Perception With Iterative Attention ”, Jaegle et al 2021

Perceiver: General Perception with Iterative Attention⁠

“Investigating the Limitations of the Transformers With Simple Arithmetic Tasks ”, Nogueira et al 2021

Investigating the Limitations of the Transformers with Simple Arithmetic Tasks⁠

“Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words ”, Hofmann et al 2021

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words⁠

“Fast WordPiece Tokenization ”, Song et al 2020

Fast WordPiece Tokenization⁠

“CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters ”, Boukkouri et al 2020

CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters⁠

“Towards End-To-End In-Image Neural Machine Translation ”, Mansimov et al 2020

Towards End-to-End In-Image Neural Machine Translation⁠

“AI Text Tokenization ”, Gwern 2020

⁠AI Text Tokenization⁠

“Unigram LM: Byte Pair Encoding Is Suboptimal for Language Model Pretraining ”, Bostrom & Durrett 2020

Unigram LM: Byte Pair Encoding is Suboptimal for Language Model Pretraining⁠

“Generative Language Modeling for Automated Theorem Proving § Experiments ”, Polu & Sutskever 2020 (page 11 org openai)

Generative Language Modeling for Automated Theorem Proving § Experiments⁠

“OTEANN: Estimating the Transparency of Orthographies With an Artificial Neural Network ”, Marjou 2019

OTEANN: Estimating the Transparency of Orthographies with an Artificial Neural Network⁠

“BPE-Dropout: Simple and Effective Subword Regularization ”, Provilkov et al 2019

BPE-Dropout: Simple and Effective Subword Regularization⁠

“BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance ”, Schick & Schütze 2019

BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance⁠

“Do NLP Models Know Numbers? Probing Numeracy in Embeddings ”, Wallace et al 2019

Do NLP Models Know Numbers? Probing Numeracy in Embeddings⁠

“Generating Text With Recurrent Neural Networks ”, Sutskever et al 2019

Generating Text with Recurrent Neural Networks⁠

“SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing ”, Kudo & Richardson 2018

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing⁠

“Character-Level Language Modeling With Deeper Self-Attention ”, Al-Rfou et al 2018

Character-Level Language Modeling with Deeper Self-Attention⁠

“Deep-Speare: A Joint Neural Model of Poetic Language, Meter and Rhyme ”, Lau et al 2018

Deep-speare: A Joint Neural Model of Poetic Language, Meter and Rhyme⁠

“GPT-1: Improving Language Understanding by Generative Pre-Training § Model Specifications ”, Radford et al 2018 (page 5)

GPT-1: Improving Language Understanding by Generative Pre-Training § Model specifications⁠

“One Big Net For Everything ”, Schmidhuber 2018

One Big Net For Everything⁠

“Breaking the Softmax Bottleneck: A High-Rank RNN Language Model ”, Yang et al 2017

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model⁠

“DeepTingle ”, Khalifa et al 2017

DeepTingle⁠

“Multiplicative LSTM for Sequence Modeling ”, Krause et al 2016

Multiplicative LSTM for sequence modeling⁠

“Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation ”, Wu et al 2016

Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation⁠

“BPEs: Neural Machine Translation of Rare Words With Subword Units ”, Sennrich et al 2015

BPEs: Neural Machine Translation of Rare Words with Subword Units⁠

“Scaling Language Models: Methods, Analysis & Insights from Training Gopher § Table A40: Conversations Can Create the Illusion of Creativity ”

⁠Scaling Language Models: Methods, Analysis & Insights from Training Gopher § Table A40: Conversations can create the illusion of creativity⁠ :

View PDF:

⁠/doc/www/arxiv.org/55416474191d68307e7d48b4c4a372b8a43882dc.pdf#page=119&org=deepmind⁠

“Commas vs Integers ”, Brokman 2025

⁠Commas vs Integers⁠ :

View HTML:

⁠/doc/www/gist.github.com/14c5d56fb03d1780213417606778ffed377de212.html⁠

“FineWeb: Decanting the Web for the Finest Text Data at Scale ”

FineWeb: decanting the web for the finest text data at scale⁠

“The Bouba/Kiki Effect And Sound Symbolism In CLIP ”

⁠The Bouba/Kiki Effect And Sound Symbolism In CLIP :

View HTML:

⁠/doc/www/near.blog/7b754d1adedff79bde90b78a60b89f20f46cc3fd.html⁠

“BPE Blues ”, Nostalgebraist 2025

⁠BPE Blues⁠

“BPE Blues+ ”, Nostalgebraist 2025

⁠BPE Blues+⁠ :

View HTML:

⁠/doc/www/nostalgebraist.tumblr.com/c8cfad2256eba912a8dfa42db9ed33ee917e4775.html⁠

“The Art of Prompt Design: Prompt Boundaries and Token Healing ”

The Art of Prompt Design: Prompt Boundaries and Token Healing⁠

“Monitor: An AI-Driven Observability Interface ”

Monitor: An AI-Driven Observability Interface

“A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More ”

⁠A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More⁠

NineOfNein

Tokens are definitely shorter than English, but the performance even worse. Getting it to explain its thinking, it clearly can’t tell at all which sentences/words sound the same, which is odd, since homonyms tend to have the same letters in Russian…On the other hand strength of the model definitely not as good outside of English.⁠ :

/doc/www/localhost/3d695bd12b19215e71ef0ae31b18d2bb4d2a6080.html⁠

Sort By Magic

Annotations sorted by machine learning into ⁠inferred 'tags'⁠. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

Miscellaneous

Bibliography

https://arxiv.org/abs/2503.13423: “SuperBPE: Space Travel for Language Models ”⁠, Alisa Liu, Jonathan Hayase, Valentin Hofmann …, Sewoong Oh, ⁠Noah A. Smith, Yejin Choi⁠
link-bibliography⁠
https://arxiv.org/abs/2410.08993: “The Structure of the Token Space for Large Language Models ”⁠, Michael Robinson, Sourya Dey, Shauna Sweet
link-bibliography⁠
https://arxiv.org/abs/2409.16211#bytedance: “MaskBit: Embedding-Free Image Generation via Bit Tokens ”⁠, Mark Weber, Lijun Yu, Qihang Yu …, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen
link-bibliography⁠
https://arxiv.org/abs/2406.20086: “Token Erasure As a Footprint of Implicit Vocabulary Items in LLMs ”⁠, Sheridan Feucht, David Atkinson, Byron Wallace, David Bau
link-bibliography⁠
https://arxiv.org/abs/2406.18906: “Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets ”⁠, Melanie Walsh⁠, Anna Preus, Maria Antoniak
link-bibliography⁠
https://arxiv.org/abs/2405.14838: “From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step ”⁠, Yuntian Deng, Yejin Choi⁠, Stuart Shieber
link-bibliography⁠
https://arxiv.org/abs/2405.07883: “Zero-Shot Tokenizer Transfer ”⁠, Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vulić⁠
link-bibliography⁠
https://arxiv.org/abs/2404.13292: “Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge ”⁠, Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers …, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella
link-bibliography⁠
https://arxiv.org/abs/2403.17844: “Mechanistic Design and Scaling of Hybrid Architectures ”⁠, Michael Poli, Armin W. Thomas, Eric Nguyen …, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting⁠, Taiji Suzuki, Brian Hie, Stefano Ermon⁠, Christopher Ré⁠, Ce Zhang, Stefano Massaroli
link-bibliography⁠
https://arxiv.org/abs/2402.14903: “Tokenization Counts: the Impact of Tokenization on Arithmetic in Frontier LLMs ”⁠, Aaditya K. Singh, D. J. Strouse
link-bibliography⁠
https://arxiv.org/abs/2402.11349: “Tasks That Language Models Don’t Learn ”⁠, Bruce W. Lee, JaeHyuk Lim
link-bibliography⁠
https://arxiv.org/abs/2311.16465: “TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering ”⁠, Jingye Chen, Yupan Huang, Tengchao Lv …, Lei Cui, Qifeng Chen, Furu Wei⁠
link-bibliography⁠
https://arxiv.org/abs/2310.02226: “Think Before You Speak: Training Language Models With Pause Tokens ”⁠, Sachin Goyal, Ziwei Ji, Ankit Singh Rawat …, Aditya Krishna Menon, Sanjiv Kumar⁠, Vaishnavh Nagarajan
link-bibliography⁠
https://arxiv.org/abs/2307.03381: “Teaching Arithmetic to Small Transformers ”⁠, Nayoung Lee, Kartik Sreenivasan, Jason D. Lee …, Kangwook Lee, Dimitris Papailiopoulos
link-bibliography⁠
https://arxiv.org/abs/2306.00238#apple: “Bytes Are All You Need: Transformers Operating Directly On File Bytes ”⁠, Maxwell Horton, Sachin Mehta, Ali Farhadi⁠, Mohammad Rastegari
link-bibliography⁠
https://www.wired.com/story/what-is-artificial-general-intelligence-agi-explained/: “What’s AGI, and Why Are AI Experts Skeptical? ChatGPT and Other Bots Have Revived Conversations on Artificial General Intelligence. Scientists Say Algorithms Won’t Surpass You Any Time Soon ”⁠, Reece Rogers
link-bibliography⁠
https://arxiv.org/abs/2304.02015#alibaba: “How Well Do Large Language Models Perform in Arithmetic Tasks? ”⁠, Zheng Yuan, Hongyi Yuan, Chuanqi Tan …, Wei Wang, Songfang Huang
link-bibliography⁠
https://arxiv.org/abs/2212.10562#google: “Character-Aware Models Improve Visual Text Rendering ”⁠, Rosanne Liu, Dan Garrette, Chitwan Saharia …, ⁠William Chan, Adam Roberts⁠, Sharan Narang, Irina Blok, R. J. Mical⁠, Mohammad Norouzi⁠, Noah Constant⁠
link-bibliography⁠
https://arxiv.org/abs/2212.04356#openai: “Whisper: Robust Speech Recognition via Large-Scale Weak Supervision ”⁠, Alec Radford⁠, ⁠Jong Wook Kim, Tao Xu …, Greg Brockman⁠, Christine McLeavey, Ilya Sutskever⁠
link-bibliography⁠
https://arxiv.org/abs/2212.01349#facebook: “NPM: Nonparametric Masked Language Modeling ”⁠, Sewon Min, Weijia Shi, Mike Lewis⁠ …, Xilun Chen, Wen-tau Yih, ⁠Hannaneh Hajishirzi, Luke Zettlemoyer⁠
link-bibliography⁠
https://arxiv.org/abs/2210.13669: “Help Me Write a Poem: Instruction Tuning As a Vehicle for Collaborative Poetry Writing (CoPoet) ”⁠, Tuhin Chakrabarty, Vishakh Padmakumar, He He
link-bibliography⁠
https://aclanthology.org/2022.cai-1.2.pdf: “Most Language Models Can Be Poets Too: An AI Writing Assistant and Constrained Text Generation Studio ”⁠, Allen Roush⁠, Sanjay Basu, Akshay Moorthy, Dmitry Dubovoy
link-bibliography⁠
https://arxiv.org/abs/2207.06991: “PIXEL: Language Modeling With Pixels ”⁠, Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello …, Elizabeth Salesky, Miryam de Lhoneux, Desmond Elliott⁠
link-bibliography⁠
https://arxiv.org/abs/2206.15474: “Forecasting Future World Events With Neural Networks ”⁠, ⁠Andy Zou, Tristan Xiao, Ryan Jia …, Joe Kwon, Mantas Mazeika⁠, Richard Li⁠, Dawn Song⁠, ⁠Jacob Steinhardt, ⁠Owain Evans, ⁠Dan Hendrycks⁠
link-bibliography⁠
https://aclanthology.org/2022.acl-short.43.pdf: “FLOTA: An Embarrassingly Simple Method to Mitigate Und-Es-Ira-Ble Properties of Pretrained Language Model Tokenizers ”⁠, Valentin Hofmann, Hinrich Schütze⁠, Janet Pierrehumbert⁠
link-bibliography⁠
https://arxiv.org/pdf/2204.06125#page=16&org=openai: “DALL·E 2: Hierarchical Text-Conditional Image Generation With CLIP Latents § 7. Limitations and Risks ”⁠, Aditya A. Ramesh⁠, ⁠Prafulla Dhariwal, Alex Nichol …, Casey Chu, ⁠Mark Chen
link-bibliography⁠
https://arxiv.org/abs/2204.03067: “ByT5 Model for Massively Multilingual Grapheme-To-Phoneme Conversion ”⁠, Jian Zhu, Cong Zhang, David Jurgens
link-bibliography⁠
https://arxiv.org/abs/2203.13131#facebook: “Make-A-Scene: Scene-Based Text-To-Image Generation With Human Priors ”⁠, Oran Gafni, Adam Polyak, Oron Ashual …, Shelly Sheynin, Devi Parikh⁠, Yaniv Taigman
link-bibliography⁠
https://arxiv.org/abs/2108.11193: “Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens ”⁠, Itay Itzhak, Omer Levy⁠
link-bibliography⁠
https://arxiv.org/abs/2107.14795#deepmind: “Perceiver IO: A General Architecture for Structured Inputs & Outputs ”⁠, Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac …, Carl Doersch, Catalin Ionescu, David Ding⁠, Skanda Koppula, Daniel Zoran, Andrew Brock⁠, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman⁠, Oriol Vinyals⁠, João Carreira
link-bibliography⁠
https://arxiv.org/abs/2106.12672#google: “Charformer: Fast Character Transformers via Gradient-Based Subword Tokenization ”⁠, ⁠Yi Tay, Vinh Q. Tran, Sebastian Ruder …, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler
link-bibliography⁠
https://arxiv.org/abs/2105.13626#google: “ByT5: Towards a Token-Free Future With Pre-Trained Byte-To-Byte Models ”⁠, Linting Xue, Aditya Barua, Noah Constant⁠ …, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts⁠, ⁠Colin Raffel
link-bibliography⁠
https://arxiv.org/abs/2103.03206#deepmind: “Perceiver: General Perception With Iterative Attention ”⁠, Andrew Jaegle, Felix Gimeno, Andrew Brock⁠ …, Andrew Zisserman⁠, Oriol Vinyals⁠, Joao Carreira
link-bibliography⁠
https://arxiv.org/abs/2012.15524#google: “Fast WordPiece Tokenization ”⁠, Xinying Song, Alex Salcianu, Yang Song⁠ …, Dave Dopson, ⁠Denny Zhou
link-bibliography⁠
https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf#page=5: “GPT-1: Improving Language Understanding by Generative Pre-Training § Model Specifications ”⁠, Alec Radford⁠, Karthik Narasimhan, ⁠Tim Salimans⁠, Ilya Sutskever⁠
link-bibliography⁠