‘LM tokenization’ directory

See Also
Gwern
Links
Miscellaneous
Bibliography

See Also

Gwern

“GPT-3 Creative Fiction”, Gwern 2020

GPT-3 Creative Fiction

“GPT-3 Nonfiction”, Gwern 2020

GPT-3 Nonfiction

“GPT-2 Folk Music”, Gwern & Presser 2019

GPT-2 Folk Music

Links

“I’d Show All My Online Friends, but I Now Worry They Wouldn’t Get It”, Gwern 2026

I’d show all my online friends, but I now worry they wouldn’t get it

“Which Programming Languages Are Most Token-Efficient?”, Alderson 2026

Which programming languages are most token-efficient?

“No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL”, Team 2025

No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL

“When Models Manipulate Manifolds: The Geometry of a Counting Task”, Gurnee et al 2025

When Models Manipulate Manifolds: The Geometry of a Counting Task

“The Dark Arts of Tokenization Or: How I Learned to Start Worrying and Love LLMs’ Undecoded Outputs”, Lovre 2025

The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs’ undecoded outputs

“Let’s Build the GPT Tokenizer: A Complete Guide to Tokenization in LLMs; A Text and Code Version of Karpathy’s Famous Tokenizer Video”, Karpathy & Turgutlu 2025

Let’s Build the GPT Tokenizer: A Complete Guide to Tokenization in LLMs; A text and code version of Karpathy’s famous tokenizer video

View HTML:

https://www.fast.ai/posts/2025-10-16-karpathy-tokenizers.html

“Shorter Tokens Are More Likely [In LLM Sampling]”, Long 2025

Shorter Tokens Are More Likely [in LLM sampling]

“H-Nets: Dynamic Chunking for End-To-End Hierarchical Sequence Modeling”, Hwang et al 2025

H-Nets: Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

“Potemkin Understanding in Large Language Models”, Mancoridis et al 2025

Potemkin Understanding in Large Language Models

“Finding Palindromes With Language Models”, Nichol 2025

Finding Palindromes with Language Models

“The Bitter Lesson Is Coming for Tokenization: a World of LLMs without Tokenization Is Desirable and Increasingly Possible”, Perić 2025

The Bitter Lesson is coming for Tokenization: a world of LLMs without tokenization is desirable and increasingly possible

“Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf’s Law”, Kunstner & Bach 2025

Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf’s Law

“[The Letter ‘G’ in ‘Strawberry’]”, Breadd007 2025

[the letter ‘g’ in ‘strawberry’]

“On the Empirical Distribution of Numbers: At Last, Data-Driven Numerology [Parsing The Pile]”, osmarks 2025

On the empirical distribution of numbers: At last, data-driven numerology [parsing The Pile]

“Why Does Claude Speak Byzantine Music Notation?”, Finke 2025

Why does Claude Speak Byzantine Music Notation?

“SuperBPE: Space Travel for Language Models”, Liu et al 2025

SuperBPE: Space Travel for Language Models

“SuperBPE: Space Travel for Language Models [Homepage]”, Liu et al 2025

SuperBPE: Space Travel for Language Models [homepage]

View HTML:

/doc/www/superbpe.github.io/0dac9c89c3ad513b051eace223554c1e2bb95ee4.html

“ByteCraft: Generating Video Games and Animations through Bytes”

ByteCraft: Generating video games and animations through bytes

View HTML:

https://emygervais.github.io/2025/03/15/bytecraft.html

“Inner Thinking Transformer (ITT): Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking”, Chen et al 2025

Inner Thinking Transformer (ITT): Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking

“Do Large Language Model Benchmarks Test Reliability?”, Vendrow et al 2025

Do Large Language Model Benchmarks Test Reliability?

“Language Models Use Trigonometry to Do Addition”, Kantamneni 2025

Language Models Use Trigonometry to Do Addition

“Scaling Embedding Layers in Language Models”, Yu et al 2025

Scaling Embedding Layers in Language Models

“Decoding-Based Regression”, Song & Bahri 2025

Decoding-based Regression

“Over-Tokenized Transformer: Vocabulary Is Generally Worth Scaling”, Huang et al 2025

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

“Anomalous Tokens in DeepSeek-V3 & Deep-Seek-R1”, Henry 2025

Anomalous Tokens in DeepSeek-V3 & Deep-Seek-r1

“EvaByte: Efficient Byte-Level Language Models at Scale: Introducing EvaByte, an Efficient and Strong Byte-Level Language Model”, Zheng et al 2025

EvaByte: Efficient Byte-level Language Models at Scale: Introducing EvaByte, an efficient and strong byte-level language model

View HTML:

/doc/www/hkunlp.github.io/0ee1ab2d100f11caa6808098ac2c61531e0b5db0.html

“Llama Goes off the Rails If You Ask It for 5 Odd Numbers That Don’t Have the Letter ‘E’ in Them”, Applemoi 2025

Llama goes off the rails if you ask it for 5 odd numbers that don’t have the letter ‘E’ in them

View HTML:

/doc/www/old.reddit.com/c8b61b73240b58391713012de92ced749b384518.html

“H-Nets—The Past”

H-Nets—the Past

View External Link:

https://goombalab.github.io/blog/2025/hnet-past/

“Tokenization Is NP-Complete”, Whittington et al 2024

Tokenization is NP-Complete

“Byte Latent Transformer (BLT): Patches Scale Better Than Tokens”, Pagnoni et al 2024

Byte Latent Transformer (BLT): Patches Scale Better Than Tokens

“Clio: Privacy-Preserving Insights into Real-World AI Use”, Anthropic 2024

Clio: Privacy-preserving insights into real-world AI use

“Training Large Language Models to Reason in a Continuous Latent Space”, Hao et al 2024

Training Large Language Models to Reason in a Continuous Latent Space

“WHy DoNt YoU JUsT USe ThE LLaMa ToKeNiZeR??”

wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR??

“The Structure of the Token Space for Large Language Models”, Robinson et al 2024

The structure of the token space for large language models

“When a Language Model Is Optimized for Reasoning, Does It Still Show Embers of Autoregression? An Analysis of OpenAI O1”, McCoy et al 2024

When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

“MaskBit: Embedding-Free Image Generation via Bit Tokens”, Weber et al 2024

MaskBit: Embedding-free Image Generation via Bit Tokens

“A New Class of Glitch Tokens: BPE Sub-Token Artifacts”

A New Class of Glitch Tokens: BPE Sub-token Artifacts

View HTML:

/doc/www/www.greaterwrong.com/4f4d1a5bc35e58a6ddcb29890185ec949e7945e2.html

“JPEG-LM: LLMs As Image Generators With Canonical Codec Representations”, Han et al 2024

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

“CARTE: toward Table Foundation Models”, Varoquaux 2024

CARTE: toward table foundation models

“Scaling Laws With Vocabulary: Larger Models Deserve Larger Vocabularies”, Tao et al 2024

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

“Token Erasure As a Footprint of Implicit Vocabulary Items in LLMs”, Feucht et al 2024

Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

“Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets”, Walsh et al 2024

Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets

“Glyph-ByT5-V2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering”, Liu et al 2024

Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

“Transformers Can Do Arithmetic With the Right Embeddings”, McLeish et al 2024

Transformers Can Do Arithmetic with the Right Embeddings

“From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step”, Deng et al 2024

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

“Zero-Shot Tokenizer Transfer”, Minixhofer et al 2024

Zero-Shot Tokenizer Transfer

“Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models”, Bai et al 2024

Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models

“Fishing for Magikarp: Automatically Detecting Under-Trained Tokens in Large Language Models”, Land & Bartolo 2024

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

“SpaceByte: Towards Deleting Tokenization from Large Language Modeling”, Slagle 2024

SpaceByte: Towards Deleting Tokenization from Large Language Modeling

“Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge”, Batsuren et al 2024

Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

“Why Do Small Language Models Underperform? Studying Language Model Saturation via the Softmax Bottleneck”, Godey et al 2024

Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck

“Training LLMs over Neurally Compressed Text”, Lester et al 2024

Training LLMs over Neurally Compressed Text

“Mechanistic Design and Scaling of Hybrid Architectures”, Poli et al 2024

Mechanistic Design and Scaling of Hybrid Architectures

“Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering”, Liu et al 2024

Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

“Greed Is All You Need: An Evaluation of Tokenizer Inference Methods”, Uzan et al 2024

Greed is All You Need: An Evaluation of Tokenizer Inference Methods

“Beyond Language Models (BGPT): Byte Models Are Digital World Simulators”, Wu et al 2024

Beyond Language Models (bGPT): Byte Models are Digital World Simulators

“Tokenization Is More Than Compression”, Schmidt et al 2024

Tokenization Is More Than Compression

“CARTE: Pretraining and Transfer for Tabular Learning”, Kim et al 2024

CARTE: Pretraining and Transfer for Tabular Learning

“Tokenization Counts: the Impact of Tokenization on Arithmetic in Frontier LLMs”, Singh & Strouse 2024

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

“Tasks That Language Models Don’t Learn”, Lee & Lim 2024

Tasks That Language Models Don’t Learn

“Getting the Most out of Your Tokenizer for Pre-Training and Domain Adaptation”, Dagan et al 2024

Getting the most out of your tokenizer for pre-training and domain adaptation

“MambaByte: Token-Free Selective State Space Model”, Wang et al 2024

MambaByte: Token-free Selective State Space Model

“A Long-Context Language Model for the Generation of Bacteriophage Genomes”, Shao 2023

A long-context language model for the generation of bacteriophage genomes

“Diff History for Neural Language Agents”, Piterbarg et al 2023

diff History for Neural Language Agents

“TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering”, Chen et al 2023

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

“Positional Description Matters for Transformers Arithmetic”, Shen et al 2023

Positional Description Matters for Transformers Arithmetic

“Strings from the Library of Babel: Random Sampling As a Strong Baseline for Prompt Optimisation”, Lu et al 2023

Strings from the Library of Babel: Random Sampling as a Strong Baseline for Prompt Optimisation

“AnyText: Multilingual Visual Text Generation And Editing”, Tuo et al 2023

AnyText: Multilingual Visual Text Generation And Editing

“EELBERT: Tiny Models through Dynamic Embeddings”, Cohn et al 2023

EELBERT: Tiny Models through Dynamic Embeddings

“ChipNeMo: Domain-Adapted LLMs for Chip Design”, Liu et al 2023

ChipNeMo: Domain-Adapted LLMs for Chip Design

“Learn Your Tokens: Word-Pooled Tokenization for Language Modeling”, Thawani et al 2023

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

“Tokenizer Choice For LLM Training: Negligible or Crucial?”, Ali et al 2023

Tokenizer Choice For LLM Training: Negligible or Crucial?

“XVal: A Continuous Number Encoding for Large Language Models”, Golkar et al 2023

xVal: A Continuous Number Encoding for Large Language Models

“Think Before You Speak: Training Language Models With Pause Tokens”, Goyal et al 2023

Think before you speak: Training Language Models With Pause Tokens

“Embers of Autoregression: Understanding Large Language Models Through the Problem They Are Trained to Solve”, McCoy et al 2023

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

“Subwords As Skills: Tokenization for Sparse-Reward Reinforcement Learning”, Yunis et al 2023

Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning

“PASTA: Pretrained Action-State Transformer Agents”, Boige et al 2023

PASTA: Pretrained Action-State Transformer Agents

“GPT-2’s Positional Embedding Matrix Is a Helix”, Yedidia 2023

GPT-2’s positional embedding matrix is a helix

View HTML:

/doc/www/www.greaterwrong.com/25139de537db618aca2c172022da7ba02621c9d2.html

“Sampling at Negative Temperature”, Kauffman 2023

Sampling at negative temperature

View HTML:

/doc/www/cavendishlabs.org/49ce1a21ece7e1ee5ba6e2a7ce696c3194fd2fbe.html

“In-Context Autoencoder for Context Compression in a Large Language Model”, Ge et al 2023

In-context Autoencoder for Context Compression in a Large Language Model

“Teaching Arithmetic to Small Transformers”, Lee et al 2023

Teaching Arithmetic to Small Transformers

“Length Generalization in Arithmetic Transformers”, Jelassi et al 2023

Length Generalization in Arithmetic Transformers

“ChatGPT Is Fun, but It Is Not Funny! Humor Is Still Challenging Large Language Models”, Jentzsch & Kersting 2023

ChatGPT is fun, but it is not funny! Humor is still challenging Large Language Models

“Bytes Are All You Need: Transformers Operating Directly On File Bytes”, Horton et al 2023

Bytes Are All You Need: Transformers Operating Directly On File Bytes

“FERMAT: An Alternative to Accuracy for Numerical Reasoning”, Sivakumar & Moosavi 2023

FERMAT: An Alternative to Accuracy for Numerical Reasoning

“MEGABYTE: Predicting Million-Byte Sequences With Multiscale Transformers”, Yu et al 2023

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

“Evaluating Transformer Language Models on Arithmetic Operations Using Number Decomposition”, Muffo et al 2023

Evaluating Transformer Language Models on Arithmetic Operations Using Number Decomposition

“What’s AGI, and Why Are AI Experts Skeptical? ChatGPT and Other Bots Have Revived Conversations on Artificial General Intelligence. Scientists Say Algorithms Won’t Surpass You Any Time Soon”, Rogers 2023

What’s AGI, and Why Are AI Experts Skeptical? ChatGPT and other bots have revived conversations on artificial general intelligence. Scientists say algorithms won’t surpass you any time soon

“BloombergGPT: A Large Language Model for Finance”, Wu et al 2023

BloombergGPT: A Large Language Model for Finance

“How Well Do Large Language Models Perform in Arithmetic Tasks?”, Yuan et al 2023

How well do Large Language Models perform in Arithmetic tasks?

“LLaMa-1: Open and Efficient Foundation Language Models”, Touvron et al 2023

LLaMa-1: Open and Efficient Foundation Language Models

“Language Is Not All You Need: Aligning Perception With Language Models (Kosmos-1)”, Huang et al 2023

Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)

“XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models”, Liang et al 2023

XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

“Language Models Are Better Than Humans at Next-Token Prediction”, Shlegeris et al 2022

Language models are better than humans at next-token prediction

“Character-Aware Models Improve Visual Text Rendering”, Liu et al 2022

Character-Aware Models Improve Visual Text Rendering

“Whisper: Robust Speech Recognition via Large-Scale Weak Supervision”, Radford et al 2022

Whisper: Robust Speech Recognition via Large-Scale Weak Supervision

“NPM: Nonparametric Masked Language Modeling”, Min et al 2022

NPM: Nonparametric Masked Language Modeling

“Fast Inference from Transformers via Speculative Decoding”, Leviathan et al 2022

Fast Inference from Transformers via Speculative Decoding

“Efficient Transformers With Dynamic Token Pooling”, Nawrot et al 2022

Efficient Transformers with Dynamic Token Pooling

“Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities”, Tjandra et al 2022

Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities

“LMentry: A Language Model Benchmark of Elementary Language Tasks”, Efrat et al 2022

LMentry: A Language Model Benchmark of Elementary Language Tasks

“n-Gram Is Back: Residual Learning of Neural Text Generation With n-Gram Language Model”, Li et al 2022

n-gram Is Back: Residual Learning of Neural Text Generation with n-gram Language Model

“Help Me Write a Poem: Instruction Tuning As a Vehicle for Collaborative Poetry Writing (CoPoet)”, Chakrabarty et al 2022

Help me write a poem: Instruction Tuning as a Vehicle for Collaborative Poetry Writing (CoPoet)

“DALL·E 2 Is Seeing Double: Flaws in Word-To-Concept Mapping in Text2Image Models”, Rassin et al 2022

DALL·E 2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models

“Incorporating Context into Subword Vocabularies”, Yehezkel & Pinter 2022

Incorporating Context into Subword Vocabularies

“Most Language Models Can Be Poets Too: An AI Writing Assistant and Constrained Text Generation Studio”, Roush et al 2022

Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio

“Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints”, Jawahar et al 2022

Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints

“AudioLM: a Language Modeling Approach to Audio Generation”, Borsos et al 2022

AudioLM: a Language Modeling Approach to Audio Generation

“PIXEL: Language Modeling With Pixels”, Rust et al 2022

PIXEL: Language Modeling with Pixels

“N-Grammer: Augmenting Transformers With Latent n-Grams”, Roy et al 2022

N-Grammer: Augmenting Transformers with latent n-grams

“Forecasting Future World Events With Neural Networks”, Zou et al 2022

Forecasting Future World Events with Neural Networks

“SymphonyNet: Symphony Generation With Permutation Invariant Language Model”, Liu et al 2022

SymphonyNet: Symphony Generation with Permutation Invariant Language Model

“FLOTA: An Embarrassingly Simple Method to Mitigate Und-Es-Ira-Ble Properties of Pretrained Language Model Tokenizers”, Hofmann et al 2022

FLOTA: An Embarrassingly Simple Method to Mitigate Und-es-ira-ble Properties of Pretrained Language Model Tokenizers

“DALL·E 2: Hierarchical Text-Conditional Image Generation With CLIP Latents § 7. Limitations and Risks”, Ramesh et al 2022 (page 16 org openai)

DALL·E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents § 7. Limitations and Risks

“ByT5 Model for Massively Multilingual Grapheme-To-Phoneme Conversion”, Zhu et al 2022

ByT5 model for massively multilingual grapheme-to-phoneme conversion

“Make-A-Scene: Scene-Based Text-To-Image Generation With Human Priors”, Gafni et al 2022

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

“Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words”, Feng et al 2022

Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words

“A Modest Spelling Reform to Increase Autologicity, Symmetry, and Readability”

A modest spelling reform to increase autologicity, symmetry, and readability

“Between Words and Characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP”, Mielke et al 2021

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

“PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts”, Khashabi et al 2021

PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts

“OCR-Free Document Understanding Transformer”, Kim et al 2021

OCR-free Document Understanding Transformer

“What Changes Can Large-Scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-Scale Korean Generative Pretrained Transformers”, Kim et al 2021

What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers

“Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens”, Itzhak & Levy 2021

Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens

“Perceiver IO: A General Architecture for Structured Inputs & Outputs”, Jaegle et al 2021

Perceiver IO: A General Architecture for Structured Inputs & Outputs

“Charformer: Fast Character Transformers via Gradient-Based Subword Tokenization”, Tay et al 2021

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

“ByT5: Towards a Token-Free Future With Pre-Trained Byte-To-Byte Models”, Xue et al 2021

ByT5: Towards a token-free future with pre-trained byte-to-byte models

“Robust Open-Vocabulary Translation from Visual Text Representations”, Salesky et al 2021

Robust Open-Vocabulary Translation from Visual Text Representations

“GPT-3 vs Water Cooler Trivia Participants: A Human vs Robot Showdown”, Waldoch 2021

GPT-3 vs Water Cooler Trivia participants: A Human vs Robot Showdown

“CANINE: Pre-Training an Efficient Tokenization-Free Encoder for Language Representation”, Clark et al 2021

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

“There Once Was a Really Bad Poet, It Was Automated but You Didn’t Know It”, Wang et al 2021

There Once Was a Really Bad Poet, It Was Automated but You Didn’t Know It

“Perceiver: General Perception With Iterative Attention”, Jaegle et al 2021

Perceiver: General Perception with Iterative Attention

“Investigating the Limitations of the Transformers With Simple Arithmetic Tasks”, Nogueira et al 2021

Investigating the Limitations of the Transformers with Simple Arithmetic Tasks

“Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words”, Hofmann et al 2021

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

“Fast WordPiece Tokenization”, Song et al 2020

Fast WordPiece Tokenization

“CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters”, Boukkouri et al 2020

CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters

“Towards End-To-End In-Image Neural Machine Translation”, Mansimov et al 2020

Towards End-to-End In-Image Neural Machine Translation

“AI Text Tokenization”, Gwern 2020

AI Text Tokenization

“Unigram LM: Byte Pair Encoding Is Suboptimal for Language Model Pretraining”, Bostrom & Durrett 2020

Unigram LM: Byte Pair Encoding is Suboptimal for Language Model Pretraining

“Generative Language Modeling for Automated Theorem Proving § Experiments”, Polu & Sutskever 2020 (page 11 org openai)

Generative Language Modeling for Automated Theorem Proving § Experiments

“OTEANN: Estimating the Transparency of Orthographies With an Artificial Neural Network”, Marjou 2019

OTEANN: Estimating the Transparency of Orthographies with an Artificial Neural Network

“BPE-Dropout: Simple and Effective Subword Regularization”, Provilkov et al 2019

BPE-Dropout: Simple and Effective Subword Regularization

“BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance”, Schick & Schütze 2019

BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance

“Do NLP Models Know Numbers? Probing Numeracy in Embeddings”, Wallace et al 2019

Do NLP Models Know Numbers? Probing Numeracy in Embeddings

“Generating Text With Recurrent Neural Networks”, Sutskever et al 2019

Generating Text with Recurrent Neural Networks

“SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing”, Kudo & Richardson 2018

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

“Character-Level Language Modeling With Deeper Self-Attention”, Al-Rfou et al 2018

Character-Level Language Modeling with Deeper Self-Attention

“Deep-Speare: A Joint Neural Model of Poetic Language, Meter and Rhyme”, Lau et al 2018

Deep-speare: A Joint Neural Model of Poetic Language, Meter and Rhyme

“GPT-1: Improving Language Understanding by Generative Pre-Training § Model Specifications”, Radford et al 2018 (page 5)

GPT-1: Improving Language Understanding by Generative Pre-Training § Model specifications

“One Big Net For Everything”, Schmidhuber 2018

One Big Net For Everything

“Breaking the Softmax Bottleneck: A High-Rank RNN Language Model”, Yang et al 2017

Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

“DeepTingle”, Khalifa et al 2017

“Multiplicative LSTM for Sequence Modeling”, Krause et al 2016

Multiplicative LSTM for sequence modeling

“Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”, Wu et al 2016

Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

“BPEs: Neural Machine Translation of Rare Words With Subword Units”, Sennrich et al 2015

BPEs: Neural Machine Translation of Rare Words with Subword Units

“Scaling Language Models: Methods, Analysis & Insights from Training Gopher § Table A40: Conversations Can Create the Illusion of Creativity”

Scaling Language Models: Methods, Analysis & Insights from Training Gopher § Table A40: Conversations can create the illusion of creativity :

View PDF:

/doc/www/arxiv.org/55416474191d68307e7d48b4c4a372b8a43882dc.pdf#page=119&org=deepmind

“Beyond Language Models: Byte Models Are Digital World Simulators [Homepage]”, Wu et al 2026

Beyond Language Models: Byte Models are Digital World Simulators [homepage]

“Commas vs Integers”, Brokman 2026

Commas vs Integers

View HTML:

/doc/www/gist.github.com/14c5d56fb03d1780213417606778ffed377de212.html

“AIGText/Glyph-ByT5: [ECCV2024] This Is an Official Inference Code”, Liu et al 2026

AIGText/Glyph-ByT5: [ECCV2024] This is an official inference code

“bgpt: Beyond Language Models: Byte Models Are Digital World Simulators”, Wu et al 2026

bgpt: Beyond Language Models: Byte Models are Digital World Simulators

“`bgpt` at Main”, Wu et al 2026

bgpt at main

“FineWeb: Decanting the Web for the Finest Text Data at Scale”

FineWeb: decanting the web for the finest text data at scale

“The Bouba/Kiki Effect And Sound Symbolism In CLIP”

The Bouba/Kiki Effect And Sound Symbolism In CLIP

View HTML:

/doc/www/near.blog/7b754d1adedff79bde90b78a60b89f20f46cc3fd.html

“BPE Blues”, Nostalgebraist 2026

“BPE Blues+”, Nostalgebraist 2026

View HTML:

/doc/www/nostalgebraist.tumblr.com/c8cfad2256eba912a8dfa42db9ed33ee917e4775.html

“It’s Owl in the Numbers: Token Entanglement in Subliminal Learning”

It’s Owl in the Numbers: Token Entanglement in Subliminal Learning

View HTML (21MB):

/doc/www/owls.baulab.info/73d15d13c522548da36e0057ad1a23809113f8fd.html

“The Art of Prompt Design: Prompt Boundaries and Token Healing”

The Art of Prompt Design: Prompt Boundaries and Token Healing

“Monitor: An AI-Driven Observability Interface”

Monitor: An AI-Driven Observability Interface

“A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More”

A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More

NineOfNein

Tokens are definitely shorter than English, but the performance even worse. Getting it to explain its thinking, it clearly can’t tell at all which sentences/words sound the same, which is odd, since homonyms tend to have the same letters in Russian…On the other hand strength of the model definitely not as good outside of English.

/doc/www/localhost/3d695bd12b19215e71ef0ae31b18d2bb4d2a6080.html

Sort By Magic

Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

`token-complexity network-optimization deep-learning net-integration token-challenge network-theory`

[see previous entry]

[see previous entry]

[see previous entry]

`creative-ai`

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

`poetry-arithmetic`

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

[see previous entry]

Wikipedia (4)

Miscellaneous

Bibliography

https://arxiv.org/abs/2507.07955: “H-Nets: Dynamic Chunking for End-To-End Hierarchical Sequence Modeling”, Sukjun Hwang, Brandon Wang, Albert Gu

link-bibliography
https://arxiv.org/abs/2503.13423: “SuperBPE: Space Travel for Language Models”, Alisa Liu, Jonathan Hayase, Valentin Hofmann, Sewoong Oh, Noah A. Smith, Yejin Choi

link-bibliography
https://arxiv.org/abs/2410.08993: “The Structure of the Token Space for Large Language Models”, Michael Robinson, Sourya Dey, Shauna Sweet

link-bibliography
https://arxiv.org/abs/2409.16211#bytedance: “MaskBit: Embedding-Free Image Generation via Bit Tokens”, Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen

link-bibliography
https://arxiv.org/abs/2406.20086: “Token Erasure As a Footprint of Implicit Vocabulary Items in LLMs”, Sheridan Feucht, David Atkinson, Byron Wallace, David Bau

link-bibliography
https://arxiv.org/abs/2406.18906: “Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets”, Melanie Walsh, Anna Preus, Maria Antoniak

link-bibliography
https://arxiv.org/abs/2406.10208#microsoft: “Glyph-ByT5-V2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering”, Zeyu Liu, Weicong Liang, Yiming Zhao, Bohan Chen, Lin Liang, Lijuan Wang, Ji Li, Yuhui Yuan

link-bibliography
https://arxiv.org/abs/2405.14838: “From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step”, Yuntian Deng, Yejin Choi, Stuart Shieber

link-bibliography
https://arxiv.org/abs/2405.07883: “Zero-Shot Tokenizer Transfer”, Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vulić

link-bibliography
https://arxiv.org/abs/2404.13292: “Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge”, Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella

link-bibliography
https://arxiv.org/abs/2403.17844: “Mechanistic Design and Scaling of Hybrid Architectures”, Michael Poli, Armin W. Thomas, Eric Nguyen, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher Ré, Ce Zhang, Stefano Massaroli

link-bibliography
https://arxiv.org/abs/2403.09622#microsoft: “Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering”, Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, Yuhui Yuan

link-bibliography
https://arxiv.org/abs/2402.19155: “Beyond Language Models (BGPT): Byte Models Are Digital World Simulators”, Shangda Wu, Xu Tan, Zili Wang, Rui Wang, Xiaobing Li, Maosong Sun

link-bibliography
https://arxiv.org/abs/2402.14903: “Tokenization Counts: the Impact of Tokenization on Arithmetic in Frontier LLMs”, Aaditya K. Singh, D. J. Strouse

link-bibliography
https://arxiv.org/abs/2402.11349: “Tasks That Language Models Don’t Learn”, Bruce W. Lee, JaeHyuk Lim

link-bibliography
https://arxiv.org/abs/2311.16465: “TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering”, Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei

link-bibliography
https://arxiv.org/abs/2310.02226: “Think Before You Speak: Training Language Models With Pause Tokens”, Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, Vaishnavh Nagarajan

link-bibliography
https://arxiv.org/abs/2307.03381: “Teaching Arithmetic to Small Transformers”, Nayoung Lee, Kartik Sreenivasan, Jason D. Lee, Kangwook Lee, Dimitris Papailiopoulos

link-bibliography
https://arxiv.org/abs/2306.00238#apple: “Bytes Are All You Need: Transformers Operating Directly On File Bytes”, Maxwell Horton, Sachin Mehta, Ali Farhadi, Mohammad Rastegari

link-bibliography
https://www.wired.com/story/what-is-artificial-general-intelligence-agi-explained/: “What’s AGI, and Why Are AI Experts Skeptical? ChatGPT and Other Bots Have Revived Conversations on Artificial General Intelligence. Scientists Say Algorithms Won’t Surpass You Any Time Soon”, Reece Rogers

link-bibliography
https://arxiv.org/abs/2304.02015#alibaba: “How Well Do Large Language Models Perform in Arithmetic Tasks?”, Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang

link-bibliography
https://arxiv.org/abs/2212.10562#google: “Character-Aware Models Improve Visual Text Rendering”, Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, R. J. Mical, Mohammad Norouzi, Noah Constant

link-bibliography
https://arxiv.org/abs/2212.04356#openai: “Whisper: Robust Speech Recognition via Large-Scale Weak Supervision”, Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever

link-bibliography
https://arxiv.org/abs/2212.01349#facebook: “NPM: Nonparametric Masked Language Modeling”, Sewon Min, Weijia Shi, Mike Lewis, Xilun Chen, Wen-tau Yih, Hannaneh Hajishirzi, Luke Zettlemoyer

link-bibliography
https://arxiv.org/abs/2210.13669: “Help Me Write a Poem: Instruction Tuning As a Vehicle for Collaborative Poetry Writing (CoPoet)”, Tuhin Chakrabarty, Vishakh Padmakumar, He He

link-bibliography
https://aclanthology.org/2022.cai-1.2.pdf: “Most Language Models Can Be Poets Too: An AI Writing Assistant and Constrained Text Generation Studio”, Allen Roush, Sanjay Basu, Akshay Moorthy, Dmitry Dubovoy

link-bibliography
https://arxiv.org/abs/2207.06991: “PIXEL: Language Modeling With Pixels”, Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, Desmond Elliott

link-bibliography
https://arxiv.org/abs/2206.15474: “Forecasting Future World Events With Neural Networks”, Andy Zou, Tristan Xiao, Ryan Jia, Joe Kwon, Mantas Mazeika, Richard Li, Dawn Song, Jacob Steinhardt, Owain Evans, Dan Hendrycks

link-bibliography
https://aclanthology.org/2022.acl-short.43.pdf: “FLOTA: An Embarrassingly Simple Method to Mitigate Und-Es-Ira-Ble Properties of Pretrained Language Model Tokenizers”, Valentin Hofmann, Hinrich Schütze, Janet Pierrehumbert

link-bibliography
https://arxiv.org/pdf/2204.06125#page=16&org=openai: “DALL·E 2: Hierarchical Text-Conditional Image Generation With CLIP Latents § 7. Limitations and Risks”, Aditya A. Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen

link-bibliography
https://arxiv.org/abs/2204.03067: “ByT5 Model for Massively Multilingual Grapheme-To-Phoneme Conversion”, Jian Zhu, Cong Zhang, David Jurgens

link-bibliography
https://arxiv.org/abs/2203.13131#facebook: “Make-A-Scene: Scene-Based Text-To-Image Generation With Human Priors”, Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, Yaniv Taigman

link-bibliography
https://arxiv.org/abs/2108.11193: “Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens”, Itay Itzhak, Omer Levy

link-bibliography
https://arxiv.org/abs/2107.14795#deepmind: “Perceiver IO: A General Architecture for Structured Inputs & Outputs”, Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira

link-bibliography
https://arxiv.org/abs/2106.12672#google: “Charformer: Fast Character Transformers via Gradient-Based Subword Tokenization”, Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler

link-bibliography
https://arxiv.org/abs/2105.13626#google: “ByT5: Towards a Token-Free Future With Pre-Trained Byte-To-Byte Models”, Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel

link-bibliography
https://arxiv.org/abs/2103.03206#deepmind: “Perceiver: General Perception With Iterative Attention”, Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira

link-bibliography
https://arxiv.org/abs/2012.15524#google: “Fast WordPiece Tokenization”, Xinying Song, Alex Salcianu, Yang Song, Dave Dopson, Denny Zhou

link-bibliography
https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf#page=5: “GPT-1: Improving Language Understanding by Generative Pre-Training § Model Specifications”, Alec Radford, Karthik Rajagopal Narasimhan, Tim Salimans, Ilya Sutskever

link-bibliography

[Quote Of The Day]

[Site Of The Day]

[Annotation Of The Day]

[adblock public service announcement]