The structure of the token space for large language models
When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
SpaceByte: Towards Deleting Tokenization from Large Language Modeling
Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge
Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
Getting the most out of your tokenizer for pre-training and domain adaptation
A long-context language model for the generation of bacteriophage genomes
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering
Positional Description Matters for Transformers Arithmetic
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling
xVal: A Continuous Number Encoding for Large Language Models
Think before you speak: Training Language Models With Pause Tokens
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve
Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning
In-context Autoencoder for Context Compression in a Large Language Model
ChatGPT is fun, but it is not funny! Humor is still challenging Large Language Models
Bytes Are All You Need: Transformers Operating Directly On File Bytes
FERMAT: An Alternative to Accuracy for Numerical Reasoning
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers
Evaluating Transformer Language Models on Arithmetic Operations Using Number Decomposition
What’s AGI, and Why Are AI Experts Skeptical? ChatGPT and other bots have revived conversations on artificial general intelligence. Scientists say algorithms won’t surpass you any time soon
How well do Large Language Models perform in Arithmetic tasks?
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)
XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models
Language models are better than humans at next-token prediction
Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities
LMentry: A Language Model Benchmark of Elementary Language Tasks
n-gram Is Back: Residual Learning of Neural Text Generation with n-gram Language Model
Help me write a poem: Instruction Tuning as a Vehicle for Collaborative Poetry Writing (CoPoet)
DALL·E 2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models
Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio
Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints
SymphonyNet: Symphony Generation with Permutation Invariant Language Model
FLOTA: An Embarrassingly Simple Method to Mitigate Und-es-ira-ble Properties of Pretrained Language Model Tokenizers
DALL·E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents § 7. Limitations and Risks
ByT5 model for massively multilingual grapheme-to-phoneme conversion
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors
Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts
What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers
Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
ByT5: Towards a token-free future with pre-trained byte-to-byte models
Robust Open-Vocabulary Translation from Visual Text Representations
GPT-3 vs Water Cooler Trivia participants: A Human vs Robot Showdown
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
There Once Was a Really Bad Poet, It Was Automated but You Didn’t Know It
Investigating the Limitations of the Transformers with Simple Arithmetic Tasks
Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words
CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters
Unigram LM: Byte Pair Encoding is Suboptimal for Language Model Pretraining
Generative Language Modeling for Automated Theorem Proving § Experiments
OTEANN: Estimating the Transparency of Orthographies with an Artificial Neural Network
BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance
Do NLP Models Know Numbers? Probing Numeracy in Embeddings
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
Character-Level Language Modeling with Deeper Self-Attention
Deep-speare: A Joint Neural Model of Poetic Language, Meter and Rhyme
GPT-1: Improving Language Understanding by Generative Pre-Training § Model specifications
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
BPEs: Neural Machine Translation of Rare Words with Subword Units
Scaling Language Models: Methods, Analysis & Insights from Training Gopher § Table A40: Conversations Can Create the Illusion of Creativity
55416474191d68307e7d48b4c4a372b8a43882dc.pdf#page=119&org=deepmind
FineWeb: Decanting the Web for the Finest Text Data at Scale
The Art of Prompt Design: Prompt Boundaries and Token Healing
Tokens Are Definitely Shorter Than English, but the Performance Even Worse. Getting It to Explain Its Thinking, It Clearly Can’t Tell at All Which Sentences/words Sound the Same, Which Is Odd, Since Homonyms Tend to Have the Same Letters in Russian...On the Other Hand Strength of the Model Definitely Not As Good outside of English.
2024-01-10-gwern-gpt4-usingipasoftwaretotrytounderstandatomatopun.png
2023-lee-figure20-naivebpetokenizationbadlydamagesgpt2arithmetictraining.png
2022-rust-figure1-pixelarchitecturefortokenizingtextasrawpixelsdenoisingmaepretraining.png
2021-liu-figure1-characterawarevsbpeblindedimagegenerationoftextinsideanimagedemonstratingthatcharacterawaremodelsgeneratetextwell.png
2021-liu-figure12-randomsamplesforwritingthewordexquisiteusingbyt5vst5showingbyt5usuallyright.jpg
2021-liu-figure4-accuracyof10imagegenerationmodelsondrawingtextshowsbyt5best.png
2021-liu-table1-spellingtestforbyt5vst5vspalmshowsbyt5spellsmuchbetter.png
2019-marjou-figure3-scatterplotofthemeanphonemictransparencyscoresbyreadingandwriting.png
2019-marjou-table3-phonemictransparencyscoresestimatedbyoteanngptneuralnet.png
lee-figure15-performanceofsmalltransformertrainedtodo3digitsubtraction2digitmultiplication4digitprecisionsinesquareroot.jpg
https://blog.scottlogic.com/2021/08/31/a-primer-on-the-openai-api-1.html
https://denyslinkov.medium.com/why-is-gpt-3-15-77x-more-expensive-for-certain-languages-2b19a4adc4bc
https://gist.github.com/moyix/ca4091f16f0b5011bfa8f3f97f705a0d
https://github.com/alasdairforsythe/tokenmonster/blob/main/benchmark/pretrain.md
https://research.google/blog/a-fast-wordpiece-tokenization-system/
https://www.beren.io/2023-02-04-Integer-tokenization-is-insane/
https://www.lesswrong.com/posts/8viQEp8KBg2QSW4Yc/solidgoldmagikarp-iii-glitch-token-archaeology
https://www.lesswrong.com/posts/CNPvESPru3XNqsw7A/what-s-up-with-all-the-non-mormons-weirdly-specific
https://www.lesswrong.com/posts/ChtGdxk9mwZ2Rxogt/smartyheadercode-anomalous-tokens-for-gpt3-5-and-gpt-4-1
https://www.lesswrong.com/posts/GyaDCzsyQgc48j8t3/linear-encoding-of-character-level-information-in-gpt-j
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
https://www.lesswrong.com/posts/c6uTNm5erRrmyJvvD/mapping-the-semantic-void-strange-goings-on-in-gpt-embedding
https://www.lesswrong.com/posts/dFbfCLZA4pejckeKc/a-mechanistic-explanation-for-solidgoldmagikarp-like-tokens
https://www.lesswrong.com/posts/jkY6QdCfAXHJk3kea/the-petertodd-phenomenon
https://www.lesswrong.com/posts/kmWrwtGE9B9hpbgRT/a-search-for-more-chatgpt-gpt-3-5-gpt-4-unspeakable-glitch
https://www.lesswrong.com/postsiGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
https://www.reddit.com/r/ChatGPT/comments/129krsc/what_happened_here_this_is_the_kind_of_censorship/jeqjir3/
https://www.reddit.com/r/ChatGPT/comments/12xai7j/spamming_the_word_stop_2300_times_or_probably_any/
https://www.reddit.com/r/mlscaling/comments/146rgq2/chatgpt_is_running_quantized/jnst1t8/
https://www.technologyreview.com/2024/05/22/1092763/openais-gpt4o-chinese-ai-data/
The structure of the token space for large language models
https%253A%252F%252Farxiv.org%252Fabs%252F2409.16211%2523bytedance.html
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets
From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering
Think before you speak: Training Language Models With Pause Tokens
Bytes Are All You Need: Transformers Operating Directly On File Bytes
https%253A%252F%252Farxiv.org%252Fabs%252F2306.00238%2523apple.html
What’s AGI, and Why Are AI Experts Skeptical? ChatGPT and other bots have revived conversations on artificial general intelligence. Scientists say algorithms won’t surpass you any time soon
https%253A%252F%252Fwww.wired.com%252Fstory%252Fwhat-is-artificial-general-intelligence-agi-explained%252F.html
How well do Large Language Models perform in Arithmetic tasks?
https%253A%252F%252Farxiv.org%252Fabs%252F2304.02015%2523alibaba.html
https%253A%252F%252Farxiv.org%252Fabs%252F2212.10562%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2212.01349%2523facebook.html
Help me write a poem: Instruction Tuning as a Vehicle for Collaborative Poetry Writing (CoPoet)
Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio
https%253A%252F%252Faclanthology.org%252F2022.cai-1.2.pdf.html
FLOTA: An Embarrassingly Simple Method to Mitigate Und-es-ira-ble Properties of Pretrained Language Model Tokenizers
https%253A%252F%252Faclanthology.org%252F2022.acl-short.43.pdf.html
DALL·E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents § 7. Limitations and Risks
https%253A%252F%252Farxiv.org%252Fpdf%252F2204.06125%2523page%253D16%2526org%253Dopenai.html
ByT5 model for massively multilingual grapheme-to-phoneme conversion
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors
https%253A%252F%252Farxiv.org%252Fabs%252F2203.13131%2523facebook.html
Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens
Perceiver IO: A General Architecture for Structured Inputs & Outputs
https%253A%252F%252Farxiv.org%252Fabs%252F2107.14795%2523deepmind.html
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
https%253A%252F%252Farxiv.org%252Fabs%252F2106.12672%2523google.html
ByT5: Towards a token-free future with pre-trained byte-to-byte models
https%253A%252F%252Farxiv.org%252Fabs%252F2105.13626%2523google.html
https%253A%252F%252Farxiv.org%252Fabs%252F2103.03206%2523deepmind.html
https%253A%252F%252Farxiv.org%252Fabs%252F2012.15524%2523google.html
GPT-1: Improving Language Understanding by Generative Pre-Training § Model specifications
https%253A%252F%252Fs3-us-west-2.amazonaws.com%252Fopenai-assets%252Fresearch-covers%252Flanguage-unsupervised%252Flanguage_understanding_paper.pdf%2523page%253D5.html
Wikipedia Bibliography: