Bibliography:

  1. AI Text Tokenization

  2. ‘neural net’ tag

  3. ‘CLIP’ tag

  4. ‘masked autoencoder’ tag

  5. ‘language’ tag

  6. GPT-3 Creative Fiction

  7. GPT-3 Nonfiction

  8. AI Text Tokenization

  9. Clio: Privacy-Preserving Insights into Real-World AI Use

  10. The structure of the token space for large language models

  11. When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

  12. MaskBit: Embedding-free Image Generation via Bit Tokens

  13. A New Class of Glitch Tokens: BPE Sub-Token Artifacts

  14. 4f4d1a5bc35e58a6ddcb29890185ec949e7945e2.html

  15. JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

  16. Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

  17. Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets

  18. From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

  19. Zero-Shot Tokenizer Transfer

  20. Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models

  21. Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

  22. SpaceByte: Towards Deleting Tokenization from Large Language Modeling

  23. Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

  24. Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck

  25. Training LLMs over Neurally Compressed Text

  26. Mechanistic Design and Scaling of Hybrid Architectures

  27. Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

  28. Tasks That Language Models Don’t Learn

  29. Getting the most out of your tokenizer for pre-training and domain adaptation

  30. MambaByte: Token-free Selective State Space Model

  31. A long-context language model for the generation of bacteriophage genomes

  32. diff History for Neural Language Agents

  33. TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

  34. Positional Description Matters for Transformers Arithmetic

  35. AnyText: Multilingual Visual Text Generation And Editing

  36. EELBERT: Tiny Models through Dynamic Embeddings

  37. ChipNeMo: Domain-Adapted LLMs for Chip Design

  38. Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

  39. Tokenizer Choice For LLM Training: Negligible or Crucial?

  40. xVal: A Continuous Number Encoding for Large Language Models

  41. Think before you speak: Training Language Models With Pause Tokens

  42. Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve

  43. Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning

  44. PASTA: Pretrained Action-State Transformer Agents

  45. In-context Autoencoder for Context Compression in a Large Language Model

  46. Teaching Arithmetic to Small Transformers

  47. Length Generalization in Arithmetic Transformers

  48. ChatGPT is fun, but it is not funny! Humor is still challenging Large Language Models

  49. Bytes Are All You Need: Transformers Operating Directly On File Bytes

  50. FERMAT: An Alternative to Accuracy for Numerical Reasoning

  51. MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

  52. Evaluating Transformer Language Models on Arithmetic Operations Using Number Decomposition

  53. What’s AGI, and Why Are AI Experts Skeptical? ChatGPT and other bots have revived conversations on artificial general intelligence. Scientists say algorithms won’t surpass you any time soon

  54. BloombergGPT: A Large Language Model for Finance

  55. How well do Large Language Models perform in Arithmetic tasks?

  56. LLaMa-1: Open and Efficient Foundation Language Models

  57. Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)

  58. XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

  59. Language models are better than humans at next-token prediction

  60. Character-Aware Models Improve Visual Text Rendering

  61. NPM: Nonparametric Masked Language Modeling

  62. Fast Inference from Transformers via Speculative Decoding

  63. Efficient Transformers with Dynamic Token Pooling

  64. Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities

  65. LMentry: A Language Model Benchmark of Elementary Language Tasks

  66. n-gram Is Back: Residual Learning of Neural Text Generation with n-gram Language Model

  67. Help me write a poem: Instruction Tuning as a Vehicle for Collaborative Poetry Writing (CoPoet)

  68. DALL·E 2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models

  69. Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio

  70. Small Character Models Match Large Word Models for Autocomplete Under Memory Constraints

  71. AudioLM: a Language Modeling Approach to Audio Generation

  72. PIXEL: Language Modeling with Pixels

  73. N-Grammer: Augmenting Transformers with latent n-grams

  74. Forecasting Future World Events with Neural Networks

  75. SymphonyNet: Symphony Generation with Permutation Invariant Language Model

  76. FLOTA: An Embarrassingly Simple Method to Mitigate Und-es-ira-ble Properties of Pretrained Language Model Tokenizers

  77. DALL·E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents § 7. Limitations and Risks

  78. ByT5 model for massively multilingual grapheme-to-phoneme conversion

  79. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

  80. Pretraining without Wordpieces: Learning Over a Vocabulary of Millions of Words

  81. Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

  82. PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts

  83. OCR-free Document Understanding Transformer

  84. What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers

  85. Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens

  86. Perceiver IO: A General Architecture for Structured Inputs & Outputs

  87. Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

  88. ByT5: Towards a token-free future with pre-trained byte-to-byte models

  89. Robust Open-Vocabulary Translation from Visual Text Representations

  90. GPT-3 vs Water Cooler Trivia participants: A Human vs Robot Showdown

  91. CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

  92. There Once Was a Really Bad Poet, It Was Automated but You Didn’t Know It

  93. Perceiver: General Perception with Iterative Attention

  94. Investigating the Limitations of the Transformers with Simple Arithmetic Tasks

  95. Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

  96. Fast WordPiece Tokenization

  97. CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters

  98. Towards End-to-End In-Image Neural Machine Translation

  99. Unigram LM: Byte Pair Encoding is Suboptimal for Language Model Pretraining

  100. Generative Language Modeling for Automated Theorem Proving § Experiments

  101. OTEANN: Estimating the Transparency of Orthographies with an Artificial Neural Network

  102. GPT-2 Folk Music

  103. BPE-Dropout: Simple and Effective Subword Regularization

  104. BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance

  105. Do NLP Models Know Numbers? Probing Numeracy in Embeddings

  106. Generating Text with Recurrent Neural Networks

  107. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

  108. Character-Level Language Modeling with Deeper Self-Attention

  109. Deep-speare: A Joint Neural Model of Poetic Language, Meter and Rhyme

  110. GPT-1: Improving Language Understanding by Generative Pre-Training § Model specifications

  111. One Big Net For Everything

  112. Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

  113. DeepTingle

  114. Multiplicative LSTM for sequence modeling

  115. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

  116. BPEs: Neural Machine Translation of Rare Words with Subword Units

  117. Scaling Language Models: Methods, Analysis & Insights from Training Gopher § Table A40: Conversations Can Create the Illusion of Creativity

  118. 55416474191d68307e7d48b4c4a372b8a43882dc.pdf#page=119&org=deepmind

  119. Commas vs Integers

  120. 14c5d56fb03d1780213417606778ffed377de212.html

  121. FineWeb: Decanting the Web for the Finest Text Data at Scale

  122. The Bouba/Kiki Effect And Sound Symbolism In CLIP

  123. 7b754d1adedff79bde90b78a60b89f20f46cc3fd.html

  124. BPE Blues

  125. BPE Blues+

  126. c8cfad2256eba912a8dfa42db9ed33ee917e4775.html

  127. The Art of Prompt Design: Prompt Boundaries and Token Healing

  128. Monitor: An AI-Driven Observability Interface

  129. A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More

  130. Tokens Are Definitely Shorter Than English, but the Performance Even Worse. Getting It to Explain Its Thinking, It Clearly Can’t Tell at All Which Sentences/words Sound the Same, Which Is Odd, Since Homonyms Tend to Have the Same Letters in Russian...On the Other Hand Strength of the Model Definitely Not As Good outside of English.

  131. design#future-tag-features

    [Transclude the forward-link's context]

  132. 2024-01-10-gwern-gpt4-usingipasoftwaretotrytounderstandatomatopun.png

  133. 2023-lee-figure20-naivebpetokenizationbadlydamagesgpt2arithmetictraining.png

  134. 2022-rust-figure1-pixelarchitecturefortokenizingtextasrawpixelsdenoisingmaepretraining.png

  135. 2021-liu-figure1-characterawarevsbpeblindedimagegenerationoftextinsideanimagedemonstratingthatcharacterawaremodelsgeneratetextwell.png

  136. 2021-liu-figure12-randomsamplesforwritingthewordexquisiteusingbyt5vst5showingbyt5usuallyright.jpg

  137. 2021-liu-figure4-accuracyof10imagegenerationmodelsondrawingtextshowsbyt5best.png

  138. 2021-liu-table1-spellingtestforbyt5vst5vspalmshowsbyt5spellsmuchbetter.png

  139. 2019-marjou-figure3-scatterplotofthemeanphonemictransparencyscoresbyreadingandwriting.png

  140. 2019-marjou-table3-phonemictransparencyscoresestimatedbyoteanngptneuralnet.png

  141. lee-figure15-performanceofsmalltransformertrainedtodo3digitsubtraction2digitmultiplication4digitprecisionsinesquareroot.jpg

  142. https://aclanthology.org/2021.emnlp-main.563.pdf

  143. https://aclanthology.org/D18-1092/

  144. https://amistrongeryet.substack.com/p/can-ai-do-my-job

  145. https://amistrongeryet.substack.com/p/gpt-4-capabilities

  146. https://blog.scottlogic.com/2021/08/31/a-primer-on-the-openai-api-1.html

  147. https://demian.ferrei.ro/blog/chatgpt-sucks-at-pangrams

  148. 1ccfb8e3ba4928af8143d7ecc5dbe7641d16676e.html

  149. https://denyslinkov.medium.com/why-is-gpt-3-15-77x-more-expensive-for-certain-languages-2b19a4adc4bc

  150. 14f723f59f0b5219090c3543148ed99f00a19424.html

  151. https://gist.github.com/moyix/ca4091f16f0b5011bfa8f3f97f705a0d

  152. 88d1b639658f8e19204084800a0baa1107d53292.html

  153. https://github.com/alasdairforsythe/tokenmonster/blob/main/benchmark/pretrain.md

  154. https://github.com/castorini/transformers-arithmetic

  155. https://github.com/google-research/byt5

  156. https://github.com/javirandor/anthropic-tokenizer

  157. 62f345d63d17c0ce55e192bfc3081f798835400e.html

  158. https://github.com/nostalgebraist/improved-diffusion

  159. https://github.com/openai/tiktoken

  160. https://github.com/skeskinen/hf-tokenizer-testing

  161. https://huggingface.co/learn/nlp-course/chapter6/5

  162. https://huggingface.co/learn/nlp-course/chapter6/6

  163. https://huggingface.co/learn/nlp-course/chapter6/7

  164. https://ndingwall.github.io/blog/tokenization

  165. 17152880e016990cb4309ab52b72cd1b86e49e66.html

  166. https://news.ycombinator.com/item?id=39557213

  167. https://paperswithcode.com/method/wordpiece

  168. https://passaglia.jp/gpt-japanese/

  169. 533f2a512ed805696771288c2e27a4f0dd3fb83e.html

  170. https://research.google/blog/a-fast-wordpiece-tokenization-system/

  171. https://spacy.io/

  172. 24a8ed78ea136569eb8d75ac78bf0ff867b68f06.html

  173. https://www.ai21.com/blog/human-or-not-results

  174. 720e6f1aad476d5f5fe5b0ee8d7dcd44c5e83901.html

  175. https://www.beren.io/2023-02-04-Integer-tokenization-is-insane/

  176. d92983a05488ab725f46cc0db7f89d01143251d8.html

  177. https://www.lesswrong.com/posts/8viQEp8KBg2QSW4Yc/solidgoldmagikarp-iii-glitch-token-archaeology

  178. https://www.lesswrong.com/posts/CNPvESPru3XNqsw7A/what-s-up-with-all-the-non-mormons-weirdly-specific

  179. https://www.lesswrong.com/posts/ChtGdxk9mwZ2Rxogt/smartyheadercode-anomalous-tokens-for-gpt3-5-and-gpt-4-1

  180. https://www.lesswrong.com/posts/GyaDCzsyQgc48j8t3/linear-encoding-of-character-level-information-in-gpt-j

  181. https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation

  182. https://www.lesswrong.com/posts/c6uTNm5erRrmyJvvD/mapping-the-semantic-void-strange-goings-on-in-gpt-embedding

  183. https://www.lesswrong.com/posts/dFbfCLZA4pejckeKc/a-mechanistic-explanation-for-solidgoldmagikarp-like-tokens

  184. https://www.lesswrong.com/posts/jkY6QdCfAXHJk3kea/the-petertodd-phenomenon

  185. https://www.lesswrong.com/posts/kmWrwtGE9B9hpbgRT/a-search-for-more-chatgpt-gpt-3-5-gpt-4-unspeakable-glitch

  186. https://www.lesswrong.com/postsiGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall

  187. https://www.merriam-webster.com/games/twofer-goofer

  188. https://www.reddit.com/r/ChatGPT/comments/129krsc/what_happened_here_this_is_the_kind_of_censorship/jeqjir3/

  189. 59624141afe8728ba1a435e5569fc59e92f38f7b.html

  190. https://www.reddit.com/r/ChatGPT/comments/12xai7j/spamming_the_word_stop_2300_times_or_probably_any/

  191. https://www.reddit.com/r/mlscaling/comments/146rgq2/chatgpt_is_running_quantized/jnst1t8/

  192. https://www.technologyreview.com/2024/05/22/1092763/openais-gpt4o-chinese-ai-data/

  193. https://www.youtube.com/watch?v=rT6wVLEDC_w

  194. https://x.com/DahnJahn/status/1669000659192930304

  195. https://x.com/MichaelTrazzi/status/1635743595989970945

  196. https://x.com/RyanRadia/status/1718619602106659239

  197. https://x.com/Sheikheddy/status/1765445782713385340

  198. https://x.com/arankomatsuzaki/status/1619548480795734016

  199. https://x.com/colin_fraser/status/1635350490484719618

  200. https://x.com/colin_fraser/status/1635360285187018752

  201. https://x.com/colin_fraser/status/1635450606013014016

  202. https://x.com/goodside/status/1666598586346352641

  203. https://x.com/goodside/status/1753192905844592989

  204. https://x.com/goodside/status/1829651283982373143

  205. https://x.com/goodside/status/1836666268767633639

  206. https://x.com/marktenenholtz/status/1787893010753015841

  207. https://x.com/repligate/status/1620949459902529537

  208. https://x.com/retvitr/status/1728934882146242701

  209. https://x.com/rogerkmoore/status/1601937387550031874

  210. https://x.com/suchenzang/status/1697862650053660721

  211. https://x.com/suchenzang/status/1702126326369636631

  212. https://x.com/suchenzang/status/1790171161512587424

  213. https://x.com/thebasepoint/status/1710050231780257882

  214. https://x.com/tianle_cai/status/1790109646205890723

  215. https://x.com/tomgoldsteincs/status/1601113497592795136

  216. https://x.com/tomgoldsteincs/status/1601113501803552768

  217. https://x.com/tomgoldsteincs/status/1601113505998204928

  218. https://x.com/zswitten/status/1390045960663797764

  219. The structure of the token space for large language models

  220. https%253A%252F%252Farxiv.org%252Fabs%252F2410.08993.html

  221. MaskBit: Embedding-free Image Generation via Bit Tokens

  222. https%253A%252F%252Farxiv.org%252Fabs%252F2409.16211%2523bytedance.html

  223. Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

  224. https%253A%252F%252Farxiv.org%252Fabs%252F2406.20086.html

  225. Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets

  226. https%253A%252F%252Farxiv.org%252Fabs%252F2406.18906.html

  227. From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

  228. https%253A%252F%252Farxiv.org%252Fabs%252F2405.14838.html

  229. Zero-Shot Tokenizer Transfer

  230. https%253A%252F%252Farxiv.org%252Fabs%252F2405.07883.html

  231. Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

  232. https%253A%252F%252Farxiv.org%252Fabs%252F2404.13292.html

  233. Mechanistic Design and Scaling of Hybrid Architectures

  234. Stefano Ermon

  235. https%253A%252F%252Farxiv.org%252Fabs%252F2403.17844.html

  236. Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs

  237. https%253A%252F%252Farxiv.org%252Fabs%252F2402.14903.html

  238. Tasks That Language Models Don’t Learn

  239. https%253A%252F%252Farxiv.org%252Fabs%252F2402.11349.html

  240. TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

  241. Furu Wei

  242. https%253A%252F%252Farxiv.org%252Fabs%252F2311.16465.html

  243. Think before you speak: Training Language Models With Pause Tokens

  244. Sanjiv Kumar

  245. https%253A%252F%252Farxiv.org%252Fabs%252F2310.02226.html

  246. Teaching Arithmetic to Small Transformers

  247. https%253A%252F%252Farxiv.org%252Fabs%252F2307.03381.html

  248. Bytes Are All You Need: Transformers Operating Directly On File Bytes

  249. https%253A%252F%252Farxiv.org%252Fabs%252F2306.00238%2523apple.html

  250. What’s AGI, and Why Are AI Experts Skeptical? ChatGPT and other bots have revived conversations on artificial general intelligence. Scientists say algorithms won’t surpass you any time soon

  251. https%253A%252F%252Fwww.wired.com%252Fstory%252Fwhat-is-artificial-general-intelligence-agi-explained%252F.html

  252. How well do Large Language Models perform in Arithmetic tasks?

  253. https%253A%252F%252Farxiv.org%252Fabs%252F2304.02015%2523alibaba.html

  254. Character-Aware Models Improve Visual Text Rendering

  255. William Chan

  256. https%253A%252F%252Farxiv.org%252Fabs%252F2212.10562%2523google.html

  257. NPM: Nonparametric Masked Language Modeling

  258. Mike Lewis

  259. Hannaneh Hajishirzi—University of Washington

  260. Luke Zettlemoyer

  261. https%253A%252F%252Farxiv.org%252Fabs%252F2212.01349%2523facebook.html

  262. Help me write a poem: Instruction Tuning as a Vehicle for Collaborative Poetry Writing (CoPoet)

  263. https%253A%252F%252Farxiv.org%252Fabs%252F2210.13669.html

  264. Most Language Models can be Poets too: An AI Writing Assistant and Constrained Text Generation Studio

  265. https%253A%252F%252Faclanthology.org%252F2022.cai-1.2.pdf.html

  266. PIXEL: Language Modeling with Pixels

  267. https%253A%252F%252Farxiv.org%252Fabs%252F2207.06991.html

  268. Forecasting Future World Events with Neural Networks

  269. Andy Zou

  270. Mantas Mazeika

  271. Jacob Steinhardt

  272. Owain Evans, AI Alignment Researcher

  273. https://people.eecs.berkeley.edu/~hendrycks/

  274. https%253A%252F%252Farxiv.org%252Fabs%252F2206.15474.html

  275. FLOTA: An Embarrassingly Simple Method to Mitigate Und-es-ira-ble Properties of Pretrained Language Model Tokenizers

  276. https%253A%252F%252Faclanthology.org%252F2022.acl-short.43.pdf.html

  277. DALL·E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents § 7. Limitations and Risks

  278. Aditya A. Ramesh

  279. Speaker Details: EmTech MIT 2023

  280. https%253A%252F%252Farxiv.org%252Fpdf%252F2204.06125%2523page%253D16%2526org%253Dopenai.html

  281. ByT5 model for massively multilingual grapheme-to-phoneme conversion

  282. https%253A%252F%252Farxiv.org%252Fabs%252F2204.03067.html

  283. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

  284. https%253A%252F%252Farxiv.org%252Fabs%252F2203.13131%2523facebook.html

  285. Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens

  286. Omer Levy

  287. https%253A%252F%252Farxiv.org%252Fabs%252F2108.11193.html

  288. Perceiver IO: A General Architecture for Structured Inputs & Outputs

  289. https%253A%252F%252Farxiv.org%252Fabs%252F2107.14795%2523deepmind.html

  290. Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

  291. Yi Tay

  292. https%253A%252F%252Farxiv.org%252Fabs%252F2106.12672%2523google.html

  293. ByT5: Towards a token-free future with pre-trained byte-to-byte models

  294. Colin Raffel

  295. https%253A%252F%252Farxiv.org%252Fabs%252F2105.13626%2523google.html

  296. Perceiver: General Perception with Iterative Attention

  297. https%253A%252F%252Farxiv.org%252Fabs%252F2103.03206%2523deepmind.html

  298. Fast WordPiece Tokenization

  299. https%253A%252F%252Farxiv.org%252Fabs%252F2012.15524%2523google.html

  300. GPT-2 Folk Music

  301. Gwern.net Homepage

    [Transclude the forward-link's context]

  302. https://x.com/theshawwn

  303. %252Fgpt-2-music.html

  304. GPT-1: Improving Language Understanding by Generative Pre-Training § Model specifications

  305. Alec Radford

  306. Tim Salimans

  307. https%253A%252F%252Fs3-us-west-2.amazonaws.com%252Fopenai-assets%252Fresearch-covers%252Flanguage-unsupervised%252Flanguage_understanding_paper.pdf%2523page%253D5.html