# Reinforcement Learning for Grade-Schoolers
I would like you to write a tutorial explaining reinforcement learning to grade-schoolers (titled "Reinforcement Learning for Grade-Schoolers") who are trying to understand contemporary deep learning like ChatGPT. It should cover off-policy vs on-policy, model-based vs model-free, value-based vs policy-based. It should explain evolution strategies, REINFORCE, PPO/GRPO, RLHF, AlphaZero, and tabular Q-learning, at a minimum.
A list of RL concepts to help plan a list of topics is attached below.
An example of the kind of explanation I am looking for:
"A brain is made of neurons. A neuron is connected to other neurons by many wires, which send electric shocks. If there are enough shocks into a neuron, a neuron will send out a shock too. How does a neuron know what is enough? We can make them learn, by trial and error. One way is evolution. This is a model-free, on-policy learning algorithm. We copy a brain many times, and we wiggle some of the neurons. Then we have each brain play a game. They win or lose. We keep all the brains that won, and none that lost. Then we make more copies, and wiggle their neurons too, and play again. We do this until they learn how to win. We could do it another way. In REINFORCE, which is also model-free on-policy, a single brain plays many games. In each game the brain wins, all the neurons that fired get strengthened and lower their 'enough'; in each game the brain loses, all the neurons that fired get weakened and so raise their 'enough'. The neuron might have done nothing, or it might have done a good thing in a game the brain lost, or it might have done a bad thing in a game the brain won. It doesn't matter. We weaken or strengthen them anyway. This may sound bad, but on average it works. We can try to make it less bad, in a way called "actor-critic". We can have a second brain (critic) which we make watch the first brain (actor) play a game. We train the critic brain to play a different game: predict if the actor brain will win or lose. It wins or loses if its prediction is right or wrong. Once it can do that, we change REINFORCE: now we wait for games where the actor brain won but the critic brain predicted it would lose; then we strengthen the neurons in the actor brain that fired. And when it was the other way around, we weakened the neurons in the actor brain. This rewards neurons which did 'better than expected' and punishes neurons which did 'worse than expected'."
The exact writing style for the final explanations is a controlled English dialect format called "Grow-Speech", which requires all multi-syllabic words to be defined before use; a detailed specification is attached below.
Aim for concrete, kid-friendly images, and an Upgoer Five XKCD-esque dry humor.
I have appended previous drafts by GPT-5.5, Claude, Gemini, and DeepSeek to help. Please combine all the best parts. (The Claude one is probably the best so far.)
To write this tutorial, please write out ideas for topics to cover and how to explain them simply and concretely.
Write out a rigorous academic explanation, then a normal explanation, then an ELI5 explanation.
Critique and revise each explanation.
Then write a Table of Contents that puts them in a logical order; revise the Table of Contents for the best pedagogical sequence.
Then write out a full version going topic by topic, with each of the 3 explanations per topic.
Then, take the ELI5 explanations, and turn them into Grow-Speech one by one in order (accumulating definitions on the way).
Check the Grow-Speech scansion and correctness.
Then critique and revise the Grow-Speech version.
Finally, print out a clean complete tutorial "Reinforcement Learning for Grade-Schoolers" essay in Markdown Grow-Speech.
A list of RL concepts to help plan a list of topics:
> _Reinforcement Learning: An Introduction_, Richard S. Sutton & Andrew G. Barto 2018: Table of Contents: I Tabular Solution Methods / 2 Multi-armed Bandits / 2.1 A k -armed Bandit Problem / 2.2 Action-value Methods / 2.3 The 10-armed Testbed / 2.4 Incremental Implementation / 2.5 Tracking a Non-stationary Problem / 2.6 Optimistic Initial Values / 2.7 Upper-Confidence-Bound Action Selection / 2.8 Gradient Bandit Algorithms / 2.9 Associative Search (Contextual Bandits) / 2.10 Summary / 3 Finite Markov Decision Processes / 3.1 The Agent--Environment Interface / 3.2 Goals and Rewards / 3.3 Returns and Episodes / 3.4 Unified Notation for Episodic and Continuing Tasks / 3.5 Policies and Value Functions / 3.6 Optimal Policies and Optimal Value Functions / 3.7 Optimality and Approximation / 3.8 Summary / 4 Dynamic Programming / 4.1 Policy Evaluation (Prediction) / 4.2 Policy Improvement / 4.3 Policy Iteration / 4.4 Value Iteration / 4.5 Asynchronous Dynamic Programming / 4.6 Generalized Policy Iteration / 4.7 Efficiency of Dynamic Programming / 4.8 Summary / 5 Monte Carlo Methods / 5.1 Monte Carlo Prediction / 5.2 Monte Carlo Estimation of Action Values / 5.3 Monte Carlo Control / 5.4 Monte Carlo Control without Exploring Starts / 5.5 Off-policy Prediction via Importance Sampling / 5.6 Incremental Implementation / 5.7 Off-policy Monte Carlo Control / 5.8 *Discounting-aware Importance Sampling / 5.9 *Per-decision Importance Sampling / 5.10 Summary / 6 Temporal-difference Learning / 6.1 TD Prediction / 6.2 Advantages of TD Prediction Methods / 6.3 Optimality of TD(0) / 6.4 Sarsa: On-policy TD Control / 6.5 Q-learning: Off-policy TD Control / 6.6 Expected Sarsa / 6.7 Maximization Bias and Double Learning / 6.8 Games, Afterstates, and Other Special Cases / 6.9 Summary / 7 n-step Bootstrapping / 7.1 n-step TD Prediction / 7.2 n-step Sarsa / 7.3 n-step Off-policy Learning / 7.4 *Per-decision Methods with Control Variates / 7.5 Off-policy Learning Without Importance Sampling: / The n-step Tree Backup Algorithm / 7.6 *A Unifying Algorithm: n-step Q() / 7.7 Summary / 8 Planning and Learning with Tabular Methods / 8.1 Models and Planning / 8.2 Dyna: Integrated Planning, Acting, and Learning / 8.3 When the Model Is Wrong / 8.4 Prioritized Sweeping / 8.5 Expected versus Sample Updates / 8.6 Trajectory Sampling / 8.7 Real-time Dynamic Programming / 8.8 Planning at Decision Time / 8.9 Heuristic Search / 8.10 Rollout Algorithms / 8.11 Monte Carlo Tree Search / 8.12 Summary of the Chapter / 8.13 Summary of Part I: Dimensions / II Approximate Solution Methods / 9 On-policy Prediction with Approximation / 9.1 Value-function Approximation / 9.2 The Prediction Objective (VE) / 9.3 Stochastic-gradient and Semi-gradient Methods / 9.4 Linear Methods / 9.5 Feature Construction for Linear Methods / 9.5.1 Polynomials / 9.5.2 Fourier Basis / 9.5.3 Coarse Coding / 9.5.4 Tile Coding / 9.5.5 Radial Basis Functions / 9.6 Selecting Step-Size Parameters Manually / 9.7 Nonlinear Function Approximation: Artificial Neural Networks / 9.8 Least-Squares TD / 9.9 Memory-based Function Approximation / 9.10 Kernel-based Function Approximation / 9.11 Looking Deeper at On-policy Learning: Interest and Emphasis / 9.12 Summary / 10 On-policy Control with Approximation / 10.1 Episodic Semi-gradient Control / 10.2 Semi-gradient n-step Sarsa / 10.3 Average Reward: A New Problem Setting for Continuing Tasks / 10.4 Deprecating the Discounted Setting / 10.5 Differential Semi-gradient n-step Sarsa / 10.6 Summary / 11 *Off-policy Methods with Approximation / 11.1 Semi-gradient Methods / 11.2 Examples of Off-policy Divergence / 11.3 The Deadly Triad / 11.4 Linear Value-function Geometry / 11.5 Gradient Descent in the Bellman Error / 11.6 The Bellman Error is Not Learnable / 11.7 Gradient-TD Methods / 11.8 Emphatic-TD Methods / 11.9 Reducing Variance / 11.10 Summary / 12 Eligibility Traces / 12.1 The -return / 12.2 TD() / 12.3 n-step Truncated -return Methods / 12.4 Redoing Updates: Online -return Algorithm / 12.5 True Online TD() / 12.6 *Dutch Traces in Monte Carlo Learning / 12.7 Sarsa() / 12.8 Variable and / 12.9 Off-policy Traces with Control Variates / 12.10 Watkins’s Q() to Tree-Backup() / 12.11 Stable Off-policy Methods with Traces / 12.12 Implementation Issues / 12.13 Conclusions / 13 Policy Gradient Methods / 13.1 Policy Approximation and its Advantages / 13.2 The Policy Gradient Theorem / 13.3 REINFORCE: Monte Carlo Policy Gradient / 13.4 REINFORCE with Baseline / 13.5 Actor--Critic Methods / 13.6 Policy Gradient for Continuing Problems / 13.7 Policy Parameterization for Continuous Actions / 13.8 Summary / III Looking Deeper / 14 Psychology / 14.1 Prediction and Control / 14.2 Classical Conditioning / 14.2.1 Blocking and Higher-order Conditioning / 14.2.2 The Rescorla--Wagner Model / 14.2.3 The TD Model / 14.2.4 TD Model Simulations / 14.3 Instrumental Conditioning / 14.4 Delayed Reinforcement / 14.5 Cognitive Maps / 14.6 Habitual and Goal-directed Behavior / 14.7 Summary / 15 Neuroscience / 15.1 Neuroscience Basics / 15.2 Reward Signals, Reinforcement Signals, Values, and Prediction Errors . . / 15.3 The Reward Prediction Error Hypothesis / 15.4 Dopamine / 15.5 Experimental Support for the Reward Prediction Error Hypothesis / 15.6 TD Error/Dopamine Correspondence / 15.7 Neural Actor--Critic / 15.8 Actor and Critic Learning Rules / 15.9 Hedonistic Neurons / 15.10 Collective Reinforcement Learning / 15.11 Model-based Methods in the Brain / 15.12 Addiction / 15.13 Summary / 16 Applications and Case Studies / 16.1 TD-Gammon / 16.2 Samuel’s Checkers Player / 16.3 Watson’s Daily-Double Wagering / 16.4 Optimizing Memory Control / 16.5 Human-level Video Game Play / 16.6 Mastering the Game of Go / 16.6.1 AlphaGo / 16.6.2 AlphaGo Zero / 16.7 Personalized Web Services / 16.8 Thermal Soaring / 17 Frontiers / 17.1 General Value Functions and Auxiliary Tasks / 17.2 Temporal Abstraction via Options / 17.3 Observations and State / 17.4 Designing Reward Signals / 17.5 Remaining Issues / 17.6 Reinforcement Learning and the Future of Artificial Intelligence
https://gwern.net/grow-speech
~~~{.Markdown}
title: Grow-Speech Specification
author: Gwern, GPT-5.5 Pro, Claude-4.7-opus
description: "Formal specification of Guy L. Steele’s 1998 controlled-vocabulary English writing system, which requires all multi-syllabic core words to be explicitly defined before use. Intended for LLM use."
created: 2026-05-06
modified: 2026-05-07
status: finished
confidence: log
importance: 1
css-extension: dropcaps-kanzlei
...
In **Grow-Speech**, invented by [Guy L. Steele](!W) [in 1998](/doc/cs/algorithm/1998-steele.pdf "‘Growing a Language [Grow-Speech]’, Steele 1998"), a writer begins with only one-syllable English words; each longer core word becomes legal only after a sentence defines it using words already legal at that point.
This page names this format "Grow-Speech", and turns it from a lecture stunt into a well-defined controlled-English format suitable for LLM prompting.
The system is stricter than [Up Goer Five](https://xkcd.com/1133/ "‘Up Goer Five’, Munroe 2012") or [Basic English](!W), because the vocabulary is not a fixed external list.
It is grown left-to-right by the prose itself, so every abstraction must pay its own local cost before it can be used.
This specification resolves the practical edge cases: hyphenated terms, inflections, derivations, proper nouns, acronyms, numerals, raw quoted text, markup, and syllable-count ambiguity.
It also gives scansion markup for auditing a passage, separating live prose from inert quotes and recording each definition in order.
The result is both a writing constraint and a test of LLMs: Grow-Speech exposes whether a model can maintain a changing set of primitives across prose passages, rather than merely explain the rule in words that violate it.
# The Grow-Speech Rule
In the constrained span, each visible word must be legal at the point where it appears.
At the start, a visible word is legal if its spoken English form has one beat in the chosen pronunciation source.
A core word with two or more beats becomes legal only after it has been defined with legal words.
A definition may use the new word only as the marked headword, as in “By **define** I mean to set down what a word will mean from then on.”
The right-hand side of the definition must use only words that were legal before the definition began.
The new word becomes legal after the defining sentence ends.
# Details
- **"Word" definition**: A styled or hyphenated term may be introduced as one marked headword, as with "Grow-Speech" itself.
Otherwise, hyphenated forms are read by their spoken parts.
- Punctuation and markup syntax (eg. ``{.HTML}) do not count as words.
- Proper nouns, acronyms, [link anchor](!W) text, code names, and foreign words receive no special pass; they count by their spoken English form or must be defined.
- **Inflection versus derivation**: A core word is the ordinary dictionary headword behind a word form.
Grow-Speech constrains core words, not every surface form.
So inflected forms are legal if their core word is legal.
Derived words and fixed compounds are new core words unless the writer has already grown them or unless the compound is merely a transparent phrase made of legal words.
A legal word also licenses its ordinary grammatical forms for any later use, including in definitions: plural, possessive, tense, aspect, person, participle, and comparison forms.
If you have defined "write", you may then use "writes", "wrote", "written", "writing" etc.; but if you have defined "nation", you have not yet defined words such as "national", "international", "nationalism".
In close cases, use a normal dictionary headword test: if the form is merely listed under the same headword, it is the same core word; if it has its own headword or technical sense, treat it as a new word.
- **Definition form**: Introduce a multi-syllable word with a sentence like "When I say **X**, I mean ..." or "Let **X** stand for ..."—note that the framing words ("when", "I", "say", "mean", "is", "let", "stand", "for") are all monosyllabic, which is why this template works.
Write the new word in bold.
The right-hand side must use only words already permitted at that point.
Once defined, the word stays available for the rest of the response.
The headword may appear as the marked word being defined.
The right-hand side of the definition must not use the headword unless it was already legal before the definition began.
The new word becomes legal only after the defining sentence ends.
The sole free use inside a definition is the marked headword itself.
- *Self-reference trap*: The word "syllable" is itself 3 syllables, so a response that tries to "state" the rule has to first define it (eg. as "beat" or "the pulse of a word").
And likewise "monosyllable", "definition", "vocabulary".
An LLM that pattern-matches to "explaining the rules" without first growing the words to do so will fail on its own first sentence.
- **Markup format**: Grow-Speech applies to all text visible to the reader inside the constrained target range.
Thus, markup punctuation is ignored. Machine-readable strings—URL targets, filenames, code-block contents—are also exempt as non-prose.
But reader-visible text counts: headings, link anchor text or image captions all obey the rule unless explicitly marked inert.
- **Proper nouns**: A proper noun may be defined like any other word: “By **Paris** I mean the town in France.”
A name made of several words is legal only if each word is legal or the full name is introduced as a marked term.
- **Quoted matter**: Quoted text may be marked as inert.
Inert quoted text need not obey Grow-Speech, but none of its words become legal, and no definitions inside it count.
The prose that introduces, frames, explains, or comments on the quote must still obey the rule.
Quotation marks alone do not make text inert.
A live prose cue such as “he said”, “she wrote”, “this raw name”, “this sign says”, or “this code” can mark the next quoted span as inert, if context makes clear that the span is outside text.
(Even if the quoted text itself defines a word, that doesn't count until the author defines it normally.)
This is based on context; for a more rigorous treatment of raw text, you can use [the Grow-Speech scansion DSL](#scansion) to mark a range as raw.
- **Syllable counting**: By spoken English pronunciation, not by spelling.
Details:
- Syllable *examples*: "through" = 1, "scythe" = 1, "strengths" = 1, "rhythm" = 2
- *Contractions* count their spoken form: "don't" = 1, "I'm" = 1, "wouldn't" = 2, "doesn't" = 2.
Don't smuggle in illegal words via contraction.
Common contraction traps: "doesn't" (2), "wouldn't" (2), "couldn't've" (3), "shouldn't've" (3).
When in doubt, expand to monosyllables: "did not", "would not", "could not have".
- *Compound and hyphenated* words sum their parts: "blackboard" = 2, "well-known" = 2.
A hyphenated form is treated as its spoken form; when in doubt, count the parts.
- *Acronyms* read as letters cost 1--3 syllables per letter ("LLM" = 3, "FBI" = 3); acronyms read as words count by pronunciation ("NASA" = 2, "RAM" = 1).
Acronyms read as individual spoken letters must count as the sum of the syllable counts of the spoken letter names.
Thus "LLM" is 3 beats ('L-L-M') and "WWW" is not 3 but 9, but 'RAM' (pronounced 'ram') is just 1.
- *Numerals*: digits and digit strings and Roman numerals are exempt, the same way URL targets and code spans are exempt.
So `1998`, `Page 47`, `v3.2.1`, `c. 1300 BC` all pass without growth.
If a number is *spelled out* in words (`nineteen ninety-eight`, `forty-seven`), the words count by their spoken form: `one/two/three/eight/ten/twelve` = 1, `seven` = 2, `eleven` = 3, etc.
Prefer digits for years, dates, page references, versions, and citations; reach for word form only when the number is a true lexical part of the sentence (`a third of the time`, `four years ago`).
Because spelled-out number words are ordinary prose words, and because they blur into idioms and fixed phrases, they are not exempt by default.
(An author who wants them for some reason, such as esthetics, can make an explicit exception here.)
- *Proper nouns* get no free pass—either they fit ("Bach", "York", "Smith", "Marx") or they must be defined.
- [Citation exception]{.smallcaps}: we exempt surnames in formal citations to avoid an explosion of biographical details (which may be quite burdensome for an author to even learn).
Thus, it is fine to say "Schmidhuber 1991" without any definition, but if one talks about "Schmidhuber", they should define him ("Schmidhuber is a Swiss AI researcher...").
- *Foreign* words count by their English pronunciation.
- *Ambiguity* favors the author: for example, words with variable syllabification ("fire", "hour", "real") count as one syllable if a current monosyllabic pronunciation exists in standard speech.
- *Checking*: If still in doubt, you may consult the [CMU Pronouncing Dictionary (`cmudict`)](https://en.wikipedia.org/wiki/CMU_Pronouncing_Dictionary).
If a word has a variant with 1 beat in cmudict, it is legal as 1-beat. If all variants have 2+ beats, it requires definition.
- **Recursive consequences**: A definition must not use the headword in its right-hand side unless the headword was already legal.
After the defining sentence ends, the new word may of course be used in later prose, including later explanations that state recursive rules.
- **Additional Formatting**: optional enhancements to a Grow-Speech include hyperlinking each use of a defined word to its definition; on Gwern.net, our popups make this nice.
To do this, wrap the original definition in a named span of a namespace prefix plus the term (eg. `[**Grow-Speech** is a...]{#defn-1-grow-speech}`{.Markdown}), and link all subsequent uses to the span anchor ID (suppressing link decoration, if that would be clutter, using `.decorate-not`{.HTML}).
The span IDs should be included in the scansion list.
- **No Dictionaries Upfront**: A Grow-Speech *must not* cheat by defining, at the beginning, all the new words. Words must be defined as you go, in a natural way.
And that is Grow-Speech.
# Process Discipline
#. Before emitting any multi-syllable word, scan the prior text to confirm it was defined.
#. If not, define it first or paraphrase it away.
When asked to write about _X_ in Grow-Speech, do not immediately try to define _X_.
First, ask: can I say this with monosyllables?
If yes, do so.
If no, can I approximate?
Only if the concept is central and recurring should you invest the syllables to define it.
The whole point of the form is to feel the cost of each abstraction, so when in doubt, paraphrase rather than define—only define a term when you'll reuse it enough to earn its weight.
(Heuristic: prefer paraphrasing for any word used <4×.)
#. You may want to write multiple drafts sequentially to clean up definitions.
- In *author mode*, the author's dialectal English one-beat pronunciation should be used.
- In *audit* mode, a word is legal only if the chosen pronunciation source gives a one-beat form, or if the word has already been defined.
To audit: Before you give the final text, check each word from left to right.
Keep a list of grown words.
When you find a word with more than one beat, either show that it was grown before this point, define it before first use, or replace it with short words.
# Example
Worked opening hypothetical:
>
>
>
> What I do here is a word stunt.
> I call [this kind of word stunt]{#defn-grow-speech}: **Grow-Speech**.
> It was done first by a man, Guy L. Steele, in his talk
> where he said he was “Growing a Language”.
> (That is a raw quote; its words do not grow my word list.)
> By **language** I mean [the words and ways by which we speak.]{#defn-language}
> He would speak and show us how he could grow a [language](#defn-language) with new words,
> right then and there.
>
> How did Steele write [Grow-Speech](#defn-grow-speech)? Like this!
>
> Each word I write must be a word of one beat.
> By **syllable** I mean [the beat in a word.]{#defn-syllable}
> By **define** I mean [to set down what a word will mean from then on.]{#defn-define}
> The one law: I may use a word with more than one [syllable](#defn-syllable) if I [define](#defn-define) it first.
> Once I [define](#defn-define) it, it is mine to use from then on.
>
> So now I can tell you that [Grow-Speech](#defn-grow-speech) is a [language](#defn-language) which
> [defines](#defn-define) all words using one-[syllable](#defn-syllable) words or words [defined](#defn-define) in the past.
>
After that opening the writer has earned "Grow-Speech", "language", "syllable", and "define" (and, by the same trick, can earn "primitive", "abstraction", etc. one at a time).
That ratchet is the whole game.
# Scansion
To help write a proper Grow-Speech, we can add hidden markup using HTML comments to encode a micro-DSL, in the style of my [poetry metadata](/style-guide#scansion).
Comments do not have to comply with Grow-Speech nor do they count as new definitions (they are only for audit), and may include additional commentary or summary or scaffolding.
To write a properly marked up passage, I suggest an iterative process of a first draft writing in "draft mode", freely, and then rewrite a second draft with full scansion added and corrections made, and then triple-checking it with a third final clean draft.
A Grow-Speech scansion header block should define the used format, and list any exceptions or changed rules.
Most authors would use `Grow-Speech (standard)`, but an author might choose to use many spelled-out numbers: `Grow-Speech (exception: all spelled-out number legal)`.
It should list all new words in a block at front with their syllable count and (optional) link IDs and definitions, in order of definition; and likewise for "raw" text.
Finally, it should explicitly list any errors, exceptions, violations, ambiguous, or miscellaneous notes, particularly for dialectal words where an American or British reader might be surprised.
The Grow-Speech as a whole should be surrounded by comment tags, to allow switching.
Whether definitions should persist between switching is left up to the author of a document; they may want each Grow-Speech to be independent, or to accumulate.
By default, definitions do not persist across separate `BEGIN Grow-Speech`/`END Grow-Speech` spans.
If an author wants cumulative growth across spans, mark this in the header: `PERSIST: yes`; otherwise, the default is `PERSIST: no`.
Inside the Grow-Speech text proper, each defined word should be marked inline as a new numbered definition at the defining use (and *only* that first use), and raw quotes with illegal words should be marked inline as a new numbered raw quote.
If it is unclear what the scope of a newly defined term or a raw quote would be, it should be surrounded with an opening and closing comment.
## Scansion Example
The [previous example](#example), marked up with this scansion:
~~~{.Markdown}
What I do here is a word stunt.
I call [this kind of word stunt]{#defn-grow-speech}:
**Grow-Speech**.
It was done first by a man, Guy L. Steele, in his talk
where he said he was “Growing a Language”.
(That is a raw quote; its words do not grow my word list.)
By **language** I mean
[the words and ways by which we speak.]{#defn-language}
He would speak and show us how he could
grow a [language](#defn-language) with new words,
right then and there.
How did Steele write [Grow-Speech](#defn-grow-speech)? Like this!
Each word I write must be a word of one beat.
By **syllable** I mean [the beat in a word.]{#defn-syllable}
By **define** I mean [to set down what
a word will mean from then on.]{#defn-syllable}
The one law: I may use a word with more than one [syllable](#defn-syllable)
if I [define](#defn-define) it first.
Once I [define](#defn-define) it, it is mine to use from then on.
So now I can tell you that [Grow-Speech](#defn-grow-speech) is
a [language](#defn-language) which [defines](#defn-define) all words
using one-[syllable](#defn-syllable) words or words [defined](#defn-define)
in the past.
~~~
# Appendix
## Specification in Grow-Speech
Below is the Grow-Speech spec rewritten in Grow-Speech by [GPT-5.5 Pro](https://openai.com/index/introducing-gpt-5-5/#pro "‘Introducing GPT-5.5: A new class of intelligence for real work § Pro’, OpenAI 2026"):
>
>
>
> By **Grow-Speech** I mean this kind of word stunt.
> By **language** I mean the words and ways by which we speak.
> By **English** I mean the tongue used in this text.
> By **legal** I mean fit to be used by the law of this text.
> By **standard** I mean used by most folk as plain speech.
> By **syllable** I mean a beat in a word.
> By **define** I mean to set down what a word will mean from then on.
> By **sentence** I mean a run of words that ends with an end mark.
> By **only** I mean no more than this.
> By **letter** I mean one mark such as 'a', 'b', or 'c'.
> By **acronym** I mean a short name made from first letters.
>
> This text sets the law for Grow-Speech.
> This form is drawn from a 1998 talk by Guy L. Steele,
> whose raw name was “Growing a Language”.
> That raw name does not grow my word list.
> Kin word stunts have raw names too: “Up Goer Five” and “Basic English”.
>
> **Rule**
>
> In a Grow-Speech span, each text word seen by the eye must be legal at the point where it stands.
> At the start, the word list is all English words with one syllable in standard speech.
> A core word with two syllables or more is not legal till it has been defined.
> Once it has been defined, it stays legal to the end of the span.
>
> To grow a long word, define it in a sentence made of old legal words.
> The new word must be shown once in bold as the head word.
> The right side of that sentence may not use the new word, save for the bold head word.
> The new word starts to be legal once the end mark of that sentence is reached.
> The head word may show up in the right side only if it was legal in the past.
>
> **Text and words**
>
> Seen prose counts.
> Tags, code, file names, path names, web links, and raw strings do not count when they are not prose.
> But text the eye must read, such as heads, link text, or art notes, counts if not marked raw.
> Bold, slant, dots, marks, and tags do not count as words.
>
> A dash word is read by its parts.
> If the whole dash term is grown as one bold head word, it may then count as that one term.
> If not, split it at the dash and count the parts.
>
> Names, acronyms, code names, link text, and words from a strange tongue get no free pass.
> They must fit by sound or be grown.
>
> **Core words**
>
> Grow-Speech tests core words.
> It does not test each print shape as a new word.
> If a core word is legal, its plain forms are legal too:
> cat, cats; write, writes, wrote, writing, written; big, bigger, biggest.
>
> But a made word with a new core is not free.
> By **nation** I mean a land with one rule or one folk.
> If nation is legal, these raw signs do not get legal for free:
> “national”, “international”, “nationalism”.
> Nor does a fixed join word get a free pass just since its parts fit.
>
> In hard cases, check a word book.
> If the book puts the form with the same head word, count it as the same core.
> If the book gives it its own head word or a craft sense, treat it as new.
>
> **How to grow a word**
>
> This raw form is safe: “By _X_ I mean...”.
> This raw form is safe too: “Let _X_ stand for...”.
> The frame words in those forms have one beat.
> Use them when you can.
> Put the new word in bold.
> The head word is the sole free word in the grow sentence.
>
> The trap is this: do not use long law words at the start.
> Grow them first.
> The word syllable has 3 beats, so this text had to grow it first.
> The same holds for these raw words: “monosyllable”, “definition”, “vocabulary”.
>
> **Raw text**
>
> Quote marks on their own do not make text raw.
> A live cue can do it: he said, she wrote, this sign says, this code.
> The sense must make plain that the next quote is not live prose.
> Raw text need not fit Grow-Speech.
> It does not grow the word list.
> A word grown in raw text does not count.
> If the bounds of raw text are not plain, put raw start and end tags round it.
>
> **Beat count**
>
> Count syllables by sound, not by spell.
> This raw set shows 4 word signs: “through” = 1, “scythe” = 1, “strengths” = 1, “rhythm” = 2.
>
> Short forms count by what you say.
> This raw set shows traps: “don't” = 1, “I'm” = 1, “wouldn't” = 2, “doesn't” = 2, “couldn't've” = 3, “shouldn't've” = 3.
> Do not sneak in long words by way of short forms.
> When in doubt, stretch them out: did not, would not, could not have.
>
> For beat count, dash words and join words sum their parts.
> This does not make a fixed join word legal just since its parts fit.
> This raw set shows two such words: “blackboard” = 2, “well-known” = 2.
>
> If an acronym is read as the names of its letters, count the beats in those names.
> If an acronym is read as a word, count its word sound.
> Thus this raw set is safe to cite: “LLM” = 3, “FBI” = 3, “NASA” = 2, “WWW” = 9, “RAM” = 1, “L-L-M” = 3.
>
> By **digit** I mean a mark for a count, such as 0 or 7.
> Digits and digit strings, such as `1998` or `47`, are raw by this law.
> Old count marks such as I, V, and X are raw too.
> Thus this raw set may pass: “1998”, “Page 47”, “v3.2.1”, “c. 1300 BC”.
> If you spell out a count, each word must fit.
> This raw set shows count words: “nineteen ninety-eight”, “forty-seven”, “one”, “two”, “three”, “eight”, “ten”, “twelve”, “seven”, “eleven”, “a third of the time”, “four years ago”.
> Use digits for years, dates, page marks, build tags, and cite tags.
> Use word form for a count only when the count word is part of the prose.
>
> Words from a strange tongue count by their English sound.
> If a word can be said in more than one way, and one way has one beat in plain standard speech, let it pass.
> This raw set shows such words: “fire”, “hour”, “real”.
> For check work, you may use a speech word list with this raw name: “cmudict”.
> If the list gives a one-beat form, let it pass.
> If all forms have more than one beat, grow the word.
>
> **Names**
>
> Names get no free pass.
> This raw set shows names that fit by sound: “Bach”, “York”, “Smith”, “Marx”.
> By **Paris** I mean the town in France.
> A name like Paris must be grown if its sound has more than one beat.
> A name with more than one part must have each part legal, or must be grown as one marked term.
> A cite tag may let a last name with a year pass, so you need not grow each name in a cite tag.
> Thus this raw cite tag may pass: “Schmidhuber 1991”.
> But if you talk of the one named, grow the name.
> This raw prose name does not pass on its own: “Schmidhuber”.
> This raw grow line would make it pass: “**Schmidhuber** is a Swiss sage.”
>
> **Loop rules**
>
> The right side of a grow sentence must not use the new word, save when that word was legal at the time the sentence starts.
> Once the grow sentence ends, the word may be used in text that talks of this rule or of rules like it.
>
> **How to write**
>
> Each time you write a long word, scan left.
> If the word has not been grown, grow it first or swap it out.
> When asked to write on _X_, do not first grow _X_.
> Ask: can I say this with short words?
> If yes, do so.
> If no, can I say a near thing?
> Grow a long word when it is key and will come back.
> The point is to feel the cost of each long word.
> Rule of thumb: skip a long word that you would use less than 4×.
> Make more than one draft if that helps.
>
> In write mode, use your own speech.
> In check mode, a word is legal only if your speech list gives a one-beat form,
> or the word has been grown.
> By **audit** I mean to check a text for breaks.
> To audit, go left to right.
> Keep a list of grown words.
> When you meet a long word, show where it was grown, grow it first, or change it.
>
> **Scan marks**
>
> By **comment** I mean a note in code that the eye does not see.
> By **scansion** I mean a way to mark how words scan.
> A scansion block can help you check the text.
> Comments do not count as words.
> They do not grow words.
>
> By **header** I mean a first code block that tells how to scan the span.
> A header should say what law is in use.
> It should list grown words, raw spans, flaws, odd calls, and notes.
> Put grown words in the same run in which they grow.
>
> Use start and end tags round the live Grow-Speech span.
> This raw tag can start it: `BEGIN Grow-Speech`.
> This raw tag can end it: `END Grow-Speech`.
>
> By **persist** I mean to keep grown words from one span to the next.
> A new span starts fresh.
> If the header says this raw sign, `PERSIST: yes`,
> grown words persist to the next span.
> If not, they do not.
>
> Mark each grown word at the first use that grows it, and mark it only once.
> Mark raw text with raw tags.
> If the bounds are not plain, use a start tag and an end tag.
>
> This is the spec.
>
~~~
# Gemini-3.1-pro-preview
**Scope:** Expanding the prior 300-word Grow-Speech essay into a comprehensive ~1000-word tutorial with extensive analogies, fulfilling the "3× longer" request.
**Confidence:** Almost certain on pedagogical flow and strict 1-syllable bounds.
**Perspective:** Expanded physical translation. The abstract matrix of MDPs is grounded into a maze, treats, smacks, maps, and tapes to exhaustively illustrate each algorithm.
# Reinforcement Learning for Grade-Schoolers
Think of a toy bot in a maze.
The bot wants to find the end of the maze.
If it hits a wall, it gets a shock.
That is a smack.
If it finds the end, it gets a piece of cheese.
That is a treat.
We want the bot to find the cheese fast.
By **learning** I mean [to get smart when you try to win.]{#defn-learning}
By **reinforcement** I mean[a treat or a smack that helps you learn.]{#defn-reinforcement}
So, [reinforcement](#defn-reinforcement) [learning](#defn-learning) is how a bot gets smart through treats and smacks.
It will try to walk.
It will hit a wall.
Smack!
It will try to turn.
It will find the cheese.
Treat!
Next time, it will know to turn.
It will not hit the wall.
How does the bot know what to do?
There are two main ways.
By **value** I mean [the score of a move.]{#defn-value}
Some bots keep a big list of scores.
They look at a spot in the maze.
They think, "What is the score if I step left? What is the score if I step right?"
They pick the step with the best score.
But to look up scores can take a long time.
By **policy** I mean [a strict rule on what to do.]{#defn-policy}
A bot with a rule does not look at scores.
It just sees a wall on its left, and its rule says "Turn right."
It does not think.
It just acts.
By **model** I mean [a map in your head of how the world works.]{#defn-model}
If a bot has a map in its head, it can think of what comes next.
It can say, "If I walk two steps, I will fall in a pit. So I will jump."
It plans.
But a bot with no map does not plan.
It just looks at what is right in front of it and moves.
It takes a lot of time to learn the map of the world.
So a lot of the time, we do not use a map.
How does the bot get the facts to learn?
By **on-policy** I mean[to learn from what you do right now.]{#defn-on-policy}
If a bot is in the maze, it makes a move.
It learns from that move.
It learns from its own flaws right in the act.
But what if it wants to learn fast?
By **off-policy** I mean [to learn from old tapes of past games.]{#defn-off-policy}
The bot can sit back and watch tapes of old bots.
It can see an old bot hit a wall.
Our bot says, "I will not do that."
It learns from the tape, not from what it does right now.
Let us look at a well known way to learn.
By **Q-learning** I mean[a way to win with a look at a big list of scores.]{#defn-q-learning}
This uses a list, so it looks at the [value](#defn-value) of a move.
It does not have a [model](#defn-model) in its head.
It looks at old tapes, so it is [off-policy](#defn-off-policy).
Here is how it works.
The bot makes a huge chart.
For each spot in the maze, it writes down all the moves it can make.
Then it writes a score next to each move.
When it looks at a tape, it sees a bot step left and get a treat.
So it makes the score for "step left" go up in its chart.
When it plays, it just looks at its chart and takes the move with the top score.
We can try a wild way to learn.
By **evolution** I mean[a way to learn where we clone a brain, change a few parts, keep the brains that win, and ditch the rest.]{#defn-evolution}
We make 10 bots.
We put them all in the maze.
Some hit walls.
Some fall down.
One gets the cheese.
We ditch the nine bots that lost.
We take the one bot that won and make 10 clones of it.
We change small parts in the brains of the clones.
Just a bit of noise.
Then we race them.
We do this for a long time.
Soon, we have a bot that runs the maze fast.
It does not learn while it runs.
It just lives or dies.
There is a way to make a [policy](#defn-policy) learn while it runs.
By **REINFORCE** I mean[a rule to make the parts of a brain that win strong and the parts that lose weak.]{#defn-reinforce}
A bot plays a full game in the maze.
It takes one whole trip.
It might make 10 moves.
At the end, if it gets the cheese, we go back and tell its brain: "All 10 of those moves were good. Do them more."
If it gets a shock at the end, we tell its brain: "All 10 of those moves were bad. Do them less."
Wait.
What if move nine was bad, but it still got the cheese?
The rule does not care.
It makes move nine strong too.
Over a lot of games, the bad moves wash out.
The bot learns to win.
This rule is [on-policy](#defn-on-policy), since the bot must play its own games to learn.
We can make that rule much less dumb.
By **critic** I mean[a side brain that learns to guess if a move will win or lose.]{#defn-critic}
By **actor** I mean [the main brain that plays the game.]{#defn-actor}
The [actor](#defn-actor) plays the game.
The [critic](#defn-critic) just sits and looks at the board.
It tries to guess the score.
If the [critic](#defn-critic) guesses a move will get 10 points, and the move gets 12 points, it says to the [actor](#defn-actor): "Wow! That was more than I thought!"
Then the [actor](#defn-actor) makes that move strong.
If the move gets 8 points, the [critic](#defn-critic) says: "Boo. That was worse than I thought."
Then the [actor](#defn-actor) makes that move weak.
This helps the bot learn fast, since it does not have to wait for the end of the game to see if a move was good.
Brains can break if they learn too fast.
If you tell a brain to change a lot based on one great game, it might lose track of how to do a plain thing like walk.
By **PPO** I mean[a rule that says a brain must take small steps and not change too fast.]{#defn-ppo}
It puts a cap on how much a brain can shift in one turn.
This keeps the brain safe.
What if we want the brain to do more than just walk in a maze?
What if we want it to speak words?
By **GRPO** I mean[a rule to rank a group of moves from best to worst to learn fast.]{#defn-grpo}
When a bot wants to speak, it can say the same thing in 10 ways.
We make the bot print all 10 ways at once.
Then we sort them.
We put the best at the top and the worst at the base.
This helps the bot see just which words to choose next time.
Let us put all of this to use.
By **human** I mean [a man or girl or boy.]{#defn-human}
By **feedback** I mean [when a boss says if a move was good or bad.]{#defn-feedback}
By **RLHF** I mean[a way to train a bot where a human gives feedback on its words.]{#defn-rlhf}
We want a bot to talk to us.
We want it to be nice and smart.
First, it learns to guess the next word from a ton of text on the web.
But it might say mean things.
So, a [human](#defn-human) reads what the bot says.
If the bot is rude, the [human](#defn-human) gives it a smack.
This is a bad score.
If the bot is nice and helps out, the[human](#defn-human) gives it a treat.
This is a good score.
A side [critic](#defn-critic) brain learns to act like the [human](#defn-human).
It learns to give treats and smacks.
Then the bot uses [PPO](#defn-ppo) to change its [policy](#defn-policy) with small safe steps based on those treats.
By **ChatGPT** I mean[a smart bot that talks to you and was trained with RLHF and PPO.]{#defn-chatgpt}
When you talk to it, it uses a strict rule to pick its words.
It got that rule from a long time of tries and faults, led by a [human](#defn-human) hand, kept safe by small steps.
# Claude-4.7-opus
---
title: Reinforcement Learning for Grade-Schoolers
description: A tutorial on contemporary deep RL written in Grow-Speech, with kitchen-table props.
status: finished
confidence: highly likely
importance: 5
...
# Reinforcement Learning for Grade-Schoolers
# Read this first
I want to teach you how a brain can learn to walk a maze, win at chess, or chat with a kid like you—not by being told the rules, but by trying things and getting a score back.
I will not start with a long list of new words.
A long word will come up when we need it.
The first time it does, I will mark it in bold and tell you what I mean.
Then it is mine to use as I please.
By the end you will know most of what you need to read the new work folk now do with chat brains and game brains.
Read this with a kitchen table in your mind.
A lot of these tricks can be built with stuff from the junk drawer: small wood boxes, stones, jars, score cards, a leash, a sketch pad, a chess board, eight cakes.
# A brain made of small parts
By **neuron** I mean [a small part of a brain that takes in shocks from wires and sends out shocks too]{#defn-neuron}.
A brain is made of [neurons](#defn-neuron).
Each [neuron](#defn-neuron) gets shocks from the rest through wires.
If those shocks add up past some high mark, the [neuron](#defn-neuron) fires and sends out a shock too.
We change how a brain thinks by changing how strong each wire is, or by changing the high mark of each [neuron](#defn-neuron).
Some wires get more strong, some less; some [neurons](#defn-neuron) fire more, some less.
That is what is meant when folk say "the brain learns."
Two key words help.
By **strengthen** I mean [to make a thing more strong]{#defn-strengthen}.
By **weaken** I mean [to make a thing more weak]{#defn-weaken}.
When the brain does well, we [strengthen](#defn-strengthen) the wires of the [neurons](#defn-neuron) that fired.
When it does ill, we [weaken](#defn-weaken) them.
That is the heart of all of this.
Once we know how to flip wires from strong to weak and back, we know how to make a brain learn.
The trick we will lean on for the rest of this read: think of each wire as a small jar full of stones.
The more stones in the jar, the more strong the wire.
[Strengthen](#defn-strengthen) = drop a stone in.
[Weaken](#defn-weaken) = take a stone out.
You could play this on your kitchen floor with a hand of dimes and a row of jam jars.
# Learn by try and miss
A brain learns to do well at a task by try and miss.
At each step, the brain looks at where it stands, picks a move, and the world hands back a score: a high score for a good move, a low score (or none) for a bad one.
The brain wants the most score it can get, on the whole, in the long run.
Think of a dog and a treat.
You do not tell the dog the rule.
You just give it a treat when it sits.
Soon, the dog sits to get the treat.
A bot is the same: hand it a score, and it learns to do more of what got the score.
Or think of a small bot in a maze.
It steps left and hits a wall.
Bad score.
It steps right and finds a chunk of cheese.
Good score.
At first, the bot knows nothing of the maze.
It has to try, get score, and change.
There is one hard part.
The score for a move can show up long past the move that earned it.
The bot might step left, then up, then up, then right, and then find the cheese.
Which step earned the cheese?
It might be the step that brought it close, or one ten steps back that put it on the right path.
The brain has to look back and guess which past move helped.
# Try a known thing, or try a new one
The bot has a choice at each step.
It can do the move that has worked best so far.
Or it can try a new move, in case the new move is in fact more strong.
Try too much, and the bot wastes time.
Stick with the best, and the bot might be stuck with what it knows now—when a much better path lies just past one new try.
You do this too.
If you find one ice cream you like, do you order that one each time?
Or do you, now and then, try a new flavour?
The bot has the same choice each step.
# State, move, score, plan
A *state* is what the brain sees right now: the spot in the maze, the board in a game, the words on a chat screen.
A *move* is what the brain does next: step left, push a pawn, write a word.
A *score* is the count the world hands back.
A *plan* is what the brain does in each kind of state.
By **policy** I mean [a plan for what move to make in each kind of case]{#defn-policy}.
Think of a [policy](#defn-policy) as a small rule book.
One page for each kind of state.
Each page says, "if you see this, do this."
You flip to the page that fits where you stand, do what it says, and step ahead.
Some pages are firm: "do this one move."
Some are loose: "most of the time go left, but with some chance, go right."
By **value** I mean [a guess at the score that will come if you do a move from where you stand]{#defn-value}.
Think of a [value](#defn-value) as a price tag stuck on a move: "this move is worth, on the long run, 7 treats."
At first the price tags are blank—the bot has no clue what each move is worth.
As it plays, the price tags fill in.
A high price tag = good move.
A low price tag = bad move.
# Two ways to learn
There are two big ways to teach a brain to play a task.
By **value-based** I mean [ways of learning where you score each move first and pick the best]{#defn-value-based}.
In a [value-based](#defn-value-based) way, the brain fills in the price tags first.
Then to act, it looks at all the moves it could make from where it stands, finds the move with the steepest price tag, and picks that one.
The [policy](#defn-policy) is built on top of the price tags: "in each state, pick the move with the steepest tag."
By **policy-based** I mean [ways of learning where you skip scoring moves and just learn the policy]{#defn-policy-based}.
In a [policy-based](#defn-policy-based) way, the brain skips price tags.
It just learns the rule book right off: in each state, what move?
[Value-based](#defn-value-based) ways tend to work well on small tasks where the brain can rate each move with care.
[Policy-based](#defn-policy-based) ways tend to work well on huge tasks—a chat brain that writes the next word, say—where there are far more moves than we could ever rate one by one.
# A row of wood boxes with stones
Now we can build our first way to learn, with no big brain at all—just a row of wood boxes on a shelf.
This is the most plain way of all.
You could in fact build this on your kitchen floor.
By **table** I mean [a flat grid of cells, set out in rows]{#defn-table}.
A [table](#defn-table) is just a way to set out facts in rows and cells.
But for our maze bot, think of the [table](#defn-table) as a row of wood boxes—one box for each spot in the maze.
Inside each box: a small set of stones.
One stone for each move the bot could make from that spot (left, right, up, down).
Each stone has a count chalked on it: the price tag for that move.
At the start, all the chalk reads zero.
By **Q-learning** I mean [a way of learning where you fill in a table of value guesses through play]{#defn-q-learning}.
[Q-learning](#defn-q-learning) goes like this.
The bot lands on a spot.
It walks to the box for that spot, opens it, looks at the chalked stones.
Most of the time it picks the stone with the steepest count.
Now and then it picks a stone at chance, just to try.
It does the move on the stone, gets a score, lands on a new spot.
Then it fixes the stone it just used.
Take that stone out of the old box.
Look in the new spot's box: what is the steepest count on a stone in there?
Add the score to that count.
Now blend the stone's old chalk with this new count: scrub a bit off the old, scrub a bit on of the new.
Drop the stone back in the old box.
Do this for a long stretch of play, and the chalked counts on the stones fill in to the right [values](#defn-value).
Then the bot just walks to its box, picks the steepest stone, and plays.
By **tabular** I mean [based on a table, not on a brain of neurons]{#defn-tabular}.
This is *[tabular](#defn-tabular)* [Q-learning](#defn-q-learning).
There is no brain of [neurons](#defn-neuron) here at all.
Just a row of wood boxes with stones in them.
The cells of the [table](#defn-table) are real cells with real stones.
This works well for small worlds.
For huge worlds—worlds with too much state to fit in a row of boxes, or worlds of text—the box-row would fill the whole house, and we have to use a brain of [neurons](#defn-neuron) to guess the chalk on each stone.
# Map or no map
By **model** I mean [a guess in the brain of how the world works, used to plan out moves in its head]{#defn-model}.
Think of a [model](#defn-model) like this.
The bot has, in its head, a small doll house copy of the maze.
And in the doll house lives a tiny doll-bot.
Before doing a move for real, the bot can poke the doll-bot through the doll house and see what comes next.
"If I send the doll left, then up, where does it land?"
The doll house is the [model](#defn-model).
With a doll house in its head, the bot can plan a string of moves before it ever moves a foot for real.
A bot with no [model](#defn-model) just plays in the real maze.
No doll house.
No plans.
It steps, it gets a score, it learns.
By **model-free** I mean [ways of learning that do not build a model]{#defn-model-free}.
By **model-based** I mean [ways of learning that do build a model]{#defn-model-based}.
[Model-based](#defn-model-based) ways tend to learn faster from less play, since the brain can plan in its doll house and so squeeze more out of each play.
But the doll house might be wrong.
The doll house could say "step left and you find cheese," and the real maze say "step left and you find a wall."
Then the brain plans well in the doll house and crashes in the real maze.
[Model-free](#defn-model-free) is slow but safe: no wrong doll house to fool you.
[Tabular](#defn-tabular) [Q-learning](#defn-q-learning), from the last part, is [model-free](#defn-model-free).
The wood boxes hold price tags for moves, not a doll house of the maze.
# Whose moves do you learn from?
Now a fine point.
The brain is making moves, and it is learning—but is it learning from its own moves right now, or from old game tapes?
By **on-policy** I mean [you learn just from moves made by the policy you have now]{#defn-on-policy}.
By **off-policy** I mean [you learn from moves made by past or kin policies, not just the one you have now]{#defn-off-policy}.
Why does this split show up?
If you are [on-policy](#defn-on-policy), you learn just from fresh games.
Once you change your [policy](#defn-policy), the old games are no good for learning—they came from the old [policy](#defn-policy), not the new one.
You have to throw out the old plays and play more.
If you are [off-policy](#defn-off-policy), you keep a box of old game tapes on your shelf.
Some are tapes of you, last week.
Some are tapes of kin bots.
You re-watch tapes from the box and learn from them too.
This saves a lot of fresh play—but [off-policy](#defn-off-policy) is a fair bit harder to make work right.
[Tabular](#defn-tabular) [Q-learning](#defn-q-learning) is [off-policy](#defn-off-policy).
Why? When it fixes a stone, it blends in the *steepest* count from the next box—not the count of the stone the bot would in fact pick next.
That is, it learns the [value](#defn-value) of the *best* [policy](#defn-policy), not its own.
That is what makes it [off-policy](#defn-off-policy).
# Shake and pick
Now: how do we make a *brain* (not a row of wood boxes) learn?
Here is one way that does no math at all.
By **evolution** I mean [change of a kind of brain through births, shakes, wins, and deaths]{#defn-evolution}.
[Evolution](#defn-evolution) goes like this.
Take a brain.
Make 100 twins.
For each twin, give it a small zap—a few wires shift up, a few shift down, by chance.
Now race all 100 brains through the maze.
Each twin gets a run-time.
The 90 worst run-times get the chop.
Throw those brains out.
The 10 best stay.
From those 10, make 100 new twins.
Zap each one.
Race them through the maze.
Chop the 90 worst.
Keep the 10 best.
Do it once more.
And once more.
Soon, the brains run the maze fast.
It is grim if you think too hard.
But the brains do not feel a thing.
[Evolution](#defn-evolution) is [model-free](#defn-model-free) (no doll house).
[Evolution](#defn-evolution) is [policy-based](#defn-policy-based) (no price tags; the rule book brain is what gets shaken).
[Evolution](#defn-evolution) is [on-policy](#defn-on-policy) (each brain learns just from games made by its own [policy](#defn-policy)).
What is the good of [evolution](#defn-evolution)?
It is plain.
It works on huge brains.
You can race all 100 twins at the same time on lots of chips, since each twin runs on its own.
What is bad?
It wastes plays.
A twin has to run a whole maze just to get one score.
And one bad slip in a long run can make a great brain look bad.
# One brain, lots of games
Now for a smart way that uses real math.
By **REINFORCE** I mean [a way of learning where one brain plays lots of games and pays each neuron by the games it helped make]{#defn-reinforce}.
Set the kitchen up like this.
On the shelf, lined up in front of you, is a jar for each [neuron](#defn-neuron) in the brain.
The count of stones in a jar tells you how strong that [neuron](#defn-neuron) is.
A jar with lots of stones: a strong, fire-easy [neuron](#defn-neuron).
A jar near dry: a weak [neuron](#defn-neuron) that holds back.
[REINFORCE](#defn-reinforce) goes like this.
The brain plays a whole game.
At each step, it picks a move with some chance.
Note each move and which [neurons](#defn-neuron) fired for it.
At the end of the game, look at the score.
If the brain won, walk down the shelf and drop a stone in each jar that fired.
If it lost, walk down the shelf and pluck a stone out of each jar that fired.
Then play more games and do it once more.
This is *not* fair to the [neurons](#defn-neuron) in one game.
A jar that fired for a smart move in a lost game still loses a stone.
A jar that fired for a dumb move in a won game still gains a stone.
But over lots of games, the noise washes out.
On the whole, the [neurons](#defn-neuron) that help win pile up stones.
The ones that help lose run dry.
The brain gets good.
[REINFORCE](#defn-reinforce) is [policy-based](#defn-policy-based), [on-policy](#defn-on-policy), and [model-free](#defn-model-free).
# A judge calls the play
[REINFORCE](#defn-reinforce) is loud.
"Won the game" is a noisy grade for each move.
We can get a much sharper grade.
By **actor** I mean [the brain that picks moves]{#defn-actor}.
By **critic** I mean [a brain that grades how well the actor will do]{#defn-critic}.
Set the kitchen up with two brains now.
The [actor](#defn-actor) plays the game—it has the rule book.
The [critic](#defn-critic) sits on the side like a judge on TV who calls the play before it goes down.
At each spot, the [critic](#defn-critic) shouts a guess: "from here, you'll score 5 treats."
That is a call.
The [critic](#defn-critic) is graded on how close its calls come to the truth: a wrong call costs the [critic](#defn-critic), a right call pays it.
By **advantage** I mean [how much a play beat the critic's guess]{#defn-advantage}.
Now we change [REINFORCE](#defn-reinforce).
The [actor](#defn-actor) makes a move.
The [critic](#defn-critic) called 5 for that spot.
The play in fact ends up with a score of 8.
That beat the call by 3—so the [advantage](#defn-advantage) is +3.
Drop 3 stones in the jars of the [neurons](#defn-neuron) that fired.
If the play in fact ends up with a score of 2, that is short of the call by 3—so the [advantage](#defn-advantage) is −3.
Pluck 3 stones out of the jars that fired.
So now we are not paying [neurons](#defn-neuron) for "we won."
We are paying them for "we did *better than the [critic](#defn-critic) called*."
That is a much sharper grade.
And the brain learns much faster.
# Small steps, safe steps
Both [REINFORCE](#defn-reinforce) and the [actor](#defn-actor)–[critic](#defn-critic) trick have a sore spot.
When you change the brain a lot in one step, you can break it.
A brain that was great might, with one big push, lurch off and lose for a long stretch—and worse, lose skills it once had, like how to walk in a straight line at all.
By **PPO** I mean [a way of learning that takes small steps so the brain does not break]{#defn-ppo}.
Think of [PPO](#defn-ppo) as a short leash.
Keep a copy of the brain you had at the start of this round.
Tie a short leash from the new brain back to the old brain.
Now, when you go to drop or pluck stones from the jars, ask: how far has each jar's count drifted from where the old brain had it?
If the drift is small, drop or pluck the stones, fine.
If a jar would drift way too far in one round, the leash snaps it back to a small change.
"Don't yank the brain too far in one step" is the law.
So the brain learns by lots of small steps, not by big leaps.
This works much better than [REINFORCE](#defn-reinforce) on big tough tasks.
[PPO](#defn-ppo) was the main way most folk did this kind of work for years.
# The group is the score
[PPO](#defn-ppo) needs the [critic](#defn-critic) brain.
That is one more brain to train—one more shelf of jars to keep up.
It is a pain.
By **GRPO** I mean [a way of learning where each task is tried in a group, and the mean of the group is the base score]{#defn-grpo}.
Think of [GRPO](#defn-grpo) as eight darts at a board.
For each task, the brain throws 8 tries—8 darts.
Each dart gets a score (how close to the bull's eye).
Find the mean of the 8 scores.
That mean is the base.
Each dart that beat the base: drop stones in the jars of the [neurons](#defn-neuron) that fired for that dart.
Each dart that fell short: pluck stones out of those jars.
And you still keep the leash, the [PPO](#defn-ppo) way.
The point: no [critic](#defn-critic) brain.
The group is the [critic](#defn-critic).
The mean of the 8 darts is the call.
Less work, less stuff to train, less to break.
[GRPO](#defn-grpo) is now used a lot to train chat brains.
# A brain that plays itself
Now for two big stuck-it-all-with ways folk use these parts.
The first is for game brains.
By **self-play** I mean [play where a brain plays a twin of its own self]{#defn-self-play}.
By **AlphaZero** I mean [a game brain that used self-play, search, a policy head, and a value head]{#defn-alphazero}.
Think of a kid alone in their room with a chess board.
They play both sides of the game.
White makes a move.
Then the kid spins the board, sits in the black seat, and makes a move for black.
Spin and switch, spin and switch, all on their own.
A few years of this, and the kid is a beast at chess.
That is what [AlphaZero](#defn-alphazero) does.
It plays both sides of board games—the raw names of the games it played were "chess, shogi, and Go".
It got great at all three with no help from folk who knew the games.
It did not read books on chess.
It did not watch tapes of grand chess play.
It just learned from games it played on its own.
Here is the trick.
The rules of chess are known.
If you have a board and a move, the rules tell you the next board with no doubt.
So the rules of the game *are* a [model](#defn-model)—a true one, not a guess.
[AlphaZero](#defn-alphazero) is [model-based](#defn-model-based) at move time, since it plans ahead with the rules.
The [AlphaZero](#defn-alphazero) brain has two parts (or, two "heads," set on two tasks).
One part guesses, "from this board, which moves look best?" (a [policy](#defn-policy) head).
One part guesses, "from this board, who is more apt to win?" (a [value](#defn-value) head).
When [AlphaZero](#defn-alphazero) has to make a move, it does not just trust its [policy](#defn-policy) head.
It pulls out a sketch pad and doodles a tree of what-ifs.
"If I do this, the kin would do that, then I could do this..."
The [policy](#defn-policy) head shapes which what-ifs to draw out (the most apt ones first).
The [value](#defn-value) head grades the boards at the leaves of the tree.
[AlphaZero](#defn-alphazero) picks the move that came out best in the doodle.
Then comes [self-play](#defn-self-play).
Two twins of [AlphaZero](#defn-alphazero) play a full game of chess, each with its own sketch pad.
Each plays its best.
At the end, one wins, one loses.
Both twins now have a heap of board-and-move pairs from the game, with the score (win or loss) at the end.
Use the games to train both heads.
The [policy](#defn-policy) head learns to like the moves the doodle picked.
The [value](#defn-value) head learns to guess the score that came at the end.
A bit later, the brain is sharper.
The doodles are sharper too, since the doodles lean on the heads.
The next round of [self-play](#defn-self-play) is sharper still.
And on, and on.
A loop where the brain and the doodle pull each one up by its own bootstraps.
# When folk pick, the brain learns
[AlphaZero](#defn-alphazero) had it good: a clear win-or-lose score from each game.
But what if you want a brain to write a *good* note?
There is no clear count for "good note."
You can not just write code that scores a note.
A note can be true and dull.
A note can be wrong and slick.
You need a kind of grade that no plain code can give.
So we ask folk.
By **RLHF** I mean [a way of learning where folk pick which of two replies is best, and the brain learns from those picks]{#defn-rlhf}.
Think of a blind taste test.
Two cakes on a plate.
A bunch of folk take a bite of each, then point to which one tastes best.
Do this lots of times—with lots of cake pairs and lots of folk.
You end up with a thick stack of "this cake beat that cake" picks.
Now train a *third* brain—a score brain—to guess which of two cakes the folk would pick.
Give it lots of stacks of pairs.
Soon the score brain can stand in for the folk: hand it any two cakes, and it picks just like the crowd would.
Now you have a fake taste-test brain that scores cakes for free, no folk on the line.
The chat brain bakes notes (not cakes, but notes).
The score brain rates each note.
Use [PPO](#defn-ppo) or [GRPO](#defn-grpo) to push the chat brain to write notes the score brain rates high.
The leash from [PPO](#defn-ppo) keeps the chat brain from drifting too far from the chat brain you had at the start.
There is one trap.
If you let the chat brain push too hard, it will learn to game the score brain.
It might write a note that scores high but is in truth bad.
The leash holds it back.
Don't drift too far from the chat brain you came from.
# How a chat thing gets built
By **language** I mean [words and the ways folk speak or write]{#defn-language}.
By **ChatGPT** I mean [a chat brain made by a firm, that takes in your text and writes text back]{#defn-chatgpt}.
Think of [ChatGPT](#defn-chatgpt) as a kid sent through cook school in three terms.
Term one: read all the cookbooks in the world.
The kid reads, and reads, and reads.
At the end, you can hand the kid the start of any cookbook line, and the kid can guess the next word.
"Pre-heat the..."—oven.
"Add a pinch of..."—salt.
This term does not use this kind of "learn by score" at all.
The kid just learns to guess words.
At the end of term one, the kid knows a lot of [language](#defn-language)—but does not yet know how to cook a *good* meal for the task at hand.
Term two: watch a great chef cook 100 dishes start to end.
Take notes.
Try the chef's moves on your own.
The kid takes on the chef's style.
Now if you ask the kid to cook a soup, the kid cooks a soup that looks like the chef's soup.
But the kid is still not great.
Some dishes come out flat.
Some come out off.
Some come out not safe to eat.
Term three: cook for guests.
Put two plates of soup down at the same time, two takes on the same dish.
Watch which plate the guests scrape clean.
Train a score brain to guess which plate the guests would pick.
Then use [PPO](#defn-ppo) or [GRPO](#defn-grpo) to shift the kid's style to make plates the score brain rates high.
That is [RLHF](#defn-rlhf), folded back in.
That is the whole plan.
Term one: read the world's cookbooks.
Term two: watch a chef.
Term three: cook for guests, take their picks to heart.
Most of [ChatGPT](#defn-chatgpt)'s skill comes from term one—from just reading.
The score loop in term three is more like a last polish.
It shapes the raw text brain toward help, truth, care, and a good chat style.
# What you have learned
You now know:
- what a [neuron](#defn-neuron) is, and what it means to [strengthen](#defn-strengthen) or [weaken](#defn-weaken) a wire;
- the parts of a learn-by-score task: state, move, score, [policy](#defn-policy) (rule book), [value](#defn-value) (price tag);
- the split: [value-based](#defn-value-based) or [policy-based](#defn-policy-based);
- the split: [model-free](#defn-model-free) or [model-based](#defn-model-based) (doll house or no doll house);
- the split: [on-policy](#defn-on-policy) or [off-policy](#defn-off-policy) (fresh games or a box of old tapes);
- one [tabular](#defn-tabular), [model-free](#defn-model-free), [off-policy](#defn-off-policy), [value-based](#defn-value-based) way: [Q-learning](#defn-q-learning), with a row of wood boxes and stones;
- one [model-free](#defn-model-free), [on-policy](#defn-on-policy), [policy-based](#defn-policy-based) way that uses brains: [evolution](#defn-evolution), with 100 twins, a maze race, and a chop;
- a sharper way: [REINFORCE](#defn-reinforce), with a jar of stones for each [neuron](#defn-neuron);
- the [actor](#defn-actor)–[critic](#defn-critic) trick, with a TV-style judge who calls the play, and an [advantage](#defn-advantage) score;
- [PPO](#defn-ppo), with a short leash from new brain to old;
- [GRPO](#defn-grpo), with 8 darts at a board and the mean of the 8 as the call;
- [self-play](#defn-self-play) and [AlphaZero](#defn-alphazero), with a kid playing both sides of chess and a sketch pad of what-ifs;
- [RLHF](#defn-rlhf), with a blind taste test, picks from folk, and a fake taste-test brain;
- and how [ChatGPT](#defn-chatgpt) is built up of all these parts—a kid through cook school in three terms.
That is the whole field of "learn by score" in one read—with no big words save the ones we grew on the way, and most of the props you'd need on your kitchen floor.
# GPT-5.5 Pro
# Reinforcement Learning for Grade-Schoolers
## Read this first
By **reinforcement learning** I mean [a way for a brain to learn from score]{#defn-reinforcement-learning}.
This is a text on how a brain can learn to walk a maze, win a game, or chat with you.
It learns not just by being told rules, but by a loop:
try, score, change.
I will not start with a long list of new words.
A long word will come up when I need it.
The first time it does, I will mark it in bold and tell you what I mean.
Then it is mine to use as I please.
Read this with a school room in mind.
A black board is up front.
A piece of chalk is on the rail.
A row of jars, a heap of stones, a short leash, a sketch pad, a chess board, and 8 cakes sit near by.
You could build most of this on the floor.
## A brain made of small parts
By **neuron** I mean [a small part of a brain that takes in shocks from wires and sends out shocks too]{#defn-neuron}.
A brain is made of [neurons](#defn-neuron).
Each [neuron](#defn-neuron) gets shocks from the rest through wires.
If those shocks add up past a high mark, the [neuron](#defn-neuron) fires and sends out a shock too.
We change how a brain thinks by changing how strong each wire is.
We can change the high mark too.
Then one [neuron](#defn-neuron) may fire more, and one may fire less.
That is what folk mean when they say, "the brain learns."
Two key words help.
By **strengthen** I mean [to make a thing more strong]{#defn-strengthen}.
By **weaken** I mean [to make a thing more weak]{#defn-weaken}.
When the brain does well, we may [strengthen](#defn-strengthen) the wires that helped.
When it does ill, we may [weaken](#defn-weaken) them.
Once you can flip wires from strong to weak and back, you can make a brain learn.
Here is the prop for wires.
Think of each wire as a small jar full of stones.
The more stones, the more strong the wire.
To [strengthen](#defn-strengthen) the wire, drop a stone in.
To [weaken](#defn-weaken) it, take a stone out.
## Try, score, change
A brain learns to do well at a task by try and miss.
At each step, the brain looks at where it stands, picks a move, and the world hands back a score.
High score for a good move.
Low score, or no score, for a bad one.
The brain wants the most score it can get, on the whole, in the long run.
Think of a dog and a treat.
You do not tell the dog all the rules.
You just give it a treat when it sits.
Soon, the dog sits to get the treat.
A bot is the same: hand it score, and it learns to do more of what got score.
Or think of a small bot in a maze.
It steps left and hits a wall.
Bad score.
It steps right and finds cheese.
Good score.
At first, the bot knows no path.
It has to try, get score, and change.
The hard part is thanks and blame.
The score for a move can show up long past the move that earned it.
The bot might step left, then up, then up, then right, and then find cheese.
Which step earned the cheese?
The last step?
A step 10 steps back?
Each step in the chain wants its share.
But the world hands back just one score at the end.
So the brain has to spread that score back through the chain.
This is the core hard task: hand thanks and blame to the right past moves.
Each way to learn that we will see is a way to crack this.
None of them are quite right.
On the whole, they work.
## Try new things or use old ones
At each step, the bot has a choice.
It can use the move that has worked best so far.
Or it can try a new move, in case the new move is in fact more strong.
Try too much, and the bot wastes time on flops.
Use just the best move, and the bot may be stuck with what it knows now, while a much better path lies one new try past it.
You do this too.
If you find one ice cream you like, do you pick that one each time?
Or do you now and then try a new one, and risk that it is bad?
The bot has the same choice.
Most ways to learn fudge this with chance.
Most of the time, pick the best known move.
Now and then, pick a move at chance.
As the brain gets sharp, drop the chance of wild tries.
The blunt knife works.
## State, move, score, plan
A state is what the brain sees right now: the spot in the maze, the board in a game, the words on a chat screen.
A move is what the brain does next: step left, push a pawn, write a word.
A score is the count the world hands back.
A plan is what the brain does in each kind of state.
By **policy** I mean [a plan for what move to make in each kind of case]{#defn-policy}.
Think of a [policy](#defn-policy) as a rule book.
One page per kind of state.
Each page says, "if you see this, do that."
Some pages are firm: "do this one move."
Some are loose: "most of the time go left, but with some chance go right."
By **value** I mean [a guess at how much score will come from a state or a move]{#defn-value}.
Think of a [value](#defn-value) as a price tag.
It may sit on a state: "this place looks good."
It may sit on a move: "this move looks good from here."
At first, the price tags are blank.
The bot has no clue what each move is worth.
As it plays, the price tags fill in.
A high price tag means good.
A low price tag means bad.
One small point on price tags.
A score now is worth more than a score far down the road.
The far road might not come.
The bot might fall in a pit.
So the price tag for a move is a sum: score now, plus some score from what comes next, plus less score from what comes past that, and so on.
The more far off the score, the less it weighs.
That is why a [value](#defn-value) is a guess at score, not just the score seen now.
## Two ways to learn
There are two large ways to teach a brain.
By **value-based** I mean [a way that learns values first, then picks moves from them]{#defn-value-based}.
In a [value-based](#defn-value-based) way, the brain fills in price tags first.
Then to act, it looks at the moves it could make, finds the move with the best [value](#defn-value), and picks that one.
The [policy](#defn-policy) is built on top of the price tags.
By **policy-based** I mean [a way that changes the policy, not just values]{#defn-policy-based}.
In a [policy-based](#defn-policy-based) way, the brain learns or shifts the rule book right off.
It may still use [values](#defn-value) to help train.
But the thing we push on is the [policy](#defn-policy): make this move more apt, make that move less apt.
[Value-based](#defn-value-based) ways tend to work well on small tasks where the brain can rate each move with care.
[Policy-based](#defn-policy-based) ways tend to work well on huge tasks, such as a chat brain that writes the next word, where there are far more moves than we could rate one by one.
## A grid of Q
Now we can build our first way to learn, with no big brain at all: just a black board and chalk.
By **table** I mean [a grid of rows and cells]{#defn-table}.
Draw a grid on the board.
Down the left side, write each spot in the maze, one per row.
On the top, write each move the bot can make: left, right, up, down.
In each cell, write a count: the price tag for that move from that spot.
At the start, write 0 in each cell.
By **Q-learning** I mean [a way that fills a table with value guesses through play]{#defn-q-learning}.
[Q-learning](#defn-q-learning) goes like this.
The bot lands on a spot.
It looks at the row for that spot.
It scans the row, finds the cell with the best count, and walks that way.
Most of the time it picks the best cell; now and then it picks at chance, just to try.
It does the move, gets score, and lands on a new spot.
Then it fixes the cell it just used.
Look at the new spot's row.
What is the best count in that row?
Add the score to that.
Now blend the cell's old count with this new one: rub out a bit of the old, write a bit of the new in its place.
Do this for a long stretch of play, and the counts on the board fill in to the right [values](#defn-value).
Then the bot just looks at its row, picks the best cell, and plays.
By **tabular** I mean [made with a table, not with a net]{#defn-tabular}.
This is [tabular](#defn-tabular) [Q-learning](#defn-q-learning).
There is no brain of [neurons](#defn-neuron) here at all.
Just a board and chalk.
You have built a thing that learns to play a maze with what is in your school room.
Try not to brag.
## When the board is too big
[Tabular](#defn-tabular) [Q-learning](#defn-q-learning) works well for small worlds.
For huge worlds, the board blows up.
Think of the world a chess brain sees.
A board has 64 squares.
Each square holds one of 13 things: 6 kinds of white piece, 6 of black, or a blank.
The count of board states is vast: more than grains of sand on Earth, more than stars in the sky.
The black board would have to be wide past your school, your town, and the state you live in.
A chat brain has it worse: there is one state per chunk of text it might see, and there is no end to chunks of text.
So we ditch the board.
In its place: a brain of [neurons](#defn-neuron).
The brain takes in the state, such as the board or the words, and spits out the counts as if it had seen the right row.
It is a brain that guesses what the chalk would have said, had we drawn that row.
By **deep learning** I mean [a way to train a large net made of neurons]{#defn-deep-learning}.
This is the deep part of [deep learning](#defn-deep-learning):
trade the board for a brain of [neurons](#defn-neuron).
The rule of [Q-learning](#defn-q-learning) stays much the same.
The brain stands in for the chalk.
It can make a fair guess for a state it has not seen, since kin states tend to have kin [values](#defn-value).
A black board could not do that: chalk in one row does not tell you what to write in the next.
The same swap works for the rule book.
A small rule book fits on a board.
A chat brain's rule book has a chance for each next word, and the word set is far too large to draw.
So the rule book can be a brain of [neurons](#defn-neuron) too.
It takes in the state and spits out a chance for each move.
The rest of this text works if you have a board or a brain of [neurons](#defn-neuron).
The prop shifts; the loop stays.
## Map or no map
By **model** I mean [a map in the brain of how the world works]{#defn-model}.
Think of a [model](#defn-model) like this.
The bot has, in its head, a small doll house of the maze.
In the doll house lives a small doll bot.
The real bot can poke the doll bot through the doll house and see what comes next.
"If I send the doll left, then up, where does it land?"
The doll house is the [model](#defn-model).
With it, the bot can plan a string of moves, then move a foot for real.
A bot with no [model](#defn-model) just plays in the real maze.
No doll house.
No plan in its head.
It steps, gets score, and learns.
By **model-free** I mean [a way that learns with no model]{#defn-model-free}.
By **model-based** I mean [a way that learns with a model]{#defn-model-based}.
[Model-based](#defn-model-based) ways can learn from less play, since the brain can plan in its doll house and squeeze more out of each run.
But the doll house might be wrong.
The doll house could say, "step left and you find cheese," while the real maze says, "step left and you hit a wall."
Then the brain plans well in the doll house and fails in the real maze.
[Model-free](#defn-model-free) is slow but safe: no wrong doll house to fool you.
[Tabular](#defn-tabular) [Q-learning](#defn-q-learning) is [model-free](#defn-model-free).
The chalk on the board holds price tags for moves, not a doll house of the maze.
## Whose play counts?
Now a fine point.
The brain is making moves, and it is learning.
But is it learning from its own moves right now, or from old game tapes?
By **on-policy** I mean [a way that learns from the policy it has now]{#defn-on-policy}.
By **off-policy** I mean [a way that can learn from old play or a past policy]{#defn-off-policy}.
If you are [on-policy](#defn-on-policy), you learn from fresh games.
Once you change your [policy](#defn-policy), the old games may not fit.
They came from the old [policy](#defn-policy), not the new one.
You have to play more.
If you are [off-policy](#defn-off-policy), you keep a stack of old game tapes on your shelf.
Some are tapes of you last week.
Some are tapes of kin bots.
You re-watch tapes from the stack and learn from them too.
This saves fresh play.
But [off-policy](#defn-off-policy) is a fair bit harder to make work right.
Old play can point the new [policy](#defn-policy) at the wrong thing.
[Tabular](#defn-tabular) [Q-learning](#defn-q-learning) is [off-policy](#defn-off-policy).
Why?
When it fixes a cell, it blends in the best count from the next row, not the count of the cell the bot would in fact pick next.
It learns the [value](#defn-value) of the best [policy](#defn-policy), not just the [policy](#defn-policy) it used to get that play.
That is what makes it [off-policy](#defn-off-policy).
We now have 3 splits to track:
price tag or rule book;
doll house or no doll house;
fresh games or old tapes.
Each way to learn can be tagged on all 3.
## Shake and pick
How do we make a brain, not a board, learn?
Here is one way that uses no math.
By **evolution** I mean [change of a kind through births, shakes, wins, and deaths]{#defn-evolution}.
By **strategy** I mean [a plan for how to win]{#defn-strategy}.
By **evolution strategy** I mean [a strategy that trains a brain by small shakes and scores]{#defn-evolution-strategy}.
An [evolution strategy](#defn-evolution-strategy) goes like this.
Take a brain.
Make 100 twins.
For each twin, give it a small shake: a few wires shift up, a few shift down, by chance.
Now race all 100 brains through the maze.
Each twin gets a run time or a score.
Drop the 90 worst.
Keep the 10 best.
From those 10, make 100 new twins.
Shake each one.
Race them through the maze.
Drop the 90 worst.
Keep the 10 best.
Do it once more.
And once more.
Soon, the brains run the maze fast.
This is grim if you think too hard.
But the brains do not feel a thing.
An [evolution strategy](#defn-evolution-strategy) is [model-free](#defn-model-free): no doll house.
It is [policy-based](#defn-policy-based): the rule book brain is what gets a shake.
It is [on-policy](#defn-on-policy) in the plain sense: each score comes from the brain just tried.
What is the good of an [evolution strategy](#defn-evolution-strategy)?
It is plain.
It works on huge brains.
You can race all 100 twins at the same time on lots of chips, since each twin runs on its own.
What is bad?
It wastes play.
A twin has to run a whole maze just to get one score.
One bad slip in a long run can make a great brain look bad.
Worst, it is dumb to thanks and blame: it pays no heed to which moves in the run helped or hurt.
The score is a big lump dropped on the brain as a whole.
The next ways are more sharp.
## One brain, lots of runs
By **REINFORCE** I mean [a rule that makes moves in high score runs more apt and moves in low score runs less apt]{#defn-reinforce}.
Set a shelf in front of you.
On the shelf is a jar for each wire in the brain.
The count of stones in a jar tells you how strong that wire is.
A jar with lots of stones: a strong wire.
A jar near dry: a weak wire.
[REINFORCE](#defn-reinforce) goes like this.
One brain plays a whole run.
At each step, it picks a move with some chance.
Note each move and which wires helped lead to it.
At the end of the run, look at the score.
If the score is high, walk down the shelf and drop stones in jars for the wires that helped make the moves.
If the score is low, walk down the shelf and pluck stones out of those jars.
Then play more runs and do it once more.
This is not fair to the wires in one run.
A wire that helped make a smart move in a lost run still loses a stone.
A wire that helped make a dumb move in a won run still gains a stone.
The jar will not know why.
The jar need not know.
With lots of runs, the noise tends to wash out.
On the whole, wires that help win pile up stones.
Wires that help lose run dry.
The brain gets good.
[REINFORCE](#defn-reinforce) is [policy-based](#defn-policy-based), [on-policy](#defn-on-policy), and [model-free](#defn-model-free).
It is more sharp than an [evolution strategy](#defn-evolution-strategy), since it changes the parts that helped make the moves.
But it is still loud.
A whole run gives one blunt grade to lots of small moves.
## A judge on the side
[REINFORCE](#defn-reinforce) is loud.
"High score" is a loud grade for each move.
We can get a more sharp grade.
By **actor** I mean [the brain that picks moves]{#defn-actor}.
By **critic** I mean [a side brain that tries to guess the value]{#defn-critic}.
Set up two brains now.
The [actor](#defn-actor) plays the game.
It has the rule book.
The [critic](#defn-critic) sits on the side and calls the play at the start.
At each spot, the [critic](#defn-critic) shouts a guess:
"from here, you will score 5 treats."
The [critic](#defn-critic) is graded by how close its calls come to the truth.
A wrong call costs it.
A right call pays it.
By **advantage** I mean [how much a run beat the critic's guess]{#defn-advantage}.
Now we change [REINFORCE](#defn-reinforce).
The [actor](#defn-actor) makes a move.
The [critic](#defn-critic) called 5 for that spot.
The run in fact ends with a score of 8.
That beat the call by 3, so the [advantage](#defn-advantage) is +3.
Drop stones in the jars that helped make that move.
If the run ends with a score of 2, that is short of the call by 3, so the [advantage](#defn-advantage) is -3.
Pluck stones out of those jars.
Now we are not paying wires for "we won."
We are paying them for "we did better than the [critic](#defn-critic) thought."
That is a more sharp grade, and the brain learns much more fast.
The [critic](#defn-critic) is just a [value](#defn-value) brain: a brain of [neurons](#defn-neuron) that guesses the price tag for a state.
So this trick blends [policy-based](#defn-policy-based) and [value-based](#defn-value-based) in one.
The [actor](#defn-actor) is the rule book.
The [critic](#defn-critic) is the price tags.
## Small steps
[REINFORCE](#defn-reinforce) and the [actor](#defn-actor)-[critic](#defn-critic) trick have a sore spot.
When you change the brain a lot in one step, you can break it.
A brain that was great might, with one big push, lurch off and lose for a long stretch.
Worse, it may lose skills it once had, such as how to walk in a straight line.
By **PPO** I mean [a rule that clips the push so a new policy stays near an old one]{#defn-ppo}.
Think of [PPO](#defn-ppo) as a short leash, but tie the leash to move chance, not to each jar.
Keep a note of how apt the old [policy](#defn-policy) was to pick each move it in fact picked.
Train the new [policy](#defn-policy) on those same moves.
If the new [policy](#defn-policy) tries to make a good move far more apt, [PPO](#defn-ppo) clips the push.
If it tries to make a bad move far less apt, [PPO](#defn-ppo) clips that push too.
The law is not "no wire may move far."
The law is "do not make the picked move far too much more or less apt in one step."
So the brain learns by lots of small steps, not by big leaps.
Good moves get more chance, but not too much at once.
Bad moves get less chance, but not too much at once.
This helps keep the brain from a wreck of what it knew.
[PPO](#defn-ppo) is [policy-based](#defn-policy-based) with a [critic](#defn-critic) to cut noise.
It is [model-free](#defn-model-free).
It is close to [on-policy](#defn-on-policy), since it wants fresh runs from the old [policy](#defn-policy) for each round.
## The group sets the base
[PPO](#defn-ppo) needs the [critic](#defn-critic) brain.
That is one more brain to train.
It can be hard to make a [critic](#defn-critic) for a chat task.
You ask it, "how good will this half-written note be once it is done?"
The [critic](#defn-critic) shrugs.
By **reply** I mean [text sent back to a prompt]{#defn-reply}.
By **GRPO** I mean [a rule that scores a group of replies and uses the group mean as the base]{#defn-grpo}.
Think of [GRPO](#defn-grpo) as 8 darts at a board.
For each task, the brain throws 8 tries: 8 [replies](#defn-reply), or 8 runs, or 8 ways to solve the same thing.
Each one gets a score.
Find the mean of the 8 scores.
That mean is the base.
Each try that beats the base gets pushed up.
Each try that falls short gets pushed down.
And you still keep the short leash, the [PPO](#defn-ppo) way.
The point: no [critic](#defn-critic) brain.
The group is the [critic](#defn-critic).
The mean of the group is the call.
Less work, less stuff to train, less to break.
[GRPO](#defn-grpo) is a good fit when the score is sharp.
For math, did the work come out right: yes or no?
For code, did the test pass: yes or no?
For chat, a score brain can grade each [reply](#defn-reply), then [GRPO](#defn-grpo) can push up the [replies](#defn-reply) that beat the group.
## A game brain with a map
Now for a way that ties lots of parts in one place.
This one is for game brains.
By **self-play** I mean [play where a brain plays a twin of its own self]{#defn-self-play}.
By **AlphaZero** I mean [a game brain that used self-play, search, policy, value, and a model]{#defn-alphazero}.
Think of a kid in a room with a chess board.
The kid plays both sides.
White makes a move.
Then the kid spins the board, sits in the black seat, and makes a move for black.
Spin and switch, spin and switch, all on their own.
A few years of this, and the kid is a beast at chess.
That is the feel of [AlphaZero](#defn-alphazero).
It plays both sides of board games.
The raw names of the games were "chess, shogi, and Go".
It got great at all 3 with no help from folk who knew the games.
It did not read books on chess.
It did not watch tapes of grand chess play.
It just learned from games it played on its own.
Here is the trick.
The rules of chess are known.
If you have a board and a move, the rules tell you the next board with no doubt.
So the rules of the game are a [model](#defn-model): a true one, not a guess.
[AlphaZero](#defn-alphazero) is [model-based](#defn-model-based) at move time, since it plans with the rules.
The [AlphaZero](#defn-alphazero) brain has two heads.
One head gives a [policy](#defn-policy): from this board, which moves look best?
One head gives a [value](#defn-value): from this board, who is more apt to win?
When [AlphaZero](#defn-alphazero) has to move, it does not just trust its [policy](#defn-policy) head.
It pulls out a sketch pad and draws a tree of what-ifs.
"If I do this, the twin would do that, then I could do this..."
The [policy](#defn-policy) head shapes which what-ifs to draw first.
The [value](#defn-value) head grades the boards at the leaves of the tree.
[AlphaZero](#defn-alphazero) picks the move that came out best in the sketch.
Then comes [self-play](#defn-self-play).
Two twins of [AlphaZero](#defn-alphazero) play a full game, each with its own sketch pad.
At the end, one wins and one loses.
The game gives a heap of board-and-move pairs, with a win or loss at the end.
Use the game to train both heads.
The [policy](#defn-policy) head learns to like the moves the sketch picked.
The [value](#defn-value) head learns to guess the score that came at the end.
A bit down the road, the brain is more sharp.
The sketch is more sharp too, since the sketch leans on the heads.
The next round of [self-play](#defn-self-play) is more sharp still.
Each helps the next get strong.
No folk rank its moves.
It learns from win and loss.
It is [model-based](#defn-model-based) at move time.
It is both [policy-based](#defn-policy-based) and [value-based](#defn-value-based), since it learns both a [policy](#defn-policy) head and a [value](#defn-value) head.
It is close to [on-policy](#defn-on-policy), since it learns from fresh games made by its own self.
## When folk pick
[AlphaZero](#defn-alphazero) had it good: a clear win-or-loss score from each game.
But what if you want a brain to write a good note?
There is no clear count for "good note."
A note can be true and dull.
A note can be wrong and slick.
A note can be sharp on the math but mean to read.
You need a kind of grade that plain code can not give.
So we ask folk.
By **human** I mean [one of us, not a bot]{#defn-human}.
By **feedback** I mean [a mark or word from a human that says good or bad]{#defn-feedback}.
By **RLHF** I mean [a way to train a bot with human feedback]{#defn-rlhf}.
Think of a blind taste test.
Two cakes on a plate.
A bunch of folk take a bite of each, then point to which one tastes best.
Do this lots of times, with lots of cake pairs and lots of folk.
You end up with a thick stack of picks: this cake beat that cake.
Now switch from cakes to chat.
A [human](#defn-human) sees two [replies](#defn-reply) to the same prompt and picks the one they like more.
Do this lots of times.
Now we have a stack of picks: this [reply](#defn-reply) beat that [reply](#defn-reply).
By **reward model** I mean [a model that gives a score to a reply]{#defn-reward-model}.
Train a [reward model](#defn-reward-model) on the stack of picks.
Its job is to guess which [reply](#defn-reply) folk would pick.
Soon the [reward model](#defn-reward-model) can stand in for the folk.
The folk go home.
The score brain stays.
The chat brain writes [replies](#defn-reply).
The [reward model](#defn-reward-model) rates each [reply](#defn-reply).
Use [PPO](#defn-ppo) or [GRPO](#defn-grpo) to push the chat brain to write [replies](#defn-reply) the [reward model](#defn-reward-model) rates high.
The [PPO](#defn-ppo) leash keeps the chat brain from drifting too far from the chat brain you had at the start.
There is one trap.
If you let the chat brain push too hard, it can learn to game the [reward model](#defn-reward-model).
It might write a [reply](#defn-reply) that scores high but is in truth bad: full of fluff that the [reward model](#defn-reward-model) falls for.
The leash holds it back.
Do well by the score brain, but do not drift too far from the chat brain you came from.
[RLHF](#defn-rlhf) does not make the bot feel joy or pain.
Score is just a count used to change wires.
## How a chat thing gets built
By **language** I mean [words and ways folk speak or write]{#defn-language}.
By **ChatGPT** I mean [a chat bot that reads your words and writes words back]{#defn-chatgpt}.
Think of [ChatGPT](#defn-chatgpt) as a kid sent through cook school in 3 terms.
Term 1: read all the cook books in the world.
The kid reads, and reads, and reads.
At the end, you can hand the kid the start of a cook book line, and the kid can guess the next word.
This term does not use [reinforcement learning](#defn-reinforcement-learning).
It is [deep learning](#defn-deep-learning) on text.
The brain learns to guess words.
At the end of term 1, the kid knows a lot of [language](#defn-language), but does not yet know how to make a good meal for the task at hand.
Cook books teach words.
They do not teach taste.
Term 2: watch a great chef cook 100 dishes from start to end.
Take notes.
Try the chef's moves on your own.
The kid takes on the chef's style.
Now if you ask the kid to cook soup, the kid cooks soup that looks like the chef's soup.
But the kid is still not great.
Some food comes out flat.
Some comes out off.
Some is not safe to eat.
Term 3: cook for guests.
Put two plates of soup down at the same time: two takes on the same dish.
Watch which plate the guests scrape clean.
Train a [reward model](#defn-reward-model) to guess which plate the guests would pick.
Then use [PPO](#defn-ppo) or [GRPO](#defn-grpo) to shift the kid's style toward plates the [reward model](#defn-reward-model) rates high.
That is [RLHF](#defn-rlhf), folded back in.
That is the plan.
Term 1: read the world's cook books.
Term 2: watch a chef.
Term 3: cook for guests, and take their picks to heart.
Most of [ChatGPT](#defn-chatgpt)'s skill comes from term 1, from just text.
The score loop in term 3 is more like a last tune.
It shapes the raw text brain toward help, truth, care, and a good chat style.
## What to keep
You now know the core parts:
state, move, score, [policy](#defn-policy), and [value](#defn-value).
You know the core hard task:
which past move gets thanks or blame?
You know 3 splits:
[Value-based](#defn-value-based) or [policy-based](#defn-policy-based): price tags or rule book.
[Model-free](#defn-model-free) or [model-based](#defn-model-based): no doll house or doll house.
[On-policy](#defn-on-policy) or [off-policy](#defn-off-policy): fresh games or old tapes.
Here is the map:
| way | rule book or tags | map? | fresh or tapes |
|---|---|---|---|
| [Q-learning](#defn-q-learning) | tags | no | tapes |
| [evolution strategy](#defn-evolution-strategy) | rule book | no | fresh |
| [REINFORCE](#defn-reinforce) | rule book | no | fresh |
| [actor](#defn-actor)-[critic](#defn-critic) | both | no | fresh |
| [PPO](#defn-ppo) | both | no | fresh |
| [GRPO](#defn-grpo) | rule book | no | fresh |
| [AlphaZero](#defn-alphazero) | both | yes | fresh |
| [RLHF](#defn-rlhf) | both | no | fresh |
[Q-learning](#defn-q-learning) fills a [table](#defn-table) of [values](#defn-value).
An [evolution strategy](#defn-evolution-strategy) shakes whole brains and keeps the best.
[REINFORCE](#defn-reinforce) makes moves in high score runs more apt.
The [actor](#defn-actor)-[critic](#defn-critic) trick pays for [advantage](#defn-advantage), not raw score.
[PPO](#defn-ppo) clips the push, so the new [policy](#defn-policy) stays near the old one.
[GRPO](#defn-grpo) uses a group mean in place of a [critic](#defn-critic).
[AlphaZero](#defn-alphazero) uses [self-play](#defn-self-play), search, a [policy](#defn-policy), a [value](#defn-value), and a [model](#defn-model).
[RLHF](#defn-rlhf) uses [human](#defn-human) [feedback](#defn-feedback) to train a [reward model](#defn-reward-model), then trains the chat [policy](#defn-policy).
That is [reinforcement learning](#defn-reinforcement-learning):
try, score, change, and try once more.
# DeepSeek-v4
## **Reinforcement Learning** for Grade-Schoolers
### 1. What is a Brain and Learning?
A brain is made of small parts called **neurons** (By **neuron** I mean a tiny part of a brain that fires a zap). A neuron shoots zaps to other neurons through wires. When a neuron gets enough zaps at once, it shoots a zap too.
How does a neuron know what “enough” means? It can learn. Learning happens by trial and error. The brain gets a “good” sign when it does a good thing. It gets a “bad” sign when it does a bad thing. Then the brain changes its neurons to get more good signs and fewer bad signs.
This whole way of learning is called **reinforcement learning** (By **reinforcement learning** I mean a way for a brain to get better by getting “good” and “bad” signs). In this text I will show you many kinds of [reinforcement learning](#defn-reinforcement-learning).
### 2. Trial and Error—Evolution Style
One way to learn is to **evolve** (By **evolve** I mean to change slowly over time by keeping what works). You make many **copies** (By **copy** I mean an exact same thing made from another thing) of one brain. Then you **wiggle** (By **wiggle** I mean to change just a little bit) each copy—you change a few neurons up or down.
Then each brain plays a **game** (a play with a win or a loss). You check who wins (By **win** I mean to do well and get a “good” sign) and who loses (By **loss** I mean to do poorly and get a “bad” sign). You keep only the brains that won. You throw away the ones that lost.
Then you make more copies of the winners, wiggle them again, and play the [game](#game) again. Do this many times. The brains slowly get better. This way of learning is called **evolution strategies**. It is “model-free” (we will learn that word later) and “on-policy” (also later).
### 3. Trial and Error—REINFORCE Style
A second way is called **REINFORCE** (By **REINFORCE** I mean a rule: after a win, strengthen all neurons that fired; after a loss, weaken them). One brain plays many [games](#game). In each [game](#game) you write down which neurons fired (shot zaps).
If the brain **wins** ([win](#win)), you strengthen all those neurons—you make them need fewer zaps to fire. If the brain **loses** ([loss](#loss)), you weaken all those neurons—you make them need more zaps to fire.
This sounds bad, because a neuron might have done a bad thing in a [win](#win) or a good thing in a [loss](#loss). But on average, over many [games](#game), good neurons get strengthened more often and bad neurons get weakened more often. It works.
[REINFORCE](#reinforce) is also “model-free” and “on-policy”.
### 4. Making REINFORCE Smarter—The Critic
We can make [REINFORCE](#reinforce) less noisy by adding a second brain. Call the first brain the **actor** (By **actor** I mean the first brain that plays the game). Call the second brain the **critic** (By **critic** I mean a second brain that watches and guesses if the first brain will win).
The [critic](#critic) watches the [actor](#actor) play many [games](#game). The [critic](#critic) is trained to guess: “Will the [actor](#actor) win or lose?” It gets a “good” sign when it guesses right, a “bad” sign when it guesses wrong. After a while, the [critic](#critic) becomes a good guesser.
Now change the rule. Do not strengthen all neurons that fired in a [win](#win). Instead, strengthen them only when the [actor](#actor) **wins** and the [critic](#critic) guessed it would lose. That is **better than expected** (By **better than expected** I mean when the actor wins and the critic guessed it would lose). Also weaken the neurons only when the [actor](#actor) **loses** and the [critic](#critic) guessed it would win—that is “worse than expected”. This way the [actor](#actor) learns only from surprises. This is called **actor-critic**.
### 5. Don’t Change Too Fast—PPO / GRPO
There is a problem with [REINFORCE](#reinforce) and actor-critic. Sometimes the [actor](#actor) changes too much in one step and forgets how to play. To fix this, we add a speed limit.
**PPO** (By **PPO** I mean a rule that changes the actor only a small step at a time) stands for “Proximal Policy Optimization”. In [PPO](#ppo), after each [game](#game) you compute how much you would like to change each neuron. Then you clamp that change—you say “no more than a tiny step”. Then you change the [actor](#actor) only that tiny step.
**GRPO** is a cousin of [PPO](#ppo). The “G” stands for “Group”. In GRPO, the [critic](#critic) compares the [actor](#actor) not to its own guess, but to the average of a group of old [actor](#actor) copies. This works well for training large brains.
### 6. A Different Trick—The Big Table (Q-learning)
Now forget about neurons for a moment. Instead, imagine a big **table** (By **table** I mean a chart with rows and columns). The rows are the places where your brain can be. The columns are the moves your brain can make. Each cell in the table holds a number that says “how good is this move in this place?”
This is called **Q-learning** (By **Q-learning** I mean a way to fill a big table that tells how good each move is in each place). We start with all cells set to zero. Then we play a [game](#game). At each step, we look at the current place, choose the move with the highest number (sometimes we try a random move instead—that is called “exploring”).
After we make a move and get a reward (a “good” or “bad” sign), we update the number in the table with this rule:
> new number = old number + a little bit of ( reward + best future number—old number )
That math makes the table slowly learn the true value of each move. The nice thing about [Q-learning](#q-learning) is that it can learn from watching other brains play, or from old games. That makes it **off-policy**.
### 7. On-policy vs Off-policy
**On-policy** (By **on-policy** I mean learning only from moves you make yourself) means the brain learns only from its own games. [REINFORCE](#reinforce), [PPO](#ppo), and actor-critic are on-policy.
**Off-policy** (By **off-policy** I mean learning also from moves you watch others do or from old moves) means the brain can learn by watching a different brain, or by remembering a [game](#game) it played long ago. [Q-learning](#q-learning) is off-policy.
Think of it this way: On-policy is like learning to cook only by tasting your own food. Off-policy is like learning by watching a chef on TV and then using those lessons in your own kitchen.
### 8. Model-based vs Model-free
**Model-free** (By **model-free** I mean just trying moves and remembering the good ones) means the brain does not build a picture of the world. It just tries things and keeps what works. [REINFORCE](#reinforce), [PPO](#ppo), and [Q-learning](#q-learning) are all model-free.
**Model-based** (By **model-based** I mean building a small world inside the brain to plan ahead) means the brain also learns a “model”—for example, “if I jump from here, I land there and get a treat”. Then the brain can plan ahead without actually jumping. This is like thinking before you act.
Some very smart brains use both: they learn a model and then use [Q-learning](#q-learning) inside that model.
### 9. Value-based vs Policy-based
**Value-based** (By **value-based** I mean learning a number for each place, then picking the move that leads to the best number) methods learn how good each place (state) is. [Q-learning](#q-learning) is value-based because it learns the number of each move in each place.
**Policy-based** (By **policy-based** I mean learning directly which move to do in each place) methods skip the numbers and learn the move directly. [REINFORCE](#reinforce) and [PPO](#ppo) are policy-based. They change the [actor](#actor) without ever asking “how good is this place?”.
You can also mix them—that is called “actor-critic” (we already did that). The [actor](#actor) is policy-based, the [critic](#critic) is value-based.
### 10. Learning from People—RLHF
Sometimes we want a brain to do what people like, not just win a [game](#game). For example, we want a brain that talks like a kind friend, not a mean one. To do this we use **RLHF** (By **RLHF** I mean using what people like or dislike as the “good” and “bad” signs). RLHF stands for “Reinforcement Learning from Human Feedback”.
First you train a brain to do something (like answer a question). Then you show the brain’s answers to people. The people click a “thumbs up” if they like the answer, “thumbs down” if they hate it. Those clicks become the “good” and “bad” signs.
Then you run [REINFORCE](#reinforce) or [PPO](#ppo) using those signs. The brain learns to give answers that people like. This is how modern talking brains like ChatGPT are tuned to be helpful.
### 11. Putting It All Together
You have now learned the main ways brains can learn by trial and error:
- **Evolution**—copy and wiggle, keep winners.
- **REINFORCE**—strengthen all fired neurons on win, weaken on loss.
- **Actor-Critic**—use a second brain to guess, then learn from surprises.
- **PPO / GRPO**—change only a small step at a time.
- **Q-learning**—fill a big table of “how good is this move in this place?”
- **On-policy** vs **Off-policy**—learn from yourself or from others?
- **Model-free** vs **Model-based**—just try, or build a world in your head?
- **Value-based** vs **Policy-based**—learn numbers for places, or learn moves directly?
- **RLHF**—learn from what people like.
All of these are [reinforcement learning](#defn-reinforcement-learning). Now you can go and teach a brain to play a [game](#game)!