Bibliography (7):
A Mathematical Framework for Transformer Circuits
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets [paper]
Explaining grokking through circuit efficiency
Wikipedia Bibliography:
Floor effect
https://en.wikipedia.org/wiki/Scaling_law :
https://en.wikipedia.org/wiki/Scaling_law
https://en.wikipedia.org/wiki/GPT-4 :
https://en.wikipedia.org/wiki/GPT-4
Maximum likelihood estimation