https://www.lesswrong.com/posts/HHSuvG2hqAnGT5Wzp/no-convincing-evidence-for-gradient-descent-in-activation#Transformers_Learn_in_Context_by_Gradient_Descent__van_Oswald_et_al__2022_
Attention Is All You Need
In-Context Learning and Induction Heads
https://github.com/google-research/self-organising-systems/tree/master/transformers_learn_icl_by_gd