Bibliography (7):

  1. Attention Is All You Need

  2. Omnigrok: Grokking Beyond Algorithmic Data

  3. Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time

  4. 2024-fan-figure2-grokkingincreaseswithmlpdepth.jpg

  5. https://arxiv.org/pdf/2405.19454#page=2

  6. https://arxiv.org/pdf/2405.19454#page=7