Attention Is All You Need
Omnigrok: Grokking Beyond Algorithmic Data
Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time
2024-fan-figure2-grokkingincreaseswithmlpdepth.jpg
https://arxiv.org/pdf/2405.19454#page=2
https://arxiv.org/pdf/2405.19454#page=7