-
Attention Is All You Need
-
Omnigrok: Grokking Beyond Algorithmic Data
-
Deep Double Descent: We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time
-
2024-fan-figure2-grokkingincreaseswithmlpdepth.jpg
-
https://arxiv.org/pdf/2405.19454#page=2
-
https://arxiv.org/pdf/2405.19454#page=7
-