“The Slingshot Helps With Learning”, Wilson Wu2024-10-31 (, ; similar)⁠:

The slingshot effect is a late-stage training anomaly found in various adaptive gradient optimization methods. In particular, slingshots are present with AdamW, the optimizer most widely used for modern transformer training.

The original slingshot paper observes that slingshots tend to occur alongside grokking, a phenomenon in which neural networks trained on algorithmic tasks generalize to the test set long after perfectly fitting the training set.

In this post, we take a closer look at slingshots and their effect on generalization in the setting of 1-hidden-layer MLPs trained on k-sparse parity, a specific algorithmic task.

The main results are:

  1. an explanation of why slingshots occur in models trained with hinge loss that partially transfers to models trained with cross-entropy loss

  2. empirical evidence that slingshots are biased towards decreasing test loss.