“The Inverse Variance–flatness Relation in Stochastic Gradient Descent Is Critical for Finding Flat Minima”, Yu Feng, Yuhjai Tu2021-03-02 (; backlinks)⁠:

One key ingredient in deep learning is the stochastic gradient descent (SGD) algorithm, which allows neural nets to find generalizable solutions at flat minima of the high-dimensional loss function. However, it is unclear how SGD finds flat minima.

Here, by analyzing SGD-based learning dynamics together with the loss function landscape, we discovered a robust inverse relation between weight fluctuation and loss landscape flatness opposite to the fluctuation-dissipation relation in physics. The reason for this inverse relationship is that the SGD noise strength and its correlation time depend inversely on the landscape flatness.

Essentially, SGD serves as a landscape-dependent annealing algorithm to search for flat minima. These theoretical insights can lead to more efficient algorithms, eg. for preventing catastrophic forgetting.

[Keywords: statistical physics, machine learning, stochastic gradient descent, loss landscape, generalization]