“The Inverse Variance–flatness Relation in Stochastic Gradient Descent Is Critical for Finding Flat Minima”, 2021-03-02 (; backlinks):
One key ingredient in deep learning is the stochastic gradient descent (SGD) algorithm, which allows neural nets to find generalizable solutions at flat minima of the high-dimensional loss function. However, it is unclear how SGD finds flat minima.
Here, by analyzing SGD-based learning dynamics together with the loss function landscape, we discovered a robust inverse relation between weight fluctuation and loss landscape flatness opposite to the fluctuation-dissipation relation in physics. The reason for this inverse relationship is that the SGD noise strength and its correlation time depend inversely on the landscape flatness.
Essentially, SGD serves as a landscape-dependent annealing algorithm to search for flat minima. These theoretical insights can lead to more efficient algorithms, eg. for preventing catastrophic forgetting.
[Keywords: statistical physics, machine learning, stochastic gradient descent, loss landscape, generalization]