“Learning through Atypical “Phase Transitions” in Overparameterized Neural Networks”, Carlo Baldassi, Clarissa Lauditi, Enrico M. Malatesta, Rosalba Pacelli, Gabriele Perugini, Riccardo Zecchina2021-10-01 (, ; backlinks; similar)⁠:

[cf. double descent] Current deep neural networks are highly overparameterized (up to billions of connection weights) and nonlinear. Yet they can fit data almost perfectly through variants of gradient descent algorithms and achieve unexpected levels of prediction accuracy without overfitting. These are formidable results that defy predictions of statistical learning and pose conceptual challenges for non-convex optimization.

In this paper, we use methods from statistical physics of disordered systems to analytically study the computational fallout of overparameterization in non-convex binary neural network models, trained on data generated from a structurally simpler but “hidden” network.

As the number of connection weights increases, we follow the changes of the geometrical structure of different minima of the error loss function and relate them to learning and generalization performance.

A first transition happens at the so-called interpolation point, when solutions begin to exist (perfect fitting becomes possible). [The information-theoretic interpolation threshold of the model: this is the point when zero-error solutions appear and perfect fitting of the data becomes possible] This transition reflects the properties of typical solutions, which however are in sharp minima and hard to sample.

After a gap, a second transition occurs, Local Entropy, with the discontinuous appearance of a different kind of “atypical” structures: wide regions of the weight space that are particularly solution-dense and have good generalization properties.

The two kinds of solutions coexist, with the typical ones being exponentially more numerous, but empirically we find that efficient algorithms sample the atypical, rare ones. This suggests that the atypical phase transition is the relevant one for learning.

The results of numerical tests with realistic networks on observables suggested by the theory are consistent with this scenario.

…[double descent] Subsequent numerical analysis of the Hessian of largely overparameterized models30 showed that minimizers present many flat directions, and that it is not hard to find a path of zero training error connecting two solutions31, 32. In under-parameterized neural networks, on the other hand, the authors of 33 showed that the landscape is very rough and dynamics is glassy. This led to think that the landscape of overparameterized networks where the dynamics is not glassy anymore, presents no “poor” minima at all34.

According to our analysis, this is not the case. As we anticipated in the introduction, over-parameterization has the effect of letting those connected regions appear at the LE transition, not letting “poor” minima completely disappear. Over-parameterizing the network even further it is possible to increase the size of the connected region; “poor” or “sharp” solutions however remain the most numerous ones and dominate the Gibbs measure.