“Is SGD a Bayesian Sampler? Well, Almost”, Chris Mingard, Guillermo Valle-Pérez, Joar Skalse, Ard A. Louis2020-06-26 (, , ; similar)⁠:

Overparameterized deep neural networks (DNNs) are highly expressive and so can, in principle, generate almost any function that fits a training dataset with zero error. The vast majority of these functions will perform poorly on unseen data, and yet in practice DNNs often generalize remarkably well. This success suggests that a trained DNN must have a strong inductive bias towards functions with low generalisation error.

Here we empirically investigate this inductive bias by calculating, for a range of architectures and datasets, the probability PSGD(f|S) that an overparameterized DNN, trained with stochastic gradient descent (SGD) or one of its variants, converges on a function f consistent with a training set S. We also use Gaussian processes to estimate the Bayesian posterior probability Pb(f|S) that the DNN expresses f upon random sampling of its parameters, conditioned on S.

Our main findings are that PSGD(f|S) correlates remarkably well with Pb(f|S) and that Pb(f|S) is strongly biased towards low-error and low complexity functions. These results imply that strong inductive bias in the parameter-function map (which determines Pb(f|S)), rather than a special property of SGD, is the primary explanation for why DNNs generalize so well in the overparameterized regime.

While our results suggest that the Bayesian posterior Pb(f|S) is the first order determinant of PSGD(f|S), there remain second order differences that are sensitive to hyperparameter tuning. A function probability picture, based on PSGD(f|S) and/or Pb(f|S), can shed new light on the way that variations in architecture or hyperparameter settings such as batch size, learning rate, and optimizer choice, affect DNN performance.