Making Sense of Big Data

Neural networks are fundamentally (almost) Bayesian

Stochastic Gradient Descent approximates Bayesian sampling

Introducing expressivity and inductive bias

In supervised learning, the problem is to fit a function to some training data S. One of the simplest supervised learning tasks is 1D regression (i.e. fitting a line to some example points). However, as is clear from Figure 1, there are many possible functions f that fit the training data (any function that passes through the black points). However, there is only one ‘true’ function f* that actually generated the black points. In the case of Figure 1, f* was a (slightly) perturbed sine function, and the grey points show its behaviour away from the training data (they are part of the test data).

Figure 1: DNNs are highly expressive and yet fit simple functions; when polynomial fitters are made highly expressive, they do not fit simple functions. This is because they do not have an inductive bias towards simple functions. However, DNNs seem to fit simple functions (the blue line) even though they are capable of fitting more complex functions (the red line). The black and grey datapoints were generated with f* (although the grey datapoints are the ‘test data’, meaning the model is not shown them). The red datapoints were generated with another function to test the expressivity of the neural network. From [1].

Introducing further notation

Before we continue, we should develop a proper formalism. In a supervised learning setting, we have some data distribution D that consists of inputs (in the space X) and labels (in the space Y), and some function

Figure 2: Images in MNIST (the dataset D) are mapped to their labels by a true function f*, which is defined on all elements of D (images of the digits 0–9)

So what is the source of DNN’s inductive bias?

We found that a good way to answer this question was to ask the following one: how many parameters are associated with functions that generalise well? The following text explains the argument, but Figure 3 might make it clearer.

Figure 3: A visualisation of Pᵦ( f | S). The x and y axes represent parameters in the DNN. It is argued in [1,2,3] that of functions consistent with training data, ‘simple’ functions have large volumes in parameter space, and ‘complex’ functions have small volumes. For real-world datasets, the volumes range over many orders of magnitude. [Technically, we use a Gaussian measure over parameter-space]
Figure 4: A pictorial summary of the main argument. It is argued that functions with better generalisation have larger Vᵦ ( f | S)— a greater volume in parameter space — and this corresponds to functions that generalise well. It is hoped that in cases where the basins Vᵦ ( f | S) vary over many orders of magnitude, Vₒₚₜ ( f | S ) will correlate well with Vᵦ ( f | S), as shown above. Note that the quantities are written with a P (rather than V) in the main text, as we can interpret them as either probabilities or volumes. From [1].

Formalising the question

We must compare the following quantities. Consider a DNN N, training set S and test set :

How to calculate Pₒₚₜ ( f | S ) and Pᵦ ( f | S)

Here we give an example of the argument above, in the hope that it might aid intuition. If everything is clear though, skip ahead to the results!

Our results

Our results can be broken down into some very simple experiments. Our main type of experiment tests whether Pₒₚₜ ( f | S ) Pᵦ ( f | S). We find that this is true over a wide range of datasets (MNIST, Fashion-MNIST, IMDb movie review dataset and the ionosphere dataset), architectures (Fully Connected, Convolutional, LSTM), optimisers (SGD, Adam, Adadelts etc.), training schemes (including overtraining) and optimiser hyperparameters (e.g. batch size, learning rate).

Figure 5: Our main result. (a) compares Pᵦ(f|S) with Pₒₚₜ(f|S). (b) shows how Pᵦ(f|S) varies over many orders of magnitude — there is a clear inductive bias towards simple functions in Pᵦ(f|S). (c) shows that simpler functions generalise better. (d) shows Pᵦ(f|S) from (a). (e) and (f) use different hyperparameters to (a) to show that this effect is not hyperparameter-dependent. From [1].
Figure 6: More hyperparameter choices — clearly for all of them Pₒₚₜ ( f | S ) Pᵦ ( f | S). However, changing optimiser hyperparameters can make small changes to Pₒₚₜ ( f | S ) — and thus make small changes to generalisation. So, SGD might not be the main player, but it can definitely affect the game. From [1].
Figure 7: More dataset and architectures — ConvNets on Fashion-MNIST, an LSTM on the IMDb movie review dataset and a small FCN on the Ionosphere Dataset. Clearly for all of them Pₒₚₜ ( f | S ) Pᵦ ( f | S), and the effect is not limited to Fully Connected Networks.

Conclusion

In conclusion, we have presented strong evidence that DNNs generalise because of a strong inductive bias due to the architecture. More specifically:

  1. There is a prior over parameters Pᵦ ( f ), which is the probability that a DNN randomly initialises to a function f. This prior is strongly biased towards simple functions.
  2. We can perform Bayesian inference with Pᵦ ( f ) as our prior, to give a posterior distribution over parameters: Pᵦ(f|S). This posterior distribution is also strongly biased towards simple functions (see Figure 5b).
  3. We then show that upon training with a stochastic optimiser like SGD, the network finds functions with probability Pₒₚₜ ( f | S )Pᵦ ( f | S). See Figure 5a.

References

[1] C. Mingard, G. Valle-Perez, J. Skalse, A. Louis. Is SGD a Bayesian Sampler? Well, almost. (2020) https://arxiv.org/abs/2006.15191

Appendix

This blog post attempts to explain the work in [1] as closely as possible, but of course, some simplifications have been made in the process. I encourage interested readers to read the paper!

Sign up for The Variable

By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to miss. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Your home for data science. A Medium publication sharing concepts, ideas and codes.

Share your ideas with millions of readers.