Making Sense of Big Data
Neural networks are fundamentally (almost) Bayesian
Stochastic Gradient Descent approximates Bayesian sampling
Deep neural networks (DNNs) have been extraordinarily successful in many different situations — from image recognition and playing chess to driving cars and making medical diagnoses. However, in spite of this success, a good theoretical understanding of why they generalise (learn) so well is still lacking.
In this post, we summarise results from three papers, which provide a candidate for a theory of generalisation in DNNs [1,2,3]. For busy or knowledgable readers, Figures 1–7, their captions and the conclusion are sufficient to understand the main argument.
We will first explain inductive biases — mechanisms that selects a function that an agent thinks is most likely to explain some training data. We then state that for highly expressive learning agents such as DNNs (which are known to be able to express many possible functions), a good inductive bias is needed to guarantee generalisation.
We present experiments on real-world datasets which suggest that DNNs behave like Bayesian learning agents (with a prior over parameters). This prior favours simple functions, and this is the source of the good inductive bias in DNNs.
It is worth noting that many theories suggest that SGD is the main source of this inductive bias (e.g. it performs some hidden regularisation, see [1] for a literature review). Our experiments are strong evidence against this claim, but do show that tuning the hyperparameters of stochastic optimisers can have a minor effect on the post-trained DNNs — and may amplify the bias already present in the prior over parameters.
Introducing expressivity and inductive bias
In supervised learning, the problem is to fit a function to some training data S. One of the simplest supervised learning tasks is 1D regression (i.e. fitting a line to some example points). However, as is clear from Figure 1, there are many possible functions f that fit the training data (any function that passes through the black points). However, there is only one ‘true’ function f* that actually generated the black points. In the case of Figure 1, f* was a (slightly) perturbed sine function, and the grey points show its behaviour away from the training data (they are part of the test data).
Then, given some learning system, and some datapoints, how can we decide which function f is most likely to be f*? The closer we are to f*, the better the network will have generalised. This is the same statement as ‘good generalisation is accurately predicting unseen data.’
In Figure 1, we use two polynomial fitters to find a candidate for f*. The 5th order polyfit fits a function that is fairly close to f* (the dotted orange line) because there are 6 parameters and 5 datapoints (which almost perfectly constrains the problem), whereas the 50th order polyfit (with 51 parameters) can choose between a much larger set of functions. Such algorithms that can fit many functions to S are called highly expressive. Without a mechanism to guide highly expressive algorithms to f*, they will (in probability) fit a function that generalises poorly (simply as they have so many to choose from, most of which will be poor approximations for f*), as is the case for the 50th order polyfit.
So, one way of controlling what f looks like is by controlling the expressivity of the model — the number of functions it can fit to the training data S. If we know that f* is much more simple than a 50th order polynomial, we can just reduce the order of the polyfit algorithm, removing the number of incorrect functions the learning algorithm can fit. This is related to the bias-variance tradeoff, which we will argue has no place in our discussion of DNNs.
This is because DNNs are highly expressive. They usually have many more parameters than datapoints, and are capable of fitting many different functions to S. In Figure 1, we see a DNN with >10000 parameters performing the same task as the polyfit algorithms. The blue line shows the result — a simple function that, whilst not a sine curve, isn’t as bad as the polynomial fitter (which had just 50 parameters). Just to check that this particular DNN isn’t too small to fit many different functions, we also trained it on a training set consisting of the black points and the red points. It had no problem fitting a function to both. This suggests that it is highly expressive, and yet generalises fairly well (chooses f that is close to f*).
When just fitting the black points, the DNN could have fit any function (e.g. the red line), but it selected a very simple function. When we force it to pass through the black and red points, it can do this successfully, both implying high expressivity and a preference towards the simplest possible functions. For more information about quantifying the simplicity of functions, see [2,3].
This example, and substantial evidence in [2,3], suggests that the DNN has an inductive bias towards ‘simple’ functions.
An inductive bias is a mechanism that causes algorithms to prefer certain functions as candidates for f*. If the inductive bias selects functions f that are close to f* then the inductive bias is good. There is insufficient inductive bias in the polyfit algorithm — hence no generalisation!
In summary, learning agents that are highly expressive need an inductive bias in order to generalise well. It is thought that an inductive bias towards simple functions is most likely to work well in the real world.
Introducing further notation
Before we continue, we should develop a proper formalism. In a supervised learning setting, we have some data distribution D that consists of inputs (in the space X) and labels (in the space Y), and some function
f*: X → Y,
defined for any x∈ D. f* is the “true” function that defines how inputs should be mapped to their labels.
Below is an example (MNIST), where D contains 28x28 greyscale images of the handwritten digits 0–9, with their labels. Note that, while there is some ambiguity as to when a particular 28x28 image can be counted in D, the black square certainly isn’t. f* represents the ‘true’ labelling of the images — it is a function from pixel values in the 28x28 images to labels.
As we have already discussed, a DNNs job is to find a function f that is (sufficiently) close to f*. In a supervised learning setting, this is done by providing some subset S⊂ D (‘training data’) and attempting to minimise the loss function on S, which measures the difference between f and f* on S. We typically test this function f on a test set E⊂ D, to see how close it is to f*.
If the inductive bias of the DNN is good, then f will be close to f* on E.
In [3], it was shown that 2 layer Fully Connected Networks are capable of fitting any function to 10000 images in MNIST (i.e. labelling the images in any way), and despite this, they overwhelmingly find functions that generalise well.
We hypothesised in the previous section that DNNs have an inductive bias towards simple functions — if we assume that functions in the real world are simple (e.g. a simple pattern explains what makes the digit 9 a nine and not a seven) then this would explain why DNNs generalise.
So what is the source of DNN’s inductive bias?
We found that a good way to answer this question was to ask the following one: how many parameters are associated with functions that generalise well? The following text explains the argument, but Figure 3 might make it clearer.
We first need to define a few quantities. For some Deep Neural Network N and dataset D:
Pᵦ ( f ) is the probability that N expresses f on the dataset D, upon randomly sampling network parameters (typically from i.i.d. Gaussians)
Pᵦ ( f ) can be related to a volume in parameter space Vᵦ ( f ): if network parameters are randomly sampled from a Gaussian distribution, then Vᵦ ( f ) is a volume with Gaussian measure, and equal to Pᵦ ( f ). We use the subscript ‘B’ to denote that this is our Bayesian Prior — which is a type of inductive bias.
This gives a measure of the ‘volume’ in parameter space for different functions f. Extensive empirical evidence and analytic results in [2,3] suggest the following: simple functions will have a corresponding higher ‘volume’ of parameters. See this post for a summary of these results. Given that simple functions generalise better, then there will be, in effect, a greater ‘volume’ in parameter space for functions with good generalisation than functions with bad generalisation. So, can we take advantage of this?
We can imagine performing Bayesian inference, taking Pᵦ ( f ) as the prior:
Pᵦ ( f | S)=P(S | f ) Pᵦ(f) / Pᵦ(S),
where P(S | f) = 1 if f is consistent with S, else 0. Pᵦ(S) is the probability of randomly initialising to a function consistent with S. This equation is perhaps better understood with reference to Figure 3: it says that the probability that a function f best describes training data S is given by the volume of f divided by the total number of functions consistent with S.
Very informally, this means that if ‘more’ parameters give a function f, and f accurately models data in the training set S, then we should assign it a higher posterior probability Pᵦ ( f | S) than another function accurate on S but with ‘fewer’ parameters. This posterior probability is our probability estimate for how likely that f is actually f* — which, looking at the above equation is strongly dependent on our prior Pᵦ ( f ).
Then, if Pᵦ ( f ) is biased towards functions with good generalisation, Pᵦ( f | S) will be too. To emphasise our earlier point, Pᵦ ( f ) is our inductive bias — which is called a prior in the context of Bayesian inference. The subscript ‘B’ stands for ‘Bayes’.
However, even if this is true, this does not guarantee good generalisation. DNNs are trained by a stochastic optimiser like SGD. While it may seem likely that functions with higher Pᵦ( f | S) are more likely to be found by SGD, this is not self-evident. Take a look at Figure 4: if the loss landscape behaves somewhat like the one displayed in the cartoon, then functions with higher ‘volumes’ will be more likely to be found, and because they generalise well, the DNN will generalise well in expectation. Arguments and empirical evidence in [1] suggest this is true (with some conditions).
Formalising the question
We must compare the following quantities. Consider a DNN N, training set S and test set E¹:
Pₒₚₜ ( f | S ) is the probability that a stochastic optimiser like SGD finds a function f on E after training to 100% accuracy on S.
Pᵦ ( f | S) is the probability that N expresses f on E, upon randomly sampling network parameters from i.i.d. Gaussians until N is 100% accurate on S. [This is an equivalent definition to that given in the previous section].
If Pₒₚₜ ( f | S ) ≈ Pᵦ ( f | S) then the reason that DNNs generalise well is that there is a greater volume of parameters (volume with Gaussian measure) that correspond to functions that generalise well. SGD is then just more likely to find functions with ‘more’ associated parameters. Again, the reader may wish to refer back to Figures 3 and 4 for cartoons of the argument.
For full and formal treatment, see [1]. Hopefully though, this section is sufficient to understand the key points.
¹Note that we use the test set E to measure differences in the functions post-training (as they will all be the same on S). In effect, we are coarse-graining the functions on data from the same distribution as S.
How to calculate Pₒₚₜ ( f | S ) and Pᵦ ( f | S)
Here we give an example of the argument above, in the hope that it might aid intuition. If everything is clear though, skip ahead to the results!
Consider the dataset MNIST — the handwritten digits from 0–9 with their labels. For our experiments, we then binarise it, so even numbers are classified as 0 and odd numbers as 1.
Let the training set S consist of 10000 randomly selected images, and the test set E consist of 100 randomly selected images.
Then, consider training a DNN N on S to 100% accuracy. Record the classification (the function) on E. This is equivalent to sampling from Pₒₚₜ ( f | S ). Do this multiple times (say around 10⁵) to build up a good approximation to Pₒₚₜ ( f | S ).
To calculate Pᵦ ( f | S), one needs a different approach. It would be nice to randomly sample parameters until N is 100% accurate on S, record the classification and repeat (similar to how we calculate Pₒₚₜ ( f | S )) but unfortunately, this would take an infeasible amount of time. We also cannot analytically calculate the volumes Pᵦ ( f ). Instead, we accurately approximate this process with the Gaussian Processes approximation. See [1] for a detailed explanation of this.
Our results
Our results can be broken down into some very simple experiments. Our main type of experiment tests whether Pₒₚₜ ( f | S ) ≈ Pᵦ ( f | S). We find that this is true over a wide range of datasets (MNIST, Fashion-MNIST, IMDb movie review dataset and the ionosphere dataset), architectures (Fully Connected, Convolutional, LSTM), optimisers (SGD, Adam, Adadelts etc.), training schemes (including overtraining) and optimiser hyperparameters (e.g. batch size, learning rate).
We show the main result in Figure 5 — for a fully connected network on MNIST. The main result is shown in 5a: each datapoint represents a specific function on a test set of size 100, after training on a training set S of size 10000, either with SGD, or by Bayesian inference. It shows that Pₒₚₜ ( f | S ) ≈ Pᵦ ( f | S). 5b shows that Pᵦ ( f | S) varies over many orders of magnitude, and is highest for functions with good generalisation. 5c shows that simple functions generalise better. 5d shows Pᵦ ( f | S) from 5a in a different way, and 5e,f show the same type of experiment found in 5a but with different hyperparameters (different optimiser and loss function respectively).
We also show in Figure 6 that changing hyperparameters associated with the optimiser (in this case, batch size) does affect the relationship between Pₒₚₜ ( f | S ) and Pᵦ ( f | S). We find that smaller batch sizes cause functions with high Pᵦ ( f | S) to have even higher Pₒₚₜ ( f | S ) than for larger batch sizes. However, this effect is much smaller than the effect of Pᵦ ( f | S). Note that the cross-entropy loss function used in Figure 6 means that the approximations to Pᵦ ( f | S) are less accurate (due to the EP approximation), hence the less strict agreement.
Finally, Figure 7 shows that Pₒₚₜ ( f | S ) ≈ Pᵦ ( f | S) for a range of architectures and datasets — including ConvNets and LSTMs. As above, the EP approximation is required to estimate Pᵦ ( f | S), introducing error. The greater scatter present in the figures is not strong evidence against the main claim — given that the optimisers do not find functions with low Pᵦ ( f | S) it seems sensible to assume that the main contribution to their inductive bias is Pᵦ ( f | S). Also, note that Figure 5b shows that Pᵦ ( f | S) varies over 200 orders of magnitude (and equivalent figures in [1] show similar results for all datasets in Figure 7) — it would require a very strong inductive bias in SGD to overcome this, and this is not supported in the rest of the figures.
For many more experiments with different datasets, architectures and optimisers, see [1].
Conclusion
In conclusion, we have presented strong evidence that DNNs generalise because of a strong inductive bias due to the architecture. More specifically:
- There is a prior over parameters Pᵦ ( f ), which is the probability that a DNN randomly initialises to a function f. This prior is strongly biased towards simple functions.
- We can perform Bayesian inference with Pᵦ ( f ) as our prior, to give a posterior distribution over parameters: Pᵦ(f|S). This posterior distribution is also strongly biased towards simple functions (see Figure 5b).
- We then show that upon training with a stochastic optimiser like SGD, the network finds functions with probability Pₒₚₜ ( f | S )≈ Pᵦ ( f | S). See Figure 5a.
Because Pₒₚₜ ( f | S )≈ Pᵦ ( f | S) on a wide range of architectures, datasets and optimisers, we conclude that the effect of the optimiser is much weaker than the prior over parameters Pᵦ ( f ). Thus, DNNs primarily generalise because of Pᵦ ( f ), and not because of the stochastic optimisers — which are great for finding functions consistent with training sets — still a very impressive property!
That said, changing hyperparameters associated with optimisers (e.g. batch size, learning rate, overtraining) can make small changes to the approximate equality between Pₒₚₜ ( f | S ) and Pᵦ ( f | S), and thus lead to small changes in generalisation.
It seems likely that there will be some types of DNNs, or regions in parameter-space, for which the effects of the optimisers become more significant. We hope to investigate these in future work.
Feel free to send any comments or suggestions to christopher.mingard@queens.ox.ac.uk
References
[1] C. Mingard, G. Valle-Perez, J. Skalse, A. Louis. Is SGD a Bayesian Sampler? Well, almost. (2020) https://arxiv.org/abs/2006.15191
[2] C. Mingard, J. Skalse, G. Valle-Perez, D. Martinez-Rubio, V. Mikulik, A. Louis. Neural Networks are a-priori biased towards low entropy functions. (2019) https://arxiv.org/abs/1909.11522
[3] G. Valle-Perez, C. Camargo, A. Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions. (2018) https://arxiv.org/abs/1805.08522
All images in this post were created by the author, most of which can also be found in [1].
Appendix
This blog post attempts to explain the work in [1] as closely as possible, but of course, some simplifications have been made in the process. I encourage interested readers to read the paper!
The experiments shown in Figures 5,6 and 7 have, as mentioned in the main text, been performed for a large range of real-world architectures and datasets. However, there are regimes in which they have not been performed extensively — for example, the chaotic regime in tanh-activated DNNs. It is entirely possible that some of these, the stochastic optimiser may play a greater role in the inductive bias, and this will be the subject of future work.
Many thanks to Joar Skalse and Guillermo Valle-Perez for assistance in the process of writing this post!