[Paper: Nakkiranet al2019] This double descent effect is often avoided through careful regularization. While this behavior appears to be fairly universal, we don’t yet fully understand why it happens, and view further study of this phenomenon as an important research direction.
Many classes of modern deep learning models, including CNNs, ResNets, and transformers, exhibit the previously-observed double descent phenomenon when not using early stopping or regularization. The peak occurs predictably at a “critical regime”, where the models are barely able to fit the training set. As we increase the number of parameters in a neural network, the test error initially decreases, increases, and, just as the model is able to fit the train set, undergoes a second descent.
There is a regime where bigger models are worse.
There is a regime where more samples hurts.
There is a regime where training longer reverses overfitting.
…In general, the peak of test error appears systematically when models are just barely able to fit the train set.
Overfitting a 1000-degree spline to a cubic curve works.
Our intuition is that, for models at the interpolation threshold, there is effectively only one model that fits the train data, and forcing it to fit even slightly noisy or misspecified labels will destroy its global structure. That is, there are no “good models” which both interpolate the train set and perform well on the test set. However, in the over-parameterized regime, there are many models that fit the train set and there exist such good models. Moreover, the implicit bias of stochastic gradient descent (SGD) leads it to such good models, for reasons we don’t yet understand.
TL;DR: Our paper shows that double descent occurs in conventional modern deep learning settings: visual classification in the presence of label noise (CIFAR-10, CIFAR-100) and machine translation (IWSLT’14 and WMT’14). As we increase the number of parameters in a neural network, initially the test error decreases, then increases, and then, just as the model is able to fit the train set, it undergoes a second descent, again decreasing as the number of parameters increases. This behavior also extends over train epochs, where a single model undergoes double-descent in test error over the course of training. Surprisingly (at least to us!), we show these phenomenon can lead to a regime where “more data hurts”—training a deep network on a larger train set actually performs worse.
…It seems that the higher the degree, the worse things are, but what happens if we go even higher? It seems like a crazy idea—why would we increase the degree beyond the number of samples? But it corresponds to the practice of having many more parameters than training samples in modern deep learning. Just like in deep learning, when the degree is larger than the number of samples, there is more than one polynomial that fits the data—but we choose a specific one: the one found running gradient descent. Here is what happens if we do this for degree 1000, fitting a polynomial using gradient descent (see this notebook):
We still fit all the training points, but now we do so in a more controlled way which actually tracks quite closely the ground truth. We see that despite what we learn in statistics textbooks, sometimes overfitting is not that bad, as long as you go “all in” rather than “barely overfitting” the data. That is, overfitting doesn’t hurt us if we take the number of parameters to be much larger than what is needed to just fit the training set—and in fact, as we see in deep learning, larger models are often better.]