“A Universal Law of Robustness via Isoperimetry”, 2021-05-26 (; backlinks; similar):
[Video: long/short] Classically, data interpolation with a parameterized model class is possible as long as the number of parameters is larger than the number of equations to be satisfied. A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest.
We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparameterization is necessary if one wants to interpolate the data smoothly. Namely, we show that smooth interpolation requires d times more parameters than mere interpolation, where d is the ambient data dimension. We prove this universal law of robustness for any smoothly parameterized function class with polynomial size weights, and any covariate distribution verifying isoperimetry [”having the same perimetry“].
In the case of two-layers neural networks and Gaussian covariates, this law was conjectured in prior work by et al 2021. We also give an interpretation of our result as an improved generalization bound for model classes consisting of smooth functions.
…To put Theorem 1 in context, we compare to the empirical results presented in [MMS+18]. In the latter work, they consider the MNIST dataset which consists of n = 6×104 images in dimension 282 = 784. They trained robustly different architectures, and reported in Figure 4 the size of the architecture versus the obtained robust test accuracy (third plot from the left). One can see a sharp transition from roughly 10% accuracy to roughly 90% accuracy at around 2×105 parameters (capacity scale 4 in their notation). Moreover the robust accuracy keeps climbing up with more parameters, to roughly 95% accuracy at roughly 3×106 parameters…In addition to [MMS+18], several recent works have experimentally studied the relationship between a neural network scale and its achieved robustness, see eg. [NBA+18, XY20, GQU+20]. It has been consistently reported that larger networks help tremendously for robustness, beyond what is typically seen for classical non-robust accuracy
…With all the caveats described above, we can now look at the numbers as follows: in the [MMS+18] experiments, smooth models with accuracy below the noise level are attained with a number of parameters somewhere in the range 2×105–3×106 parameters (possibly even larger depending on the interpretation of the noise level), while the law of robustness would predict any such model must have at least nd parameters, and this latter quantity should be somewhere in the range 106–107 (corresponding to an effective dimension 15–150). While far from perfect, the law of robustness prediction is far more accurate than the classical rule of thumb # parameters ≃ # equations (which here would predict a number of parameters of the order 104).
Perhaps more interestingly, one could apply a similar reasoning to the ImageNet dataset, which consists of 1.4×107 images of size roughly 2×105. Estimating that the effective dimension is a couple of order of magnitudes smaller than this size, the law of robustness predicts that to obtain good robust models on ImageNet one would need at least 1010–1011 parameters. This number is larger than the size of current neural networks trained robustly for this task, which sports between 108–109 parameters. Thus, we arrive at the tantalizing possibility that robust models for ImageNet do not exist yet simply because we are a couple orders of magnitude off in the current scale of neural networks trained for this task.