We may have found a solid hypothesis to explain why extreme overparametrization is so helpful in #DeepLearning, especially if one is concerned about adversarial robustness. arxiv.org/abs/2105.12806 1/7

Jun 9, 2021 · 3:15 PM UTC

With my student extraordinaire Mark Sellke @geoishard, we prove a vast generalization of our conjectured law of robustness from last summer, that there is an inherent tradeoff between # neurons and smoothness of the network (see *pre-solution* video). 2/7 piped.video/watch?v=pmROV1AS…
If you squint hard enough (eg, like a physicist) our new universal law of robustness even makes concrete predictions for real data. For ex. we predict that on ImageNet you need at least 100 billion parameters (i.e., GPT-3-like scale) to possibly attain good robust guarantees. 3/7
So what does the law actually says? Classically, interpolating n points with a p-parameters function class only requires p>n. Now what if you want to interpolate *smoothly*? We show that this simple extra robust constraint forces overparametrization! (by a factor d=data dim). 4/7
This result is true for broad class of data: 1) covariate x should be a mixture of ``isoperimetric measures" (e.g., Gaussian), but the key here is that we can allow *many* mixture components (like n/log(n)). 2) target labels y should have some independent noise. Reasonable?! 5/7
The real surprise to me is how general the phenomenon is. Back last summer we struggled to prove it for 2-layers neural nets, but in the end the law applies to *any* (smoothly parametrized) function class. Key to unlock the problem was to adopt a probabilist perspective! 6/7
As always we welcome comments! While the law itself is a rather simple mathematical statement, its interpretation is ofc fairly speculative. In fact you can check a video by @EldanRonen explaining the paper & giving us a hard time on speculative part🤣7/7 piped.video/watch?v=__Kj9HeU…
Replying to @SebastienBubeck
Do you think maximum gradient norm is the correct notion of adversarial robustness in this problem?
No, I do not think so. But it is a natural and tractable mathematical proxy. Note that the law of robustness is *wrong* if you consider say the average norm of the gradient rather than the max (see last paragraph in Section 1.2).
Replying to @SebastienBubeck
Very interesting! Maybe a naive question: I understand the Lipchitz constant as a measure of robustness to perturbations in the parameters, but is there a way to see this as a good measure of robustness to feature/label perturbations?
Not sure I understand: the Lipschitz cst Lip(f) measures the robustness to *feature* perturbations (in the sense of perturbation of the covariate/input x). In the formal statement we also need to control J, the Lipschitz constant of the *parametrization*. Are you talking about J?