Sebastien Bubeck · Jun 9, 2021 · 3:15 PM UTC

Sebastien Bubeck · Jun 9, 2021 · 3:15 PM UTC

Sebastien Bubeck

9 Jun 2021

We may have found a solid hypothesis to explain why extreme overparametrization is so helpful in #DeepLearning, especially if one is concerned about adversarial robustness. arxiv.org/abs/2105.12806 1/7

Jun 9, 2021 · 3:15 PM UTC

132

680

Sebastien Bubeck · Jun 9, 2021 · 3:15 PM UTC

Sebastien Bubeck @SebastienBubeck

9 Jun 2021

With my student extraordinaire Mark Sellke @geoishard, we prove a vast generalization of our conjectured law of robustness from last summer, that there is an inherent tradeoff between # neurons and smoothness of the network (see *pre-solution* video). 2/7 piped.video/watch?v=pmROV1AS…

Sebastien Bubeck - A law of robustness for two-layers neural networks

Guest talk by Sebastien Bubeck on the seminar series held by MTL MLOpt.https://mtl-mlopt.github.ioHost: Ioannis MitliagkasMarch 2021, Montréal.

youtube.com

Sebastien Bubeck · Jun 9, 2021 · 3:15 PM UTC

Sebastien Bubeck @SebastienBubeck

9 Jun 2021

If you squint hard enough (eg, like a physicist) our new universal law of robustness even makes concrete predictions for real data. For ex. we predict that on ImageNet you need at least 100 billion parameters (i.e., GPT-3-like scale) to possibly attain good robust guarantees. 3/7

Sebastien Bubeck · Jun 9, 2021 · 3:15 PM UTC

Sebastien Bubeck @SebastienBubeck

9 Jun 2021

So what does the law actually says? Classically, interpolating n points with a p-parameters function class only requires p>n. Now what if you want to interpolate *smoothly*? We show that this simple extra robust constraint forces overparametrization! (by a factor d=data dim). 4/7

Sebastien Bubeck · Jun 9, 2021 · 3:15 PM UTC

Sebastien Bubeck @SebastienBubeck

9 Jun 2021

This result is true for broad class of data: 1) covariate x should be a mixture of ``isoperimetric measures" (e.g., Gaussian), but the key here is that we can allow *many* mixture components (like n/log(n)). 2) target labels y should have some independent noise. Reasonable?! 5/7

Sebastien Bubeck · Jun 9, 2021 · 3:15 PM UTC

Sebastien Bubeck @SebastienBubeck

9 Jun 2021

The real surprise to me is how general the phenomenon is. Back last summer we struggled to prove it for 2-layers neural nets, but in the end the law applies to *any* (smoothly parametrized) function class. Key to unlock the problem was to adopt a probabilist perspective! 6/7

Sebastien Bubeck · Jun 9, 2021 · 3:15 PM UTC

Sebastien Bubeck @SebastienBubeck

9 Jun 2021

As always we welcome comments! While the law itself is a rather simple mathematical statement, its interpretation is ofc fairly speculative. In fact you can check a video by @EldanRonen explaining the paper & giving us a hard time on speculative part🤣7/7 piped.video/watch?v=__Kj9HeU…

A Universal Law of Robustness via Isoperimetry - a paper by Bubeck...

Computer Science/Discrete Mathematics Reading SeminarTopic: A Universal Law of Robustness via Isoperimetry - a paper by Bubeck and SellkeSpeaker: Ronen Eldan...

youtube.com

Daniel Paleka · Jun 9, 2021 · 4:41 PM UTC

Daniel Paleka

@dpaleka

9 Jun 2021

Replying to @SebastienBubeck

Do you think maximum gradient norm is the correct notion of adversarial robustness in this problem?

Sebastien Bubeck · Jun 9, 2021 · 4:58 PM UTC

Sebastien Bubeck @SebastienBubeck

9 Jun 2021

No, I do not think so. But it is a natural and tractable mathematical proxy. Note that the law of robustness is *wrong* if you consider say the average norm of the gradient rather than the max (see last paragraph in Section 1.2).

more replies

Daniel Russo · Jun 10, 2021 · 12:42 AM UTC

Daniel Russo @DanielRuss0

10 Jun 2021

Replying to @SebastienBubeck

Very interesting! Maybe a naive question: I understand the Lipchitz constant as a measure of robustness to perturbations in the parameters, but is there a way to see this as a good measure of robustness to feature/label perturbations?

Sebastien Bubeck · Jun 10, 2021 · 1:07 AM UTC

Sebastien Bubeck @SebastienBubeck

10 Jun 2021

Not sure I understand: the Lipschitz cst Lip(f) measures the robustness to *feature* perturbations (in the sense of perturbation of the covariate/input x). In the formal statement we also need to control J, the Lipschitz constant of the *parametrization*. Are you talking about J?

more replies