“Rip Van Winkle’s Razor, a Simple New Estimate for Adaptive Data Analysis”, Sanjeev Arora, Yi Zhang2021-04-07 (, ; backlinks; similar)⁠:

Meta-overfitting Error (MOE) of a model is the difference between its average error on the test data and its expected error on the full distribution. (It is closely related to false discovery rate in statistics.)

This blog post concerns our new paper, which gives meaningful upper bounds on this sort of trouble for popular deep net architectures, whereas prior ideas from adaptive data analysis gave no nontrivial estimates. We call our estimate Rip van Winkle’s Razor which combines references to Occam’s Razor and the mythical person who fell asleep for 20 years.

MOE bounds and description length: The starting point of our work is the following classical concentration bounds:

Folklore Theorem: With high probability over the choice of a test set of size N, the MOE of all models with description length at most k bits is 𝒪(√kN).

At first sight this doesn’t seem to help us because one cannot imagine modern deep nets having a short description. The most obvious description involves reporting values of the net parameters, which requires millions or even hundreds of millions of bits, resulting in a vacuous upper bound on MOE. Another obvious description would be the computer program used to produce the model using the (publicly available) training and validation sets. However, these programs usually rely on imported libraries through layers of encapsulation and so the effective program size is pretty large as well.

Rip van Winkle’s Razor: Our new upper bound involves a more careful definition of Description Length: it is the smallest description that allows a referee to reproduce a model of similar performance using the (universally available) training and validation datasets.

Estimating MOE of ResNet-152: As an illustration, here we provide a suitable description allowing Rip van Winkle to reproduce a mainstream ImageNet model, ResNet-152, which achieves 4.49% top-5 test error.

The description consists of 3 types of expressions: English phrases, Math equations, and directed graphs. In the paper, we describe in detail how to encode each of them into binary strings and count their lengths. The allowed vocabulary includes primitive concepts that were known before 2012, such as CONV, MaxPool, ReLU, SGD etc., as well as a graph-theoretic notation/shorthand for describing net architecture. The newly introduced concepts including Batch-Norm, Layer, Block are defined precisely using Math, English, and other primitive concepts.

Description for reproducing ResNet-152

According to our estimate, the length of the above description is 1032 bits, which translates into an upper bound on meta-overfitting error of merely 5%! This suggests the real top-5 error of the model on full distribution is at most 9.49%. In the paper we also provide a 980-bit long description for reproducing DenseNet-264, which leads to 5.06% upper bound on its meta-overfitting error.

…But the important point is that unlike existing bounds in Adaptive Data Analysis, there is no dependence on t, the number of models that have been tested before, and the bound is non-vacuous. Our estimates indicate that the issue of meta-overfitting on ImageNet for these mainstream models is mild. The reason is that despite the vast number of parameters and hyper-parameters in today’s deep nets, the information content of these models is not high given knowledge circa 2012.