“Deep Learning Meets Nonparametric Regression: Are Weight-Decayed DNNs Locally Adaptive?”, 2022-04-20 ():
We study the theory of neural network (NN) from the lens of classical nonparametric regression problems with a focus on NN’s ability to adaptively estimate functions with heterogeneous smoothness—a property of functions in Besov or Bounded Variation (BV) classes. Existing work on this problem requires tuning the NN architecture based on the function spaces and sample sizes.
We consider a “Parallel NN” variant of deep ReLU networks [an ensemble of deep nets] and show that the standard weight decay is equivalent to promoting the 𝓁p-sparsity (0 < p < 1) of the coefficient vector of an end-to-end learned function bases, ie. a dictionary.
Using this equivalence, we further establish that by tuning only the weight decay, such Parallel NN achieves an estimation error arbitrarily close to the minimax rates for both the Besov and BV classes. Notably, it gets exponentially closer to minimax optimal as the NN gets deeper.
Our research sheds new lights on why depth matters and how NNs are more powerful than kernel methods.
…Our main contributions are:
We prove that the (standard) weight decay in training an L-layer parallel ReLU-activated neural network is equivalent to a sparse 𝓁p penalty term (where p = 2⁄L) on the linear coefficients of a learned representation.
We show that neural networks can approximate B-spline basis functions of any order without the need of choosing the order parameter manually. In other words, neural networks can adapt to functions of different order of smoothness, and even functions with different smoothness in different regions in their domain.
We show that the estimation error of weight decayed parallel ReLU neural network decreases polynomially with the number of samples up to a constant error for estimating functions with heterogeneous smoothness in the both BV and Besov classes, and the exponential term in the error rate is close to the minimax rate. Notably, the method requires tuning only the weight decay parameter.
We find that deeper models achieve closer to the optimal error rate. This result helps explain why deep neural networks can achieve better performance than shallow ones empirically.
The above results separate NNs with any linear methods such as kernel ridge regression. To the best of our knowledge, we are the first to demonstrate that standard techniques (“weight decay” and ReLU activation) suffice for DNNs in achieving the optimal rates for estimating BV and Besov functions.