We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have.
Deep networks are able to sequentially map portions of each layer’s input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network’s depth.
This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks.
We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
Figure 1: Binary classification using a shallow model with 20 hidden units (solid line) and a deep model with two layers of 10 units each (dashed line). The right panel shows a close-up of the left panel. Filled markers indicate errors made by the shallow model.
…In this respect, Pascanuet al2013 reported a theoretical result on the complexity of functions computable by deep feedforward networks with rectifier units. They showed that, in the asymptotic limit of many hidden layers, deep networks are able to separate their input space into exponentially more linear response regions than their shallow counterparts, despite using the same number of computational units.
Building on the ideas from (Pascanu et al 201311ya), we develop a general framework for analyzing deep models with piecewise linear activations. The intermediary layers of these models are able to map several pieces of their inputs into the same output. The layer-wise composition of the functions computed in this way re-uses low-level computations exponentially often as the number of layers increases. This key property enables deep networks to compute highly complex and structured functions. We underpin this idea by estimating the number of linear regions of functions computable by two important types of piecewise linear networks: with rectifier units and with maxout units.
Our results for the complexity of deep rectifier networks yield a substantial improvement over the previous results on rectifier networks mentioned above, showing a favourable behavior of deep over shallow networks even with a moderate number of hidden layers. Our analysis of deep rectifier and maxout networks serves as platform to study a broad variety of related networks, such as convolutional networks.
The number of linear regions of the functions that can be computed by a given model is a measure of the model’s flexibility. An example of this is given in Figure 1, which compares the learnt decision boundary of a single-layer and a two-layer model with the same number of hidden units (see details in Appendix F). This illustrates the advantage of depth; the deep model captures the desired boundary more accurately, approximating it with a larger number of linear pieces.
Figure 2: (a) Space folding of 2-D Euclidean space along the two axes.
(b) An illustration of how the top-level partitioning (on the right) is replicated to the original input space (left).
(c) Identification of regions across the layers of a deep model.
Figure 3: Space folding of 2-D space in a non-trivial way.
Note how the folding can potentially identify symmetries in the boundary that it needs to learn.
As noted earlier, deep networks are able to identify an exponential number of input neighborhoods by mapping them to a common output of some intermediary hidden layer. The computations carried out on the activations of this intermediary layer are replicated many times, once in each of the identified neighborhoods. This allows the networks to compute very complex looking functions even when they are defined with relatively few parameters.
The number of parameters is an upper bound for the dimension of the set of functions computable by a network, and a small number of parameters means that the class of computable functions has a low dimension. The set of functions computable by a deep feedforward piecewise linear network, although low dimensional, achieves exponential complexity by re-using and composing features from layer to layer.
…Each hidden layer of a deep neural network can be associated with a folding operator. Each hidden layer folds the space of activations of the previous layer. In turn, a deep neural network effectively folds its input-space recursively, starting with the first layer. The consequence of this recursive folding is that any function computed on the final folded space will apply to all the collapsed subsets identified by the map corresponding to the succession of foldings. This means that in a deep model any partitioning of the last layer’s image-space is replicated in all input-space regions which are identified by the succession of foldings. Figure 2b offers an illustration of this replication property.