Recently, multiple successful applications of deep neural networks to pattern recognition problems (Schmidhuber , LeCun et al. ) have revived active interest in theoretical properties of such networks, in particular their expressive power. It has been argued that deep networks may be more expressive than shallow ones of comparable size (see, e.g., Delalleau and Bengio , Raghu et al. , Montufar et al. , Bianchini and Scarselli , Telgarsky ). In contrast to a shallow network, a deep one can be viewed as a long sequence of non-commutative transformations, which is a natural setting for high expressiveness (cf. the well-known Solovay-Kitaev theorem on fast approximation of arbitrary quantum operations by sequences of non-commutative gates, see Kitaev et al. , Dawson and Nielsen ).
There are various ways to characterize expressive power of networks. Delalleau and Bengio 2011 consider sum-product networks and prove for certain classes of polynomials that they are much more easily represented by deep networks than by shallow networks. Montufar et al. 2014estimate the number of linear regions in the network’s landscape. Bianchini and Scarselli 2014 give bounds for Betti numbers characterizing topological properties of functions represented by networks. Telgarsky 2015, 2016 provides specific examples of classification problems where deep networks are provably more efficient than shallow ones.
In the context of classification problems, a general and standard approach to characterizing expressiveness is based on the notion of the Vapnik-Chervonenkis dimension (Vapnik and Chervonenkis ). There exist several bounds for VC-dimension of deep networks with piece-wise polynomial activation functions that go back to geometric techniques of Goldberg and Jerrum 1995 and earlier results of Warren 1968; see Bartlett et al. , Sakurai  and the book Anthony and Bartlett . There is a related concept, the fat-shattering dimension, for real-valued approximation problems (Kearns and Schapire , Anthony and Bartlett ).
A very general approach to expressiveness in the context of approximation is the method of nonlinear widths by DeVore et al. 1989 that concerns approximation of a family of functions under assumption of a continuous dependence of the model on the approximated function.
In this paper we examine the problem of shallow-vs-deep expressiveness from the perspective of approximation theory and general spaces of functions having derivatives up to certain order (Sobolev-type spaces). In this framework, the problem of expressiveness is very well studied in the case of shallow networks with a single hidden layer, where it is known, in particular, that to approximate a -function on a -dimensional set with infinitesimal error one needs a network of size about , assuming a smooth activation function (see, e.g., Mhaskar , Pinkus  for a number of related rigorous upper and lower bounds and further qualifications of this result). Much less seems to be known about deep networks in this setting, though Mhaskar et al. 2016, 2016 have recently introduced functional spaces constructed using deep dependency graphs and obtained expressiveness bounds for related deep networks.
We will focus our attention on networks with the ReLU activation function , which, despite its utter simplicity, seems to be the most popular choice in practical applications LeCun et al. . We will consider -error of approximation of functions belonging to the Sobolev spaces (without any assumptions of hierarchical structure). We will often consider families of approximations, as the approximated function runs over the unit ball in . In such cases we will distinguish scenarios of fixed and adaptive network architectures. Our goal is to obtain lower and upper bounds on the expressiveness of deep and shallow networks in different scenarios. We measure complexity of networks in a conventional way, by counting the number of their weights and computation units (cf. Anthony and Bartlett ).
In Section 2 we describe our ReLU network model and show that the ReLU function is replaceable by any other continuous piece-wise linear activation function, up to constant factors in complexity asymptotics (Proposition 1).
In Section 3 we establish several upper bounds on the complexity of approximating by ReLU networks, in particular showing that deep networks are quite efficient for approximating smooth functions. Specifically:
In Subsection 3.3 we show that, even with fixed-depth networks, one can further decrease the approximation complexity if the network architecture is allowed to depend on the approximated function. Specifically, we prove that one can -approximate a given Lipschitz function on the segment by a depth-6 ReLU network with connections and activation units (Theorem 2). This upper bound is of interest since it lies below the lower bound provided by the method of nonlinear widths under assumption of continuous model selection (see Subsection 4.1).
In Section 4 we obtain several lower bounds on the complexity of approximation by deep and shallow ReLU networks, using different approaches and assumptions.
In Subsection 4.1 we recall the general lower bound provided by the method of continuous nonlinear widths. This method assumes that parameters of the approximation continuously depend on the approximated function, but does not assume anything about how the approximation depends on its parameters. In this setup, at least connections and weights are required for an -approximation on (Theorem 3). As already mentioned, for this lower bound is above the upper bound provided by Theorem 2.
In Subsection 4.2 we consider the setup where the same network architecture is used to approximate all functions in , but the weights are not assumed to continuously depend on the function. In this case, application of existing results on VC-dimension of deep piece-wise polynomial networks yields a lower bound in general and a lower bound if the network depth grows as (Theorem 4).
In Subsection 4.4 we prove that -approximation of any nonlinear -function by a network of fixed depth requires at least computation units (Theorem 6). By comparison with Theorem 1, this shows that for sufficiently smooth functions approximation by fixed-depth ReLU networks is less efficient than by unbounded-depth networks.
In Section 5 we discuss the obtained bounds and summarize their implications, in particular comparing deep vs. shallow networks and fixed vs. adaptive architectures.
The arXiv preprint of the first version of the present work appeared almost simultaneously with the work of Liang and Srikant Liang and Srikant  containing results partly overlapping with our results in Subsections 3.1,3.2 and 4.4. Liang and Srikant consider networks equipped with both ReLU and threshold activation functions. They prove a logarithmic upper bound for the complexity of approximating the function , which is analogous to our Proposition 2. Then, they extend this upper bound to polynomials and smooth functions. In contrast to our treatment of generic smooth functions based on standard Sobolev spaces, they impose more complex assumptions on the function (including, in particular, how many derivatives it has) that depend on the required approximation accuracy . As a consequence, they obtain strong complexity bounds rather different from our bound in Theorem 1 (in fact, our lower bound proved in Theorem 5 rules out, in general, such strong upper bounds for functions having only finitely many derivatives). Also, Liang and Srikant prove a lower bound for the complexity of approximating convex functions by shallow networks. Our version of this result, given in Subsection 4.4, is different in that we assume smoothness and nonlinearity instead of global convexity.
2 The ReLU network model
Throughout the paper, we consider feedforward neural networks with the ReLU (Rectified Linear Unit) activation function
The network consists of several input units, one output unit, and a number of “hidden” computation units. Each hidden unit performs an operation of the form
with some weights (adjustable parameters) and depending on the unit. The output unit is also a computation unit, but without the nonlinearity, i.e., it computes . The units are grouped in layers, and the inputs of a computation unit in a certain layer are outputs of some units belonging to any of the preceding layers (see Fig. 1). Note that we allow connections between units in non-neighboring layers. Occasionally, when this cannot cause confusion, we may denote the network and the function it implements by the same symbol.
The depth of the network, the number of units and the total number of weights are standard measures of network complexity (Anthony and Bartlett ). We will use these measures throughout the paper. The number of weights is, clearly, the sum of the total number of connections and the number of computation units. We identify the depth with the number of layers (in particular, the most common type of neural networks – shallow networks having a single hidden layer – are depth-3 networks according to this convention).
We finish this subsection with a proposition showing that, given our complexity measures, using the ReLU activation function is not much different from using any other piece-wise linear activation function with finitely many breakpoints: one can replace one network by an equivalent one but having another activation function while only increasing the number of units and weights by constant factors. This justifies our restricted attention to the ReLU networks (which could otherwise have been perceived as an excessively particular example of networks).
Let be any continuous piece-wise linear function with breakpoints, where .
Let be a network with the activation function , having depth , weights and computation units. Then there exists a ReLU network that has depth , not more than weights and not more than units, and that computes the same function as .
Conversely, let be a ReLU network of depth with weights and computation units. Let be a bounded subset of , where is the input dimension of . Then there exists a network with the activation function that has depth , weights and units, and that computes the same function as on the set .
a) Let be the breakpoints of , i.e., the points where its derivative is discontinuous: . We can then express via the ReLU function , as a linear combination
with appropriately chosen coefficients and . It follows that computation performed by a single -unit,
can be equivalently represented by a linear combination of a constant function and computations of -units,
(here is the index of a -unit). We can then replace one-by-one all the -units in the network by -units, without changing the output of the network. Obviously, these replacements do not change the network depth. Since each hidden unit gets replaced by new units, the number of units in the new network is not greater than times their number in the original network. Note also that the number of connections in the network is multiplied, at most, by . Indeed, each unit replacement entails replacing each of the incoming and outgoing connections of this unit by new connections, and each connection is replaced twice: as an incoming and as an outgoing one. These considerations imply the claimed complexity bounds for the resulting -network .
b) Let be any breakpoint of , so that . Let be the distance separating from the nearest other breakpoint, so that is linear on and on (if has only one node, any will do). Then, for any , we can express the ReLU function via in the -neighborhood of 0:
It follows that a computation performed by a single -unit,
can be equivalently represented by a linear combination of a constant function and two -units,
provided the condition
holds. Since is a bounded set, we can choose at each unit of the initial network sufficiently large so as to satisfy condition (2) for all network inputs from . Then, like in a), we replace each -unit with two -units, which produces the desired -network. ∎
3 Upper bounds
Throughout the paper, we will be interested in approximating functions by ReLU networks. Given a function and its approximation , by the approximation error we will always mean the uniform maximum error
3.1 Fast deep approximation of squaring and multiplication
Our first key result shows that ReLU networks with unconstrained depth can very efficiently approximate the function (more efficiently than any fixed-depth network, as we will see in Section 4.4). Our construction uses the “sawtooth” function that has previously appeared in the paper Telgarsky .
The function on the segment can be approximated with any error by a ReLU network having the depth and the number of weights and computation units .
Consider the “tooth” (or “mirror”) function
and the iterated functions
(see Fig. 1(a)). Our key observation now is that the function can be approximated by linear combinations of the functions . Namely, let
be the piece-wise linear interpolation ofwith uniformly distributed breakpoints :
(see Fig. 1(b)). The function approximates with the error . Now note that refining the interpolation from to amounts to adjusting it by a function proportional to a sawtooth function:
Since can be implemented by a finite ReLU network (as ) and since construction of only involves linear operations and compositions of , we can implement by a ReLU network having depth and the number of weights and computation units all being (see Fig. 1(c)). This implies the claim of the proposition.
we can use Proposition 2 to efficiently implement accurate multiplication in a ReLU network. The implementation will depend on the required accuracy and the magnitude of the multiplied quantities.
Given and , there is a ReLU network with two input units that implements a function so that
for any inputs , if and then ;
if or , then ;
the depth and the number of weights and computation units in is not greater than with an absolute constant and a constant .
Let be the approximate squaring function from Proposition 2 such that and for . Assume without loss of generality that and set
where . Then property b) is immediate and a) follows easily using expansion (3). To conclude c), observe that computation (4) consists of three instances of and finitely many linear and ReLU operations, so, using Proposition 2, we can implement by a ReLU network such that its depth and the number of computation units and weights are , i.e. are . ∎
3.2 Fast deep approximation of general smooth functions
In order to formulate our general result, Theorem 1, we consider the Sobolev spaces with Recall that is defined as the space of functions on lying in along with their weak derivatives up to order . The norm in can be defined by
where , , and
is the respective weak derivative. Here and in the sequel we denote vectors by boldface characters. The spacecan be equivalently described as consisting of the functions from such that all their derivatives of order are Lipschitz continuous.
Throughout the paper, we denote by the unit ball in :
Also, it will now be convenient to make a distinction between networks and network architectures: we define the latter as the former with unspecified weights. We say that a network architecture is capable of expressing any function from with error meaning that this can be achieved by some weight assignment.
For any and , there is a ReLU network architecture that
is capable of expressing any function from with error ;
has the depth at most and at most weights and computation units, with some constant .
The proof will consist of two steps. We start with approximating by a sum-product combination of local Taylor polynomials and one-dimensional piecewise-linear functions. After that, we will use results of the previous section to approximate by a neural network.
Let be a positive integer. Consider a partition of unity formed by a grid of functions on the domain :
Here and the function is defined as the product
(see Fig. 3). Note that
For any , consider the degree- Taylor polynomial for the function at :
with the usual conventions and . Now define an approximation to by
We bound the approximation error using the Taylor expansion of :
Here in the second step we used the support property (7) and the bound (6), in the third the observation that any belongs to the support of at most functions , in the fourth a standard bound for the Taylor remainder, and in the fifth the property
It follows that if we choose
(where is the ceiling function), then
Note that, by (8) the coefficients of the polynomials are uniformly bounded for all :
We have therefore reduced our task to the following: construct a network architecture capable of approximating with uniform error any function of the form (9), assuming that is given by (10) and the polynomials are of the form (12).
The expansion is a linear combination of not more than terms . Each of these terms is a product of at most piece-wise linear univariate factors: functions (see (5)) and at most linear expressions . We can implement an approximation of this product by a neural network with the help of Proposition 3. Specifically, let be the approximate multiplication from Proposition 3 for and some accuracy to be chosen later, and consider the approximation of the product obtained by the chained application of :
that Using statement c) of Proposition 3, we see can be implemented by a ReLU network with the depth and the number of weights and computation units not larger than for some constant .
Now we estimate the error of this approximation. Note that we have and for all and all . By statement a) of Proposition 3, if and , then . Repeatedly applying this observation to all approximate multiplications in (14) while assuming , we see that the arguments of all these multiplications are bounded by our (equal to ) and the statement a) of Proposition 3 holds for each of them. We then have
Moreover, by statement b) of Proposition 3,
Now we define the full approximation by
We estimate the approximation error of :
where in the first step we use expansion (13), in the second the identity (16), in the third the bound and the fact that for at most functions and in the fourth the bound (15). It follows that if we choose
then and hence, by (11),
On the other hand, note that by (17), can be implemented by a network consisting of parallel subnetworks that compute each of ; the final output is obtained by weighting the outputs of the subnetworks with the weights . The architecture of the full network does not depend on ; only the weights do. As already shown, each of these subnetworks has not more than layers, weights and computation units, with some constant . There are not more than such subnetworks. Therefore, the full network for has not more than layers and weights and computation units. With given by (18) and given by (10), we obtain the claimed complexity bounds. ∎
3.3 Faster approximations using adaptive network architectures
Theorem 1 provides an upper bound for the approximation complexity in the case when the same network architecture is used to approximate all functions in . We can consider an alternative, “adaptive architecture” scenario where not only the weights, but also the architecture is adjusted to the approximated function. We expect, of course, that this would decrease the complexity of the resulting architectures, in general (at the price of needing to find the appropriate architecture). In this section we show that we can indeed obtain better upper bounds in this scenario.
For simplicity, we will only consider the case . Then, is the space of Lipschitz functions on the segment . The set consists of functions having both and the Lipschitz constant bounded by 1. Theorem 1 provides an upper bound for the number of weights and computation units, but in this special case there is in fact a better bound obtained simply by piece-wise interpolation.
Namely, given and , set and let be the piece-wise interpolation of with uniformly spaced breakpoints (i.e., ). The function is also Lipschitz with constant 1 and hence (since for any we can find such that and then ). At the same time, the function can be expressed in terms of the ReLU function by
with some coefficients and . This expression can be viewed as a special case of the depth-3 ReLU network with weights and computation units.
We show now how the bound can be improved by using adaptive architectures.
For any and , there exists a depth-6 ReLU network (with architecture depending on ) that provides an -approximation of while having not more than weights, connections and computation units. Here is an absolute constant.
We first explain the idea of the proof. We start with interpolating by a piece-wise linear function, but not on the length scale – instead, we do it on a coarser length scale , with some . We then create a “cache” of auxiliary subnetworks that we use to fill in the details and go down to the scale , in each of the -subintervals. This allows us to reduce the amount of computations for small because the complexity of the cache only depends on . The assignment of cached subnetworks to the subintervals is encoded in the network architecture and depends on the function . We optimize by balancing the complexity of the cache with that of the initial coarse approximation. This leads to and hence to the reduction of the total complexity of the network by a factor compared to the simple piece-wise linear approximation on the scale . This construction is inspired by a similar argument used to prove the upper bound for the complexity of Boolean circuits implementing -ary functions Shannon .
The proof becomes simpler if, in addition to the ReLU function , we are allowed to use the activation function
in our neural network. Since is discontinuous, we cannot just use Proposition 1 to replace -units by -units. We will first prove the analog of the claimed result for the model including -units, and then we will show how to construct a purely ReLU nework.
For any and , there exists a depth-5 network including -units and -units, that provides an -approximation of while having not more than weights, where is an absolute constant.
Given , we will construct an approximation to in the form
Here, is the piece-wise linear interpolation of with the breakpoints , for some positive integer to be chosen later. Since is Lipschitz with constant 1, is also Lipschitz with constant 1. We will denote by the intervals between the breakpoints:
We will now construct as an approximation to the difference
Note that vanishes at the endpoints of the intervals :
and is Lipschitz with constant 2:
since and are Lipschitz with constant 1.
To define , we first construct a set of cached functions. Let be a positive integer to be chosen later. Let be the set of piecewise linear functions with the breakpoints and the properties
Note that the size of is not larger than .
If is any Lipschitz function with constant 2 and , then can be approximated by some with error not larger than : namely, take .
Moreover, if is defined by (20), then, using (21), (22), on each interval the function can be approximated with error not larger than by a properly rescaled function . Namely, for each we can define the function by . Then it is Lipschitz with constant 2 and , so we can find such that
This can be equivalently written as
Note that the obtained assignment is not injective, in general ( will be much larger than ).
We can then define on the whole by
This approximates with error on :
and hence, by (20), for the full approximation