Expressiveness of deep neural networks with piecewise-linear (in particular, ReLU) activation functions has been a topic of much theoretical research in recent years. The topic has many aspects, with connections to combinatorics(Montufar et al., 2014; Telgarsky, 2016), topology (Bianchini and Scarselli, 2014), Vapnik-Chervonenkis dimension (Bartlett et al., 1998; Sakurai, 1999) and fat-shattering dimension (Kearns and Schapire, 1990; Anthony and Bartlett, 2009), hierarchical decompositions of functions (Mhaskar et al., 2016), information theory (Petersen and Voigtlaender, 2017), etc.
Here we adopt the perspective of classical approximation theory, in which the problem of expressiveness can be basically described as follows. Suppose that is a multivariate function, say on the cube , and has some prescribed regularity properties; how efficiently can one approximate by deep neural networks? The question has been studied in several recent publications. Depth-separation results for some explicit families of functions have been obtained in Safran and Shamir (2016); Telgarsky (2016). General upper and lower bounds on approximation rates for functions characterized by their degree of smoothness have been obtained in Liang and Srikant (2016); Yarotsky (2017). Hanin and Sellke (2017); Lu et al. (2017) establish the universal approximation property and convergence rates for deep and “narrow” (fixed-width) networks. Petersen and Voigtlaender (2017) establish convergence rates for approximations of discontinuous functions. Generalization capabilities of deep ReLU networks trained on finite noisy samples are studied in Schmidt-Hieber (2017).
In the present paper we consider and largely resolve the following question: what is the optimal rate of approximation of general continuous functions by deep ReLU networks, in terms of the number of network weights and the modulus of continuity of the function? Specifically, for any we seek a network architecture with weights so that for any continuous , as increases, we would achieve the best convergence in the uniform norm when using these architectures to approximate .
In the slightly different but closely related context of approximation on balls in Sobolev spaces this question of optimal convergence rate has been studied in Yarotsky (2017). That paper described ReLU network architectures with weights ensuring approximation with error
(Theorem 1). The construction was linear in the sense that the network weights depended on the approximated function linearly. Up to the logarithmic factor, this approximation rate matches the optimal rate over all parametric models under assumption of continuous parameter selection (DeVore et al. (1989)). It was also shown in Theorem 2 of Yarotsky (2017) that one can slightly (by a logarithmic factor) improve over this conditionally optimal rate by adjusting network architectures to the approximated function.
On the other hand, it was proved in Theorem 4 of Yarotsky (2017) that ReLU networks generally cannot provide approximation with accuracy better than – a bound with the power twice as big as in the previously mentioned existence result. As was shown in the same theorem, this bound can be strengthened for shallow networks. However, without imposing depth constraints, there was a serious gap between the powers and in the lower and upper accuracy bounds that was left open in that paper.
In the present paper we bridge this gap in the setting of continuous functions (which is slightly more general than the setting of the Sobolev space of Lipschitz functions, , i.e. the case ). Our key insight is the close connection between approximation theory and VC dimension bounds. The lower bound on the approximation accuracy in Theorem 4 of Yarotsky (2017) was derived using the upper VCdim bound from Goldberg and Jerrum (1995). More accurate upper and lower bounds involving the network depth have been given in Bartlett et al. (1998); Sakurai (1999). The recent paper Bartlett et al. (2017) establishes nearly tight lower and upper VCdim bounds: , where is the largest VC dimension of a piecewise linear network with weights and layers. The key element in the proof of the lower bound is the “bit extraction technique” (Bartlett et al. (1998)) providing a way to compress significant expressiveness in a single network weight. In the present paper we adapt this technique to the approximation theory setting.
Our main result is the complete phase diagram for the parameterized family of approximation rates involving the modulus of continuity of the function and the number of weights . We prove that using very deep networks one can approximate function with error , and this rate is optimal up to a logarithmic factor. In fact, the depth of the networks must necessarily grow almost linearly with to achieve this rate, in sharp contrast to shallow networks that can provide approximation with error . Moreover, whereas the slower rate can be achieved using a continuous weight assignment in the network, the optimal rate necessarily requires a discontinuous weight assignment. All this allows us to regard these two kinds of approximations as being in different “phases”. In addition, we explore the intermediate rates with and show that they are also in the discontinuous phase and require network depths We show that the optimal rate can be achieved with a deep constant-width fully-connected architecture, whereas the rates with and depth can be achieved by stacking the deep constant-width architecture with a shallow parallel architecture. Apart from the bit extraction technique, we use the idea of the two-scales expansion from Theorem 2 in Yarotsky (2017) as an essential tool in the proofs of our results.
2 The results
We define the modulus of continuity of a function by
where is the euclidean norm of .
We approximate functions
by usual feed-forward neural networks with the ReLU activation function. The network has input units, some hidden units, and one output unit. The hidden units are assumed to be grouped in a sequence of layers so that the inputs of each unit is formed by outputs of some units from previous layers. The depth of the network is the number of these hidden layers. A hidden unit computes a linear combination of its inputs followed by the activation function: , where and are the weights associated with this unit. The output unit acts similarly, but without the activation function: .
The network is determined by its architecture and weights. Clearly, the total number of weights, denoted by , is equal to the total number of connections and computation units (not counting the input units). We don’t impose any constraints on the network architecture (see Fig. 1 for an example of a valid architecture).
Throughout the paper, we consider the input dimension as fixed. Accordingly, by constants we will generally mean values that may depend on .
We are interested in relating the approximation errors to the complexity of the function , measured by its modulus of continuity , and to the complexity of the approximating network, measured by its total number of weights . More precisely, we consider approximation rates in terms of the following procedure.
First, suppose that for each we choose in some way a network architecture with inputs and weights. Then, for any we construct an approximation to by choosing in some way the values of the weights in the architecture – in the sequel, we refer to this stage as the weight assignment. The question we ask is this: for which powers can we ensure, by the choice of the architecture and then the weights, that
with some constants possibly depending on and but not on or ?
Clearly, if inequality (2) holds for some , then it also holds for any smaller . However, we expect that for smaller the inequality can be in some sense easier to satisfy. In this paper we show that there is in fact a qualitative difference between different regions of ’s.
Our findings are best summarized by the phase diagram shown in Fig. 2. We give an informal overview of the diagram before moving on to precise statements. The region of generally feasible rates is This region includes two qualitatively distinct phases corresponding to and . At the rate (2) can be achieved by fixed-depth networks whose weights depend linearly on the approximated function . In contrast, at the rate can only be achieved by networks with growing depths and whose weights depend discontinuously on the approximated function. In particular, at the rightmost feasible point the approximating architectures have and are thus necessarily extremely deep and narrow.
The parallel, constant-depth network architecture implementing piecewise linear interpolation and ensuring approximation rate (2) with .
We now turn to precise statements. First we characterize the phase in which the approximation can be obtained using a standard piecewise linear interpolation. In the sequel, when writing we mean that with some constant that may depend on For brevity, we will write without the subscript .
There exist network architectures with weights and, for each , a weight assignment linear in such that Eq. (2) is satisfied with The network architectures can be chosen as consisting of parallel blocks each having the same architecture that only depends on (see Fig. 4). In particular, the depths of the networks depend on but not on .
The detailed proof is given in Section 4.2; we explain now the idea. The approximating function is constructed as a linear combination of “spike” functions sitting at the knots of the regular grid in , with coefficients given by the values of at these knots (see Fig. 4). For a grid of spacing with an appropriate , the number of knots is while the approximation error is We implement each spike by a block in the network, and implement the whole approximation by summing blocks connected in parallel and weighted. Then the whole network has weights and, by expressing as , the approximation error is , i.e. we obtain the rate (2) with .
We note that the weights of the resulting network either do not depend on at all or are given by with some In particular, the weight assignment is continuous in with respect to the standard topology of .
We turn now to the region Several properties of this region are either direct consequences or slight modifications of existing results, and it is convenient to combine them in a single theorem.
(Feasibility) Approximation rate (2) cannot be achieved with .
(Inherent discontinuity) Approximation rate (2) cannot be achieved with if the weights of are required to depend on continuously with respect to the standard topology of .
(Inherent depth) If approximation rate (2) is achieved with a particular , then the architectures must have depths with some possibly - and -dependent constant
The proofs of these statements have the common element of considering the approximation for functions from the unit ball in the Sobolev space of Lipschitz functions. Namely, suppose that the approximation rate (2) holds with some . Then all can be approximated by architectures with accuracy
with some constant independent of . The three statements of the theorem are then obtained as follows.
a) This statement is a consequence of Theorem 4a) of Yarotsky (2017), which is in turn a consequence of the upper bound for the VC dimension of a ReLU network (Goldberg and Jerrum (1995)). Precisely, Theorem 4a) implies that if an architecture allows to approximate all with accuracy , then with some . Comparing this with Eq. (3), we get
b) This statement is a consequence of the general bound of DeVore et al. (1989) on the efficiency of approximation of Sobolev balls with parametric models having parameters continuously depending on the approximated function. Namely, if the weights of the networks depend on continuously, then Theorem 4.2 of DeVore et al. (1989) implies that with some constant , which implies that
c) This statement can be obtained by combining arguments of Theorem 4 of Yarotsky (2017) with the recently established tight upper bound for the VC dimension of ReLU networks (Bartlett et al. (2017), Theorem 6) with given depth and the number of weights :
where is a global constant.
Specifically, suppose that an architecture allows to approximate all with accuracy . Then, by considering suitable trial functions, one shows that if we threshold the network output, the resulting networks must have VC dimension (see Eq.(38) in Yarotsky (2017)). Hence, by Eq. (3), . On the other hand, the upper bound (4) implies . We conclude that , i.e. with some constant . ∎
Theorem 1 suggests the existence of an approximation phase drastically different from the phase . This new phase would provide better approximation rates, up to , at the cost of deeper networks and some complex, discontinuous weight assignment. The main contribution of the present paper is the proof that this phase indeed exists.
We describe some architectures that, as we will show, correspond to this phase. First we describe the architecture for
i.e. for the fastest possible approximation rate. Consider the usual fully-connected architecture connecting neighboring layers and having a constant number of neurons in each layer, see Fig.6. We refer to this constant number of neurons as the “width” of the network. Such a network of width and depth has weights in total. We will be interested in the scenario of “narrow” networks where is fixed and the network grows by increasing ; then grows linearly with . Below we will refer to the “narrow fully-connected architecture of width having weights”: the depth is supposed in this case to be determined from the above equality; we will assume without loss of generality that the equality is satisfied with an integer . We will show that these narrow architectures provide the approximation rate if the width is large enough (say, ).
In the case we consider another kind of architectures obtained by stacking parallel shallow architectures (akin to those of Proposition 1) with the above narrow fully-connected architectures, see Fig. 6. The first, parallelized part of these architectures consists of blocks that only depend on (but not on or ). The second, narrow fully-connected part will again have a fixed width, and we will take its depth to be . All the remaining weights then go into the first parallel subnetwork, which in particular determines the number of blocks in it. Since the blocks are parallel and their architectures do not depend on , the overall depth of the network is determined by the second, deep subnetwork and is . On the other hand, in terms of the number of weights, for most computation is performed by the first, parallel subnetwork (the deep subnetwork has weights while the parallel one has an asymptotically larger number of weights, ).
Clearly, these stacked architectures can be said to “interpolate” between the purely parallel architectures for and the purely serial architectures for . Note that a parallel computation can be converted into a serial one at the cost of increasing the depth of the network. For , rearranging the parallel subnetwork of the stacked architecture into a serial one would destroy the bound on the depth of the full network, since the parallel subnetwork has weights. However, for this rearrangement does not affect the asymptotic of the depth more than by a constant factor – that’s why we don’t include the parallel subnetwork into the full network in this case.
We state now our main result as the following theorem.
For any , there exist a sequence of architectures with depths and respective weight assignments such that inequality (2) holds with this .
For , an example of such architectures is the narrow fully-connected architectures of constant width .
For , an example of such architectures are stacked architectures described above, with the narrow fully-connected subnetwork having width and depth .
Comparing this theorem with Theorem 1a) we see that the narrow fully-connected architectures provide the best possible approximation in the sense of Eq. (2). Moreover, for the upper bound on the network depth in Theorem 2c) matches the lower bound in Theorem 1c) up to a logarithmic factor. This proves that for our stacked architectures are also optimal (up to a logarithmic correction) if we additionally impose the asymptotic constraint on the network depth.
The full proof of Theorem 2 is given in Section 5; we explain now its main idea. Given a function and some , we first proceed as in Proposition 1 and construct its piecewise linear interpolation on the length scale with . This approximation has uniform error . Then, we improve this approximation by constructing an additional approximation for the discrepancy . This second approximation lives on a smaller length scale with . In contrast to , the second approximation is inherently discrete: we consider a finite set of possible shapes of in patches of linear size , and in each patch we use a special single network weight to encode the shape closest to . The second approximation is then fully determined by the collection of these special encoding weights found for all patches. We make the parallel subnetwork of the full network serve two purposes: in addition to computing the initial approximation as in Proposition 1, the subnetwork returns the position of within its patch along with the weight that encodes the second approximation within this patch. The remaining, deep narrow part of the network then serves to decode the second approximation within this patch from the special weight and compute the value . Since the second approximation lives on the smaller length scale , there are possible approximations within the patch that might need to be encoded in the special weight. It then takes a narrow network of depth to reconstruct the approximation from the special weight using the bit extraction technique of Bartlett et al. (1998). As , we get . At the same time, the second approximation allows us to effectively improve the overall approximation scale from down to , i.e. to , while keeping the total number of weights in the network. This gives us the desired error bound
We remark that the discontinuity of the weight assignment in our construction is the result of the discreteness of the second approximation : whereas the variable weights in the network implementing the first approximation are found by linearly projecting the approximated function to (namely, by computing at the knots ), the variable weights for are found by assigning to one of the finitely many values encoding the possible approximate shapes in a patch. This operation is obviously discontinuous. While the discontinuity is present for all at smaller it is “milder” in the sense of a smaller number of assignable values.
We discuss now our result in the context of general approximation theory and practical machine learning. First, a theorem ofKainen et al. (1999) shows that in the optimal approximations by neural networks the weights generally discontinuously depend on the approximated function, so the discontinuity property that we have established is not surprizing. However, this theorem of Kainen et al. (1999) does not in any way quantify the accuracy gain that can be acquired by giving up the continuity of the weights. Our result does this in the practically important case of deep ReLU networks, and explicitly describes a relevant mechanism.
In general, many nonlinear approximation schemes involve some form of discontinuity, often explicit (e.g., using different expansion bases for different approximated functions (DeVore (1998)). At the same time, discontinuous selection of parameters in parametric models is often perceived as an undesirable phenomenon associated with unreliable approximation (DeVore et al. (1989); DeVore (1998)). We point out, however, that deep architectures considered in the present paper resemble some popular state-of-the-art practical networks for highly accurate image recognition – residual networks (He et al., 2016) and highway networks (Srivastava et al., 2015) that may have dozens or even hundreds of layers. While our model does not explicitly include shortcut connections as in ResNets, a very similar element is effectively present in the proof of Theorem 2 (in the form of channels reserved for passing forward the data). We expect, therefore, that our result may help better understand the properties of ResNet-like networks.
Quantized network weights have been previously considered from the information-theoretic point of view in Bölcskei et al. (2017); Petersen and Voigtlaender (2017). In the present paper we do not use quantized weights in the statement of the approximation problem, but they appear in the solution (namely, we use them to store small-scale descriptions of the approximated function). One can expect that weight quantization may play an important role in the future development of the theory of deep networks.
4 Preliminaries and proof of Proposition 1
The modulus of continuity defined by (1) is monotone nondecreasing in . By the convexity of the cube for any integer we have More generally, for any we can write
The ReLU function allows us to implement the binary operation as and the binary min operation as . The maximum or minimum of any numbers can then be implemented by chaining binary ’s or ’s. Computation of the absolute value can be implemented by
Without loss of generality, we can assume that hidden units in the ReLU network may not include the ReLU nonlinearity (i.e., may compute just a linear combination of the input values). Indeed, we can simulate nonlinearity-free units just by increasing the weight in the formula , so that is always nonnegative (this is possible since the network inputs are from a compact set and the network implements a continuous function), and then compensating the shift by subtracting appropriate amounts in the units receiving input from the unit in question. In particular, we can simulate in this way trivial “pass forward” (identity) units that simply deliver given values to some later layers of the network.
In the context of deep networks of width considered in Section 5, it will be occasionally convenient to think of the network as consisting of “channels”, i.e. sequences of units with one unit per layer. Channels can be used to pass forward values or to perform computations. For example, suppose that we already have channels that pass forward numbers. If we also need to compute the maximum of these numbers, then, by chaining binary ’s, this can be done in a subnetwork including one additional channel that spans layers.
We denote vectors by boldface characters; the scalar components of a vectorare denoted
4.2 Proof of Proposition 1
We start by describing the piecewise linear approximation of the function on a scale where is some fixed large integer that we will later relate to the network size . It will be convenient to denote this approximation by . This approximation is constructed as an interpolation of on the grid . To this end, we consider the standard triangulation of the space into the simplexes
where and is a permutation of elements. This triangulation can be described as resulting by dissecting the space
by the hyperplanesand with various and
The vertices of the simplexes of the triangulation are the points of the grid . Each such point is a vertex in simplexes. The union of these simplexes is the convex set , where
We define the “spike function” as the continuous piecewise linear function such that 1) it is linear on each simplex of the partition , and 2) for . The spike function can be expressed in terms of linear and ReLU operations as follows. Let be the set of the simplexes having as a vertex. For each let be the affine map such that and vanishes on the face of opposite to the vertex Then
Indeed, if then it is easy to see that we can find some such that , hence vanishes outside , as required. On the other hand, consider the restriction of to . Each is nonnegative on this set. For each and , the value is a convex combination of the values of at the vertices of the simplex that belongs to. Since is nonnegative and for all , the minimum in (6) is attained at such that vanishes at all vertices of other than , i.e. at .
Note that each map in (6) is either of the form or (different simplexes may share the same map ), so the minimum actually needs to be taken only over different :
We define now the piecewise linear interpolation by
The function is linear on each simplex and agrees with at the interpolation knots . We can bound , since any partial derivative in the interior of a simplex equals , where are the two vertices of the simplex having all coordinates the same except . It follows that the modulus of continuity for can be bounded by . Moreover, by Eq.(5) we have as long as Therefore we can also write for
Consider now the discrepancy
We can bound the modulus of continuity of by for Since any point is within the distance of one of the interpolation knots where vanishes, we can also write
where in the second inequality we again used Eq.(5).
We observe now that formula (7) can be represented by a parallel network shown in Fig. 4 in which the blocks compute the values and their output connections carry the weights . Since the blocks have the same architecture only depending on , for a network with blocks the total number of weights is . It follows that the number of weights will not exceed if we take with a suficiently small constant . Then, the error bound (8) ensures the desired approximation rate (2) with .
5 Proof of Theorem 2
We divide the proof into three parts. In Section 5.1 we construct the “two-scales” approximation for the given function
and estimate its accuracy. In Section5.2 we describe an efficient way to store and evaluate the refined approximation using the bit extraction technique. Finally, in Section 5.3 we describe the neural network implementations of and verify the network size constraints.
5.1 The two-scales approximation and its accuracy
We follow the outline of the proof given after the statement of Theorem 2 and start by constructing the initial interpolating approximation to using Proposition 1. We may assume without loss of generality that must be implemented by a network with not more than weights, reserving the remaining weights for the second approximation. Then is given by Eq. (7), where
with a sufficiently small constant . The error of the approximation is given by Eq. (8). We turn now to approximating the discrepancy
5.1.1 Decomposition of the discrepancy.
It is convenient to represent as a finite sum of functions with supports consisting of disjoint “patches” of linear size Precisely, let and consider the partition of unity on the cube
We then write
Each function is supported on the disjoint union of cubes with corresponding to the spikes in the expansion (10). With some abuse of terminology, we will refer to both these cubes and the respective restrictions as “patches”.
5.1.2 The second (discrete) approximation.
We will construct the full approximation in the form
where are approximations to on a smaller length scale . We set
where is a sufficiently small constant to be chosen later and is the desired power in the approximation rate. We will assume without loss of generality that is integer, since in the sequel it will be convenient to consider the grid as a subgrid in the refined grid .
We define to be piecewise linear with respect to the refined triangulation and to be given on the refined grid by
Here the discretization parameter is given by
so that, by Eq. (5.1.1), we have
5.1.3 Accuracy of the full approximation.
Let us estimate the accuracy of the full approximation . First we estimate . Consider the piecewise linear function defined similarly to , but exactly interpolating at the knots Then, by Eq. (15), , since in each simplex the difference is a convex combination of the respective values at the vertices. Also, by applying the interpolation error bound (8) with instead of and instead of . Using Eqs. (16) and (17), it follows that
5.2 Storing and decoding the refined approximation
We have reduced our task to implementing the functions subject to the network size and depth constraints. We describe now an efficient way to compute these functions using a version of the bit extraction technique.
5.2.1 Patch encoding.
Fix . Note that, like , the approximation vanishes outside of the cubes with . Fix one of these and consider those values from Eq.(15) that lie in this patch:
By the bound (17), if are neighboring points on the grid , then Moreover, since vanishes on the boundary of , we have if one of the components of equals or . Let us consider separately the first component in the multi-index and write . Denote
Since we can encode all the values by a single ternary number
where is some enumeration of the multi-indices . The values for all and will be stored as weights in the network and encode the approximation in all the patches.
5.2.2 Reconstruction of the values .
The values are, up to the added constant 1, just the digits in the ternary representation of and can be recovered from by a deep ReLU network that iteratively implements “ternary shifts”. Specifically, consider the sequence with and . Then for all . To implement these computations by a ReLU network, we need to show how to compute for all . Consider a piecewise-linear function such that
Such a function can be implemented by Observe that if , then for all the number belongs to one of the three intervals in the r.h.s. of Eq.(21) and hence . Thus, we can reconstruct the values for all by a ReLU network with layers and weights.
5.2.3 Computation of in a patch.
On the patch , the function can be expanded over the spike functions as
It is convenient to rewrite this computation in terms of the numbers and the expressions
using summation by parts in the direction :
where we used the identities