1 Introduction
In the study of parametrized nonlinear approximations, the framework of continuous nonlinear widths of DeVore et al. [1] is a general approach based on the assumption that approximation parameters depend continuously on the approximated object. Under this assumption, the framework provides lower error bounds for approximations in several important functional spaces, in particular Sobolev spaces.
In the present paper we will only consider the simplest Sobolev space i.e. the space of Lipschitz functions on the segment . The norm in this space can be defined by where is the weak derivative of , and is the norm in . We will denote by the unit ball in . Then the following result contained in Theorem 4.2 of [1] establishes an asymptotic lower bound on the accuracy of approximating functions from under assumption of continuous parameter selection:
Theorem 1.
Let be a positive integer and be any map between the space and the space . Suppose that there is a continuous map such that for all . Then with some absolute constant .
The bound
stated in the theorem is attained, for example, by the usual piecewiselinear interpolation with uniformly distributed breakpoints. Namely, assuming
, define and define to be the continuous piecewiselinear function with the nodes Then it is easy to see that for allThe key assumption of Theorem 1 that is continuous need not hold in applications, in particular for approximations with neural networks. The most common practical task in this case is to optimize the weights of the network so as to obtain the best approximation of a specific function, without any regard for other possible functions. Theoretically, Kainen et al. [2]
prove that already for networks with a single hidden layer the optimal weight selection is discontinuous in general. The goal of the present paper is to quantify the accuracy gain that discontinuous weight selection brings to a deep feedforward neural network model in comparison to the baseline bound of Theorem
1.Specifically, we consider a common deep architecture shown in Fig. 1 that we will refer to as standard. The network consists of one input unit, one output unit, and fullyconnected hidden layers, each including a constant number of units. We refer to as the depth of the network and to as its width. The layers are fully connected in the sense that each unit is connected with all units of the previous and the next layer. Only neighboring layers are connected. A hidden unit computes the map
where are the inputs,
is a nonlinear activation function, and
and are the weights associated with this computation unit. For the first hidden layer , otherwise . The network output unit is assumed to compute a linear map:The total number of weights in the network then equals
We will take the activation function
to be the ReLU function (Rectified Linear Unit), which is a popular choice in applications:
(1) 
With this activation function, the functions implemented by the whole network are continuous and piecewise linear, in particular they belong to . Let map the weights of the width depth network to the respective implemented function. Consider uniform approximation rates for the ball , with or without the weight selection continuity assumption:
Theorem 1 implies that .
We will be interested in the scenario where the width is fixed and the depth is varied. The main result of our paper is an upper bound for that we will prove in the case (but the result obviously also holds for larger ).
Theorem 2.
with some absolute constant .
Comparing this result with Theorem 1, we see that without the assumption of continuous weight selection the approximation error is lower than with this assumption at least by a factor logarithmic in the size of the network:
Corollary 1.
with some absolute constant .
The remainder of the paper is the proof of Theorem 2. The proof relies on a construction of adaptive network architectures with “cached” elements from [4]. We use a somewhat different, streamlined version of this construction: whereas in [4] it was first performed for a discontinuous activation function and then extended to ReLU by some approximation, in the present paper we do it directly, without invoking auxiliary activation functions.
2 Proof of Theorem 2
It is convenient to divide the proof into four steps. In the first step we construct for any function an approximation using cached functions. In the second step we expand the constructed approximation in terms of the ReLU function (1) and linear operations. In the third step we implement by a highly parallel shallow network with an dependent (“adaptive”) architecture. In the fourth step we show that this adaptive architecture can be embedded in the standard deep architecture of width 5.
Step 1: Cachebased approximation.
We first explain the idea of the construction. We start with interpolating by a piecewise linear function with uniformly distributed nodes. After that, we create a “cache” of auxiliary functions that will be used for detalization of the approximation in the intervals between the nodes. The key point is that one cached function can be used in many intervals, which will eventually lead to a saving in the error–complexity relation. The assignment of cached functions to the intervals will be dependent and encoded in the network architecture in Step 3.
The idea of reused subnetworks dates back to the proof of the upper bound for the complexity of Boolean circuits implementing ary functions ([3]).
We start now the detailed exposition. Given , we will construct an approximation to in the form
Here, is the piecewise linear interpolation of with the breakpoints :
The value of will be chosen later.
Since is Lipschitz with constant 1, is also Lipschitz with constant 1. We denote by the intervals between the breakpoints:
We will now construct as an approximation to the difference
(2) 
Note that vanishes at the endpoints of the intervals :
(3) 
and is Lipschitz with constant 2:
(4) 
since and are Lipschitz with constant 1.
To define , we first construct a set of cached functions. Let be a positive integer to be chosen later. Let be the set of piecewise linear functions with the breakpoints and the properties
and
Note that the size of is not larger than .
If is any Lipschitz function with constant 2 and , then can be approximated by some with error not larger than : namely, take (here is the floor function.)
Moreover, if is defined by (2), then, using (3), (4), on each interval the function can be approximated with error not larger than by a properly rescaled function . Namely, for each we can define the function by . Then it is Lipschitz with constant 2 and , so we can find such that
This can be equivalently written as
Note that the obtained assignment is not injective, in general ( will be larger than ).
We can then define on the whole interval by
(5) 
This approximates with error on :
and hence, by (2), for the full approximation we will also have
(6) 
Step 2: ReLUexpansion of .
We express now the constructed approximation in terms of linear and ReLU operations.
Let us first describe the expansion of . Since is a continuous piecewiselinear interpolation of with the breakpoints , we can represent it on the segment in terms of the ReLU activation function as
(7) 
with some weights and .
Now we turn to , as given by (5). Consider the “tooth” function
(8) 
Note in particular that for any
(9) 
The function can be written as
(10) 
where
Let us expand each over the basis of shifted ReLU functions:
(11) 
with some coefficients . There is no constant term because . Since , we also have
(12) 
Consider the functions defined by
(13) 
Lemma 1.
(14)  
(15)  
(16) 
for all and .
Proof.
Using definition (5) of and representation (16), we can then write, for any
In order to obtain computational gain from caching, we would like to move summation over into the arguments of the function . However, we need to avoid double counting associated with overlapping supports of the functions . Therefore we first divide all ’s into three series indexed by , and we will then move summation over into the argument of separately for each series. Precisely, we write
(17) 
where
(18) 
We claim now that can be alternatively written as
(19) 
To check that, suppose that for some and consider several cases:
The desired ReLU expansion of is then given by (17) and (19), where and are further expanded by (10), (13):
(20) 
Step 3: Network implementation with dependent architecture.
We will now express the approximation by a neural network (see Fig. 2). The network consists of two parallel subnetworks implementing and that include three and five layers, respectively (one and three layers if counting only hidden layers). The units of the network either have the ReLU activation function or simply perform linear combinations of inputs without any activation function. We will denote individual units in the subnetworks and by symbols and , respectively, with superscripts numbering the layer and subscripts identifying the unit within the layer. The subnetworks have a common input unit:
and their output units are merged so as to sum and :
Let us first describe the network for . By (7), we can represent by a 3layer ReLU network as follows:

The first layer contains the single input unit .

The second layer contains units , where .

The third layer contains a single output unit .
Now we describe the network for based on the representation (2).

The first layer contains the single input unit .

The second layer contains units , where , , and corresponds to the first or second argument of the function :

The third layer contains units , where , , and :
(21) 
The fourth layer contains units , where and :

The final layer consists of the single output unit
(22)
Step 3: Embedding into the standard deep architecture.
We show now that the dependent ReLU network constructed in the previous step can be realized within the standard width5 architecture.
Note first that we may ignore unneeded connections in the standard network (simply by assigning weight 0 to them). Also, we may assume some units to act purely linearly on their inputs, i.e., ignore the nonlinear ReLU activation. Indeed, for any bounded set , if is sufficiently large, then for all , hence we can ensure that the ReLU activation function always works in the identity regime in a given unit by adjusting the intercept term in this unit and in the units accepting its output. In particular, we can also implement in this way identity units, i.e. those having a single input and passing the signal further without any changes.
The embedding strategy is to arrange parallel subnetworks along the “depth” of the standard architecture as shown in Fig. 2(a). Note that and are computed in parallel; moreover, computation of is parallelized over independent subnetworks computing and indexed by and (see (2) and Fig. 2). Each of these independent subnetworks gets embedded into its own batch of width5 layers. The top row of units in the standard architecture is only used to pass the network input to each of the subnetworks, so all the top row units function in the identity mode. The bottom row is only used to accumulate results of the subnetworks into a single linear combination, so all the bottom units function in the linear mode.
Implementation of the subnetwork is shown in Fig. 2(b). It requires width5 layers of the standard architecture. The original output unit of this subnetwork gets implemented by linear units that have not more than two inputs each and gradually accumulate the required linear combination.
Implementation of a subnetwork is shown in Fig. 2(c). Its computation can be divided into two stages.
In terms of the original adaptive network, in the first stage we perform parallel computations associated with the units and combine their results in the two linear units and . By (21), each of and accepts inputs from the units, where
In the standard architecture, the two original linear units and get implemented by linear units that occupy two reserved lines of the network. This stage spans width5 layers of the standard architecture.
In the second stage we use the outputs of and to compute values , and accumulate the respective part of the final output (22). This stage spans layers of the standard architecture.
The full implementation of one subnetwork thus spans width5 layers. Since and there are such subnetworks, implementation of the whole subnetwork spans width5 layers. Implementation of the whole network then spans layers.
It remains to optimize and so as to achieve the minimum approximation error, subject to the total number of layers in the standard network being bounded by . Recall that and that the approximation error of is not greater than by (6). Choosing , and assuming sufficiently large, we satisfy the network size constraint and achieve the desired error bound , uniformly in . Since, by construction, with some weight selection function , this completes the proof.
References
 [1] Ronald A DeVore, Ralph Howard, and Charles Micchelli. Optimal nonlinear approximation. Manuscripta mathematica, 63(4):469–478, 1989.
 [2] Paul C Kainen, Věra Kůrková, and Andrew Vogt. Approximation by neural networks is not continuous. Neurocomputing, 29(1):47–56, 1999.
 [3] Claude Shannon. The synthesis of twoterminal switching circuits. Bell Labs Technical Journal, 28(1):59–98, 1949.
 [4] Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks. arXiv preprint arXiv:1610.01145v3, 2016. Submitted to Neural Networks.
Comments
There are no comments yet.