    # Quantified advantage of discontinuous weight selection in approximations with deep neural networks

We consider approximations of 1D Lipschitz functions by deep ReLU networks of a fixed width. We prove that without the assumption of continuous weight selection the uniform approximation error is lower than with this assumption at least by a factor logarithmic in the size of the network.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In the study of parametrized nonlinear approximations, the framework of continuous nonlinear widths of DeVore et al.  is a general approach based on the assumption that approximation parameters depend continuously on the approximated object. Under this assumption, the framework provides lower error bounds for approximations in several important functional spaces, in particular Sobolev spaces.

In the present paper we will only consider the simplest Sobolev space i.e. the space of Lipschitz functions on the segment . The norm in this space can be defined by where is the weak derivative of , and is the norm in . We will denote by the unit ball in . Then the following result contained in Theorem 4.2 of  establishes an asymptotic lower bound on the accuracy of approximating functions from under assumption of continuous parameter selection:

###### Theorem 1.

Let be a positive integer and be any map between the space and the space . Suppose that there is a continuous map such that for all . Then with some absolute constant .

The bound

stated in the theorem is attained, for example, by the usual piecewise-linear interpolation with uniformly distributed breakpoints. Namely, assuming

, define and define to be the continuous piecewise-linear function with the nodes Then it is easy to see that for all

The key assumption of Theorem 1 that is continuous need not hold in applications, in particular for approximations with neural networks. The most common practical task in this case is to optimize the weights of the network so as to obtain the best approximation of a specific function, without any regard for other possible functions. Theoretically, Kainen et al. 

prove that already for networks with a single hidden layer the optimal weight selection is discontinuous in general. The goal of the present paper is to quantify the accuracy gain that discontinuous weight selection brings to a deep feed-forward neural network model in comparison to the baseline bound of Theorem

1. Figure 1: The standard network architecture of depth N=9 and width M=5(i.e., containing 9 fully-connected hidden layers, 5 neurons each).

Specifically, we consider a common deep architecture shown in Fig. 1 that we will refer to as standard. The network consists of one input unit, one output unit, and fully-connected hidden layers, each including a constant number of units. We refer to as the depth of the network and to as its width. The layers are fully connected in the sense that each unit is connected with all units of the previous and the next layer. Only neighboring layers are connected. A hidden unit computes the map

 (x1,…,xK)↦σ(K∑k=1wkxk+h),

where are the inputs,

is a nonlinear activation function, and

and are the weights associated with this computation unit. For the first hidden layer , otherwise . The network output unit is assumed to compute a linear map:

 (x1,…,xM)↦M∑k=1wkxk+h.

The total number of weights in the network then equals

 νM,N=M2(N−1)+M(N+2)+1.

We will take the activation function

to be the ReLU function (Rectified Linear Unit), which is a popular choice in applications:

 σ(x)=max(0,x). (1)

With this activation function, the functions implemented by the whole network are continuous and piecewise linear, in particular they belong to . Let map the weights of the width- depth- network to the respective implemented function. Consider uniform approximation rates for the ball , with or without the weight selection continuity assumption:

 dcont(M,N)= infψ:B→RνM,Nψ continuoussupf∈B∥f−ηM,N(ψ(f))∥∞ dall(M,N)= infψ:B→RνM,Nsupf∈B∥f−ηM,N(ψ(f))∥∞

Theorem 1 implies that .

We will be interested in the scenario where the width is fixed and the depth is varied. The main result of our paper is an upper bound for that we will prove in the case (but the result obviously also holds for larger ).

###### Theorem 2.
 dall(5,N)≤cNlnN,

with some absolute constant .

Comparing this result with Theorem 1, we see that without the assumption of continuous weight selection the approximation error is lower than with this assumption at least by a factor logarithmic in the size of the network:

###### Corollary 1.
 dall(5,N)dcont(5,N)≤clnN,

with some absolute constant .

The remainder of the paper is the proof of Theorem 2. The proof relies on a construction of adaptive network architectures with “cached” elements from . We use a somewhat different, streamlined version of this construction: whereas in  it was first performed for a discontinuous activation function and then extended to ReLU by some approximation, in the present paper we do it directly, without invoking auxiliary activation functions.

## 2 Proof of Theorem 2

It is convenient to divide the proof into four steps. In the first step we construct for any function an approximation using cached functions. In the second step we expand the constructed approximation in terms of the ReLU function (1) and linear operations. In the third step we implement by a highly parallel shallow network with an -dependent (“adaptive”) architecture. In the fourth step we show that this adaptive architecture can be embedded in the standard deep architecture of width 5.

### Step 1: Cache-based approximation.

We first explain the idea of the construction. We start with interpolating by a piece-wise linear function with uniformly distributed nodes. After that, we create a “cache” of auxiliary functions that will be used for detalization of the approximation in the intervals between the nodes. The key point is that one cached function can be used in many intervals, which will eventually lead to a saving in the error–complexity relation. The assignment of cached functions to the intervals will be -dependent and encoded in the network architecture in Step 3.

The idea of reused subnetworks dates back to the proof of the upper bound for the complexity of Boolean circuits implementing -ary functions ().

We start now the detailed exposition. Given , we will construct an approximation to in the form

 ˜f=˜f1+˜f2.

Here, is the piecewise linear interpolation of with the breakpoints :

 ˜f1(tT)=f(tT),t=0,1,…,T.

The value of will be chosen later.

Since is Lipschitz with constant 1, is also Lipschitz with constant 1. We denote by the intervals between the breakpoints:

 It=[tT,t+1T),t=0,…,T−1.

We will now construct as an approximation to the difference

 f2=f−˜f1. (2)

Note that vanishes at the endpoints of the intervals :

 f2(tT)=0,t=0,…,T, (3)

and is Lipschitz with constant 2:

 |f2(x1)−f2(x2)|≤2|x1−x2|, (4)

since and are Lipschitz with constant 1.

To define , we first construct a set of cached functions. Let be a positive integer to be chosen later. Let be the set of piecewise linear functions with the breakpoints and the properties

 γ(0)=γ(1)=0

and

 γ(rm)−γ(r−1m)∈{−2m,0,2m},r=1,…,m.

Note that the size of is not larger than .

If is any Lipschitz function with constant 2 and , then can be approximated by some with error not larger than : namely, take (here is the floor function.)

Moreover, if is defined by (2), then, using (3), (4), on each interval the function can be approximated with error not larger than by a properly rescaled function . Namely, for each we can define the function by . Then it is Lipschitz with constant 2 and , so we can find such that

 supy∈[0,1)∣∣Tf2(t+yT)−γt(y)∣∣≤2m.

This can be equivalently written as

 supx∈It∣∣f2(x)−1Tγt(Tx−t)∣∣≤2Tm.

Note that the obtained assignment is not injective, in general ( will be larger than ).

We can then define on the whole interval by

 ˜f2(x)=1Tγt(Tx−t),x∈It,t=0,…,T−1. (5)

This approximates with error on :

 supx∈[0,1)|f2(x)−˜f2(x)|≤2Tm,

and hence, by (2), for the full approximation we will also have

 supx∈[0,1)|f(x)−˜f(x)|≤2Tm. (6)

### Step 2: ReLU-expansion of ˜f.

We express now the constructed approximation in terms of linear and ReLU operations.

Let us first describe the expansion of . Since is a continuous piecewise-linear interpolation of with the breakpoints , we can represent it on the segment in terms of the ReLU activation function as

 ˜f1(x)=T−1∑t=0wtσ(x−tT)+h, (7)

with some weights and .

Now we turn to , as given by (5). Consider the “tooth” function

 ϕ(x)={0,x∉[−1,1],1−|x|,x∈[−1,1]. (8)

Note in particular that for any

 ϕ(Tx−t)=0if x∉It−1∪It. (9)

The function can be written as

 ϕ(x)=1∑q=−1αqσ(x−q), (10)

where

 α−1=α1=1,α0=−2.

Let us expand each over the basis of shifted ReLU functions:

 γ(x)=m−1∑r=0cγ,rσ(x−rm),x∈[0,1], (11)

with some coefficients . There is no constant term because . Since , we also have

 m−1∑r=0cγ,rσ(1−rm)=0. (12)

Consider the functions defined by

 θγ(a,b)=m−1∑r=0cγ,rσ(m−rma−rmb). (13)
###### Lemma 1.
 θγ(a,0) =0 for all a≥0, (14) θγ(0,b) =0 for all b≥0, (15) θγ(ϕ(Tx−t−1),ϕ(Tx−t)) ={γ(Tx−t),x∈It,0,x∉It, (16)

for all and .

###### Proof.

Property (14) follows from (12) using positive homogeneity of . Property (15) follows since for .

To establish (16), consider the two cases:

1. If then at least one of the arguments of vanishes by (9), and hence the l.h.s. vanishes by (14), (15).

2. If then and , so

 m−rmϕ(Tx−t−1)−rmϕ(Tx−t)=Tx−t−rm

and hence, by (13) and (11),

 θγ(ϕ(Tx−t−1),ϕ(Tx−t))=m−1∑r=0cγ,rσ(Tx−t−rm)=γ(Tx−t).

Using definition (5) of and representation (16), we can then write, for any

 ˜f2(x) =1TT−1∑t=0θγt(ϕ(Tx−t−1),ϕ(Tx−t)) =1T∑γ∈Γ∑t:γt=γθγ(ϕ(Tx−t−1),ϕ(Tx−t)).

In order to obtain computational gain from caching, we would like to move summation over into the arguments of the function . However, we need to avoid double counting associated with overlapping supports of the functions . Therefore we first divide all ’s into three series indexed by , and we will then move summation over into the argument of separately for each series. Precisely, we write

 ˜f2(x)=1T∑γ∈Γ2∑i=0˜f2,γ,i, (17)

where

 ˜f2,γ,i(x)=∑t:γt=γt≡i (mod 3)θγ(ϕ(Tx−t−1),ϕ(Tx−t)). (18)

We claim now that can be alternatively written as

 ˜f2,γ,i(x)=θγ(∑t:γt=γt≡i (mod 3)ϕ(Tx−t−1),∑ct:γt=γt≡i (mod 3)ϕ(Tx−t)). (19)

To check that, suppose that for some and consider several cases:

1. Let , then both expressions (18) and (19) vanish. Indeed, at least one of the sums forming the arguments of in the r.h.s. of (19) vanishes by (9), so vanishes by (14), (15). Similar reasoning shows that the r.h.s. of (18) vanishes too.

2. Let and . Then all terms in the sums over in (18) and (19) vanish, so both (18) and (19) vanish.

3. Let and . Then all terms in the sums over in (18) and (19) vanish except for the term , so both (18) and (19) are equal to .

The desired ReLU expansion of is then given by (17) and (19), where and are further expanded by (10), (13):

 ˜f2(x) =1T∑γ∈Γ2∑i=0˜f2,γ,i =1T∑γ∈Γ2∑i=0m−1∑r=0cγ,rσ( m−rm ∑t:γt=γt≡i (mod 3)1∑q=−1αqσ(Tx−t−q−1) −rm ∑t:γt=γt≡i (mod 3)1∑q=−1αqσ(Tx−t−q)). (20) Figure 2: Implementation of the function ˜f=˜f1+˜f2 by a neural network with f-dependent architecture. The computation units Q(2)t,j,q,Q(3)γ,i,j and Q(4)γ,i,r are depicted in their layers in the lexicographic order (later indices change faster). Dimensions of the respective index arrays are indicated on the right (the example shown is for T=4, m=3 and |Γ|=2). Computation of ˜f2 includes 3|Γ| parallel computations of functions ˜f2,γ,i; the corresponding subnetworks are formed by units and connections connected to Q(3)γ,i,0 and Q(3)γ,i,1 with specific γ,i.

### Step 3: Network implementation with f-dependent architecture.

We will now express the approximation by a neural network (see Fig. 2). The network consists of two parallel subnetworks implementing and that include three and five layers, respectively (one and three layers if counting only hidden layers). The units of the network either have the ReLU activation function or simply perform linear combinations of inputs without any activation function. We will denote individual units in the subnetworks and by symbols and , respectively, with superscripts numbering the layer and subscripts identifying the unit within the layer. The subnetworks have a common input unit:

 R=P(1)=Q(1)

and their output units are merged so as to sum and :

 Z=P(3)+Q(5).

Let us first describe the network for . By (7), we can represent by a 3-layer ReLU network as follows:

1. The first layer contains the single input unit .

2. The second layer contains units , where .

3. The third layer contains a single output unit .

Now we describe the network for based on the representation (2).

1. The first layer contains the single input unit .

2. The second layer contains units , where , , and corresponds to the first or second argument of the function :

 Q(2)t,j,q=σ(TQ(1)−t−q−j).
3. The third layer contains units , where , , and :

 Q(3)γ,i,j=∑t:γt=γt≡i (mod 3)1∑q=−1αqQ(2)t,j,q. (21)
4. The fourth layer contains units , where and :

 Q(4)γ,i,r=σ(m−rmQ(3)γ,i,1−rmQ(3)γ,i,0).
5. The final layer consists of the single output unit

 Q(5)=1T∑γ∈Γ2∑i=0m−1∑r=0cγ,rQ(4)γ,i,r. (22)

### Step 3: Embedding into the standard deep architecture.

We show now that the -dependent ReLU network constructed in the previous step can be realized within the standard width-5 architecture.

Note first that we may ignore unneeded connections in the standard network (simply by assigning weight 0 to them). Also, we may assume some units to act purely linearly on their inputs, i.e., ignore the nonlinear ReLU activation. Indeed, for any bounded set , if is sufficiently large, then for all , hence we can ensure that the ReLU activation function always works in the identity regime in a given unit by adjusting the intercept term in this unit and in the units accepting its output. In particular, we can also implement in this way identity units, i.e. those having a single input and passing the signal further without any changes. (a) Embedding overview: the subnetworks ˜f1 and ˜f2 are computed in parallel; the subnetwork ˜f2 is further divided into parallel subnetworks ˜f2,γ,i, where γ∈Γ,i∈{0,1,2}.

The embedding strategy is to arrange parallel subnetworks along the “depth” of the standard architecture as shown in Fig. 2(a). Note that and are computed in parallel; moreover, computation of is parallelized over independent subnetworks computing and indexed by and (see (2) and Fig. 2). Each of these independent subnetworks gets embedded into its own batch of width-5 layers. The top row of units in the standard architecture is only used to pass the network input to each of the subnetworks, so all the top row units function in the identity mode. The bottom row is only used to accumulate results of the subnetworks into a single linear combination, so all the bottom units function in the linear mode.

Implementation of the subnetwork is shown in Fig. 2(b). It requires width-5 layers of the standard architecture. The original output unit of this subnetwork gets implemented by linear units that have not more than two inputs each and gradually accumulate the required linear combination.

Implementation of a subnetwork is shown in Fig. 2(c). Its computation can be divided into two stages.

In terms of the original adaptive network, in the first stage we perform parallel computations associated with the units and combine their results in the two linear units and . By (21), each of and accepts inputs from the units, where

 Nγ,i=|{t∈{0,…,T−1}|γt=γ,t≡i (mod 3)}|.

In the standard architecture, the two original linear units and get implemented by linear units that occupy two reserved lines of the network. This stage spans width-5 layers of the standard architecture.

In the second stage we use the outputs of and to compute values , and accumulate the respective part of the final output (22). This stage spans layers of the standard architecture.

The full implementation of one subnetwork thus spans width-5 layers. Since and there are such subnetworks, implementation of the whole subnetwork spans width-5 layers. Implementation of the whole network then spans layers.

It remains to optimize and so as to achieve the minimum approximation error, subject to the total number of layers in the standard network being bounded by . Recall that and that the approximation error of is not greater than by (6). Choosing , and assuming sufficiently large, we satisfy the network size constraint and achieve the desired error bound , uniformly in . Since, by construction, with some weight selection function , this completes the proof.