# A Priori Estimates of the Population Risk for Residual Networks

Optimal a priori estimates are derived for the population risk of a regularized residual network model. The key lies in the designing of a new path norm, called the weighted path norm, which serves as the regularization term in the regularized model. The weighted path norm treats the skip connections and the nonlinearities differently so that paths with more nonlinearities have larger weights. The error estimates are a priori in nature in the sense that the estimates depend only on the target function and not on the parameters obtained in the training process. The estimates are optimal in the sense that the bound scales as O(1/L) with the network depth and the estimation error is comparable to the Monte Carlo error rates. In particular, optimal error bounds are obtained, for the first time, in terms of the depth of the network model. Comparisons are made with existing norm-based generalization error bounds.

• 60 publications
• 71 publications
• 5 publications
10/15/2018

### A Priori Estimates of the Generalization Error for Two-layer Neural Networks

New estimates for the generalization error are established for the two-l...
03/30/2021

### Nonlinear Weighted Directed Acyclic Graph and A Priori Estimates for Neural Networks

In an attempt to better understand structural benefits and generalizatio...
12/15/2019

### On the Generalization Properties of Minimum-norm Solutions for Over-parameterized Neural Network Models

We study the generalization properties of minimum-norm solutions for thr...
10/05/2020

### Smaller generalization error derived for deep compared to shallow residual neural networks

Estimates of the generalization error are proved for a residual neural n...
04/10/2019

### Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections

The behavior of the gradient descent (GD) algorithm is analyzed for a de...
02/23/2017

### Sobolev Norm Learning Rates for Regularized Least-Squares Algorithm

Learning rates for regularized least-squares algorithms are in most case...
01/19/2022

### Stability of Deep Neural Networks via discrete rough paths

Using rough path techniques, we provide a priori estimates for the outpu...

## 1 Introduction

One of the major theoretical challenges in machine learning is to understand the generalization error for deep neural networks, especially residual networks

(He et al., 2016)

which have become one of the default choices for many machine learning tasks, such as the ones that arise in computer vision. Many recent attempts have been made by trying to derive bounds that do not explicitly depend on the number of parameters. In this regard, the norm-based bounds use some appropriate norms of the parameters to control the generalization error

(Neyshabur et al., 2015b; Bartlett et al., 2017; Golowich et al., 2017; Barron and Klusowski, 2018). Other bounds include the ones that are based on idea of compressing the networks (Arora et al., 2018) or the use of the Fisher-Rao information (Liang et al., 2017). While these generalization bounds differ in many ways, they have one thing in common: they depend on information about the final parameters obtained in the training process. Following E et al. (2018), we call them a posteriori estimates. In this paper, we derive an a priori estimate of the generalization error and the population risk for deep residual networks. Compared to the a posteriori estimates mentioned above, our bounds depend only on the target function and the network structure. In addition, our bounds scale optimally with the network depths and the size of the training data. The approximation error term scales as with the depth, while the estimation error term scales like the Monte Carlo error rate with the size of training data, and is independent of the depth.

We should note that our interest in deriving a priori estimates also comes from the analogy with finite element methods (Ciarlet, 2002; Ainsworth and Oden, 2011). Both a priori and a posteriori error estimates are very common in the theoretical analysis of finite element methods. In fact, there a priori estimates appeared much earlier and are still more common than a posteriori estimates (Ciarlet, 2002), contrary to the situation in machine learning. For the case of two-layer neural network models, the analytical and practical advantages of a priori analysis have already been demonstrated in E et al. (2018). It was shown there that optimal error rates can be established for appropriately regularized two-layer neural networks models, and the accuracy of these models behaves in a much more robust fashion than the vanilla models without regularization. In any case, we believe both a priori and a posteriori estimates are useful and can shed some light on the principles behind modern machine learning models. In this paper, we set out to extend the work in E et al. (2018) for shallow neural network models to deep ones and we choose residual network as a starting point.

To derive our a priori estimate, we design a new path norm for deep residual networks called the weighted path norm. Unlike traditional path norms, our weighted path norm is a weighted version which put more weight on paths going through more nonlinearities. In this way, we penalize paths with many nonlinearities and hence control the complexity of the functions represented by networks with a bounded norm. Moreover, by using the weighted path norm as the regularization term, we can strike a balance between the empirical risk and the complexity of the model, and thus a balance between the approximation error and the estimation error. This allows us to prove that the minimizer of the regularized model has the optimal error rate in terms of the population risk.

Our contributions:

1. We propose the weighed path norm for residual networks which gives larger weights to paths with more nonlinearities. The weighed path norm can help us to better control the Rademacher complexity for the associated function space.

2. With the weighted path norm, we propose a regularized model and derive a priori estimates for the population risk, in the sense that the bounds depend only on the target function instead of the parameters obtained after training.

3. The a priori estimates are optimal in the sense that both the approximation error and the estimation error behave similarly to the Monte Carlo error rates.

The rest of the paper is organized as follows. In Section 2, we setup the problem and state our main theorem as well as the proof sketch. In Section 3 we give the full proof of the theorems. In Section 4, we compare our result with related works and put things into perspective. Conclusions are drawn in Section 5.

## 2 Setup of the problem and the main theorem

### 2.1 Setup

In this paper, we focus on the regression problem and residual networks with ReLU activation. Assume that the target function

. Let the training set be , where the ’s are independently sampled from an underlying distribution and .

Consider the following residual network architecture with skip connection in each layer111In practice, standard residual networks use skip connections every two layers. We consider skip connections every layer for the sake of clarity. It is easy to extend the analysis to the cases where skip connections are used across multiple layers.

 h0 =Vx, gl =σ(Wlhl−1), hl =hl−1+Ulgl, l=1,…,L, f(x;θ) =u⊺hL. (2.1)

Here the set of parameters , , , , , is the number of layers, is the width of the residual blocks and

is the width of skip connections. The ReLU activation function

and we extend it to vectors in a component-wise fashion. Note that we omit the bias term in the network by assuming that the first element of the input

is always 1.

To simplify the proof we will consider the truncated square loss

 ℓ(x;θ)=∣∣min{max{f(x;θ),0},1}−f⋆(x)∣∣2, (2.2)

then the truncated population risk and empirical risk functions are

 L(θ)=Ex∼πℓ(x;θ),^L(θ)=1nn∑i=1ℓ(xi;θ), (2.3)

In principle we can also truncate the risk function as is done below for the case with noise.

Now we define the spectral norm (Klusowski and Barron, 2016) for the target function and the weighted path norm for residual networks.

###### Definition 2.1 (Spectral norm).

Let , and let be an extension of to , and

be the Fourier transform of

. Define the spectral norm of as

 γ(f)=inf∫Rd∥ω∥21|^F(ω)|dω, (2.4)

where the is taken over all possible extensions .

###### Definition 2.2 (Weighted path norm).

Given a residual network with architecture (2.1), define the weighted path norm of as

 ∥f∥P=∥θ∥P=∥∥|u|⊺(I+3|UL||WL|)⋯(I+3|U1||W1|)|V|∥∥1, (2.5)

where with being a vector or matrix means taking over the absolute values of all the entries of the vector or matrix.

Note that our weighted path norm is a weighted sum over all paths in the neural network flowing from the input to the output, and we give larger weight to the paths that go through more nonlinearities. More precisely, consider the following path : assume that goes through nonlinearities for layers , and goes through skip connections for the other layers; for the nonlinear layers , assume that

goes through the neurons

, 222We denote by the -th element of the vector , and the -th element of the matrix .; for skip connections between the layers and , assume that goes through the neurons , , . In addition, assume that starts from the input , . Then the path is given by:

 xj→hj00→hj01→⋯→hj0l1−1→gk1l1→hj1l1→hj1l1+1→⋯hj1l2−1→⋯→gkrlr→hjrlr→hjrlr+1→⋯hjrlr+1−1→⋯→hjpL→f.

Define the weight of path by

 Π(P)=Vj,j0⋅p∏r=1Wjr−1,krlrUkr,jrlr⋅ujp; (2.6)

and the activation of by

 1P activated=p∏r=11gkrlr≥0.

Then output of the residual network can be written as

 f(x;θ)=∑PΠ(P)1P activated, (2.7)

and the weighted path norm is given by

 ∥f∥P=∑P3p|Π(P)|. (2.8)

We see that is the weighted sum over all the paths, where the weight is decided by the number of nonlinearities encountered along the path.

### 2.2 Main theorem

###### Theorem 2.3 (A priori estimate).

Let and assume that the residual network has architecture (2.1). Let be the number of training samples, be the number of layers and be the width of the residual blocks. Let and be the truncated population risk and empirical risk defined in (2.3) respectively; let be the spectral norm of and be the weighted path norm of in Definition 2.1 and 2.2. For , assume that is an optimal solution of the regularized model

 minθJ(θ):=^L(θ)+λ∥θ∥P⋅3√2log(2d)n. (2.9)

Then for any

, with probability at least

over the random training sample, the population risk has the bound

 L(^θ)≤16γ2(f⋆)Lm+(12γ(f⋆)+1)3(4+λ)√2log(2d)+2√n+4√2log(14/δ)n. (2.10)
###### Remark.
1. The estimates are a priori in nature since (2.10) depends only on the spectral norm of the target function without knowing the norm of .

2. We want to emphasize that our estimate is nearly optimal. The first term in (2.10) shows that the convergence rate with respect to the size of the neural network is , which matches the rate in universal approximation theory for shallow networks (Barron, 1993). The last two terms show that the rate with respect to the number of training samples is , which matches the classical estimates of the generalize gap.

3. The last term depends only on instead of the network architecture, thus there is no need to increase the sample size with respect to the network size and to ensure convergence. This is not the case for existing error bounds (see Section 4).

### 2.3 Extension to noisy problems

Our a priori estimate can be extended to problems with sub-gaussian noise. Assume that in the training data are computed by where are i.i.d. distributed with and

 Pr{|εi|>t}≤ce−t22σ2,∀t≥τ, (2.11)

for some constants , and . Let be the square loss truncated by , and define

 LB(θ)=Ex∼πℓB(x;θ),^LB(θ)=1nn∑i=1ℓB(xi;θ). (2.12)

Then, we have

###### Theorem 2.4 (A priori estimate for noisy problems).

In addition to the same conditions as in Theorem 2.3, assume that the noise satisfies (2.11). Let and be the truncated population risk and empirical risk defined in (2.12). For and , assume that is an optimal solution of the regularized model

 minθJ(θ):=^L(θ)+λB∥θ∥P⋅3√2log(2d)n. (2.13)

Then for any , with probability at least over the random training sample, the population risk satisfies

 L(^θ)≤16γ2(f⋆)Lm+(12γ(f⋆)+1)3(4+λ)B√2log(2d)+2B2√n+4B2√2log(14/δ)n+2c(4σ2+1)√n. (2.14)

We see that the a priori estimates for noisy problems only differ from that for noiseless problems by a logarithmic term. In particular, the estimates of the generalization error are still near optimal.

### 2.4 Proof sketch

We prove the main theorem in 3 steps. We list the main intermediate results in this section, and leave the full proof to Section 3.

#### 2.4.1 Approximation error

First, we show that there exists a set of parameters such that and is controlled as .

###### Theorem 2.5.

For any distribution with compact support , and any target function with , there exists a residual network with depth and width , such that

 Ex∼π∣∣f(x;~θ)−f⋆(x)∣∣2≤16γ2(f⋆)Lm (2.15)

and .

#### 2.4.2 A posteriori estimate

Second, we show that the weighted path norm can help to bound the Rademacher complexity. Since the Rademacher complexity can bound the generalization gap, this gives the a posteriori estimates.

Recall the definition of Rademacher complexity:

Given a function class and sample set , the (empirical) Rademacher complexity of with respect to is defined as

 ^R(H)=1nEξ[suph∈Hn∑i=1ξih(xi)], (2.16)

where the

’s are independent random variables with

.

It is well-known that the generalization gap is controlled by the Rademacher complexity (Shalev-Shwartz and Ben-David, 2014).

###### Theorem 2.7.

Given a function class , for any , with probability at least over the random samples ,

 suph∈H∣∣ ∣∣Ex[h(x)]−1nn∑i=1h(xi)∣∣ ∣∣≤2^R(H)+2suph,h′∈H∥h−h′∥∞√2log(4/δ)n. (2.17)

The following theorem is a crucial step in our analysis. It shows that the Rademacher complexity of residual networks can be controlled by the weighted path norm.

###### Theorem 2.8.

Let where ’s are residual networks defined by (2.1). Assume that the samples , then we have

 ^R(FQ)≤3Q√2log(2d)n. (2.18)

From Theorems 2.7 and 2.8, we have the following a posteriori estimates.

###### Theorem 2.9 (A posteriori estimate).

Let be the weighted path norm of residual network . Let be the number of training samples. Let and be the truncated population risk and empirical risk defined in (2.3). Then for any , with probability at least over the random training samples, we have

 ∣∣L(θ)−^L(θ)∣∣≤2(∥θ∥P+1)6√2log(2d)+1√n+2√2log(7/δ)n. (2.19)

#### 2.4.3 A priori estimate

By comparing the definition of objective function (2.9) with the a posteriori estimate (2.19), we conclude that for any ,

where is bound with high probability in (2.19). Recall that is the optimal solution of the objective function (2.9), and corresponds to the approximation in Theorem 2.5, we get

Here the first term and the third term are upper-bounded with high probability; the second term since ; and are also upper-bounded as shown in Theorem 2.5. These give us the a priori estimates in Theorem 2.3.

For problems with noise, we only need the following lemma:

###### Lemma 2.10.

Assume that the noise has zero mean and satisfies (2.11), and . For any we have

 |L(θ)−LB(θ)|≤c(4σ2+1)√n. (2.20)

The details of the proof are given in the following Section 3.

## 3 Proof

### 3.1 Approximation error

For the approximation error, E et al. (2018) gives the following result for shallow networks.

###### Theorem 3.1.

For any distribution with compact support , and any target function with , there exists a one-hidden-layer network with width , such that

 Ex∼π∣∣ ∣∣m∑j=1ajσ(b⊺jx+cj)−f⋆(x)∣∣ ∣∣2≤16γ2(f⋆)m (3.1)

and

 m∑j=1|aj|(∥bj∥1+|cj|)≤4γ(f⋆). (3.2)

For residual networks, we prove the approximation result by splitting the shallow network into several parts and stack them vertically.

###### Proof of Theorem 2.5.

Recall the assumption that the first element of input is always 1, thus we can omit the bias terms in Theorem 3.1. Hence there exists a shallow network with width that satisfies

 Ex∼π∣∣ ∣∣Lm∑j=1ajσ(b⊺jx)−f⋆(x)∣∣ ∣∣2≤16γ2(f⋆)Lm

and

 Lm∑j=1|aj|∥bj∥1≤4γ(f⋆).

Now, we construct a residual network with input dimension , depth , width , and using

 V=[Id 0]⊺,u=[0,0,…,1]⊺, Wl=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣b⊺(l−1)m+10b⊺(l−1)m+20⋮⋮b⊺lm0⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,Ul=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣00⋯0⋮⋮⋱⋮00⋯0a(l−1)m+1a(l−1)m+2⋯alm⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

for . Then it is easy to verify that , and

 ∥~θ∥P=3Lm∑j=1|aj|∥bj∥1≤12γ(f⋆).

### 3.2 A posteriori estimate

To bound the Rademacher complexity of residual networks, we first define the hidden neurons in the residual blocks and their corresponding path norm.

###### Definition 3.2.

Given a residual network defined by (2.1), let

 gl(x)=σ(Wlhl−1),l=1,…,L. (3.3)

Let be the -th element of , define the weighted path norm

 ∥gil∥P=∥∥3|Wi,:l|(I+3|Ul−1||Wl−1|)⋯(I+3|U1||W1|)|V|∥∥1, (3.4)

where is the -th row of .

The following Lemma 3.3 establishes the relationship between and . Lemma 3.4 gives properties of the corresponding function class. We omit the proof here.

###### Lemma 3.3.

For the weighted path norm defined in (2.5) and (3.4), we have

 ∥f∥P=L∑l=1m∑j=1(|u|⊺|U:,jl|)∥gjl∥P+∥∥|u|⊺|V|∥∥1, (3.5)

and

 ∥gil∥P=l∑k=1m∑j=13(|Wi,:l||U:,jk|)∥gjk∥P+3∥∥|Wi,:l||V|∥∥1, (3.6)

where is the -th column of .

###### Lemma 3.4.

Let , then

1. for ;

2. and for .

Now we recall two lemmas about Rademacher complexity (Shalev-Shwartz and Ben-David, 2014).

###### Lemma 3.5.

Let , and the samples , then

 ^R(H)≤maxi∥xi∥∞√2log(2d)n. (3.7)
###### Lemma 3.6.

Assume that are Lipschitz continuous functions with uniform Lipschitz constant , i.e., for , then

 Eξ[suph∈Hn∑i=1ξiϕi(h(xi))]≤LϕEξ[suph∈Hn∑i=1ξih(xi)]. (3.8)

With Lemma 3.33.6, we can bound the Rademacher complexity of residual networks in Theorem 2.8.

###### Proof of Theorem 2.8.

We first estimate the Rademacher complexity of . We do this by induction:

 ^R(GQl)≤Q√2log(2d)n. (3.9)

By definition, . Hence, using Lemma 3.5 and 3.6, we have the statement (3.9) holds for . Now, assume the result holds for , for we have

 n^R(GQl+1) =Eξsupgl+1∈GQl+1n∑i=1ξigl+1(xi) =Eξsup(1)n∑i=1ξiσ(w⊺l(Ulgl+Ul−1gl−1+⋯+U1g1+h0)) ≤Eξsup(1)n∑i=1ξi(w⊺l+1(Ulgl+Ul−1gl−1+⋯+U1g1+h0)) ≤Eξsup(2)⎧⎨⎩l∑k=1aksupg∈G1k∣∣ ∣∣n∑i=1ξig(xi)∣∣ ∣∣+bsup∥u∥1≤1∣∣ ∣∣n∑i=1ξiu⊺xi∣∣ ∣∣⎫⎬⎭ ≤Eξsupa+b≤Q3a,b≥0⎧⎨⎩asupg∈G1l∣∣ ∣∣n∑i=1ξig(xi)∣∣ ∣∣+bsup∥u∥1≤1∣∣ ∣∣n∑i=1ξiu⊺xi∣∣ ∣∣⎫⎬⎭ ≤Q3⎡⎣Eξsupg∈G1l∣∣ ∣∣n∑i=1ξig(xi)∣∣ ∣∣+Eξsup∥u∥1≤1∣∣ ∣∣n∑i=1ξiu⊺xi∣∣ ∣∣⎤⎦

where condition (1) is , and condition (2) is . The first inequality is due to the contraction lemma, while the third inequality is due to Lemma 3.4. Because is symmetric, we know

 Eξsup∥u∥1≤1∣∣ ∣∣n∑i=1ξiu⊺xi∣∣ ∣∣=Eξsup∥u∥1≤1n∑i=1ξiu⊺xi≤n√2log(2d)n.

On the other hand, as , we have

 Eξsupg∈G1l∣∣ ∣∣n∑i=1ξig(xi)∣∣ ∣∣≤2Eξsupg∈G1ln∑i=1ξig(xi)=2n^R(G1l).

Therefore, we have

 ^R(GQl+1)≤Q3⎡⎣2√2log(2d)n+√2log(2d)n⎤⎦≤Q√2log(2d)n.

Similarly, based on the control for the Rademacher complexity of , we get

 ^R(FQ)≤3Q√2log(2d)n.

###### Proof of Theorem 2.9.

Let . Notice that for all ,

 |ℓ(x;θ)−ℓ(x;θ′)|≤2|f(x;θ)−f(x;θ′)|.

By Lemma 3.6,

From Theorem 2.7, with probability at least ,

 sup∥θ∥P≤Q∣∣L(θ)−^L(θ)∣∣≤2^R(H)+2suph,h′∈H∥h−h′∥∞√2log(4/δ)n≤12Q√2log(2d)n+2√2log(4/δ)n. (3.10)

Now take and , then with probability at least , the bound

 sup∥θ∥P≤Q∣∣L(θ)−^L(θ)∣∣≤12Q√2log(2d)n+2√2nlog2(πQ)23δ

holds for all . In particular, for given , the inequality holds for , thus

 ∣∣L(θ)−^L(θ)∣∣ ≤12(∥θ∥P+1)√2log(2d)n+2√2nlog7(∥θ∥P+1)2δ ≤12(∥θ∥P+1)√2log(2d)n+2⎡⎣∥θ∥P+1√n+√2log(7/δ)n⎤⎦ =2(∥θ∥P+1)6√2log(2d)+1√n+2√2log(7/δ)n.

### 3.3 A priori estimate

Now we are ready to prove the main Theorem 2.3.

###### Proof of Theorem 2.3.

Let be the optimal solution of the regularized model (2.9), and be the approximation in Theorem 2.5. Consider

 (3.11)

From (2.15) in Theorem 2.5,

 L(~θ)≤16γ2(f⋆)Lm. (3.12)

Compare the definition of in (2.9) and the gap in (2.19), with probability at least ,

 L(^θ)−J(^θ) ≤(∥^θ∥P+1)3(4−λ)√2log(2d)+2√n+3λ√2log(2d)n+2√2log(14/δ)n ≤3λ√2log(2d)n+2√2log(14/δ)n (3.13)

since ; with probability at least , we have

 J(~θ)−L(~θ)≤(∥~θ∥P+1)3(4+λ)√2log(2d)+2√n−3λ√2log(2d)n+2√2log(14/δ)n (3.14)

Thus with probability at least , (3.13) and (3.14) hold simultaneously. In addition, we have

 J(^θ)−J(~θ)≤0 (3.15)

since .

Now plug (3.123.15) into (3.11), and notice that from Theorem 2.5, we see that the main theorem (2.10) holds with probability at least . ∎

Finally, we deal with the noise and prove Theorem 2.4. For problems with noise, we decompose as

 (3.16)

Based on the results we had for the noiseless problems, in (3.16) we only have to estimate the first and the last term. This can be done by Lemma 2.10.

###### Proof of Lemma 2.10.

Let , then we have

As and , we have

 ∫∞0Pr{|Z|≥√B2+t2}dt2≤∫∞0Pr{|ε|≥√B2+t2−1}dt2.

Let , then

 ∫∞0 Pr{|ε|≥√B2+t2−1}dt2≤∫∞Bce−(s−1)22σ2ds2 =∫∞B−12ce−s22σ2ds2+∫∞B−14ce−s22σ2ds ≤4cσ2e−(B−1)22σ2+√2πce−(B−1)22σ2 ≤c(4σ2+1)√n.

## 4 Related works and discussions

### 4.1 Comparison with norm-based a posteriori estimates

Several research groups have proposed different norms to bound the generalization error of deep neural networks, including the group norm and path norm given in Neyshabur et al. (2015b), the spectral norm in Bartlett et al. (2017), and the variational norm in Barron and Klusowski (2018). In these works, the bounds for the generalization gap