1 Introduction
One of the major theoretical challenges in machine learning is to understand the generalization error for deep neural networks, especially residual networks
(He et al., 2016)which have become one of the default choices for many machine learning tasks, such as the ones that arise in computer vision. Many recent attempts have been made by trying to derive bounds that do not explicitly depend on the number of parameters. In this regard, the normbased bounds use some appropriate norms of the parameters to control the generalization error
(Neyshabur et al., 2015b; Bartlett et al., 2017; Golowich et al., 2017; Barron and Klusowski, 2018). Other bounds include the ones that are based on idea of compressing the networks (Arora et al., 2018) or the use of the FisherRao information (Liang et al., 2017). While these generalization bounds differ in many ways, they have one thing in common: they depend on information about the final parameters obtained in the training process. Following E et al. (2018), we call them a posteriori estimates. In this paper, we derive an a priori estimate of the generalization error and the population risk for deep residual networks. Compared to the a posteriori estimates mentioned above, our bounds depend only on the target function and the network structure. In addition, our bounds scale optimally with the network depths and the size of the training data. The approximation error term scales as with the depth, while the estimation error term scales like the Monte Carlo error rate with the size of training data, and is independent of the depth.We should note that our interest in deriving a priori estimates also comes from the analogy with finite element methods (Ciarlet, 2002; Ainsworth and Oden, 2011). Both a priori and a posteriori error estimates are very common in the theoretical analysis of finite element methods. In fact, there a priori estimates appeared much earlier and are still more common than a posteriori estimates (Ciarlet, 2002), contrary to the situation in machine learning. For the case of twolayer neural network models, the analytical and practical advantages of a priori analysis have already been demonstrated in E et al. (2018). It was shown there that optimal error rates can be established for appropriately regularized twolayer neural networks models, and the accuracy of these models behaves in a much more robust fashion than the vanilla models without regularization. In any case, we believe both a priori and a posteriori estimates are useful and can shed some light on the principles behind modern machine learning models. In this paper, we set out to extend the work in E et al. (2018) for shallow neural network models to deep ones and we choose residual network as a starting point.
To derive our a priori estimate, we design a new path norm for deep residual networks called the weighted path norm. Unlike traditional path norms, our weighted path norm is a weighted version which put more weight on paths going through more nonlinearities. In this way, we penalize paths with many nonlinearities and hence control the complexity of the functions represented by networks with a bounded norm. Moreover, by using the weighted path norm as the regularization term, we can strike a balance between the empirical risk and the complexity of the model, and thus a balance between the approximation error and the estimation error. This allows us to prove that the minimizer of the regularized model has the optimal error rate in terms of the population risk.
Our contributions:

We propose the weighed path norm for residual networks which gives larger weights to paths with more nonlinearities. The weighed path norm can help us to better control the Rademacher complexity for the associated function space.

With the weighted path norm, we propose a regularized model and derive a priori estimates for the population risk, in the sense that the bounds depend only on the target function instead of the parameters obtained after training.

The a priori estimates are optimal in the sense that both the approximation error and the estimation error behave similarly to the Monte Carlo error rates.
The rest of the paper is organized as follows. In Section 2, we setup the problem and state our main theorem as well as the proof sketch. In Section 3 we give the full proof of the theorems. In Section 4, we compare our result with related works and put things into perspective. Conclusions are drawn in Section 5.
2 Setup of the problem and the main theorem
2.1 Setup
In this paper, we focus on the regression problem and residual networks with ReLU activation. Assume that the target function
. Let the training set be , where the ’s are independently sampled from an underlying distribution and .Consider the following residual network architecture with skip connection in each layer^{1}^{1}1In practice, standard residual networks use skip connections every two layers. We consider skip connections every layer for the sake of clarity. It is easy to extend the analysis to the cases where skip connections are used across multiple layers.
(2.1) 
Here the set of parameters , , , , , is the number of layers, is the width of the residual blocks and
is the width of skip connections. The ReLU activation function
and we extend it to vectors in a componentwise fashion. Note that we omit the bias term in the network by assuming that the first element of the input
is always 1.To simplify the proof we will consider the truncated square loss
(2.2) 
then the truncated population risk and empirical risk functions are
(2.3) 
In principle we can also truncate the risk function as is done below for the case with noise.
Now we define the spectral norm (Klusowski and Barron, 2016) for the target function and the weighted path norm for residual networks.
Definition 2.1 (Spectral norm).
Let , and let be an extension of to , and
be the Fourier transform of
. Define the spectral norm of as(2.4) 
where the is taken over all possible extensions .
Definition 2.2 (Weighted path norm).
Given a residual network with architecture (2.1), define the weighted path norm of as
(2.5) 
where with being a vector or matrix means taking over the absolute values of all the entries of the vector or matrix.
Note that our weighted path norm is a weighted sum over all paths in the neural network flowing from the input to the output, and we give larger weight to the paths that go through more nonlinearities. More precisely, consider the following path : assume that goes through nonlinearities for layers , and goes through skip connections for the other layers; for the nonlinear layers , assume that
goes through the neurons
, ^{2}^{2}2We denote by the th element of the vector , and the th element of the matrix .; for skip connections between the layers and , assume that goes through the neurons , , . In addition, assume that starts from the input , . Then the path is given by:Define the weight of path by
(2.6) 
and the activation of by
Then output of the residual network can be written as
(2.7) 
and the weighted path norm is given by
(2.8) 
We see that is the weighted sum over all the paths, where the weight is decided by the number of nonlinearities encountered along the path.
2.2 Main theorem
Theorem 2.3 (A priori estimate).
Let and assume that the residual network has architecture (2.1). Let be the number of training samples, be the number of layers and be the width of the residual blocks. Let and be the truncated population risk and empirical risk defined in (2.3) respectively; let be the spectral norm of and be the weighted path norm of in Definition 2.1 and 2.2. For , assume that is an optimal solution of the regularized model
(2.9) 
Then for any
, with probability at least
over the random training sample, the population risk has the bound(2.10) 
Remark.

The estimates are a priori in nature since (2.10) depends only on the spectral norm of the target function without knowing the norm of .

We want to emphasize that our estimate is nearly optimal. The first term in (2.10) shows that the convergence rate with respect to the size of the neural network is , which matches the rate in universal approximation theory for shallow networks (Barron, 1993). The last two terms show that the rate with respect to the number of training samples is , which matches the classical estimates of the generalize gap.

The last term depends only on instead of the network architecture, thus there is no need to increase the sample size with respect to the network size and to ensure convergence. This is not the case for existing error bounds (see Section 4).
2.3 Extension to noisy problems
Our a priori estimate can be extended to problems with subgaussian noise. Assume that in the training data are computed by where are i.i.d. distributed with and
(2.11) 
for some constants , and . Let be the square loss truncated by , and define
(2.12) 
Then, we have
Theorem 2.4 (A priori estimate for noisy problems).
In addition to the same conditions as in Theorem 2.3, assume that the noise satisfies (2.11). Let and be the truncated population risk and empirical risk defined in (2.12). For and , assume that is an optimal solution of the regularized model
(2.13) 
Then for any , with probability at least over the random training sample, the population risk satisfies
(2.14) 
We see that the a priori estimates for noisy problems only differ from that for noiseless problems by a logarithmic term. In particular, the estimates of the generalization error are still near optimal.
2.4 Proof sketch
We prove the main theorem in 3 steps. We list the main intermediate results in this section, and leave the full proof to Section 3.
2.4.1 Approximation error
First, we show that there exists a set of parameters such that and is controlled as .
Theorem 2.5.
For any distribution with compact support , and any target function with , there exists a residual network with depth and width , such that
(2.15) 
and .
2.4.2 A posteriori estimate
Second, we show that the weighted path norm can help to bound the Rademacher complexity. Since the Rademacher complexity can bound the generalization gap, this gives the a posteriori estimates.
Recall the definition of Rademacher complexity:
Definition 2.6 (Rademacher complexity).
Given a function class and sample set , the (empirical) Rademacher complexity of with respect to is defined as
(2.16) 
where the
’s are independent random variables with
.It is wellknown that the generalization gap is controlled by the Rademacher complexity (ShalevShwartz and BenDavid, 2014).
Theorem 2.7.
Given a function class , for any , with probability at least over the random samples ,
(2.17) 
The following theorem is a crucial step in our analysis. It shows that the Rademacher complexity of residual networks can be controlled by the weighted path norm.
Theorem 2.8.
Let where ’s are residual networks defined by (2.1). Assume that the samples , then we have
(2.18) 
Theorem 2.9 (A posteriori estimate).
Let be the weighted path norm of residual network . Let be the number of training samples. Let and be the truncated population risk and empirical risk defined in (2.3). Then for any , with probability at least over the random training samples, we have
(2.19) 
2.4.3 A priori estimate
By comparing the definition of objective function (2.9) with the a posteriori estimate (2.19), we conclude that for any ,
where is bound with high probability in (2.19). Recall that is the optimal solution of the objective function (2.9), and corresponds to the approximation in Theorem 2.5, we get
Here the first term and the third term are upperbounded with high probability; the second term since ; and are also upperbounded as shown in Theorem 2.5. These give us the a priori estimates in Theorem 2.3.
For problems with noise, we only need the following lemma:
Lemma 2.10.
Assume that the noise has zero mean and satisfies (2.11), and . For any we have
(2.20) 
The details of the proof are given in the following Section 3.
3 Proof
3.1 Approximation error
For the approximation error, E et al. (2018) gives the following result for shallow networks.
Theorem 3.1.
For any distribution with compact support , and any target function with , there exists a onehiddenlayer network with width , such that
(3.1) 
and
(3.2) 
For residual networks, we prove the approximation result by splitting the shallow network into several parts and stack them vertically.
Proof of Theorem 2.5.
Recall the assumption that the first element of input is always 1, thus we can omit the bias terms in Theorem 3.1. Hence there exists a shallow network with width that satisfies
and
Now, we construct a residual network with input dimension , depth , width , and using
for . Then it is easy to verify that , and
∎
3.2 A posteriori estimate
To bound the Rademacher complexity of residual networks, we first define the hidden neurons in the residual blocks and their corresponding path norm.
Definition 3.2.
Given a residual network defined by (2.1), let
(3.3) 
Let be the th element of , define the weighted path norm
(3.4) 
where is the th row of .
The following Lemma 3.3 establishes the relationship between and . Lemma 3.4 gives properties of the corresponding function class. We omit the proof here.
Lemma 3.3.
Lemma 3.4.
Let , then

for ;

and for .
Now we recall two lemmas about Rademacher complexity (ShalevShwartz and BenDavid, 2014).
Lemma 3.5.
Let , and the samples , then
(3.7) 
Lemma 3.6.
Assume that are Lipschitz continuous functions with uniform Lipschitz constant , i.e., for , then
(3.8) 
Proof of Theorem 2.8.
We first estimate the Rademacher complexity of . We do this by induction:
(3.9) 
By definition, . Hence, using Lemma 3.5 and 3.6, we have the statement (3.9) holds for . Now, assume the result holds for , for we have
where condition (1) is , and condition (2) is . The first inequality is due to the contraction lemma, while the third inequality is due to Lemma 3.4. Because is symmetric, we know
On the other hand, as , we have
Therefore, we have
Similarly, based on the control for the Rademacher complexity of , we get
∎
3.3 A priori estimate
Now we are ready to prove the main Theorem 2.3.
Proof of Theorem 2.3.
Let be the optimal solution of the regularized model (2.9), and be the approximation in Theorem 2.5. Consider
(3.11) 
Finally, we deal with the noise and prove Theorem 2.4. For problems with noise, we decompose as
(3.16) 
Based on the results we had for the noiseless problems, in (3.16) we only have to estimate the first and the last term. This can be done by Lemma 2.10.
Proof of Lemma 2.10.
Let , then we have
As and , we have
Let , then
Comments
There are no comments yet.