1 Introduction
It is important to understand the generalization performance of deep neural networks (DNNs). An open problem recently attracts substantive attention, that is, why DNNs can generalize well even the number of the parameters are much greater than the number of training samples (Zhang et al., 2016)
. A promising approach has been extensively exploited by considering DNNs with infinite width. This approach is much easier, compared with finite width network, for theoretical analysis, while it preserves the good generalization performance in the case that the number of parameters is much larger than the number of samples. A highly possible path to understanding the generalization performance of finitewidth DNNs can be then partitioned in two steps. One is to study the generalization error of infinitewidth DNNs, the other is to study the gap between the finitewidth DNN and the infinitewidth DNN. We point out that the second step can be completed by utilizing probably approximately correct (PAC) theory, which is a common technique, for example, a special case of
has been solved by Arora et al. (2019a).Studies (Jacot et al., 2018; Mei et al., 2019; Chizat and Bach, 2018; Arora et al., 2019b) found that a fixed kernel can well characterize the behavior of twolayer DNNs with infinite width (the order of parameters is , is the width of the hidden layer, , also known as “Neural Tangent Kernel (NTK) regime”), which is trained by gradient descent (GD). In another word, the infinitewidth DNN output through GD the training can be well characterized by the first order Taylor expansion at the initial parameters, namely,
(1.1) 
where is the DNN output, is the set of DNN parameters at a training point, is the initial parameter set. Chizat and Bach (2018); Arora et al. (2019b); Mei et al. (2019); Zhang et al. (2019b, a) further show that this linearized network learns the minimum norm solution, which can fit the training data and keeps the minimal distance between and . We here also partition the generalization problem of infinitewidth DNN into two subproblems. One is to study the generalization performance of finitewidth network with minimum norm solution. The other is to study how this generalization error relates with the DNN width. This research line is depicted in Fig. 1.
Along this line in this paper, we start from the finitewidth network with minimum norm solution by considering the following minimization problem
(1.2) 
where represents the empirical risk with . Here denotes the network width and represents a given positive value and denotes the number of samples. The solution of of model (1.2) is denoted by . Note that when , problem (1.2) is equivalent to the linear GD dynamics (1.1) (Zhang et al., 2019a).
In this work, we first estimate a priori generalization error of model (1.2), which depends on and the target function , but not . To this end, we begin with constructing a network that satisfies the constraints of model (1.2), which depends on the target function . The distance between the parameters of the constructed network (denoted by ) and the initial parameters , i.e., , is a upper bound of , based on which we estimate a upper bound of the Rademacher complexity of the function space spanned by the network after training. After that, the generalization error can be bounded through the Rademacher complexity. Similarly to E et al. (2018), this estimate is nearly optimal in the sense the error rates are in the same way as the Monte Carlo error rates w.r.t. the neuron number . Since the minimum norm solution with (achieved by taking ) is equivalent to the solution of a NN with fixed kernel, (Chizat and Bach, 2018; Arora et al., 2019b; Mei et al., 2019; Zhang et al., 2019b, a), this estimate also quantitatively fills up the gap between the minimum norm solutions of finitewidth network and infinitewidth network (NTK regime) by taking (Jacot et al., 2018; Mei et al., 2019). Finally, using Theorem 3.2 in Arora et al. (2019a), we bridge the gap between the infinitewidth NN (NTK regime) and the general finitewidth NN with gradient descent training. Such a priori generalization error does not suffer from the curse of dimensionality, which indicates DNN can work on highdimensional problems. Specially, our a priori generalization error without extra regularity provides insight into the widelyobserved good generalization performance of overparameterized DNNs without regularity in applications (Zhang et al., 2016).
The organization of this paper is as follows. We first discuss the related works in section 2, and introduce some notations and several useful results proved in the previous literatures in section 3 and 4. For a given fixed value , section 5 shows that for each solution of (1.2), the corresponding a priori generalization error can be bounded by , where the constant only depends on the target function and initial parameters. We end the paper by providing an a priori generalization error for finitewidth twolayer ReLU NNs.
2 Related work
The main technique in this paper follows E et al. (2018). We point out the important difference between this paper and E et al. (2018) as follows. The result of E et al. (2018) is general for twolayer ReLU network, however, E et al. (2018) requires an extra regularity in the training loss function. This work does not explicitly require the extra regularity in the loss function, but the minimum norm solution may require the width of network large enough.
The minimization term in model (1.2) comes from the implicit bias of DNNs. Zhang et al. (2019b) shows that in the NTK regime, the minimumnorm implicit bias is equivalent to another implicit bias of Frequency Principle (FPrinciple), i.e., DNN prefers low frequencies (Xu et al., 2018; Xu, 2018; Xu et al., 2019; Luo et al., 2019; Rahaman et al., 2018). Understanding these implicit biases is important for the better use of DNNs, for instances, the FPrinciple guides the design of DNNs that can solve highfrequency functions (Cai et al., 2019; Cai and Xu, 2019). Zhang et al. (2019b) then estimates a priori
generalization error based on a FPnorm of the target function, which depends on the Fourier transform of the target function, but not the norm of DNN parameters. Compared with
Zhang et al. (2019b), this work provides another important view to understand the good generalization performance of DNNs as follows: (i) In the training process, optimization methods tune parameters in each step but not directly on the DNN output, and by minimizing the distance from the initial parameters, DNNs can achieve a good generalization error bound; (ii) Many studies have constructed various norms of parameters to estimate a posteriori generalization error (Bartlett, 1998; Neyshabur et al., 2017), such as path norm (Neyshabur et al., 2015). This work shows that intrinsic norm employed by DNNs may lead to a natural a priori generalization error.3 Notations
Here we first introduce the notations used in this paper. Let be the target function with , and be a fixed i.i.d sample of size drawn from a underlying distribution with , and be the label with . The training set is denoted by
and the twolayer fully connected neural network with ReLU (rectified linear units) activation function is denoted by
(3.1) 
where and is the ReLU function. Let denotes all parameters and denotes the network width. For the initialization, we choose , where is i.i.d drawn from the distribution , whose scale is the same as Jacot et al. (2018).
Definition 3.1.
(Spectral norm). Following E et al. (2018) for a given , we extend to . Let be the Fourier transform of F, then . We define the spectral norm of by
Assumption 3.2.
We introduce the squared loss by
(3.2) 
and generalization error (expected risk) by
and empirical risk by
(3.3) 
4 Preliminary Results
Before the detail discussion, we present several useful results on the approximation error and generalization bound for twolayer ReLU neural network (see (E et al., 2018; Breiman, 1993; Klusowski and Barron, 2016; Barron, 1993)).
Proposition 4.1.
(Proposition 1 in E et al. (2018) ). For any , one has the integral representation:
where
Here is the normalization constant such that , which satisfies .
Definition 4.1.
(Rademacher complexity). Let be a hypothesis space. The Rademacher complexity of with respect to samples is defined as
where
are independent random variables with the probability
.The generalization gap can be estimated by the Rademacher complexity and the following theorem (see Bartlett and Mendelson (2002); ShalevShwartz and BenDavid (2014)).
Theorem 4.2.
(Theorem 26.5 of ShalevShwartz and BenDavid (2014)). Fix a hypothesis space . Assume that for any and , . Then for any , with probability at least over the choice of , we have
Lemma 4.1.
Lemma 4.2.
(Lemma 26.9 in ShalevShwartz and BenDavid (2014)). Let be a Lipschitz function. For any , let , then we have
5 Main results
5.1 An a priori generalization error for finitewidth NN with minimum norm solution
This section focuses on the bound of generalization error of each solution to the minimization problem (1.2). At the beginning, it is necessary to prove the minimization problem (1.2) does exist a solution. To this end, following E et al. (2018), a set of parameters are constructed, which satisfy , as shown in the following theorem.
Theorem 5.1.
For any distribution with and any , let be a fixed i.i.d drawn from distribution , there exists a twolayer neural network of width such that
(5.1) 
Proof.
We first consider the case that . Set the MonteCarlo estimator by
Then we have
where are a set of samples, i.i.d randomly drawn from . Since
and
and , which follows from Proposition 4.1, we have
Furthermore, for any
, the variance can be upper bounded since
Hence, we have
Therefore there must exist a set of such that the corresponding empirical risk satisfies
For the general case,
we have (still denote the parameters by )
is a twolayer neural network with width and ReLU activation function. The main estimate (5.1) still holds. The proof is complete. ∎
Remark 5.1.

From the proof of Theorem 5.1, the parameters for general case can be represented as
for all , and
which implies that the parameters in L2norm can be bonded by
(5.2) 
It follows from Theorem 5.1 that for any , there exist a set of parameters such that while the neural network width satisfies . Note that the requirement of the width is independent on the dimension of input and the number of samples, it only depends on the target function and the parameter .
Denote by one solution to problem (1.2). From Remark 5.1 (iii), we have
if the neural network width satisfies . Together with Remark 5.1 (ii), we further have
(5.3) 
We now estimate the Rademacher complexity of , which is defined in the following lemma.
Lemma 5.1.
Let be the set of twolayer neural networks with ReLU activation function. Then we have
Proof.
Without loss of generality, let . Otherwise we can define and . Then
We further have
(5.4)  
where the first inequality holds by the Cauchy inequality with the
dimension of samples, and the third inequality is based on the arithmetic and geometric mean inequalities. Hence we have that
Noting the symmetry, we arrive at
(5.5)  
Noting that is Lipschitz continuous with Lipschitz constant one, applying Lemmas 4.1 and 4.2 and from (5.4) and (5.5), we have
The proof is complete. ∎
Proposition 5.1.
Assume that the loss function is Lipschitz continuous with Lipschitz constant and bounded by , then for any with probability , we have
(5.6) 
Proof.
The above analysis shows the generalization error can be bounded via proposition 5.1 and estimate (5.3). Thus, we produce our main result that the generalization error is bounded by the following theorem.
Theorem 5.2.
Let be a solution of (1.2) with size . Then for any , there exists a constant such that for any and any , with probability over the choice of , the following inequality holds:
(5.7) 
where , and is defined by (5.2).
Proof.
In particular, we choose the parameter . Then the minimization problem (1.2) can be represented as
(5.8) 
Where denotes the empirical risk and . Here denotes the network width and represents a given positive value and denotes the number of samples. It follows from Theorems 5.1 and 5.2 that minimization problem (5.8) has at least one solution, denoted by , and the corresponding generalization error can be bounded. We summarise these in the following theorem.
Theorem 5.3.
Let be a solution of (5.8) with size . Then for any , with probability over the choice of , the following inequality holds:
(5.9) 
Remark 5.2.
Similarly to E et al. (2018), this estimate is nearly optimal in the sense that the error rates are in the same way as the Monte Carlo error rates w.r.t. the neuron number .
5.2 An a priori generalization error for finitewidth NN with gradient training
The gradient descent (GD) dynamics for twolayer ReLU neural network is wellknown as
(5.10) 
where is the neural tangent kernel (NTK).
Remark 5.3.
It has been proven in (Jacot et al., 2018; Mei et al., 2019) that the NTK is fixed as . More specifically, as . Here,
where . We denote the solution to the infinitewidth GD dynamics by . We introduce the notations given in (Arora et al. (2019a)) by
(5.11)  
(5.12) 
The gap between solutions to finitewidth GD dynamics and infinitewidth GD dynamic has been estimated in Arora et al. (2019a). We introduce this result under the twolayer ReLU neural network as follows.
Theorem 5.4.
(Theorem 3.2 in Arora et al. (2019a)). For the twolayer ReLU neural network, suppose and the network width satisfies Then, for any with , with probability at least over the random initialization, we have
(5.13) 
From Theorem 5.4, the gap between solutions to finitewidth GD dynamics and infinitewidth GD dynamic is polynomial descent. The following theorem presents the generalization error for solutions to the infinitewidth GD dynamics.
Theorem 5.5.
Let be defined as (5.12) and denote its generalization error by . Then for any , with probability over the choice of , the following inequality holds:
Comments
There are no comments yet.