It is important to understand the generalization performance of deep neural networks (DNNs). An open problem recently attracts substantive attention, that is, why DNNs can generalize well even the number of the parameters are much greater than the number of training samples (Zhang et al., 2016)
. A promising approach has been extensively exploited by considering DNNs with infinite width. This approach is much easier, compared with finite width network, for theoretical analysis, while it preserves the good generalization performance in the case that the number of parameters is much larger than the number of samples. A highly possible path to understanding the generalization performance of finite-width DNNs can be then partitioned in two steps. One is to study the generalization error of infinite-width DNNs, the other is to study the gap between the finite-width DNN and the infinite-width DNN. We point out that the second step can be completed by utilizing probably approximately correct (PAC) theory, which is a common technique, for example, a special case ofhas been solved by Arora et al. (2019a).
Studies (Jacot et al., 2018; Mei et al., 2019; Chizat and Bach, 2018; Arora et al., 2019b) found that a fixed kernel can well characterize the behavior of two-layer DNNs with infinite width (the order of parameters is , is the width of the hidden layer, , also known as “Neural Tangent Kernel (NTK) regime”), which is trained by gradient descent (GD). In another word, the infinite-width DNN output through GD the training can be well characterized by the first order Taylor expansion at the initial parameters, namely,
where is the DNN output, is the set of DNN parameters at a training point, is the initial parameter set. Chizat and Bach (2018); Arora et al. (2019b); Mei et al. (2019); Zhang et al. (2019b, a) further show that this linearized network learns the minimum norm solution, which can fit the training data and keeps the minimal distance between and . We here also partition the generalization problem of infinite-width DNN into two sub-problems. One is to study the generalization performance of finite-width network with minimum norm solution. The other is to study how this generalization error relates with the DNN width. This research line is depicted in Fig. 1.
Along this line in this paper, we start from the finite-width network with minimum norm solution by considering the following minimization problem
where represents the empirical risk with . Here denotes the network width and represents a given positive value and denotes the number of samples. The solution of of model (1.2) is denoted by . Note that when , problem (1.2) is equivalent to the linear GD dynamics (1.1) (Zhang et al., 2019a).
In this work, we first estimate a priori generalization error of model (1.2), which depends on and the target function , but not . To this end, we begin with constructing a network that satisfies the constraints of model (1.2), which depends on the target function . The distance between the parameters of the constructed network (denoted by ) and the initial parameters , i.e., , is a upper bound of , based on which we estimate a upper bound of the Rademacher complexity of the function space spanned by the network after training. After that, the generalization error can be bounded through the Rademacher complexity. Similarly to E et al. (2018), this estimate is nearly optimal in the sense the error rates are in the same way as the Monte Carlo error rates w.r.t. the neuron number . Since the minimum norm solution with (achieved by taking ) is equivalent to the solution of a NN with fixed kernel, (Chizat and Bach, 2018; Arora et al., 2019b; Mei et al., 2019; Zhang et al., 2019b, a), this estimate also quantitatively fills up the gap between the minimum norm solutions of finite-width network and infinite-width network (NTK regime) by taking (Jacot et al., 2018; Mei et al., 2019). Finally, using Theorem 3.2 in Arora et al. (2019a), we bridge the gap between the infinite-width NN (NTK regime) and the general finite-width NN with gradient descent training. Such a priori generalization error does not suffer from the curse of dimensionality, which indicates DNN can work on high-dimensional problems. Specially, our a priori generalization error without extra regularity provides insight into the widely-observed good generalization performance of over-parameterized DNNs without regularity in applications (Zhang et al., 2016).
The organization of this paper is as follows. We first discuss the related works in section 2, and introduce some notations and several useful results proved in the previous literatures in section 3 and 4. For a given fixed value , section 5 shows that for each solution of (1.2), the corresponding a priori generalization error can be bounded by , where the constant only depends on the target function and initial parameters. We end the paper by providing an a priori generalization error for finite-width two-layer ReLU NNs.
2 Related work
The main technique in this paper follows E et al. (2018). We point out the important difference between this paper and E et al. (2018) as follows. The result of E et al. (2018) is general for two-layer ReLU network, however, E et al. (2018) requires an extra regularity in the training loss function. This work does not explicitly require the extra regularity in the loss function, but the minimum norm solution may require the width of network large enough.
The minimization term in model (1.2) comes from the implicit bias of DNNs. Zhang et al. (2019b) shows that in the NTK regime, the minimum-norm implicit bias is equivalent to another implicit bias of Frequency Principle (F-Principle), i.e., DNN prefers low frequencies (Xu et al., 2018; Xu, 2018; Xu et al., 2019; Luo et al., 2019; Rahaman et al., 2018). Understanding these implicit biases is important for the better use of DNNs, for instances, the F-Principle guides the design of DNNs that can solve high-frequency functions (Cai et al., 2019; Cai and Xu, 2019). Zhang et al. (2019b) then estimates a priori
generalization error based on a FP-norm of the target function, which depends on the Fourier transform of the target function, but not the norm of DNN parameters. Compared withZhang et al. (2019b), this work provides another important view to understand the good generalization performance of DNNs as follows: (i) In the training process, optimization methods tune parameters in each step but not directly on the DNN output, and by minimizing the distance from the initial parameters, DNNs can achieve a good generalization error bound; (ii) Many studies have constructed various norms of parameters to estimate a posteriori generalization error (Bartlett, 1998; Neyshabur et al., 2017), such as path norm (Neyshabur et al., 2015). This work shows that intrinsic norm employed by DNNs may lead to a natural a priori generalization error.
Here we first introduce the notations used in this paper. Let be the target function with , and be a fixed i.i.d sample of size drawn from a underlying distribution with , and be the label with . The training set is denoted by
where and is the ReLU function. Let denotes all parameters and denotes the network width. For the initialization, we choose , where is i.i.d drawn from the distribution , whose scale is the same as Jacot et al. (2018).
(Spectral norm). Following E et al. (2018) for a given , we extend to . Let be the Fourier transform of F, then . We define the spectral norm of by
We introduce the squared loss by
and generalization error (expected risk) by
and empirical risk by
4 Preliminary Results
Before the detail discussion, we present several useful results on the approximation error and generalization bound for two-layer ReLU neural network (see (E et al., 2018; Breiman, 1993; Klusowski and Barron, 2016; Barron, 1993)).
(Proposition 1 in E et al. (2018) ). For any , one has the integral representation:
Here is the normalization constant such that , which satisfies .
(Rademacher complexity). Let be a hypothesis space. The Rademacher complexity of with respect to samples is defined as
where are independent random variables with the probability
are independent random variables with the probability.
(Theorem 26.5 of Shalev-Shwartz and Ben-David (2014)). Fix a hypothesis space . Assume that for any and , . Then for any , with probability at least over the choice of , we have
(Lemma 26.9 in Shalev-Shwartz and Ben-David (2014)). Let be a -Lipschitz function. For any , let , then we have
5 Main results
5.1 An a priori generalization error for finite-width NN with minimum norm solution
This section focuses on the bound of generalization error of each solution to the minimization problem (1.2). At the beginning, it is necessary to prove the minimization problem (1.2) does exist a solution. To this end, following E et al. (2018), a set of parameters are constructed, which satisfy , as shown in the following theorem.
For any distribution with and any , let be a fixed i.i.d drawn from distribution , there exists a two-layer neural network of width such that
We first consider the case that . Set the Monte-Carlo estimator by
Then we have
where are a set of samples, i.i.d randomly drawn from . Since
and , which follows from Proposition 4.1, we have
Furthermore, for any
, the variance can be upper bounded since
Hence, we have
Therefore there must exist a set of such that the corresponding empirical risk satisfies
For the general case,
we have (still denote the parameters by )
is a two-layer neural network with width and ReLU activation function. The main estimate (5.1) still holds. The proof is complete. ∎
From the proof of Theorem 5.1, the parameters for general case can be represented as
for all , and
which implies that the parameters in L2-norm can be bonded by
It follows from Theorem 5.1 that for any , there exist a set of parameters such that while the neural network width satisfies . Note that the requirement of the width is independent on the dimension of input and the number of samples, it only depends on the target function and the parameter .
if the neural network width satisfies . Together with Remark 5.1 (ii), we further have
We now estimate the Rademacher complexity of , which is defined in the following lemma.
Let be the set of two-layer neural networks with ReLU activation function. Then we have
Without loss of generality, let . Otherwise we can define and . Then
We further have
where the first inequality holds by the Cauchy inequality with the
dimension of samples, and the third inequality is based on the arithmetic and geometric mean inequalities. Hence we have that
Noting the symmetry, we arrive at
The proof is complete. ∎
Assume that the loss function is Lipschitz continuous with Lipschitz constant and bounded by , then for any with probability , we have
The above analysis shows the generalization error can be bounded via proposition 5.1 and estimate (5.3). Thus, we produce our main result that the generalization error is bounded by the following theorem.
Let be a solution of (1.2) with size . Then for any , there exists a constant such that for any and any , with probability over the choice of , the following inequality holds:
where , and is defined by (5.2).
In particular, we choose the parameter . Then the minimization problem (1.2) can be represented as
Where denotes the empirical risk and . Here denotes the network width and represents a given positive value and denotes the number of samples. It follows from Theorems 5.1 and 5.2 that minimization problem (5.8) has at least one solution, denoted by , and the corresponding generalization error can be bounded. We summarise these in the following theorem.
Let be a solution of (5.8) with size . Then for any , with probability over the choice of , the following inequality holds:
Similarly to E et al. (2018), this estimate is nearly optimal in the sense that the error rates are in the same way as the Monte Carlo error rates w.r.t. the neuron number .
5.2 An a priori generalization error for finite-width NN with gradient training
The gradient descent (GD) dynamics for two-layer ReLU neural network is well-known as
where is the neural tangent kernel (NTK).
where . We denote the solution to the infinite-width GD dynamics by . We introduce the notations given in (Arora et al. (2019a)) by
The gap between solutions to finite-width GD dynamics and infinite-width GD dynamic has been estimated in Arora et al. (2019a). We introduce this result under the two-layer ReLU neural network as follows.
(Theorem 3.2 in Arora et al. (2019a)). For the two-layer ReLU neural network, suppose and the network width satisfies Then, for any with , with probability at least over the random initialization, we have
From Theorem 5.4, the gap between solutions to finite-width GD dynamics and infinite-width GD dynamic is polynomial descent. The following theorem presents the generalization error for solutions to the infinite-width GD dynamics.
Let be defined as (5.12) and denote its generalization error by . Then for any , with probability over the choice of , the following inequality holds: