A priori generalization error for two-layer ReLU neural network through minimum norm solution

12/06/2019 ∙ by Zhi-Qin John Xu, et al. ∙ Shanghai Jiao Tong University Institute for Advanced Study Wuhan University 0

We focus on estimating a priori generalization error of two-layer ReLU neural networks (NNs) trained by mean squared error, which only depends on initial parameters and the target function, through the following research line. We first estimate a priori generalization error of finite-width two-layer ReLU NN with constraint of minimal norm solution, which is proved by <cit.> to be an equivalent solution of a linearized (w.r.t. parameter) finite-width two-layer NN. As the width goes to infinity, the linearized NN converges to the NN in Neural Tangent Kernel (NTK) regime <cit.>. Thus, we can derive the a priori generalization error of two-layer ReLU NN in NTK regime. The distance between NN in a NTK regime and a finite-width NN with gradient training is estimated by <cit.>. Based on the results in <cit.>, our work proves an a priori generalization error bound of two-layer ReLU NNs. This estimate uses the intrinsic implicit bias of the minimum norm solution without requiring extra regularity in the loss function. This a priori estimate also implies that NN does not suffer from curse of dimensionality, and a small generalization error can be achieved without requiring exponentially large number of neurons. In addition the research line proposed in this paper can also be used to study other properties of the finite-width network, such as the posterior generalization error.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is important to understand the generalization performance of deep neural networks (DNNs). An open problem recently attracts substantive attention, that is, why DNNs can generalize well even the number of the parameters are much greater than the number of training samples (Zhang et al., 2016)

. A promising approach has been extensively exploited by considering DNNs with infinite width. This approach is much easier, compared with finite width network, for theoretical analysis, while it preserves the good generalization performance in the case that the number of parameters is much larger than the number of samples. A highly possible path to understanding the generalization performance of finite-width DNNs can be then partitioned in two steps. One is to study the generalization error of infinite-width DNNs, the other is to study the gap between the finite-width DNN and the infinite-width DNN. We point out that the second step can be completed by utilizing probably approximately correct (PAC) theory, which is a common technique, for example, a special case of

has been solved by Arora et al. (2019a).

Studies (Jacot et al., 2018; Mei et al., 2019; Chizat and Bach, 2018; Arora et al., 2019b) found that a fixed kernel can well characterize the behavior of two-layer DNNs with infinite width (the order of parameters is , is the width of the hidden layer, , also known as “Neural Tangent Kernel (NTK) regime”), which is trained by gradient descent (GD). In another word, the infinite-width DNN output through GD the training can be well characterized by the first order Taylor expansion at the initial parameters, namely,

(1.1)

where is the DNN output, is the set of DNN parameters at a training point, is the initial parameter set. Chizat and Bach (2018); Arora et al. (2019b); Mei et al. (2019); Zhang et al. (2019b, a) further show that this linearized network learns the minimum norm solution, which can fit the training data and keeps the minimal distance between and . We here also partition the generalization problem of infinite-width DNN into two sub-problems. One is to study the generalization performance of finite-width network with minimum norm solution. The other is to study how this generalization error relates with the DNN width. This research line is depicted in Fig. 1.


Figure 1: We propose an analysis framework to estimate generalization error of finite-width and gradient descent training neural network as follows. In this work, we aim to estimate a priori generalization error of a finite-width and gradient descent (GD) dynamics (5.11). In Arora et al. (2019a), the gap between and of a infinite-width NN (NTK regime) has been estimated. And the NN in NTK regime can be achieved through finite-width NN with constraint of linearization (1.1), denoted by , by taking “the width” (Jacot et al., 2018; Mei et al., 2019). Furthermore, the equivalence of finite-width two-layer NN with constraint of minimal norm solution and is proved in Zhang et al. (2019a). Therefore, the estimate of is transferred to the estimate of . In this paper, we present the a priori generalization error of the minimum solution (1.2) and then taking to produce the a priori estimate of .

Along this line in this paper, we start from the finite-width network with minimum norm solution by considering the following minimization problem

(1.2)

where represents the empirical risk with . Here denotes the network width and represents a given positive value and denotes the number of samples. The solution of of model (1.2) is denoted by . Note that when , problem (1.2) is equivalent to the linear GD dynamics (1.1) (Zhang et al., 2019a).

In this work, we first estimate a priori generalization error of model (1.2), which depends on and the target function , but not . To this end, we begin with constructing a network that satisfies the constraints of model (1.2), which depends on the target function . The distance between the parameters of the constructed network (denoted by ) and the initial parameters , i.e., , is a upper bound of , based on which we estimate a upper bound of the Rademacher complexity of the function space spanned by the network after training. After that, the generalization error can be bounded through the Rademacher complexity. Similarly to E et al. (2018), this estimate is nearly optimal in the sense the error rates are in the same way as the Monte Carlo error rates w.r.t. the neuron number . Since the minimum norm solution with (achieved by taking ) is equivalent to the solution of a NN with fixed kernel, (Chizat and Bach, 2018; Arora et al., 2019b; Mei et al., 2019; Zhang et al., 2019b, a), this estimate also quantitatively fills up the gap between the minimum norm solutions of finite-width network and infinite-width network (NTK regime) by taking (Jacot et al., 2018; Mei et al., 2019). Finally, using Theorem 3.2 in Arora et al. (2019a), we bridge the gap between the infinite-width NN (NTK regime) and the general finite-width NN with gradient descent training. Such a priori generalization error does not suffer from the curse of dimensionality, which indicates DNN can work on high-dimensional problems. Specially, our a priori generalization error without extra regularity provides insight into the widely-observed good generalization performance of over-parameterized DNNs without regularity in applications (Zhang et al., 2016).

The organization of this paper is as follows. We first discuss the related works in section 2, and introduce some notations and several useful results proved in the previous literatures in section 3 and 4. For a given fixed value , section 5 shows that for each solution of (1.2), the corresponding a priori generalization error can be bounded by , where the constant only depends on the target function and initial parameters. We end the paper by providing an a priori generalization error for finite-width two-layer ReLU NNs.

2 Related work

The main technique in this paper follows E et al. (2018). We point out the important difference between this paper and E et al. (2018) as follows. The result of E et al. (2018) is general for two-layer ReLU network, however, E et al. (2018) requires an extra regularity in the training loss function. This work does not explicitly require the extra regularity in the loss function, but the minimum norm solution may require the width of network large enough.

The minimization term in model (1.2) comes from the implicit bias of DNNs. Zhang et al. (2019b) shows that in the NTK regime, the minimum-norm implicit bias is equivalent to another implicit bias of Frequency Principle (F-Principle), i.e., DNN prefers low frequencies (Xu et al., 2018; Xu, 2018; Xu et al., 2019; Luo et al., 2019; Rahaman et al., 2018). Understanding these implicit biases is important for the better use of DNNs, for instances, the F-Principle guides the design of DNNs that can solve high-frequency functions (Cai et al., 2019; Cai and Xu, 2019). Zhang et al. (2019b) then estimates a priori

generalization error based on a FP-norm of the target function, which depends on the Fourier transform of the target function, but not the norm of DNN parameters. Compared with

Zhang et al. (2019b), this work provides another important view to understand the good generalization performance of DNNs as follows: (i) In the training process, optimization methods tune parameters in each step but not directly on the DNN output, and by minimizing the distance from the initial parameters, DNNs can achieve a good generalization error bound; (ii) Many studies have constructed various norms of parameters to estimate a posteriori generalization error (Bartlett, 1998; Neyshabur et al., 2017), such as path norm (Neyshabur et al., 2015). This work shows that intrinsic norm employed by DNNs may lead to a natural a priori generalization error.

3 Notations

Here we first introduce the notations used in this paper. Let be the target function with , and be a fixed i.i.d sample of size drawn from a underlying distribution with , and be the label with . The training set is denoted by

and the two-layer fully connected neural network with ReLU (rectified linear units) activation function is denoted by

(3.1)

where and is the ReLU function. Let denotes all parameters and denotes the network width. For the initialization, we choose , where is i.i.d drawn from the distribution , whose scale is the same as Jacot et al. (2018).

Definition 3.1.

(Spectral norm). Following E et al. (2018) for a given , we extend to . Let be the Fourier transform of F, then . We define the spectral norm of by

Assumption 3.2.

Following Breiman (1993); Klusowski and Barron (2016), we consider target functions that are bounded and have finite spectral norm. Define the space

We assume that the target function belongs to the space , i.e., .

We introduce the squared loss by

(3.2)

and generalization error (expected risk) by

and empirical risk by

(3.3)

4 Preliminary Results

Before the detail discussion, we present several useful results on the approximation error and generalization bound for two-layer ReLU neural network (see (E et al., 2018; Breiman, 1993; Klusowski and Barron, 2016; Barron, 1993)).

Proposition 4.1.

(Proposition 1 in E et al. (2018) ). For any , one has the integral representation:

where

Here is the normalization constant such that , which satisfies .

Definition 4.1.

(Rademacher complexity). Let be a hypothesis space. The Rademacher complexity of with respect to samples is defined as

where

are independent random variables with the probability

.

The generalization gap can be estimated by the Rademacher complexity and the following theorem (see Bartlett and Mendelson (2002); Shalev-Shwartz and Ben-David (2014)).

Theorem 4.2.

(Theorem 26.5 of Shalev-Shwartz and Ben-David (2014)). Fix a hypothesis space . Assume that for any and , . Then for any , with probability at least over the choice of , we have

Lemma 4.1.

(Lemma 26.11 in Shalev-Shwartz and Ben-David (2014)). Let be vectors in . Then the Rademacher complexity of has the following upper bound,

Lemma 4.2.

(Lemma 26.9 in Shalev-Shwartz and Ben-David (2014)). Let be a -Lipschitz function. For any , let , then we have

5 Main results

5.1 An a priori generalization error for finite-width NN with minimum norm solution

This section focuses on the bound of generalization error of each solution to the minimization problem (1.2). At the beginning, it is necessary to prove the minimization problem (1.2) does exist a solution. To this end, following E et al. (2018), a set of parameters are constructed, which satisfy , as shown in the following theorem.

Theorem 5.1.

For any distribution with and any , let be a fixed i.i.d drawn from distribution , there exists a two-layer neural network of width such that

(5.1)
Proof.

We first consider the case that . Set the Monte-Carlo estimator by

Then we have

where are a set of samples, i.i.d randomly drawn from . Since

and

and , which follows from Proposition 4.1, we have

Furthermore, for any

, the variance can be upper bounded since

Hence, we have

Therefore there must exist a set of such that the corresponding empirical risk satisfies

For the general case,

we have (still denote the parameters by )

is a two-layer neural network with width and ReLU activation function. The main estimate (5.1) still holds. The proof is complete. ∎

Remark 5.1.
  1. Most of the proof follows Theorem 2 in E et al. (2018). We focus on (3.3).

  2. From the proof of Theorem 5.1, the parameters for general case can be represented as

    for all , and

    which implies that the parameters in L2-norm can be bonded by

    (5.2)
  3. It follows from Theorem 5.1 that for any , there exist a set of parameters such that while the neural network width satisfies . Note that the requirement of the width is independent on the dimension of input and the number of samples, it only depends on the target function and the parameter .

Denote by one solution to problem (1.2). From Remark 5.1 (iii), we have

if the neural network width satisfies . Together with Remark 5.1 (ii), we further have

(5.3)

We now estimate the Rademacher complexity of , which is defined in the following lemma.

Lemma 5.1.

Let be the set of two-layer neural networks with ReLU activation function. Then we have

Proof.

Without loss of generality, let . Otherwise we can define and . Then

We further have

(5.4)

where the first inequality holds by the Cauchy inequality with the

dimension of samples, and the third inequality is based on the arithmetic and geometric mean inequalities. Hence we have that

Noting the symmetry, we arrive at

(5.5)

Noting that is Lipschitz continuous with Lipschitz constant one, applying Lemmas 4.1 and 4.2 and from (5.4) and (5.5), we have

The proof is complete. ∎

Proposition 5.1.

Assume that the loss function is Lipschitz continuous with Lipschitz constant and bounded by , then for any with probability , we have

(5.6)
Proof.

Define , we have

where we use the results in Lemmas 4.2 and 5.1. Then Theorem 4.2 yields the generalization bound (5.6). The proof is complete. ∎

The above analysis shows the generalization error can be bounded via proposition 5.1 and estimate (5.3). Thus, we produce our main result that the generalization error is bounded by the following theorem.

Theorem 5.2.

Let be a solution of (1.2) with size . Then for any , there exists a constant such that for any and any , with probability over the choice of , the following inequality holds:

(5.7)

where , and is defined by (5.2).

Proof.

For all , we have that

Thus the Lipschitz constant of the squared loss should be and the bound . Then Proposition 5.1 and (5.3) yields the result (5.7). ∎

In particular, we choose the parameter . Then the minimization problem (1.2) can be represented as

(5.8)

Where denotes the empirical risk and . Here denotes the network width and represents a given positive value and denotes the number of samples. It follows from Theorems 5.1 and 5.2 that minimization problem (5.8) has at least one solution, denoted by , and the corresponding generalization error can be bounded. We summarise these in the following theorem.

Theorem 5.3.

Let be a solution of (5.8) with size . Then for any , with probability over the choice of , the following inequality holds:

(5.9)
Remark 5.2.

Similarly to E et al. (2018), this estimate is nearly optimal in the sense that the error rates are in the same way as the Monte Carlo error rates w.r.t. the neuron number .

5.2 An a priori generalization error for finite-width NN with gradient training

The gradient descent (GD) dynamics for two-layer ReLU neural network is well-known as

(5.10)

where is the neural tangent kernel (NTK).

Remark 5.3.

Note that the value 2 in GD dynamics (5.10) come from the derivative of loss function (3.2). In some papers, such as E et al. (2019); Arora et al. (2019b), always get rid of it by multiply 1/2 in the loss function. We point out it to avoid misunderstanding.

It has been proven in (Jacot et al., 2018; Mei et al., 2019) that the NTK is fixed as . More specifically, as . Here,

where . We denote the solution to the infinite-width GD dynamics by . We introduce the notations given in (Arora et al. (2019a)) by

(5.11)
(5.12)

The gap between solutions to finite-width GD dynamics and infinite-width GD dynamic has been estimated in Arora et al. (2019a). We introduce this result under the two-layer ReLU neural network as follows.

Theorem 5.4.

(Theorem 3.2 in Arora et al. (2019a)). For the two-layer ReLU neural network, suppose and the network width satisfies Then, for any with , with probability at least over the random initialization, we have

(5.13)

From Theorem 5.4, the gap between solutions to finite-width GD dynamics and infinite-width GD dynamic is polynomial descent. The following theorem presents the generalization error for solutions to the infinite-width GD dynamics.

Theorem 5.5.

Let be defined as (5.12) and denote its generalization error by . Then for any , with probability over the choice of , the following inequality holds: