Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

05/30/2019 ∙ by Yuan Cao, et al. ∙ 4

We study the training and generalization of deep neural networks (DNNs) in the over-parameterized regime, where the network width (i.e., number of hidden nodes per layer) is much larger than the number of training data points. We show that, the expected 0-1 loss of a wide enough ReLU network trained with stochastic gradient descent (SGD) and random initialization can be bounded by the training loss of a random feature model induced by the network gradient at initialization, which we call a neural tangent random feature (NTRF) model. For data distributions that can be classified by NTRF model with sufficiently small error, our result yields a generalization error bound in the order of Õ(n^-1/2) that is independent of the network width. Our result is more general and sharper than many existing generalization error bounds for over-parameterized neural networks. In addition, we establish a strong connection between our generalization error bound and the neural tangent kernel (NTK) proposed in recent work.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has achieved great success in a wide range of applications including image processing (Krizhevsky et al., 2012)

, natural language processing

(Hinton et al., 2012)

and reinforcement learning

(Silver et al., 2016). Most of the deep neural networks used in practice are highly over-parameterized, such that the number of parameters is much larger than the number of training data. One of the mysteries in deep learning is that, even in an over-parameterized regime, neural networks trained with stochastic gradient descent can still give small test error and do not overfit. In fact, a famous empirical study by Zhang et al. (2016) shows the following phenomena:

  • [leftmargin = *]

  • Even if one replaces the real labels of a training data set with purely random labels, an over-parameterized neural network can still fit the training data perfectly. However since the labels are independent of the input, the resulting neural network does not generalize to the test dataset.

  • If the same over-parameterized network is trained with real labels, it not only achieves small training loss, but also generalizes well to the test dataset.

While a series of recent work has theoretically shown that a sufficiently over-parameterized (i.e., sufficiently wide) neural network can fit random labels (Du et al., 2018b; Allen-Zhu et al., 2018b; Du et al., 2018a; Zou et al., 2018), the reason why it can generalize well when trained with real labels is less understood. Existing generalization bounds for deep neural networks (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur et al., 2017; Golowich et al., 2017; Dziugaite and Roy, 2017; Arora et al., 2018; Li et al., 2018; Wei et al., 2018; Neyshabur et al., 2018) based on uniform convergence usually cannot provide non-vacuous bounds (Langford and Caruana, 2002; Dziugaite and Roy, 2017) in the over-parameterized regime. In fact, the empirical observation by Zhang et al. (2016) indicates that in order to understand deep learning, it is important to distinguish the true data labels from random labels when studying generalization. In other words, it is essential to quantify the “classifiability” of the underlying data distribution, i.e., how difficult it can be classified.

Certain effort has been made to take the “classifiability” of the data distribution into account for generalization analysis of neural networks. Brutzkus et al. (2017) showed that stochastic gradient descent (SGD) can learn an over-parameterized two-layer neural network with good generalization for linearly separable data. Li and Liang (2018) proved that, if the data satisfy certain structural assumption, SGD can learn an over-parameterized two-layer network with fixed second layer weights and achieve a small generalization error. Allen-Zhu et al. (2018a)

studied the generalization performance of SGD and its variants for learning two-layer and three-layer networks, and used the risk of smaller two-layer or three-layer networks with smooth activation functions to characterize the classifiability of the data distribution. There is another line of studies on the algorithm-dependent generalization bounds of neural networks in the over-parameterized regime

(Daniely, 2017; Arora et al., 2019b; Cao and Gu, 2019; Yehudai and Shamir, 2019; E et al., 2019), which quantifies the classifiability of the data with a reference function class defined by random features (Rahimi and Recht, 2008, 2009) or kernels111Since random feature models and kernel methods are highly related (Rahimi and Recht, 2008, 2009), we group them into the same category. More details are discussed in Section 3.2.. Specifically, Daniely (2017) showed that a neural network of large enough size is competitive with the best function in the conjugate kernel class of the network. Arora et al. (2019b) gave a generalization error bound for two-layer ReLU networks with fixed second layer weights based on a ReLU kernel function. Cao and Gu (2019) showed that deep ReLU networks trained with gradient descent can achieve small generalization error if the data can be separated by certain random feature model (Rahimi and Recht, 2009) with a margin. Yehudai and Shamir (2019) used the expected loss of a similar random feature model to quantify the generalization error of two-layer neural networks with smooth activation functions. A similar generalization error bound was also given by E et al. (2019), where the authors studied the optimization and generalization of two-layer networks trained with gradient descent. However, all the aforementioned results are still far from satisfactory: they are either limited to two-layer networks, or restricted to very simple and special reference function classes.

In this paper, we aim at providing a sharper and generic analysis on the generalization of deep ReLU networks trained by SGD. In detail, we base our analysis upon the key observations that near random initialization, the neural network function is almost a linear function of its parameters and the loss function is locally almost convex. This enables us to prove a cumulative loss bound of SGD, which further leads to a generalization bound by online-to-batch conversion

(Cesa-Bianchi et al., 2004). The main contributions of our work are summarized as follows:

  • [leftmargin= *]

  • We give a bound on the expected - error of deep ReLU networks trained by SGD with random initialization. Our result relates the generalization bound of an over-parameterized ReLU network with a random feature model defined by the network gradients, which we call neural tangent random feature (NTRF) model. It also suggests an algorithm-dependent generalization error bound of order , which is independent of network width, if the data can be classified by the NTRF model with small enough error.

  • Our analysis is general enough to cover recent generalization error bounds for neural networks with random feature based reference function classes, and provides better bounds. Our expected - error bound directly covers the result by Cao and Gu (2019), and gives a tighter sample complexity when reduced to their setting, i.e., versus where is the target generalization error. Compared with recent results by Yehudai and Shamir (2019); E et al. (2019) who only studied two-layer networks, our bound not only works for deep networks, but also uses a larger reference function class when reduced to the two-layer setting, and therefore is sharper.

  • Our result has a direct connection to the neural tangent kernel studied in Jacot et al. (2018). When interpreted in the language of kernel method, our result gives a generalization bound in the form of , where

    is the training label vector, and

    is the neural tangent kernel matrix defined on the training input data. This form of generalization bound is similar to, but more general and tighter than the bound given by Arora et al. (2019b).

Notation We use lower case, lower case bold face, and upper case bold face letters to denote scalars, vectors and matrices respectively. For a vector and a number , let . We also define . For a matrix , we use to denote the number of non-zero entries of , and denote and for . For two matrices , we define . We denote by if is positive semidefinite. In addition, we define the asymptotic notations , , and as follows. Suppose that and be two sequences. We write if , and if . We use and to hide the logarithmic factors in and .

2 Problem Setup

In this section we introduce the basic problem setup. Following the same standard setup implemented in the line of recent work (Allen-Zhu et al., 2018b; Du et al., 2018a; Zou et al., 2018; Cao and Gu, 2019), we consider fully connected neural networks with width , depth and input dimension . Such a network is defined by its weight matrices at each layer: for , let , , and be the weight matrices of the network. Then the neural network with input is defined as

(1)

where is the entry-wise activation function. In this paper, we only consider the ReLU activation function , which is the most commonly used activation function in applications. It is also arguably one of the most difficult activation functions to analyze, due to its non-smoothess. We remark that our result can be generalized to many other Lipschitz continuous and smooth activation functions. For simplicity, we follow Allen-Zhu et al. (2018b); Du et al. (2018a) and assume that the widths of each hidden layer are the same. Our result can be easily extended to the setting that the widths of each layer are not equal but in the same order, as discussed in Zou et al. (2018); Cao and Gu (2019).

When , the neural network reduces to a linear function, which has been well-studied. Therefore, for notational simplicity we focus on the case , where the parameter space is defined as

We also use to denote the collection of weight matrices for all layers. For , we define their inner product as .

The goal of neural network learning is to minimize the expected risk, i.e.,

(2)

where is the loss defined on any example , and is the loss function. Without loss of generality, we consider the cross-entropy loss in this paper, which is defined as . We would like to emphasize that our results also hold for most convex and Lipschitz continuous loss functions such as hinge loss. We now introduce stochastic gradient descent based training algorithm for minimizing the expected risk in (2). The detailed algorithm is given in Algorithm 1.

  Input: Number of iterations , step size .
  Generate each entry of independently from , .
  Generate each entry of independently from .
  for  do
     Draw from .
     Update .
  end for
  Output: Randomly choose uniformly from .
Algorithm 1 SGD for DNNs starting at Gaussian initialization

The initialization scheme for given in Algorithm 1

generates each entry of the weight matrices from a zero-mean independent Gaussian distribution, whose variance is determined by the rule that the expected length of the output vector in each hidden layer is equal to the length of the input. This initialization method is also known as He initialization

(He et al., 2015). Here the last layer parameter is initialized with variance instead of since the last layer is not associated with the ReLU activation function.

3 Main Results

In this section we present the main results of this paper. In Section 3.1 we give an expected - error bound against a neural tangent random feature reference function class. In Section 3.2, we discuss the connection between our result and the neural tangent kernel proposed in Jacot et al. (2018).

3.1 An Expected - Error Bound

In this section we give a bound on the expected - error obtained by Algorithm 1. Our result is based on the following assumption.

The data inputs are normalized: for all .

Assumption 3.1 is a standard assumption made in almost all previous work on optimization and generalization of over-parameterized neural networks (Du et al., 2018b; Allen-Zhu et al., 2018b; Du et al., 2018a; Zou et al., 2018; Oymak and Soltanolkotabi, 2019; E et al., 2019). As is mentioned in Cao and Gu (2019), this assumption can be relaxed to for all , where are absolute constants.

For any , we define its -neighborhood as

Below we introduce the neural tangent random feature function class, which serves as a reference function class to measure the “classifiability” of the data, i.e., how easy it can be classified.

[Neural Tangent Random Feature] Let be generated via the initialization scheme in Algorithm 1. The neural tangent random feature (NTRF) function class is defined as

where measures the size of the function class, and is the width of the neural network. The name “neural tangent random feature” is inspired by the neural tangent kernel proposed by Jacot et al. (2018), because the random features are the gradients of the neural network with random weights. Connections between the neural tangent random features and the neural tangent kernel will be discussed in Section 3.2.

We are ready to present our main result on the expected - error bound of Algorithm 1.

For any and , there exists

such that if

, then with probability at least

over the randomness of , the output of Algorithm 1 with step size for some small enough absolute constant satisfies

(3)

where the expectation is taken over the uniform draw of from .

The expected - error bound given by Theorem 3.1 consists of two terms: The first term in (3) relates the expected - error achieved by Algorithm 1 with a reference function class–the NTRF function class in Definition 3.1. The second term in (3) is a standard large-deviation error term. As long as , this term matches the standard rate in PAC learning bounds (Shalev-Shwartz and Ben-David, 2014).

The parameter in Theorem 3.1 is from the NTRF class and introduces a trade-off in the bound: when is small, the corresponding NTRF class is small, making the first term in (3) large, and the second term in (3) is small. When is large, the corresponding function class is large, so the first term in (3) is small, and the second term will be large. In particular, if we set , the second term in (3) will be . In this case, the “classifiability” of the underlying data distribution is determined by how well its i.i.d. samples can be classified by . In other words, Theorem 3.1 suggests that if the data can be classified by a function in the NTRF function class with a small training error, the over-parameterized ReLU network learnt by Algorithm 1 will have a small generalization error.

The expected - error bound given by Theorem 3.1 is in a very general form. It directly covers the result given by Cao and Gu (2019). In Appendix A.1, we show that under the same assumptions made in Cao and Gu (2019), to achieve expected - error, our result requires a sample complexity of order , which outperforms the result in Cao and Gu (2019) by a factor of .

Our generalization bound can also be compared with two recent results (Yehudai and Shamir, 2019; E et al., 2019) for two-layer neural networks. When , the NTRF function class can be written as

In contrast, the reference function classes studied by Yehudai and Shamir (2019) and E et al. (2019) are contained in the following random feature class:

where are the random weights generated by the initialization schemes in Yehudai and Shamir (2019); E et al. (2019)222Normalizing weights to the same scale is necessary for a proper comparison. See Appendix A.2 for details.. Evidently, our NTRF function class is richer than –it also contains the features corresponding to the first layer gradient of the network at random initialization, i.e., . As a result, our generalization bound is sharper than those in Yehudai and Shamir (2019); E et al. (2019) in the sense that we can show that neural networks trained with SGD can compete with the best function in a larger reference function class.

3.2 Connection to Neural Tangent Kernel

Besides quantifying the classifiability of the data with the NTRF function class , an alternative way to apply Theorem 3.1 is to check how large the parameter needs to be in order to make the first term in (3) small enough (e.g., smaller than ). In this subsection, we show that this type of analysis connects Theorem 3.1 to the neural tangent kernel proposed in Jacot et al. (2018) and later studied by Yang (2019); Lee et al. (2019); Arora et al. (2019a). Specifically, we provide an expected - error bound in terms of the neural tangent kernel matrix defined over the training data. We first define the neural tangent kernel matrix for the neural network function in (1).

[Neural Tangent Kernel Matrix] For any , define

Then we call the neural tangent kernel matrix of an -layer ReLU network on training inputs .

Definition 3.2 is the same as the original definition in Jacot et al. (2018) when restricting the kernel function on , except that there is an extra coefficient in the second and third lines. This extra factor is due to the difference in initialization schemes–in our paper the entries of hidden layer matrices are randomly generated with variance , while in Jacot et al. (2018) the variance of the random initialization is . We remark that this extra factor in Definition 3.2 will remove the exponential dependence on the network depth in the kernel matrix, which is appealing. In fact, it is easy to check that under our scaling, the diagonal entries of are all ’s, and the diagonal entries of are all ’s.

The following lemma is a summary of Theorem 1 and Proposition 2 in Jacot et al. (2018), which ensures that is the infinite-width limit of the Gram matrix , and is positive-definite as long as no two training inputs are parallel. [Jacot et al. (2018)] For an layer ReLU network with parameter set initialized in Algorithm 1, as the network width 333The original result by Jacot et al. (2018) requires that the widths of different layers go to infinity sequentially. Their result was later improved by Yang (2019) such that the widths of different layers can go to infinity simultaneously., it holds that

where the expectation is taken over the randomness of . Moreover, as long as each pair of inputs among are not parallel, is positive-definite.

Lemmas 3.2 clearly shows the difference between our neural tangent kernel matrix in Definition 3.2 and the Gram matrix defined in Definition 5.1 in Du et al. (2018a). For any , by Lemma 3.2 we have

In contrast, the corresponding entry in is

It can be seen that our definition of kernel matrix takes all layers into consideration, while Du et al. (2018a) only considered the last hidden layer (i.e., second last layer). Moreover, it is clear that

. Since the smallest eigenvalue of the kernel matrix plays a key role in the analysis of optimization and generalization of over-parameterized neural networks

(Du et al., 2018b, a; Arora et al., 2019b), our neural tangent kernel matrix can potentially lead to better bounds than the Gram matrix studied in Du et al. (2018a).

Let and . For any , there exists that only depends on and such that if , then with probability at least over the randomness of , the output of Algorithm 1 with step size for some small enough absolute constant satisfies

where the expectation is taken over the uniform draw of from .

Corollary 3.2 gives an algorithm-dependent generalization error bound of over-parameterized -layer neural networks trained with SGD. It is worth noting that recently Arora et al. (2019b) gives a generalization bound for two-layer networks with fixed second layer weights, where is defined as

Our result in Corollary 3.2 can be specialized to two-layer neural networks by choosing , and yields a bound , where

Here the extra term corresponds to the training of the second layer–it is the limit of . Since we have , our bound is sharper than theirs. This comparison also shows that, our result generalizes the result in Arora et al. (2019b) from two-layer, fixed second layer networks to deep networks with all parameters being trained.

Corollary 3.2 is based on the asymptotic convergence result in Lemma 3.2, which does not show how wide the network need to be in order to make the Gram matrix close enough to the NTK matrix. Very recently, Arora et al. (2019a) provided a non-asymptotic convergence result for the Gram matrix, and showed the equivalence between an infinitely wide network trained by gradient flow and a kernel regression predictor using neural tangent kernel, which suggests that the generalization of deep neural networks trained by gradient flow can potentially be measured by the corresponding NTK. Utilizing this non-asymptotic convergence result, one can potentially specify the detailed dependency of on , , and in Corollary 3.2.

4 Proof of Main Theory

In this section we provide the proof of Theorem 3.1 and Corollary 3.2, and explain the intuition behind the proof. For notational simplicity, for we denote .

4.1 Proof of Theorem 3.1

Before giving the proof of Theorem 3.1, we first introduce several lemmas. The following lemma states that near initialization, the neural network function is almost linear in terms of its weights.

There exists an absolute constant such that, with probability at least over the randomness of , for all and with , it holds uniformly that

Since the cross-entropy loss is convex, given Lemma 4.1, we can show in the following lemma that near initialization, is also almost a convex function of for any .

There exists an absolute constant such that, with probability at least over the randomness of , for any , and with , it holds uniformly that

The locally almost convex property of the loss function given by Lemma 4.1 implies that the dynamics of Algorithm 1 is similar to the dynamics of convex optimization. We can therefore derive a bound of the cumulative loss. The result is given in the following lemma. For any , there exists

such that if , then with probability at least over the randomness of , for any , Algorithm 1 with , for some small enough absolute constant has the following cumulative loss bound:

We now finalize the proof by applying an online-to-batch conversion argument (Cesa-Bianchi et al., 2004), and use Lemma 4.1 to relate the neural network function with a function in the NTRF function class.

Proof of Theorem 3.1.

For , let . Since cross-entropy loss satisfies , we have . Therefore, setting in Lemma 4.1 gives that, if is set as , then with probability at least ,

(4)

Note that for any , only depends on and is independent of . Therefore by Proposition 1 in Cesa-Bianchi et al. (2004), with probability at least we have

(5)

By definition, we have . Therefore combining (4) and (5) and applying union bound, we obtain that with probability at least ,

(6)

for all . We now compare the neural network function with the function . We have

where the first inequality is by the -Lipschitz continuity of and Lemma 4.1, the second inequality is by , and last inequality holds as long as for some large enough absolute constant . Plugging the inequality above into (6) gives

Taking infimum over and rescaling finishes the proof. ∎

4.2 Proof of Corollary 3.2

In this subsection we prove Corollary 3.2. The following lemma shows that at initialization, with high probability, the neural network function value at all the training inputs are of order . For any , if for a large enough absolute constant , then with probability at least , for all .

We now present the proof of Corollary 3.2. The idea is to construct suitable target values , and then bound the norm of the solution of the linear equations , .

Proof of Corollary 3.2.

Set , then for cross-entropy loss we have for . Moreover, let . Then by Lemma 4.2, with probability at least , for all . Let and , then it holds that for any ,

and therefore

(7)

Denote . Note that entries of are all bounded by . Therefore, the largest eigenvalue of is at most , and we have . By Lemma 3.2 and standard matrix perturbation bound, there exists such that, if , then with probability at least , is strictly positive-definite and

(8)

Let

be the singular value decomposition of

, where have orthogonal columns, and is a diagonal matrix. Let , then we have

(9)

Moreover, by direct calculation we have

Therefore by (8) and the fact that , we have

Let be the parameter collection reshaped from . Then clearly

and therefore . Moreover, by (9), we have . Plugging this into (7) then gives