Deep learning achieves great successes in almost all real-world applications ranging from image processing (Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012) to Go games (Silver et al., 2016). Understanding and explaining the success of deep learning has thus become a central problem for theorists. One of the mysteries is that the neural networks used in practice are often heavily over-parameterized such that they can even fit random labels to the input data (Zhang et al., 2016), while they can still achieve very small generalization error (i.e., test error) when trained with real labels.
There are multiple recent attempts towards answering the above question and demystifying the success of deep learning. Soudry and Carmon (2016); Soudry and Hoffer (2017) explained why over-parametrization can remove bad local minima. Safran and Shamir (2016) showed that over-parametrization can improve the quality of the random initialization. Arora et al. (2018b) interpreted over-parametrization as a way of implicit acceleration during optimization. Haeffele and Vidal (2015); Nguyen and Hein (2017); Venturi et al. (2018) showed that for sufficiently over-parametrized networks, all local minima are global, but do not show how to find these minima via gradient descent. Li and Liang (2018); Du et al. (2018b)
proved that with proper random initialization, (stochastic) gradient descent provably finds the global minimum for training over-parameterized one-hidden-layer ReLU networks.Du et al. (2018a)
proved that gradient descent can converge to the global minima for over-parameterized deep nueral networks with smooth activation functions.Arora et al. (2018a) analyzed the convergence of GD to global optimum for training a deep linear neural network under a set of assumptions on the network width and initialization. Allen-Zhu et al. (2018c, b); Zou et al. (2018)
proved the global convergence results of GD/SGD for deep neural networks with ReLU activation functions in the over-parameterization regime. However, in such an over-parametrized regime, the training loss function of deep neural networks may have potentially infinitely many global minima, but not all of them can generalize well. Hence, minimizing the training error is not sufficient to explain the good generalization performance of GD/SGD. There are only a few studies on the generalization theory for learning neural networks in the over-parameterization regime.Brutzkus et al. (2017) showed that SGD learns over-parameterized networks that provably generalize on linearly separable data. Li and Liang (2018) relaxed the linear separable data assumption and proved that SGD learns an over-parameterized network with a small generalization error when the data comes from mixtures of well-separated distributions. Neyshabur et al. (2018) proposed a novel complexity measure based on unit-wise capacities, and proved a tighter generalization bound for two-layer over-parameterized ReLU networks. However, this bound is independent of the specific training algorithms (e.g., GD or SGD). Allen-Zhu et al. (2018a) proved that under over-parameterization, SGD or its variants can learn some notable hypothesis classes, including two and three-layer neural networks with fewer parameters. Arora et al. (2019) provided a generalization bound of GD for two-layer ReLU networks based on a fine-grained analysis on how much the network parameters can move during GD. Nevertheless, all these results still do not explain the good generalization performance of gradient-based methods for learning over-parameterized deep neural networks.
In this paper, we aim to answer the following question:
Why gradient descent can learn an over-parameterized deep neural network with good generalization performance?
Without loss of generality, we focus on binary classification problems on the -dimensional unit sphere , and consider using gradient descent to solve the empirical risk minimization problem of learning deep fully connected neural networks with ReLU activation function and cross-entropy loss.
1.1 Our Main Results and Contributions
The following theorem gives an informal version of our main results.
[Informal version of Theorem 4] Under certain data distribution assumptions, for any , if the number of nodes per each hidden layer is set to and the sample size
, then with high probability, gradient descent with properly chosen step size and random initialization method learns a deep ReLU network and achieves a population classification error at most. Theorem 1.1 holds for ReLU networks with arbitrary constant number of layers, as long as the data distribution satisfies certain separation conditions, which will be specified in Section 4.
Our contributions. Our main contributions can be summarized as follows:
We show that, over-parameterized deep ReLU networks trained by gradient descent and random initialization provably generalize well. More specifically, we prove that, when solving the empirical loss minimization problem, with high probability, gradient descent starting from proper random initialization can find a network that gives arbitrarily small population error, as long as the number of hidden nodes per layer and the sample size are large enough. To the best of our knowledge, this is the first theoretical result in literature that reasonably explains the mysterious empirical observation that gradient-based methods can learn over-parameterized deep neural networks and achieve good generalization.
Our result is based on a data distribution assumption that essentially requires that there exists a two-layer ReLU network with infinite hidden nodes and regularized weights that separates the two classes. This assumption is milder than linearly separable condition made in Brutzkus et al. (2017); Soudry et al. (2017).
We provide a thorough landscape analysis for deep ReLU network that helps the study of both optimization and generalization. Compared with similar results given in earlier work (Allen-Zhu et al., 2018b; Zou et al., 2018) where the convergence guarantee heavily relies on a finite sample size , our work removes this ill dependency on and gives a series of results that hold uniformly over the input domain. These results characterize the geometry/landscape of population loss function near random initialization, which is of independent interest and may be used to revisit the optimization dynamics of (stochastic) gradient descent for training over-parameterized DNNs.
Comparison with most related work. Our result is most relevant to the recent line of work on the generalization for over-parameterized shallow (two- or three-layer) networks (Brutzkus et al., 2017; Li and Liang, 2018; Allen-Zhu et al., 2018a; Arora et al., 2019). While all these existing results only hold for two- or three-layer neural networks, our result holds for deep ReLU networks with arbitrarily many layers. Compared to standard uniform convergence based generalization error bounds (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur et al., 2017; Golowich et al., 2017; Dziugaite and Roy, 2017; Arora et al., 2018c; Li et al., 2018a; Wei et al., 2018), the major advantage of our result is that, our generalization error bound is algorithm-dependent, almost independent of the width of the network, and provably achievable by gradient-based local search algorithms.
Throughout this paper, scalars, vectors and matrices are denoted by lower case, lower case bold face, and upper case bold face letters respectively. For a positive integer, we denote . For a vector , we denote by , , and the , and norms of respectively. We use to denote a square diagonal matrix with the entries of on the main diagonal. For a matrix , we use and
to denote the spectral norm (maximum singular value) and Frobenius norm ofrespectively. We also denote by the number of nonzero entries of . We denote by the unit sphere in . For a function , we denote by the essential supreme of .
We use the following standard asymptotic notations. For two sequences and , we write if for some absolute constant , and if for some absolute constant . In addition, we use and to hide some logarithmic terms in Big-O and Big-Omega notations.
1.3 Organization of the Paper
In Section 2 we survey recent work most related to ours. In Section 3 we introduce basic definitions and some preliminary results. Our main theoretical results are presented in Section 4. We provide a proof sketch of our main theory in Section 5. In Section 6 we conclude the paper and briefly discuss potential future work. All remaining proofs are deferred to the appendix.
2 Additional Related Work
There is a huge body of literature towards building the foundations of deep learning, and we are not able to include every work in this paper. In this section, we briefly review and comment additional work that is most related to ours and was not discussed in Section 1.
Representation power of deep neural networks. A line of research has shown that deeper neural networks have higher expressive power (Telgarsky, 2015, 2016; Lu et al., 2017; Liang and Srikant, 2016; Yarotsky, 2017, 2018; Hanin, 2017; Hanin and Sellke, 2017) than shallow neural networks. This to certain extent explains the advantage of deep neural networks with over-parameterization. Lin and Jegelka (2018) proved that ResNet (He et al., 2016) with one hidden node per layer is a universal approximator to any Lebesgue integrable function.
Optimization landscape of neural networks. Many studies (Haeffele and Vidal, 2015; Kawaguchi, 2016; Freeman and Bruna, 2016; Hardt and Ma, 2016; Safran and Shamir, 2017; Xie et al., 2017; Nguyen and Hein, 2017; Soltanolkotabi et al., 2017; Zhou and Liang, 2017; Yun et al., 2017; Du and Lee, 2018; Venturi et al., 2018) investigated the optimization landscape of neural networks with different activation functions. However, these results only apply to one-hidden layer neural networks, or deep linear networks, or rely on some stringent assumptions on the data and/or activation functions. In fact, they do not hold for non-linear shallow neural networks (Yun et al., 2018a) or three-layer linear neural networks (Kawaguchi, 2016). Furthmore, Yun et al. (2018b) showed that small nonlinearities in activation functions create bad local minima in neural networks.
Implicit bias/regularization of GD and its variants. A bunch of papers (Gunasekar et al., 2017; Soudry et al., 2017; Ji and Telgarsky, 2018; Gunasekar et al., 2018a, b; Nacson et al., 2018; Li et al., 2018b)
have studied implicit regularization/bias of GD, stochastic gradient descent (SGD) or mirror descent for matrix factorization, logistic regression, and deep linear networks. However, generalizing these results to deep non-linear neural networks turns out to be challenging and is still an open problem.
Connections between deep learning and kernel methods. Daniely (2017) uncovered the connection between deep neural networks with kernel methods and showed that SGD can learn a function that is comparable with the best function in the conjugate kernel space of the network. Jacot et al. (2018) showed that the evolution of a DNN during training can be described by a so-called neural tangent kernel, which makes it possible to study the training of DNNs in the functional space. Belkin et al. (2018); Liang and Rakhlin (2018)
Recovery guarantees for shallow neural networks. A series of work (Tian, 2017; Brutzkus and Globerson, 2017; Li and Yuan, 2017; Soltanolkotabi, 2017; Du et al., 2017a, b; Zhong et al., 2017; Zhang et al., 2018) have attempted to study shallow one-hidden-layer neural networks with ground truth parameters, and proved recovery guarantees for gradient-based methods such as gradient descent (GD) and stochastic gradient descent (SGD). However, the assumption of the existence of ground truth parameters is not realistic and the analysis of the recovery guarantee can hardly be extended to deep neural networks. Moreover, many of these studies need strong assumptions on the input distribution such as Gaussian, sub-Gaussian or symmetric distributions.
Distributional view of over-parameterized networks. Mei et al. (2018); Chizat and Bach (2018); Sirignano and Spiliopoulos (2018); Rotskoff and Vanden-Eijnden (2018); Wei et al. (2018) took a distributional view of over-parametrized networks, used mean field analysis to show that the empirical distribution of the two-layer neural network parameters can be described as a Wasserstein gradient flow, and proved that Wasserstein gradient flow converges to global optimima under certain structural assumptions. However, their results are limited to two-layer infinitely wide neural networks. Very recently, Yang (2019) studied the scaling limit of wide multi-layer neural networks.
In this paper, we study the binary classification problem with some unknown data distribution over . A data point drawn from consists of the input and label . We denote by the marginal distribution of . Given an input , we consider predicting its corresponding label using a deep neural network with the ReLU activation function .
We consider -hidden-layer neural networks with hidden nodes on the -th layer, . We denote , and define
where denotes the entry-wise ReLU activation function (with a slight abuse of notation), , are the weight matrices, and is the fixed output layer weight vector with half and half entries. We denote by the collection of matrices .
Given training examples drawn independently from , the empirical loss minimization problem is defined as follows:
where is the training sample set, and is the cross-entropy loss function.
3.1 Gradient Descent with Gaussian Initialization
Here we introduce the details of the algorithm we use to solve the empirical loss minimization problem (1).
Gaussian initialization. We say that the weight matrices are generated via Gaussian initialization if for all , each entries of are generated independently from .
Gradient descent. We study the generalization performance of deep ReLU networks trained by gradient descent. Let be weight matrices generated via Gaussian initialization. We consider the following gradient descent update rule to solve the empirical loss minimization problem (1):
where is the step size.
3.2 Matrix Product Representation for Deep ReLU Networks
Here we introduce the matrix product representation for deep ReLU networks, which plays an essential role in our analysis. Given parameter matrices and an input , we set , and denote by the output of the -th layer of the ReLU network:
We also define diagonal binary matrices as follows
Then we have the following representations for the neural network and its gradients:
where we use the following matrix product notation:
Since this paper studies the generalization performance, we frequently need to study the training examples as well as a test example . To distinguish the -th example in the training sample and the -th layer output of the test input , we implement the following notations:
For , , we use to denote the -th training input, and the output of the -th layer with input .
For , we denote by the output of the -th layer with test input .
4 Main Theory
In this section we present our main result. We first introduce several assumptions.
We assume that the input data are normalized: .
Assumption 4 is widely made in most existing work on over-parameterized neural networks (Li and Liang, 2018; Allen-Zhu et al., 2018b; Du et al., 2017b, 2018b). This assumption can be relaxed to the case that for all , where are absolute constants. Such relaxation will not affect our final generalization results.
Denote by the density of standard Gaussian random vectors. Define
where is the ReLU function. We assume that there exist an and a constant such that for all . Assumption 4 essentially states that there exists a function in the function class that gives a constant margin . is a fairly large function class. In the definition, each value of can be considered as a node in an infinite-width single-hidden-layer ReLU network, and the corresponding product can be considered as the second-layer weight. Therefore consists of infinite-width single-hidden-layer ReLU networks whose second-layer weights decay faster than . Assumption 4 is comparable with assumptions made in previous work:
defined in Assumption 4 corresponds to the function class studied in Rahimi and Recht (2009) when the feature function is ReLU. In this sense, our work essentially studies the generalization performance of over-parameterized deep ReLU networks when the data can be classified by the Random Kitchen Sinks fitting procedure proposed by Rahimi and Recht (2009).
Define , . We assume that .
We are now ready to present our main theoretical result.
Suppose that is generated via Gaussian initialization. For any , there exist
such that, if and , then with probability at least , gradient descent initialized by with step size finds a point that satisfies
Here are a few remarks on Theorem 4. In our setting the number of hidden layers and the margin can both be considered as constants. However for the sake of completeness we still give the detailed dependency of on and . Our current result has an exponential dependency on , which can potentially be further improved. We leave this as a future research direction. It is worth noting that such exponential dependency on also appears in existing uniform convergence based generalization bounds (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur et al., 2017; Golowich et al., 2017; Dziugaite and Roy, 2017; Arora et al., 2018c; Li et al., 2018a), since under our over-parameterization setting the spectral norms of weight matrices are constants greater than one.
Theorem 4 suggests an sample complexity, which is almost (up to a logarithmic factor) independent of the minimum number of nodes per layer . In terms of the condition , we would like to point out that with a proof similar to the proof given in Zou et al. (2018), the dimension can in fact easily be replaced by a term, since our optimization analysis focuses on the optimization of empirical loss. However, since the condition is commonly satisfied in practice, here we choose to present the version so that our detailed analysis can be directly extended to population loss. It is worth noting that out sample complexity result is still dimension-free.
Theorem 4 holds for the over-parameterization setting in the sense that in order to provably achieve expected/population error, the minimum number of nodes per layer should be chosen to be at least . This differs from the optimization results for over-parameterized deep ReLU networks given by Allen-Zhu et al. (2018b); Du et al. (2018b); Zou et al. (2018), where is required to be of order . We remark that such polynomial dependency on the sample size hinders generalization analysis, since it will lead to loose generalization error bounds. In stark contrast, our derived over-parameterization condition is independent of the sample size and is the key to generalization analysis.
5 Proof Sketch of the Main Theory
In this section we sketch the proof of Theorem 4, and provide the insights of the analysis techniques. For the ease of exposition, we first introduce two auxiliary definitions. For a collection of parameter matrices , we call the set
the -neighborhood of . The definition of is motivated by the observation that in a small neighborhood of initialization, deep ReLU networks satisfy good scaling and landscape properties. It also provides a small parameter space and enables sharper bound based on Rademacher complexity for the generalization gap between empirical and expected/population errors.
For a collection of parameter matrices , we define its empirical surrogate error and population surrogate error as follows:
The intuition behind the definition of surrogate error is that, for cross-entropy loss we have , which can be seen as a smooth version of the indicator function , and therefore is related to the classification error of the classifier. Surrogate error plays a pivotal role in our generalization analysis: on the one hand, it is closely related to the derivative of the empirical loss function. On the other hand, by , it also provides an upper bound on the classification error. It is worth noting that the surrogate error is comparable with the ramp loss studied in margin-based generalization error bounds (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur et al., 2017; Golowich et al., 2017; Arora et al., 2018c; Li et al., 2018a) in the sense that it is Lipschitz continuous in , which ensures that concentrates around uniformly over the parameter space .
The proof of Theorem 4 consists of two parts:
In Section 5.1, we study the scaling and landscape properties of Deep ReLU networks with parameter for small enough , where is the random parameter generated via Gaussian initialization.
In Section 5.2, we focus on the gradient descent iterates where is generated via Gaussian initialization, and establish a connection from gradient descent to empirical surrogate error, further to population surrogate error, and finally to population error to complete the proof.
5.1 Scaling and Landscape Analysis Around Initialization
Here we first study the properties of deep ReLU networks with parameter for small enough , where is generated via Gaussian initialization.
Scaling properties at random initialization. We first study the scaling properties of ReLU networks when all parameter matrices are generated via Gaussian initialization. Define
where in the definition of , , and we use a parameter to measure the sparsity of vectors and/or . Intuitively, these sparsity-based bounds can be combined with activation pattern analysis to give refined bounds on the network scaling. The following theorem summarizes the main scaling properties at Gaussian initialization.
Suppose that are generated via Gaussian initialization. There exists absolute constants such that for any , if
for some large enough absolute constant , then with probability at least , the following results hold:
, for all and
, , , for all , and
for all and , where , denote the output of the -th layer of the network at initialization with inputs , respectively.
for all .
It is worth noting that the condition of given in Theorem 5.1 requires that , which, for fixed , is an upper bound of . However, throughout our proof, whenever we apply Theorem 5.1 we always use of order . Therefore the condition on in Theorem 5.1 is essentially .
Uniform scaling analysis over . Based on the scaling analysis at initialization, we now show that for small enough , the key scaling properties obtained at initialized parameter in fact hold around the -neighborhood of .
Suppose that are generated via Gaussian initialization. There exist absolute constants , such that for any , if
then with probability at least the following results hold:
, for all , and ;
for all , and ;
for all satisfying , , , and ,
for all and all ,
where , are the outputs of the -th hidden layer of the ReLU network with input and weight matrices , respectively, ,
is the first order approximation of around in the parameter space. The results of Theorems 5.1 and 5.1 are comparable with similar results given in Allen-Zhu et al. (2018b) and Zou et al. (2018). However, these previous results only give scaling properties over the training set , while our results hold uniformly for all in the input domain. As a consequence, our results not only lead to nice landscape properties of the empirical loss, but also play an important role in showing the concentration of empirical surrogate error around the population surrogate error.
Optimization landscape over . We are now ready to present the results on the landscape of the empirical loss function in the -neighborhood of initialization. There exist absolute constants such that for any , if
then with probability at least ,
for all and .
for all .
For all , it holds that
Theorem 5.1 studies the landscape of the empirical loss function in three aspects: 1 gives upper bounds for the empirical gradients at each layer, 2 gives a lower bound for the last layer derivative, and 3 studies the smoothness property of . All the results relate the empirical loss function to the empirical surrogate loss . It is worth noting that, compared with similar results for over-parameterized optimization problems in Allen-Zhu et al. (2018b); Zou et al. (2018), our results have more reasonable dependency on sample size : if we formally let go to infinity, the gradient lower bounds given in Allen-Zhu et al. (2018b) and Zou et al. (2018) both vanishes to zero; the objective semi-smoothness result given by Allen-Zhu et al. (2018b) also explodes and does not provide any meaningful result. In comparison, our results directly guarantee the same landscape properties of the population loss function as the empirical loss function.
5.2 Generalization Guarantee for Gradient Descent
We now study gradient descent starting at generated via Gaussian initialization. Our analysis consists of the following two parts:
We show that if the minimum number of nodes per layer is large enough and the step size is chosen properly, the iterates of gradient descent stay inside for small enough . Moreover, gradient descent is able to find a point such that the empirical surrogate error is small.
We establish the relation between the empirical surrogate error and the population error uniformly for all , and show that when the sample size is large enough, a small empirical surrogate error implies small expected/population error.
Convergence of gradient descent. We first give the convergence result of gradient descent based on the landscape analysis given by Theorems 5.1 and 5.1. Suppose that are generated via Gaussian initialization. For any , there exists
such that, for any , with probability at least , gradient descent starting at with step size generates iterates that satisfy:
for all , where .
There exists such that .
Theorem 5.2 suggests that gradient descent is able to find a point which gives small empirical surrogate error without escaping from . In the next step, we relate with the population error using a uniform convergence argument.
Population error bound and sample complexity. The following theorem gives an upper bound of the population error based on the empirical surrogate error and the sample size . Suppose that is generated via Gaussian initialization and the results of Theorem 5.1 and Theorem 5.1 hold with . Then there exists an absolute constant and