1 Introduction
Neural network learning has become a key machine learning approach and has achieved remarkable success in a wide range of realworld domains, such as computer vision, speech recognition, and game playing
[KSH12, HZRS16, GMH13, SHM16]. In contrast to the widely accepted empirical success, much less theory is known. Despite a recent boost of theoretical studies, many questions remain largely open, including fundamental ones about the optimization and generalization in learning neural networks.A neural network of
layers is a function defined via layers of neurons: neurons in layer
are the coordinates of the input ; neurons in the subsequent layers totake a linear combination of the output of the previous layer, and then apply an activation function; the output
of the neural network is those of the neurons in the last layer. The weights in the linear combinations in all layers are called parameters of the network, and layers to are called hidden layers. In the problem of learning neural networks, given training data where is i.i.d. from some unknown distribution andis the label, the goal is to find a network with a small population risk with respect to some prescribed loss function.
One key challenge in analyzing the learning of neural networks is that the corresponding optimization is nonconvex and is theoretically hard in the general case [ZLWJ17, Sha18]. This is in sharp contrast to the fact that simple optimization algorithms like stochastic gradient descent (SGD) and its variants usually produce good solutions in practice. One of the empirical tricks to overcome the learning difficulty is to use neural networks that are heavily overparameterized [ZBH17]: the number of parameters is usually larger than the number of the training samples. Unlike traditional convex models, overparameterization for neural networks actually improves both training speed and generalization. For example, it is observed by [LSS14] that on synthetic data generated from a ground truth network, SGD converges faster when the learnt network has more parameters than the ground truth. Perhaps more interestingly, [AGNZ18] found that overparameterized networks learnt in practice can often be compressed to simpler ones with much fewer parameters, without hurting their ability to generalize; however, directly learning such simpler networks runs into worse results due to the optimization difficulty.
The practical findings suggest that, albeit optimizing a neural network along can be computationally expensive, using an overparameterized neural network as an improper learner for some simpler hypothesis class, especially smaller neural networks, might not actually be difficult. Towards this end, the following question is of great interest both in theory and in practice:
Question 1:
Can overparameterized networks be used as efficient improper learners for neural networks of fewer parameters or simpler structures?
Improper learners are common in theory literatures. Polynomials are often used as improper learners for various purposes such as learning DNF and density estimation (e.g.,
[KS04, ADLS17]). Several recent works also study using kernels as improper learners for neural networks [LSS14, ZLJ16, Dan17, GK18].However, in practice, multilayer neural networks with the rectified linear unit (ReLU) activation function have been the dominant learners across vastly different domains. It is known that some other activation functions, especially smooth ones, can lead to provable learning guarantees. For example,
[APVZ14] uses a twolayer neural network with exponential activation to learn polynomials. To the best of our knowledge, the practical universality of the nonsmooth ReLU activation is still not wellunderstood. This motivates us to study ReLU networks.Recently, some progress has been made towards understanding, how in twolayer networks with ReLU activations, overparameterization can make the learning process easier. In particular, brutzkus2017sgd [BGMS17] shows that such networks can learn linearlyseparable data using just SGD. li2018learning [LL18] shows that SGD learns a network with good generalization when the data come from mixtures of wellseparated distributions. li2018learning [LL18] and du2018gradient [DZPS18] show that gradient descent can perfectly fit the training samples when the data is not degenerated. These results are only for two layers and only applicable to structured data or the training data. This leads to the following natural question:
Question 2:
Can overparameterization simplify the learning process without any structural assumptions about the input distribution?
Most existing works analyzing the learning process of neural networks [Kaw16, SC16, XLS16, GLM17, SJL17, Tia17, BG17, ZSJ17, LY17, BL17, LMZ17, VW18, GKLW19, BJW18] need to make unrealistic assumptions about the data (such as being random Gaussian), and/or have strong assumptions about the network (such as using linear activations), and are restricted to twolayer networks. A theorem without distributional assumptions about the data is often more desirable. Indeed, how to obtain a result that does not depend on the data distribution, but only on the hypothesis class itself, lies in the center of PAClearning which is one of the foundations of machine learning theory [Val84].
Following these questions, we also note that determining the exact amount of overparameterization can be challenging without a clear knowledge of the ground truth. In practice, researchers usually create networks with a significantly large number of parameters, and surprisingly, the population risk often does not increase. Thus, we would also like to understand the following question:
Question 3:
Can overparameterized networks be learnt to a small population risk, using a number of samples that is (almost) independent of the number of parameters?
This question cannot be studied under the traditional VCdimension learning theory, since in principle, the VC dimension of the network grows with the number of parameters. Recently, several works [BFT17, NBMS17, AGNZ18, GRS18] explain generalization in the overparameterized setting by studying some other “complexity” of the learnt neural networks. Most related to the discussion here is [BFT17], where the authors prove a generalization bound in terms of the norms (of weight matrices) of each layer, as opposed to the number of parameters. However, their norms are “sparsity induced norms”: in order for the norm not to scale with the number of hidden neurons , essentially, it requires the number of neurons with nonzero weights not to scale with . This more or less reduces the problem to the nonoverparameterized case. More importantly, it is not clear from these results how a network with such low “complexity” and a good training loss can be produced by the training method.
1.1 Our Results
In this work, we extend the theoretical understanding of neural networks both in the algorithmic and the generalization perspectives. We give positive answers to the above questions for networks of two and three layers. We prove that when the network is sufficiently overparameterized, simple optimization algorithms (SGD or its variants) can learn ground truth networks with a small generalization error in polynomial time using polynomially many samples.
To state our result in a simple form, we assume that there exists a (two or threelayer) ground truth network with risk , and show that one can learn this hypothesis class, up to risk , using larger (two or threelayer) networks of size greater than a fixed polynomial in the size of the ground truth, in , and in the “complexity” of the activation function used in the ground truth. Furthermore, the sample complexity is also polynomial in these parameters, and only polylogarithmic in the size of the overparameterized network. Our result is proved for networks with the ReLU activation where instead the ground truth can use various of smooth activation functions.
Furthermore, unlike the twolayer case (so there is only one hidden layer) where the optimization landscape in the overparameterized regime is almost convex [LL18, DZPS18], our result on threelayer networks gives the first theoretical proof that learning neural networks, even when there are sophisticated nonconvex interactions between hidden layers, might still be not difficult, as long as sufficient overparameterization is provided. This gives further insights to the fundamental questions about the algorithmic and generalization aspects of neural network learning. Since practical neural networks are heavily overparameterized, our results may also provide theoretical insights to networks used in various applications.
We highlight a few interesting conceptual findings we used to derive our main result:

In the overparameterized regime, good networks with small risks are almost everywhere
. With high probability over the random initialization, there exists a good network in the “close” neighborhood of the initialization.

In the overparameterized regime, if one stays close enough to the random initialization, the learning process is tightly coupled with that of a “pseudo network” which has a benign optimization landscape.

In the overparameterized regime, every neuron matters. During training, information is spread out among all the neurons instead of collapsed into a few neurons. With this structure, we can prove a new generalization bound that is (almost) independent of the number of neurons, even when all neurons have nonnegligible contributions to the output.
Combining the first and the second items leads to the convergence of the optimization process, and combining that with the third item gives our generalization result.
Roadmap. We formally define our (improper) learning problem in Section 2, and introduce notations in Section 3. In Section 4 we present our main theorems and give some examples. Our Section 5 in the main body summarizes our main proof ideas for our threelayer network results, and details are in Appendix B, C, D, E and F. Our twolayer proof is much easier and is included in Appendix G.
2 Problem and Assumptions
We consider learning some unknown distribution of data points , where is the input data point and is the label associated with this data point. Without loss of generality, we restrict our attention to the distributions where each data point in is of unit Euclidean norm and satisfies .^{1}^{1}1This is without loss of generality, since
can always be padded to the last coordinate, and
can always be ensured from by padding to the secondlast coordinate. We make this assumption to simplify our notations: for instance, allows us to focus only on ground truth networks without bias.We consider a loss function such that for every label in the support of , the loss function is nonnegative, convex, 1Lipschitz continuous and 1Lipschitz smooth and .^{2}^{2}2In fact, the nonnegativity assumption and the 1Lipschitz smoothness assumption are not needed for our twolayer result, but we state all of them here for consistency. Assume that there exists a ground truth function and some so that
(1) 
Our goal is to learn a neural network for a given , satisfying
(2) 
using a data set consisting of i.i.d. samples from the distribution .
In this paper, we consider being equipped with the ReLU activation function: . It is arguably the most widelyused activation function in practice. We assume uses arbitrary smooth activation functions.
Below, we describe the details of the ground truth, our network, and the learning process for the cases of two and three layers, respectively.
2.1 Two Layer Networks
Ground truth . The ground truth for our twolayer case is
(3) 
where each is infiniteorder smooth, are
ground truth weight vectors
, and are weights. Without loss of generality, we assume and .Remark 2.1.
Standard twolayer networks are only special cases of our formulation (3). Indeed, since , if we set then
(4) 
is a twolayer network with activation functions . Our formulation (3) allows for more functions, and in particular, captures combinations of correlations between nonlinear and linear measurements of different directions of .
Our ReLU network . Our (improper) learners are twolayer networks with
(5) 
Here, represents the hidden weight matrix and are its rows,
is the bias vector, and each
is the output weight vector.To simplify analysis, we only update and keep and at initialization values. For such reason, we also write the functions as and .
Let denote the initial value for and sometimes use and to emphasize that they are at random initialization. Below we specify our random initialization:

The entries of and are i.i.d. random Gaussians from .

The entries of each are i.i.d. random Gaussians from for some fixed .^{3}^{3}3We shall choose in the proof due to technical reason. As we shall see in the threelayer case, if weight decay is used, one can relax this to .
Learning process. Given data set where each , the network is first randomly initialized and then updated by SGD. Let denote the hidden weight matrix at the th iteration of SGD. (Note that is the matrix of increments.) For , define
Given step size and , the SGD algorithm is presented in Algorithm 1.We remark that the (sub)gradient is taken with respect to .^{4}^{4}4Strictly speaking, does not have gradient everywhere due to the nondifferentiability of ReLU. Throughout the paper, is used to denote the value computed by setting , which is also what is used in practical autodifferentiation softwares.
2.2 Three Layer Networks
Ground truth . The ground truth for our threelayer case is
(6) 
where each function is infiniteorder smooth, vectors are the ground truth weights of the first layer, vectors are the ground truth weights of the second layer, and reals are weights. Without loss of generality, we assume and .
Remark 2.2.
Standard threelayer networks are only special cases of our formulation (292). If we set as constant functions and , then
(7) 
is a threelayer network with activation functions . Our formulation (292) is much more general. As an interesting example, even in the special case of , the ground truth
(8) 
captures combinations of correlations of combinations of nonlinear measurements in different directions of . This we do not know how to compute using twolayer networks, and the ability to learn these correlations is the critical advantage of threelayer network compared to twolayer ones.
Remark 2.3.
In fact, our results of this paper even apply to the following general form:
(9) 
with the mild requirement that for each . We choose to present the slightly weaker formulation (292) for cleaner proofs.
Our ReLU network . Our (improper) learners are threelayer networks with
(10)  
(11) 
Above, there are and hidden neurons in the first and second layers respectively.^{5}^{5}5Our theorems are stated for the special case , but we state the more general case of for most lemmas because they may be of independent interests. Matrices and represent the weights of the first and second hidden layers respectively, and and represent the corresponding bias vectors. Each is an output weight vector.
To simplify our analysis, we only update and but keep , and at their initial values. We denote by and the initial value of and , and sometimes use , and to emphasize that they are at random initialization. We also denote the th output and the vector output . Below we specify our random initialization:

The entries of and are i.i.d. random Gaussians from .

The entries of and are i.i.d. random Gaussians from .

The entries of each are i.i.d. random Gaussians from for .^{6}^{6}6Recall in our twolayer result we have chosen due to technical reasons; thanks to weight decay, we can simply select in our threelayer case.
Learning process. As in the twolayer case, we use and to denote the weight matrices at the th iteration of the optimization algorithm (so that and are the increments). For threelayer networks, we consider two variants of SGD discussed below.
Given sample and , define function
where the role of is to scale down the entire function (because a ReLU network is positive homogenous). Both variants of SGD optimize with respect to matrices as well as this parameter . They start with and slowly decreases it across iterations— this is similar to weight decay used in practice.
Remark 2.4.
The scaling can be view as a simplified version of weight decay. Intuitively, in the training process, it is easy to add new information (from the ground truth) to our current network, but hard to forget “false” information that is already in the network. Such false information can be accumulated from randomness of SGD, nonconvex landscapes, and so on. Thus, by scaling down the weights of the current network, we can effectively forget false information.
Algorithm 2 presents the details. Choosing gives our first variant of SGD using objective function (12), and choosing gives our second variant using objective (13).
First variant of SGD. In each round , we use (noisy) SGD to minimize the following stochastic objective for some fixed :
(12) 
Above, the objective is stochastic because (1) is a random sample from the training set, and (2) and are two small perturbation random matrices with entries i.i.d. drawn from and respectively. We introduce such Gaussian perturbation for theoretical purpose (similar to smooth analysis) and it may not be needed in practice.
Specifically, in each round , Algorithm 2 starts with weight matrices and performs iterations. In each iteration it goes in the negative direction of the gradient (with respect to a stochastic choice of ). Let the final matrices be . At the end of this round , Algorithm 2 performs weight decay by setting for some .
Remark.
Second variant of SGD. In this variant we modify the stochastic objective to make the training method more sampleefficient (at least for theory):
(13) 
This time, the stochastic randomness not only comes from , , , but also from , a random diagonal matrix with diagonal entries i.i.d. uniformly drawn from . This matrix is similar to the Dropout technique [SHK14] used in practice which randomly masks out neurons. We make a slight modification that we only mask out the weight increments , but not the initialization .
3 Notations
We use to denote that . We use or to denote the indicator function of the event . We denote by and the Euclidean and infinity norms of vectors , and the number of nonzeros of . We also abbreviate when it is clear from the context. We denote the row norm for (for ) as
(14) 
By definition, is the Frobenius norm of . We use to denote the matrix spectral norm. For a matrix , we use or sometimes to denote the th row of .
We say a function is Lipscthiz continuous if ; say it is is smooth if its gradient is Lipscthiz continuous, that is ; and say it is secondorder smooth if its Hessian is Lipscthiz continuous, that is .
For notation simplicity, with high probability (or w.h.p.) means with probability for a sufficiently large constant for twolayer network, and for threelayer network. In this paper, hides factors of for twolayer networks, or for threelayer networks.
Wasserstein distance. The
Wasserstein distance between random variables
is(15) 
where the infimum is taken over all possible joint distributions over
where the marginal on (resp. ) is distributed in the same way as (resp. ).Slightly abusing notation, in this paper, we say a random variable satisfies with high probability if (1) w.h.p. and (2) . For instance, if , then with high probability.
Function complexity. The following notion measures the complexity of any smooth activation function used in the ground truth network. Suppose . Given nonnegative , the complexity
(16) 
(17) 
where is a sufficiently large constant (e.g., ). In our twolayer network, we use
(18) 
to denote the (maximum) complexity of all activation functions. In our threelayer network, we use
(19)  
(20) 
to denote the complexity of the first and second hidden layers, respectively. We assume throughout the paper that these terms are bounded.
Remark.
Intuitively, mainly measures the sample complexity: how many samples are required to learn correctly; while mainly bounds the network size: how much overparameterization is needed for the algorithm to (efficiently) learn up to error. It always satisfies .^{7}^{7}7Recall for every . However, for functions such as or low degree polynomials, and only differ by .
4 Main Results
4.1 Two Layer Networks
We have the following main theorem for twolayer networks. (Recall that is the number of hidden neurons in the ground truth network, is the output dimension.)
Theorem 1 (twolayer).
For every , there exists
such that for every and every , choosing for the initialization, choosing learning rate and
with high probability over the random initialization, SGD after iteration satisfies
(Above, takes expectation with the randomness of the SGD. )
Remark.
SGD only takes one example per iteration. Thus, sample complexity is also at most .
Example 4.1.
For functions such as or low degree polynomials, and . Our theorem indicates that for ground truth networks with such activation functions, we can learn them using twolayer ReLU networks with
size and sample complexity 
We note is (almost) independent of , the amount of overparametrization in our network, and is independent of so is dimension free.
Example 4.2.
If is or , to get approximation we can truncate their Taylor series at degree . One can verify that by the fact that for every , and . Thus, ground truth networks with such activations can also be learned using twolayer ReLU networks with
size and sample complexity 
One might want to compare our result to [APVZ14]. First of all, our result allows activation functions in the ground truth to be infinitedegree polynomials, which is not captured by [APVZ14]. More importantly, to learn a polynomial with degree , their sample complexity is but our result is inputdimension free.
One can also view Theorem 1
as a nonlinear analogue to the margin theory for linear classifiers. The ground truth network with a small population risk (and of bounded norm) can be viewed as a “large margin nonlinear classifier.” In this view,
Theorem 1 shows that assuming the existence of such largemargin classifier, SGD finds a good solution with sample complexity mostly determined by the margin, instead of the dimension of the data.Inductive Bias. Some recent works (e.g., [ALS18]) show that when the possibly deep network is heavily overparameterized (that is, is polynomial in the number of training samples) and no two training samples are identical, then SGD can find a global optimum with classification error (or find a solution with training loss) in polynomial time. This does not come with generalization, since it can even fit random labels. Our theorem, combined with [ALS18], confirms the inductive bias of SGD for twolayer networks: when the labels are random, SGD finds a network that memorizes the training data; when the labels are (even only approximately) realizable by some ground truth network, then SGD learns it by finding a network that can generalize. This gives an explanation towards the wellknown empirical observations of such inductive bias (e.g., [ZBH17]) in the twolayer setting, and is more general than brutzkus2017sgd [BGMS17] in which the ground truth network is only linear.
4.2 Three Layer Networks
We give the main theorem for using the first variant of SGD to train threelayer networks.
Theorem 2 (threelayer, ).
Consider Algorithm 2 with . For every constant , every , there exists
such that for every , and properly set in Table 1, as long as
there is a choice and such that with probability ,
We emphasize the result is for the population risk over , the real data distribution, instead of , the empirical distribution. Thus, Theorem 2 shows that using samples we can find a network with a small population risk.
Remark 4.3.
As goes large, this sample complexity bound polynomially scales with so may not be very efficient (we did not try hard to improve the exponent ). Perhaps interesting enough, this is already nontrivial, because can be much smaller than , the number of parameters of the network or equivalently the naive VC dimension bound. In our second variant of SGD, we shall reduce it further to a polylogarithmic dependency in .
Example 4.4.
Recall from Remark 2.2 that a major advantage of using a threelayer network, compared to a twolayer one, is the ability to learn (combinations of) correlations between nonlinear measurements of the data. This corresponds to the special case and our threelayer ground truth (for the th output) can be
(21) 
Since , we have . Thus, learning this threelayer network is essentially in the same complexity as learning each in twolayer networks. As a concrete example, a threelayer network can learn up to accuracy in complexity , while it is unclear how to do so using twolayer networks.
Remark 4.5.
For general , ignoring , the complexity of threelayer networks is essentially . This is necessary in some sense: consider the case when for a very large parameter , then is just a function , and we have .
We next give the main theorem for SGD with Dropouttype noise (Algorithm 2 with ) to train threelayer networks. It has better sample efficiency comparing to Theorem 2.
Theorem 3 (threelayer, ).
Consider Algorithm 2 with . In the same setting as Theorem 2, for every , as long as
there is choice , such that with probability at least ,
As mentioned, the training algorithm for this version of SGD uses a random diagonal scaling , which is similar to the Dropout trick used in practice to reduce sample complexity by turning on and off each hidden neuron. Theorem 3 shows that in this case, the sample complexity needed to achieve a small risk scales only polynomially with the complexity of the ground truth network, and is (almost) independent of , the amount of overparameterization in our network.
5 Main Lemmas for Three Layer Networks
We present the key technical lemmas we used for proving the threelayer network Theorem 2 and 3. The twolayer result is based on similar ideas but simpler. We defer that to Appendix G.
In Section 5.1, we show the existence of some good “pseudo network” that can approximate the ground truth. In Section 5.2, we present our coupling technique between a real network and a pseudo network. In Section 5.3, we present the key lemma about the optimization procedure. In Section 5.4, we state a simple generalization bound that is compatible with our algorithm. These techniques together give rise to the proof of Theorem 2. In Section 5.5, we present additional techniques needed to show Theorem 3.
5.1 Existence
We wish to show the existence of some good “pseudo network” that can approximate the ground truth network. In a pseudo network, each ReLU activation is replaced with where is the value at random initialization. Formally, let

denote a diagonal sign matrix indicating the sign of the ReLU’s for the first layer at random initialization, that is, , and

denote the diagonal sign matrix of the second layer at random initialization.
Consider the output of a threelayer network at randomly initialized sign without bias as
(22)  
(23) 
Lemma 5.1 (existence).
For every , there exists
(24)  
(25) 
such that if , then with high probability, there exists weights with
(26) 
such that
(27) 
and hence,