Neural network learning has become a key machine learning approach and has achieved remarkable success in a wide range of real-world domains, such as computer vision, speech recognition, and game playing[KSH12, HZRS16, GMH13, SHM16]. In contrast to the widely accepted empirical success, much less theory is known. Despite a recent boost of theoretical studies, many questions remain largely open, including fundamental ones about the optimization and generalization in learning neural networks.
A neural network of
layers is a function defined via layers of neurons: neurons in layerare the coordinates of the input ; neurons in the subsequent layers to
take a linear combination of the output of the previous layer, and then apply an activation function; the outputof the neural network is those of the neurons in the last layer. The weights in the linear combinations in all layers are called parameters of the network, and layers to are called hidden layers. In the problem of learning neural networks, given training data where is i.i.d. from some unknown distribution and
is the label, the goal is to find a network with a small population risk with respect to some prescribed loss function.
One key challenge in analyzing the learning of neural networks is that the corresponding optimization is non-convex and is theoretically hard in the general case [ZLWJ17, Sha18]. This is in sharp contrast to the fact that simple optimization algorithms like stochastic gradient descent (SGD) and its variants usually produce good solutions in practice. One of the empirical tricks to overcome the learning difficulty is to use neural networks that are heavily overparameterized [ZBH17]: the number of parameters is usually larger than the number of the training samples. Unlike traditional convex models, overparameterization for neural networks actually improves both training speed and generalization. For example, it is observed by [LSS14] that on synthetic data generated from a ground truth network, SGD converges faster when the learnt network has more parameters than the ground truth. Perhaps more interestingly, [AGNZ18] found that overparameterized networks learnt in practice can often be compressed to simpler ones with much fewer parameters, without hurting their ability to generalize; however, directly learning such simpler networks runs into worse results due to the optimization difficulty.
The practical findings suggest that, albeit optimizing a neural network along can be computationally expensive, using an overparameterized neural network as an improper learner for some simpler hypothesis class, especially smaller neural networks, might not actually be difficult. Towards this end, the following question is of great interest both in theory and in practice:
Can overparameterized networks be used as efficient improper learners for neural networks of fewer parameters or simpler structures?
Improper learners are common in theory literatures. Polynomials are often used as improper learners for various purposes such as learning DNF and density estimation (e.g.,[KS04, ADLS17]). Several recent works also study using kernels as improper learners for neural networks [LSS14, ZLJ16, Dan17, GK18].
However, in practice, multi-layer neural networks with the rectified linear unit (ReLU) activation function have been the dominant learners across vastly different domains. It is known that some other activation functions, especially smooth ones, can lead to provable learning guarantees. For example,[APVZ14] uses a two-layer neural network with exponential activation to learn polynomials. To the best of our knowledge, the practical universality of the non-smooth ReLU activation is still not well-understood. This motivates us to study ReLU networks.
Recently, some progress has been made towards understanding, how in two-layer networks with ReLU activations, overparameterization can make the learning process easier. In particular, brutzkus2017sgd [BGMS17] shows that such networks can learn linearly-separable data using just SGD. li2018learning [LL18] shows that SGD learns a network with good generalization when the data come from mixtures of well-separated distributions. li2018learning [LL18] and du2018gradient [DZPS18] show that gradient descent can perfectly fit the training samples when the data is not degenerated. These results are only for two layers and only applicable to structured data or the training data. This leads to the following natural question:
Can overparameterization simplify the learning process without any structural assumptions about the input distribution?
Most existing works analyzing the learning process of neural networks [Kaw16, SC16, XLS16, GLM17, SJL17, Tia17, BG17, ZSJ17, LY17, BL17, LMZ17, VW18, GKLW19, BJW18] need to make unrealistic assumptions about the data (such as being random Gaussian), and/or have strong assumptions about the network (such as using linear activations), and are restricted to two-layer networks. A theorem without distributional assumptions about the data is often more desirable. Indeed, how to obtain a result that does not depend on the data distribution, but only on the hypothesis class itself, lies in the center of PAC-learning which is one of the foundations of machine learning theory [Val84].
Following these questions, we also note that determining the exact amount of overparameterization can be challenging without a clear knowledge of the ground truth. In practice, researchers usually create networks with a significantly large number of parameters, and surprisingly, the population risk often does not increase. Thus, we would also like to understand the following question:
Can overparameterized networks be learnt to a small population risk, using a number of samples that is (almost) independent of the number of parameters?
This question cannot be studied under the traditional VC-dimension learning theory, since in principle, the VC dimension of the network grows with the number of parameters. Recently, several works [BFT17, NBMS17, AGNZ18, GRS18] explain generalization in the overparameterized setting by studying some other “complexity” of the learnt neural networks. Most related to the discussion here is [BFT17], where the authors prove a generalization bound in terms of the norms (of weight matrices) of each layer, as opposed to the number of parameters. However, their norms are “sparsity induced norms”: in order for the norm not to scale with the number of hidden neurons , essentially, it requires the number of neurons with non-zero weights not to scale with . This more or less reduces the problem to the non-overparameterized case. More importantly, it is not clear from these results how a network with such low “complexity” and a good training loss can be produced by the training method.
1.1 Our Results
In this work, we extend the theoretical understanding of neural networks both in the algorithmic and the generalization perspectives. We give positive answers to the above questions for networks of two and three layers. We prove that when the network is sufficiently overparameterized, simple optimization algorithms (SGD or its variants) can learn ground truth networks with a small generalization error in polynomial time using polynomially many samples.
To state our result in a simple form, we assume that there exists a (two or three-layer) ground truth network with risk , and show that one can learn this hypothesis class, up to risk , using larger (two or three-layer) networks of size greater than a fixed polynomial in the size of the ground truth, in , and in the “complexity” of the activation function used in the ground truth. Furthermore, the sample complexity is also polynomial in these parameters, and only poly-logarithmic in the size of the overparameterized network. Our result is proved for networks with the ReLU activation where instead the ground truth can use various of smooth activation functions.
Furthermore, unlike the two-layer case (so there is only one hidden layer) where the optimization landscape in the overparameterized regime is almost convex [LL18, DZPS18], our result on three-layer networks gives the first theoretical proof that learning neural networks, even when there are sophisticated non-convex interactions between hidden layers, might still be not difficult, as long as sufficient overparameterization is provided. This gives further insights to the fundamental questions about the algorithmic and generalization aspects of neural network learning. Since practical neural networks are heavily overparameterized, our results may also provide theoretical insights to networks used in various applications.
We highlight a few interesting conceptual findings we used to derive our main result:
In the overparameterized regime, good networks with small risks are almost everywhere
. With high probability over the random initialization, there exists a good network in the “close” neighborhood of the initialization.
In the overparameterized regime, if one stays close enough to the random initialization, the learning process is tightly coupled with that of a “pseudo network” which has a benign optimization landscape.
In the overparameterized regime, every neuron matters. During training, information is spread out among all the neurons instead of collapsed into a few neurons. With this structure, we can prove a new generalization bound that is (almost) independent of the number of neurons, even when all neurons have non-negligible contributions to the output.
Combining the first and the second items leads to the convergence of the optimization process, and combining that with the third item gives our generalization result.
Roadmap. We formally define our (improper) learning problem in Section 2, and introduce notations in Section 3. In Section 4 we present our main theorems and give some examples. Our Section 5 in the main body summarizes our main proof ideas for our three-layer network results, and details are in Appendix B, C, D, E and F. Our two-layer proof is much easier and is included in Appendix G.
2 Problem and Assumptions
We consider learning some unknown distribution of data points , where is the input data point and is the label associated with this data point.
Without loss of generality, we restrict our attention to the distributions where each data point in is of unit Euclidean norm and satisfies .111This is without loss of generality, since can always be padded to the last coordinate, and
can always be padded to the last coordinate, andcan always be ensured from by padding to the second-last coordinate. We make this assumption to simplify our notations: for instance, allows us to focus only on ground truth networks without bias.
We consider a loss function such that for every label in the support of , the loss function is non-negative, convex, 1-Lipschitz continuous and 1-Lipschitz smooth and .222In fact, the non-negativity assumption and the 1-Lipschitz smoothness assumption are not needed for our two-layer result, but we state all of them here for consistency. Assume that there exists a ground truth function and some so that
Our goal is to learn a neural network for a given , satisfying
using a data set consisting of i.i.d. samples from the distribution .
In this paper, we consider being equipped with the ReLU activation function: . It is arguably the most widely-used activation function in practice. We assume uses arbitrary smooth activation functions.
Below, we describe the details of the ground truth, our network, and the learning process for the cases of two and three layers, respectively.
2.1 Two Layer Networks
Ground truth . The ground truth for our two-layer case is
where each is infinite-order smooth, are ground truth weight vectors
ground truth weight vectors, and are weights. Without loss of generality, we assume and .
Standard two-layer networks are only special cases of our formulation (3). Indeed, since , if we set then
is a two-layer network with activation functions . Our formulation (3) allows for more functions, and in particular, captures combinations of correlations between non-linear and linear measurements of different directions of .
Our ReLU network . Our (improper) learners are two-layer networks with
Here, represents the hidden weight matrix and are its rows,
is the bias vector, and eachis the output weight vector.
To simplify analysis, we only update and keep and at initialization values. For such reason, we also write the functions as and .
Let denote the initial value for and sometimes use and to emphasize that they are at random initialization. Below we specify our random initialization:
The entries of and are i.i.d. random Gaussians from .
The entries of each are i.i.d. random Gaussians from for some fixed .333We shall choose in the proof due to technical reason. As we shall see in the three-layer case, if weight decay is used, one can relax this to .
Learning process. Given data set where each , the network is first randomly initialized and then updated by SGD. Let denote the hidden weight matrix at the -th iteration of SGD. (Note that is the matrix of increments.) For , define
Given step size and , the SGD algorithm is presented in Algorithm 1.We remark that the (sub-)gradient is taken with respect to .444Strictly speaking, does not have gradient everywhere due to the non-differentiability of ReLU. Throughout the paper, is used to denote the value computed by setting , which is also what is used in practical auto-differentiation softwares.
2.2 Three Layer Networks
Ground truth . The ground truth for our three-layer case is
where each function is infinite-order smooth, vectors are the ground truth weights of the first layer, vectors are the ground truth weights of the second layer, and reals are weights. Without loss of generality, we assume and .
Standard three-layer networks are only special cases of our formulation (292). If we set as constant functions and , then
is a three-layer network with activation functions . Our formulation (292) is much more general. As an interesting example, even in the special case of , the ground truth
captures combinations of correlations of combinations of non-linear measurements in different directions of . This we do not know how to compute using two-layer networks, and the ability to learn these correlations is the critical advantage of three-layer network compared to two-layer ones.
In fact, our results of this paper even apply to the following general form:
with the mild requirement that for each . We choose to present the slightly weaker formulation (292) for cleaner proofs.
Our ReLU network . Our (improper) learners are three-layer networks with
Above, there are and hidden neurons in the first and second layers respectively.555Our theorems are stated for the special case , but we state the more general case of for most lemmas because they may be of independent interests. Matrices and represent the weights of the first and second hidden layers respectively, and and represent the corresponding bias vectors. Each is an output weight vector.
To simplify our analysis, we only update and but keep , and at their initial values. We denote by and the initial value of and , and sometimes use , and to emphasize that they are at random initialization. We also denote the -th output and the vector output . Below we specify our random initialization:
The entries of and are i.i.d. random Gaussians from .
The entries of and are i.i.d. random Gaussians from .
The entries of each are i.i.d. random Gaussians from for .666Recall in our two-layer result we have chosen due to technical reasons; thanks to weight decay, we can simply select in our three-layer case.
Learning process. As in the two-layer case, we use and to denote the weight matrices at the -th iteration of the optimization algorithm (so that and are the increments). For three-layer networks, we consider two variants of SGD discussed below.
Given sample and , define function
where the role of is to scale down the entire function (because a ReLU network is positive homogenous). Both variants of SGD optimize with respect to matrices as well as this parameter . They start with and slowly decreases it across iterations— this is similar to weight decay used in practice.
The scaling can be view as a simplified version of weight decay. Intuitively, in the training process, it is easy to add new information (from the ground truth) to our current network, but hard to forget “false” information that is already in the network. Such false information can be accumulated from randomness of SGD, non-convex landscapes, and so on. Thus, by scaling down the weights of the current network, we can effectively forget false information.
First variant of SGD. In each round , we use (noisy) SGD to minimize the following stochastic objective for some fixed :
Above, the objective is stochastic because (1) is a random sample from the training set, and (2) and are two small perturbation random matrices with entries i.i.d. drawn from and respectively. We introduce such Gaussian perturbation for theoretical purpose (similar to smooth analysis) and it may not be needed in practice.
Specifically, in each round , Algorithm 2 starts with weight matrices and performs iterations. In each iteration it goes in the negative direction of the gradient (with respect to a stochastic choice of ). Let the final matrices be . At the end of this round , Algorithm 2 performs weight decay by setting for some .
Second variant of SGD. In this variant we modify the stochastic objective to make the training method more sample-efficient (at least for theory):
This time, the stochastic randomness not only comes from , , , but also from , a random diagonal matrix with diagonal entries i.i.d. uniformly drawn from . This matrix is similar to the Dropout technique [SHK14] used in practice which randomly masks out neurons. We make a slight modification that we only mask out the weight increments , but not the initialization .
We use to denote that . We use or to denote the indicator function of the event . We denote by and the Euclidean and infinity norms of vectors , and the number of non-zeros of . We also abbreviate when it is clear from the context. We denote the row norm for (for ) as
By definition, is the Frobenius norm of . We use to denote the matrix spectral norm. For a matrix , we use or sometimes to denote the -th row of .
We say a function is -Lipscthiz continuous if ; say it is is -smooth if its gradient is -Lipscthiz continuous, that is ; and say it is -second-order smooth if its Hessian is -Lipscthiz continuous, that is .
For notation simplicity, with high probability (or w.h.p.) means with probability for a sufficiently large constant for two-layer network, and for three-layer network. In this paper, hides factors of for two-layer networks, or for three-layer networks.
Wasserstein distance. The
Wasserstein distance between random variablesis
where the infimum is taken over all possible joint distributions overwhere the marginal on (resp. ) is distributed in the same way as (resp. ).
Slightly abusing notation, in this paper, we say a random variable satisfies with high probability if (1) w.h.p. and (2) . For instance, if , then with high probability.
Function complexity. The following notion measures the complexity of any smooth activation function used in the ground truth network. Suppose . Given non-negative , the complexity
where is a sufficiently large constant (e.g., ). In our two-layer network, we use
to denote the (maximum) complexity of all activation functions. In our three-layer network, we use
to denote the complexity of the first and second hidden layers, respectively. We assume throughout the paper that these terms are bounded.
Intuitively, mainly measures the sample complexity: how many samples are required to learn correctly; while mainly bounds the network size: how much over-parameterization is needed for the algorithm to (efficiently) learn up to error. It always satisfies .777Recall for every . However, for functions such as or low degree polynomials, and only differ by .
4 Main Results
4.1 Two Layer Networks
We have the following main theorem for two-layer networks. (Recall that is the number of hidden neurons in the ground truth network, is the output dimension.)
Theorem 1 (two-layer).
For every , there exists
such that for every and every , choosing for the initialization, choosing learning rate and
with high probability over the random initialization, SGD after iteration satisfies
(Above, takes expectation with the randomness of the SGD. )
SGD only takes one example per iteration. Thus, sample complexity is also at most .
For functions such as or low degree polynomials, and . Our theorem indicates that for ground truth networks with such activation functions, we can learn them using two-layer ReLU networks with
|size and sample complexity|
We note is (almost) independent of , the amount of overparametrization in our network, and is independent of so is dimension free.
If is or , to get approximation we can truncate their Taylor series at degree . One can verify that by the fact that for every , and . Thus, ground truth networks with such activations can also be learned using two-layer ReLU networks with
|size and sample complexity|
One might want to compare our result to [APVZ14]. First of all, our result allows activation functions in the ground truth to be infinite-degree polynomials, which is not captured by [APVZ14]. More importantly, to learn a polynomial with degree , their sample complexity is but our result is input-dimension free.
One can also view Theorem 1
as a nonlinear analogue to the margin theory for linear classifiers. The ground truth network with a small population risk (and of bounded norm) can be viewed as a “large margin non-linear classifier.” In this view,Theorem 1 shows that assuming the existence of such large-margin classifier, SGD finds a good solution with sample complexity mostly determined by the margin, instead of the dimension of the data.
Inductive Bias. Some recent works (e.g., [ALS18]) show that when the possibly deep network is heavily overparameterized (that is, is polynomial in the number of training samples) and no two training samples are identical, then SGD can find a global optimum with classification error (or find a solution with training loss) in polynomial time. This does not come with generalization, since it can even fit random labels. Our theorem, combined with [ALS18], confirms the inductive bias of SGD for two-layer networks: when the labels are random, SGD finds a network that memorizes the training data; when the labels are (even only approximately) realizable by some ground truth network, then SGD learns it by finding a network that can generalize. This gives an explanation towards the well-known empirical observations of such inductive bias (e.g., [ZBH17]) in the two-layer setting, and is more general than brutzkus2017sgd [BGMS17] in which the ground truth network is only linear.
4.2 Three Layer Networks
We give the main theorem for using the first variant of SGD to train three-layer networks.
Theorem 2 (three-layer, ).
We emphasize the result is for the population risk over , the real data distribution, instead of , the empirical distribution. Thus, Theorem 2 shows that using samples we can find a network with a small population risk.
As goes large, this sample complexity bound polynomially scales with so may not be very efficient (we did not try hard to improve the exponent ). Perhaps interesting enough, this is already non-trivial, because can be much smaller than , the number of parameters of the network or equivalently the naive VC dimension bound. In our second variant of SGD, we shall reduce it further to a polylogarithmic dependency in .
Recall from Remark 2.2 that a major advantage of using a three-layer network, compared to a two-layer one, is the ability to learn (combinations of) correlations between non-linear measurements of the data. This corresponds to the special case and our three-layer ground truth (for the -th output) can be
Since , we have . Thus, learning this three-layer network is essentially in the same complexity as learning each in two-layer networks. As a concrete example, a three-layer network can learn up to accuracy in complexity , while it is unclear how to do so using two-layer networks.
For general , ignoring , the complexity of three-layer networks is essentially . This is necessary in some sense: consider the case when for a very large parameter , then is just a function , and we have .
Theorem 3 (three-layer, ).
As mentioned, the training algorithm for this version of SGD uses a random diagonal scaling , which is similar to the Dropout trick used in practice to reduce sample complexity by turning on and off each hidden neuron. Theorem 3 shows that in this case, the sample complexity needed to achieve a small risk scales only polynomially with the complexity of the ground truth network, and is (almost) independent of , the amount of overparameterization in our network.
5 Main Lemmas for Three Layer Networks
In Section 5.1, we show the existence of some good “pseudo network” that can approximate the ground truth. In Section 5.2, we present our coupling technique between a real network and a pseudo network. In Section 5.3, we present the key lemma about the optimization procedure. In Section 5.4, we state a simple generalization bound that is compatible with our algorithm. These techniques together give rise to the proof of Theorem 2. In Section 5.5, we present additional techniques needed to show Theorem 3.
We wish to show the existence of some good “pseudo network” that can approximate the ground truth network. In a pseudo network, each ReLU activation is replaced with where is the value at random initialization. Formally, let
denote a diagonal sign matrix indicating the sign of the ReLU’s for the first layer at random initialization, that is, , and
denote the diagonal sign matrix of the second layer at random initialization.
Consider the output of a three-layer network at randomly initialized sign without bias as
Lemma 5.1 (existence).
For every , there exists
such that if , then with high probability, there exists weights with