It is well known that artificial neural networks can approximate any real-valued function. Fundamental results [19, 7, 4] show that a neural network with a single hidden layer provides a universal representation up to arbitrary approximation, with the number of hidden units needed depending on the function being approximated and the desired accuracy. In practice, NNs today effectively capture a wide variety of information with remarkably accurate predictions.
Besides their generality, an important feature of NNs is the ease of training them — gradient descent is used to minimize the error of the network, measured by a loss function of the current weights. This seems to work across a range of labeled data sets. Yet despite of its tremendous success, there is no satisfactory explanation for the efficiency or effectiveness of this generic training algorithm111Indeed, one might consider this a miraculous feat of engineering and even ask, is there anything to explain rigorously? We are not entirely comfortable with this view..
The difficulty is that even for highly restricted classes of neural networks, natural loss functions such as the mean squared loss have a highly non-convex landscape with many nonoptimal local minima. However, when data is generated from a model with random weights, gradient descent (the stochastic version with a small batch size) seems to consistently learn a network with error close to zero. This raises the prospect of a provable guarantee, but there are two complicating experimental observations. First, the randomness of the initialization appears essential (standard in practice) as in experiments it is possible to remain stuck at higher error. Second, we observe smaller error (and it decreases more quickly) when the model size used for training is made larger; in particular, for the realizable case, we train using many more units than the original. This aspect is also commonly encountered in the training of large neural networks on real data — even with huge amounts of data, the size of the model used can be larger.
In this paper we give nearly matching upper and lower bounds that help explain the phenomena seen in practice when training neural networks. The upper bounds are for gradient descent and the lower bounds are for all SQ algorithms. We summarize them here, and present them formally in the next section.
Our algorithmic result is an agnostic upper bound on the approximation error and time and sample complexity of gradient descent with the standard mean squared loss function. Despite training only the top level weights, our novel proof techniques avoid using any convexity in the problem. Since our analysis does not rely on reaching a global minimum, there is reason to hope the techniques will extend to nonconvex settings where we can in general expect only to find a local minimum. Prior results along this line were either for more complicated algorithms or more restricted settings; the closest is the work of Andoni et al.  where they assume the target function is a bounded degree polynomial. A detailed comparison of results is given in Section 1.3.
The upper bound shows that to get close to the best possible degree polynomial approximation of the data, it suffices to run gradient descent on a neural network with
units, using the same number of samples. This is an agnostic guarantee. We prove a matching lower bound for solving this polynomial learning problem over the uniform distribution on the unit sphere, forany statistical query algorithm that uses tolerance inversely proportional to the input dimension (roughly speaking, batch sizes are ). Thus, for this general agnostic learning problem, gradient descent is as good as it gets.
For two functions , the mean squared loss with respect to the distribution in is . Given data with ,
, we analyze gradient descent to minimize the loss of the current model with respect to the given data. Let the current network weights from the input layer to the hidden layer be the set of vectors. The function being computed by the current weights is as in Eq. (1
). By gradient descent, we mean the following procedure: in each iteration, the gradient of the loss function is computed using a finite sample of examples. The weights are then modified by adding a fixed multiple of the estimated gradient. Throughout the paper, we will assume that the input data distributionis uniform on the sphere. Randomly initialized units will have their weights drawn from the same distribution.
Our first theorem is for training networks of sigmoid gates. The -norm of a function is .
Let , , and
an odd bounded function such that ,
sigmoid gates and a linear
output layer, with high probability, will have mean squared loss of at most
an odd bounded function such that, where denotes the best polynomial of degree at most approximation of in norm. Then for any , a randomly initialized single-hidden-layer NN with
, sigmoid gates and a linear output layer, with high probability, will have mean squared loss of at mostafter at most iterations of GD using samples.
The same statement holds for ReLU activation units and even functions.
Next we state a more general theorem. This will apply to a large class of activation functions. The main property we need of the activation function is that it should not be a low-degree polynomial.
For and , an -activation is a function from to whose harmonic polynomial expansion uses polynomials of -norm at least for all degrees .
(See Section 2 for the definition of the harmonic polynomial expansion.)
For example, the commonly used sigmoid gate is an -activation function for the odd integers less than and . Similarly, ReLU gates are -activation functions for subsets of the even integers.
Let and a bounded function such that , where is the best -approximation to by a function whose harmonic polynomial expansion is supported on degrees in . Then for any , and any -activation function with a randomly initialized single-hidden-layer NN with -gates and a linear output layer, with high probability, will have mean squared loss of at most after at most iterations of GD using samples.
This general theorem has the following corollary in the realizable case, when data is generated by a one-hidden-layer NN. In this case, the function can be approximated by a low-degree polynomial. In order to allow for this approximation guarantee, and to side-step previous SQ lower bounds , we guarantee some degree on nondegeneracy by focusing on unbiased NNs, i.e., networks of the form
where is a bounded activation function.
Let be computed by an unbiased one-hidden-layer NN with sigmoid units in the hidden layer and a linear output. Suppose the norm of the output layer weights is , and each hidden layer weight vector has norm at most . Then for every , a randomly initialized single-hidden-layer NN with sigmoid units and a linear output layer, with high probability, will have mean squared loss of at most after at most iterations of GD using samples.
The use of sigmoid units in Corollary 1.4 is not essential, but the bounds on network size and training time will depend on the specific activation function chosen.
Our lower bounds hold in the very general Statistical Query (SQ) model, first defined by Kearns . An SQ algorithm solves a computational problem over an input distribution and interacts with the input only by querying the expected value of of a bounded function up to a desired accuracy. For any integer and distribution over , a oracle  takes as input a query function with expectation and returns a value such that
The bound on the RHS is the standard deviation ofindependent Bernoulli coins with desired expectation, i.e., the error that even a random sample of size would yield. The SQ complexity of an algorithm is given by the number of queries and the batch size . The remaining computation is unrestricted and can use randomization.
The statistical query framework was introduced by Kearns for supervised learning problems using the oracle, which, for , responds to a query function with a value such that . The oracle can be simulated by the oracle. The oracle was introduced by  who extended these oracles to more general problems over distributions.
Choosing a useful statistical query model for regression problems is nontrivial. We discuss some of the pitfalls in Section 4. Our lower bounds concern two query models.
The first allows quite general query functions. We say a statistical query algorithm (for regression) makes -normalized -Lipschitz queries concerning an unknown concept if it makes queries of the form , where is -Lipschitz at any fixed , to which the SQ oracle should respond with a value approximation .
We prove the following two lower bounds. The first is for general (Lipschitz) query functions.
Let . There exists a family of degree- polynomials on with the following property. For any randomized SQ algorithm learning to regression error less than with probability at least using -normalized -Lipschitz queries to , for some , we have .
We get a similar lower bound for a natural family of inner product queries with no Lipschitzness assumption. We say a statistical query algorithm makes inner product queries concerning an unknown concept if it makes queries of the form to an oracle that replies with an approximation of .
Let . There exists a family of degree- polynomials on with the following property. For any randomized SQ algorithm learning to regression error less than with probability at least using inner product queries to , we have .
In the case of training a neural network via gradient descent, the relevant queries should yield gradients of the loss function with respect to the current weights. In the case of mean squared loss, the gradients are of the form , where is the unknown concept, is the current network, and represents parameters of . These gradients can be estimated via queries in either of the two models we consider.
1.2 Approach and techniques
The gradient of the loss function with respect to any outer layer weight can be viewed as a spherical transform of the current residual error. More precisely, if the current function is computed by an unbiased single hidden-layer neural network with output-layer weights , as in Eq. (1), and the residual error with respect to the target function is , then for any ,
The latter expectation is quite special when the domain of integration is the unit sphere. Different choices of the function correspond to different spherical transformations. For example, being the indicator of is the hemispherical transform, iff is the Radon transform, etc. This type of transformation
has a closed form expression whenever the function is a harmonic polynomial (see definitions in the next section). By the classical Funk-Hecke theorem, for any bounded function and any harmonic polynomial , there is an explicit constant s.t.
In particular, the harmonic polynomials are eigenfunctions of the operator . Moreover, since the harmonic polynomials form an orthonormal basis over the unit sphere, any function (in our case the residual ) has zero norm iff the corresponding transform has zero norm, assuming the function has nonzero coefficients .
With the above observations in hand, we can now outline our analysis. We focus on the dynamics of gradient descent as an operator on a space of functions. In particular, for a set and function , we define an operator
Thus, if the current residual error is given by some function , then the empirical gradient of the mean-squared loss with respect to a set of labeled examples is (see Section 3).
Our analysis proceeds in three stages:
A crucial observation that simplifies our analysis is that when is given by a neural network as in Eq. (1), then itself is obtained by applying the operator (where is the set of hidden weights in ) to a function that computes the output-layer coefficients for each gate (see Proposition 3.2).
To better appreciate our approach, we contrast it with a more typical pattern that could be used here. Because we optimize only the top-level coefficients under gradient descent, the optimization problem is in fact convex; the usual approach for understanding gradient descent in such a setting has two main pieces:
Observe that gradient descent minimizes empirical loss, and therefore, with enough samples, also approximately minimizes population loss (using the general theory of convex optimization of gradient descent)
Prove a “representation theorem” showing that the hypothesis minimizing the population loss is a good approximation to the target function (using the particular details of the target function and the hypothesis space).
For the purposes of the present study, there are two significant drawbacks to this standard approach. First, a naive application of standard results in the convex setting would give a bound on the number of GD iterations scaling with in Theorem 1.3, rather than . Moreover, it is unclear how to replace the first step in order to extend to nonconvex settings.
By contrast, our analysis does not use the fact that the optimization produces an approximate global minimum; hence, there is a greater hope of generalizing to nonconvex regimes where we expect to instead only reach a local minimum in general. Another pleasant feature of our analysis is that we need not prove such a “representation theorem” directly; instead, we can derive such a result for free, as a corollary to our analysis. That is, since we prove directly that gradient descent on the top-level weights of a single-layer neural network with randomly-initialized gates results in small loss, it follows that any low-degree harmonic polynomial is in fact approximated by such a network. Our hope is that this new approach offers an interesting possibility for understanding gradient descent in more difficult settings.
The upper bound guarantees hold for the agnostic learning problem of minimizing the least squares error, and the bound is with respect to the best degree polynomial approximation. The size of the network needed grows as , as does the time and sample complexity. We show that this unavoidable for any SQ algorithm, including gradient descent and its variants on arbitrary network architectures. The “hard” functions used for the lower bound will be generated by spherical harmonic polynomials. Specifically, we use the univariate Legendre polynomial of degree in dimension , denoted as , and also called the Gegenbauer polynomial (see Section 2 for more background). We pick a set of unit vectors and for each one we get a polynomial . We choose the vectors randomly so that most have a small pairwise inner product. Then querying one of these polynomials gives little information about the others (on the same input ), and forces an algorithm to make many queries. As in earlier work on SQ regression algorithms , it is essential not only to bound the pairwise correlations of the “hard” functions themselves, but also of arbitary “smoothed” indicator functions composed with the hard family. This is accomplished by using a concentration of measure inequality on the sphere to avoid regions where these indicators are in fact correlated. In contrast to the earlier SQ regression lower bounds, we obtain bounds on the sensitivity parameter for the oracle that scales with the number of queries and the degree .
1.3 Related work
Explaining the success of deep neural networks and gradient descent for training neural networks has been a challenge for several years. The trade-off between depth and size for the purpose of representation has been rigorously demonstrated [29, 10]. Moreover, there are strong complexity-theoretic and cryptographic-assumption based lower bounds to contend with [5, 9, 22]. These lower bounds are typically based on Boolean functions and “hard” input distributions. More recent lower bounds hold even for specific distributions and smooth functions, for basic gradient descent , and even realizable smooth functions for any SQ algorithm and any product logconcave input distribution . These earlier lower bound constructs are degenerate in the sense that they rely on data generated by networks whose bias and weight vectors have unbounded Euclidean norm as the dimension increases. In contrast, the constructions used in this paper match a corresponding upper bound almost exactly by making use of generic harmonic polynomials in the construction, apply to a significantly broader family of functions, and achieve a much stronger bound on the sensitivity parameter .
Upper bounds have been hard to come by. Standard loss functions, even for one-hidden-layer networks with an output sum gate, are not convex and have multiple disconnected local minima. One body of work shows how to learn more restricted functions, e.g., polynomials  and restricted convolutional networks 
. Another line of work investigates classes of such networks that can be learned in polynomial time, notably using tensor methods[20, 26] and polynomial kernels [14, 16], more direct methods with assumptions on the structure of the network [15, 12] and a combination of tensor initialization followed by gradient descent . A recent paper shows that the tensor method can be emulated by gradient descent by adding a sufficiently sophisticated penalty to the objective function . Earlier work gave combinatorial methods to learn random networks , guarantees for learning linear dynamical systems by gradient descent  and ReLU networks with more restrictive assumptions . Representation theorems analogous to our own were also proved in , and a very general analysis of gradient descent is given in .
Our analysis is reminiscent of the well-known random kitchen sinks paper , which showed that gradient descent using a hard upper bound on the magnitude of coefficients (in practice, an penalty term) with many random features from some distribution achieves error that converges to the best possible error among functions whose coefficients are not much higher than those of the corresponding densities of the sampling distribution. While this approach has been quite insightful (and effective in practice), it (a) does not give a bound for standard gradient descent (with no penalty) and (b) does not address functions that have very different support than the sampling distribution. Our bounds compare with the best possible polynomial approximations and are essentially the best possible in that generality for randomly chosen features.
The work of Andoni et al.  shows that gradient descent applied to learn a bounded degree polynomial, using a 1-hidden-layer network of exponential gates, converges with roughly the same number of gates (and a higher iteration count, instead of to achieve error ). A crucial difference is that our analysis is agnostic and we show that gradient descent converges to the error of the best degree approximation of the target function given sufficient many gates. We also state our results for general and commonly-used activation functions, rather than the gate analyzed in , and obtain explicit sample complexity bounds. Of course, the proof technique is also novel; we obtain our representation theorem as a side effect of our direct analysis of GD, rather than the other way around.
We now recall the basic theorems of spherical harmonics we will require. A homogeneous polynomial of degree in is said to be harmonic if it satisfies the differential equation , where is the Laplacian operator. We denote by the set of spherical harmonics of degree on the sphere , i.e., the projections of all harmonic polynomials of degree to the sphere . The only properties of harmonic polynomials used in this paper are that they are polynomials, form an orthogonal basis for , and are eigenfunctions of Funk transforms, as we now explain. We denote by the (single-variable) Legendre polynomial of degree in dimension , which is also called the Gegenbauer polynomial. We note that for all .
Let be bounded and integrable. We define the Funk transformation for functions as and for the constant
Given a function , we denote by the projection of to the harmonic polynomials of degree , so and . We also write , and for , we write ,.
Theorem 2.3 (Funk–Hecke).
Let be bounded and integrable, and let . Then, for and as in Definition 2.1, .
The following proposition is immediate from Cauchy-Schwarz.
We have .
Let be bounded and integrable, and let . Then for any , .
2.1 Spectra for specific activation functions
We first prove a general lemma describing the harmonic spectrum of a wide class of functions, and then derive estimates of the spectra for commonly used activation functions.
Suppose has an absolutely convergent Taylor series on . Suppose that for all , we have whenever and are nonzero. Then for any positive integer , is an -activation, where .
By Rodrigues’ formula (see [17, Proposition 3.3.7]),
Hence, by the bounded convergence theorem,
We claim that
where is the Euler beta function. Indeed, integrating by parts, we see that if the expression is , and otherwise
After a change of variables , this latter integral is by definition .
Therefore, we compute for all ,
Now for any of the same parity , if we estimate
In particular, whenever we have
be the standard sigmoid function. Then for any positive integer, is an -activation function, where contains and all odd integers less than .
Let be the “softplus” function. Then for any positive integer , is an -activation function, where contains and all even integers less than .
The statement follows from Lemma 2.6 by computing the relevant Taylor series. ∎
We can also perform a similar computation for ReLU activations. (A more general estimate is given in [3, Appendix D.2].)
Let be the ReLU function. Then for any positive integer , is an -activation function, where contains and all even integers less than .
3 Analysis of Gradient Descent
In this section, we fix a function we wish to learn. We also fix an -activation function with , for some finite .
We let be a finite set of independent points drawn from . Similar to Eq. (1), we define by , so is computed by an unbiased single-hidden-layer neural network with hidden layer weight matrix given by and linear output layer weights given by . We will study how changes as we update according to gradient descent on the mean-squared loss function . We will state bounds in terms of some of the parameters, and then show that for adequate choices of these parameter, gradient descent will succeed in reducing the loss below an arbitrary threshold, proving Theorem 1.3.
We now define notation that will be used throughout the rest of this section.
We fix , the approximation error we will achieve over the projection of to harmonics of degrees in . We define quantities , , and as follows, using absolute constants , , and to be defined later in the proof. The maximum number of iterations of gradient descent will be
We define to be an error tolerance used in certain estimates in the proof,
Finally, we define to be the number of hidden units (so ), as well as the number of samples,
Let be a collection of random independent samples , The set , along with the labels for , will be the training data used by the algorithm.222We remark that there is no need to have ; our analysis simply achieves the same bound for both. It is, however, useful to have separate sets and , so that these samples are independent.
We recall the definition in Eq. (3) of the operator
defined for sets and functions . As described in Section 1.2, the empirical gradient is given by the operator applied to the residual error, i.e., the gradient of with respect to the top-level weight for the gate is estimated as . On the other hand, we will observe below that the neural network itself can also be understood as the result of applying the operator to a function representing the output-layer weights.
For integers we shall define functions recursively, corresponding to the model function and its coefficients after rounds of gradient descent. In particular, we let , i.e.,
We define and, for , set .
We therefore have the following two propositions which describe how the neural network evolves over multiple iterations of gradient descent.
Suppose is initially . Then after rounds of gradient descent with learning rate , we have .
Indeed, as we have observed in Eq. (2), for each , the true gradient of the loss with respect to the output-level weight is . So the empirical gradient using the samples in is indeed
Thus, a single iteration of gradient descent with learning rate will update the weight by adding . The proposition now follows by induction on . ∎
For all , .
By the definitions of and , we have
as desired. ∎
Having introduced and explained the necessary notation, we now turn our attention to the first step of our analysis, as outlined in Section 1.2. Namely, we now show that the operator approximates for sufficiently large sets . Lemma 3.3 gives a very general version of this approximation, which we use to prove the finer approximation described in Lemma 3.4.
For the rest of the section, we write .
Let and , and let . There is some
such that if is a set of independent random points drawn from , then with probability at least , we have both and .
Without loss of generality assume and let . Fix . We have by definition. Since , we have
and for all . By a Bernstein bound, we have
for an appropriate choice of the constant hidden in the definition of .
Let if and otherwise. By the preceding inequality, we have . Therefore, by Markov’s inequality, the probability over the choice of that is at most . Hence, with probability over the choice of , we have
In particular, the last inequality of the present lemma holds.
But for all choices of and , we have
Therefore, using Eq. (9),
as desired. ∎
We denote by the function .
In the following Lemma 3.4 we prove a finer-tuned approximation of the operator by both and . Since Lemma 3.3 doesn’t give a sufficiently tight approximation between the operators simultaneously for every function in , we restrict our attention to the subspace we care about, namely, the functions spanned by the for .
With probability over the choice of and , the following statements are all true:
For all , we have ;
For all we have
For all , we have ;
We have . For any fixed , we can set sufficiently large that there is some while also ensuring . The same statement also holds (for appropriate choice of ) with in place of , since . Then since , by Lemma 3.3, for any fixed , we have with probability over the choice of that
Similar to Eq. (10), with in place of and in place of , statement (1) of the present lemma holds with probability over the choice of . Furthermore, for any fixed , taking a union bound over , we have with probability that statement (2) holds.
Now suppose is such that Eq. (10) holds for a random with probability at least ; as we have already observed, this is the case with probability at least over the choice of . Then by a union bound over , it then follows that with probability over the choice of , statement (3) holds. Finally, supposing similarly that is such that Eq. (11) holds for a random with probability at least , we get statement (4) with probability as well, by another union bound over . Overall, statements (1)–(4) hold with probability at least . ∎
For the remainder of this section, we use the notation and .
We now focus on the second step of our analysis, as outlined in Section 1.2, bounding the rate at which error from the approximations of described above accumulates over multiple iterations of GD. More precisely, we control the norm of , measured via and . The statements are given in the following two lemmas.
Suppose statements (1)–(3) of Lemma 3.4 all hold. Then for all , we have both and .
For all , we have . Furthermore, if statement (4) of Lemma 3.4 holds, then for all , we have .
3.1 Proof of Main Results
We now prove Theorem 1.3.
Proof of Theorem 1.3.
Let . We argue by induction that for all , the following are all true:
Since and , the base cases are all trivial. Fix and assume (1)-(5) hold for all .
We first prove that (1) holds for . Indeed, using the second statement of Lemma 3.6, and then simplifying using the inductive hypothesis for statements (2) and (5), we have
Similarly, from the first statement of Lemma 3.6 and from estimate (1), we have