Deep learning, in the form of artificial neural networks, have seen a dramatic resurgence in popularity in recent years. This is mainly due to impressive performance gains on various difficult learning problems, in fields such as computer vision, natural language processing and many others. Despite the practical success of neural networks, our theoretical understanding of them is still very incomplete.
A key aspect of modern networks is that they tend to be very large, usually with many more parameters than the size of the training data: In fact, so many that in principle, they can simply memorize all the training examples (as shown in the influential work of Zhang et al. ). The fact that such huge, over-parameterized networks are still able to learn and generalize is one of the big mysteries concerning deep learning. A current leading hypothesis is that over-parameterization makes the optimization landscape more benign, and encourages standard gradient-based training methods to find weight configurations that fit the training data as well as generalize (even though there might be many other configurations which fit the training data without any generalization). However, pinpointing the exact mechanism by which over-parameterization helps is still an open problem.
Recently, a spate of papers (such as [4, 11, 14, 2, 19, 15, 9, 3, 1]) provided positive results for training and learning with over-parameterized neural networks. Although they differ in details, they are all based on the following striking observation: When the networks are sufficiently large, standard gradient-based methods change certain components of the network (such as the weights of a certain layer) very slowly, so that if we run these methods for a bounded number of iterations, they might as well be fixed. To give a concrete example, consider one-hidden layer neural networks, which can be written as a linear combination of neurons
and an activation function. When is sufficiently large, and with standard random initializations, it can be shown that gradient descent will leave the weights in the first layer nearly unchanged (at least initially). As a result, the dynamics of gradient descent will resemble those where are fixed at random initial values – namely, where we learn a linear predictor (parameterized by ) over a set of random features of the form (for some random choice of ). For such linear predictors, it is not difficult to show that they will converge quickly to an optimal predictor (over the span of the random features). This leads to learning guarantees with respect to hypothesis classes which can be captured well by such random features: For example, most papers focus (explicitly or implicitly) on multivariate polynomials with certain constraints on their degree or the magnitude of their coefficients. We discuss these results in more detail (and demonstrate their close connection to random features) in Section 4.
Taken together, these results are a significant and insightful advance in our understanding of neural networks: They rigorously establish that sufficient over-parameterization allows us to learn complicated functions while solving a non-convex optimization problem. However, it is important to realize that this approach can only explain learnability of hypothesis classes which can already be learned using random features. Considering the one-hidden-layer example above, this corresponds to learning linear predictors over a fixed representation (chosen obliviously and randomly at initialization). Thus, it does not capture any element of representation learning, which appears to lend much of the power of modern neural networks.
In this paper, we study both the power and the limitations of the random feature approach for understanding deep learning:
On the positive side, we provide a simple, self-contained analysis, showing how over-parameterized, one-hidden-layer networks can provably learn polynomials with bounded degrees and coefficients, using standard stochastic gradient descent with standard initialization. In more details, given any distributionover labeled data , where , any , and any multivariate polynomial with degree at most and coefficients of magnitude at most , if we take a one-hidden-layer neural network with analytic activation functions and at least
neurons, and run
many iterations of stochastic gradient descent on i.i.d. examples, then with probability at leastover the random initialization, it holds that
where is the expected hinge loss and the expectation is over the random choice of examples. We emphasize that although our analysis improves on previous ones in certain aspects (discussed later on), it is not fundamentally novel: Our goal is mostly to present the approach developed in previous papers in a transparent and self-contained manner, focusing on clarity rather than generality.
On the negative side, we show that there are inherent limitations on what predictors can be captured with random features, and as a result, on what can be learned with neural networks using the techniques described earlier. In particular, we show that given almost any activation function, and any neural network where the are randomly chosen and then fixed, we cannot efficiently approximate even a single ReLU neuron: In particular, for any , we can choose a bias term , so that w.h.p over the choice of random features, if
(where is the ReLU function and
has a standard Gaussian distribution) then
In other words, either the number of neurons or the magnitude of the weights (or both) must be exponential in the dimension . This means that the random features approach cannot fully explain polynomial-time learnability of neural networks, even with respect to data generated by an extremely simple neural network, composed of a single neuron. This is despite the fact that single ReLU neurons are easily learnable with gradient-based methods (e.g., ). The point we want to make here is that the random feature approach, as a theory for explaining the success of neural networks, cannot explain even the fact that single neurons are learnable.
We emphasize that there is no contradiction between the positive and negative results: In the positive results, the required size of the network is exponential in the degree of the polynomial, and constant-degree polynomials cannot express even a single ReLU neuron if its weights are large enough.
Overall, we argue that although the random feature approach captures important aspects of training neural networks, it is by no means the whole story, and we are still quite far from a satisfying general explanation for the empirical success of neural networks.
The recent literature on the theory of deep learning is too large to be thoroughly described here. Instead, we survey here some of the works most directly relevant to the themes of our paper. In Section 4, we provide a more technical comparison of our results to those mentioned in the introduction, and explain their connection to random features.
The power of over-parameterization. The fact that over-parameterized networks are easier to train was empirically observed, for example, in , and was used in several contexts to show positive results for learning and training neural networks. For example, it is known that adding more neurons makes the optimization landscape more benign (e.g., [25, 31, 32, 26, 10, 30]), or allows them to learn in various settings (e.g., besides the papers mentioned in the introduction, [8, 20, 7, 34, 13]).
Learning Polynomials. In , it was shown that over-parameterized neural networks can be used to learn polynomials, which is also the focus of the positive result we present here. However, the paper makes a few non-standard assumptions (which we do not make here), such as the activation function being the exponential function, and that the data and weights are complex-valued. We also note that there are other algorithms to learn polynomials (e.g., kernel methods with a polynomial kernel, see also the references in ), but they do not involve neural networks and standard methods to train them.
, originally as a computationally-efficient alternative to kernel methods (although as a heuristic, it can be traced back to the “random connections” feature of Rosenblatt’s Perceptron machine in the 1950’s). These involve learning predictors of the form, where are random non-linear functions. The training involves only tuning of the weights. Thus, the learning problem is as computationally easy as training linear predictors, but with the advantage that the resulting predictor is non-linear, and in fact, if is large enough, can capture arbitrarily complex functions. In our paper, we mainly consider the case where for randomly chosen (see Eq. (1)). The power of random features to express certain classes of functions has been studied in past years (for example [5, 23, 18, 33]). However, in our paper we also consider negative rather than positive results for such features.  also discusses the limitation of approximating functions with a bounded number of such features, but in a different setting than ours (worst-case approximation of a large function class using a fixed set of features, rather than inapproximability of a fixed target function, and not in the context of single neurons). Less directly related,  studied learning neural networks using kernel methods, which can be seen as learning a linear predictor over a fixed non-linear mapping. However, the algorithm is not based on training neural networks with standard gradient-based methods.
In this section, we introduce some notation and assumptions used later in our results.
Notation. Denote by the
-dimensional uniform distribution over the rectangle, and by the multivariate Gaussian distribution with covariance matrix . For let
, and for a vectorwe denote by the norm. We denote the ReLU function by .
For our positive results, we consider one-hidden-layer feed-forward neural networks which are defined as:
where is an activation function which acts coordinate-wise and . In this paper we consider activations which are either analytic, or the popular ReLU activation. We will also use the following form for the network:
here and .
Loss and Expected Loss. For simplicity, in our positive results we will use the hinge loss, which is defined by:
thus the optimization will be done on the function . We will also use the notation:
The gradients of can be calculated directly:
Here , and we look at as a matrix in hence .
SGD. We will use the standard form of SGD to optimize , stated as Algorithm 1.
The initialization of is a standard Xavier initialization , that is . can be initialized in any manner, as long as its norm is smaller than , e.g. we can initialize = 0. This kind of initialization for the outer layer has been used also in other works (see , ).
Multivariate Polynomials. Finally, we introduce some notations regarding multi-variable polynomials: Letting be a multi index, and given , we define , and also . We say for two multi indices that if for all , and that if and also there is an index such that . For and multi index we say that if . Lastly, given a multi-variable polynomials , where we define:
3 Over-Parameterized Neural Networks Learn Polynomials
In this section, the data for our network is , drawn from an unknown distribution . We assume for simplicity that and .
The main result of this section is the following:
Let be an analytic activation function, which is -Lipschitz with . Let be any distribution over the labelled data with , and let , , and be some positive integer. Suppose we run SGD on the neural network:
with the following parameters:
is initialized with for and is initialized s.t
here , where , bounds the first coefficients of the Taylor series expansion of the activation . Then for every polynomial with , and all the monomials of which have a non-zero coefficient also have a non-zero coefficient in the Taylor series of , w.p over the initialization there is such that:
Here the expectation is over the random choice of in each iteration of SGD.
We note that for simplicity, we focused on analytic activation functions, although it is possible to derive related results for non-analytic activations such as a ReLU (see Appendix E for a discussion). Also, note that we did not use a bias term in the architecture of the network in the theorem (namely, we have and not ). This is because if the polynomial we are trying to compete with has a constant factor, then we require that the Taylor expansion of the activation also has a constant factor, thus the bias term is already included in the Taylor expansion of the activation function.
Suppose we are given a sample set . By choosing uniform on the sample set , Theorem 3.1 shows that SGD over the sample set will lead to an average loss not much worse than the best possible polynomial predictor with bounded degree and coefficients.
3.1 Proof of Thm. 3.1
We break the proof to three steps, where each step contains a theorem which is independent of the other steps. Finally we combine the three steps to prove the main theorem.
Step 1: SGD on Over-Parameterized Networks Competes with Random Features
Recall we use a network of the form:
where are initialized as described in the theorem We show that for any target matrix with a small enough norm and every , if we run SGD on as in Algorithm 1 with appropriate learning rate and number of iterations , there is some with:
where the expectation is over the random choices of examples in each round of SGD.
The bound in Eq. (7) means that SGD on randomly initialized weights competes with random features. By random features here we mean any linear combination of neurons of the form where the are randomly chosen, and the norm of the weights of the linear combination are bounded. In more details:
Assume we initialize such that and . Also assume that is -Lipschitz with , and let be a constant. Letting , we run SGD with step size and steps with and let be the weights produced at each step. If we pick such that then for every target matrix with there is a s.t:
Here the expectation is over the random choice of the training examples in each round of SGD.
In the proof of Theorem 3.2 we first show that for the chosen learning rate and limited number of iterations , the matrix does not change much from its initialization. After that we use results from online convex optimization for linear prediction with respect to with a sufficiently small norm to prove the required bound. For a full proof see Appendix C. Note that in the theorem we did not need to specify the initialization scheme, only to bound the norm of the initialized weights. The optimization analysis is similar to the one done in Daniely .
Step 2: Random Features Concentrate Around their Expectation
In the previous step we showed that in order to bound the expected loss of the network, it is enough to consider a network of the form , where the are randomly initialized with . We now show that if the number of random features is large enough, then a linear combination of them approximates functions of the form for an appropriate normalization factor :
Let where is -Lipschitz on with , and a normalization term. Assume that for a constant . Then for every if are drawn i.i.d from the uniform distribution on , w.p there is a function of the form
where for every , such that:
Step 3: Representing Polynomials as Expectation of Random Features
In the previous step we showed that random features can approximate functions with the integral form:
In this step we show how a a polynomial with bounded degree and coefficients can be represented in this form. This means that we need to find a function for which . To do so we use the fact that is analytic, thus it can be represented as an infinite sum of monomials using a Taylor expansion, and take to be a finite weighted sum of Legendre polynomials, which are orthogonal with respect to the appropriate inner product. The main difficulty here is to find a bound on , which in turn also bounds the distance between the sum of the random features and its expectation. The main theorem of this step is:
Let be an analytic function, and be a polynomial, where , and all the monomials of which have a non-zero coefficient also have a non-zero coefficient in the Taylor series of . Assume that and for all , with . Then there exists a function that satisfies the following:
Here is a normalization term, and , bounds the first coefficients of the Taylor series expansion of the activation .
Step 4: Putting it all Together
We are now ready to prove the main theorem of this section. The proof is done for convenience in reverse order of the three steps presented above.
Proof of Theorem 3.1.
Let be the coefficients of the Taylor expansion of up to degree , and let be a a polynomial with and , such that if then the monomials in of degree also have a zero coefficient.
First, we use Theorem 3.4 to find a function such that:
and also , thus .
4 Analysis of Neural Networks as Random Features
In the results of the previous section, a key element was to analyze neural networks as if they were random features. Many previous works that analyze neural networks also use this approach, either explicitly or implicitly. Here we survey some of these works, how they can actually be viewed as random features and how their results are different from ours.
4.1 Optimization with Coupling, Fixing the Output Layer
One approach is to fix the output layer and do optimization only on the inner layers. Most works that use this method (e.g. , , , , , ) also use the method of ”coupling” and the popular ReLU activation. This method uses the following observation: a ReLU neuron can be viewed as a linear predictor multiplied by a threshold function, that is: . The coupling method informally states that after doing gradient descent with appropriate learning rate and a limited number of iterations, the amount of neurons that change the sign of (for in the data) is small. Thus, it is enough to analyze a linear network over random features of the form: where are randomly chosen.
For example, a one-hidden-layer neural network where the activation is the ReLU function can be written as:
Using the coupling method, after doing gradient descent, the amount of neurons that change sign, i.e. the sign of changes, is small. As a result, using the homogeneity of the ReLU function, the following network can actually be analyzed:
where are randomly chosen. This is just analyzing a linear predictor with random features of the form . Note that the homogeneity of the ReLU function is used in order to show that fixing the output layer does not change the network’s expressiveness. This is not true in terms of optimization, as optimizing both the inner layers and the output layers may help the network converge faster, and to find a predictor which has better generalization properties. Thus, the challenge in this approach is to find functions or distributions that can be approximated with this kind of random features network, and the amount of neurons necessary is at most polynomial in the dimension. This method is used in several works, for example:
Li and Liang  show a generalization bound for ReLU neural networks where there is a strong separability assumption on the distribution of the data, which is critical in the analysis.
In Du et al. , it is shown that under some assumptions on the data, neural network with ReLU activation would reach a global minimum of the empirical risk in polynomial time.
Allen-Zhu et al.  also show an empirical risk bound on a multi-layer feed-forward neural network with ReLU activation and relies on separability assumption on the data.
Allen-Zhu et al.  show with almost no assumptions on the distribution of the data that the generalization of neural networks with two or three layers is better then the generalization of a ”ground truth” function for a large family of functions, which includes polynomials.
Cao and Gu  show a generalization result for multi-layered networks, where the generalization is compared to a family of functions that take an integral form, similar to the one developed in Theorem 3.3. This integral form is also studied in ,  in the context of studying neural networks as random features.
All the above fix the weights of the output layer and consider only the ReLU activation function. Note that while using ReLU as an activation function, it is possible to show that fixing the output layer does not change the expressive power of the network. With that said, in practice all layers of neural networks are being optimized, and the optimization process may not work as well if some of the layers are fixed.
4.2 Optimization on all the Layers
A second approach in the literature is to perform optimization on all the layers of the network, choose a ”good” learning rate and bound the number of iterations such that the inner layers stay close to their initialization. For example, in the setting of a one-hidden-layer network, for every , a learning rate and number of iterations are chosen, such that after running gradient descent with these parameters, there is an iteration such that:
Hence, it is enough to analyze a linear predictor over a set of random features:
where is not necessarily the ReLU function. Again, the difficulty here is finding the functions that can be approximated in this form, where (the amount of neurons) is only polynomial in the dimension. This approach was used in this paper to prove Theorem 3.1.
In Andoni et al.  this approach is also used to prove that neural networks can approximate polynomials. There the weights are drawn from a complex distribution, thus optimization is done on a complex domain which is non standard. Moreover, that paper uses the exponent activation function and assumes that the distribution on the data is uniform on the complex unit circle.
In Daniely et al.  and Daniely  this approach is used to get a generalization bound with respect to a large family of functions, where the network may have more than two layers and a different architecture than simple feed-forward. The methods used there are through the ”conjugate kernel”, which is a function corresponding to the activation and architecture of the network. The proof of our result is relatively more direct, and gives a bound which correspond directly to the activation function, without going through the conjugate kernel. Moreover, those papers do not quantitatively characterize the class of polynomials learned by the network, with an explicit proof.
Du and Lee  rely on the same methods and assumptions as Du et al.  to show an empirical risk bound on multi-layer neural network where a large family of activations is considered and with several architectures, including ResNets and convolutional ResNets. However, this does not imply a bound on the risk.
5 Limitations of Random Features
Having shown positive results for learning using (essentially) random features, we turn to discuss the limitations of this approach.
Concretely, we will consider in this section data , where is drawn from a standard Gaussian on , and there exists some single ground-truth neuron which generates the target values : Namely, for some fixed . We also consider the squared loss , so the expected loss we wish to minimize takes the form
Importantly, when (that is, we train a single neuron to learn a single target neuron), this problem does seem to be quite tractable with standard gradient-based methods (see, e.g., ). In this section, we ask whether this positive result – that single target neurons can be learned – can be explained by the random features approach. Specifically, we consider the case where the weights are sampled at random, and ask what conditions on and are required to minimize Eq. (11). Our main result (Thm. 5.1) is that either one of them has to be exponential in the dimension . Since exponentially-sized or exponentially-valued networks are generally not efficiently trainable, we conclude that an approach based on random-feature cannot explain why learning single neurons is tractable in practice.
To simplify the notation in this section, we consider functions on as elements of the space weighted by a standard Gaussian measure, that is
where is a normalization term. For example, Eq. (11) can also be written as .
5.1 Warm up: Linear predictors
Before stating our main result, let us consider a particularly simple case, where is the identity, and our goal is to learn a linear predictor , where has unit norm. We will show that already in this case, there is a significant cost to pay for using random features. The main result in the next subsection can be seen as an elaboration of this idea.
In this setting, finding a good linear predictor, namely minimizing
is easy. It can be done using standard gradient-based methods, and is solvable in polynomial time since it is a convex problem.
Suppose now that we are given random features and want to find such that:
Then Eq. (12) is with high probability unsolvable unless , or (or both). This shows that even linear predictors are very limited if instead of learning the linear predictor itself, a combination of random features is learned. In more details:
Let with and let . Then w.p over sampling of (where are distributed as , if there are such that:
Note that for every ,
has a half normal distribution, hence:
Using Markov’s inequality we get that w.p :
and using the union bound, we get that w.p , for every Eq. (13) holds.
We note that in this setting, it is possible to prove the result without the restriction on . However, we chose to present the result in this manner, so that the proof will be more similar to the one we employ in the non-linear case later on (we conjecture that removing the dependence on in the non-linear case is possible, and is left to future work).
5.2 Limitations of Random Features for Learning a Single Neuron
Having discussed the linear case, let us return to the case of a non-linear neuron. Specifically, we will show that even a single ReLU neuron cannot be approximated by a neural network with random features, with any ”standard” activation, unless the amount of neurons in the network is exponential in the dimension, or the coefficients of the linear combination are exponential in the dimension. In more details:
Let be an activation function such that:
and for a universal constant . Then for all with there exists with such that w.p over sampling of (where are distributed uniformly over and are distributed as ), if:
where is a universal constant that depends only on the activation .
To prove the theorem, we will use the following proposition, which implies that functions of the form for a certain sine-like and ”random” are nearly uncorrelated with any fixed function.
Let , where for some positive constant , and let . Define the following function:
Then satisfies the following:
It is a periodic odd function on the interval
For every with :
For every we have:
where is sampled uniformly from , and is a universal constant.
Items 1 and 2 are straight forward calculation, where in item 2 we also used the fact that has a symmetric distribution. Item 3 relies on a claim from , which shows that periodic functions of the form for a random with sufficiently large norm have low correlation with any fixed function. The full proof can be found in Appendix D.
At a high level, the proof of Theorem 5.1 proceeds as follows: If we randomly choose and fix some and , then any linear combination of random features with small weights will be nearly uncorrelated with . But, we know that can be written as a linear combination of a ReLU neurons, so there must be some ReLU neuron which will be nearly uncorrelated with any linear combination of the random features (and as a result, cannot be well-approximated by them). Finally, by a symmetry argument, we can actually fix arbitrarily and the result still holds. We now turn to provide the formal proof:
Proof of Theorem 5.1.
Denote , take from Proposition 5.2 and denote for , . If we sample uniformly from , then for every :
where is a universal constant that depends only on and on the norm of the activation function. Hence also:
We now show that doesn’t depend on . Fix , then any with can be written as
for some orthogonal matrix. Now:
where we used the fact that both and have a spherically symmetric distribution. Therefore:
Using Markov’s inequality we get that w.p over sampling :
and finally using the union bound, w.p over the choice of for we get:
We can write , where and , for . Let with , and denote . Assume that for every we can find such that:
Let the bias term of the output layer of the network (we can let ), then also: