Consider the supervised learning problem in which we are given i.i.d. data , where
a probability distribution over, and . (For simplicity, we focus our introductory discussion on the case in which the response is a noiseless function of the feature vector : some of our results go beyond this setting.) We would like to learn the unknown function as to minimize the prediction risk . We will assume throughout , i.e. .
The function class of two-layers neural networks (with neurons) is defined by:
Classical universal approximation results [Cyb89] imply that any can be approximated arbitrarily well by an element in (under mild conditions). At the same time, we know that such an approximation can be constructed in polynomial time only for a subset of functions . Namely, there exist sets of functions for which no algorithm can construct a good approximation in in polynomial time [KK14, Sha18], even having access to the full distribution (under certain complexity-theoretic assumptions).
These facts lead to the following central question in neural network theory:
For which subset of function can a neural network approximation be learnt efficiently?
Here ‘efficiently’ can be formalized in multiple ways: in this paper we will focus on learning via stochastic gradient descent.
Here are weights which are not optimized and instead drawn at random. Through this paper, we will assume . (Notice that we do not add an offset in the model, and will limit ourselves to target functions that are centered: this choice simplifies some calculations without modifying the results.)
We can think of and as tractable inner bounds of the class of neural networks :
Tractable. Both , are finite-dimensional linear spaces, and minimizing the empirical risk over these classes can be performed efficiently.
Inner bounds. Indeed : the random feature model is simply obtained by fixing all the first layer weights. Further (the closure of the class of neural networks with neurons). This follows from as .
It is possible to show that the class of neural networks is significantly more expressive than the two linearization , , see e.g. [YS19, GMMM19]. In particular, [GMMM19] shows that, if the feature vectors are uniformly random over the -dimensional sphere, and are large with , then can only capture linear functions, while can only capture quadratic functions.
Despite these findings, it could still be that the subset of functions for which we can learn efficiently a neural network approximation is well described by and . Indeed, several recent papers show that –in a certain highly overparametrized regime– this description is accurate [DZPS18, DLL18, LXS19]. A specific counterexample is given in [YS19]: if the function to be learnt is a single neuron then gradient descent (in the space of neural networks with neurons) efficiently learns it [MBM18]; on the other hand, or require a number of neurons exponential in the dimension to achieve vanishing risk.
1.1 Summary of main results
In this paper we explore systematically the gap between , and , by considering two specific data distributions:
Quadratic functions: feature vectors are distributed according to and responses are quadratic functions with .
Mixture of Gaussians: with equal probability , and , .
Let us emphasize that the choice of quadratic functions in model qf is not arbitrary: in a sense, it is the most favorable case for training. Indeed [GMMM19] proves that111Note that [GMMM19] considers feature vectors uniformly random over the sphere rather than Gaussian. However, the results of [GMMM19] can be generalized, with certain modifications, to the Gaussian case. Roughly speaking, for Gaussian features, with neurons can represent quadratic functions, and a low-dimensional subspace of higher order polynomials. (when ): Third- and higher-order polynomials cannot be approximated nontrivially by ; Linear functions are already well approximated within .
For clarity, we will first summarize our result for the model qf, and then discuss generalizations to mg. The prediction risk achieved within any of the regimes , , is defined by
where is the neural network produced by steps of stochastic gradient descent (SGD) where each sample is used once, and the stepsize is set to (see Section 2.3 for a complete definition). Notice that the quantities ,
are random variables because of the random weights, and the additional randomness in SGD.
Our results are summarized by Figure 1, which compares the risk achieved by the three approaches above in the population limit , using quadratic activations . We consider the large-network, high-dimensional regime , with . Figure 1 reports the risk achieved by various approaches in numerical simulations, and compares them with our theoretical predictions for each of three regimes , , and , which are detailed in the next sections.
The agreement between analytical predictions and simulations is excellent but, more importantly, a clear picture emerges. We can highlight a few phenomena that are illustrated in this figure:
Random features do not capture quadratic functions. The random features risk remains generally bounded away from zero for all values of . It is further highly dependent on the distribution of the weight vectors . Section 2.1
characterizes explicitly this dependence, for general activation functions. For large , the optimal distribution of the weight vectors uses covariance , but even in this case the risk is bounded away from zero unless .
The neural tangent model achieves vanishing risk on quadratic functions for . However, the risk is bounded away from zero if . Section 2.1 provides explicit expressions for the minimum risk as a function of . Roughly speaking fits the quadratic function along random subspace determined by the random weight vectors . For , these vectors span the whole space and hence the limiting risk vanishes. For only a fraction of the space is spanned, and not the most important one (i.e. not the principal eigendirections of ).
Fully trained neural networks achieve vanishing risk on quadratic functions for : this is to be expected on the basis of the previous point. For the risk is generally bounded away from , but its value is smaller than for the neural tangent model. Namely, in Section 2.3 we give an explicit expression for the asymptotic risk (holding for ) implying that, for some (independent of ),
We prove this result by showing convergence of SGD to gradient flow in the population risk, and then proving a strict saddle property for the population risk. As a consequence the limiting risk on the left-hand side coincides with the minimum risk over the whole space of neural networks . We characterize the latter and shows that it amounts to fitting along the principal eigendirections of . This mechanism is very different from the one arising in the regime.
The picture emerging from these findings is remarkably simple. The fully trained network learns the most important eigendirections of the quadratic function and fits them, hence surpassing the model which is confined to a random set of directions.
Let us emphasize that the above separation between and is established only for . It is natural to wonder whether this separation generalizes to for more complicated classes of functions, or if instead it always vanishes for wide networks. We expect the separation to generalize to by considering higher order polynomial, instead of quadratic functions. Partial evidence in this direction is provided by [GMMM19]: for third- or higher-order polynomials does not achieve vanishing risk at any . The mechanism unveiled by our analysis of quadratic functions is potentially more general: neural networks are superior to linearized models such as or , because they can learn a good representation of the data.
Our results for quadratic functions are formally presented in Section 2. In order to confirm that the picture we obtain is general, we establish similar results for mixture of Gaussians in Section 3. More precisely, our results of and for mixture of Gaussians are very similar to the quadratic case. In this model, however, we do not prove a convergence result for analogous to (6), although we believe it should be possible by the same approach outlined above. On the other hand, we characterize the minimum prediction risk over neural networks and prove it is strictly smaller than the minimum achieved by and . Finally, Section 4 contains background on our numerical experiments.
1.2 Further related work
The connection (and differences) between two-layers neural networks and random features models has been the object of several papers since the original work of Rahimi and Recht [RR08]. An incomplete list of references includes [Bac13, AM15, Bac17a, Bac17b, RR17]. Our analysis contributes to this line of work by establishing a sharp asymptotic characterization, although in more specific data distributions. Sharp results have recently been proven in [GMMM19], for the special case of random weights uniformly distributed over a -dimensional sphere. Here we consider the more general case of anisotropic random features with covariance . This clarifies a key reason for suboptimality of random features: the data representation is not adapted to the target function . We focus on the population limit
. Complementary results characterizing the variance as a function ofare given in [HMRT19].
The model (3) is much more recent [JGH18]. Several papers show that SGD optimization within the original neural network is well approximated by optimization within the model as long as the number of neurons is large compared to a polynomial in the sample size [DZPS18, DLL18, AZLS18, ZCZG18]. Empirical evidence in the same direction was presented in [LXS19, ADH19].
Chizat and Bach [CB18] clarified that any nonlinear statistical model can be approximated by a linear one in an early (lazy) training regime. The basic argument is quite simple. Given a model with parameters , we can Taylor-expand around a random initialization . Setting , we get
Here the second approximation holds since, for many random initializations, because of random cancellations. The resulting model is linear, with random features.
Our objective is complementary to this literature: we prove that and have limited approximation power, and significant gain can be achieved by full training.
Finally, our analysis of fully trained networks connects to the ample literature on non-convex statistical estimation. For two layers neural networks with quadratic activations, Soltanolkotabi, Javanmard and Lee[SJL19] showed that, as long as the number of neurons satisfies there are no spurious local minimizers. Du and Lee [DL18] showed that the same holds as long as where is the sample size. Zhong et. al. [ZSJ17] established local convexity properties around global optima. Further related landscape results include [GLM17, HYV14, GJZ17].
2 Main results: quadratic functions
As mentioned in the previous section, our results for quadratic functions (qf) assume and where
2.1 Random features
We consider random feature model with first-layer weights . We make the following assumptions:
The activation function verifies for some constants with . Further it is nonlinear (i.e. there is no such that almost everywhere).
We fix the weights’ normalization by requiring . We assume the operator norm for some constant , and that the empirical spectral distribution of converges weakly, as to a probability distribution over .
Let be a quadratic function as per Eq. (8), with . Assume conditions A1 and A2 to hold. Denote by the -th Hermite coefficient of and assume . Define . Let be the unique solution of
Then, the following holds as with :
Moreover, assuming to have a limit as , (10) simplifies as follows for :
Notice that is the risk normalized by the risk of the trivial predictor . The asymptotic result in (11) is remarkably simple. By Cauchy-Schwartz, the normalized risk is bounded away from zero even as the number of neurons per dimension diverges , unless , i.e. the random features are perfectly aligned with the function to be learned. For isotropic random features, the right-hand side of Eq. (11) reduces to . In particular, performs very poorly when , and no better than the trivial predictor if .
Notice that the above result applies to quite general activation functions. The formulas simplify significantly for quadratic activations.
Under the assumptions of Theorem 1, further assume . Then we have, as with :
2.2 Neural tangent
For the regime, we focus on quadratic activations and isotropic weights .
Let be a quadratic function as per Eq. (8), with , and assume . Then, we have for with
where the expectation is taken over .
As for the case of random features, the risk depends on the target function only through the ratio . However, the normalized risk is always smaller than the baseline . Note that, by Cauchy-Schwartz, , with this worst case achieved when . In particular, vanishes asymptotically for . This comes at the price of a larger number of parameters to be fitted, namely instead of .
2.3 Neural network
For the analysis of SGD-trained neural networks, we assume to be a quadratic function as per Eq. (8), but we will now restrict to the positive semidefinite case . We consider quadratic activations , and we fix the second layers weights to be :
Notice that we use an explicit offset to account for the mismatch in means between and . It is useful to introduce the population risk, as a function of the network parameters :
Here expectation is with respect to . We will study a one-pass version of SGD, whereby at each iteration we perform a stochastic gradient step with respect to a fresh sample
Notice that this is the risk with respect to a new sample, independent from the ones used to train . It is the test error. Also notice that is the number of SGD steps but also (because of the one-pass assumption) the sample size. Our next theorem characterizes the asymptotic risk achieved by SGD. This prediction is reported in Figure 1.
Let be a quadratic function as per Eq. (8), with . Consider SGD with initialization whose distribution is absolutely continuous with respect to the Lebesgue measure. Let be the test prediction error after SGD steps with step size .
Then we have (probability is over the initialization and the samples)
where are the ordered eigenvalues of
are the ordered eigenvalues of.
The proof of this theorem depends on the following proposition concerning the landscape of the population risk, which is of independent interest.
Let be a quadratic function as per Eq. (8), with . For any sub-level set of the risk function , there exists constants such that is -strict saddle in the region . Namely, for any with , we have .
3 Main results: mixture of Gaussians
In this section, we consider the mixture of Gaussian setting (mg): with equal probability , and , . We parametrize the covariances as and , and will make the following assumptions:
There exists constants such that ;
The scaling in assumption M2 ensures the signal-to-noise ratio to be of order one. If the eigenvalues of are much larger than , then it is easy to distinguish the two classes with high probability (they are asymptotically mutually singular). If
then no non-trivial classifier exists.
We will denote by
the joint distribution ofunder the (mg) model, and by or the corresponding expectation. The minimum prediction risk within any of the regimes , , is defined by
As mentioned in the introduction, the picture emerging from our analysis of the mg model is aligned with the results obtained in the previous section. We will limit ourselves to stating the results without repeating comments that were made above. Our results are compared with simulations in Figure 2. Notice that, in this case, the Bayes error (MMSE) is not achieved even for very wide networks either by or .
3.1 Random seatures
As in the previous section, we generate random first-layer weights . We consider a general activation function satisfying condition . We make the following assumption on :
We fix the weights’ normalization by requiring . We assume that there exists a constant such that , and that the empirical spectral distribution of converges weakly, as to a probability distribution over .
Consider the mg distribution, with and satisfying condition M1 and M2. Assume conditions A1 and B2 to hold. Define to be the -th Hermite coefficient of and assume without loss of generality . Define . Let be the unique solution of
Define , . Then, the following holds as with :
Moreover, assume to have limits as , i.e. we have for . Then the following holds as :
3.2 Neural tangent
For the model, we first state our theorem for general and and then give an explicit concentration result in the case and isotropic weights .
Let be the mixture of Gaussian distribution, with
be the mixture of Gaussian distribution, withand satisfying conditions M1 and M2. Further assume . Then, the following holds for almost every (with respect to the Lebesgue measure):
where and is the projection perpendicular to .
Assuming further that and , we have as with :
In particular, for , we have (for almost every )
3.3 Neural network
We consider quadratic activations with general offset and coefficients . This is optimized over and .
Let be the mixture of Gaussian distribution, with and satisfying conditions M1 and M2. Then, the following holds
where and are the singular values of
are the singular values of. In particular, for , we have
Let us emphasize that, for this setting, we do not have a convergence result for SGD as for the model qf, cf. Theorem 3. However, because of certain analogies between the two models, we expect a similar result to hold for mixtures of Gaussians.
We recover a similar behavior as in the case of the (qf) model: learns the most important directions of , while , do not. Note that the Bayes error is not achieved in this model.
4 Numerical Experiments
models are trained with SGD in TensorFlow[ABC16]. We run a total of SGD steps for each (qf) model and steps for each (mg) model. The SGD batch size is fixed at and the step size is chosen from the grid where the hyper-parameter that achieves the best fit is used for the figures. models are fitted directly by solving KKT conditions with observations. After fitting the model, the test error is evaluated on fresh samples. In our figures, each data point corresponds to the test error averaged over models with independent realizations of .
For (qf) experiments, we choose
to be diagonal with diagonal elements chosen i.i.d from standard exponential distribution with parameter. For (mg) experiments, is also diagonal with the diagonal element chosen uniformly from the set .
This work was partially supported by grants NSF DMS-1613091, CCF-1714305, IIS-1741162, and ONR N00014-18-1-2729, NSF DMS-1418362, NSF DMS-1407813.
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard,
Tensorflow: A system for large-scale machine learning, 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 2016, pp. 265–283.
- [ADH19] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang, On exact computation with an infinitely wide neural net, arXiv:1904.11955 (2019).
Ahmed El Alaoui and Michael W Mahoney,
Fast randomized kernel ridge regression with statistical guarantees, Advances in Neural Information Processing Systems, 2015, pp. 775–783.
- [AZLS18] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song, A convergence theory for deep learning via over-parameterization, arXiv:1811.03962 (2018).
- [Bac13] Francis Bach, Sharp analysis of low-rank kernel matrix approximations, Conference on Learning Theory, 2013, pp. 185–209.
Breaking the curse of dimensionality with convex neural networks, The Journal of Machine Learning Research 18 (2017), no. 1, 629–681.
- [Bac17b] , On the equivalence between kernel quadrature rules and random feature expansions, The Journal of Machine Learning Research 18 (2017), no. 1, 714–751.
- [BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart, Concentration inequalities: A nonasymptotic theory of independence, Oxford university press, 2013.
- [BS10] Zhidong Bai and Jack W Silverstein, Spectral analysis of large dimensional random matrices, vol. 20, Springer, 2010.
- [CB18] Lenaic Chizat and Francis Bach, A note on lazy training in supervised differentiable programming, arXiv:1812.07956 (2018).
Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2 (1989), no. 4, 303–314.
- [DL18] Simon S Du and Jason D Lee, On the power of over-parametrization in neural networks with quadratic activation, arXiv:1803.01206 (2018).
- [DLL18] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai, Gradient descent finds global minima of deep neural networks, arXiv:1811.03804 (2018).
- [DZPS18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably optimizes over-parameterized neural networks, arXiv:1810.02054 (2018).
- [EK10] Noureddine El Karoui et al., The spectrum of kernel random matrices, The Annals of Statistics 38 (2010), no. 1, 1–50.
- [EY36] Carl Eckart and Gale Young, The approximation of one matrix by another of lower rank, Psychometrika 1 (1936), no. 3, 211–218.
- [GJZ17] Rong Ge, Chi Jin, and Yi Zheng, No spurious local minima in nonconvex low rank problems: A unified geometric analysis, Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 1233–1242.
- [GLM17] Rong Ge, Jason D Lee, and Tengyu Ma, Learning one-hidden-layer neural networks with landscape design, arXiv:1711.00501 (2017).
- [GMMM19] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Linearized two-layers neural networks in high dimension, arXiv:1904.12191 (2019).
- [HMRT19] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation, arXiv:1903.08560 (2019).
- [HYV14] Benjamin Haeffele, Eric Young, and Rene Vidal, Structured low-rank matrix factorization: Optimality, algorithm, and applications to image processing, International conference on machine learning, 2014, pp. 2007–2015.
- [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler, Neural tangent kernel: Convergence and generalization in neural networks, Advances in neural information processing systems, 2018, pp. 8571–8580.
Adam Klivans and Pravesh Kothari, Embedding hard learning problems into
, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014.
Thomas G Kurtz,
Solutions of ordinary differential equations as limits of pure jump markov processes, Journal of applied Probability 7 (1970), no. 1, 49–58.
- [Led01] Michel Ledoux, The concentration of measure phenomenon, no. 89, American Mathematical Soc., 2001.
- [Loj82] S Lojasiewicz, Sur les trajectoires du gradient d’une fonction analytique, Seminari di geometria 1983 (1982), 115–117.
- [LXS19] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington, Wide neural networks of any depth evolve as linear models under gradient descent, arXiv:1902.06720 (2019).
- [MBM18] Song Mei, Yu Bai, and Andrea Montanari, The landscape of empirical risk for nonconvex losses, The Annals of Statistics 46 (2018), no. 6A, 2747–2774.
- [PP16] Ioannis Panageas and Georgios Piliouras, Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions, arXiv:1605.00405 (2016).
- [RR08] Ali Rahimi and Benjamin Recht, Random features for large-scale kernel machines, Advances in neural information processing systems, 2008, pp. 1177–1184.
- [RR17] Alessandro Rudi and Lorenzo Rosasco, Generalization properties of learning with random features, Advances in Neural Information Processing Systems, 2017, pp. 3215–3225.
- [Sha18] Ohad Shamir, Distribution-specific hardness of learning neural networks, The Journal of Machine Learning Research 19 (2018), no. 1, 1135–1163.
- [SJL19] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee, Theoretical insights into the optimization landscape of over-parameterized shallow neural networks, IEEE Transactions on Information Theory 65 (2019), no. 2, 742–769.
- [Ver10] Roman Vershynin, Introduction to the non-asymptotic analysis of random matrices, arXiv:1011.3027 (2010).
- [YS19] Gilad Yehudai and Ohad Shamir, On the power and limitations of random features for understanding neural networks, arXiv:1904.00687 (2019).
- [ZCZG18] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu, Stochastic gradient descent optimizes over-parameterized deep relu networks, arXiv:1811.08888 (2018).
- [ZSJ17] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon, Recovery guarantees for one-hidden-layer neural networks, Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 4140–4149.
Appendix A Technical background
a.1 Hermite polynomials
The Hermite polynomials form an orthogonal basis of , where is the standard Gaussian measure, and has degree . We will follow the classical normalization (here and below, expectation is with respect to ):
As a consequence, for any function , we have the decomposition
Throughout the proofs, (resp. ) denotes the standard big-O (resp. little-o) notation, where the subscript emphasizes the asymptotic variable. We denote (resp. ) the big-O (resp. little-o) in probability notation: if for any , there exists and , such that
and respectively: , if converges to in probability.
We will occasionally hide logarithmic factors using the notation (resp. ): if there exists a constant such that . Similarly, we will denote (resp. ) when considering the big-O in probability notation up to a logarithmic factor.
Appendix B Proofs for quadratic functions
Our results for quadratic functions (qf) assume and where
Throughout this section, we will denote the expectation operator with respect to , and the expectation operator with respect to .
b.1 Random Features model: proof of Theorem 1
Recall the definition
Note that it is easy to see from the proof th