 # Limitations of Lazy Training of Two-layers Neural Networks

We study the supervised learning problem under either of the following two models: (1) Feature vectors x_i are d-dimensional Gaussians and responses are y_i = f_*( x_i) for f_* an unknown quadratic function; (2) Feature vectors x_i are distributed as a mixture of two d-dimensional centered Gaussians, and y_i's are the corresponding class labels. We use two-layers neural networks with quadratic activations, and compare three different learning regimes: the random features (RF) regime in which we only train the second-layer weights; the neural tangent (NT) regime in which we train a linearization of the neural network around its initialization; the fully trained neural network (NN) regime in which we train all the weights in the network. We prove that, even for the simple quadratic model of point (1), there is a potentially unbounded gap between the prediction risk achieved in these three training regimes, when the number of neurons is smaller than the ambient dimension. When the number of neurons is larger than the number of dimensions, the problem is significantly easier and both NT and NN learning achieve zero risk.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider the supervised learning problem in which we are given i.i.d. data , where

, and . (For simplicity, we focus our introductory discussion on the case in which the response is a noiseless function of the feature vector : some of our results go beyond this setting.) We would like to learn the unknown function as to minimize the prediction risk . We will assume throughout , i.e. .

The function class of two-layers neural networks (with neurons) is defined by:

 FNN,N={f(x)=c+N∑i=1aiσ(⟨wi,x⟩):c,ai∈R,wi∈Rd,i∈[N]}. (1)

Classical universal approximation results [Cyb89] imply that any can be approximated arbitrarily well by an element in (under mild conditions). At the same time, we know that such an approximation can be constructed in polynomial time only for a subset of functions . Namely, there exist sets of functions for which no algorithm can construct a good approximation in in polynomial time [KK14, Sha18], even having access to the full distribution (under certain complexity-theoretic assumptions).

These facts lead to the following central question in neural network theory:

For which subset of function can a neural network approximation be learnt efficiently?

Here ‘efficiently’ can be formalized in multiple ways: in this paper we will focus on learning via stochastic gradient descent.

Significant amount of work has been devoted to two subclasses of which we will refer to as the random feature model () [RR08], and the neural tangent model () [JGH18]:

 FRF,N(W) ={fN(x)=N∑i=1aiσ(⟨wi,x⟩):ai∈R,i∈[N]}, (2) FNT,N(W) ={fN(x)=c+N∑i=1σ′(⟨wi,x⟩)⟨ai,x⟩:c∈R,ai∈Rd,i∈[N]}. (3)

Here are weights which are not optimized and instead drawn at random. Through this paper, we will assume . (Notice that we do not add an offset in the model, and will limit ourselves to target functions that are centered: this choice simplifies some calculations without modifying the results.)

We can think of and as tractable inner bounds of the class of neural networks :

• Tractable. Both , are finite-dimensional linear spaces, and minimizing the empirical risk over these classes can be performed efficiently.

• Inner bounds. Indeed : the random feature model is simply obtained by fixing all the first layer weights. Further (the closure of the class of neural networks with neurons). This follows from as .

It is possible to show that the class of neural networks is significantly more expressive than the two linearization , , see e.g. [YS19, GMMM19]. In particular, [GMMM19] shows that, if the feature vectors are uniformly random over the -dimensional sphere, and are large with , then can only capture linear functions, while can only capture quadratic functions.

Despite these findings, it could still be that the subset of functions for which we can learn efficiently a neural network approximation is well described by and . Indeed, several recent papers show that –in a certain highly overparametrized regime– this description is accurate [DZPS18, DLL18, LXS19]. A specific counterexample is given in [YS19]: if the function to be learnt is a single neuron then gradient descent (in the space of neural networks with neurons) efficiently learns it [MBM18]; on the other hand, or require a number of neurons exponential in the dimension to achieve vanishing risk.

### 1.1 Summary of main results

In this paper we explore systematically the gap between , and , by considering two specific data distributions:

1. Quadratic functions: feature vectors are distributed according to and responses are quadratic functions with .

2. Mixture of Gaussians: with equal probability , and , .

Let us emphasize that the choice of quadratic functions in model qf  is not arbitrary: in a sense, it is the most favorable case for training. Indeed [GMMM19] proves that111Note that [GMMM19] considers feature vectors uniformly random over the sphere rather than Gaussian. However, the results of [GMMM19] can be generalized, with certain modifications, to the Gaussian case. Roughly speaking, for Gaussian features, with neurons can represent quadratic functions, and a low-dimensional subspace of higher order polynomials. (when ): Third- and higher-order polynomials cannot be approximated nontrivially by ; Linear functions are already well approximated within .

For clarity, we will first summarize our result for the model qf, and then discuss generalizations to mg. The prediction risk achieved within any of the regimes , , is defined by

 RM,N(f∗)=argmin^f∈FM,N(W)E{(f∗(x)−^f(x))2},M∈{RF,NT,NN}. (4) RNN,N(f∗;ℓ,ε)=E{(f∗(x)−^f\tiny\rm SGD(x;ℓ,ε))2}, (5)

where is the neural network produced by steps of stochastic gradient descent (SGD) where each sample is used once, and the stepsize is set to (see Section 2.3 for a complete definition). Notice that the quantities ,

are random variables because of the random weights

, and the additional randomness in SGD. Figure 1: Left frame: Prediction (test) error of a two-layer neural networks in fitting a quadratic function in d=450 dimensions, as a function of the number of neurons N. We consider the large sample (population) limit n→∞ and compare three training regimes: random features (RF), neural tangent (NT), and fully trained neural networks (NN). Lines are analytical predictions obtained in this paper, and dots are empirical results. Right frame: Evolution of the risk for NT and NN with the number of samples. Dashed lines are our analytic prediction for the large n limit.

Our results are summarized by Figure 1, which compares the risk achieved by the three approaches above in the population limit , using quadratic activations . We consider the large-network, high-dimensional regime , with . Figure 1 reports the risk achieved by various approaches in numerical simulations, and compares them with our theoretical predictions for each of three regimes , , and , which are detailed in the next sections.

The agreement between analytical predictions and simulations is excellent but, more importantly, a clear picture emerges. We can highlight a few phenomena that are illustrated in this figure:

Random features do not capture quadratic functions. The random features risk remains generally bounded away from zero for all values of . It is further highly dependent on the distribution of the weight vectors . Section 2.1

characterizes explicitly this dependence, for general activation functions

. For large , the optimal distribution of the weight vectors uses covariance , but even in this case the risk is bounded away from zero unless .

The neural tangent model achieves vanishing risk on quadratic functions for . However, the risk is bounded away from zero if . Section 2.1 provides explicit expressions for the minimum risk as a function of . Roughly speaking fits the quadratic function along random subspace determined by the random weight vectors . For , these vectors span the whole space and hence the limiting risk vanishes. For only a fraction of the space is spanned, and not the most important one (i.e. not the principal eigendirections of ).

Fully trained neural networks achieve vanishing risk on quadratic functions for : this is to be expected on the basis of the previous point. For the risk is generally bounded away from , but its value is smaller than for the neural tangent model. Namely, in Section 2.3 we give an explicit expression for the asymptotic risk (holding for ) implying that, for some (independent of ),

 limt→∞limε→0RNN,N(f∗;ℓ=t/ε,ε)=inff∈FNN,NE{(f(x)−f∗(x))2}≤RNT,N(f∗)−GAP(ρ). (6)

We prove this result by showing convergence of SGD to gradient flow in the population risk, and then proving a strict saddle property for the population risk. As a consequence the limiting risk on the left-hand side coincides with the minimum risk over the whole space of neural networks . We characterize the latter and shows that it amounts to fitting along the principal eigendirections of . This mechanism is very different from the one arising in the regime.

The picture emerging from these findings is remarkably simple. The fully trained network learns the most important eigendirections of the quadratic function and fits them, hence surpassing the model which is confined to a random set of directions.

Let us emphasize that the above separation between and is established only for . It is natural to wonder whether this separation generalizes to for more complicated classes of functions, or if instead it always vanishes for wide networks. We expect the separation to generalize to by considering higher order polynomial, instead of quadratic functions. Partial evidence in this direction is provided by [GMMM19]: for third- or higher-order polynomials does not achieve vanishing risk at any . The mechanism unveiled by our analysis of quadratic functions is potentially more general: neural networks are superior to linearized models such as or , because they can learn a good representation of the data.

Our results for quadratic functions are formally presented in Section 2. In order to confirm that the picture we obtain is general, we establish similar results for mixture of Gaussians in Section 3. More precisely, our results of and for mixture of Gaussians are very similar to the quadratic case. In this model, however, we do not prove a convergence result for analogous to (6), although we believe it should be possible by the same approach outlined above. On the other hand, we characterize the minimum prediction risk over neural networks and prove it is strictly smaller than the minimum achieved by and . Finally, Section 4 contains background on our numerical experiments.

### 1.2 Further related work

The connection (and differences) between two-layers neural networks and random features models has been the object of several papers since the original work of Rahimi and Recht [RR08]. An incomplete list of references includes [Bac13, AM15, Bac17a, Bac17b, RR17]. Our analysis contributes to this line of work by establishing a sharp asymptotic characterization, although in more specific data distributions. Sharp results have recently been proven in [GMMM19], for the special case of random weights uniformly distributed over a -dimensional sphere. Here we consider the more general case of anisotropic random features with covariance . This clarifies a key reason for suboptimality of random features: the data representation is not adapted to the target function . We focus on the population limit

. Complementary results characterizing the variance as a function of

are given in [HMRT19].

The model (3) is much more recent [JGH18]. Several papers show that SGD optimization within the original neural network is well approximated by optimization within the model as long as the number of neurons is large compared to a polynomial in the sample size [DZPS18, DLL18, AZLS18, ZCZG18]. Empirical evidence in the same direction was presented in [LXS19, ADH19].

Chizat and Bach [CB18] clarified that any nonlinear statistical model can be approximated by a linear one in an early (lazy) training regime. The basic argument is quite simple. Given a model with parameters , we can Taylor-expand around a random initialization . Setting , we get

 f(x;θ)≈f(x;θ0)+βT∇θf(x;θ0)≈βT∇θf(x;θ0). (7)

Here the second approximation holds since, for many random initializations, because of random cancellations. The resulting model is linear, with random features.

Our objective is complementary to this literature: we prove that and have limited approximation power, and significant gain can be achieved by full training.

Finally, our analysis of fully trained networks connects to the ample literature on non-convex statistical estimation. For two layers neural networks with quadratic activations, Soltanolkotabi, Javanmard and Lee

[SJL19] showed that, as long as the number of neurons satisfies there are no spurious local minimizers. Du and Lee [DL18] showed that the same holds as long as where is the sample size. Zhong et. al. [ZSJ17] established local convexity properties around global optima. Further related landscape results include [GLM17, HYV14, GJZ17].

## 2 Main results: quadratic functions

As mentioned in the previous section, our results for quadratic functions (qf) assume and where

 f∗(x)≡b0+⟨x,Bx⟩. (8)

### 2.1 Random features

We consider random feature model with first-layer weights . We make the following assumptions:

• The activation function verifies for some constants with . Further it is nonlinear (i.e. there is no such that almost everywhere).

• We fix the weights’ normalization by requiring . We assume the operator norm for some constant , and that the empirical spectral distribution of converges weakly, as to a probability distribution over .

###### Theorem 1.

Let be a quadratic function as per Eq. (8), with . Assume conditions A1 and A2 to hold. Denote by the -th Hermite coefficient of and assume . Define . Let be the unique solution of

 −~λ=−ρψ+∫λ21t1+λ21tψD(dt). (9)

Then, the following holds as with :

 RRF,N(f∗)=∥f∗∥2L2(1−ψλ22d⟨Γ,B⟩2∥B∥2F(2+ψλ22d∥Γ∥2F)+od,P(1)). (10)

Moreover, assuming to have a limit as , (10) simplifies as follows for :

 limρ→∞limd→∞,N/d→ρRRF,N(f∗)∥f∗∥2L2 =limd→∞(1−⟨Γ,B⟩2∥Γ∥2F∥B∥2F). (11)

Notice that is the risk normalized by the risk of the trivial predictor . The asymptotic result in (11) is remarkably simple. By Cauchy-Schwartz, the normalized risk is bounded away from zero even as the number of neurons per dimension diverges , unless , i.e. the random features are perfectly aligned with the function to be learned. For isotropic random features, the right-hand side of Eq. (11) reduces to . In particular, performs very poorly when , and no better than the trivial predictor if .

Notice that the above result applies to quite general activation functions. The formulas simplify significantly for quadratic activations.

###### Corollary 1.

Under the assumptions of Theorem 1, further assume . Then we have, as with :

 RRF,N(f∗) =∥f∗∥2L2(1−ρd⟨B,Γ⟩2∥B∥2F(1+ρd∥Γ∥2F)+od,P(1)). (12)

The right-hand side of Eq. (12) is plotted in Fig. 1 for isotropic features , and for optimal features .

### 2.2 Neural tangent

For the regime, we focus on quadratic activations and isotropic weights .

###### Theorem 2.

Let be a quadratic function as per Eq. (8), with , and assume . Then, we have for with

 E[RNT,N(f∗)]=∥f∗∥2L2{(1−ρ)2+(1−\rm Tr(B)2d∥B∥2F)+(1−ρ)+\rm Tr(B)2d∥B∥2F+od(1)}.

where the expectation is taken over .

As for the case of random features, the risk depends on the target function only through the ratio . However, the normalized risk is always smaller than the baseline . Note that, by Cauchy-Schwartz, , with this worst case achieved when . In particular, vanishes asymptotically for . This comes at the price of a larger number of parameters to be fitted, namely instead of .

### 2.3 Neural network

For the analysis of SGD-trained neural networks, we assume to be a quadratic function as per Eq. (8), but we will now restrict to the positive semidefinite case . We consider quadratic activations , and we fix the second layers weights to be :

 ^f(x;W,c)=N∑i=1⟨wi,x⟩2+c.

Notice that we use an explicit offset to account for the mismatch in means between and . It is useful to introduce the population risk, as a function of the network parameters :

 L(W,c)=E[(f∗(x)−^f(x;W,c))2]=E[(⟨xxT,B−WWT⟩+b0−c)2].

Here expectation is with respect to . We will study a one-pass version of SGD, whereby at each iteration we perform a stochastic gradient step with respect to a fresh sample

 (Wk+1,ck+1)=(Wk,ck)−ε∇W,c(f∗(xk)−^f(xk;W,c))2,

and define

 RNN,N(f∗;ℓ,ε)≡L(Wℓ,cℓ)=Ex∼N(0,Id)[(f∗(x)−^f(x;Wℓ,cℓ))2].

Notice that this is the risk with respect to a new sample, independent from the ones used to train . It is the test error. Also notice that is the number of SGD steps but also (because of the one-pass assumption) the sample size. Our next theorem characterizes the asymptotic risk achieved by SGD. This prediction is reported in Figure 1.

###### Theorem 3.

Let be a quadratic function as per Eq. (8), with . Consider SGD with initialization whose distribution is absolutely continuous with respect to the Lebesgue measure. Let be the test prediction error after SGD steps with step size .

Then we have (probability is over the initialization and the samples)

 limt→∞limε→0 P(∣∣RNN,N(f∗;ℓ=t/ε,ε)−infW,cL(W,c)∣∣≥δ)=0, infW,cL(W,c)=2d∑i=N+1λi(B)2,

where

are the ordered eigenvalues of

.

The proof of this theorem depends on the following proposition concerning the landscape of the population risk, which is of independent interest.

###### Proposition 1.

Let be a quadratic function as per Eq. (8), with . For any sub-level set of the risk function , there exists constants such that is -strict saddle in the region . Namely, for any with , we have .

We can now compare the risk achieved within the regimes , and . Gathering the results of Corollary 1, and Theorems 2, 3 (using for and ), we obtain

 RM,N(f∗)∥f∗∥2L2 (13)

As anticipated, learns the most important directions in , while , do not.

## 3 Main results: mixture of Gaussians Figure 2: Left frame: Prediction (test) error of a two-layer neural networks in fitting a mixture of Gaussians in d=450 dimensions, as a function of the number of neurons N, within the three regimes RF, NT, NN. Lines are analytical predictions obtained in this paper, and dots are empirical results (both in the population limit). Dotted line is the Bayes error. Right frame: Evolution of the risk for NT and NN with the number of samples.

In this section, we consider the mixture of Gaussian setting (mg): with equal probability , and , . We parametrize the covariances as and , and will make the following assumptions:

• There exists constants such that ;

• .

The scaling in assumption M2 ensures the signal-to-noise ratio to be of order one. If the eigenvalues of are much larger than , then it is easy to distinguish the two classes with high probability (they are asymptotically mutually singular). If

then no non-trivial classifier exists.

We will denote by

the joint distribution of

under the (mg) model, and by or the corresponding expectation. The minimum prediction risk within any of the regimes , , is defined by

 RM,N(P)=inff∈FM,NE(y,x){(y−f(x))2},M∈{RF,NT,NN}.

As mentioned in the introduction, the picture emerging from our analysis of the mg  model is aligned with the results obtained in the previous section. We will limit ourselves to stating the results without repeating comments that were made above. Our results are compared with simulations in Figure 2. Notice that, in this case, the Bayes error (MMSE) is not achieved even for very wide networks either by or .

### 3.1 Random seatures

As in the previous section, we generate random first-layer weights . We consider a general activation function satisfying condition . We make the following assumption on :

• We fix the weights’ normalization by requiring . We assume that there exists a constant such that , and that the empirical spectral distribution of converges weakly, as to a probability distribution over .

###### Theorem 4.

Consider the mg  distribution, with and satisfying condition M1 and M2. Assume conditions A1 and B2 to hold. Define to be the -th Hermite coefficient of and assume without loss of generality . Define . Let be the unique solution of

 −~λ=−ρψ+∫λ21t1+λ21tψD(dt). (14)

Define , . Then, the following holds as with :

 RRF,N(PΣ,Δ) =1+ζ1(d)λ22ψ1+(ζ1(d)+ζ2(d))λ22ψ+od,P(1),. (15)

Moreover, assume to have limits as , i.e. we have for . Then the following holds as :

 limρ→∞limd→∞,N/d→ρRRF,N(PΣ,Δ)=ζ1,∗ζ1,∗+ζ2,∗. (16)

### 3.2 Neural tangent

For the model, we first state our theorem for general and and then give an explicit concentration result in the case and isotropic weights .

###### Theorem 5.

Let

be the mixture of Gaussian distribution, with

and satisfying conditions M1 and M2. Further assume . Then, the following holds for almost every (with respect to the Lebesgue measure):

 RNT,N(PΣ,Δ)=22+∥~Δ∥2F−∥P⊥~ΔP⊥∥2F+od(1),

where and is the projection perpendicular to .

Assuming further that and , we have as with :

 RNT,N(PI,Δ)=22+κ(ρ,Δ)∥Δ∥2F+od,P(1), κ(ρ,Δ)=1−(1−ρ)2+(1−\rm Tr(Δ)2d∥Δ∥2F)−(1−ρ)+\rm Tr(Δ)2d∥Δ∥2F,

In particular, for , we have (for almost every )

 RNT,N(PI,Δ)=11+∥Δ∥2F/2+od,P(1).

### 3.3 Neural network

We consider quadratic activations with general offset and coefficients . This is optimized over and .

###### Theorem 6.

Let be the mixture of Gaussian distribution, with and satisfying conditions M1 and M2. Then, the following holds

 RNN,N(PΣ,Δ)=22+∑N∧di=1λi(~Δ)2+od(1),

where and

are the singular values of

. In particular, for , we have

 RNN,N(PI,Δ)=11+∥~Δ∥2F/2+od(1).

Let us emphasize that, for this setting, we do not have a convergence result for SGD as for the model qf, cf. Theorem 3. However, because of certain analogies between the two models, we expect a similar result to hold for mixtures of Gaussians.

We can now compare the risks achieved within the regimes , and . Gathering the results of Theorems 4, 5 and 6 for and (using for RF  and NT), we obtain

 RM,N(PI,Δ) (17)

We recover a similar behavior as in the case of the (qf) model: learns the most important directions of , while , do not. Note that the Bayes error is not achieved in this model.

## 4 Numerical Experiments

For the experiments illustrated in Figures 1 and 2, we use feature size of , and number of hidden units . and

models are trained with SGD in TensorFlow

[ABC16]. We run a total of SGD steps for each (qf) model and steps for each (mg) model. The SGD batch size is fixed at and the step size is chosen from the grid where the hyper-parameter that achieves the best fit is used for the figures. models are fitted directly by solving KKT conditions with observations. After fitting the model, the test error is evaluated on fresh samples. In our figures, each data point corresponds to the test error averaged over models with independent realizations of .

For (qf) experiments, we choose

to be diagonal with diagonal elements chosen i.i.d from standard exponential distribution with parameter

. For (mg) experiments, is also diagonal with the diagonal element chosen uniformly from the set .

## Acknowledgements

This work was partially supported by grants NSF DMS-1613091, CCF-1714305, IIS-1741162, and ONR N00014-18-1-2729, NSF DMS-1418362, NSF DMS-1407813.

## Appendix A Technical background

### a.1 Hermite polynomials

The Hermite polynomials form an orthogonal basis of , where is the standard Gaussian measure, and has degree . We will follow the classical normalization (here and below, expectation is with respect to ):

 E{Hej(G)Hek(G)}=k!δjk. (18)

As a consequence, for any function , we have the decomposition

 g(x)=∞∑k=0μk(g)k!Hek(x),μk(g)≡E{g(G)Hek(G)}. (19)

### a.2 Notations

Throughout the proofs, (resp. ) denotes the standard big-O (resp. little-o) notation, where the subscript emphasizes the asymptotic variable. We denote (resp. ) the big-O (resp. little-o) in probability notation: if for any , there exists and , such that

 P(|h1(d)/h2(d)|>Cε)≤ε,∀d≥dε,

and respectively: , if converges to in probability.

We will occasionally hide logarithmic factors using the notation (resp. ): if there exists a constant such that . Similarly, we will denote (resp. ) when considering the big-O in probability notation up to a logarithmic factor.

## Appendix B Proofs for quadratic functions

Our results for quadratic functions (qf) assume and where

 f∗(xi)≡b0+⟨x,Bx⟩. (20)

Throughout this section, we will denote the expectation operator with respect to , and the expectation operator with respect to .

### b.1 Random Features model: proof of Theorem 1

Recall the definition

 RRF,N(f∗)=argmin^f∈FRF,N(W)E{(f∗(x)−^f(x))2},

where

 FRF,N(W)={fN(x)=N∑i=1aiσ(⟨wi,x⟩):ai∈R,i∈[N]}.

Note that it is easy to see from the proof that the result stays the same if we add an offset .

#### b.1.1 Representation of the RF risk

###### Lemma 1.

Consider the model. We have