# A Note on Lazy Training in Supervised Differentiable Programming

In a series of recent theoretical works, it has been shown that strongly over-parameterized neural networks trained with gradient-based methods could converge linearly to zero training loss, with their parameters hardly varying. In this note, our goal is to exhibit the simple structure that is behind these results. In a simplified setting, we prove that "lazy training" essentially solves a kernel regression. We also show that this behavior is not so much due to over-parameterization than to a choice of scaling, often implicit, that allows to linearize the model around its initialization. These theoretical results complemented with simple numerical experiments make it seem unlikely that "lazy training" is behind the many successes of neural networks in high dimensional tasks.

## Authors

• 9 publications
• 129 publications
• ### Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

Neural networks trained to minimize the logistic (a.k.a. cross-entropy) ...
02/11/2020 ∙ by Lenaïc Chizat, et al. ∙ 2

• ### Understanding Generalization of Deep Neural Networks Trained with Noisy Labels

Over-parameterized deep neural networks trained by simple first-order me...
05/27/2019 ∙ by Wei Hu, et al. ∙ 0

• ### Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets

Hierarchical Bayesian networks and neural networks with stochastic hidde...
02/03/2014 ∙ by Diederik P. Kingma, et al. ∙ 0

• ### On the Power and Limitations of Random Features for Understanding Neural Networks

Recently, a spate of papers have provided positive theoretical results f...
04/01/2019 ∙ by Gilad Yehudai, et al. ∙ 0

• ### Benefits of Jointly Training Autoencoders: An Improved Neural Tangent Kernel Analysis

A remarkable recent discovery in machine learning has been that deep neu...
11/27/2019 ∙ by Thanh V. Nguyen, et al. ∙ 0

• ### Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks

When optimizing over-parameterized models, such as deep neural networks,...
04/30/2019 ∙ by Gauthier Gidel, et al. ∙ 2

• ### A Revision of Neural Tangent Kernel-based Approaches for Neural Networks

Recent theoretical works based on the neural tangent kernel (NTK) have s...
07/02/2020 ∙ by Kyung-Su Kim, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Differentiable programming is becoming an important paradigm in certain areas of signal processing and machine learning that consists in building parameterized models, sometimes with a complex architecture and a large number of parameters, and adjusting these parameters so that the model fits training data using gradient-based optimization methods. The resulting problem is in general highly non-convex: it has been observed empirically that seemingly innocuous changes in the parameterization, optimization procedure, or in the initialization could lead to a selection of very different models, even though they sometimes all fit perfectly the training data

[30]. Our goal is to showcase this effect by studying lazy training, which refers to training algorithms that select parameters that are close to the initialization.

This note is motivated by a series of recent articles [12, 19, 11, 2, 3, 31] where it is shown that certain over-parameterized neural networks converge linearly to zero training loss with their parameters hardly varying. With a slightly different viewpoint, it was shown in [15] that the first phase of training behaves like a kernel regression in the over-parameterization limit, with a kernel built from the linearization of the neural network around its initialization. Blending these two points of views together, we remark that lazy training essentially amounts to kernel-based regression with a specific kernel. Importantly, in all these papers, a specific and somewhat implicit choice of scaling is made. We argue that lazy training is not so much due to over-parameterization than to this choice of scaling. By introducing a scale factor

, we see that essentially any parametric model can be trained in this lazy regime if initialized close enough to

in the space of predictors. This remark allows to better understand lazy training: its generalization properties, when it occurs, and its downsides.

The takeaway is that guaranteed fast training is indeed possible, but at the cost of recovering a linear method111

Here we mean a prediction function linearly parameterized by a potentially infinite-dimensional vector.

. On the upside, this draws an interesting link between neural networks and kernel methods, as noticed in [15]

. On the downside, we believe that most practically useful neural networks are not trained in this regime: a clue for that is that in practice, neurons weights actually move quite a lot (see, e.g.,

[21, 5]

), and the first layer of convolutional neural networks tend to learn Gabor-like filters when randomly initialized

[13, Chap. 9]

. Instead, they seem to be able to perform high dimensional, non-linear, feature selection and this remains a fundamental theoretical challenge to understand how and when they do so.

The situation is illustrated in Figure 1

where lazy training for a two-layer neural network with rectified linear units (ReLU) is achieved by increasing the variance

at initialization (see Section 4

). While in panel (a) the ground truth features are identified, this is not the case for lazy training on panel (b) that tries to interpolate the observations with the least displacement in parameter space (in both cases, near

training loss was achieved). As seen on panel (c), this behavior hinders good generalization. The plateau reached for large corresponds exactly to the performance of the corresponding kernel method.

##### Setting of supervised differentiable programming.

In this note, a model is a black box function that maps an input variable to an output variable222The case with adds no difficulty but makes notations more complicated. in a way that is consistent with observations. A common way to build models is to use a parametric model, that is a function associated with a training algorithm designed to select a parameter given a sequence of observations , . When the function is differentiable in the parameters (at least in a weak sense), most training algorithms are gradient-based. They consist in choosing a smooth and often strongly-convex loss function , a (possibly random) initialization and defining recursively a sequence of parameters by accessing the model only through the gradient of with respect to

. For instance, stochastic gradient descent (SGD) updates the parameters as follows

 wk+1=wk−ηkℓ′(f(wk,xk),yk)∇f(wk,xk), (1)

where and are the gradients with respect to the first arguments and a specified sequence of step-sizes. When the model is linear in the parameters, or in other specific cases [10, 20], such methods are known to find models which are optimal in some statistical sense, but this property is not well understood in the general case.

### 1.1 Content of this note

Lazy training can be introduced as follows. We start by looking at the SGD update of Eq. (1) in the space of predictors: for and small , one has the first order approximation

 f(wk+1,x)≈f(wk,x)−ηk∇f(wk,x)⊺∇f(wk,xk)ℓ′(f(wk,xk),yk).

This is an SGD update for unregularized kernel-based regression [16, 9] with kernel . The key point is that if the iterates remain in a neighborhood of then this kernel is roughly constant throughout training. When , this behavior naturally arises when scaling the model as with a large scaling factor . Indeed, this scaling does not change the tangent model and brings the iterates of SGD closer to by a factor . This scaling is not artificial: rather, it is often implicit in practice (for instance, hidden in the choice of initialization, see Section 4). Another depiction of lazy training with a geometrical point of view is given in Figure 2. There, can be interpreted as a way to stretch the manifold to bring it closer to .

This note is organized as follows:

• in Section 2, we describe the tangent model introduced in [15] and detail the case of neural networks with a single hidden layer where the tangent model turns out to be equivalent to a random feature method [24];

• in Section 3, we give simple proofs that gradient flows in the lazy regime converge linearly, to a global minimizer for over-parameterized models, or to a local minimizer for under-parameterized models. We also prove that they are identical to gradient flows associated to the tangent model, up to higher order terms in the scaling factor .

• in Section 4, we emphasize that lazy training is just a specific regime: it occurs for a specific range of initialization or hyper-parameters. We give criteria to check whether a given parametric model is likely to exhibit this behavior;

• finally, Section 5 shows simple numerical experiments on synthetic cases to illustrate how lazy training differs from other regimes of training (see also Figure 1).

The main motivation for this note is to present in a simple setting the phenomenon underlying a series of recent results [15, 12, 19, 11, 2, 3, 31] and to emphasize that they (i) are not fundamentally related to over-parameterization nor to specific neural networks architectures, and (ii) correspond however to a very specific training regime which is not typically seen in practice. Our focus is on general principles and qualitative description so we make the simplifying assumption that is differentiable333This assumption is relevant to some extent, as it is morally true for large networks or with a large amount of data (see, e.g., [6, App. D.4]). in and we sometimes provide statements without constants.

## 2 Training the tangent model

### 2.1 Tangent model

In first order approximation around the initial parameters , the parametric model reduces to the following tangent model :

 Tf(w,x)=f(w0,x)+(w−w0)⋅∇wf(w0,x). (2)

The corresponding hypothesis class is affine in the space of predictors. It should be stressed that when is a neural network, is generally not a linear neural network because it is not linear in , but in the features which generally depend non-linearly on . For large neural networks, the dimension of the features might be much larger than , which makes  similar to non-parametric methods. Finally, if is already a linear model, then and  are identical.

##### Kernel method with an offset.

In the case where only depends on , such as the quadratic loss, training the affine model (2) is equivalent to training a linear model in the variables

 (~x,~y)\coloneqq(∇wf(w0,x),y−f(w0,x)).

This is equivalent to a kernel method with the tangent kernel (see [15]):

 k(x,x′)=∇wf(w0,x)⊺∇wf(w0,x′). (3)

This kernel is different from the one generally associated to neural networks [25, 8] which involve the derivative with respect to the output layer only. Also, the output data is shifted by the initialization of the model . This term inherits from the randomness due to the initialization: it is for instance shown in [15] that converges to a Gaussian process for certain over-parameterized neural networks initialized with random normal weights. For neural networks, we can make sure that even with a random initialization by using a simple “doubling trick”: neurons in the last layer are duplicated, with the new neurons having the same input weights and opposite output weights.

##### Computational differences.

We write on Figure 3 the SGD algorithm for the parametric model and for the tangent model in order to highlight the small differences. The (optional) scaling factor allows to recover the lazy regime when set to a large value (its role is detailed next section). For neural networks, the computational complexity per iteration is of the same order, the main difference being that for the tangent model, the forward and backward pass are done with the weights at initialization instead of the current weights. Another difference is that in the lazy regime, all the training information lies is the small bits of which might make this regime unstable, e.g., for network compression [14].

### 2.2 Limit kernels and random feature

In this section, we show that the tangent kernel is a random feature kernel for neural networks with a single hidden layer. Consider a hidden layer of size

and an activation function

,

 fm(w,x)=1√mm∑j=1bj⋅σ(aj⋅x),

with parameters444We have omitted the bias/intercept, which is recovered by fixing the last coordinate of to . , so here . This scaling by is the same as in [12] and leads to a non-degenerated limit of the kernel555Since the definition of gradients depends on the choice of a metric, this scaling is not of intrinsic importance. Rather, it reflects that we work with the Euclidean metric on . Scaling this metric by , a natural choice suggested by the Wasserstein metric, would call for a scaling of the model by (see, e.g., [6, Sec. 2.2]). The choice of scaling will however become important when dealing with training (see also discussion in Section 4.2.2). as . The associated tangent kernel in Eq. (3) is the sum of two kernels , one for each layer, where

 k(a)m(x,x′)=1mm∑j=1(x⋅x′)b2jσ′(aj⋅x)σ′(aj⋅x′) and k(b)m(x,x′)=1mm∑j=1σ(aj⋅x)σ(aj⋅x′).

If we assume that the initial weights (resp. ) are independent samples of a distribution on (resp. a distribution on ), these are random feature kernels [24] that converge as to the kernels

 k(a)(x,x′)=E(a,b)[(x⋅x′)b2σ′(a⋅x)σ′(a⋅x′)] and k(b)(x,x′)=Ea[σ(a⋅x)σ(a⋅x′)].

The second component , corresponding to the differential with respect to the output layer, is the one traditionally used to make the link between these networks and random features [25]. When is the rectified linear unit activation and the distribution of the weights  is rotation invariant in , one has the following explicit formulae [7]:

 k(a)(x,x′)=(x⋅x′)E(b2)2π(π−φ), k(b)(x,x′)=∥x∥∥x′∥E(∥a∥2)2πd((π−φ)cosφ+sinφ), (4)

where is the angle between the two vectors and . See Figure 4 for an illustration of this kernel. The link with random sampling is lost for deeper neural networks, because the non-linearities do not commute with expectations, but it is shown in [15] that tangent kernels still converge when the size of networks increase, for certain architectures.

## 3 Analysis of lazy training dynamics

### 3.1 Theoretical setting

This section is devoted to the theoretical analysis of lazy training dynamics under simple assumptions. Our goal is to show that they are essentially the same as the training dynamics for the tangent model of Eq. 2 when the scaling factor is large. For theoretical purposes, the space of predictors is endowed with the structure of a separable Hilbert space and we consider an objective function with a global minimizer . Our assumptions are the following:

###### Assumption 3.1.

The parametric model is differentiable with a -Lipschitz derivative666Note that is a continuous linear map from to and the Lipschitz constant of is defined with respect to the operator norm. When , can be identified with the Jacobian matrix and we adopt this matrix notations throughout for simplicity. . Moreover, is -strongly convex and has a -Lipschitz derivative.

This setting covers two cases of interest, where is built from a loss function :

• (interpolation) given a finite data set of input/outputs , we wish to find a model that fits these points. Here we define the objective function as and can be identified with with the Euclidean structure;

• (statistical learning) if instead, one desires good fitting on a (hypothetically) infinite data set, one may model the data as independent samples of a couple of random variables

distributed according to a probability distribution

. The objective is the expected or population loss and is the space of functions which are squared-integrable with respect to , the marginal of on the input space.

##### Scaled objective function.

For a scale factor , we introduce the scaled functional

 Fα(w)\coloneqqR(αf(w))/α2. (5)

This scaling factor was motivated in Section 1.1 as a mean to reach the lazy regime, when set to a large value. Here we have also multiplied the objective by , for the limit of training algorithms to be well-behaved. This scaling does not change the minimizers, but suggests in practice a step-size of the order , when is large. For a quadratic loss for some , this objective reduces to which amounts to learning a signal that is close to .

##### Lazy and tangent gradient flows.

In the rest of this section, we study the gradient flow of the objective function . This gradient flow is expected to reflect the behavior of first-order descent algorithms with small step-sizes, as the latter are known to approximate the former (see, e.g., [27] for gradient descent and [17, Thm. 2.1] for SGD). With an initialization , the gradient flow of is the path in the space of parameters that satisfies

and solves the ordinary differential equation:

 w′(t)=−∇Fα(w(t))=−1αDf⊺w(t)∇R(αf(w(t))), (6)

where denotes the transposed/adjoint differential. We will study this dynamic for itself, but will also compare it to the gradient flow of the objective for the tangent model (Eq. 2). The objective function for the tangent model is defined as

 ¯Fα(w)\coloneqqR(αTf(w))/α2.

and the tangent gradient flow is the path that satisfies and

 ¯w′(t)=−∇¯Fα(¯w(t))=−1αDf⊺w0∇R(αTf(¯w(t))).

As the gradient flow of a function that is strictly convex on the orthogonal complement of , converges linearly to the unique global minimizer of . In particular, if then this global minimizer does not depend on and .

### 3.2 Over-parameterized case

One generally says that a model is over-parameterized when the number of parameters exceeds the number of points to fit. The following proposition gives the main properties of lazy training under the slightly more stringent condition that is surjective (equivalently, has rank ). As

gives the number of effective parameters or degrees of freedom of the model around

, this over-parameterization assumption guarantees that any model around can be fitted.

###### Theorem 3.2 (Over-parameterized lazy training).

Let denote the condition number of and

the smallest singular value of

. Assume that and that . If , then for , it holds

 ∥αf(w(t))−f∗∥≤√κ∥αf(w0)−f∗∥exp(−μσ2mint/4).

Moreover, as , it holds ,

 supt≥0∥αf(w(t))−αTf(¯w(t))∥=O(1/α) and supt≥0∥w(t)−¯w(t)∥=O(logα/α2).

The comparison with the tangent gradient flow in infinite time horizon is new and follows mostly from Lemma A.1 in appendix where constants are given. Otherwise, we do not claim an improvement over [12, 19, 11, 2, 3, 31]. The idea is rather to exhibit the key arguments behind lazy training in a simplified setting.

###### Proof.

The trajectory in predictor space solves the differential equation

 ddtαf(w(t))=−Dfw(t)Df⊺w(t)∇R(αf(w(t))),

that involves the covariance of the tangent kernel [15] evaluated at the current point , instead of . Consider the radius . By smoothness of , it holds as long as . Thus Lemma 3.3 below guarantees that converges linearly, up to time . It only remains to find conditions on so that . The variation of the parameters can be bounded as follows for :

 ∥w′(t)∥≤1α∥Dfw(t)∥∥∇R(αf(w(t)))∥≤2να∥Dfw0∥∥αf(w(t))−f∗∥.

By Lemma 3.3, it follows that for ,

 ∥w(t)−w(0)∥≤2ν3/2αμ∥Dfw0∥∥αf(w0)−f∗∥∫t0e−(μσ2min/4)sds≤8κ3/2ασ2min∥Dfw0∥∥αf(w0)−f∗∥.

This quantity is smaller than , and thus , if . This is in particular guaranteed by the conditions on and in the theorem. This also implies the “laziness” property .

For the comparison with the tangent gradient flow, the first bound is obtained by applying Lemma A.1 with and , and noticing that the quantity denoted by in that lemma is in thanks to the previous bound on . For the last bound, we compute the integral over of the bound

 α∥w′(t)−¯w′(t)∥ =∥Dfw(t)∇R(αf(w(t))−Dfw0∇R(αTf(¯w(t)))∥ ≤∥Dfw(t)−Dfw0∥∥∇R(αf(w(t))∥+∥Dfw0∥∥∇R(αf(w(t))−∇R(αTf(¯w(t)))∥.

It is easy to see from the derivations above that the integral of the first term is in . For the second term, we define and on we use the smoothness bound

 ∥∇R(αf(w(t))−∇R(αTf(¯w(t)))∥≤ν∥αf(w(t))−αTf(¯w(t))∥

which integral over is in , while on we use the crude bound

 ∥∇R(αf(w(t))−∇R(αTf(¯w(t)))∥≤∥∇R(αf(w(t))∥+∥∇R(αTf(¯w(t)))∥

which integral over is in thanks to the definition of and the exponential decrease of along both trajectories. ∎

In geometrical term, the proof above can be summarized as follows. It is a general fact that the parametric model induces a pushforward-metric on  [18]. With a certain choice of scaling, this metric hardly changes during training, equalling the inverse covariance of the tangent kernel. This makes the loss landscape in essentially convex and allows to call the following lemma, that shows linear convergence of strongly-convex gradient flows in a time-dependent metric.

###### Lemma 3.3 (Strongly-convex gradient flow in a time-dependent metric).

Let be a -strongly-convex function with -Lipschitz continuous gradient and with global minimizer and let

be a time dependent continuous auto-adjoint linear operator with eigenvalues lower bounded by

for . Then solutions on to the differential equation

 y′(t)=−Σ(t)∇F(y(t)),

satisfy, for ,

 ∥y(t)−y∗∥≤(ν/μ)1/2∥y(0)−y∗∥exp(−μλt).
###### Proof.

By strong convexity, it holds . It follows

 ddt¯F(y(t))=−∇F(y(t))⊺Σ(t)∇F(y(t))≤−λ∥∇F(y(t))∥2≤−2μλ¯F(y),

and thus by Grönwall’s Lemma. We now use the strong convexity inequality in the left-hand side and the smoothness inequality in the right-hand side. This yields . ∎

### 3.3 Under-parameterized case

In this section, we do not make the over-parameterization assumption and show that the lazy regime still occurs for large values of . This covers for instance the case of the population loss, where . For this setting, we content ourselves with a qualitative statement777Quantitative statements would involve the smallest positive singular value of , which is anyways hard to control. proved in Appendix B.

###### Theorem 3.4 (Under-parameterized lazy training).

Assume that and that is constant on a neighborhood of . Then there exists such that for all the gradient flow (6) converges at a geometric rate (asymptotically independent of ) to a local minimum of .

The assumption that the rank is locally constant holds generically due to lower-semicontinuity of the rank function. In particular, it holds with probability if is randomly chosen according to an absolutely continuous probability measure. In this under-parameterized case, the limit is in general not a global minimum because the image of is a proper subspace of which may not contain the global minimizer of , as pictured in Figure 2. Thus it cannot be excluded that there are models with far from that have a smaller loss. This fact is clearly observed experimentally in Section 5 (Figure 6-(b)). Finally, a comparison with the tangent gradient flow as in Theorem 3.2 could be shown along the same lines, but would be technically slightly more involved because differential geometry comes into play.

##### Relationship to the global convergence result in [6].

A consequence of Theorem 3.4 is that when training a neural network with SGD to minimize a population loss (i.e., ) then lazy training gets stuck in a local minimum. In contrast, it is shown in [6] that gradient flows of neural networks with a single hidden layer converge to global optimality in the over-parameterization limit if initialized with enough diversity in the weights. This is not a contradiction since Theorem 3.4 assumes a finite number of parameters. For lazy training, the population loss also converges to its minimum when increases if the tangent kernel converges to a universal kernel as . However, this convergence might be unreasonably slow and does not compare to the actual practical performance of neural networks, as Figure 1-(c) suggests. As a side note, we stress that the global convergence result in [6]

is not limited to lazy dynamics but also covers highly non-linear dynamics, including “active” learning behaviors such as seen on Figure

1.

## 4 Range of the lazy regime

In this section, we derive a simplified rule to determine if a certain choice of initialization leads to lazy training. Our aim is to emphasize the fact that this regime is due to the choice of scaling, which is often implicit. We proceed informally and do not claim mathematical rigorousness.

### 4.1 Informal view

Suppose that the training algorithm on the tangent model from Eq. 2 converges to . This tangent model is an accurate approximation of throughout training when the second-order remainder remains negligible until convergence, i.e., where is the Lipschitz constant of . With the rough simplification that this is equivalent to the property

 ∥w∗−w0∥≪∥Dfw0∥/LDf.

Making also the approximation leads to the following rule of thumb: lazy training occurs if

 ∥Tf(w∗)−f(w0)∥≪∥Dfw0∥2/LDf. (7)

This bound compares how close the model at initialization is to the best linear model, with the extent of validity of the linear approximation (it is a simplified version of the hypothesis of Theorem 3.2).

##### Scaling factor.

The interest of Eq. (7) is that it allows for simplified informal considerations. For instance, consider the scaling factor and an initialization such that . Then for the scaled model , the left-hand side of (7) does not grow with while the right-hand side is proportional to . Thus, one is bound to reach the lazy regime when is large.

##### Dealing with non-differentiability.

In practice, one often uses models that are only differentiable in a weaker sense, as neural networks with ReLU activation or max-pooling. Then the quantity

involved in Eq. (7

) does not make sense immediately and should be translated into estimates on the stability of

. It is one of the important technical contributions of [12, 19, 11, 2, 3, 31] to have studied rigorously such aspects in the non-differentiable setting.

### 4.2 Examples

#### 4.2.1 Homogeneous models

A parametric model is said -homogeneous in the parameters (), if for all it holds

 f(λw,x)=λqf(w,x).

For such models, changing the magnitude of the initialization by a factor is equivalent to changing the scale factor, with the relationship . Indeed, the -th derivative of a -homogeneous function is -homogeneous. It follows that if the initialization is multiplied by , then the ratio of Eq. (7) is multiplied by . This explains why neural networks initialized with large weights, but at the same time close to in the space of predictors, display a lazy regime. In our experiments, large weights correspond to a high variance at initialization (see Figures 1 and 6(b) for -homogeneous examples).

#### 4.2.2 Large shallow neural networks

In order to understand how lazy training can be intertwined with over-parameterization, let us consider the most simple non-trivial example of a parametric model of the form

 f(w,x)=α(m)m∑i=1ϕ(θi,x),

where the parameters are and is assumed differentiable with a Lipschitz derivative. This includes neural networks with a single hidden layer, where (when is the ReLU activation function, differentiability requires , see, e.g., [6, App. D.4]). Our notation indicates that should depend on . Indeed, taking the limit makes the quantities in Eq.(7) explode or vanish, depending on , thus leading to lazy training or not.

Now consider an initialization with independent and identically distributed variables such that . The initial model satisfies:

 E∥f(w0)∥2=mα(m)2E∥Φ(θ)∥2, E∥Dfw0∥2=mα(m)2E∥DΦθ∥2 and LDf=α(m)LDΦ.

Thus the terms involved in the criterion of Eq.(7) satisfy (hiding constants depending on or ):

 E∥Tf(w∗)−f(w0)∥≲max{1,√mα(m)}, E∥Dfw0∥2/LDf≈mα(m).

It follows, looking at Eq.(7), that the criterion to obtain lazy training is satisfied when grows if . For instance, Du et al. [12] consider and Li et al. [19]

consider an initialization with standard deviation

with a -homogeneous , which amounts to the same scaling and leads to lazy training. It is more difficult to analyze this ratio for deeper neural networks as studied in [11, 2, 3, 31].

##### Mean-field limit.

Note that this scaling contrasts with the scaling chosen in a series of recent works that study the mean-field limit of neural networks with a single hidden layer [23, 6, 26, 28]. This scaling is precisely the one that maintains a ratio of order and allows to converge as

to a non-degenerate dynamic, described by a partial differential equation. As a side note, this scaling leads to a vanishing derivative

as : this might appear ill-posed, but this is actually not an intrinsic fact. Indeed, this is due to the choice of the Euclidean metric on , which could be itself scaled by to give a non-degenerate limit [6]. In contrast, the ratio is independent of the choice of the scaling of the metric on the space of parameters.

## 5 Numerical experiments

In our numerical experiments, we consider the following neural network with a single-hidden layer

 f(w,x)=m∑j=1bjmax(aj⋅x,0),

with parameters , so here

. These parameters are initialized randomly and independently according to a normal distribution

, except when using the “doubling trick” mentioned in Section 2. In this case (assuming even), the parameters with index are random as above and we set, for , and . The input data are uniformly random on the unit sphere in and the output data is given by the output of a neural network with hidden units and with random parameters normalized so that for . We chose the quadratic loss and performed batch gradient descent with a small step-size on a finite data set of size , except for Figure 6-(b) which has been obtained with mini-batch SGD with a small step-size and where each sample is only used once.

##### Cover illustration.

Figure 1 in Section 1 was used to motivate this note and was discussed in Section 1. Here we give more details on the setting. On panels (a)-(b), we have taken samples and neurons with the doubling trick, in order to show a simple dynamic. To obtain a -d representation, we plot throughout training (lines) and at convergence (dots) for . The blue or red colors stand for the sign of . The unit circle is displayed to help visualizing the change of scale. On panel (c), we have taken , , used the “doubling trick” and averaged the result over experiments. To make sure that the bad performance for large is not due to a lack of regularization, we have displayed the best test error throughout training (for kernel methods, early stopping is a form of regularization [29]).

##### Many neurons dynamics.

Figure 5 is similar to panels (a)-(b) in Figure 1 except that there are neurons and training samples. We also show the trajectory of the parameters without the “doubling trick” where we see that the neurons need to move slightly more than in panel (c) in order to compensate for the non-zero initialization . The good behavior for small (also observed on Figure 1) can be related to the results in [22] where it is shown that gradient flows of this form initialized close to zero quantize features.

##### Increasing number of parameters.

Figure 6-(a) shows the evolution of the test error when increasing as discussed in Section 4.2.2, for two choices of scaling functions , averaged over experiments in dimension , with “doubling trick”. The scaling leads to lazy training, with a poor generalization as increases. This is in contrast to the scaling for which the test error remains relatively close to for large . The high test error for small values of is due to the fact that the ground truth model has neurons: gradient descent seems to need a slight over-parameterization to perform well (more experiments with the scaling can be found in [6, 26, 23]).

##### Under-parameterized with SGD.

Finally, Figure 6-(b) illustrates the under-parameterized case. We consider a random initialization with “doubling trick” in dimension and neurons. We used SGD with a batch-size , and displayed the final population loss (estimated with samples) averaged over experiments. As shown in Theorem 3.4, SGD remains stuck in a local minimum in the lazy regime (i.e., here for large ). As in Figure 1, it behaves intriguingly well when is small. There is also an intermediate regime (hatched area) where convergence is very slow and the loss was still decreasing when the algorithm was stopped.

## 6 Discussion

By connecting a series of recent works, we have exhibited the simple structure behind lazy training, which is a situation when a non-linear parametric model behaves like a linear one. We have studied under which conditions this regime occurs and shown that it is not limited to over-parameterized models. While the lazy training regime provides some of the first optimization-related theoretical insights for deeper models [11, 2, 3, 31, 15], we believe it does not explain yet the many successes of neural networks that have been observed in various challenging, high-dimensional tasks in machine learning. This is corroborated by numerical experiments where it is seen that networks trained in the lazy regime are those that perform worst. Instead, the intriguing phenomenon that still defies theoretical understanding is the one displayed on Figures 1(c) and 6(b) for small : neural networks trained with gradient-based methods (and neurons that move) have the ability to perform high-dimensional feature selection.

#### Acknowledgments

We acknowledge supports from grants from Région Ile-de-France and the European Research Council (grant SEQUOIA 724063).

## Appendix A Stability lemma

The following stability lemma is at the basis of the equivalence between lazy training and linearized model training. We limit ourselves to a rough estimate sufficient for our purposes.

###### Lemma A.1.

Let be a -strongly convex function and let be a time dependent positive definite operator on such that for . Consider the paths and on that solve for ,

 y′(t)=−Σ(t)∇R(y(t)) and ¯y′(t)=−Σ(0)∇R(¯y(t)).

Defining , it holds for ,

 ∥y(t)−¯y(t)∥≤K∥Σ(0)∥1/2λ3/2μ.
###### Proof.

Let be the positive definite square root of , let , and let be the function defined as . It holds

 h′(t) =⟨z′(t)−¯z′(t),z(t)−¯z(t)⟩ =−⟨Σ−1/20Σ(t)∇R(Σ1/20z(t))−Σ1/20∇R(Σ1/20¯z(t)),z(t)−¯z(t)⟩ =−⟨Σ1/20∇R(Σ1/20z(t))−Σ1/20∇R(Σ1/20¯z(t)),z(t)−¯z(t)⟩ (A(t)) −⟨Σ−1/20(Σ(t)−Σ(0))∇R(Σ1/20z(t)),z(t)−¯z(t)⟩. (B(t))

Since the function is -strongly convex, one has that . Using the quantity introduced in the statement, one has also . Summing these two terms yields the bound

 h′(t)≤K√2h(t)/λ−2λμh(t).

The right-hand side is a concave function of which is nonnegative for and negative for higher values of . Since it follows that for all , one has