Differentiable programming is becoming an important paradigm in certain areas of signal processing and machine learning that consists in building parameterized models, sometimes with a complex architecture and a large number of parameters, and adjusting these parameters so that the model fits training data using gradient-based optimization methods. The resulting problem is in general highly non-convex: it has been observed empirically that seemingly innocuous changes in the parameterization, optimization procedure, or in the initialization could lead to a selection of very different models, even though they sometimes all fit perfectly the training data. Our goal is to showcase this effect by studying lazy training, which refers to training algorithms that select parameters that are close to the initialization.
This note is motivated by a series of recent articles [12, 19, 11, 2, 3, 31] where it is shown that certain over-parameterized neural networks converge linearly to zero training loss with their parameters hardly varying. With a slightly different viewpoint, it was shown in  that the first phase of training behaves like a kernel regression in the over-parameterization limit, with a kernel built from the linearization of the neural network around its initialization. Blending these two points of views together, we remark that lazy training essentially amounts to kernel-based regression with a specific kernel. Importantly, in all these papers, a specific and somewhat implicit choice of scaling is made. We argue that lazy training is not so much due to over-parameterization than to this choice of scaling. By introducing a scale factor
, we see that essentially any parametric model can be trained in this lazy regime if initialized close enough toin the space of predictors. This remark allows to better understand lazy training: its generalization properties, when it occurs, and its downsides.
The takeaway is that guaranteed fast training is indeed possible, but at the cost of recovering a linear method111 Here we mean a prediction function linearly parameterized by a potentially infinite-dimensional vector.
Here we mean a prediction function linearly parameterized by a potentially infinite-dimensional vector.. On the upside, this draws an interesting link between neural networks and kernel methods, as noticed in 
. On the downside, we believe that most practically useful neural networks are not trained in this regime: a clue for that is that in practice, neurons weights actually move quite a lot (see, e.g.,[21, 5]
), and the first layer of convolutional neural networks tend to learn Gabor-like filters when randomly initialized[13, Chap. 9]
. Instead, they seem to be able to perform high dimensional, non-linear, feature selection and this remains a fundamental theoretical challenge to understand how and when they do so.
The situation is illustrated in Figure 1at initialization (see Section 4
). While in panel (a) the ground truth features are identified, this is not the case for lazy training on panel (b) that tries to interpolate the observations with the least displacement in parameter space (in both cases, neartraining loss was achieved). As seen on panel (c), this behavior hinders good generalization. The plateau reached for large corresponds exactly to the performance of the corresponding kernel method.
Setting of supervised differentiable programming.
In this note, a model is a black box function that maps an input variable to an output variable222The case with adds no difficulty but makes notations more complicated. in a way that is consistent with observations. A common way to build models is to use a parametric model, that is a function associated with a training algorithm designed to select a parameter given a sequence of observations , . When the function is differentiable in the parameters (at least in a weak sense), most training algorithms are gradient-based. They consist in choosing a smooth and often strongly-convex loss function , a (possibly random) initialization and defining recursively a sequence of parameters by accessing the model only through the gradient of with respect to
. For instance, stochastic gradient descent (SGD) updates the parameters as follows
where and are the gradients with respect to the first arguments and a specified sequence of step-sizes. When the model is linear in the parameters, or in other specific cases [10, 20], such methods are known to find models which are optimal in some statistical sense, but this property is not well understood in the general case.
1.1 Content of this note
Lazy training can be introduced as follows. We start by looking at the SGD update of Eq. (1) in the space of predictors: for and small , one has the first order approximation
This is an SGD update for unregularized kernel-based regression [16, 9] with kernel . The key point is that if the iterates remain in a neighborhood of then this kernel is roughly constant throughout training. When , this behavior naturally arises when scaling the model as with a large scaling factor . Indeed, this scaling does not change the tangent model and brings the iterates of SGD closer to by a factor . This scaling is not artificial: rather, it is often implicit in practice (for instance, hidden in the choice of initialization, see Section 4). Another depiction of lazy training with a geometrical point of view is given in Figure 2. There, can be interpreted as a way to stretch the manifold to bring it closer to .
This note is organized as follows:
in Section 3, we give simple proofs that gradient flows in the lazy regime converge linearly, to a global minimizer for over-parameterized models, or to a local minimizer for under-parameterized models. We also prove that they are identical to gradient flows associated to the tangent model, up to higher order terms in the scaling factor .
in Section 4, we emphasize that lazy training is just a specific regime: it occurs for a specific range of initialization or hyper-parameters. We give criteria to check whether a given parametric model is likely to exhibit this behavior;
The main motivation for this note is to present in a simple setting the phenomenon underlying a series of recent results [15, 12, 19, 11, 2, 3, 31] and to emphasize that they (i) are not fundamentally related to over-parameterization nor to specific neural networks architectures, and (ii) correspond however to a very specific training regime which is not typically seen in practice. Our focus is on general principles and qualitative description so we make the simplifying assumption that is differentiable333This assumption is relevant to some extent, as it is morally true for large networks or with a large amount of data (see, e.g., [6, App. D.4]). in and we sometimes provide statements without constants.
2 Training the tangent model
2.1 Tangent model
In first order approximation around the initial parameters , the parametric model reduces to the following tangent model :
The corresponding hypothesis class is affine in the space of predictors. It should be stressed that when is a neural network, is generally not a linear neural network because it is not linear in , but in the features which generally depend non-linearly on . For large neural networks, the dimension of the features might be much larger than , which makes similar to non-parametric methods. Finally, if is already a linear model, then and are identical.
Kernel method with an offset.
In the case where only depends on , such as the quadratic loss, training the affine model (2) is equivalent to training a linear model in the variables
This is equivalent to a kernel method with the tangent kernel (see ):
This kernel is different from the one generally associated to neural networks [25, 8] which involve the derivative with respect to the output layer only. Also, the output data is shifted by the initialization of the model . This term inherits from the randomness due to the initialization: it is for instance shown in  that converges to a Gaussian process for certain over-parameterized neural networks initialized with random normal weights. For neural networks, we can make sure that even with a random initialization by using a simple “doubling trick”: neurons in the last layer are duplicated, with the new neurons having the same input weights and opposite output weights.
We write on Figure 3 the SGD algorithm for the parametric model and for the tangent model in order to highlight the small differences. The (optional) scaling factor allows to recover the lazy regime when set to a large value (its role is detailed next section). For neural networks, the computational complexity per iteration is of the same order, the main difference being that for the tangent model, the forward and backward pass are done with the weights at initialization instead of the current weights. Another difference is that in the lazy regime, all the training information lies is the small bits of which might make this regime unstable, e.g., for network compression .
2.2 Limit kernels and random feature
In this section, we show that the tangent kernel is a random feature kernel for neural networks with a single hidden layer. Consider a hidden layer of size
and an activation function,
with parameters444We have omitted the bias/intercept, which is recovered by fixing the last coordinate of to . , so here . This scaling by is the same as in  and leads to a non-degenerated limit of the kernel555Since the definition of gradients depends on the choice of a metric, this scaling is not of intrinsic importance. Rather, it reflects that we work with the Euclidean metric on . Scaling this metric by , a natural choice suggested by the Wasserstein metric, would call for a scaling of the model by (see, e.g., [6, Sec. 2.2]). The choice of scaling will however become important when dealing with training (see also discussion in Section 4.2.2). as . The associated tangent kernel in Eq. (3) is the sum of two kernels , one for each layer, where
If we assume that the initial weights (resp. ) are independent samples of a distribution on (resp. a distribution on ), these are random feature kernels  that converge as to the kernels
The second component , corresponding to the differential with respect to the output layer, is the one traditionally used to make the link between these networks and random features . When is the rectified linear unit activation and the distribution of the weights is rotation invariant in , one has the following explicit formulae :
where is the angle between the two vectors and . See Figure 4 for an illustration of this kernel. The link with random sampling is lost for deeper neural networks, because the non-linearities do not commute with expectations, but it is shown in  that tangent kernels still converge when the size of networks increase, for certain architectures.
3 Analysis of lazy training dynamics
3.1 Theoretical setting
This section is devoted to the theoretical analysis of lazy training dynamics under simple assumptions. Our goal is to show that they are essentially the same as the training dynamics for the tangent model of Eq. 2 when the scaling factor is large. For theoretical purposes, the space of predictors is endowed with the structure of a separable Hilbert space and we consider an objective function with a global minimizer . Our assumptions are the following:
The parametric model is differentiable with a -Lipschitz derivative666Note that is a continuous linear map from to and the Lipschitz constant of is defined with respect to the operator norm. When , can be identified with the Jacobian matrix and we adopt this matrix notations throughout for simplicity. . Moreover, is -strongly convex and has a -Lipschitz derivative.
This setting covers two cases of interest, where is built from a loss function :
(interpolation) given a finite data set of input/outputs , we wish to find a model that fits these points. Here we define the objective function as and can be identified with with the Euclidean structure;
(statistical learning) if instead, one desires good fitting on a (hypothetically) infinite data set, one may model the data as independent samples of a couple of random variables
distributed according to a probability distribution. The objective is the expected or population loss and is the space of functions which are squared-integrable with respect to , the marginal of on the input space.
Scaled objective function.
For a scale factor , we introduce the scaled functional
This scaling factor was motivated in Section 1.1 as a mean to reach the lazy regime, when set to a large value. Here we have also multiplied the objective by , for the limit of training algorithms to be well-behaved. This scaling does not change the minimizers, but suggests in practice a step-size of the order , when is large. For a quadratic loss for some , this objective reduces to which amounts to learning a signal that is close to .
Lazy and tangent gradient flows.
In the rest of this section, we study the gradient flow of the objective function . This gradient flow is expected to reflect the behavior of first-order descent algorithms with small step-sizes, as the latter are known to approximate the former (see, e.g.,  for gradient descent and [17, Thm. 2.1] for SGD). With an initialization , the gradient flow of is the path in the space of parameters that satisfies
and solves the ordinary differential equation:
where denotes the transposed/adjoint differential. We will study this dynamic for itself, but will also compare it to the gradient flow of the objective for the tangent model (Eq. 2). The objective function for the tangent model is defined as
and the tangent gradient flow is the path that satisfies and
As the gradient flow of a function that is strictly convex on the orthogonal complement of , converges linearly to the unique global minimizer of . In particular, if then this global minimizer does not depend on and .
3.2 Over-parameterized case
One generally says that a model is over-parameterized when the number of parameters exceeds the number of points to fit. The following proposition gives the main properties of lazy training under the slightly more stringent condition that is surjective (equivalently, has rank ). As
gives the number of effective parameters or degrees of freedom of the model around, this over-parameterization assumption guarantees that any model around can be fitted.
Theorem 3.2 (Over-parameterized lazy training).
Let denote the condition number of and the smallest singular value of
the smallest singular value of. Assume that and that . If , then for , it holds
Moreover, as , it holds ,
The comparison with the tangent gradient flow in infinite time horizon is new and follows mostly from Lemma A.1 in appendix where constants are given. Otherwise, we do not claim an improvement over [12, 19, 11, 2, 3, 31]. The idea is rather to exhibit the key arguments behind lazy training in a simplified setting.
The trajectory in predictor space solves the differential equation
that involves the covariance of the tangent kernel  evaluated at the current point , instead of . Consider the radius . By smoothness of , it holds as long as . Thus Lemma 3.3 below guarantees that converges linearly, up to time . It only remains to find conditions on so that . The variation of the parameters can be bounded as follows for :
By Lemma 3.3, it follows that for ,
This quantity is smaller than , and thus , if . This is in particular guaranteed by the conditions on and in the theorem. This also implies the “laziness” property .
For the comparison with the tangent gradient flow, the first bound is obtained by applying Lemma A.1 with and , and noticing that the quantity denoted by in that lemma is in thanks to the previous bound on . For the last bound, we compute the integral over of the bound
It is easy to see from the derivations above that the integral of the first term is in . For the second term, we define and on we use the smoothness bound
which integral over is in , while on we use the crude bound
which integral over is in thanks to the definition of and the exponential decrease of along both trajectories. ∎
In geometrical term, the proof above can be summarized as follows. It is a general fact that the parametric model induces a pushforward-metric on . With a certain choice of scaling, this metric hardly changes during training, equalling the inverse covariance of the tangent kernel. This makes the loss landscape in essentially convex and allows to call the following lemma, that shows linear convergence of strongly-convex gradient flows in a time-dependent metric.
Lemma 3.3 (Strongly-convex gradient flow in a time-dependent metric).
Let be a -strongly-convex function with -Lipschitz continuous gradient and with global minimizer and let be a time dependent continuous auto-adjoint linear operator with eigenvalues lower bounded by
be a time dependent continuous auto-adjoint linear operator with eigenvalues lower bounded byfor . Then solutions on to the differential equation
satisfy, for ,
By strong convexity, it holds . It follows
and thus by Grönwall’s Lemma. We now use the strong convexity inequality in the left-hand side and the smoothness inequality in the right-hand side. This yields . ∎
3.3 Under-parameterized case
In this section, we do not make the over-parameterization assumption and show that the lazy regime still occurs for large values of . This covers for instance the case of the population loss, where . For this setting, we content ourselves with a qualitative statement777Quantitative statements would involve the smallest positive singular value of , which is anyways hard to control. proved in Appendix B.
Theorem 3.4 (Under-parameterized lazy training).
Assume that and that is constant on a neighborhood of . Then there exists such that for all the gradient flow (6) converges at a geometric rate (asymptotically independent of ) to a local minimum of .
The assumption that the rank is locally constant holds generically due to lower-semicontinuity of the rank function. In particular, it holds with probability if is randomly chosen according to an absolutely continuous probability measure. In this under-parameterized case, the limit is in general not a global minimum because the image of is a proper subspace of which may not contain the global minimizer of , as pictured in Figure 2. Thus it cannot be excluded that there are models with far from that have a smaller loss. This fact is clearly observed experimentally in Section 5 (Figure 6-(b)). Finally, a comparison with the tangent gradient flow as in Theorem 3.2 could be shown along the same lines, but would be technically slightly more involved because differential geometry comes into play.
Relationship to the global convergence result in .
A consequence of Theorem 3.4 is that when training a neural network with SGD to minimize a population loss (i.e., ) then lazy training gets stuck in a local minimum. In contrast, it is shown in  that gradient flows of neural networks with a single hidden layer converge to global optimality in the over-parameterization limit if initialized with enough diversity in the weights. This is not a contradiction since Theorem 3.4 assumes a finite number of parameters. For lazy training, the population loss also converges to its minimum when increases if the tangent kernel converges to a universal kernel as . However, this convergence might be unreasonably slow and does not compare to the actual practical performance of neural networks, as Figure 1-(c) suggests. As a side note, we stress that the global convergence result in 
is not limited to lazy dynamics but also covers highly non-linear dynamics, including “active” learning behaviors such as seen on Figure1.
4 Range of the lazy regime
In this section, we derive a simplified rule to determine if a certain choice of initialization leads to lazy training. Our aim is to emphasize the fact that this regime is due to the choice of scaling, which is often implicit. We proceed informally and do not claim mathematical rigorousness.
4.1 Informal view
Suppose that the training algorithm on the tangent model from Eq. 2 converges to . This tangent model is an accurate approximation of throughout training when the second-order remainder remains negligible until convergence, i.e., where is the Lipschitz constant of . With the rough simplification that this is equivalent to the property
Making also the approximation leads to the following rule of thumb: lazy training occurs if
This bound compares how close the model at initialization is to the best linear model, with the extent of validity of the linear approximation (it is a simplified version of the hypothesis of Theorem 3.2).
The interest of Eq. (7) is that it allows for simplified informal considerations. For instance, consider the scaling factor and an initialization such that . Then for the scaled model , the left-hand side of (7) does not grow with while the right-hand side is proportional to . Thus, one is bound to reach the lazy regime when is large.
Dealing with non-differentiability.
In practice, one often uses models that are only differentiable in a weaker sense, as neural networks with ReLU activation or max-pooling. Then the quantityinvolved in Eq. (7
) does not make sense immediately and should be translated into estimates on the stability of. It is one of the important technical contributions of [12, 19, 11, 2, 3, 31] to have studied rigorously such aspects in the non-differentiable setting.
4.2.1 Homogeneous models
A parametric model is said -homogeneous in the parameters (), if for all it holds
For such models, changing the magnitude of the initialization by a factor is equivalent to changing the scale factor, with the relationship . Indeed, the -th derivative of a -homogeneous function is -homogeneous. It follows that if the initialization is multiplied by , then the ratio of Eq. (7) is multiplied by . This explains why neural networks initialized with large weights, but at the same time close to in the space of predictors, display a lazy regime. In our experiments, large weights correspond to a high variance at initialization (see Figures 1 and 6(b) for -homogeneous examples).
4.2.2 Large shallow neural networks
In order to understand how lazy training can be intertwined with over-parameterization, let us consider the most simple non-trivial example of a parametric model of the form
where the parameters are and is assumed differentiable with a Lipschitz derivative. This includes neural networks with a single hidden layer, where (when is the ReLU activation function, differentiability requires , see, e.g., [6, App. D.4]). Our notation indicates that should depend on . Indeed, taking the limit makes the quantities in Eq.(7) explode or vanish, depending on , thus leading to lazy training or not.
Now consider an initialization with independent and identically distributed variables such that . The initial model satisfies:
Thus the terms involved in the criterion of Eq.(7) satisfy (hiding constants depending on or ):
consider an initialization with standard deviationwith a -homogeneous , which amounts to the same scaling and leads to lazy training. It is more difficult to analyze this ratio for deeper neural networks as studied in [11, 2, 3, 31].
Note that this scaling contrasts with the scaling chosen in a series of recent works that study the mean-field limit of neural networks with a single hidden layer [23, 6, 26, 28]. This scaling is precisely the one that maintains a ratio of order and allows to converge as
to a non-degenerate dynamic, described by a partial differential equation. As a side note, this scaling leads to a vanishing derivativeas : this might appear ill-posed, but this is actually not an intrinsic fact. Indeed, this is due to the choice of the Euclidean metric on , which could be itself scaled by to give a non-degenerate limit . In contrast, the ratio is independent of the choice of the scaling of the metric on the space of parameters.
5 Numerical experiments
In our numerical experiments, we consider the following neural network with a single-hidden layer
with parameters , so here
. These parameters are initialized randomly and independently according to a normal distribution, except when using the “doubling trick” mentioned in Section 2. In this case (assuming even), the parameters with index are random as above and we set, for , and . The input data are uniformly random on the unit sphere in and the output data is given by the output of a neural network with hidden units and with random parameters normalized so that for . We chose the quadratic loss and performed batch gradient descent with a small step-size on a finite data set of size , except for Figure 6-(b) which has been obtained with mini-batch SGD with a small step-size and where each sample is only used once.
Figure 1 in Section 1 was used to motivate this note and was discussed in Section 1. Here we give more details on the setting. On panels (a)-(b), we have taken samples and neurons with the doubling trick, in order to show a simple dynamic. To obtain a -d representation, we plot throughout training (lines) and at convergence (dots) for . The blue or red colors stand for the sign of . The unit circle is displayed to help visualizing the change of scale. On panel (c), we have taken , , used the “doubling trick” and averaged the result over experiments. To make sure that the bad performance for large is not due to a lack of regularization, we have displayed the best test error throughout training (for kernel methods, early stopping is a form of regularization ).
Many neurons dynamics.
Figure 5 is similar to panels (a)-(b) in Figure 1 except that there are neurons and training samples. We also show the trajectory of the parameters without the “doubling trick” where we see that the neurons need to move slightly more than in panel (c) in order to compensate for the non-zero initialization . The good behavior for small (also observed on Figure 1) can be related to the results in  where it is shown that gradient flows of this form initialized close to zero quantize features.
Increasing number of parameters.
Figure 6-(a) shows the evolution of the test error when increasing as discussed in Section 4.2.2, for two choices of scaling functions , averaged over experiments in dimension , with “doubling trick”. The scaling leads to lazy training, with a poor generalization as increases. This is in contrast to the scaling for which the test error remains relatively close to for large . The high test error for small values of is due to the fact that the ground truth model has neurons: gradient descent seems to need a slight over-parameterization to perform well (more experiments with the scaling can be found in [6, 26, 23]).
Under-parameterized with SGD.
Finally, Figure 6-(b) illustrates the under-parameterized case. We consider a random initialization with “doubling trick” in dimension and neurons. We used SGD with a batch-size , and displayed the final population loss (estimated with samples) averaged over experiments. As shown in Theorem 3.4, SGD remains stuck in a local minimum in the lazy regime (i.e., here for large ). As in Figure 1, it behaves intriguingly well when is small. There is also an intermediate regime (hatched area) where convergence is very slow and the loss was still decreasing when the algorithm was stopped.
By connecting a series of recent works, we have exhibited the simple structure behind lazy training, which is a situation when a non-linear parametric model behaves like a linear one. We have studied under which conditions this regime occurs and shown that it is not limited to over-parameterized models. While the lazy training regime provides some of the first optimization-related theoretical insights for deeper models [11, 2, 3, 31, 15], we believe it does not explain yet the many successes of neural networks that have been observed in various challenging, high-dimensional tasks in machine learning. This is corroborated by numerical experiments where it is seen that networks trained in the lazy regime are those that perform worst. Instead, the intriguing phenomenon that still defies theoretical understanding is the one displayed on Figures 1(c) and 6(b) for small : neural networks trained with gradient-based methods (and neurons that move) have the ability to perform high-dimensional feature selection.
We acknowledge supports from grants from Région Ile-de-France and the European Research Council (grant SEQUOIA 724063).
Ralph Abraham, Jerrold E. Marsden, and Tudor Ratiu.
Manifolds, tensor analysis, and applications, volume 75. Springer Science & Business Media, 2012.
-  Zeyuan Allen-Zhu, Li Yuanzhi, and Liang Yingyu. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.04918, 2018.
-  Zeyuan Allen-Zhu, Li Yuanzhi, and Liang Yingyu. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.
-  Youness Boutaib. On Lipschitz maps and their flows. arXiv preprint arXiv:1510.07614, 2015.
Xinyu Chen, Qiang Guan, Xin Liang, Li-Ta Lo, Simon Su, Trilce Estrada, and
Tensorview: visualizing the training of convolutional neural network
Proceedings of the 1st Workshop on Distributed Infrastructures for Deep Learning, pages 11–16. ACM, 2017.
-  Lénaïc Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in neural information processing systems, 2018.
-  Youngmin Cho and Lawrence K. Saul. Kernel methods for deep learning. In Advances in neural information processing systems, pages 342–350, 2009.
-  Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pages 2253–2261, 2016.
-  Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44(4):1363–1399, 2016.
-  Simon Du and Jason Lee. On the power of over-parametrization in neural networks with quadratic activation. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1329–1338, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
-  Simon S. Du, Lee Jason D., Li Haochuan, Wang Liwei, and Zhai Xiyu. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804.
-  Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
-  Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
-  Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations, 2016.
-  Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, 2018.
-  Jyrki Kivinen, Alexander J. Smola, and Robert C. Williamson. Online learning with kernels. IEEE Transactions on Signal Processing, 52(8):2165–2176, 2004.
-  Harold Kushner and G. George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003.
-  John M Lee. Smooth manifolds. In Introduction to Smooth Manifolds, pages 1–29. Springer, 2003.
-  Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8167–8176, 2018.
-  Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems, pages 597–607, 2017.
-  Dongyu Liu, Weiwei Cui, Kai Jin, Yuxiao Guo, and Huamin Qu. Deeptracker: Visualizing the training process of convolutional neural networks. arXiv preprint arXiv:1808.08531, 2018.
-  Hartmut Maennel, Olivier Bousquet, and Sylvain Gelly. Gradient descent quantizes ReLU network features. arXiv preprint arXiv:1803.08367, 2018.
-  Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
-  Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
-  Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems, pages 1313–1320, 2009.
-  Grant M. Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. In Advances in neural information processing systems, 2018.
-  Damien Scieur, Vincent Roulet, Francis Bach, and Alexandre d’Aspremont. Integration methods and optimization algorithms. In Advances in Neural Information Processing Systems, pages 1109–1118, 2017.
-  Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A central limit theorem. arXiv preprint arXiv:1808.09372, 2018.
-  Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.
-  Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
-  Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888, 2018.
Appendix A Stability lemma
The following stability lemma is at the basis of the equivalence between lazy training and linearized model training. We limit ourselves to a rough estimate sufficient for our purposes.
Let be a -strongly convex function and let be a time dependent positive definite operator on such that for . Consider the paths and on that solve for ,
Defining , it holds for ,
Let be the positive definite square root of , let , and let be the function defined as . It holds
Since the function is -strongly convex, one has that . Using the quantity introduced in the statement, one has also . Summing these two terms yields the bound
The right-hand side is a concave function of which is nonnegative for and negative for higher values of . Since it follows that for all , one has