Modern neural networks include millions of neurons, and one can expect that macroscopic properties of such big systems can be described by analytic models – similarly to how statistical mechanics or fluid dynamics describe macroscopic properties of big colections of physical particles. One specific approach in this general direction is to consider the limit of “large-width” neural networks, typically combined with the assumption of weight independence. There is a well-known connection between neural networks, Gaussian processes and kernel machines in this limit ([Neal, 2012, Williams, 1997]). Recently, this limit has been used, along with some complex mathematical tools, to understand the landscape of the loss surface of large networks. In particular, [Choromanska et al., 2015] establish a link to the theory of spin glasses and use this theory to explain the distribition of critical points on the loss surface; [Pennington and Bahri, 2017]
analyze the loss surface using random matrix theory;[Poole et al., 2016, Schoenholz et al., 2016] analyze propagation of information in a deep network assuming Gaussian signal distributions in wide layers.
In the present paper we apply the large-width limit to describe the collective evolution of weights under standard gradient descent used to train the network. Weight optimization is currently a topic of active theoretical research. While local convergence of gradient descent is generally well-understood ([Nesterov, 2013]), its global and statistical properties are hard to analyze due to the complex nonconvex shape of the loss surface. There are some special scenarios where the absence of spurious local minima has been proved, in particular in deep linear networks ([Kawaguchi, 2016, Laurent and von Brecht, 2017]) or for pyramidal networks trained on small training set ([Nguyen and Hein, 2017]). But in general, the loss surface is known to have many local minima or saddle points trapping or delaying optimization, see e.g. ([Safran and Shamir, 2017, Dauphin et al., 2014]). Similarly, gradient descent is known to be analytically tractable in some cases, e.g. in linear networks ([Saxe et al., 2013, Bartlett et al., 2018]
) or in ReLU networks under certain assumptions on the distribution of the inputs ([Tian, 2017]), but in general, current studies of gradient descent in large networks tend to postulate its replacement by an “effective” macroscopic model ([Shwartz-Ziv and Tishby, 2017, Chaudhari et al., 2018]).
The goal of the present paper is to derive a general macroscopic description of the gradient descent directly from the underlying finite model, without ad hoc assumptions. We do this in section 2.2, obtaining a nonlinear integro-differential transport equation. In sections 2.3,2.4
we analyze its general properties and characterize its global minima and stationary points. Interestingly, an important quantity for the transport equation is the quadratic mean-field version of the loss function: this loss does not increase under the evolution. However, the generator of the transport equation is not the formal gradient of this loss – in particular, that is why the macroscopic dynamics does have stationary points that are not global minima. Next, in sections2.5–2.7
we analyze three scenarios where the transport equation can be solved perturbatively (near a global minimizer, for small output weights, and for localized Gaussian distributions). Finally, in section3 we verify our general method on a simple one-dimensional ReLU network (essentially describing linear splines). We compare the theoretical predictions involving stationary points, their stability, and convergence rate of the dynamics with the experiment and find a good agreement between the two.
This short paper is written at a “physics level of rigor”. We strive to expose the main ideas and do not attempt to fill in all mathematical details.
2 General theory
2.1 A “generalized shallow network” model
We consider the problem of approximating a “ground truth” map . The nature of the set will not be important for the present exposition. We consider approximation by the “generalized shallow network” model:
where is some function of the weights and the input,
is the weight vector for the’th term, and is the full weight vector. The usual neural network with a single hidden layer is obtained when and
with some nonlinear activation function.111The standard network model also includes a constant term in the output layer, but we will omit it for simplicity of the exposition.
We consider the usual quadratic loss function
where is some measure on . Again, we don’t assume anything specific about : for example, it can consist of finitely many atoms (a “finite training set” scenario) or have a density w.r.t. the Lebesgue measure (assuming ; a “population average” scenario).
We will study the standard gradient descent dynamics of the weights:
where is the learning rate and is the gradient w.r.t. .
2.2 Derivation of the transport equation
We are interested in the collective behavior of the weights under gradient descent in the “wide network” limit . It is convenient to view the weight vector as a stochastic object described by a distribution with a density function such that . The common practice in network training (see e.g. [Glorot and Bengio, 2010]) is to initialize the weights randomly and independently with some density , i.e. Our immediate goal will be to derive, under suitable assumptions, the equation governing the evolution of the factors.
First we replace the system (3
) of ordinary differential equations in
by a partial differential equation in. We can view the vector field in Eq.(3
) as a “probability flux”. Then, the evolution ofcan be described by the continuity equation expressing local probability conservation:
where is the Laplacian and denotes the material derivative:
(see Appendix A.1 for details).
Expanding the continuity equation, we get
We look for factorized solutions with some density . However, it is clear that we cannot fulfill this factorization exactly because of the interaction of different weights
on the r.h.s. To decouple them, we perform the “mean field” (or “law of large numbers”) approximation:
This approximation corresponds to the limit . To obtain a finite non-vanishing r.h.s. in Eq.(5), we rescale the learning rate by setting . We discard the second term on the r.h.s. of (5), since the coefficient makes it asymptotically small compared to the first term. After all this, we obtain the equation
admitting a factorized solution. Namely, using the product form of , unwrapping the material derivative and equating same- terms, we get
Using the mean-field approximation (6), we can write
Making this replacement for in Eq.(7), dropping the index and rearranging the terms, we obtain the desired transport equation for :
2.3 Properties of the transport equation
The derived equation (8) is a nonlinear integro-differential equation in and we expect it to approximately describe the evolution of the distribution of the weights . In agreement with this interpretation, the equation (8) preserves the total probability () – this follows from the gradient form of the r.h.s. of Eq.(8) (assuming the function falls off sufficiently fast as , so that the boundary term can be omitted when integrating by parts). We also expect that, under suitable regularity assumptions, the equation preserves the nonnegativity
of the initial condition. In the sequel, our treatment of this equation will be rather heuristic; in particular, we will not distinguish between regular and weak (distributional) solutions.
It is helpful to consider Eq.(8) as a pair of an integral and a differential equations:
The quantity reflects the correlation of the current approximation error, as a function of , with the function . If is known for all , then equation (10), being first-order in , can be solved by the method of characteristics. Specifically, let be some trajectory satisfying the equation . On this trajectory, by Eq.(10),
which has the solution
This shows that, geometrically, determines the transport direction in the space, while determines the infinitesimal change of the value of .
2.4 The mean-field loss and stationary points
If is a solution of the transport equation (8), then, by a computation involving integration by parts,
This property is inherited from the original gradient descent (3). Note, however, that the transport equation (8) is not a (formal) gradient descent performed on w.r.t. – this latter would be given by and is quite different from Eq.(8). (In fact, in contrast to Eq.(8), the equation does not seem to generally admit reasonable solutions – even in simplest cases solutions may blow up in infinitesimal time).
Accordingly, the intuition behind the connection of the transport equation (8) to the mean-field loss is different from the naïve intuition of gradient descent. Note that is a convex quadratic functional in and the set of all nonnegative distributions is convex. The intuition of finite-dimensional gradient descent then suggests that the solution should invariably converge to a global minimum of . However, this is not the case with the dynamics (8) which can easily get trapped at stationary points. The meaning of the stationary point is clarified by the following proposition.
Assuming is a solution of Eq.(8), the following conditions are equivalent at any particular :
at all where .
Thus, a stationary distribution can be characterized by any of the three conditions of this proposition. Given the monotonicity (13), the proposition suggests that, under reasonable regularity assumptions, any solution of the transport equation (8) eventually converges to such a stationary distribution.
If the family of maps is sufficiently rich, then we expect a stationary distribution either to be a global minimizer of or to be concentrated on some proper subset of the -space . We state one particular proposition demonstrating this in the important case of models with linear output. We say that an approximation (1) is a model with linear output if , where and is some map. For example, the standard shallow neural network is a model with linear output. For the proposition below, we assume that the functions and the error function are square-integrable w.r.t. the measure . Recall that a set of vectors in a Hilbert space is called total if their finite linear combinations are dense in this space.
Let be a stationary distribution supported on a subset . Suppose that the functions are total in . Then
By the second stationarity condition of proposition 1, we have for all . In particular, taking the partial derivative w.r.t. ,
for all . Since the functions are total, almost everywhere w.r.t. , and hence ∎
2.5 Linearization near a global minimizer
Suppose that the solution of the transport equation converges to a limiting distribution such that i.e. (-a.e.). Consider the positive semidefinite kernel
and the corresponding integral operator
Note that the operator determines the quadratic part of the mean-field loss:
Let us represent the solution in the form with a small . Plugging this into Eq.(8) and keeping only terms linear in , we obtain
where is the integral operator with the kernel
Under this linearization, the solution to the transport equation can be written as
The operator is symmetric and negative semi-definite w.r.t. the semi-definite scalar product associated with the kernel :
This suggests that we can use the spectral theory of self-adjoint operators to analize the large– evolution of . A caveat here is that the scalar product is not strictly positive definite, in general. This issue can be addressed by considering the quotient space , where is the Hilbert space associated with the semi-definite scalar product , and . On this quotient space the scalar product is non-degenerate, and extends from to since the subspace lies in the null-space of .
Now, the self-adjoint operator has an orthogonal spectral decomposition in . In particular, suppose that
has a discrete spectrum of eigenvaluesand
–orthogonal eigenvectors. For each , the equivalence class has a unique representative such that in . On the other hand, . We can view the space as a direct sum so this gives us the full eigendecomposition of the space . Recall that we have assumed that . By Eq.(17), this means that the eigendecomposition of the vector can include only eigenvectors with stricly negative eigenvalues, and in particular it does not include an –component. We can then write, by Eq.(17) and the eigenvector expansion,
and, by Eq.(14),
We see, in particular, that the large- asymptotic of loss is determined by the distribution of the eigenvalues of and, for a particular initial condition , by the coefficients of its eigendecomposition.
For overparametrized models, we expect the subspace to be nontrivial and possibly even highly degenerate. Assuming that for some ground truth the minimum loss can be achieved, we can ask how to find the actual limiting distribution in the space of all minimizers of . In the linearized setting described above, this should be done using the already mentioned condition that expansion of over the eigenvectors of does not contain the component. In Appendix A.3, we illustrate this observation with a simple example involving a single-element set .
2.6 Models with linear output
We describe now another solvable approximation that holds for models with linear output at small values of the linear parameter. Recall that we defined such models as those where with some map and . Observe that for such models, the gradient factor in the transport equation (8) can be written as a sum of two terms:
For small the second term is small and can be dropped. Then, the transport equation simplifies to
It follows that the distribution evolves by “shifting along ”, separately at each :
where the function satisfies the equation
with the initial condition . In particular, for each , the integral – the marginal of w.r.t. – does not depend on . We will denote this marginal by , and by we will denote the operator of multiplication by , i.e.
It is convenient to introduce linear operators :
These operators are mutually adjoint w.r.t. to the scalar products and namely
The operator is self-adjoint and positive semidefinite with respect to the scalar product :
Assuming belongs to the Hilbert space associated with this scalar product, we can write the solution to Eq.(23) as
where are the complementary orthogonal projectors to the nullspace and to the strictly positive subspace of , respectively.
Using the identity and Eq.(24), the loss function can then be written as
The first, constant term on the r.h.s. is half the squared norm of the component of orthogonal to the range of the operator . The second term converges to 0 as , so the limit of the loss function is given by the first term. Suppose that the first term vanishes and suppose that the positive semidefinite operator has a pure point spectrum with eigenvalues and eigenvectors . Then Eq.(24) can be expanded as
and the loss function as
Thus, the large- evolution of loss is determined by the eigenvalues of and, for a particular , by the eigendecomposition of .
2.7 Localized Gaussian aproximation
Suppose that, for each , the distribution is approximately Gaussian with a center and a small covariance matrix :
Plugging this ansatz into the transport equation and keeping only leading terms in we obtain (see Appendix A.4):
and denotes the Hessian w.r.t. Not surprisingly, Eq.(28) coincides with the gradient descent equation for the original model (1) with . Now consider Eq.(29). If the matrix is diagonalized, then Eq.(29) is also diagonalized in the basis of matrix elements:
In particular, our “pointlike” Gaussian solution is unstable (expansive) iff has positive eigenvalues.
Consider now the special case of a model with linear output, and a distribution close to a “pointlike stationary distribution”, so that . We write the Hessian and accordingly in the block form:
Observe that, by linearity of in , and hence . Also, observe that It follows that , where we have denoted By assumption, If , this implies that . We conclude that the matrix has the block form
This shows in particular that a distribution close to a pointlike stationary distribution evolves only in the -component.
3 Application to free-knot linear splines
In this section we apply the developed general theory to the model of piecewise linear free-knot splines, which can be viewed as a simplified neural network acting on the one-dimensional input space. Specifically, let be the Lebesgue measure on and
where is the ReLU activation function. We will perform numerical simulations for this model with in Eq.(1).
Note that еру model (31) is highly degenerate in the sense that the same prediction can be obtained from multiple distributions on the parameter space :
Using the identity with Dirac delta, we get:
The derivative determines the prediction up to two constants, for example, and , which can also be expressed in terms of the distribution :
Conditions (33) on distributions are independent at different
and leave infinitely many degrees of freedom for these distributions. This property is of course shared by all models with linear output.
3.1 Global optima and stationary distributions
More generally, we can ask what are stationary distributions of the dynamics (in the sense of section 2.4 and proposition 1). Their complete general characterization is rather cumbersome, so we just consider the particular example of the ground truth function (this case is easier thanks to the constant convexity of ). One can then show the following properties (see Appendix A.5). First, any distribution supported on the subset is stationary. Second, consider the marginal distribution and its restriction to the segment . Then, for a stationary , either and is a global minimizer, or consists of a finite number of equidistantly spaced atoms in (and the approximation is a piecewise linear spline). In particular, if consists of a single point , then or
We examine in more detail the special case when a stationary measure is a single atom, i.e. . There are two possibilities: either and then is arbitrary, or and then In Fig.1 we show a series of simulations where the initial distribution is an atom, with some . We observe such distributions to converge to stationary atomic distributions of one of the above two kinds.
We consider now “pointlike” Gaussian initial distributions
, with a small standard deviationDepending on we observe two different patterns of evolution of . If the limit of the perfectly atomic distribution () is a stationary point with then the evolution is stable. In contrast, if the limit of the atomic distribution is then the evolution is unstable: near this point the weights diverge rapidly along , and then form a curve approaching a globally minimizing distribution spread relatively uniformly over see Fig.2. The existence of two patterns is explained by the linear stability study of section 2.7: the pattern depends on the sign of in Eq.(30). By Eq.(31) we have and hence, at a stationary point , . Thus, stationary points with are stable and contractive in the direction, while the stationary point is unstable and expansive in the direction.
are sampled from a distribution with a small positive variance, in the unstable case the trajectories of the weights spread along theaxis when approaching the stationary point.
3.2 The small- approximation
We can solve our model approximately in the region of small as described in section 2.6. Suppose that the initial distribution is As discussed in section 2.6, at small we expect to evolve by approximately shifting along , with some shift function . We observe indeed in an experiment that , see Fig.3(a). To further understand this dynamics, recall the approximate solutions (25),(26) (where , by our choice of the initial distribution). In our case, is the operator of multiplication by the indicator function