Multi-layer neural networks are one of the oldest approaches to statistical machine learning, dating back at least to the 1960’s [Ros62]. Over the last ten years, under the impulse of increasing computer power and larger data availability, they have emerged as a powerful tool for a wide variety of learning tasks [KSH12, GBCB16].
In this paper we focus on the classical setting of supervised learning, whereby we are given data points, indexed by , which are assumed to be independent and identically distributed from an unknown distribution on . Here
is a feature vector (e.g. a set of descriptors of an image), andis a label (e.g. labeling the object in the image). Our objective is to model the dependence of the label on the feature vector in order to assign labels to previously unlabeled examples. In a two-layers neural network, this dependence is modeled as
is the number of hidden units (neurons),
is an activation function, andare parameters, which we collectively denote by . The factor is introduced for convenience and can be eliminated by redefining the activation. Often and
for some . Ideally, the parameters should be chosen as to minimize the risk (generalization error) where
is a certain loss function. For the sake of simplicity, we will focus on the square lossbut more general choices can be treated along the same lines.
In practice, the parameters of neural networks are learned by stochastic gradient descent [RM51] (SGD) or its variants. In the present case, this amounts to the iteration
Here denotes the parameters after iterations, is a step size, and is the -th example. Throughout the paper, we make the following assumption:
One-pass assumption. Training examples are never revisited. Equivalently, are i.i.d. .
In large scale applications, this is not far from truth: the data is so large that each example is visited at most a few times [Bot10]. Further, theoretical guarantees suggest that there is limited advantage to be gained from multiple passes [SSBD14]. For recent work deriving scaling limits under such assumption (in different problems) see [WML17].
Understanding the optimization landscape of two-layers neural networks is largely an open problem even when we have access to an infinite number of examples, i.e. to the population risk . Several studies have focused on special choices of the activation function and of the data distribution , proving that the population risk has no bad local minima [SJL17, GLM17, BG17]. This type of analysis requires delicate calculations that are somewhat sensitive to the specific choice of the model. Another line of work proposes new algorithms with theoretical guarantees [ABGM14, SA15, JSA15, ZLJ16, Tia17, ZSJ17]
, which use initializations based on tensor factorization.
In this paper, we prove that –in a suitable scaling limit– the SGD dynamics admits an asymptotic description in terms of a certain non-linear partial differential equation (PDE). This PDE has a remarkable mathematical structure, in that it corresponds to a gradient flow in the metric space
: the space of probability measures on, endowed with the Wasserstein metric. This gradient flow minimizes an asymptotic version of the population risk which is defined for and will be denoted by . This description simplifies the analysis of the landscape of two-layers neural networks, for instance by exploiting underlying symmetries. We illustrate this by obtaining new results on several concrete examples, as well as a general convergence result for ‘noisy SGD.’ In the next section, we provide an informal outline, focusing on basic intuitions rather than on formal results. We then present the consequences of these ideas on a few specific examples, and subsequently state our general results.
1.1 An informal overview
A good starting point is to rewrite the population risk as
where we defined the potentials , . In particular is a symmetric positive semidefinite kernel. The constant is the risk of the trivial predictor .
Notice that only depends on through their empirical distribution . This suggests to consider a risk function defined for (we denote by
the space of probability distributions on):
Formal relationships can be established between and . For instance, under mild assumptions, . We refer to the next sections for mathematical statements of this type.
Roughly speaking, corresponds to the population risk when the number of hidden units goes to infinity, and the empirical distribution of parameters converges to . Since is positive semidefinite, we obtain that the risk becomes convex in this limit. The fact that learning can be viewed as convex optimization in an infinite-dimensional space was indeed pointed out in the past [LBW96, BRV06]. Does this mean that the landscape of the population risk simplifies for large and descent algorithms will converge to a unique (or nearly unique) global optimum?
The answer to the last question is generally negative, and a physics analogy can explain why. Think of as the positions of particles in a -dimensional space. When is large, the behavior of such a ‘gas’ of particles is effectively described by a density (with indexing time). However, not all ‘small’ changes of this density profile can be realized in the actual physical dynamics: the dynamics conserves mass locally because particles cannot move discontinuously. For instance, if for two disjoint compact sets , and all , then the total mass in each of these regions cannot change over time, i.e. does not depend on .
We will prove that stochastic gradient descent is well approximated (in a precise quantitative sense described below) by a continuum dynamics that enforces this local mass conservation principle. Namely, assume that the step size in SGD given by , for a sufficiently regular function. Denoting by the empirical distribution of parameters after SGD steps, we prove that
when , (here denotes weak convergence). The asymptotic dynamics of is defined by the following PDE, which we shall refer to as distributional dynamics (DD)
(Here denotes the divergence of the vector field .) This should be interpreted as an evolution equation in . While we described the convergence to this dynamics in asymptotic terms, the results in the next sections provide explicit non-asymptotic bounds. In particular, is a good approximation of , , as soon as and .
Using these results, analyzing learning in two-layer neural networks reduces to analyzing the PDE (7). While this is far from being an easy task, the PDE formulation leads to several simplifications and insights. First of all, it factors out the invariance of the risk (4) (and of the SGD dynamics (3)), with respect to permutations of the units .
Second, it allows to exploit symmetries in the data distribution . If is left invariant under a group of transformations (e.g. rotations), we can look for a solution of the DD (7) that enjoys the same symmetry, hence reducing the dimensionality of the problem. This is impossible for the finite- dynamics (3), since no arrangement of the points is left invariant –say– under rotations. We will provide examples of this approach in the next sections.
Third, there is rich mathematical literature on the PDE (7) which was motivated by the study of interacting particle systems in mathematical physics. As mentioned above, a key structure exploited in this line of work is that (7) can be viewed as a gradient flow for the cost function in the space , of probability measures on endowed with the Wasserstein metric [JKO98, AGS08, CMV03]. Roughly speaking, this means that the trajectory attempts to minimize the risk while maintaining the ‘local mass conservation’ constraint. Recall that Wasserstein distance is defined as
where the infimum is taken over all couplings of and . Informally, the fact that is a gradient flow means that (7) is equivalent, for small , to
Most importantly, the scaling limit elucidates the dependence of the landscape of two-layer neural networks on the number of hidden units .
A remarkable feature of neural networks is the observation that, while they might be dramatically over parametrized, this does not lead to performance degradation. In the case of bounded activation functions, this phenomenon was clarified in the nineties for empirical risk minimization algorithms, see e.g. [Bar98]. The present work provides analogous insight for the SGD dynamics: roughly speaking, our results imply that the landscape remains essentially unchanged as grows, provided . In particular, assume that the PDE (7) converges close to an optimum in time . This might depend on , but does not depend on the number of hidden units (which does not appear in the DD PDE (7)). If , we can then take arbitrarily (as long as ) and will achieve a population risk which is independent of (and corresponds to the optimum), using samples.
Our analysis can accommodate some important variants of SGD, a particularly interesting one being noisy SGD:
where and . (The term corresponds to an regularization and will be useful for our analysis below.) The resulting scaling limit differ from (7) by the addition of a diffusion term:
where , and denotes the usual Laplacian. This can be viewed as a gradient flow for the free energy , where is the entropy of (by definition if is singular). is an entropy-regularized risk, which penalizes strongly non-uniform .
We will prove below that, for , the evolution (12) generically converges to the minimizer of , hence implying global convergence of noisy SGD in a number of steps independent of .
In this section, we discuss some simple applications of the general approach outlined above. Let us emphasize that these examples are not realistic. First, the data distribution is extremely simple: we made this choice in order to be able to carry out explicit calculations. Second, the activation function is not necessarily optimal: we made this choice in order to illustrate some interesting phenomena.
2.1 Centered isotropic Gaussians
One-neuron neural networks perform well with (nearly) linearly separable data. The simplest classification problem which requires multilayer networks is –arguably– the one of distinguishing two Gaussians with the same mean. Assume the joint law of to be as follows:
With probability : ,
With probability : , .
(This example will be generalized later.) Of course, optimal classification in this model becomes entirely trivial if we compute the feature . However, it is non-trivial that a SGD-trained neural network will succeed.
We choose an activation function without offset or output weights, namely . While qualitatively similar results are obtained for other choices of , we will use a simple piecewise linear function as a running example: if , if , and interpolated linearly for . In simulations we use , , , .
We run SGD with initial weights , where is spherically symmetric. Figure 1 reports the result of such an experiment. Due to the symmetry of the distribution , the distribution remains spherically symmetric for all , and hence is completely determined by the distribution of the norm . This distribution satisfies a one-dimensional reduced DD:
where the form of can be derived from . This reduced PDE can be efficiently solved numerically, see Supplementary Information (SI) for technical details. As illustrated by Fig. 1, the empirical results match closely the predictions produced by this PDE.
, the minimum is achieved by the uniform distribution over a sphere of radius, to be denoted by . The value of is computed by minimizing
where expressions for , can be readily derived from , and are given in the SI.
Let be a global minimizer of . Then is a global minimizer of if and only if for all .
Checking numerically this condition yields that is a global minimizer for in an interval , where and .
Figure 2 shows good quantitative agreement between empirical results and theoretical predictions, and suggests that SGD achieves a value of the risk which is close to optimum. Can we prove that this is indeed the case, and that the SGD dynamics does not get stuck in local minima? It turns out that we can use our general theory (see next section) to prove that this is the case for large . In order to state this result, we need to introduce a class of good uninformative initializations for which convergence to the optimum takes place. For , we let . This risk has a well defined limit as . We say that if: is absolutely continuous with respect to Lebesgue measure, with bounded density; .
For any , and , there exists , , and
, such that the following holds for the problem of classifying isotropic Gaussians. For any dimension
, such that the following holds for the problem of classifying isotropic Gaussians. For any dimension, number of neurons , consider SGD initialized with and step size . Then we have
for any with probability at least .
In particular, if we set , then the number of SGD steps is : the number of samples used by SGD does not depend on the number of hidden units , and is only linear in the dimension. Unfortunately the proof does not provide the dependence of on , but Theorem 6 below suggests exponential local convergence.
While we stated Theorem 1 for the piecewise linear sigmoids, the SI presents technical conditions under which it holds for a general monotone function .
2.2 Centered anisotropic Gaussians
We can generalize the previous result to a problem in which the network needs to select a subset of relevant nonlinear features out of many a priori equivalent ones. We assume the joint law of to be as follows:
With probability : , , and
With probability : , .
Given a linear subspace of dimension , we assume that , differ uniquely along : , where and is the orthogonal projector onto . In other words, the projection of on the subspace
is distributed according to a isotropic Gaussian with variance(if ) or (if ). The projection orthogonal to has instead the same variance in the two classes. A successful classifier must be able to learn the relevant subspace . We assume the same class of activations as for the isotropic case.
The distribution is invariant under a reduced symmetry group . As a consequence, letting and , it is sufficient to consider distributions that are uniform, conditional on the values of and . If we initialize to be uniform conditional on , this property is preserved by the evolution (7). As in the isotropic case, we can use our general theory to prove convergence to a near-optimum if is large enough.
For any , and , there exists , , and , such that the following holds for the problem of classifying anisotropic Gaussians with , fixed. For any dimension parameters , number of neurons , consider SGD initialized with initialization and step size . Then we have for any with probability at least .
Even with a reduced degree of symmetry, SGD converges to a network with nearly-optimal risk, after using a number of samples , which is independent of the number of hidden units .
2.3 A better activation function
Our previous examples use activation functions
without output weights or offset, in order to simplify the analysis and illustrate some interesting phenomena. Here we consider instead a standard rectified linear unit (ReLU) activation, and fit both the output weight and the offset:where . Hence .
We consider the same data distribution introduced in the last section (anisotropic Gaussians). Figure 3 reports the evolution of the risk for three experiments with , and different values of . SGD is initialized by setting , and for . We observe that SGD converges to a network with very small risk, but this convergence has a nontrivial structure and presents long flat regions.
The empirical results are well captured by our predictions based on the continuum limit. In this case we obtain a reduced PDE for the joint distribution of the four quantities, denoted by . The reduced PDE is analogous to (13) albeit in rather than dimensions. In Figure 3 we consider the evolution of the risk, alongside three properties of the distribution –the means of the output weight , of the offset , and of .
2.4 Predicting failure
SGD does not always converge to a near global optimum. Our analysis allows to construct examples in which SGD fails. For instance, Figure 4 reports results for isotropic Gaussians problem. We violate the assumptions of Theorem 1 by using non monotone activation function. Namely, we use , where for , for , and linearly interpolates from to , and from to .
Depending on the initialization, SGD converges to two different limits, one with a small risk, and the second with high risk. Again this behavior is well tracked by solving a one-dimensional PDE for the distribution of .
3 General results
In this section we return to the general supervised learning problem described in the introduction and describe our general results. Proofs are deferred to the SI.
First, we note that the minimum of the asymptotic risk of (5) provides a good approximation of the minimum of the finite- risk .
Assume that either one of the following conditions hold: is achieved by a distribution such that ; There exists such that, for any such that we have . Then
Further, assume that and are continuous, with bounded below. A probability measure is a global minimum of if and
We next consider the distributional dynamics (7) and (12). These should be interpreted to hold in weak sense, cf. SI. In order to establish that these PDEs indeed describe the limit of the SGD dynamics, we make the following assumptions.
is bounded Lipschitz: , with .
The activation function is bounded, with sub-Gaussian gradient: , . Labels are bounded .
The gradients , are bounded, Lipschitz continuous (namely , , , ).
We also introduce the following error term which quantifies in a non-asymptotic sense the accuracy of our PDE model
The convergence of the SGD process to the PDE model is an example of a phenomenon which is known in probability theory aspropagation of chaos [Szn91].
Assume that conditions A1, A2, A3 hold. For , consider SGD with initialization and step size . For , let be the solution of PDE (7). Then, for any fixed , almost surely along any sequence such that , , and . Further, there exists a constant (depending uniquely on the parameters of conditions A1-A3) such that, for any , with , ,
Notice that dependence of the error terms in and is rather benign. On the other hand, the error grows exponentially with the time horizon , which limits its applicability to cases in which the DD converges rapidly to a good solution. We do not expect this behavior to be improvable within the general setting of 3, which a priori includes cases in which the dynamics is unstable.
We can regard as a current. The fixed points of the continuum dynamics are densities that correspond to zero current, as stated below.
Note that global optimizers of , defined by condition (17), are fixed points, but the set of fixed points is, in general, larger than the set of optimizers. Our next proposition provides an analogous characterization of the fixed points of diffusion DD (12) (see [CMV03] for related results).
Assume that conditions A1-A3 hold and that is absolutely continuous with respect to Lebesgue measure, with . If is a solution of the diffusion PDE (12), then is absolutely continuous. Further, there is at most one fixed point of (12) satisfying . This fixed point is absolutely continuous and its density satisfies
In the next sections we state our results about convergence of the distributional dynamics to its fixed point. In the case of noisy SGD (and for the diffusion PDE (12)), a general convergence result can be established (although at the cost of an additional regularization). For noiseless SGD (and the continuity equation (12)), we do not have such general result. However, we obtain a stability condition for fixed point containing one point mass, which is useful to characterize possible limiting points (and is used in treating the examples in the previous section).
3.1 Convergence: noisy SGD
Remarkably, the diffusion PDE (12) generically admits a unique fixed point, which is the global minimum of and the evolution (12) converges to it, if initialized so that . This statement requires some qualifications. First of all, we introduce sufficient regularity assumptions to guarantee the existence of sufficiently smooth solutions of (12).
, , is uniformly bounded for .
Next notice that the right-hand side of the fixed point equation (21) is not necessarily normalizable (for instance, it is not when , are bounded). In order to ensure the existence of a fixed point, we need .
Assume that conditions A1-A4 hold, and for some Then has a unique minimizer, denoted by , which satisfies
where is a constant depending on ,,,. Further, letting be a solution of the diffusion PDE (12) with initialization satisfying , we have, as ,
The proof of this theorem is based on the following formula that describes the free energy decrease along the trajectories of the distributional dynamics (12):
(A key technical hurdle is of course proving that this expression makes sense, which we do by showing the existence of strong solutions.) It follows that the right-hand side must vanish as , from which we prove that (eventually taking subsequences) where must satisfy . This in turns mean is a solution of the fixed point condition 21 and is in fact a global minimum of by convexity.
Assume that conditions A1-A4 hold. Let be absolutely continuous with and sub-Gaussian. Consider regularized noisy SGD, cf. (11), at inverse temperature , regularization with initialization . Then for any , there exists and setting , there exists and (independent of the dimension and temperature ) such that the following happens for , : for any we have, with probability ,
Let us emphasize that the convergence time in the last theorem can depend on the dimension and on the data distribution , but is independent of the number of hidden units . As illustrated by the examples in the previous section, understanding the dependence of on requires further analysis, but examining the proof of this theorem suggests quite generally (examples in which or can be constructed). We expect that our techniques could be pushed to investigate the dependence of on (see discussion in SI). In highly structured cases, the dimension can be of constant order, and be much smaller than .
3.2 Convergence: noiseless SGD
The next theorems provide necessary and sufficient conditions for distributions containing a single point mass to be a stable fixed point of the evolution. This result is useful in order to characterize the large time asymptotics of the dynamics (7). Here, we write for the gradient of with respect to its first argument, and for the corresponding Hessian. Further, for a probability distribution , we define
Note that is nothing but the Hessian of at .