The statistical performance of the Stochastic Gradient Descent (SGD) and its variants has led to effective training of large–scale statistical learning models. It is widely believed that SGD is an “implicit regularizer” which helps itself to search for a local minimum point that is easy to generalize (see , , ). This belief has been stemming from its remarkable empirical performance. To justify this belief from a solid theoretical perspective, optimization behaviors of SGD and their impacts on generalization has been studied from attempts of characterizing how SGD escapes from stationary points, including saddle points and local minima (see , 
). Another stream of research focuses on interpreting SGD as performing variational inference, in which it is found that SGD minimizes the Kullback–Leibler divergence between the stationary distribution to its posterior (see, , ). Along this direction, it is realized that SGD performs a variational inference using a new potential function that it implicitly constructs given an architecture and a dataset.
In the above mentioned researches, “implicit regularization” was explained to be related to the “noise tied to architecture” (see [3, Section 6]) arising in, e.g. dropout or small mini–batches used in SGD. For example in variational inference approach the new potential function, denoted as , can be shown to be equal up to a scalar multiple to the original loss function in the case when the noise in SGD is isotropic (see [3, Lemma 6]). However in real case, due to the special architecture of deep networks where gradient noise is highly non–isotropic (anisotropic, see [3, Section 4.1], , ), it is believed that the potential possesses properties that lead to both generalization and acceleration (see ). From the perspective of escaping from stationary points, it is also found in  that anisotropic noise leads to faster escape from sharp local minima, which validates previous results from both empirical and theoretical analysis (see , , , ).
In this work, we provide a unified approach to these problems. Instead of the potential function , we will construct another potential function – the (global) quasi-potential that characterizes the long–time behavior of SGD with small learning rate. We demonstrate that, as the learning rate tends to zero, SGD finally enters the global minimum of the (global) quasi–potential function . It is interesting to observe that in the case when the loss landscape possesses only one global minimum and under isotropic noise, the (global) quasi–potential that we constructed becomes a local quasi–potential and it agrees with the original loss function up to a multiplicative constant. However, with the presence of multiple potential wells, SGD with anisotropic noise minimizes a (global) quasi–potential that has a different landscape from the original loss function.
The construction of the (global) quasi–potential function is done by first calculating the local quasi–potential function, which is valid within one basin of attraction around a specific local minimum point. Analytically, this local quasi–potential function is the solution of a variational problem in which the Lagrangian has explicit dependence on the covariance structure of the noise (e.g. provided by the architecture of the neural network) and the loss landscape. By classical calculus of variations, such a function can be calculated from a partial differential equation of Hamilton–Jacobi type with prescribed boundary value condition. From this perspective one can explicitly see the dependence of the quasi–potential on the covariance structure of the noise. This suggests that the quasi–potential can capture the effects of implicit regularization brought in by the “noise tied to architecture” from algorithms like SGD. In particular, highly anisotropic noise in SGD also induces a (global) quasi–potential function that will be different from the original loss function, and this is more relevant to the training of real deep neural networks.
Our results are in the case when the learning rate asymptotically tends to zero. This is different when compared to the previous work like  that considers the case of moderate learning rate. We demonstrate that even in this case, implicit regularization still exists and manifests itself through the (global) quasi–potential function . This is mainly due to the exponentially long escape time from local minima induced by small noise in SGD. Our analysis relies on Large Deviations Theory (LDT) (see , , 
), in which exponentially small transition probabilities are estimated through a path–integral along trajectories exiting from local minima. These exponents depend on the learning rate as well as the noise covariance structure. Via the Markov property of the SGD process, such exponentially small exit probabilities can be turned into estimates of the exponentially long escape time from the basin of attractors around local minima. As the exponential complexity of SGD based algorithms has also been noted in via information–theoretic guarantees, our work demonstrates that the exponentially long escape time from local minima induced by small noise in SGD should be a key factor that leads to implicit regularization.
Although our constructed (global) quasi–potential function is different compared to the potential function constructed via variational inference (see ), they are inter–connected via the steady–state distribution of the Fokker–Plank equation corresponding to the stochastic dynamics of SGD. In fact the potential function constructed in variational inference comes directly from the exponent in the exponential form of the density of the stationary–state distribution. This has been the case when the learning rate (step–size of the recursive scheme) is moderate. In the case when the learning rate is small, the normalizing constant (or in the language of statistical physics, the partition function) also scales with the learning rate and will affect the exponent from which we calculate the previous potential function . At the limit when the learning rate tends to zero, this leads to our new potential function, which is the (global) quasi–potential .
Another interesting point here is that from our probabilistic considerations, we can understand not only the “final” behavior of the SGD algorithm in the long–time, that is which minimum is going to be finally achieved, but also the dynamics of the SGD algorithm jumping between different local minima at an intermediate time scale. Such “metastable” behavior can be characterized by a Markov chain with exponentially small transition probabilities between local minima of the quasi–potential. This Markov chain is a reflection of the procedure during which SGD selects those specific local minimum points that it favors (such as those with good generalization properties). From here we can characterize the mechanism of implicit regularization by this Markov chain associated with the quasi–potential. We will illustrate this point via an example (Example 4.1) in Section 4.
The paper is organized as follows. We will review in Section 2 basic information of continuous SGD and variational inference. We will then demonstrate in Section 3 the construction of the local quasi–potential function, in particular how this quantity is related to the noise covariance matrix (sometimes also referred as the diffusion matrix) via a partial differential equation of Hamilton–Jacobi type. We accompany this section with an example showing that when the diffusion matrix is anisotropic, the escape from a local minimum can be faster than the case with isotropic noise. We further demonstrate the construction of the global quasi–potential via an example in Section 4, where the loss function is given by a two–well potential in dimension , and the diffusion matrix is anisotropic. We illustrate the construction of the (global) quasi–potential and we show that it has a different landscape from the original loss function. We also demonstrate via this example the metastable dynamics of SGD, that is a Markov chain between different local minimum points. This Markov chain leads to SGD’s final trapping into the global minimum of the quasi–potential, that may be different from the global minimum of the original loss function. In Section 5 we provide numerical results for both our examples in Sections 3 and 4. Finally, we conclude this work and we propose future directions in Section 6.
Main contributions of this paper. In this work, a new potential function named the quasi–potential was introduced and the variational inference of SGD was interpreted as minimizing the quasi–potential function. By making use of LDT and classical calculus of variations, a relation between the quasi–potential and the noise covariance structure (the diffusion matrix in SGD) is revealed through a partial differential equation of Hamilton–Jacobi type. This relation helps to show that anisotropic noise leads to faster escape than isotropic noise. Furthermore, the mechanism of “implicit regularization” was explained through a Markov chain between local minimum points of the quasi–potential. This Markov chain is induced by the noise in SGD and it is associated with the noise covariance structure via the relation between the diffusion matrix and the quasi–potential. This work proposes a quantitative way to understand the phenomenon of “implicit regularization” by proposing the quasi–potential and relate it to the noise covariance structure via partial differential equation and stochastic dynamics.
Comparison with previous works. There has been a large literature dedicated to the empirical fact that SGD favors local minimum points that have good empirical generalization properties (see for example , , ). On the theoretical side, attempts have been made to explain this “implicit regularization” associated with the SGD process through its stochastic dynamics, such as escape from stationary points (see , , , , , ). Variational inference has been discussed in such works as , , . There has been attempts trying to relate the covariance structure of the SGD noise and its generalization properties (see , ), in particular how anisotropic noise leads to fast escape from saddle points and local minimum points (see , ). Compared to these previous works, our work proposed a unified way to enhance the understanding of the connection between SGD’s noise covariance structure and its selection of specifically favored local minimum points. The novelty here is that the quasi–potential function that we introduced can be quantitatively related to SGD’s noise covariance structure via a partial differential equation of Hamilton–Jacobi type. This provides us with a general analytic tool to compare the effects of isotropic v.s. anisotropic noise. The derivation of our analysis based on LDT also proposes a further understanding of the mechanism behind implicit regularization. Indeed, from LDT we understand that SGD selects its favorite local minimum points through performing a Markov chain between different local minima, and the behavior of this Markov chain is related to SGD’s noise covariance structure.
2 Background on continuous–time SGD and the stationary distribution, statement of the problem.
2.1 Continuous–time SGD.
) with a constant learning rate is a stochastic analogue of the gradient descent algorithm, aiming at finding the local or global minimizers of the function expectation parameterized by some random variable. Schematically, the algorithm can be interpreted as targeting at finding a local minimum point of the expectation of function
where the index random variable follows some prescribed distribution
, and the weight vector. The stochastic gradient descent updates via the iteration
where is a fixed step–size which is also the learning rate, and are i.i.d. random variables that have the same distribution as . In particular, in the case of training a deep neural network, the random variable samples size () mini–batches uniformly from an index set : and . In this case, given loss functions on training data, we have and . Set
Based on the iteration (2), we introduce a stochastic differential equation (SDE) for the discrete–time SGD updates
The continuous–time limit of SGD is given by
where is a standard Brownian motion in and the matrix satisfies , where the diffusion matrix is the nonnegative–definite matrix
We refer to , , , , ,  for the proof of the convergence of discrete SGD (2) to (4). The diffusion matrix depends on the weight vector , the architecture of the learning model (such as a neural network) and the loss function defined by as well as the data set. When is a scalar multiple of the identity, independent of , we call an isotropic diffusion matrix; otherwise, we call non–isotropic (anisotropic). In real case when the architecture is given by a deep neural network, the diffusion matrix is usually anisotropic with a large condition number (see , ).
For mathematical reasons we will make the following simple assumptions regarding the loss function and the diffusion matrix .
We assume that the loss function admits a gradient which is –Lipschitz
We assume that is piecewise Lipschitz in and the diffusion matrix is invertible for all such that
Here and below Tr is the trace operator applied to a square matrix.
Although we assume here that the diffusion matrix is invertible for all choices of , this does not exclude the case that it is anisotropic or in other words it admits a large condition number.
2.2 Steady–State Distribution and Variational Inference.
The steady–state distribution of the weights , given by a density function , evolves according to the Fokker–Planck equation (see [30, Chapter 8])
We make an assumption on the uniqueness of stationary–state distribution.
We assume that the steady–state distribution of the Fokker–Plank equation exists and is unique. We denote the density of the stationary state distribution to be and it satisfies
up to a multiplicative constant. In this way, the stationary density can be expressed in terms of the potential function using a normalizing constant , which is the partition function in statistical physics, as
Under the Assumption 1, that is the existence and uniqueness of a stationary (invariant) density , it is guaranteed that the as time , the SGD distribution density function converges to in the sense of KL divergence. This is the variational inference interpretation of SGD. We have the following Theorem proved in [3, Theorem 5].
The functional decreases monotonically along the trajectories of the Fokker–Plank equation (8) as and converges to its minimum, which is zero, at steady–state. Here
is the KL divergence between and . Further, an energy–entropy split for the functional is given by
where is the entropy of the distribution .
In particular, the above Theorem implies the following Corollary.
We have as for every .
Let us fix and and thus we consider the case when is a fixed parameter. In this case, based on the above Theorem,  shows that the steady–state of SGD in (11) is such that it places most of its probability mass in regions of the parameter space with small values of . In this sense, the potential function is understood as a new objective function, instead of the original function that SGD tries to minimize in (1) (see more discussions in ). In particular, the function captures information from both the original loss function and the diffusion matrix . This has been discussed as a manifestation of implicit regularization via the SGD trajectory (see ). In particular, if and without boundary condition, then one can show that (see [3, Lemma 6]), so that isotropic diffusion will not bring in new effects via implicit regularization.
2.3 The asymptotic as and the quasi–potential.
The above considerations are in the case when is fixed, rather than the case when . As is small, the normalizing factor in (11) also scales with and results in the fact that the asymptotic of the stationary distribution does not depend only on the potential function .
(logarithmic equivalence) Two families of quantities and that depend on are said to be logarithmically equivalent, and denoted by
if and only if
Or in other words, for any there exists some such that we have
for any .
Our goal in this paper is to argue that as is small and is close to we have
That is, for any we can pick an small enough such that
for all . Or in other words
The function is called the global quasi–potential function (or sometimes abbreviated as the quasi–potential function depending on the context) and can be constructed from the original loss function and the diffusion matrix . Thus it depends on the weight vector , the architecture of the learning model (such as a neural network) and the loss function defined by as well as the data set.
Asymptotic identity (14) does not involve any normalizing constant as in (11), and indeed the global quasi–potential function has a global minimum such that . This combined with the ansatz (14) indicates that as is small, the stationary density will be concentrated on a certain global minimum point of the quasi–potential . By Corollary 2.4, we see that this global minimum point can be understood as the long–time behavior of SGD dynamics as first and then . This indicates that in the asymptotic regime when , SGD minimizes the quasi–potential rather than the original function .
as . Thus we see that the two potential functions and differ by a term that involves the normalizing factor (partition function) . Moreover, for fixed , the potential may depend on , and thus we can think of as the limit .
In the next two sections we will demonstrate how is calculated via the diffusion matrix and the loss landscape .
3 Local quasi–potential: the case of convex loss function.
Let us assume in this Section that the original loss function is convex and admits only one minimum point , which is also its global minimum point. Let be the origin. In this Section we will introduce the local quasi–potential function and we will connect it to the SGD noise covariance structure via a partial differential equation of Hamilton–Jacobi type. The analysis is based on interpreting the LDT as a path integral theory in the trajectory space.
3.1 SGD as a small random perturbation of Gradient Descent (GD).
For small the SGD process in (4) has trajectories that are close to the Gradient Descent (GD) flow characterized by the deterministic equation
In fact, it can be easily justified (see Appendix C) that we have the following
Under Assumption 1 we have, for any ,
for some constant .
When (17) holds, we will simply say that and are –close on . Thus in finite time the SGD process will be attracted to a neighborhood of the origin . Since is the only minimum point of the convex loss function , every point in is attracted by the gradient flow (16) to . Let us take any open set containing the origin . Due to the attraction property of the deterministic gradient descent flow and the –closeness of the SGD path from , the SGD solution process will then spend a long time staying in this neighborhood of the origin , before it escapes from and hits somewhere on . Such an escape is due to small random term in (4), that leads to fluctuations of the SGD process deviating from the deterministic trajectory of the gradient descent flow . In terms of optimization, SGD in this case finds the minimum point of the convex function just as GD does. However, at the presence of multiple local minimum points, the escape from due to small randomness in SGD solution process (4) is a key feature that leads to its regularization properties, such as the selection of flat minimizers against sharp minimizers (see ). The understanding of escape properties from the basin of attractors due to small random perturbations can also be performed in the case of just one minimum point . In this case, we can take an open neighborhood of and consider escape behavior of the SGD process from the set to its boundary . This will be done via LDT in the next subsection.
3.2 Large Deviations Theory (LDT) interpreted as a path integral in the trajectory space.
To quantitatively characterize such escape properties, we propose to use Large Deviations Theory (LDT) (see , , ). Roughly, this theory gives the probability weights in the path space of the solution of (4). That is to say, for a given regular connecting path , , , and some small enough, for any small enough, we have
where denotes logarithmic equivalence. The asymptotic (18) can be understood as providing a density function for the process in the path space:
The precise statement that leads to the asymptotic (18) will be illustrated in Appendix A. In LDT, for the asymptotic (18), the functional is called the rate functional. However, this rate functional can be interpreted as the action functional that gives the solution to the Fokker–Plank equation (8). In fact, following Feynman’s path integral approach to quantum mechanics (see , ), formally by integrating the individual paths according to their weights given in (19), we have that
where the integral is a formal integration on the path space with “path differential” , and is the solution to the Fokker–Plank equation (8) (it is a partial differential equation) with initial density . In Feynman’s theory, the term “action” corresponds to the exponent in the above integration. In this way from (20) we get
Assuming in the above that the limit and are interchangeable, then we have by (21) that
The last equality in the above display is due to the fact that we can always take a path , such that for . Equation (22) demonstrates a relation between the stationary measure and the “action functional” introduced in LDT.
3.3 Local quasi–potential function as the solution to a variational problem and Hamilton–Jacobi equation.
From the above considerations, we can define a local quasi–potential function as
which matches the identity (14). This implies that in the case when there is only one stable attractor of the gradient system (16), the quasi–potential is given by the local quasi–potential , which is the solution to a variational problem (23).
One may observe that our local quasi–potential function defined in (23) depends on the initial condition , while the stationary measure does not depend on . In fact, since (18) and (19) describes the density function in path space of the process , for any we have for and . However, when the loss function is convex, all initial points are attracted by the gradient flow (16) to the origin . Combining these two effects, we can see that , resulting in the fact that in the case when the loss function is convex. However, in general when there are several different local minimum points of , one has to specify the local quasi–potential with respect to the initial point .
When obtaining (22) we have exchanged the limit order of and . This is because we have only one attractor of the gradient flow (16), so that in the long time we do not expect to see transitions between different attractors (the transitions between different attractors will be illustrated in the Section 4). That being said, the asymptotic of does not depend on the limit order and , resulting in the exponential form of the stationary measure (24). See  for more details on this.
We have seen that according to our general framework on the variational inference of SGD and its relation to the stationary measure, SGD minimizes the local quasi–potential function in the case when the loss function is convex. The function is shown to be a solution to the variational problem (23). In terms of implicit regularization, the quasi–potential is related to diffusion matrix in (5) by solving the variational problem (23) via an explicit form of the action . According to LDT (see , , ,  as well as Appendix A), when the gradient flow (16) has only one attractor , the SGD diffusion equation (4) admits a LDT with the action functional (rate functional) given by the following explicit formula
Combining the formula (25) for the action functional and the variational formulation of the quasi–potential, one obtains by using classical calculus of variations that the local quasi–potential function satisfies a partial differential equation of Hamilton–Jacobi type, which involves the diffusion matrix . We have the following
The local quasi–potential is a solution to the Hamilton–Jacobi equation
with boundary condition
The proof is found in Appendix B. The dependency of the Hamilton–Jacobi equation (26) on the diffusion matrix can be viewed as a quantitative manifestation of the implicit regularization through the quasi–potential. In particular, when , it is easy to see that is a solution. This justifies the prediction that for isotropic noise and only one minimizer, quasi–potential is just the original loss function up to a multiplicative constant. Later in this Section we will provide an example with anisotropic noise covariance structure and we will discuss properties of the quasi–potential in this case by making use of Theorem 3.4.
3.4 Escape properties from local minimum points in terms of the local quasi–potential.
Another remarkable feature of the local quasi–potential is that it characterizes the escape properties from local minimum points. As we have described in the first paragraph of this Section, the escape from sharp minima to flat minima is a key feature which leads to good generalization (see ). The LDT estimate (18) provides a tool to obtain the exponential estimates of the exit probability and mean first exit time from the basin of an attractor. To illustrate this, let us consider the set–up in our Section, where the loss function is convex and admits only one single attractor . Let be an open neighborhood of with boundary . Let the process start its motion from an initial point and consider its first hitting time to :
Suppose the SGD process starts from some . Let be a closed ball around with radius . Pick some small and let and such that . We introduce a sequence of Markov times in the following way: let and , . Consider the Markov chain , . The state space of , is . Together with the hitting time we also define the one–step transition probability by
Then we have the following Theorem characterizing the exponential asymptotic for the exit probability and the mean exit time:
Assume that the boundary of the domain is smooth and
for , where is the exterior normal vector to the boundary of , then for we have the following two asymptotic
and the mean exit time has exponential asymptotic
It can also be shown that the local quasi–potential is related to the first exit position. In fact, assuming that is the only minimum point of on , then as , the first exit position approaches the minimum point of on , i.e.
From here we can see that the escape properties of the process from a local minimum point, such as the exit probability, the mean escape time and even the first exit position, are related to the quasi–potential . This combined with Theorem 3.4 indicate that the choice of covariance structure affects the escape properties from local minimum points. By using these results, the next example shows that in some cases, anisotropic noise helps the SGD process escape faster from a local minimum point than isotropic noise. This validates some of the predictions in .
Let and , and consider the loss function . Let the neighborhood
and initial condition . Let . Let . Notice that when , is isotropic, and when , is anisotropic. It is easy to check by verifying equation (26) that for each we have
We have justified that in this case, anisotropic noise leads to faster escape then isotropic noise.
4 Global quasi–potential: the case of multiple global minima and the stochastic dynamics of SGD.
In the previous section we have introduced a local quasi–potential function in the case when the loss function is convex and admits only one stable attractor . In term of stationary measure, from (22) our local quasi–potential function can be viewed as the following limit
We have explained in Remark 3.3 that the exchange of limit order in the above demonstration is due to the fact that the loss function is convex and admits only one minimum point . In this section we consider the case when the loss function is non–convex and possibly admits several different local minimum points. Under this scenario, the exchange of limit order is not valid, and instead we have
This defines the global quasi–potential function or sometimes just abbreviated as the quasi–potential function . The asymptotic (32) indicates that we have the expression of the stationary measure just as we demonstrated in (14):
This relation between the quasi–potential and the stationary measure indicates that SGD minimizes the quasi–potential function . However, just such a definition did not provide useful information on how to construct the quasi–potential and how it is related to the original loss function and the covariance structure .
In fact, the classical Large Deviations Theory (see , ) provides a systematic yet rather complicated way to construct the quasi–potential from its local version . However, for the purpose of studying SGD we are mainly interested in the location of the local and global minimum points of since these are the points that the SGD process will be finally trapped into. Such a problem can be understood from the dynamics of SGD process (4) via LDT Theorem 3.5. Let us illustrate this via a –dimensional (i.e. in ) example as follows.
Consider the loss function defined in a piecewise way
The above defined loss function has two local minimum points and . The minimal values of at and are both equal to . The corresponding two basin of attractors are and , in which the function increases from to . Let us consider the piecewise defined gradient function
In a very similar fashion as Example 3.1, we can calculate the local quasi–potential with respect to within balls and as
We use the same reasoning as in Example 3.1, this indicates that
When the SGD process in (4) enters one of the basin of attractors, say , it will be attracted to by the gradient flow dynamics and fluctuate there due to the small noise term in (4). The probability that it exits this basin is given by (29) in Theorem 3.5, that is
Once the process reaches with positive probability it will hit and gets attracted to . In this way, the transition from to can be viewed as a Markov chain with transition probability
Similarly, from a transition to may happen with probability
Since and are two attractors of the SGD process with loss function given by (34), the dynamics of SGD process spends most of its time near and . Thus this dynamics can be viewed as a Markov chain on and with transition probabilities between them given by and . This implies that the stationary measure concentrates on and , with
Set . Then and as . This combined with (33) indicates that among the two local minimum points and of the loss function in this example, the quasi–potential has local minima at and with . This indicates that the SGD process tends to select rather than as its final place to be trapped, even when the original loss function has same values on both and . This “selection of specific local minimum point” can be viewed as a regularization induced by the anisotropic noise covariance structure.
5 Numerical Experiments.
We have performed numerical experiment for Example 3.1. Here we pick and stepsize . In Figure 1 (a)–(d) the number of iterations is equal to . Figure 1 (a) and (c) (blue) are for isotropic noise and Figure 1 (b) and (d) are for anisotropic noise . Figure 1 (a) and (b) show the evolution of the radial processes