Optimization plays an important role in scientific fields ranging from physical sciences to machine learning and statistics to engineering. Numerous algorithms and methods have been proposed to solve optimization problems. Examples include Newton’s methods, gradient and subgradient descent, conjugate gradient methods, trust region methods, and interior point methods (see Polyak, 1987; Boyd and Vandenberghe, 2004; Nocedal and Wright, 2006; Ruszczynski, 2006; Boyd et al., 2011; Shor, 2012; Goodfellow et al. (2016) for expositions). Practical problems arose in fields like statistics and machine learning usually involve optimization settings where the objective functions are empirically estimated from available data (a training sample or a statistical sample) with the form of a sum of differentiable functions. We refer such optimization problems with random objective functions as stochastic optimization. As data sets in practical problems grow rapidly in scale and complexity, methods such as stochastic gradient descent can scale to the enormous size of big data and have been very popular. There have been great recent research interest and work on the theory and practice of gradient descent and its extensions and variants. For example, a number of recent papers were devoted to investigate stochastic gradient descent and its variants for solving complex optimization problems (Chen et al. (2016), Li et al. (2017), Mandt et al. (2016), Kawaguchi (2016), Keskar et al. (2017), Lee et al. (2016), Ge et al. (2015), Jin et al. (2017)). Su et al. (2016) showed that the continuous-time limit of Nesterov’s accelerated gradient descent is a second-order ordinary differential equation that can be used to understand and analyze the acceleration phenomenon and generalize Nesterov’s scheme; and Wibisono et al. (2016) studied acceleration from a continuous-time variational point of view and further developed a systematic approach to understand acceleration phenomenon and produce acceleration algorithms from continuous-time differential equations generated by a so-called Bregman Lagrangian. In spite of compelling theoretical and numerical evidence on the value of the stochastic approximation idea and acceleration phenomenon, yet there remains some conceptual mystery in the acceleration and stochastic approximation schemes.
This paper establishes asymptotic theory for gradient descent, stochastic gradient descent, and accelerated gradient descent in the stochastic optimization setup. We derive continuous-time ordinary or stochastic differential equations to model the dynamic behaviors of these gradient descent algorithms and investigate their asymptotic distributions as data size goes to infinity. Specifically for an optimization problem whose objective function is convex and deterministic, we consider a matched stochastic optimization problem whose random objective function is an empirical estimator of the deterministic convex objective function based on available data. The solution of the stochastic optimization specifies a decision rule like an estimator or a classifier based on the sampled data in statistics and machine learning, while its corresponding deterministic optimization problem characterizes through its solution the true value of the parameter in the statistical model. In other words, the two connected optimization problems associate with the data sample and its corresponding population model where the data are sampled from, and the stochastic optimization is considered to be a sample version of the deterministic optimization corresponding to the population. We show that random sequences generated from these gradient descent algorithms and their continuous time ordinary or stochastic differential equations for the stochastic optimization setting converge to ordinary differential equations for the corresponding deterministic optimization set-up, with asymptotic distributions governed by some linear ordinary or stochastic differential equations. Moreover, our analysis may offer a novel unified framework to carry out a joint asymptotic analysis for computational algorithms and statistical decision rules that the algorithms are applied to compute. As iterated computational methods, these gradient descent algorithms generate sequences of numerical values that converge to the exact decision rule or the true parameter value for the corresponding optimization problems, as the number of the iterations goes to infinity. Thus, as time (corresponding to the number of iterations) goes to infinity, the continuous-time differential equations may have distributional limits corresponding to the asymptotic distributions of statistical decision rules as the sample size goes to infinity. In other words, the asymptotic analysis can be done with both time and data size, where the time direction corresponds to the computational asymptotics on dynamic behaviors of algorithms, and the data size direction associates with usual statistical asymptotics on the statistical behaviors of decision rules such as estimators and classifiers. To the best of our knowledge, this is the first paper to provide rigorous treatments between stochastic gradient descent and stochastic differential equations, discover the second order stochastic differential equations for the accelerated case, and offer the unified asymptotic analysis. The continuous-time modeling and the joint asymptotic analysis may shed some light via large deviation theory and limiting stationary distribution on the phenomenon that stochastic gradient descent algorithms can escape from saddle points and converge to good local minimizers for solving non-convex optimization problems in deep learning.
The rest of the paper is proceeded as follows. Section 2 introduces gradient descent, accelerated gradient descent, and their corresponding ordinary differential equations. Section 3 presents stochastic optimization and investigates asymptotic behaviors of the plain and accelerated gradient descent algorithms and their associated ordinary differential equations (with random coefficients) when the sample size goes to infinity. We illustrate the unified framework to carry out a joint analysis on computational asymptotics with time (or iteration) for the gradient descent algorithms and statistical asymptotics with sample size for statistical decision rules that the algorithms are applied to compute. Section 4 considers stochastic gradient descent algorithms for large scale data and derives stochastic differential equations to model these algorithms. We establish asymptotic theory for these algorithms and their associated stochastic differential equations, and illustrate a joint analysis on computational asymptotics with time for the stochastic gradient descent algorithms and statistical asymptotics with sample size for statistical decision rules. Section 5 features an example. All technical proofs are relegated in the appendix section.
We adopt the following notations and conventions. For the stochastic optimization problem considered in Sections 3 and 4, we add a super index to notations for the associated processes and sequences in Section 3 and indices and/or to notations for the corresponding processes and sequences affiliated with mini-batches in Section 4, while notations without any such subscripts or superscripts are for sequences and functions corresponding to the deterministic optimization problem given in Section 2.
2 Ordinary differential equations for gradient descent algorithms
Consider the following minimization problem
where the target function is defined on a parameter space and assumed to have L-Lipshitz continuous gradients. Iterative algorithms such as gradient descent methods are often employed to numerically compute the solution of the minimization problem. Starting with some initial values , the plain gradient descent algorithm is iteratively defined by
where denotes gradient operator, and is a positive constant which is often called a step size or learning rate.
It is easy to model , by a smooth curve with the Ansatz as follows. Define step function for , and as , approches to satisfying
where denotes the derivative of , and initial value .
Nesterov’s accelerated gradient descent scheme is a well-known algorithm that is much faster than the plain gradient descent algorithm. Starting with initial values and , Nesterov’s accelerated gradient descent algorithm is iteratively defined by
where is a positive constant. Using (4) we derive a recursive relationship between consecutive increments
We model , by a smooth curve in a sense that are its samples at discrete points, that is, we define step function for , and introduce the Ansatz for some smooth function defined for . Let be the step size. For , as , we have , , and
Applying Taylor expansion and using L-Lipshitz continuous gradients we obtain
where denotes the second derivative of . Substituting above results into equation (5) we obtain
Re-arranging the terms and dividing on both sides lead to
Note that is free of . As , the equation becomes
with the initial conditions and . As the coefficient in (6) is singular at , classical ordinary differential equation theory is not applicable to establish the existence or uniqueness of the solution to (6). Su et al. (2016) derived (6) and proved that it has a unique solution satisfying the initial conditions, and converges to uniformly on for any fixed . Note the step size difference between the plain and accelerated cases, where the step size is for Nesterov’s accelerated gradient descent algorithm and for the plain gradient descent algorithm. Su et al. (2016) has shown that, because of the difference, the accelerated gradient descent algorithm moves much faster than the plain gradient descent algorithm along the curve . See also Wibisono et al. (2016) for more elaborate explanation on the acceleration phenomenon.
3 Gradient descent for stochastic optimization
Let be the parameter that we are interested in, and
be an relevant random element on a probability space with a given distribution. Consider an objective function and its corresponding expectation . For example, in a statistical decision problem, we may take to be a decision rule,
a loss function, andits corresponding risk; in M-estimation, we may treat as a sample observation and a -function; in nonparametric function estimation and machine learning, we may choose an observation and equal to a loss function plus some penalty. For these problems we need to consider the corresponding minimization problem (1) for characterizing the true parameter value, but practically, because is usually unavailable, we have to employ its empirical version and consider a stochastic optimization problem, described as follows:
where , is a training sample or statistical sample, and we assume are i.i.d. and follow distribution .
Minimization problem (1) characterizes the true value of parameter in the statistical model such as an underlying M-functional in the M-estimation setup. As the true target function is usually unknown in practices, we often solve stochastic minimization problem (7
) with observed data to obtain practically useful decision rules such as an M-estimator, a smoothing function estimator, and a machine learning classifier. The approach to obtain practical procedures is based on the heuristic reasoning that as
, the law of large number implies thateventually converges to in probability, and thus the solution of (7) approaches that of (1).
3.1 Plain gradient descent algorithm
where is a step size or learning rate, and is the objective function in minimization problem (7).
Following the continuous curve approximation described in Section 2 we define step function for , and for each , as , approaches a smooth curve , , given by
where , gradient operator here is applied to and with respect to , and initial value .
As and are random, and our main interest is to study the distributional behaviors of the solution and algorithm, we may define a solution of equation (9) in a weak sense that there exist a process
and a random vectordefined on some probability space such that is identically distributed as , and satisfies (9), and is called a (weak) solution of equation (9). Note that
is not required to be defined on a fixed probability space with given random variables, instead we defineon some probability space with some associated random variables whose distributions are given by . The weak solution definition, which shares the same spirit as that for stochastic differential equations (see Ikeda and Watanabe (1981) and more in Section 4), will be very handy in facilitating our asymptotic analysis in this paper. For simplicity we drop index and ‘weak’ when there is no confusion.
3.2 Accelerated gradient descent algorithm
Nesterov’s accelerated gradient descent scheme can be used to solve the minimization problem (7). Starting with initial values and , we obtain the following iterative algorithm to compute the solution of (7),
Using the continuous curve approach described in Section 2 we can define step function for , and for every , as , we approximate by a smooth curve , , governed by
where initial values and , , and gradient operator here is applied to and with respect to .
3.3 Asymptotic ordinary differential equations
Assume initial values satisfy .
is continuously twice differentiable in ; , , such that , , where and for some fixed
have finite fourth moments.
, . On the parameter space , is continuously twice differentiable and strongly convex, and is -Lipschitz for some , with linear growth, where is the Laplacian operator.
Cov, , and Var is positive definite, continuously differentiable, and -Lipschitz with linear growth in .
Conditions A1-A2 are often used to make optimization problems and differential equations to be well defined, and the stochastic optimization (7) corresponds to optimization (1), and Conditions A3-A4 guarantee that the solution of (7) and its associated differential equations provide large sample approximations of those for (1). Condition A4 is quite reasonable, which can be easily justified by empirical processes with common assumptions such as that , , form a Donsker class (van der Vaart and Wellner (2000)), since solution curves of ordinary differential equations (3) and (6) are deterministic and bounded, and is bounded.
For a given , denote by the space of all continuous functions on with the uniform metric between functions and . For solutions and of equations (3) and (9) [or equations (6) and (11)], respectively, we define . Then , and live on . Treating them as random elements in , in the following theorem we establish a weak convergence limit of .
Under conditions A0-A4, as , weakly converges to on , where , and matrix is the unique solution of the following linear differential equation
for the plain gradient descent case, and
As we discussed early in Section 3, as , converges to in probability, and the solutions of the minimization problems (1) and (7) should be very close to each other. We may heuristically illustrate the derivation of Theorem 3.1 as follows.
Central limit theorem may lead us to
see that as
as follows. Central limit theorem may lead us to see that as, is asymptotically distributed as , where random vector . Then asymptotically differential equations (9) and (11) are, respectively, equivalent to
Appplying the perturbation method for solving ordinary differential equations, we write approximation solutions of (14) and (15) as and substitute it into (14) and (15). With satisfying (3) or (6), ignoring higher order terms, we obtain the following equations for the limit of in the two cases, respectively,
where is a solution of the corresponding equation (3) or (6), random variable follows the p-variate standard normal distribution , and initial conditions . Using linear scaling we show that (16) and (17) have unique solutions , where are unique solutions of (12) and (13).
As step function is used to model generated from gradient descent algorithms (8) and (10). To study their weak convergence, we need to introduce the Skorokhod space, denoted by , of all cádlág functions on , equipped with the Skorokhod metric (Billingsely (1999)). Then lives on , and treating it as a random element in , we derive its weak convergence limit in the following theorem.
Under assumption A0-A4, as and , we have
where are continuous-time step processes for discrete generated from algorithms (8) and (10), with continuous curves defined by (9) and (11), for the cases of plain and accelerated gradient descent algorithms, respectively. In particular, if we take such that as and , , then on , weakly converges to , where is the solution of (3) or (6), and and are defined in Theorem 3.1. That is, and share the same weak convergence limit.
There are two types of asymptotic analyses in the set up. One type is to employ continuous differential equations to model discrete sequences generated from gradient descent algorithms, which is associated with treated as step size between consecutive sequence points. Another type involves the use of random objective functions in stochastic optimization, which are estimated from sample data of size . We refer the first and second types as computational (modeling) and statistical asymptotics, respectively. The computational asymptotic analysis is that for each , differential equations (9) and (11)[or (14) and (15)] provide continuous solutions as the limits of discrete sequences generated from algorithms (8) and (10), respectively, when is allowed to go to zero. Theorem 3.1 provides the statistical asymptotic analysis to describe the behavior difference between the data based solutions and the true solutions , as the sample size goes to infinity. Theorem 3.2 involves both types of asymptotics and shows that as and , is of order . As a computational modeling parameter, can be any arbitrary sequence approaching zero, and we may let it depend on and choose that goes to zero fast enough so that is of order smaller than . Then has the same asymptotic distribution as .
3.4 A framework to unify computational and statistical asymptotic analysis
The two types of asymptotics associated with and seem to be quite different, with one for computational algorithms and one for statistical inferences. This section will elaborate further about these analyses and provide a framework to unify both point of views. Denote the solutions of optimization problems (1) and (7) by and , respectively. In the statistical set-up, and represent the true parameter value and estimator of the parameter , respectively. Then using the definitions of and and Taylor expansion, we have
and the law of large number implies that converges in probability to as . On the other hand, Assumption 4 indicates that
where stands for a standard normal random vector. Thus, is asymptotically distributed as . On the other hand, gradient descent algorithms generate sequences corresponding to and are expected to approach the solutions of the two optimization problems (1) and (7), respectively, hence and must move towards and , respectively, and and are reaching their corresponding targets and . Below we will provide a framework to connect with and with and .
Since the time interval considered so far is for any arbitrary , we may extend the time intervals to , and consider , the space of all continuous functions on , equipped with a metric for the topology of uniform convergence on compacta:
Solutions, , , , of ordinary differential equations (3), (6), (9), (11)-(17) all live on , and we can study their weak convergence on . Similarly we may adopt the Skorokhod space equipped with the Skorokhod metric for the weak convergence study of (see Billingsely (1999)). The following theorem establishes the weak convergence of these processes on and studies their asymptotic behaviors as .
Suppose that the assumption A0-A4 are met, is positive definite, all eigenvalues of
is positive definite, all eigenvalues ofdiverge as , and as and . Then on , as and , and weakly converge to , .
Furthermore, for the plain gradient descent case we have as and ,
Denote the limits of the processes in Theorem 3.3 as by the corresponding processes with and replacing by . Then Theorem 3.3 shows that for the plain gradient descent case, , , , , weakly converges to as , and weakly converges to as . In particular, as process is indexed by and , its limits are the same regardless the order of and . Also as is the minimizer of convex function , the positive definite assumption is very reasonable; since the limit, , of as has all positive eigenvalues, it is natural to expect that has diverging eigenvalues. We conjecture that for the accelerated gradient descent case, similar asymptotic results might hold as .
With the augmentation of , we extend further to , consider , , , , , and on and derive the limits of and on in Theorem 3.2. As and , the limiting distributions of and are for , where describe the dynamic evolution of gradient descent algorithms for and the statistical distribution of for .
The joint asymptotic analysis provides a unified framework to describe distribution limits of and
from both computation and statistical points of view as follows. For , and gives the limiting behaviors
of and corresponding to computational algorithms, and and illustrate their limiting
behaviors of the corresponding statistical decision rule (or the exact solutions of the corresponding optimization problems
(1) and (7) that the algorithms are designed to compute). We use the following simple example to explicitly illustrate
the joint asymptotic analysis.
Example 1. Suppose that , , are iid random vectors, where and are independent, and follow a normal distribution
and an exponential distribution with mean, respectively, and . Define , and denote by the true value of parameter in the model. Then , , , , , and , where is the sample mean. It is easy to see that the minimization problems corresponding to (1) and (7) has explicit solutions: has the minimizer , and has the minimizer . For this example, algorithms (2), (8), (4) and (10) yield recursive formulas , and for the plain gradient descent case; and , , , for the accelerated gradient descent case. While it may not be so obvious to explicitly describe the dynamic behaviors of these algorithms for the accelerated case, below we will clearly illustrate the behaviors of their corresponding ordinary differential equations through closed form expressions. First we consider the plain gradient descent case where closed form expressions are very simple. Ordinary differential equations (3) and (9) admit simple solutions
which confirms that converges to , as shown in Theorem 3.1. Furthermore, as , , , and