In this work we target at analyzing a system of two stochastic differential equations called the perturbed compositional gradient flow, which takes the form
Here follows a certain distribution on an index set ; and are assumed to be in
; the vectoris the gradient column –vector of evaluated at and the matrix is the matrix formed by the gradient column –vector of each of the components of evaluated at ; and are two small parameters. We assume that the functions and are supported on some compact subsets of and , respectively 111A further discussion of this assumption is provided in Remark 3 of Section 5..
The two Brownian motions and are independent standard Brownian motions moving in the spaces and , respectively. Here the diffusion matrix satisfies
and the diffusion matrix satisfies
and both matrices are assumed to be non–degenerate for any choice of and .
1.1 Coupled fast–slow dynamics and averaging principle.
It turns out that, by an appropriate choice of the step size parameters, the perturbed compositional gradient flow exhibits a fast–slow dynamics. To see this, we perform a change into the fast time scale for (1) and we let . Then we have, for the time–changed process , that
We will set the vectors
and the matrices
From our assumption on and we know that the vectors , and the matrices and contain bounded coefficients together with their first derivatives, so that these quantities are also uniformly Lipschitz continuous with respect to their arguments.
One can write system (2) as
so that as , the motion is running at a fast speed the following Ornstein–Uhlenbeck process (OU process for short, see [21, Exercise 5.5])
The invariant measure of the (multidimensional) OU process is a Gaussian measure with mean and covariance matrix :
Let us introduce the operator
where can be scalar, vector or matrix–valued functions with arguments and .
As the fast motion process is running at a high speed, the process in (7) plays the role of the slow motion. That is to say, changes very little, and thus could be viewed as frozen, during a small time interval in which is running very fast. Roughly speaking, in the dynamics of , the fast component can be replaced by the invariant measure of with frozen
. This heuristic supports the following asymptotic picture: asfixed and , thus , one can approximate the slow process in (7) by an averaged process satisfying
The approximation of by is the content of the classical averaging principle and was discussed in many literatures (see e.g. , , [10, Chapter 7]). In this paper we will show that (see Proposition 1), as is fixed and set , for we have
This justifies the approximation of the averaged motion to the slow process .
It turns out, as we will prove quantitatively in Lemma A.1 below, that when ,
Therefore as , by (4) we see that
Thus as , the process approximates another process that solves an ordinary differential equation:
with an error of . In fact, equation (14) can be viewed as a gradient flow, which is the perturbed gradient flow with no stochastic noise terms . Such averaging principle can hence explain why we call (1) the perturbed compositional gradient flow.
1.2 A sharper rate via normal deviation.
One major drawback of the classical averaging principle is that, the approximation as in (12) can only identify the deterministic drift, and thus the small diffusion part in the equation for vanishes as . To overcome this difficulty, let us consider the deviation and we rescale it by a factor of . Thus we consider the process
We will show that (see Proposition 2), as , the process converges weakly to random process . The process has its deterministic drift part and is driven by two mean Gaussian processes carrying explicitly calculated covariance structures. This implies that, roughly speaking, from (15) we can expand
as . Here
means approximate equality of probability distributions. In fact, such approximate expansions have been introduced in the classical program under the context of stochastic climate models (see, [1, equation (4.8)]), and in physics this is also known as the Van Kampen’s approximation (see ).
So that by (17) we further have
From the perspective of mathematical techniques, there are two classical approaches to averaging principle and normal deviations 222A much more technical and functional–analytic third method is discussed in Remark 1 of Section 5.. One is the classical Khasminskii’s averaging method . This method chooses an intermediate time scale such that
. This intermediate time scale enables the analysis of averaging procedure by using a fast motion with frozen slow component. To demonstrate its effectiveness, in this work we exploit this method to do our averaging analysis. Another less intuitive method is the corrector method, which relies on the solution of an auxiliary Poisson equation. Upon obtaining appropriate a–priori estimates for this Poisson equation, one can reduce the averaging principle or normal deviations to the analysis of an Itô’s formula. Since we are working in the case when fast motionis an OU process, when applying the corrector method, we are mostly close to the set–up of  (see also , , ). Our analysis of the normal deviations will be following the corrector method and based on a–priori bounds provided in .
1.3 Connection with stochastic compositional gradient descent algorithm.
In the field of statistical optimization, the stochastic composition optimization problem of the following form has been of tremendous interests in both theory and application:
Here , denotes the composite function, and
denotes a pair of random variables. has shown that the optimization problem (20
) includes many important applications in statistical learning and finance, such as reinforcement learning, statistical estimation, dynamic programming and portfolio management.
Let us consider the following version of Stochastic Composite Gradient Descent (SCGD) algorithm in [29, Algorithm 1] whose iteration takes the form
Here is taken as i.i.d. random vectors following some distribution over the parameter space;
333Often in optimization for finite samples, the parameter space is chosen as some finite, discrete index set , and is the uniform distribution over such index set. We extend this setting to any distribution over general parameter space.
is the uniform distribution over such index set. We extend this setting to any distribution over general parameter space.and are functions indexed by the aformentioned random vectors; the vector is the gradient column –vector of evaluated at and the matrix is the matrix formed by the gradient column –vector of each of the components of evaluated at . The SCGD algorithm (21) is a provably effective method that solves (20); see early optimization literatures on the convergence and rates of convergence analysis in [8, 29]. However, the convergence rate of SCGD algorithm and its variations is not known to be comparable to its SGD counterpart [29, 30]. To drill further into this algorithm we consider the coupled diffusion process (1) which is a continuum version, as both and , of the SCGD algorithm (21). We copy in below the perturbed compositional gradient flow (1) for convenience:
Here is taken to be distributed as , and and are assumed to be in . Without loss of generality, when considering an optimization problem (20), we can assume that the functions and are supported on some compact subsets of and , respectively444See Remark 3 of Section 5.. Also for convenience, let us further assume that the and in the -pair drawn from are independent. We do not believe this assumption is necessary, see discussions in [29, 30]; however it does simplify our analysis since in the perturbed compositional gradient flow (22) can be chosen as an independent pair of Brownian motions which in turn simplifies the proof.
Recall that . In the case where the objective function is strongly convex, in (18) enters a basin containing the minimizer of (20) in finite time , so that (19) implies in (2) enters a basin containing the minimizer of (20) also in finite time . Such heuristic analysis validates, in the sense of convergence, the effectiveness of using the perturbed compositional gradient flow to solve (20) in the strongly convex case. Such argument can be generalized to the convex case and omitted due to the limitation of space.
It is worth pointing out that in an early probability literature , the authors have briefly mentioned in its introductory part the potential application of averaging principle to the analysis of stochastic approximation algorithms. In contrast, in the classical literature on stochastic approximation algorithms (see , ,), the techniques of normal deviations have been addressed under the context of weak convergence to diffusion processes in the discrete setting. For example, [4, Chap. 4, Part II] analyzed the asymptotic behavior of a board class of single-equation adaptive algorithms including SGD. Moreover, [19, Chap. 8] discussed the idea of multiple timescale analysis for stochastic approximation algorithms; see also [5, Chap. 6]
for a connection to averaging principle for constant stepsize algorithms. However, these mathematical theories focus on the long-time asymptotic analysis instead of convergence rates, which is vital in many recent applications. The current work serves as an attempt on convergence rates using one algorithmic example (SCGD) and can be viewed as a further contribution along this line of research thread.
Organization. The paper is organized as follows. In Section 2 we will show the averaging principle that justifies the convergence of to as . In Section 3 we will consider the rescaled deviation and we show that as it converges weakly to the process . This justifies (16). In Section 4 we show the approximation (19) and we justify the effectiveness of using SCGD in the strongly convex case. In Section 5 we discuss further problems, remarks and generalizations.
Notational Conventions. For an –vector we define the norm
We also denote for . For any matrix , let us define the norm
If is a vector or a matrix, then denotes either when is an –vector, or if is an matrix. The standard inner product in is denoted as .
The spaces , (and ) are the spaces of –times continuously differentiable functions on a domain ( can be the whole space). For a function we define to be the norm of on . In case we need to highlight the target space, we also use that refers to functions in the space that are mapped into . If is Lipschitz continuous on , then is the Lipschitz seminorm . In the case of vector or matrix valued functions, the Lipschitz norm is then defined to be the largest Lipschitz norm for its corresponding component functions.
Throughout the paper, capital , etc., are quantities for the time rescaled process (2), and small , etc., are quantities for the original process (1). The constant denotes a positive constant that varies from line to line. Sometimes, to emphasize the dependence of this constant on other parameters, may also be used. For notational convenience, we use simultaneously, e.g., or to denote a stochastic process.
2 The convergence of to : Averaging principle.
Our first Lemma is about –boundedness of the system in (7).
For any and there exist some constant such that
For a matrix valued random function adapted to the filtration of we have (see [17, (3.12) and (3.13)])
Therefore we obtain (23).
We can write the solution in (7) in mild form as
Set and . Then we have
Therefore by Gronwall inequality we know that for we have
It remains to estimate . Again, by (25) we have
Thus we obtain
which is (24). ∎
The next Lemma summarizes basic facts about the process defined in (8).
Let the process defined in (8) start from . Then for any function , for some we have
where the constant may depend on , but is independent of .
Moreover, for some constant we have
in which the constant may depend on but is independent of .
Let be the OU process satisfying the stochastic differential equation
Thus we have the explicit representation
Let be the invariant measure of , where
is the identity matrix in. Then we have the exponential mixing estimate, that for we have
Now we will derive the averaging principle following the classical method in . Let . Let us consider a partition of the time interval into intervals of the same length . Let us introduce the auxiliary processes , by means of the relations
The interval length can be chosen such that , as and for any small we have
uniformly in , and .
In fact we can write, for , that