 # A convergence analysis of the perturbed compositional gradient flow: averaging principle and normal deviations

We consider in this work a system of two stochastic differential equations named the perturbed compositional gradient flow. By introducing a separation of fast and slow scales of the two equations, we show that the limit of the slow motion is given by an averaged ordinary differential equation. We then demonstrate that the deviation of the slow motion from the averaged equation, after proper rescaling, converges to a stochastic process with Gaussian inputs. This indicates that the slow motion can be approximated in the weak sense by a standard perturbed gradient flow or the continuous-time stochastic gradient descent algorithm that solves the optimization problem for a composition of two functions. As an application, the perturbed compositional gradient flow corresponds to the diffusion limit of the Stochastic Composite Gradient Descent (SCGD) algorithm for minimizing a composition of two expected-value functions in the optimization literatures. For the strongly convex case, such an analysis implies that the SCGD algorithm has the same convergence time asymptotic as the classical stochastic gradient descent algorithm. Thus it validates the effectiveness of using the SCGD algorithm in the strongly convex case.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this work we target at analyzing a system of two stochastic differential equations called the perturbed compositional gradient flow, which takes the form

 {dy(t)=−εy(t)dt+εEgw(x(t))dt+εΣ1(x(t))dW1t ,y(0)=y0 ,dx(t)=−ηE˜∇gw(x(t))∇fv(y(t))dt+ηΣ2(x(t),y(t))dW2t ,x(0)=x0 . (1)

Here follows a certain distribution on an index set ; and are assumed to be in

; the vector

is the gradient column –vector of evaluated at and the matrix is the matrix formed by the gradient column –vector of each of the components of evaluated at ; and are two small parameters. We assume that the functions and are supported on some compact subsets of and , respectively 111A further discussion of this assumption is provided in Remark 3 of Section 5..

The two Brownian motions and are independent standard Brownian motions moving in the spaces and , respectively. Here the diffusion matrix satisfies

 Σ1(x)ΣT1(x)=E[(gw(x)−Egw(x))(gw(x)−Egw(x))T] ,

and the diffusion matrix satisfies

 Σ2(x,y)ΣT2(x,y)=E[(˜∇gw(x)∇fv(y)−E˜∇gw(x)∇fv(y))⋅(˜∇gw(x)∇fv(y)−E˜∇gw(x)∇fv(y))T] ,

and both matrices are assumed to be non–degenerate for any choice of and .

### 1.1 Coupled fast–slow dynamics and averaging principle.

It turns out that, by an appropriate choice of the step size parameters, the perturbed compositional gradient flow exhibits a fast–slow dynamics. To see this, we perform a change into the fast time scale for (1) and we let . Then we have, for the time–changed process , that

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩dYε,η(t)=−εηYε,η(t)dt+εηEgw(Xε,η(t))dt+ε√ηΣ1(Xε,η(t))dW1t ,Yε,η(0)=y0 ,dXε,η(t)=−E˜∇gw(Xε,η(t))∇fv(Yε,η(t))dt+√ηΣ2(Xε,η(t),Yε,η(t))dW2t ,Xε,η(0)=x0 . (2)

We will set the vectors

 B1(X)=Egw(X)∈Rm (3)

and

 B2(X,Y)=−E˜∇gw(X)∇fv(Y)∈Rn , (4)

and the matrices

 A1(X)=Σ1(X)ΣT1(X)∈Rm⊗Rm , (5)

and

 A2(X,Y)=Σ2(X,Y)ΣT2(X,Y)∈Rn⊗Rn . (6)

From our assumption on and we know that the vectors , and the matrices and contain bounded coefficients together with their first derivatives, so that these quantities are also uniformly Lipschitz continuous with respect to their arguments.

One can write system (2) as

 ⎧⎪⎨⎪⎩dYε,η(t)=−εηYε,η(t)dt+εηB1(Xε,η(t))dt+ε√ηΣ1(Xε,η(t))dW1t,Yε,η(0)=y0,dXε,η(t)=B2(Xε,η(t),Yε,η(t))dt+√ηΣ2(Xε,η(t),Yε,η(t))dW2t,Xε,η(0)=x0. (7)

Fix and let . Thus as . Then system (7) is a standard fast–slow system of stochastic differential equations. In fact, the –component of the system (2) can be written as

 dYε,η(t) =−εηYε,η(t)dt+εηEgw(Xε,η(t))dt+√ε(εη)1/2Σ1(Xε,η(t))dW1t , Yε,η(0) =y0 ,

so that as , the motion is running at a fast speed the following Ornstein–Uhlenbeck process (OU process for short, see [21, Exercise 5.5])

 dyX,ε(t)=−yX,ε(t)dt+Egw(X)dt+√εΣ1(X)dW2t , yX,ε(0)=y0 . (8)

The invariant measure of the (multidimensional) OU process is a Gaussian measure with mean and covariance matrix :

 μX,ε(dY)∼N(Egw(X),ε2Σ1(X)ΣT1(X)) . (9)

Let us introduce the operator

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯q(X,Y)ε=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯q(X,Y)ε(X)=∫Rmq(X,Y)μX,ε(dY) , (10)

where can be scalar, vector or matrix–valued functions with arguments and .

As the fast motion process is running at a high speed, the process in (7) plays the role of the slow motion. That is to say, changes very little, and thus could be viewed as frozen, during a small time interval in which is running very fast. Roughly speaking, in the dynamics of , the fast component can be replaced by the invariant measure of with frozen

. This heuristic supports the following asymptotic picture: as

fixed and , thus , one can approximate the slow process in (7) by an averaged process satisfying

 dXε(t)=¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯B2(Xε(t),Y)ε(Xε(t))dt , Xε(0)=x0 . (11)

The approximation of by is the content of the classical averaging principle and was discussed in many literatures (see e.g. , , [10, Chapter 7]). In this paper we will show that (see Proposition 1), as is fixed and set , for we have

 sup0≤t≤TE|Xε,η(t)−Xε(t)|2Rn→0 . (12)

This justifies the approximation of the averaged motion to the slow process .

It turns out, as we will prove quantitatively in Lemma A.1 below, that when ,

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯q(X,Y)ε−q(X,Egw(X))≈O(√ε) . (13)

Therefore as , by (4) we see that

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯B2(X,Y)ε(X)≈−E˜∇gw(X)∇fv(Egw(X))+O(√ε) .

Thus as , the process approximates another process that solves an ordinary differential equation:

 (14)

with an error of . In fact, equation (14) can be viewed as a gradient flow, which is the perturbed gradient flow with no stochastic noise terms . Such averaging principle can hence explain why we call (1) the perturbed compositional gradient flow.

### 1.2 A sharper rate via normal deviation.

One major drawback of the classical averaging principle is that, the approximation as in (12) can only identify the deterministic drift, and thus the small diffusion part in the equation for vanishes as . To overcome this difficulty, let us consider the deviation and we rescale it by a factor of . Thus we consider the process

 Zε,η(t)=Xε,η(t)−Xε(t)√η . (15)

We will show that (see Proposition 2), as , the process converges weakly to random process . The process has its deterministic drift part and is driven by two mean Gaussian processes carrying explicitly calculated covariance structures. This implies that, roughly speaking, from (15) we can expand

 Xε,η(t)D≈Xε(t)+√ηZεt , (16)

as . Here

means approximate equality of probability distributions. In fact, such approximate expansions have been introduced in the classical program under the context of stochastic climate models (see

, [1, equation (4.8)]), and in physics this is also known as the Van Kampen’s approximation (see ).

Therefore by (13), (14) and (16) we know that the slow motion in (7) (or (2)) has an expansion around the GD algorithm in (14):

 Xε,η(t)D≈¯X(t)+O(√ε)+√ηZεt . (17)

Let us introduce the process as the following fast time-scale version of the perturbed gradient flow , :

 dXε,η(t)=−E˜∇gw(Xε,η(t))∇fv(Egw(Xε,η(t)))dt+√ηdZεt , Xε,η(0)=x0 . (18)

From (14) and (18), we know that

 ¯X(t)+√ηZεt−Xε,η(t)D≈O(√η) .

So that by (17) we further have

 Xε,η(t)D≈Xε,η(t)+O(√ε)+O(√η) . (19)

From the perspective of mathematical techniques, there are two classical approaches to averaging principle and normal deviations 222A much more technical and functional–analytic third method is discussed in Remark 1 of Section 5.. One is the classical Khasminskii’s averaging method . This method chooses an intermediate time scale such that

. This intermediate time scale enables the analysis of averaging procedure by using a fast motion with frozen slow component. To demonstrate its effectiveness, in this work we exploit this method to do our averaging analysis. Another less intuitive method is the corrector method, which relies on the solution of an auxiliary Poisson equation. Upon obtaining appropriate a–priori estimates for this Poisson equation, one can reduce the averaging principle or normal deviations to the analysis of an Itô’s formula. Since we are working in the case when fast motion

is an OU process, when applying the corrector method, we are mostly close to the set–up of  (see also , , ). Our analysis of the normal deviations will be following the corrector method and based on a–priori bounds provided in .

### 1.3 Connection with stochastic compositional gradient descent algorithm.

In the field of statistical optimization, the stochastic composition optimization problem of the following form has been of tremendous interests in both theory and application:

 minx(Efv∘Egw)(x). (20)

Here , denotes the composite function, and

denotes a pair of random variables.

 has shown that the optimization problem (20

) includes many important applications in statistical learning and finance, such as reinforcement learning, statistical estimation, dynamic programming and portfolio management.

Let us consider the following version of Stochastic Composite Gradient Descent (SCGD) algorithm in [29, Algorithm 1] whose iteration takes the form

 {yk+1=(1−ε)yk+εgwk(xk) ,y0∈Rm ,xk+1=xk−η˜∇gwk(xk)∇fvk(yk+1) ,x0∈Rn . (21)

Here is taken as i.i.d. random vectors following some distribution over the parameter space; 333Often in optimization for finite samples, the parameter space is chosen as some finite, discrete index set , and

is the uniform distribution over such index set. We extend this setting to any distribution over general parameter space.

and are functions indexed by the aformentioned random vectors; the vector is the gradient column –vector of evaluated at and the matrix is the matrix formed by the gradient column –vector of each of the components of evaluated at . The SCGD algorithm (21) is a provably effective method that solves (20); see early optimization literatures on the convergence and rates of convergence analysis in [8, 29]. However, the convergence rate of SCGD algorithm and its variations is not known to be comparable to its SGD counterpart [29, 30]. To drill further into this algorithm we consider the coupled diffusion process (1) which is a continuum version, as both and , of the SCGD algorithm (21). We copy in below the perturbed compositional gradient flow (1) for convenience:

 {dy(t)=−εy(t)dt+εEgw(x(t))dt+εΣ1(x(t))dW1t ,y(0)=y0 ,dx(t)=−ηE˜∇gw(x(t))∇fv(y(t))dt+ηΣ2(x(t),y(t))dW2t ,x(0)=x0 . (22)

Here is taken to be distributed as , and and are assumed to be in . Without loss of generality, when considering an optimization problem (20), we can assume that the functions and are supported on some compact subsets of and , respectively444See Remark 3 of Section 5.. Also for convenience, let us further assume that the and in the -pair drawn from are independent. We do not believe this assumption is necessary, see discussions in [29, 30]; however it does simplify our analysis since in the perturbed compositional gradient flow (22) can be chosen as an independent pair of Brownian motions which in turn simplifies the proof.

Recall that . In the case where the objective function is strongly convex, in (18) enters a basin containing the minimizer of (20) in finite time , so that (19) implies in (2) enters a basin containing the minimizer of (20) also in finite time . Such heuristic analysis validates, in the sense of convergence, the effectiveness of using the perturbed compositional gradient flow to solve (20) in the strongly convex case. Such argument can be generalized to the convex case and omitted due to the limitation of space.

It is worth pointing out that in an early probability literature , the authors have briefly mentioned in its introductory part the potential application of averaging principle to the analysis of stochastic approximation algorithms. In contrast, in the classical literature on stochastic approximation algorithms (see , ,), the techniques of normal deviations have been addressed under the context of weak convergence to diffusion processes in the discrete setting. For example, [4, Chap. 4, Part II] analyzed the asymptotic behavior of a board class of single-equation adaptive algorithms including SGD. Moreover, [19, Chap. 8] discussed the idea of multiple timescale analysis for stochastic approximation algorithms; see also [5, Chap. 6]

for a connection to averaging principle for constant stepsize algorithms. However, these mathematical theories focus on the long-time asymptotic analysis instead of convergence rates, which is vital in many recent applications. The current work serves as an attempt on convergence rates using one algorithmic example (SCGD) and can be viewed as a further contribution along this line of research thread.

Organization. The paper is organized as follows. In Section 2 we will show the averaging principle that justifies the convergence of to as . In Section 3 we will consider the rescaled deviation and we show that as it converges weakly to the process . This justifies (16). In Section 4 we show the approximation (19) and we justify the effectiveness of using SCGD in the strongly convex case. In Section 5 we discuss further problems, remarks and generalizations.

Notational Conventions. For an –vector we define the norm

 |v|Rn=(v21+...+v2n)1/2 .

We also denote for . For any matrix , let us define the norm

 ∥σ∥Rn⊗Rn=(n∑i,j=1σ2ij)1/2 .

If is a vector or a matrix, then denotes either when is an –vector, or if is an matrix. The standard inner product in is denoted as .

The spaces , (and ) are the spaces of –times continuously differentiable functions on a domain ( can be the whole space). For a function we define to be the norm of on . In case we need to highlight the target space, we also use that refers to functions in the space that are mapped into . If is Lipschitz continuous on , then is the Lipschitz seminorm . In the case of vector or matrix valued functions, the Lipschitz norm is then defined to be the largest Lipschitz norm for its corresponding component functions.

Throughout the paper, capital , etc., are quantities for the time rescaled process (2), and small , etc., are quantities for the original process (1). The constant denotes a positive constant that varies from line to line. Sometimes, to emphasize the dependence of this constant on other parameters, may also be used. For notational convenience, we use simultaneously, e.g., or to denote a stochastic process.

## 2 The convergence of Xε,η(t) to Xε(t): Averaging principle.

In this section we are going to show the convergence of to as by arguing as in the classical averaging principle (see , ).

Our first Lemma is about –boundedness of the system in (7).

###### Lemma 2.1.

For any and there exist some constant such that

 sup0≤t≤TE|Xε,ηt|2Rn≤C(1+|x0|2Rn) , (23)

and

 sup0≤t≤TE|Yε,ηt|2Rn≤C(1+|y0|2Rm) . (24)
###### Proof.

This Lemma can be derived in the same way as in [6, Lemma 4.2]. In fact, we can write the equation (7) for in an integral form as

 Xε,ηt=x0+∫t0B2(Xε,ηs,Yε,ηs)ds+√η∫t0Σ2(Xε,ηs,Yε,ηs)dW2s .

Therefore

 E|Xε,ηt|2Rn≤C(|x0|2Rn+E∣∣∣∫t0B2(Xε,ηs,Yε,ηs)ds∣∣∣2Rn+ηE∣∣∣∫t0Σ2(Xε,ηs,Yε,ηs)dW2s∣∣∣2Rn) .

For a matrix valued random function adapted to the filtration of we have (see [17, (3.12) and (3.13)])

 E∣∣∣∫t0σ(t)dWt∣∣∣2Rn=∫t0E∥σ(t)∥2Rn⊗Rndt . (25)

Therefore we obtain (23).

We can write the solution in (7) in mild form as

 Yε,ηt=e−εηty0+εη∫t0e−εη(t−s)B1(Xε,ηs)ds+ε√η∫t0e−εη(t−s)Σ1(Xε,ηs)dW1s . (26)

Set and . Then we have

 dΛ(t)=−εη[Λ(t)+B1(Xε,ηt)]dt , Λ(0)=y0 ,

which gives

 12ddt|Λ(t)|2Rn=⟨Λ(t),−εη[Λ(t)+B1(Xε,ηt)]⟩Rn≤−ε2η|Λ(t)|2Rn+Cεη .

Therefore by Gronwall inequality we know that for we have

 |Λ(t)|2Rn≤Ce−εηt|y0|2Rm+2CT≤C(1+|y0|2Rm) .

It remains to estimate . Again, by (25) we have

 E|Γ(t)|2Rm=ε2ηe−2εηt∫t0e2εηs∥Σ1(Xε,ηs)∥2Rn⊗Rnds≤Cε2 .

Thus we obtain

 E|Yε,ηt|2Rm≤C(E|Λ(t)|2Rm+E|Γ(t)|2Rm)≤C(1+|y0|2Rm) ,

which is (24). ∎

The next Lemma summarizes basic facts about the process defined in (8).

###### Lemma 2.2.

Let the process defined in (8) start from . Then for any function , for some we have

 ∣∣∣Eyφ(yX,ε(t))−∫Rmφ(Y)μX,ε(dY)∣∣∣≤Ce−δt(1+|y|Rm)[φ]Lip , (27)

where the constant may depend on , but is independent of .

Moreover, for some constant we have

 E∣∣∣1T∫t+TtB2(X,yX,ε(s))ds−∫RmB2(X,Y)μX,ε(dY)∣∣∣2Rn≤CT[√ε[B2]Lip(1+|y|Rm)+|B2(X,0)|Rn]2 . (28)

Finally

 E|yX,ε(t)|2Rm≤C(1+e−2t|y|2Rm) , (29)

and

 ∫Rm|y|2RmμX,ε(dY)≤C<∞ , (30)

in which the constant may depend on but is independent of .

###### Proof.

Let us first recall the auxiliary process in (8). From (8) we have

 d(yX,ε(t)−Egw(X))=−(yX,ε(t)−Egw(X))dt+√εΣ1(X)dW1t ,

so that

 yX,ε(t)−Egw(X)=(yX,ε(0)−Egw(X))e−t+√εΣ1(X)∫t0e−(t−s)dW1s .

Let be the OU process satisfying the stochastic differential equation

 dZ(t)=−Z(t)dt+dW1t , Z(0)=0∈Rm .

Thus we have the explicit representation

 yX,ε(t)=Egw(X)+(yX,ε(0)−Egw(X))e−t+√εΣ1(X)Z(t) . (31)

Let be the invariant measure of , where

is the identity matrix in

. Then we have the exponential mixing estimate, that for we have

 ∣∣∣Eφ(Z(t))−∫Rmφ(Y)μ(dY)∣∣∣≤Ce−δt[φ]Lip(Rm) . (32)

This, together with (31), as well as the boundedness of in terms of , imply (27).

From (32), by the same argument as in [6, Lemma 2.3], we obtain

 E∣∣∣1T∫t+TtB2(X,Z(s))ds−∫RmB2(X,Y)μ(dY)∣∣∣2Rn ≤ CT([B2]Lip+|B2(X,0)|Rn)2 . (33)

From here, by making use of the representation (31), we obtain (28).

Moreover, by (31) we infer that

 E|yX,ε(t)|2Rm≤C(1+e−2t+e−2t|yX,ε(0)|2Rn)+CεE|Z(t)|2Rm=C(1+e−2t)+Cε2(1−e−2t)≤C(1+e−2t|yX,ε(0)|2Rn) ,

which is (29).

Finally, (30) is a result of (9) and the fact that and are uniformly bounded in . ∎

Now we will derive the averaging principle following the classical method in . Let . Let us consider a partition of the time interval into intervals of the same length . Let us introduce the auxiliary processes , by means of the relations

 ˆYε,ηt=Yε,ηkΔ−εη∫tkΔˆYε,ηsds+εη∫tkΔB1(Xε,ηkΔ)ds+ε√η∫tkΔΣ1(Xε,ηkΔ)dW1s , t∈[kΔ,(k+1)Δ] , (34)
 ˆXε,ηt=x0+∫t0B2(Xε,η[s/Δ]Δ,ˆYε,ηs)ds+√η∫t0Σ2(Xε,η[s/Δ]Δ,ˆYε,ηs)dW2s . (35)
###### Lemma 2.3.

The interval length can be chosen such that , as and for any small we have

 E|Yε,ηt−ˆYε,ηt|2Rm≤Cε2η1−κ→0 (36)

uniformly in , and .

###### Proof.

In fact we can write, for , that