 # On the modes of convergence of Stochastic Optimistic Mirror Descent (OMD) for saddle point problems

In this article, we study the convergence of Mirror Descent (MD) and Optimistic Mirror Descent (OMD) for saddle point problems satisfying the notion of coherence as proposed in Mertikopoulos et al. We prove convergence of OMD with exact gradients for coherent saddle point problems, and show that monotone convergence only occurs after some sufficiently large number of iterations. This is in contrast to the claim in Mertikopoulos et al. of monotone convergence of OMD with exact gradients for coherent saddle point problems. Besides highlighting this important subtlety, we note that the almost sure convergence guarantees of MD and OMD with stochastic gradients for strictly coherent saddle point problems that are claimed in Mertikopoulos et al. are not fully justified by their proof. As such, we fill out the missing details in the proof and as a result have only been able to prove convergence with high probability. We would like to note that our analysis relies heavily on the core ideas and proof techniques introduced in Zhou et al. and Mertikopoulos et al., and we only aim to re-state and correct the results in light of what we were able to prove rigorously while filling in the much needed missing details in their proofs.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We analyze some recent results on the use of Mirror Descent (MD) and Optimistic Mirror descent (OMD),  that have recently been studied extensively for alleviating convergence issues in training of adversarial generative networks [3, 1]. In particular these papers consider the following general saddle-point (SP) problem.

 minx1∈X1maxx2∈X2f(x1,x2), (1)

where are compact convex subset of a finite-dimensional normed space , , and is continuously differentiable. Let and let be the dual of . Define as

 g(x):=(∇x1f(x1,x2),−∇x2f(x1,x2)).

It is well-known that if is a solution to (1), then it satisfies the Stampacchia Variational Inequality (SVI) , i.e.

 ⟨g(x∗),x−x∗⟩≥0,∀x∈X.

When the function is convex-concave, it is also well-known that the point also satisfies the Minty Variational Inequality (MVI) , i.e.

 ⟨g(x),x−x∗⟩≥0,∀x∈X.

For convex-concave problems, it can be shown that these three conditions, namely, is the optimal point of (1), satisfies SVI, and satisfies MVI, are all equivalent. The interplay of these two VIs characterizing the optimal set of solutions has been investigated extensively. We focus here on the notion of coherence proposed in [2, 1], that are assumed to be satisfied by problem (1) and that are in some sense the weakest set of possible conditions considered for global optimality going beyond convexity, pseudo-montonocity, and quasi-convexity .

In order to describe MD and OMD, one needs to define the notion of Bregman Divergence (BD) with respect to a differentiable and -strongly convex function whose domain includes the set . There are several equivalent definitions of -strong convexity, here we provide the one that will directly be used in the proof later:

 ⟨∇h(x)−∇h(x′),x−x′⟩≥K∥x−x′∥2,∀x,x′∈X. (2)

We further assume that is -Lipschitz, which is needed in the proof. The BD is defined as,

 D(x,y)=h(y)−h(x)−⟨∇h(x),y−x⟩. (3)

Bregman divergence enjoys a number of properties that are critical to the success of MD and OMD and we refer the reader to the Apendices in 

where they are proposed and derived. Given a vector

, and a vector , define the following Bregman projection operator via,

 Px(y)=argminx′∈X{⟨y,x−x′⟩+D(x′,x)}. (4)

In the following we will assume that an is given and fixed throughout.

### 1.1 Variational Inequalities and Coherence

The following definition of coherence is provided in [1, Definition 2.1], where we explicitly define what it means by being sufficiently close to in Condition 3; this definition will be used in the proofs presented in Section 4.

###### Definition 1.

We say that (SP) is coherent if

1. Every solution of (SVI) also solves (SP).

2. There exists a solution of (SP) that satisfies (MVI).

3. Every solution of (SP) satisfies (MVI) locally. Specifically, for some fixed , for all such that .

If, moreover, (MVI) holds as a strict inequality in Condition 2 when is not a solution of (SP), then (SP) will be called strictly coherent; by contrast, if (MVI) holds as an equality in Condition 2 for all , we will say that (SP) is null-coherent.

### 1.2 Mirror Descent Algorithms

Under stochastic gradients, the MD and OMD are defined below along with the almost standard assumptions on the expected values and the variance of the gradient estimates. For all the probabilistic statements in this article, we consider the probability space

and let denote the expectation with respect to .

Mirror Descent (MD): The vanilla mirror descent (MD) algorithm is defined as

 Xn+1=PXn(−γn^gn), (5)

where is some constant and satisfies

 E[^gn|Fn]=g(Xn)andE[∥^gn∥2∗|Fn]≤G2 (6)

with , the -algebra generated by . Note that with the assumption in (6), we have that there exists some constant such that

 E[∥^gn−g(Xn)∥2∗|Fn]≤2E[∥^gn∥2∗|Fn]+2∥g(Xn)∥2∗≤τ2, (7)

since a continuous function () is bounded on a compact set ().

Optimistic Mirror Descent (OMD): The optimistic mirror descent (OMD) algorithm is defined as

 Yn=PXn(−γn^gn),Xn+1=PXn(−γn^rn), (8)

with the assumption that

 E[^gn|Fn,n−1]=g(Xn),E[^rn|Fn,n]=g(Yn),E[∥^gn∥2∗|Fn,n−1]≤G2,E[∥^rn∥2∗|Fn,n]≤G2, (9)

where . Similarly, there exist some finite constant such that

 E[∥^gn−g(Xn)∥2∗|Fn,n−1]≤σ2andE[∥^rn−g(Yn)∥2∗|Fn,n]≤σ2. (10)

## 2 Main Results

We are now ready to re-state the main results from  with several important corrections.

First, for coherent problems and OMD with exact (non-stochastic) gradients, it is claimed in  that is monotone decreasing for some . However, as we fill out the missing details in the proof, we believe that monotone decreasing is guaranteed only after some sufficiently large number of iterations; our modified statement is given in Theorem 1.

###### Theorem 1 (Modified from [1, Theorem 4.1]).

Suppose that (SP) is coherent and is -Lipschitz. If the algorithm (8) is run with exact (non-stochastic) gradient and step-size sequence that satisfies

 0

where is defined in (2), then . Moreover, there exists some sufficiently large such that decreases monotonically in for all .

Next, for strictly coherent problems, almost sure convergence is claimed in  for both MD and OMD with stochastic gradients. However, as we fill out the missing details in their proofs, we believe that only convergence with high probability can be guaranteed; our modified theorem statements for MD and OMD are given in Theorem 2 (a) and Theorem 3, respectively. Moreover, for null-coherent problems, it is claimed in  that the sequence is non-decreasing for all , whereas we believe that it is true only for the ’s that satisfy Condition 2 in Definition 1; our modified statement is given in Theorem 2 (b).

###### Theorem 2 (Modified from [1, Theorem 3.1]).

Suppose that MD (5) is run with a gradient oracle satisfying (6).

1. For strictly coherent problems, for any , if the step-size satisfies and

 ∞∑n=1γ2n≤min{δϵ22diam(X)2τ2,Kδϵ2τ2}, (11)

where , , is defined in (2), and is defined in (7), then

 P(∃n0∈N,∃x∗∈X∗, s.t. D(x∗,Xn)≤ϵ,∀n≥n0)≥1−δ.
2. For null-coherent problems, the sequence is non-decreasing for saddle points that satisfy Condition 2 (global MVI) in Definition 1.

###### Remark 1.

Theorem 2 (b) implies that for null-coherent problems with a unique saddle point (thus necessarily satisfies global (MVI) by definition of coherence), such as the two-player zero-sum game example provided in [1, Proposition C.3], the sequence is non-decreasing.

###### Theorem 3 (Modified from [1, Theorem 4.3]).

Suppose that (SP) is strictly coherent and stochastic OMD (8) is run with a gradient oracle satisfying (9). For any , if the step-size satisfies and

 ∞∑n=1γ2n≤min{δϵ23diam(X)2σ2,Kδϵ3σ2}, (12)

where , is defined in (2), and is defined in (10), then

 P(∃n0∈N,∃x∗∈X∗, s.t. D(x∗,Xn)≤ϵ,∀n≥n0)≥1−δ. (13)
###### Remark 2.

The conditions on the step-size sequence given in (11) and (12) suggest that there is a trade-off between the evolution speed of the algorithm (how large can be), the accuracy of the solution (how small can be), and the probability of convergence (how small can be).

## 3 Conclusions and Future Work

In an attempt towards understanding the recent body of work on MD/OMD dynamics for saddle point problems, in this article we have provided more rigorous and corrected statement of the claims in . As part of future work we aim to shed light on the rates of convergence of MD and OMD under coherency assumptions. In this context we aim to build upon the analysis conducted in .

## References

•  P. Mertikopoulos, B. Lecouat, H. Zenati, C.-S. Foo, V. Chandrasekhar, and G. Piliouras, “Optimistic mirror descent in saddle-point problems: Going the extra(-gradient) mile,” in International Conference on Learning Representations (ICLR 2019), 2019.
•  Z. Zhou, P. Mertikopoulos, N. Bambos, S. Boyd, and P. Glynn, “On the convergence of mirror descent beyond stochastic convex programming,” arXiv preprint arXiv:1706.05681, 2017.
•  C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng, “Training gans with optimism.,” in International Conference on Learning Representations (ICLR 2018), 2018.
•  R. Ferrentino, “Variational inequalities and optimization problems,” Applied Mathematical Sciences, vol. 1, pp. 2327–2343, 01 2007.
•  P. Mertikopoulos and M. Staudigl, “Stochastic mirror descent dynamics and their convergence in monotone variational inequalities,” Journal of optimization theory and applications, vol. 179, no. 3, pp. 838–867, 2018.

## 4 Appendix

### 4.1 Proof of Theorem 1

First, we restate [1, Lemma D.1].

###### Lemma 1.

Suppose that (SP) is coherent and g is -Lipschitz. For any saddle point , the iterates of (OMD) with exact gradient satisfy

 D(x∗,Xn+1)≤D(x∗,Xn)−12(K−γ2nL2K)∥Yn−Xn∥2−γn⟨g(Yn),Yn−x∗⟩. (14)

If moreover, is (one of) the special saddle points that satisfy (MVI) globally, then

 D(p,Xn+1)≤D(p,Xn)−12(K−γ2nL2K)∥Yn−Xn∥2. (15)
###### Proof.

Note that no additional proof is needed, since (14) is directly obtained from the first inequality in [1, (D.2)] and (14) is the original statement of [1, Lemma D.1]. ∎

Next, we add a result that is similar to [1, Proposition B.4(a)].

###### Lemma 2.

Let be a -strongly convex distance-generating function on and further assume that is -Lipschitz. Then, for any , we have

 ∥Px1(y)−Px2(y)∥≤LhK∥x1−x2∥,∀x1,x2∈dom ∂h.
###### Proof.

Let and . By [1, Lemma B.1(b)(c)], we have

 ⟨∇h(z1)−y−∇h(x1),z1−p⟩≤0,∀p∈X ⟨∇h(z2)−y−∇h(x2),z2−p′⟩≤0,∀p′∈X.

Letting and and adding the two inequalities:

 ⟨∇h(z1)−∇h(z2),z1−z2⟩≤⟨∇h(x1)−∇h(x2),z1−z2⟩.

Note that the LHS of the above is low bounded by

 ⟨∇h(z1)−∇h(z2),z1−z2⟩≥K∥z1−z2∥2

by strong convexity of . The RHS of the above is upper bounded by

 ⟨∇h(x1)−∇h(x2),z1−z2⟩≤Lh∥x1−x2∥∥z1−z2∥

by Cauchy-Schwarz and Lipschitz property of . Combining the two, we have

 ∥z1−z2∥≤LhK∥x1−x2∥.

We are now ready to prove Theorem 1.

###### Proof.

Let be a saddle point of (SP) and satisfies (MVI) globally. Such exists by definition of coherence. By (15) in Lemma 1, we have

 D(p,Xn+1)≤D(p,Xn)−12(K−γ2nL2gK)∥Yn−Xn∥2.

Since , there exists such that for all . Then with this , we have

 D(p,Xn+1)≤D(p,Xn)−K2(1−α2)∥Yn−Xn∥2.

Telescoping the above, we have

 D(p,Xn+1)≤D(p,X1)−K2(1−α2)n∑k=1∥Yk−Xk∥2.

Rearranging the above, we have

 K2(1−α2)n∑k=1∥Yk−Xk∥2≤D(p,X1)−D(p,Xn+1)≤D(p,X1),

where the last inequality follows by positivity of Bregman divergence. Taking the limit as on both sides of the above inequality, we have , which implies that .

Next by compactness of , we have that has a convergent subsequence such that . We show in the following that, in fact, . First notice that

 limk→∞∥Ynk−^x∥≤limk→∞(∥Ynk−Xnk∥+∥Xnk−^x∥)=0.

Moreover, suppose that , and so as well. Then

 ^x=limk→∞Ynk=limk→∞PXnk(−γnkg(Xnk))=P^x(−γg(^x)),

where the last equality follow by

 ∥PXnk(−γnkg(Xnk))−P^x(−γg(^x))∥ ≤∥PXnk(−γnkg(Xnk))−PXnk(−γg(^x))∥+∥PXnk(−γg(^x))−P^x(−γg(^x))∥ (a)≤1K∥γnkg(Xnk)−γg(^x)∥+LhK∥Xnk−^x∥ ≤1K∥γnkg(Xnk)−γnkg(^x)∥+1K∥γnkg(^x)−γg(^x)∥+LhK∥Xnk−^x∥ ≤LgK|γnk|∥Xnk−^x∥+1K|γnk−γ|∥g(^x)∥+LhK∥Xnk−^x∥,

since for all and , taking the limit as on both sides of the resulting inequality, we have that . In the above, step follows by [1, Proposition B.4(a)] and Lemma 2. The above shows that . By [1, (B.7)], we have

 ⟨∇h(^x),^x−x⟩≤⟨∇h(^x)−γg(^x),^x−x⟩,∀x∈X.

That is, for all , which is (SVI). By definition of coherence, must be a saddle point of (SP).

So far, we have proved that . Next we want to show that . Similar to before, fix an such that . Then (14) in Lemma 1 gives

 D(x∗,Xn+1)≤D(x∗,Xn)−K2(1−α2)∥Yn−Xn∥2−γn⟨g(Yn),Yn−x∗⟩.

Telescoping the above, we have

 D(x∗,Xn+1) ≤D(x∗,Xn0)−K2(1−α2)n∑k=n0∥Yk−Xk∥2−αK/Lgn∑k=n0⟨g(Yk),Yk−x∗⟩ ≤D(x∗,Xn0)−αK/Lgn∑k=n0⟨g(Yk),Yk−x∗⟩,

where the choice of is as follows: let with being arbitrary but fixed,

1. Choose sufficiently large such that for all , such choice is possible since we have proved that as ;

2. Choose sufficiently large such that for all , such choice is possible since by and the Bregman reciprocity condition, we have as ;

3. Choose such that , such choice is possible since we have proved that .

With such choice of , we will prove that for all .

First we show that for all ,

 D(x∗,Xn)≤ϵ⟹D(x∗,Yn)≤(2+Lhdiam% (X))ϵ=¯ϵ. (16)

To see this, in [1, Lemma B.2], letting , we have

where the first inequality follows by Cauchy-Schwarz and the Lipschitz property of , and the second inequality follows by our choice of .

Now starting with , we have

 D(x∗,Xn0+1)≤D(x∗,Xn0)−αK/Lg⟨g(Yn0),Yn0−x∗⟩.

Since , by (16) we have . By our modified Condition 3 in the definition of coherence, we have , which implies

 D(x∗,Xn0+1)≤D(x∗,Xn0)−αK/Lg⟨g(Yn0),Yn0−x∗⟩≤D(x∗,Xn0)≤ϵ.

Using (16) again, the above implies and hence . Therefore,

 D(x∗,Xn0+2)≤D(x∗,Xn0)−αK/Lgn0+1∑k=n0⟨g(Yk),Yk−x∗⟩≤D(x∗,Xn0)≤ϵ.

Keeping this procedure, we can show that for all , we have . Since can be chosen to be arbitrarily close to zero (by choosing arbitrarily close to zero), we have proved that for all , there exists an such that for all , hence . By the Bregman reciprocity condition, we have . ∎

### 4.2 Proof of Theorem 2

###### Proof.
1. The same technique used for proving (ii) and (iii) in Theorem 3 can also be used to prove the convergence of mirror descent (MD) algorithm; we omit the details here.

2. Note that there is a typo in the proof of [1, Theorem 3.1(b)] that can be quite confusing: in [1, (C.14)], the plus sign before the last innerproduct should be a minus sign. To see how [1, (C.14)] (with the corrected sign) is obtained, we first recall that is proper, convex, and closed on . Therefore, we have , where is the convex conjugate of . By [1, (B.5),(B.6b)], MD (5) can be written as

 Xn+1=∇h∗(∇h(Xn)−γn^gn).

It follows that

 ∇h(Xn+1)=∇h(∇h∗(∇h(Xn)−γn^gn))=∇h(Xn)−γn^gn,

hence

 ∇h(Xn+1)−∇h(Xn)=−γn^gn. (17)

Now applying [1, Lemma B.2] with (a saddle points that satisfies Condition 2 of Definition 1), , and , we have

 D(p,Xn+1)=D(p,Xn)+D(Xn,Xn+1)+⟨∇h(Xn+1)−∇h(Xn),Xn−p⟩=D(p,Xn)+D(Xn,Xn+1)−γn⟨^gn,Xn−p⟩,

where the last equality follows by (17). Taking expectation on both sides,

 E[D(p,Xn+1)](a)=E[D(p,Xn)]+E[D(Xn,Xn+1)]−γnE[⟨g(Xn),Xn−p⟩](b)=E[D(p,Xn)]+E[D(Xn,Xn+1)]≥E[D(p,Xn)],

where step follows by , since is -measurable and is an unbiased conditioned on by (6), and step follows by definition of null-coherence. This shows that the sequence is non-decreasing for the special saddle points that satisfy global (MVI).

### 4.3 Proof of Theorem 3

###### Proof.

We first include a result from [1, Proposition B.4(b)], which will be used frequently in the proof of the theorem.

###### Lemma 3 ([1, Proposition B.4(b)]).

Let be defined in (2) and let the prox-mapping be defined in (4). Fix some , . Letting and , we have

 D(x′,x+2)≤D(x′,x)+⟨y2,x+1−x′⟩+12K∥y1−y2∥2∗−K2∥x+1−x∥2.

We are now ready to prove the theorem. The proof contains three steps:

1. Show that

 P(limn→∞∥Xn−Yn∥=0)=1. (18)
2. Let denote a subsequence of . Show that

 P(∃{Ynk}k, s.t. limk→∞infx∗∈X∗∥Ynk−x∗∥=0)=1. (19)
3. Show that (13) holds.

Let denote a subsequence of and we notice that (i) and (ii) imply that

 P(∃{Xnk}k, s.t. limk→∞infx∗∈X∗∥Xnk−x∗∥=0)=1. (20)

To see this, denote the events considered in (18), (19), and (20) by , , and , respectively. For any ,

 infx∗∈X∗∥Xnk(ω)−x∗∥ ≤infx∗∈X∗∥Xnk(ω)Ynk(ω)+Ynk(ω)−x∗∥ ≤∥Xnk(ω)−Ynk(ω)∥+infx∗∈X∗∥Ynk(ω)−x∗∥.

Letting on both sides of the resulting inequality, we have , thus . To recap, we have shown implies , which implies , and so ; (20) is proved. We will see later that (20) will be used to prove (iii).

We now show (i). Let be one of the special saddle points that satisfy global (MVI). Define

 U+n+1:=^rn−g(Yn),andξ+n+1:=−⟨U+n+1,Yn−p⟩.Un+1:=^gn−g(Xn),andξn+1:=−⟨Un+1,Xn−p⟩.

Let (a saddle point that satisfies Condition 2 in Definition 1), , , , , and in Lemma 3, then

 D(p,Xn+1) ≤D(p,Xn)−γn⟨^rn,Yn−p⟩+γ2n2K∥^rn−^gn∥2∗−K2∥Yn−Xn∥2 ≤D(p,Xn)−γn⟨g(Yn),Yn−p⟩−γn⟨U+n+1,Yn−p⟩ +γ2nK∥^rn∥2∗+γ2nK∥^gn∥2∗−K2∥Yn−Xn∥2 ≤D(p,Xn)+γnξ+n+1+γ2nK∥^rn∥2∗+γ2nK∥^gn∥2∗−K2∥Yn−Xn∥2,

where the last inequality follows by (MVI). Telescoping the above, we have

 K2n∑k=1∥Yk−Xk∥2 ≤D(p,X1)−D(p,Xn+1)+n∑k=1γkξ+k+1+1Kn∑k=1γ2k∥^gk∥2∗+1Kn∑k=1γ2k∥^rk∥2∗ ≤D(p,X1)+n∑k