DeepAI

# Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks

Motivated by the pursuit of a systematic computational and algorithmic understanding of Generative Adversarial Networks (GANs), we present a simple yet unified non-asymptotic local convergence theory for smooth two-player games, which subsumes several discrete-time gradient-based saddle point dynamics. The analysis reveals the surprising nature of the off-diagonal interaction term as both a blessing and a curse. On the one hand, this interaction term explains the origin of the slow-down effect in the convergence of Simultaneous Gradient Ascent (SGA) to stable Nash equilibria. On the other hand, for the unstable equilibria, exponential convergence can be proved thanks to the interaction term, for three modified dynamics which have been proposed to stabilize GAN training: Optimistic Mirror Descent (OMD), Consensus Optimization (CO) and Predictive Method (PM). The analysis uncovers the intimate connections among these stabilizing techniques, and provides detailed characterization on the choice of learning rate.

• 27 publications
• 16 publications
10/23/2018

### Finding Mixed Nash Equilibria of Generative Adversarial Networks

We reconsider the training objective of Generative Adversarial Networks ...
06/24/2018

### JR-GAN: Jacobian Regularization for Generative Adversarial Networks

Generative adversarial networks (GANs) are notoriously difficult to trai...
11/07/2021

### Teamwork makes von Neumann work: Min-Max Optimization in Two-Team Zero-Sum Games

Motivated by recent advances in both theoretical and applied aspects of ...
03/28/2022

While the generative model has many advantages, it is not feasible to ca...
12/18/2020

### Convergence dynamics of Generative Adversarial Networks: the dual metric flows

Fitting neural networks often resorts to stochastic (or similar) gradien...
03/29/2021

### Saddle Point Optimization with Approximate Minimization Oracle

A major approach to saddle point optimization min_xmax_y f(x, y) is a gr...
05/28/2021

### Discretization Drift in Two-Player Games

Gradient-based methods for two-player games produce rich dynamics that c...

## 1 Introduction

In this paper we consider the non-asymptotic local convergence and stability of discrete-time gradient-based optimization algorithms for solving smooth two-player zero-sum games of the form,

 minθ∈Rpmaxω∈RqU(θ,ω). (1.1)

The motivation behind our non-asymptotic analysis follows from the observation that Generative Adversarial Networks (GANs) lack principled understanding both in the computational and the algorithmic level

111We refer the readers to Huszár (2017)’s concise post on this issue.. GAN optimization is a special case of (1.1

), which has been developed for learning a complex and multi-modal probability distribution based on samples from

(over ), through learning a generator function that transforms the input distribution (over ) to match the target . Ignoring the parameter regularization, the value function corresponding to a GAN is of the form,

 U(θ,ω)=EX∼Prealh1(fω(X))−EZ∼Pinputh2(fω(gθ(Z))), (1.2)

where parametrizes the generator function and discriminator function , respectively. The original GAN (Goodfellow et al., 2014), for example, corresponds to choosing , where

is the sigmoid function; Wasserstein GAN

(Arjovsky et al., 2017) considers ; -GAN (Nowozin et al., 2016) proposes to use , where denotes the Fenchel dual of . Recently, several attempts have been made to understand whether GANs learn the target distribution in the statistical sense (Liu et al., 2017; Arora and Zhang, 2017; Liang, 2017; Arora et al., 2017).

Optimization of GANs (and value functions of the form (1.1) at large) is hard, both in theory and in practice (Singh et al., 2000; Pfau and Vinyals, 2016; Salimans et al., 2016). Global optimization of a general value function with multiple saddle points is impractical and unstable, so we instead resort to the more modest problem of searching for a local saddle point such that no player has the incentive to deviate locally

 U(θ∗,ω∗)≤U(θ,ω∗),  for θ in an open neighborhood of θ∗, U(θ∗,ω∗)≥U(θ∗,ω),  for ω in an open neighborhood of ω∗.

For smooth value functions, the above conditions are equivalent to the following solution concept:

###### Definition 1 (Local Nash Equilibrium).

is called a local Nash equilibrium if

1. , ;

2. , .

Here we use to denote the off-diagonal term , and name it the interaction term throughout the paper. denotes the gradient , and for the Hessian .

In practice, discrete-time dynamical systems are employed to numerically approach the saddle points of , as is the case in GANs (Goodfellow et al., 2014), and in primal-dual methods for non-linear optimization (Singh et al., 2000). The simplest possibility is Simultaneous Gradient Ascent (SGA), which corresponds to the following discrete-time dynamical system,

 θt+1 =θt−η∇θU(θt,ωt), ωt+1 =ωt+η∇ωU(θt,ωt), (1.3)

where is the step size or learning rate. In the limit of vanishing step size, SGA approximates a continuous-time autonomous dynamical system, the asymptotic convergence of which has been established in Singh et al. (2000); Cherukuri et al. (2017); Nagarajan and Kolter (2017)

. In practice, however, it has been widely reported that the discrete-time SGA dynamics for GAN optimization suffers from instabilities due to the possibility of complex eigenvalues in the operator of the dynamical system

(Salimans et al., 2016; Metz et al., 2016; Nagarajan and Kolter, 2017; Mescheder et al., 2017; Heusel et al., 2017). We believe room for improvement still exists in the current theory, which we hope will render it to be more informative in practice:

• Non-asymptotic convergence speed.  In practice, one is concerned with finite step size

which is typically subject to extensive hyperparameter tuning. Detailed characterizations on the convergence speed, and theoretical insights on the choice of learning rate can be helpful.

• Unified simple analysis for modified saddle point dynamics.  Several attempts to fix GAN optimization have been put forth by independent researchers, which modify the dynamics (Mescheder et al., 2017; Daskalakis et al., 2017; Yadav et al., 2017) using very different insights. A unified analysis that reviews the deeper connections amongst these proposals helps to better understand the saddle point dynamics at large.

In this paper, we address the above points by studying the theory of non-asymptotic convergence of SGA and related discrete-time saddle point dynamics, namely, Optimistic Mirror Descent (OMD), Consensus Optimization (CO), and Predictive Method (PM). More concretely, we provide the following theoretical contributions about the crucial effect of the off-diagonal interaction term in two-player games:

• Stable case: curse of the interaction term.  Locally, SGA converges exponentially fast to a stable Nash equilibrium with a carefully chosen learning rate. This can be viewed as a generalization (rather than a special case) of the local convergence guarantee for single-player gradient descent for strongly-convex functions. In addition, we quantitatively isolate the slow-down in the convergence rate of two-player SGA compared to single-player gradient descent, due to the presence of the off-diagonal interaction term for the two-player game.

• Unstable case: blessing of the interaction term.  For unstable Nash equilibria, SGA diverges away for any non-zero learning rate. We discover a unified non-asymptotic analysis that encompasses three proposed modified dynamics —  OMD, CO, and PM. The analysis shows that all these algorithms, at a high level, share the same idea of utilizing the curvature introduced by the interaction term . Unlike the slow sub-linear rate of convergence experienced by single-player gradient descent for non-strongly convex functions222In fact, Nesterov (2013) constructed a convex function that is non-strongly convex, such that all first order methods suffer slow sub-linear rate of convergence (in optimization literature, linear rate refers to exponential convergence speed)., the OMD/CO/PM effectively exploit the interaction term to achieve exponential convergence to unstable Nash equilibria. The analysis also provides specific advice on the choice of learning rate for each procedure, albeit restricted to the simple case of bi-linear games.

The organization of the paper is as follows. In Section 2 we consider the (admittedly idealized) situation when locally, the value function satisfies strict convexity/concavity. We show non-asymptotic exponential convergence to Nash equilibria for SGA, and identify an optimized learning rate. To reveal and understand the new features of the modified dynamics, we study in Section 3 the minimal unstable bilinear game, showing that proposed stabilizing techniques all achieve exponential convergence to unstable Nash equilibria. Finally, in Section 4 we take a step closer to the real world by numerically evaluating each of the proposed dynamical systems using value functions of GAN form (1.2

), under objective evaluation metrics. Detailed proofs are deferred to the Section

5.

## 2 Stable Case: Non-asymptotic Local Convergence

In this section we will establish the non-asymptotic convergence of SGA dynamics to saddle points that are stable local Nash equilibria. With a properly chosen learning rate, the local convergence can be intuitively pictured as cycling inwards to these saddle points, where the distance to the saddle point of interest is exponentially contracting. First, let’s introduce the notion of stable equilibrium.

###### Definition 2 (Stable Local Nash Equilibrium).

is called a stable local Nash equilibrium if

1. , ;

2. , .

The above notion of stability is stronger than the Definition 1, in the sense that and have smallest eigenvalues bounded away from .

###### Assumption 1 (Local Strong Convexity-Concavity).

Consider that is smooth and twice differentiable, and let be a stable local Nash equilibrium as in Definition 2. Assume that for some , there exists an open neighborhood near such that for all , the following strong convexity-concavity condition holds,

 ∇θθU(θ,ω)≻0, −∇ωωU(θ,ω)≻0.

It will prove convenient to introduce some notation before introducing the main theorem. Let us define the following block-wise abbreviation for the matrix of second derivatives,

 [∇θθU(θ,ω)∇θωU(θ,ω)−∇ωθU(θ,ω)−∇ωωU(θ,ω)]:=[Aθ,ωCθ,ω−CTθ,ωBθ,ω], (2.1)

and define as

 α :=min(θ,ω)∈B2((θ∗,ω∗),r)λmin([A2θ,ω00B2θ,ω]), β :=max(θ,ω)∈B2((θ∗,ω∗),r)λmax([A2θ,ω+Cθ,ωCTθ,ω−Aθ,ωCθ,ω+Cθ,ωBθ,ω−CTθ,ωAθ,ω+Bθ,ωCTθ,ωB2θ,ω+CTθ,ωCθ,ω]), (2.2)

where denote the largest and smallest eigenvalue of matrix .

###### Theorem 2.1 (Exponential Convergence: SGA).

Consider that satisfies Assumption 1 holds for some radius near a stable local Nash equilibrium as in Definition 2. Consider the initialization satisfies . Then the SGA dynamics (1) with fixed learning rate

 η=√α/β,

( defined in Eqn. (2)) obtains an -minimizer such that , as long as

 T≥TSGA:=⌈2βαlogrϵ⌉.
###### Remark 2.1.

It is interesting to compare the convergence speed of the saddle point dynamics to conventional gradient descent in one variable, for a strongly-convex function. We remind the reader that to obtain an -minimizer for a strongly-convex function, one needs the following number of iterations of gradient descent,

 TGD:=max{λmax(Aθ,ω)λmin(Aθ,ω)logrϵ,  λmax(Bθ,ω)λmin(Bθ,ω)logrϵ},

depending on whether we are optimizing with respect to or , respectively. It is now evident that due to the presence of , the convergence of two-player SGA to a saddle-point can be significantly slower than convergence of single-player gradient descent. Note the following inequality,

 λmax([A2θ,ω+Cθ,ωCTθ,ω−Aθ,ωCθ,ω+Cθ,ωBθ,ω−CTθ,ωAθ,ω+Bθ,ωCTθ,ωB2θ,ω+CTθ,ωCθ,ω])≥λmax([A2θ,ωB2θ,ω]),

which follows from the fact that the LHS is lower-bounded by

 λmax(A2θ,ω+Cθ,ωCTθ,ω)≥λmax(A2θ,ω).

Therefore the convergence of SGA is slower than that in the conventional GD

 TSGA≥TGD.

We would like to emphasize that for the saddle point convergence, the slow-down effect of the interaction term is explicit in our non-asymptotic analysis.

The intuition that the discrete-time SGA dynamics cycles inward to a stable Nash equilibrium exponentially fast can be seen in the following way. The presence of the off-diagonal anti-symmetric component in Eqn. (2.1) means that the associated linear operator of the discrete-time dynamics has complex eigenvalues, which results in periodic cycling behavior. However, due to the explicit choice of , the distance to stable Nash equilibrium is shrinking exponentially fast. The local exponential stability in the infinitesimal/asymptotic case when has already been studied in a nice paper Nagarajan and Kolter (2017) (Theorem 3.1 therein) by showing the Jacobian matrix of a particular form of GAN objective is Hurwitz (has all strictly negative eigenvalues). There are two distinct differences in our result: (1) we provide non-asymptotic convergence, with specific guidance on the choice of learning rate

; (2) our analysis goes through analyzing the singular values (which is rather different from the modulus of eigenvalue for a general matrix), instead of involving the complex eigenvalues, and this simple technique generalizes to three other modified saddle point dynamics which we discuss in the next section.

## 3 Unstable Case: Local Bi-Linear Problem

Oscillation and instability for SGA occurs when the problem is non-strongly convex-concave, as in the bi-linear game (or more precisely, at least linear in one player). This observation was first pointed out using a very simple linear game in Salimans et al. (2016). More generally, as a result of Theorem 2.1, this phenomenon occurs when the local Nash equilibrium is non-stable,

 λmin([Aθ∗,ω∗00Bθ∗,ω∗])≈0⟺[Aθ∗,ω∗Cθ∗,ω∗−CTθ∗,ω∗Bθ∗,ω∗]≈[0Cθ∗,ω∗−CTθ∗,ω∗0].

Let’s consider an extreme case when . In this case, we will show that: (1) an improved analysis for Optimistic Mirror Descent (OMD) introduced in Daskalakis et al. (2017), (2) a modified version of Predictive Methods (PM) motivated from Yadav et al. (2017) and (3) Consensus Optimization (CO) introduced in Mescheder et al. (2017), can fix the oscillation problem and provide exponential convergence to unstable Nash equilibrium, using a novel unified non-asymptotic analysis. Our analysis shows that these stabilizing techniques, at a high level, all manipulate the dynamics to utilize the curvature generated by the interaction term  —  which we refer to as the “blessing” of the interaction term, to contrast with the “slow-down effect” of the interaction term in the strongly convex-concave case. Once again as eluded in the introduction, this fast linear rate convergence result (in the non-strongly convex-concave two-player game) should be compared to the significantly slower sub-linear convergence rate for all first-order-methods in convex but non-strongly convex optimization (one player). The latter was proved by a lower bound argument in (Nesterov, 2013, Theorem 2.1.7). We are ready to state the main result proved in this section informally.

###### Theorem 3.1 (Informal: Unstable Case).

All these three modified dynamics, in the bi-linear game, enjoy the last iterate exponential convergence guarantee.

We will start with the minimal unstable bilinear game, explaining why oscillation can happen when the game is non-strongly convex-concave, and to reveal and understand the new features of the modified dynamics. This bilinear game can be motivated by considering the lowest-order truncation of the Taylor expansion of a general smooth two-player game around a Nash equilibrium (),

 U(θ,ω)=12θTAθ+12ωTBω+θTCω+higher order terms≈θTCω. (3.1)

assuming that . Now consider the simple bi-linear game . With the SGA dynamics defined in (1), one can easily verify that

 ∥θt+1∥2≥(1+η2λmin(CCT))∥θt∥2,  ∥ωt+1∥2≥(1+η2λmin(CTC))∥ωt∥2.

Therefore, the continuous limit is cycling around a sphere, while with any practical learning rate , the distance to the Nash equilibrium can be increasing exponentially instead of converging. Per Theorem 2.1 and the discussion above, instability for SGA only occurs when the local game is approximately bi-linear. From now on, we will focus on the simplest unstable form of the game, the bi-linear game, to isolate the main idea behind fixing the unstable problem.

### 3.1 (Improved) Optimistic Mirror Descent

Daskalakis et al. (2017) employed Optimistic Mirror Descent (OMD) motivated by online learning to solve the instability problem in GANs. Here we provide a stronger result, showing that the last iterate of OMD enjoys exponential convergence for bi-linear games. We note that although the last-iterate convergence of this OMD procedure was already rigorously proved in Daskalakis et al. (2017), the exponential convergence, however, is not known, to the best of our knowledge. Roughly speaking, for initialization with distance , (Daskalakis et al., 2017, Theorem 1) asserts that after iterations, the last iterate of OMD is within for a small enough learning rate (under the some additional assumptions), and the convergence only happens in the limit of vanishing step size . In contrast, we will prove after iterations, with the properly chosen step size , the last iterate of OMD is within a distance of . This improved theory also explains the exponential convergence found in simulations.

###### Lemma 3.1 (Exponential Convergence: OMD).

Consider a bi-linear game Assume and is full rank. Then the OMD dynamics,

 θt+1 =θt−2η∇θU(θt,ωt)+η∇θU(θt−1,ωt−1), ωt+1 =ωt+2η∇ωU(θt,ωt)−η∇ωU(θt−1,ωt−1), (3.2)

with the learning rate

 η=12√λmax(CCT),

obtains an -minimizer such that , as long as

 T≥TOMD:=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎢(8λmax(CCT)λmin(CCT)+4)log4r(1+λmin(CCT)λmax(CCT))ϵ⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎥,

under the assumption that .

### 3.2 (Modified) Predictive Methods

From a very different motivation in ODE, Yadav et al. (2017) proposed Predictive Methods (PM) to fix the instability problem. The intuition is to evaluate the gradient at a predictive future location, then perform the update. In this section, we propose and analyze a modified version of the predictive method (for simultaneous gradient updates), inspired by Yadav et al. (2017).

Consider the following modified PM dynamic

 predictive step:θt+1/2 =θt−γ∇θU(θt,ωt), ωt+1/2 =ωt+γ∇ωU(θt,ωt); gradient step:θt+1 =θt−η∇θU(θt+1/2,ωt+1/2), ωt+1 =ωt+η∇ωU(θt+1/2,ωt+1/2). (3.3)
###### Lemma 3.2 (Exponential Convergence: PM).

Consider a bi-linear game Assume and is full rank. Fix some . Then the PM dynamics in Eqn. (3.2) with learning rate

 η=γλmin(CCT)λmax(CCT)+γ2λ2max(CCT),

obtains an -minimizer such that , as long as

 T≥TPM:=⌈2γ2λ2max(CCT)+λmax(CCT)γ2λ2min(CCT)logrϵ⌉,

under the assumption that .

### 3.3 Consensus Optimization

Consensus Optimization (CO) is one other elegant attempt to fix the aforementioned problem, proposed in Mescheder et al. (2017)

. The authors’ motivation is to add an extra potential vector field (consensus part) on top of the current curl vector field (for the SGA of the bi-linear problem), in order to attract the dynamics to the critical points.

Mescheder et al. (2017); Nagarajan and Kolter (2017) analyzed the infinitesimal flow version of the consensus optimization, and intuitively showed that it pushes the real part of the eigenvalue away from , to ensure asymptotic convergence. In this section, we provide a simple convergence analysis of the discretized dynamics, of the same flavor as the previous section. An upshot of the analysis is that it sheds light on possible choice of learning rate.

Recall that the regularization term defining consensus optimization is given by,

 R(θ,ω)=(∥∇θU(θ,ω)∥2+∥∇ωU(θ,ω)∥2)/2. (3.4)

We are ready to state the Lemma. Surprisingly, we find that the consensus optimization coincides with the modified predictive method for the bi-linear game.

###### Lemma 3.3 (Exponential Convergence: CO).

Consider a bi-linear game Assume and is full rank. Recall defined in Eqn. (3.4), and fix some . Then the CO dynamics with the same learning rate as in Lemma 3.2,

 θt+1 =θt−η[∇θU(θt,ωt)+γ∇θR(θt,ωt)], ωt+1 =ωt+η[∇ωU(θt,ωt)−γ∇ωR(θt,ωt)], (3.5)

converges exponentially fast in the same way as the PM dynamics in Lemma 3.2.

## 4 Experiments

Lucic et al. (2017) recently conducted a large-scale study of GANs and found that improvements from algorithmic changes mostly disappeared after taking into account hyper-parameter tuning and randomness of initialization. They conclude that “future GAN research should be based on more systematic and objective evaluation procedures.” Inspired by this conclusion, we conduct a systematic evaluation of the proposed optimization algorithms on two basic density learning problems, and introduce corresponding objective evaluation metrics. The goal of this analysis is not to achieve state-of-art performance, but rather to compare and contrast the existing proposals in a carefully controlled learning environment. We focus on the Wasserstein GAN formulation so that the value function is given by

 (4.1)

where

is a multi-layer neural network with

hidden layers and rectifier non-linearities and the input distribution was chosen to be -dimensional standard Gaussian noise. Following Gulrajani et al. (2017), we impose the Lipschitz-1 constraint on the discriminator network using the two-sided gradient penalty term introduced in (Gulrajani et al., 2017, Eqn. (3)). The consensus optimization loss is defined with respect to the value function as in (3.4

) without including gradient penalty. The combined loss function of the discriminator and generator are respectively,

 Ldis(θ,ω) =−U(θ,ω)+γR(θ,ω)+λΛ(ω), (4.2) Lgen(θ,ω) =U(θ,ω)+γR(θ,ω). (4.3)

The coefficients of the gradient penalty and consensus optimization terms were determined by a coarse parameter search and then locked to throughout. In order to make close contact with our theoretical formalism, we optimize the above loss functions on an alternating schedule using vanilla gradient descent with fixed learning rate of

. To ensure reproducibility, all algorithms were independently implemented on top of the TFGAN and Keras framework.

### 4.1 Learning Covariance of Multivariate Gaussian

Consider the problem of learning the covariance matrix of a

-dimensional multivariate Gaussian distribution

with non-degenerate covariance . Note that the learning problem is well-specified if we choose the generator function

to be a simple linear transformation of the

-dimensional latent space (

). Although the GAN approach is clearly overkill for this simple density estimation problem, we find this example illuminating because it affords some analytical tractability for the otherwise intractable general GAN value function (

1.2). Specifically, if we choose the discriminator function to be a neural network with hidden layer consisting of hidden units with rectifier nonlinearities, and set biases to zero, then the explicit functional forms of discriminator and generator are respectively,

 (4.4)

where and are the discriminator and generator parameters, respectively. If, moreover, we express the covariance matrix as , then the value function can be expressed in closed form as,

 U(θ,ω)=const×H∑i=1vi[∥ATwi∥−∥VTwi∥]. (4.5)

The above analytical form of the value function sheds some light on the nature of the local Nash equilibrium solution concept. In particular, if one solves for the condition of being a Nash equilibrium, one does not conclude that . The result depends on the rank of the matrix .

The evaluation of different optimization algorithms involved comparing the target density and the analytical generator density after training iterations (Fig. 1). For simplicity, we chose the evaluation metric to be the Frobenius norm of the difference between the covariance matrices . The covariance learning experiments were conducted in the well-specified and over-parametrized regime (, ) using hidden units for the discriminator network. We also performed a head-to-head comparison of simultaneous and alternating update methods on this distribution and found negligible difference (Fig. 2).

### 4.2 Mixture of Gaussians

In practical applications, GANs are typically trained using the empirical distribution of the samples, where the samples are drawn from an idealized multi-modal probability distribution. To capture the notion of a multi-modal data distribution, we focus on a mixture of 8 Gaussians with means located at the vertices of a regular octagon inscribed in the unit circle, where each component has a fixed diagonal covariance of width . In contrast to previous visual-based evaluations, we estimate the Wasserstein-1 distance between the target density and the distribution

of the random variable

implied by the trained generator network. The estimate is obtained by solving a linear program which computes the exact Wasserstein-1 distance between the sample estimates

and , respectively, and approaches the population version as number of samples .

The experiments with the mixture of Gaussians used 2 dimensional Gaussian as input (). Both the generator and discriminator networks consisted of hidden layers with units per hidden layer. The estimate of the Wasserstein-1 distance was calculated using a sample size of after training for iterations. It is clear from Fig. 3 that the Wasserstein-1 distance correlates closely with the visual fit to the target distribution. The empirical evaluation (Fig. 1) shows that the exponential separation between consensus optimization and competing algorithms disappears on the mixture distribution, suggesting that the qualitative ranking is not robust to the choice of loss landscape. These findings demand deeper understanding of the global structure of the landscape, which is not captured by our local stability analysis.

## 5 Technical Proofs

###### Proof of Theorem 2.1.

Define the line interpolation between two points,

 θ(x) =xθt+(1−x)θ∗, ω(x) =xωt+(1−x)ω∗.

Then the SGA dynamics can be written as

 [θt+1−θ∗ωt+1−ω∗] =[θt−θ∗ωt−ω∗]−η[∇θU(θt,ωt)−∇ωU(θt,ωt)], =[θt−θ∗ωt−ω∗]−η∫10[∇θθU(θ(x),ω(x))∇θωU(θ(x),ω(x))−∇ωθU(θ(x),ω(x))−∇ωωU(θ(x),ω(x))]dx⋅[θt−θ∗ωt−ω∗], =∫10(I−η[∇θθU(θ(x),ω(x))∇θωU(θ(x),ω(x))−∇ωθU(θ(x),ω(x))−∇ωωU(θ(x),ω(x))])dx⋅[θt−θ∗ωt−ω∗].

Assume that one can prove for some , and , with a proper choice of , the largest singular value is bounded above by 1,

 ∥∥∥I−η[∇θθU(θ,ω)∇θωU(θ,ω)−∇ωθU(θ,ω)−∇ωωU(θ,ω)]∥∥∥op<1.

Then due to convexity of the operator norm, the dynamics of SGA is contracting locally because,

 ∥∥∥[θt+1−θ∗ωt+1−ω∗]∥∥∥ ≤∥∥∥∫10(I−η[∇θθU(θ(x),ω(x))∇θωU(θ(x),ω(x))−∇ωθU(θ(x),ω(x))−∇ωωU(θ(x),ω(x))])dx∥∥∥op⋅∥∥∥[θt−θ∗ωt−ω∗]∥∥∥, ≤∫10∥∥∥(I−η[∇θθU(θ(x),ω(x))∇θωU(θ(x),ω(x))−∇ωθU(θ(x),ω(x))−∇ωωU(θ(x),ω(x))])∥∥∥opdx⋅∥∥∥[θt−θ∗ωt−ω∗]∥∥∥, <∥∥∥[θt−θ∗ωt−ω∗]∥∥∥.

Let’s analyze the singular values of

 I−η[∇θθU(θ,ω)∇θωU(θ,ω)−∇ωθU(θ,ω)−∇ωωU(θ,ω)],

assuming , . Abbreviate

 [∇θθU(θ,ω)∇θωU(θ,ω)−∇ωθU(θ,ω)−∇ωωU(θ,ω)]:=[AC−CTB].

The largest singular value of

 I−η[AC−CTB],

is the square root of the largest eigenvalue of the following symmetric matrix

 [I−ηA−ηCηCTI−ηB][I−ηAηC−ηCTI−ηB]=[(I−ηA)2+η2CCT−η2(AC−CB)−η2(CTA−BCT)(I−ηB)2+η2CTC].

It is clear that when , the largest eigenvalue of the above matrix is . Observe

 [(I−ηA)2+η2CCT−η2(AC−CB)−η2(CTA−BCT)(I−ηB)2+η2CTC]=I−2η[A00B]+η2[A2+CCT−AC+CB−CTA+BCTB2+CTC], ≺[1−2ηλmin([A00B])+η2λmax([A2+CCT−AC+CB−CTA+BCTB2+CTC])]I.

If we choose to be

 η=min(θ,ω)∈B2((θ∗,ω∗),r)λmin([Aθ,ω00Bθ,ω])max(θ,ω)∈B2((θ∗,ω∗),r)λmax([A2θ,ω+Cθ,ωCTθ,ω−Aθ,ωCθ,ω+Cθ,ωBθ,ω−CTθ,ωAθ,ω+Bθ,ωCTθ,ωB2θ,ω+CTθ,ωCθ,ω])=√αβ,

then we find

 [(I−ηA)2+η2CCT−η2(AC−CB)−η2(CTA−BCT)(I−ηB)2+η2CTC] ≺(1−αβ)I.

In this case,

 ∥∥∥[θt+1−θ∗ωt+1−ω∗]∥∥∥ ≤sup(θ,ω)∈B2((θ∗,ω∗),r)∥∥∥I−η[∇θθU(θ,ω)∇θωU(θ,ω)−∇ωθU(θ,ω)−∇ωωU(θ,ω)]∥∥∥op∥∥∥[θt−θ∗ωt−ω∗]∥∥∥, ≤√1−αβ⋅∥∥∥[θt−θ∗ωt−ω∗]∥∥∥.

Therefore, to obtain an -minimizer one requires a number of steps equal to

 2βαlogrϵ.

###### Proof of Lemma 3.1.

Recall that the OMD dynamics iteratively updates

 [θt+1ωt+1]=(I−2η[0C−CT0])⋅[θtωt]+η[0C−CT0]⋅[θt−1ωt−1].

Define the following matrices

 R1=(I−2η[0C−CT0])+(I−4η2[CCT00CTC])1/22, (5.1) R2=(I−2η[0C−CT0])−(I−4η2[CCT00CTC])1/22. (5.2)

It is easy to verify that

 R1+R2 =(I−2η[0C−CT0]), R1R2=R2R1 =(I−2η[0C−CT0])2−(I−4η2[CCT00CTC])4=−η[0C−CT0].

The commutative property

follows from a singular value decomposition argument: Letting

be the SVD of ( diagonal) one finds,

 C(I−4η2CTC)1/2=UD(I−4η2D2)1/2VT=U(I−4η2D2)1/2DVT=(I−4η2CCT)1/2C.

Using the above equality, the commutative property follows

 (I−2η[0C−CT0])(I−4η2[CCT00CTC])1/2=(I−4η2[CCT00CTC])1/2(I−2η[0C−CT0]) ⟹R1R2=R2R1.

Now we have the following relations for OMD,

 [θt+1ωt+1]−R1[θtωt]=R2([θtωt]−R1[θt−1ωt−1]), [θt