1 Introduction
Minimax problems—or games—of the form
are ubiquitous in economics, statistics, optimization and machine learning. In recent years, minimax formulations for neural network training rose to prominence
[14, 22], leading to intense interest in algorithms for solving large scale minimax games [9, 13, 19, 8, 17, 23]. However, the algorithmic toolbox for minimax optimization is not as complete as the one for minimization. Variance reduction, a technique for improving stochastic gradient estimators by introducing control variates, stands as a case in point. A multitude of variance reduction schemes exist for finitesum minimization [cf. 18, 34, 1, 3, 11], and their impact on complexity is wellunderstood [41]. In contrast, only a few works apply variance reduction to finitesum minimax problems [32, 38, 4, 25], and the potential gains from variance reduction are not wellunderstood.We take a step towards closing this gap by designing variancereduced minimax game solvers that offer strict runtime improvements over nonstochastic gradient methods, similar to that of optimal variance reduction methods for finitesum minimization. To achieve this, we focus on the fundamental class of bilinear minimax games,
In particular, we study the complexity of finding an approximate saddle point (Nash equilibrium), namely with
In the setting where and
are both probability simplices, the problem corresponds to finding an approximate (mixed) equilbrium in a matrix game, a central object in game theory and economics. Matrix games are also fundamental to algorithm design due in part to their equivalence to linear programming
[7]. Alternatively, when is an ball andis a simplex, solving the corresponding problem finds a maximummargin linear classifier (hardmargin SVM), a fundamental task in machine learning and statistics
[24]. We refer to the former as an  game and the latter as an  game; our primary focus is to give improved algorithms for these domains.1.1 Our Approach
Our starting point is Nemirovski’s “conceptual proxmethod” [27] for solving , where is convex in and concave in . The method solves a sequence of subproblems parameterized by , each of the form
(1) 
for some , where is a normsuitable Bregman divergence from to : squared Euclidean distance for and KL divergence for . Combining each subproblem solution with an extragradient step, mirrorprox solves the original problem to accuracy by solving subproblems.^{1}^{1}1 We use the notation to suppress terms logarithmic in and . (Solving (1) with is equivalent to to solving .)
Our first contribution is showing that if a stochastic unbiased gradient estimator satisfies the “variance” bound
(2) 
for some , then regularized stochastic mirror descent steps using solve (1) in a suitable probabilistic sense. We call unbiased gradient estimators that satisfy (2) “centered”.
Our second contribution is the construction of “centered” gradient estimators for  and  bilinear games, where . Our estimator has the following form. Suppose we wish to estimate (the gradient of w.r.t. ), and we already have . Let be some distribution over , draw and set
where is the th column of . This form is familiar from variance reduction techniques [18, 42, 1], that typically use a fixed distribution . In our setting, however, a fixed will not produce sufficiently low variance. Departing from prior variancereduction work and building on [15, 5], we choose based on according to
yielding exactly the variance bound we require. We call this technique “sampling from the difference.”
For our gradient estimator, we sample from the squared difference, choosing block coordinate , where
To strengthen our results for  games, we consider a refined version of the “centered” criterion (2) which allows regret analysis using local norms [37, 5]. To further facilitate this analysis we follow [5]
and introduce gradient clipping. We extend our proofs to show that stochastic regularized mirror descent can solve (
1) despite the (distancebounded) bias caused by gradient clipping.Our gradient estimators attain the bound (2) with equal to the Lipschitz constant of . Specifically,
(3) 
1.2 Method complexity compared with prior art
As per the discussion above, to achieve accuracy our algorithm solves subproblems. Each subproblem takes time for computing 2 exact gradients (one for variance reduction and one for an extragradient step), plus an additional time for the inner mirror descent iterations, with as in (3). The total runtime is therefore
By setting optimally to be , we obtain the runtime
(4) 
Comparison with mirrorprox and dual extrapolation.
Nemirovski [27] instantiates his conceptual proxmethod by solving the relaxed proximal problem (1) with in time , where is the Lipschitz constant of , as given in (3). The total complexity of the resulting method is therefore
(5) 
The closely related dual extrapolation method of Nesterov [30] attains the same rate of convergence. We refer to the running time (5) as linear since it scales linearly with the problem description size . Our running time guarantee (4) is never worse than (5) by more than a constant factor, and improves on (5) when , i.e. whenever is not extremely sparse. In that regime, our method uses , hence solving a harder version of (1) than possible for mirrorprox.
Comparison with sublineartime methods
Using a randomized algorithm, Grigoriadis and Khachiyan [15] solve  bilinear games in time
(6) 
and Clarkson et al. [5] extend this result to  bilinear games, with the values of as in (3). Since these runtimes scale with , we refer to them as sublinear. Our guarantee improves on the guarantee (6) when , i.e. whenever (6) is not truly sublinear.
Our method carefully balances lineartime extragradient steps with cheap sublineartime stochastic gradient steps. Consequently, our runtime guarantee (4) inherits strengths from both the linear and sublinear runtimes. First, our runtime scales linearly with rather than quadratically, as does the linear runtime (5). Second, while our runtime is not strictly sublinear, its component proportional to is , which is sublinear in . Overall, our method offers the best runtime guarantee in the literature in the regime
where the lower bound on is due to the best known theoretical runtimes of interior point methods: [6] and [20], where is the (current) matrix multiplication exponent.
1.3 Additional contributions
We extend our development in three ways. First, we show how to combine restarting with variance reduction in order to compute exact proximal points to high accuracy. This technique applies to any function with a centered gradient estimator (rather than the bilinear functions considered so far). Second, we describe an extension of our results to “composite” saddle point problems of the form , where admits a centered gradient estimator and are “simple” convex functions. Third, we describe a number of alternative centered gradient estimators for the bilinear objective with and geometries. In particular, for the case we show that “sampling from the sum” by setting also works.
1.4 Related work
Matrix games, the canonical form of discrete zerosum games, have long been studied in economics [31]. It is wellknown that the classical mirror descent (i.e. noregret) method yields an algorithm with running time [29]. Subsequent work [15, 27, 30, 5] improve this runtime as described above. Our work builds on the extragradient scheme of Nemirovski [27] as well as the gradient estimation and clipping technique of Clarkson et al. [5].
Palaniappan and Bach [32] apply standard variance reduction [18] to bilinear  games by sampling elements proportional to squared matrix entries. Using proximalpoint acceleration they obtain a runtime of , a rate we recover using our algorithm (Section 4.3). However, in this setting the mirrorprox method has runtime , which may be better than the result of [32] by a factor of due to the discrepancy in the norm of . Naive application of [32] to domains results in even greater potential losses. Shi et al. [38] extend the method of [32] to smooth functions using general Bregman divergences, but their extension is unaccelerated and appears limited to a rate.
Chavdarova et al. [4] propose a variancereduced extragradient method with applications to generative adversarial training. In contrast to our algorithm, which performs extragadient steps in the outer loop, the method of [4] performs stochastic extragradient steps in the inner loop, using finitesum variance reduction as in [18]. Chavdarova et al. [4] analyze their method in the convexconcave setting, showing improved stability over direct application of the extragradient method to noisy gradients. However, their complexity guarantees are worse than those of lineartime methods. Following up on [4], Mishchenko et al. [25] propose to reduce the variance of the stochastic extragradient method by using the same stochastic sample for both the gradient and extragradient steps. In the Euclidean strongly convex case, they show a convergence guarantee with a relaxed variance assumption, and in the noiseless fullrank bilinear case they recover the guarantees of [26]. In the general convex case, however, they only show an rate of convergence.
1.5 Paper outline
We define our notation in Section 2. In Section 3.1, we review Nemirovski’s conceptual mirrorprox method and introduce the notion of a relaxed proximal oracle; we implement such oracle using variancereduced gradient estimators in Section 3.2. In Section 4, we construct these gradient estimators for the ,  as well as  domain settings, and complete the analyses of the corresponding algorithms. Finally, in Section 5 we give our additional contributions described in Section 1.3 above.
2 Notation
Problem setup.
A setup is the triplet where: (i) is a compact and convex subset of , (ii) is a norm on and (iii) is 1stronglyconvex w.r.t. and , i.e. such that for all .^{2}^{2}2 For nondifferentiable , we define , where is the subdifferential of at . We call the distance generating function and denote the Bregman divergence associated with it by
We also denote and assume it is finite.
Norms and dual norms.
We write for the set of linear functions on . For we define the dual norm of as . For we write the norm with . The dual norm of is with .
Domain components.
We assume is of the form for convex and compact sets and . Particular sets of interest are the simplex and the Euclidean ball
. For any vector in
,we write and for the first and last coordinates of , respectively. 
When totally clear from context, we sometimes refer to the and components of directly as and . We write the th coordinate of vector as .
Matrices.
We consider a matrix and write for the number of its nonzero entries. For and we write , and for the corresponding row, column and entry, respectively.^{3}^{3}3 For , we let . We consider the matrix norms , and .
3 Primaldual variance reduction framework
In this section, we establish a framework for solving the saddle point problem
where is convex in and concave , and admits a (variancereduced) stochastic estimator for the continuous and monotone^{4}^{4}4 A mapping is monotone if and only if for all ; is monotone due to convexityconcavity of . gradient mapping
Our goal is to find an approximate saddle point (Nash equilibrium), i.e. such that
(7) 
We achieve this by generating a sequence such that for every and using the fact that
(8) 
due to convexityconcavity of (see proof in Appendix A.1).
In Section 3.1 we define the notion of a (randomized) relaxed proximal oracle, and describe how Nemirovski’s mirrorprox method leverages it to solve the problem (3). In Section 3.2 we define a class of centered gradient estimators, whose variance is proportional to the squared distance from a reference point. Given such a centered gradient estimator, we show that a regularized stochastic mirror descent scheme constitutes a relaxed proximal oracle. For a technical reason, we limit our oracle guarantee in Section 3.2 to the bilinear case , which suffices for the applications in Section 4. We lift this limitation in Section 5.1, where we show a different oracle implementation that is valid for general convexconcave , with only a logarithmic increase in complexity.
3.1 The mirrorprox method with a randomized oracle
Recall that we assume the space is equipped with a norm and distance generating function that is stronglyconvex w.r.t. and has range . We write the induced Bregman divergence as . We use the following fact throughout the paper: by definition, the Bregman divergence satisfies, for any ,
(9) 
For any we define the proximal mapping to be the solution of the variational inequality corresponding to the strongly monotone operator , i.e. the unique such that for all [cf. 10]. Equivalently (by (9)),
(10) 
When , is also the unique solution of the saddle point problem
Consider iterations of the form , with . Averaging the definition (10) over , using the bound (8) and the nonnegativity of Bregman divergences gives
Thus, we can find an suboptimal point in exact proximal steps. However, computing exactly may be as difficult as solving the original problem. Nemirovski [27] proposes a relaxation of the exact proximal mapping, which we slightly extend to include the possibility of randomization, and formalize in the following.
Definition 1 (()relaxed proximal oracle).
Let be a monotone operator and . An ()relaxed proximal oracle for is a (possibly randomized) mapping such that satisfies
Note that is an relaxed proximal oracle. Algorithm 1 describes the mirrorprox method of Nemirovski [27], which recovers the error guarantee of exact proximal iterations. The th iteration consists of (i) a relaxed proximal oracle call producing , and (ii) a linearized proximal (mirror) step where we replace with the constant function , producing . We now state and prove the convergence guarantee for the mirrorprox method, first shown in [27].
Proposition 1 (Mirror prox convergence via oracles).
Let be an (,)relaxed proximal oracle with respect to gradient mapping and distancegenerating function with range at most . Let be the iterates of Algorithm 1 and let be its output. Then
Proof.
Fix iteration , and note that by the definition (10), satisfies
Summing over , writing and rearranging yields
for all . Note that since minimizes , for all . Therefore, maximizing the above display over and afterwards taking expectation gives
Finally, by Definition 1, for every , and and the result follows by dividing by and using the bound (8). ∎
3.2 Implementation of an relaxed proximal oracle
We now explain how to use stochastic variancereduced gradient estimators to design an efficient ()relaxed proximal oracle. We begin by introducing the bias and variance properties of the estimators we require.
Definition 2.
Let and . A stochastic gradient estimator is called centered for if for all

,

.
Lemma 1.
A centered estimator for satisfies .
Proof.
Writing , we have by the first centered estimator property. Therefore,
where the bounds follow from the triangle inequality, Jensen’s inequality and the second centered estimator property. ∎
Remark 1.
A gradient mapping that admits a centered gradient estimator for every is Lipschitz, since by Jensen’s inequality and Lemma 1 we have for all
Remark 2.
Definition 2 bounds the gradient variance using the distance to the reference point. Similar bounds are useful for finding stationary points in smooth nonconvex finitesum problems [2, 33, 11, 43]. However, known variance reduction methods for smooth convex finitesum minimization require stronger bounds [cf. 1, Section 2.1].
With the variance bounds defined, we describe Algorithm 2 which (for the bilinear case) implements a relaxed proximal oracle. The algorithm is essentially standard stochastic mirror descent, except for an additional regularization term around the initial point . Note that we do not perform extragradient steps in this stochastic method. When combined with a centered gradient estimator, the iterates of Algorithm 2 provide the following guarantee, which is one of our key technical contributions.
Proposition 2.
Let , let and let be centered for monotone . Then, for and , the iterates of Algorithm 2 satisfy
(11) 
Before discussing the proof of Proposition 2, we state how it implies the relaxed proximal oracle property for the bilinear case.
Corollary 1.
Let and let . Then, in the setting of Proposition 2, is an relaxed proximal oracle.
Proof.
More generally, the proof of Corollary 1 shows that Algorithm 2 implements a relaxed proximal oracle whenever is convex for every . In Section 5.1 we implement an relaxed proximal oracle without such an assumption.
The proof of Proposition 2 is a somewhat lengthy application of existing techniques for stochastic mirror descent analysis in conjunction with Definition 2. We give it in full in Appendix B and review the main steps here.
Regret bound.
Regularization.
Variance bound.
Using the second centered gradient estimator property and strong convexity of the distance generating function, we have
for . Since the RHS of (13) is nonpositive in expectation and the gradient estimator is unbiased, we have
Exchanging maximum and expectation.
When depends on we generally have . To address this issue we use a technique due to Nemirovski et al. [28]. Writing and defining the “ghost iterates” with , we rewrite as . Since does not depend on randomness in , we have . To handle the term we use the standard mirror descent regret bound again, absorbing the result into the RHS of (13) using and , which follows from Lemma 1.
4 Application to bilinear saddle point problems
We now construct centered gradient estimators (as per Definition 2) for the linear gradient mapping
We consider two domain types, namely (the simplex) and (the Euclidean ball). In Section 4.1 we present a centered gradient estimator and resulting runtime guarantees for  games. In Section 4.2 we first give a centered gradient estimator  with a suboptimal constant (larger than the Lipschitz constant of ). We then obtain the correct Lipschitz constant dependence using a local norms analysis, which requires clipping the gradient estimates in order to control the magnitude of the updates. Finally, in Section 4.3 we give a gradient estimator for  games. Unlike the previous two setups, the estimator constant for  games does not match the Lipschitz constant of the underlying gradient mapping. Such mismatch is consistent with prior findings in the literature.
Throughout this section, we let denote the “center” (i.e. reference point) of our stochastic gradient estimator and consider a general query point . We also recall the notation for the th entry of vector .
4.1  games
Setup.
Denoting the dimensional simplex by , we let , and . We take to be the norm with conjugate norm . We take the distance generating function to be the negative entropy, i.e. . We note that both and are separable and in particular separate over the and blocks of . Finally we set
and note that this is the Lipschitz constant of the gradient mapping under the chosen norm.
4.1.1 Gradient estimator
Given and , we describe the reducedvariance gradient estimator . First, we define the probabilities and according to,
(14) 
To compute we sample