Minimax problems—or games—of the form14, 22], leading to intense interest in algorithms for solving large scale minimax games [9, 13, 19, 8, 17, 23]. However, the algorithmic toolbox for minimax optimization is not as complete as the one for minimization. Variance reduction, a technique for improving stochastic gradient estimators by introducing control variates, stands as a case in point. A multitude of variance reduction schemes exist for finite-sum minimization [cf. 18, 34, 1, 3, 11], and their impact on complexity is well-understood . In contrast, only a few works apply variance reduction to finite-sum minimax problems [32, 38, 4, 25], and the potential gains from variance reduction are not well-understood.
We take a step towards closing this gap by designing variance-reduced minimax game solvers that offer strict runtime improvements over non-stochastic gradient methods, similar to that of optimal variance reduction methods for finite-sum minimization. To achieve this, we focus on the fundamental class of bilinear minimax games,
In particular, we study the complexity of finding an -approximate saddle point (Nash equilibrium), namely with
In the setting where and
are both probability simplices, the problem corresponds to finding an approximate (mixed) equilbrium in a matrix game, a central object in game theory and economics. Matrix games are also fundamental to algorithm design due in part to their equivalence to linear programming. Alternatively, when is an ball and
is a simplex, solving the corresponding problem finds a maximum-margin linear classifier (hard-margin SVM), a fundamental task in machine learning and statistics. We refer to the former as an - game and the latter as an - game; our primary focus is to give improved algorithms for these domains.
1.1 Our Approach
Our starting point is Nemirovski’s “conceptual prox-method”  for solving , where is convex in and concave in . The method solves a sequence of subproblems parameterized by , each of the form
for some , where is a norm-suitable Bregman divergence from to : squared Euclidean distance for and KL divergence for . Combining each subproblem solution with an extragradient step, mirror-prox solves the original problem to accuracy by solving subproblems.111 We use the notation to suppress terms logarithmic in and . (Solving (1) with is equivalent to to solving .)
Our first contribution is showing that if a stochastic unbiased gradient estimator satisfies the “variance” bound
Our second contribution is the construction of “centered” gradient estimators for - and - bilinear games, where . Our estimator has the following form. Suppose we wish to estimate (the gradient of w.r.t. ), and we already have . Let be some distribution over , draw and set
where is the th column of . This form is familiar from variance reduction techniques [18, 42, 1], that typically use a fixed distribution . In our setting, however, a fixed will not produce sufficiently low variance. Departing from prior variance-reduction work and building on [15, 5], we choose based on according to
yielding exactly the variance bound we require. We call this technique “sampling from the difference.”
For our gradient estimator, we sample from the squared difference, choosing -block coordinate , where
To strengthen our results for - games, we consider a refined version of the “centered” criterion (2) which allows regret analysis using local norms [37, 5]. To further facilitate this analysis we follow 
and introduce gradient clipping. We extend our proofs to show that stochastic regularized mirror descent can solve (1) despite the (distance-bounded) bias caused by gradient clipping.
Our gradient estimators attain the bound (2) with equal to the Lipschitz constant of . Specifically,
1.2 Method complexity compared with prior art
As per the discussion above, to achieve accuracy our algorithm solves subproblems. Each subproblem takes time for computing 2 exact gradients (one for variance reduction and one for an extragradient step), plus an additional time for the inner mirror descent iterations, with as in (3). The total runtime is therefore
By setting optimally to be , we obtain the runtime
Comparison with mirror-prox and dual extrapolation.
Nemirovski  instantiates his conceptual prox-method by solving the relaxed proximal problem (1) with in time , where is the Lipschitz constant of , as given in (3). The total complexity of the resulting method is therefore
The closely related dual extrapolation method of Nesterov  attains the same rate of convergence. We refer to the running time (5) as linear since it scales linearly with the problem description size . Our running time guarantee (4) is never worse than (5) by more than a constant factor, and improves on (5) when , i.e. whenever is not extremely sparse. In that regime, our method uses , hence solving a harder version of (1) than possible for mirror-prox.
Comparison with sublinear-time methods
Using a randomized algorithm, Grigoriadis and Khachiyan  solve - bilinear games in time
and Clarkson et al.  extend this result to - bilinear games, with the values of as in (3). Since these runtimes scale with , we refer to them as sublinear. Our guarantee improves on the guarantee (6) when , i.e. whenever (6) is not truly sublinear.
Our method carefully balances linear-time extragradient steps with cheap sublinear-time stochastic gradient steps. Consequently, our runtime guarantee (4) inherits strengths from both the linear and sublinear runtimes. First, our runtime scales linearly with rather than quadratically, as does the linear runtime (5). Second, while our runtime is not strictly sublinear, its component proportional to is , which is sublinear in . Overall, our method offers the best runtime guarantee in the literature in the regime
1.3 Additional contributions
We extend our development in three ways. First, we show how to combine restarting with variance reduction in order to compute exact proximal points to high accuracy. This technique applies to any function with a centered gradient estimator (rather than the bilinear functions considered so far). Second, we describe an extension of our results to “composite” saddle point problems of the form , where admits a centered gradient estimator and are “simple” convex functions. Third, we describe a number of alternative centered gradient estimators for the bilinear objective with and geometries. In particular, for the case we show that “sampling from the sum” by setting also works.
1.4 Related work
Matrix games, the canonical form of discrete zero-sum games, have long been studied in economics . It is well-known that the classical mirror descent (i.e. no-regret) method yields an algorithm with running time . Subsequent work [15, 27, 30, 5] improve this runtime as described above. Our work builds on the extragradient scheme of Nemirovski  as well as the gradient estimation and clipping technique of Clarkson et al. .
Palaniappan and Bach  apply standard variance reduction  to bilinear - games by sampling elements proportional to squared matrix entries. Using proximal-point acceleration they obtain a runtime of , a rate we recover using our algorithm (Section 4.3). However, in this setting the mirror-prox method has runtime , which may be better than the result of  by a factor of due to the discrepancy in the norm of . Naive application of  to domains results in even greater potential losses. Shi et al.  extend the method of  to smooth functions using general Bregman divergences, but their extension is unaccelerated and appears limited to a rate.
Chavdarova et al.  propose a variance-reduced extragradient method with applications to generative adversarial training. In contrast to our algorithm, which performs extragadient steps in the outer loop, the method of  performs stochastic extragradient steps in the inner loop, using finite-sum variance reduction as in . Chavdarova et al.  analyze their method in the convex-concave setting, showing improved stability over direct application of the extragradient method to noisy gradients. However, their complexity guarantees are worse than those of linear-time methods. Following up on , Mishchenko et al.  propose to reduce the variance of the stochastic extragradient method by using the same stochastic sample for both the gradient and extragradient steps. In the Euclidean strongly convex case, they show a convergence guarantee with a relaxed variance assumption, and in the noiseless full-rank bilinear case they recover the guarantees of . In the general convex case, however, they only show an rate of convergence.
1.5 Paper outline
We define our notation in Section 2. In Section 3.1, we review Nemirovski’s conceptual mirror-prox method and introduce the notion of a relaxed proximal oracle; we implement such oracle using variance-reduced gradient estimators in Section 3.2. In Section 4, we construct these gradient estimators for the -, - as well as - domain settings, and complete the analyses of the corresponding algorithms. Finally, in Section 5 we give our additional contributions described in Section 1.3 above.
A setup is the triplet where: (i) is a compact and convex subset of , (ii) is a norm on and (iii) is 1-strongly-convex w.r.t. and , i.e. such that for all .222 For non-differentiable , we define , where is the subdifferential of at . We call the distance generating function and denote the Bregman divergence associated with it by
We also denote and assume it is finite.
Norms and dual norms.
We write for the set of linear functions on . For we define the dual norm of as . For we write the norm with . The dual norm of is with .
We assume is of the form for convex and compact sets and . Particular sets of interest are the simplex and the Euclidean ball
. For any vector in,
|we write and for the first and last coordinates of , respectively.|
When totally clear from context, we sometimes refer to the and components of directly as and . We write the th coordinate of vector as .
We consider a matrix and write for the number of its nonzero entries. For and we write , and for the corresponding row, column and entry, respectively.333 For , we let . We consider the matrix norms , and .
3 Primal-dual variance reduction framework
In this section, we establish a framework for solving the saddle point problem
where is convex in and concave , and admits a (variance-reduced) stochastic estimator for the continuous and monotone444 A mapping is monotone if and only if for all ; is monotone due to convexity-concavity of . gradient mapping
Our goal is to find an -approximate saddle point (Nash equilibrium), i.e. such that
We achieve this by generating a sequence such that for every and using the fact that
due to convexity-concavity of (see proof in Appendix A.1).
In Section 3.1 we define the notion of a (randomized) relaxed proximal oracle, and describe how Nemirovski’s mirror-prox method leverages it to solve the problem (3). In Section 3.2 we define a class of centered gradient estimators, whose variance is proportional to the squared distance from a reference point. Given such a centered gradient estimator, we show that a regularized stochastic mirror descent scheme constitutes a relaxed proximal oracle. For a technical reason, we limit our oracle guarantee in Section 3.2 to the bilinear case , which suffices for the applications in Section 4. We lift this limitation in Section 5.1, where we show a different oracle implementation that is valid for general convex-concave , with only a logarithmic increase in complexity.
3.1 The mirror-prox method with a randomized oracle
Recall that we assume the space is equipped with a norm and distance generating function that is -strongly-convex w.r.t. and has range . We write the induced Bregman divergence as . We use the following fact throughout the paper: by definition, the Bregman divergence satisfies, for any ,
For any we define the -proximal mapping to be the solution of the variational inequality corresponding to the strongly monotone operator , i.e. the unique such that for all [cf. 10]. Equivalently (by (9)),
When , is also the unique solution of the saddle point problem
Thus, we can find an -suboptimal point in exact proximal steps. However, computing exactly may be as difficult as solving the original problem. Nemirovski  proposes a relaxation of the exact proximal mapping, which we slightly extend to include the possibility of randomization, and formalize in the following.
Definition 1 (()-relaxed proximal oracle).
Let be a monotone operator and . An ()-relaxed proximal oracle for is a (possibly randomized) mapping such that satisfies
Note that is an -relaxed proximal oracle. Algorithm 1 describes the mirror-prox method of Nemirovski , which recovers the error guarantee of exact proximal iterations. The th iteration consists of (i) a relaxed proximal oracle call producing , and (ii) a linearized proximal (mirror) step where we replace with the constant function , producing . We now state and prove the convergence guarantee for the mirror-prox method, first shown in .
Proposition 1 (Mirror prox convergence via oracles).
Let be an (,)-relaxed proximal oracle with respect to gradient mapping and distance-generating function with range at most . Let be the iterates of Algorithm 1 and let be its output. Then
Fix iteration , and note that by the definition (10), satisfies
Summing over , writing and rearranging yields
for all . Note that since minimizes , for all . Therefore, maximizing the above display over and afterwards taking expectation gives
3.2 Implementation of an -relaxed proximal oracle
We now explain how to use stochastic variance-reduced gradient estimators to design an efficient ()-relaxed proximal oracle. We begin by introducing the bias and variance properties of the estimators we require.
Let and . A stochastic gradient estimator is called -centered for if for all
A -centered estimator for satisfies .
Writing , we have by the first centered estimator property. Therefore,
where the bounds follow from the triangle inequality, Jensen’s inequality and the second centered estimator property. ∎
A gradient mapping that admits a -centered gradient estimator for every is -Lipschitz, since by Jensen’s inequality and Lemma 1 we have for all
Definition 2 bounds the gradient variance using the distance to the reference point. Similar bounds are useful for finding stationary points in smooth nonconvex finite-sum problems [2, 33, 11, 43]. However, known variance reduction methods for smooth convex finite-sum minimization require stronger bounds [cf. 1, Section 2.1].
With the variance bounds defined, we describe Algorithm 2 which (for the bilinear case) implements a relaxed proximal oracle. The algorithm is essentially standard stochastic mirror descent, except for an additional regularization term around the initial point . Note that we do not perform extragradient steps in this stochastic method. When combined with a centered gradient estimator, the iterates of Algorithm 2 provide the following guarantee, which is one of our key technical contributions.
Let , let and let be -centered for monotone . Then, for and , the iterates of Algorithm 2 satisfy
Before discussing the proof of Proposition 2, we state how it implies the relaxed proximal oracle property for the bilinear case.
Let and let . Then, in the setting of Proposition 2, is an -relaxed proximal oracle.
More generally, the proof of Corollary 1 shows that Algorithm 2 implements a relaxed proximal oracle whenever is convex for every . In Section 5.1 we implement an -relaxed proximal oracle without such an assumption.
The proof of Proposition 2 is a somewhat lengthy application of existing techniques for stochastic mirror descent analysis in conjunction with Definition 2. We give it in full in Appendix B and review the main steps here.
Substituting the equality (9) and rearranging gives
and taking guarantees for all .
Using the second centered gradient estimator property and strong convexity of the distance generating function, we have
for . Since the RHS of (13) is nonpositive in expectation and the gradient estimator is unbiased, we have
Exchanging maximum and expectation.
When depends on we generally have . To address this issue we use a technique due to Nemirovski et al. . Writing and defining the “ghost iterates” with , we rewrite as . Since does not depend on randomness in , we have . To handle the term we use the standard mirror descent regret bound again, absorbing the result into the RHS of (13) using and , which follows from Lemma 1.
4 Application to bilinear saddle point problems
We now construct centered gradient estimators (as per Definition 2) for the linear gradient mapping
We consider two domain types, namely (the simplex) and (the Euclidean ball). In Section 4.1 we present a centered gradient estimator and resulting runtime guarantees for - games. In Section 4.2 we first give a centered gradient estimator - with a suboptimal constant (larger than the Lipschitz constant of ). We then obtain the correct Lipschitz constant dependence using a local norms analysis, which requires clipping the gradient estimates in order to control the magnitude of the updates. Finally, in Section 4.3 we give a gradient estimator for - games. Unlike the previous two setups, the estimator constant for - games does not match the Lipschitz constant of the underlying gradient mapping. Such mismatch is consistent with prior findings in the literature.
Throughout this section, we let denote the “center” (i.e. reference point) of our stochastic gradient estimator and consider a general query point . We also recall the notation for the th entry of vector .
4.1 - games
Denoting the -dimensional simplex by , we let , and . We take to be the norm with conjugate norm . We take the distance generating function to be the negative entropy, i.e. . We note that both and are separable and in particular separate over the and blocks of . Finally we set
and note that this is the Lipschitz constant of the gradient mapping under the chosen norm.
4.1.1 Gradient estimator
Given and , we describe the reduced-variance gradient estimator . First, we define the probabilities and according to,
To compute we sample