Variance Reduction for Matrix Games

07/03/2019 ∙ by Yair Carmon, et al. ∙ 0

We present a randomized primal-dual algorithm that solves the problem _x_y y^ A x to additive error ϵ in time nnz(A) + √(nnz(A)n)/ϵ, for matrix A with larger dimension n and nnz(A) nonzero entries. This improves on Nemirovski's mirror-prox method by a factor of √(nnz(A)/n) and is faster than stochastic gradient methods in the accurate and/or sparse regime ϵ<√(n/nnz(A)). Our results hold for x,y in the simplex (matrix games, linear programming) and for x in an ℓ_2 ball and y in the simplex (perceptron / SVM, minimum enclosing ball). Our algorithm combines the mirror-prox method and a novel variance-reduced gradient estimator based on "sampling from the difference" between the current iterate and a reference point.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Minimax problems—or games—of the form

are ubiquitous in economics, statistics, optimization and machine learning. In recent years, minimax formulations for neural network training rose to prominence 

[14, 22], leading to intense interest in algorithms for solving large scale minimax games [9, 13, 19, 8, 17, 23]. However, the algorithmic toolbox for minimax optimization is not as complete as the one for minimization. Variance reduction, a technique for improving stochastic gradient estimators by introducing control variates, stands as a case in point. A multitude of variance reduction schemes exist for finite-sum minimization [cf. 18, 34, 1, 3, 11], and their impact on complexity is well-understood [41]. In contrast, only a few works apply variance reduction to finite-sum minimax problems [32, 38, 4, 25], and the potential gains from variance reduction are not well-understood.

We take a step towards closing this gap by designing variance-reduced minimax game solvers that offer strict runtime improvements over non-stochastic gradient methods, similar to that of optimal variance reduction methods for finite-sum minimization. To achieve this, we focus on the fundamental class of bilinear minimax games,

In particular, we study the complexity of finding an -approximate saddle point (Nash equilibrium), namely with

In the setting where and

are both probability simplices, the problem corresponds to finding an approximate (mixed) equilbrium in a matrix game, a central object in game theory and economics. Matrix games are also fundamental to algorithm design due in part to their equivalence to linear programming

[7]. Alternatively, when is an ball and

is a simplex, solving the corresponding problem finds a maximum-margin linear classifier (hard-margin SVM), a fundamental task in machine learning and statistics

[24]. We refer to the former as an - game and the latter as an - game; our primary focus is to give improved algorithms for these domains.

1.1 Our Approach

Our starting point is Nemirovski’s “conceptual prox-method” [27] for solving , where is convex in and concave in . The method solves a sequence of subproblems parameterized by , each of the form

(1)

for some , where is a norm-suitable Bregman divergence from to : squared Euclidean distance for and KL divergence for . Combining each subproblem solution with an extragradient step, mirror-prox solves the original problem to accuracy by solving subproblems.111 We use the notation to suppress terms logarithmic in and . (Solving (1) with is equivalent to to solving .)

Our first contribution is showing that if a stochastic unbiased gradient estimator satisfies the “variance” bound

(2)

for some , then regularized stochastic mirror descent steps using solve (1) in a suitable probabilistic sense. We call unbiased gradient estimators that satisfy (2) “centered”.

Our second contribution is the construction of “centered” gradient estimators for - and - bilinear games, where . Our estimator has the following form. Suppose we wish to estimate (the gradient of w.r.t. ), and we already have . Let be some distribution over , draw and set

where is the th column of . This form is familiar from variance reduction techniques [18, 42, 1], that typically use a fixed distribution . In our setting, however, a fixed will not produce sufficiently low variance. Departing from prior variance-reduction work and building on [15, 5], we choose based on according to

yielding exactly the variance bound we require. We call this technique “sampling from the difference.”

For our gradient estimator, we sample from the squared difference, choosing -block coordinate , where

To strengthen our results for - games, we consider a refined version of the “centered” criterion (2) which allows regret analysis using local norms [37, 5]. To further facilitate this analysis we follow [5]

and introduce gradient clipping. We extend our proofs to show that stochastic regularized mirror descent can solve (

1) despite the (distance-bounded) bias caused by gradient clipping.

Our gradient estimators attain the bound (2) with equal to the Lipschitz constant of . Specifically,

(3)

1.2 Method complexity compared with prior art

As per the discussion above, to achieve accuracy our algorithm solves subproblems. Each subproblem takes time for computing 2 exact gradients (one for variance reduction and one for an extragradient step), plus an additional time for the inner mirror descent iterations, with as in (3). The total runtime is therefore

By setting optimally to be , we obtain the runtime

(4)
Comparison with mirror-prox and dual extrapolation.

Nemirovski [27] instantiates his conceptual prox-method by solving the relaxed proximal problem (1) with in time , where is the Lipschitz constant of , as given in (3). The total complexity of the resulting method is therefore

(5)

The closely related dual extrapolation method of Nesterov [30] attains the same rate of convergence. We refer to the running time (5) as linear since it scales linearly with the problem description size . Our running time guarantee (4) is never worse than (5) by more than a constant factor, and improves on (5) when , i.e. whenever is not extremely sparse. In that regime, our method uses , hence solving a harder version of (1) than possible for mirror-prox.

Comparison with sublinear-time methods

Using a randomized algorithm, Grigoriadis and Khachiyan [15] solve - bilinear games in time

(6)

and Clarkson et al. [5] extend this result to - bilinear games, with the values of as in (3). Since these runtimes scale with , we refer to them as sublinear. Our guarantee improves on the guarantee (6) when , i.e. whenever (6) is not truly sublinear.

Our method carefully balances linear-time extragradient steps with cheap sublinear-time stochastic gradient steps. Consequently, our runtime guarantee (4) inherits strengths from both the linear and sublinear runtimes. First, our runtime scales linearly with rather than quadratically, as does the linear runtime (5). Second, while our runtime is not strictly sublinear, its component proportional to is , which is sublinear in . Overall, our method offers the best runtime guarantee in the literature in the regime

where the lower bound on is due to the best known theoretical runtimes of interior point methods:  [6] and  [20], where is the (current) matrix multiplication exponent.

In the square dense case (i.e. ), we improve on the accelerated runtime (5) by a factor of , the same improvement that optimal variance-reduced finite-sum minimization methods achieve over the fast gradient method [42, 1].

1.3 Additional contributions

We extend our development in three ways. First, we show how to combine restarting with variance reduction in order to compute exact proximal points to high accuracy. This technique applies to any function with a centered gradient estimator (rather than the bilinear functions considered so far). Second, we describe an extension of our results to “composite” saddle point problems of the form , where admits a centered gradient estimator and are “simple” convex functions. Third, we describe a number of alternative centered gradient estimators for the bilinear objective with and geometries. In particular, for the case we show that “sampling from the sum” by setting also works.

1.4 Related work

Matrix games, the canonical form of discrete zero-sum games, have long been studied in economics [31]. It is well-known that the classical mirror descent (i.e. no-regret) method yields an algorithm with running time  [29]. Subsequent work [15, 27, 30, 5] improve this runtime as described above. Our work builds on the extragradient scheme of Nemirovski [27] as well as the gradient estimation and clipping technique of Clarkson et al. [5].

Palaniappan and Bach [32] apply standard variance reduction [18] to bilinear - games by sampling elements proportional to squared matrix entries. Using proximal-point acceleration they obtain a runtime of , a rate we recover using our algorithm (Section 4.3). However, in this setting the mirror-prox method has runtime , which may be better than the result of [32] by a factor of due to the discrepancy in the norm of . Naive application of [32] to domains results in even greater potential losses. Shi et al. [38] extend the method of [32] to smooth functions using general Bregman divergences, but their extension is unaccelerated and appears limited to a rate.

Chavdarova et al. [4] propose a variance-reduced extragradient method with applications to generative adversarial training. In contrast to our algorithm, which performs extragadient steps in the outer loop, the method of [4] performs stochastic extragradient steps in the inner loop, using finite-sum variance reduction as in [18]. Chavdarova et al. [4] analyze their method in the convex-concave setting, showing improved stability over direct application of the extragradient method to noisy gradients. However, their complexity guarantees are worse than those of linear-time methods. Following up on [4], Mishchenko et al. [25] propose to reduce the variance of the stochastic extragradient method by using the same stochastic sample for both the gradient and extragradient steps. In the Euclidean strongly convex case, they show a convergence guarantee with a relaxed variance assumption, and in the noiseless full-rank bilinear case they recover the guarantees of [26]. In the general convex case, however, they only show an rate of convergence.

1.5 Paper outline

We define our notation in Section 2. In Section 3.1, we review Nemirovski’s conceptual mirror-prox method and introduce the notion of a relaxed proximal oracle; we implement such oracle using variance-reduced gradient estimators in Section 3.2. In Section 4, we construct these gradient estimators for the -, - as well as - domain settings, and complete the analyses of the corresponding algorithms. Finally, in Section 5 we give our additional contributions described in Section 1.3 above.

2 Notation

Problem setup.

A setup is the triplet where: (i) is a compact and convex subset of , (ii) is a norm on and (iii) is 1-strongly-convex w.r.t.  and , i.e. such that for all .222 For non-differentiable , we define , where is the subdifferential of at . We call the distance generating function and denote the Bregman divergence associated with it by

We also denote and assume it is finite.

Norms and dual norms.

We write for the set of linear functions on . For we define the dual norm of as . For we write the norm with . The dual norm of is with .

Domain components.

We assume is of the form for convex and compact sets and . Particular sets of interest are the simplex and the Euclidean ball

. For any vector in

,

we write and for the first and last coordinates of , respectively.

When totally clear from context, we sometimes refer to the and components of directly as and . We write the th coordinate of vector as .

Matrices.

We consider a matrix and write for the number of its nonzero entries. For and we write , and for the corresponding row, column and entry, respectively.333 For , we let . We consider the matrix norms , and .

3 Primal-dual variance reduction framework

In this section, we establish a framework for solving the saddle point problem

where is convex in and concave , and admits a (variance-reduced) stochastic estimator for the continuous and monotone444 A mapping is monotone if and only if for all ; is monotone due to convexity-concavity of . gradient mapping

Our goal is to find an -approximate saddle point (Nash equilibrium), i.e.  such that

(7)

We achieve this by generating a sequence such that for every and using the fact that

(8)

due to convexity-concavity of (see proof in Appendix A.1).

In Section 3.1 we define the notion of a (randomized) relaxed proximal oracle, and describe how Nemirovski’s mirror-prox method leverages it to solve the problem (3). In Section 3.2 we define a class of centered gradient estimators, whose variance is proportional to the squared distance from a reference point. Given such a centered gradient estimator, we show that a regularized stochastic mirror descent scheme constitutes a relaxed proximal oracle. For a technical reason, we limit our oracle guarantee in Section 3.2 to the bilinear case , which suffices for the applications in Section 4. We lift this limitation in Section 5.1, where we show a different oracle implementation that is valid for general convex-concave , with only a logarithmic increase in complexity.

3.1 The mirror-prox method with a randomized oracle

Recall that we assume the space is equipped with a norm and distance generating function that is -strongly-convex w.r.t.  and has range . We write the induced Bregman divergence as . We use the following fact throughout the paper: by definition, the Bregman divergence satisfies, for any ,

(9)

For any we define the -proximal mapping to be the solution of the variational inequality corresponding to the strongly monotone operator , i.e. the unique such that for all  [cf. 10]. Equivalently (by (9)),

(10)

When , is also the unique solution of the saddle point problem

Consider iterations of the form , with . Averaging the definition (10) over , using the bound (8) and the nonnegativity of Bregman divergences gives

Thus, we can find an -suboptimal point in exact proximal steps. However, computing exactly may be as difficult as solving the original problem. Nemirovski [27] proposes a relaxation of the exact proximal mapping, which we slightly extend to include the possibility of randomization, and formalize in the following.

Definition 1 (()-relaxed proximal oracle).

Let be a monotone operator and . An ()-relaxed proximal oracle for is a (possibly randomized) mapping such that satisfies

Note that is an -relaxed proximal oracle. Algorithm 1 describes the mirror-prox method of Nemirovski [27], which recovers the error guarantee of exact proximal iterations. The th iteration consists of (i) a relaxed proximal oracle call producing , and (ii) a linearized proximal (mirror) step where we replace with the constant function , producing . We now state and prove the convergence guarantee for the mirror-prox method, first shown in [27].

Input: -relaxed proximal oracle for gradient mapping , distance-generating
Parameters : Number of iterations
Output: Point with
1 for  do
2        We implement by calling
return
Algorithm 1 (mirror-prox [27])
Proposition 1 (Mirror prox convergence via oracles).

Let be an (,)-relaxed proximal oracle with respect to gradient mapping and distance-generating function with range at most . Let be the iterates of Algorithm 1 and let be its output. Then

Proof.

Fix iteration , and note that by the definition (10), satisfies

Summing over , writing and rearranging yields

for all . Note that since minimizes , for all . Therefore, maximizing the above display over and afterwards taking expectation gives

Finally, by Definition 1, for every , and and the result follows by dividing by and using the bound (8). ∎

3.2 Implementation of an -relaxed proximal oracle

We now explain how to use stochastic variance-reduced gradient estimators to design an efficient ()-relaxed proximal oracle. We begin by introducing the bias and variance properties of the estimators we require.

Definition 2.

Let and . A stochastic gradient estimator is called -centered for if for all

  1. ,

  2. .

Lemma 1.

A -centered estimator for satisfies .

Proof.

Writing , we have by the first centered estimator property. Therefore,

where the bounds follow from the triangle inequality, Jensen’s inequality and the second centered estimator property. ∎

Remark 1.

A gradient mapping that admits a -centered gradient estimator for every is -Lipschitz, since by Jensen’s inequality and Lemma 1 we have for all

Remark 2.

Definition 2 bounds the gradient variance using the distance to the reference point. Similar bounds are useful for finding stationary points in smooth nonconvex finite-sum problems [2, 33, 11, 43]. However, known variance reduction methods for smooth convex finite-sum minimization require stronger bounds [cf. 1, Section 2.1].

With the variance bounds defined, we describe Algorithm 2 which (for the bilinear case) implements a relaxed proximal oracle. The algorithm is essentially standard stochastic mirror descent, except for an additional regularization term around the initial point . Note that we do not perform extragradient steps in this stochastic method. When combined with a centered gradient estimator, the iterates of Algorithm 2 provide the following guarantee, which is one of our key technical contributions.

Input: Initial , gradient estimator , oracle quality
Parameters : Step size , number of iterations
Output: Point satisfying Definition 1 (for appropriate , , )
1 for  do
2      
return
Algorithm 2
Proposition 2.

Let , let and let be -centered for monotone . Then, for and , the iterates of Algorithm 2 satisfy

(11)

Before discussing the proof of Proposition 2, we state how it implies the relaxed proximal oracle property for the bilinear case.

Corollary 1.

Let and let . Then, in the setting of Proposition 2, is an -relaxed proximal oracle.

Proof.

Note that for any and consequently . Therefore, the iterates of Algorithm 2 and its output satisfy for every ,

Substituting into the bound (11) yields the -relaxed proximal oracle property in Definition 1. ∎

More generally, the proof of Corollary 1 shows that Algorithm 2 implements a relaxed proximal oracle whenever is convex for every . In Section 5.1 we implement an -relaxed proximal oracle without such an assumption.

The proof of Proposition 2 is a somewhat lengthy application of existing techniques for stochastic mirror descent analysis in conjunction with Definition 2. We give it in full in Appendix B and review the main steps here.

Regret bound.

Viewing the iterations of Algorithm 2 as stochastic mirror descent with stochastic gradients and composite term , the standard mirror descent regret bound (see Lemma 12 in Appendix A.2) gives

(12)

deterministically for all .

Regularization.

Substituting the equality (9) and rearranging gives

(13)

and taking guarantees for all .

Variance bound.

Using the second centered gradient estimator property and strong convexity of the distance generating function, we have

for . Since the RHS of (13) is nonpositive in expectation and the gradient estimator is unbiased, we have

Exchanging maximum and expectation.

When depends on we generally have . To address this issue we use a technique due to Nemirovski et al. [28]. Writing and defining the “ghost iterates” with , we rewrite as . Since does not depend on randomness in , we have . To handle the term we use the standard mirror descent regret bound again, absorbing the result into the RHS of (13) using and , which follows from Lemma 1.

4 Application to bilinear saddle point problems

We now construct centered gradient estimators (as per Definition 2) for the linear gradient mapping

We consider two domain types, namely (the simplex) and (the Euclidean ball). In Section 4.1 we present a centered gradient estimator and resulting runtime guarantees for - games. In Section 4.2 we first give a centered gradient estimator - with a suboptimal constant (larger than the Lipschitz constant of ). We then obtain the correct Lipschitz constant dependence using a local norms analysis, which requires clipping the gradient estimates in order to control the magnitude of the updates. Finally, in Section 4.3 we give a gradient estimator for - games. Unlike the previous two setups, the estimator constant for - games does not match the Lipschitz constant of the underlying gradient mapping. Such mismatch is consistent with prior findings in the literature.

Throughout this section, we let denote the “center” (i.e. reference point) of our stochastic gradient estimator and consider a general query point . We also recall the notation for the th entry of vector .

4.1 - games

Setup.

Denoting the -dimensional simplex by , we let , and . We take to be the norm with conjugate norm . We take the distance generating function to be the negative entropy, i.e. . We note that both and are separable and in particular separate over the and blocks of . Finally we set

and note that this is the Lipschitz constant of the gradient mapping under the chosen norm.

4.1.1 Gradient estimator

Given and , we describe the reduced-variance gradient estimator . First, we define the probabilities and according to,

(14)

To compute