A large number of core problems in statistics, optimization, and machine learning, can be framed as the solution of a two-player zero-sum game. Linear programs, for example, can be viewed as a competition between a feasibility player, who selects a point in, and a constraint player that aims to check for feasibility violations Adler2013. Boosting freund1999adaptive can be viewed as the competition between an agent that selects hard distributions and a weak learning oracle that aims to overcome such challenges freund1996game. The hugely popular technique of Generative Adversarial Networks (GANs) goodfellow2014generative, which produce implicit generative models from unlabelled data, has been framed in terms of a repeated game, with a distribution player aiming to produce realistic samples and a discriminative player that seeks to distinguish real from fake.
While many vanilla supervised learning problems reduce to finding the minimum of an objective functionover some constraint set, tasks that require the search for a saddle point—that is, min-max solution of some convex-concave payoff function —don’t easily lend themselves to standard optimization protocols such as gradient descent, Newton’s method, etc. It is not clear, for example, whether successive iterates should even increase or decrease the payoff . This issue has been noticed in the training of GANs, for example, where the standard update method is a simultaneous gradient descent procedure, and many practitioners have raised concerns about cycling.
On the other hand, what has emerged as a very popular and widely-used trick is the following: simulate a pair of online learning algorithms, each competing in the game with the objective of minimizing regret, and return the time-averaged sequence of actions taken by the players as an approximate solution. The method of applying no-regret learning strategies to find equilibria in zero-sum games was explored in freund1999adaptive, yet the idea goes back at least as far as work by blackwell1956analog and hannan1957approximation. This methodology has several major benefits, which include the following. First, this framework “decouples” the optimization into two parallel routines that have very little communication overhead. Second, the use of no-regret learning is ideal in this scenario, as most of the guarantees for such algorithms are robust to even adversarial environments. Third, one is able to bound the approximation error of the returned saddle point simply in terms of the total regret of the two players. Finally, several surprising recent results have suggested that this parallel online learning methodology leads to even stronger guarantees than what the naïve theory would tell you. In short, whereas the typical no-regret analysis would lead to an approximation error of after iterations, the use of optimistic learning strategies CJ12 can be shown to guarantee convergence; this technique was developed by RK13 and further expanded by SALS15.
In this work we go further, showing that even faster rates are achievable for some specific cases of saddle-point problems. In particular:
AW17 observed that the optimization method known as Frank-Wolfe is simply an instance of the above no-regret framework for solving a particular convex-concave game, leading to a rate of . In this work we further analyze the Frank-Wolfe game, and show that when the objective function and constraint set have additional structure, and both algorithms use optimistic learning procedures, then we can achieve a rate of . This generalizes a result of D15 who proved a similar convergence rate for Frank-Wolfe.
Additionally, we show that when the game payoff function is suitably curved in both inputs—i.e. it is strongly-convex-concave and smooth—then we can use no-regret dynamics to achieve a linear rate, with the error decaying as . Applying our technique to the Frank-Wolfe game we are able to recover the linear rate results of LP66,DR70 and [Dunn(1979)].
A notable aspect of our work is the combination of several key algorithmic techniques. First, our Frank-Wolfe result relies on regularization using the squared gauge function, allowing the learner to need only a single linear optimization call on each round. Second, we introduce a notion of weighted regret minimization, and our rates depend on the careful selection of the weight schedule as well as a careful analysis of what has been called Optimistic FollowTheRegularizedLeader. Third, our linear convergence rate leans on a trick developed recently by L17 that generates an adaptive weighting scheme based on the norm of the observed gradients.
We first provide some definitions that are used in this paper. Let be some function.
A vectoris a subgradient of at for any , .
is -smooth w.r.t. a norm if is everywhere differentiable and for any . An equivalent definition of smoothness is that has Lipschitz continuous gradient, i.e., .
is -strongly convex w.r.t. a norm if for any , for some constant .
For a convex function , its Fenchel conjugate is . Note that if is convex then so is its conjugate , since it is defined as the maximum over linear functions of ([Boyd(2004)]). Morever, when the function is strictly convex and the above supremum is attained, we have that . Furthermore, the biconjugate equals if and only if is closed and convex. It is known that is -strongly convex w.r.t. if and only if is strongly smooth with respect to the dual norm ([Kakade et al.(2009)Kakade, Shalev-shwartz, and Tewari]), assuming that is a closed and convex function.
A convex set is a -strongly convex set w.r.t. a norm if for any , any , the ball centered at with radius is included in . For examples of strongly-convex sets, we refer the readers to D15.
Let be any closed convex set which contains the origin. Then the gauge function of is One can show that the gauge function is a convex function (e.g. [Rockafellar(1996)]). It is known that several closed convex sets can lead to the same gauge function ([Bach(2013)]). But if a closed convex set contains the origin, then the gauge function is unique and one has . Furthermore,
Next we provide a characterization of sets based on their gauge function. [-Gauge set] Let be a closed convex set which contains the origin. We say that is -Gauge if its squared gauge function, , is -strongly-convex. This property captures a wide class of constraints. Among these are balls, Schatten balls, and the Group ball. We refer the reader to Appendix B for more details. Curiously, all of these Gauge sets are also known to be strongly-convex. We conjecture that strong-convexity and the Gauge property are equivalent.
2 Minimizing Regret to Solve Games
Let us now turn our attention to a now-classical trick: using sequential no-regret learning strategies to find equilibria in zero-sum games.
2.1 Weighted Regret Minimization
We begin by briefly defining the standard online learning setup. We imagine a learner who must make a sequence of decisions, selecting at each round a point that lies within a convex and compact decision space . After selecting she is charged for her action, where is the loss function in round , and she proceeds to the next round. Typically it is assumed that when the learner selects in round
, she has observed all loss functionsup to, but not including, time . However, we will also consider learners that are prescient, i.e. that can choose with knowledge of the loss functions up to and including time .
The standard objective for adversarial online learning is the regret, defined as the difference between the learner’s loss over the sequence, discounted by the loss of the best fixed action in hindsight. However, for the purposes of this paper we consider a generalized notion which we call the weighted regret, where every time period has an importance weight that can differ from round to round. More precisely, we assume that the learning process is characterized by a sequence of weights , where for every . Now we define the weighted regret according to
(Note that when we drop the , this implies that for all ). The sequence of ’s can arbitrary, and indeed we will consider scenarios under which these weights can be selected in an online fashion, according to the observed loss sequence. The learners also observe at the end of each round. Throughout the paper we will use to denote the cumulative sum , and of particular importance will be the weighted average regret .
In this section we present several of the classical, and a few more recent, algorithms with well-established regret guarantees. For the most part, we present these algorithms in unweighted form, without reference to the weight sequence . In later sections we specify more precisely their weighted counterparts.
One of the most well-known online learning strategies is known as FollowTheRegularizedLeader (FTRL), in which the decision point is chosen as the “best” point over the previous loss functions, with some additional regularization penalty according to some convex . Precisely, given a parameter , the learner chooses on round the point
For convenience, let be the gradient . If we assume that is a strongly convex function with respect to some norm , then a well-known regret analysis grants the following bound:
where . With an appropriately-tuned , one achieves , which is as long the gradients have bounded norm. See, e.g., [Shalev-Shwartz et al.(2012), Hazan(2014), Rakhlin and Sridharan(2016)] for further details on this analysis.
The FollowTheLeader (FTL) strategy minimizes the objective (1), but without the regularization penalty; i.e. . Another way to formalize this is to consider . Given that the above bound has a term, it is clear we can not simply apply the same analysis of FollowTheRegularizedLeader, and indeed one can find examples where linear regret is unavoidable [Cesa-Bianchi and Lugosi(2006), Shalev-Shwartz et al.(2012)]. On the other hand, it has been shown that a strong regret guarantee is achievable even without regularization, as long as the sequence of loss functions are strongly convex. In particular, [Kakade and Shalev-Shwartz(2009)] show the following result: [Corollary 1 from [Kakade and Shalev-Shwartz(2009)]] Let be a sequence of functions such that for all , is -strongly convex. Assume that the FTL algorithm runs on this sequence and for each , let be in . Then,
Furthermore, let and assume that for all . Then, the regret is bounded by . In the context of solving zero-sum games, the online learning framework allows for one of the two players to be prescient, so she has access to one additional loss function before selecting her . In such a case it is much easier to achieve low regret, and we present three standard prescient algorithms:
Indeed it is easy to show that, for the first two of these prescient strategies, one easily obtains kalai2005efficient. The regret of BeTheRegularizedLeader is no more than . We also consider optimistic algorithms, which we discuss in Appendix A.
Gauge Function FTRL.
While the analysis of FollowTheRegularizedLeader is natural and leads to a simple intuitive bound (2), it requires solving a non-linear optimization problem on each round even when the loss functions are themselves linear – a very common scenario. From a computational perspective, it is often impractical to solve the FollowTheRegularizedLeader objective. Nevertheless, in many scenarios a (computationally feasible) linear optimization oracle is at hand. In such instances, much attention has been focused on a perturbed version of FollowTheLeader
, where one solves the unregularized optimization problem but with a linear noise term added to the objective; there is much work analyzing these algorithms and we refer the reader to kalai2005efficient,cesa2006prediction,abernethy2014online among many others. The main downside of such randomized approaches is that they have good expected regret but suffer in variance, which makes them less suitable in various reductions.
In this work, we introduce a family of FollowTheRegularizedLeader algorithms that rely solely on a linear oracle, and we believe this is a novel approach to online linear optimization problems. The restriction we require is that the regularizer is chosen as the squared gauge function for the decision set of the learner. Here we will assume111One can reduce any arbitrary convex loss to the linear loss case by convexity . ([Shalev-Shwartz et al.(2012), Hazan(2014), Rakhlin and Sridharan(2016)]). for every that for some vector , hence the objective (1) reduces to
where . Denote as the boundary of the constraint set . We can reparameterize the above optimization, by observing that any point can be written as where , and . Hence we have
We are able to remove the dependence on the gauge function since it is homogeneous, , and is identically 1 on the boundary of . The inner minimization reduces to the linear optimization , and the optimal is .
2.3 Solving zero-sum convex-concave games
Let us now apply the tools described above to the problem of solving a particular class of zero-sum games; these are often referred to as convex-concave saddle point problems. Assume we have convex and compact sets , , known as the action spaces for the two players. We are given a convex-concave payoff function ; that is, is convex in its first argument for every fixed , and is concave in its second argument for every fixed . We say that a pair is an -equilibrium for if .
The celebrated minimax theorem, first proven by von Neumann for a simple class of biaffine payoff functions v1928theorie,neumann1944theory and generalized by sion1958general and others, states that there exist 0-equilibria for convex-concave games under reasonably weak conditions. Another way to state this , and we tend to call this quantity , the value of the game .
The method of computing an -equilibrium using a pair of no-regret algorithms is reasonably straightforward, although here we will emphasize the use of weighted regret, which has been much less common in the literature. Algorithm 1 describes a basic template used throughout the paper.
Assume that a convex-concave game payoff and a -length sequence are given. Assume that we run Algorithm 1 using no-regret procedures and , and the -weighted average regret of each is and , respectively. Then the output is an -equilibrium for , with The theorem can be restated in terms of , where we get the following “-sandwich”:
2.4 Application: the Frank-Wolfe Algorithm
We can tie the above set of tools together with an illustrative application, describing a natural connection to the Frank-Wolfe (FW) method frank1956algorithm for constrained optimization. The ideas presented here summarize the work of AW17, but in Section 3 we significantly strengthen the result for a special case.
We have a convex set , an -smooth convex function , and some initial point . The FW algorithm makes repeated calls to a linear optimization oracle over , followed by a convex averaging step:
where the parameter is a learning rate, and following the standard analysis one sets . A well-known result is that .
Let us leverage Theorem 2.3 to obtain a convergence rate from a no-regret perspective. With a brief inspection, one can verify that FW is indeed a special case of Algorithm 1, assuming that (a) the game payoff is , where is the Fenchel conjugate of ; (b) the sequence is ; (c) the -player and -player employ FollowTheLeader and BestResponse, respectively; we output the final iterate as . We refer to AW17 for a thorough exposition, but it is striking that this use of Algorithm 1 leads to Frank-Wolfe even up to .
As we have reframed FW in terms of our repeated game, we can now appeal to our main theorem to obtain a rate. We must first observe, using the duality of Fenchel conjugation, that
Using (9) and the above equality, we can obtain .
The convergence rate of FW thus boils down to bounding the regret of the two players. We note first that the -player is prescient and employs BestResponse, hence we conclude that . The -player on the other hand will suffer the -weighted regret of FollowTheLeader. But notice, critically, that the choice of payoff happens to be strongly convex in , as -smoothness of implies -strong convexity in . We may thus use Lemma 2.2 to obtain:
where we use the fact that the x-player observes an strongly convex function, , and that in Lemma 2.2 is , where is the diameter of . We conclude by noting that the absence of the term, which tends to arise from the regret of online strongly convex optimization, was removed by carefully selecting the sequence of weights .
3 Fast convergence in the FW game
In this section, we introduce a new FW-like algorithm that achieves a convergence rate on -Gauge sets accessed using a linear optimization oracle. The design and analysis are based on a reweighting scheme and Optimistic-FTRL, taking advantage of recent tools developed for fast rates in solving games CJ12,RS13,SALS15.
In Theorem 3.1 we give an instantiation of Algorithm 1 that finds an approximate saddle point for the FW game . In this instantiation the -player plays Optimistic-FTL and the -player plays BeTheRegularizedLeader. With an appropriate weighting, the weighted regret guarantees of these two algorithms imply that we can find an -approximate saddle point solution of the FW game in rounds. Recalling that , this immediate translates to a convergence rate of for the the problem .
The algorithm that we describe in Theorem 3.1 does not immediately yield a FW-like algorithm—in general, we may not be able to compute the -player’s BeTheRegularizedLeader iterates using only a linear optimization oracle. However, if the -player uses the squared gauge function of as a regularizer, then the iterates are computable using a linear optimization oracle, as shown in Section 2.2. This fact immediately implies that for -Gauge sets and upon choosing the gauge function as regularizer, Algorithm 2 instantiates a projection-free procedure which provides a convergence rate of for the problem (see Corollary 3.1). In Appendix G, we discuss how to get a faster rate than for arbitrary convex sets if BeThePerturbedLeader rather than BeTheRegularizedLeader is used by the -player in the FW game.
3.1 Solving the FW game with Optimistic-FTL and BeTheRegularizedLeader
In this section, we present our algorithm for finding -saddle point solutions to the FW game. We instantiate Algorithm 1 using the FW objective , where we assume is -smooth and -strongly convex. The -player plays Optimistic-FTL and the -player plays BeTheRegularizedLeader. Assume that we instantiate Algorithm 1 with the FW game , weight sequence , and the following strategies for the players. The -player plays Optimistic-FTL:
where , and the -player plays BeTheRegularizedLeader:
with a -strongly-convex regularizer and , where . Then the output of Algorithm 1 is an -approximate saddle point solution to the FW game, where .
Now recall that for the FW setting, we are interested in -players that may only employ a linear optimization oracle. In general it is impossible to solve Equation (12) within calls to such oracles in each round. Nevertheless, recall that for -Gauge sets, choosing induces a -strongly-convex regularizer, while enabling us to solve Equation (12) with a single call to the linear oracle, as shown in Equation 8. The proof of Section 3.1 shows that the -player’s strategy is the gradient of the primal objective at the point , where is a weight vector such that = for and and is the -weighted average of (See Equation 18). This leads to Algorithm 2 and Corollary 3.1.
We get the following corollary of the above theorem. The full proof is in Appendix F.
3.2 Proof of Section 3.1
[Proof of Theorem 3.1] In the FW game, we observe that the loss functions seen by the x-player are -strongly convex, since the function is smooth, which implies that is -strongly convex.
The -player chooses based on Optimistic-FTL: , where . To analyze the regret of the -player, let us first denote the update of the standard FollowTheLeader as
Denote .222The following analysis actually holds for any . Now we are going to analyze the -weighted regret of the -player, which is
where the last inequality uses strong convexity of so that
There are three sums in (14). Note that the second sum should be small because the expression for “exploits” more than the expression for does. The third sum is the regret of BeTheLeader, which is non-positive. In Lemma D, we show that the second and third sums in eq. 14 are in total non-positive. For the proof, please see Appendix D.
Since , each term in the first sum in (14) can be bounded by
where the last inequality uses Hölder’s inequality and the fact that is -strongly convex so that is smooth. Let us analyze . Note that, by Fenchel conjugacy, , where is the -weighted average of For notational simplicity, let us define a new weight vector , where = for and . Similarly, for , we have
where is the -weighted average of . According to (18),
Combining and , we get
Therefore, we have shown that the first sum in (14) is bounded by
Now let us switch to analyze the regret of the -player, which is defined as which equals . This means that the -player actually observes the linear loss in each round , due to the fact that the -player plays after the -player plays. We can reinterpret BeTheRegularizedLeader as Optimistic-FTRL ([Syrgkanis et al.(2015)Syrgkanis, Agarwal, Luo, and Schapire]) when the learner is fully informed as to the loss function for the current round. That is, we may write the update as , where and is -strongly convex with respect to a norm on .
For loss vectors , Appendix E gives the regret of Optimistic-FTRL as