The simple zero-sum games have been studied extensively, often from the standpoint of analyzing the convergence to the Nash equilibrium. At the equilibrium, the players employ a min-max pair of strategies where no player can improve their pay-off by a unilateral deviation (von1928theory).
In this setting, one can expect that the players arrive at the equilibrium via decentralized, no-regret learning algorithms, which hold even in the presence of potential adversarial behavior, and which also better model selfish play. The resulting dynamics is of great interest in optimization and behavioral economics (myerson1999nash), especially under communication constraints.
When the behavior of each player is explained by a no-regret algorithm, it is possible to significantly improve convergence rates beyond the so-called black-box, adversarial dynamics. This observation was first made by (daskalakis2011near), which tailored a decentralized version of Nesterov’s primal-dual method based on the excessive gap condition.
Intriguingly, (daskalakis2011near) left it as an open question on the existence of a simple algorithm that converges at optimal rates for both regret and the value of the game in an uncoupled manner, both against honest (i.e., cooperative) and dishonest (i.e., arbitrarily adversarial) behavior.
The challenge was partially settled by the modified optimistic mirror descent (OMD) framework in (rakhlin2013optimization). While the framework of (daskalakis2011near) is considered unnatural and involves additional logarithmic factors, similar arguments apply to rakhlin2013optimization’s framework: The modified OMD needs to know the game horizon a priori to determine the step-sizes. Their analysis also results in non-optimal regret and logarithmic factors in convergence to the value of the game.
Besides the aforementioned drawbacks, neither approaches can accommodate natural switches between honest and dishonest behavior.
In this work, we propose a simple algorithmic framework that closes the gap between upper and lower bounds for adversarial regret as well as convergence to the value of the game, while maintaining the best known rate for honest regret, thereby resolving the open problem posed by (daskalakis2011near).
We achieve the desiderata as follows: First, we provide a novel analysis of OMD and show that it can obtain fast convergence for both honest regret and value of the game, when both players are honest. Second, we introduce robust optimistic mirror descent (ROMD), which attains optimal adversarial regret without knowing the time horizon. Finally, we propose a simple signaling scheme, which enables us to bridge OMD and ROMD to achieve the best of both worlds, and seamlessly handle honest and dishonest behavior.
1.1 Related Work
Algorithms for Decentralized Games: To our knowledge, the only two explicit algorithms capable of solving zero-sum games in the decentralized setting are given by (daskalakis2011near) and (rakhlin2013optimization), respectively. A comparison of their convergence rates versus ours is presented in Table 1.
The algorithm of (daskalakis2011near) is a decentralized primal-dual method based on Nesterov’s excessive gap technique (nesterov2005excessive). Its convergence guarantees are only slightly worse than ours (cf., Table 1). However, due to the presence of complicated and unnatural scheduling steps, the authors in (daskalakis2011near) themselves were not convinced by the practicality of their algorithm and stated the result as merely an “existence proof.”
Later on, rakhlin2013optimization proposed an algorithm based on the Optimistic Mirror Descent (OMD), initially introduced in a special case by (chiang2012online) and also studied in detail by (rakhlin2013online). While the algorithm is simple, it features several drawbacks. Foremost, it requires the time horizon beforehand, which is unsatisfactory. Second, when both players are playing collaboratively, their regret is sub-optimal. Third, its adversarial regret and convergence to the game value has extra factors, which require additional cautions to remove. Finally, the algorithm uses adaptive step-sizes, requiring additional work per-iteration.
Meta-Algorithms: There exist some work on “meta-algorithms” for games (syrgkanis2015fast; foster2016learning), which can turn certain learning algorithms into solving zero-sum games. For instance, leveraging the framework in (syrgkanis2015fast), one can modify OMD to achieve for honest regret + for adversarial regret. Our algorithm uniformly outperforms these rates.
2 Preliminaries and Notation
Let be a mirror map over the convex domain , and let be the Bregman divergence associated with . We assume the knowledge of the three-point identity for Bregman divergence in the sequel:
We use the notation to denote:
where is the Fenchel dual of .
Let be 1-strongly convex with respect to the norm . We define
where is the prox center. Hence controls both the diameter (in ) and the Bregman divergence to the prox center.
We frequently use the fact that
where is the maximum entry of in absolute value, and is the standard simplex. On a simplex, we will only consider the entropic mirror map:
which is well-known to be 1-strongly convex in .
to denote the uniform distribution on.
3 Problem Formulation and Main Result
An (offline) two-player zero-sum game with payoff matrix refers to the solving the minimax problem:
The quantity in (1) is called the value of the game, or the Nash Equilibrium Value. Any pair attaining the game value is called an equilibrium strategy.
In the decentralized setting (aka., the “strongly uncoupled” setting), the payoff matrix and the number of opponent’s strategies are unknown to both players, and their goal is to learn a pair of equilibrium strategy through repeated game plays. Moreover, each player aims to suffer a low individual regret, even in the presence of an adversary or a corrupted channel that distorts the feedback.
Specifically, at each round , the players take actions and
, and then receive the loss vectors(for -player) and (for -player). In the honest setting, we assume that the two players take actions according to a prescribed algorithm, and we say the setting is adversarial if only one player (the -player in this paper) adheres to the prescribed algorithm and the other player arbitrary.
As in previous work, we assume that an upper bound on the maximum absolute entry of is available to both players. The goal is to achieve
for fast-decaying and sublinear in . The first requirement is to approximate the game value in (1), and the second one asks to minimize the regret .
Our main result can be stated as follows:
Theorem 1 (Main result, informal).
For (1), there is a simple decentralized algorithm with non-adaptive step-size such that
if the opponent is honest (i.e., playing collaboratively to solve the game). Moreover, against any adversary, we have
Except for the honest regret, these rates are known to be optimal (cesa2006prediction; daskalakis2015near). We are also the first to remove factors in convergence to the value of the game, an open question posed by the very first work in learning decentralized games (daskalakis2011near).
4 A family of optimistic mirror descents: Classical, Robust, and Let’s be honest
We first illustrate the high-level ideas to prove Theorem 1 in Section 4.1. A novel analysis for OMD in the honest setting is given in Section 4.2, and we propose a new algorithm for the adversarial setting in Section 4.3. Finally, the full algorithm is presented in Section 4.4, along with the rigorous version of the main result (cf., Theorem 4).
4.1 High-Level Ideas
Our algorithms are inspired by the iterates of the form:
which are equivalent to the OMD in (rakhlin2013optimization) (see Appendix A). It is known that directly applying (2) to (1) yields convergence in the game value, however without any guarantee on the regret.
To make OMD optimal for zero-sum games, we improve (2) on two fronts. First, in the honest setting, we make the following simple observation: Although the iterates are not guaranteed to possess sublinear regret, the averaged iterates do enjoy logarithmic regret, and hence, it suffices to play the averaged iterates in the honest setting.
Second, in order to make OMD robust against any adversary, we utilize the “mixing steps” of (rakhlin2013optimization) with an important improvement: Our step-sizes do not depend on the time horizon. This new feature is crucial in removing factors in both the convergence to game value and adversarial regret. In fact, our analysis is arguably simpler than (rakhlin2013optimization).
4.2 Optimistic Mirror Descent
We analyze our version of OMD below. The crux of our analysis is to first look at the regrets of auxiliary sequences and , and we show that the sum of the auxiliary regrets, not any individual of them, controls both the convergence to the value of the game and the honest regret for the averaged sequences and .
Suppose two players of a zero-sum game have played rounds according to the OMD algorithm with . Then
The -player suffers an regret:
and similarly for the -player.
The strategies constitutes an -approximate equilibrium to the value of the game:
See Appendix B. ∎
4.3 Robust Optimistic Mirror Descent
In this section, we introduce robust optimistic mirror descent (ROMD), which is a novel algorithm even for online convex optimization.
Let be 1-strongly convex with respect to , and suppose we are minimizing the regret against an arbitrary sequence of convex functions in a constraint set . Assume that each function is -Lipschitz in . Assume also that no Bregman projection is needed (i.e., for any and ); this is, for instance, the case for the entropic mirror map.
We state ROMD in the general form in Algorithm 3.
Theorem 3 (-Adversarial Regret).
Suppose that for all . Then playing rounds of Algorithm 3 with against an arbitrary sequence of convex functions has the following guarantee on the regret:
See Appendix C. ∎
When specialized to zero-sum games, it suffices to take , , , and being the entropic mirror map.
Our analysis of ROMD crucially relies on the assumption that no Bregman projection is needed. We have not been able to generalize our analysis to the case with Bregman projections.
4.4 Let’s be honest: The full framework
We now present our approach for solving (1).
To ease the notation, define
From a high-level, our approach exploits the following simple observation: Suppose that we know above. If the instantaneous regret bound (5) and (6) hold true for all , then we would trivially have the desired convergence.
In contrast, if at any round the bound (5) is violated for the -player, then it must be due to an adversarial play, and we can simply switch to ROMD to get regret. However, since (cf., (B.10)) involves , the number of opponent’s strategies, the -player cannot compute it exactly. The situation is similar for the
-player. We hence need to come up with a way to estimatefor both players.
It is important to note that one can not naïvely estimate by binary search separately on both players. The reason, and the major difficultly to the above approach, is as follows: Since in general , it could be the case that, at the same round, the -player detects a bad instantaneous regret and switch to ROMD, while the -player remains in OMD, even though two players are both honest. However, our entire analysis of OMD would breakdown if the OMD is not played cohesively.
Furthermore, recall that we also want robustness against any adversary. Therefore, a bad instantaneous regret indicates the possibility of receiving an adversarial play, and we need to switch to ROMD whenever this occurs.
To resolve such issues, we devise a simple signaling scheme ( and below), which synchronizes both players’ estimate and also the OMD plays while guaranteeing robustness.
In words, our signaling scheme is a “Let’s be honest” message to the opponent: “I am having a bad instantaneous regret. Please update your with me, and please pretend that I am adversarial for a small number of rounds, so that we can play honest OMD cohesively.” It turns out that doing these extra signaling rounds do not hurt the convergence rates in OMD and ROMD at all.
In Algorithm 4 and 5, the role of is to estimate the constant in (5). Since our analysis requires to be the same for both players throughout the algorithm run, a simple way is to assume that, say, , compute the corresponding , and set the initial . Doing so indeed improves upon constants in our convergence; we chose only for simplicity.
There are some degree of freedom in
There are some degree of freedom inAlgorithm 4 and 5. For instance, instead of doubling in Line 16, one can do for some . In Line 5, one can also play rounds, rather than . As will become apparent in Theorem 4, these variants only effect the constants but not the convergence rates. However, they do have impact on empirical performance; cf., Section 5.
The following key lemma ensures the two players to enter the ROMD plays coherently.
If the -player enters Line 12 of Algorithm 5 at the -th round, then the -player enters Line 4 of Algorithm 4 at the -th round. Conversely, if, at the -th round, the -player does not enter Line 12 of Algorithm 5, then the -player does not enter Line 4 of Algorithm 4 at the -th round.
Exactly the same statements hold when the - and -player are reversed above.
If the -player enters Line 12 of Algorithm 5 at the -th round, then is signalled at the -th round, and it must be the case that (cf., Line 12 of Algorithm 5). Therefore, at the -th round, the -player would receive and compute
which then enters the Line 4 of Algorithm 4.
Conversely, suppose that the -player does not enter Line 12 of Algorithm 5 at the -th round (or, equivalently, plays OMD at the -th round). Then , implying that
hence preventing the -player from entering Line 4 of Algorithm 4.
Exactly the same computation holds when we reverse the role of - and -player. ∎
Suppose the -player plays according to Algorithm 4 for rounds, and let be the regret up to time . Then
Let where is the number of OMD plays, is the number of ROMD plays, and is the number of signaling rounds (playing or ). Then there are constants and , depending only on and , such that
In particular, if the opponent plays honestly, then If the opponent is adversarial, we have
Suppose that the honest -player plays Algorithm 5. Then the pair constitutes an -approximate equilibrium:
for some constant .
Suppose first that both players are honest.
We first prove the individual regret for the -player. We split the terms as follows:
which establishes (9) in the honest case.
For convergence to the value of the game, we have, by (8),
where The proof of (10) is completed by using the fact that when .
Finally, we show (9) in the adversarial case.
Let , and be as before, and we again split the regret into:
Following the analysis as in the honest setting, we may further write
It hence suffices to show that
for some constant . To see (12), recall that
for some universal constant .
As is evident from the proof, we have made no attempt to sharpening the constants, and hence our bounds can be numerically loose.
The purpose of this section is to provide numerical evidence to the following claims of our theory:
The LbH algorithm does not require knowing the time horizon beforehand, and our step-sizes are non-adaptive. Therefore, all quantities of interest, such as regrets or game value, should steadily decrease along the algorithm run.
The LbH algorithm automatically adjusts to honest and adversarial opponents.
For comparison, we include the modified OMD (henceforth abbreviated as m-OMD) of (rakhlin2013optimization) in our experiment, for different choices of time horizon.
We generate the entries of uniformly at random in the interval , and we set and .
We consider two scenarios:
Honest setting: Both players adhere to the prescribed algorithms and try to reach the Nash equilibrium collaboratively.
Adversarial setting: The -player greedily maximizes the instantaneous regret of the -player.
5.1 Honest Setting
The convergence for the honest setting is reported in Figure 1, for two different parameter choices of LbH and m-OMD.
For both convergence to the game value and individual regret, after a short burn-in period (due to not knowing the in (5) and (6)), the LbH algorithm enters a steady -decreasing phase, as expected from our theory. On the other hand, as the m-OMD chooses step-sizes according to the time horizon, it eventually saturates in both plots.
As noted by (rakhlin2013optimization), it is possible to prevent the saturation of m-OMD by employing the doubling trick or the techniques in (auer2002adaptive). However, doing so not only complicates the algorithm, but also introduces extra factors in the convergence of honest regret, since the doubling trick loses a factor for logarithmic regrets. Such rates are sub-optimal given our results.
5.2 Adversarial Setting
We report the regret comparison in Figure 2.
In the adversarial setting, the LbH algorithm is essentially running the ROMD, and hence we see a straight decrease in the regret, as dictated by our upper bound in Theorem 3; see Figure 2-(b). The parameter choice does not effect the performance.
The m-OMD slightly outperforms LbH for a short period, but eventually blows up in regret. We remark that the short-term good empirical performance is due to the adaptive step-sizes of m-OMD, which require additional work per-iteration. Our LbH algorithm is non-adaptive, but is already competitive in terms of empirical performance.
6 Conclusion and Future Work
We studied the problem of zero-sum games in the decentralized setting, and we resolved an open problem of achieving optimal convergence to the game value while maintaining low regrets. Our techniques were based on several simple but novel observations in the game dynamics. Namely, we noticed that the averaged iterates of OMD enjoy logarithmic regret in the honest setting, we provided horizon-independent mixing steps for the OMD to achieve optimal adversarial regret, and we designed a singaling scheme to losslessly bridge OMD and ROMD. In essence, we showed that it is not necessary, as done in the work of (rakhlin2013optimization), to fix the time horizon beforehand and modify OMD accordingly. Our observations were instrumental in removing terms in all convergence rates.
Our framework suggests several research directions. First, instead of assuming that we observe the full loss vector, we may pose our problem in the bandit setting, where only the payoff value of the current strategy is observed. Second, for practical purposes, it is interesting to see whether there exists an adaptive step-size version of our algorithm. Finally, generalizing our framework to multiplayer games is a challenging future work.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n 725594 - time-data), and was supported by the Swiss National Science Foundation (SNSF) under grant number 200021_178865 / 1.
Appendix A Equivalence Formulations of Optimistic Mirror Descent
In this appendix, we show that the iterates in (2) of the main text is equivalent to the following iterates given in (chiang2012online; rakhlin2013online):
Appendix B Optimistic Mirror Descent
In this appendix, we prove Theorem 2, restated below for convenience.
We define an auxiliary individual regret as
Notice that this is the regret on the sequence versus sequence, while we are playing ’s and ’s in the algorithm.
We then have
where . Inserting into the definition of , we get . Straightforward calculation then shows:
Using the fact that is 1-strongly convex with respect to the -norm, we have . Also, we have . Combining these facts in the last inequality gives:
Similarly, for the second player we define
where . We then have
Setting , we get
Now, recalling that and and using the definition of and , we get
Furthermore, by the definition of the value of the game, we have
We also trivially have