# Let's be honest: An optimal no-regret framework for zero-sum games

We revisit the problem of solving two-player zero-sum games in the decentralized setting. We propose a simple algorithmic framework that simultaneously achieves the best rates for honest regret as well as adversarial regret, and in addition resolves the open problem of removing the logarithmic terms in convergence to the value of the game. We achieve this goal in three steps. First, we provide a novel analysis of the optimistic mirror descent (OMD), showing that it can be modified to guarantee fast convergence for both honest regret and value of the game, when the players are playing collaboratively. Second, we propose a new algorithm, dubbed as robust optimistic mirror descent (ROMD), which attains optimal adversarial regret without knowing the time horizon beforehand. Finally, we propose a simple signaling scheme, which enables us to bridge OMD and ROMD to achieve the best of both worlds. Numerical examples are presented to support our theoretical claims and show that our non-adaptive ROMD algorithm can be competitive to OMD with adaptive step-size selection.

## Authors

• 4 publications
• 7 publications
• 1 publication
• 75 publications
• ### Adaptive Learning in Continuous Games: Optimal Regret Bounds and Convergence to Nash Equilibrium

In game-theoretic learning, several agents are simultaneously following ...
04/26/2021 ∙ by Yu-Guan Hsieh, et al. ∙ 0

• ### Last Round Convergence and No-Instant Regret in Repeated Games with Asymmetric Information

This paper considers repeated games in which one player has more informa...
03/26/2020 ∙ by Le Cong Dinh, et al. ∙ 0

• ### Training GANs with Optimism

We address the issue of limit cycling behavior in training Generative Ad...
10/31/2017 ∙ by Constantinos Daskalakis, et al. ∙ 0

• ### Efficient Episodic Learning of Nonstationary and Unknown Zero-Sum Games Using Expert Game Ensembles

Game theory provides essential analysis in many applications of strategi...
07/28/2021 ∙ by Yunian Pan, et al. ∙ 0

• ### Competing Against Equilibria in Zero-Sum Games with Evolving Payoffs

We study the problem of repeated play in a zero-sum game in which the pa...
07/17/2019 ∙ by Adrian Rivera Cardoso, et al. ∙ 0

• ### Fast and Furious Learning in Zero-Sum Games: Vanishing Regret with Non-Vanishing Step Sizes

We show for the first time, to our knowledge, that it is possible to rec...
05/11/2019 ∙ by James P. Bailey, et al. ∙ 0

• ### The Impatient May Use Limited Optimism to Minimize Regret

Discounted-sum games provide a formal model for the study of reinforceme...
11/17/2018 ∙ by Michaël Cadilhac, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The simple zero-sum games have been studied extensively, often from the standpoint of analyzing the convergence to the Nash equilibrium. At the equilibrium, the players employ a min-max pair of strategies where no player can improve their pay-off by a unilateral deviation (von1928theory).

In this setting, one can expect that the players arrive at the equilibrium via decentralized, no-regret learning algorithms, which hold even in the presence of potential adversarial behavior, and which also better model selfish play. The resulting dynamics is of great interest in optimization and behavioral economics (myerson1999nash), especially under communication constraints.

When the behavior of each player is explained by a no-regret algorithm, it is possible to significantly improve convergence rates beyond the so-called black-box, adversarial dynamics. This observation was first made by (daskalakis2011near), which tailored a decentralized version of Nesterov’s primal-dual method based on the excessive gap condition.

Intriguingly, (daskalakis2011near) left it as an open question on the existence of a simple algorithm that converges at optimal rates for both regret and the value of the game in an uncoupled manner, both against honest (i.e., cooperative) and dishonest (i.e., arbitrarily adversarial) behavior.

The challenge was partially settled by the modified optimistic mirror descent (OMD) framework in (rakhlin2013optimization). While the framework of (daskalakis2011near) is considered unnatural and involves additional logarithmic factors, similar arguments apply to rakhlin2013optimization’s framework: The modified OMD needs to know the game horizon a priori to determine the step-sizes. Their analysis also results in non-optimal regret and logarithmic factors in convergence to the value of the game.

Besides the aforementioned drawbacks, neither approaches can accommodate natural switches between honest and dishonest behavior.

In this work, we propose a simple algorithmic framework that closes the gap between upper and lower bounds for adversarial regret as well as convergence to the value of the game, while maintaining the best known rate for honest regret, thereby resolving the open problem posed by (daskalakis2011near).

We achieve the desiderata as follows: First, we provide a novel analysis of OMD and show that it can obtain fast convergence for both honest regret and value of the game, when both players are honest. Second, we introduce robust optimistic mirror descent (ROMD), which attains optimal adversarial regret without knowing the time horizon. Finally, we propose a simple signaling scheme, which enables us to bridge OMD and ROMD to achieve the best of both worlds, and seamlessly handle honest and dishonest behavior.

### 1.1 Related Work

Algorithms for Decentralized Games: To our knowledge, the only two explicit algorithms capable of solving zero-sum games in the decentralized setting are given by (daskalakis2011near) and (rakhlin2013optimization), respectively. A comparison of their convergence rates versus ours is presented in Table 1.

The algorithm of (daskalakis2011near) is a decentralized primal-dual method based on Nesterov’s excessive gap technique (nesterov2005excessive). Its convergence guarantees are only slightly worse than ours (cf., Table 1). However, due to the presence of complicated and unnatural scheduling steps, the authors in (daskalakis2011near) themselves were not convinced by the practicality of their algorithm and stated the result as merely an “existence proof.”

Later on, rakhlin2013optimization proposed an algorithm based on the Optimistic Mirror Descent (OMD), initially introduced in a special case by (chiang2012online) and also studied in detail by (rakhlin2013online). While the algorithm is simple, it features several drawbacks. Foremost, it requires the time horizon beforehand, which is unsatisfactory. Second, when both players are playing collaboratively, their regret is sub-optimal. Third, its adversarial regret and convergence to the game value has extra factors, which require additional cautions to remove. Finally, the algorithm uses adaptive step-sizes, requiring additional work per-iteration.

Meta-Algorithms: There exist some work on “meta-algorithms” for games (syrgkanis2015fast; foster2016learning), which can turn certain learning algorithms into solving zero-sum games. For instance, leveraging the framework in (syrgkanis2015fast), one can modify OMD to achieve for honest regret + for adversarial regret. Our algorithm uniformly outperforms these rates.

## 2 Preliminaries and Notation

Let be a mirror map over the convex domain , and let be the Bregman divergence associated with . We assume the knowledge of the three-point identity for Bregman divergence in the sequel:

 D(x,y)+D(y,z)=D(x,z)+⟨x−y,∇ψ(z)−∇ψ(y)⟩.

We use the notation to denote:

 z=∇ψ⋆(∇ψ(x)−ηg)

where is the Fenchel dual of .

Let be 1-strongly convex with respect to the norm . We define

 D2\coloneqqmax{supx,x′∈D12∥x−x′∥2,supx∈DD(x,xc)}

where is the prox center. Hence controls both the diameter (in ) and the Bregman divergence to the prox center.

We frequently use the fact that

 ⟨x,Ay⟩≤|A|max∀x∈Δm, y∈Δn

where is the maximum entry of in absolute value, and is the standard simplex. On a simplex, we will only consider the entropic mirror map:

 ψ(x)=k∑i=1xilogxi,k=m or n

which is well-known to be 1-strongly convex in .

We use

to denote the uniform distribution on

.

## 3 Problem Formulation and Main Result

An (offline) two-player zero-sum game with payoff matrix refers to the solving the minimax problem:

 V\coloneqqminy∈Δnmaxx∈Δm⟨x,Ay⟩. (1)

The quantity in (1) is called the value of the game, or the Nash Equilibrium Value. Any pair attaining the game value is called an equilibrium strategy.

In the decentralized setting (aka., the “strongly uncoupled” setting), the payoff matrix and the number of opponent’s strategies are unknown to both players, and their goal is to learn a pair of equilibrium strategy through repeated game plays. Moreover, each player aims to suffer a low individual regret, even in the presence of an adversary or a corrupted channel that distorts the feedback.

Specifically, at each round , the players take actions and

, and then receive the loss vectors

(for -player) and (for -player). In the honest setting, we assume that the two players take actions according to a prescribed algorithm, and we say the setting is adversarial if only one player (the -player in this paper) adheres to the prescribed algorithm and the other player arbitrary.

As in previous work, we assume that an upper bound on the maximum absolute entry of is available to both players. The goal is to achieve

 ≤r1(T), RT\coloneqqmaxx∈ΔmT∑t=1⟨xt−x,−Ayt⟩ ≤r2(T)

for fast-decaying and sublinear in . The first requirement is to approximate the game value in (1), and the second one asks to minimize the regret .

Our main result can be stated as follows:

###### Theorem 1 (Main result, informal).

For (1), there is a simple decentralized algorithm with non-adaptive step-size such that

 r1(T)=O(1T),r2(T)=O(logT),

if the opponent is honest (i.e., playing collaboratively to solve the game). Moreover, against any adversary, we have

 r2(T)=O(√T).

Except for the honest regret, these rates are known to be optimal (cesa2006prediction; daskalakis2015near). We are also the first to remove factors in convergence to the value of the game, an open question posed by the very first work in learning decentralized games (daskalakis2011near).

## 4 A family of optimistic mirror descents: Classical, Robust, and Let’s be honest

We first illustrate the high-level ideas to prove Theorem 1 in Section 4.1. A novel analysis for OMD in the honest setting is given in Section 4.2, and we propose a new algorithm for the adversarial setting in Section 4.3. Finally, the full algorithm is presented in Section 4.4, along with the rigorous version of the main result (cf., Theorem 4).

### 4.1 High-Level Ideas

Our algorithms are inspired by the iterates of the form:

 {xt+1=MDη(xt,−2Ayt+Ayt−1)yt+1=MDη(yt,2A⊤xt−A⊤xt−1), (2)

which are equivalent to the OMD in (rakhlin2013optimization) (see Appendix A). It is known that directly applying (2) to (1) yields convergence in the game value, however without any guarantee on the regret.

To make OMD optimal for zero-sum games, we improve (2) on two fronts. First, in the honest setting, we make the following simple observation: Although the iterates are not guaranteed to possess sublinear regret, the averaged iterates do enjoy logarithmic regret, and hence, it suffices to play the averaged iterates in the honest setting.

Second, in order to make OMD robust against any adversary, we utilize the “mixing steps” of (rakhlin2013optimization) with an important improvement: Our step-sizes do not depend on the time horizon. This new feature is crucial in removing factors in both the convergence to game value and adversarial regret. In fact, our analysis is arguably simpler than (rakhlin2013optimization).

### 4.2 Optimistic Mirror Descent

As alluded to in Section 4.1, we will play OMD with the averaged iterates. The algorithms are given explicitly in Algorithm 1 and 2.

###### Remark 1.

Note that there is no need to play and three times in Algorithm 1 and 2. The players could just play once and would have enough information to run OMD from and . Our choices are motivated by the resulting ease of the notation.

We analyze our version of OMD below. The crux of our analysis is to first look at the regrets of auxiliary sequences and , and we show that the sum of the auxiliary regrets, not any individual of them, controls both the convergence to the value of the game and the honest regret for the averaged sequences and .

###### Theorem 2.

Suppose two players of a zero-sum game have played rounds according to the OMD algorithm with . Then

1. The -player suffers an regret:

 maxz∈ΔmT∑t=3⟨zt−z,−Awt⟩ ≤log2(T−2)|A|max× (20+logm+logn) (3) =O(logT)

and similarly for the -player.

2. The strategies constitutes an -approximate equilibrium to the value of the game:

 ≤(20+logm+logn)|A|maxT−2 (4) =O(1T).
###### Proof.

See Appendix B. ∎

### 4.3 Robust Optimistic Mirror Descent

In this section, we introduce robust optimistic mirror descent (ROMD), which is a novel algorithm even for online convex optimization.

Let be 1-strongly convex with respect to , and suppose we are minimizing the regret against an arbitrary sequence of convex functions in a constraint set . Assume that each function is -Lipschitz in . Assume also that no Bregman projection is needed (i.e., for any and ); this is, for instance, the case for the entropic mirror map.

We state ROMD in the general form in Algorithm 3.

Suppose that for all . Then playing rounds of Algorithm 3 with against an arbitrary sequence of convex functions has the following guarantee on the regret:

 maxx∈ΔmT∑t=1⟨xt−x,∇ft(xt)⟩ ≤G√T(18+2D2) +GD(3√2+4D) =O(√T).
###### Proof.

See Appendix C. ∎

When specialized to zero-sum games, it suffices to take , , , and being the entropic mirror map.

###### Remark 2.

Our analysis of ROMD crucially relies on the assumption that no Bregman projection is needed. We have not been able to generalize our analysis to the case with Bregman projections.

### 4.4 Let’s be honest: The full framework

We now present our approach for solving (1).

To ease the notation, define

 z∗t\coloneqqargminx∈Δm⟨x,−Awt⟩

and

 w∗t=argminy∈Δn⟨zt,Ay⟩.

Let constants , and be such that (see Thereom 2, Theorem 3, and (B.10))

 ⟨zt−z∗t,−Awt⟩≤C1t, zt,wt from OMD, (5) (6) T∑t=1⟨zt−z∗,−Ayt⟩≤C2√T,zt % from ROMD and yt % arbitrary, (7) |V−zTAwT|≤C3T,zT,wT from OMD. (8)

From a high-level, our approach exploits the following simple observation: Suppose that we know above. If the instantaneous regret bound (5) and (6) hold true for all , then we would trivially have the desired convergence.

In contrast, if at any round the bound (5) is violated for the -player, then it must be due to an adversarial play, and we can simply switch to ROMD to get regret. However, since (cf., (B.10)) involves , the number of opponent’s strategies, the -player cannot compute it exactly. The situation is similar for the

-player. We hence need to come up with a way to estimate

for both players.

It is important to note that one can not naïvely estimate by binary search separately on both players. The reason, and the major difficultly to the above approach, is as follows: Since in general , it could be the case that, at the same round, the -player detects a bad instantaneous regret and switch to ROMD, while the -player remains in OMD, even though two players are both honest. However, our entire analysis of OMD would breakdown if the OMD is not played cohesively.

Furthermore, recall that we also want robustness against any adversary. Therefore, a bad instantaneous regret indicates the possibility of receiving an adversarial play, and we need to switch to ROMD whenever this occurs.

To resolve such issues, we devise a simple signaling scheme ( and below), which synchronizes both players’ estimate and also the OMD plays while guaranteeing robustness.

In words, our signaling scheme is a “Let’s be honest” message to the opponent: “I am having a bad instantaneous regret. Please update your with me, and please pretend that I am adversarial for a small number of rounds, so that we can play honest OMD cohesively.” It turns out that doing these extra signaling rounds do not hurt the convergence rates in OMD and ROMD at all.

Our full algorithm, termed Let’s Be Honest (LbH) Optimistic Mirror Descent, is presented in Algorithm 4 and 5.

###### Remark 3.

In Algorithm 4 and 5, the role of is to estimate the constant in (5). Since our analysis requires to be the same for both players throughout the algorithm run, a simple way is to assume that, say, , compute the corresponding , and set the initial . Doing so indeed improves upon constants in our convergence; we chose only for simplicity.

###### Remark 4.

There are some degree of freedom in

Algorithm 4 and 5. For instance, instead of doubling in Line 16, one can do for some . In Line 5, one can also play rounds, rather than . As will become apparent in Theorem 4, these variants only effect the constants but not the convergence rates. However, they do have impact on empirical performance; cf., Section 5.

The following key lemma ensures the two players to enter the ROMD plays coherently.

###### Lemma 1.

If the -player enters Line 12 of Algorithm 5 at the -th round, then the -player enters Line 4 of Algorithm 4 at the -th round. Conversely, if, at the -th round, the -player does not enter Line 12 of Algorithm 5, then the -player does not enter Line 4 of Algorithm 4 at the -th round.

Exactly the same statements hold when the - and -player are reversed above.

###### Proof.

If the -player enters Line 12 of Algorithm 5 at the -th round, then is signalled at the -th round, and it must be the case that (cf., Line 12 of Algorithm 5). Therefore, at the -th round, the -player would receive and compute

 Gwt+1 =⟨wt−w∗t,A⊤zt⟩>bt

which then enters the Line 4 of Algorithm 4.

Conversely, suppose that the -player does not enter Line 12 of Algorithm 5 at the -th round (or, equivalently, plays OMD at the -th round). Then , implying that

 Gwt+1 ≤⟨wt−w∗t,A⊤zt⟩≤bt

hence preventing the -player from entering Line 4 of Algorithm 4.

Exactly the same computation holds when we reverse the role of - and -player. ∎

Given Lemma 1, we now know that the -player switches to ROMD if and only if the -player does. The rest of the proof then readily follows from Theorems 2 and 3.

###### Theorem 4.

Suppose the -player plays according to Algorithm 4 for rounds, and let be the regret up to time . Then

1. Let where is the number of OMD plays, is the number of ROMD plays, and is the number of signaling rounds (playing or ). Then there are constants and , depending only on and , such that

 1TRT≤ClogT1+C′√T2T1+T2. (9)

In particular, if the opponent plays honestly, then If the opponent is adversarial, we have

2. Suppose that the honest -player plays Algorithm 5. Then the pair constitutes an -approximate equilibrium:

 |V−⟨zT,AwT⟩|≤C′′T (10)

for some constant .

###### Proof.

Suppose first that both players are honest.

We first prove the individual regret for the -player. We split the terms as follows:

 RT =RT1(playingOMD)+RT2(playingROMD) +RT3(signaling). (11)

Recall (5)-(8). We claim that

1. .

Indeed, after -times signaling, we would have . Then (5) and (6) imply that we will never enter Line 12 again. On the other hand, we have

 T2≤T3∑r=124r=16T3−1−115.

Combining (a), (b) and using (5), (7) in (4.4), we conclude that

 RT ≤C1logT1+C2√T2+2|A|maxT3 ≤C1logT1+C2√C′1+2|A|max⌈logC1⌉ =O(logT1)=O(logT)

which establishes (9) in the honest case.

For convergence to the value of the game, we have, by (8),

 |V−⟨zT,AwT⟩|≤C3T−T2−T3≤C3T−C∗

where The proof of (10) is completed by using the fact that when .

Finally, we show (9) in the adversarial case.

Let , and be as before, and we again split the regret into:

 RT =RT1(playingOMD)+RT2(playingROMD) +RT3(signaling).

Notice that this time the inequalities (5) and (6) do not apply since the opponent no longer plays OMD collaboratively. However, by Line 12 of Algorithm 4, for every OMD play we must have

 ⟨zt,−Awt⟩−⟨z∗t,−Awt⟩≤bt≤2T3t.

Following the analysis as in the honest setting, we may further write

 RT≤2T3logT1+C2√T2+2|A|maxT3.

It hence suffices to show that

 2T3logT1≤C∗∗√T1+T2. (12)

for some constant . To see (12), recall that

 T2=16(16T3−1)15≥16T3−1.

But then

 2T3logT1√T1+T2 ≤2T3logT1√2√T1T2 ≤2T3logT12T3−1⋅√2⋅√[4]T1≤C∗∗.

for some universal constant .

###### Remark 5.

As is evident from the proof, we have made no attempt to sharpening the constants, and hence our bounds can be numerically loose.

## 5 Experiments

The purpose of this section is to provide numerical evidence to the following claims of our theory:

1. The LbH algorithm does not require knowing the time horizon beforehand, and our step-sizes are non-adaptive. Therefore, all quantities of interest, such as regrets or game value, should steadily decrease along the algorithm run.

For comparison, we include the modified OMD (henceforth abbreviated as m-OMD) of (rakhlin2013optimization) in our experiment, for different choices of time horizon.

We generate the entries of uniformly at random in the interval , and we set and .

We consider two scenarios:

1. Honest setting: Both players adhere to the prescribed algorithms and try to reach the Nash equilibrium collaboratively.

2. Adversarial setting: The -player greedily maximizes the instantaneous regret of the -player.

### 5.1 Honest Setting

The convergence for the honest setting is reported in Figure 1, for two different parameter choices of LbH and m-OMD.

For both convergence to the game value and individual regret, after a short burn-in period (due to not knowing the in (5) and (6)), the LbH algorithm enters a steady -decreasing phase, as expected from our theory.  On the other hand, as the m-OMD chooses step-sizes according to the time horizon, it eventually saturates in both plots.

As noted by (rakhlin2013optimization), it is possible to prevent the saturation of m-OMD by employing the doubling trick or the techniques in (auer2002adaptive). However, doing so not only complicates the algorithm, but also introduces extra factors in the convergence of honest regret, since the doubling trick loses a factor for logarithmic regrets. Such rates are sub-optimal given our results.

We report the regret comparison in Figure 2.

In the adversarial setting, the LbH algorithm is essentially running the ROMD, and hence we see a straight decrease in the regret, as dictated by our upper bound in Theorem 3; see Figure 2-(b). The parameter choice does not effect the performance.

The m-OMD slightly outperforms LbH for a short period, but eventually blows up in regret. We remark that the short-term good empirical performance is due to the adaptive step-sizes of m-OMD, which require additional work per-iteration. Our LbH algorithm is non-adaptive, but is already competitive in terms of empirical performance.

## 6 Conclusion and Future Work

We studied the problem of zero-sum games in the decentralized setting, and we resolved an open problem of achieving optimal convergence to the game value while maintaining low regrets. Our techniques were based on several simple but novel observations in the game dynamics. Namely, we noticed that the averaged iterates of OMD enjoy logarithmic regret in the honest setting, we provided horizon-independent mixing steps for the OMD to achieve optimal adversarial regret, and we designed a singaling scheme to losslessly bridge OMD and ROMD. In essence, we showed that it is not necessary, as done in the work of (rakhlin2013optimization), to fix the time horizon beforehand and modify OMD accordingly. Our observations were instrumental in removing terms in all convergence rates.

Our framework suggests several research directions. First, instead of assuming that we observe the full loss vector, we may pose our problem in the bandit setting, where only the payoff value of the current strategy is observed. Second, for practical purposes, it is interesting to see whether there exists an adaptive step-size version of our algorithm. Finally, generalizing our framework to multiplayer games is a challenging future work.

## Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n 725594 - time-data), and was supported by the Swiss National Science Foundation (SNSF) under grant number 200021_178865 / 1.

## Appendix A Equivalence Formulations of Optimistic Mirror Descent

In this appendix, we show that the iterates in (2) of the main text is equivalent to the following iterates given in (chiang2012online; rakhlin2013online):

 (A.1)

By the optimality condition for (A.1), we have

 ∇ψ(xt) =∇ψ(~xt)−η(−Ayt−1), (A.2) ∇ψ(~xt) =∇ψ(~xt−1)−η(−Ayt−1), (A.3) ∇ψ(~xt−1) =∇ψ(xt−1)+η(−Ayt−2). (A.4)

We hence get (2) by applying (A.4) to (A.3) and then (A.3) to (A.2).

## Appendix B Optimistic Mirror Descent

In this appendix, we prove Theorem 2, restated below for convenience.

###### Theorem 5.

Suppose two players of a zero-sum game have played rounds according to Algorithm 1 and 2 with . Then

1. The -player suffers a regret:

 maxz∈ΔmT∑t=3⟨zt−z,−Awt⟩ ≤(log(T−2)+1)(20+logm+logn)|A|max (B.1) =O(logT)

and similarly for the -player.

2. The strategies constitutes an -approximate equilibrium to the value of the game:

 |V−⟨zT,AwT⟩|≤(20+logm+logn)|A|maxT−2=O(1T). (B.2)
###### Proof.

Define as

 x∗=argminx∈Δm⟨x,−A(1T−2T∑t=3yt)⟩. (B.3)

We define an auxiliary individual regret as

 RxT\coloneqqT∑t=3⟨xt−x∗,−Ayt⟩. (B.4)

Notice that this is the regret on the sequence versus sequence, while we are playing ’s and ’s in the algorithm.

We then have

 RxT =T∑t=3⟨xt−x∗,−Ayt⟩ =⟨x3−x∗,−Ay3⟩+T∑t=4⟨xt−x∗,−Ayt⟩ ≤2|A|max+T∑t=4⟨xt−x∗,−Ayt−gt−1⟩+T∑t=4⟨xt−x∗,gt−1⟩

where . Inserting into the definition of , we get . Straightforward calculation then shows:

 RxT ≤2|A|max+T∑t=4⟨xt−x∗,−Ayt+2Ayt−1−Ayt−2⟩+T∑t=4⟨xt−x∗,−2Ayt−1+Ayt−2⟩ =2|A|max+T∑t=4⟨xt−x∗,(−Ayt+Ayt−1)−(−Ayt−1+Ayt−2)⟩ =2|A|max+T−1∑t=4⟨xt−xt+1,−Ayt+Ayt−1⟩+⟨x4−x∗,Ay3−Ay2⟩ ≤10|A|max+T−1∑t=4⟨xt−xt+1,−Ayt+Ayt−1⟩ +1ηT∑t=4(D(x∗,xt−1)−D(x∗,xt)−D(xt,xt−1)) ≤10|A|max+T−1∑t=4∥xt−xt+1∥1⋅|A|max⋅∥yt−yt−1∥1

Using the fact that is 1-strongly convex with respect to the -norm, we have . Also, we have . Combining these facts in the last inequality gives:

 RxT ≤10|A|max+logmη+|A|max2T−1∑t=4∥xt−xt+1∥21 +|A|max2T−1∑t=4∥yt−yt−1∥21−12ηT∑t=4∥xt−1−xt∥21.

Similarly, for the second player we define

 RyT\coloneqqT∑t=3⟨yt−y∗,A⊤xt⟩ (B.5)

where . We then have

 RyT ≤10|A|max+lognη+|A|max2T−1∑t=4∥yt−yt+1∥21 +|A|max2T−1∑t=4∥xt−xt−1∥21−12ηT∑t=4∥yt−1−yt∥21.

Setting , we get

 RxT+RyT≤(20+logm+logn)|A|max. (B.6)

Now, recalling that and and using the definition of and , we get

 1T−2(RxT+RyT)=maxx∈Δm⟨x,AwT⟩−miny∈Δn⟨zT,Ay⟩. (B.7)

Furthermore, by the definition of the value of the game, we have

 miny∈Δn⟨zT,Ay⟩≤V≤maxx∈Δm⟨x,AwT⟩. (B.8)

We also trivially have

 miny∈Δn⟨zT,Ay⟩≤⟨zT