1 Introduction
We consider a problem in which two players interact in a zerosum game repeatedly. The payoff matrix of the game is unknown to the players a priori, and may change arbitrarily on each round. Our objective is to find competitive strategies that can achieve the Nash equilibrium of the game with the average payoffs in the long term. This problem is a significant extension of the classical learning setting in zerosum games, where the underlying payoff matrix is often assumed to be fixed or i.i.d. In contrast, we allow the payoff matrix to evolve arbitrarily in each round, and can even be selected in a possibly adversarial fashion.
Zerosum games [53, 44] are ubiquitous in economics and central to understanding Linear Programming duality [31, 4], convex optimization [3, 1], robust optimization [11], and Differential Privacy [22]
. The task of finding the Nash equilibrium of a zerosum game is also connected to several machine learning problems such as: Markov Games
[40], Boosting [26], Multiarmed Bandits with Knapsacks [9, 34] and dynamic pricing problems [23].We formally define the problem setting in Section 1.1. We then highlight the main contributions of this paper in Section 1.2 and discuss related works in Section 1.3.
1.1 Problem Formulation: Online Matrix Games
We start by reviewing the definition of classical twoplayer zerosum games. Suppose player 1 has possible actions and player 2 has possible actions. The payoffs for both players are determined by a matrix , with corresponding to the loss of player 1 and the reward of player 2 when they choose to play actions .^{2}^{2}2Throughout, for any positive integer . We allow the players to use mixed strategies
– each mixed strategy is represented by a probability distribution over their actions. More specifically, when Player 1 uses a mixed strategy
and Player 2 uses a mixed strategy , the expected payoff is .^{3}^{3}3Here, represents the unit simplex in dimension : . Throughout the paper, we refer to the static zerosum game as a matrix game (MG), because the players’ payoffs are a bilinear function encoded by the matrix . A Nash equilibrium of this game is defined as any pair of (possibly) mixed strategies such thatfor any . It is well known that every MG has at least one Nash equilibrium [44]. The problem of finding an equilibrium for a MG can be reduced to solving linear programming problems. In fact, [4] showed that the opposite is also true, every linear programming problem can be solved by finding an equilibrium to a corresponding MG.
Now, we define a problem that generalizes the matrix games into an online setting, which we call the Online Matrix Games (OMG) problem. Suppose two players interact in a repeated zerosum matrix game through rounds. In every round , they must each choose a (possibly) mixed strategy from the given action sets . However, we assume that the payoff matrix in OMG can evolve in each round, and the players have no knowledge of the payoff quantities in that round before they commit to an action. Let be an arbitrary sequence of matrices, where each for all . For each round , the players choose their mixed strategies before the matrix is revealed. Then, player 1 (resp. player 2) receives a loss (resp. gain) given by the payoff quantity . Note that the payoff matrix is allowed to change arbitrarily from round to round and may even depend on the past actions of both players. The joint goal for both players is to find strategies that ensure their average payoffs in rounds is close to the Nash Equilibrium under the average payoff matrix in hindsight.
More precisely, let us call the quantity
(1) 
the Nash Equilibrium (NE) regret. This is a natural extension of the regret concept in typical online learning or multiarmed bandit problems, which involve only a single decision maker. The primary objective of the OMG problem is to find online strategies for both players so that, as , the average NE regret (1) per round tends to 0 (i.e., the NE regret is ).
We make some remarks about the choice of benchmark and the fact that the players must update jointly despite the fact that they are playing a zerosum game. In the following examples, the comparator term arises naturally and there is one decision maker which chooses the actions of both players.

Online Linear Programming [5]: the decision maker solves an LP where data arrives sequentially. This problem has realworld applications in adauctions. Using Lagrangian duality, we can reduce this problem to an online zerosum game (our setting), where player 1 chooses primal variables and player 2 chooses dual variables. Our benchmark corresponds to the optimal solution of the offline LP.

Generative Adversarial Networks [28]: GANs can also be viewed as a zerosum game, where the decision maker trains the generator and discriminator to find a Nash equilibrium. Although our model cannot directly be used for GANs because they are nonconvex, it is another example where both players may desire to update jointly. In Section 6 we explore this further.
In the paper, we consider the OMG problem in two distinct information feedback settings. In the full information setting (Section 4), both players are able to observe the full matrix at the end of round . In the bandit setting (Section 5), players can only observe the entry of indexed by at the end of round , where and are the actions sampled from the probability distributions associated with their mixed strategies .
1.2 Main Contributions
In addition to introducing a novel problem setting, the main contributions of the present work are as follows.

Second, in the full information setting, we provide an algorithm for the OMG problem that achieves a NE regret of (Theorem 3). Note that the regret depends logarithmically on the number of actions, allowing us to handle scenarios where the players have exponentially many actions available.

Third, we propose an algorithm for the bandit setting that achieves an NE regret of order (Theorem 5).

Fourth, we show empirically how our algorithm can be used to prevent mode collapse when training GANs in a basic setup (Section 6).
1.3 Related Work
The reader familiar with Online Convex Optimization (OCO) may find it closely related to the OMG problem. In the OCO setting, a player is given a convex, closed, and bounded action set , and must repeatedly choose an action before the convex function is revealed. The player’s goal is to obtain sublinear individual regret defined as . This problem is well studied and several algorithms such as Online Gradient Descent [54], Regularized Follow the Leader [49, 2] and Perturbed Follow the Leader [36] achieve optimal individual regret bounds that scale as . The most natural (although incorrect) approach to attack the OMG problem is to equip each of the players with a sublinear individual regret algorithm. However, we will show in Section 3 that if both players use an algorithm that guarantees sublinear individual regret, then it is impossible to achieve sublinear NE regret when the payoff matrices are chosen adversarially. In other words, the algorithms for the OCO setting cannot be directly applied to the OMG problem considered in this paper.
We now discuss some related works that focus on learning in games. [50] study a two player, twoaction general sum static game. They show that if both players use Infinitesimal Gradient Ascent, either the strategy pair will converge to a Nash Equilibrium (NE), or even if they do not, then the average payoffs are close to that of the NE. A result of similar flavor was derived in [20] for any zerosum convexconcave game. Given a payoff function , they show that if both players minimize their individualregrets, then the average of actions will satisfy as , where is a NE. [14] improve upon the result of [50] by proposing an algorithm called WoLF (Win or Learn Fast), which is a modification of gradient ascent; they show that the iterates of their algorithm indeed converge to a NE. [21] further improve the results in [50] and [13] by developing an algorithm called GIGAWoLF for multiplayer nonzero sum static games. Their algorithm learns to play optimally against stationary opponents; when used in selfplay, the actions chosen by the algorithm converge to a NE. More recently, [10]
studied general multiplayer static games and show that by decomposing and classifying the second order dynamics of these games, one can prevent cycling behavior to find NE. We note that unlike our paper, all of the papers above consider repeated games with a static payoff matrix, whereas we allow the payoff matrix to change arbitrarily. An exception is the work by
[33], who consider the same setting as our OMG problem; however their paper only shows that the sum of the individual regrets of both players is sublinear and does not study convergence to NE.Related to the OMG problem with bandit feedback is the seminal work of [25]. They provide the first sublinear regret bound for Online Convex Optimization with bandit feedback, using a onepoint estimate of the gradient. The onepoint gradient estimate used in [25] is similar to those independently proposed in [29] and in [51]. The regret bound provided in [25] is , which is suboptimal. In [2], the authors give the first bound for the special case when the functions are linear. More recently, [32] and [17] designed the first efficient algorithms with regret for the general online convex optimization case; unfortunately, the dependence on the dimension in the regret rate is a very large polynomial. Our onepoint matrix estimate is most closely related to the random estimator in [7] for linear functions. It is possible to use the more sophisticated techniques from [2, 32, 17] to improve our NE regret bound in section 5; however, the result does not seem to be immediate and we leave this as future work.
2 Preliminaries
In this section we introduce notation and definitions that will be used throughout the paper.
2.1 Notation
By default, all vectors are column vectors. A vector with entries
is written as , where denotes the transpose. For a matrix , let be the entry in the th row and th column.2.2 Convex Functions
For any we say that a function is strongly convex with respect to a norm , if for any , it holds that
Here, denotes any subgradient of at . Strong convexity implies that the optimization problem has a unique solution. If we simply say that the function is convex. We say a function is strongly concave if is strongly convex. Furthermore, we say a function is strongly convexconcave if for any fixed , the function is strongly convex in , and for any fixed , the function is strongly concave in .
2.3 Saddle Points and Nash Equilibra
A pair is called a saddle point for if for any and , we have
(2) 
It is well known that if is convexconcave, and and are convex and compact sets, there always exists at least one saddle point [see e.g. 15]. Moreover, if is strongly convexconcave, the saddle point is unique.
A saddle point is also known as a Nash equilibrium for twoplayer zerosum games [45]. In a matrix game, the payoff function is bilinear, and therefore is convexconcave. The action spaces of the two players are and , which are convex and compact. As a result, there always exists a Nash equilibrium for any matrix game. The famous von Neumann minimax theorem states that . If Player 1 chooses and Player 2 chooses , the pair is an equilibrium of the game [44].
2.4 Lipschitz Continuity
We say a function is Lipschitz continuous with respect to a norm if for all it holds that
It is well known that the previous inequality holds if and only if
for all , where denotes the dual norm of [15, 48]. Similarly, we say a function is Lipschitz continuous with respect to a norm if
for any and any . Again, the previous inequality holds if and only if
for all , .
Lemma 1.
Consider a matrix . If the absolute value of each entry of is bounded by , then the function is Lipschitz continuous with respect to , where . The function is also Lipschitz continuous with respect to norm , where .
3 Challenges of the OMG Problem: An Impossibility Result
Recall that we defined the Online Matrix Games (OMG) problem in Section 1.1, where two players play a zerosum games for rounds. The sequence of payoff matrices is selected arbitrarily. In each round , both players choose their strategies before the payoff matrix is revealed. The goal is to find strategies under which the players’ average payoffs are close to the Nash Equilibrium of the game with payoff matrix .
Perhaps the most natural (albeit futile) approach to attack the OMG problem is to equip each of the players with a sublinear individual regret algorithm to generate a sequence of iterates . We gave a few examples of Online Convex Optimization (OCO) algorithms that guarantee regret in Section 1.3. However, if each player minimizes its individual regret greedily using OCO, this approach only implies that , and Notice that the quantity associated with the Nash Equilibrium in equation (1) does not even appear in these bounds. The reader familiar with saddle point computation may wonder how the socalled ‘duality gap’ [18]: relates to achieving sublinear NE regret. It is easy to see that the duality gap is the sum of individual regret of both players. In view of Theorem 1 we will see that NE regret and the duality gap are in some sense incompatible.
In this section we present a result that shows that there is no algorithm that simultaneously achieves sublinear NE regret and individual regret for both players. This implies that if both players individually use any existing algorithm from OCO they would inevitably fail to solve the OMG problem.
Theorem 1.
A full proof of the result is shown in the Appendix, but here we give a sketch. The main idea is to construct two parallel scenarios, each with their own sequences of payoff matrices. The two scenarios will be identical for the first periods but are different for the rest of the horizon. In our particular construction, in both scenarios the players play the well known “matchingpennies” game for the first periods, then in first scenario they play a game with equal payoffs for all of their actions and in the second scenario they play a game where Player 1 is indifferent between its actions. One can show that if all three quantities in the statement of the theorem are in the first scenario, then we prove that at least one of them is in the second one which yields the result. This suggests that the machinery for OCO, which minimizes individual regret, cannot be directly applied to the OMG problem.
4 Online Matrix Games: Full Information
4.1 Saddle Point Regularized FollowtheLeader
In this section we propose an algorithm to solve the OMG problem in the full information setting. In fact, we will consider the algorithm in a slighly more general setting than the OMG problem, allowing the sequence of payoff functions to be specified by arbitrary convexconcave Lipschitz functions, and the action sets of Player 1 and Player 2 ( and respectively) to be arbitrary convex compact sets.
Let the sequence of convexconcave functions be , which are Lipschitz with respect to some norm . We propose an algorithm called Saddle Point Regularized Follow the Leader (SPRFTL), shown in Algorithm 1.
The regularizers are used as input for the algorithm. We will choose regularizers that are strongly convex with respect to norm , and and Lipschitz with respect to norm , which means that for all , and for, all . Finally, we assume for all and for all .
The main difference between SPRFTL and the well known Regularized Follow the Leader (RFTL) algorithm [49, 2] is that in SPRFTL both players update jointly and play the saddle point of the sum of regularized games observed so far. In particular, they disregard their previous actions. In contrast, the updates for RFTL would be
for , and , are chosen as to minimize and in their respective sets . It is easy to see that the sequence of iterates is in general not the same. In fact, in view of Theorem 1 we know that RFTL can not achieve sublinear NE regret when the sequence of functions is chosen arbitrarily. One last remark about the algorithm is that as the last iterates will converge to the set of NE of the average game . To see this, observe that if then i.e. solves the average problem where the regularization is vanishing, and a similar expression can be written for . This is in contrast with many of the results mentioned in Section 1.3 where it is the average of the iterates which is an approximate equilibrium.
We have the following guarantee for SPRFTL.
Theorem 2.
For , let be Lipschitz with respect to norm . Let , be strongly convex functions with respect to the same norm, let be the Lipschitz constants of , with respect to the same norm. Let be the iterates generated by SPRFTL when run on convexconcave functions . It holds that
where the last equality follows by choosing .
A formal proof of the theorem is provided in the Appendix and a sketch will be given shortly.
We note that the bound in Theorem 2 holds for general convexconcave functions, however the dependence on the dimension is hidden on the Lipschitz constants and the choice of regularizer. It is easy to check that if one chooses as regularizer, and the functions are Lipschitz continuous with respect to norm , then the NE regret bound will be .
We now provide a sketch of the proof of Theorem 2. Define . Notice that it is strongly convex in with respect to norm for all and strongly concave with respect to norm for all . Additionally, notice that is Lipschitz with respect to norm . Finally, notice that for , all and all it holds that
(6) 
The following lemma shows that the value of the convexconcave games defined by and are not too far from each other.
Lemma 2.
Let
It holds that
To prove the NE regret bound, we note that SPRFTL is running a FollowtheLeader scheme on functions [36]. With the next two lemmas one can show that the NE regret of the players relative to functions is small.
Lemma 3.
Let be the iterates of SPRFTL. It holds that
Lemma 4.
Let be the sequence of iterates generated by the algorithm. It holds that
4.2 Logarithmic Dependence on the Dimension of the Action Spaces
Previously, we analyzed the OMG problem by treating the payoff functions as general convexconcave functions and the action spaces as general convex compact sets. We explained that in general one should expect to achieve NE regret which depends linearly in the dimension of the problem. The goal in this section is to obtain sharper NE regret bounds that scale as by exploiting the geometry of the decision sets
and the bilinear structure of the payoff functions. This allows us to solve games which may have exponentially many actions, which often arise in combinatorial optimization settings.
The plan to obtain the desired NE regret bounds in this more restrictive setting is to use the negative entropy as a regularization function (which is strongly convex with respect to ), that is and where the extra logarithmic terms ensure are nonnegative everywhere in their respective simplexes. Unfortunately, the negative entropy is not Lipschitz over the simplex, so we can not leverage our result from Theorem 2. To deal with this challenge, we will restrict the new algorithm to play over a restricted simplex:^{4}^{4}4We will also use the notation and to mean the restricted simplex of Player 1 and 2, respectively
(7) 
The tuning parameter used for the algorithm will be defined later in the analysis. (Notice that when , the set is empty.) We have the following result.
Lemma 5.
The function is Lipschitz continuous with respect to over with .
The algorithm OnlineMatrixGames RegularizedFollowtheLeader is an instantiation of SPRFTL with a particular choice of regularization functions, which are nonegative and Lipschitz over the sets , . With this, we can prove a NE regret bound for the OMG problem. For the remainder of the paper, the regularization functions will be set as follows:
We have the following guarantee for OMGRFTL.
Theorem 3.
Let be an arbitrary sequence of matrices with entries bounded between . Let be the Lipschitz constant (with respect to ) of for . Let be the iterates of OMGRFTL) and choose such that . Set . It holds that
A full proof of the theorem can be found in the Appendix. We now give a sketch of the proof. Since the algorithm selects actions over the restricted simplex, we must quantify the potential loss in the NE regret bound imposed by this restriction. The next two lemmas make this precise.
Lemma 6.
Let define , with . Notice is unique since it is a projection. It holds that .
Lemma 7.
Let be an arbitrary sequence of convexconcave functions, , that are Lipschitz with respect to . With , and . It holds that
Combining the previous two lemmas and Theorem 2, one can show the NE regret bound for OMGRFTL holds.
5 Online Matrix Games: Bandit Feedback
In this section we focus on the OMG problem under bandit feedback. In this setting, the players observe in every round only the payoff corresponding to the chosen actions. If Player 1 chooses action , Player 2 chooses action , and the payoff matrix at that time step is , then the players observe only instead of the full matrix . The limited feedback makes the problem significantly more challenging than the full information one: the players must find a way to exploit (use all previous information to try to play a Nash Equilibrium) and explore (try to estimate in every round). This problem resembles that of Online Bandit Optimization [25, 7, 17, 32], while the main difference is that with one function evaluation we must estimate a matrix instead of the gradients and where .
Before proceeding we establish some useful notation. For , let be the collection of standard unit vectors i.e. is the vector that has a in the th entry and in the rest. Let be the standard unit vector corresponding to the decision made by Player 1 for round , define similarly. Notice that under bandit feedback, in round both players only observe the quantity .
5.1 A OnePoint Estimate for
As explained previously, in each round the players must estimate by observing only one of its entries. To this end, we allow the players to share with each other their decisions and to randomize jointly (a similar assumption is used to define correlated equilibria in zerosum games, see [8]). The following result shows how to build a random estimate of by observing only one of its entries.
Theorem 4.
Let with and . Sample . Let be the matrix with for all such that and and . It holds that
5.2 Bandit Online Matrix Games Rftl
We now present an algorithm that ensures sublinear (i.e. ) NE regret under bandit feedback for the OMG problem that holds against an adaptive adversary. By adaptive adversary, we mean that the payoff matrices can depend on the players’ actions up to time ; in particular, we assume the adversary does not observe the actions chosen by the players for time period when choosing . We consider an algorithm that runs OMGRFTL on a sequence of functions , where is the unbiased onepoint estimate of derived in Theorem 4. Recall that the iterates of OMGRFTL algorithm are distributions over the possible actions of both players. In order to generate the estimate , both players will sample an action from their distributions and weigh their observation with the inverse probability of obtaining that observation.
We have the following guarantee for BanditOMGRFTL.
Theorem 5.
Let be any sequence of payoff matrices chosen by an adaptive adversary. Let be the iterates generated by BanditOMGFTRL. Setting , ensures
where the expectation is taken with respect to randomization in the algorithm.
We now give a sketch of the proof. The total payoff given to each of the players is given by so we must relate this quantity to the iterates of OMGRFTL when run on sequence of matrices . The following two lemmas will allow us to do so.
Lemma 8.
Let be the sequence of iterates generated by BanditOMGRFTL. It holds that
where the expectation is taken with respect to the internal randomness of the algorithm.
Lemma 9.
It holds that
where the expectation is with respect to all the internal randomness of the algorithm.
We will then bound the difference between the comparator term
Comments
There are no comments yet.