We consider a problem in which two players interact in a zero-sum game repeatedly. The payoff matrix of the game is unknown to the players a priori, and may change arbitrarily on each round. Our objective is to find competitive strategies that can achieve the Nash equilibrium of the game with the average payoffs in the long term. This problem is a significant extension of the classical learning setting in zero-sum games, where the underlying payoff matrix is often assumed to be fixed or i.i.d. In contrast, we allow the payoff matrix to evolve arbitrarily in each round, and can even be selected in a possibly adversarial fashion.
Zero-sum games [53, 44] are ubiquitous in economics and central to understanding Linear Programming duality [31, 4], convex optimization [3, 1], robust optimization , and Differential Privacy 
. The task of finding the Nash equilibrium of a zero-sum game is also connected to several machine learning problems such as: Markov Games, Boosting , Multiarmed Bandits with Knapsacks [9, 34] and dynamic pricing problems .
1.1 Problem Formulation: Online Matrix Games
We start by reviewing the definition of classical two-player zero-sum games. Suppose player 1 has possible actions and player 2 has possible actions. The payoffs for both players are determined by a matrix , with corresponding to the loss of player 1 and the reward of player 2 when they choose to play actions .222Throughout, for any positive integer . We allow the players to use mixed strategies
– each mixed strategy is represented by a probability distribution over their actions. More specifically, when Player 1 uses a mixed strategyand Player 2 uses a mixed strategy , the expected payoff is .333Here, represents the unit simplex in dimension : . Throughout the paper, we refer to the static zero-sum game as a matrix game (MG), because the players’ payoffs are a bilinear function encoded by the matrix . A Nash equilibrium of this game is defined as any pair of (possibly) mixed strategies such that
for any . It is well known that every MG has at least one Nash equilibrium . The problem of finding an equilibrium for a MG can be reduced to solving linear programming problems. In fact,  showed that the opposite is also true, every linear programming problem can be solved by finding an equilibrium to a corresponding MG.
Now, we define a problem that generalizes the matrix games into an online setting, which we call the Online Matrix Games (OMG) problem. Suppose two players interact in a repeated zero-sum matrix game through rounds. In every round , they must each choose a (possibly) mixed strategy from the given action sets . However, we assume that the payoff matrix in OMG can evolve in each round, and the players have no knowledge of the payoff quantities in that round before they commit to an action. Let be an arbitrary sequence of matrices, where each for all . For each round , the players choose their mixed strategies before the matrix is revealed. Then, player 1 (resp. player 2) receives a loss (resp. gain) given by the payoff quantity . Note that the payoff matrix is allowed to change arbitrarily from round to round and may even depend on the past actions of both players. The joint goal for both players is to find strategies that ensure their average payoffs in rounds is close to the Nash Equilibrium under the average payoff matrix in hindsight.
More precisely, let us call the quantity
the Nash Equilibrium (NE) regret. This is a natural extension of the regret concept in typical online learning or multi-armed bandit problems, which involve only a single decision maker. The primary objective of the OMG problem is to find online strategies for both players so that, as , the average NE regret (1) per round tends to 0 (i.e., the NE regret is ).
We make some remarks about the choice of benchmark and the fact that the players must update jointly despite the fact that they are playing a zero-sum game. In the following examples, the comparator term arises naturally and there is one decision maker which chooses the actions of both players.
Online Linear Programming : the decision maker solves an LP where data arrives sequentially. This problem has real-world applications in ad-auctions. Using Lagrangian duality, we can reduce this problem to an online zero-sum game (our setting), where player 1 chooses primal variables and player 2 chooses dual variables. Our benchmark corresponds to the optimal solution of the offline LP.
Generative Adversarial Networks : GANs can also be viewed as a zero-sum game, where the decision maker trains the generator and discriminator to find a Nash equilibrium. Although our model cannot directly be used for GANs because they are nonconvex, it is another example where both players may desire to update jointly. In Section 6 we explore this further.
In the paper, we consider the OMG problem in two distinct information feedback settings. In the full information setting (Section 4), both players are able to observe the full matrix at the end of round . In the bandit setting (Section 5), players can only observe the entry of indexed by at the end of round , where and are the actions sampled from the probability distributions associated with their mixed strategies .
1.2 Main Contributions
In addition to introducing a novel problem setting, the main contributions of the present work are as follows.
Second, in the full information setting, we provide an algorithm for the OMG problem that achieves a NE regret of (Theorem 3). Note that the regret depends logarithmically on the number of actions, allowing us to handle scenarios where the players have exponentially many actions available.
Third, we propose an algorithm for the bandit setting that achieves an NE regret of order (Theorem 5).
Fourth, we show empirically how our algorithm can be used to prevent mode collapse when training GANs in a basic setup (Section 6).
1.3 Related Work
The reader familiar with Online Convex Optimization (OCO) may find it closely related to the OMG problem. In the OCO setting, a player is given a convex, closed, and bounded action set , and must repeatedly choose an action before the convex function is revealed. The player’s goal is to obtain sublinear individual regret defined as . This problem is well studied and several algorithms such as Online Gradient Descent , Regularized Follow the Leader [49, 2] and Perturbed Follow the Leader  achieve optimal individual regret bounds that scale as . The most natural (although incorrect) approach to attack the OMG problem is to equip each of the players with a sublinear individual regret algorithm. However, we will show in Section 3 that if both players use an algorithm that guarantees sublinear individual regret, then it is impossible to achieve sublinear NE regret when the payoff matrices are chosen adversarially. In other words, the algorithms for the OCO setting cannot be directly applied to the OMG problem considered in this paper.
We now discuss some related works that focus on learning in games.  study a two player, two-action general sum static game. They show that if both players use Infinitesimal Gradient Ascent, either the strategy pair will converge to a Nash Equilibrium (NE), or even if they do not, then the average payoffs are close to that of the NE. A result of similar flavor was derived in  for any zero-sum convex-concave game. Given a payoff function , they show that if both players minimize their individual-regrets, then the average of actions will satisfy as , where is a NE.  improve upon the result of  by proposing an algorithm called WoLF (Win or Learn Fast), which is a modification of gradient ascent; they show that the iterates of their algorithm indeed converge to a NE.  further improve the results in  and  by developing an algorithm called GIGA-WoLF for multi-player nonzero sum static games. Their algorithm learns to play optimally against stationary opponents; when used in self-play, the actions chosen by the algorithm converge to a NE. More recently, 
studied general multi-player static games and show that by decomposing and classifying the second order dynamics of these games, one can prevent cycling behavior to find NE. We note that unlike our paper, all of the papers above consider repeated games with a static payoff matrix, whereas we allow the payoff matrix to change arbitrarily. An exception is the work by, who consider the same setting as our OMG problem; however their paper only shows that the sum of the individual regrets of both players is sublinear and does not study convergence to NE.
Related to the OMG problem with bandit feedback is the seminal work of . They provide the first sublinear regret bound for Online Convex Optimization with bandit feedback, using a one-point estimate of the gradient. The one-point gradient estimate used in  is similar to those independently proposed in  and in . The regret bound provided in  is , which is suboptimal. In , the authors give the first bound for the special case when the functions are linear. More recently,  and  designed the first efficient algorithms with regret for the general online convex optimization case; unfortunately, the dependence on the dimension in the regret rate is a very large polynomial. Our one-point matrix estimate is most closely related to the random estimator in  for linear functions. It is possible to use the more sophisticated techniques from [2, 32, 17] to improve our NE regret bound in section 5; however, the result does not seem to be immediate and we leave this as future work.
In this section we introduce notation and definitions that will be used throughout the paper.
By default, all vectors are column vectors. A vector with entriesis written as , where denotes the transpose. For a matrix , let be the entry in the -th row and -th column.
2.2 Convex Functions
For any we say that a function is -strongly convex with respect to a norm , if for any , it holds that
Here, denotes any subgradient of at . Strong convexity implies that the optimization problem has a unique solution. If we simply say that the function is convex. We say a function is -strongly concave if is -strongly convex. Furthermore, we say a function is -strongly convex-concave if for any fixed , the function is -strongly convex in , and for any fixed , the function is -strongly concave in .
2.3 Saddle Points and Nash Equilibra
A pair is called a saddle point for if for any and , we have
It is well known that if is convex-concave, and and are convex and compact sets, there always exists at least one saddle point [see e.g. 15]. Moreover, if is strongly convex-concave, the saddle point is unique.
A saddle point is also known as a Nash equilibrium for two-player zero-sum games . In a matrix game, the payoff function is bilinear, and therefore is convex-concave. The action spaces of the two players are and , which are convex and compact. As a result, there always exists a Nash equilibrium for any matrix game. The famous von Neumann minimax theorem states that . If Player 1 chooses and Player 2 chooses , the pair is an equilibrium of the game .
2.4 Lipschitz Continuity
We say a function is -Lipschitz continuous with respect to a norm if for all it holds that
It is well known that the previous inequality holds if and only if
for any and any . Again, the previous inequality holds if and only if
for all , .
Consider a matrix . If the absolute value of each entry of is bounded by , then the function is -Lipschitz continuous with respect to , where . The function is also -Lipschitz continuous with respect to norm , where .
3 Challenges of the OMG Problem: An Impossibility Result
Recall that we defined the Online Matrix Games (OMG) problem in Section 1.1, where two players play a zero-sum games for rounds. The sequence of payoff matrices is selected arbitrarily. In each round , both players choose their strategies before the payoff matrix is revealed. The goal is to find strategies under which the players’ average payoffs are close to the Nash Equilibrium of the game with payoff matrix .
Perhaps the most natural (albeit futile) approach to attack the OMG problem is to equip each of the players with a sublinear individual regret algorithm to generate a sequence of iterates . We gave a few examples of Online Convex Optimization (OCO) algorithms that guarantee regret in Section 1.3. However, if each player minimizes its individual regret greedily using OCO, this approach only implies that , and Notice that the quantity associated with the Nash Equilibrium in equation (1) does not even appear in these bounds. The reader familiar with saddle point computation may wonder how the so-called ‘duality gap’ : relates to achieving sublinear NE regret. It is easy to see that the duality gap is the sum of individual regret of both players. In view of Theorem 1 we will see that NE regret and the duality gap are in some sense incompatible.
In this section we present a result that shows that there is no algorithm that simultaneously achieves sublinear NE regret and individual regret for both players. This implies that if both players individually use any existing algorithm from OCO they would inevitably fail to solve the OMG problem.
A full proof of the result is shown in the Appendix, but here we give a sketch. The main idea is to construct two parallel scenarios, each with their own sequences of payoff matrices. The two scenarios will be identical for the first periods but are different for the rest of the horizon. In our particular construction, in both scenarios the players play the well known “matching-pennies” game for the first periods, then in first scenario they play a game with equal payoffs for all of their actions and in the second scenario they play a game where Player 1 is indifferent between its actions. One can show that if all three quantities in the statement of the theorem are in the first scenario, then we prove that at least one of them is in the second one which yields the result. This suggests that the machinery for OCO, which minimizes individual regret, cannot be directly applied to the OMG problem.
4 Online Matrix Games: Full Information
4.1 Saddle Point Regularized Follow-the-Leader
In this section we propose an algorithm to solve the OMG problem in the full information setting. In fact, we will consider the algorithm in a slighly more general setting than the OMG problem, allowing the sequence of payoff functions to be specified by arbitrary convex-concave Lipschitz functions, and the action sets of Player 1 and Player 2 ( and respectively) to be arbitrary convex compact sets.
Let the sequence of convex-concave functions be , which are -Lipschitz with respect to some norm . We propose an algorithm called Saddle Point Regularized Follow the Leader (SP-RFTL), shown in Algorithm 1.
The regularizers are used as input for the algorithm. We will choose regularizers that are strongly convex with respect to norm , and and Lipschitz with respect to norm , which means that for all , and for, all . Finally, we assume for all and for all .
The main difference between SP-RFTL and the well known Regularized Follow the Leader (RFTL) algorithm [49, 2] is that in SP-RFTL both players update jointly and play the saddle point of the sum of regularized games observed so far. In particular, they disregard their previous actions. In contrast, the updates for RFTL would be
for , and , are chosen as to minimize and in their respective sets . It is easy to see that the sequence of iterates is in general not the same. In fact, in view of Theorem 1 we know that RFTL can not achieve sublinear NE regret when the sequence of functions is chosen arbitrarily. One last remark about the algorithm is that as the last iterates will converge to the set of NE of the average game . To see this, observe that if then i.e. solves the average problem where the regularization is vanishing, and a similar expression can be written for . This is in contrast with many of the results mentioned in Section 1.3 where it is the average of the iterates which is an approximate equilibrium.
We have the following guarantee for SP-RFTL.
For , let be -Lipschitz with respect to norm . Let , be strongly convex functions with respect to the same norm, let be the Lipschitz constants of , with respect to the same norm. Let be the iterates generated by SP-RFTL when run on convex-concave functions . It holds that
where the last equality follows by choosing .
A formal proof of the theorem is provided in the Appendix and a sketch will be given shortly.
We note that the bound in Theorem 2 holds for general convex-concave functions, however the dependence on the dimension is hidden on the Lipschitz constants and the choice of regularizer. It is easy to check that if one chooses as regularizer, and the functions are -Lipschitz continuous with respect to norm , then the NE regret bound will be .
We now provide a sketch of the proof of Theorem 2. Define . Notice that it is -strongly convex in with respect to norm for all and -strongly concave with respect to norm for all . Additionally, notice that is -Lipschitz with respect to norm . Finally, notice that for , all and all it holds that
The following lemma shows that the value of the convex-concave games defined by and are not too far from each other.
It holds that
To prove the NE regret bound, we note that SP-RFTL is running a Follow-the-Leader scheme on functions . With the next two lemmas one can show that the NE regret of the players relative to functions is small.
Let be the iterates of SP-RFTL. It holds that
Let be the sequence of iterates generated by the algorithm. It holds that
4.2 Logarithmic Dependence on the Dimension of the Action Spaces
Previously, we analyzed the OMG problem by treating the payoff functions as general convex-concave functions and the action spaces as general convex compact sets. We explained that in general one should expect to achieve NE regret which depends linearly in the dimension of the problem. The goal in this section is to obtain sharper NE regret bounds that scale as by exploiting the geometry of the decision sets
and the bilinear structure of the payoff functions. This allows us to solve games which may have exponentially many actions, which often arise in combinatorial optimization settings.
The plan to obtain the desired NE regret bounds in this more restrictive setting is to use the negative entropy as a regularization function (which is strongly convex with respect to ), that is and where the extra logarithmic terms ensure are nonnegative everywhere in their respective simplexes. Unfortunately, the negative entropy is not Lipschitz over the simplex, so we can not leverage our result from Theorem 2. To deal with this challenge, we will restrict the new algorithm to play over a restricted simplex:444We will also use the notation and to mean the restricted simplex of Player 1 and 2, respectively
The tuning parameter used for the algorithm will be defined later in the analysis. (Notice that when , the set is empty.) We have the following result.
The function is -Lipschitz continuous with respect to over with .
The algorithm Online-Matrix-Games Regularized-Follow-the-Leader is an instantiation of SP-RFTL with a particular choice of regularization functions, which are nonegative and Lipschitz over the sets , . With this, we can prove a NE regret bound for the OMG problem. For the remainder of the paper, the regularization functions will be set as follows:
We have the following guarantee for OMG-RFTL.
Let be an arbitrary sequence of matrices with entries bounded between . Let be the Lipschitz constant (with respect to ) of for . Let be the iterates of OMG-RFTL) and choose such that . Set . It holds that
A full proof of the theorem can be found in the Appendix. We now give a sketch of the proof. Since the algorithm selects actions over the restricted simplex, we must quantify the potential loss in the NE regret bound imposed by this restriction. The next two lemmas make this precise.
Let define , with . Notice is unique since it is a projection. It holds that .
Let be an arbitrary sequence of convex-concave functions, , that are -Lipschitz with respect to . With , and . It holds that
Combining the previous two lemmas and Theorem 2, one can show the NE regret bound for OMG-RFTL holds.
5 Online Matrix Games: Bandit Feedback
In this section we focus on the OMG problem under bandit feedback. In this setting, the players observe in every round only the payoff corresponding to the chosen actions. If Player 1 chooses action , Player 2 chooses action , and the payoff matrix at that time step is , then the players observe only instead of the full matrix . The limited feedback makes the problem significantly more challenging than the full information one: the players must find a way to exploit (use all previous information to try to play a Nash Equilibrium) and explore (try to estimate in every round). This problem resembles that of Online Bandit Optimization [25, 7, 17, 32], while the main difference is that with one function evaluation we must estimate a matrix instead of the gradients and where .
Before proceeding we establish some useful notation. For , let be the collection of standard unit vectors i.e. is the vector that has a in the -th entry and in the rest. Let be the standard unit vector corresponding to the decision made by Player 1 for round , define similarly. Notice that under bandit feedback, in round both players only observe the quantity .
5.1 A One-Point Estimate for
As explained previously, in each round the players must estimate by observing only one of its entries. To this end, we allow the players to share with each other their decisions and to randomize jointly (a similar assumption is used to define correlated equilibria in zero-sum games, see ). The following result shows how to build a random estimate of by observing only one of its entries.
Let with and . Sample . Let be the matrix with for all such that and and . It holds that
5.2 Bandit Online Matrix Games Rftl
We now present an algorithm that ensures sublinear (i.e. ) NE regret under bandit feedback for the OMG problem that holds against an adaptive adversary. By adaptive adversary, we mean that the payoff matrices can depend on the players’ actions up to time ; in particular, we assume the adversary does not observe the actions chosen by the players for time period when choosing . We consider an algorithm that runs OMG-RFTL on a sequence of functions , where is the unbiased one-point estimate of derived in Theorem 4. Recall that the iterates of OMG-RFTL algorithm are distributions over the possible actions of both players. In order to generate the estimate , both players will sample an action from their distributions and weigh their observation with the inverse probability of obtaining that observation.
We have the following guarantee for Bandit-OMG-RFTL.
Let be any sequence of payoff matrices chosen by an adaptive adversary. Let be the iterates generated by Bandit-OMG-FTRL. Setting , ensures
where the expectation is taken with respect to randomization in the algorithm.
We now give a sketch of the proof. The total payoff given to each of the players is given by so we must relate this quantity to the iterates of OMG-RFTL when run on sequence of matrices . The following two lemmas will allow us to do so.
Let be the sequence of iterates generated by Bandit-OMG-RFTL. It holds that
where the expectation is taken with respect to the internal randomness of the algorithm.
It holds that
where the expectation is with respect to all the internal randomness of the algorithm.
We will then bound the difference between the comparator term