1 Introduction
Extensiveform games model sequential interactions between multiple agents, each of which maximize their own utility. Classic examples are perfect information games (e.g. chess and Go), which have served as milestones for measuring the progress of artificial intelligence
[Campbell et al.2002, Silver et al.2016]. When there are simultaneous moves, such as in Markov games, the players may need stochastic policies to guarantee their worstcase expected utility, and must use linear programming at each state for value backups. Computing policies for imperfect information games is much more difficult: no Bellman operator exists, so approximate dynamic programming is not applicable; exact equilibrium solutions can be found by sequenceform linear programming
[Koller et al.1994, Shoham and LeytonBrown2009], but these techniques do not scale to very large games.The challenge domain for imperfect information has been computer Poker, which has driven much of the progress in computational approaches to equilibriumfinding [Rubin and Watson2011]. While there are gradient descent techniques that can find an Nash equilibrium in iterations [Hoda et al.2007], the dominant technique has been counterfactual regret minimization (CFR) [Zinkevich et al.2008]. Based on CFR, recent techniques have solved headsup limit Texas Hold’em [Bowling et al.2015] and beat human professionals in nolimit Texas Hold’em [Moravčík et al.2017, Brown and Sandholm2017].
Other techniques have emerged in recent years, based first on fictitious play (XFP) [Heinrich et al.2015], and generalized to double oracle and any metagame solver over sets of policies [Lanctot et al.2017]
. Both require a subroutine that computes a best response (an “oracle”). Here, reinforcement learning can be used to compute approximate oracles, and function approximation can be used to generalize over the state space without domainspecific abstraction mechanisms. Hence, deep neural networks can trained from zero knowledge as in AlphaZero
[Silver et al.2018]. Policy gradient techniques are also compatible with function approximation in this setting [Srinivasan et al.2018], but may require many iterations to converge. Combining data buffers with CFR using regression to predict regrets has also shown promise in mediumsized poker variants [Waugh et al.2015, Brown et al.2019].In this paper, we introduce a new algorithm for computing approximate Nash equilibria. Like XFP, best responses are computed at each iteration. Unlike XFP, players optimize their policies directly against their worstcase opponent. When using tabular policies and projections after policy updates, the sequence of policies will contain an Nash equilibrium, unlike CFR and XFP that only convergeinaverage. Our algorithm works well with function approximation, as the problem can be expressed directly as a policy gradient optimization. Our experiments show convergence rates comparable to XFP and CFR in the tabular setting, exhibit generalization over the state space using neural networks in four different games.
At the time of original submission, we were unaware of a similar algorithm recently presented at the Deep RL Workshop NeurIPS 2018: SelfPlay Against a Best Response (SPAR) [Tang et al.2018]. The work we present in this paper was done independently. In this paper, we provide convergence guarantees, as well as results in both the tabular and neural network cases. We do so on four benchmark games (both commonly used poker benchmarks used in [Tang et al.2018] and two additional games), whereas results of SPAR focus on the samplebased setting not covered in this paper.
2 Background and Terminology
An extensiveform game describes a sequential interaction between players , where is considered a special player called chance
with a fixed stochastic policy that determines the transition probabilities given states and actions. We will often use
to refer to all the opponents of . In this paper, we focus on the player setting.The game starts in the empty history . On each turn, a player chooses an action , changing the history to . Here is called a prefix history of , denoted . The full history is sometimes also called a ground state because it uniquely identifies the true state, since chance’s actions are included. In poker, for example, a ground state would include all the players’ private cards. We define an information state for player as the state as perceived by an agent which is consistent with its observations. Formally, each is a set of histories, specifically the sequence of of player ’s observations along and are equal. In poker, an information state groups together all the histories that differ only in the private cards of . Denote the set of terminal histories, each corresponding to the end of a game, and a utility to each player for . We also define as the player whose turn it is at , and the subset of terminal histories that share as a prefix.
Since players cannot observe the ground state , policies are defined as , where
is the set of probability distributions over
. Each player tries to maximize their expected utility given the initial starting history . We assume finite games, so every history is bounded in length. The expected value of a joint policy (all players’ policies) for player is defined as(1) 
where the terminal histories are composed of actions drawn from the joint policy. We also define stateaction values for joint policies. The value represents the expected return starting at state , taking action , and playing :
(2) 
where is the expected utility of the ground stateaction pair , and is the probability of reaching under the policy . We make the common assumption that players have perfect recall, i.e. they do not forget anything they have observed while playing. Under perfect recall, the distribution of the states can be obtained only from the opponents’ policies using Bayes’ rule (see [Srinivasan et al.2018, Section 3.2]).
Each player tries to find a policy that maximizes their own value . However, this is difficult to do independently since the value depends on the joint policy, not just player ’s policy. A best response policy for player is defined to be . Given a joint policy , the exploitability of a policy is how much the other player could gain if they switched to a best response: . In twoplayer zerosum games, an minmax (or Nash equilibrium) policy is one where . A Nash equilibrium is achieved when . A common metric to measure the distance to Nash is NashConv.
2.1 ExtensiveForm Fictitious Play (XFP)
Extensiveform fictitious play (XFP) is equivalent to standard fictitious play, except that it operates in the extensiveform representation of the game [Heinrich et al.2015]. In fictitious play, the joint policy is initialized arbitrarily (e.g. uniform random distribution at each information state), and players learn by aggregating best response policies. algocf[h!] The extensiveform version, XFP, requires gametree traversals to compute the best responses and specific update rules that account for the reach probabilities to ensure that the updates are equivalent to the classical algorithm, as described in [Heinrich et al.2015, Section 3]. Fictitious play converges to a Nash equilibrium asymptotically in twoplayer zerosum games. Samplebased approximations to the best response step have also been developed [Heinrich et al.2015] as well as function approximation methods to both steps [Heinrich and Silver2016]. Both steps have also been generalized to other best response algorithms and metastrategy combinations [Lanctot et al.2017].
2.2 Counterfactual Regret Minimization (CFR)
CFR decomposes the full regret computation over the tree into per informationstate regret tables and updates [Zinkevich et al.2008]. Each iteration traverses the tree to compute the local values and regrets, updating cumulative regret and average policy tables, using a local regret minimizer to derive the current policies at each information state.
The quantities of interest are counterfactual values, which are similar to values, but differ in that they weigh only the opponent’s reach probabilities, and are not normalized. Formally, let be only the opponents’ contributions to the probability of reaching under . Then, similarly to equation 2, we define counterfactual values: , and . On each iteration , with a joint policy , CFR computes a counterfactual regret for all information states , and a new policy from the cumulative regrets of over the iterations using regretmatching [Hart and MasColell2000]. The average policies converge to an Nash equilibrium in iterations.
2.2.1 CFR versus a Best Response Oracle (CFRBR)
Instead of both players employing CFR (CFRvsCFR), each player can use CFR versus their worstcase (best response) opponent, i.e. simultaneously running CFRvsBR and BRvsCFR. This is the main idea behind counterfactual regret minimization against a best response (CFRBR) algorithm [Johanson et al.2012]. The combined average policies of the CFR players is also guaranteed to converge to an Nash equilibrium. In fact, the current strategies also converge with high probability. Our convergence analyses are based on CFRBR, showing that a policy gradient versus a best responder also converges to an Nash equilibrium.
2.3 Policy Gradients in Games
We consider policies
each policy are parameterized by a vector of parameter
. Using the likelihood ratio method, the gradient of with respect to the vector of parameters is:(3) 
This result can be seen as an extension of the policy gradient Theorem [Sutton et al.2000, Glynn and L’ecuyer1995, Williams1992, Baxter and Bartlett2001] to imperfect information games and has been used under several forms: for a detailed derivation, see [Srinivasan et al.2018, Appendix D].
The critic (
) can be estimated in many ways (Monte Carlo Return
[Williams1992] or using a critic for instance in [Srinivasan et al.2018] in the context of games. Then:where is the learning rate used by the algorithm and is the estimation of the return used.
3 Exploitability Descent
Exploitability Descent (ED) follows the basic form of the classic convexconcave optimization problem for solving matrix games [Gale et al.1951, Boyd and Vandenberghe2004]. Conceptually, the algorithm is uncomplicated and shares the outline of fictitious play: on each iteration, there are two steps that occur for each player. The first step is identical to fictitious play: compute the best response to each player’s policy. The second step then performs gradient ascent on the policy to increase each player’s utility against the respective best responder (aiming to decrease each player’s exploitability). algocf[h!] The change in the second step is important for two reasons. First, it leads to a convergence of the policies that are being optimized without having to compute an explicit average policy, which is complex in the sequential setting. Secondly, the policies can now be easily parameterized (i.e. using e.g. deep neural networks) and trained using policy gradient ascent without storing a large buffer of previous data.
The general algorithm is outlined in Algorithm LABEL:alg:ed, where the learning rate on iteration . Two steps (lines LABEL:alg:edvalues and LABEL:alg:edupdate) are intentionally unspecified: we will show properties for two specific instantiations of this general ED algorithm. The quantity refers to a set of expected values for player , one for each action at using against a set of individual best responses. The GradientAscent update step unspecified for now as we will describe several forms, but the main idea is to increase/decrease the probability of higher/lower utility actions via the gradients of the value functions, and project back to the space of policies.
3.1 Tabular ED with values and projection
For a vector of real numbers , define the simplex as , and the projection as .
Let be a joint policy parameterized by , and refer to the portion of player ’s parameters (i.e. in tabular form ). Here each parameter is a probability of an action at a particular state: . We refer to TabularED(, ) as an instance of exploitability descent with
(4) 
and the policy gradient ascent update defined to be
(5)  
where the Jacobian
is an identity matrix because each parameter
corresponds directly to the probability , and is the usual matrix inner product.3.2 Tabular ED with counterfactual values and softmax transfer function
For some vector of real numbers, , define softmax. Reusing the tabular policy notation from the previous section, we now define a different instance of exploitability descent. We refer to TabularED(, softmax) as the algorithm that specifies ,
(6) 
and the policy parameter update as
(7) 
where represents the Jacobian of softmax.
3.3 Convergence Analyses
We now analyze the convergence guarantees of ED. We give results for two cases: first, in cyclical perfect information games and Markov games, and secondly imperfect information games. All the proofs are found in Appendix A.
3.3.1 Cyclical Perfect Information Games and Markov Games
The following result extends the policy gradient theorem [Sutton et al.2000, Glynn and L’ecuyer1995, Williams1992, Baxter and Bartlett2001] to the zerosum twoplayer case. It proves that a generalized gradient of the worstcase value function can be estimated from experience as in the single player case.
Theorem 1 (Policy Gradient in the Worst Case).
The gradient of policy ’s value, , against a best response, is a generalized gradient (see [Clarke1975]) of ’s worstcase value function,
All of the proofs are found in Appendix A.
This theorem is a natural extension of the policy gradient theorem to the zerosum twoplayer case. As in policy gradient, this process is only guaranteed to converge to a local maximum of the worst case value of the game but not necessarily to an equilibrium of the game. An equilibrium of the game is reached when the two following conditions are met simultaneously: (1) if the policy is tabular and (2) if all states are visited with at least some probability for all policies. This statement is proven in Appendix D.
The method is called exploitability descent because policy gradient in the worst case minimizes exploitability. In a twoplayer, zerosum game, if both players independently run ED, NashConv is locally minimized.
Lemma 1.
In the twoplayer zerosum case, simultaneous policy gradient in the worst case locally minimizes NashConv.
3.3.2 Imperfect Information Games
We now examine convergence guarantees in the imperfect information setting. There have two main techniques used to solve adversarial games in this case: the first is to rely on the sequenceform representation of policies which makes the optimization problem convex [Koller et al.1994, Hoda et al.2007]. The second is to weight the values by the appropriate reach probabilities, and employ local optimizers [Zinkevich et al.2008, Johanson et al.2012]. Both take into account the probability of reaching information states, but the latter allows a convenient tabular policy representation.
We prove finite time exploitability bounds for TabularED(, ), and we relate TabularED(, softmax) to a similar algorithm that also has finite time bounds.
The convergence analysis is built upon two previous results: the first is CFRBR [Johanson et al.2012].
The second is a very recent result that relates policy gradient optimization in imperfect information games to CFR [Srinivasan et al.2018].
The result here is also closely related to the optimization against a worstcase opponent [Waugh and Bagnell2014, Theorem 4], except our policies are expressed in tabular (i.e. behavioral) form rather than the sequence form.
Case: TabularED(, ). Recall that the parameters correspond to the tabular policy. For convenience, let .
We now present the main theorem, which states that if both players optimize their policies using TabularED(, ), it will generate policies with decreasing regret, which combined form an approximate Nash equilibrium.
Theorem 2.
Let TabularED(, ) be described as in Section 3.1 using tabular policies and the update rule in Definition 1. In a twoplayer zerosum game, if each player updates their policy simultaneously using TabularED(, ), if and , then for each player : after iterations, a policy will have been generated such that is ’s part of a Nash equilibrium, where , and .
ED is computing best responses each round already, so it is easy to track the best iterate: it will simply be the one with the highest expected value versus the opponent’s best response.
The proof can also be applied to the original CFRBR theorem, so we now present an improved guarantee, whereas the original CFRBR theorem made a probabilistic guarantee.
Corollary 1.
(Improved [Johanson et al.2012, Theorem 4]) If player plays iterations of CFRBR, then it will have generated a , where is a equilibrium, where is defined as in [Johanson et al.2012, Theorem 3].
The best iterate can be tracked in the same way as ED, and the convergence is guaranteed.
Remark 1.
When using values, the values are normalized by a quantity, , that depends on the opponents’ policies [Srinivasan et al.2018, Section 3.2]. The convergence guarantee of TabularED(, ) relies on [Srinivasan et al.2018, Theorem 2], whose proof includes a division by [Srinivasan et al.2018, Appendix E.2]. Therefore, the regret bound is undefined when , which can happen when an opponent no longer plays to reach .
Case: TabularED(, ). Instead of using qvalues, we can implement ED with counterfactual values. In this case, TabularED with the projection becomes CFRBR(GIGA), which then avoids the issued discussed in Remark 1.
Theorem 3.
Case: TabularED(, softmax) We now relate TabularED with counterfactual values and softmax policies closely to an algorithm with known finite time convergence bounds. For details, see Appendix C.
TabularED(, softmax) is still a policy gradient algorithm: it differentiates the policy (i.e. softmax function) with respect to its parameters, and updates in the direction of higher value. With two subtle changes to the overall process, we can show that the algorithm would become CFRBR using hedge [Freund and Schapire1997] as a local regret minimizer. CFR with hedge is known to have a better bound, but has typically not performed as well as regret matching in practice, though it has been shown to work better when combined with pruning based on dynamic probability thresholding [Brown et al.2017].
Instead of policy gradient, one can use a softmax transfer function over the the sum of action values (or regrets) over time, which are the gradients of the value function with respect to the policy. Accumulating the gradients in this way, the algorithm can be recognized as Mirror Descent [Nemirovsky and Yudin1983], which also coincides with hedge given the softmax transfer [Beck and Teboulle2003]. When using the counterfactual values, ED then turns into CFRBR(hedge), which converges for the same reasons as CFRBR(regretmatching).
We do not have a finite time bound of the exploitability of TabularED(, softmax) as we do for the same algorithm with an projection or CFRBR(hedge). But since TabularED(, softmax) is a policy gradient algorithm, its policy will be adjusted toward a local optimum upon each update and will converge at that point when the gradient is zero. We use this algorithm because the policy gradient formulation allows for easilyapplicable general function approximation.
4 Experimental Results
We now present our experimental results. We start by comparing empirical convergence rates to XFP and CFR in the tabular setting, following by convergence behavior when training neural network functions to approximate the policy.
In our initial experiments, we found that using values led to plateaus in convergence in some cases, possibly due to numerical instability caused by the problem outlined in Remark 1. Therefore, we present results only using TabularED(, softmax), which for simplicity we refer to as TabularED for the remainder of this section. We also found that the algorithm converged faster with slightly higher learning rates than the ones suggested by Section 3.3.2.
4.1 Experiment Domains
Our experiments are run across four different imperfect information games. We provide very brief descriptions here; see Appendix B as well as [Kuhn1950, Southey et al.2005] and [Lanctot2013, Chapter 3] for more detail.
Kuhn poker is a simplified poker game first proposed by Harold Kuhn [Kuhn1950] Leduc poker is significantly larger game with two rounds and a 6card deck in two suits, e.g. {JS,QS,KS, JH,QH,KH}. Liar’s Dice(1,1) is dice game where each player gets a single private die, rolled at the start of the game, and players proceed to bid on the outcomes of all dice in the game. Goofspiel is a card game where players try to obtain point cards by bidding simultaneously. We use an imperfect information variant where bid cards are unrevealed.
4.2 Convergence Results
We now present empirical convergence rates to Nash equilibria. The main results are depicted in Figure 1.
For the neural network experiments, we use a single policy network for both players, which takes as input the current state of the game and whose output is a softmax distribution over the actions of the game. The state of the game is represented in a gamedependent fashion as a fixedsize vector of between 11 and 52 binary bits, encoding public information, private information, and the game history.
The neural network consists of a number of fullyconnected hidden layers, each with the same number of units and with rectified linear activation functions after each layer. A linear output layer maps from the final hidden layer to a value per action. The values for the legal actions are selected and mapped to a policy using the softmax function.
At each step, we evaluate the policy for every state of the game, compute a best response to it, and evaluate each action against the best response. We then perform a single gradient descent step on the loss function:
, where the final term is a regularization for all the neural network weights, and the baseline is a computed constant (i.e. it does not contribute to the gradient calculation) with . We performed a sweep over the number of hidden layers (from 1 to 5), the number of hidden units (64, 128 or 256), the regularization weight (), and the initial learning rate (powers of 2). The plotted results show the best values from this sweep for each game.4.3 Discussion
There are several interesting observations to make about the results. First, the convergence of the neural network policies is more erratic than the tabular counterparts. However, in two games the neural network policies have learned more accurate approximate equilibria than any of the tabular algorithms for the same number of iterations. The network could be generalizing across the state space (discovering patterns) in a way that is not possible in the tabular case, despite raw input features.
Although Tabular ED and XFP have roughly the same convergence rate, the respective function approximation versions have an order of magnitude difference in speed, with Neural ED reaching an exploitability of in Leduc Poker after iterations, a level which NFSP reaches after approximately iterations [Heinrich and Silver2016]. Neural ED and NFSP are not directly comparable as NFSP is computing an approximate equilibrium using sampling and RL while ED uses true best response. However, NFSP uses a reservoir buffer dataset of 2 million entries, whereas this is not required in ED.
5 Conclusion
We introduce Exploitability Descent (ED) that optimizes its policy directly against worstcase opponents. In cyclical perfect information and Markov games, we prove that ED policies converge to strong policies that are unexploitable in the tabular case. In imperfect information games, we also present finite time exploitability bounds for tabular policies. While the empirical convergence rates using tabular policies are comparable to previous algorithms, the policies themselves provably converge. So, unlike XFP and CFR, there is no need to compute the average policy. Neural network function approximation is applicable via direct policy gradient ascent, also avoiding domainspecific abstractions, or the need to store large replay buffers of past experience, as in neural fictitious selfplay [Heinrich and Silver2016], or a set of past networks, as in PSRO [Lanctot et al.2017].
In some of our experiments, neural networks learned lowerexploitability policies than the tabular counterparts, which could be an indication of strong generalization potential by recognizing similar patterns across states. There are interesting directions for future work: using approximate best responses and sampling trajectories for the policy optimization in larger games where enumerating the trajectories is not feasible.
Acknowledgments
We would like to thank Neil Burch, Johannes Heinrich, and Martin Schmid for feedback. Dustin Morrill was supported by The Alberta Machine Intelligence Institute (Amii) and Alberta Treasury Branch (ATB) during the course of this research.
References
 [Baxter and Bartlett2001] Jonathan Baxter and Peter L Bartlett. Infinitehorizon policygradient estimation. Journal of Artificial Intelligence Research, 15:319–350, 2001.
 [Beck and Teboulle2003] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, May 2003.
 [Bowling et al.2015] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Headsup Limit Hold’em Poker is solved. Science, 347(6218):145–149, January 2015.
 [Boyd and Vandenberghe2004] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
 [Brown and Sandholm2017] Noam Brown and Tuomas Sandholm. Superhuman AI for headsup nolimit poker: Libratus beats top professionals. Science, 360(6385), December 2017.
 [Brown et al.2017] Noam Brown, Christian Kroer, and Tuomas Sandholm. Dynamic thresholding and pruning for regret minimization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2017.
 [Brown et al.2019] Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization. In Proceedings of the Thirtysixth International Conference on Machine Learning (ICML), 2019. Full technical report available at http://arxiv.org/abs/1811.00164.
 [Campbell et al.2002] M. Campbell, A. J. Hoane, and F. Hsu. Deep blue. Artificial Intelligence, 134:57–83, 2002.
 [Clarke1975] Frank H Clarke. Generalized gradients and applications. Transactions of the American Mathematical Society, 205:247–262, 1975.
 [Freund and Schapire1997] Yoav Freund and Robert E Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
 [Gale et al.1951] D. Gale, H.W. Kuhn, and A.W. Tucker. Linear programming and the theory of games. In T.C. Koopmans et al., editor, Activity Analysis of Production and Allocation, pages 317–329. Wiley: New York, 1951.
 [Glynn and L’ecuyer1995] Peter W Glynn and Pierre L’ecuyer. Likelihood ratio gradient estimation for stochastic recursions. Advances in applied probability, 1995.
 [Hart and MasColell2000] S. Hart and A. MasColell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
 [Hazan2015] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3–4):157–325, 2015.
 [Heinrich and Silver2016] Johannes Heinrich and David Silver. Deep reinforcement learning from selfplay in imperfectinformation games. CoRR, abs/1603.01121, 2016.
 [Heinrich et al.2015] Johannes Heinrich, Marc Lanctot, and David Silver. Fictitious selfplay in extensiveform games. In ICML 2015, 2015.
 [Hoda et al.2007] S. Hoda, A. Gilpin, and J. Pe na. A gradientbased approach for computing Nash equilibria of large sequential games. Optimization Online, July 2007. http://www.optimizationonline.org/DB_HTML/2007/07/1719.html.
 [Johanson et al.2012] M. Johanson, N. Bard, N. Burch, and M. Bowling. Finding optimal abstract strategies in extensive form games. In Proceedings of the TwentySixth Conference on Artificial Intelligence (AAAI), pages 1371–1379, 2012.

[Koller et al.1994]
D. Koller, N. Megiddo, and B. von Stengel.
Fast algorithms for finding randomized strategies in game trees.
In
Proceedings of the 26th ACM Symposium on Theory of Computing (STOC ’94)
, pages 750–759, 1994.  [Kuhn1950] H. W. Kuhn. Simplified twoperson Poker. Contributions to the Theory of Games, 1:97–103, 1950.
 [Lanctot et al.2017] Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, and Thore Graepel. A unified gametheoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, 2017.
 [Lanctot2013] Marc Lanctot. Monte Carlo Sampling and Regret Minimization for Equilibrium Computation and DecisionMaking in Large Extensive Form Games. PhD thesis, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada, June 2013.
 [Lockhart et al.2019] Edward Lockhart, Marc Lanctot, Julien Pérolat, JeanBaptiste Lespiau, Dustin Morrill, Finbarr Timbers, and Karl Tuyls. Computing approximate equilibria in sequential adversarial games by exploitability descent. CoRR, abs/1903.05614, 2019.
 [Moravčík et al.2017] Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expertlevel artificial intelligence in headsup nolimit poker. Science, 358(6362), October 2017.
 [Nemirovsky and Yudin1983] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
 [Rubin and Watson2011] J. Rubin and I. Watson. Computer poker: A review. Artificial Intelligence, 175(5–6):958–987, 2011.

[ShalevShwartz and
others2012]
Shai ShalevShwartz et al.
Online learning and online convex optimization.
Foundations and Trends® in Machine Learning
, 4(2):107–194, 2012.  [Shoham and LeytonBrown2009] Y. Shoham and K. LeytonBrown. Multiagent Systems: Algorithmic, GameTheoretic, and Logical Foundations. Cambridge University Press, 2009.
 [Silver et al.2016] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529:484––489, 2016.
 [Silver et al.2018] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through selfplay. Science, 632(6419):1140–1144, 2018.
 [Southey et al.2005] Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ bluff: Opponent modelling in poker. In Proceedings of the TwentyFirst Conference on Uncertaintyin Artificial Intelligence (UAI), pages 550–558, 2005.
 [Srinivasan et al.2018] Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien Pérolat, Karl Tuyls, Rémi Munos, and Michael Bowling. Actorcritic policy optimization in partially observable multiagent environments. In Advances in Neural Information Processing Systems, 2018.
 [Sutton and Barto2018] R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018.
 [Sutton et al.2000] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press, 2000.
 [Tang et al.2018] Jie Tang, Keiran Paster, and Pieter Abbeel. Equilibrium finding via asymmetric selfplay reinforcement learning. In Deep Reinforcement Learning Workshop NeurIPS 2018, 2018.
 [Waugh and Bagnell2014] Kevin Waugh and J. Andrew Bagnell. A unified view of largescale zerosum equilibrium computation. CoRR, abs/1411.5007, 2014.
 [Waugh et al.2015] Kevin Waugh, Dustin Morrill, J. Andrew Bagnell, and Michael Bowling. Solving games with functional regret estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2015.
 [Williams1992] Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 [Zinkevich et al.2008] M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems 20 (NIPS 2007), 2008.
 [Zinkevich2003] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of Twentieth International Conference on Machine Learning (ICML2003), 2003.
Appendix A Proofs
a.1 Cyclical Perfect Information Games and Markov Games
Proof of Theorem 1.
The proof uses tools from the nonsmooth analysis to properly handle gradients of a nonsmooth function. We use the notion of generalized gradients defined in [Clarke1975]. The generalized gradient of a Lipschitz function is the convex hull of the limits of the form where . The only assumption we will require is that the parameters of our policy remains in a compact set and that is differentiable with respect to for all .
More precisely we use [Clarke1975, Theorem 2.1] to state our result. The theorem requires the function to be uniformly semicontinuous which is the case if the policy is differentiable since the dependence of on is polynomial. The function is Lipschitz with respect to . The uniform continuity of comes from the fact that is compact.
Using [Clarke1975, Theorem 2.1], we have that is the convex hull of , so .
The proof follows by applying the policy gradient theorem [Baxter and Bartlett2001] to . ∎
Proof of Lemma 1.
In a twoplayer, zerosum game, NashConv reduces to the sum of exploitabilities:
so doing policy gradient in the worst case independently for all players locally minimizes the sum of exploitabilities and therefore NashConv. Formally we have^{1}^{1}1Usually one would have . But since in our case the functions are defined on two different sets of parameters, we have an equality.:
(8)  
∎
a.2 Imperfect Information Games
Before we prove the main proof (of Theorem 2), we start with some definitions, a previous theorem, and a supporting lemma.
Definition 1.
This is a form of strong policy gradient policy iteration (SPGPI) defined in [Srinivasan et al.2018, Theorem 2] that separates the optimization for each player. Also notice that an iteration of TabularED(, ) is equivalent to simultaneous applications of SPGPI.
Definition 2.
Suppose all players use a sequence of joint policies over iterations. Define player ’s regret after iterations to be the difference in expected utility between the best possible policy in hindsight and the expected utility given the sequence of policies:
Theorem 4.
[Srinivasan et al.2018, Theorem 2] Suppose players play a finite game using joint policies over iterations. In a twoplayer zerosum game, if and , then the regret of SPGPI after iterations is , where , and .
Note that, despite the original application of policy gradients in selfplay, it follows from the original proof that the statement about the regret incurred by player does not require a specific algorithm generate the opponents’ policies: it is only a function of the specific sequence of opponent policies. In particular, they could be best response policies, and so SPGPI has the same regret guarantee.
We need one more lemma before we prove the convergence guarantee of Tabular ED. The following lemma states an optimality bound of the best iterate under timeindependent loss functions equal to the average regret. The best strategy in a noregret sequence of strategies then approaches an equilibrium strategy over time without averaging and without probabilistic arguments.
Lemma 2.
Denote . For any sequence of iterates, , from decision set , the regret of this sequence under loss, , is
Then, the iterates with the lowest loss, , has an optimality gap bounded by the average regret:
Proof.
Since is fixed (not varying with ),
and dividing by yields the result. ∎
In finite games, we can replace the operation in Lemma 2 with when the decision set is the set of all possible strategies, since this set is closed.
We now provide the proof of the main theorem.
Proof of Theorem 2.
The first part of the proof follows the logic of the CFRBR proofs [Johanson et al.2012]. Unlike CFRBR, we then use Lemma 2 to bound the quality of the best iterate.
SPGPI(), has bounded regret sublinear in for player by Theorem 4. Define loss function, , as the negated worstcase value for player , like that described by [Waugh and Bagnell2014, Theorem 4]. Then by Lemma 2 and Theorem 4 we have, for the best iterate:
where is the exploitability defined in Section 2.
The Nash equilibrium approximation bound is just the sum of the explioitabilities, so when both and are returned from ED, they form a equilbrium. ∎
Proof of Theorem 3.
This update rule is identical to that of generalized infinitesimal gradient ascent (GIGA) [Zinkevich2003] at each information state with best response counterfactual values. CFRBR(GIGA) therefore performs the same updates and the two algorithms coincide. With step sizes , each local GIGA instance has regret after iterations upper bounded by , where [Srinivasan et al.2018, Lemma 5]. Then, by the CFR Theorem [Zinkevich et al.2008], the regret is bounded by
Hence, the reasoning of the proof of Theorem 2 follows since it is a regretminimization algorithm. ∎
Appendix B Longer Descriptions of Experiment Domains
 Kuhn poker

is a simplified poker game first proposed by Harold Kuhn [Kuhn1950]. Each player antes a single chip, and gets a single private card from a totallyordered 3card deck, e.g.. There is a single betting round limited to one raise of 1 chip, and two actions: pass (check/fold) or bet (raise/call). If a player folds, they lose their commitment (2 if the player made a bet, otherwise 1). If neither player folds, the player with the higher card wins the pot (2, 4, or 6 chips). The utility for each player is defined as the number of chips after playing minus the number of chips before playing.
 Leduc poker

is significantly larger game with two rounds and a 6card deck in two suits, e.g. {JS,QS,KS, JH,QH,KH}. Like Kuhn, each player initially antes a single chip to play and obtains a single private card and there are three actions: fold, call, raise. There is a fixed bet amount of 2 chips in the first round and 4 chips in the second round, and a limit of two raises per round. After the first round, a single public card is revealed. A pair is the best hand, otherwise hands are ordered by their high card (suit is irrelevant). Utilities are defined similarly to Kuhn poker.
 Liar’s Dice(1,1)

is dice game where each player gets a single private die in , rolled at the beginning of the game. The players then take turns bidding on the outcomes of both dice, i.e. with bids of the form  referring to quantity and face, or calling “Liar”. The bids represent a claim that there are at least dice with face value among both players. The highest die value, , counts as a wild card matching any value. Calling “Liar” ends the game, then both players reveal their dice. If the last bid is not satisfied, then the player who called “Liar” wins. Otherwise, the other player wins. The winner receives +1 and loser 1.
 Goofspiel

or the Game of Pure Strategy, is a bidding card game where players are trying to obtain the most points. shuffled and set facedown. Each turn, the top point card is revealed, and players simultaneously play a bid card; the point card is given to the highest bidder or discarded if the bids are equal. In this implementation, we use a fixed deck of decreasing points. In this paper, we use and an imperfect information variant where players are only told whether they have won or lost the bid, but not what the other player played.
Appendix C Connections and Differences Between Gradient Descent, Mirror Descent, Policy Gradient, and Hedge
There is a broad class of algorithms that attempt to make incremental progress on an optimization task by moving parameters in the direction of a gradient. This elementary idea is intuitive, requiring only basic knowledge of calculus and functions to understand in abstract. One way to more formally justify this procedure comes from the field of online convex optimization [Zinkevich2003, Hazan2015, ShalevShwartz and others2012]. The linearization trick [ShalevShwartz and others2012] reveals how simple parameter adjustments based on gradients can be used to optimize complicated nonlinear functions. Perhaps the most well known learning rule is that of gradient descent: , where the goal is to minimize function .
Often problems will include constraints on , such as the probability simplex constraint required of decision policies. One often convenient way to approach this problem is transform unconstrained parameters, , to the nearest point in the feasible set, , with a transfer function, . This separation between the unconstrained and constrained space produces some ambiguity in the way optimization algorithms are adapted to handle constraints. Do we adjust the transformed parameters or the unconstrained parameters with the gradient? And do we take the gradient with respect to the transformed parameters or the unconstrained parameters?
Projected gradient descent (PGD) [Zinkevich2003] resolves these ambiguities by adjusting the transformed parameters with the gradient of the transformed parameters. For PGD, the unconstrained parameters are not saved, they are only produced temporarily before they can be projected into the feasible set, . Mirror descent (MD) [Nemirovsky and Yudin1983, Beck and Teboulle2003], broadly, makes adjustments exclusively in the unconstrained space, and transfers to the feasible set ondemand. However, like PGD, MD uses the gradient with respect to the transformed parameters. E.g. A MDbased update is , and transfer is done ondemand, .
Further difficulties are encountered when function approximation is involved, that is, when , for an arbitrary function . Now PGD’s approach of making adjustments in the decision space where resides is untenable because the function parameters, , might reside in a very different space. E.g. may be a complete strategy while may be a vector of neural network parameters with many fewer dimensions. But the gradient with respect to the is also in the decision space, so MD’s update cannot be done exactly either.
A simple fix is to apply the chain rule to find the gradient with respect to
. This is the approach taken by policy gradient (PG) methods [Williams1992, Sutton et al.2000, Sutton and Barto2018] (the “all actions” versions rather than sample action versions). A consequence of this choice, however, is that PG updates in the tabular setting (when is the identity function) generally do not reproduce MD updates.E.g. hedge, exponentially weighted experts, or entropic mirror descent [Freund and Schapire1997, Beck and Teboulle2003], is a celebrated algorithm for approximately solving games. It is a simple noregret algorithm that achieves the optimal regret dependence on the number of actions, , and it can be used in the CFR and CFRBR framework instead of the more conventional regretmatching to solve sequential imperfect information games. It is also an instance of MD, which we show here.
Hedge accumulates values associated with each action (e.g. counterfactual values or regrets) and transfers them into the probability simplex with the softmax function to generate a policy. Formally, given a sequence of values and temperature, , hedge plays
We now show how to recover this policy creation rule with MD.
Given a vector of bounded action values, , the expected value of policy interpreted as a probability distribution is just the weighted sum of values, .
The gradient of the expected value of ’s value is then just the vector of action values, . MD accumulates the gradients on each round,
where is a stepsize parameter. If is zero, then the current parameters, , are simply the stepsize weighted sum of the action values.
If on each round, is chosen to be , then we can rewrite this policy in terms of the action values alone:
which one can recognize as hedge with . This shows how hedge fits into the ED framework. When counterfactual values are used for action values, ED with MD gradientupdates and a softmax transfer at every information state is identical to CFRBR with hedge at every information state.
In comparison, PG, using the same transfer and tabular parameterization, generates policies according to
[Sutton and Barto2018, Section 2.8], so the update direction, , is actually the regret scaled by :
Knowing this, we can write the in concrete terms:
The fact that the PG parameters accumulate regret instead of action value is inconsequential because the difference between action values and regrets is a shift that is shared between each action, and the softmax function is shiftinvariant. But there is a substantive difference in that updates are scaled by the current policy.
Appendix D Global Minimum Conditions
In this section we will suppose that the policy under the simplex constraints and (where is a slack variable to enforce the inequality constrain )
subject to  
The Lagrangian is:
Knowing that for all :
The gradient of the Lagrangian with respect to is:
(9)  
(10)  
(11)  
(12) 
Suppose that there exists a best response such that (i.e. if 0 is in the set of generalized gradients). Two cases can appear:
If then and then:
If then and then:
Two cases (one stable and one unstable):

then we have a stable fixed point,

is not stable as then we could increase the value by switching to that action.
We conclude by noticing that is greedy with respect to the value of the joint policy , thus is a best response to . Since both policies are best responses to each other, is a Nash equilibrium. is also therefore unexploitable.