Multiagent interactions are often modeled using extensive-form games (EFGs), a powerful framework that incoporates sequential actions, hidden information, and stochastic events. Recent research has focused on computing approximately optimal strategies in large extensive-form games, resulting in a solution to heads-up limit Texas Hold’em [Bowling et al.2015], a game with states [Johanson2013], and in two independent super-human computer agents for the much larger heads-up no-limit Texas Hold’em [Moravčík et al.2017, Brown and Sandholm2018].
When modeling an interaction with an EFG we must specify a utility at every outcome for each agent. This utility is the cardinal measure of an outcome’s desirability. Utility is particularly difficult to specify. Take, for example, situations where an agent has multiple objectives to balance: a defender in a security game might have the primary objective of protecting a target and a secondary objective of minimizing expected cost, or a robot operating in a dangerous environment will have a primary task to complete and a secondary objective of minimizing damage to itself and others. How these objectives combine into a single value, the agent’s utility, is ill-specified and error prone.
One approach for handling multiple objectives is to use a linear combination of per-objective utilities. This approach has been used in EFGs to “tilt” poker agents toward taking specific actions [Johanson et al.2011], and to mix between cost minimization and risk mitigation in sequential security games [Lisý, Davis, and Bowling2016]. However, objectives are typically measured on incommensurable scales. This leads to dubious combinations of weights often selected by trial-and-error.
A second approach is to constrain the agents’ strategy spaces directly. For example, rather than minimizing the expected cost, we use a hard constraint that disqualifies high-cost strategies. Using such constraints has been extensively studied in single-agent perfect information settings [Altman1999] and partial information settings [Isom, Meyn, and Braatz2008, Poupart et al.2015, Santana, Thiébaux, and Williams2016], as well as in (non-sequential) security games [Brown et al.2014].
Incorporating strategy constraints when solving EFGs presents a unique challenge. Nash equilibria can be found by solving a linear program (LP) derived using the sequence-form representation [Koller, Megiddo, and von Stengel1996]. This LP is easily modified to incorporate linear strategy constraints; however, LPs do not scale to large games. Specialized algorithms for efficiently solving large games, such as an instantiation of Nesterov’s excessive gap technique (EGT) [Hoda et al.2010] as well as counterfactual regret minimization (CFR) [Zinkevich et al.2008] and its variants [Lanctot et al.2009, Tammelin et al.2015], cannot integrate arbitrary strategy constraints directly. Currently, the only large-scale approach is restricted to constraints that consider only individual decisions [Farina, Kroer, and Sandholm2017].
In this work we present the first scalable algorithm for solving EFGs with arbitrary convex strategy constraints. Our algorithm, Constrained CFR, provably converges towards a strategy profile that is minimax optimal under the given constraints. It does this while retaining the convergence rate of CFR and requiring additional memory proportional to the number of constraints. We demonstrate the empirical effectiveness of Constrained CFR by comparing its solution to that of an LP solver in a security game. We also present a novel constraint-based technique for opponent modeling with partial observations in a small poker game.
Formally, an extensive-form game [Osborne and Rubinstein1994] is a game tree defined by:
A set of players . This work focuses on games with two players, so .
A set of histories , the tree’s nodes rooted at . The leafs, , are terminal histories. For any history , we let denote a prefix of , and necessarily .
For each , a set of actions . For any , is a child of .
A player function defining the player to act at . If then chance
acts according to a known probability distribution, where is the probability simplex of dimension .
A set of utility functions , for each player. Player receives utility for reaching . We assume the game is zero-sum, i.e., . Let .
For each player , a collection of information sets . partitions , the histories where acts. Two histories in an information set are indistinguishable to . Necessarily , which we denote by . When a player acts they do not observe the history, only the information set it belongs to, which we denote as .
We assume a further requirement on the information sets called perfect recall. It requires that players are never forced to forget information they once observed. Mathematically this means that all indistinguishable histories share the same sequence of past information sets and actions for the actor. Although this may seem like a restrictive assumption, some perfect recall-like condition is needed to guarantee that an EFG can be solved in polynommial time, and all sequential games played by humans exhibit perfect recall.
A behavioral strategy for player maps each information set to a distribution over actions, . The probability assigned to is . A strategy profile, , specifies a strategy for each player. We label the strategy of the opponent of player as . The sets of behavioral strategies and strategy profiles are and respectively.
A strategy profile uniquely defines a reach probability for any history :
This product decomposes into contributions from each player and chance, . For a player , we denote the contributions from the opponent and chance as so that . By perfect recall we have for any in same information set . We thus also write this probability as .
Given a strategy profile , the expected utility for player is given by
A strategy is an -best response to the opponent’s strategy if for any alternative strategy . A strategy profile is an -Nash equilibrium when each is a -best response to its opponent; such a profile exists for any . The exploitability of a strategy profile is the smallest such that each is an -best response. Due to the zero-sum property, all Nash equilibria coincide with the saddle-points of the minmax problem
A zero-sum EFG can be represented in sequence form [von Stengel1996]. The sets of sequence-form strategies for players 1 and 2 are and respectively. A sequence-form strategy
is a vector indexed by pairs, . The entry is the probability of player 1 playing the sequence of actions that reaches and then playing action . A special entry, , represents the empty sequence. Any behavioral strategy has a corresponding sequence-form strategy where
For any history and, by perfect recall, any information set , player has a unique sequence that leads to it. Let and denote the corresponding entries in . Thus, we are free to write the expected utility as . This is bilinear, i.e., there exists a payoff matrix such that . A consequence of perfect recall and the laws of probability is for that and that . These constraints are linear and completely describe the polytope of sequence-form strategies. Using these together, (3) can be expressed as a bilinear saddle point problem over the polytopes and :
For a convex function , let be any element of the subdifferential , and let be the element of this subgradient.
2.2 Counterfactual regret minimization
Counterfactual regret minimization [Zinkevich et al.2008] is a large-scale equilibrium-finding algorithm that, in self-play, iteratively updates a strategy profile in a fashion that drives its counterfactual regret to zero. This regret is defined in terms of counterfactual values. The counterfactual value of reaching information set is the expected payoff under the counterfactual that the acting player attempts to reach it:
Here for any , and is the probability of reaching from under . Let be the profile that plays at and otherwise plays according to . For a series of profiles , the average counterfactual regret of action at is .
To minimize counterfactual regret, CFR employs regret matching [Hart and Mas-Colell2000]. In particular, actions are chosen in proportion to positive regret, where . It follows that the average strategy profile , defined by , is an -Nash equilibrium [Zinkevich et al.2008]. In sequence form, the average is given by .
3 Solving games with strategy constraints
We begin by formally introducing the constrained optimization problem for extensive-form games. We specify convex constraints on the set of sequence-form strategies111Without loss of generality, we will assume throughout this paper that the constrained player is player 1, i.e. the maximizing player. with a set of convex functions where we require for each . We use constraints on the sequence form instead of on the behavioral strategies because reach probabilities and utilities are linear functions of a sequence-form strategy, but not of a behavioral strategy.
The optimization problem can be stated as:
The first step toward solving this problem is to incorporate the constraints into the objective with Lagrange multipliers. Assuming that the problem is feasible (i.e., there exists a feasible such that for each ), then (6) is equivalent to:
We will now present intuition as to how CFR can be modified to solve (7), before presenting the algorithm and proving its convergence.
CFR can be seen as doing a saddle point optimization on the objective in (3), using the gradients222For a more complete discussion of the connection between CFR and gradient ascent, see [Waugh and Bagnell2015]. of given as
The intuition behind our modified algorithm is to perform the same updates, but with gradients of the modified utility function
The (sub)gradients we use in the modified CFR update are then
Note that this leaves the update of the unconstrained player unchanged. In addition, we must update using the gradients , which is the -vector with at index . This can be done with any gradient method, e.g. simple gradient ascent with the update rule
for some step size .
3.2 Constrained counterfactual regret minimization
We give the Constrained CFR (CCFR) procedure in Algorithm 1. The constrained player’s strategy is updated with the function CCFR and the unconstrained player’s strategy is updated with unmodified CFR. In this instantiation is updated with gradient ascent, though any regret minimizing update can be used. We clamp each to the interval for reasons discussed in the following section. Together, these updates form a full iteration of CCFR.
The CCFR update for the constrained player is the same as the CFR update, with the crucial difference of line 6, which incorporates the second part of the gradient into the counterfactual value . The loop beginning on line 3 goes through the constrained player’s information sets, walking the tree bottom-up from the leafs. The counterfactual value is set on line 5 using the values of terminal states which directly follow from action at (this corresponds to the term of the gradient), as well as the already computed values of successor information sets . Line 8 computes the value of the current information set using the current strategy. Lines 10 and 11 update the stored regrets for each action. Line 14 updates the current strategy with regret matching.
3.3 Theoretical analysis
In order to ensure that the utilities passed to the regret matching update are bounded, we will require to be bounded from above; in particular, we will choose . We can then evaluate the chosen sequence using its regret in comparison to the optimal :
We can guarantee , e.g. by choosing with projected gradient ascent [Zinkevich2003].
We now present the theorems which show that CCFR can be used to approximately solve (6). In the following thereoms we assume that , we have some convex, continuous constraint functions , and we use some regret-minimizing method to select the vectors each in .
First, we show that the exploitability of the average strategies approaches the optimal value:
If CCFR is used to select the sequence of strategies and CFR is used to select the sequence of strategies , then the following holds:
where is the range of possible utilities, is the maximum number of actions at any information set, is the number of constraints, is a bound on the subgradients333Such a bound must exist as the strategy sets are compact and the constraint functions are continuous., and is a game-specific constant.
All proofs are given in the appendix. Theorem 3 guarantees that the constrained exploitability of the final CCFR strategy profile converges to the minimum exploitability possible over the set of feasible profiles, at a rate of (assuming a suitable regret minimizer is used to select ).
In order to establish that CCFR approximately solves optimization (6), we must also show that the CCFR strategies converge to being feasible. First we establish a bound for arbitrary :
If CCFR is used to select the sequence of strategies and CFR is used to select the sequence of strategies , then the following holds:
This theorem guarantees that the CCFR strategy converges to the feasible set at a rate of , up to an approximation error of induced by the bounding of .
We can eliminate the approximation error when is chosen large enough for some optimal to lie within the bounded set . In order to establish the existence of such a , we must assume a constraint qualification such as Slater’s condition, which requires the existance of a feasible which strictly satisfies any nonlinear constraints ( for all and for all nonlinear ). Then there exists a finite which is a solution to optimization (7), which we can use to give the bound:
Assume that satisfy a constraint qualification such as Slater’s condition, and define to a finite solution for in the resulting optimization (7). Then if is chosen such that for all , and CCFR and CFR are used to respectively select the strategy sequences and , the following holds:
In this case, the CCFR strategy converges fully to the feasible set, at a rate of , given a suitable choice of regret minimizer for . We provide an explicit example of such a minimizer in the following corollary:
If the conditions of Theorem 3 hold and, in addition, the sequence is chosen using projected gradient descent with learning rate where , then the following hold:
This follows from using the projected gradient descent regret bound [Zinkevich2003] to give
Finally, we note that together Theorem 2 and Theorem 3 suggest a strategy for solving to optimal precision when is not known. Some can be chosen and CCFR run for a number of iterations. If the average is close to , this implies that , so can be doubled and CCFR run again. If this process is continued until no average lies near the boundary, it is guaranteed that and optimally convergenct behavior is possible.
4 Related Work
To the best of our knowledge, no previous work has proposed a technique for solving either of the optimizations (6) or (7) for general constraints in extensive-form games. Optimization (7) belongs to a general class of saddle point optimizations for which a number of accelerated methods with convergence have been proposed [Nemirovski2004, Nesterov2005b, Nesterov2005a, Juditsky, Nemirovski, and Tauvel2011, Chambolle and Pock2011]. These methods have been applied to unconstrained equilibrium computation in extensive-form games using a family of prox functions initially proposed by Hoda et. al. [Hoda et al.2010, Kroer et al.2015, Kroer et al.2017]. Like CFR, these algorithms could be extended to solve the optimization (7).
Despite a worse theoretical dependence on , CFR is preferred to accelerated methods as our base algorithm for a number of practical reasons.
CFR can be easily modified with a number of different sampling schemes, adapting to sparsity and achieving greatly improved convergence over the deterministic version [Lanctot et al.2009]. Although the stochastic mirror prox algorithm has been used to combine an accelerated update with sampling in extensive-form games, each of its iterations still requires walking each player’s full strategy space to compute the prox functions, and it has poor performance in practice [Kroer et al.2015].
CFR has good empirical performance in imperfect recall games [Waugh et al.2009b] and even provably converges to an equilibrium in certain subclasses of well-formed games [Lanctot et al.2012, Lisý, Davis, and Bowling2016], which we will make use of in Section 5.1. The prox function used by the accelerated methods is ill-defined in all imperfect recall games.
CFR theoretically scales better with game size than do the accelerated techniques. The constant in the bounds of Theorems 3-3 is at worst , and for many games of interest is closer to (Burch17 Burch17, Section 3.2). The best convergence bound for an accelerated method depends in the worst case on where is the depth of the game tree, and is at best [Kroer et al.2017].
The CFR update can be modified to CFR+ to give a guaranteed bound on tracking regret and greatly improve empirical performance [Tammelin et al.2015]. CFR+ empirically converges at a rate faster than up to reasonable precision in a variety of games (Burch17 Burch17, Sections 4.3-4.4).
Finally, CFR is not inherently limited to worst-case convergence. Regret minimization algorithms can be optimistically modified to give convergence in self-play [Rakhlin and Sridharan2013]. Such a modification has been applied to CFR (Burch17 Burch17, Section 4.4).
We describe CCFR as an extension of deterministic CFR for ease of exposition. All of the CFR modifications described in this section can be applied to CCFR out-of-the-box.
5 Experimental evaluation
We present two domains for experimental evaluation in this paper. In the first, we use constraints to model a secondary objective when generating strategies in a model security game. In the second domain, we use constraints for opponent modeling in a small poker game. We demonstrate that using constraints for modeling data allows us to learn counter-strategies that approach optimal counter-strategies as the amount of data increases. Unlike previous opponent modeling techniques for poker, we do not require our data to contain full observations of the opponent’s private cards for this guarantee to hold.
5.1 Transit game
The transit game is a model security game introduced in [Bosansky et al.2015]. With size parameter , the game is played on an 8-connected grid of size (see Figure 1) over time steps. One player, the evader, wishes to cross the grid from left to right while avoiding the other player, the patroller. Actions are movements along the edges of the grid, but each move has a probability of failing. The evader receives utils for each time he encounters the patroller, utils when he escapes on reaching the east end of the grid, and utils for each time step that passes without escaping. The patroller receives the negative of the evader’s utils, making the game zero-sum. The players observe only their own actions and locations.
The patroller has a secondary objective of minimizing the risk that it fails to return to its base ( in Figure 1) by the end of the game. In the original formulation, this was modeled using a large utility penalty when the patroller doesn’t end the game at its base. For the reasons discussed in the introduction, it is more natural to model this objective as a linear constraint on the patroller’s strategy, bounding the maximum probability that it doesn’t return to base.
For our experiments, we implemented CCFR on top of the NFGSS-CFR algorithm described in [Lisý, Davis, and Bowling2016]. In the NFGSS framework, each information set is defined by only the current grid state and the time step; history is not remembered. This is a case of imperfect recall, but our theory still holds as the game is well-formed. The constraint on the patroller is defined as
where are state action pairs at time step , is the probability that is the next state given that action is taken from , and is the chosen risk bound. This is a well-defined linear constraint despite imperfect recall, as is a linear combination over the sequences that reach . We update the CCFR constraint weights using stochastic gradient ascent with constant step size , which we found to work well across a variety of game sizes and risk bounds. In practice, we found that bounding was unnecessary for convergence.
Previous work has shown that double oracle (DO) techniques outperform solving the full game linear program (LP) in the unconstrained transit game [Bosansky et al.2015, Lisý, Davis, and Bowling2016]. However, DO techniques are not appropriate for the constrained setting, as an efficient best-response oracle doesn’t exist; finding a constrained best response requires solving a linear program in the full game. Thus, we omit comparison to DO methods in this work.
We first empirically demonstrate that CCFR converges to optimality by comparing its produced strategies with strategies produced by solving the LP representation of the game with the simplex solver in IBM ILOG CPLEX 12.7.1. Figure 1(a) shows the risk and exploitability for strategies produced by running CCFR for 100,000 iterations on a game of size , with a variety of values for the risk bound . In each case, the computed strategy had risk within of the specified bound , and exploitability within of the corresponding LP strategy (not shown because the points are indistinguishable). The convergence over time for one particular case, , is shown in Figure 1(b), where the plotted value is the difference in exploitability between the average CCFR strategy and the LP strategy, shown with a log-linear scale. The vertical line shows the time used to compute the LP strategy.
Convergence times for the CPLEX LP and CCFR with risk bound are shown on a log scale for a variety of game sizes in Figure 1(c). The time for CCFR is presented for a variety of precisions , which bounds both the optimality of the final exploitability and the violation of the risk bound. The points for game size are also shown in Figure 1(b). The LP time is calculated with default precision . Changing the precision to a higher value actually results in a slower computation, due to the precision also controlling the size of allowed infeasibility in the Harris ratio test [Klotz and Newman2013].
Even at , a game which has relatively small strategy sizes of 6,000 values, we can see that a significant speedup can be gained by using CCFR for a small tradeoff in precision. At and larger, the LP is clearly slower than CCFR even for the relatively modest precision of . For game size , with strategy sizes of 25,000 values, the LP is more than an order of magnitude slower than high precision CCFR.
5.2 Opponent modeling in poker
In multi-agent settings, strategy constraints can serve an additional purpose beyond encoding secondary objectives. Often, when creating a strategy for one agent, we have partial information on how the other agent(s) behave. A way to make use of this information is to solve the game with constraints on the other agents’ strategies, enforcing that their strategy in the solution is consistent with their observed behavior. As a motivating example, we consider poker games in which we always observe our opponent’s actions, but not necessarily the private card(s) that they hold when making the action.
In poker games, if either player takes the fold action, the other player automatically wins the game. Because the players’ private cards are irrelevant to the game outcome in this case, they are typically not revealed. We thus consider the problem of opponent modeling from observing past games, in which the opponent’s hand of private card(s) is only revealed when a showdown is reached and the player with the better hand wins. Most previous work in opponent modeling has either assumed full observation of private cards after a fold [Johanson, Zinkevich, and Bowling2008, Johanson and Bowling2009] or has ignored observations of opponent actions entirely, instead only using observed utilities [Bard et al.2013]. The only previous work which uses these partial observations has no theoretical guarantees on solution quality [Ganzfried and Sandholm2011].
We begin by collecting data by playing against the opponent with a probe strategy, which is a uniformly random distribution over the non-fold actions. To model the opponent in an unbiased way, we generate two types of sequence-form constraints from this data. First, for each possible sequence of public actions and for each of our own private hands, we build an unbiased confidence interval on the probability that we are dealt the hand and the public sequence occurs. This probability is a weighted sum of the opponent’s sequence probabilities over their possible private cards, and thus the confidence bounds become linear sequence-form constraints. Second, for each terminal history that is a showdown, we build a confidence interval on the probability that the showdown is reached. In combination, these two sets of constraints guarantee that the CCFR strategy converges to a best response to the opponent strategy as the number of observed games increases. A proof of convergence to a best response and full details of the constraints are provided in the appendix.
Because we construct each constraint separately, there is no guarantee that the full constraint set is simultaneously feasible. In fact, in our experiments it typically was the case that the constraints were mildly infeasible. However, this is not a problem for CCFR, which doesn’t require feasible constraints to have well-defined updates. In fact, because we bound the Lagrange multipliers, CCFR still theoretically converges to a sensible solution, especially when the total infeasibility is small. For more details on how CCFR handles infeasibility, see the appendix.
We ran our experiments in Leduc Hold’em [Southey et al.2005], a small poker game played with a six card deck over two betting rounds. To generate a target strategy profile to model, we solved the "JQ.K/pair.nopair" abstracted version of the game [Waugh et al.2009a]. We then played a probe strategy profile against the target profile to generate constraints as described above, and ran CCFR twice to find each half of a counter-profile that is optimal against the set of constrained profiles. We used gradient ascent with step size to update the values, and ran CCFR for iterations, which we found to be sufficient for approximate convegence with .
We evaluate how well the trained counter-profile performs when played against the target profile, and in particular investigate how this performance depends on the number of games we observe to produce the counter-profile, and on the confidence used for the confidence interval constraints. Results are shown in Figure 3, with a log-linear scale. With a high confidence (looser constraints), we obtain an expected value that is better than the equilibrium expected value with fewer than 100 observed games on average, and with fewer than 200 observed games consistently. Lower confidence levels (tighter constraints) resulted in more variable performance and poor average value with small numbers of observed games, but also faster learning as the number of observed games increased. For all confidence levels, the expected value converges to the best response value as the number of observed games increases.
Strategy constraints are a powerful modeling tool in extensive-form games. Prior to this work, solving games with strategy constraints required solving a linear program, which scaled poorly to many of the very large games of practical interest. We introduced CCFR, the first efficient large-scale algorithm for solving extensive-form games with general strategy constraints. We demonstrated that CCFR is effective at solving sequential security games with bounds on acceptable risk. We also introduced a method of generating strategy constraints from partial observations of poker games, resulting in the first opponent modeling technique that has theoretical guarantees with partial observations. We demonstrated the effectiveness of this technique for opponent modeling in Leduc Hold’em.
Constrained Markov Decision Processes. Chapman and Hall/CRC.
- [Bard et al.2013] Bard, N.; Johanson, M.; Burch, N.; and Bowling, M. 2013. Online implicit agent modeling. In Proceedings of the Twelth International Conference on Autonomous Agents and Multiagent Systems.
[Bosansky et al.2015]
Bosansky, B.; Jiang, A. X.; Tambe, M.; and Kiekintveld, C.
Combining compact representation and incremental generation in large
games with sequential strategies.
Proceedings of Twenty-Ninth AAAI Conference on Artificial Intelligence.
- [Bowling et al.2015] Bowling, M.; Burch, N.; Johanson, M.; and Tammelin, O. 2015. Heads-up limit hold’em poker is solved. Science 347(6218):145–149.
- [Brown and Sandholm2018] Brown, N., and Sandholm, T. 2018. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359(6374):418–424.
- [Brown et al.2014] Brown, M.; An, B.; Kiekintveld, C.; Ordóñez, F.; and Tambe, M. 2014. An extended study on multi-objective security games. Autonomous Agents and Multi-Agent Systems 28(1):31–71.
- [Burch2017] Burch, N. 2017. Time and Space: Why Imperfect Information Games are Hard. Ph.D. Dissertation, University of Alberta.
- [Chambolle and Pock2011] Chambolle, A., and Pock, T. 2011. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision 40(1):120–145.
[Farina, Kroer, and Sandholm2017]
Farina, G.; Kroer, C.; and Sandholm, T.
Regret minimization in behaviorally-constrained zero-sum games.
Proceedings of the 34th International Conference on Machine Learning.
- [Ganzfried and Sandholm2011] Ganzfried, S., and Sandholm, T. 2011. Game theory-based opponent modeling in large imperfect-information games. In Proceedings of the Tenth International Conference on Autonomous Agents and Multiagent Systems.
- [Hart and Mas-Colell2000] Hart, S., and Mas-Colell, A. 2000. A simple adaptive procedure leading to correlated equilibrium. Econometrica 5(68):1127–1150.
- [Hoda et al.2010] Hoda, S.; Gilpin, A.; Peña, J.; and Sandholm, T. 2010. Smoothing techniques for computing nash equilibria of sequential games. Mathematics of Operations Research 35(2):494–512.
- [Isom, Meyn, and Braatz2008] Isom, J. D.; Meyn, S. P.; and Braatz, R. D. 2008. Piecewise linear dynamic programming for constrained POMDPs. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence.
- [Johanson and Bowling2009] Johanson, M., and Bowling, M. 2009. Data biased robust counter strategies. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics.
- [Johanson et al.2011] Johanson, M.; Bowling, M.; Waugh, K.; and Zinkevich, M. 2011. Accelerating best response calculation in large extensive games. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence.
- [Johanson, Zinkevich, and Bowling2008] Johanson, M.; Zinkevich, M.; and Bowling, M. 2008. Computing robust counter-strategies. In Advances in Neural Information Processing Systems 20.
- [Johanson2013] Johanson, M. 2013. Measuring the size of large no-limit poker games. Technical Report TR13-01, University of Alberta.
- [Juditsky, Nemirovski, and Tauvel2011] Juditsky, A.; Nemirovski, A.; and Tauvel, C. 2011. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems 1(1):17–58.
- [Klotz and Newman2013] Klotz, E., and Newman, A. M. 2013. Practical guidelines for solving difficult linear programs. Surveys in Operations Research and Management Science 18(1):1–17.
- [Koller, Megiddo, and von Stengel1996] Koller, D.; Megiddo, N.; and von Stengel, B. 1996. Efficient computation of equilibria for extensive two-person games. Games and Economic Behavior 14(2):247–259.
- [Kroer et al.2015] Kroer, C.; Waugh, K.; Kilinç-Karzan, F.; and Sandholm, T. 2015. Faster first-order methods for extensive-form game solving. In Proceedings of the Sixteenth ACM Conference on Economics and Computation.
- [Kroer et al.2017] Kroer, C.; Waugh, K.; Kilinc-Karzan, F.; and Sandholm, T. 2017. Theoretical and practical advances on smoothing for extensive-form games. In Proceedings of the Eighteenth ACM Conference on Economics and Computation.
- [Lanctot et al.2009] Lanctot, M.; Waugh, K.; Zinkevich, M.; and Bowling, M. 2009. Monte Carlo sampling for regret minimization in extensive games. In Advances in Neural Information Processing Systems 22.
- [Lanctot et al.2012] Lanctot, M.; Gibson, R.; Burch, N.; and Bowling, M. 2012. No-regret learning in extensive-form games with imperfect recall. In Proceedings of the Twenty-Ninth International Conference on Machine Learning.
- [Lisý, Davis, and Bowling2016] Lisý, V.; Davis, T.; and Bowling, M. 2016. Counterfactual regret minimization in security games. In Proceedings of Thirtieth AAAI Conference on Artificial Intelligence.
- [Moravčík et al.2017] Moravčík, M.; Schmid, M.; Burch, N.; Lisý, V.; Morrill, D.; Bard, N.; Davis, T.; Waugh, K.; Johanson, M.; and Bowling, M. H. 2017. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science 356 6337:508–513.
- [Nemirovski2004] Nemirovski, A. 2004. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization 15(1):229–251.
- [Nesterov2005a] Nesterov, Y. 2005a. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization 16(1):235–249.
- [Nesterov2005b] Nesterov, Y. 2005b. Smooth minimization of non-smooth functions. Mathematical Programming 103(1):127–152.
- [Osborne and Rubinstein1994] Osborne, M. J., and Rubinstein, A. 1994. A Course in Game Theory. The MIT Press.
- [Poupart et al.2015] Poupart, P.; Malhotra, A.; Pei, P.; Kim, K.-E.; Goh, B.; and Bowling, M. 2015. Approximate linear programming for constrained partially observable markov decision processes. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence.
- [Rakhlin and Sridharan2013] Rakhlin, A., and Sridharan, K. 2013. Online learning with predictable sequences. In Proceedings of the 26th Annual Conference on Learning Theory.
- [Santana, Thiébaux, and Williams2016] Santana, P.; Thiébaux, S.; and Williams, B. 2016. Rao*: an algorithm for chance constrained pomdps. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence.
- [Southey et al.2005] Southey, F.; Bowling, M.; Larson, B.; Piccione, C.; Burch, N.; Billings, D.; and Rayner, C. 2005. Bayes’ bluff: Opponent modelling in poker. In Proceedings of the Twenty-First Conference on Uncertainty in Artficial Intelligence.
- [Tammelin et al.2015] Tammelin, O.; Burch, N.; Johanson, M.; and Bowling, M. 2015. Solving heads-up limit Texas Hold’em. In Proceedings of the 24th International Joint Conference on Artificial Intelligence.
- [von Stengel1996] von Stengel, B. 1996. Efficient computation of behavior strategies. Games and Economic Behavior 14:220–246.
- [Waugh and Bagnell2015] Waugh, K., and Bagnell, J. A. 2015. A unified view of large-scale zero-sum equilibrium computation. In AAAI Workshop on Computer Poker and Imperfect Information.
- [Waugh et al.2009a] Waugh, K.; Schnizlein, D.; Bowling, M.; and Szafron, D. 2009a. Abstraction pathologies in extensive games. In Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems.
- [Waugh et al.2009b] Waugh, K.; Zinkevich, M.; Johanson, M.; Kan, M.; Schnizlein, D.; and Bowling, M. 2009b. A practical use of imperfect recall. In Proceedings of the Eighth Symposium on Abstraction, Reformulation and Approximation.
- [Wilson1927] Wilson, E. B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22(158):209–212.
- [Zinkevich et al.2008] Zinkevich, M.; Johanson, M.; Bowling, M.; and Piccione, C. 2008. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems 20.
- [Zinkevich2003] Zinkevich, M. 2003. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning.
Appendix A Notation
Define for brevity. Let be the modified utility function:
For any history , define
to be the set of terminal histories reachable from . In addition, define
to be the subset of those histories which include no player 1 actions after the action at . Define
to be set of histories where player 1 might act next after taking action at . Also, for any define
to be the set of all histories where player 1 might act after . Note that if , then .
We extend all of the preceding definitions to information sets in the natural way:
Consider two information sets such that . By perfect recall, the series of actions that player 1 takes from to must be unique, no matter which is the starting state. Thus we may define for any and where . Similarly, for , we can define where is the unique such that (or if no such exists).
For , we will use as short for , where is the strategy profile that always takes action at , but everywhere else is identical to .
Appendix B Proof of Theorems
For any information set , action , and , let be the constraint tilt added to , i.e.
Given a , define the tilted counterfactual value for strategy profile , information set , and action by the recurrences
We proceed by strong induction on . In the base case , we have that is empty for each , that there is no such that , and that . Thus: