Multi-agent strategic interactions are often modeled as extensive-form games (EFGs), a game tree representation that allows for hidden information, stochastic outcomes, and sequential interactions. Research on solving EFGs has been driven by the experimental domain of poker games, in which the Counterfactual Regret Minimization (CFR) algorithm (Zinkevich et al., 2008) has been the basis of several breakthroughs. Approaches incorporating CFR have been used to essentially solve one nontrivial poker game (Bowling et al., 2015), and to beat human professionals in another (Moravčík et al., 2017; Brown and Sandholm, 2018).
CFR is in essence a policy improvement algorithm that iteratively evaluates and improves a strategy for playing an EFG. As part of this process, it must walk the entire game tree on every iteration. However, many games have prohibitively large trees when represented as EFGs. For example, many commonly played poker games have more possible game states than there are atoms in the universe (Johanson, 2013). In such cases, performing even a single iteration of traditional CFR is impossible.
The prohibitive cost of CFR iterations is the motivation for Monte Carlo Counterfactual Regret Minimization (MCCFR), which samples trajectories to walk through the tree to allow for significantly faster iterations (Lanctot et al., 2009). Additionally, while CFR spends equal time updating every game state, the sampling scheme of MCCFR can be altered to target updates to parts of the game that are more critical or more difficult to learn (Gibson et al., 2012b, a). As a trade-off for these benefits, MCCFR requires more iterations to converge due to the variance of sampled values.
In the Reinforcement Learning (RL) community, the topic of variance reduction in sampling algorithms has been extensively studied. In particular, baseline functions that estimate state values are typically used within policy gradient methods to decrease the variance of value estimates along sampled trajectories(Williams, 1992; Greensmith et al., 2004; Bhatnagar et al., 2009; Schulman et al., 2016). Recent work by Schmid et al. (2019) has adapted these ideas to variance reduction in MCCFR, resulting in the VR-MCCFR algorithm.
In this work, we generalize and extend the ideas of Schmid et al. We introduce a framework for variance reduction of sampled values in EFGs by use of state-action baseline functions. We show that the VR-MCCFR is a specific application of our baseline framework that unnecessarily generalizes across dissimilar states. We introduce alternative baseline functions that take advantage of our access to the full hidden state during training, avoiding this generalization. Empirically, our new baselines result in significantly reduced variance and faster convergence than VR-MCCFR.
Schmid et al. also discuss the idea of an oracle baseline that provably minimizes variance, but is impractical to compute. We introduce a predictive baseline that estimates this oracle value and can be efficiently computed. We show that under certain sampling schemes, the predictive baseline exactly tracks the true oracle value, thus provably computing zero-variance sampled values. For the first time, this allows for exact CFR updates to be performed along sampled trajectories.
An extensive-form game (EFG) (Osborne and Rubinstein, 1994) is a game tree, formally defined by a tuple . is a finite set of players. is a set of histories, where each history is a sequence of actions and corresponds to a vertex of the tree. For , we write if is a prefix of . The set of actions available at that lead to a successor history is denoted . Histories with no successors are terminal histories . maps each history to the player that chooses the next action, where is the chance player that acts according to the defined distribution , where
is the set of probability distributions over. The utility function assigns a value to each terminal history for each player.
For each player , the collection of (augmented) information sets is a partition of the histories .aaaAugmented information sets were introduced by Burch et al. (2014). Player does not observe the true history , but only the information set . Necessarily, this means that if , which we then denote .
Each player selects actions according to a (behavioral) strategy that maps each information set where to a distribution over actions, . The probability of taking a specific action at a history is . A strategy profile, , specifies a strategy for each player. The reach probability of a history is . This product can be decomposed as , where the first term contains the actions of player , and the second contains the actions of other players and chance. We also write for the probability of reaching from , defined to be 0 if . A strategy profile defines an expected utility for each player as .
In this work, we consider two-player zero-sum EFGs, in which and . We also assume that the information sets satisfy perfect recall, which requires that players not forget any information that they once observed. Mathematically, this means that two histories in the same information set must have the same sequence of past information sets and actions for player . All games played by humans exhibit perfect recall, and solving games without perfect recall is NP-hard. We write if there is any history such that , and we denote that history (unique by perfect recall) by .
2.1 Solving EFGs
A common solution concept for EFGs is a Nash equilibrium, in which no player has incentive to deviate from their specified strategy. We evaluate strategy profiles by their distance from equilibrium, as measured by exploitability, which is the average expected loss against a worst-case opponent: .
Counterfactual Regret Minimization (CFR) is an algorithm for learning Nash equilibria in EFGs through iterative self play (Zinkevich et al., 2008). For any , let be the set of terminal histories reachable from , and define the history’s expected utility as . For each information set and action , CFR accumulates the counterfactual regret of not choosing that action on previous iterations:
The next strategy profile is then selected with regret matching, which sets probabilities proportional to the positive regrets: . Defining the average strategy such that , CFR guarantees that as , thus converging to a Nash equilibrium.
The state-of-the-art CFR+ variant of CFR greedily zeroes all negative regrets on every iteration, replacing with an accumulant recursively defined with (Tammelin et al., 2015). It also alternates updates for each player, and uses linear averaging, which gives greater weight to more recent strategies.
CFR(+) requires a full walk of the game tree on each iteration, which can be a very costly operation on large games. Monte Carlo Counterfactual Regret Minimization (MCCFR) avoids this cost by only updating along sampled trajectories. For simplicity, we focus on the outcome sampling (OS) variant of MCCFR (Lanctot et al., 2009), though all results in this paper can be trivially extended to other MCCFR variants. On each iteration , a sampling strategy is used to sample a single terminal history . A sampled utility is then calculated recursively for each prefix of as
where is the indicator function and . For any , the sampled value
is an unbiased estimate of the expected utility, whether is sampled or not. These sampled values are used to calculate a sample of the counterfactual regret:
This gives an unbiased sample of the counterfactual regret for all , which is then used to perform unbiased CFR updates. As long as as the sampling strategies satisfy for all , MCCFR guarantees that with high probability as , thus converging to a Nash equilibrium. However, the rate of convergence depends on the variance of (Gibson et al., 2012b).
3 Baseline framework for EFGs
We now introduce a method for calculating unbiased estimates of utilities in EFGs that has lower variance than the sampled utilities defined above. We do this using baseline functions, which estimate the expected utility of actions in the game. We will describe specific examples of such functions in Section 4; for now, we assume the existence of some function such that in some way approximates . We define a baseline-corrected sampled utility as
Equation (4) comes from the application of a control variate
, in which we lower the variance of a random variable () by subtracting another random variable () and adding its known expectation (), thus keeping the resulting estimate unbiased. If and are correlated, then this estimate will have lower variance than itself. Because is defined recursively, its computation includes the application of independent control variates at every action taken between and .
These estimates are unbiased and, if the baseline function is chosen well, have low variance:
For any and any , the baseline-corrected utilities satisfy
Assume that we have a baseline that satisfies for all , . Then for any ,
All proofs are given in the appendix. Theorem 1 show that we can use in place of in equation 3 and maintain the convergence guarantees of MCCFR. Theorem 2 shows that an ideal baseline eliminates all variance in the MCCFR update. By choosing our baseline well, we decrease the MCCFR variance and speed up its convergence. Pseudocode for MCCFR with baseline-corrected values is given in Appendix A.
Although we focus on using our baseline-corrected samples in MCCFR, nothing in the value definition is particular to that algorithm. In fact, a lower variance estimate of sampled utilities is useful in any algorithm that performs iterative training using sampled trajectories. Examples of such algorithms include policy gradient methods (Srinivasan et al., 2018) and stochastic first-order methods (Kroer et al., 2015).11todo: 1Other examples?22todo: 2This paragraph here or in related work?
4 Baselines for EFGs
In this section we propose several baseline functions for use during iterative training. Theorem 2 shows that we can minimize variance by choosing a baseline function such that .
We begin by examining MCCFR under its original definition, where no baseline function is used. We note that when we run baseline-corrected MCCFR with a static choice of for all , the operation of the algorithm is identical to MCCFR. Thus, opting to not use a baseline is, in itself, a choice of a very particular baseline.
Using might seem like a reasonable choice when we expect the game’s payouts to be balanced between the players. However, even when the overall expected utility is very close to 0, there will usually be particular histories with high magnitude expected utility . For example, in poker games, the expected utility of a history is heavily biased toward the player who has been dealt better cards, even if these biases cancel out when considered across all histories. In fact, often there is no strategy profile at all that satisfies , which makes a poor choice in regards to the ideal criteria . An example game where a zero baseline performs very poorly is explored in Section 5.
Static strategy baseline.
The simplest way to ensure that the baseline function does correspond to an actual strategy is to choose a static, known strategy profile and let for each time . Once the strategy is chosen, the baseline values only need to be computed once and stored. In general this requires a full walk of the game tree, but it is sometimes possible to take advantage of the structure of the game to greatly reduce this cost. For an example, see Section 5.
Learned history baseline.
Using a static strategy for our baseline ensures that it corresponds to some expected utility, but it fails to take advantage of the iterative nature of MCCFR. In particular, when attempting to estimate , we have access to all past samples for . Because the strategy is changed incrementally, we might expect the expected utility to change slowly and for these to be reasonable samples of the utility at time as well.
Define to be the set of timesteps on which was sampled, and denote the th such timestep as . We define the learned history baseline as
where is a sequence of weights satisfying . Possible weighting choices include simple averaging, where , and exponentially-decaying averaging, where for some . In either case, the baseline can be efficiently updated online by tracking the weighted sum and the number of times that has been sampled.
Learned infoset baseline.
The learned history baseline is very similar to the VR-MCCFR baseline defined by Schmid et al. (2019). The principle difference is that the VR-MCCFR baseline tracks values for each information set, rather than for each history; we thus refer to it as the learned infoset baseline. This baseline also updates values for each player separately, based on their own information sets. This can be accomplished by tracking separate values for each player throughout the tree walk, or by running MCCFR with alternating updates, where only one player’s regrets are updated on each tree walk. The VR-MCCFR baseline can be defined in our framework as
where is the player being updated, is the set of timesteps on which was sampled for any , and is th such timestep. Following Schmid et al. we consider both simple averaging and exponentially-decaying averaging for selecting the weights .
Our last baseline takes advantage of the recursive nature of the MCCFR update. On each iteration, each history along the sampled trajectory is evaluated and updated in depth-first order. Thus when the update of history is complete and the value is returned , we have already calculated the next regrets for all such that . These values will be the input to the regret matching procedure on the next iteration, computing at these histories. Thus we can immediately compute this next strategy, and using the already sampled trajectory, compute an estimate of the strategy’s utility as . This is an unbiased sample of the expected utility , which is our target value for the next baseline . We thus use this sample to update the baseline:
The computation for this update can be done efficiently by a simple change to MCCFR. In MCCFR, we compute at each step by using to weight recursively-computed action values. In MCCFR with predictive baseline, after updating the regrets at , we use a second regret matching computation to compute . We use this strategy to weight a second set of recursively-computed action values to compute . When we walk back up the tree, we return both of the values and , allowing this recursion to continue. The predictive value is only used for updating the baseline function. These changes do not modify the asymptotic time complexity of MCCFR. Pseudocode is given in Appendix A.
5 Experimental comparison
We run our experiments using a commodity desktop machine in Leduc hold’em (Southey et al., 2005), a small poker game commonly used as a benchmark in games researchbbbAn open source implementation of CFR+ and Leduc hold’em is available from the University of Alberta (Computer Poker Research Group, University of Alberta and Tammelin, 2014).
. We compare the effect of the various baselines on the MCCFR convergence rate. Our experiments use the regret zeroing and linear averaging of CFR+, as these improve convergence when combined with any baseline. For the static strategy baseline, we use the “always call” strategy, which matches the opponent’s bets and makes no bets of its own. Expected utility under this strategy is determined by the current size of the pot, which is measurable at run time, and the winning chance of each player’s cards. Before training, we measure and store these odds for all possible sets of cards, which is significantly smaller than the size of the full game. For both of the learned baselines, we use simple averaging as it was found to work best in preliminary experiments.
We run experiments with two sampling strategies. The first is uniform sampling, in which . The second is opponent on-policy sampling, depends on the player being updated: we sample uniformly () at histories where , and sample on-policy () otherwise. For consistency, we use alternating updates for both schemes.
show the convergence of MCCFR with the various baselines, as measured by exploitability (recall that exploitability converges to zero). All results in this paper are averaged over 20 runs, with 95% confidence intervals shown as error bands (often too narrow to be visible). With uniform sampling, the learned infoset (VR-MCCFR) baseline improves modestly on using no baseline at all, while the other three baselines achieve a significant improvement on top of that. With opponent on-policy sampling, the gap is smaller, but the learned infoset baseline is still noticeably worse than the other options.
Many true expected values in Leduc are very close to zero, making MCCFR without a baseline (i.e. ) better than it might otherwise be. To demonstrate the necessity of a baseline in some games, we ran MCCFR in a modified Leduc game where player 2 always transfers 100 chips to player 1 after every game. This utility change is independent of the player’s actions, so it doesn’t strategically change the game. However, it means that 0 is now an extremely inaccurate value estimate for all histories. Figure 0(c) shows convergence in Leduc with shifted utilities. The always call baseline is omitted, as the results would be identical to those in Figure 0(b). Here we see that using any baseline at all provides a significant advantage over not using a baseline, due to the ability to adapt to the shifted utilities. We also see that the learned infoset baseline performs relatively well early on in this setting, because it generalizes across histories.
6 Public Outcome Sampling
Although the results in Section 5 show large gains in convergence speed when using baselines with MCCFR, the magnitudes are not as large as those shown with the VR-MCCFR baseline by Schmid et al. (2019)
. This is because their experiments use a "vectorized" form of MCCFR, which avoids the sampling of individual histories within information sets. Instead, they track a vector of values on each iteration, one for each possible true history given the player’s observed information set. Schmid et al. do not formally define their algorithm. We refer to it asPublic Outcome Sampling (POS) as the algorithm samples any actions that are publicly visible to both players, while exhaustively considering all possible private states. We give a full formal definition of POS in Appendix E.
6.1 Baselines in POS
In MCCFR with POS, we still use action baselines with the ideal baseline values being . Thus the baselines in Section 4 apply to this setting as well.
For the learned infoset baseline, we have more information available to us than in the OS case. This is because when POS samples some history-action pair , it also samples every pair for . Thus, rather, than using one sampled history value to update the baseline, we use a weighted sample of all of the history values. Following Schmid et al., we weight the baseline values
This is the same relative weighting given to each history when calculating the counterfactual regret.
POS also has implications for the predictive baseline. In fact, we can guarantee that after every outcome of the game has been sampled, the predictive baseline will have learned the true value of the current strategy. For time , let be the set of sampled terminal histories (consistent with a public outcome), and let be the event that is sampled on way to .
If each of the terminal states reachable from history has been sampled at least once under public outcome sampling (), then the predictive baseline satisfies
The key idea behind the proof is that POS ensures that the baseline is updated at a history if and only if the expected value of the history changes. The full proof is in Appendix F.
In order for the theorem to hold everywhere in the tree, all outcomes must be sampled, which could take a large number of iterations. An alternative is to guarantee that all outcomes are sampled during the early iterations of MCCFR. For example, one could do a full CFR tree walk on the very first iteration, and then sample on subsequent iterations. Alternatively, we can ensure the theorem always holds with smart initialization of the baseline. When there are no regrets accumulated, MCCFR uses an arbitrary strategy. If we have some strategy with known expected values throughout the tree, we can use this strategy as the default MCCFR strategy and initialize the baseline values to the strategy’s expected values. Either option guarantees that all regret updates will use zero-variance values.
6.2 POS results
As in Section 5, we run experiments in Leduc and use CFR+ updates. For the learned baselines, we use exponentially-decaying averaging with , which preliminary experiments found to outperform simple averaging when combined with POS. For simplicity and consistency with the experiments of Schmid et al. (2019), we use uniform sampling and simultaneous updates.
Figure 1(a) compares the baselines’ effects on POS MCCFR. We find that using any baseline provides a significant improvement on using no baseline. The always call baseline performs well early but tales off as it doesn’t learn during training. Even with POS, where we always see an entire information set at a time, the learned infoset baseline (VR-MCCFR) is significantly outperformed by the learned history and predictive baselines. This is likely because the learned infoset baseline has to learn the relative weighting between histories in an infoset, while the other baselines always use the current strategy to weight the learned values. Finally, we observe that the predictive baseline has a small, but statistically significant, advantage over the learned history baseline in early iterations.
In addition, we compare the baselines by directly measuring their variance. We measure the variance of the counterfactual value for each pair , and we average across all such pairs. Full details are in Appendix G. Results are shown in Figure 1(b). We see that using no baseline results in high and relatively steady variance of counterfactual values. Using the always call baselines also results in steady variance, as nothing is learned, but at approximately an order of magnitude lower than no baseline. Variance with the other baselines improves over time, as the baseline becomes more accurate. The learned history baseline mirrors the learned infoset baseline, but with more than an order of magnitude reduction in variance. The predictive baseline is best of all, and in fact we see Theorem 3 in action as the variance drops to zero.
7 Related Work44todo: 4Do we need to talk about VR-MCCFR again here?
As discussed in the introduction, the use of baseline functions has a long history in RL. Typically these approaches have used state value baselines, with some recent exceptions (Liu et al., 2018; Wu et al., 2018). Tucker et al. (2018) suggest an explanation for this by isolating the variance terms that come from sampling an immediate action and from sampling the rest of a trajectory. Typical RL baselines only reduce the action variance, so the additional benefit from using a state-action baseline is insignificant when compared to the trajectory variance. In our work, we apply a recursive baseline to reduce both the action and trajectory variances, meaning state-action baselines give a noticeable benefit.
In RL, the doubly-robust estimator (Jiang and Li, 2016) has been used to reduce variance settings by the recursive application of control variates (Thomas and Brunskill, 2016). Similarly, variance reduction in EFGs via recursive control variates is the basis of the advantage sum estimator (Zinkevich et al., 2008) and AIVAT (Burch et al., 2018). All of these techniques construct control variates by evaluating a static policy or strategy, either on the true game or on a static model. In this sense they are equivalent to our static strategy baseline. However, to the best of our knowledge, these techniques have only been used for evaluation of static strategies, rather than for variance reduction during training. Our work extends the EFG techniques to the training domain; we believe that similar ideas can be used in RL, and this is an interesting avenue of future research.
Concurrent to this work, Zhou et al. (2018) also suggested tracking true values of histories in a CFR variant, analogous to our predictive baseline. They use these values for truncating full tree walks, rather than for variance reduction along sampled trajectories. As such, they always initialize their values with a full tree walk, and don’t examine gradually learning the values during training.
8 Conclusion and Future Work
In this work we introduced a new framework for variance reduction in EFGs through the application of a baseline value function. We demonstrated that the existing VR-MCCFR baseline can be described in our framework with a specific baseline function, and we introduced other baseline functions that significantly outperform it in practice. In addition, we introduced a predictive baseline and showed that it gives provably optimal performance under a sampling scheme that we formally define.
There are three sources of variance when performing sampled updates in EFGs. The first is from sampling trajectory values, the second from sampling individual histories within an information set that is being updated, and the third from sampling which information sets will be updated on a given iteration. By introducing MCCFR with POS, we provably eliminate the first two sources of variance: the first because we have a zero-variance baseline, and the second because we consider all histories within the information set. For the first time, this allows us to select the MCCFR sampling strategy entirely on the basis of minimizing the third source of variance, by choosing the “best” information sets to update. Doing this in a principled way is an exciting avenue for future research.
Finally, we close by discussing function approximation. All of the baselines introduced in this paper require an amount of memory that scales with the size of the game tree. In contrast, baseline functions in RL typically use function approximation, requiring a much smaller number of parameters. Additionally, these functions generalize across states, which can allow for learning an accurate baseline function more quickly. The framework that we introduce in this work is completely compatible with function approximation, and combining the two is an area for future research.
- Bhatnagar et al.  Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009.
- Bowling et al.  Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold’em poker is solved. Science, 347(6218):145–149, January 2015.
- Brown and Sandholm  Noam Brown and Tuomas Sandholm. Safe and nested subgame solving for imperfect-information games. In Advances in Neural Information Processing Systems 30, 2017.
- Brown and Sandholm  Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, January 2018.
- Brown et al.  Noam Brown, Tuomas Sandholm, and Brandon Amos. Depth-limited solving for imperfect-information games. In Advances in Neural Information Processing Systems 31, 2018.
Burch et al. 
Neil Burch, Michael Johanson, and Michael Bowling.
Solving imperfect information games using decomposition.
Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
- Burch et al.  Neil Burch, Martin Schmid, Matej Moravcik, Dustin Morrill, and Michael Bowling. AIVAT: A new variance reduction technique for agent evaluation in imperfect information games. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Computer Poker Research Group, University of Alberta and Tammelin  Computer Poker Research Group, University of Alberta and Oskari Tammelin. CFR+, 2014. URL http://webdocs.cs.ualberta.ca/~games/poker/cfr_plus.html. [Online; accessed 23-May-2019].
- Gibson et al. [2012a] Richard Gibson, Neil Burch, Marc Lanctot, and Duane Szafron. Efficient Monte Carlo counterfactual regret minimization in games with many player actions. In Proceedings of the Twenty-Sixth Conference on Advances in Neural Information Processing Systems, 2012a.
- Gibson et al. [2012b] Richard Gibson, Marc Lanctot, Neil Burch, Duane Szafron, and Michael Bowling. Generalized sampling and variance in counterfactual regret minimization. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012b.
Greensmith et al. 
Evan Greensmith, Peter L Bartlett, and Jonathan Baxter.
Variance reduction techniques for gradient estimates in reinforcement
Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.
- Jakobsen et al.  Sune K. Jakobsen, Troels B. Sørensen, and Vincent Conitzer. Timeability of extensive-form games. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, 2016.
- Jiang and Li  Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, 2016.
- Johanson  Michael Johanson. Measuring the size of large no-limit poker games. Technical Report TR13-01, Department of Computing Science, University of Alberta, 2013.
- Johanson et al.  Michael Johanson, Kevin Waugh, Michael Bowling, , and Martin Zinkevich. Accelerating best response calculation in large extensive games. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, 2011.
- Kroer et al.  Christian Kroer, Kevin Waugh, Fatma Kilinç-Karzan, and Tuomas Sandholm. Faster first-order methods for extensive-form game solving. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, 2015.
- Lanctot et al.  Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling for regret minimization in extensive games. In Advances in Neural Information Processing Systems 22, 2009.
- Liu et al.  Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-dependent control variates for policy optimization via Stein identity. In International Conference on Learning Representations, 2018.
- Moravčík et al.  Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael H. Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356 6337:508–513, 2017.
Osborne and Rubinstein 
Martin J. Osborne and Ariel Rubinstein.
A Course in Game Theory. The MIT Press, 1994. ISBN 0-262-65040-1.
- Schmid et al.  Martin Schmid, Neil Burch, Marc Lanctot, Matej Moravcik, Rudolf Kadlec, and Michael Bowling. Variance reduction in monte carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines. In Proceedings of the The Thirty-Third AAAI Conference on Artificial Intelligence, 2019.
- Schulman et al.  John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016.
- Southey et al.  Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. Bayes’ bluff: Opponent modelling in poker. In Proceedings of the Twenty-First Conference on Uncertainty in Artficial Intelligence, 2005.
- Srinivasan et al.  Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien Pérolat, Karl Tuyls, Rémi Munos, and Michael Bowling. Actor-critic policy optimization in partially observable multiagent environments. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
- Tammelin et al.  Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit Texas hold’em. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, 2015.
- Thomas and Brunskill  Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, 2016.
- Tucker et al.  George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard Turner, Zoubin Ghahramani, and Sergey Levine. The mirage of action-dependent baselines in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, 2018.
- Šustr et al.  Michal Šustr, Vojtěch Kovařík, and Viliam Lisý. Monte Carlo continual resolving for online strategy computation in imperfect information games. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 2019.
- Williams  Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
- Wu et al.  Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M. Bayen, Sham M. Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. In International Conference on Learning Representations, 2018.
- Zhou et al.  Yichi Zhou, Tongzheng Ren, Jialian Li, Dong Yan, and Jun Zhu. Lazy-CFR: a fast regret minimization algorithm for extensive games with imperfect information. CoRR, abs/1810.04433, 2018. URL http://arxiv.org/abs/1810.04433.
- Zinkevich et al.  Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems 20, 2008.
Appendix A MCCFR with baseline-corrected values
Pseudocode for MCCFR with baseline-corrected values is given in Algorithm 1. Quantities of the form refer to the vector of all quantities for . A version for the predictive baseline, which must calculate extra values, is given in Algorithm 2. Each of these algorithms has the same worst-case iteration complexity as MCCFR without baselines, namely where is the tree’s depth and .
Appendix B Proof of Theorem 1
This proof is a simplified version of the proof of Lemma 5 in Schmid et al. .
We directly analyze the expectation of the baseline-corrected utility:
We now proceed by induction on the height of in the tree. If has height 0, then and by definition.
For the inductive step, consider arbitrary such that has height more than 0. We assume that for all such that has smaller height than . We then have
|by inductive hypothesis|
We are able to apply the inductive hypothesis because is a suffix of and thus must have smaller height. The proof follows by induction. ∎
Appendix C Proof of Theorem 2
This proof is similar to the proof of Lemma 3 in Schmid et al. .
We begin by proving that the assumption of the theorem necessitates that the baseline-corrected values are the true expected values.
Assume that we have a baseline that satisfies for all , . Then for any ,
As for Theorem 1, we prove this by induction on the height of . If , then by definition . This then means
|by lemma assumption|
For the inductive step, we assume that for any with smaller height than . Then