Evaluating agent performance when outcomes are stochastic and agents use randomized strategies can be challenging when there is limited data available. The variance of sampled outcomes may make the simple approach of Monte Carlo sampling inadequate. This is the case for agents playing heads-up no-limit Texas hold'em poker, where man-machine competitions have involved multiple days of consistent play and still not resulted in statistically significant conclusions even when the winner's margin is substantial. In this paper, we introduce AIVAT, a low variance, provably unbiased value assessment tool that uses an arbitrary heuristic estimate of state value, as well as the explicit strategy of a subset of the agents. Unlike existing techniques which reduce the variance from chance events, or only consider game ending actions, AIVAT reduces the variance both from choices by nature and by players with a known strategy. The resulting estimator in no-limit poker can reduce the number of hands needed to draw statistical conclusions by more than a factor of 10.READ FULL TEXT VIEW PDF
Evaluating an agent’s performance in stochastic settings can be hard. Non-zero variance in outcomes means the game must be played multiple times to compute a confidence interval that likely contains the true expected value. Regardless of whether the variance arises from player actions or from chance events, we might need to observe many samples before we get a narrow enough interval to draw desirable conclusions. In many situations, it is simply not feasible (e.g., when the evaluation involves human participation) to simply observe more samples, so we must turn to statistical techniques that use additional information to help narrow the confidence interval.
This agent evaluation problem is commonly encountered in games, where the goal is to estimate the expected performance difference between players. For example, consider poker games. Poker is not only a long-standing challenge problem for AI [von Neumann1928, Koller and Pfeffer1997, Billings et al.2002] with annual competitions [Zinkevich and Littman2006, Annual Computer Poker Competition], but also a very popular game played by an estimated 150 million players worldwide [Eco2007]. Heads-up no-limit Texas hold’em (HUNL) is a particular variant of the game that has received considerable attention in the AI community in recent years, including a “Brains vs. AI” event pitting Claudico [Brains Vs. AI2015], a top HUNL computer program, against professional poker players. That match involved 80,000 hands of poker, played over seven days, involving four poker players, playing dozens of hours each. Despite Claudico losing by over 9 big blinds per 100 hands (a margin that is considered huge by poker professionals) [Wood2015], the result is only on the edge of statistical significance, making it hard to draw a conclusion from this large investment of human time.
Previous techniques for variance reduction to achieve stronger statistical conclusions in this setting have used two broad classes of statistical techniques. Techniques like MIVAT [White and Bowling2009] use the method of control variates with heuristic value estimates to reduce the variance caused by chance events. The technique of importance sampling over imaginary observations [Bowling et al.2008] takes a different approach, using knowledge of a player strategy to evaluate multiple states given a single observation. Imaginary observations can be used to reduce the variance caused by privately observed chance events, as well as the player’s randomly chosen choice of whether to make any actions which would immediately end the game.
Techniques from the two classes can be combined, but are not specifically designed to work together for the greatest reduction in variance, and none of the techniques deal with the variance caused by non-terminal action selection. Because good play in imperfect information games generally requires randomised action selection, ignoring action variance is an important shortcoming. We introduce the action-informed value assessment tool (AIVAT), an unbiased low-variance estimator for imperfect information games which extends the use of control variates to player actions, and makes explicit use of imaginary observations to exploit knowledge of the game structure and player strategies.
This paper focuses on variance reduction when evaluating agents for extensive form games, a class of imperfect information sequential decision making problems. Formally, an extensive form game is a set of of players and chance player , a set of states described as a history of actions from the initial state , a set of terminal states, acting player , player value functions , and information partitions of . We will say if a game in state was previously in state , if or , is the set of valid actions at , and is the successor state of that is reached by making action . For all states such that ,
is the publicly known probability distribution over possible chance outcomes at state.
An information set describes a set of states that player can not distinguish due to imperfect information of the game state. Any player decision is therefore made at information sets, not states. A behaviour strategy gives the probability of player making decision at information set . The behaviour in a state is determined by the information set , so that . We will say the probability of reaching a state is . It is also useful to consider , the probability of a player reaching state if all other players play to reach . This notation can be extended so that for any set of players , .
When talking about estimating the value for players in a game, we are trying to find the expected value . An estimator is said to be unbiased if the expected value . Having an estimator be provably unbiased is important because it is in some sense truthful: a player can not appear to do better by changing their play to take advantage of the estimation method.
use value functions for a control variate that estimates the expected utility given observed chance events. Conceptually, the techniques subtract the expected chance utility to get a lower variance value which mostly depends on the player actions. For example, in poker, it is likely that good hands end in positive outcomes and bad hands end in negative outcomes. Starting with the observed outcome, we could subtract some value for good hands and add a value for bad hands, and we would expect the corrected value to have lower variance. If the expected value of the correction terms is zero, we can use the lower variance corrected value as an unbiased estimator of player value.
DIVAT requires a strategy for all players to generate value estimates for states through self-play, which MIVAT generalised by allowing for arbitrary value functions defined after chance events. MIVAT adds a correction term for each chance event in an observed state. In order to remain unbiased despite using an arbitrary value estimation function , MIVAT uses a correction term of the form for an observation with outcome . Computing this expectation requires us to know the probability distribution that was drawn from, which is true in the case of chance events as is public knowledge. These terms are guaranteed to have an expected value of zero, making the MIVAT value (observed value plus correction terms) an unbiased estimate of player value. In a game like poker, MIVAT will account for the dealer giving a player favourable or unfavourable cards, but not for lucky player actions selected from a randomised strategy.
Imaginary observations with importance sampling [Bowling et al.2008] use knowledge of a player’s strategy to compute an expected value of multiple states given an observation of a single state. Due to imperfect information, there may be many states which are all guaranteed to have the same probability of the opponent making their actions. If we consider importance sampling over these imaginary observations, the opponent’s probability of reaching the state cancels out so we do not need the opponent’s strategy. By taking an expectation over a set of states for every observation, we get a lower variance value.
There are two kinds of situations where we can use imaginary observations. First, for any states where player could have made an action which ends the game, we can add the imaginary observation of the terminal state . For example, in poker this lets us consider player folding to a bet they called or raised, or calling a bet we folded to in the final round. Second, because of the information partitions in imperfect information games, there may be other states that have identical opponent probabilities. In poker, this lets us consider all the states where the public player actions are the same, the opponent private cards and public board cards are the same, but player has different private cards. Imaginary observations do not let us reduce the variance caused by choosing non-terminal actions or the outcomes of publicly visible chance events.
MIVAT and imaginary observations consider different information and can be combined to get a value estimate with lower variance than either technique used individually. Instead of using the terminal value for an imaginary observation , we could use the MIVAT value estimate given . However, because neither technique has terms which address the effect of non-terminal actions, we would never expect this combination of techniques to produce a zero variance value estimate. Even with a “perfect” value function that correctly estimates the expected value of a state and action for the players, there would still be some variance in the value estimate due to the random action selection by players.
Conceptually, AIVAT combines the chance correction terms of MIVAT with imaginary observations across private information, along with new MIVAT-like correction terms for player actions. The AIVAT estimator is the sum of a base value using imaginary observations, plus imaginary observation correction terms for both player actions and chance events. Roughly speaking, moving backwards through the choices in an observed game, the AIVAT correction terms are constructed in a fashion that shifts an estimate of the expected value after a choice was made towards an estimate of the expected value before the choice.
Because imaginary observations with importance sampling provides an unbiased estimate of the expected value of the players, and the MIVAT-like terms have an expected value of zero, AIVAT is also an unbiased estimator of the expected player value. Furthermore, with well-structured games, “perfect” value functions, and knowledge of all player strategies, we could see zero variance: the imaginary observation values and the correction terms would sum to the expected player value, regardless of the observed game.
gives a high level overview of MIVAT, imaginary observations, and AIVAT. In this example, we are interested in the expected value for player 1, and know player 1’s strategy. We use an observation of one hand of Leduc hold’em poker, a small synthetic game constructed for artificial intelligence research[Southey et al.2005]. Leduc hold’em is a two round game with one private card for each player, and one publicly visible board card that is revealed after the first round of player actions. In the example, player 1 is dealt Q and player 2 is dealt K. Player 1 makes the check action followed by a player 2 check action. The public board card is revealed to be J. After the round two actions check, raise, call, player 1 loses 5 chips.
We start by describing the correction terms added for chance events and actions. Given information about a player’s strategy, we can treat that player’s choice events as chance events and construct MIVAT-like correction terms for them. The player strategy also allows imaginary observations considering alternative histories with identical opponent probabilities, so we can compute an expectation over a set of compatible histories rather than using the single observed outcome.
The correction term at a decision point will be the expectation across all compatible histories of the expected value before a choice, minus the value after the observed choice. As with MIVAT, the values are estimated using an arbitrary fixed value function to estimate the value after every decision. Value estimates which more closely approximate the true expected value will result in greater variance reduction.
To consider imaginary observations, we need at least one player for which we know the know the strategy. Let be a non-empty set of players, including , such that we know , and be the set of opponent players for which we do not know the strategy. If then AIVAT would be identical to MIVAT. We must also partition the states into the sets we can evaluate given an observation of a completed game. Let be a partition of states such that and ,
. For example, this can be enforced by requiring and to pass through the same sequence of player information sets and make the same actions at those information sets.
. This implies a uniqueness property, where for any terminal , is either empty or a singleton.
We will extend the actions so that and let . Because we will say .
Similar to MIVAT, we need value functions that give an estimate of the expected value after an action. Let there be arbitrary functions for each state where . Say we have seen a terminal state . Consider a part . If such that , then the correction term . Otherwise, property 2 of implies there is a unique observed action such that , and the correction term is
AIVAT uses the sum of across all .
The AIVAT correction terms have an expected value of zero, and are not a value estimate by themselves. They must be combined with an unbiased estimate of player value. For improved variance reduction, the form of the correction terms must match the choice of base value estimate.
To see how the terms match, consider a simplified version of AIVAT where the final correction term for a terminal state has the form . Ideally, we would like the value estimate for to be . The value estimate plus the correction term will then have the same value for all actions at , resulting in zero variance.
For the AIVAT correction terms, the correct choice is to use imaginary observations of all possible private information for players in , as in “Example 3: Private Information” of the paper by Bowling et al. [Bowling et al.2008]. In poker, it corresponds to evaluating the game with all possible private cards, weighted by the likelihood of holding the cards given the observed game. For completeness, we formally describe the particular instance of this existing estimator using the notation of this paper.
Given the correction term partition of player states, we construct a matching partition of terminal states such that and ,
a player in made an action in a player in made an action in .
if a player in made an action in , then for the longest prefix and such that and , both and are in the same part of .
The last two conditions on ensure that the imaginary observation estimate does not include terminal states that the correction terms will also account for. This rules out a form of double counting which would not produce a biased estimator, but would increase the variance when using high quality estimates in the correction terms.
If we observe a terminal state , let be the part such that . The base estimated value for player is
The AIVAT estimator gives an unbiased estimate of the expected value . If we use partitions and as described above, and are given an observation of a terminal state , the value estimate is
Note that there is a subtle difference between AIVAT and a simple combination of imaginary observations and an extended MIVAT framework using player strategy information to add control variates for actions. Using an extended MIVAT plus imaginary observations, we would consider the expected MIVAT value estimate across all terminal histories compatible with the observed terminal state. In AIVAT, for each correction term we would consider all histories compatible with the state at that decision point.
As a concrete example of the difference, consider the game used in Figure 1. MIVAT with imaginary observations would only consider private cards for player 1 that do not conflict with the opponent’s K or the public card J, even when computing the control variate term for the public card. In contrast, AIVAT considers J as a possible player card for the term.
It is desirable to have an unbiased value estimate for games, so that players can not improve their estimated value by changing their strategy to fit the estimation technique. We prove that AIVAT is unbiased. The value estimate in Equation 1 is a sum of two parts. The fraction in the first part is an unbiased estimator based on imaginary observations [Bowling et al.2008], so we only need to show that the sum of all terms has an expected value of .
Proof. Consider an arbitrary . Let be the set of terminal states passing through . Expanding definitions, using property 1 of and multiplying by we get
Using property 3 of
Because the expected value is 0 for an arbitrary , the expected value is 0 for the sum of all .
Proof. This immediately follows from Lemma 1, as the expected value of a sum of terms is the sum of the expected values of the terms, which are all .
We demonstrate the effectiveness of AIVAT in two poker games, Leduc hold’em and heads-up no-limit Texas hold’em (HUNL). Both Leduc hold’em and HUNL have a convenient structure where all actions are public, and there is a mix of chance events in the form of completely public board cards and completely private hole cards. The uncomplicated structure leads to a clear choice for the partition . Each has states with identical betting, public board cards, and private hole cards for any players in .
In all experiments the value functions are self-play values, generated by solving the game to find a Nash equilibrium strategy using a variant of the Monte Carlo CFR algorithm [Lanctot et al.2009]. For each player and partition , we save the average observed values for opponent across all iterations, giving us a value . is an expected self-play value for at , given the probability distribution of hands for that reach and play . Because we are playing a zero-sum game and , we can use . In HUNL, which is too large to solve directly, we solve a very small abstraction of the game [Billings et al.2003, Ganzfried and Sandholm2014] with only 8 million information sets, which gives us a rough estimate of that is identical across many partitions of HUNL states.
Poker is played in an alternating fashion, where agents take turns playing in different positions. Let us say we have two agents, and
. In poker, in odd-numbered games (starting at game 1) we would haveas player 1 and as player 2, and in even-numbered games we would have as player 1 and as player 2. For the experiments, we model this as an extended game where there is an initial 50/50 chance event that assigns a position to the agent, along with a AIVAT correction term for the position.
All experiments will compare AIVAT value estimates with the unmodified game values from counting chips, the MIVAT value estimate, and the combination of MIVAT and imaginary observations using the strategy for agent (MIVAT+IO). Because poker is a zero-sum game, it is sufficient to present results from the point of view of agent .
The small size of Leduc hold’em lets us test both the case where only contains one non-chance player, as well as the full-knowledge case where . AIVAT and chip count results are generated from observations of 100,000 games. All of the numbers are in units of chips, where Leduc hold’em has a 1 chip ante, and 2 chip and 4 chip bets in the first and second rounds, respectively.
Figure 2 looks at self-play, where both and play the same Nash equilibrium that was used to generate . The true expected value for player is 0. Because we are using value functions computed from their self-play, this experiment represents a best-case situation. With knowledge of both player’s strategies, the only remaining variance comes from noise in the value function that arises from the sampling and averaging used in the MCCFR computation.
With knowledge of both player’s strategies, we reduce the per-game standard deviation of the estimated player value by a little less than 99.9%. This situation might be unlikely in practice, but does demonstrate that the AIVAT computation correctly shifts every observed outcome to the expected player value, given full correct information. Surprisingly, the one-sided evaluation where we use only one player’s strategy still reduces the standard deviation by 99.8%. Using MIVAT or MIVAT+IO, we only see a 33.8% and 45.1% reduction, respectively.
Moving away from the best-case situation, Figure 3 looks at games where is the same Nash equilibrium from above, and is an agent that randomly calls or raises. Given these strategies, the true expected value for player is 0.69358.
Using the call/raise strategy for demonstrates that the amount of variance reduction does depend on how well the value functions estimates the true expected value of a situation. We used value functions which encode self-play values for , and while is sufficiently similar to that the true values are still positively correlated with the estimated values for both players, they are no longer an almost-perfect match. Despite the strategic mismatch, using AIVAT we see a reduction in the standard deviation of 48% to 75% compared to the basic chip-count estimate. All of the AIVAT estimators outperform the 25% reduction using MIVAT plus imaginary observations.
The game of HUNL better represents a potential real-world application. The game is commonly played, it is too large to easily compute exact expected values directly even when the strategy of both agents is known, average win rate is a statistic of interest to players and observers, and the high per-game variance of outcomes obscures the win rate even after hundreds of thousands of hands.
The variant of HUNL that we use has a small blind of 1 chip and big blind of 2 chips, and each player has 200 chips (i.e., 100 big blinds.) Due to the large branching factor of chance events, we can only present results for AIVAT analysis using the strategy of one agent. All results are generated from observations of 1 million games.
We start by looking at self-play, using a low-quality Nash equilibrium approximation for both players and . The value functions are generated using this same weak approximation. Figure 4 gives the results for the different estimation methods. The true expected value for is 0.
In Figure 5 we look at games where uses the same low-quality approximation of a Nash equilibrium, and is a much stronger agent using a high-quality approximation of a Nash equilibrium. The value functions are still generated using the low-quality approximation. The true expected value for player is not known.
In both experiments, we see a 39% reduction in the standard deviation when using MIVAT with imaginary observations, and a bit more than a 68% reduction using AIVAT. It must be noted that our value function could be improved, as the 18% reduction for MIVAT in this experiment does not match the 23% improvement previously demonstrated using values learned from data [White and Bowling2009]. The small abstract game used to generate the value functions does not do a good job of understanding the consequences of cards being dealt, as it can not distinguish most card situations. Despite this handicap, the full AIVAT estimator still significantly improves on the state of the art for low-variance value estimators for imperfect information games.
We introduce a technique for value estimation in imperfect information games that extends and combines existing techniques. AIVAT uses heuristic value functions, knowledge of game structure, and knowledge about player strategies to both add a control variate term for chance and player decisions, and to average over multiple possible outcomes given a single observation. We prove AIVAT is unbiased, and demonstrate that with (almost) perfect value functions we see (almost) complete elimination of variance. Even with imprecise value functions, we show variance reduction in a real-world game that significantly exceeds existing techniques. AIVAT’s three times reduction in standard deviation allows us to achieve the same statistical significance with ten times less data. A factor of ten is substantial: for problems with limited data, like human play against bots, ten times as many games could be the distinction between practical and impractical.
Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML), pages 72–79, 2008.