CFR was introduced [TammelinTammelin2014] as an algorithm for approximately solving imperfect information games, and was subsequently used to essentially solve the game of heads-up limit Texas Hold’em poker [Bowling, Burch, Johanson, TammelinBowling et al.2015]. Another paper associated with the poker result gives a correctness proof for CFR, showing that approximation error approaches zero [Tammelin, Burch, Johanson, BowlingTammelin et al.2015].
CFR is a variant of the CFR algorithm[Zinkevich, Johanson, Bowling, PiccioneZinkevich et al.2007], with much better empirical performance than CFR. One of the CFR changes is switching from simultaneous updates to alternately updating a single player at a time. A crucial step in proving the correctness of both CFR and CFR is linking regret, a hindsight measurement of performance, to exploitability, a measurement of the solution quality.
Later work pointed out a problem with the CFR proof [Farina, Kroer, SandholmFarina et al.2018], noting that the CFR proof makes reference to a folk theorem making the necessary link between regret and exploitability, but fails to satisfy the theorem’s requirements due to the use of alternating updates in CFR. Farina et al. give an example of a sequence of updates which lead to zero regret for both players, but high exploitability.
We state a version of the folk theorem that links alternating update regret and exploitability, with an additional term in the exploitability bound relating to strategy improvement. By proving that CFR and CFR generate improved strategies, we can give a new correctness proof for CFR, recovering the original bound on approximation error.
We need a fairly large collection of definitions to get to the correctness proof. CFR and CFR make use of the regret-matching and regret-matching algorithm, respectively, and we need to show some properties of this component algorithm. Both CFR and CFR operate on extensive form games, a compact tree-based formalism for describing an imperfect information sequential decision making problem.
2.1 Regret-Matching and Regret-Matching
Regret-matching[Hart Mas-ColellHart Mas-Colell2000] is an algorithm for solving the online regret minimisation problem. External regret is a hindsight measurement of how well a policy did, compared to always selecting some action. Given a set of possible actions , a sequence of value functions , and sequence of policies , the regret for an action is
An online regret minimisation algorithm specifies a policy based on past value functions and policies, such that as .
Let , , and
Then for any , regret-matching uses a policy
Regret-matching[TammelinTammelin2014] is a variant of regret-matching that stores a set of non-negative regret-like values
and uses the same regret-matching mapping from stored values to policy
2.2 Extensive Form Games
An extensive form game [Von Neumann MorgensternVon Neumann Morgenstern1947] is a sequential decision-making problem where players have imperfect (asymmetric) information. The formal description of an extensive form game given by a tuple .
is the set of all states , which are a history of actions from the beginning of the game . Given a history and an action , is the new state reached by taking action at . To denote a descendant relationship, we say if can be reached by some (possibly empty) sequence of actions from , and .
We will use to denote the set of terminal histories, where the game is over. We will use to refer to the set of terminal histories that can be reached from some state .
gives the set of valid actions at . We assume some fixed ordering
is the set of players, and gives the player that is acting at state , or the special chance player for states where a chance event occurs according to probabilities specified by . Our work is restricted to two player games, so will say .
The utility of a terminal history for player is given by . We will restrict ourselves to zero-sum games, where .
Player’s imperfect information about the game state is represented by a partition of states based on player knowledge. For all information sets and all states are indistinguishable to player , with the same legal actions . Given this equality, we can reasonably talk about and for any . For any , we will use such that to refer to the information set containing . It is convenient to group information sets by the acting player, so we will use to refer to player ’s information sets.
We will also restrict ourselves to extensive form games where players have perfect recall. Informally, a player has perfect recall if they do not forget anything they once knew: for all states in some information set, both and passed through the same sequence of player information sets from the beginning of the game , and made the same player actions.
A player strategy
gives a probability distributionover legal actions for player information sets. For convenience, let . A strategy profile is a tuple of strategies for both players. Given a profile , we will use to refer to the strategy of ’s opponent.
Because states are sequences of actions, we frequently need to refer to various products of strategy action probabilities. Given a strategy profile ,
refers to the probability of a game reaching state when players sample actions according to and chance events occur according to .
refers to the probability of a game reaching given that was reached.
refer to probabilities of player or all actors but making the actions to reach , given that ’s opponent and chance made the actions in . Note that there is a slight difference in the meaning of the label here, with considering actions by both player ’s opponent and chance, whereas refers to the strategy of ’s opponent.
refers to the probability of player making the actions to reach , given was reached and ’s opponent and chance make the actions to reach . There are a few useful relationships:
The expected utility of a strategy profile is
The counterfactual value of a history or information set are defined as
For later convenience, we will assume that for each player there exists an information set at the beginning of the game, containing a single state with a single action, leading to the rest of the game. This lets us say that .
Given a sequence of strategies, we denote the average strategy from to as
Given a sequence of strategy profiles, we denote the average player regret as
The exploitability of a strategy profile is a measurement of how much expected utility each player could gain by switching their strategy:
Achieving zero exploitability – a Nash equilibrium [NashNash1950] – is possible. In two player, zero-sum games, finding a strategy with low exploitability is a reasonable goal for good play.
2.3 CFR and CFR
CFR and its variant CFR are both algorithms for finding an extensive form game strategy with low exploitability. They are all iterative self-play algorithms that track the average of a current strategy that is based on many loosely coupled regret minimisation problems.
CFR and CFR track regret-matching values or regret-matching values respectively, for all . At time , CFR and CFR use strategy profile and , respectively. When doing alternating updates, with the first update done by player 1, the values used for updating regrets are
and the output of CFR is the profile of average strategies , while the output of CFR is the profile of weighted average strategies .
3 Theoretical Results
The folk theorem links regret and exploitability, and there is a clear analogue that links alternating update regret and exploitability.
Let be the strategy profile at some time , and be the regrets computed using alternating updates so that player 1 regrets are updated using and player 2 regrets are updated using . If the regrets are bounded by , then the exploitability of is bounded by .
Motivated by the trailing sum in Theorem 1, the next results do not relate to exploitability, but to expected utility: regret-matching, CFR, and their variants generate new policies which are not worse than the current policy. Specifically, measured with respect to the current value function, the expected utility of the new policy is never worse than the expected utility of the current policy.
For any , let be the stored value used by regret-matching or used by regret-matching, and be the associated policy. Then for all , .
Let be the player that is about to be updated in CFR or CFR at some time . Let be the current strategy for , and be the opponent strategy or used by the values defined in Equation 16. Then .
As an immediate corollary of Theorems 1 and 3, when using alternating updates with either CFR or CFR, the average strategy has exploitability. From the original papers, both algorithms have an regret bound, and the trailing sum in Theorem 1 is non-negative by Theorem 3. However, this only applies to a uniform average, so we need yet another theorem to bound the exploitability of the CFR weighted average.
Let be the CFR strategy profile at some time , using alternating updates so that player 1 regret-like values are updated using and player 2 regrets are updated using . Let be the bound on terminal utilities. Then the exploitability of the weighted average strategy is bounded by , where .
The original CFR convergence proof makes unsupported use of the folk theorem linking regret to exploitability. We re-make the link between regret and exploitability for alternating updates, and provide a corrected CFR convergence proof that recovers the original exploitability bound. The proof uses a specific property of CFR and CFR, where for any single player update, both algorithms are guaranteed to never generate a new strategy which is worse than the current strategy.
With a corrected proof, we once again have a theoretical guarantee of correctness to fall back on, and can safely use CFR with alternating updates, in search of its strong empirical performance without worrying that it might be worse than CFR.
The alternating update analogue of the folk theorem also provides some theoretical motivation for the empirically observed benefit of using alternating updates. Exploitability is now bounded by the regret minus the average improvement in expected values. While we proved that the improvement is guaranteed to be non-negative for CFR and CFR, we would generally expect non-zero improvement on average, with a corresponding reduction in the bound on exploitability.
Appendix A Proofs
We give an updated proof of an exploitability bound for CFR. There are three steps. First, we show that regret-matching and regret-matching never generate a new policy that is worse than the current policy. Next, we show that CFR and CFR have the same property for strategies. Finally, we show that the exploitability of the CFR weighted average strategy approaches zero.
a.1 Regret-Matching and Regret-Matching Properties
We show that when using regret-matching or regret-matching, once there is at least one positive stored regret or regret-like value, there will always be a positive stored value. As a consequence, the policy will not switch back to the default of uniform random, which lets us show that with respect to the current values, both algorithms move to a new policy which is no worse than the current policy.
For any , let be the stored value used by regret-matching or used by regret-matching, and be the associated policy. Then for all where such that , there such that .
Proof. Given we have
Now consider . There are two cases:
In both cases, such that .
For any , let be the stored value used by regret-matching or used by regret-matching, and be the associated policy. Then for all and , .
Proof. There are two cases.
In both cases, we have .
Then we have
Given some ordering of the actions such that , let
Note that , , and , , so that is always well-defined. We can show by induction that for all
For the base case of , we have
Now assume that Equation 22 holds for some . By construction,
by Eq. 21 (23)
For notational convenience, let .
, so by induction Equation 22 holds for all . In particular, we can now say
In all cases, we have .
a.2 CFR and CFR Properties
We start by showing that after a player updates their strategy using CFR or CFR, the player’s counterfactual value does not decrease for any action at any of their information sets. This implies that with the opponent strategy fixed, the expected value of the player’s new strategy does not decrease. Given this non-decreasing value and the original CFR regret bounds, we show a bound on exploitability.
Let be the player that is about to be updated in CFR or CFR at some time . Let be the current strategy for , and be the opponent strategy or used by Equation 16. Then and , .
Proof. We will use some additional terminology. Let the terminal states reached from by action be
and for any descendant state of , we will call the ancestor in
Let be the set of information sets which are immediate children of given action :
Note that by perfect recall, for , such that for all : if one state in is reached from by action , all states in are reached from by action . Let the depth of an information set be
Using this new terminology, we can re-write
We will now show that
For the base case , consider any such that . Given these assumptions,
Assume the inductive hypothesis, Equation 29, holds for some . If , , Equation 29 trivially holds for . Otherwise, consider any such that . For notational convenience, call the (possibly empty) set of terminal histories in that do not pass through another player information set
If we partition based on , we end up with sets of terminals passing through different , and possibly some additional terminals . Note that by the induction assumption, because
Given this, we have
|by Eq. 28|