Revisiting CFR+ and Alternating Updates

10/26/2018
by   Neil Burch, et al.
Google
0

The CFR+ algorithm for solving imperfect information games is a variant of the popular CFR algorithm, with faster empirical performance on a range of problems. It was introduced with a theoretical upper bound on solution error, but subsequent work showed an error in one step of the proof. We provide updated proofs to recover the original bound.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/10/2018

Lazy-CFR: a fast regret minimization algorithm for extensive games with imperfect information

In this paper, we focus on solving two-player zero-sum extensive games w...
08/01/2019

A True O(n logn) Algorithm for the All-k-Nearest-Neighbors Problem

In this paper we examined an algorithm for the All-k-Nearest-Neighbor pr...
10/14/2021

On the ESL algorithm for solving energy games

We propose a variant of an algorithm introduced by Schewe and also studi...
06/26/2020

A Hennessy-Milner Theorem for ATL with Imperfect Information

We show that a history-based variant of alternating bisimulation with im...
08/01/2019

A True O(n n) Algorithm for the All-k-Nearest-Neighbors Problem

In this paper we examined an algorithm for the All-k-Nearest-Neighbor pr...
03/06/2020

Tighter Bound Estimation of Sensitivity Analysis for Incremental and Decremental Data Modification

In large-scale classification problems, the data set may be faced with f...
05/30/2014

Learning to Act Greedily: Polymatroid Semi-Bandits

Many important optimization problems, such as the minimum spanning tree ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

CFR was introduced [TammelinTammelin2014] as an algorithm for approximately solving imperfect information games, and was subsequently used to essentially solve the game of heads-up limit Texas Hold’em poker [Bowling, Burch, Johanson,  TammelinBowling et al.2015]. Another paper associated with the poker result gives a correctness proof for CFR, showing that approximation error approaches zero [Tammelin, Burch, Johanson,  BowlingTammelin et al.2015].

CFR is a variant of the CFR algorithm[Zinkevich, Johanson, Bowling,  PiccioneZinkevich et al.2007], with much better empirical performance than CFR. One of the CFR changes is switching from simultaneous updates to alternately updating a single player at a time. A crucial step in proving the correctness of both CFR and CFR is linking regret, a hindsight measurement of performance, to exploitability, a measurement of the solution quality.

Later work pointed out a problem with the CFR proof [Farina, Kroer,  SandholmFarina et al.2018], noting that the CFR proof makes reference to a folk theorem making the necessary link between regret and exploitability, but fails to satisfy the theorem’s requirements due to the use of alternating updates in CFR. Farina et al. give an example of a sequence of updates which lead to zero regret for both players, but high exploitability.

We state a version of the folk theorem that links alternating update regret and exploitability, with an additional term in the exploitability bound relating to strategy improvement. By proving that CFR and CFR generate improved strategies, we can give a new correctness proof for CFR, recovering the original bound on approximation error.

2 Definitions

We need a fairly large collection of definitions to get to the correctness proof. CFR and CFR make use of the regret-matching and regret-matching algorithm, respectively, and we need to show some properties of this component algorithm. Both CFR and CFR operate on extensive form games, a compact tree-based formalism for describing an imperfect information sequential decision making problem.

2.1 Regret-Matching and Regret-Matching

Regret-matching[Hart  Mas-ColellHart  Mas-Colell2000] is an algorithm for solving the online regret minimisation problem. External regret is a hindsight measurement of how well a policy did, compared to always selecting some action. Given a set of possible actions , a sequence of value functions , and sequence of policies , the regret for an action is

(1)

An online regret minimisation algorithm specifies a policy based on past value functions and policies, such that as .

Let , , and

(2)

Then for any , regret-matching uses a policy

(3)

Regret-matching[TammelinTammelin2014] is a variant of regret-matching that stores a set of non-negative regret-like values

(4)

and uses the same regret-matching mapping from stored values to policy

(5)

2.2 Extensive Form Games

An extensive form game [Von Neumann  MorgensternVon Neumann  Morgenstern1947] is a sequential decision-making problem where players have imperfect (asymmetric) information. The formal description of an extensive form game given by a tuple .

is the set of all states , which are a history of actions from the beginning of the game . Given a history and an action , is the new state reached by taking action at . To denote a descendant relationship, we say if can be reached by some (possibly empty) sequence of actions from , and .

We will use to denote the set of terminal histories, where the game is over. We will use to refer to the set of terminal histories that can be reached from some state .

gives the set of valid actions at . We assume some fixed ordering

of the actions, so we can speak about a vector of values or probabilities across actions.

is the set of players, and gives the player that is acting at state , or the special chance player for states where a chance event occurs according to probabilities specified by . Our work is restricted to two player games, so will say .

The utility of a terminal history for player is given by . We will restrict ourselves to zero-sum games, where .

Player’s imperfect information about the game state is represented by a partition of states based on player knowledge. For all information sets and all states are indistinguishable to player , with the same legal actions . Given this equality, we can reasonably talk about and for any . For any , we will use such that to refer to the information set containing . It is convenient to group information sets by the acting player, so we will use to refer to player ’s information sets.

We will also restrict ourselves to extensive form games where players have perfect recall. Informally, a player has perfect recall if they do not forget anything they once knew: for all states in some information set, both and passed through the same sequence of player information sets from the beginning of the game , and made the same player actions.

A player strategy

gives a probability distribution

over legal actions for player information sets. For convenience, let . A strategy profile is a tuple of strategies for both players. Given a profile , we will use to refer to the strategy of ’s opponent.

Because states are sequences of actions, we frequently need to refer to various products of strategy action probabilities. Given a strategy profile ,

(6)

refers to the probability of a game reaching state when players sample actions according to and chance events occur according to .

(7)

refers to the probability of a game reaching given that was reached.

(8)

refer to probabilities of player or all actors but making the actions to reach , given that ’s opponent and chance made the actions in . Note that there is a slight difference in the meaning of the label here, with considering actions by both player ’s opponent and chance, whereas refers to the strategy of ’s opponent.

(9)

refers to the probability of player making the actions to reach , given was reached and ’s opponent and chance make the actions to reach . There are a few useful relationships:

(10)

The expected utility of a strategy profile is

(11)

The counterfactual value of a history or information set are defined as

(12)

For later convenience, we will assume that for each player there exists an information set at the beginning of the game, containing a single state with a single action, leading to the rest of the game. This lets us say that .

Given a sequence of strategies, we denote the average strategy from to as

(13)

Given a sequence of strategy profiles, we denote the average player regret as

(14)

The exploitability of a strategy profile is a measurement of how much expected utility each player could gain by switching their strategy:

(15)

Achieving zero exploitability – a Nash equilibrium [NashNash1950] – is possible. In two player, zero-sum games, finding a strategy with low exploitability is a reasonable goal for good play.

2.3 CFR and CFR

CFR and its variant CFR are both algorithms for finding an extensive form game strategy with low exploitability. They are all iterative self-play algorithms that track the average of a current strategy that is based on many loosely coupled regret minimisation problems.

CFR and CFR track regret-matching values or regret-matching values respectively, for all . At time , CFR and CFR use strategy profile and , respectively. When doing alternating updates, with the first update done by player 1, the values used for updating regrets are

(16)

and the output of CFR is the profile of average strategies , while the output of CFR is the profile of weighted average strategies .

3 Theoretical Results

The folk theorem links regret and exploitability, and there is a clear analogue that links alternating update regret and exploitability.

Theorem 1

Let be the strategy profile at some time , and be the regrets computed using alternating updates so that player 1 regrets are updated using and player 2 regrets are updated using . If the regrets are bounded by , then the exploitability of is bounded by .

Proof.

(17)
by Eq. 14
by Eq. 15

Given , we have

by zero-sum

 

Motivated by the trailing sum in Theorem 1, the next results do not relate to exploitability, but to expected utility: regret-matching, CFR, and their variants generate new policies which are not worse than the current policy. Specifically, measured with respect to the current value function, the expected utility of the new policy is never worse than the expected utility of the current policy.

Theorem 2

For any , let be the stored value used by regret-matching or used by regret-matching, and be the associated policy. Then for all , .

Theorem 3

Let be the player that is about to be updated in CFR or CFR at some time . Let be the current strategy for , and be the opponent strategy or used by the values defined in Equation 16. Then .

As an immediate corollary of Theorems 1 and 3, when using alternating updates with either CFR or CFR, the average strategy has exploitability. From the original papers, both algorithms have an regret bound, and the trailing sum in Theorem 1 is non-negative by Theorem 3. However, this only applies to a uniform average, so we need yet another theorem to bound the exploitability of the CFR weighted average.

Theorem 4

Let be the CFR strategy profile at some time , using alternating updates so that player 1 regret-like values are updated using and player 2 regrets are updated using . Let be the bound on terminal utilities. Then the exploitability of the weighted average strategy is bounded by , where .

4 Conclusions

The original CFR convergence proof makes unsupported use of the folk theorem linking regret to exploitability. We re-make the link between regret and exploitability for alternating updates, and provide a corrected CFR convergence proof that recovers the original exploitability bound. The proof uses a specific property of CFR and CFR, where for any single player update, both algorithms are guaranteed to never generate a new strategy which is worse than the current strategy.

With a corrected proof, we once again have a theoretical guarantee of correctness to fall back on, and can safely use CFR with alternating updates, in search of its strong empirical performance without worrying that it might be worse than CFR.

The alternating update analogue of the folk theorem also provides some theoretical motivation for the empirically observed benefit of using alternating updates. Exploitability is now bounded by the regret minus the average improvement in expected values. While we proved that the improvement is guaranteed to be non-negative for CFR and CFR, we would generally expect non-zero improvement on average, with a corresponding reduction in the bound on exploitability.

Appendix A Proofs

We give an updated proof of an exploitability bound for CFR. There are three steps. First, we show that regret-matching and regret-matching never generate a new policy that is worse than the current policy. Next, we show that CFR and CFR have the same property for strategies. Finally, we show that the exploitability of the CFR weighted average strategy approaches zero.

a.1 Regret-Matching and Regret-Matching Properties

We show that when using regret-matching or regret-matching, once there is at least one positive stored regret or regret-like value, there will always be a positive stored value. As a consequence, the policy will not switch back to the default of uniform random, which lets us show that with respect to the current values, both algorithms move to a new policy which is no worse than the current policy.

Lemma 5

For any , let be the stored value used by regret-matching or used by regret-matching, and be the associated policy. Then for all where such that , there such that .

Proof. Given we have

by Eqs. 235 (18)

Now consider . There are two cases:

  1. by Lemma assumption, Eq. 2.14
  2. by Eq. 18
    by Eqs. 2.14

In both cases, such that .   

Lemma 6

For any , let be the stored value used by regret-matching or used by regret-matching, and be the associated policy. Then for all and , .

Proof. There are two cases.

  1. For regret-matching, where , we have

    by Eq. 2.1

    For regret-matching, where , we have

    by Eq. 4
    by Eq. 4

    Therefore for both algorithms we have

  2. by Eqs. 2.14

In both cases, we have .   

See 2 Proof. By Lemma 5 we do not need to consider the case where and . This leaves three cases.

  1. by Eqs. 235
  2. by Eqs. 235
    by Eqs. 2.14

  3. Let

    (19)

    Then we have

    by Eqs. 235 (20)

    Given some ordering of the actions such that , let

    (21)

    Note that , , and , , so that is always well-defined. We can show by induction that for all

    (22)

    For the base case of , we have

    by Eq. 21
    by Eq. 20

    Now assume that Equation 22 holds for some . By construction,

    by Eq. 21 (23)

    For notational convenience, let .

    by Eq. 19
    by Eqs. 2123
    by Eq. 19
    by assumption
    by Lemma 6

    , so by induction Equation 22 holds for all . In particular, we can now say

    by Eq. 21
    by Eq. 20

In all cases, we have .   

a.2 CFR and CFR Properties

We start by showing that after a player updates their strategy using CFR or CFR, the player’s counterfactual value does not decrease for any action at any of their information sets. This implies that with the opponent strategy fixed, the expected value of the player’s new strategy does not decrease. Given this non-decreasing value and the original CFR regret bounds, we show a bound on exploitability.

Lemma 7

Let be the player that is about to be updated in CFR or CFR at some time . Let be the current strategy for , and be the opponent strategy or used by Equation 16. Then and , .

Proof. We will use some additional terminology. Let the terminal states reached from by action be

(24)

and for any descendant state of , we will call the ancestor in

(25)

Let be the set of information sets which are immediate children of given action :

(26)

Note that by perfect recall, for , such that for all : if one state in is reached from by action , all states in are reached from by action . Let the depth of an information set be

(27)

Using this new terminology, we can re-write

by Eq. 2.2
by Eqs. 2425 (28)

We will now show that

(29)

For the base case , consider any such that . Given these assumptions,

by Eqs. 2627
by Eq. 9 (30)

Now consider

by Eq. 28
by Eq. 30
by Eq. 28

Assume the inductive hypothesis, Equation 29, holds for some . If , , Equation 29 trivially holds for . Otherwise, consider any such that . For notational convenience, call the (possibly empty) set of terminal histories in that do not pass through another player information set

(31)

If we partition based on , we end up with sets of terminals passing through different , and possibly some additional terminals . Note that by the induction assumption, because

by Eqs. 2627
(32)

Given this, we have

by Eq. 28