Lazy-CFR: a fast regret minimization algorithm for extensive games with imperfect information

10/10/2018 ∙ by Yichi Zhou, et al. ∙ 0

In this paper, we focus on solving two-player zero-sum extensive games with imperfect information. Counterfactual regret minimization (CFR) is the most popular algorithm on solving such games and achieves state-of-the-art performance in practice. However, the performance of CFR is not fully understood, since empirical results on the regret are much better than the upper bound proved in zinkevich2008regret. Another issue of CFR is that CFR has to traverse the whole game tree in each round, which is not tolerable in large scale games. In this paper, we present a novel technique, lazy update, which can avoid traversing the whole game tree in CFR. Further, we present a novel analysis on the CFR with lazy update. Our analysis can also be applied to the vanilla CFR, which results in a much tighter regret bound than that proved in zinkevich2008regret. Inspired by lazy update, we further present a novel CFR variant, named Lazy-CFR. Compared to traversing O(|I|) information sets in vanilla CFR, Lazy-CFR needs only to traverse O(√(|I|)) information sets per round while the regret bound almost keep the same, where I is the class of all information sets. As a result, Lazy-CFR shows better convergence result compared with vanilla CFR. Experimental results consistently show that Lazy-CFR outperforms the vanilla CFR significantly.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Extensive games provide a mathematical framework for modeling the sequential decision-making problems with imperfect information. They are widely used in economic decisions, negotiations and security. In this paper, we focus on solving two-player zero-sum extensive games with imperfect information (TEGI). In a TEGI, there are an environment with uncertainty and two players on opposite sides (Koller & Megiddo, 1992).

Counterfactual regret minimization (CFR) (Zinkevich et al., 2008) provides a state-of-the-art algorithm for solving TEGIs with much progress in practice. The most famous application of CFR is Libratus, the first program that defeats top human players in heads-up no-limit Texas Hold’em poker (Brown & Sandholm, 2017). CFR works based on the fact that minimizing the regrets of both players makes the time-averaged strategy to approach the Nash-Equilibrium (NE) (Zinkevich et al., 2008). Furthermore, CFR bounds the original regret with a summation of many immediate counterfactual regrets, each of which corresponds to an infomation set (infoset). These immediate regrets are defined by counterfactual rewards and they can be iteratively minimized by a standard online learning algorithm called regret matching (RM) (Blackwell et al., 1956).

Though CFR has succeeded in practice, the behavior of CFR is not fully understood. Specifically, experiments have shown that the regret is significantly smaller than the upper bound proved in (Zinkevich et al., 2008). So at least some further theoretical analysis can be provided on the regret bound of CFR. Besides, a more crucial limitation of CFR is that it requires traversing the whole game tree in each round, which is time-consumingin large-scale games. This is because we have to apply RM to every immediate regret in each round. Though various attempts have been made to avoid traversing the whole game tree in each round, often leading to significant speedup of the vanilla CFR in practice, they are lack of theoretical guarantees on the running time or can even degenerate in the worst case (Brown & Sandholm, 2015, 2016; Lanctot et al., 2009).

In this paper, we present a novel technique, called lazy update, which provides a unified framework to avoid traversing the whole game tree in CFR. For each infoset, CFR with lazy update segments the time horizon into disjoint subset with consecutive elements, we call these subsets as segments. And then keeps the strategy on that infoset the same within each segment. That is, CFR with lazy update updates the strategy only at the start of each segment, so that lazy update can save computation resources. It is noteworthy that our framework includes the vanilla CFR as a degenerated case, in which the length of each segment is . Moreover, we present a novel analysis on the regret of CFR with lazy update. Our analysis is also on the immediate regrets as in (Zinkevich et al., 2008). The difference is that, in contrast to (Zinkevich et al., 2008)’s analysis which takes each immediate regret independently, our analysis reveals the correlation among them via the underlying optimal strategy. Specifically, we prove that it is impossible that immediate regrets are all very large simultaneously. As an application of our analysis, we refine the regret bound of CFR from to where , is the infosets of player , is the number of actions, is the depth of the game tree and is the length of time.

Obviously, the performance of CFR with lazy update relies on the underlying segmentation procedure. Intuitively, if the strategy on an infoset at some round only has a little effect on the overall regret, it is not necessary to update it at that round. The key observation is that, in CFR, the importance of the strategy at some round can be defined by the norm of the corresponding counterfactual reward. If the norm is very small, the regret will not increase much if we don’t update the strategy. Furthermore, in a TEGI, the norm of counterfactual rewards on most infosets is very small (see Sec. 3 for a formal statement.). The combination of the above observation and the framework of lazy update naturally leads to a novel variant of CFR, named Lazy-CFR, which dramatically outperforms the vanilla CFR in both theory and practice. In our final algorithm Alg. 1, we only need to update the strategies on infosets in each round which is significantly smaller than in CFR. And at the same time, the regret of Lazy-CFR can be controlled in . Accordingly, Lazy-CFR requires running time to compute an -Nash Equilibrium, whilst the vanilla CFR needs running time. So that we accelerate CFR by a factor , which is a dramatic improvement in large scale games, since (the depth of the game tree) is usually in the order of .

We empirically evaluate our algorithm on the standard benchmark, Leduc Hold’em (Brown & Sandholm, 2015). We compare our algorithm with the vanilla CFR, MC-CFR (Lanctot et al., 2009), and CFR+ (Bowling et al., 2017). It is noteworthy that the same idea of Lazy-CFR can also be applied to CFR+, and the resulted algorithm is named Lazy-CFR+. We empirically evaluate Lazy-CFR+ and don’t analyze the regret bound of Lazy-CFR+. Experiments show that Lazy-CFR and Lazy-CFR+ dramatically improve the convergence rates of CFR and CFR+ in practice, respectively.

The rest of this paper is organized as follows. Sec. 2 reviews some useful preliminary knowledge of this work. And then we discuss some related work in Sec. 5. In Sec. 3, we present the idea of lazy update with the analysis, and our algorithm is presented in Sec. 4. Finally, we show our experimental results in Sec. 6.

2 Notations and Preliminaries

We first introduce the notations and definitions of extensive games and TEGIs. Then we introduce an online learning concept of regret minimization. After that, we discuss the connection between TEGIs and regret minimization. This connection triggered the powerful algorithm, CFR. Finally, we finish this section by discussing the details of CFR.

2.1 Extensive games

Extensive games (see (Osborne & Rubinstein, 1994) page 200 for a formal definition of extensive games.) compactly model the decision-making problems with sequential interactions among multiple agents. An extensive game can be represented by a game tree of histories (a history is a sequence of actions in the past.). Suppose that there are players participating in an extensive game and let denote the chance player which is usually used to model the uncertainty in the environment. A player function assigns a player to a non-terminal history, , in the game tree where . This means that is the player who takes an action after . And each player receives a reward at a terminal history .

Let denote the set of valid actions of after , that is, , . Let . A strategy of player is a function which assigns a distribution over if . A strategy profile consists of the strategy for each player, i.e., . We will use to refer to all the strategies in except . And we use the pair to denote the full strategy profile. In games with imperfect information, actions of other players are partially observable to a player . So for player , the game tree can be partitioned into disjoint information sets (infoset), . That is, two histories are not distinguishable to player . Thus, should assign the same distribution over actions to all histories in an infoset . With a little abuse of notations, we let denote the strategy of player on infoset .

Moreover, let

denote the probability of arriving at a history

if the players take actions according to strategy . Obviously, we can decompose into the product of each player’s contribution, that is, . Similarly, we can define as the probability of arriving at an infoset and denote the corresponding contribution of player . Let and denote the product of the contributions on arriving at and , respectively, of all players except player .

In game theory, the

solution of a game is often referred to a Nash equilibrium (NE) (Osborne & Rubinstein, 1994). With a little abuse of notations, let denote the expectation of reward of player if all players take actions according to . An NE is a strategy profile , in which every is optimal if given , that is, , .

In this paper, we concern on computing an approximation of an NE, namely an -NE, since computing an -NE is usually faster in running time (Brown & Sandholm, 2017; Zinkevich et al., 2008). An -NE is a strategy profile such that:

With the above notations, a two-player zero-sum extensive game with imperfect information (TEGI) is an extensive game with and for a terminal history . And the -NE in a TEGI can be efficiently computed by regret minimization, see later in this section.

2.2 Regret minimization

Now we introduce regret, a core concept in online learning (Cesa-Bianchi & Lugosi, 2006). Many powerful online learning algorithms can be framed as minimizing some kinds of regret, therefore known as regret minimization algorithms. Generally, the regret is defined as follows:

Definition 1 (Regret).

Consider the case where a player takes actions repeatedly. At each round, the player selects an action , where is the set of valid actions. At the same time, the environment 111The environment may be an adversary in online learning. selects a reward function . Then, the overall reward of the player is where is the length of time, and the regret is defined as:

One of the most famous example of online learning is online linear optimization (OLO) in which is a linear function. If is the set of distributions over some discrete set , an OLO can be solved by a standard regret minimization algorithm called regret matching (RM) (Blackwell et al., 1956; Abernethy et al., 2011).

CFR employs RM as a sub-procedure, so we summarize OLO and RM as follows:

Definition 2 (Online linear optimization (OLO) and regret matching (RM)).

Consider the online learning problem with linear rewards. In each round , an agent plays a mixed strategy , where is the set of probabilities over the action set

, while an adversary selects a vector

. The reward of the agent at this round is where denotes the operator of inner product. The goal of the agent is to maximize the cumulative reward which is equivalent to minimizing the following regret:

Let , RM picks as follows:

(1)

According to the result in (Blackwell et al., 1956), RM enjoys the following regret bound:

(2)

2.3 Counterfactual regret minimization (CFR)

CFR is developed on a connection between -NE and regret minimization. This connection is naturally established by considering repeatedly playing a TEGI as an online learning problem. It is noteworthy that there are two online learning problems in a TEGI, one for each player.

Suppose player takes at time step and let . Consider the online learning problem for player by setting and . The regret for player is where .

Furthermore, define the time-averaged strategy, , as follows:

It is well-known that (Nisan et al., 2007):

Lemma 1.

If for , then is an -NE.

However, it is hard to directly apply regret minimization algorithms to TEGIs, since the reward function is non-convex respect to . One approach is that as in (Gordon, 2007), we first transform a TEGI to a normal-form game, and then apply the Lagragian-Hedge algorithm (Gordon, 2005). However, this approach is time-consuming since the dimension of the corresponding normal-form game is exponential to . To address this problem, Zinkevich et al. (2008) propose a novel decomposition of the regret into the summation of immediate regrets as 222Zinkevich et al. (2008) directly upper bounded by the counterfactual regret, i.e., Eq. (4), and omitted the proof of Eq. (2.3). So we present the proof of Eq. (2.3) in Appendix B.:

(3)

where denotes the strategy generated by modifying to and denote the reward of player conditioned on arriving at the infoset if the strategy is executed.

Further, Zinkevich et al. (2008) upper bound Eq. (2.3) by the counterfactual regret:

(4)

For convenience, we call the counterfactual reward of action at round .

Notice that Eq. (4) essentially decomposes the regret minimization of a TEGI into OLOs. So that, in each round, we can apply RM directly to each individual OLO to minimize the counterfactual regret. And the original regret is also minimized since the counterfactual regret is an upper bound. However, we have to traverse the whole game tree, which is very time-consuming in large scale games. Furthermore, with Eq. (2), Eq. (4) and the fact that the norm of a conterfactual reward vector is at most , we can upper bound the counterfactual regret by .

However, updating the strategy on every infoset in each round is not cost-effective. This is because the regret is determined by the norm of the vector of counterfactual reward on each node (see Eq. (2)). However, on most nodes, the corresponding norm is very small, since is a probability. In Sec. 3, we will present how to avoid the heavy update in CFR by exploiting this property.

3 Lazy update and analysis

In this section, we present the idea of lazy update. We first discuss lazy update in the context of OLO. And then we leverage the idea of lazy update to extensive games. After that, we provide our analysis on the regret bound of CFR with lazy update in Sec. 3.2. Our analysis is novel since it reveals the correlation among immediate regrets and encodes the structure of the game tree explicitly. The regret bound is presented in our main theorem, Thm 1. Furthermore, Thm 1 can also be used to analyze the regret bound of CFR. Thus, by applying Thm 1, we refined the regret bound of the vanilla CFR.

3.1 Lazy update for OLOs

We now introduce lazy update for OLOs in Defn. 2. We call an online learning algorithm for OLOs as a lazy update algorithm if:

  • It divides time steps into disjoint subsets with consecutive elements, that is, where . For convenience, we call these subsets as segments.

  • It updates at time steps for some and keeps the same within each segment. That is, the OLO shrinks into a new OLO whose length of time is . And we have where is the vector selected by the adversary in the primal OLO at time step and is the vector selected by the adversary in the shrinked OLO at time step .

Apply RM

Apply RM
Figure 1: An illustration on RM with lazy update for OLOs. On the bottom is the standard RM; on the top is the RM with lazy update. The length of time in the primal OLO and the shrinked OLO is and respectively. Suppose and . Then the regrets are almost the same, since and .

Suppose we update at the beginning of each segment by RM. On the one hand, RM with lazy update does not need to update the strategy at each round. On the other hand, if the division is reasonable, that is, , then according to Eq. (2), the regrets of the lazy update RM and the vanilla RM are similar in amount. See Fig. 1 for an illustration on RM with lazy update. Formally, the regret of the lazy update RM is bounded by:

(5)

It is noteworthy that in OLO, the running time of lazy update RM is still which is the same as applying RM directly, where is the dimension of . This is because we have to compute which is time-consuming. Fortunately, this problem can be addressed in TEGIs, see Sec. 4 for how to overcome it by exploiting the structure of the game tree.

3.2 Lazy update for TEGIs

We now extend the idea of lazy update to TEGIs. According to Eq. (4), the regret minimization procedure can be divided into OLOs, one for each infoset. For convenience, for each infoset , we divide the time steps into segments where . Let denote the summation of the counterfactual rewards over a segment. And let denote the vector consisting of . Similar to lazy update for OLOs, we only update the strategy on infoset at according to RM. Let denote the strategy after the -th update on infoset , that is, for . According to Eq. (5), we have:

Now we analyze the regret upper bound on the overall regret of a TEGI, i.e., Eq. (2.3), of the above lazy update algorithm. Our main theorem is presented in Thm. 1. Our analysis is also on the immediate regrets as in (Zinkevich et al., 2008) and the improvements are on following two aspects:

First, instead of providing an upper bound on the counterfactual regret in Eq. (4), we directly analyze the bound of the original regret Eq. (2.3). This makes us to be able to analyze ’s effect on . To see how this improves the regret bound intuitively, consider the case that is large on an infoset . Though is large, it makes increasing dramatically only if is also large. This is because the immediate regret of infoset is and is the summation of immediate regrets. Moreover, it is impossible that is very large on all infosets, since is the probability of arriving at contributed by player (See Corollary 1 for a formal description on this claim.). So that the immediate regrets cannot be very large at the same time.

Second, we upper bound the regret by quantities ( and in Eq. (6)) which can reflect the structure of the underlying game tree. So that we can give a more detailed analysis on these quantities, which leads to a tighter regret bound.

According to above ideas, we can prove the following theorem which provides an upper bound on the overall regret of a TEGI, i.e., Eq. (2.3), of a lazy update algorithm.

Theorem 1.

The regret of CFR with lazy update can be bounded as follows:

(6)
(7)

where .

Proof.

We defer the proof to Appendix B. ∎

According to Thm 1, we can bound the regret by bounding . In the sequel of this paper, we upper bound using the following inequality:

(8)

The proof of Eq. (8) relies on a mild assumption. So that we first introduce the assumption.

Assumption 1.

The tree of infosets for each player is a full -ary tree.

Assumption 1 naturally leads to the following corollary:

Corollary 1.

If a TEGI satisfies Assumption 1, then and .

Proof.

We first prove by the mathematical induction. The key point is . This is because if where is the infoset of player after takes action at . And with the fact that , we have . We can also prove in the same way. ∎

Now we can present the proof of Eq. (8).

Proof of Eq. (8).

With straight-forward computations and Corollary 1, we have:

A tighter regret bound of CFR: It is easy to see that the vanilla CFR is a special case of lazy update, in which for every . So we can apply 1 and Eq. (8) to CFR directly, which leads to a tighter regret bound. Formally, we prove that the regret bound of CFR is rather than .

Lemma 2.

The regret of vanilla CFR is bounded by .

Proof.

We only need to bound and then insert it into Thm 1. By directly applying Eq. (8) and Corollary 1, we have

4 Lazy-CFR

1:  A two-player zero-sum extensive game.
2:  Initialize the reward vector for all
3:  while  do
4:      where is the root of the infosets tree.
5:     while  is not empty. do
6:        Pop from .
7:        Update the strategy on via RM.
8:        For , if , push into .
9:     end while
10:     Update the reward vector for all nodes in .
11:  end while
Algorithm 1 Lazy-CFR

In this section, we discuss how to design an efficient variant of CFR with the framework of lazy update.

Intuitively, an efficient -NE solver for TEGIs, which is based on minimizing the regret of the OLO on each inforset, should satisfy the following two conditions. The first one is to prevent the overall regret from growing too fast. And according to Thm 1, we only need to make to be small for all . Furthermore, this can be done by making small for all in the framework of lazy update. The second condition is to update as small number of infosets as possible during a round, which is equivalent to make small.

It is easy to see that the larger the is, the smaller the is. So that we should balance the tradeoff between and . The key observation is that is significantly smaller than . More specifically, under Assumption 1, . Thus, for most infosets , is extremely small, so that not updating the strategy at these time steps will not make the regret to increase dramatically.

Our final algorithm is pretty simple: at time step , let denote the last time step we update the strategy on infoset before . Let denote the summation of probabilities of arriving at after , which is contributed by all players except . Let denote the subtree 333Here, the tree is composed of infosets rather than histories. rooted at infoset . Let denote a subset of such that and , if is an ancient of , then or . We simply update the strategies on infosets recursively as follows: after updating the strategy on infoset , we keep on updating the strategies on the infosets from with . We summarize our algorithm in Alg. 1.

Now we analyze Alg. 1. To give a clean theoretical result, we further make the following assumption:

Assumption 2.

Every infoset in the tree of infosets is corresponding to nodes in the game tree.

It is noteworthy that Alg. 1 is still valid and efficient without both Assumption 1 and 2.

Now we present our theoretical results on the regret and time complexity of Alg. 1 in Lem 3 and Lem 4 respectively.

Lemma 3.

If the underlying game satisfies Assumption 1, then the regret of Alg. 1 is bounded by .

Proof.

According to Thm 1 and Eq. (8), we only need to bound . Below, we prove that where is the depth of infoset in the game tree.

We exploit mathematical induction to prove this lemma. If it holds for . Consider , it is obvious that at the last time step at which its parent was updated, there is at most cumulated probability at infoset , thus where . ∎

Lemma 4.

The time complexity of Alg. 1 is at round , where is the number of nodes in the tree of histories touched by Alg. 1 during round . More specifically, if the underlying game satisfies both Assumption 1 and 2, then the time complexity of Alg. 1 in each round is on average.

Proof.

There is a little engineering involved to prove the first statement. We defer the details of the engineering into Appendix A.

The second statement is proved as follows. According to Corollary 1, on average, there are at most infosets which satisfy , so that there are nodes touched in each round. ∎

According to Lem 3 and 4, the regret is about times larger than the regret of CFR, whilst the running time is about times faster than CFR per round. Thus, according to Lem 1, 2 and with a little algebra, we know that Alg. 1 is times faster than the vanilla CFR to achieve the same approximation error, since the vanilla CFR has to traverse the whole game tree in each round. The improvement is significant in large scale TEGIs.

Lazy-CFR+: It is noteworthy that we can directly apply the idea of lazy update to CFR+ (Bowling et al., 2017), which is a novel variant of CFR. CFR+ uses a different regret minimization algorithm instead of RM. Tammelin et al. prove that the running time of CFR+ is at most in the same order as CFR, but in practice CFR+ outperformes CFR. To get Lazy-CFR+, we only need to replace RM by RM+ in Alg. 1 and use the method of computing time-averaged strategy as in (Bowling et al., 2017). We empirically evaluate Lazy-CFR+ in Sec. 6.

5 Related work

There are several variants of CFR which attempt to avoid traversing the whole game tree at each round. Monte-Carlo based CFR (MC-CFR) (Lanctot et al., 2009), also known as CFR with partial pruning, uses Monte-Carlo sampling to avoid updating the strategy on infosets with small probability of arriving at. Pruning-based variants (Brown & Sandholm, 2016, 2015) skip the branches of the game tree if they do not affect the regret, but their performance can deteriorate to the vanilla CFR on the worst case.

In order to solve a large scale extensive game, there are several techniques used except CFR, we give a brief summary on these techniques. (Brown & Sandholm, 2017) proposed a technique on sub-game solving, which makes us to be able to solve the NE on a subtree. Another useful technique is abstraction. Abstraction (Gilpin & Sandholm, 2007) reduces the computation complexity by solving an abstracted game, which is much smaller than the original game. There are two main kinds of abstractions, lossless abstraction and lossy abstraction. Lossless abstraction algorithms (Gilpin & Sandholm, 2007) ensures that each equilibrium in the abstracted game is also an equilibrium in the original game. For Poker games, it is able to reduce the size of the game by one-to-two orders of magnitude. Lossy abstraction algorithms (Kroer & Sandholm, 2014; Sandholm, 2015) create smaller, coarser game, in the cost of a decrease in the solution quality. Both of above two kinds of abstractions can be used to reduce the number of actions or the number of information sets.

6 Experiment

((a)) Leduc-5
((b)) Leduc-10
((c)) Leduc-15
Figure 2: Convergence for Lazy-CFR, Lazy-CFR+, MC-CFR, CFR and CFR+ on the Leduc Hold’em.

In this section, we empirically compare our algorithm with existing methods. We compare Lazy-CFR and Lazy-CFR+ with CFR, MC-CFR, and CFR+. In our experiments, we do not use any heuristic pruning in CFR, CFR+, Lazy-CFR and Lazy-CFR+.

Experiments are conducted on variants of the Leduc hold’em (Brown & Sandholm, 2015), which is a common benchmark in imperfect-information game solving. Leduc hold’em is a simplifed version of the Texas hold’em. In Leduc hold’em, there is a deck consists of 6 cards, two Jack, two Queen and two King. There are two dealt rounds in the game. In the first round, each player receives a single private card. In the second round, a single public card is revealed. A bet round takes place after each dealt round, and player goes first. In our experiments, the bet-maximum varies in and .

As discussed in Lem 4, Alg. 1 uses running time for each touched node, which is the same as in the vanilla CFR, CFR+ and MC-CFR. Thus, we compare the number of touched nodes of these algorithms, since nodes touched is independent with hardware and implementation.

We measure the exploitability of these algorithms. The exploitability of a strategy can be interpreted as the approximation error to the Nash equilibrium. The exploitability is defined as .

Results are presented in Fig. 2. The performance of Lazy-CFR is slightly worse than MC-CFR and CFR+ on Leduc-5. But as the size of the game grows, the performance of Lazy-CFR outperforms all baselines. And Lazy-CFR+ consistently outperforms other algorithms. Thus, empirical results show that our method, lazy update, is a powerful technique to accelerate regret minimization algorithms for TEGIs. More specifically, on our largest experiment, Leduc-15, with over infosets, Lazy-CFR converges over 200 times faster than CFR, and Lazy-CFR+ converges over 500 times faster than CFR+.

References

  • Abernethy et al. (2011) Abernethy, J., Bartlett, P. L., and Hazan, E. Blackwell approachability and no-regret learning are equivalent. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 27–46, 2011.
  • Blackwell et al. (1956) Blackwell, D. et al. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1–8, 1956.
  • Bowling et al. (2017) Bowling, M., Burch, N., Johanson, M., and Tammelin, O. Heads-up limit hold’em poker is solved. Communications of the ACM, 60(11):81–88, 2017.
  • Brown & Sandholm (2015) Brown, N. and Sandholm, T. Regret-based pruning in extensive-form games. In Advances in Neural Information Processing Systems, pp. 1972–1980, 2015.
  • Brown & Sandholm (2016) Brown, N. and Sandholm, T. Reduced space and faster convergence in imperfect-information games via regret-based pruning. arXiv preprint arXiv:1609.03234, 2016.
  • Brown & Sandholm (2017) Brown, N. and Sandholm, T. Safe and nested subgame solving for imperfect-information games. In Advances in Neural Information Processing Systems, pp. 689–699, 2017.
  • Cesa-Bianchi & Lugosi (2006) Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006.
  • Gilpin & Sandholm (2007) Gilpin, A. and Sandholm, T. Lossless abstraction of imperfect information games. Journal of the ACM (JACM), 54(5):25, 2007.
  • Gordon (2005) Gordon, G. J. No-regret algorithms for structured prediction problems. Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, 2005.
  • Gordon (2007) Gordon, G. J. No-regret algorithms for online convex programs. In Advances in Neural Information Processing Systems, pp. 489–496, 2007.
  • Koller & Megiddo (1992) Koller, D. and Megiddo, N. The complexity of two-person zero-sum games in extensive form. Games and economic behavior, 4(4):528–552, 1992.
  • Kroer & Sandholm (2014) Kroer, C. and Sandholm, T. Extensive-form game abstraction with bounds. In Proceedings of the fifteenth ACM conference on Economics and computation, pp. 621–638. ACM, 2014.
  • Lanctot et al. (2009) Lanctot, M., Waugh, K., Zinkevich, M., and Bowling, M. Monte carlo sampling for regret minimization in extensive games. In Advances in neural information processing systems, pp. 1078–1086, 2009.
  • Nisan et al. (2007) Nisan, N., Roughgarden, T., Tardos, E., and Vazirani, V. V. Algorithmic game theory. Cambridge University Press, 2007.
  • Osborne & Rubinstein (1994) Osborne, M. J. and Rubinstein, A. A course in game theory. 1994.
  • Sandholm (2015) Sandholm, T. Abstraction for solving large incomplete-information games. In AAAI, pp. 4127–4131, 2015.
  • (17) Tammelin, O., Burch, N., Johanson, M., and Bowling, M. Solving heads-up limit texas hold’em.
  • Zinkevich et al. (2008) Zinkevich, M., Johanson, M., Bowling, M., and Piccione, C. Regret minimization in games with incomplete information. In Advances in neural information processing systems, pp. 1729–1736, 2008.

Appendix A Details of the implementation of lazy update in TEGIs

In this section, we discuss how to efficiently implement the idea of lazy update in TEGIs.

The challenge is that if we are going to update the strategy on an infoset, we need to compute the summation of counterfatual rewards over a segment. If we compute the summation directly, then lazy update enjoys no improvement compared with the vanilla CFR. Fortunately, this problem can be addressed in TEGIs. The key observation is that the reward on an infoset changes if and only if the strategies on some infosets in the subtree changed.

More specifically, suppose we are going to compute the cumulative counterfactual reward on a history over the segment and is divided into where keeps the same in . Then we have:

Obviously, by exploiting the tree structure and with elementary data structures, both and can be computed in running time . Thus, we only need to analyze the number of segments.

We exploit the following property to bound the number of segments:

Property 1.

In Lazy-CFR, if the strategy on an infoset is updated in a round and , then is also updated in this round.

So that, suppose there are infosets’ strategies updated in round , there are at most new segments.

Appendix B Proof of Thm 1

We first prove Eq. (2.3).

Proof.

Let denote the infoset of player after takes at . Without loss of generality, we assume that where is the root of the tree of player ’s infoset. Then we have:

Now we prove Thm 1.

Proof.

With Eq. (2.3) and the regret bound of RM, we have

And then apply Jensen’s inequality and with some calculations, we have