No-regret learning dynamics for extensive-form correlated and coarse correlated equilibria

04/01/2020 ∙ by Andrea Celli, et al. ∙ Carnegie Mellon University Politecnico di Milano 0

Recently, there has been growing interest around less-restrictive solution concepts than Nash equilibrium in extensive-form games, with significant effort towards the computation of extensive-form correlated equilibrium (EFCE) and extensive-form coarse correlated equilibrium (EFCCE). In this paper, we show how to leverage the popular counterfactual regret minimization (CFR) paradigm to induce simple no-regret dynamics that converge to the set of EFCEs and EFCCEs in an n-player general-sum extensive-form games. For EFCE, we define a notion of internal regret suitable for extensive-form games and exhibit an efficient no-internal-regret algorithm. These results complement those for normal-form games introduced in the seminal paper by Hart and Mas-Colell. For EFCCE, we show that no modification of CFR is needed, and that in fact the empirical frequency of play generated when all the players use the original CFR algorithm converges to the set of EFCCEs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Nash equilibrium (NE) (Nash, 1950)

is the most common notion of rationality in game theory, and its computation in two-player, zero-sum games has been the flagship computational challenge in the area at the interplay between computer science and game theory (see,

e.g., the landmark results in heads-up no-limit poker by Brown and Sandholm (2017) and Moravčík et al. (2017)). The assumption underpinning NE is that the interaction among players is fully decentralized. Therefore, an NE is a distribution on the uncorrelated strategy space (i.e., a product of independent distributions, one per player). A competing notion of rationality is the correlated equilibrium (CE) proposed by Aumann (1974). A correlated strategy is a general distribution over joint action profiles and it is customarily modeled via a trusted external mediator that draws an action profile from this distribution, and privately recommends to each player his component. A correlated strategy is a CE if no player has an incentive to choose an action different from the mediator’s recommendation, because, assuming that all other players also obey, the suggested strategy is the best in expectation.

Many real-world strategic interactions involve more than two players with arbitrary (i.e., general-sum) utilities. In these settings, the notion of NE presents some weaknesses which render the CE a natural solution concept: (i) computing an NE is an intractable problem, being PPAD-complete even in two-player games (Chen and Deng, 2006; Daskalakis et al., 2009); (ii) the NE is prone to equilibrium selection issues; and (iii) the social welfare than can be attained via an NE may be significantly lower than what can be achieved via a CE (Koutsoupias and Papadimitriou, 1999; Roughgarden and Tardos, 2002). Moreover, in normal-form games, the notion of CE arises from simple learning dynamics in senses that NE does not (Hart and Mas-Colell, 2000; Cesa-Bianchi and Lugosi, 2006).

The notion of extensive-form correlated equilibrium (EFCE) by von Stengel and Forges (2008) is a natural extension of the CE to the case of sequential strategic interactions. In an EFCE, the mediator draws, before the beginning of the sequential interaction, a recommended action for each of the possible decision points (i.e., information sets) that players may encounter in the game, but she does not immediately reveal recommendations to each player. Instead, the mediator incrementally reveals relevant individual moves as players reach new information sets. At any decision point, the acting player is free to defect from the recommended action, but doing so comes at the cost of future recommendations, which are no longer issued if the player deviates. Another suitable notion of correlation for sequential games has been recently introduced by Farina et al. (2019a) as the coarse version of an EFCE. In general, a coarse correlated equilibrium enforces protection against deviations which are independent of the recommended move. Then, in an extensive-form coarse correlated equilibrium (EFCCE), at each information set, the acting player has to commit to following the mediator’s recommended move before it is revealed to him.

Original contributions

We focus on general-sum sequential games with an arbitrary number of players (including the chance player). In this setting, the problem of computing a feasible EFCE (and, therefore, a feasible EFCCE) can be solved in polynomial time in the size of the game tree (Huang and von Stengel, 2008) via a variation of the Ellipsoid Against Hope algorithm (Papadimitriou and Roughgarden, 2008; Jiang and Leyton-Brown, 2015). However, in practice, this approach cannot scale beyond toy problems. Therefore, the following question remains open: is it possible to devise simple dynamics leading to a feasible EFCE/EFCCE? In this paper, we show that the answer is positive.

First, we define a suitable notion of internal regret and show how to minimize it by decomposing the problem locally at each information set. We propose the ICFR

algorithm as a natural extension of the laminar regret decomposition framework to minimize such internal regrets across the game tree. The empirical frequency of play generated by this algorithm converges to an EFCE, both with high probability at finite time and almost surely in the limit. These results generalize the seminal work by 

Hart and Mas-Colell (2000) to the sequential case via a simple and natural framework.

CFR-S was introduced by Celli et al. (2019b) as a simple extension of vanilla CFR (Zinkevich et al., 2008) which keeps track of the empirical frequency of play in order to recover a CE of the sequential game (i.e., a CE defined over the normal-form representation of the game). Surprisingly enough, we show that the empirical frequency of play generated by CFR-S converges to the set of EFCCEs, both with high probability at finite time and almost surely in the limit. Therefore, relaxing the EFCE constraints to their coarse counterparts yields an even simpler dynamic which renders the EFCCE an appealing notion of rationality for complex multi-player, general-sum sequential problems.

2 Preliminaries

In this section, we provide the needed groundings and definitions on sequential games and regret minimization (see the books by Shoham and Leyton-Brown (2008) and Cesa-Bianchi and Lugosi (2006), respectively, for additional details).

2.1 Extensive-Form Games

We focus on extensive-form games (EFGs) with imperfect information. We denote the set of players as , where is a chance

player that selects actions according to fixed known probability distributions, representing exogenous stochasticity. An EFG is usually defined by means of a

game tree, where is the set of nodes of the tree, and a node is identified by the ordered sequence of actions from the root to the node. is the set of terminal nodes, which are the leaves of the game tree. For every , we let be the unique player who acts at and be the set of actions she has available. For each player , we let be his payoff function. Moreover, we denote by the function assigning each terminal node to the probability of reaching it given chance moves on the path from the root of the game tree to . Finally, we let be the maximum range of payoffs in the game, i.e., it holds .

Imperfect information is encoded by using information sets (infosets). Given , a player ’s infoset groups nodes belonging to player that are indistinguishable for him, i.e., for any pair of nodes . denotes the set of all player ’s infosets, which form a partition of . Moreover, we let be the set of actions available at infoset . As usual in the literature, we assume that the game has perfect recall, i.e., the infosets are such that no player forgets information once acquired. In EFGs with perfect recall, the infosets of each player are partially ordered. We write whenever infoset precedes according to such ordering, i.e., there exists a path in the game tree connecting a node to some node . For the ease of notation, given , we let be the set of player ’s infosets that follow infoset (this included), defined as . Moreover, given and , we let be the set of player ’s infosets that immediately follow infoset by playing action , i.e., those reachable from some node by following a path that includes and does not pass through another player ’s infoset.

Normal-form plans and strategies. A normal-form plan for player is a tuple which specifies an action for each player ’s infoset, where represents the action selected by at infoset . We denote with a joint normal-form plan, defining a plan for each player . The expected payoff of player , when she plays and the opponents play normal-form plans specified by , is denoted, with an overload of notation, by (this also includes the probability of chance moves, as determined by ). A normal-form strategy is a probability distribution over , where denotes the probability of selecting a plan according to . We let be the set of strategies of player . Moreover, is a joint probability distribution defined over , with being the probability that the players end up playing the plans prescribed by .

Sequences. For any , given an infoset and an action , we denote with the sequence of player ’s actions reaching infoset and terminating with . Notice that, in EFGs with perfect recall, such sequence is uniquely determined, as paths that reach nodes belonging to the same infoset identify the same sequnece of player ’s actions. We let be the set of player ’s sequences, where is the empty sequence of player (representing the case in which he never plays). Moreover, for the ease of notation, given a sequence , we let be the player ’s infoset where the last action in is played.

Subsets of (joint) normal-form plans. We now define certain subsets of . The reader is encouraged to refer to Figure 1 while reading the definitions to see what these subsets equal to in a small example.

For every player and infoset , we let be the set of player ’s normal-form plans that prescribe to play so as to reach infoset whenever possible (depending on the opponents’ actions up to that point) and any action whenever reaching is not possible anymore. Moreover, for every sequence , we let be the set of player ’s plans that reach infoset and recommend action at . Similarly, given a terminal node , we denote with the set of normal-form plans by which player plays so as to reach , while and .

a

b

c

d
a b c d a b c d = = = = = = =
Figure 1: (Left) Sample game tree. Black round nodes belong to Player , white round nodes belong to Player , and white square nodes are leaves. Rounded, gray lines denote information sets. (Center) Set of normal-form plans for Player . Each plan identifies an action at each information set. (Right) Examples of certain subsets of defined in this subsection.

Additional notation. For every and , we let be the set of terminal nodes that are reachable from infoset of player . Moreover, is the set of terminal nodes reachable by playing action at infoset , whereas is the set of the terminal nodes which are reachable by playing an action different from at . For any joint normal-form plan , we denote with

a vector in which each component

is equal to if and only if can be reached when the players play according to . Analogously, given and , we define the vectors and , while, with an abuse of notation, is if and only if infoset can be reached when playing according to .

2.2 Regret Minimization

In the regret minimization framework (Cesa-Bianchi and Lugosi, 2006), each player plays repeatedly against the others by making a series of decisions. For each , let be the normal-form plan adopted by player at iteration . Then, each player observes a utility defined as with for , where denotes the normal-form plans played by the opponents at iteration .

The cumulative (external) regret of player up to iteration is defined as:

(1)

which represents how much player would have gained by always playing the best plan in hindsight, given the history of utilities observed up to iteration .

A regret minimizer is a function providing, after each iteration , the next player ’s normal-form plan on the basis of the past history of play and the observed utilities up to iteration . A desirable property for regret minimizers is Hannan consistency (Hannan, 1957), which requires that , i.e., the cumulative regret grows at a sublinear rate in the number of iterations . There are many regret-minimizing procedures that ensure such property, one is regret matching (RM) (Hart and Mas-Colell, 2000).

We define as the empirical frequency of play up to iteration , where for every , with denoting the joint normal-form plan in which each player plays . Then, it is well known that, if all the players play according to an Hannan consistent regret-minimizing procedure, then approaches the set of normal-form coarse correlated equilibria of the game (Celli et al., 2019b) (see Section 3 for further details on equilibria).

3 Extensive-form Correlated and Coarse Correlated Equilibria

In the context of EFGs, the two most widely adopted notions of correlated equilibrium are the normal-form correlated equilibrium (NFCE) (Aumann, 1974) and the extensive-form correlated equilibrium (EFCE) (von Stengel and Forges, 2008). In the former, the mediator draws and recommends a complete normal-form plan to each player before the game starts. Then, each player decides whether to follow the recommended plan or deviate to an arbitrary strategy she desires. In an EFCE the mediator draws a normal-form plan for each player before the beginning of the game, but she does not immediately reveal it to each player. Instead, the mediator incrementally reveals individual moves as players reach new infosets. At any infoset, the acting player is free to deviate from the recommended action, but doing so comes at the cost of future recommendations, which are no longer issued if the player deviates.

A coarse correlated equilibrium enforces protection against deviations which are independent of the recommended move. Players have to decide whether or not to commit to playing according to the recommendations ex ante such recommendations. In a normal-form coarse correlated equilibrium (NFCCE) (Moulin and Vial, 1978) players decide to commit to follow the recommended normal-form plan before actually observing it (i.e., with the only knowledge of the mediator’s distribution over joint normal-form plans). Those who decide to commit to following the mediator will privately receive their recommendations, while the remaining players are free to play any desired strategy and they will not receive any recommendation. An extensive-form coarse correlated equilibrium (EFCCE) (Farina et al., 2019a) is the coarse equivalent of an EFCE. At each infoset, the acting player has to commit to following the relevant recommended move before it is revealed to her.

In an EFCE, players know less about the normal-form plans that were sampled by the mediator than in an NFCE, where the whole normal-form plan is immediately revealed. Therefore, by exploiting an EFCE, the mediator can more easily incentivize players to follow strategies that may hurt them, as long as players are indifferent as to whether or not to follow the recommendations. This is beneficial when the mediator wants to maximize, e.g., the social-welfare of the game. For arbitrary EFGs with perfect recall, the following inclusion of the set of equilibria holds:  (von Stengel and Forges, 2008; Farina et al., 2019a).

The remainder of the section provides suitable formal definitions of the set of EFCEs and EFCCEs (respectively, Sections 3.1 and 3.2) via the notion of trigger agent (originally introduced by Gordon et al. (2008) and Dudík and Gordon (2009)). Finally, Section 3.3 summarizes existing approaches for computing EFCEs and EFCCEs.

3.1 Formal Definition of the Set of EFCEs

The definition of EFCE requires the following notion of trigger agent, which, intuitively, is associated to each player and sequence of action recommendations for him.

Definition 1 (Trigger agent for EFCE).

Given a player , a sequence , and a probability distribution , an -trigger agent for player is an agent that takes on the role of player and commits to following all recommendations unless she reaches and gets recommended to play . If this happens, the player stops committing to the recommendations and plays according to a plan sampled from until the game ends.

It follows that joint probability distribution is an EFCE if, for every , player ’s expected utility when following the recommendations is at least as large as the expected utility that any -trigger agent for player can achieve (assuming the opponents’ do not deviate). Given , in order to express the expected utility of a -trigger agent, it is convenient to define the probability of the game ending in each terminal node . Three cases are possible. In the first one, and the probability of reaching is defined as:

(2)

which accounts for the fact that the agent follows recommendations until she receives that one of playing at , and, thus, she ‘gets triggered’ and plays according to sampled from from onwards. The second case is , which is reached with probability:

(3)

where the first term accounts for the event that is reached when the agent ‘gets triggered’, while the second term is the probability of reaching while not being triggered (notice that the two events are independent). Finally, the third case is when and the infoset is never reached. Then, the probability of reaching is defined as:

(4)

The following is the formal definition of EFCE.

Definition 2 (Extensive-form correlated equilibrium).

An EFCE of an EFG is a probability distribution such that, for every and -trigger agent for player , with , it holds:

(5)

Noticing that the left-hand side of Equation (5) is equal to and that , we can rewrite Equation (5) as follows:

(6)

A probability distribution is said to be an -EFCE if, for every and -trigger agent for player , with , it holds:

(7)

3.2 Formal Definition of the Set of EFCCEs

In a similar way to Farina et al. (2019a), we define the following notion of trigger agent for EFCCEs. Differently from the EFCE case, here an agent ‘gets triggered’ when an infoset is reached (before observing the action recommendation).

Definition 3 (Trigger agent for EFCCE).

Given a player , an infoset , and a probability distribution , an -trigger agent for player is an agent that takes on the role of player and commits to following all recommendations unless she reaches . If this happens, the player stops committing to the recommendations and, instead, plays according to a normal-form plan sampled from until the game ends.

Then, a joint probability distribution is an EFCCE if, for every , player ’s expected utility when following the recommendations is at least as large as the expected utility that any -trigger agent for player can achieve (assuming the opponents’ do not deviate either). Following the reasoning adopted to define EFCEs, it is convenient to express the expected utility of a -trigger agent via the probability of the game ending in each . Two cases are possible. In the first one, and the probability of reaching is defined as:

(8)

which accounts for the fact that the trigger agent follows recommendations until the infoset is reached, and, then, she plays according to a plan sampled according to from onwards. The second case is when and the infoset is never reached. Thus, the probability of getting to is defined as in Equation (4).

We can now provide the following formal definition of EFCCE.

Definition 4 (Extensive-form coarse correlated equilibrium).

An EFCCE of an EFG is a probability distribution such that, for every and -trigger agent for player , it holds:

(9)

The left-hand side of Equation (9) defines player ’s expected utility by following recommendations, while the right-hand side is the expected utility of an -trigger agent for player . Equation (9) can be equivalently rewritten as:

(10)

An -EFCCE is defined analogously, replacing zero with in the right-hand side of Equation (10).

3.3 Computation of EFCEs and EFCCEs

The problem of computing an optimal EFCE in extensive-form games with more than two players and/or chance moves is known to be NP-hard (von Stengel and Forges, 2008). However, Huang and von Stengel (2008) show that the problem of finding one EFCE can be solved in polynomial time via a variation of the Ellipsoid Against Hope algorithm (Papadimitriou and Roughgarden, 2008; Jiang and Leyton-Brown, 2015). This holds for arbitrary EFGs with multiple players and/or chance moves. Unfortunately, that algorithm is mainly a theoretical tool, and it is known to have limited scalability beyond toy problems. Dudík and Gordon (2009) provide an alternative sampling-based algorithm to compute EFCEs. However, their algorithm is centralized and based on MCMC sampling which may limit its practical appeal. Our framework is arguably simpler and based on the classical counterfactual regret minimization algorithm (Zinkevich et al., 2008; Farina et al., 2019b). Moreover, our framework is fully decentralized since each player, at every decision point, plays so as to minimize her internal/external regret.

If we restrict our attention to two-player perfect-recall games without chance moves, than the problem of determining an optimal EFCE can be characterized through a succint linear program with polynomial size in the game description 

(von Stengel and Forges, 2008). In this setting, Farina et al. (2019c) showed that the problem of computing an EFCE can be formulated as the solution to a bilinear saddle-point problem, which they solve via a subgradient descent method. Moreover, Farina et al. (2019d) design a regret minimization algorithm suitable for this specific scenario.

Computing coarse correlated equilibria in arbitrary sequential games is still a largely unexplored problem.  Celli et al. (2019a) study the computation of optimal NFCCEs, but their result are mainly theoretical and of limited practical applicability. There exist variations of the CFR algorithm (Zinkevich et al., 2008) that are shown to converge to the set of NFCCEs (Celli et al., 2019b). However, the problem of computing an EFCCE via regret minimization is still open.

4 Suitable Notion of Internal Regret for Extensive-Form Games and Convergence to EFCE

In this section, we introduce a new notion of regret whose minimization allows to approach the set of EFCEs in general EFGs (with any number of players, including chance). Our main idea is to define a regret for each trigger agent, i.e., for each player and sequence . Intuitively, this represents the regret of not having played the best trigger agent’s plan in hindsight, taking into account all the iterations in which the agent actually gets triggered (i.e., infoset is reached and action is recommended). Then, we show how all these regrets can be minimized by minimizing other regret terms that can be defined locally at each infoset. In particular, our approach follows the line of and extends the laminar regret decomposition framework introduced by Farina et al. (2019b).

4.1 Approaching the Set of EFCEs

First, we introduce our suitably defined notion of regret and show that its minimization allows to approach the set of EFCEs in any EFG. We start with some preliminary definitions.

For each , we denote with the immediate utility observed by player at infoset during iteration . For every action , represents the utility experienced by player if the game ends after playing at , without passing through another player ’s infoset and assuming that the other players play as prescribed by the plans . Formally, letting be the set of terminal nodes immediately reachable from by playing , it holds (notice that the payoff of each leaf is multiplied by chance probabilities, as determined by the function ).

For , the following is player ’s utility attainable at infoset when a normal-form plan is selected:

(11)

Moreover, for notational convenience, for every , we define , which represents the utility player gets at infoset by means of the normal-form plan played at iteration .

For every player , sequence , and infoset following that one where the last action of is played (this included), we let be the cumulative internal trigger agent regret representing the regret at infoset experienced by the trigger agent that gets triggered on , defined as follows:

(12)

Notice that only accounts for those iterations in which , i.e., intuitively, when the actions prescribed by the normal-form plan trigger the agent associated to sequence .

The following theorem shows that minimizing the internal trigger agent regrets allows to approach the set of EFCEs.

Theorem 1.

The empirical frequency of play is an -EFCE, where is the maximum of the average internal trigger agent regrets computed over and .

Proof.

By definition of cumulative internal trigger agent regret, we have that for each player and sequence , it holds that:

(13)

Let us fix and . By expanding the recursive definition of and recalling the definition of the immediate utility function , it is easy to see that the following holds:

Given , let us define, for every , the following probabilities:

which are the equivalent of and obtained by replacing with the empirical frequency of play . Then:

Using convexity, for any

and therefore for any

and we conclude using Equation 7. ∎

Hence, in particular whenever grows sublinearly, we obtain the following as a simple corollary:

Theorem 2.

If , then the empirical frequency of play converges almost surely to an EFCE.

In the next sections we show how one can guarantee that grow sublinearly.

4.2 Internal Laminar Regret Decomposition

Next, we show how the internal trigger agent regrets can be minimized by minimizing other suitably defined regrets defined locally at each infoset. In order to define them, we need to introduce, for every player and infoset , the following parameterized utility function at each iteration :

(14)

which represents the utility that player gets, at iteration , by playing action at and following the actions prescribed by at the subsequent infosets. Then, for each sequence , infoset , and action , the cumulative internal laminar regret of action is defined as:

(15)

while, for and , the cumulative internal laminar regret is:

(16)

The following two lemmas show that the internal trigger agent regrets can be minimized by minimizing the cumulative internal laminar regrets at all the infosets of the game.

Lemma 1.

The cumulative internal trigger agent regret for each sequence and infoset can be decomposed as:

Proof.

By using the recursive definitions of and , we get:

(17)