Alternative Function Approximation Parameterizations for Solving Games: An Analysis of f-Regression Counterfactual Regret Minimization

12/06/2019 ∙ by Ryan D'Orazio, et al. ∙ University of Alberta 0

Function approximation is a powerful approach for structuring large decision problems that has facilitated great achievements in the areas of reinforcement learning and game playing. Regression counterfactual regret minimization (RCFR) is a flexible and simple algorithm for approximately solving imperfect information games with policies parameterized by a normalized rectified linear unit (ReLU). In contrast, the more conventional softmax parameterization is standard in the field of reinforcement learning and has a regret bound with a better dependence on the number of actions in the tabular case. We derive approximation error-aware regret bounds for (Φ, f)-regret matching, which applies to a general class of link functions and regret objectives. These bounds recover a tighter bound for RCFR and provides a theoretical justification for RCFR implementations with alternative policy parameterizations (f-RCFR), including softmax. We provide exploitability bounds for f-RCFR with the polynomial and exponential link functions in zero-sum imperfect information games, and examine empirically how the link function interacts with the severity of the approximation to determine exploitability performance in practice. Although a ReLU parameterized policy is typically the best choice, a softmax parameterization can perform as well or better in settings that require aggressive approximation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The dominant framework for approximating Nash equilibria in sequential games with imperfect information is Counterfactual Regret Minimization (CFR), which has successfully been used to solve and expertly play human-scale poker games Bowling et al. (2015); Moravčík et al. (2017); Brown and Sandholm (2018, 2019). This framework is built on the idea of decomposing a game into a network of simple regret minimizers Zinkevich et al. (2008); Farina et al. (2019). For very large games, abstraction is typically used to yield a strategically similar game with a smaller number of information sets that is feasible to solve with CFR (Zinkevich et al., 2008; Waugh et al., 2009; Johanson et al., 2013; Ganzfried and Sandholm, 2013).

Function approximation

is a natural generalization of abstraction. In CFR, this amounts to estimating the regrets for each regret minimizer instead of storing them all in a table 

(Waugh et al., 2015; Morrill, 2016; Brown et al., 2019; Li et al., 2018; Steinberger, 2019). Game solving with function approximation can be competitive with domain specific state abstraction Waugh et al. (2015); Morrill (2016); Brown et al. (2019); Heinrich and Silver (2016), and in some cases is able to outperform tabular CFR without abstraction if the players are optimizing against their best responses Lockhart et al. (2019). Function approximation has facilitated many recent successes in game playing more broadly Silver et al. (2016); Silver et al. (2018); Vinyals et al. (2019).

Combining regression and regret-minimization with applications to CFR was initially studied by Waugh et. al. Waugh et al. (2015), introducing the Regression Regret-Matching (RRM) theorem—giving a sufficient condition for function approximator error to still achieve no external regret. The extension to Regression Counterfactual Regret Minimization (RCFR) yields an algorithm that utilizes function approximation in a way similar to reinforcement learning (RL), particularly policy-based RL. Action preferences—cumulative counterfactual regrets—are learned and predicted, and these predictions parameterize a policy. In particular, there are hints of strong connections with recent RL algorithms that emphasize regret minimization as their primary objective, i.e. Regret Policy Gradient (RPG) Srinivasan et al. (2018), Exploitability Descent (ED) Lockhart et al. (2019), Politex Abbasi-Yadkori et al. (2019), and Neural Replicator Dynamics Omidshafiei et al. (2019).

CFR was originally introduced using regret matching (RM) Hart and Mas-Colell (2000) as its component learners. This learning algorithm generates policies by normalizing positive regrets and setting the weight of actions with negative regrets to zero. This truncation of negative regrets is exactly the application of a Rectified Linear Unit (ReLU)

function, which is used extensively in the field of machine learning for constructing neural network layers. RCFR, following in CFR’s lineage, had only theoretical guarantees with normalized ReLU policies.

However, most RL algorithms for discrete action spaces take a different approach: they exponentiate and normalize the preferences according to the softmax function. The Hedge or Exponential Weights learning algorithm Freund and Schapire (1997) also uses a softmax function to generate policies. It even has a regret bound with a log dependence on the number of actions, rather than square root dependence, as RM does. This provides us with some motivation for generalizing the RRM and RCFR theory to allow for alternative policy parameterizations.

In fact, RM and Hedge can be unified. Greenwald et al. Greenwald et al. (2006a) present -regret matching, a general framework for constructing learners for minimizing -regret—a set of regret metrics that include external regret, internal regret, and swap regret when using a policy parameterized by a link function . RM and Hedge are instances of -regret matching algorithms for particular link functions—the polynomial and exponential link functions respectively—when -regret is selected to be external regret. Generalizing to internal regret has an important connection to correlated equilibria in non-zero sum games Cesa-Bianchi and Lugosi (2006).

In this paper we first generalize the RRM Theorem to -regret matching by extending Greenwald et al.’s framework to the case when the regret inputs to algorithms are approximate. This new approximate -regret matching framework allows for the use of a broad class of link functions and regret objectives, and provides a simple recipe for generating regret bounds under new choices for both when estimating regrets. Our analysis, both due to improvements previously made by Greenwald et al. Greenwald et al. (2006a) and more careful application of conventional inequalities, tightens the bound for RRM. This improvement to the bound has a large impact in the RCFR regret and exploitability bounds, since they essentially just magnify the RRM bound by the size of the game. In addition, this framework provides insight into the effectiveness of combining function approximation and regret minimization as the effect of function approximation error on the bounds may vary between link functions and parameter choices.

The approximate -regret matching framework provides the basis for bounds that apply to RCFR algorithms with alternative link functions, thereby allowing the sound use of alternative policy parameterizations, including softmax. We call this generalization, -RCFR, and provide bounds for the polynomial and exponential link functions. We test exponential and polynomial RCFR in two games commonly used in games research, Leduc hold’em poker Southey et al. (2005) and imperfect information goofspiel Lanctot (2013) with a simple but extensible linear representation to investigate how the link function and the degree of approximation interact during learning. We find that the conventional normalized ReLU policy often works well but when the degree of approximation is large, the softmax policy can achieve a lower exploitability.

This paper is organized as follows. First, we define online decision problems and connect to relevant prior work in this area. We then define approximate regret matching and provide regret bounds for this new class of algorithms. Afterward, we begin our discussion of RCFR and our new generalization by describing extensive-form games and prior work on RCFR. Finally, we present -RCFR, its exploitability bound, and experiments in Leduc hold’em poker and goofspiel.

2. Online Decision Problems

2.1. Background

We adopt the notation from Greenwald et al. Greenwald et al. (2006a) to describe an online decision problem (ODP). An ODP consists of a set of possible actions and set of possible rewards . In this paper we assume a finite set of actions and bounded where . The tuple fully characterizes the problem and is referred to as a reward system. Furthermore, let denote the set of reward functions .

At each round an agent selects a distribution over actions , samples an action and then receives the reward function . The agent is able to compute the rewards for actions that were not taken at time , in contrast to the bandit setting where the agent only observes . Crucially, each may be selected arbitrarily from . As a consequence, this ODP model is flexible enough to encompass multi-agent, adversarial interactions, and game theoretic equilibrium concepts even though it is described from the perspective of a single agent’s decisions.

A learning algorithm in an ODP selects using information from the history of observations and actions previously taken. We denote this information at time as history , where . Formally, an online learning algorithm is a sequence of functions , where .

We denote the rectified linear unit (ReLU) function as , for all

. Similarly for vectors

we define to be the componentwise application of the ReLU function.

2.1.1. Action Transformations

To generalize the analysis to different forms of regret (e.g. swap, internal, and external regret), it is useful to define action transformations. Action transformations are functions of the form , giving a distribution over actions for each action input. Let denote the set of all action transformations for the set of actions and denote the set of all action transformations with codomain equal to the set of distributions over with full weight on only one single action.

Two important subsets of are and . denotes the set of all external transformations—the set of constant action transformations in . More formally, if is the distribution with full weight on action , then .

consists of the set of all possible internal transformations for action set , where an internal transformation from action to action is defined as

We have that , , and Greenwald et al. (2006a).

We will also make use of the linear transformation

defined as .

2.1.2. Regret

For a given action transformation we can compute the difference in expected reward for a particular action and reward function. This expected difference, known as -regret, is denoted by . For a set of action transformations , the -regret vector is . Note the expected value of -regret if the agent chooses is .

For an ODP with observed history at time , with reward functions and actions , the cumulative -regret for time and action transformations is . For brevity we will omit the argument, and for convenience we set . Note that is a random vector, and we seek to bound

(1)

Choosing for (1) amounts to minimizing external regret, internal regret, and swap regret respectively. One can also change (1) by interchanging the max and the expectation. In RRM, is bounded (Waugh et al., 2015; Morrill, 2016), however, bounds for (1) still apply when the algorithm observes and uses the expected cumulative regret, , to form decisions at time  (Greenwald et al., 2006a, Corollary 18). The bounds remain the same with the exception of replacing the observed random regrets with their corresponding expected values.

2.2. Approximate Regret-Matching

Given a set of action transformations and a link function that is subgradient to a convex potential function , where denotes the -dimensional positive orthant222 Note that as long as is bounded from above on the negative orthant then the codomain of is the positive orthant., we can define a general class of online learning algorithms known as -regret-matching algorithms Greenwald et al. (2006a). A -regret-matching algorithm at time chooses that is a fixed point333Note that since is a linear operator over the simplex , the fixed point always exists by the Brouwer fixed point theorem. of

when , where , and arbitrarily otherwise. If then the fixed point of is a distribution  Greenwald et al. (2006b). Examples of -regret-matching algorithms include Hart’s algorithm Hart and Mas-Colell (2000)—typically called “regret-matching” or the polynomial weighted average forecaster Cesa-Bianchi and Lugosi (2006)—and Hedge Freund and Schapire (1997)—the exponentially weighted average forecaster Cesa-Bianchi and Lugosi (2006), with link functions for , and with temperature parameter , respectively.

A useful technique to bounding regret when estimates are used in place of true values is to define an Blackwell condition, as was used in the RRM theorem (Waugh et al., 2015). The analysis in RRM was specific to and the polynomial link with . To generalize across different link functions and we define the -Blackwell condition.

[-Blackwell Condition] For a given reward system , finite set of action transformations , and link function , a learning algorithm satisfies the -Blackwell condition if

The Regret Matching Theorem Greenwald et al. (2006a) shows that the -Blackwell condition () holds with equality for -regret-matching algorithms for any finite set of action transformations and link function .

We seek to bound objective (1) when an algorithm at time chooses the fixed point of , when and arbitrarily otherwise, where and is an estimate of , possibly from a function approximator. Such an algorithm is referred to as approximate -regret-matching.

Similarly to the RRM theorem (Waugh et al., 2015; Morrill, 2016), we show that the parameter of the -Blackwell condition depends on the error in approximating the exact link outputs, .

[] Given reward system (A,), a finite set of action transformations , and link function , then an approximate ()-regret-matching algorithm, , satisfies the -Blackwell Condition with , where , and . All proofs are deferred to the appendix.

For a -regret-matching algorithm, an approach to bounding (1) is to use the -Blackwell condition and provide a bound on for an appropriate potential function  (Greenwald et al., 2006a; Cesa-Bianchi and Lugosi, 2006). Bounding the regret (1) for an approximate -regret-matching algorithm will be done similarly, except the bound on from Theorem B will be used. Proceeding in this fashion yields the following theorem: [] Given a real-valued reward system a finite set of action transformations. If is a Gordon triple444See definition A.1 in appendix., then an approximate -regret-matching algorithm guarantees at all times

2.3. Bounds for Specific Link Functions

2.3.1. Polynomial

Given the polynomial link function we consider two cases and . For the following results it is useful to denote the maximal activation Greenwald et al. (2006a).

For the case we have the following bound on (1). [] Given an ODP, a finite set of action transformations , and the polynomial link function with , then an approximate - regret-matching algorithm guarantees

where and

Similarly for the case we have the following. [] Given an ODP, a finite set of action transformations , and the polynomial link function with , then an approximate - regret-matching algorithm guarantees

where and .

In comparison to the RRM theorem (Morrill, 2016), the above bound is tighter as there is no term in front of the errors and the term has been replaced by555For . . These improvements are due to the tighter bound in Theorem B and the original -regret analysis Greenwald et al. (2006a), respectively. Aside from these differences, the bounds coincide.

2.3.2. Exponential

[] Given an ODP, a finite set of action transformations , and an exponential link function with , then an approximate - regret-matching algorithm guarantees

where and .

The Hedge algorithm corresponds to the exponential link function when , so Theorem B provides a bound on a regression Hedge algorithm. Note that in this case, the approximation error term is not inside a root function as it is under the polynomial link function. This seems to imply that at the level of link outputs, polynomial link functions have a better dependence on the approximation errors. However, in the exponential link function bound is normalized to the simplex while the polynomial link functions can take on larger values. So which link function has a better dependence on the approximation errors depends on the magnitude of the cumulative regrets, which depends on the environment and the algorithm’s empirical performance.

3. Extensive-Form Games

3.1. Background

A zero-sum extensive-form game (EFG) is a tuple

is the set of valid action sequences and chance outcomes called histories where an action is an element of , and the set of actions available at each history is determined by . The player to act (including the chance "player", ) at each non-terminal history is determined by , where terminal histories are those with no valid actions, . is a fixed stochastic policy assigned to the chance player that determines the likelihood of random outcomes, like those from die rolls or draws from a shuffled deck of cards. is the information partition and it describes which histories players can distinguish between. The set of histories where player acts, , are partitioned into a set of information states, , where each information state , is a set of histories indistinguishable to . For all we must have that if , we therefore overload notation with giving the set of possible actions available at information state . We require perfect recall so that for all histories in an information state, the sequence of information states admitted by the preceding histories must be identical. is a reward or utility function for player 1. The game is zero sum because player 2’s utility function is simply -.

Player ’s policy or behavioral strategy,

defines a probability distribution over valid actions at each of

’s information states, and a joint policy or strategy profile is an assignment of policies for each player, . We use to denote the probability of reaching terminal history under profile from the beginning of the game and the same except starting from history . We subscript by the player to denote that player’s contribution to these probabilities . The expected value to player under profile is .

A best response to player ’s policy, , is a policy for the opponent, , that maximizes the opponent’s value,

A profile, is an -Nash equilibrium if neither player can deviate unilaterally from their assigned policy and gain more than , i.e.,

We call the average equilibrium approximation error, , the exploitability of the profile , which is also the average best response values since the game is zero-sum. In a zero-sum game, policies that are part of a Nash equilibrium (i.e, a -Nash equilibrium) are minimax optimal, so they are safe to play against any opponent in the sense that they guarantee the largest minimum payoff.

The exploitability of a profile is related to (abbreviated to ) in a fundamental way. First consider the induced normal form of an EFG, where actions taken by a player consists of specifying an action at each information state. That is, from an ODP perspective, the set of actions available to player (the learning algorithm) is . We can then define the expected regret at time for player with respect to action when selecting a policy as the difference

We can then define the cumulative external regret of player at time as . Note that this is an instance of an ODP problem where the sequence of reward functions for player is induced by the opponent’s sequence of policies. Furthermore, the external regret defined here reflects interchanging the max and expectation in objective 1. The connection between and Nash equilibria then follows from the well-known folk theorem. If two ODPs are enmeshed so that the rewards of the learners always sum to zero and the action of one learner influences the reward vector of the other, then they form a zero-sum game. Furthermore, if neither learner has more than external regret after rounds, , in their respective ODP, then the average of their policies, form a -Nash equilibrium. See, for example, Blum and Mansour Blum and Mansour (2007) for a proof.

3.2. Counterfactual Regret Minimization

The idea of counterfactual regret minimization (CFR) Zinkevich et al. (2008) is that we can decompose an EFG into multiple ODPs, one at each information state, in such a way that the regret of the sequence of game-wide policies generated by combining the individual ODP policies (together forming a behavioral strategy) is controlled by the regret of the ODP learners. We define the rewards for each ODP as the counterfactual action values,

for each action at each information state, , where is the history that results from taking action at history , and whenever is unreachable from . Intuitively, this is the expected value that player receives if they were to play after playing to reach . This can be computed recursively by backing up the counterfactual values from information states weighted by the policy at those information states, and aggregating them through opponent information states. Accordingly, the regret of the ODP learners is called counterfactual regret, and it is defined for player at information state and as

We denote the cumulative counterfactual regret as .

Zinkevich et al. Zinkevich et al. (2008) showed: [CFR] For both players, , the regret of ’s policies constructed from their ODP learners after iterations of CFR is where . Furthermore, ’s normalized averaged sequence weight policy,

is part of a -Nash equilibrium. See Farina et al. Farina et al. (2019) for the sketch of an alternative proof using the regret circuits framework that is perhaps more intuitive than that originally posed.

3.3. Function Approximation

If the game has many information states, it may be infeasible to compute counterfactual values exactly for each ODP learner and to always update all ODP learners simultaneously. In addition, games that humans are interested in playing or that model real-world problems often contain structure that is lost in the process of formalizing them as an EFG. We can recover this structure by describing a feature representation, on information state-action sequences. This way, information states and actions can be endowed with shared features that a function approximator, , could make use of to more efficiently represent and update values across information states that would otherwise be stored in a large table.

RCFR Waugh et al. (2015) uses a function approximator to predict cumulative or average counterfactual regrets at each information state, which can be used to parameterize a normalized ReLU policy. The function approximator can be trained in different ways.

The original RCFR paper suggested that all of the sequence feature-counterfactual regret pairs ever computed can be aggregated or sampled into a dataset, and after every iteration, a new functional average regret estimator could be produced by minimizing Euclidean prediction error with a machine learning algorithm. Brown et al. Brown et al. (2019) and Steinberger Steinberger (2019) follow this approach by using a reservoir buffer. Morrill Morrill (2016) and Li et al. Li et al. (2018) describe "bootstrap" approximate cumulative regret targets that could be constructed by adding instantaneous regrets to current regret estimates, which permit the use of alternative machine learning objectives and do not require data to be preserved across model updates.

Exploitability descent (ED) Lockhart et al. (2019), Politex Abbasi-Yadkori et al. (2019), and Neural replicator dynamics (NeuRD) Omidshafiei et al. (2019) are three recent related algorithms that face the same problem of training an estimator to predict sums from summand targets. ED and NeuRD suggest the use of neural network estimators and online model gradient updates. Politex instead opts to train and store models to predict instantaneous values, and computes their sum on demand.

3.4. -Rcfr

Thanks to our new analysis of approximate regret matching, we now know that any link function that admits a no--regret regret matching algorithm also has an approximate version, and that the regret bound of this approximate version has a similar dependence on the approximation error as that of the original RRM. Rather than restricting ourselves to the polynomial link function with parameter , we can now use alternate parameter values for the polynomial link function, or even alternative link functions, like . So instead of a normalized ReLU policy, we use a policy generated by the external regret fixed point of link function with respect to approximate regrets predicted by a functional regret estimator . More formally, the -RCFR policy for player given functional regret estimator is when and arbitrarily otherwise, for all . Since the input to any link function in an approximate regret matching algorithm is simply an estimate of , we can reuse all of the techniques previously developed for RCFR to train regret estimators.

Using Theorem B and the CFR theorem 3.2, we can derive an improved regret bound with the polynomial link and a new bound with the exponential link.

[] [polynomial RCFR] Given the polynomial link function with , let be the policy that -RCFR assigns to player at iteration given functional regret estimator . Let and

After -iterations, -RCFR guarantees, for both players, ,

, and if otherwise . Furthermore, ’s normalized averaged sequence weight policy,

is part of a -Nash equilibrium.

Proof.

This result follows directly from theorem 3.2. The counterfactual regret, , at each state corresponds to regret for an online ODP with . Therefore, playing an approximate -regret matching algorithm at each state with a polynomial link funciton with results in the regret bound presented in theorem B for each state specific ODP. Although this bound is an expectation of a max and the counterfactual regret is a max of an expectation, the analysis of Greenwald et al. (Greenwald et al., 2006a, Corollary 18) allows us to trivially extend our bounds in section 2.3 to this case. The result then follows trvially from theorem 3.2. ∎

The proofs for the polynomial link with and the exponential link are very similar and omitted for brevity.

[] [polynomial RCFR] Given the polynomial link function with , let be the policy that -RCFR assigns to player at iteration given functional regret estimator . Let and

After -iterations, -RCFR guarantees, for both players, ,

, and . Furthermore, ’s normalized averaged sequence weight policy,

is part of a -Nash equilibrium. The above theorem provides a tighter bound for RCFR than what exists in the literature. The improvement is a direct consequence of the tighter bound for RRM presented in theorem B in section 2.3. Given the application of the RRM theorem by Brown et al. Brown et al. (2019), these results should lead to a tighter bound when a function approximator is learning from sampled counterfactual regret targets.

[] [exponential RCFR] Given the polynomial link function with , let be the policy that -RCFR assigns to player at iteration given functional regret estimator . Let and

After -iterations, -RCFR guarantees, for both players, ,

, and . Furthermore, ’s normalized averaged sequence weight policy,

is part of a -Nash equilibrium. This bound shares the same advantage with respect to the action set size dependence over the polynomial RCFR bounds as the bound of Theorem B has over the bounds of Theorems B and 3.4.

4. Experiments

Figure 1. Cumulative regression error over 100K -RCFR iterations for Leduc hold’em poker and goofspiel for selected -RCFR instances. For each number of partitions and game, the link function and parameter with smallest average exploitability over 5 runs at 100K iterations was selected. The solid lines connect the average error and dots show the errors of individual runs.
Figure 2. (left) Exploitability of the average strategy profile of tabular CFR and -RCFR instances during the first 100K iterations in the games Leduc hold’em poker (top) and goofspiel (bottom). For each number of partitions, the link function and parameter with the lowest average final exploitability over 5 runs is shown. The mean exploitability and the individual runs are plotted for the chosen instances. A larger number of partitions permits a better approximation and lower exploitability. (right) The final average exploitability after 100K -RCFR iterations for the best exponential and polynomial link function instances in Leduc (top) and goofspiel (bottom).

4.1. Algorithm Implementation

We use independent linear function approximators for each player, , and action , as our regret estimators. The features are constructed using tug-of-war hashing features from Bellemare et al. Bellemare et al. (2012). Consider random hash functions from a universal family. Each such hash function maps an information state to indices, i.e., . The feature representation is an -sparse vector whose non-zero entries are the indices selected by the hash functions, i.e., . The value of the non-zero entries are all either or , determined by additional random hash functions . This gives us the following feature vector,

The tug-of-war hashing features aim to reduce bias introduced by collisions from the hash functions. The expected sign of all other information sets that share a non-zero entry in their feature vector is by design zero. In other words, for player the first hash functions form random partitions of , where each partition is of size . We use the number of partitions to control the severity of approximation in our experiments. See Bellemare et al. Bellemare et al. (2012) for specific properties of this representation scheme.

To train our regret estimators, we do least-squares ridge regression on exact counterfactual regret targets. After the first iteration, we simply add this new vector of weights to our previous weights. Since the counterfactual regrets are computed for each information state-action sequence on every iteration, the same feature matrix is used during training after each iteration. Therefore, the least-squares solution is a linear function of the targets and the sum of the optimal weights for predicting counterfactual regret yields the optimal weights for predicting the sum, with respect to the squared Euclidean prediction error. Beyond training the weights at the end of each iteration, the regrets do not need to be saved or reprocessed.

Since we are most interested in comparing the performance of -RCFR with different link functions and parameters, we track the average policies for each instance exactly in a table. While this is less practical than other approaches, such as learning the average policies from data, it removes another variable from the analysis and allows us to examine the impact of different link functions in relative isolation. Equivalently, we could have saved copies of the regret estimator weights across all iterations and computed the average policy on demand, similarly to Steinberger Steinberger (2019).

Figure 3. Exploitability of the average strategy profile for all configurations and runs with the exponential and polynomial link functions. (top) The 30 and 40 partition runs are shown to display the difference in final exploitability after 100K iterations when there is a moderate degree of function approximation. (bottom) The 40 and 50 partition runs, comparable to the 30 and 40 partition runs in Leduc, are selected to display the similarity in final exploitability across the exponential and polynomial link functions in goofspiel.

4.2. Games

In Leduc hold’em poker (Southey et al., 2005), the deck consists of 6 cards, two suits each with 3 ranks (e.g. King, Queen, Ace), and played with two players. At the start of the game each player antes 1 chip and receives one private card. Betting is restricted to two rounds with a maximum of two raises each round, and bets are limited to 2 and 4 chips. Before the second round of betting a public card is revealed from the deck. Provided no one folds, the player with a private card matching the public card wins, if no players match, the winnings go to the player with the private card of highest rank. This game has 936 states.

Goofspiel is played with two players and a deck with three suits. Each suit consists of cards of different rank. Two of the suits form the hands of the players. The third is used as a deck of point cards. At each round a card is revealed from the point deck and players simultaneously bid by playing a card from their hand. The player with the highest bid (i.e. highest rank) receives points equal to the rank of the revealed card. The player with the most points when the point deck runs out is the winner and receives a utility of +1. The loser receives a utility of -1. We use an imperfect information variant of goofspiel where the bidding cards are not revealed (Lanctot, 2013), and each suit contains five ranks. To keep the size of the game manageable, we also order the cards in the point deck from highest to lowest rank. This goofspiel variant is roughly twice as large as Leduc poker at 2124 information states.

Our experiments use the OpenSpiel (Lanctot et al., 2019) implementations of these games, and our solver code is also derived from solvers provided by this library.

Convergence to a Nash equilibrium is measured by the exploitability of the average strategy profile after each -RCFR iteration. From here onward exploitability will refer to the exploitability of the average profile generated after iterations and will be omitted for brevity when clear by context. Exploitability in Leduc is measured in milli-big blinds, and goofspiel in milli-utils.

4.3. Parameters

From theorems 3.2 and 3.1, any network of external regret minimizers (one at each information state) can be combined to produce an average strategy profile with bounded exploitability. Therefore, the bounds presented in section 2.3 provide an exploitability bound for -RCFR algorithms where is a polynomial or exponential link function, and estimates of counterfactual regrets are used at each information state in place of true values (corollaries 3.4, 3.4, 3.4).

Most notably, the appearance of function approximator error within the regret bounds in section 2.3 appear in different forms depending on the link function . For the polynomial link function, the bounds vary with the parameter and similarly the exponential link with the parameter. We tested the polynomial link function with to test the common choice (), in addition to a smaller and larger value. The exponential link function was tested with in Leduc and in goofspiel.

To understand the interdependence between a link function, link function specific parameters, and function approximator error, we examine the empirical exploitability of -RCFR with different levels of approximation in Leduc and goofspiel. The degree of approximation is adjusted via the quality of features. Each information state partition maps each of player ’s information states to one of ten buckets, making the number of features per partition . Nine of these buckets contain -information states while the remaining bucket also contains any leftover states, making its size . The number of partitions, however, is allowed to vary. Adding partitions increases discriminative power and reduces approximation error (Figure 1), at the cost of increasing the computation required to train and use the regret estimator.

4.4. Results and Analysis

In Figure 2 (left), the average exploitability of the optimal link function and hyper-parameter configuration is plotted at different iterations. The optimal parameterization was selected by examining the final exploitability after 100K iterations over 5 runs. For both Leduc and goofspiel, the exploitability of the average strategy profile, bounded by the external regrets of each learner in -RCFR, decreases for a larger number of partitions at each iteration. Similarly, the final average exploitability across the best parameter configuration for each link function is shown to decrease for larger numbers of partitions (Figure 2 (right)).

For 30 and 40 partitions in Leduc, with moderate function approximation error, the best exponential link function outperforms all polynomial link functions, including RCFR (polynomial link with ) (Figure 3 (top)). In addition, this performance difference was observed across all configurations of the exponential and polynomial link in Leduc, i.e., all the exponential configurations plateau to a final average exploitability lower than that of all the polynomial link functions tested.

This performance difference was not observed in goofspiel (Figure 3 (bottom)), as all of the instances arrive at a similar average exploitability after 100k iterations (Figures 2 (right) and 3 (bottom)). This shows that the relative performance of different link functions is game dependent. However, among the different choices of for the polynomial link function, (RCFR) was observed to have performed well with respect to the other polynomial instances across all partition numbers and in both games (Figure 2 (right)).

5. Conclusions

In this paper, we generalize the RRM theorem in two dimensions (1) using alternative link functions— the polynomial and exponential link functions—and (2) a more general class of regret metrics including external, internal, and swap regret. The generalization to different link functions allows us to construct regret bounds for a general -RCFR algorithm, thus facilitating the use and analysis of other functional policy parameterizations, such as softmax, to solve zero-sum games with imperfect information. We then examine the performance of -RCFR for different link functions, parameters, and different levels of function approximation error, in Leduc hold’em poker and imperfect information goofspiel. -RCFR with the polynomial link function and often achieves a exploitability competitive with or lower than other choices, but the exponential link function can outperform all polynomial parameters when the functional regret estimator has a moderate degree of approximation.

This work was primarily focused on the benefits of generalizing beyond the ReLU link function. However, extending the RRM Theorem to a more general class of regret metrics such as internal regret also enables future directions such as applying regret minimization and function approximation to find an approximate correlated equilibrium Cesa-Bianchi and Lugosi (2006) or extensive-form correlated equilibrium Von Stengel and Forges (2008).

NeuRD Omidshafiei et al. (2019) and Politex Abbasi-Yadkori et al. (2019) demonstrate that there are problems with traditional RL algorithms, such as softmax policy gradient and Q-Learning Watkins (1989), that can be rectified by adapting a regret-minimizing method to the function approximation case. These algorithms are also particular ways of implementing approximate Hedge, utilizing softmax policies. Our experimental results suggest that other parameterizations, like a normalized ReLU could perhaps improve performance in RL applications.

Acknowledgments

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Alberta Machine Intelligence Institute (Amii), and Alberta Treasury Branch (ATB). Computing resources were provided by WestGrid and Compute Canada.

References

  • (1)
  • Abbasi-Yadkori et al. (2019) Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. 2019. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning. 3692–3702.
  • Bellemare et al. (2012) Marc Bellemare, Joel Veness, and Michael Bowling. 2012. Sketch-based linear value function approximation. In Advances in Neural Information Processing Systems. 2213–2221.
  • Blum and Mansour (2007) A. Blum and Y. Mansour. 2007. Learning, Regret Minimization, and Equilibria. In

    Algorithmic Game Theory

    . Cambridge University Press, Chapter 4.
  • Bowling et al. (2015) Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. 2015. Heads-up limit hold’em poker is solved. Science 347, 6218 (2015), 145–149.
  • Brown et al. (2019) Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. 2019. Deep Counterfactual Regret Minimization. In Proceedings of the 36th International Conference on Machine Learning (ICML-19). 793–802.
  • Brown and Sandholm (2018) Noam Brown and Tuomas Sandholm. 2018. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359, 6374 (2018), 418–424.
  • Brown and Sandholm (2019) Noam Brown and Tuomas Sandholm. 2019. Superhuman AI for multiplayer poker. Science 365, 6456 (2019), 885–890.
  • Cesa-Bianchi and Lugosi (2006) Nicolo Cesa-Bianchi and Gabor Lugosi. 2006. Prediction, learning, and games. Cambridge university press.
  • Farina et al. (2019) Gabriele Farina, Christian Kroer, and Tuomas Sandholm. 2019. Regret Circuits: Composability of Regret Minimizers. In International Conference on Machine Learning. 1863–1872.
  • Freund and Schapire (1997) Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55, 1 (1997), 119–139.
  • Ganzfried and Sandholm (2013) Sam Ganzfried and Tuomas Sandholm. 2013. Action translation in extensive-form games with large action spaces: Axioms, paradoxes, and the pseudo-harmonic mapping. In

    Workshops at the Twenty-Seventh AAAI Conference on Artificial Intelligence

    .
  • Gordon (2005) Geoffrey J Gordon. 2005. No-regret algorithms for structured prediction problems. Technical Report. CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE.
  • Greenwald et al. (2006a) Amy Greenwald, Zheng Li, and Casey Marks. 2006a. Bounds for Regret-Matching Algorithms.. In ISAIM.
  • Greenwald et al. (2006b) Amy Greenwald, Zheng Li, and Casey Marks. 2006b. Bounds for Regret-Matching Algorithms. Technical Report CS-06-10. Brown University, Department of Computer Science.
  • Hart and Mas-Colell (2000) S. Hart and A. Mas-Colell. 2000. A Simple Adaptive Procedure Leading to Correlated Equilibrium. Econometrica 68, 5 (2000), 1127–1150.
  • Heinrich and Silver (2016) Johannes Heinrich and David Silver. 2016. Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint arXiv:1603.01121 (2016).
  • Johanson et al. (2013) Michael Johanson, Neil Burch, Richard Valenzano, and Michael Bowling. 2013. Evaluating state-space abstractions in extensive-form games. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems, 271–278.
  • Lanctot (2013) Marc Lanctot. 2013. Monte Carlo Sampling and Regret Minimization for Equilibrium Computation and Decision-Making in Large Extensive Form Games. Ph.D. Dissertation. Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada.
  • Lanctot et al. (2019) Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Paul Muller, Timo Ewalds, Ryan Faulkner, János Kramár, Bart De Vylder, Brennan Saeta, James Bradbury, David Ding, Sebastian Borgeaud, Matthew Lai, Julian Schrittwieser, Thomas Anthony, Edward Hughes, Ivo Danihelka, and Jonah Ryan-Davis. 2019. OpenSpiel: A Framework for Reinforcement Learning in Games. CoRR abs/1908.09453 (2019). arXiv:cs.LG/1908.09453 http://arxiv.org/abs/1908.09453
  • Li et al. (2018) Hui Li, Kailiang Hu, Zhibang Ge, Tao Jiang, Yuan Qi, and Le Song. 2018. Double neural counterfactual regret minimization. arXiv preprint arXiv:1812.10607 (2018).
  • Lockhart et al. (2019) Edward Lockhart, Marc Lanctot, Julien Pérolat, Jean-Baptiste Lespiau, Dustin Morrill, Finbarr Timbers, and Karl Tuyls. 2019. Computing Approximate Equilibria in Sequential Adversarial Games by Exploitability Descent. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, 464–470. https://doi.org/10.24963/ijcai.2019/66
  • Moravčík et al. (2017) Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisỳ, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. 2017. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science 356, 6337 (2017), 508–513.
  • Morrill (2016) Dustin Morrill. 2016. Using Regret Estimation to Solve Games Compactly. Master’s thesis. University of Alberta.
  • Omidshafiei et al. (2019) Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Remi Munos, Julien Perolat, Marc Lanctot, Audrunas Gruslys, Jean-Baptiste Lespiau, and Karl Tuyls. 2019. Neural Replicator Dynamics. arXiv preprint arXiv:1906.00190 (2019).
  • Silver et al. (2016) David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016), 484–489.
  • Silver et al. (2018) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362, 6419 (2018), 1140–1144.
  • Southey et al. (2005) Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, Darse Billings, and Chris Rayner. 2005. Bayes’ bluff: Opponent modelling in poker. In Proceedings of the 21st Annual Conference on Uncertainty in Artificial Intelligence (UAI. 550–558.
  • Srinivasan et al. (2018) Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien Perolat, Karl Tuyls, Remi Munos, and Michael Bowling. 2018. Actor-Critic Policy Optimization in Partially Observable Multiagent Environments. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 3422–3435. http://papers.nips.cc/paper/7602-actor-critic-policy-optimization-in-partially-observable-multiagent-environments.pdf
  • Steinberger (2019) Eric Steinberger. 2019. Single Deep Counterfactual Regret Minimization. arXiv preprint arXiv:1901.07621 (2019).
  • Vinyals et al. (2019) Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom L. Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver. 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature (2019). https://doi.org/10.1038/s41586-019-1724-z
  • Von Stengel and Forges (2008) Bernhard Von Stengel and Françoise Forges. 2008. Extensive-form correlated equilibrium: Definition and computational complexity. Mathematics of Operations Research 33, 4 (2008), 1002–1022.
  • Watkins (1989) Christopher John Cornish Hellaby Watkins. 1989. Learning from delayed rewards. (1989).
  • Waugh et al. (2015) Kevin Waugh, Dustin Morrill, James Andrew Bagnell, and Michael Bowling. 2015. Solving games with functional regret estimation. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
  • Waugh et al. (2009) Kevin Waugh, David Schnizlein, Michael Bowling, and Duane Szafron. 2009. Abstraction pathologies in extensive games. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 2. International Foundation for Autonomous Agents and Multiagent Systems, 781–788.
  • Zinkevich et al. (2008) Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. 2008. Regret minimization in games with incomplete information. In Advances in neural information processing systems. 1729–1736.

Appendix A Existing Results

Below we recall results from Greenwald et. al.Greenwald et al. (2006a) and include the detailed proofs omitted in the main body of the paper.

Many of the following results make use of a Gordon triple. We restate the definition from Greenwald et. al. below.

A Gordon triple consists of three functions , , and such that for all , .

If is a random vector that takes values in , then for . See [Lemma 21]Greenwald et al. (2006a).

Given a reward system and a finite set of action transformations , then for any reward function . The proof is indentical to [Lemma 22]Greenwald et al. (2006a) except we have that regrets are bounded in instead of . Also note that by assumption is bounded.

[Gordon 2005] Assume is a Gordon triple and . Let , let be a sequence of random vectors over , and define for all times .
If for all times ,

then, for all times ,

It should be noted that the above theorem was originally proved by Gordon Gordon (2005).

Appendix B Proofs

[] Given reward system (A,), a finite set of action transformations , and link function , then an approximate ()-regret-matching algorithm, , satisfies the -Blackwell Condition with , where , and .

Proof.

We denote as the reward vector for an arbitrary reward function . Since by construction this algorithm chooses at each timestep to be the fixed point of , all that remains to be shown is that this algorithm satisfies the -Blackwell condition with .

By expanding the value of interest in the -Blackwell condition and applying elementary upper bounds, we arrive at the desired bound. For simplicity, we omit timestep indices and set . First, suppose :