Bounds for Approximate Regret-Matching Algorithms

10/03/2019 ∙ by Ryan D'Orazio, et al. ∙ University of Alberta 0

A dominant approach to solving large imperfect-information games is Counterfactural Regret Minimization (CFR). In CFR, many regret minimization problems are combined to solve the game. For very large games, abstraction is typically needed to render CFR tractable. Abstractions are often manually tuned, possibly removing important strategic differences in the full game and harming performance. Function approximation provides a natural solution to finding good abstractions to approximate the full game. A common approach to incorporating function approximation is to learn the inputs needed for a regret minimizing algorithm, allowing for generalization across many regret minimization problems. This paper gives regret bounds when a regret minimizing algorithm uses estimates instead of true values. This form of analysis is the first to generalize to a larger class of (Φ, f)-regret matching algorithms, and includes different forms of regret such as swap, internal, and external regret. We demonstrate how these results give a slightly tighter bound for Regression Regret-Matching (RRM), and present a novel bound for combining regression with Hedge.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The dominant framework for approximating Nash equilibria in extensive-form games with imperfect information is Counterfactual Regret Minimization (CFR), and it has successfully been used to solve and expertly play human-scale poker games [1, 2, 3, 4]. This framework is built on the idea of decomposing a game into a network of simple regret minimizers [5, 6]. For very large games, abstraction is typically used to reduce the size and yield a strategically similar game that is feasible to solve with CFR [5, 7, 8, 9].

Function approximation is a natural generalization of abstraction. In CFR, this amounts to estimating the regrets for each regret minimizer instead of storing them all in a table [10, 11, 12, 13, 14, 15]. Function approximation can be competitive with domain specific state abstraction [10, 11, 12], and in some cases is able to outperform tabular CFR without abstraction if the players are optimizing against their best responses [14].

Combining regression and regret-minimization with applications to CFR was initially studied by Waugh et. al. [10], introducing the RRM theorem—giving a sufficient condition for function approximator error to still achieve no external regret. In this paper we generalize the RRM theorem to a larger class of regret-minimizers and -regret—a set of regret metrics that include external regret, internal regret, and swap regret. Extending to a larger class of regret-minimizers provides insight into the effectiveness of combining function approximation and regret minimization—the effect of function approximation error on the bounds will vary between algorithms. Furthermore, extending to other algorithms can give theory for existing or future methods. For example, there has been interest in a functional version of hedge, an algorithm within the studied class, for general multiagent and non-stationary settings that can outperform softmax policy gradient methods [13]. Extending to a more general class of regret metrics such as internal regret allows for potentially-novel applications of regret minimization and function approximation including finding an approximate correlated equilibrium [16].

2 Preliminaries

We adopt the notation from Greenwald et al. [17] to describe an online decision problem (ODP). An ODP consists of a set of possible actions and set of possible rewards . In this paper we assume a finite set of actions and bounded111The restriction of positive rewards is without loss of generality and is only used for convenience. where . The tuple fully characterizes the problem and is referred to as a reward system. Furthermore, let denote the set of reward functions.

At each round an agent selects a distribution over actions , samples an action and then receives the reward function . The agent is able to compute the rewards for actions that were not taken at time in contrast to the bandit setting where the agent only observes . Crucially, each is allowed to be selected arbitrarily from . As a consequence, this ODP model is flexible enough to encompass multi-agent, adversarial interactions, and game theoretic equilibrium concepts even though it is described from the perspective of a single agent’s decisions.

A learning algorithm in an ODP selects using information from the history of observations and actions previously taken. We denote this information at time as historyt , where . Formally, an online learning algorithm is a sequence of functions , where .

2.1 Action Transformations

To generalize the analysis to different forms of regret (e.g. swap, internal, and external regret), it is useful to define action transformations. Action transformations are functions of the form , giving a distribution over actions for each action input. Let denote the set of all action transformations for the set of actions and the set of all action transformations with codomain as the set of all pure strategies for action set .

Two important subsets of are and . denotes the set of all external transformations—the set of constant action transformations in . More formally, if is the distribution with full weight on action , then .

consists of the set of all possible internal transformations for action set , where an internal transformation from action to action is defined as if , otherwise.

We have that [17].

We will also make use of the linear transformation

defined as .

2.2 Regret

For a given action transformation we can compute the difference in expected reward for a particular action and reward function. This expected difference, known as -regret, is denoted by . For a set of action transformations , the

-regret vector is

. Note the expected value of -regret if the agent chooses is .

For an ODP with observed history at time , with reward functions and actions , the cumulative -regret for time and action transformations is . For brevity we will omit the argument, and for convenience we set . Note that is a random vector, and we seek to bound

(1)

Choosing for (1) amounts to minimizing external regret, internal regret, and swap regret respectively. One can also change (1) by interchanging the max and the expectation. In RRM, is bounded [10, 11], however, bounds for (1) still apply [17, Corollary 18].

3 Approximate Regret-Matching

A -regret-matching algorithm is an ODP algorithm characterized by a set of action transformations and a link function that is subgradient to a convex potential function , where denotes the -dimensional positive orthant333 Note that as long as is bounded from above on the negative orthant then the codomain of is the positive orthant. that satisfies the generalized -Blackwell condition [17]. Examples of -regret-matching algorithms include Hart’s algorithm [18]–typically called “regret-matching” or the polynomial weighted average forecaster [16]–and Hedge [19]–the exponentially weighted average forecaster [16], with link functions for , and with parameter , respectively.

A useful technique to bounding regret when estimates are used in place of true values is to define an Blackwell condition, as was used in the RRM theorem [10]. RRM is a specific instance of the class where and is the polynomial link with . To generalize across different link functions and we define the -Blackwell condition.

Definition 1 (-Blackwell Condition).

For a given reward system , finite set of action transformations , and link function , a learning algorithm satisfies the -Blackwell condition with value if

The Regret Matching Theorem [17] shows that the -Blackwell condition () holds with equality for any finite set of action transformations and link function , if the algorithm chooses that is a fixed point444Note that since is a linear operator over the simplex , the fixed point always exists by the Brouwer fixed point theorem. of where . If then the fixed point of is a distribution  [20]. An algorithm that chooses the above fixed point of when and arbitrarily otherwise is regret-matching.

We seek to bound objective (1) when an algorithm at time chooses the fixed point of , when and arbitrarily otherwise, where and is an estimate of , possibly from a function approximator. Such an algorithm is referred to as approximate -regret-matching.

Similarly to the RRM theorem [10, 11], we show that the parameter of the -Blackwell condition depends on the error in approximating the exact link outputs, .

Theorem 1.

Given reward system (A,), a finite set of action transformations , and link function , then an approximate ()-regret-matching algorithm, , is a -regret-matching algorithm with , where , and .

All proofs are deferred to the appendix.

For a -regret-matching algorithm, an approach to bounding (1) is to use the -Blackwell condition and provide a bound on for a particular potential function  [17, 16]. Bounding the regret (1) for a regret-matching algorithm will be done similarly, except the bound on from Theorem 1 will be used. Proceeding in this fashion yields the following theorem:

Theorem 2.

Given a real-valued reward system a finite set of action transformations. If is a Gordon triple555See definition 2 in appendix., then an approximate -regret-matching algorithm guarantees at all times

4 Bounds

4.1 Polynomial Link

Given the polynomial link function we consider two cases and . For the following results it’s useful to denote the maximal activation [17].

For the case we have the following bound on (1).

Theorem 3.

Given an ODP, a finite set of action transformations , and the polynomial link function with , then an approximate - regret-matching algorithm guarantees

where and if otherwise .

Similarly for the case we have the following.

Theorem 4.

Given an ODP, a finite set of action transformations , and the polynomial link function with , then an approximate - regret-matching algorithm guarantees

where and .

In comparison to the RRM theorem [11], the above bound is tighter as there is no term in front of the errors and the term has been replaced by666For [17]. . These improvements are due to the tighter bound in Theorem 1 and the original -regret analysis [17], respectively. Aside from these differences, the bounds coincide.

4.2 Exponential Link

Theorem 5.

Given an ODP, a finite set of action transformations , and an exponential link function with , then an approximate - regret-matching algorithm guarantees

where and .

The Hedge algorithm corresponds to the exponential link function when , so Theorem 5 provides a bound on a regression Hedge algorithm. Note that in this case, the approximation error term is not inside a root function as it is under a polynomial link function. This seems to imply that at the level of link outputs, polynomial link functions have a better dependence on the approximation errors. However, in the exponential link function bound is normalized to the simplex while the polynomial link functions can take on larger values. So which link function has a better dependence on the approximation errors depends on the magnitude of the cumulative regrets, which depends on the environment and the algorithm’s empirical performance.

Acknowledgments

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Alberta Machine Intelligence Institute (Amii), and Alberta Treasury Branch (ATB).

References

  • [1] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold’em poker is solved. Science, 347(6218):145–149, 2015.
  • [2] Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisỳ, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling.

    Deepstack: Expert-level artificial intelligence in heads-up no-limit poker.

    Science, 356(6337):508–513, 2017.
  • [3] Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018.
  • [4] Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019.
  • [5] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In Advances in neural information processing systems, pages 1729–1736, 2008.
  • [6] Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Regret circuits: Composability of regret minimizers. In

    International Conference on Machine Learning

    , pages 1863–1872, 2019.
  • [7] Kevin Waugh, David Schnizlein, Michael Bowling, and Duane Szafron. Abstraction pathologies in extensive games. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 781–788. International Foundation for Autonomous Agents and Multiagent Systems, 2009.
  • [8] Michael Johanson, Neil Burch, Richard Valenzano, and Michael Bowling. Evaluating state-space abstractions in extensive-form games. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pages 271–278. International Foundation for Autonomous Agents and Multiagent Systems, 2013.
  • [9] Sam Ganzfried and Tuomas Sandholm. Action translation in extensive-form games with large action spaces: Axioms, paradoxes, and the pseudo-harmonic mapping. In Workshops at the Twenty-Seventh AAAI Conference on Artificial Intelligence, 2013.
  • [10] Kevin Waugh, Dustin Morrill, James Andrew Bagnell, and Michael Bowling. Solving games with functional regret estimation. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
  • [11] Dustin Morrill. Using Regret Estimation to Solve Games Compactly. Master’s thesis, University of Alberta, 2016.
  • [12] Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization. In Proceedings of the 36th International Conference on Machine Learning (ICML-19), pages 793–802, 2019.
  • [13] Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Remi Munos, Julien Perolat, Marc Lanctot, Audrunas Gruslys, Jean-Baptiste Lespiau, and Karl Tuyls. Neural replicator dynamics. arXiv preprint arXiv:1906.00190, 2019.
  • [14] Edward Lockhart, Marc Lanctot, Julien Pérolat, Jean-Baptiste Lespiau, Dustin Morrill, Finbarr Timbers, and Karl Tuyls. Computing approximate equilibria in sequential adversarial games by exploitability descent. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 464–470. International Joint Conferences on Artificial Intelligence Organization, 7 2019.
  • [15] Eric Steinberger. Single deep counterfactual regret minimization. arXiv preprint arXiv:1901.07621, 2019.
  • [16] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
  • [17] Amy Greenwald, Zheng Li, and Casey Marks. Bounds for regret-matching algorithms. In ISAIM, 2006.
  • [18] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
  • [19] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  • [20] Amy Greenwald, Zheng Li, and Casey Marks. Bounds for regret-matching algorithms. Technical Report CS-06-10, Brown University, Department of Computer Science, 2006.
  • [21] Geoffrey J Gordon. No-regret algorithms for structured prediction problems. Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, 2005.
  • [22] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • [23] Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.

Appendix A Existing Results

Lemma 1.

If is a random vector that takes values in , then for .

See [Lemma 21][17].

Lemma 2.

Given a reward system and a finite set of action transformations , then for any reward function .

The proof is indentical to [Lemma 22][17] except we have that regrets are bounded in instead of . Also note that by assumption is bounded.

Theorem 6 (Gordon 2005).

Assume is a Gordon triple and . Let , let be a sequence of random vectors over , and define for all times .
If for all times ,

then, for all times ,

It should be noted that the above theorem was originally proved by Gordon [21].

Appendix B Proofs

See 1

Proof.

We denote as the reward vector for an arbitrary reward function . Since by construction this algorithm chooses at each timestep to be the fixed point of , all that remains to be shown is that this algorithm satisfies the -Blackwell condition with .

By expanding the value of interest in the -Blackwell condition and applying elementary upper bounds, we arrive at the desired bound. For simplicity, we omit timestep indices and set . First, suppose :

If it is easy to see the inequality still holds.

Therefore, satisfies the -Blackwell condition with , as required to complete the argument.

An important observation of Theorem 1 is the following corollary:

Corollary 1.

For a reward system , finite set of action transformations , and two link functions and , if there exists a strictly positive function such that then for any , then an approximate -regret-matching algorithm satisfies

Proof.

The reasoning is similar to [Lemma 20][17]. The played fixed point is the same under both link functions, thus following the same steps to Theorem 1 provides the above bound. ∎

See 2

Proof.

The proof is similar to [Corollary 7][17] except that the learning algorithm is playing the approximate fixed point with respect to the link function . From Theorem 1 we have . Noticing that and taking we have

The result directly follows from Theorem 6 by taking . ∎

See 3

Proof.

The proof follows closely to [Theorem 9][17]. Taking and then is a Gordon triple [17]. Given the above gordon triple we have

(2)
(3)
(4)
(5)

The first inequality is from Lemma 1. The second inequality follows from Corollary 1 and theorem 2. The third inequality is an application of Lemma 2. The result then immediately follows. ∎

See 4

Proof.

The proof follows closely to [Theorem 11][17]. Taking and then is a Gordon triple [17]. Given the above Gordon triple we have

(6)
(7)
(8)
(9)

The first inequality is from Lemma 1. The second inequality follows from Corollary 1 and theorem 2. The third inequality is an application of Lemma 2. The result then immediately follows. ∎

See 5

Proof.

The proof follows closely to [Theorem 13][17]. Taking and then is a Gordon triple [17]. Given the above Gordon triple we have

(10)
(11)
(12)
(13)
(14)
(15)

The second inequality follows from Corollary 1 and theorem 2. The result then immediately follows. ∎