1 Introduction
The dominant framework for approximating Nash equilibria in extensiveform games with imperfect information is Counterfactual Regret Minimization (CFR), and it has successfully been used to solve and expertly play humanscale poker games [1, 2, 3, 4]. This framework is built on the idea of decomposing a game into a network of simple regret minimizers [5, 6]. For very large games, abstraction is typically used to reduce the size and yield a strategically similar game that is feasible to solve with CFR [5, 7, 8, 9].
Function approximation is a natural generalization of abstraction. In CFR, this amounts to estimating the regrets for each regret minimizer instead of storing them all in a table [10, 11, 12, 13, 14, 15]. Function approximation can be competitive with domain specific state abstraction [10, 11, 12], and in some cases is able to outperform tabular CFR without abstraction if the players are optimizing against their best responses [14].
Combining regression and regretminimization with applications to CFR was initially studied by Waugh et. al. [10], introducing the RRM theorem—giving a sufficient condition for function approximator error to still achieve no external regret. In this paper we generalize the RRM theorem to a larger class of regretminimizers and regret—a set of regret metrics that include external regret, internal regret, and swap regret. Extending to a larger class of regretminimizers provides insight into the effectiveness of combining function approximation and regret minimization—the effect of function approximation error on the bounds will vary between algorithms. Furthermore, extending to other algorithms can give theory for existing or future methods. For example, there has been interest in a functional version of hedge, an algorithm within the studied class, for general multiagent and nonstationary settings that can outperform softmax policy gradient methods [13]. Extending to a more general class of regret metrics such as internal regret allows for potentiallynovel applications of regret minimization and function approximation including finding an approximate correlated equilibrium [16].
2 Preliminaries
We adopt the notation from Greenwald et al. [17] to describe an online decision problem (ODP). An ODP consists of a set of possible actions and set of possible rewards . In this paper we assume a finite set of actions and bounded^{1}^{1}1The restriction of positive rewards is without loss of generality and is only used for convenience. where . The tuple fully characterizes the problem and is referred to as a reward system. Furthermore, let denote the set of reward functions.
At each round an agent selects a distribution over actions , samples an action and then receives the reward function . The agent is able to compute the rewards for actions that were not taken at time in contrast to the bandit setting where the agent only observes . Crucially, each is allowed to be selected arbitrarily from . As a consequence, this ODP model is flexible enough to encompass multiagent, adversarial interactions, and game theoretic equilibrium concepts even though it is described from the perspective of a single agent’s decisions.
A learning algorithm in an ODP selects using information from the history of observations and actions previously taken. We denote this information at time as historyt , where . Formally, an online learning algorithm is a sequence of functions , where .
2.1 Action Transformations
To generalize the analysis to different forms of regret (e.g. swap, internal, and external regret), it is useful to define action transformations. Action transformations are functions of the form , giving a distribution over actions for each action input. Let denote the set of all action transformations for the set of actions and the set of all action transformations with codomain as the set of all pure strategies for action set .
Two important subsets of are and . denotes the set of all external transformations—the set of constant action transformations in . More formally, if is the distribution with full weight on action , then .
consists of the set of all possible internal transformations for action set , where an internal transformation from action to action is defined as if , otherwise.
We have that [17].
2.2 Regret
For a given action transformation we can compute the difference in expected reward for a particular action and reward function. This expected difference, known as regret, is denoted by . For a set of action transformations , the
regret vector is
. Note the expected value of regret if the agent chooses is .For an ODP with observed history at time , with reward functions and actions , the cumulative regret for time and action transformations is . For brevity we will omit the argument, and for convenience we set . Note that is a random vector, and we seek to bound
(1) 
3 Approximate RegretMatching
A regretmatching algorithm is an ODP algorithm characterized by a set of action transformations and a link function that is subgradient to a convex potential function , where denotes the dimensional positive orthant^{3}^{3}3 Note that as long as is bounded from above on the negative orthant then the codomain of is the positive orthant. that satisfies the generalized Blackwell condition [17]. Examples of regretmatching algorithms include Hart’s algorithm [18]–typically called “regretmatching” or the polynomial weighted average forecaster [16]–and Hedge [19]–the exponentially weighted average forecaster [16], with link functions for , and with parameter , respectively.
A useful technique to bounding regret when estimates are used in place of true values is to define an Blackwell condition, as was used in the RRM theorem [10]. RRM is a specific instance of the class where and is the polynomial link with . To generalize across different link functions and we define the Blackwell condition.
Definition 1 (Blackwell Condition).
For a given reward system , finite set of action transformations , and link function , a learning algorithm satisfies the Blackwell condition with value if
The Regret Matching Theorem [17] shows that the Blackwell condition () holds with equality for any finite set of action transformations and link function , if the algorithm chooses that is a fixed point^{4}^{4}4Note that since is a linear operator over the simplex , the fixed point always exists by the Brouwer fixed point theorem. of where . If then the fixed point of is a distribution [20]. An algorithm that chooses the above fixed point of when and arbitrarily otherwise is regretmatching.
We seek to bound objective (1) when an algorithm at time chooses the fixed point of , when and arbitrarily otherwise, where and is an estimate of , possibly from a function approximator. Such an algorithm is referred to as approximate regretmatching.
Similarly to the RRM theorem [10, 11], we show that the parameter of the Blackwell condition depends on the error in approximating the exact link outputs, .
Theorem 1.
Given reward system (A,), a finite set of action transformations , and link function , then an approximate ()regretmatching algorithm, , is a regretmatching algorithm with , where , and .
All proofs are deferred to the appendix.
For a regretmatching algorithm, an approach to bounding (1) is to use the Blackwell condition and provide a bound on for a particular potential function [17, 16]. Bounding the regret (1) for a regretmatching algorithm will be done similarly, except the bound on from Theorem 1 will be used. Proceeding in this fashion yields the following theorem:
Theorem 2.
Given a realvalued reward system a finite set of action transformations. If is a Gordon triple^{5}^{5}5See definition 2 in appendix., then an approximate regretmatching algorithm guarantees at all times
4 Bounds
4.1 Polynomial Link
Given the polynomial link function we consider two cases and . For the following results it’s useful to denote the maximal activation [17].
For the case we have the following bound on (1).
Theorem 3.
Given an ODP, a finite set of action transformations , and the polynomial link function with , then an approximate  regretmatching algorithm guarantees
where and if otherwise .
Similarly for the case we have the following.
Theorem 4.
Given an ODP, a finite set of action transformations , and the polynomial link function with , then an approximate  regretmatching algorithm guarantees
where and .
In comparison to the RRM theorem [11], the above bound is tighter as there is no term in front of the errors and the term has been replaced by^{6}^{6}6For [17]. . These improvements are due to the tighter bound in Theorem 1 and the original regret analysis [17], respectively. Aside from these differences, the bounds coincide.
4.2 Exponential Link
Theorem 5.
Given an ODP, a finite set of action transformations , and an exponential link function with , then an approximate  regretmatching algorithm guarantees
where and .
The Hedge algorithm corresponds to the exponential link function when , so Theorem 5 provides a bound on a regression Hedge algorithm. Note that in this case, the approximation error term is not inside a root function as it is under a polynomial link function. This seems to imply that at the level of link outputs, polynomial link functions have a better dependence on the approximation errors. However, in the exponential link function bound is normalized to the simplex while the polynomial link functions can take on larger values. So which link function has a better dependence on the approximation errors depends on the magnitude of the cumulative regrets, which depends on the environment and the algorithm’s empirical performance.
Acknowledgments
We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Alberta Machine Intelligence Institute (Amii), and Alberta Treasury Branch (ATB).
References
 [1] Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Headsup limit hold’em poker is solved. Science, 347(6218):145–149, 2015.

[2]
Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisỳ, Dustin
Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael
Bowling.
Deepstack: Expertlevel artificial intelligence in headsup nolimit poker.
Science, 356(6337):508–513, 2017.  [3] Noam Brown and Tuomas Sandholm. Superhuman ai for headsup nolimit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018.
 [4] Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019.
 [5] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In Advances in neural information processing systems, pages 1729–1736, 2008.

[6]
Gabriele Farina, Christian Kroer, and Tuomas Sandholm.
Regret circuits: Composability of regret minimizers.
In
International Conference on Machine Learning
, pages 1863–1872, 2019.  [7] Kevin Waugh, David Schnizlein, Michael Bowling, and Duane Szafron. Abstraction pathologies in extensive games. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent SystemsVolume 2, pages 781–788. International Foundation for Autonomous Agents and Multiagent Systems, 2009.
 [8] Michael Johanson, Neil Burch, Richard Valenzano, and Michael Bowling. Evaluating statespace abstractions in extensiveform games. In Proceedings of the 2013 international conference on Autonomous agents and multiagent systems, pages 271–278. International Foundation for Autonomous Agents and Multiagent Systems, 2013.
 [9] Sam Ganzfried and Tuomas Sandholm. Action translation in extensiveform games with large action spaces: Axioms, paradoxes, and the pseudoharmonic mapping. In Workshops at the TwentySeventh AAAI Conference on Artificial Intelligence, 2013.
 [10] Kevin Waugh, Dustin Morrill, James Andrew Bagnell, and Michael Bowling. Solving games with functional regret estimation. In TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
 [11] Dustin Morrill. Using Regret Estimation to Solve Games Compactly. Master’s thesis, University of Alberta, 2016.
 [12] Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization. In Proceedings of the 36th International Conference on Machine Learning (ICML19), pages 793–802, 2019.
 [13] Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Remi Munos, Julien Perolat, Marc Lanctot, Audrunas Gruslys, JeanBaptiste Lespiau, and Karl Tuyls. Neural replicator dynamics. arXiv preprint arXiv:1906.00190, 2019.
 [14] Edward Lockhart, Marc Lanctot, Julien Pérolat, JeanBaptiste Lespiau, Dustin Morrill, Finbarr Timbers, and Karl Tuyls. Computing approximate equilibria in sequential adversarial games by exploitability descent. In Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI19, pages 464–470. International Joint Conferences on Artificial Intelligence Organization, 7 2019.
 [15] Eric Steinberger. Single deep counterfactual regret minimization. arXiv preprint arXiv:1901.07621, 2019.
 [16] Nicolo CesaBianchi and Gabor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
 [17] Amy Greenwald, Zheng Li, and Casey Marks. Bounds for regretmatching algorithms. In ISAIM, 2006.
 [18] S. Hart and A. MasColell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
 [19] Yoav Freund and Robert E Schapire. A decisiontheoretic generalization of online learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
 [20] Amy Greenwald, Zheng Li, and Casey Marks. Bounds for regretmatching algorithms. Technical Report CS0610, Brown University, Department of Computer Science, 2006.
 [21] Geoffrey J Gordon. Noregret algorithms for structured prediction problems. Technical report, CARNEGIEMELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, 2005.
 [22] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 [23] Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
Appendix A Existing Results
Lemma 1.
If is a random vector that takes values in , then for .
See [Lemma 21][17].
Lemma 2.
Given a reward system and a finite set of action transformations , then for any reward function .
The proof is indentical to [Lemma 22][17] except we have that regrets are bounded in instead of . Also note that by assumption is bounded.
Theorem 6 (Gordon 2005).
Assume is a Gordon triple and
. Let ,
let be a sequence of random vectors over ,
and define for all times .
If for all times ,
then, for all times ,
It should be noted that the above theorem was originally proved by Gordon [21].
Appendix B Proofs
See 1
Proof.
We denote as the reward vector for an arbitrary reward function . Since by construction this algorithm chooses at each timestep to be the fixed point of , all that remains to be shown is that this algorithm satisfies the Blackwell condition with .
By expanding the value of interest in the Blackwell condition and applying elementary upper bounds, we arrive at the desired bound. For simplicity, we omit timestep indices and set . First, suppose :
If it is easy to see the inequality still holds.
Therefore, satisfies the Blackwell condition with , as required to complete the argument.
∎
An important observation of Theorem 1 is the following corollary:
Corollary 1.
For a reward system , finite set of action transformations , and two link functions and , if there exists a strictly positive function such that then for any , then an approximate regretmatching algorithm satisfies
Proof.
The reasoning is similar to [Lemma 20][17]. The played fixed point is the same under both link functions, thus following the same steps to Theorem 1 provides the above bound. ∎
See 2
Proof.
The proof is similar to [Corollary 7][17] except that the learning algorithm is playing the approximate fixed point with respect to the link function . From Theorem 1 we have . Noticing that and taking we have
The result directly follows from Theorem 6 by taking . ∎
See 3
Proof.
The proof follows closely to [Theorem 9][17]. Taking and then is a Gordon triple [17]. Given the above gordon triple we have
(2)  
(3)  
(4)  
(5) 
The first inequality is from Lemma 1. The second inequality follows from Corollary 1 and theorem 2. The third inequality is an application of Lemma 2. The result then immediately follows. ∎
See 4
Proof.
The proof follows closely to [Theorem 11][17]. Taking and then is a Gordon triple [17]. Given the above Gordon triple we have
(6)  
(7)  
(8)  
(9) 
The first inequality is from Lemma 1. The second inequality follows from Corollary 1 and theorem 2. The third inequality is an application of Lemma 2. The result then immediately follows. ∎
See 5