Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

by   Sriram Srinivasan, et al.

Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero sum games, without any domain-specific state space reductions.


page 1

page 2

page 3

page 4


The Advantage Regret-Matching Actor-Critic

Regret minimization has played a key role in online learning, equilibriu...

OPAC: Opportunistic Actor-Critic

Actor-critic methods, a type of model-free reinforcement learning (RL), ...

Unbiased Asymmetric Actor-Critic for Partially Observable Reinforcement Learning

In partially observable reinforcement learning, offline training gives a...

Actor-Critic Algorithms for Learning Nash Equilibria in N-player General-Sum Games

We consider the problem of finding stationary Nash equilibria (NE) in a ...

Unified Policy Optimization for Continuous-action Reinforcement Learning in Non-stationary Tasks and Games

This paper addresses policy learning in non-stationary environments and ...

Gaussian Process Policy Optimization

We propose a novel actor-critic, model-free reinforcement learning algor...

Learning to Play Pong using Policy Gradient Learning

Activities in reinforcement learning (RL) revolve around learning the Ma...

1 Introduction

There has been much success in learning parameterized policies for sequential decision-making problems. One paradigm driving progress is deep reinforcement learning (Deep RL), which uses deep learning 


to train function approximators that represent policies, reward estimates, or both, to learn directly from experience and rewards 

Sutton18 . These techniques have learned to play Atari games beyond human-level Mnih15DQN , Go, chess, and shogi from scratch Silver17AGZ ; Silver17AChess , complex behaviors in 3D environments Mnih2016asynchronous ; WuTian17 ; Jaderberg17UNREAL , robotics Gu16 ; Quillen18 , character animation Peng18 , among others.

When multiple agents learn simultaneously, policy optimization becomes more complex. First, each agent’s environment is non-stationary and naive approaches can be non-Markovian Matignon12Independent , violating the requirements of many traditional RL algorithms. Second, the optimization problem is not as clearly defined as maximizing one’s own expected reward, because each agent’s policy affects the others’ optimization problems. Consequently, game-theoretic formalisms are often used as the basis for representing interactions and decision-making in multiagent systems Busoniu08Comprehensive ; Shoham09 ; Nowe12Game .

Computer poker is a common multiagent benchmark domain. The presence of partial observability poses a challenge for traditional RL techniques that exploit the Markov property. Nonetheless, there has been steady progress in poker AI. Near-optimal solutions for heads-up limit Texas Hold’em were found with tabular methods using state aggregation, powered by policy iteration algorithms based on regret minimization CFR ; Tammelin15CFRPlus ; Bowling15Poker . These approaches were founded on a basis of counterfactual regret minimization (CFR), which is the root of recent advances in no-limit, such as Libratus Brown17Libratus and DeepStack Moravcik17DeepStack . However, (i) both required Poker-specific domain knowledge, and (ii) neither were model-free, and hence are unable to learn directly from experience, without look-ahead search using a perfect model of the environment.

In this paper, we study the problem of multiagent reinforcement learning in adversarial games with partial observability, with a focus on the model-free case where agents (a) do not have a perfect description of their environment (and hence cannot do a priori planning), (b) learn purely from their own experience without explicitly modeling the environment or other players. We show that actor-critics reduce to a form of regret minimization and propose several policy update rules inspired by this connection. We then analyze the convergence properties and present experimental results.

2 Background and Related Work

In this section, we briefly describe the necessary background. While we draw on game-theoretic formalisms, we choose to align our terminology with the RL literature to emphasize the setting and motivations. We include clarifications in Appendix A. For details, see Shoham09 ; Sutton18 .

2.1 Reinforcement Learning and Policy Gradient Algorithms

An agent acts by taking actions in states from their policy , where

is the set of probability distributions over

, which results in changing the state of the environment ; the agent then receives an observation and reward .111Note that in fully-observable settings, . In partially observable environments Kaelbling98POMDPs ; Oliehoek16 , an observation function is used to sample . A sum of rewards is a return , and aim to find that maximizes expected return .222 We assume finite episodic tasks of bounded length and leave out the discount factor to simplify the notation, without loss of generality. We use -discounted returns in our experiments.

Value-based solution methods achieve this by computing estimates of , or , using temporal difference learning to bootstrap from other estimates, and produce a series of -greedy policies . In contrast, policy gradient methods define a score function of some parameterized (and differentiable) policy with parameters , and use gradient ascent directly on to update .

There have been several recent successful applications of policy gradient algorithms in complex domains such as self-play learning in AlphaGo Silver16Go , Atari and 3D maze navigation Mnih2016asynchronous , continuous control problems Schulman15TRPO ; Lillicrap16DDPG ; Duan16 , robotics Gu16 , and autonomous driving ShalevShwartz16 . At the core of several recent state-of-the-art Deep RL algorithms Jaderberg17UNREAL ; Espeholt18IMPALA is the advantage actor-critic (A2C) algorithm defined in Mnih2016asynchronous . In addition to learning a policy (actor), A2C learns a parameterized critic: an estimate of , which it then uses both to estimate the remaining return after steps, and as a control variate (i.e.

 baseline) that reduces the variance of the return estimates.

2.2 Game Theory, Regret Minimization, and Multiagent Reinforcement Learning

In multiagent RL (MARL), agents interact within the same environment. At each step, each agent takes an action, and the joint action leads to a new state ); each player receives their own separate observation and reward . Each agent maximizes their own return , or their expected return which depends on the joint policy .

Much work in classical MARL focuses on Markov games where the environment is fully observable and agents take actions simultaneously, which in some cases admit Bellman operators Littman94markovgames ; Zinkevich05 ; Perolat15 ; Perolat16 . When the environment is partially observable, policies generally map to values and actions from agents’ observation histories; even when the problem is cooperative, learning is hard Oliehoek16 .

We focus our attention to the setting of zero-sum games, where . In this case, polynomial algorithms exist for finding optimal policies in finite tasks for the two-player case. The guarantees that Nash equilibrium provides are less clear for the -player case, and finding one is hard Daskalakis06 . Despite this, regret minimization approaches are known to filter out dominated actions, and have empirically found good (e.g. competition-winning) strategies in this setting Risk10 ; Gibson13 ; Lanctot14Further .

The partially observable setting in multiagent reinforcement learning requires a few more key definitions in order to properly describe the notion of state. A history is a sequence of actions from all players including the environment taken from the start of an episode. The environment (also called “nature”) is treated as a player with a fixed policy, such that there is a deterministic mapping from any to the actual state of the environment. Define an information state, player ’s sequence of observations, , is consistent with }333In defining , we drop the reference to acting player in turn-based games without loss of generality.. So, includes histories leading to that are indistinguishable to player ; e.g. in Poker, the differ only in the private cards dealt to opponents. A joint policy is a Nash equilibrium if the incentive to deviate to a best response for each player , where is the set of opponents’ policies. Otherwise, -equilibria are approximate, with . Regret minimization algorithms produce iterates whose average policy reduces an upper bound on ; convergence is measured using . Nash equilibrium is minimax-optimal in two-player zero-sum games, so using one minimizes worst-case losses.

There are well-known links between learning, game theory and regret minimization 

Blum07 . One method, counterfactual regret (CFR) minimization CFR , has led to significant progress in Poker AI. Let , where is a prefix, , be the reach probability of under from all policies’ action choices. This can be split into player ’s contribution and their opponents’ (including nature’s) contribution, . Suppose player is to play at : under perfect recall, player remembers the sequence of their own states reached, which is the same for all , since they differ only in private information seen by opponent(s); as a result . For some history and action , we call a prefix history , where is the history followed by action ; they may also be smaller, so . Let and . CFR defines counterfactual values , and , where is the return to player along , and accumulates regrets , producing new policies from cumulative regret using e.g. regret-matching Hart00 or exponentially-weighted experts Exp3 ; Brown17 .

CFR is a policy iteration algorithm that computes the expected values by visiting every possible trajectory, described in detail in Appendix B

. Monte Carlo CFR (MCCFR) samples trajectories using an exploratory behavior policy, computing unbiased estimates

and corrected by importance sampling Lanctot09mccfr . Therefore, MCCFR is an off-policy Monte Carlo method. In one MCCFR variant, model-free outcome sampling (MFOS), the behavior policy at opponent states is defined as enabling online regret minimization (player can update their policy independent of and ).

There are two main problems with (MC)CFR methods: (i) significant variance is introduced by sampling (off-policy) since quantities are divided by reach probabilities, (ii) there is no generalization across states except through expert abstractions and/or forward simulation with a perfect model. We show that actor-critics address both problems and that they are a form of on-policy MCCFR.

2.3 Most Closely Related Work

There is a rich history of policy gradient approaches in MARL. Early uses of gradient ascent showed that cyclical learning dynamics could arise, even in zero-sum matrix games Singh00 . This was partly addressed by methods that used variable learning rates Bowling02 ; Bowling04 , policy prediction Zhang10 , and weighted updates Abdallah08 . The main limitation with these classical works was scalability: there was no direct way to use function approximation, and empirical analyses focused almost exclusively on one-shot games.

Recent work on policy gradient approaches to MARL addresses scalability by using newer algorithms such as A3C or TRPO Schulman15TRPO . However, they focus significantly less (if at all) on convergence guarantees. Naive approaches such as independent reinforcement learning fail to find optimal stochastic policies Littman94markovgames ; Heinrich16 and can overfit the training data, failing to generalize during execution Lanctot17PSRO . Considerable progress has been achieved for cooperative MARL: learning to communicate Lazaridou17 , Starcraft unit micromanagement Foerster17 , taxi fleet optimization Nguyen17 , and autonomous driving ShalevShwartz16 . There has also been significant progress for mixed cooperative/competitive environments: using a centralized critic Lowe17 , learning to negotiate Kao18 , anticipating/learning opponent responses in social dilemmas Foerster18 ; Lerer17 , and control in realistic physical environments AlShedivat18 ; Bansal18 . In this line of research, the most common evaluation methodology has been to train centrally (for decentralized execution), either having direct access to the other players’ policy parameters or modeling them explicitly. As a result, assumptions are made about the form of the other agents’ policies, utilities, or learning mechanisms.

There are also methods that attempt to model the opponents dpiqn ; He16DRON ; Albrecht18Modeling

. Our methods do no such modeling, and can be classified in the “forget” category of the taxonomy proposed in 

HernandezLeal18Survey : that is, due to its on-policy nature, actors and critics adapt to and learn mainly from new/current experience.

We focus on the model-free (and online) setting: other agents’ policies are inaccessible; training is not separated from execution. Actor-critics were recently studied in this setting for multiagent games Perolat18 , whereas we focus on partially-observable environments; only tabular methods are known to converge. Fictitious Self-Play computes approximate best responses via RL Heinrich15FSP ; Heinrich16 , and can also be model-free. Regression CFR (RCFR) uses regression to estimate cumulative regrets from CFR Waugh15solving . RCFR is closely related to Advantage Regret Minimization (ARM) Jin17ARM . ARM Jin17ARM shows regret estimation methods handle partial observability better than standard RL, but was not evaluated in multiagent environments. In contrast, we focus primarily on the multiagent setting.

3 Multiagent Actor-Critics: Advantages and Regrets

CFR defines policy update rules from thresholded cumulative counterfactual regret: , where is the number of iterations and . In CFR, regret matching updates a policy to be proportional to .

On the other hand, REINFORCE Williams92 samples trajectories and computes gradients for each state , updating toward . A baseline is often subtracted from the return: , and policy gradients then become actor-critics, training and separately. The log appears due to the fact that action is sampled from the policy, the value is divided by to ensure the estimate is properly estimating the true expectation (Sutton18, , Section 13.3), and . One could instead train -based critics from states and actions. This leads to a -based Policy Gradient (QPG) (also known as Mean Actor-Critic Allen18MAC ):


an advantage actor-critic algorithm differing from A2C in the (state-action) representation of the critics liu2018action ; Wu18 and summing over actions similar to the all-action algorithms Sutton01Comparing ; Peters02Policy ; Ciosek18EPG ; Allen18MAC . Interpreting as a regret, we can instead minimize a loss defined by an upper bound on the thresholded cumulative regret: , moving the policy toward a no-regret region. We call this Regret Policy Gradient (RPG):


The minus sign on the front represents a switch from gradient ascent on the score to descent on the loss. Another way to implement an adaptation of the regret-matching rule is by weighting the policy gradient by the thresholded regret, which we call Regret Matching Policy Gradient (RMPG):


In each case, the critic is trained in the standard way, using regression loss from sampled returns. The pseudo-code is given in Algorithm LABEL:alg:rpg in Appendix C. In Appendix F, we show that the QPG gradient is proportional to the RPG gradient at : .

3.1 Analysis of Learning Dynamics on Normal-Form Games

The first question is whether any of these variants can converge to an equilibrium, even in the simplest case. So, we now show phase portraits of the learning dynamics on Matching Pennies: a two-action version of Rock, Paper, Scissors. These analyses are common in multiagent learning as they allow visual depiction of the policy changes and how different factors affect the (convergence) behavior Singh00 ; walsh:02 ; Bowling02 ; Walsh03 ; Bowling04 ; Wellman06 ; Abdallah08 ; Zhang10 ; Wunder2010 ; BloembergenTHK15 ; Tuyls18 . Convergence is difficult in Matching Pennies as the only Nash equilibrium requires learning stochastic policies. We give more detail and results on different games that cause cyclic learning behavior in Appendix D.

In Figure 1, we see the similarity of the regret dynamics to replicator dynamics TaylorJonkerRD ; Sandholm17 . We also show the average policy dynamics and observe convergence to equilibrium in each game we tried, which is a known to be guaranteed in two-player zero-sum games using CFR, fictitious play Brown51 , and continuous replicator dynamics Hofbauer09 . However, computing the average policy is complex Heinrich15FSP ; CFR and potentially worse with function approximation, requiring storing past data in large buffers Heinrich16 .

(a) Replicator Dynamics (b) RPG Dynamics (c) Average RPG Dynamics
Figure 1:

Learning Dynamics in Matching Pennies: (a) and (b) show the vector field for

including example particle traces, where each point is each player’s probability of their first action; (c) shows example traces of policies following a discrete approximation to .

3.2 Partially Observable Sequential Games

How do the values and differ? The authors of Jin17ARM posit that they are approximately equal when rarely occurs more than once in a trajectory. First, note that cannot be reached more than once in a trajectory from our definition of , because the observation histories (of the player to play at ) would be different in each occurrence (i.e. due to perfect recall). So, the two values are indeed equal in deterministic, single-agent environments. In general, counterfactual values are conditioned on player playing to reach , whereas -function estimates are conditioned on having reached . So,

The derivation is similar to show that . Hence, counterfactual values and standard value functions are generally not equal, but are scaled by the Bayes normalizing constant . If there is a low probability of reaching due to the environment or due to opponents’ policies, these values will differ significantly.

This leads to a new interpretation of actor-critic algorithms in the multiagent partially observable setting: the advantage values are immediate counterfactual regrets scaled by . This then determines requirements for convergence guarantees in the tabular case.

Note that the standard policy gradient theorem holds: gradients can be estimated from samples. This follows from the derivation of the policy gradient in the tabular case (see Appendix E). When TD bootstrapping is not used, the Markov property is not required; having multiple agents and/or partial observability does not change this. For a proof using REINFORCE ( only), see (ShalevShwartz16, , Theorem 1). The proof trivially follows using since is trained separately and does not depend on .

Policy gradient algorithms perform gradient ascent on , using , where is on-policy distribution under  (Sutton18, , Section 13.2). The actor-critic equivalent is . Note that the baseline is unnecessary when summing over the actions and  Allen18MAC . However, our analysis relies on a projected gradient descent algorithm that does not assume simplex constraints on the policy: in that case, in general .

Definition 1.

Define policy gradient policy iteration (PGPI) as a process that iteratively runs , and actor-critic policy iteration (ACPI) similarly using .

In two-player zero-sum games, PGPI/ACPI are gradient ascent-descent problems, because each player is trying to ascend their own score function, and when using tabular policies a solution exists due to the minimax theorem Shoham09 . Define player ’s external regret over steps as , where is the set of deterministic policies.

Theorem 1.

In two-player zero-sum games, when using tabular policies and an projection , where is the space of tabular simplices, if player uses learning rates of at on iteration , and for all and , then projected PGPI, , has regret , where is the set of player ’s states, is the reward range, and . The same holds for projected ACPI (see appendix).

The proof is given in Appendix E. In the case of sampled trajectories, as long as every state is reached with positive probability, Monte Carlo estimators of will be consistent. Therefore, we use exploratory policies and decay exploration over time. With a finite number of samples, the probability that an estimator differs by some quantity away from its mean is determined by Hoeffding’s inequality and the reach probabilities. We suspect these errors could be accumulated to derive probabilistic regret bounds similar to the off-policy Monte Carlo case Lanctot09Sampling .

What happens in the sampling case with a fixed per-state learning rate ? If player collects a batch of data from many sampled episodes and applies them all at once, then the effective learning rates (expected update rate relative to the other states) is scaled by the probability of reaching : , which matches the value in the condition of Theorem 1. This suggests using a globally decaying learning rate to simulate the remaining .

The analysis so far has concentrated on establishing guarantees for the optimization problem that underlies standard formulation of policy gradient and actor-critic algorithms. A better guarantee can be achieved by using stronger policy improvement (proof and details are found in Appendix E):

Theorem 2.

Define a state-local , composite gradient , strong policy gradient policy iteration (SPGPI), and strong actor-critic policy iteration (SACPI) as in Definition 1 except replacing the gradient components with . Then, in two-player zero-sum games, when using tabular policies and projection as defined in Theorem 1 with learning rates on iteration , projected SPGPI, , has regret , where is the set of player ’s states and is the reward range. This also holds for projected SACPI (see appendix).

4 Empirical Evaluation

We now assess the behavior of the actor-critic algorithms in practice. While the analyses in the previous section established guarantees for the tabular case, ultimately we want to assess scalability and generalization potential for larger settings. Our implementation parameterizes critics and policies using neural networks with two fully-connected layers of 128 units each, and rectified linear unit activation functions, followed by a linear layer to output a single value

or softmax layer to output

. We chose these architectures to remain consistent with previous evaluations Heinrich16 ; Lanctot17PSRO .

4.1 Domains: Kuhn and Leduc Poker

We evaluate the actor-critic algorithms on two -player games: Kuhn poker, and Leduc poker.

Kuhn poker is a toy game where each player starts with 2 chips, antes 1 chip to play, and receives one card face down from a deck of size (one card remains hidden). Players proceed by betting (raise/call) by adding their remaining chip to the pot, or passing (check/fold) until all players are either in (contributed as all other players to the pot) or out (folded, passed after a raise). The player with the highest-ranked card that has not folded wins the pot.

In Leduc poker, players have a limitless number of chips, and the deck has size , divided into two suits of identically-ranked cards. There are two rounds of betting, and after the first round a single public card is revealed from the deck. Each player antes 1 chip to play, and the bets are limited to two per round, and number of chips limited to 2 in the first round, and 4 in the second round.

The rewards to each player is the number of chips they had after the game minus before the game. To remain consistent with other baselines, we use the form of Leduc described in Lanctot17PSRO which does not restrict the action space, adding reward penalties if/when illegal moves are chosen.

4.2 Baseline: Neural Fictitious Self-Play

We compare to one main baseline. Neural Fictitious Self-Play (NFSP) is an implementation of fictitious play, where approximate best responses are used in place of full best response Heinrich16 . Two transition buffers of are used: and ; the former to train a DQN agent towards a best response to , data in the latter is replaced using reservoir sampling, and trains by classification.

4.3 Main Performance Results

Here we show the empirical convergence to approximate Nash equlibria for each algorithm in self-play, and performance against fixed bots. The standard metric to use for this is NashConv() defined in Section 2.2, which reports the accuracy of the approximation to a Nash equilibrium.

Training Setup. In the domains we tested, we observed that the variance in returns was high and hence we performed multiple policy evaluation updates (-update for , , and , and -update for A2C) followed by policy improvement (policy gradient update). These updates were done using separate SGD optimizers with their respective learning rates of fixed 0.001 for policy evaluation, and annealed from a starting learning rate to 0 over 20M steps for policy improvement. (See Appendix G for exact values). Further, the policy improvement step is applied after policy evaluation updates. We treat and batch size as a hyper parameters and sweep over a few reasonable values. In order to handle different scales of rewards in the multiple domains, we used the streaming Z-normalization on the rewards, inspired by its use in Proximal Policy Optimization (PPO) schulman2017proximal . In addition, the agent’s policy is controlled by a(n inverse) temperature added as part of the softmax operator. The temperature is annealed from 1 to 0 over 1M steps to ensure adequate state space coverage. An additional entropy cost hyper-parameter is added as is standard practice with Deep RL policy gradient methods such as A3C Mnih2016asynchronous ; schulman2017proximal . For NFSP, we used the same values presented in Lanctot17PSRO .

Convergence to Equilibrium. See Figure 2 for convergence results. Please note that we plot the NashConv for the average policy in the case of NFSP, and the current policy in the case of the policy gradient algorithms. We see that in 2-player Leduc, the actor-critic variants we tried are similar in performance; NFSP has faster short-term convergence but long-term the actor critics are comparable. Each converges significantly faster than A2C. However RMPG seems to plateau.

Performance Against Fixed Bots. We also measure the expected reward against fixed bots, averaged over player seats. These bots, cfr500, correspond to the average policy after 500 iterations of CFR. QPG and RPG do well here, scoring higher than A2C and even beating NFSP in the long-term.

NashConv in 2-player Kuhn NashConv in 3-player Kuhn
NashConv in 2-player Leduc NashConv in 3-player Leduc
2-player Leduc vs. CFR500 3-player Leduc vs CFR500
Figure 2: Empirical convergence rates for NashConv() and performance versus CFR agents.

5 Conclusion

In this paper, we discuss several update rules for actor-critic algorithms in multiagent reinforcement learning. One key property of this class of algorithms is that they are model-free, leading to a purely online algorithm, independent of the opponents and environment. We show a connection between these algorithms and (counterfactual) regret minimization, leading to previously unknown convergence properties underlying model-free MARL in zero-sum games with imperfect information.

Our experiments show that these actor-critic algorithms converge to approximate Nash equilibria in commonly-used benchmark Poker domains with rates similar to or better than baseline model-free algorithms for zero-sum games. However, they may be easier to implement, and do not require storing a large memory of transitions. Furthermore, the current policy of some variants do significantly better than the baselines (including the average policy of NFSP) when evaluated against fixed bots. Of the actor-critic variants, RPG and QPG seem to outperform RMPG in our experiments.

As future work, we would like to formally develop the (probabilistic) guarantees of the sample-based on-policy Monte Carlo CFR algorithms and/or extend to continuing tasks as in MDPs Kash18 . We are also curious about what role the connections between actor-critic methods and CFR could play in deriving convergence guarantees in model-free MARL for cooperative and/or potential games.

Acknowledgments. We would like to thank Martin Schmid, Audrūnas Gruslys, Neil Burch, Noam Brown, Kevin Waugh, Rich Sutton, and Thore Graepel for their helpful feedback and support.


  • (1) Sherief Abdallah and Victor Lesser. A multiagent reinforcement learning algorithm with non-linear dynamics. JAIR, 33(1):521–549, 2008.
  • (2) Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Nicolas Heess Remi Munos, and Martin Riedmiller. Maximum a posteriori policy optimisation. CoRR, abs/1806.06920, 2018.
  • (3) Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuous adaptation via meta-learning in nonstationary and competitive environments. In Proceedings of the Sixth International Conference on Learning Representations, 2018.
  • (4) Stefano V. Albrecht and Peter Stone. Autonomous agents modelling other agents: A comprehensive survey and open problems. Artificial Intelligence, 258:66–95, 2018.
  • (5) Cameron Allen, Melrose Roderick Kavosh Asadi, Abdel rahman Mohamed, George Konidaris, and Michael Littman. Mean actor critic. CoRR, abs/1709.00503, 2017.
  • (6) P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science, pages 322–331, 1995.
  • (7) Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent complexity via multi-agent competition. In Proceedings of the Sixth International Conference on Learning Representations, 2018.
  • (8) Daan Bloembergen, Karl Tuyls, Daniel Hennes, and Michael Kaisers. Evolutionary dynamics of multi-agent learning: A survey. J. Artif. Intell. Res. (JAIR), 53:659–697, 2015.
  • (9) A. Blum and Y. Mansour. Learning, regret minimization, and equilibria. In Algorithmic Game Theory, chapter 4. Cambridge University Press, 2007.
  • (10) Branislav Bošanský, Viliam Lisý, Marc Lanctot, Jiří Čermák, and Mark H.M. Winands. Algorithms for computing strategies in two-player simultaneous move games. Artificial Intelligence, 237:1––40, 2016.
  • (11) Michael Bowling. Convergence and no-regret in multiagent learning. In Advances in Neural Information Processing Systems 17 (NIPS), pages 209–216, 2005.
  • (12) Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up Limit Hold’em Poker is solved. Science, 347(6218):145–149, January 2015.
  • (13) Michael Bowling and Manuela Veloso. Multiagent learning using a variable learning rate. Artificial Intelligence, 136:215–250, 2002.
  • (14) G. W. Brown. Iterative solutions of games by fictitious play. In T.C. Koopmans, editor, Activity Analysis of Production and Allocation, pages 374–376. John Wiley & Sons, Inc., 1951.
  • (15) Noam Brown, Christian Kroer, and Tuomas Sandholm. Dynamic thresholding and pruning for regret minimization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2017.
  • (16) Noam Brown and Tuomas Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 360(6385), December 2017.
  • (17) L. Busoniu, R. Babuska, and B. De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transaction on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 38(2):156–172, 2008.
  • (18) Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z Leibo, Karl Tuyls, and Stephen Clark. Emergent communication through negotiation. In Proceedings of the Sixth International Conference on Learning Representations (ICLR), 2018.
  • (19) Kamil Ciosek and Shimon Whiteson. Expected policy gradients. In Proceedings of the Thirty-Second AAAI conference on Artificial Intelligence (AAAI-18), 2018.
  • (20) Constantinos Daskalakis, Paul W. Goldberg, and Christos H. Papadimitriou. The complexity of computing a nash equilibrium. In

    Proceedings of the Thirty-eighth Annual ACM Symposium on Theory of Computing

    , STOC ’06, pages 71–78, New York, NY, USA, 2006. ACM.
  • (21) Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. CoRR, abs/1604.06778, 2016.
  • (22) Lasse Espeholt, Hubert Soyer, Rémi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. CoRR, abs/1802.01561, 2018.
  • (23) Jakob N. Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2017.
  • (24) Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2017.
  • (25) N. Gatti, F. Panozzo, and M. Restelli. Efficient evolutionary dynamics with extensive-form games. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, pages 335–341, 2013.
  • (26) Richard Gibson. Regret minimization in non-zero-sum games with applications to building champion multiplayer computer poker agents. CoRR, abs/1305.0034, 2013.
  • (27) Shixiang Gu, Ethan Holly, Timothy P. Lillicrap, and Sergey Levine. Deep reinforcement learning for robotic manipulation. CoRR, abs/1610.00633, 2016.
  • (28) S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
  • (29) Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(3–4):157–325, 2015.
  • (30) He He, Jordan L. Boyd-Graber, Kevin Kwok, and Hal Daumé III. Opponent modeling in deep reinforcement learning. In

    Proceedings of The 33rd International Conference on Machine Learning (ICML 2016)

    , 2016.
  • (31) Johannes Heinrich, Marc Lanctot, and David Silver. Fictitious self-play in extensive-form games. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), 2015.
  • (32) Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect-information games. CoRR, abs/1603.01121, 2016.
  • (33) Pablo Hernandez-Leal, Michael Kaisers, Tim Baarslag, and Enrique Munoz de Cote. A survey of learning in multiagent environments: Dealing with non-stationarity. CoRR, abs/1707.09183, 2017.
  • (34) Josef Hofbauer and Karl Sigmund. Evolutionary Games and Population Dynamics. Cambridge University Press, 1998.
  • (35) Josef Hofbauer, Sylvain Sorin, and Yannick Viossat. Time average replicator and best-reply dynamics. Mathematics of Operations Research, 34(2):263–269, 2009.
  • (36) Zhang-Wei Hong, Shih-Yang Su, Tzu-Yun Shann, Yi-Hsiang Chang, and Chun-Yi Lee. A deep policy inference q-network for multi-agent systems. CoRR, abs/1712.07893, 2017.
  • (37) Max Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. In Proceedings of the International Conference on Representation Learning, 2017.
  • (38) Peter H. Jin, Sergey Levine, and Kurt Keutzer. Regret minimization for partially observable deep reinforcement learning. CoRR, abs/1710.11424, 2017.
  • (39) Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99–134, 1998.
  • (40) Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, ICML ’02, pages 267–274, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc.
  • (41) Ian A. Kash and Katja Hoffman. Combining no-regret and Q-learning. In European Workshop on Reinforcement Learning (EWRL) 14, 2018.
  • (42) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  • (43) Vojtech Kovarík and Viliam Lisý. Analysis of hannan consistent selection for monte carlo tree search in simultaneous move games. CoRR, abs/1509.00149, 2015.
  • (44) H. W. Kuhn. Extensive games and the problem of information. Contributions to the Theory of Games, 2:193–216, 1953.
  • (45) Shapley L. Some topics in two-person games. In Advances in Game Theory. Princeton University Press., 1964.
  • (46) M. Lanctot, K. Waugh, M. Bowling, and M. Zinkevich. Sampling for regret minimization in extensive games. In Advances in Neural Information Processing Systems (NIPS 2009), pages 1078–1086, 2009.
  • (47) Marc Lanctot. Monte Carlo Sampling and Regret Minimization for Equilibrium Computation and Decision-Making in Large Extensive Form Games. PhD thesis, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada, June 2013.
  • (48) Marc Lanctot. Further developments of extensive-form replicator dynamics using the sequence-form representation. In Proceedings of the Thirteenth International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pages 1257–1264, 2014.
  • (49) Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling for regret minimization in extensive games. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1078–1086, 2009.
  • (50) Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, and Thore Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems, 2017.
  • (51) Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and the emergence of (natural) language. In Proceedings of the International Conference on Learning Representations (ICLR), April 2017.
  • (52) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521:436–444, 2015.
  • (53) Adam Lerer and Alexander Peysakhovich. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. CoRR, abs/1707.01068, 2017.
  • (54) Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2015.
  • (55) Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In In Proceedings of the Eleventh International Conference on Machine Learning, pages 157–163. Morgan Kaufmann, 1994.
  • (56) Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-dependent control variates for policy optimization via stein identity. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  • (57) Ryan Lowe, YI WU, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6379–6390. Curran Associates, Inc., 2017.
  • (58) L. Matignon, G. J. Laurent, and N. Le Fort-Piat. Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems.

    The Knowledge Engineering Review

    , 27(01):1–31, 2012.
  • (59) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 1928–1937, 2016.
  • (60) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529–533, 2015.
  • (61) Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 358(6362), October 2017.
  • (62) Todd W. Neller and Marc Lanctot. An introduction to counterfactual regret minimization. In Proceedings of Model AI Assignments, The Fourth Symposium on Educational Advances in Artificial Intelligence (EAAI-2013), 2013.
  • (63) Duc Thien Nguyen, Akshat Kumar, and Hoong Chuin Lau. Policy gradient with value function approximation for collective multiagent planning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4319–4329. Curran Associates, Inc., 2017.
  • (64) A. Nowé, P. Vrancx, and Y-M. De Hauwere. Game theory and multi-agent reinforcement learning. In Reinforcement Learning: State-of-the-Art, chapter 14, pages 441–470. 2012.
  • (65) Frans A. Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs. Springer, 2016.
  • (66) Fabio Panozzo, Nicola Gatti, and Marcello Restelli. Evolutionary dynamics of q-learning over the sequence form. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pages 2034–2040, 2014.
  • (67) Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. CoRR, abs/1804.02717, 2018.
  • (68) Julien Perolat, Bilal Piot, and Olivier Pietquin. Actor-critic fictitious play in simultaneous move multistage games. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 919–928, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018. PMLR.
  • (69) Julien Pérolat, Bilal Piot, Bruno Scherrer, and Olivier Pietquin. On the use of non-stationary strategies for solving two-player zero-sum markov games. In The 19th International Conference on Artificial Intelligence and Statistics (AISTATS 2016), 2016.
  • (70) Julien Pérolat, Bruno Scherrer, Bilal Piot, and Olivier Pietquin. Approximate dynamic programming for two-player zero-sum markov games. In Proceedings of the International Conference on Machine Learning (ICML), 2015.
  • (71) Jan Peters. Policy gradient methods for control applications. Technical Report TR-CLMC-2007-1, University of Southern California, 2002.
  • (72) Yu Qian, Fang Debin, Zhang Xiaoling, Jin Chen, and Ren Qiyu. Stochastic evolution dynamic of the rock–scissors–paper game based on a quasi birth and death process. Scientific Reports, 6(1):28585, 2016.
  • (73) Deirdre Quillen, Eric Jang, Ofir Nachum, Chelsea Finn, Julian Ibarz, and Sergey Levine. Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods. CoRR, abs/1802.10264, 2018.
  • (74) N. A. Risk and D. Szafron. Using counterfactual regret minimization to create competitive multiplayer poker agents. In Proceedings of the International Conference on Autonomus Agents and Multiagent Systems (AAMAS), pages 159–166, 2010.
  • (75) William H. Sandholm. Population Games and Evolutionary Dynamics. The MIT Press, 2010.
  • (76) John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. CoRR, abs/1502.05477, 2015.
  • (77) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • (78) Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. CoRR, abs/1610.03295, 2016.
  • (79) Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, 2009.
  • (80) David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529:484––489, 2016.
  • (81) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815, 2017.
  • (82) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 530:354–359, 2017.
  • (83) Satinder P. Singh, Michael J. Kearns, and Yishay Mansour. Nash convergence of gradient dynamics in general-sum games. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, UAI ’00, pages 541–548, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.
  • (84) R. Sutton and A. Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018.
  • (85) Richard S. Sutton, Satinder Singh, and David McAllester. Comparing policy-gradient algorithms, 2001. Unpublished.
  • (86) Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit Texas Hold’em. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, 2015.
  • (87) Taylor and Jonker. Evolutionarily stable strategies and game dynamics. Mathematical Biosciences, 40:145–156, 1978.
  • (88) Karl Tuyls, Julien Perolat, Marc Lanctot, Joel Z Leibo, and Thore Graepel. A Generalised Method for Empirical Game Theoretic Analysis . In AAMAS, 2018.
  • (89) Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37–57.
  • (90) W. E. Walsh, D. C. Parkes, and R. Das.

    Choosing samples to compute heuristic-strategy Nash equilibrium.

    In Proceedings of the Fifth Workshop on Agent-Mediated Electronic Commerce, 2003.
  • (91) William E Walsh, Rajarshi Das, Gerald Tesauro, and Jeffrey O Kephart. Analyzing Complex Strategic Interactions in Multi-Agent Systems. In AAAI, 2002.
  • (92) Kevin Waugh, Dustin Morrill, J. Andrew Bagnell, and Michael Bowling. Solving games with functional regret estimation. In Proceedongs of the AAAI Conference on Artificial Intelligence, 2015.
  • (93) Michael P. Wellman. Methods for empirical game-theoretic analysis. In Proceedings, The Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference, pages 1552–1556, 2006.
  • (94) R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256, 1992.
  • (95) Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M. Bayen, Sham Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
  • (96) Yuxin Wu and Yuandong Tian. Training agent for first-person shooter game with actor-critic curriculum learning. In Proceedings of the International Conference on Representation Learning, 2017.
  • (97) Michael Wunder, Michael Littman, and Monica Babes. Classes of multiagent q-learning dynamics with -greedy exploration. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pages 1167–1174, 2010.
  • (98) Chongjie Zhang and Victor Lesser. Multi-agent learning with policy prediction. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, pages 927–934, 2010.
  • (99) M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of Twentieth International Conference on Machine Learning (ICML-2003), 2003.
  • (100) M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. Technical Report CMU-CS-03-110, Carnegie Mellon University, 2003.
  • (101) M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems 20 (NIPS 2007), 2008.
  • (102) Martin Zinkevich, Amy Greenwald, and Michael L. Littman. Cyclic equilibria in markov games. In Proceedings of the 18th International Conference on Neural Information Processing Systems, NIPS’05, pages 1641–1648, Cambridge, MA, USA, 2005. MIT Press.

Appendix A Some Notes on Notation and Terminology

Here we clarify some notational differences between the work on computational game theory and (multiagent) reinforcement learning.

There are some analogues between approximate dynamic programming and RL to counterfactual regret minimization in zero-sum games.

CFR is a policy iteration technique: it implements generalized policy iteration: policy evaluation computes the (counterfactual) values . Policy “improvement” is implemented by the regret minimizers at each information state, such as regret-matching which yields the next policy by assigning probabilities to each action proportional to its thresholded cumulative regret . There is one main difference: this improvement step is not (necessarily) a contraction on any distance to an optimal policy. However, the average policy does converges due to the Folk theorem, so in some sense the policy update operator on is improving . We give more detail on CFR in the following subsection (Appendix B).

Like standard value/policy iteration, CFR requires a full state space sweep each iteration. Intead, Monte Carlo sampling can be applied to get estimated values  [49]. Then the equivalent policy update operator can be applied and there are probabilistic bounds on convergence to equilibrium.

One main crticial point is that temporal difference bootstrapping from values recursively is not possible as the Markov property does not hold in general for information states in partially-observable multiagent environments: the optimal policy at some state does generally depend on the policies at other information states.

POMDPs represent hidden state using belief states. They are different from information states, as they are paired with an associated distribution over the histories.

The following table shows a mapping between most-used terms that are analogous (mostly equivalent) but used within the two separate communities:

Computational Game Theory Reinforcement Learning This paper Prev. paper(s)
Player Agent
Information set Information state
Action (or move) Action
History State
Utility Reward
Strategy Policy
Reach probability 444There is no precise equivalent. The closest is the on-policy distribution in episodic tasks described in [84, Section 9.2].
Chance event probability Transition probability
Chance Nature
Imperfect Information Partial Observability
Extensive-form game Multiagent Environment555Also: finite-horizon Dec-POMDP, in the cooperative setting.
Simultaneous-move/Stochastic Game Markov/Stochastic Game
Table 1: A mapping of analogous terms across fields. The last two columns show nomenclature used for instances of each, compared to the previous papers from computational game theory.

Appendix B Counterfactual Regret Minimization

As mentioned above, Counterfactial Regret Minimization (CFR) is a policy iteration technique with a different policy improvement step. In this section we describe the algorithm using the terminology as defined in this paper. Again, it is presented in a slightly different way than from previous papers to emphasize the elements of policy iteration. For an overview with more background, see [47, Chapter 2]. For a thorough introductory tutorial, including backgound and code exercises see [62].


Appendix C Regret-based Policy Gradients: Algorithm Pseudo-Code

The algorithm is similar in form to A2C [59]. The differences are:

  1. Gradient descent is used with instead of gradient ascent in , , and A2C.

  2. An (action,value) -function critic is used in place of the usual state baseline .

The pseudo-code is presented in Algorithm LABEL:alg:rpg.


In this paper we focus on episodes of bounded length and is greater than the maximum number of moves per episode. So there is no TD-style bootstrapping from other values. In environments with longer episodes, it might be necessary to truncate the sequence lengths as is common in Deep RL.

Appendix D Analysis of Regret Dynamics in Matrix Games

In Tables 4, 4, and 4

we show the three games under study, i.e., matching pennies (MP), rock-paper-scissors (RPS), and a skewed version of the latter, called bias rock-paper-scissors (bRPS) from

[10]. In Figures 5, 5 and 5 we illustrate several dynamics in these respective games. More precisely, we show classical Replicator Dynamics (RD) as a reference point (row a), RPG Dynamics (row b), and time average RPG Dynamics plots (row c), sorted row by row. As can be observed from the figures, and as is well known, the RD cycle around the mixed Nash equilibrium (indicated by a yellow dot) in all three games, see row (a). The RPG dynamics cycle as well, though in a slightly different manner than RD as can be seen from row (b). Finally, in row (c) we show the average time RPG dynamics. Interestingly these plots show that in all three cases the learning dynamics converge to the mixed Nash equilibrium. These final plots illustrate that the average intended behavior of RPG converges to the mixed Nash equilibrium and that the RPG algorithm is regret-minimizing in these specific normal-form games [35].

d.1 Normal Form Games Dynamical Systems

Here we present the dynamical systems that describe each policy gradient update rule in two-player matrix games. For futher detail on the construction and analysis of these dynamical systems, see [8, 34].

Let us recall the updated we consider in this paper.







Let us consider that the game is in normal form and let us suppose that the policy is only parametrized by logits. The parameter will be

and in a state less game. It follows that:


d.1.1 Qpg

The dynamical system followed by QPG on a normal form game can be written as follow:


Final dynamical system:


d.1.2 Rpg


it falls that the dynamical system of RPG is:

22[][] & & H & & T & &
Table 2: Matching Pennies.
33[][] & & & & & & & & & & & &
Table 3: Rock-Paper-Scissors.
33[][] & & & & & & & & & & & &
Table 4: Bias RPS.
Figure 4: Rock-Paper-Scissors
Figure 5: Bias Rock-Paper-Scissors
(a) Replicator Dynamics
(b) RPG dynamics
(c) Average RPG dynamics
(a) Replicator Dynamics
(b) RPG dynamics
(c) Average RPG dynamics
(a) Replicator Dynamics
(b) RPG dynamics
(c) Average RPG dynamics
Figure 4: Rock-Paper-Scissors
Figure 5: Bias Rock-Paper-Scissors
Figure 3: Matching Pennies

d.2 Generalised Rock-Paper-Scissors Game

For the sake of completeness we also looked at the behavior of the dynamics in the generalised Rock-Paper-Scissors (gRPS) game [72, 45]. More precisely, the gRPS game can be described as illustrated in Table 5.

33[][] & & &
& & &
& & &
& & &

Table 5: Generalized Rock-Paper-Scissors.

We describe the dynamics in this game for replicator dynamics, RPG dynamics, and both replicator and RPG dynamics as average time dynamics plots. As in the RPS game, the replicator dyanmis and RPG dynamics cycle around the Nash equilibrium, and the average time replicator dynamics and average time RPG dynamics converge to the Nash equilibrium, as illustrated in Figures 7, 7, 9 and 9. A more detailed description on the convergence properties of replicator equations in this game can be found in [72].

Figure 6: Replicator Dynamics in the gRPS Game
Figure 7: RPG dynamics in the gRPS Game
Figure 8: Avg. Replicator Dynamics in the gRPS Game
Figure 9: Avg. RPG Dynamics in the gRPS Game

Appendix E Sequential Partially-Observable Case

Let be defined as in Theorem 1. We first define the four update rules that we will discuss in this section. On iteration , at state , the update to the policy parameters are:

Tabular policies are represents in behavioral strategy form: a probability is a weight per state-action, where the weights obey simplex constraints: .

In turn-based games, the gradient of a tabular policy, is then simply a sum of partial derivates with respect to each specific weight .

The score function . The contribution of some to the gradient is: