LearningDynamics
Evaluating EGT and RL dynamics in 3 strategies norm games
view repo
In multiagent learning, agents interact in inherently nonstationary environments due to their concurrent policy updates. It is, therefore, paramount to develop and analyze algorithms that learn effectively despite these nonstationarities. A number of works have successfully conducted this analysis under the lens of evolutionary game theory (EGT), wherein a population of individuals interact and evolve based on biologicallyinspired operators. These studies have mainly focused on establishing connections to valueiteration based approaches in stateless or tabular games. We extend this line of inquiry to formally establish links between EGT and policy gradient (PG) methods, which have been extensively applied in single and multiagent learning. We pinpoint weaknesses of the commonlyused softmax PG algorithm in adversarial and nonstationary settings and contrast PG's behavior to that predicted by replicator dynamics (RD), a central model in EGT. We consequently provide theoretical results that establish links between EGT and PG methods, then derive Neural Replicator Dynamics (NeuRD), a parameterized version of RD that constitutes a novel method with several advantages. First, as NeuRD reduces to the wellstudied noregret Hedge algorithm in the tabular setting, it inherits noregret guarantees that enable convergence to equilibria in games. Second, NeuRD is shown to be more adaptive to nonstationarity, in comparison to PG, when learning in canonical games and imperfect information benchmarks including Poker. Thirdly, modifying any PGbased algorithm to use the NeuRD update rule is straightforward and incurs no added computational costs. Finally, while singleagent learning is not the main focus of the paper, we verify empirically that NeuRD is competitive in these settings with a recent baseline algorithm.
READ FULL TEXT VIEW PDFEvaluating EGT and RL dynamics in 3 strategies norm games
In multiagent reinforcement learning (MARL), agents interact in a shared environment and aim to learn policies that maximize their returns
(Busoniu et al., 2008; Panait and Luke, 2005; Tuyls and Weiss, 2012). The associated core challenge is that the agents’ concurrent learning implies that they each perceive the environment as nonstationary (Matignon et al., 2012; Tuyls and Weiss, 2012). It has been suggested that enabling agents to adapt to nonstationary environments, rather than merely learn static policies, is the paramount objective in multiagent learning (Shoham et al., 2007; Tuyls and Parsons, 2007). Recent works have made considerable progress in increasing the scalability of MARL algorithms and the complexity of application domains. These include approaches relying on selfplay (Heinrich and Silver, 2016), regret minimization (Bowling et al., 2015; Waugh et al., 2015; Moravčík et al., 2017; Brown and Sandholm, 2017; Brown et al., 2018; Lockhart et al., 2019), and a large body of works that assume a standard centralized learning, decentralized execution paradigm (Oliehoek et al., 2016; Lowe et al., 2017; Foerster et al., 2018; Rashid et al., 2018). However, approaches such as Neural Fictitious SelfPlay (NFSP) (Heinrich and Silver, 2016) and Deep CFR (Brown et al., 2018) rely on maintenance of extremely large data buffers; continual resolving approaches such as DeepStack (Moravčík et al., 2017) and Libratus (Brown and Sandholm, 2017) are restricted to finitehorizon domains with enumerable belief spaces; and centralized learning approaches assume teamwide shared knowledge of experiences or policies. Most other works that focus on nonstationary multiagent learning are limited to repeated and stochastic games (Busoniu et al., 2008; HernandezLeal et al., 2017; Abdallah and Kaisers, 2016; Palmer et al., 2018b, a).This paper focuses on modelfree learning in the context of nonstationary games. Specifically, we leverage insights from evolutionary game theory (EGT) (Maynard Smith and Price, 1973; Weibull, 1997; J. Hofbauer and Sigmund, 1998) to develop our proposed algorithm. EGT models the interactions of a population of individuals that have minimal knowledge of opponent strategies or preferences, and has proven vital for not only the analysis and evaluation of MARL agents (Ponsen et al., 2009; Tuyls et al., 2018b, a; Omidshafiei et al., 2019), but also the development of novel algorithms (Tuyls et al., 2004). Links between EGT and MARL have been primarily identified between the Replicator Dynamics (RD), a standard model in EGT, and simple policy iteration algorithms such as Cross Learning and Learning Automata (Börgers and Sarin, 1997; Tuyls et al., 2006; Bloembergen et al., 2015), valueiteration algorithms such as Qlearning in stateless cases (Tuyls et al., 2003; Kaisers and Tuyls, 2010), and noregret learning (Klos et al., 2010).
In this paper, we go beyond these simple settings and focus our analysis on the connections between EGT and Policy Gradient (PG) algorithms, which have been extensively applied to MARL (Singh et al., 2000; Bowling, 2005; Bowling and Veloso, 2002; Foerster et al., 2018; AlShedivat et al., 2017; Lowe et al., 2017; Srinivasan et al., 2018). We first establish links between the RD, PG methods, and online learning, enabling new interpretations of these approaches. We use these links to identify a key limitation in commonlyused softmax PGbased methods that prevents their rapid adaptation in nonstationary settings, in contrast to RD. Given this insight, we derive a new PG method, called Neural Replicator Dynamics (NeuRD). We prove that in the tabular setting, NeuRD reduces to Hedge, a wellstudied noregret learning algorithm. We then introduce a variant of NeuRD that enables modelfree learning in sequential imperfect information games. We empirically evaluate NeuRD in canonical and sequential games. By including variants of each where reward spaces are nonstationary, we demonstrate that NeuRD is not only more stable when learning in fixed games, but also more adaptive under nonstationarity compared to standard PG. Moreover, though not the focus of the paper, we demonstrate that NeuRD matches stateoftheart performance in a suite of complex stationary singleagent tasks. Critically, the conversion of standard PGbased algorithms to the NeuRD update rule is simple, involving changes to few lines of code. These features make NeuRD a strong alternative to usual PG algorithms in nonstationary environments, such as in MARL or for computing equilibria in games.
This section briefly introduces the necessary gametheoretic and reinforcement learning background.
Game theory studies strategic interactions of players. A normalform game (NFG) specifies the simultaneous interaction of players with corresponding action sets . The payoff function assigns a numerical utility to each player for each possible joint action , where for all . Let denote the th player’s mixed strategy. The expected utility for player given strategy profile is then . The best response for player given is , where is the set of opponent policies. Profile is a Nash equilibrium if for all . A useful metric for evaluating policies is (Lanctot et al., 2017).
Replicator Dynamics (RD) are a key concept from EGT, which describe how a population evolves through time based on biologicallyinspired operators, such as selection and mutation (Taylor and Jonker, 1978; Taylor, 1979; J. Hofbauer and Sigmund, 1998; Zeeman, 1980, 1981). The singlepopulation RD are defined by the following system of differential equations:
(1) 
Each component of determines the proportion of an action (or pure strategy) being played in the population at time . The time derivative of each component is proportional to the difference in its expected payoff, , and the average population payoff, .
Online learning examines the performance of learning prediction algorithms in potentially adversarial environments. On each round, , the learner samples an action, from a discrete set of actions, , according to a policy, , and receives utility, , where
is a bounded vector provided by the environment. A typical objective is for the learner to minimize its expected
regret in hindsight for not committing to after rounds; the regret is defined as(2) 
Algorithms that guarantee their average worstcase regret goes to zero as the number of rounds increases, i.e., , are called noregret; these algorithms learn optimal policies under fixed or stochastic environments. According to a folk theorem, the average policies of noregret algorithms in selfplay or against best responses converge to a Nash equilibrium in twoplayer zerosum games (e.g., see Waugh (2009, Section 2.2.1)). This result can be extended from matrix games to sequential imperfect information games by composing learners in a tree and defining utility as counterfactual value (Zinkevich et al., 2008; Hofbauer et al., 2009). We provide background on sequential games in Appendix A for completeness.
The family of noregret algorithms known as Follow the Regularized Leader (FoReL) (McMahan, 2011; ShalevShwartz and Singer, 2007a, b; McMahan et al., 2013) generalizes wellknown decision making algorithms and population dynamics. For a discrete action set, , FoReL is defined through the following updates:
(3)  
(4) 
where is the learning rate at timestep , is the vector of action utilities observed at , and the regularizer is a convex function. Note that this algorithm assumes that the learner observes the entire action utility vector at each timestep rather than only the reward for taking a particular action. This is known as the full information or allactions setting.
Under negative entropy regularization , policy reduces to a softmax function , where . This yields the wellknown Hedge (Freund and Schapire, 1997) algorithm:
(5) 
Hedge is noregret as long as . Likewise, the continuoustime FoReL dynamics (Mertikopoulos et al., 2018) are
(6)  
(7) 
which in the case of entropy regularization yields RD as defined in (1) (e.g., see (Mertikopoulos et al., 2018)). This implies that RD is noregret, thereby enjoying equilibration to Nash and convergence to the optimal prediction in the timeaverage.
In a Markov Decision Process, at each timestep
, an agent in state selects an action , receives a reward , then transitions to a new state . In the discounted infinitehorizon regime, the reinforcement learning (RL) objective is to learn a policy , which maximizes the expected return , with discount factor .In actorcritic algorithms, one generates trajectories according to some parameterized policy
while learning to estimate the actionvalue function
. Temporal difference learning (Sutton and Barto, 2018) can be used to learn an actionvalue function estimator, , which is parameterized by . A PG algorithm then updates policy parameters in the direction of the gradient , where the quantity in square brackets is defined as the advantage, denoted , and . The advantage is analogous to regret in the online learning literature. In samplebased learning, the PG update incorporates a factor that accounts for the fact that was sampled from . The allactions PG algorithm without this factor is then , where is a column vector ofs. While different policy parameterizations are possible, the most common choice for discrete decision problems is a softmax function over the logits,
, and we focus the rest of our analysis on this form of PG.PGbased methods learn policies directly, handle continuous actions seamlessly, and combine readily with deep learning to solve highdimensional tasks
(Sutton and Barto, 2018). These benefits have, in part, led to the success of recent PGbased algorithms (e.g., A3C (Mnih et al., 2016), IMPALA (Espeholt et al., 2018), MADDPG (Lowe et al., 2017), and COMA (Foerster et al., 2018)).This section motivates and introduces a novel algorithm, Neural Replicator Dynamics (NeuRD), then presents unifying theoretical results for NeuRD, online learning, RD, and PG.
Let us consider the strengths and weaknesses of the algorithms described thus far. While RD and the closelyrelated FoReL are noregret and enable learning of equilibria in games, they are limited in application to tabular settings. By contrast, PG is applicable to highdimensional single and multiagent RL domains. Unfortunately, PG suffers from the fact that increasing the probability of taking an action which already has low probability mass can be very slow, in contrast to the considered noregret algorithms. We can see this by writing out the single state, tabular, allactions PG update explicitly, using the notation of online learning to identify the correspondences to that literature. On round
, PG updates its logits and policy as(8)  
(9) 
As there is no action or state sampling in this setting, shifting all the payoffs by the expected value (or in RL terms) has no impact on the policy, so this term is omitted here. Noting that (Sutton and Barto, 2018, Section 2.8), we observe that the update direction, , is actually the instantaneous regret scaled by , yielding the concrete update:^{1}^{1}1This scaling by is also apparent when taking the continuoustime limit of PG dynamics (see Appendix B).
(10) 
See Section C.1 for details. Scaling the regret by leads to an update that can prevent PG from achieving reasonable performance in even simple environments.
The fact that the update includes a scaling factor can hinder learning in nonstationary settings (e.g., in games) when an action might be safely disregarded at first, but later the value of this action improves. We illustrate this adaptability issue in the game of Rock–Paper–Scissors, by respectively comparing the dynamics of RD and PG in Figs. 0(b) and 0(a); note the differences near the vertices, where a single action retains a majority of policy mass. Figure 0(c) compares the speeds of the dynamics by plotting the ratio . This example illustrates the practical issues that arise when using PG in settings where learning has converged to a neardeterministic policy and then must adapt to a different policy given, e.g., dynamic payoffs or opponents. While PG fails to adapt rapidly to the game at hand, RD does not exhibit this issue.
Given these insights, our objective is to derive an algorithm that combines the best of both worlds, in that it is theoreticallygrounded and adaptive in the manner of RD, while still enjoying the practical benefits of the parametric PG update rule in RL applications.
While we have highlighted key limitations of PG in comparison to RD, the latter has limited practicality when computational updates are inherently discretetime or a parameterized policy is desired for generalizability. To address these limitations, we derive a discretetime parameterized policy update rule, titled Neural Replicator Dynamics (NeuRD), which is later compared against PG. For seamless comparison of our update rule to standard PG, we next switch our nomenclature from the utilities used in online learning, , to the analogous actionvalues used in RL, . We write the RD (i.e., FoReL with entropy regularization) logit dynamics creftype 7 as
(11) 
where
is the standard variancereducing baseline
(Sutton and Barto, 2018). Let denote the logits parameterized by . A natural way to derive a parametric update rule is to compute the Euler discretization^{2}^{2}2Given a tabular softmax policy, this definition matches the standard discretetime RD. See Section C.5. of creftype 11,(12) 
and consider a fixed target value that the parameterized logits are adjusted toward. Specifically, one can update to minimize some choice of metric ,
(13) 
In particular, minimizing the Euclidean distance yields,
(14)  
(15)  
(16) 
which we later prove has a rigorous connection to Hedge and, thereby, inherits the noregret guarantees that are useful in nonstationary settings such as games. As our experiments use a neural network parameterization, we refer to the update rule
creftype 16 as Neural Replicator Dynamics (NeuRD).Overall, NeuRD is not only practical to use as it involves a simple modification of PG with no added computational expense, but also benefits from rigorous links to algorithms with noregret guarantees, as shown in this section. All proofs are in Appendix C.
We first consider the single state, tabular case to make a key connection to noregret algorithms.
Single state, allactions, tabular NeuRD is Hedge.
As a reminder, Hedge (and therefore tabular NeuRD) is noregret, so NeuRD can be used to find optimal policies or Nash equilibria. In sequential decision making settings, ensuring noregret requires independent learners at every decision point, and additionally a counterfactual weighting of utilities in imperfect information games (IIGs) (Zinkevich et al., 2008); for completeness, we provide the necessary background on this in Appendix A. At a high level, in IIGs the set of information states for player corresponds to a partitioning of action histories, and counterfactual values are defined as the expected player utility weighted by the product of opponent sequence probabilities. With these changes, applying NeuRD to imperfect information settings is trivial, as shown in our experiments.
These facts ensure that tabular NeuRD can be used to solve a broad class of problems where PG may fail. While these guarantees are limited to the tabular case, they constitute a principled theoretical grounding on which parameterized NeuRD is constructed.
We next formalize the connection between RD and PG, expanding beyond the scope of prior works that have considered only the links between EGT and valueiteration based algorithms (Tuyls et al., 2003; Kaisers and Tuyls, 2010).
PG is a policylevel Euler discretization approximation of continuoustime RD (i.e., computing using ), under a KLdivergence minimization criterion.
Next we establish a formal link between NeuRD and Natural Policy Gradient (NPG) (Kakade, 2002).
The NeuRD update rule creftype 16 corresponds to a naturalized policy gradient rule, in the sense that NeuRD applies a natural gradient only at the policy output level of softmax function over logits, and uses the standard gradient otherwise.
Unlike NPG, NeuRD does not require computation of the inverse Fisher information matrix, which is especially expensive when, e.g., the policy is parameterized by a largescale neural network (Martens and Grosse, 2015).
As in average, the logits get incremented by the advantage, they may diverge to . To avoid numerical issues, in practice one can stop updating the logits if the gap between them exceeds a large threshold. We apply this by using the following clipped gradient in lieu of a standard gradient:
(17) 
where is the indicator function, is a learning rate, and controls the allowable logits gap. This thresholding is not problematic at the policy representation level, since actions can have a probability arbitrarily close to or given a large enough . Moreover, while we have so far assumed an allactions NeuRD update, a corresponding samplebased variant, where , is given by
(18) 
where is a variancereduction term, useful since the update rule scales inversely with action selection probability , which may be close to for certain actions.
We conduct a series of evaluations demonstrating the effectiveness of NeuRD when learning in nonstationary settings such as NFGs, standard imperfect information benchmarks, and variants of each with additional reward nonstationarity. As NeuRD involves only a simple modification of the PG update rule to improve adaptivity, we focus our comparisons against standard PG as a baseline, noting that additional benefits can be analogously gained by combining NeuRD with more intricate techniques that improve PG (e.g., variance reduction, improved exploration, or offpolicy learning). Experimental procedures are detailed in Section D.1.
We consider several domains. Rock–Paper–Scissors (RPS) is a wellknown canonical NFG involving two players, with a cyclic dominance among the three strategies. Goofspiel is a card game where players try to obtain point cards by bidding simultaneously. We use an imperfect information variant with 4 cards where bid cards are not revealed (Lanctot, 2013). Kuhn Poker is a game wherein each player starts with 2 chips, antes 1 chip to play, receives a facedown card from a deck of such that one card remains hidden, and either bets (raise or call) or folds until all players are in (contributed equally to the pot) or out (folded). Amongst those that are in, the player with the highestranked card wins the pot. In Leduc Poker (Southey et al., 2005), players instead have limitless chips, one initial private card, and ante 1 chip to play. Bets are limited to 2 and 4 chips, respectively, in the first and second round, with two raises maximum in each round. A public card is revealed before the second round. Though not the primary focus of the paper, we also provide empirical results for singleagent stationary RL tasks in Section D.3, with the key observation being that updating the recentlyintroduced IMPALA (Espeholt et al., 2018) agent to use the NeuRD update rule matches stateoftheart performance.
We first show that the differences between NeuRD and PG detailed in Section 3.3 are more than theoretical. Consider the NashConv of the timeaverage NeuRD and PG tabular policies in the game of RPS, shown in Fig. 1(a). Note that by construction, NeuRD and RD are equivalent in this tabular, singlestate setting. NeuRD not only converges towards the Nash equilibrium faster, but PG eventually plateaus. Consider next a more complex imperfect information setting, where Fig. 2 shows that tabular, allactions, counterfactual value NeuRD ^{3}^{3}3This can be seen as counterfactual regret minimization (CFR) (Zinkevich et al., 2008) with Hedge, also see Brown et al. (2017). more quickly and more closely approximates a Nash equilibrium in twoplayer Leduc Poker than tabular, allactions, counterfactual value PG.

We next consider modifications of our domains wherein reward functions change at specific intervals during learning, compounding the usual nonstationarities in games. Specifically, we consider games with three phases, wherein learning commences under a particular reward function, after which it switches to a different function in each phase while learning continues without the policies being reset. In biased RPS, each phase corresponds to a particular choice of the parameter in payoff tables Table 0(a); specifically, we set to 20, 0, and 20, respectively, for the three phases. This has the effect of biasing the Nash equilibrium towards one of the simplex corners (see Fig. 2(b)), then to the simplex center (Fig. 0(a)), then back again towards the corner. In Fig. 2(b), we plot the NashConv of NeuRD and PG with respect to the Nash for that particular phase. Despite the changing payoffs, the NeuRD NashConv decreases towards 0 in each of the phases, while PG again plateaus.
Finally, we consider imperfect information games, with the reward function being iteratively negated in each game phase for added nonstationarity, and policies parameterized using neural networks. Due to the complexity of maintaining a timeaverage neural network policy to ensure noregret, we use entropy regularization to induce realtime policy convergence towards the Nash (e.g., as done by Srinivasan et al. (2018)). Figure 4 illustrates the NashConv for NeuRD and PG for the imperfect information domains considered, for an intermediate level of entropy regularization. NeuRD converges faster than PG in all three domains. Section D.2 provides full sweeps over the entropy regularization parameter; as regularization increases, so does the rate of convergence, although to a fixed point further from the Nash. In Fig. 5, we plot the NashConv areaunderthecurve (AUC) for all game phases in nonstationary Leduc, and for all entropy regularization levels. Notably, NeuRD is significantly more stable in learning than PG, even without entropy regularization. Of particular note is Phase II of the game, wherein PG fails to match NeuRD’s performance for any value of regularization.
This paper rigorously investigated the links between RD and PG methods, extending prior inquiries between EGT and valueiteration based methods. The insights gained led to the introduction of a novel algorithm, NeuRD, that generalizes the noregret Hedge algorithm and RD to utilize function approximation. NeuRD was empirically shown to better cope in highly nonstationary and adversarial settings than PG. While NeuRD represents an important extension of classical learning dynamics to utilize function approximation, Hedge and RD are also instances of the more general FoReL framework that applies to arbitrary convex decision sets and problem geometries expressed through a regularization function. This connection suggests that NeuRD could perhaps likewise be generalized to convex decision sets and various parametric forms, which would allow a general FoReLlike method to take advantage of functional representations. Moreover, as NeuRD is practical in that it involves a very simple modification of the standard PG update rule, a natural avenue for future work is to investigate NeuRDbased extensions of standard PGbased methods (e.g., A3C (Mnih et al., 2016), DDPG (Lillicrap et al., 2015), and MADDPG (Lowe et al., 2017)), in addition to naturally nonstationary singleagent RL tasks such as intrinsic motivationbased exploration (Graves et al., 2017).
Journal of Machine Learning Research
, 17(46):1–31, 2016.The Knowledge Engineering Review
, 27(1):1–31, 2012.An extensiveform game (EFG) specifies the sequential interaction between players , where is a special player called chance or nature (also denoted as player ) who has a fixed stochastic policy that determines the transition probabilities at random events like dice rolls or the dealing of cards from a shuffled deck. Actions, , are played in turns according to a player function, , and are recorded in a history, , where is the set of all possible action sequences. To model games like Poker that require some actions to be hidden from particular players, players do not observe the game’s history directly. Instead, histories are partitioned into information states, , for each player, , according to function mapping sets of histories into information states. Player must act from without knowing which particular history led to . This requires that the set of actions in each history within an information set must match, so we can define with . A behavioral policy for player maps every
to a probability distribution over actions:
. Payoffs, , are provided to each player upon reaching a terminal history, . We assume finite games so a terminal history is always eventually reached. We denote the subset of terminal histories that share as a prefix as . For any two histories, , we use the notation to indicate that history is a prefix of .We also define stateaction values for joint policies. The value represents the expected return to player starting at state , taking action , and playing . Formally,
(19) 
where is the expected utility of the ground stateaction pair , and is the probability of reaching under the profile . We make the common assumption that players have perfect recall, i.e., they do not forget anything they have observed while playing. Under perfect recall, the distribution of the states can be obtained only from the opponents’ policies using Bayes’ rule (see Srinivasan et al. (2018, Section 3.2)). For convenience, in turnbased games, we define and .
The probability of reaching any history can be decomposed as the product of player sequence probabilities, . Counterfactual value (Zinkevich et al., 2008) plays a critical role in many Nash equilibrium approximation methods and is defined as the expected utility weighted by the product of opponent sequence probabilities,
(20) 
with defined analogously. Policy averaging is also part of many algorithms and is done in terms of sequence probabilities in EFGs.
In this regime, the policy for player on round in information state , is
(21) 
where is the learning rate at round , while and are, respectively, the counterfactual actionvalue and value.
We can also consider the continuoustime value based policy gradient (QPG) dynamics (Srinivasan et al., 2018, Section D.1.1), which are amenable for comparison against the RD and provide a reasonable approximation to the discretetime PG models given a sufficiently small learning rate (Borkar, 2009):
(22) 
In contrast to RD creftype 1, the QPG dynamics in creftype 22 have an additional term that modulates learning and slows adaptation for actions that are taken with low probability under .
Here, we fully derive the single state, tabular PG update. On round , PG updates its logits as
(23) 
As there is no action or state sampling in this setting, shifting all the payoffs by the expected value (or in RL terms) has no impact on the policy, so this term is omitted here. Noting that (Sutton and Barto, 2018, Section 2.8), we observe that the update direction, , is actually the instantaneous regret scaled by :
(24)  
(25)  
(26)  
(27)  
(28)  
(29) 
Therefore, the concrete update is:
(30) 
See 1
In the single state, tabular case,
is the identity matrix, so unrolling the NeuRD update
creftype 16 across rounds, we see that the NeuRD policy is(31) 
since is shift invariant. But creftype 31 is identical to creftype 5, so NeuRD and hedge use the same policy on every round and are therefore equivalent in this setting. ∎
The following result unifies the Replicator Dynamics and Policy Gradient.
See 1
A discretetime Euler discretization of the RD at the policy level is:
Note that while , this Eulerdiscretized update may still be outside the simplex; however, merely provides a target for our parameterized policy update, which is subsequently reprojected back to the simplex via .
Now if we consider parameterized policies , and our goal is to define dynamics on that captures those of the RD, a natural way consists in updating in order to make move towards , for example in the sense of minimizing their KL divergence, .
Of course, the KL divergence is defined only when both inputs are in the positive orthant, , so in order to measure the divergence from , which may have negative values, we need to define a KLlike divergence. Fortunately, since the is inconsequential from an optimization perspective and this is the only term that requires , a natural modification of the KL divergence to allow for negative values in its first argument is to drop this term entirely, resulting in .
The gradientdescent step on the objective is:
(32)  
(33)  
(34)  
Assuming ,  
(36)  
(37)  
(38)  
(39)  
(40)  
(41) 
which is precisely a policy gradient step. ∎
We detail here a unification of the Natural PG and NeuRD update rules. See 2
Consider a policy defined by a softmax over a set of logits : . Define the Fisher information matrix of the policy with respect to the logits :
(42)  
(43)  
(44)  
(45) 
Note that
(46)  
(47)  
(48)  
(49) 
from the definition of . This means that considering the variables as parameters of the policy, the natural gradient of with respect to is
(50) 
Now assume the logits are parameterized by some parameter (e.g., with a neural network). Let us define the pseudonatural gradient of the probabilities with respect to as the composition of the natural gradient of with respect to (i.e., the softmax transformation) and the gradient of with respect to :
(51) 
From the above, we have that a natural policy gradient yields:
(52) 
which is nothing else than the NeuRD update rule in creftype 16. ∎
A common way to define discretetime replicator dynamics is according to the socalled standard discretetime replicator dynamic (Cressman, 2003),
(53) 
The action values are exponentiated to ensure all the utility values are positive, which is the typical assumption required by this model. Since the policy is a softmax function applied to logits, we can rewrite this dynamic in the tabular case to also recover an equivalent to the NeuRD update rule in creftype 16 with :
(54)  
(55) 
is generated from logits, , which only differs from creftype 16 in a constant shift of across all actions. Since the softmax function is shift invariant, the sequence of policies generated from these update rules will be identical.
Interestingly, there have been sequenceform extensions of these standard discretetime replicator dynamics (Gatti et al., 2013) that are also related to counterfactual regret minimization (Lanctot, 2014), but with a different (but similar) regret minimizer. There have also been a sampling variants, such as sequenceform Qlearning (Panozzo et al., 2014), and sequenceform logit dynamics (Gatti and Restelli, 2016), which seems similar to the NeuRD update. The main benefit of NeuRD over these algorithms is that using counterfactual values allows representing the policies in behavioral form, as is standard in reinforcement learning, rather than the sequenceform. As a result, it is straightforward to do sampling and function approximation.
We detail the experimental procedures here.
For Fig. 1 we simulate the continuous time RD creftype 1 and the continuous time PG dynamics creftype 22 from regular grid points on the simplex. For Fig. 0(a) and Fig. 0(b) we plot 15 trajectories of length ; arrows indicate the direction of evolution. For Fig. 0(c) we compute for 20,000 regularly spaced points on the policy simplex.
For Fig. 1(a) the continuous time dynamics of NeuRD creftype 11 and PG creftype 22 are integrated over time with step size (equals 1 iteration). The figure shows the mean NashConv of 100 trajectories starting from initial conditions sampled uniformly from the policy simplex. The shaded area corresponds to the confidence interval computed with bootstrapping from 1000 samples.
For Fig. 1(b), in every iteration, each informationstate policy over the entire game was updated for both players in an alternating fashion, i.e., the first player’s policy was updated, then the second player’s (Burch, 2017, Section 4.3.6) (Burch et al., 2019). The only difference between the NeuRD and PG algorithms in this case is the informationstate logit update rule, and the only difference here–as described by creftype 10–is that PG scales the NeuRD update by the current policy. The performance displayed is that of the sequence probability timeaverage policy for both algorithms (see Appendix A). The set of constant step sizes tried were the same for both algorithms: . The shaded area corresponds to the interval that would result from a uniform sampling of the step size from this set.
We use the same setup detailed above for Fig. 1(a) for the nonstationary case shown in Fig. 3. The payoff matrix is switched every 1000 iterations, i.e., at and .
For each game in Fig. 4
, we randomly initialize a policy parameterized by a twolayer neural neural network (128 hidden units). Results are reported for 40 random hyperparameter sweeps for both NeuRD and PG. We update the policy once every 4 update of the Qfunction. The batch size of the policy update is 256 trajectories, with trajectory lengths of 5, 8, and 8 for Kuhn Poker, Leduc Poker, and Goofspiel. The Qfunction update batch size is 4 trajectories (same lengths). A learning rate of 0.002 was used for policy updates, and 0.01 for Qvalue function updates. Reward function negation occurs every
learning iterations, with the three reward function phases separated by the vertical red stripes in the plots. Upon conclusion of each learning phase, policies are not reset; instead, learning continues given the latest policy (for each of the 40 trials).For Fig. 5, we simply compute the NashConv areaunderthecurve for each phase of learning in Leduc Poker, across all entropy regularization levels, using 40 hyperparameter seeds. I.e., this plot provides a concise summary of Fig. D.6.
With the exception of the experiments presented in Section D.3, all experiments were performed on local workstations. The singleagent DMLab30 (Beattie et al., 2016) experiments were conducted using a cloud computing platform with P100 GPUs. We performed 20 independent runs with 20 actors and one learner per run. The hyperparameters were sampled from the same distributions as selected by Espeholt et al. (2018) with the addition of NeuRD specific hyperparameters loguniformly sampled in the interval and maxlogitgap uniformly in the interval .
The sample mean is used as the central tendency estimator in plots, with variation indicated as the 95% confidence interval that is the default setting used in the Seaborn visualization library that generates our plots. No data was excluded and no other preprocessing was conducted to generate these plots.
Full results for all considered nonstationary imperfect information domains, under varying levels of policy entropy regularization, are shown in Figs. D.8, D.7 and D.6. Increasing policy network regularization causes the NashConv to be biased away from the Nash equilibrium. In every learning phase of every domain, the lowest NashConv over the sweep of entropy regularization coefficients is achieved by NeuRD.
Though not considered a main focus in this paper, we also explore how NeuRD behaves in the single agent stationary RL setting. Specifically, we consider IMPALA (Espeholt et al., 2018), a stateoftheart policy gradient based distributed learning agent, as a baseline. We replace the IMPALA policy update rule with that of NeuRD, otherwise keeping the training procedures the same as those used by Espeholt et al. (2018). We then compare the two algorithms in a suite of DMLab30 (Beattie et al., 2016) tasks. All IMPALA scores correspond to the ‘experts’ (i.e., nonmultitask) results reported by Espeholt et al. (2018, Table B.1).
Of particular note is the rooms_select_nonmatching_object domain, where NeuRD attains a score of , which is significantly higher than IMPALA’s score of
. We conjecture this is due to the inherent adaptivity required on behalf of the agent to maximize its reward. In this domain, the agent spawns in a room, and views an outofreach object shown in one of two displays. If the agent reaches a specific pad in the room, it receives a
reward, and is spawned to a second room with two objects, one of which was the one in the previous room. The agent receives a reward of (respectively, ) if it collects object not matching (respectively, matching) the one in the previous room, then respawns to the first room. The episode ends after 12 seconds. This domain can be viewed as a contextual twoarm bandit problem, where the agent needs to learn to choose the nonmatching object based on the context provided in the first room, rather than converge to a deterministic policy that always chooses the object that yielded the first reward. We, therefore, conjecture that the adaptivity properties of NeuRD enable it to attain the high reward in contrast to IMPALA. These preliminary results indicate that investigation of NeuRD’s performance in related nonstationary singleagent settings may make for an interesting avenue of future work.Level name  Human  Random  IMPALA  NeuRD 

ooms_select_nonmatching_object & 65.9 & 0.3 & 7.3 & 48.9$\pm$4.0 \\ \veb rooms_watermaze 
54.0  4.1  26.9  29.80.4 
ooms_keys_doo s_puzzle 
53.8  4.1  28.0  24.21.2 
asertag_one_opponent_sma l 
18.6  0.1  0.1  0.00.0 
asertag_three_opponents_sma l 
31.5  0.1  19.1  0.00.0 
asertag_one_opponent_ arge 
12.7  0.2  0.2  0.00.0 
asertag_three_opponents_ arge 
18.6  0.2  0.1  0.00.0 
xplor _goal_locations_small 
267.5  7.7  209.4  208.60.4 
xplor _object_locations_small 
74.5  3.6  57.8  54.00.4 
xplor _object_locations_large 
65.7  4.7  37.0  42.80.8 
atlab_fixed_large_map & 36.9 & 2.2 & 34.7 & 22.5$\pm$4.6 \\ \verbatlab_varying_map_regrowth 
24.4  3.0  20.7  22.60.7 
atlab_varyi g_map_randomized 
42.4  7.3  36.1  37.21.3 
sychlab_arbitrary_visuomotor_ma ping 
58.8  0.2  16.4  16.50.6 
sychlab_continuous_recognition & 58.3 & 0.2 & 29.9 & 30.7$\ m 

sychlab_sequential_com arison 
39.5  0.1  0.0  0.00.0 