1. Introduction
Due to the advent of deep reinforcement learning (RL) methods that allow the study of many agents in rich environments, multiagent RL has flourished in recent years. However, most of this recent work considers fully cooperative settings (Omidshafiei et al., 2017; Foerster et al., 2018, 2017) and emergent communication in particular (Das et al., 2017; Mordatch and Abbeel, 2017; Lazaridou et al., 2016; Foerster et al., 2016; Sukhbaatar et al., 2016). Considering future applications of multiagent RL, such as selfdriving cars, it is obvious that many of these will be only partially cooperative and contain elements of competition and conflict.
The human ability to maintain cooperation in a variety of complex social settings has been vital for the success of human societies. Emergent reciprocity has been observed even in strongly adversarial settings such as wars (Axelrod, 2006), making it a quintessential and robust feature of human life.
In the future, artificial learning agents are likely to take an active part in human society, interacting both with other learning agents and humans in complex partially competitive settings. Failing to develop learning algorithms that lead to emergent reciprocity in these artificial agents would lead to disastrous outcomes.
How reciprocity can emerge among a group of learning, selfinterested, reward maximizing RL agents is thus a question both of theoretical interest and of practical importance. Game theory has a long history of studying the learning outcomes in games that contain cooperative and competitive elements. In particular, the tension between cooperation and defection is commonly studied in the iterated prisoners’ dilemma. In this game, selfish interests can lead to an outcome that is overall worse for all participants, while cooperation maximizes social welfare, one measure of which is the sum of rewards for all agents.
Interestingly, in the simple setting of an infinitely repeated prisoners’ dilemma with discounting, randomly initialised RL agents pursuing independent gradient descent on the exact value function learn to defect with high probability. This shows that current stateoftheart learning methods in deep multiagent RL can lead to agents that fail to cooperate reliably even in simple social settings with explicit actions to cooperate and defect. One wellknown shortcoming is that they fail to consider the learning process of the other agents and simply treat the other agent as a
static part of the environment (HernandezLeal et al., 2017).As a step towards reasoning over the learning behaviour of other agents in social settings, we propose Learning with OpponentLearning Awareness (LOLA). The LOLA learning rule includes an additional term that accounts for the impact of one agent’s parameter update on the learning step of the other agents. For convenience we use the word ‘opponents’ to describe the other agents, even though the method is not limited to zerosum games and can be applied in the generalsum setting. We show that this additional term, when applied by both agents, leads to emergent reciprocity and cooperation in the iterated prisoners’ dilemma (IPD). Experimentally we also show that in the IPD, each agent is incentivised to switch from naive learning to LOLA, while there are no additional gains in attempting to exploit LOLA with higher order gradient terms. This suggests that within the space of local, gradientbased learning rules both agents using LOLA is a stable equilibrium. This is further supported by the good performance of the LOLA agent in a roundrobin tournament, where it successfully manages to shape the learning of a number of multiagent learning algorithms from literature. This leads to the overall highest average return on the IPD and good performance on Iterated Matching Pennies (IMP).
We also present a version of LOLA adopted to the deep RL setting using likelihood ratio policy gradients, making LOLA scalable to settings with high dimensional input and parameter spaces.
We evaluate the policy gradient version of LOLA on the IPD and iterated matching pennies (IMP), a simplified version of rockpaperscissors. We show that LOLA leads to cooperation with high social welfare, while independent policy gradients, a standard multiagent RL approach, does not. The policy gradient finding is consistent with prior work, e.g., Sandholm and Crites (1996). We also extend LOLA to settings where the opponent policy is unknown and needs to be inferred from stateaction trajectories of the opponent’s behaviour.
Finally, we apply LOLA with and without opponent modelling to a gridworld task with an embedded underlying social dilemma. This task has temporally extended actions and therefore requires high dimensional recurrent policies for agents to learn to reciprocate. Again, cooperation emerges in this task when using LOLA, even when the opponent’s policy is unknown and needs to be estimated.
2. Related Work
The study of generalsum games has a long history in game theory and evolution. Many papers address the iterated prisoners’ dilemma (IPD) in particular, including the seminal work on the topic by Axelrod (2006). This work popularised titfortat (TFT), a strategy in which an agent cooperates on the first move and then copies the opponent’s most recent move, as an effective and simple strategy in the IPD.
A number of methods in multiagent RL aim to achieve convergence in selfplay and rationality in sequential, general sum games. Seminal work includes the family of WoLF algorithms (Bowling and Veloso, 2002), which uses different learning rates depending on whether an agent is winning or losing, jointactionlearners (JAL), and AWESOME (Conitzer and Sandholm, 2007). Unlike LOLA, these algorithms typically have well understood convergence behaviour given an appropriate set of constraints. However, none of these algorithm have the ability to shape the learning behaviour of the opponents in order to obtain higher payouts at convergence. AWESOME aims to learn the equilibria of the oneshot game, a subset of the equilibria of the iterated game.
Detailed studies have analysed the dynamics of JALs in general sum settings: This includes work by Uther and Veloso (1997) in zerosum settings and by Claus and Boutilier (1998) in cooperative settings. Sandholm and Crites (1996) study the dynamics of independent Qlearning in the IPD under a range of different exploration schedules and function approximators. Wunder et al. (2010) and Zinkevich et al. (2006) explicitly study the convergence dynamics and equilibria of learning in iterated games. Unlike LOLA, these papers do not propose novel learning rules.
Littman (2001) propose a method that assumes each opponent either to be a friend, i.e., fully cooperative, or foe, i.e., fully adversarial. Instead, LOLA considers general sum games.
By comparing a set of models with different history lengths, Chakraborty and Stone (2014) propose a method to learn a best response to memory bounded agents with fixed policies. In contrast, LOLA assumes learning agents, which effectively correspond to unbounded memory policies.
Brafman and Tennenholtz (2003) introduce the solution concept of an efficient learning equilibrium (ELE), in which neither side is encouraged to deviate from the learning rule. The algorithm they propose applies to settings where all Nash equilibria can be computed and enumerated; LOLA does not require either of these assumptions.
By contrast, most work in deep multiagent RL focuses on fully cooperative or zerosum settings, in which learning progress is easier to evaluate, (Omidshafiei et al., 2017; Foerster et al., 2018, 2017) and emergent communication in particular (Das et al., 2017; Mordatch and Abbeel, 2017; Lazaridou et al., 2016; Foerster et al., 2016; Sukhbaatar et al., 2016). As an exception, Leibo et al. (2017)
analyse the outcomes of independent learning in general sum settings using feedforward neural networks as policies.
Lowe et al. (2017) propose a centralised actorcritic architecture for efficient training in these general sum environments. However, none of these methods explicitly reasons about the learning behaviour of other agents. Lanctot et al. (2017) generalise the ideas of gametheoretic bestresponsestyle algorithms, such as NFSP (Heinrich and Silver, 2016). In contrast to LOLA, these bestresponse algorithms assume a given set of opponent policies, rather than attempting to shape the learning of the other agents.The problem setting and approach of Lerer and Peysakhovich (2017) is closest to ours. They directly generalise titfortat to complex environments using deep RL. The authors explicitly train a fully cooperative and a defecting policy for both agents and then construct a titfortat policy that switches between these two in order to encourage the opponent to cooperate. Similar in spirit to this work, Munoz de Cote and Littman (2008) propose a Nash equilibrium algorithm for repeated stochastic games that explicitly attempts to find the egalitarian equilibrium by switching between competitive and cooperative strategies. A similar idea underlies MQubed, (Crandall and Goodrich, 2011), which balances bestresponse, cautious, and optimistic learning biases.
Reciprocity and cooperation are not emergent properties of the learning rules in these algorithms but rather directly coded into the algorithm via heuristics, limiting their generality.
Our work also relates to opponent modelling, such as fictitious play (Brown, 1951) and actionsequence prediction (Mealing and Shapiro, 2015; Rabinowitz et al., 2018). Mealing and Shapiro (2013) also propose a method that finds a policy based on predicting the future action of a memory bounded opponent. Furthermore, HernandezLeal and Kaisers (2017) directly model the distribution over opponents. While these methods model the opponent strategy, or distribution thereof, and use lookahead to find optimal response policies, they do not address the learning dynamics of opponents. For further details we refer the reader to excellent reviews on the subject (HernandezLeal et al., 2017; Busoniu et al., 2008).
By contrast, Zhang and Lesser (2010) carry out policy prediction under onestep learning dynamics. However, the opponents’ policy updates are assumed to be given and only used to learn a best response to the anticipated updated parameters. By contrast, a LOLA agent directly shapes the policy updates of all opponents in order to maximise its own reward. Differentiating through the opponent’s learning step, which is unique to LOLA, is crucial for the emergence of titfortat and reciprocity. To the best of our knowledge, LOLA is the first method that aims to shape the learning of other agents in a multiagent RL setting.
With LOLA, each agent differentiates through the opponents’ policy update. Similar ideas were proposed by Metz et al. (2016)
, whose training method for generative adversarial networks differentiates through multiple update steps of the opponent. Their method relies on an endtoend differentiable loss function, and thus does not work in the general RL setting. However, the overall results are similar: differentiating through the opponent’s learning process stabilises the training outcome in a zero sum setting.
Outside of purely computational studies the emergence of cooperation and defection in RL settings has also been studied and compared to human data KleimanWeiner et al. (2016).
3. Notation
Our work assumes a multiagent task that is commonly described as a stochastic game , specified by a tuple . Here agents, , choose actions, , and is the state of the environment. The joint action leads to a state transition based on the transition function . The reward functions specify the reward for each agent and is the discount factor.
We further define the discounted future return from time onward as for each agent, . In a naive approach, each agent maximises its total discounted return in expectation separately. This can be done with policy gradient methods (Sutton et al., 1999) such as REINFORCE (Williams, 1992). Policy gradient methods update an agent’s policy, parameterised by , by performing gradient ascent on an estimate of the expected discounted total reward .
By convention, bold lowercase letters denote column vectors.
4. Methods
In this section, we review the naive learner’s strategy and introduce the LOLA learning rule. We first derive the update rules when agents have access to exact gradients and Hessians of their expected discounted future return in Sections 4.1 and 4.2. In Section 4.3, we derive the learning rules purely based on policy gradients, thus removing access to exact gradients and Hessians. This renders LOLA suitable for deep RL. However, we still assume agents have access to opponents’ policy parameters in policy gradientbased LOLA. Next, in Section 4.4, we incorporate opponent modeling into the LOLA learning rule, such that each LOLA agent only infers the opponent’s policy parameter from experience. Finally, we discuss higher order LOLA in Section 4.5.
For simplicity, here we assume the number of agents is and display the update rules for agent 1 only. The same derivation holds for arbitrary numbers of agents.
4.1. Naive Learner
Suppose each agent’s policy is parameterised by and is the expected total discounted return for agent as a function of both agents’ policy parameters . A Naive Learner (NL) iteratively optimises for its own expected total discounted return, such that at the th iteration, is updated to according to
In the reinforcement learning setting, agents do not have access to over all parameter values. Instead, we assume that agents only have access to the function values and gradients at . Using this information the naive learners apply the gradient ascent update rule :
(4.1) 
where is the step size.
4.2. Learning with Opponent Learning Awareness
A LOLA learner optimises its return under one step lookahead of opponent learning: Instead of optimizing the expected return under the current parameters, , a LOLA agent optimises , which is the expected return after the opponent updates its policy with one naive learning step, . Going forward we drop the subscript for clarity. Assuming small , a firstorder Taylor expansion results in:
(4.2) 
The LOLA objective (4.2) differs from prior work, e.g., Zhang and Lesser (2010), that predicts the opponent’s policy parameter update and learns a best response. LOLA learners attempt to actively influence the opponent’s future policy update, and explicitly differentiate through the with respect to . Since LOLA focuses on this shaping of the learning direction of the opponent, the dependency of on is dropped during the backward pass. Investigation of how differentiating through this term would affect the learning outcomes is left for future work.
By substituting the opponent’s naive learning step:
(4.3) 
into (4.2) and taking the derivative of (4.2) with respect to , we obtain our LOLA learning rule:
which includes a second order correction term
(4.4) 
where the step sizes are for the first and second order updates. Exact LOLA and NL agents (LOLAEx and NLEx) have access to the gradients and Hessians of at the current policy parameters and can evaluate (4.1) and (4.4) exactly.
4.3. Learning via Policy Gradient
When agents do not have access to exact gradients or Hessians, we derive the update rules and based on approximations of the derivatives in (4.1) and (4.4). Denote an episode of horizon as and its corresponding discounted return for agent at timestep as . Given this definition, the expected episodic return conditioned on the agents’ policies , and , approximate and respectively, as do the gradients and Hessians.
The gradient of follows from the policy gradient derivation:
where
is a baseline for variance reduction. Then the update rule
for the policy gradientbased naive learner (NLPG) is(4.5) 
For the LOLA update, we derive the following estimator of the secondorder term in (4.4) based on policy gradients. The derivation (omitted) closely resembles the standard proof of the policy gradient theorem, exploiting the fact that agents sample actions independently. We further note that this second order term is exact in expectation:
(4.6) 
The complete LOLA update using policy gradients (LOLAPG) is
(4.7) 
4.4. LOLA with Opponent Modeling
Both versions (4.4) and (4.7) of LOLA learning assume that each agent has access to the exact parameters of the opponent. However, in adversarial settings the opponent’s parameters are typically obscured and have to be inferred from the opponent’s stateaction trajectories. Our proposed opponent modeling is similar to behavioral cloning Ross et al. (2011); Bojarski et al. (2016). Instead of accessing agent ’s true policy parameters , agent models the opponent’s behavior with , where is estimated from agent ’s trajectories using maximum likelihood:
(4.8) 
Then, replaces in the LOLA update rule, both for the exact version (4.4) using the value function and the gradient based approximation (4.7). We compare the performance of policygradient based LOLA agents (4.7) with and without opponent modeling in our experiments. In particular we can obtain using the past actionobservation history. In our experiments we incrementally fit to the most recent data in order to address the nonstationarity of the opponent.
4.5. HigherOrder LOLA
By substituting the naive learning rule (4.3) into the LOLA objective (4.2), the LOLA learning rule so far assumes that the opponent is a naive learner. We call this setting firstorder LOLA, which corresponds to the firstorder learning rule of the opponent agent. However, we can also consider a higherorder LOLA agent that assumes the opponent applies a firstorder LOLA learning rule, thus replacing (4.3). This leads to thirdorder derivatives in the learning rule. While the thirdorder terms are typically difficult to compute using policy gradient method, due to high variance, when the exact value function is available it is tractable. We examine the benefits of higherorder LOLA in our experiments.
5. Experimental Setup
In this section, we summarise the settings where we compare the learning behavior of NL and LOLA agents. The first setting (Sec. 5.1) consists of two classic infinitely iterated games, the iterated prisoners dilemma (IPD), Luce and Raiffa (1957) and iterated matching pennies (IMP) Lee and Louis (1967). Each round in these two environments requires a single action from each agent. We can obtain the discounted future return of each player given both players’ policies, which leads to exact policy updates for NL and LOLA agents. The second setting (Sec. 5.2
), Coin Game, a more difficult twoplayer environment, where each round requires the agents to take a sequence of actions and exact discounted future reward can not be calculated. The policy of each player is parameterised with a deep recurrent neural network.
In the policy gradient experiments with LOLA, we assume offline learning, i.e., agents play many (batchsize) parallel episodes using their latest policies. Policies remain unchanged within each episode, with learning happening between episodes. One setting in which this kind of offline learning naturally arises is when policies are trained on realworld data. For example, in the case of autonomous cars, the data from a fleet of cars is used to periodically train and dispatch new policies.
5.1. Iterated Games
We first review the two iterated games, the IPD and IMP, and explain how we can model iterated games as a memory twoagent MDP.
C  D  

C  (1, 1)  (3, 0) 
D  (0, 3)  (2, 2) 
Table 1 shows the perstep payoff matrix of the prisoners’ dilemma. In a singleshot prisoners’ dilemma, there is only one Nash equilibrium Fudenberg and Tirole (1991), where both agents defect. In the infinitely iterated prisoners’ dilemma, the folk theorem (Roger, 1991) shows that there are infinitely many Nash equilibria. Two notable ones are the always defect strategy (DD), and titfortat (TFT). In TFT each agent starts out with cooperation and then repeats the previous action of the opponent. The average returns per step in selfplay are and for TFT and DD respectively.
Matching pennies Gibbons (1992) is a zerosum game, with perstep payouts shown in Table 2. This game only has a single mixed strategy Nash equilibrium which is both players playing heads / tails.
Head  Tail  

Head  (+1, 1)  (1, +1) 
Tail  (1, +1)  (+1, 1) 
Agents in both the IPD and IMP can condition their actions on past history. Agents in an iterated game are endowed with a memory of length , i.e., the agents act based on the results of the last rounds. Press and Dyson Press and Dyson (2012) proved that agents with a good memory strategy can effectively force the iterated game to be played as memory. Thus, we consider memory iterated games in our work.
We can model the memory IPD and IMP as a twoagent MDP, where the state at time is empty, denoted as , and at time is both agents’ actions from :
Each agent’s policy is fully specified by probabilities. For agent in the case of the IPD, they are the probability of cooperation at game start , and the cooperation probabilities in the four memories: , , , and . By analytically solving the multiagent MDP we can derive each agent’s future discounted reward as an analytical function of the agents’ policies and calculate the exact policy update for both NLEx and LOLAEx agents.
We also organise a roundrobin tournament where we compare LOLAEx to a number of stateoftheart multiagent learning algorithms, both on the and IMP.
IPD  IMP  
TFT  R(std)  Nash  R(std)  
NLEx.  20.8  1.98(0.14)  0.0  0(0.37) 
LOLAEx.  81.0  1.06(0.19)  98.8  0(0.02) 
NLPG  20.0  1.98(0.00)  13.2  0(0.19) 
LOLAPG  66.4  1.17(0.34)  93.2  0(0.06) 
We summarise results for NL vs. NL and LOLA vs. LOLA settings with either exact gradient evaluation (Ex) or policy gradient approximation (PG). Shown is the probability of agents playing TFT and Nash for the IPD and IMP respectively as well as the average reward per step, R, and standard deviation (std) at the end of training for 50 training runs.
5.2. Coin Game
Next, we study LOLA in a more highdimensional setting called Coin Game. This is a sequential game and the agents’ policies are parametrised as deep neural networks. Coin Game was first proposed by Lerer and Peysakhovich (2017) as a higher dimensional alternative to the IPD with multistep actions. As shown in Figure 3, in this setting two agents, ‘red’ and ‘blue’, are tasked with collecting coins.
The coins are either blue or red, and appear randomly on the gridworld. A new coin with random colour and random position appears after the last one is picked up. Agents pick up coins by moving onto the position where the coin is located. While every agent receives a point for picking up a coin of any colour, whenever an picks up a coin of different colour, the other agent loses 2 points.
As a result, if both agents greedily pick up any coin available, they receive 0 points in expectation. Since the agents’ policies are parameterised as a recurrent neural network, one cannot obtain the future discounted reward as a function of both agents’ policies in closed form. Policy gradientbased learning is applied for both NL and LOLA agents in our experiments. We further include experiments of LOLA with opponent modelling (LOLAOM) in order to examine the behavior of LOLA agents without access to the opponent’s policy parameters.
5.3. Training Details
In policy gradientbased NL and LOLA settings, we train agents with an actorcritic method (Sutton and Barto, 1998) and parameterise each agent with a policy actor and critic for variance reduction during policy updates.
During training, we use gradient descent with step size, , of 0.005 for the actor, 1 for the critic, and the batch size 4000 for rollouts. The discount rate is set to for the prisoners’ dilemma and Coin Game and for matching pennies. The high value of for Coin Game and the IPD was chosen in order to allow for long time horizons, which are known to be required for cooperation. We found that a lower produced more stable learning on IMP.
For Coin Game the agent’s policy architecture is a recurrent neural network with hidden units and 2 convolutional layers with
filters, stride
, and ReLU activation for input processing. The input is presented as a 4channel grid, with 2 channels encoding the positions of the 2 agents and 2 channels for the red and blue coins respectively.
For the tournament, we use baseline algorithms and the corresponding hyperparameter values as provided in the literature
(Bowling and Veloso, 2002). The tournament is played in a roundrobin fashion between all pairs of agents for 1000 episodes, 200 steps each.Normalised returns of a roundrobin tournament on the IPD (left) and IMP (right). LOLAEx agents achieve the best performance in the IPD and are within error bars for IMP. Shading indicates a 95% confidence interval of the error of the mean. Baselines from
(Bowling and Veloso, 2002): naive Qlearner (NLQ), jointaction Qlearner (JALQ), policy hillclimbing (PHC), and “Win or Learn Fast” (WoLF).6. Results
In this section, we summarise the experimental results. We denote LOLA and naive agents with exact policy updates as LOLAEx and NLEx respectively. We abbreviate LOLA and native agents with policy updates with LOLAPG and NLPG. We aim to answer the following questions:

How do pairs of LOLAEx agents behave in iterated games compared with pairs of NLEx agents?

Using policy gradient updates instead, how to LOLAPG agents and NLPG agents behave?

How do LOLAEx agents fair in a round robin tournament involving a set of multiagent learning algorithms from literature?

Does the learning of LOLAPG agents scale to highdimensional settings where the agents’ policies are parameterised by deep neural networks?

Does LOLAPG maintain its behavior when replacing access to the exact parameters of the opponent agent with opponent modeling?

Can LOLA agents be exploited by using higher order gradients, i.e., does LOLA lead to an arms race of ever higher order corrections or is LOLA / LOLA stable?
We answer the first three questions in Sec. 6.1, the next two questions in Sec. 6.2, and the last one in Sec. 6.3.
6.1. Iterated Games
We first compare the behaviors of LOLA agents with NL agents, with either exact policy updates or policy gradient updates.
Figures 1a and 1b show the policy for both agents at the end of training under NLEx and LOLAEx when the agents have access to exact gradients and Hessians of . Here we consider the settings of NLEx vs. NLEx and LOLAEx vs. LOLAEx. We study mixed learning of one LOLAEx agent vs. an NLEx agent in Section 6.3. Under NLEx, the agents learn to defect in all states, indicated by the accumulation of points in the bottom left corner of the plot. However, under LOLAEx, in most cases the agents learn TFT. In particular agent 1 cooperates in the starting state , and , while agent 2 cooperates in , and . As a result, Figure 1c) shows that the normalised discounted reward^{1}^{1}1We use following definition for the normalised discounted reward: . is close to for LOLAEx vs. LOLAEx, corresponding to TFT, while NLEx vs. NLEx results in an normalised discounted reward of , corresponding to the fully defective () equilibrium. Figure 1d) shows the normalised discounted reward for NLPG and LOLAPG where agents learn via policy gradient. LOLAPG also demonstrates cooperation while agents defect in NLPG.
We conduct the same analysis for IMP in Figure 2. In this game, under naive learning the agents’ strategies fail to converge. In contrast, under LOLA the agents’ policies converge to the only Nash equilibrium, playing heads / tails.
Table 3 summarises the numerical results comparing LOLA with NL agents in both the exact and policy gradient settings in the two iterated game environments. In the IPD, LOLA agents learn policies consistent with TFT with a much higher probability and achieve higher normalised discounted rewards than NL ( vs ). In IMP, LOLA agents converge to the Nash equilibrium more stably while NL agents do not. The difference in stability is illustrated by the high variance of the normalised discounted returns for NL agents compared to the low variance under LOLA ( vs ).
In Figure 4 we show the average normalised return of our LOLAEx agent against a set of learning algorithms from the literature. We find that LOLAEx receives the highest normalised return in the IPD, indicating that it successfully shapes the learning outcome of other algorithms in this general sum setting.
In the IMP, LOLAEx achieves stable performance close to the middle of the distribution of results.
The percentage of all picked up coins that match in colour (left) and the total points obtained per episode (right) for a pair of naive learners using policy gradient (NLPG), LOLAagents (LOLAPG), and a pair of LOLAagents with opponent modelling (LOLAOM). Also shown is the standard error of the mean (shading), based on 10 training runs. While LOLAPG and LOLAOM agents learn to cooperate, LOLAOM is less stable and obtains lower returns than LOLAPG. Best viewed in colour.
6.2. Coin Game
We summarise our experimental results in the Coin Game environment. To examine the scalability of LOLA learning rules, we compare NLPG vs. NLPG to LOLAPG vs. LOLAPG. Figure 5 demonstrates that NLPG agents collect coins indiscriminately, corresponding to defection. In contrast, LOLAPG agents learn to pick up coins predominantly (around ) of their own colour, showing that the LOLA learning rule leads to cooperation in the Coin Game as well.
Removing the assumption that agents can access the exact parameters of opponents, we examine LOLA agents with opponent modeling (Section 4.4). Figure 5 demonstrates that without access to the opponent’s policy parameters, LOLA agents with opponent modeling pick up coins of their own colour around of the time, inferior to the performance of LOLAPG agents. We emphasise that with opponent modeling neither agent can recover the exact policy parameters of the opponent, since there is a large amount of redundancy in the neural network parameters. For example, each agent could permute the weights of their fully connected layers. Opponent modeling introduces noise in the opponent agent’s policy parameters, thus increasing the variance and bias of the gradients (4.7) during policy updates, which leads to inferior performance of LOLAOM vs. LOLAPG in Figure 5.
6.3. Exploitability of LOLA
In this section we address the exploitability of the LOLA learning rule. We consider the IPD, where one can calculate the exact value function of each agent given the policies. Thus, we can evaluate the higherorder LOLA terms. We pitch a NLEx or LOLAEx agent against NLEx, LOLAEx, and a 2ndorder LOLA agent. We compare the normalised discounted return of each agent in all settings and address the question of whether there is an arms race to incorporate ever higher orders of LOLA correction terms.
Table 4 shows that a LOLAEx learner can achieve higher payouts against NLEx. Thus, there is an incentive for either agent to switch from naive learning to first order LOLA. Furthermore, two LOLAEx agents playing against each other both receive higher normalised discounted reward than a LOLAEx agent playing against a NLEx. This makes LOLA a dominant learning rule in the IPD compared to naive learning. We further find that 2ndorder LOLA provides no incremental gains when playing against a LOLAEx agent, leading to a reduction in payouts for both agents. These experiments were carried out with a of 0.5. While it is beyond the scope of this work to prove that LOLA vs LOLA is a dominant learning rule in the space of gradientbased rules, these initial results are encouraging.
NLEx  LOLAEx  2ndOrder  

NLEx  (1.99, 1.99)  (1.54, 1.28)  (1.46, 1.46) 
LOLAEx  (1.28, 1.54)  (1.04, 1.04)  (1.14, 1.17) 
7. Conclusions & Future Work
We presented Learning with OpponentLearning Awareness (LOLA), a learning method for multiagent settings that considers the learning processes of other agents. We show that when both agents have access to exact value functions and apply the LOLA learning rule, cooperation emerges based on titfortat in the infinitely repeated iterated prisoners’ dilemma while independent naive learners defect. We also find that LOLA leads to stable learning of the Nash equilibrium in IMP. In our roundrobin tournament against other multiagent learning algorithms we show that exact LOLA agents achieve the highest average returns on the IPD and respectable performance on IMP. We also derive a policy gradientbased version of LOLA, applicable to a deep RL setting. Experiments on the IPD and IMP demonstrate similar learning behavior to the setting with exact value function.
In addition, we scale the policy gradientbased version of LOLA to the Coin Game, a multistep game that requires deep recurrent policies. LOLA agents learn to cooperate, as agents pick up coins of their own colour with high probability while naive learners pick up coins indiscriminately. We further remove agents’ access to the opponent agents’ policy parameters and replace with opponent modeling. While less reliable, LOLA agents with opponent modeling also learn to cooperate.
We briefly address the exploitability of LOLA agents. Empirical results show that in the IPD both agents are incentivised to use LOLA, while higher order exploits show no further gain.
In the future, we would like to continue to address the exploitability of LOLA, when adversarial agents explicitly aim to take advantage of a LOLA learner using global search methods rather than just gradientbased methods. Just as LOLA is a way to exploit a naive learner, there should be means of exploiting LOLA learners in turn, unless LOLA is itself an equilibrium learning strategy.
Acknowledgements
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713) and from the National Institutes of Health (grant agreement number R01GM114311). It was also supported by the OxfordGoogle DeepMind Graduate Scholarship and a generous equipment grant from NVIDIA. We would like to thank Jascha SohlDickstein, David Balduzzi, Karl Tuyls, Marc Lanctot, Michael Bowling, Ilya Sutskever, Bob McGrew, and Paul Cristiano for fruitful discussion. We would like to thank Michael Littman for providing feedback on an early version of the manuscript. We would like to thank our reviewers for constructive and thoughtful feedback.
References
 (1)
 Axelrod (2006) Robert M Axelrod. 2006. The evolution of cooperation: revised edition. Basic books.
 Bojarski et al. (2016) Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. 2016. End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316 (2016).
 Bowling and Veloso (2002) Michael Bowling and Manuela Veloso. 2002. Multiagent learning using a variable learning rate. Artificial Intelligence 136, 2 (2002), 215–250.
 Brafman and Tennenholtz (2003) Ronen I. Brafman and Moshe Tennenholtz. 2003. Efficient Learning Equilibrium. In Advances in Neural Information Processing Systems, Vol. 9. 1635–1643.
 Brown (1951) George W Brown. 1951. Iterative solution of games by fictitious play. (1951).
 Busoniu et al. (2008) Lucian Busoniu, Robert Babuska, and Bart De Schutter. 2008. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, And CyberneticsPart C: Applications and Reviews, 38 (2), 2008 (2008).
 Chakraborty and Stone (2014) Doran Chakraborty and Peter Stone. 2014. Multiagent learning in the presence of memorybounded agents. Autonomous agents and multiagent systems 28, 2 (2014), 182–213.
 Claus and Boutilier (1998) Caroline Claus and Craig Boutilier. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI 1998 (1998), 746–752.
 Conitzer and Sandholm (2007) Vincent Conitzer and Tuomas Sandholm. 2007. AWESOME: A general multiagent learning algorithm that converges in selfplay and learns a best response against stationary opponents. Machine Learning 67, 12 (2007), 23–43.
 Crandall and Goodrich (2011) Jacob W Crandall and Michael A Goodrich. 2011. Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning. Machine Learning 82, 3 (2011), 281–314.
 Das et al. (2017) Abhishek Das, Satwik Kottur, José MF Moura, Stefan Lee, and Dhruv Batra. 2017. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning. arXiv preprint arXiv:1703.06585 (2017).
 Foerster et al. (2016) Jakob Foerster, Yannis M Assael, Nando de Freitas, and Shimon Whiteson. 2016. Learning to communicate with deep multiagent reinforcement learning. In Advances in Neural Information Processing Systems. 2137–2145.
 Foerster et al. (2018) Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. 2018. Counterfactual MultiAgent Policy Gradients. In AAAI.
 Foerster et al. (2017) Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Philip Torr, Pushmeet Kohli, Shimon Whiteson, et al. 2017. Stabilising experience replay for deep multiagent reinforcement learning. In 34th International Conference of Machine Learning.
 Fudenberg and Tirole (1991) Drew Fudenberg and Jean Tirole. 1991. Game theory, 1991. Cambridge, Massachusetts 393 (1991), 12.
 Gibbons (1992) Robert Gibbons. 1992. Game theory for applied economists. Princeton University Press.
 Heinrich and Silver (2016) Johannes Heinrich and David Silver. 2016. Deep reinforcement learning from selfplay in imperfectinformation games. arXiv preprint arXiv:1603.01121 (2016).
 HernandezLeal and Kaisers (2017) Pablo HernandezLeal and Michael Kaisers. 2017. Learning against sequential opponents in repeated stochastic games. (2017).
 HernandezLeal et al. (2017) Pablo HernandezLeal, Michael Kaisers, Tim Baarslag, and Enrique Munoz de Cote. 2017. A Survey of Learning in Multiagent Environments: Dealing with NonStationarity. arXiv preprint arXiv:1707.09183 (2017).
 KleimanWeiner et al. (2016) Max KleimanWeiner, Mark K Ho, Joseph L Austerweil, Michael L Littman, and Joshua B Tenenbaum. 2016. Coordinate to cooperate or compete: abstract goals and joint intentions in social interaction. In COGSCI.
 Lanctot et al. (2017) Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, and Thore Graepel. 2017. A Unified GameTheoretic Approach to Multiagent Reinforcement Learning. In Advances in Neural Information Processing Systems (NIPS).
 Lazaridou et al. (2016) Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. 2016. Multiagent cooperation and the emergence of (natural) language. arXiv preprint arXiv:1612.07182 (2016).
 Lee and Louis (1967) King Lee and K Louis. 1967. The Application of Decision Theory and Dynamic Programming to Adaptive Control Systems. Ph.D. Dissertation.
 Leibo et al. (2017) Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. 2017. Multiagent Reinforcement Learning in Sequential Social Dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 464–473.
 Lerer and Peysakhovich (2017) Adam Lerer and Alexander Peysakhovich. 2017. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068 (2017).
 Littman (2001) Michael L Littman. 2001. Friendorfoe Qlearning in generalsum games. In ICML, Vol. 1. 322–328.
 Lowe et al. (2017) Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. 2017. MultiAgent ActorCritic for Mixed CooperativeCompetitive Environments. arXiv preprint arXiv:1706.02275 (2017).
 Luce and Raiffa (1957) R Duncan Luce and Howard Raiffa. 1957. Games and Decisions: Introduction and Critical Survey. (1957).
 Mealing and Shapiro (2015) Richard Mealing and Jonathan Shapiro. 2015. Opponent Modelling by ExpectationMaximisation and Sequence Prediction in Simplified Poker. IEEE Transactions on Computational Intelligence and AI in Games (2015).
 Mealing and Shapiro (2013) Richard Mealing and Jonathan L Shapiro. 2013. Opponent Modelling by Sequence Prediction and Lookahead in TwoPlayer Games.. In ICAISC (2). 385–396.
 Metz et al. (2016) Luke Metz, Ben Poole, David Pfau, and Jascha SohlDickstein. 2016. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163 (2016).
 Mordatch and Abbeel (2017) Igor Mordatch and Pieter Abbeel. 2017. Emergence of Grounded Compositional Language in MultiAgent Populations. arXiv preprint arXiv:1703.04908 (2017).
 Munoz de Cote and Littman (2008) Enrique Munoz de Cote and Michael L. Littman. 2008. A Polynomialtime Nash Equilibrium Algorithm for Repeated Stochastic Games. In 24th Conference on Uncertainty in Artificial Intelligence (UAI’08). http://uai2008.cs.helsinki.fi/UAI_camera_ready/munoz.pdf
 Omidshafiei et al. (2017) Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P How, and John Vian. 2017. Deep Decentralized Multitask MultiAgent RL under Partial Observability. arXiv preprint arXiv:1703.06182 (2017).
 Press and Dyson (2012) William H Press and Freeman J Dyson. 2012. Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent. Proceedings of the National Academy of Sciences 109, 26 (2012), 10409–10413.
 Rabinowitz et al. (2018) Neil C Rabinowitz, Frank Perbet, H Francis Song, Chiyuan Zhang, SM Eslami, and Matthew Botvinick. 2018. Machine Theory of Mind. arXiv preprint arXiv:1802.07740 (2018).
 Roger (1991) B Myerson Roger. 1991. Game theory: analysis of conflict. (1991).

Ross
et al. (2011)
Stéphane Ross,
Geoffrey J Gordon, and J Andrew
Bagnell. 2011.
Noregret reductions for imitation learning and structured prediction. In
In AISTATS. Citeseer.  Sandholm and Crites (1996) Tuomas W Sandholm and Robert H Crites. 1996. Multiagent reinforcement learning in the iterated prisoner’s dilemma. Biosystems 37, 12 (1996), 147–166.

Sukhbaatar
et al. (2016)
Sainbayar Sukhbaatar, Rob
Fergus, et al. 2016.
Learning multiagent communication with backpropagation. In
Advances in Neural Information Processing Systems. 2244–2252.  Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. Vol. 1. MIT press Cambridge.
 Sutton et al. (1999) Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. 1999. Policy gradient methods for reinforcement learning with function approximation.. In NIPS, Vol. 99. 1057–1063.
 Uther and Veloso (1997) William Uther and Manuela Veloso. 1997. Adversarial reinforcement learning. Technical Report. Technical report, Carnegie Mellon University, 1997. Unpublished.
 Williams (1992) Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8, 34 (1992), 229–256.
 Wunder et al. (2010) Michael Wunder, Michael L Littman, and Monica Babes. 2010. Classes of multiagent qlearning dynamics with epsilongreedy exploration. In Proceedings of the 27th International Conference on Machine Learning (ICML10). 1167–1174.
 Zhang and Lesser (2010) Chongjie Zhang and Victor R Lesser. 2010. MultiAgent Learning with Policy Prediction.. In AAAI.
 Zinkevich et al. (2006) Martin Zinkevich, Amy Greenwald, and Michael L Littman. 2006. Cyclic equilibria in Markov games. In Advances in Neural Information Processing Systems. 1641–1648.
Comments
There are no comments yet.