1 Introduction
Consider a sequential social dilemma, where individually rational behavior leads to outcomes that are suboptimal for each individual in the group. (Hardin, 1968; Ostrom, 1990; Ostrom et al., 1999; Dietz et al., 2003). Current stateoftheart MultiAgent Deep Reinforcement Learning (MARL) methods that train agents independently can lead to agents that fail to cooperate reliably, even in simple social dilemma settings. This failure to cooperate results in suboptimal individual and group outcomes (Foerster et al. (2018); Lerer and Peysakhovich (2017), Section 2.2).
To illustrate why it is challenging to evolve cooperation in such dilemmas, we consider the Coin Game ((Foerster et al., 2018), Figure 1). Each agent can play either selfishly (pick all coins) or cooperatively (pick only coins of its color). Regardless of the behavior of the other agent, the individually rational choice for an agent is to play selfishly, either to minimize losses (avoid being exploited) or to maximize gains (exploit the other agent). However, when both agents behave rationally, they try to pick all coins and achieve an average long term reward of . In contrast, if both play cooperatively, then the average long term reward for each agent is . Therefore, when agents cooperate, they are both better off. Training Deep RL agents independently in the Coin Game using stateoftheart methods leads to mutually harmful selfish behavior (Section 2.2).
In this paper, we present a novel MARL algorithm that allows independently learning Deep RL agents to converge to individually and socially desirable cooperative behavior in such social dilemma situations. Our key contributions can be summarised as:

We introduce a StatusQuo loss (, Section 2.3) and an associated policy gradientbased algorithm to evolve optimal behavior for agents that can act in either a cooperative or a selfish manner, by choosing between a cooperative and a selfish policy. We empirically demonstrate that agents trained with the evolve optimal behavior in several social dilemma iterated matrix games (Section 4).

We propose (Section 2.4), an algorithm that reduces a social dilemma game with visual observations to an iterated matrix game by extracting policies that implement cooperative and selfish behavior. We empirically demonstrate that extracts cooperative and selfish policies for the Coin Game (Section 4.2).

We demonstrate that when agents run followed by MARL gameplay using , they converge to individually as well as socially desirable cooperative behavior in a social dilemma game with visual observations (Section 4.2).
The problem of how independently learning agents evolve cooperative behavior in social dilemmas has been studied by researchers through human studies and simulation models (Fudenberg and Maskin, 1986; Green and Porter, 1984; Fudenberg et al., 1994; Kamada and Kominers, 2010; Abreu et al., 1990). A large body of work has looked at the mechanism of evolution of cooperation through reciprocal behaviour and indirect reciprocity (Trivers, 1971; Axelrod, 1984; Nowak and Sigmund, 1992, 1993, 1998), through variants of reinforcement using aspiration (Macy and Flache, 2002), attitude (Damer and Gini, 2008) or multiagent reinforcement learning (Sandholm and Crites, 1996; Wunder et al., 2010), and under specific conditions (Banerjee and Sen, 2007) using different learning rates (de Cote et al., 2006) similar to WoLF (24) as well as using embedded emotion (Yu et al., 2015), social networks (Ohtsuki et al., 2006; Santos and Pacheco, 2006).
However, these approaches do not directly apply to Deep RL agents (Leibo et al., 2017). Recent work in this direction (KleimanWeiner et al., 2016; Julien et al., 2017; Peysakhovich and Lerer, 2018) focuses on letting agents learn strategies in multiagent settings through interactions with other agents. Leibo et al. (2017) define the problem of social dilemmas in the Deep RL framework and analyze the outcomes of a fruitgathering game (Julien et al., 2017). They vary the abundance of resources and the cost of conflict in the fruit environment to generate degrees of cooperation between agents. Hughes et al. (2018) define an intrinsic reward (inequality aversion) that attempts to reduce the difference in obtained rewards between agents. The agents are designed to have an aversion to both advantageous (guilt) and disadvantageous (unfairness) reward allocation. This handcrafting of loss with mutual fairness evolves cooperation, but it leaves the agent vulnerable to exploitation. LOLA (Foerster et al., 2018) uses opponent awareness to achieve high levels of cooperation in the Coin Game and the Iterated Prisoner’s Dilemma game. However, the LOLA agent assumes access to the other agent’s policy parameters and gradients. This level of access is analogous to getting complete access to the other agent’s private information and therefore devising a strategy with full knowledge of how they are going to play. Wang et al. (2019) propose an evolutionary Deep RL setup to evolve cooperation. They define an intrinsic reward that is based on features generated from the agent’s past and future rewards, and this reward is shared with other agents. They use evolution to maximize the sum of rewards among the agents and thus evolve cooperative behavior. However, sharing rewards in this indirect way enforces cooperation rather than evolving it through independently learning agents.
In contrast, we introduce a StatusQuo () that evolves cooperation between agents without sharing rewards, gradients, or using a communication channel. The
encourages an agent to imagine the consequences of sticking to the statusquo. This imagined stickiness ensures that an agent gets a better estimate of a cooperative or selfish policy. Without
, agents repeatedly switch policies (from cooperative to selfish), obtain shortterm rewards (through exploitation), and, therefore incorrectly learn that a selfish strategy gives higher rewards in the longterm.To work with social dilemma games that have visual observations, we introduce . uses selfsupervision and clustering to automatically extract a cooperative and a selfish policy from a social dilemma game. The input to is a collection of state sequences derived from gameplay between two randomly initialized agents. Each state sequence represents a collection of states and actions (of both agents), leading up to a reward in the environment. uses this collection of state sequences to learn two oracles. One oracle represents a cooperative policy, and the other oracle represents a selfish policy. Given a state, an oracle returns an action according to the specific policy. It is important to note that each agent independently runs to extract oracles. (For instance, Figure 8 (Appendix A) illustrates the cooperation and defection oracles extracted by the Red agent using in the Coin Game.) Figure 2 shows the highlevel architecture of our approach.

For a social dilemma game with visual observations, each RL agent runs to learn oracles that implement cooperative and selfish behavior.

We train agents (with ) to play the game such that at any step, an agent can either take the action suggested by the cooperation oracle or the selfish oracle.
We empirically demonstrate in Section 4 that our approach evolves cooperative behavior between independently trained agents.
2 Approach
2.1 Social Dilemmas modeled as Iterated Matrix Games
We adopt the definitions in Foerster et al. (2018). We model social dilemmas as generalsum Markov (simultaneous move) games. A multiagent Markov game is specified by , , , , , , . denotes the state space of the game. denotes the number of agents playing the game. At each step of the game, each agent , selects an action .
denotes the joint action vector that represents the simultaneous actions of all agents. The joint action
changes the state of the game from to according to the state transition function . At the end of each step, each agent gets a reward according to the reward function . The reward obtained by an agent at each step is a function of the actions played by all agents. For an agent, , the discounted future return from time is defined as , where is the discount factor. Each agent independently attempts to maximize its expected total discounted reward.Matrix games are the special case of twoplayer perfectly observable Markov games (Foerster et al., 2018). Table 1 shows examples of matrix games that represent social dilemmas. Consider the Prisoner’s Dilemma matrix game in Table 0(a). Each agent can either cooperate () or defect (). For an agent, playing is the rational choice, regardless of whether the other agent plays either or . Therefore, if both agents play rationally, they each receive a reward of . However, if each agent plays , then it will obtain a reward of . This fact that individually rational behavior leads to a suboptimal group (and individual) outcome highlights the dilemma.
In Infinitely Iterated Matrix Games, agents repeatedly play a particular matrix game against each other. In each iteration of the game, each agent has access to the actions played by both agents in the previous iteration. Therefore, the state input to an RL agent consists of the actions of both agents in the previous iteration of the game. We adopt this state formulation to remain consistent with Foerster et al. (2018). The infinitely iterated variations of the matrix games in Table 1 represent sequential social dilemmas. For ease of representation, we refer to infinitely iterated matrix games as iterated matrix games in subsequent sections.
2.2 Learning Policies in Iterated Matrix Games: The Selfish Learner
The standard method to model agents in iterated matrix games is to model each agent as a Deep RL agent that independently attempts to maximize its expected total discounted reward. Several approaches to model agents in this way use policy gradientbased methods (Sutton et al., 2000; Williams, 1992)). Policy gradient methods update an agent’s policy, parameterized by , by performing gradient ascent on the expected total discounted reward . Formally, let denote the parameterized version of an agent’s policy and denote the total expected discounted reward for agent . Here, is a function of the policy parameters of both agents. In the iteration of the game, each agent updates to , such that it maximizes it’s total expected discounted reward. is computed as follows:
(1) 
For agents trained using reinforcement learning, the gradient ascent rule to update is:
(2) 
where is the step size of the updates.
In the Iterated Prisoner’s Dilemma (IPD) game, agents trained with the policy gradient update method converge to a suboptimal mutual defection equilibrium (Figure 4, (Lerer and Peysakhovich, 2017)). This suboptimal equilibrium attained by Selfish Learners motivates us to explore alternative methods that could lead to a desirable cooperative equilibrium. We denote the agent trained using policy gradient updates as a Selfish Learner ().
2.3 Learning Policies in Iterated Matrix Games: The StatusQuo Aware Learner ()
2.3.1 : Intuition
Why do independent, selfish learners converge to mutually harmful behavior in the IPD? To understand this, consider the payoff matrix for a single iteration of the IPD in Table 0(a). In each iteration, an agent can play either or . Mutual defection is worse for each agent than mutual cooperation . However, onesided exploitation is better than mutual cooperation for the exploiter and far worse for the exploited. Therefore, as long as an agent perceives the possibility of exploitation ( or ), it is drawn to defect, both to maximize reward (through exploitation) and minimize loss (through being exploited). To increase the likelihood of cooperation, it is important to reduce instances of exploitation between agents. We posit that, if agents mostly only either mutually cooperate or mutually defect , then they will learn to prefer and achieve a socially desirable cooperative equilibrium.
Motivated by this idea, we introduce a statusquo loss () for each agent derived from the idea of imaginary gameplay, as depicted in Figure 3. Intuitively, the loss encourages an agent to imagine an episode where the statusquo (current situation) is repeated for a number of steps. This imagined episode causes the exploited agent (in ) to perceive a continued risk of exploitation and, therefore, quickly move to . Hence, for an agent, the shortterm gain from exploitation is overcome by the longterm loss from mutual exploitation . Therefore, agents move towards either mutual cooperation or mutual defection . With exploitation (and subsequently, the fear of being exploited) out of the picture, agents move towards mutual cooperation. Figure 3 shows the idea behind .
2.3.2 : Formulation
We describe below the formulation of SQLoss with respect to agent 1. The formulation for agent 2 is identical to that of agent 1. Let denote the collection of an agent’s experiences after time steps. Let denote the discounted future return for agent starting at in actual gameplay. Let denote denote the collection of an agent’s imagined experiences. For a state (), an agent imagines an episode by starting at and repeating for steps. This is equivalent to imagining a step repetition of already played actions. We sample
from a Discrete Uniform distribution
where is a hyperparameter . To simplify notation, let denote the ordered set of state, actions, and rewards starting at time and repeated times for imagined gameplay. Let denote the discounted future return starting at in imagined statusquo gameplay.(3) 
(4) 
(5) 
and are approximated by and respectively. These values are the expected rewards conditioned on both agents’ policies (, ). For agent 1, the regular gradients and the StatusQuo gradients, and , can be derived from the policy gradient formulation as:
(6) 
(7) 
where
is a baseline for variance reduction.
Then the update rule for the policy gradientbased StatusQuo Learner (SQLPG) is,
(8) 
where and denote the loss scaling factor for the reinforce and the imaginative gameplay, respectively.
2.4 : Moving Beyond Iterated Matrix Games
In the previous sections, we have focused on evolving cooperative behavior in the iterated matrix game formulation of sequential social dilemmas. In the iterated matrix game formulation, an agent is only allowed to either cooperate or defect in each iteration. However, in a social dilemma game with visual observations, it is not clear what set of lowlevel actions constitute cooperative or selfish behavior. Therefore, to work on social dilemmas with visual observations, we propose , an approach that automatically learns a cooperation and a defection policy by analyzing the behavior of randomly initialized agents. learns these policies in the form of cooperation and defection oracles. Given a state, the cooperation oracle suggests an action that represents cooperative behavior. Similarly, the defection oracle suggests an action that represents selfish behavior (Figure 8 (Appendix A)). When RL agents play the social dilemma game, each agent independently runs before playing the game. Once both agents have run , they consult either of the two extracted oracles in every step of the game. Therefore, in each step, an agent either takes the action recommended by the cooperation oracle or the action recommended by the defection oracle. In this way, we reduce the visual input game to an iterated matrix game and subsequently apply to evolve cooperative behavior. (see Figure 2) works as follows.

We initialize RL agents with random weights and play them against each other in the game. In these random gameplay episodes, whenever an agent receives a reward, we store the sequence of the last three states up to the current state.

This collection of state sequences is used to train the network. The network takes as input a sequence of states and predicts the rewards of both agents as well as environment parameters that depend on the game. For instance, in the Coin Game, the network predicts the rewards of both agents and the color of the picked coin.

Training the network leads to the emergence of feature embeddings for the various state sequences. Subsequently, clustering these embeddings using Agglomerative Clustering (with number of clusters=2) leads to the development of cooperative and defection clusters. One of the learned clusters contains state sequences that represent cooperative behavior, and the other cluster contains state sequences that represent defection. For instance, in the Coin Game, a point in the cooperation cluster contains a sequence of states where an agent picks a coin of its color.

To train the cooperation and defection oracle networks, we use the collection of state sequences in each cluster. For each sequence of states in a cluster, we train the oracle network to predict the next action, given the current state. For instance, Figure 8 (Appendix A)shows the cooperation and defection oracles extracted by the Red agent using in the Coin Game.
Section 3.3 describes the architectural choices of each component of .
3 Experimental Setup



In order to compare our results to previous work (Foerster et al., 2018), we use the Normalized Discounted Reward (). A higher NDR implies that an agent obtains a higher reward in the environment. We compare our approach (StatusQuo Aware Learner: ) to Learning with OpponentLearning Awareness (LolaPG) (Foerster et al., 2018) and the Selfish Learner (SL, Section 2.2). For all experiments, we perform runs and report average
, along with variance across runs. The bold line in all the figures is the mean, and the shaded region is the one standard deviation region over the mean. All of our code is available at
Code (2019).3.1 Social Dilemma Games
For our experiments with social dilemma matrix games, we use the (Iterated Prisoners Dilemma (IPD) (Luce and Raiffa, 1989), Iterated Matching Pennies (IMP) (Lee and Louis, 1967), and the Iterated Stag Hunt (ISH) (Foerster et al., 2018)). Table 1 shows the payoff matrix for a single iteration of each game. In iterated matrix games, at each iteration, agents take an action according to a policy and receive the rewards in Table 1. To simulate an infinitely iterated game, we let agents play 200 iterations of the game against each other, and do not provide an agent with any information about the number of remaining iterations (Foerster et al., 2018). In an iteration, the state for an agent is the actions played by both agents in the previous iteration. Each matrix game in Table 1 represents a different dilemma.
In the Prisoner’s Dilemma, the rational policy for each agent is to defect, regardless of the policy of the other agent. However, when each agent plays rationally, each is worse off. In Matching Pennies, if an agent plays predictably, it is prone to exploitation by the other agent. Therefore, the optimal policy is to randomize between and , obtaining an average NDR of . The Stag Hunt game represents a coordination dilemma. In the game, given that the other agent will cooperate, an agent’s optimal action is to cooperate as well. However, at each step, each agent has an attractive alternative, that of defecting and obtaining a guaranteed reward of . Therefore, the promise of a safer alternative and the fear that the other agent might select the safer choice could drive an agent to also select the safer alternative, thereby sacrificing the higher reward of mutual cooperation.
3.2 SQLoss
For our experiments with the Selfish and StatusQuo Aware Learner (), we use policy gradientbased learning where we train an agent with the ActorCritic method (Sutton and Barto, 2011). Each agent is parameterized with a policy actor and critic for variance reduction in policy updates. During training, we use gradient descent with step size, for the actor and for the critic. We use a batch size of for LolaPG (Foerster et al., 2018) and 200 for for rollouts. We use an episode length of 200 for all iterated matrix games. We use a discount rate () of for the Iterated Prisoners’ Dilemma, Iterated Stag Hunt, and Coin Game. For the Iterated Matching Pennies, we use . The high value of allows for long time horizons, thereby incentivizing longterm reward. Each agent randomly samples from (, discussed in Appendix C) at each step.
3.3 GameDistill
consists of two components. First, the state sequence encoder (Step 2, Section 2.4
) that takes as input a sequence of states and outputs a feature representation. We encode each state in the sequence using a series of standard Convolution layers with kernelsize 3. We then use a fullyconnected layer with 100 neurons that outputs a dense representation of the sequence of states. The color of picked coin, agent reward, and opponent reward branches consist of a series of dense layers with linear activation. We use linear activation so that we can cluster the feature vectors (embeddings) using a linear clustering algorithm, such as Agglomerative Clustering. We obtain similar results when we use the Kmeans clustering algorithm. We use the
BinaryCrossEntropy (BCE) loss for classification and the meansquared error (MSE) loss for regression. We use the Adam (Kingma and Ba, 2014) optimizer (learning rate ).Second, the oracle network (Step 4, Section 2.4), that predicts an action for an input state. We encode the input state using convolution layers with kernelsize and relu activation. To predict the action, we use fullyconnected layers with relu activation and the BCE loss. We use L2 regularization, and Gradient Descent with the Adam optimizer (learning rate ).
4 Results
4.1 Learning optimal policies in Iterated Matrix Games using SQLoss
Iterated Prisoner’s Dilemma (IPD):
We train different learners to play the IPD game. Figure 4 shows the results. For all learners, agents initially defect and move towards an NDR of . This initial bias towards defection is expected, since, for agents trained with random gameplay episodes, the benefits of exploitation outweigh the costs of mutual defection. For Selfish Learner (SL) agents, the bias intensifies, and the agents converge to mutually harmful selfish behavior (NDR=). LolaPG agents learn to predict each other’s behavior and therefore realize that defection is more likely to lead to mutual harm than selfish benefit. They subsequently move towards cooperation, but occasionally defect (NDR=). In contrast, agents quickly realize the costs of defection, indicated by the small initial dip in the NDR curves. They subsequently move towards almost perfect cooperation, with an NDR of . Finally, it is important to note that agents have close to zero variance, unlike other methods where the variance in NDR across runs is significant.
Iterated Matching Pennies (IMP):
We train different learners to play the IMP game. The optimal policy for an agent to avoid exploitation is to play or perfectly randomly and obtain an NDR of . Figure 5 shows the results. agents learn to play optimally and obtain an NDR close to . Interestingly, Selfish Learner and LolaPG agents converge to an exploiterexploited equilibrium where one agent consistently exploits the other agent. This asymmetric exploitation equilibrium is more pronounced for Selfish Learner agents than for LolaPG agents. As before, we observe that agents have close to zero variance across runs, unlike other methods where the variance in NDR across runs is significant.
Appendix B shows the results for the ISH game.
4.2 Evolving Cooperation in Games with visual input using GameDistill followed by SQLoss
4.2.1 The Coin Game: GameDistill
To evaluate the clustering step in , we make two tSNE (Maaten and Hinton, 2008) plots of the 100dimensional feature vector extracted from the last layer of the network. The first plot colors each point (state sequence) by the rewards obtained by both agents in the sequence. The second plot colors each point by the cluster label output by Agglomerative clustering. Figure 6 shows the results. correctly learns two clusters, one for state sequences that represent cooperation and the other for state sequences that represent defection. We also experiment with different values for feature vector dimensions and obtain similar clustering results with sufficient training. Once we have the clusters, we train oracle networks using the state sequences in each cluster. To evaluate that the trained oracles represent a cooperation and a defection policy, we modify the Coin Game environment to contain only the Red agent. We then play two variations of the game. In the first variation, the Red agent is forced to play the action suggested by the first oracle. In this variation, we find that the Red agent picks only of Blue coins, indicating a high rate of cooperation. Therefore, the first oracle represents a cooperation policy. In the second variation, the Red agent is forced to play the action suggested by the second oracle. In this case, we find that the Red agent picks of Blue coins, indicating a high rate of defection. Hence, the second oracle represents a defection policy. Therefore, the oracles learned by the Red agent using represent cooperation and defection policies.
4.2.2 The Coin Game: SQLoss
Before playing the game, each agent uses to learn cooperation and defection oracles. During gameplay, at each step, an agent follows either the action suggested by its cooperation oracle or the action suggested by its defection oracle. Further, each agent has an additional term. We compare approaches using the degree of cooperation between agents, measured by the probability of an agent to pick the coin of its color (Foerster et al., 2018). Figure 7 shows the results. The probability that an agent will pick the coin of its color is close to . This high probability indicates that the other agent is cooperating with this agent and only picking coins of its color. In contrast, the probability that a LolaPG agent will pick its own coin is much smaller, indicating higher rates of defection. As expected, the probability of an agent picking its own coin is the smallest for selfish learners (SL). The probability value of indicates that a selfish learner is just as likely to pick the other agent’s coin as it is to pick its own coin.
4.3 SQLearner: Exploitability and Adaptability
Given that an agent does not have any prior information about the other agent, it is important that it evolves its strategy based on the strategy of its opponent. To evaluate an agent’s ability to avoid exploitation by a selfish agent, we train one agent against an agent that always defects in the Coin Game. We find that the agent also learns to always defect. This persistent defection is important since given that the other agent is selfish, the agent can do no better than also be selfish. To evaluate an agent’s ability to exploit a cooperative agent, we train one agent with an agent that always cooperates in the Coin Game. In this case, we find that the agent learns to always defect. This persistent defection is important since given that the other agent is cooperative, the agent obtains maximum reward by behaving selfishly. Hence, the agent is both resistant to exploitation and able to exploit, depending on the strategy of the other agent.
5 Conclusion
We have described a statusquo loss () that encourages an agent to imagine the consequences of sticking to the statusquo. We demonstrated how agents trained with evolve cooperative behavior in several social dilemmas without sharing rewards, gradients, or using a communication channel. To work with visual input games, we proposed , an approach that automatically extracts a cooperative and a selfish policy from a social dilemma game. We combined and to demonstrate how agents evolve desirable cooperative behavior in a social dilemma game with visual observations.
References
 Toward a theory of discounted repeated games with imperfect monitoring. Econometrica 58 (5), pp. 1041–1063. External Links: ISSN 00129682, 14680262, Link Cited by: §1.
 Robert axelrod’s (1984) the evolution of cooperation. Cited by: §1.
 Reaching paretooptimality in prisoner’s dilemma using conditional joint action learning. Autonomous Agents and MultiAgent Systems 15 (1). External Links: ISSN 13872532 Cited by: §1.
 MARL with sqloss. GitHub. Note: https://github.com/user12423/MARLwithSQLoss/ Cited by: item 3, item 2, item 7, §3.
 Achieving cooperation in a minimally constrained environment.. Vol. 1, pp. 57–62. Cited by: §1.
 Learning to cooperate in multiagent social dilemmas. In Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS ’06. Cited by: §1.
 The struggle to govern the commons. Science 302 (5652), pp. 1907–1912. External Links: Document Cited by: §1.
 Learning with opponentlearning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 122–130. Cited by: §1, §1, §1, §2.1, §2.1, §2.1, §3.1, §3.1, §3.2, §3, §4.2.2.
 The folk theorem with imperfect public information. Econometrica 62 (5), pp. 997–1039. External Links: ISSN 00129682, 14680262, Link Cited by: §1.
 The folk theorem in repeated games with discounting or with incomplete information. Econometrica 54 (3), pp. 533–554. External Links: ISSN 00129682, 14680262 Cited by: §1.
 Noncooperative Collusion under Imperfect Price Information. Econometrica 52 (1), pp. 87–100. Cited by: §1.
 The tragedy of the commons. Science 162 (3859), pp. 1243–1248. External Links: Document Cited by: §1.
 Inequity aversion improves cooperation in intertemporal social dilemmas. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18. Cited by: §1.
 A multiagent reinforcement learning model of commonpool resource appropriation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. . Cited by: §1.
 Information can wreck cooperation: a counterpoint to kandori (1992). Economics Letters 107, pp. 112–114. External Links: Document Cited by: §1.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
 Coordinate to cooperate or compete: abstract goals and joint intentions in social interaction. In CogSci, Cited by: §1.
 The application of decision theory and dynamic programming to adaptive control systems. Ph.D. Thesis. Cited by: §3.1.
 Multiagent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’17. Cited by: §1.
 Maintaining cooperation in complex social dilemmas using deep reinforcement learning. External Links: arXiv:1707.01068 Cited by: §1, §2.2.
 Games and decisions: introduction and critical survey. Courier Corporation. Cited by: §3.1.

Visualizing data using tsne.
Journal of machine learning research
9 (Nov), pp. 2579–2605. Cited by: §4.2.1.  Learning dynamics in social dilemmas. Proceedings of the National Academy of Sciences of the United States of America 99 Suppl 3, pp. 7229–36. External Links: Document Cited by: §1.
 [24] (200204) Multiagent learning using a variable learning rate. Artificial Intelligence 136 (2), pp. 215–250. External Links: ISSN 00043702 Cited by: §1.
 Tit for tat in heterogeneous populations. Nature 355 (6357), pp. 250–253. Cited by: §1.
 Evolution of indirect reciprocity by image scoring. Nature 393 (6685), pp. 573–577. Cited by: §1.
 A strategy of winstay, loseshift that outperforms titfortat in the prisoner’s dilemma game. Nature 364, pp. 56–8. External Links: Document Cited by: §1.
 A simple rule for the evolution of cooperation on graphs and social networks. Nature 441 (7092), pp. 502–505. External Links: ISSN 14764687, Document, Link Cited by: §1.
 Governing the commonsthe evolution of institutions for collective actions. Political economy of institutions and decisions. Cited by: §1.
 Revisiting the commons: local lessons, global challenges. Science 284 (5412), pp. 278–282. External Links: Document Cited by: §1.
 Consequentialist conditional cooperation in social dilemmas with imperfect information. In International Conference on Learning Representations, ICLR 2018,Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, Cited by: §1.
 The Machine Learning Reproducibility Checklist. Note: https://www.cs.mcgill.ca/ jpineau/ReproducibilityChecklist.pdf External Links: Link Cited by: Appendix E.
 Multiagent reinforcement learning in the iterated prisoner’s dilemma.. Bio Systems 37 12, pp. 147–66. Cited by: §1.
 A new route to the evolution of cooperation. Journal of evolutionary biology 19, pp. 726–33. External Links: Document Cited by: §1.
 Reinforcement learning: an introduction. Cited by: §3.2.
 Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2.2.
 The evolution of reciprocal altruism. Quarterly Review of Biology 46, pp. 35–57.. External Links: Document Cited by: §1.
 Evolving intrinsic motivations for altruistic behavior. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, pp. 683–692. External Links: ISBN 9781450363099 Cited by: §1.
 Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8 (34), pp. 229–256. Cited by: §2.2.
 Classes of multiagent qlearning dynamics with greedy exploration. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10. Cited by: §1.

Emotional multiagent reinforcement learning in spatial social dilemmas.
IEEE Transactions on Neural Networks and Learning Systems
26 (12), pp. 3083–3096. Cited by: §1.
Supplementary Material
Appendix A Illustrations of Trained Oracle Networks for the Coin Game
Figure 8 shows the predictions of the oracle networks learned by the Red agent using in the Coin Game. We see that the cooperation oracle suggests an action that avoids picking the coin of the other agent (the Blue coin). Analogously, the defection oracle suggests a selfish action that picks the coin of the other agent.
Appendix B Results for the Iterated Stag Hunt using SQLoss
Figure 9 shows the results of training two agents on the Iterated Stag Hunt game. agents coordinate successfully to obtain a nearoptimal NDR value () for this game.
Appendix C : Effect of on convergence to cooperation
We explore the effect of the hyperparameter (Section 2) on convergence to cooperation. To imagine the consequences of maintaining the statusquo, each agent samples from the Discrete Uniform distribution . Therefore, a larger value of implies a larger value of and longer imaginary episodes. We find that larger (and hence ) leads to faster cooperation between agents in the IPD and Coin Game. This effect plateaus at , which we select for our experiments.
Appendix D Architecture Details
We performed all our experiments on an AWS instance with the following specifications.

Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz

RAM: 189GB

CPU(s): 96

Architecture: x86_64

Thread(s) per core: 2
Appendix E Reproducibility Checklist
We follow the reproducibility checklist from (Pineau, 2019) and include further details here. For all the models and algorithms we have included details that we think would be useful for reproducing the results of this work.

For all models and algorithms presented, check if you include:

A clear description of the mathematical setting, algorithm, and/or model: Yes. The algorithm is described in detail in Section 2
, with all the loss functions used for training being clearly defined. The details of the architecture, hyperparameters used and other algorithm details are given in Section
3. Environment details are explained in the sections that they are introduced in. 
An analysis of the complexity (time, space, sample size) of any algorithm: No. We do not include a formal complexity analysis of our algorithm. However, we do highlight the additional computational steps (in terms of losses and parameter updates) in Section 2 over standard multiagent independently learning RL algorithms that would be needed in our approach.

A link to a downloadable source code, with specification of all dependencies, including external libraries.: Yes. We have made the source code available at Code (2019).


For any theoretical claim, check if you include:

A statement of the result: NA. Our paper is primarily empirical and we do not have any major theoretical claims. Hence this is Not Applicable.

A clear explanation of any assumptions: NA.

A complete proof of the claim: NA.


For all figures and tables that present empirical results, check if you include:

A complete description of the data collection process, including sample size: NA. We did not collect any data for our work.

A link to a downloadable version of the dataset or simulation environment: Yes. We have made the source code available at Code (2019).

An explanation of any data that were excluded, description of any preprocessing step: NA. We did not perform any preprocessing step.

An explanation of how samples were allocated for training / validation / testing: Yes. For the details regarding data used for training is given in Section 2.4. The number of iterations used for learning (training) by is given in Figures 4, 5 and 7. The details of the number of runs and the batch sizes used for various experiments are given in Section 3.

The range of hyperparameters considered, method to select the best hyperparameter configuration, and specification of all hyperparameters used to generate results: Yes. We did not do any hyperparameter tuning as part of this work. All the hyperparameters that we used are specified in Section 3.

The exact number of evaluation runs: Yes. For all our environments, we repeat the experiment times. For evaluation of performance, we use an average of Monte Carlo estimates. We state this in Section 3. We do not fix any seeds. The details of the number of runs and the batch sizes used for various experiments are also given here.

A description of how experiments were run: Yes. The README with instructions on how to run the experiments along with the source code is provided at Code (2019).

A clear definition of the specific measure or statistics used to report results: Yes. We plot the mean and the one standard deviation region over the mean for all our numerical experiments. This is stated in Section 3.

Clearly defined error bars: Yes. We plot the mean and the one standard deviation region over the mean for all our numerical experiments. This is stated in Section 3.

A description of results with central tendency (e.g. mean) & variation (e.g. stddev): Yes. We plot the mean and the one standard deviation region over the mean for all our numerical experiments. This is stated in Section 3.

A description of the computing infrastructure used: Yes. We have provided this detail in the Supplementary material in Section D.

Comments
There are no comments yet.