Inducing Cooperative behaviour in Sequential-Social dilemmas through Multi-Agent Reinforcement Learning using Status-Quo Loss

01/15/2020 ∙ by Pinkesh Badjatiya, et al. ∙ 0

In social dilemma situations, individual rationality leads to sub-optimal group outcomes. Several human engagements can be modeled as a sequential (multi-step) social dilemmas. However, in contrast to humans, Deep Reinforcement Learning agents trained to optimize individual rewards in sequential social dilemmas converge to selfish, mutually harmful behavior. We introduce a status-quo loss (SQLoss) that encourages an agent to stick to the status quo, rather than repeatedly changing its policy. We show how agents trained with SQLoss evolve cooperative behavior in several social dilemma matrix games. To work with social dilemma games that have visual input, we propose GameDistill. GameDistill uses self-supervision and clustering to automatically extract cooperative and selfish policies from a social dilemma game. We combine GameDistill and SQLoss to show how agents evolve socially desirable cooperative behavior in the Coin Game.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider a sequential social dilemma, where individually rational behavior leads to outcomes that are sub-optimal for each individual in the group. (Hardin, 1968; Ostrom, 1990; Ostrom et al., 1999; Dietz et al., 2003). Current state-of-the-art Multi-Agent Deep Reinforcement Learning (MARL) methods that train agents independently can lead to agents that fail to cooperate reliably, even in simple social dilemma settings. This failure to cooperate results in sub-optimal individual and group outcomes (Foerster et al. (2018); Lerer and Peysakhovich (2017), Section 2.2).

To illustrate why it is challenging to evolve cooperation in such dilemmas, we consider the Coin Game ((Foerster et al., 2018), Figure 1). Each agent can play either selfishly (pick all coins) or cooperatively (pick only coins of its color). Regardless of the behavior of the other agent, the individually rational choice for an agent is to play selfishly, either to minimize losses (avoid being exploited) or to maximize gains (exploit the other agent). However, when both agents behave rationally, they try to pick all coins and achieve an average long term reward of . In contrast, if both play cooperatively, then the average long term reward for each agent is . Therefore, when agents cooperate, they are both better off. Training Deep RL agents independently in the Coin Game using state-of-the-art methods leads to mutually harmful selfish behavior (Section 2.2).

In this paper, we present a novel MARL algorithm that allows independently learning Deep RL agents to converge to individually and socially desirable cooperative behavior in such social dilemma situations. Our key contributions can be summarised as:

  1. We introduce a Status-Quo loss (, Section 2.3) and an associated policy gradient-based algorithm to evolve optimal behavior for agents that can act in either a cooperative or a selfish manner, by choosing between a cooperative and a selfish policy. We empirically demonstrate that agents trained with the evolve optimal behavior in several social dilemma iterated matrix games (Section 4).

  2. We propose (Section 2.4), an algorithm that reduces a social dilemma game with visual observations to an iterated matrix game by extracting policies that implement cooperative and selfish behavior. We empirically demonstrate that extracts cooperative and selfish policies for the Coin Game (Section 4.2).

  3. We demonstrate that when agents run followed by MARL game-play using , they converge to individually as well as socially desirable cooperative behavior in a social dilemma game with visual observations (Section 4.2).

Figure 1: Two agents (Red and Blue) playing the Coin Game. The agents, along with a Blue or Red coin, appear at random positions in a 3x3 grid. An agent observes the complete 3x3 grid as input and can move either left, right, up, or down. When an agent moves into a cell with a coin, it picks the coin, and a new instance of the game begins. If the Red agent picks the Red coin, it gets a reward of and the Blue agent gets no reward. If the Red agent picks the Blue coin, it gets a reward of , and the Blue agent gets a reward of . The blue agent’s reward structure is symmetric to that of the red agent.

The problem of how independently learning agents evolve cooperative behavior in social dilemmas has been studied by researchers through human studies and simulation models (Fudenberg and Maskin, 1986; Green and Porter, 1984; Fudenberg et al., 1994; Kamada and Kominers, 2010; Abreu et al., 1990). A large body of work has looked at the mechanism of evolution of cooperation through reciprocal behaviour and indirect reciprocity (Trivers, 1971; Axelrod, 1984; Nowak and Sigmund, 1992, 1993, 1998), through variants of reinforcement using aspiration (Macy and Flache, 2002), attitude (Damer and Gini, 2008) or multi-agent reinforcement learning (Sandholm and Crites, 1996; Wunder et al., 2010), and under specific conditions (Banerjee and Sen, 2007) using different learning rates (de Cote et al., 2006) similar to WoLF (24) as well as using embedded emotion (Yu et al., 2015), social networks (Ohtsuki et al., 2006; Santos and Pacheco, 2006).

However, these approaches do not directly apply to Deep RL agents (Leibo et al., 2017). Recent work in this direction (Kleiman-Weiner et al., 2016; Julien et al., 2017; Peysakhovich and Lerer, 2018) focuses on letting agents learn strategies in multi-agent settings through interactions with other agents. Leibo et al. (2017) define the problem of social dilemmas in the Deep RL framework and analyze the outcomes of a fruit-gathering game (Julien et al., 2017). They vary the abundance of resources and the cost of conflict in the fruit environment to generate degrees of cooperation between agents. Hughes et al. (2018) define an intrinsic reward (inequality aversion) that attempts to reduce the difference in obtained rewards between agents. The agents are designed to have an aversion to both advantageous (guilt) and disadvantageous (unfairness) reward allocation. This handcrafting of loss with mutual fairness evolves cooperation, but it leaves the agent vulnerable to exploitation. LOLA (Foerster et al., 2018) uses opponent awareness to achieve high levels of cooperation in the Coin Game and the Iterated Prisoner’s Dilemma game. However, the LOLA agent assumes access to the other agent’s policy parameters and gradients. This level of access is analogous to getting complete access to the other agent’s private information and therefore devising a strategy with full knowledge of how they are going to play. Wang et al. (2019) propose an evolutionary Deep RL setup to evolve cooperation. They define an intrinsic reward that is based on features generated from the agent’s past and future rewards, and this reward is shared with other agents. They use evolution to maximize the sum of rewards among the agents and thus evolve cooperative behavior. However, sharing rewards in this indirect way enforces cooperation rather than evolving it through independently learning agents.

In contrast, we introduce a Status-Quo () that evolves cooperation between agents without sharing rewards, gradients, or using a communication channel. The

encourages an agent to imagine the consequences of sticking to the status-quo. This imagined stickiness ensures that an agent gets a better estimate of a cooperative or selfish policy. Without

, agents repeatedly switch policies (from cooperative to selfish), obtain short-term rewards (through exploitation), and, therefore incorrectly learn that a selfish strategy gives higher rewards in the long-term.

To work with social dilemma games that have visual observations, we introduce . uses self-supervision and clustering to automatically extract a cooperative and a selfish policy from a social dilemma game. The input to is a collection of state sequences derived from game-play between two randomly initialized agents. Each state sequence represents a collection of states and actions (of both agents), leading up to a reward in the environment. uses this collection of state sequences to learn two oracles. One oracle represents a cooperative policy, and the other oracle represents a selfish policy. Given a state, an oracle returns an action according to the specific policy. It is important to note that each agent independently runs to extract oracles. (For instance, Figure 8 (Appendix A) illustrates the cooperation and defection oracles extracted by the Red agent using in the Coin Game.) Figure 2 shows the high-level architecture of our approach.

Figure 2: High-level architecture of our approach. Each agent runs by performing individually before playing the social dilemma game. This creates two oracles per agent. During game-play, each agent (enhanced with ) takes either the action suggested by the cooperation oracle or the action suggested by the defection oracle.
  1. For a social dilemma game with visual observations, each RL agent runs to learn oracles that implement cooperative and selfish behavior.

  2. We train agents (with ) to play the game such that at any step, an agent can either take the action suggested by the cooperation oracle or the selfish oracle.

We empirically demonstrate in Section 4 that our approach evolves cooperative behavior between independently trained agents.

2 Approach

2.1 Social Dilemmas modeled as Iterated Matrix Games

We adopt the definitions in Foerster et al. (2018). We model social dilemmas as general-sum Markov (simultaneous move) games. A multi-agent Markov game is specified by , , , , , , . denotes the state space of the game. denotes the number of agents playing the game. At each step of the game, each agent , selects an action .

denotes the joint action vector that represents the simultaneous actions of all agents. The joint action

changes the state of the game from to according to the state transition function . At the end of each step, each agent gets a reward according to the reward function . The reward obtained by an agent at each step is a function of the actions played by all agents. For an agent, , the discounted future return from time is defined as , where is the discount factor. Each agent independently attempts to maximize its expected total discounted reward.

Matrix games are the special case of two-player perfectly observable Markov games (Foerster et al., 2018). Table 1 shows examples of matrix games that represent social dilemmas. Consider the Prisoner’s Dilemma matrix game in Table 0(a). Each agent can either cooperate () or defect (). For an agent, playing is the rational choice, regardless of whether the other agent plays either or . Therefore, if both agents play rationally, they each receive a reward of . However, if each agent plays , then it will obtain a reward of . This fact that individually rational behavior leads to a sub-optimal group (and individual) outcome highlights the dilemma.

In Infinitely Iterated Matrix Games, agents repeatedly play a particular matrix game against each other. In each iteration of the game, each agent has access to the actions played by both agents in the previous iteration. Therefore, the state input to an RL agent consists of the actions of both agents in the previous iteration of the game. We adopt this state formulation to remain consistent with Foerster et al. (2018). The infinitely iterated variations of the matrix games in Table 1 represent sequential social dilemmas. For ease of representation, we refer to infinitely iterated matrix games as iterated matrix games in subsequent sections.

2.2 Learning Policies in Iterated Matrix Games: The Selfish Learner

The standard method to model agents in iterated matrix games is to model each agent as a Deep RL agent that independently attempts to maximize its expected total discounted reward. Several approaches to model agents in this way use policy gradient-based methods (Sutton et al., 2000; Williams, 1992)). Policy gradient methods update an agent’s policy, parameterized by , by performing gradient ascent on the expected total discounted reward . Formally, let denote the parameterized version of an agent’s policy and denote the total expected discounted reward for agent . Here, is a function of the policy parameters of both agents. In the iteration of the game, each agent updates to , such that it maximizes it’s total expected discounted reward. is computed as follows:

(1)

For agents trained using reinforcement learning, the gradient ascent rule to update is:

(2)

where is the step size of the updates.

In the Iterated Prisoner’s Dilemma (IPD) game, agents trained with the policy gradient update method converge to a sub-optimal mutual defection equilibrium (Figure 4, (Lerer and Peysakhovich, 2017)). This sub-optimal equilibrium attained by Selfish Learners motivates us to explore alternative methods that could lead to a desirable cooperative equilibrium. We denote the agent trained using policy gradient updates as a Selfish Learner ().

Figure 3: Intuition behind . At each step, the encourages an agent to imagine the consequences of sticking to the status-quo by imagining an episode where the status-quo is repeated for steps. Section 2.3 describes in more detail.

2.3 Learning Policies in Iterated Matrix Games: The Status-Quo Aware Learner ()

2.3.1 : Intuition

Why do independent, selfish learners converge to mutually harmful behavior in the IPD? To understand this, consider the payoff matrix for a single iteration of the IPD in Table 0(a). In each iteration, an agent can play either or . Mutual defection is worse for each agent than mutual cooperation . However, one-sided exploitation is better than mutual cooperation for the exploiter and far worse for the exploited. Therefore, as long as an agent perceives the possibility of exploitation ( or ), it is drawn to defect, both to maximize reward (through exploitation) and minimize loss (through being exploited). To increase the likelihood of cooperation, it is important to reduce instances of exploitation between agents. We posit that, if agents mostly only either mutually cooperate or mutually defect , then they will learn to prefer and achieve a socially desirable cooperative equilibrium.

Motivated by this idea, we introduce a status-quo loss () for each agent derived from the idea of imaginary game-play, as depicted in Figure 3. Intuitively, the loss encourages an agent to imagine an episode where the status-quo (current situation) is repeated for a number of steps. This imagined episode causes the exploited agent (in ) to perceive a continued risk of exploitation and, therefore, quickly move to . Hence, for an agent, the short-term gain from exploitation is overcome by the long-term loss from mutual exploitation . Therefore, agents move towards either mutual cooperation or mutual defection . With exploitation (and subsequently, the fear of being exploited) out of the picture, agents move towards mutual cooperation. Figure 3 shows the idea behind .

2.3.2 : Formulation

We describe below the formulation of SQLoss with respect to agent 1. The formulation for agent 2 is identical to that of agent 1. Let denote the collection of an agent’s experiences after time steps. Let denote the discounted future return for agent starting at in actual game-play. Let denote denote the collection of an agent’s imagined experiences. For a state (), an agent imagines an episode by starting at and repeating for steps. This is equivalent to imagining a step repetition of already played actions. We sample

from a Discrete Uniform distribution

where is a hyper-parameter . To simplify notation, let denote the ordered set of state, actions, and rewards starting at time and repeated times for imagined game-play. Let denote the discounted future return starting at in imagined status-quo game-play.

(3)
(4)
(5)

and are approximated by and respectively. These values are the expected rewards conditioned on both agents’ policies (, ). For agent 1, the regular gradients and the Status-Quo gradients, and , can be derived from the policy gradient formulation as:

(6)
(7)

where

is a baseline for variance reduction.

Then the update rule for the policy gradient-based Status-Quo Learner (SQL-PG) is,

(8)

where and denote the loss scaling factor for the reinforce and the imaginative game-play, respectively.

2.4 : Moving Beyond Iterated Matrix Games

In the previous sections, we have focused on evolving cooperative behavior in the iterated matrix game formulation of sequential social dilemmas. In the iterated matrix game formulation, an agent is only allowed to either cooperate or defect in each iteration. However, in a social dilemma game with visual observations, it is not clear what set of low-level actions constitute cooperative or selfish behavior. Therefore, to work on social dilemmas with visual observations, we propose , an approach that automatically learns a cooperation and a defection policy by analyzing the behavior of randomly initialized agents. learns these policies in the form of cooperation and defection oracles. Given a state, the cooperation oracle suggests an action that represents cooperative behavior. Similarly, the defection oracle suggests an action that represents selfish behavior (Figure 8 (Appendix A)). When RL agents play the social dilemma game, each agent independently runs before playing the game. Once both agents have run , they consult either of the two extracted oracles in every step of the game. Therefore, in each step, an agent either takes the action recommended by the cooperation oracle or the action recommended by the defection oracle. In this way, we reduce the visual input game to an iterated matrix game and subsequently apply to evolve cooperative behavior. (see Figure 2) works as follows.

  1. We initialize RL agents with random weights and play them against each other in the game. In these random game-play episodes, whenever an agent receives a reward, we store the sequence of the last three states up to the current state.

  2. This collection of state sequences is used to train the network. The network takes as input a sequence of states and predicts the rewards of both agents as well as environment parameters that depend on the game. For instance, in the Coin Game, the network predicts the rewards of both agents and the color of the picked coin.

  3. Training the network leads to the emergence of feature embeddings for the various state sequences. Subsequently, clustering these embeddings using Agglomerative Clustering (with number of clusters=2) leads to the development of cooperative and defection clusters. One of the learned clusters contains state sequences that represent cooperative behavior, and the other cluster contains state sequences that represent defection. For instance, in the Coin Game, a point in the cooperation cluster contains a sequence of states where an agent picks a coin of its color.

  4. To train the cooperation and defection oracle networks, we use the collection of state sequences in each cluster. For each sequence of states in a cluster, we train the oracle network to predict the next action, given the current state. For instance, Figure 8 (Appendix A)shows the cooperation and defection oracles extracted by the Red agent using in the Coin Game.

Section 3.3 describes the architectural choices of each component of .

3 Experimental Setup

(-1, -1) (-3, 0)
(0, -3) (-2, -2)
(a) Prisoners’ Dilemma (PD)
(+1, -1) (-1, +1)
(-1, +1) (+1, -1)
(b) Matching Pennies (MP)
(0, 0) (-4, -1)
(-1, -4) (-3, -3)
(c) Stag Hunt (SH)
Table 1: Payoff matrices for the different games used in our experiments. in a cell represents a reward of to the row and to the column player. , , , and denote the actions for the row and column players. In the iterated versions of these games, agents play against each other over several iterations. In each iteration, an agent takes an action and receives a reward based on the actions of both agents. Each matrix represents a different kind of social dilemma.

In order to compare our results to previous work (Foerster et al., 2018), we use the Normalized Discounted Reward (). A higher NDR implies that an agent obtains a higher reward in the environment. We compare our approach (Status-Quo Aware Learner: ) to Learning with Opponent-Learning Awareness (Lola-PG) (Foerster et al., 2018) and the Selfish Learner (SL, Section 2.2). For all experiments, we perform runs and report average

, along with variance across runs. The bold line in all the figures is the mean, and the shaded region is the one standard deviation region over the mean. All of our code is available at 

Code (2019).

3.1 Social Dilemma Games

For our experiments with social dilemma matrix games, we use the (Iterated Prisoners Dilemma (IPD) (Luce and Raiffa, 1989), Iterated Matching Pennies (IMP) (Lee and Louis, 1967), and the Iterated Stag Hunt (ISH) (Foerster et al., 2018)). Table 1 shows the payoff matrix for a single iteration of each game. In iterated matrix games, at each iteration, agents take an action according to a policy and receive the rewards in Table 1. To simulate an infinitely iterated game, we let agents play 200 iterations of the game against each other, and do not provide an agent with any information about the number of remaining iterations (Foerster et al., 2018). In an iteration, the state for an agent is the actions played by both agents in the previous iteration. Each matrix game in Table 1 represents a different dilemma.

In the Prisoner’s Dilemma, the rational policy for each agent is to defect, regardless of the policy of the other agent. However, when each agent plays rationally, each is worse off. In Matching Pennies, if an agent plays predictably, it is prone to exploitation by the other agent. Therefore, the optimal policy is to randomize between and , obtaining an average NDR of . The Stag Hunt game represents a coordination dilemma. In the game, given that the other agent will cooperate, an agent’s optimal action is to cooperate as well. However, at each step, each agent has an attractive alternative, that of defecting and obtaining a guaranteed reward of . Therefore, the promise of a safer alternative and the fear that the other agent might select the safer choice could drive an agent to also select the safer alternative, thereby sacrificing the higher reward of mutual cooperation.

For our experiments on a social dilemma with visual observations, we use the Coin Game (Figure 1) (Foerster et al., 2018). The rational policy for an agent is to defect and try to pick all coins, regardless of the policy of the other agent. However, when both agents defect, both are worse off.

3.2 SQLoss

For our experiments with the Selfish and Status-Quo Aware Learner (), we use policy gradient-based learning where we train an agent with the Actor-Critic method (Sutton and Barto, 2011). Each agent is parameterized with a policy actor and critic for variance reduction in policy updates. During training, we use gradient descent with step size, for the actor and for the critic. We use a batch size of for Lola-PG (Foerster et al., 2018) and 200 for for roll-outs. We use an episode length of 200 for all iterated matrix games. We use a discount rate () of for the Iterated Prisoners’ Dilemma, Iterated Stag Hunt, and Coin Game. For the Iterated Matching Pennies, we use . The high value of allows for long time horizons, thereby incentivizing long-term reward. Each agent randomly samples from (, discussed in Appendix C) at each step.

3.3 GameDistill

consists of two components. First, the state sequence encoder (Step 2, Section 2.4

) that takes as input a sequence of states and outputs a feature representation. We encode each state in the sequence using a series of standard Convolution layers with kernel-size 3. We then use a fully-connected layer with 100 neurons that outputs a dense representation of the sequence of states. The color of picked coin, agent reward, and opponent reward branches consist of a series of dense layers with linear activation. We use linear activation so that we can cluster the feature vectors (embeddings) using a linear clustering algorithm, such as Agglomerative Clustering. We obtain similar results when we use the K-means clustering algorithm. We use the

Binary-Cross-Entropy (BCE) loss for classification and the mean-squared error (MSE) loss for regression. We use the Adam (Kingma and Ba, 2014) optimizer (learning rate ).

Second, the oracle network (Step 4, Section 2.4), that predicts an action for an input state. We encode the input state using convolution layers with kernel-size and relu activation. To predict the action, we use fully-connected layers with relu activation and the BCE loss. We use L2 regularization, and Gradient Descent with the Adam optimizer (learning rate ).

4 Results

4.1 Learning optimal policies in Iterated Matrix Games using SQLoss

Iterated Prisoner’s Dilemma (IPD):
Figure 4: Average NDR values for different learners in the IPD game. agents obtain a near-optimal NDR value for this game. In contrast to other methods, agents have close to zero variance across runs.

We train different learners to play the IPD game. Figure 4 shows the results. For all learners, agents initially defect and move towards an NDR of . This initial bias towards defection is expected, since, for agents trained with random game-play episodes, the benefits of exploitation outweigh the costs of mutual defection. For Selfish Learner (SL) agents, the bias intensifies, and the agents converge to mutually harmful selfish behavior (NDR=). Lola-PG agents learn to predict each other’s behavior and therefore realize that defection is more likely to lead to mutual harm than selfish benefit. They subsequently move towards cooperation, but occasionally defect (NDR=). In contrast, agents quickly realize the costs of defection, indicated by the small initial dip in the NDR curves. They subsequently move towards almost perfect cooperation, with an NDR of . Finally, it is important to note that agents have close to zero variance, unlike other methods where the variance in NDR across runs is significant.

Iterated Matching Pennies (IMP):
Figure 5: Average NDR values for different learners in the IMP game. agents avoid exploitation by randomising between and to obtain a near-optimal NDR value (0) for this game. In contrast to other methods, agents have close to zero variance across runs.

We train different learners to play the IMP game. The optimal policy for an agent to avoid exploitation is to play or perfectly randomly and obtain an NDR of . Figure 5 shows the results. agents learn to play optimally and obtain an NDR close to . Interestingly, Selfish Learner and Lola-PG agents converge to an exploiter-exploited equilibrium where one agent consistently exploits the other agent. This asymmetric exploitation equilibrium is more pronounced for Selfish Learner agents than for Lola-PG agents. As before, we observe that agents have close to zero variance across runs, unlike other methods where the variance in NDR across runs is significant.

Appendix B shows the results for the ISH game.

4.2 Evolving Cooperation in Games with visual input using GameDistill followed by SQLoss

4.2.1 The Coin Game: GameDistill

Figure 6: Representation of the clusters learned by . Each point is a t-SNE projection of the 100-dimensional feature vector output by the network for an input sequence of states. The figure on the left is colored based on actual rewards obtained by each agent (Red agent followed by Blue agent). The figure on the right is colored based on clusters learned by . correctly identifies two types of state sequences, one for cooperation (blue cluster) and the other for defection (red cluster).
Figure 7: Probability of an agent picking a coin of its color for learners trained in the Coin Game. agents cooperate (pick only their own coins) to achieve a near optimal strategy in the game. In contrast to Lola-PG, agents have close to zero variance across runs.

To evaluate the clustering step in , we make two t-SNE (Maaten and Hinton, 2008) plots of the 100-dimensional feature vector extracted from the last layer of the network. The first plot colors each point (state sequence) by the rewards obtained by both agents in the sequence. The second plot colors each point by the cluster label output by Agglomerative clustering. Figure 6 shows the results. correctly learns two clusters, one for state sequences that represent cooperation and the other for state sequences that represent defection. We also experiment with different values for feature vector dimensions and obtain similar clustering results with sufficient training. Once we have the clusters, we train oracle networks using the state sequences in each cluster. To evaluate that the trained oracles represent a cooperation and a defection policy, we modify the Coin Game environment to contain only the Red agent. We then play two variations of the game. In the first variation, the Red agent is forced to play the action suggested by the first oracle. In this variation, we find that the Red agent picks only of Blue coins, indicating a high rate of cooperation. Therefore, the first oracle represents a cooperation policy. In the second variation, the Red agent is forced to play the action suggested by the second oracle. In this case, we find that the Red agent picks of Blue coins, indicating a high rate of defection. Hence, the second oracle represents a defection policy. Therefore, the oracles learned by the Red agent using represent cooperation and defection policies.

4.2.2 The Coin Game: SQLoss

Before playing the game, each agent uses to learn cooperation and defection oracles. During game-play, at each step, an agent follows either the action suggested by its cooperation oracle or the action suggested by its defection oracle. Further, each agent has an additional term. We compare approaches using the degree of cooperation between agents, measured by the probability of an agent to pick the coin of its color (Foerster et al., 2018). Figure 7 shows the results. The probability that an agent will pick the coin of its color is close to . This high probability indicates that the other agent is cooperating with this agent and only picking coins of its color. In contrast, the probability that a Lola-PG agent will pick its own coin is much smaller, indicating higher rates of defection. As expected, the probability of an agent picking its own coin is the smallest for selfish learners (SL). The probability value of indicates that a selfish learner is just as likely to pick the other agent’s coin as it is to pick its own coin.

4.3 SQLearner: Exploitability and Adaptability

Given that an agent does not have any prior information about the other agent, it is important that it evolves its strategy based on the strategy of its opponent. To evaluate an agent’s ability to avoid exploitation by a selfish agent, we train one agent against an agent that always defects in the Coin Game. We find that the agent also learns to always defect. This persistent defection is important since given that the other agent is selfish, the agent can do no better than also be selfish. To evaluate an agent’s ability to exploit a cooperative agent, we train one agent with an agent that always cooperates in the Coin Game. In this case, we find that the agent learns to always defect. This persistent defection is important since given that the other agent is cooperative, the agent obtains maximum reward by behaving selfishly. Hence, the agent is both resistant to exploitation and able to exploit, depending on the strategy of the other agent.

5 Conclusion

We have described a status-quo loss () that encourages an agent to imagine the consequences of sticking to the status-quo. We demonstrated how agents trained with evolve cooperative behavior in several social dilemmas without sharing rewards, gradients, or using a communication channel. To work with visual input games, we proposed , an approach that automatically extracts a cooperative and a selfish policy from a social dilemma game. We combined and to demonstrate how agents evolve desirable cooperative behavior in a social dilemma game with visual observations.

References

  • D. Abreu, D. Pearce, and E. Stacchetti (1990) Toward a theory of discounted repeated games with imperfect monitoring. Econometrica 58 (5), pp. 1041–1063. External Links: ISSN 00129682, 14680262, Link Cited by: §1.
  • R. Axelrod (1984) Robert axelrod’s (1984) the evolution of cooperation. Cited by: §1.
  • D. Banerjee and S. Sen (2007) Reaching pareto-optimality in prisoner’s dilemma using conditional joint action learning. Autonomous Agents and Multi-Agent Systems 15 (1). External Links: ISSN 1387-2532 Cited by: §1.
  • A. Code (2019) MARL with sqloss. GitHub. Note: https://github.com/user12423/MARL-with-SQLoss/ Cited by: item 3, item 2, item 7, §3.
  • S. Damer and M. Gini (2008) Achieving cooperation in a minimally constrained environment.. Vol. 1, pp. 57–62. Cited by: §1.
  • E. M. de Cote, A. Lazaric, and M. Restelli (2006) Learning to cooperate in multi-agent social dilemmas. In Proceedings of the Fifth International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS ’06. Cited by: §1.
  • T. Dietz, E. Ostrom, and P. C. Stern (2003) The struggle to govern the commons. Science 302 (5652), pp. 1907–1912. External Links: Document Cited by: §1.
  • J. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel, and I. Mordatch (2018) Learning with opponent-learning awareness. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 122–130. Cited by: §1, §1, §1, §2.1, §2.1, §2.1, §3.1, §3.1, §3.2, §3, §4.2.2.
  • D. Fudenberg, D. Levine, and E. Maskin (1994) The folk theorem with imperfect public information. Econometrica 62 (5), pp. 997–1039. External Links: ISSN 00129682, 14680262, Link Cited by: §1.
  • D. Fudenberg and E. Maskin (1986) The folk theorem in repeated games with discounting or with incomplete information. Econometrica 54 (3), pp. 533–554. External Links: ISSN 00129682, 14680262 Cited by: §1.
  • E. J. Green and R. H. Porter (1984) Noncooperative Collusion under Imperfect Price Information. Econometrica 52 (1), pp. 87–100. Cited by: §1.
  • G. Hardin (1968) The tragedy of the commons. Science 162 (3859), pp. 1243–1248. External Links: Document Cited by: §1.
  • E. Hughes, J. Z. Leibo, M. Phillips, K. Tuyls, E. Dueñez-Guzman, A. G. Castañeda, I. Dunning, T. Zhu, K. McKee, R. Koster, H. Roff, and T. Graepel (2018) Inequity aversion improves cooperation in intertemporal social dilemmas. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18. Cited by: §1.
  • P. Julien, J. Leibo, V. Zambaldi, C. Beattie, K. Tuyls, and T. Graepel (2017) A multi-agent reinforcement learning model of common-pool resource appropriation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. . Cited by: §1.
  • Y. Kamada and S. Kominers (2010) Information can wreck cooperation: a counterpoint to kandori (1992). Economics Letters 107, pp. 112–114. External Links: Document Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
  • M. Kleiman-Weiner, M. K. Ho, J. L. Austerweil, M. L. Littman, and J. B. Tenenbaum (2016) Coordinate to cooperate or compete: abstract goals and joint intentions in social interaction. In CogSci, Cited by: §1.
  • K. Lee and K. Louis (1967) The application of decision theory and dynamic programming to adaptive control systems. Ph.D. Thesis. Cited by: §3.1.
  • J. Z. Leibo, V. Zambaldi, M. Lanctot, J. Marecki, and T. Graepel (2017) Multi-agent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’17. Cited by: §1.
  • A. Lerer and A. Peysakhovich (2017) Maintaining cooperation in complex social dilemmas using deep reinforcement learning. External Links: arXiv:1707.01068 Cited by: §1, §2.2.
  • R. D. Luce and H. Raiffa (1989) Games and decisions: introduction and critical survey. Courier Corporation. Cited by: §3.1.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne.

    Journal of machine learning research

    9 (Nov), pp. 2579–2605.
    Cited by: §4.2.1.
  • D. Macy and A. Flache (2002) Learning dynamics in social dilemmas. Proceedings of the National Academy of Sciences of the United States of America 99 Suppl 3, pp. 7229–36. External Links: Document Cited by: §1.
  • [24] (2002-04) Multiagent learning using a variable learning rate. Artificial Intelligence 136 (2), pp. 215–250. External Links: ISSN 0004-3702 Cited by: §1.
  • M. A. Nowak and K. Sigmund (1992) Tit for tat in heterogeneous populations. Nature 355 (6357), pp. 250–253. Cited by: §1.
  • M. A. Nowak and K. Sigmund (1998) Evolution of indirect reciprocity by image scoring. Nature 393 (6685), pp. 573–577. Cited by: §1.
  • M. Nowak and K. Sigmund (1993) A strategy of win-stay, lose-shift that outperforms tit-for-tat in the prisoner’s dilemma game. Nature 364, pp. 56–8. External Links: Document Cited by: §1.
  • H. Ohtsuki, C. Hauert, E. Lieberman, and M. A. Nowak (2006) A simple rule for the evolution of cooperation on graphs and social networks. Nature 441 (7092), pp. 502–505. External Links: ISSN 1476-4687, Document, Link Cited by: §1.
  • E. Ostrom (1990) Governing the commons-the evolution of institutions for collective actions. Political economy of institutions and decisions. Cited by: §1.
  • E. Ostrom, J. Burger, C. B. Field, R. B. Norgaard, and D. Policansky (1999) Revisiting the commons: local lessons, global challenges. Science 284 (5412), pp. 278–282. External Links: Document Cited by: §1.
  • A. Peysakhovich and A. Lerer (2018) Consequentialist conditional cooperation in social dilemmas with imperfect information. In International Conference on Learning Representations, ICLR 2018,Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §1.
  • J. Pineau (2019) The Machine Learning Reproducibility Checklist. Note: https://www.cs.mcgill.ca/ jpineau/ReproducibilityChecklist.pdf External Links: Link Cited by: Appendix E.
  • T. W. Sandholm and R. H. Crites (1996) Multiagent reinforcement learning in the iterated prisoner’s dilemma.. Bio Systems 37 1-2, pp. 147–66. Cited by: §1.
  • F. Santos and J. Pacheco (2006) A new route to the evolution of cooperation. Journal of evolutionary biology 19, pp. 726–33. External Links: Document Cited by: §1.
  • R. S. Sutton and A. G. Barto (2011) Reinforcement learning: an introduction. Cited by: §3.2.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2.2.
  • R. Trivers (1971) The evolution of reciprocal altruism. Quarterly Review of Biology 46, pp. 35–57.. External Links: Document Cited by: §1.
  • J. X. Wang, E. Hughes, C. Fernando, W. M. Czarnecki, E. A. Duéñez-Guzmán, and J. Z. Leibo (2019) Evolving intrinsic motivations for altruistic behavior. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, pp. 683–692. External Links: ISBN 978-1-4503-6309-9 Cited by: §1.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §2.2.
  • M. Wunder, M. Littman, and M. Babes (2010) Classes of multiagent q-learning dynamics with -greedy exploration. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10. Cited by: §1.
  • C. Yu, M. Zhang, F. Ren, and G. Tan (2015) Emotional multiagent reinforcement learning in spatial social dilemmas.

    IEEE Transactions on Neural Networks and Learning Systems

    26 (12), pp. 3083–3096.
    Cited by: §1.

Supplementary Material

Appendix A Illustrations of Trained Oracle Networks for the Coin Game

Figure 8: Illustrative predictions of the oracle networks learned by the Red agent using in the Coin Game. The cooperation oracle suggests an action that avoids picking the coin of the other agent. The defection oracle suggests an action that involves picking the coin of the other agent.

Figure 8 shows the predictions of the oracle networks learned by the Red agent using in the Coin Game. We see that the cooperation oracle suggests an action that avoids picking the coin of the other agent (the Blue coin). Analogously, the defection oracle suggests a selfish action that picks the coin of the other agent.

Appendix B Results for the Iterated Stag Hunt using SQLoss

Figure 9 shows the results of training two agents on the Iterated Stag Hunt game. agents coordinate successfully to obtain a near-optimal NDR value () for this game.

Figure 9: NDR values for agents in the ISH game. agents coordinate successfully to obtain a near optimal NDR value () for this game.

Appendix C : Effect of on convergence to cooperation

We explore the effect of the hyper-parameter (Section 2) on convergence to cooperation. To imagine the consequences of maintaining the status-quo, each agent samples from the Discrete Uniform distribution . Therefore, a larger value of implies a larger value of and longer imaginary episodes. We find that larger (and hence ) leads to faster cooperation between agents in the IPD and Coin Game. This effect plateaus at , which we select for our experiments.

Appendix D Architecture Details

We performed all our experiments on an AWS instance with the following specifications.

  • Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz

  • RAM: 189GB

  • CPU(s): 96

  • Architecture: x86_64

  • Thread(s) per core: 2

Appendix E Reproducibility Checklist

We follow the reproducibility checklist from  (Pineau, 2019) and include further details here. For all the models and algorithms we have included details that we think would be useful for reproducing the results of this work.

  • For all models and algorithms presented, check if you include:

    1. A clear description of the mathematical setting, algorithm, and/or model: Yes. The algorithm is described in detail in Section 2

      , with all the loss functions used for training being clearly defined. The details of the architecture, hyperparameters used and other algorithm details are given in Section 

      3. Environment details are explained in the sections that they are introduced in.

    2. An analysis of the complexity (time, space, sample size) of any algorithm: No. We do not include a formal complexity analysis of our algorithm. However, we do highlight the additional computational steps (in terms of losses and parameter updates) in Section 2 over standard multi-agent independently learning RL algorithms that would be needed in our approach.

    3. A link to a downloadable source code, with specification of all dependencies, including external libraries.: Yes. We have made the source code available at Code (2019).

  • For any theoretical claim, check if you include:

    1. A statement of the result: NA. Our paper is primarily empirical and we do not have any major theoretical claims. Hence this is Not Applicable.

    2. A clear explanation of any assumptions: NA.

    3. A complete proof of the claim: NA.

  • For all figures and tables that present empirical results, check if you include:

    1. A complete description of the data collection process, including sample size: NA. We did not collect any data for our work.

    2. A link to a downloadable version of the dataset or simulation environment: Yes. We have made the source code available at Code (2019).

    3. An explanation of any data that were excluded, description of any pre-processing step: NA. We did not perform any pre-processing step.

    4. An explanation of how samples were allocated for training / validation / testing: Yes. For the details regarding data used for training is given in Section 2.4. The number of iterations used for learning (training) by is given in Figures 45 and 7. The details of the number of runs and the batch sizes used for various experiments are given in Section 3.

    5. The range of hyper-parameters considered, method to select the best hyper-parameter configuration, and specification of all hyper-parameters used to generate results: Yes. We did not do any hyperparameter tuning as part of this work. All the hyperparameters that we used are specified in Section 3.

    6. The exact number of evaluation runs: Yes. For all our environments, we repeat the experiment times. For evaluation of performance, we use an average of Monte Carlo estimates. We state this in Section 3. We do not fix any seeds. The details of the number of runs and the batch sizes used for various experiments are also given here.

    7. A description of how experiments were run: Yes. The README with instructions on how to run the experiments along with the source code is provided at Code (2019).

    8. A clear definition of the specific measure or statistics used to report results: Yes. We plot the mean and the one standard deviation region over the mean for all our numerical experiments. This is stated in Section 3.

    9. Clearly defined error bars: Yes. We plot the mean and the one standard deviation region over the mean for all our numerical experiments. This is stated in Section 3.

    10. A description of results with central tendency (e.g. mean) & variation (e.g. stddev): Yes. We plot the mean and the one standard deviation region over the mean for all our numerical experiments. This is stated in Section 3.

    11. A description of the computing infrastructure used: Yes. We have provided this detail in the Supplementary material in Section D.