Inducing Cooperation in Multi-Agent Games Through Status-Quo Loss

01/15/2020 ∙ by Pinkesh Badjatiya, et al. ∙ 10

Social dilemma situations bring out the conflict between individual and group rationality. When individuals act rationally in such situations, the group suffers sub-optimal outcomes. The Iterative Prisoner's Dilemma (IPD) is a two-player game that offers a theoretical framework to model and study such social situations. In the Prisoner's Dilemma, individualistic behavior leads to mutual defection and sub-optimal outcomes. This result is in contrast to what one observes in human groups, where humans often sacrifice individualistic behavior for the good of the collective. It is interesting to study how and why such cooperative and individually irrational behavior emerges in human groups. To this end, recent work models this problem by treating each player as a Deep Reinforcement Learning (RL) agent and evolves cooperative behavioral policies through internal information or reward sharing mechanisms. We propose an approach to evolve cooperative behavior between RL agents playing the IPD game without sharing rewards, internal details (weights, gradients), or a communication channel. We introduce a Status-Quo loss (SQLoss) that incentivizes cooperative behavior by encouraging policy stationarity. We also describe an approach to transform a two-player game (with visual inputs) into its IPD formulation through self-supervised skill discovery (IPDistill).We show how our approach outperforms existing approaches in the Iterative Prisoner's Dilemma and the two-player Coin game.



There are no comments yet.


page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Social dilemma situations often bring out the contradiction between individual and group rationality. In such situations, individual rationality suggests outcomes that lead to worse outcomes for the group Hardin1968; ostrom:1990; Ostrom278; Dietz1907. For instance, when sharing a pool of resources (such as a fishing pond or forest), individual rationality suggests overuse since the costs incurred through a single agent’s selfishness are distributed equally across all agents while the rewards accrue to the selfish agent. Such collective individual rationality (or selfishness) leads to outcomes that are harmful to the group (eg. deforestation, draining the resource pool). Further, several studies have shown that humans do not exhibit such individual rationality in similar situations Ostrom278; Ostrom:1993. Therefore, it is unclear why and how humans have evolved this individually irrational (or cooperative) behavior.

Game theorists have been studying the development of cooperation between rational agents Fudenberg; Edward:1984; Fudenberg:1984; Kamada:2010; Abreu:1990 and the Prisoner’s Dilemma Rapoport1974 provides a sound theoretical framework to study this conflict between individual and group rationality. The question is: How can one incentivize agents to evolve cooperative behavior in these kinds of situations? A large body of work has looked at the mechanism of emergence of cooperation through reciprocal behaviour and indirect reciprocity reciprocal1971; reciprocal1984; reciprocal1992; reciprocal1993; in_reciprocal1998, though variants of reinforcement using aspiration reinforce_variant, attitude NonRL_attitude or multi-agent reinforcement learning Sandholm1996MultiagentRL; Wunder:2010, and under specific conditions R_plus_S_g_2P using different learning rates deCote:2006 similar to WoLF WOLF2002 as well as using embedded emotion Emotional_Multiagent, social netrworks Ohtsuki2006; Santos:2006.

Figure 1. In Coin Game, 2 agents (Red and Blue) appear at random positions in a 3x3 grid. A coin of color Red or Blue appears randomly in a location. Agents traverse the grid to pick coins which maximises their reward. For each agent, eating a coin of any color gives reward, but eating a coin of opponent’s color penalizes the opponent with reward. The best strategy (which maximizes long-term reward) to play requires cooperating with the opponent.

However, these approaches do not directly apply to Deep Reinforcement Learning agents. Recent work kleiman2016coordinate; Julien:2017; consequentialist18, in this direction, focuses on letting agents learn strategies in multi-agent settings through interaction with other agents. Leibo:2017 define the problem of social dilemma in the deep reinforcement learning framework and analyze the outcome of the 2d Fruit gathering game Julien:2017 as a general sum matrix game. They vary the abundance of resources and the cost of conflict to generate degrees of cooperation between agents. Hughes:2018 define an intrinsic reward (inequality aversion) that attempts to reduce the difference in rewards among agents by having an aversion to both advantageous (guilt) and disadvantageous (unfairness) inequity. This handcrafting of loss with mutual fairness leads to better cooperation but also makes the agent vulnerable to exploitation. LOLA foerster2018learning achieves high levels of cooperation in the Coin Game (see Figure 1) and in the Iterative Prisoner’s Dilemma. However, LOLA assumes access to the opponent’s policy parameters and gradients. This access is analogous to getting complete access to the opponent agent’s brain and devising a strategy with complete knowledge of how they are going to play. Wang:2019 propose an evolutionary deep reinforcement learning setup to evolve cooperation. They define an intrinsic reward that is based on features generated from the agent’s past and future rewards, and this reward is shared with other agents. They use evolution to maximize the sum of rewards among the agents and thus achieve cooperation. However, sharing rewards in this indirect way enforces cooperative behavior rather than evolving it through individual behavior.

We propose an approach to evolve cooperative behavior between RL agents without sharing internal details (weights, gradients, etc.) or a communication channel. We introduce a Status-Quo () loss that can be thought of as a stable strategy loss which encourages agents to explore stationary policies. By stationary policy, we mean a policy (which defines the learning agent’s way of behaving at a given time) where agents have an incentive to stick to their current actions as opposed to changing actions.

We evaluate the policy gradient learner with Status-Quo Loss on the IPD & Iterated Matching Pennies (IMP) and show that RL agents trained with the Status Quo loss converge to a cooperative behavior. This result is particularly interesting because behavioral economists samuelson1988status; kahneman1991anomalies; kahneman2011thinking; thaler2009nudge have studied the effect of the status-quo bias in decision making in humans. Our work shows that it is possible that the status-quo bias helps to evolve cooperative behavior in human groups.

To work with games that have visual input foerster2018learning, we extend the idea in Wang:2018. They start with two policies, cooperation (denoted by ) and defection (denoted by ). They use these policies to generate intermediate policies with varying degrees of Cooperation and Defection. During gameplay, each agent predicts the degree of cooperation of its opponent and chooses a policy of the same degree of cooperation. We propose an IPD-Distilling architecture () to transform a multi-agent game to its IPD formulation. uses self-supervision and clustering to learn cooperation and defection policies automatically. We use to transform a game to an IPD, followed by enhanced agents to evolve cooperative behavior for RL agents in the Coin game foerster2018learning. We show that RL agents trained using our approach achieve higher levels of cooperation and average reward than those trained with existing approaches (Section 4).

2. Approach

Figure 2. Converting a Naïve Leaner into a Status-Quo aware Learner using the imaginary -Stationary Environment intuition. In the figure, K refers to parameter as used in the paper Section 2.1.3
Figure 3. Framework Overview. Some parts of the network are representational, actual parameters are provided in Section 3

In this section, we formally describe our problem setup by borrowing the notation from foerster2018learning. A multi-agent game is specified by a tuple , , , , , . In the game, agents, , select actions, , to reach state

in the environment. The action vector

results in a state transition according to the transition function . The reward functions denotes the rewards and is the discount factor. For each agent, , the discounted future return from time is defined as .

Each agent independently attempts to maximize its expected total discounted reward. Several approaches to solve this use policy gradient based methods sutton2000policy (example, REINFORCE williams1992simple). Policy gradient methods update an agent’s policy, parameterized by , by performing gradient ascent on the expected discounted total reward . In the Iterative Prisoner’s Dilemma (IPD), agents trained with this update method converge to a sub-optimal mutual defection equilibrium Peysakhovich1707.01068. The IPD game has two agents and . Each agent has two possible actions, Cooperate (denoted by ) and Defect (denoted by ). Table 0(a) shows the reward matrix for all possible combinations of actions. The Nash Equilibrium (NE) in a single-step Prisoner’s Dilemma is  Game_Theory_1991. The notation, (A,B) denotes the first agent takes an action A and the second agent takes an action B. This equilibrium describes the scenario when both players always defect. According to the Folk theorem folk_theorem; Fudenberg, the IPD has infinitely many Nash equilibria, out of which is the most commonly occurring NE in deep reinforcement learning setup foerster2018learning. From the reward matrix, it is clear that both players would do better if they played . However, this is not a stable equilibrium because can improve its reward by playing instead of since . Here, denotes the reward for . The same reasoning holds for . Therefore, both players have an incentive to switch from to . In contrast, if the two players are playing , then neither has an incentive to change their action. This sub-optimal equilibrium attained by policy gradients motivates us to explore other techniques that could lead to a mutually beneficial equilibrium.

When RL agents optimize for individual rewards in an IPD, they reach a state of mutual defection, as discussed earlier. Here, we propose an approach to evolve cooperative behavior in two-player IPD games. Our approach has two components.

  • In Section 2.1, we describe the Status-Quo loss (). We show how agents trained with an additional evolve cooperative behavior in an IPD setting.

  • In section 2.2, we describe the architecture of the component which uses self-supervised skill discovery to transform a game into its IPD formulation.

We then show how applying along with evolves cooperative behavior between agents in the Coin game.

2.1. SQLoss: Encouraging a Stationary Policy

2.1.1. Naïve Learner (NL)

Following the notation of foerster2018learning, let denote a parameterized version of an agent’s policy and denote the expected total discounted reward for agent . Here is a function of both agent’s policy parameters . The objective of a Naïve Learner (NL) at iteration, is to update to , such that it maximizes it’s expected total discounted reward .

is updated as follows:


The gradient ascent rule to update in reinforcement learning is given by:


Where is the step size of the updates. Agents trained to optimize for individual rewards evolve mutually harmful behavior and subsequently obtain a lower average reward.

2.1.2. -Stationary Environment Learner

Here we introduce the intuition behind our Status-Quo loss. A Naïve Learner plays the game in an environment and gathers the experience for updates defined by Equation 2. If the environment enforces a stationary-policy (defined earlier in Section 1) for horizons, then the action an agent takes at time-step will be executed for the next future time-steps. Since the environment is fair to all agents, the time-step enforced stationariness is synchronous to both agents. This means both the agents are forced to stick with the same action every steps. Figure 2 shows the states, rewards, and actions for such a scenario. A Naïve learner playing in an -stationary environment is incentivized to make decisions that optimize long-term rewards while having a stationary policy.

The stationary policy reduces the likelihood that an agent will defect in a mutually cooperative situation. This is because the policy encourages sticking to the current action. Further, optimizing the long-term reward with a stationary policy reduces the impact of the short-term gains introduced by defection. Therefore, agents playing an IPD game in an -stationary environment are more likely to evolve cooperative behavior.

However, in general, it might not be possible to alter environment dynamics to enforce stationarity. Therefore, we introduce a Status-Quo loss that encourages a stationary policy and, subsequently, cooperative behavior through a separate loss term.

2.1.3. Status-Quo aware Learner (SQLoss)

Let denote an episode of actual experiences of the agent in the environment ( is the horizon of the episode) as , while denotes the imaginary episode of the agent as a result of maintaining the Status-quo. denotes the trajectory from time , and is states, actions and rewards at time repeated times, and (>=1) denotes the length of the imaginary game-play of the agents.


here, is

sampled from Discrete Uniform distribution

(both inclusive, is a hyper-parameter ) at time .

The discounted return for agent at time-step is . The imaginary status-quo discounted return is , which is defined by:


and are approximated by and respectively. These are the expected rewards conditioned on the agent’s policies (, ) respectively. For learner 1, the regular gradients and the Status-Quo gradients, and , can be derived from the policy gradient formulation as,



is a baseline for variance reduction,

is the discount rate.

Then the update rule for the policy gradient-based Status-Quo-learner (SQL-PG) is,


where and denotes the loss scaling factor for the reinforce and the imaginative gameplay, respectively.

2.2. IPDistill: Reducing a Game to an IPD formulation

In the previous section, we focus on the IPD formulation of the games, where agents are only allowed to either cooperate or defect at each step. Table 1 tabulates the payoff matrices of the iterated versions of different games. In iterated games, agents interact by playing a stage game (such as the prisoner’s dilemma) numerous times.

We propose an approach () to transform a two-player game into its IPD formulation. We explain the idea using the coin game described in paper foerster2018learning. Figure 3 shows a high-level description of . First, we initialize RL agents with random weights and pit them against each other in the Coin game (Figure 1). In the game, whenever an agent captures a coin, it obtains a reward, depending on its color and the color of the coin it captured. A cooperative strategy in the coin game is one where agents only capture coins of their color. In contrast, a defection strategy is one where agents capture all the available coins. Therefore, the notion of cooperation or defection in the Coin game is closely linked to the color of the coins. Whenever an agent receives a reward (when it picks a coin), we store a sequence of the last three states up to the current state. This collection of sequences is used to train the network. The network takes as input a sequence of states and predicts the rewards of the agents and the environment attributes. Predicting the environment variables provides an implicit understanding of the environment to the agents while knowing the rewards of the opponent is needed as it is essential in differentiating defection from cooperation. We then use Agglomerative Clustering on the embeddings of the final layer to create two clusters, which leads to the discovery of cooperation and defection automatically. Figure 3 shows examples of the events clustered in the cooperation and defection skills discovered by for the Coin game.

We also train an action prediction network for each discovered strategy cluster. Each agent trains neural networks, one for each skill/strategy such that given a state and skill as input, it can predict the action which needs to be taken. Figure 

3 shows the network diagram with appropriate input & output of the network. More details about the specific architecture parameters are specified in Section 3. The network gets a single state as input, which is passed by a set of convolution layers that extract features from the states. Along with predicting the action, we also predict the environment attributes, which helps in training the action prediction network better. We train the network with three losses, action prediction loss, color-of-coin prediction loss, and the L2 weight regularization loss (with weight ) that ensures stable learning. Therefore given a state, the action prediction network for a strategy returns the action that an agent should take to play according to that strategy. The RL agents consult these networks during gameplay. Therefore, at each step, the agent can either take an action recommended by the cooperation or the defection oracles (trained networks). Hence, reduces the game to an IPD formulation. Figure 4 shows the schematic diagram of the proposed reduction.

Hence, reduces the game to an IPD formulation.

Figure 4. Reduction of a game to an IPD formulation. We summarize complex skills using oracles and the agent has to learn a meta-policy to switch between skills.

3. Experimental Setup

In this section, we provide details of our experiments and values of hyper-parameters used in our experiments.

3.1. IPDistill: Skill Clustering

We define an Experience Event for the Coin Game as a sequence of states when either agents receive a positive or a negative reward.

Our architecture for Skill Clustering consists of two components. First, the Experience Event encoder that takes as input an Experience event and outputs a feature representation of the sequence. The Event encoder encodes each of the states from the sequence using a series of standard Convolution layers with a kernel of size 3. This is followed by a fully-connected layer with 100 neurons that outputs a unified representation of the Experience Event.

The Color-of-Coin, self-Reward, opponent-Reward, and Skill prediction branches consist of a series of dense layers with no non-linear activations enabling us to cluster the feature vectors using a Linear Clustering algorithm, Agglomerative Clustering. We also experiment with K-means clustering algorithm which gives similar results. We use

Binary-Cross-Entropy (BCE) loss for classification, mean-squared error (MSE) loss for regression and Adam kingma2014adam as our optimizer with a learning-rate of .

For the Action Prediction Network, input is the game state, encoded using a series of Convolutions with kernel-size and relu activations. The Action prediction and Color-of-Coin prediction branches consist of a series of fully-connected layers with relu activation with weighted

-BCE and BCE as the loss functions respectively. We use

L2 regularization and Gradient Descent (learning-rate 0.001) optimizer.

3.2. SQLoss

For all our experiments with Learners, we use a policy gradient-based learning where we train agents with the Actor-Critic method sutton2011reinforcement. Each agent is parameterised with a policy actor and -critic for variance reduction in policy updates.

We use parameters similar to foerster2018learning. During training, we use gradient descent with step size, for the actor and 1 for the critic. We use a batch size of 4000 for LOLA foerster2018learning and 500 for SQLearner for rollouts. We keep the trace-length (length of an episode) to 200 for both IPD and IMP in our experiments. The discount rate is set to 0.96 for the prisoners’ dilemma. The high value of allows for long time horizons, thereby incentivising long-term reward.

We randomly sample from for each step, independently for each agent.

In order to compare our results with previous work, we use the Normalized Discounted Reward (NDR) defined as,


Section 4 describes the results obtained in more detail with additional discussion and analysis in Section 5.

4. Results

We present the results first from the proposed SQLoss, and then we show results of the Skill clustering strategy on Coin Game. We also provide the Probability of Cooperation plots for both agents for the IPD game and the IMP game.

For iterated games, we consider two classic infinitely iterated games, the iterated prisoners dilemma (IPD), luce1989games and iterated matching pennies (IMP) lee1967application. Each round in these two environments requires a single action from each agent. We can obtain the discounted future return of each player given both players policies. For CoinGame, a more difficult two-player environment, where each round requires the agents to take a sequence of actions and exact discounted future reward can not be calculated.

(a) 3-dimensional feature vector
(b) 4-dimensional feature vector
(c) 10-dimensional feature vector
(d) 100-dimensional feature vector
Figure 5. In each subfigure, Left figures represent the actual rewards obtained by both the agents (the ground-truth), while the figures on the Right represent the clusters obtained by Agglomerative Clustering (when no-of-clusters was set to 2). For feature vectors of higher dimensions (>2), we show their t-SNE maaten2008visualizing projections. In the figures on the left, the Legend shows the events that give a reward of and a reward of denoted in the format |. In the plots on the right, from the legend, Class and Class represent the Cooperation and Defection clusters respectively. From the figures, all the events that resulted in the opponent getting penalized belong to the same cluster, namely the Defection cluster.

4.1. Developing cooperation in Iterated Games

(-1,-1) (-3,0)
(0,-3) (-2,-2)
(a) Prisoner’s Dilemma (IPD)
(+1,-1) (-1,+1)
(-1,+1) (+1,-1)
(b) Matching Pennies (IMP)
(1.0, 1.0) (0.0, 2.8)
(2.8, 0.0) (-1.0, -1.0)
(c) Coin Game in IPD formulation
Table 1. Tables above show the payoff matrices of different games. The payoff for Coin game was inferred by averaging out the returns for a sequence of states when the agent plays the game using a skill in the environment. The absolute payoffs for such a game can vary, but the relative value of each skill remains the same. Here the & denote the Cooperation & Defection skills respectively.
(a) ()-Stationary environment. Agents first cooperate but eventually end up defecting
(b) ()-Stationary environment
Figure 6. Instantaneous probability of cooperation for both the agents in the IPD game when the environment enforces the stationary criterion. Green line indicates probability of cooperation while Blue line indicates probability of defection.

4.1.1. Ipd

Figure 7. Results of 10 independent trials for the IPD game with SQLearner agents along with the results from foerster2018learning for different pairs of agents as baselines. For SQLearner agents, varies from to . SQLearner’s reward graph is hidden because of overlap with

SQLearner’s reward shown in magenta. Here x-axis is epochs while y-axis shows the probability

Figure 7 shows the average reward and rate of defection for two Naïve learners in the IPD game with a -stationary environment. As expected, we observe that enforcing stationarity in the environment induces cooperative behavior.

The results obtained for the game of IPD are tabulated in Figures 5(a) & 5(b). Figure 5(b) shows the Probability of Cooperation for both the agents for the game of IPD with environment enforcing the stationary policy with . Agents initially decide to defect, but eventually, both the agents choose to cooperate. They occasionally do defect and end up in the Defect-Defect state, but they again switch to cooperation for the majority of the game.

Having has a similar effect on the agents, but the convergence to cooperation takes longer. Setting the parameter to 5 did not result in cooperation, as visible in Figure 5(a), which aligns with our understanding that there is a minimum threshold, a minimum policy stationarity, that the agents require in order to overcome the loss from the rewards when taking C as compared to D.

We experiment with the proposed changes with the learning policy, as shown in Section 2.1.3 for the traditional IPD game. Figure 7 shows the results obtained on training the agents using the proposed SQLoss. The SQLearner agents attempt to defect for a few epochs, but then switch to cooperation resulting in CC state. The SQLearner agents also give high NDR scores on average ( vs. ). The Naive Learners trained using Policy gradient (NL-PG) end up defecting, giving a very low NDR of . For this experiment, we sample from the Discrete Uniform distribution randomly for each step. This ensures that for both the agents is not synchronous which makes it more challenging. With synchronous , it would collapse to a setting similar to the -Stationary environment.

4.1.2. Imp

Figure 8. Results of 5 independent trials for the IMP game with pairs of Lola, NL and SQLearner agents. SQLearner agents converge on the Nash Equilibrium which is 50%/50% Heads and Tails resulting in NDR close to 0. Best viewed in color.

Figure 8 shows the average Normalized Discounted Reward of 5 independent trials for a pair of SQLearner agents in the IMP game. The Nash Equilibrium for IMP game is 50% Heads/50% Tails. As visible in Figure 8, the Naïve Learners trained using policy-gradient (denoted by NL-PG) end up diverging from the equilibrium. The LOLA-PG agents still manage to stay near the equilibrium but farther as compared to the pair of SQLearner agents. SQLearner agents converge at the Nash Equilibrium of the IMP game.

(a) Average NDR for a pair of SQLearner
(b) Probability(picking-own-coin) for SQLearner agents with baseline results from foerster2018learning
Figure 9. In the figures above, we have (a) average NDR for a pair of SQLearner agents, and (b) Probability(picking-own-coin) for CoinGame. y-axis represents the average NDR results for 5 independent runs of the game.

4.2. IPDistill: Skill Clustering

Color of Coin Max steps Percentage of coins picked
Defection Skill Cooperation Skill
Red Max 1 steps 12.7% 46.7%
Max 2 steps 22.4% 95.8%
Blue Max 1 steps 48.4% 4.7%
Max 2 steps 99.4% 8.4%
Table 2. For Coin game. Percentage of coins picked by the Red agent when the actions are predicted by the oracle trained from the events obtained by Skill clustering.

Figure 5 shows results obtained from the clustering of the Experience Events for Coin Game. From Figure 4(a), the Left figure represents the Experience Events with similar rewards representing the ground truths while the Right figure shows the same set of events now labeled with the class prediction from the Skill Clustering network. As visible in the Figure, the events where the Red agent defected are clustered in Cluster (denoted in red, they represent Defection), while the remaining events belong to the other cluster (Cluster ), representing Cooperation strategy.

We also experiment with different dimensions for the feature vectors and obtain similar clustering results with sufficient training. Figures 4(a), 4(b), 4(c) & 4(d) show the results obtained for 2-D, 3-D, 10-D and 100-D feature vectors respectively.

Once the cooperation and defection skills are clustered in the feature space, we train an Action Prediction Network to predict the action at any state of the game.

We train the networks, for each strategy, using the discovered skills and then evaluate the oracles by executing it in the environment keeping the opponent frozen. We initiate 5000 independent games with each game-play lasting for a maximum of 2 steps (For CoinGame an agent can pick the coin in a maximum of 2 steps, irrespective of both the agent’s and coin’s positions). We terminate the game once an agent picks the coin. Table 2 shows the percentage of picking coins of a particular color for a given strategy when guided by the Action Prediction Network. The network can learn to either pick or leave the coin based on the strategy.

Action Prediction Network consists of a set of convolution layers to encode the game state, followed by a full-connected layer to obtain the state feature vector, followed by 2 branches consisting of a fully-connected layer to predict the coin-color and the action. We use the relu activation and the Gradient Descent optimizer for training the network with . We train the network for epochs with a batch size of 128.

After obtaining the skills for Coin game, Cooperation, and Defection, agents play the skills in the environment and obtain the following reward values for the skills. The trained oracles with the reward values are then used in the IPD formulation of CoinGame with trained oracles. Figure 9 shows the average NDR of the pair of SQLearner agents along with Probability(picking-own-coin). From the figure, the oracle-based SQLearner agents have Probability(picking-own-coin) higher than Naïve Learner and Lola agents. The Figure 8(a) shows the average Normalized Discounted Reward (NDR) which reaches -0.5 when trained for sufficiently long time, which is much higher than Naïve Learner agents.

5. Discussion and Analysis

In this section we perform various experiments to understand the effect of SQLoss on the agent’s behaviour, the effect of various parameters on the results and explain some of the observations.

We observe that the SQLearner agents initial play randomly for a few epochs. Initial random play of agents forces the Status-Quo loss to penalize the agents heavily which results in the logits getting diverged and subsequently resulting in very high probabilities of taking an action given a state. Eventually the agents stabilize and thus the probabilities come near zero and they converge to CC.

We also observe certain sudden drops in Cooperation probabilities after every few epochs when the Naiïve agents play in -stationary environment. This trend is clearly visible in Figure 5(b). We believe this is due to the inherent incentive to exploit the opponent which entices the agents to exploit, but realizing it won’t help, then agents switch to CC again. The interval between consecutive drops in Probability(Cooperation) reduces as training proceeds.

5.0.1. Exploitability

Given the fact that the RL agents don’t have any prior information about the opponent, its necessary that they evolve their strategies based on the opponent. For a SQLearner, we attempt to understand its adaptability when the opponent suddenly switches its strategy mid-game. In contrast to playing against an SQLearner, when the agent learns to Cooperate, the SQLearner agent when played against an always-Cooperate agent and against an always-Defect, it learns to Defect. This nature of exploiting the opponent, and prevent from being exploited are useful indications of the agent adapting to its opponent. The SQLearner agent when played against a Tit-for-tat (TFT) agent failed to converge to Cooperation possibly because of the difficulty to model instantaneous changes. Both agents converged to DD state.

5.0.2. Effect of , & on convergence

In this section we attempt to explore the effect of various new parameters introduced in the Section 2 on convergence to CC.

For -Stationary Environment, controls the amount of stationary criterion the environment enforces on the agents. Experimenting with varying , we observed that starting with , the agents reach the CC state for quite some time, but each of the agents attempts to exploit the opponent and both eventually end up defecting. The agents end up in DD state if played for long. But as we increase (to lets say 20), the frequency of exploitation reduces, and the agents stay in equilibrium in the CC state. Figures 5(a) & 5(b) show the results of varying and the nature of agents when is relatively low.

and , both control the degree of imaginary self-play. Since, is sampled from Discrete Uniform distribution (both inclusive), larger implies larger . Having higher adds variance in the rewards and thus possible future outcomes, but the overall long-term expected reward is still higher for CC, allowing the agents to converge to cooperation. It also makes the imaginary game-play increasingly asynchronous for both the agents. Larger results in faster cooperation.

6. Conclusion

We have described an approach to evolve cooperative behavior between RL agents playing the IPD and IMP game without sharing rewards, internal details (weights, gradients, etc.) or a communication channel. We introduce a Status Quo loss (SQLoss) that incentivizes cooperative behavior by encouraging policy stationarity. Further, we have described an approach to transform a two-player game (with visual inputs) into its IPD formulation through self-supervised skill discovery (IPDistill). Finally, we showed how our approach (IPDistill + SQLoss) outperforms existing approaches in the IPD, IMP and the two-player Coin game.