DeepAI
Log In Sign Up

Towards Cooperation in Sequential Prisoner's Dilemmas: a Deep Multiagent Reinforcement Learning Approach

The Iterated Prisoner's Dilemma has guided research on social dilemmas for decades. However, it distinguishes between only two atomic actions: cooperate and defect. In real-world prisoner's dilemmas, these choices are temporally extended and different strategies may correspond to sequences of actions, reflecting grades of cooperation. We introduce a Sequential Prisoner's Dilemma (SPD) game to better capture the aforementioned characteristics. In this work, we propose a deep multiagent reinforcement learning approach that investigates the evolution of mutual cooperation in SPD games. Our approach consists of two phases. The first phase is offline: it synthesizes policies with different cooperation degrees and then trains a cooperation degree detection network. The second phase is online: an agent adaptively selects its policy based on the detected degree of opponent cooperation. The effectiveness of our approach is demonstrated in two representative SPD 2D games: the Apple-Pear game and the Fruit Gathering game. Experimental results show that our strategy can avoid being exploited by exploitative opponents and achieve cooperation with cooperative opponents.

READ FULL TEXT VIEW PDF

page 4

page 6

page 11

02/10/2017

Multi-agent Reinforcement Learning in Sequential Social Dilemmas

Matrix games like Prisoner's Dilemma have guided research on social dile...
03/23/2018

Inequity aversion improves cooperation in intertemporal social dilemmas

Groups of humans are often able to find ways to cooperate with one anoth...
01/24/2020

Cooperative versus decentralized strategies in three-pursuer single-evader games

The value of cooperation in pursuit-evasion games is investigated. The c...
12/23/2021

Should transparency be (in-)transparent? On monitoring aversion and cooperation in teams

Many modern organisations employ methods which involve monitoring of emp...
03/23/2018

Inequity aversion resolves intertemporal social dilemmas

Groups of humans are often able to find ways to cooperate with one anoth...
10/19/2017

Consequentialist conditional cooperation in social dilemmas with imperfect information

Social dilemmas, where mutual cooperation can lead to high payoffs but p...

1 Introduction

Learning is the key to achieving coordination with others in multiagent environments [Stone and Veloso2000]. Over the last couple of decades, a large body of multiagent learning techniques have been proposed that aim to coordinate on various solutions (e.g., Nash equilibrium) in different settings, e.g., minimax Q-learning [Littman1994], Nash Q-learning [Hu and Wellman2003], and Conditional-JAL [Banerjee and Sen2007], to name just a few.

One commonly investigated class of games is the Prisoner’s Dilemma (PD), in which an Nash equilibrium solution is not a desirable learning target. Until now, a large body of work [Axelrod1984, Nowak and Sigmund1993, Banerjee and Sen2007, Crandall and Goodrich2005, Damer and Gini2008, Hao and Leung2015, Mathieu and Delahaye2015] has been devoted to incentivize rational agents towards mutual cooperation in repeated matrix PD games. However, all the above works focus on the classic repeated PD games, which ignores several key aspects of real world prisoner’s dilemma scenarios. In repeated PD games, the moves are atomic actions and can be easily labeled as cooperative or uncooperative or learned from the payoffs [Busoniu et al.2008]. In contrast, in real world PD scenarios, cooperation/defection behaviors are temporally extended and the payoff signals are usually delayed (available after a number of steps of interactions).

Crandall crandall2012just proposes the Pepper framework for repeated stochastic PD games (e.g., two-player gate entering problem), which can extend strategies originally proposed for classic repeated matrix games. Later some techniques extend Pepper in different scenarios, e.g., stochastic games with a large state space under tabular based framework [Elidrisi et al.2014] and playing against the switching opponents [Hernandez-Leal and Kaisers2017]. However, these approaches rely on hand-crafted state inputs and tabular Q-learning techniques to learn optimal policies. Thus, they cannot be directly applied to more realistic environments whose states are too large and complex to be analyzed beforehand.

Leibo et al. leibo2017multi introduce a 2D Fruit Gathering game to better capture the real world social dilemma characteristics, while also maintaining the characteristics of classical iterated PD games. In this game, at each time step, an agent selects its action based on its image observation and cannot directly observe the actions of the opponent. Different policies represent different levels of cooperativeness, which is a graded quantity. They investigate the cooperation/defection emergence problem by leveraging the power of deep reinforcement learning [Mnih et al.2013, Mnih et al.2015] from the descriptive point of view: how do multiple selfish independent agents’ behaviors evolve when agents update their policy using deep Q-learning? In contrast, this paper takes a prescriptive and non-cooperative perspective and considers the following question: how should an agent learn effectively in real world social dilemma environments when it is faced with different opponents?

To this end, in the paper, we first formally introduce the general notion of sequential prisoner’s dilemma (SPD) to model real world PD problems. We propose a multiagent deep reinforcement learning approach for mutual cooperation in SPD games. Our approach consists of two phases: offline and online phases. The offline phase generates policies with varying cooperation degrees and trains a cooperation degree detection network. To generate policies, we propose using the weighted target reward and two schemes, IAC and JAC, to train the baseline policies with varying cooperation degrees. Then we propose using the policy generation approach to synthesize the full range of policies from these baseline policies. Lastly, we propose a cooperation degree detection network implemented as an LSTM-based structure with an encoder-decoder module and generate a training dataset. The online phase extends the Tit-for-Tat principle into sequential prisoner’s dilemma scenarios. Our strategy adaptively selects its policy with the proper cooperation degree from a continuous range of candidates based on the detected cooperation degree of an opponent. Intuitively, on one hand, our overall algorithm is cooperation-oriented and seeks for mutual cooperation whenever possible; on the other hand, our algorithm is also robust against selfish exploitation and resorts to a defection strategy to avoid being exploited whenever necessary. We evaluate the performance of our deep multiagent reinforcement approach using two 2D SPD games (the Fruit Gathering and Apple-Pear games). Our experiments show that our agent can efficiently achieve mutual cooperation under self-play and also perform well against opponents with changing stationary policies.

2 Background

2.1 Matrix Games and the Prisoner’s Dilemma

A matrix game can be represented as a tuple , where is the set of agents, is the set of actions available to agent with being the joint action space , and is the reward function for agent . One representative class of matrix games is the prisoner’s dilemma game, as shown in Table 1. In this game, each agent has two actions: cooperate (C) and defect (D), and is faced with four possible rewards: , , , and . The four payoffs satisfy the following four inequalities under a prisoner’s dilemma game:

  • : mutual cooperation is preferred to mutual defection.

  • : mutual cooperation is preferred to being exploited by a defector.

  • : mutual cooperation is preferred to an equal probability of unilateral cooperation and defection.

  • : exploiting a cooperator is preferred over mutual cooperation.

  • : mutual defection is preferred over being exploited.

2.2 Markov Game

Markov games combine matrix games and Markov Decision Processes and can be considered as an extension of Matrix games to multiple states. A Markov game

is defined by a tuple , where is the set of states and is the number of agents, is the collection of action sets, with being the action set of agent , and is the set of reward functions, is the reward function for agent . is the state transition function: , where

denotes the set of discrete probability distributions over

. Matrix games are the special case of Markov games when .

C D
C R, R S, T
D T, S P, P
Table 1: Prisoner’s Dilemma

Next we formally introduce SPD by extending the classic iterated PD game to multiple states.

2.3 Definition of Sequential Prisoner’s Dilemma

A two-player SPD is a tuple , where is a 2-player Markov game with state space . is the set of policies with varying cooperation degrees. The empirical payoff matrix can be induced by policies , where is more cooperative than . Given two policies, and , the corresponding empirical payoffs under any starting state s with respect to the payoff matrix in Section 2.1 can be defined as through their long-term expected payoff, where

(1)
(2)
(3)
(4)

We can define the long-term payoff for agent when the joint policy is followed starting from .

(5)

A Markov game is an SPD when there exists a state for which the induced empirical payoff matrix satisfies the five inequalities in Section 2.1. Since SPD is more complex than PD, the existing approaches addressing learning in matrix PD games cannot be directly applied in SPD.

2.4 Deep Reinforcement Learning

Q-Learning and Deep Q-Networks: Q-learning and Deep Q-Networks (DQN) [Mnih et al.2013, Mnih et al.2015] are value-based reinforcement learning approaches to learn optimal policies in Markov environments. Q-learning makes use of an action-value function for policy as

. DQN uses a deep convolutional neural network to estimate Q-values and the optimal Q-values are learned by minimizing the following loss function:

(6)
(7)

where is a target network whose parameters are periodically updated with the most recent .

Policy Gradient and Actor-Critic Algorithms: Policy Gradient methods are for a variety of RL tasks [Williams1992, Sutton et al.2000]. Their objective is to maximize by taking steps in the direction of , where

(8)

where is the state transition distribution. Practically, the value of can be estimated in different ways. For example, serves as a critic to guide the updating direction of , which leads to a class of actor-critic algorithms [Schulman et al.2015, Wang et al.2016].

3 Deep RL: Towards Mutual Cooperation

Algorithm 1 describes our deep multiagent reinforcement learning approach, which consists of two phases, as discussed Section 1. In the offline phase, we first seek to generate policies with varying cooperation degrees. Since the number of policies with different cooperation degrees is infinite, it is computationally infeasible to train all the policies from scratch. To address this issue, we first train representative policies using Actor-Critic until convergence (i.e., cooperation and defection baseline policies) (Lines 3-5) detailed in Section 3.1; second, we synthesize the full range of policies (Lines 6-7) from the above baseline policies, which will be detailed in Section 3.2. Another task is to how to effectively detect the cooperation degree of the opponent. We divide this task into two steps: we first train an LSTM-based cooperation degree detection network offline (Lines 8-10), which will be then used for real-time detection during the online phase, detailed in Section 3.3. In the online phase, our agent plays against any opponent by reciprocating with a policy of a slightly higher cooperation degree than that of the opponent we detect (Lines 12-18), detailed in Section 3.4. Intuitively, on one hand, our algorithm is cooperation-oriented and seeks for mutual cooperation whenever possible; on the other hand, our algorithm is also robust against selfish exploitation and resorts to defection strategy to avoid being exploited whenever necessary.

1:  //offline training
2:  initialize the size of training policy set, the size of generation policy set, the number of training data set, the episode number and the step number of each episode
3:  for training policy set index t = 1 to  do
4:     set agents’ attitudes
5:     train agents’ policy set using weighted target reward
6:  for generation policy set index g = 1 to  do
7:     use policy set to generate policy set
8:  for training data set index d = 1 to  do
9:     generate training data set as
10:  use data set to train cooperation degree detection network
11:  //adjust the policy online
12:  initialize cooperation degree
13:  for episode index e = 1 to  do
14:     for step index r = 1 to  do
15:         and take actions and get rewards
16:         uses -state trajectory to detect the cooperation degree of
17:         updates incrementally based on
18:         synthesizes a policy with using policy generation
Algorithm 1 The Approach of Deep Multiagent Reinforcement Learning Towards Mutual Cooperation

3.1 Train Baseline Policies with Different Cooperation Degrees

One way of generating policies with different cooperation degrees is by directly changing the key parameters of the environments. For example, Leibo et al. [Leibo et al.2017] investigate the influence of the resource abundance degree on the learned policy’s cooperation tendency in sequential social dilemma games where agents compete for limited resources. It is found that when both agents employ deep Q-learning algorithms, more cooperative behaviors can be learned when resources are plentiful and vice versa. We may leverage similar ideas of modifying game settings to generate policies with different cooperation degrees. However, this type of approach requires a perfect understanding of the environment as a prior, and also may not be practically feasible when on cannot modify the underlying game engine.

Another more generalized way of generating policies with different cooperation degrees is to modify agents’ reward signals during learning. Intuitively agents with the reward of the sum of all agents’ immediate rewards would learn towards cooperation policies to maximize the expected accumulated social welfare eventually, and agents maximizing only their own reward would learn more selfish (defecting) policies. Formally, for a two-player environment (agent and ), agent computes a weighted target reward as follows:

(9)

where is agent ’s attitude towards agent , which reflects the relative importance of agent in agent perceived reward. By setting the values of and to 0, agents would update their strategies in the direction of maximizing their own accumulated discounted rewards. By setting the values of and to 1, agents would update their strategies in the direction of maximizing the overall accumulated discounted reward. The greater the value of agent one’s attitude towards agent two is, the higher the cooperation degree of agent one’s learned policy would be.

Figure 1: Cooperation Degree Detection Network

Given the modified reward signal for each agent, the next question is how agents should learn to effectively converge to the expected behaviors. One natural way is to resort to the the independent Actor-Critic (IAC) or some other independent deep reinforcement learning, i.e., equipping each agent with an individual deep Q-learning algorithm (IDQL) with the modified reward. However, the key element to the success of DQL, experience replay memory, might prohibit effective learning in deep multiagent Q-learning environments [Foerster et al.2017b, Sunehag et al.2017]. The nonstationarity introduced by the coexistence of multiple IACs means that data in the replay memory may no longer reflect the current dynamics in which the agent is learning. Thus IACs may frequently get confused by obsolete experience and impede the learning process. A number of methods have been proposed to remedy this issue [Foerster et al.2017b, Lowe et al.2017, Foerster et al.2017a] and we omit the details which are out of the scope of this paper.

Since the baseline policy training step is performed offline, another way of improving training is to use the joint Actor-Critic (JAC) by treating both agents as a single learner during training. Note that we use JAC only for the purpose of training baseline policies offline, and we do not require that we can control the policies that an opponent may use online. In JAC, both agents share the same underlying network that learns the optimal policy over the joint action space using a single reward signal. In this way, the aforementioned nonstationarity problem can be avoided. Besides, compared with IAC, the training efficiency can be improved significantly since the network parameters are shared across agents in JAC.

In JAC, the weighted target reward is defined as follows:

(10)

where represents the relative importance of agent on the overall reward. The smaller the value of , the higher the cooperation degree of agent ’s learned policy and vice versa. Given the learned joint policy , agent can easily obtain its individual policy as follows:

(11)

As we mentioned previously, it is computationally prohibitive to train a large number of policies with different cooperation degrees due to the high training cost of deep Q-learning, and because the policy space is infinite. To alleviate this issue, here we propose that only two policies, cooperation policy and defection policy , need to be trained. Other policies with cooperation degree between the baselines can be synthesized efficiently, which will be introduced in Section 3.2.

Figure 2: The Structure of Deep Reinforcement Learning Approach Towards Mutual Cooperation

3.2 Policy Generation

Given the baseline policies and , we synthesize multiple policies. Each continuous weighting factor corresponds to a new policy defined as follows:

(12)

The weighting factor is defined as policy ’s cooperation degree. Specially, the linear combination of two policies has two advantages — it 1) generates policies with varying cooperation degrees and 2) ensures low computational cost. Any synthesized policy should be more cooperative than and more defecting than . The higher the value of is, the more cooperative the corresponding policy is. It is important to mention that the cooperation degrees of synthesized policies are ordinal, i.e., the cooperation degree of policies only reflect their relative cooperation ranking. For example, considering two synthesized policies and , it only implies that policy is more cooperative than policy . However, we cannot say that is twice as cooperative as . Our way of synthesizing new policies can be understood as synthesizing policies over expert policies [He et al.2016]. The previous work applies a similar idea to generate policies to better respond with different opponents in competitive environments, however, our goal here is to synthesize policies with varying cooperation degrees in sequential Prisoner’s dilemmas.

3.3 Opponent Cooperation Degree Detection

Figure 3: The Fruit Gathering game (left) and the Apple-Pear game.

In classical iterated PD games, a number of techniques have been proposed to estimate the opponent’s cooperation degree, e.g., counting the cooperation frequency when action can be observed; using a partial filter or a Bayesian approach otherwise [Damer and Gini2008, Hernandez-Leal et al.2016a, Hernandez-Leal et al.2016b, Leibo et al.2017]. However, in SPD, the actions are temporally extended, and the opponent’s information (actions and rewads) cannot be observed directly. Thus the previous works cannot be directly applied. We need a way of accurately predicting the cooperation degree of the opponents from the observed sequence of moves in a qualitative manner. In SPD, given the sequential observations (time-series data), we propose an LSTM-based cooperation degree detection network.

In previous sections, we have introduced a way of synthesizing policies of any cooperation degree, thus we can easily prepare a large dataset of agents’ behaviors with varying cooperation degrees. Based on this, we can transform the cooperation degree detection problem into a supervised learning problem: given a sequence of moves of an opponent, our task is to detect the cooperation degree (label) of this opponent. We propose a recurrent neural network, which combines an autoencoder and a recurrent classifier, as shown in Figure 

1

. Combing an autoencoder with a recurrent classifier brings two major benefits here. First, the classifier and autoencoder share underlying layer parameters of the neural network. This ensures that the classification task is based on the effective feature extraction of the observed moves, which improves the classification detection accuracy. Second, concurrent training of the autoencoder also helps to accelerate the training speed of the classifier and reduces fluctuation during training.

The network is trained on experiences collected by agents. Both agents and interact with the environment starting with initialized policies and , yielding a training set :

Figure 4: Agents’ average rewards under policies trained with different cooperation attitudes.
(13)

where and are baseline policies for agent . is the learned policy set of its opponent . is the relative cooperation degree of policy , and is the set of trajectories under the joint policy . For each trajectory , its label is the cooperation degree of agent ’s policy. The network is trained to minimize the following weighted cross entropy loss function as follows:

(14)

where is the weight of . is the network output, which is the probability of .

3.4 Play Against Different Opponents

Once we have detected the cooperation degree of the opponent, the final question arises as to how an agent should select proper policies to play against that opponent. A self-interested approach would be to simply play the best response policy toward the detected policy of the opponent. However, as we mentioned before, we seek a solution that can allow agents to achieve cooperation while avoiding being exploited.

Figure 2 shows our overall approach playing with opponents towards mutual cooperation. At each time step , agent uses its previous -step sequence of observations (from time step to ) as the input of the detection network, and obtain the detected cooperation degree

. However, the one-shot detection of the opponent’s cooperation degree might be misleading due to either the detection error of our classifier or the stochastic behaviors of the opponent. Thus this may lead to high variance of our detection outcome and our response policy thereafter. To reduce the variance, agent

uses the exponential smoothing to update its current cooperation degree estimation of its opponent as follows:

(15)

where is agent ’s cooperation degree in the last time step, and is the cooperation degree changeing factor.

Finally, agent sets its own cooperation degree equal to plus its reciprocation level, and then synthesizes a new policy with the updated cooperation degree following Equation (12) as its next-step strategy to play against its opponent. Note that the reciprocation level can be quite low and still produce cooperation. The benefit is that it will not lead to a significant loss if the opponent is not cooperative at all, while full cooperation can be reached if the opponent reciprocates in the same way. Also, note that here we only provide a way of responding to the opponent with changing policies. However, our overall approach is general and any existing multiagent strategy selection approaches can be applied here.

(a) The Apple-Pear game
(b) The Fruit Gathering game
Figure 5: Average and total rewards under different cooperation degrees. The cooperation degrees of and increase from left to right and from bottom to top respectively. Each cell corresponds the rewards of different policy pairs.

4 Simulation and Results

4.1 SPD Game Descriptions

In this section, we adopt the Fruit Gathering game [Leibo et al.2017] to evaluate the effectiveness of our approach. We also propose another game Apple-Pear, which also satisfies real world social dilemma conditions we mentioned before. Each game involves two agents (in blue and red). The task of an agent in the Fruit Gathering game is to collect as many apples, represented by green pixels, as possible (see Figure 3 (left)). An agent’s action set is: step forward, step backward, step left, step right, rotate left, rotate right, use beam, and stand still. The agent obtains the corresponding fruit when it steps on the same square as the fruit is located. When an agent collects an apple, it will receive a reward 1. And the apple will be removed from the environment and respawn after 40 frames. Each agent can also emit a beam in a straight line along its current orientation. An agent is removed from the map for 20 frames if it is hit by the beam twice. Intuitively, a defecting policy in this game is one which frequently tags the rival agent to remove it from the game. A cooperation policy is one that rarely tags the other agent. For the Apple-Pear game, there is a red apple and a green pear (see Figure 3 (right)). The blue agent prefers apple while the red agent prefers pear. Each agent has four actions: step right, step left, step backward, step forward, and each step of moving incurs a cost of 0.01. The fruit is collected when the agent steps on the same square as it. When the blue (red) agent collects an apple (pear) individually, it receives a higher reward 1. When the blue agent collects a pear individually, it receives a lower reward 0.5. The situation is the opposite for the red agent. One exception is that they both receive a half of their corresponding rewards when they share a pear or an apple. In this game, a fully defecting policy is to collect both fruits whenever the fruit-collecting reward exceeds the moving cost, while a cooperative one is to only collect the fruit it prefers to maximize the social welfare of agents. In Section 4.4, we find that the two games satisfy the definition of SPD games in Section 2.3 by using policies with different cooperation degrees to play with each other.

4.2 Network Architecture and Parameter Settings

In both games, our network architectures for training the baseline policies follow standard AC networks, except that we allow both actor and critic to share the same underlying network to reduce the parameter space. For the underlying network, the first hidden layer convolves filters of

with stride

with the input image and applies a rectifier nonlinearity. The second hidden layer convolves filters of with stride , again followed by a rectifier nonlinearity. This is followed by a third convolutional layer that convolves filters of with stride followed by a rectifier. For the actor, on the basis of sharing network, the next layer includes

units with rectifier nonlinearity, and the final softmax layer has as many units as the number of actions. The critic is similar to the actor, but with only one scalar output.

The recurrent cooperation degree detection network is shown in Figure 1. The autoencoder and the detection network share the same underlying network. The first hidden layer convolves filters of with stride with the input image and applies a rectifier nonlinearity. The second hidden layer is the same as the first one and the third hidden layer convolves filters of with stride and applies a rectifier nonlinearity. The autoencoder is followed by a fourth hidden layer that deconvolves filters of with stride and applies a sigmoid and its output shape is . The next layer deconvolves filters of with stride and applies a sigmoid and the output shape is . The final layer deconvolves filters of with stride and applies a sigmoid and the output shape is . Cooperation degree detection network is followed by two LSTM layers of units. The final layer is an output node.111The code and network architectures will be available soon: https://goo.gl/3VnFHj.

For the Apple-Pear game, each episode has at most steps. The exploration rate is annealed linearly from to over the first steps. The weight parameters are updated by soft target updates [Lillicrap et al.2015] every 4 steps to avoid the update fluctuation as follows:

(16)

where is the parameter of the policy network and is the parameter of the target network. The learning rate is set to and memory is set to . The batch size is . For the loss function in the cooperation degree detection network, we set and as 1 and 2 (see Equation 14). When agents play with different opponents online, we assign the number of states visited to the length of the state sequence which is the input of cooperation detection network. The cooperation degree changing factor is set to 1.

The Fruit Gathering game uses the same detection network architecture. The actor-critic uses independent policy networks. Each episode has at most steps during training. The exploration rate and the memory are the same as the Apple-Pear game. The weight parameters are updates the same as the Apple-Pear game:

(17)

When agents play with different opponents online, we set state sequence length as 50 and the changing factor is set to 0.02.

(a) Apple-Pear
(b) Fruit Gathering
(c) Fruit Gathering ()
Figure 6: The detection results for under different cooperation degrees of : (a) Apple-Pear game; (b) Fruit Gathering game; (c)’s cooperation degree as 1 in Fruit Gathering game

4.3 Effect of Baseline Policy Generation

For the Apple-Pear game, the baseline policies are trained using the JAC scheme. The individual rewards of each agent increase gradually as its attitude increases and the other agent’s attitude decreases. The results also indicate that an agent can learn a defecting policy when its attitude that represents its relative importance on overall reward increases and vice versa. We set the attitude in Equation (10) between 0.1 and 0.9 to train baseline policies. The reason that attitudes are not set 0 and 1 is that these settings will cause an agent with meaningless policy, e.g, the agent whose attitude equals makes no influence to total reward in Equation (10) and hence will always avoid collecting fruit. This would, in turn, affect the learned policy quality of the other agent (i.e., lazy agent problem [Perolat et al.2017]). Figure 4 shows the average rewards of agents under policies trained with different weighted target rewards. We also evaluate the IAC scheme and similar results can be obtained and we omit it here.

For the Fruit Gathering game, the learning policies are synthesized based on IAC. IAC is more efficient for training baseline policies in this game, relative to JAC, since the rewards of collecting the apple for both agents are the same. On the other hand, if we adopt a JAC approach, it might lead to the consequence that the agent with a higher weight will collect all the apples, while the other agent will not collect any apples. Similar results as the Apple-Pear game can be observed here and we omit it here.

4.4 Effect of Policy Generation

For the Apple-Pear game, the baseline cooperation policies and are trained under the setting of . For , the baseline defection policy has . For , the baseline defection policy has . Then we generate the policy set of and the policy set of . After that, two policies and are sampled from and and matched against each other in the games for 200,000 episodes. The average rewards are assigned to individual cells of different attitude pairs, in which correspond to policies with varying cooperation degrees for agent , and , policies with varying cooperation degrees for agent (see Figure 5 (a)). In Figure 5 (a), we can observe that when agents’ cooperation degrees decrease, their rewards decrease. When both of their cooperation degrees increase, which means they are more cooperative, the sum of their rewards increase. Besides, given a fixed cooperation degree of , ’s reward is increased as its cooperation degree decreases, and vice versa. Similar pattern can be observed for agent 1 as well. Therefore, it indicates that we can successfully synthesize policies with a continuous range of cooperation degrees. It also confirms that the Apple-Pear game can be seen as an SPD following the definition in Section 2.3.

For the Fruit Gathering game, we use policies with both attitudes equal to 0 and 0.5 as the baseline defection and cooperation policy respectively. Figure 5 (b) shows the results of their rewards, which is similar to the Apple-Pear game.

4.5 Effect of Cooperation Degree Detection

This section evaluates the detection power of the cooperation degree detection network. First, we train the cooperation degree detection network using datasets which include only the data labeled as full cooperation or full defection. The training data in the simulation is obtained based on baseline policies of . We set its label as 1 when uses its baseline cooperation policy and 0 when uses its baseline defection policy. Then we use this dataset to train the detection network. After training we evaluate the detection accuracy by applying it to detecting the cooperation degree of agent 2 when it uses policies with cooperation degrees from 0 to 1 and the degree interval is 0.1. The policy pair is sampled from and and matched against with each other for each episode. After the cooperation degree average value is stable, we view the output value as the cooperation degree of .

For the Apple-Pear game, we collect 10000 data for state sequence lengths {3, 4, 5, 6, 7, 8}. The cooperation detection results of are shown in Figure 6 (a). We can see that the cooperation detection values can well approximate the true values with slight variance. Besides, the network can clearly detect the order of different cooperation degrees, which allows us to calculate the cooperation degree of accurately. For the Fruit Gathering game, we collect 4000 data for each state sequence lengths {40, 50, 60, 70}, which includes 2000 data labeled as 1 and 2000 data labeled as 0. The network is trained in a similar way and the cooperation degree detection accuracy is high (see Figure 6 (b)). From Figure 6, We observe that the detection results are almost linear with the true values. Fig 6(c) shows the detection results of when ’s policy is fixed with cooperation degree 1. We can see that the true cooperation degree of can be easily obtained by fitting a linear curve between the predicted and true values. Thus for each policy of in , we can fit a linear function to evaluate the true cooperation degrees of . During practical online learning, when uses policies of varying cooperation degrees to play with , it firstly chooses the function whose corresponding policy is closest to the used policy, and then computes the cooperation degree of .

Figure 7: Performance under self-play in the Apple-Pear game when both agents use our strategy.
Figure 8: Performance under self-play in the Fruit Gathering game when both agents use our strategy.
Figure 9: Apple-Pear game: ’s policy varies between and every 110 episodes. The average rewards of the are higher when using our approach than using , which means our approach can avoid being exploited by defective opponents. The social welfare is higher than using , indicating that our approach can seek for cooperation against cooperative ones.
Figure 10: Fruit Gathering game: ’s policy varies between and every 300 steps. Similar phenomenon can be observed as in Figure 16.

4.6 Performance under Self-Play

Next, we evaluate the learning performance of our approach under self-play. Since the initialized policies of agents can affect their behaviors in the game, we evaluate under all different initial conditions: a) starts with cooperation policy and starts with defecting policy; b) starts with defecting policy and starts with cooperation policy; c) both agents start with cooperation policies; d) both agents start with defecting policies. Agents converge to full cooperation for all four cases. We present the results for the last case, which is the most challenging one (see Figure 7 and Figure 8). When both agents start with a defection policy, it is more likely for them to model each other as defective and thus both play defect thereafter. Our approach enables agents to successfully detect the cooperation tendency of their opponents from sequential actions and converge to cooperation at last. In the Apple-Pear game, agents converge efficiently within few episodes. The reason is that when agents are close to fruits they prefer, they will collect fruits no matter whether their policies are cooperative or not. Thus, an agent is more likely to be detected as being cooperative, which induces its opponent to change policies towards cooperation. In contrast, for the Fruit Gathering game, agents need a relatively longer time before converging to full cooperation. This is because in the Fruit Gathering game, the main feature of detecting the cooperation degree is beam emitting frequency. When one agent emits a beam continuously, they change to defect. Only when both agents collect fruits without emitting beam would lead to mutual cooperation.

4.7 Playing with Opponents with Changing Strategies

Now, we evaluate the performance against switching opponents, during a repeated interaction of episodes, the opponent changes its policy after a certain number of episodes and the learning agent does not know when the switches happen. Through this, we can evaluate the cooperation degree detection performance of our network, and verify whether our strategy can perform better than the fully cooperation or defection strategy from two aspects: 1) our approach seeks for mutual cooperation whenever possible; 2) our approach is robust against selfish exploitation. In the Apple-Pear game, we vary the value of in the range of at the interval of 30, and similarly in the Fruit Gathering game we vary it in the range of at the interval of 100. Only one set of results for each game are provided in Figure 9 and 10. The results of other values of are in the Appendix. Similar phenomenon can be observed for other values of . From Figure 9 and 10, we observe that for both games, the average rewards of the learning agent () are higher than its rewards using cooperation strategy . And the social welfare (the sum of both agents’ rewards) is higher than that using defection strategy . This indicates that our approach can prevent agents from being exploited by defecting opponents and seek for cooperation against cooperative ones. By comparing the results for different values of , we find that the detection accuracy decreases when the opponent changes its policy quickly. Since the agent requires observing several episodes to detect the cooperation degree of the opponent, when the agent realizes its opponent changes the policy and adjusts its policy, the opponent may change the policy again. This problem becomes less severe when is comparatively large.

5 Conclusions

In this paper, we make the first step in investigating multiagent learning problem in large-scale PD games by leveraging the recent advance of deep reinforcement learning. A deep multiagent RL approach is proposed towards mutual cooperation in SPD games to support adaptive end-to-end learning. Empirical simulation shows that our agent can efficiently achieve mutual cooperation under self-play and also perform well against opponents with changing strategies. As the first step towards solving multiagent learning problem in large-scale environments, we believe there are many interesting questions remaining for future work. One worthwhile direction is how to generalize our approach to other classes of large-scale multiagent games. Generalized policy detection and reuse techniques should be proposed, e.g., by extending existing approaches in traditional reinforcement learning contexts [Hernandez-Leal et al.2016b, Hernandez-Leal et al.2016a, Hernandez-Leal and Kaisers2017].

References

Appendix A Playing with Opponents with Changing Strategies

a.1 The Apple-Pear Game

Figure 11: Apple-Pear game: ’s policy varies between and every 50 episodes.
Figure 12: Apple-Pear game: ’s policy varies between and every 80 episodes.
Figure 13: Apple-Pear game: ’s policy varies between and every 110 episodes.
Figure 14: Apple-Pear game: ’s policy varies between and every 140 episodes.
Figure 15: Apple-Pear game: ’s policy varies between and every 170 episodes.
Figure 16: Apple-Pear game: ’s policy varies between and every 200 episodes.

a.2 The Gathering Game

Figure 17: Fruit Gathering game: ’s policy varies between and every 100 steps.
Figure 18: Fruit Gathering game: ’s policy varies between and every 200 steps.
Figure 19: Fruit Gathering game: ’s policy varies between and every 300 steps.
Figure 20: Fruit Gathering game: ’s policy varies between and every 400 steps.
Figure 21: Fruit Gathering game: ’s policy varies between and every 500 steps.