1 Introduction
Learning is the key to achieving coordination with others in multiagent environments [Stone and Veloso2000]. Over the last couple of decades, a large body of multiagent learning techniques have been proposed that aim to coordinate on various solutions (e.g., Nash equilibrium) in different settings, e.g., minimax Qlearning [Littman1994], Nash Qlearning [Hu and Wellman2003], and ConditionalJAL [Banerjee and Sen2007], to name just a few.
One commonly investigated class of games is the Prisoner’s Dilemma (PD), in which an Nash equilibrium solution is not a desirable learning target. Until now, a large body of work [Axelrod1984, Nowak and Sigmund1993, Banerjee and Sen2007, Crandall and Goodrich2005, Damer and Gini2008, Hao and Leung2015, Mathieu and Delahaye2015] has been devoted to incentivize rational agents towards mutual cooperation in repeated matrix PD games. However, all the above works focus on the classic repeated PD games, which ignores several key aspects of real world prisoner’s dilemma scenarios. In repeated PD games, the moves are atomic actions and can be easily labeled as cooperative or uncooperative or learned from the payoffs [Busoniu et al.2008]. In contrast, in real world PD scenarios, cooperation/defection behaviors are temporally extended and the payoff signals are usually delayed (available after a number of steps of interactions).
Crandall crandall2012just proposes the Pepper framework for repeated stochastic PD games (e.g., twoplayer gate entering problem), which can extend strategies originally proposed for classic repeated matrix games. Later some techniques extend Pepper in different scenarios, e.g., stochastic games with a large state space under tabular based framework [Elidrisi et al.2014] and playing against the switching opponents [HernandezLeal and Kaisers2017]. However, these approaches rely on handcrafted state inputs and tabular Qlearning techniques to learn optimal policies. Thus, they cannot be directly applied to more realistic environments whose states are too large and complex to be analyzed beforehand.
Leibo et al. leibo2017multi introduce a 2D Fruit Gathering game to better capture the real world social dilemma characteristics, while also maintaining the characteristics of classical iterated PD games. In this game, at each time step, an agent selects its action based on its image observation and cannot directly observe the actions of the opponent. Different policies represent different levels of cooperativeness, which is a graded quantity. They investigate the cooperation/defection emergence problem by leveraging the power of deep reinforcement learning [Mnih et al.2013, Mnih et al.2015] from the descriptive point of view: how do multiple selfish independent agents’ behaviors evolve when agents update their policy using deep Qlearning? In contrast, this paper takes a prescriptive and noncooperative perspective and considers the following question: how should an agent learn effectively in real world social dilemma environments when it is faced with different opponents?
To this end, in the paper, we first formally introduce the general notion of sequential prisoner’s dilemma (SPD) to model real world PD problems. We propose a multiagent deep reinforcement learning approach for mutual cooperation in SPD games. Our approach consists of two phases: offline and online phases. The offline phase generates policies with varying cooperation degrees and trains a cooperation degree detection network. To generate policies, we propose using the weighted target reward and two schemes, IAC and JAC, to train the baseline policies with varying cooperation degrees. Then we propose using the policy generation approach to synthesize the full range of policies from these baseline policies. Lastly, we propose a cooperation degree detection network implemented as an LSTMbased structure with an encoderdecoder module and generate a training dataset. The online phase extends the TitforTat principle into sequential prisoner’s dilemma scenarios. Our strategy adaptively selects its policy with the proper cooperation degree from a continuous range of candidates based on the detected cooperation degree of an opponent. Intuitively, on one hand, our overall algorithm is cooperationoriented and seeks for mutual cooperation whenever possible; on the other hand, our algorithm is also robust against selfish exploitation and resorts to a defection strategy to avoid being exploited whenever necessary. We evaluate the performance of our deep multiagent reinforcement approach using two 2D SPD games (the Fruit Gathering and ApplePear games). Our experiments show that our agent can efficiently achieve mutual cooperation under selfplay and also perform well against opponents with changing stationary policies.
2 Background
2.1 Matrix Games and the Prisoner’s Dilemma
A matrix game can be represented as a tuple , where is the set of agents, is the set of actions available to agent with being the joint action space , and is the reward function for agent . One representative class of matrix games is the prisoner’s dilemma game, as shown in Table 1. In this game, each agent has two actions: cooperate (C) and defect (D), and is faced with four possible rewards: , , , and . The four payoffs satisfy the following four inequalities under a prisoner’s dilemma game:

: mutual cooperation is preferred to mutual defection.

: mutual cooperation is preferred to being exploited by a defector.

: mutual cooperation is preferred to an equal probability of unilateral cooperation and defection.

: exploiting a cooperator is preferred over mutual cooperation.

: mutual defection is preferred over being exploited.
2.2 Markov Game
Markov games combine matrix games and Markov Decision Processes and can be considered as an extension of Matrix games to multiple states. A Markov game
is defined by a tuple , where is the set of states and is the number of agents, is the collection of action sets, with being the action set of agent , and is the set of reward functions, is the reward function for agent . is the state transition function: , wheredenotes the set of discrete probability distributions over
. Matrix games are the special case of Markov games when .C  D  

C  R, R  S, T 
D  T, S  P, P 
Next we formally introduce SPD by extending the classic iterated PD game to multiple states.
2.3 Definition of Sequential Prisoner’s Dilemma
A twoplayer SPD is a tuple , where is a 2player Markov game with state space . is the set of policies with varying cooperation degrees. The empirical payoff matrix can be induced by policies , where is more cooperative than . Given two policies, and , the corresponding empirical payoffs under any starting state s with respect to the payoff matrix in Section 2.1 can be defined as through their longterm expected payoff, where
(1)  
(2)  
(3)  
(4) 
We can define the longterm payoff for agent when the joint policy is followed starting from .
(5) 
A Markov game is an SPD when there exists a state for which the induced empirical payoff matrix satisfies the five inequalities in Section 2.1. Since SPD is more complex than PD, the existing approaches addressing learning in matrix PD games cannot be directly applied in SPD.
2.4 Deep Reinforcement Learning
QLearning and Deep QNetworks: Qlearning and Deep QNetworks (DQN) [Mnih et al.2013, Mnih et al.2015] are valuebased reinforcement learning approaches to learn optimal policies in Markov environments. Qlearning makes use of an actionvalue function for policy as
. DQN uses a deep convolutional neural network to estimate Qvalues and the optimal Qvalues are learned by minimizing the following loss function:
(6)  
(7) 
where is a target network whose parameters are periodically updated with the most recent .
Policy Gradient and ActorCritic Algorithms: Policy Gradient methods are for a variety of RL tasks [Williams1992, Sutton et al.2000]. Their objective is to maximize by taking steps in the direction of , where
(8) 
where is the state transition distribution. Practically, the value of can be estimated in different ways. For example, serves as a critic to guide the updating direction of , which leads to a class of actorcritic algorithms [Schulman et al.2015, Wang et al.2016].
3 Deep RL: Towards Mutual Cooperation
Algorithm 1 describes our deep multiagent reinforcement learning approach, which consists of two phases, as discussed Section 1. In the offline phase, we first seek to generate policies with varying cooperation degrees. Since the number of policies with different cooperation degrees is infinite, it is computationally infeasible to train all the policies from scratch. To address this issue, we first train representative policies using ActorCritic until convergence (i.e., cooperation and defection baseline policies) (Lines 35) detailed in Section 3.1; second, we synthesize the full range of policies (Lines 67) from the above baseline policies, which will be detailed in Section 3.2. Another task is to how to effectively detect the cooperation degree of the opponent. We divide this task into two steps: we first train an LSTMbased cooperation degree detection network offline (Lines 810), which will be then used for realtime detection during the online phase, detailed in Section 3.3. In the online phase, our agent plays against any opponent by reciprocating with a policy of a slightly higher cooperation degree than that of the opponent we detect (Lines 1218), detailed in Section 3.4. Intuitively, on one hand, our algorithm is cooperationoriented and seeks for mutual cooperation whenever possible; on the other hand, our algorithm is also robust against selfish exploitation and resorts to defection strategy to avoid being exploited whenever necessary.
3.1 Train Baseline Policies with Different Cooperation Degrees
One way of generating policies with different cooperation degrees is by directly changing the key parameters of the environments. For example, Leibo et al. [Leibo et al.2017] investigate the influence of the resource abundance degree on the learned policy’s cooperation tendency in sequential social dilemma games where agents compete for limited resources. It is found that when both agents employ deep Qlearning algorithms, more cooperative behaviors can be learned when resources are plentiful and vice versa. We may leverage similar ideas of modifying game settings to generate policies with different cooperation degrees. However, this type of approach requires a perfect understanding of the environment as a prior, and also may not be practically feasible when on cannot modify the underlying game engine.
Another more generalized way of generating policies with different cooperation degrees is to modify agents’ reward signals during learning. Intuitively agents with the reward of the sum of all agents’ immediate rewards would learn towards cooperation policies to maximize the expected accumulated social welfare eventually, and agents maximizing only their own reward would learn more selfish (defecting) policies. Formally, for a twoplayer environment (agent and ), agent computes a weighted target reward as follows:
(9) 
where is agent ’s attitude towards agent , which reflects the relative importance of agent in agent perceived reward. By setting the values of and to 0, agents would update their strategies in the direction of maximizing their own accumulated discounted rewards. By setting the values of and to 1, agents would update their strategies in the direction of maximizing the overall accumulated discounted reward. The greater the value of agent one’s attitude towards agent two is, the higher the cooperation degree of agent one’s learned policy would be.
Given the modified reward signal for each agent, the next question is how agents should learn to effectively converge to the expected behaviors. One natural way is to resort to the the independent ActorCritic (IAC) or some other independent deep reinforcement learning, i.e., equipping each agent with an individual deep Qlearning algorithm (IDQL) with the modified reward. However, the key element to the success of DQL, experience replay memory, might prohibit effective learning in deep multiagent Qlearning environments [Foerster et al.2017b, Sunehag et al.2017]. The nonstationarity introduced by the coexistence of multiple IACs means that data in the replay memory may no longer reflect the current dynamics in which the agent is learning. Thus IACs may frequently get confused by obsolete experience and impede the learning process. A number of methods have been proposed to remedy this issue [Foerster et al.2017b, Lowe et al.2017, Foerster et al.2017a] and we omit the details which are out of the scope of this paper.
Since the baseline policy training step is performed offline, another way of improving training is to use the joint ActorCritic (JAC) by treating both agents as a single learner during training. Note that we use JAC only for the purpose of training baseline policies offline, and we do not require that we can control the policies that an opponent may use online. In JAC, both agents share the same underlying network that learns the optimal policy over the joint action space using a single reward signal. In this way, the aforementioned nonstationarity problem can be avoided. Besides, compared with IAC, the training efficiency can be improved significantly since the network parameters are shared across agents in JAC.
In JAC, the weighted target reward is defined as follows:
(10) 
where represents the relative importance of agent on the overall reward. The smaller the value of , the higher the cooperation degree of agent ’s learned policy and vice versa. Given the learned joint policy , agent can easily obtain its individual policy as follows:
(11) 
As we mentioned previously, it is computationally prohibitive to train a large number of policies with different cooperation degrees due to the high training cost of deep Qlearning, and because the policy space is infinite. To alleviate this issue, here we propose that only two policies, cooperation policy and defection policy , need to be trained. Other policies with cooperation degree between the baselines can be synthesized efficiently, which will be introduced in Section 3.2.
3.2 Policy Generation
Given the baseline policies and , we synthesize multiple policies. Each continuous weighting factor corresponds to a new policy defined as follows:
(12) 
The weighting factor is defined as policy ’s cooperation degree. Specially, the linear combination of two policies has two advantages — it 1) generates policies with varying cooperation degrees and 2) ensures low computational cost. Any synthesized policy should be more cooperative than and more defecting than . The higher the value of is, the more cooperative the corresponding policy is. It is important to mention that the cooperation degrees of synthesized policies are ordinal, i.e., the cooperation degree of policies only reflect their relative cooperation ranking. For example, considering two synthesized policies and , it only implies that policy is more cooperative than policy . However, we cannot say that is twice as cooperative as . Our way of synthesizing new policies can be understood as synthesizing policies over expert policies [He et al.2016]. The previous work applies a similar idea to generate policies to better respond with different opponents in competitive environments, however, our goal here is to synthesize policies with varying cooperation degrees in sequential Prisoner’s dilemmas.
3.3 Opponent Cooperation Degree Detection
In classical iterated PD games, a number of techniques have been proposed to estimate the opponent’s cooperation degree, e.g., counting the cooperation frequency when action can be observed; using a partial filter or a Bayesian approach otherwise [Damer and Gini2008, HernandezLeal et al.2016a, HernandezLeal et al.2016b, Leibo et al.2017]. However, in SPD, the actions are temporally extended, and the opponent’s information (actions and rewads) cannot be observed directly. Thus the previous works cannot be directly applied. We need a way of accurately predicting the cooperation degree of the opponents from the observed sequence of moves in a qualitative manner. In SPD, given the sequential observations (timeseries data), we propose an LSTMbased cooperation degree detection network.
In previous sections, we have introduced a way of synthesizing policies of any cooperation degree, thus we can easily prepare a large dataset of agents’ behaviors with varying cooperation degrees. Based on this, we can transform the cooperation degree detection problem into a supervised learning problem: given a sequence of moves of an opponent, our task is to detect the cooperation degree (label) of this opponent. We propose a recurrent neural network, which combines an autoencoder and a recurrent classifier, as shown in Figure
1. Combing an autoencoder with a recurrent classifier brings two major benefits here. First, the classifier and autoencoder share underlying layer parameters of the neural network. This ensures that the classification task is based on the effective feature extraction of the observed moves, which improves the classification detection accuracy. Second, concurrent training of the autoencoder also helps to accelerate the training speed of the classifier and reduces fluctuation during training.
The network is trained on experiences collected by agents. Both agents and interact with the environment starting with initialized policies and , yielding a training set :
(13) 
where and are baseline policies for agent . is the learned policy set of its opponent . is the relative cooperation degree of policy , and is the set of trajectories under the joint policy . For each trajectory , its label is the cooperation degree of agent ’s policy. The network is trained to minimize the following weighted cross entropy loss function as follows:
(14) 
where is the weight of . is the network output, which is the probability of .
3.4 Play Against Different Opponents
Once we have detected the cooperation degree of the opponent, the final question arises as to how an agent should select proper policies to play against that opponent. A selfinterested approach would be to simply play the best response policy toward the detected policy of the opponent. However, as we mentioned before, we seek a solution that can allow agents to achieve cooperation while avoiding being exploited.
Figure 2 shows our overall approach playing with opponents towards mutual cooperation. At each time step , agent uses its previous step sequence of observations (from time step to ) as the input of the detection network, and obtain the detected cooperation degree
. However, the oneshot detection of the opponent’s cooperation degree might be misleading due to either the detection error of our classifier or the stochastic behaviors of the opponent. Thus this may lead to high variance of our detection outcome and our response policy thereafter. To reduce the variance, agent
uses the exponential smoothing to update its current cooperation degree estimation of its opponent as follows:(15) 
where is agent ’s cooperation degree in the last time step, and is the cooperation degree changeing factor.
Finally, agent sets its own cooperation degree equal to plus its reciprocation level, and then synthesizes a new policy with the updated cooperation degree following Equation (12) as its nextstep strategy to play against its opponent. Note that the reciprocation level can be quite low and still produce cooperation. The benefit is that it will not lead to a significant loss if the opponent is not cooperative at all, while full cooperation can be reached if the opponent reciprocates in the same way. Also, note that here we only provide a way of responding to the opponent with changing policies. However, our overall approach is general and any existing multiagent strategy selection approaches can be applied here.


4 Simulation and Results
4.1 SPD Game Descriptions
In this section, we adopt the Fruit Gathering game [Leibo et al.2017] to evaluate the effectiveness of our approach. We also propose another game ApplePear, which also satisfies real world social dilemma conditions we mentioned before. Each game involves two agents (in blue and red). The task of an agent in the Fruit Gathering game is to collect as many apples, represented by green pixels, as possible (see Figure 3 (left)). An agent’s action set is: step forward, step backward, step left, step right, rotate left, rotate right, use beam, and stand still. The agent obtains the corresponding fruit when it steps on the same square as the fruit is located. When an agent collects an apple, it will receive a reward 1. And the apple will be removed from the environment and respawn after 40 frames. Each agent can also emit a beam in a straight line along its current orientation. An agent is removed from the map for 20 frames if it is hit by the beam twice. Intuitively, a defecting policy in this game is one which frequently tags the rival agent to remove it from the game. A cooperation policy is one that rarely tags the other agent. For the ApplePear game, there is a red apple and a green pear (see Figure 3 (right)). The blue agent prefers apple while the red agent prefers pear. Each agent has four actions: step right, step left, step backward, step forward, and each step of moving incurs a cost of 0.01. The fruit is collected when the agent steps on the same square as it. When the blue (red) agent collects an apple (pear) individually, it receives a higher reward 1. When the blue agent collects a pear individually, it receives a lower reward 0.5. The situation is the opposite for the red agent. One exception is that they both receive a half of their corresponding rewards when they share a pear or an apple. In this game, a fully defecting policy is to collect both fruits whenever the fruitcollecting reward exceeds the moving cost, while a cooperative one is to only collect the fruit it prefers to maximize the social welfare of agents. In Section 4.4, we find that the two games satisfy the definition of SPD games in Section 2.3 by using policies with different cooperation degrees to play with each other.
4.2 Network Architecture and Parameter Settings
In both games, our network architectures for training the baseline policies follow standard AC networks, except that we allow both actor and critic to share the same underlying network to reduce the parameter space. For the underlying network, the first hidden layer convolves filters of
with stride
with the input image and applies a rectifier nonlinearity. The second hidden layer convolves filters of with stride , again followed by a rectifier nonlinearity. This is followed by a third convolutional layer that convolves filters of with stride followed by a rectifier. For the actor, on the basis of sharing network, the next layer includesunits with rectifier nonlinearity, and the final softmax layer has as many units as the number of actions. The critic is similar to the actor, but with only one scalar output.
The recurrent cooperation degree detection network is shown in Figure 1. The autoencoder and the detection network share the same underlying network. The first hidden layer convolves filters of with stride with the input image and applies a rectifier nonlinearity. The second hidden layer is the same as the first one and the third hidden layer convolves filters of with stride and applies a rectifier nonlinearity. The autoencoder is followed by a fourth hidden layer that deconvolves filters of with stride and applies a sigmoid and its output shape is . The next layer deconvolves filters of with stride and applies a sigmoid and the output shape is . The final layer deconvolves filters of with stride and applies a sigmoid and the output shape is . Cooperation degree detection network is followed by two LSTM layers of units. The final layer is an output node.^{1}^{1}1The code and network architectures will be available soon: https://goo.gl/3VnFHj.
For the ApplePear game, each episode has at most steps. The exploration rate is annealed linearly from to over the first steps. The weight parameters are updated by soft target updates [Lillicrap et al.2015] every 4 steps to avoid the update fluctuation as follows:
(16) 
where is the parameter of the policy network and is the parameter of the target network. The learning rate is set to and memory is set to . The batch size is . For the loss function in the cooperation degree detection network, we set and as 1 and 2 (see Equation 14). When agents play with different opponents online, we assign the number of states visited to the length of the state sequence which is the input of cooperation detection network. The cooperation degree changing factor is set to 1.
The Fruit Gathering game uses the same detection network architecture. The actorcritic uses independent policy networks. Each episode has at most steps during training. The exploration rate and the memory are the same as the ApplePear game. The weight parameters are updates the same as the ApplePear game:
(17) 
When agents play with different opponents online, we set state sequence length as 50 and the changing factor is set to 0.02.
4.3 Effect of Baseline Policy Generation
For the ApplePear game, the baseline policies are trained using the JAC scheme. The individual rewards of each agent increase gradually as its attitude increases and the other agent’s attitude decreases. The results also indicate that an agent can learn a defecting policy when its attitude that represents its relative importance on overall reward increases and vice versa. We set the attitude in Equation (10) between 0.1 and 0.9 to train baseline policies. The reason that attitudes are not set 0 and 1 is that these settings will cause an agent with meaningless policy, e.g, the agent whose attitude equals makes no influence to total reward in Equation (10) and hence will always avoid collecting fruit. This would, in turn, affect the learned policy quality of the other agent (i.e., lazy agent problem [Perolat et al.2017]). Figure 4 shows the average rewards of agents under policies trained with different weighted target rewards. We also evaluate the IAC scheme and similar results can be obtained and we omit it here.
For the Fruit Gathering game, the learning policies are synthesized based on IAC. IAC is more efficient for training baseline policies in this game, relative to JAC, since the rewards of collecting the apple for both agents are the same. On the other hand, if we adopt a JAC approach, it might lead to the consequence that the agent with a higher weight will collect all the apples, while the other agent will not collect any apples. Similar results as the ApplePear game can be observed here and we omit it here.
4.4 Effect of Policy Generation
For the ApplePear game, the baseline cooperation policies and are trained under the setting of . For , the baseline defection policy has . For , the baseline defection policy has . Then we generate the policy set of and the policy set of . After that, two policies and are sampled from and and matched against each other in the games for 200,000 episodes. The average rewards are assigned to individual cells of different attitude pairs, in which correspond to policies with varying cooperation degrees for agent , and , policies with varying cooperation degrees for agent (see Figure 5 (a)). In Figure 5 (a), we can observe that when agents’ cooperation degrees decrease, their rewards decrease. When both of their cooperation degrees increase, which means they are more cooperative, the sum of their rewards increase. Besides, given a fixed cooperation degree of , ’s reward is increased as its cooperation degree decreases, and vice versa. Similar pattern can be observed for agent 1 as well. Therefore, it indicates that we can successfully synthesize policies with a continuous range of cooperation degrees. It also confirms that the ApplePear game can be seen as an SPD following the definition in Section 2.3.
For the Fruit Gathering game, we use policies with both attitudes equal to 0 and 0.5 as the baseline defection and cooperation policy respectively. Figure 5 (b) shows the results of their rewards, which is similar to the ApplePear game.
4.5 Effect of Cooperation Degree Detection
This section evaluates the detection power of the cooperation degree detection network. First, we train the cooperation degree detection network using datasets which include only the data labeled as full cooperation or full defection. The training data in the simulation is obtained based on baseline policies of . We set its label as 1 when uses its baseline cooperation policy and 0 when uses its baseline defection policy. Then we use this dataset to train the detection network. After training we evaluate the detection accuracy by applying it to detecting the cooperation degree of agent 2 when it uses policies with cooperation degrees from 0 to 1 and the degree interval is 0.1. The policy pair is sampled from and and matched against with each other for each episode. After the cooperation degree average value is stable, we view the output value as the cooperation degree of .
For the ApplePear game, we collect 10000 data for state sequence lengths {3, 4, 5, 6, 7, 8}. The cooperation detection results of are shown in Figure 6 (a). We can see that the cooperation detection values can well approximate the true values with slight variance. Besides, the network can clearly detect the order of different cooperation degrees, which allows us to calculate the cooperation degree of accurately. For the Fruit Gathering game, we collect 4000 data for each state sequence lengths {40, 50, 60, 70}, which includes 2000 data labeled as 1 and 2000 data labeled as 0. The network is trained in a similar way and the cooperation degree detection accuracy is high (see Figure 6 (b)). From Figure 6, We observe that the detection results are almost linear with the true values. Fig 6(c) shows the detection results of when ’s policy is fixed with cooperation degree 1. We can see that the true cooperation degree of can be easily obtained by fitting a linear curve between the predicted and true values. Thus for each policy of in , we can fit a linear function to evaluate the true cooperation degrees of . During practical online learning, when uses policies of varying cooperation degrees to play with , it firstly chooses the function whose corresponding policy is closest to the used policy, and then computes the cooperation degree of .
4.6 Performance under SelfPlay
Next, we evaluate the learning performance of our approach under selfplay. Since the initialized policies of agents can affect their behaviors in the game, we evaluate under all different initial conditions: a) starts with cooperation policy and starts with defecting policy; b) starts with defecting policy and starts with cooperation policy; c) both agents start with cooperation policies; d) both agents start with defecting policies. Agents converge to full cooperation for all four cases. We present the results for the last case, which is the most challenging one (see Figure 7 and Figure 8). When both agents start with a defection policy, it is more likely for them to model each other as defective and thus both play defect thereafter. Our approach enables agents to successfully detect the cooperation tendency of their opponents from sequential actions and converge to cooperation at last. In the ApplePear game, agents converge efficiently within few episodes. The reason is that when agents are close to fruits they prefer, they will collect fruits no matter whether their policies are cooperative or not. Thus, an agent is more likely to be detected as being cooperative, which induces its opponent to change policies towards cooperation. In contrast, for the Fruit Gathering game, agents need a relatively longer time before converging to full cooperation. This is because in the Fruit Gathering game, the main feature of detecting the cooperation degree is beam emitting frequency. When one agent emits a beam continuously, they change to defect. Only when both agents collect fruits without emitting beam would lead to mutual cooperation.
4.7 Playing with Opponents with Changing Strategies
Now, we evaluate the performance against switching opponents, during a repeated interaction of episodes, the opponent changes its policy after a certain number of episodes and the learning agent does not know when the switches happen. Through this, we can evaluate the cooperation degree detection performance of our network, and verify whether our strategy can perform better than the fully cooperation or defection strategy from two aspects: 1) our approach seeks for mutual cooperation whenever possible; 2) our approach is robust against selfish exploitation. In the ApplePear game, we vary the value of in the range of at the interval of 30, and similarly in the Fruit Gathering game we vary it in the range of at the interval of 100. Only one set of results for each game are provided in Figure 9 and 10. The results of other values of are in the Appendix. Similar phenomenon can be observed for other values of . From Figure 9 and 10, we observe that for both games, the average rewards of the learning agent () are higher than its rewards using cooperation strategy . And the social welfare (the sum of both agents’ rewards) is higher than that using defection strategy . This indicates that our approach can prevent agents from being exploited by defecting opponents and seek for cooperation against cooperative ones. By comparing the results for different values of , we find that the detection accuracy decreases when the opponent changes its policy quickly. Since the agent requires observing several episodes to detect the cooperation degree of the opponent, when the agent realizes its opponent changes the policy and adjusts its policy, the opponent may change the policy again. This problem becomes less severe when is comparatively large.
5 Conclusions
In this paper, we make the first step in investigating multiagent learning problem in largescale PD games by leveraging the recent advance of deep reinforcement learning. A deep multiagent RL approach is proposed towards mutual cooperation in SPD games to support adaptive endtoend learning. Empirical simulation shows that our agent can efficiently achieve mutual cooperation under selfplay and also perform well against opponents with changing strategies. As the first step towards solving multiagent learning problem in largescale environments, we believe there are many interesting questions remaining for future work. One worthwhile direction is how to generalize our approach to other classes of largescale multiagent games. Generalized policy detection and reuse techniques should be proposed, e.g., by extending existing approaches in traditional reinforcement learning contexts [HernandezLeal et al.2016b, HernandezLeal et al.2016a, HernandezLeal and Kaisers2017].
References
 [Axelrod1984] Robert Axelrod. The evolution of cooperation, 1984.
 [Banerjee and Sen2007] Dipyaman Banerjee and Sandip Sen. Reaching paretooptimality in prisoner’s dilemma using conditional joint action learning. Autonomous Agents and MultiAgent Systems, 15(1):91–108, 2007.
 [Busoniu et al.2008] Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, And CyberneticsPart C: Applications and Reviews, 38 (2), 2008, 2008.
 [Crandall and Goodrich2005] Jacob W Crandall and Michael A Goodrich. Learning to teach and follow in repeated games. In AAAI workshop on Multiagent Learning, 2005.
 [Crandall2012] Jacob W Crandall. Just add pepper: extending learning algorithms for repeated matrix games to repeated markov games. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent SystemsVolume 1, pages 399–406. International Foundation for Autonomous Agents and Multiagent Systems, 2012.
 [Damer and Gini2008] Steven Damer and Maria L Gini. Achieving cooperation in a minimally constrained environment. In AAAI, pages 57–62, 2008.
 [Elidrisi et al.2014] Mohamed Elidrisi, Nicholas Johnson, Maria Gini, and Jacob Crandall. Fast adaptive learning in repeated stochastic games by game abstraction. In Proceedings of the 2014 international conference on Autonomous agents and multiagent systems, pages 1141–1148. International Foundation for Autonomous Agents and Multiagent Systems, 2014.
 [Foerster et al.2017a] Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multiagent policy gradients. arXiv preprint arXiv:1705.08926, 2017.
 [Foerster et al.2017b] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Philip Torr, Pushmeet Kohli, Shimon Whiteson, et al. Stabilising experience replay for deep multiagent reinforcement learning. arXiv preprint arXiv:1702.08887, 2017.
 [Hao and Leung2015] Jianye Hao and Hofung Leung. Introducing decision entrustment mechanism into repeated bilateral agent interactions to achieve social optimality. Autonomous Agents and MultiAgent Systems, 29(4):658–682, 2015.

[He et al.2016]
He He, Jordan BoydGraber, Kevin Kwok, and Hal Daumé III.
Opponent modeling in deep reinforcement learning.
In
International Conference on Machine Learning
, pages 1804–1813, 2016.  [HernandezLeal and Kaisers2017] Pablo HernandezLeal and Michael Kaisers. Towards a fast detection of opponents in repeated stochastic games. In The Workshop on Transfer in Reinforcement Learning, 2017.
 [HernandezLeal et al.2016a] Pablo HernandezLeal, Benjamin Rosman, Matthew E Taylor, L Enrique Sucar, and Enrique Munoz de Cote. A bayesian approach for learning and tracking switching, nonstationary opponents. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 1315–1316, 2016.
 [HernandezLeal et al.2016b] Pablo HernandezLeal, Matthew E Taylor, Benjamin Rosman, L Enrique Sucar, and Enrique Munoz De Cote. Identifying and tracking switching, nonstationary opponents: a bayesian approach. In Multiagent Interaction without Prior Coordination Workshop at AAAI, 2016.
 [Hu and Wellman2003] Junling Hu and Michael P Wellman. Nash qlearning for generalsum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
 [Leibo et al.2017] Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multiagent reinforcement learning in sequential social dilemmas. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 464–473. International Foundation for Autonomous Agents and Multiagent Systems, 2017.
 [Lillicrap et al.2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [Littman1994] Michael L Littman. Markov games as a framework for multiagent reinforcement learning. In Proceedings of the eleventh international conference on machine learning, volume 157, pages 157–163, 1994.
 [Lowe et al.2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multiagent actorcritic for mixed cooperativecompetitive environments. arXiv preprint arXiv:1706.02275, 2017.
 [Mathieu and Delahaye2015] Philippe Mathieu and JeanPaul Delahaye. New winning strategies for the iterated prisoner’s dilemma. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1665–1666. International Foundation for Autonomous Agents and Multiagent Systems, 2015.
 [Mnih et al.2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [Nowak and Sigmund1993] Martin Nowak and Karl Sigmund. A strategy of winstay, loseshift that outperforms titfortat in the prisoner’s dilemma game. Nature, 364(6432):56–58, 1993.
 [Perolat et al.2017] Julien Perolat, Joel Z Leibo, Vinicius Zambaldi, Charles Beattie, Karl Tuyls, and Thore Graepel. A multiagent reinforcement learning model of commonpool resource appropriation. arXiv preprint arXiv:1707.06600, 2017.
 [Schulman et al.2015] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
 [Stone and Veloso2000] Peter Stone and Manuela Veloso. Multiagent systems: A survey from a machine learning perspective. Autonomous Robots, 8(3):345–383, 2000.
 [Sunehag et al.2017] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Valuedecomposition networks for cooperative multiagent learning. arXiv preprint arXiv:1706.05296, 2017.
 [Sutton et al.2000] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 [Wang et al.2016] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224, 2016.
 [Williams1992] Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.