Learning through Probing: a decentralized reinforcement learning architecture for social dilemmas

09/26/2018 ∙ by Nicolas Anastassacos, et al. ∙ 0

Multi-agent reinforcement learning has received significant interest in recent years notably due to the advancements made in deep reinforcement learning which have allowed for the developments of new architectures and learning algorithms. Using social dilemmas as the training ground, we present a novel learning architecture, Learning through Probing (LTP), where agents utilize a probing mechanism to incorporate how their opponent's behavior changes when an agent takes an action. We use distinct training phases and adjust rewards according to the overall outcome of the experiences accounting for changes to the opponents behavior. We introduce a parameter eta to determine the significance of these future changes to opponent behavior. When applied to the Iterated Prisoner's Dilemma (IPD), LTP agents demonstrate that they can learn to cooperate with each other, achieving higher average cumulative rewards than other reinforcement learning methods while also maintaining good performance in playing against static agents that are present in Axelrod tournaments. We compare this method with traditional reinforcement learning algorithms and agent-tracking techniques to highlight key differences and potential applications. We also draw attention to the differences between solving games and societal-like interactions and analyze the training of Q-learning agents in makeshift societies. This is to emphasize how cooperation may emerge in societies and demonstrate this using environments where interactions with opponents are determined through a random encounter format of the IPD.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Multi-agent reinforcement learning (RL) has garnered a significant amount of interest in recent years due to the advancements in deep RL which has allowed for extensive study on agent behaviors. There has been emphasis on designing cooperative agents for decades [Tan1993, Kapetanakis and Kudenko2002] yet extending this success to multi-agent environments has proven difficult as the Markov property is not satisfied since agent behaviors are continuously changing [Sutton and Barto1998] and the use of experience replay does little to inhibit unstable learning in presence of multiple learners. Indeed, there are still challenges to be tackled in order to enable broader applications, e.g., in automated decision-making such as self-driving cars, personalized assistants, and the eventuality of artificial agents operating in society. A central aspect of this evolution lies in understanding the competitive and collaborative nature of environments and the emergence of such behaviors [Axelrod and Hamilton1981, Nowak2006].

Humans have cooperated and maintained cooperation to great effect, which has been paramount for the development of civilization [Axelrod and Hamilton1981, Boyd and Richerson2009, Fehr and Fischbacher2004]. Many plants and animals have also demonstrated the tendency to cooperate with relatives and have even been observed cooperating with members of a different species even in highly competitive environments, likely to take advantage of long-term rewards [Clutton-Brock2004, Stevens and Hauser2004]

. The evolution of cooperation in competitive environments has therefore been relevant to studies in economics, game-theory, psychology, social science, and now computer science as the future will certainly demand interaction between artificial agents in human and artificial societies. The emergence of cooperative and competitive strategies has been studied in Social Dilemmas

[Ostrum, Gardner, and Walker1994, Yu et al.2015, Lange et al.2013, Leibo et al.2017]. These are games where an individual profits from selfishness unless everyone chooses to behave selfishly, in which case the group as a whole achieves an undesirable outcome. In other words, problems arise when too many group members choose to pursue individual profit and immediate satisfaction rather than behave in the group’s best long-term interests. From a game-theoretic perspective, the dominant strategy in social dilemmas is often to behave selfishly which results in arriving at a socially deficient Nash Equilibria [Leibo et al.2017]. One of the first and most studied examples of social dilemmas is the Prisoner’s Dilemma (PD), a two-player social dilemma that is formalized by a payoff matrix and a dominant strategy to defect despite a collaborative effort from both players leading to a higher reward. The Iterated Prisoner’s Dilemma (IPD), is an extended, sequential version of the PD that has often been the focus for multi-agent RL.

Most RL algorithms are designed for the single-agent case. Qualitatively, the Q-value describes how much reward an agent is expected to receive when taking a particular action at a particular state [Sutton and Barto1998]. As the environment changes, the learned Q-values become increasingly irrelevant and this is exemplified as the number of agents who are learning increases. Other research projects have aimed to address the issue of non-stationarity in multi-agent environments in a variety of ways ranging from refreshing the experience replay buffer [Leibo et al.2017] using importance sampling [Foerster et al.2017, Uchibe and Doya2004] to stabilize learning, using defined policy-types [Lerer and Peysakhovich2017], agent-tracking techniques to predict the policy of an opposing agent [Tesauro2004, Foerster et al.2017, Lowe et al.2017, Zhang and Lesser2010], and centralized functions to share information across to all participating agents [Lowe et al.2017]. To tackle this, we instead propose a decentralized Learning through Probing architecture, which allows agents to gather experiences that have been adjusted to reflect behavioral changes in a sequence of events over a period of time via an adjusted reward signal. Experimentally, we demonstrate that two agents trained with this approach learn to cooperate in the IPD as each agent accounts for the opposing agent’s learning while also revealing how their own behavior will change as a result of an opposing agent’s chosen actions. Furthermore, we also demonstrate how this type of training mechanism results in a RL agent learning optimal policies for the IPD when matched with other stationary and quasi-stationary strategies from Axelrod tournaments. Finally, we contrast this with current methodologies in multi-agent RL to highlight potential difficulties and we discuss how probing and using experiences through updates might help established methods achieve better performance in dynamic environments.

Alongside this architecture, we also demonstrate that shaping the environment can lead to cooperative behaviors using a standard Q-learning algorithm to take into account the effect of external factors beyond each individual agent’s decision-making in determining what behaviors can emerge when introducing artificial agents into open environments like societies. The focus in multi-agent RL is predominantly based around stationarizing the environment in order for agents to learn how to achieve optimal outcomes in closed environments. This typically requires the same agents to be used in training and testing. However, an overlooked aspect of agent behavior is how to design environments to nurture certain types of behavior. We present experimental results on how untrained agents can learn to cooperate with other untrained agents using standard Q-learning when interacting with static agents at the same time.

Related Work and Motivation

Modeling multi-agent systems and designing learning algorithms for dynamic settings is hard and the majority of work in this area focuses on competitive, zero-sum games. A common approach is to simply have each agent treat all other agents as part of the environment and learn independently, however, this generally leads to less than optimal performance [Kapetanakis and Kudenko2002, Shoham, Powers, and Grenager2003]. Firstly, a naive implementation of experience replay is not suitable for dynamically changing environments [Adam, Buşoniu, and Babuška2012, Schaul et al.2016]. Others have employed importance sampling in order to stabilize the application of replay buffers where experiences that are collected from “old” environments can still be used to update Q-values [Uchibe and Doya2004, Foerster et al.2017]. Secondly, agents have to account for actions taken by other agents in their Q-value approximations [Buşoniu, Babuška, and Schutter2010]. Finally, learning to play optimally against just an opponent’s current behavior may trap players in undesirable states, as can be the case in social dilemmas, as there is no consistent and suitable method of exploration.

The study of sequential social dilemmas, notably the IPD, has been prevalent across numerous disciplines such as game theory, economics, and across social sciences as a way of analyzing complex behaviors such as altruism, reciprocity and cooperation [Rapoport1974, Macy and Flache2002, Nowak and Sigmund2005]. A strategy known as Tit-For-Tat (TFT) where an agent cooperates on the first move and then replicates an opponent’s previous action, known as equivalent retaliation, is a simple strategy yet has shown to be one of the most effective strategies in the IPD and has served as a basis for the modeling of many real-world behaviors [Axelrod and Hamilton1981]. Social dilemmas like the IPD have proven to be an effective training ground for multi-agent RL as they might involve cooperation as a viable method of achieving optimal performance against fixed policies as well as learning agents [Sandholm and Crites1996, Zhang and Lesser2010, Lerer and Peysakhovich2017]. In order to perform well in social dilemmas, agents must learn to forgo their desire for early rewards and “agree” on strategies that will benefit those involved. Many agents designed for Axelrod tournaments utilize reciprocating behavior which is the general philosophy behind the Tit-For-Tat strategy whereas others employ more grudging techniques to manouevre opponents into cooperating.

Leibo et al. have attempted to train Q-learning agents for sequential social dilemmas so as to analyze how conflict, competition, and cooperation can emerge via a multiplayer Wolfpack hunting game and a Gathering game [Leibo et al.2017]. Replay buffers of a fixed capacity were used to try and accommodate for the multiple learners; once filled, they were refreshed so that the agent emphasized training on more recent experiences. A more recently popular approach to multi-agent reinforcement learning involves policy prediction which, alongside Q-learning (independent learners), we will contrast our approach with. We emphasize this approach because we think that it closely captures a necessary element of multi-agent learning: understanding opponent behavior and incorporating it directly into learning. It expands on methods like fictitious play [Erev and Roth1998] and joint action learning (JAL) to more accurately represent an opponent’s policy [Claus and Boutilier1998, Marden, Arslan, and Shamma2009]

and enables coordinations (in cooperative games). Tesauro presents Hyper-Q learning, which learns the value of mixed strategies instead of base actions by estimating opponent actions using observed data and evaluates it using Rock-Paper-Scissors

[Tesauro2004]. He further argues that Hyper-Q learning may be effective against agents even if they are persistently dynamic. This has been corroborated by other research employing similar philosophy of policy prediction. Experiments on Starcraft and other abstract games that require complex multi-agent coordination have shown that this methodology significantly improves performance compared to independent learners trained with Q-learning though they are often combined with other techniques, e.g., sampling techniques, centralized value function [Foerster et al.2017, Lowe et al.2017]. However, as we will demonstrate in our experiments, these types of methods do not perform adequately in social dilemmas as they aim to shape learning around what is happening currently rather than what could

happen in the future. In contrast, our approach focuses directly on understanding the consequence of actions on opponent’s behavior and incorporates that knowledge directly into agent learning via an adjusted reward function.                               The use of social behavior metrics is another approach to tackle the issue of describing what is really happening at a state at any moment in time in decentralized learning environments

[Perolat et al.2017], however, it is difficult to determine how these metrics should be designed as they are contextually dependent on the environment. Matignon et al. achieve cooperative behavior in a decentralized RL system using a modified update equation that is conditioned on the size of the reward [Matignon, Laurent, and le Fort-Piat2007]. Another decentralized method by Yu et al. attempts to embed emotional context into agents to drive them to learn cooperative behaviors using various metrics to represent an agent’s drive and emotions relative to neighbouring agents [Yu et al.2015]. However, these approaches represent only the current standings that are available without an indication of how things may or may not change which we maintain is essential to developing cooperative behavior. An approach that also emphasizes integrating future behaviour of one’s opponents is LOLA which looks to consider opponent learning, optimizing its return using a one-step look ahead which requires direct access to the opposing agent’s parameters [Foerster et al.2018]. Our approach differs in a number of ways. Firstly, we use two components, one to probe opponent agents at train time and another to play at test time. Secondly, the probes simultaneously demonstrate changes in behaviour and gather data around how their opponent’s behaviour changes. Thirdly, we use a defined time horizon to determine the number of future updates to consider during training.



Q-learning is a popular off-policy reinforcement learning technique to learn optimal behaviors. An agent trained with Q-learning looks to take actions that maximize its expected cumulative reward. A value function for a policy is given by

Among all possible value functions there exists a maximum optimal value function and an optimal policy that corresponds to the optimal value function . The optimal Q-function is defined as the expected cumulative reward received by an agent starting in , picking action and behaving optimally from that point onwards. We can therefore write the optimal Q-function as

From here, we can see clearly that the Q-value is optimal assuming the distribution over states remains the same. This is sensible in single-agent games but not in multi-agent settings since we expect the state distribution to change as agents learn. Agent-tracking methods expand on this, conditioning the transition to on both the agent’s own action as well as any opponent’s actions.


Hyper-Q-learning is an agent-tracking technique and an extension of Q-learning for multi-agent systems. It estimates an opponent’s mixed strategy and then evaluates the best response. In the single agent case, the Hyper-Q function is adjusted such that

where is a new estimated opponent strategy in

. Variations of Hyper-Q have performed well using deep neural networks with sampling modifications to the replay buffer. A similar approach in multi-agent scenarios has been implemented with success using actor critic techniques such that the target value of agent


In this paper, we use a variation of Hyper-Q for simplicity in the IPD adopting a separate neural network to estimate an opponent’s next action directly from an observation and optimize by taking recent samples of an agent from a replay buffer. While agent-tracking methods have performed better than independent learners, they still suffer from a problem of non-stationarity, which is that the distribution of states changes as each agent updates their policy . If the updates to are small and the environment is less dynamic then and is sufficient for calculating an optimal policy though this problem is exacerbated the more learners there are. Furthermore, since agents cannot anticipate how behavior will change in the future, they cannot avoid getting trapped in socially deficient Nash Equilibria (which are optimal and dominant strategies given knowledge only of an opponent’s current policy). We contrast this with our approach that, instead, emphasizes an awareness of how an agent’s actions can influence an opponent’s future response as we argue that optimizing against the current policy of an opponent can trap agents into policies that result in both parties receiving inadequate rewards.

Forms of Iterated Prisoner’s Dilemma

The Prisoner’s Dilemma (PD) is a simple game that serves as the basis for research on social dilemmas. The premise of the game is that two partners in crime are imprisoned separately and each are offered leniency if they provide evidence against the other. Each player can choose between two actions: cooperation (C) or defection (D), and the payoffs of the game are displayed in Figure 1. The game is modeled so that and . Solving this from a game-theoretic perspective, the dominant strategy is to defect, however, if both players take this action then they arrive at a Nash Equilibrium that is socially deficient. Originally, the PD is a one round game, but the IPD is a sequential PD often studied to understand the effects of previous outcomes and the emergence of cooperative behaviours.

C R, R S, T
D T, S P, P
C 3, 3 0, 5
D 5, 0 1, 1
Figure 1: Payoff Matrix for Social Dilemmas and Iterated Prisoner’s Dilemma. The motivation to defect comes from fear of an opponent defecting or acting greedily to gain the maximum reward when one anticipates the opponent might cooperate.

Learning through Probing

In this section, we present and summarize the architecture and learning methodology of the probing technique that we apply to RL agents. Each agent consists of two separate models that we term the probe and the player. Each has a policy parameterized by . Each probe, similarly, has a policy parameterized by . The role of the probe is to generate experiences that account for opponent learning when performing action after observing . The eventual consequences of this action are measured by aggregating the rewards of the total sequence of events over a finite time horizon to capture the results after adaptations are made by the opposing agent to its policy. To gauge the consequences of a type of action, the probes group the set of stored experiences according to the reward outcome. Each probe update is based on the set of opposing experiences stored. Once taking a one-step update based on these initial experiences, the probes play against each other according to their learned policies updating after each interaction and storing the sequence .

We take the decision here to group actions according to their reward outcome rather than opponent action as is done in Hyper-Q and JAL. In the context of the PD, this is the same as grouping according to the opponent’s action, however, in general it is possible for that information to be unavailable to the agent directly while the reward is always known. We also argue that grouping according to actions also makes intuitive sense since if a similar reward can be achieved performing action in state as performing action in state , then it is likely that state and state share similar properties in the context of taking action .

The continues to play against the opponent, updating after each interaction and stores the future responses and rewards of the opponent in memory. Once the experiences have been generated, the rewards are adjusted such that and are used to train the player. is an added discount through updates term. By manipulating the value of we can determine how each player values the approximated long-term outcome and indicates an approach identical to Q-learning. The gradient of the player updates according to

After training has concluded on the adjusted experiences, the learned policy is transfered to the probe. In a two-player game, each probe takes actions according to at time . When only one of the participants is learning to maximize an RL objective function the architecture is adjusted so that the probe interacts directly with the opponent.

Figure 2: Learning through Probing architecture diagram involving two RL agents. 1) After exploring the environment, the probe component trains on subsets of experiences to learn consequences for actions. Actions are then selected according to a learned policy. 2) Experiences are collected into a replay buffer and adjusted. 3) The player component trains on the adjusted experiences. 4) The players are matched against each other after training. 5) In continuously adaptive games, probes could adopt learned player policies and adapt their strategies over time.

Implementation and Experiments

In this section, we describe the methodology and experimental setup. The first subsection describes how RL agents train against Axelrod agents simultaneously and compares and contrasts the results of Q-learning and Hyper-Q-learning to examine how policy prediction improves learning under certain conditions. The second subsection describes how we tackle these difficulties using a probing technique and Q-learning to account for behavior changes through updates

and investigating the influence of this new learning methodology on the emergence of cooperation. The default neural networks had two hidden layers with 40 hidden units and ReLU activation functions. Agents were trained with gradient descent with a buffer size of 1e5, learning rate of 1e-4 and a batch size of 300. Exploration,

, was initially set to 1.0 and decreased linearly with iterations, stopping at 0.1. The discount rate was set to 0.99. Target networks were also used to further stabilize training. A value for the time horizon was selected through experimentation. Finally, the third section describes the external aspects of learning in artificial societies and the impact on agent behavior.

RL Agents versus Axelrod Agents

The Axelrod library [Knight et al.2016] contains an extensive set of strategies that have been used in previous Axelrod tournaments as well as those that have been rigorously covered in scientific literature. We will refer to the collection of these strategies as “Axelrod agents”. All strategies in the tournament follow a simple set of rules: players are unaware of the number of turns in a match, players carry no acquired state between matches, players cannot observe the outcome of other matches and players cannot manipulate or inspect their opponent in any way beyond what is required in a match. We highlight a diverse set of strategies taken from the Axelrod library that will be used as opponents.

Tit for Tat (TFT). The Tit For Tat strategy is a forgiving strategy that will cooperate on the first move and then perform the same action as the opponent’s most recent move.

Punisher. The Punisher strategy is a grudging strategy that starts by cooperating, however, if at any point its opponent defects, it will defect for memory length where memory length is proportional to the opponent’s historical percentage of defecting.

Forgetful Grudger. The Forgetful Grudger strategy is a grudging strategy which defects for a fixed length of time if an opponent defects at any point. If an opponent cooperates for long enough it forgets its grudge and will cooperate until it sees another defection.

Prober. The Prober strategy plays an initial sequence of moves to feel out an opponent’s strategy. It keeps a count of defections that are retaliating and defections that are unmerited. If the number of justified defections and number of unjustified defections differs by more than 2, cooperate for the next 5 turns and then play Tit For Tat’s strategy. Otherwise defect forever.

Sneaky. The Sneaky strategy is an original strategy that tracks the three most recent actions of the opponent. If all three actions are to cooperate, then it will defect. Also, if the total number of opponent defections are greater than the total number of opponent cooperations then it will defect.

Some of these agents, like the Prober, are not stationary to begin with but converge to stationary distributions given enough time. We refer to these as quasi-stationary behaviors. Q-learning, Hyper-Q-learning and LTP agents are each trained to play against the above agents. Agents are able to observe the previous four actions taken by themselves and their opponent. Each match lasts for 100 timesteps. Agents are paired using a round robin matchmaking format. This was run for 10,000 episodes. For Q-learning and Hyper-Q-learning agents, was set to 1.0 initially and decreased linearly with iterations until it reached 0.1.


Average cumulative rewards and standard deviation for various RL agents over 20 encounters at numerous training iterations.

(b) Average rewards and standard deviation of LTP agents over 20 timesteps trained with various values.
Figure 3: (a) The blue line shows the results achieved by LTP agents trained with

. These agents learn to cooperate early and consistently with decreasing variance with more training iterations indicating stable performance. Trained with

, LTP agents are slower to converge than Deep-Q or Hyper-Q agents, however, achieve similar results with consistency. Hyper-Q agents learn to defect the quickest of all agents. (b) Agents trained with higher values learn cooperative policies with little error rate. A threshold value specific to the used configuration of the IPD is noted at roughly .

RL Agents versus RL Agents

We carry out experiments in order to understand how Q-learning agents, Hyper-Q-learning agents and LTP agents perform against their counterparts in the IPD. Each state was characterized by the agent’s and its opponent’s previous four interactions. Agents had no information about who they were playing with beyond what was available in the observation data. While versus Axelrod agents, though Q-learning and Hyper-Q learning agents keeping an exploration rate at 1.0 is feasible when playing against static strategies, Hyper-Q relies on developing accurate predictions of an opponent’s next move and so we ensure that the observed policies are non-random. We train LTP agents against one another using values of 0.99, 0.7, 0.5, and 0.01 to demonstrate the effect of accounting for behavioral adaptations.

External Factors Influencing Agent Learning

The emphasis in current multiagent RL is to stationarize the environment using techniques that give the agent more information about how the environment might change at the next timestep. However, in open environments such as artificial societies, agents are likely to come into contact with unseen scenarios and their learning in these new environment is hard to predict. In other words, we are interested in understanding what happens when an agent in inserted in an unseen multi-agent environment with given dynamics, like a society with established social norms. Evaluating these developmental aspects may provide key insights to understanding how types of behaviours are established in a society or how certain behaviours might provide the basis for stable social norms while others might not. We look to demonstrate in a simple, yet insightful, way that agents’ behaviors may change in society based on their encounters with others and how this analysis can be useful for understanding social interactions.

We start with an environment that features only two players, both RL agents that train with a Q-learning update. TFT agents are added one-by-one to the environment to observe the changes in Q-values. Sneaky agents and Punisher agents are also added to see how cumulative reward increases and decreases. The format was modeled as a random encounters where each agent was matched with a random opponent with equal probability. Each match lasted for 20 timesteps. Training lasted for 20,000 episodes and the RL agents implemented

-greedy policies with decreasing linearly.

Results and Discussion

(a) 0 TFT agents
(b) 2 TFT agents
(c) 4 TFT agents
(d) 6 TFT agents
Figure 4: Adapting surroundings using TFT agents to get two independent Q-learning agents to cooperate in the IPD. (a) Two Q-learning agents learn independently to play the IPD and both learn defecting policies to achieve the minimum cumulative reward. As more TFT agents are introduced, cumulative reward rises and with (d) 6 TFT agents both Q-learning agents learn to cooperate consistently achieving the maximum cumulative reward.

Optimal Policies under Stationary Distributions

Individually, Q-learning agents and Hyper-Q learning agents score well against Axelrod agents due to their patterned behavior. The strategies that Axelrod agents use can be considered stationary and it is straightforward for the RL agents to learn an appropriate policy to play when matched against these agents provided that they have access to enough information to determine their strategy. In the case of the Sneaky agent, it requires at least four previous interactions at every time step in order to learn the optimal policy as it can then determine what the pattern in its behavior is. Both Q-learning and Hyper-Q-learning perform well against these agents. However, against the Prober agent, Hyper-Q performs significantly better and more consistently than either of the other two RL agents demonstrating an advantage of agent-tracking techniques in quasi-stationary environments. The average scores over 100 timesteps are recorded in Table 1. Overall, when matched with the Axelrod agents, LTP agents achieve desirable results versus all the opponents and perform at least as well as Q-learning though Hyper-Q outperforms both.

TFT Punisher Prober Sneaky FG
Q 300 300 166.7 391.6 300
Hyper-Q 300 300 305 400 300
LTP 300 300 154.5 384.9 300
Table 1: Average scores vs. Static agents

Incorporating Information about Future Behaviors

LTP agents perform well with other LTP agents and are able to achieve the maximum cumulative reward when playing in the IPD consistently. When testing various values of we get a variety of learned policies. With values of 0.99, LTP agents heavily favor cooperation and consistently learn to cooperate with each other. In Fig 3 we see the scores of LTP agents trained when . Between and there are no stark differences and both result in defecting policies. When there is a noticeable separation from a “defect only”-type policy. LTP agents with values of less than 0.7 learn to defect following a similar trajectory to that of Q-learning and Hyper-Q-learning. With greater than both agents learn cooperative policies to achieve the maximum cumulative reward. When matching Q-learning agents with each other, they learn defecting policies every time over the course of 10,000 iterations. The number of times they defect increases exponentially until every action taken is to defect. While Hyper-Q-learning agents outperformed the other algorithms versus stationary agents, it performs poorly in this scenario. As exploration rate decreases, the predicted probability of the opponent defecting increases and Hyper-Q agents learn to defect the quickest with negligible variance. We expect this will be further compounded using sampling techniques and buffer refreshing for social dilemmas as the new policies learned by its opponent will only be more likely to defect than before. Since the optimal policy to play against an opponent that is increasingly likely to defect is to defect oneself, predicting what the opponent might do next is not a viable strategy to maximize cumulative reward in the IPD. However, as LTP agents demonstrated, it is beneficial to take actions that maximize reward conditioned on changes in an opponent’s behavior.

Cooperative Behavior in Presence of Stable Norms

This experiment displays the changes in the behavior of two agents after interacting with other pre-existing agents. As before, Q-learning agents do not learn to cooperate with one another when matched together in the IPD. However, we can show how tailoring their overall experience without explicitly changing the reward function they can learn to cooperate with one another which is an interesting find as it better represents how these agents would act in societal-like contexts rather than closed environments. By adding TFT agents to the environment and modeling it as a random-encounters the Q-values of cooperating go up as the punishment for defecting is more likely to be immediate. When facing other agents that are also exploring the environment, a Q-learning agent may be able to reap rewards from defecting behavior, however, in a more strict environment, this behavior can be punished. By inserting more TFT agents into the environment, this behavior can be eliminated altogether and agents will learn to cooperate with one another consistently as shown in Fig 4. Similarly, by adding a Punisher agent who severely punishes defecting behavior, this is even more quickly enforced. However, when a Sneaky agent is introduced, this disrupts the learning of cooperative policy heavily and many more TFT agents must be introduced to compensate. In contrast, Hyper-Q agents do not learn to cooperate with each other in this setting and continue to learn defecting policies when matched together and are therefore less suitable in such scenarios. These initial experiments provide insights about the emergence of cooperation (or other behaviors) in the presence of environments or societies with stable pre-established dynamics, i.e., social norms.


In this paper we have presented a novel architecture for agents to learn optimal strategies for the IPD where elements of cooperation and competition are prevalent and important. Our LTP agents successfully learn to cooperate with one another by demonstrating changes to behavior via the use of probes and adjusting experiences to reflect these changes. We also show that these agents are able to learn optimal strategies when matched against stationary and quasi-stationary agents that have been used in Axelrod tournaments. We will look to focus on scaling this research to produce good results on more advanced social dilemmas and to incorporate the useful aspects of agent-tracking techniques to broaden the applicability of our approach. Building on this work, we plan to investigate further how different types of behaviours may emerge in agent societies and how they might develop according to their surroundings.


  • [Adam, Buşoniu, and Babuška2012] Adam, S.; Buşoniu, L.; and Babuška, R. 2012. Experience replay for real-time reinforcement learning control. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(2):201–212.
  • [Axelrod and Hamilton1981] Axelrod, R., and Hamilton, W. D. 1981. The evolution of cooperation. Science 221(4489):1390–1396.
  • [Boyd and Richerson2009] Boyd, R., and Richerson, P. J. 2009. Culture and the evolution of human cooperation. Philosophical Transactions of the Royal Society of London B: Biological Sciences 364(1533):3281–3288.
  • [Buşoniu, Babuška, and Schutter2010] Buşoniu, L.; Babuška, R.; and Schutter, B. D. 2010. Multi-agent reinforcement learning: An overview. Innovations in Multi-agent Systems and Applications 1(310):183–221.
  • [Claus and Boutilier1998] Claus, C., and Boutilier, C. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI 746–752.
  • [Clutton-Brock2004] Clutton-Brock, T. 2004. Cooperation between non-kin in animal societies. Nature 462(7269):51.
  • [Erev and Roth1998] Erev, I., and Roth, A. E. 1998. Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria. American Economic Review 848–881.
  • [Fehr and Fischbacher2004] Fehr, E., and Fischbacher, U. 2004. Social norms and human cooperation. Trends in Cognitive Sciences 8(4):185–190.
  • [Foerster et al.2017] Foerster, J.; Nardelli, N.; Farquhar, G.; Afouras, T.; Torr, P. H.; Kohli, P.; and Whiteson, S. 2017. Stabilising experience replay for deep multi-agent reinforcement learning. In ICML.
  • [Foerster et al.2018] Foerster, J.; Chen, R. Y.; Al-Shedivat, M.; Whiteson, S.; Abbeel, P.; and Mordatch, I. 2018. Learning with opponent-learning awareness. In AAMAS.
  • [Kapetanakis and Kudenko2002] Kapetanakis, S., and Kudenko, D. 2002. Reinforcement learning of coordination in cooperative multi-agent systems. In AAAI/IAAI, 326–331.
  • [Knight et al.2016] Knight, V. A.; Campbell, O.; Harper, M.; and et al., K. M. L. 2016. An open reproducible framework for the study of the Iterated Prisoner’s Dilemma. Journal of Open Research Software 4(1).
  • [Lange et al.2013] Lange, P. A. V.; Joireman, J.; Parks, C. D.; and Dijk, E. V. 2013. The psychology of social dilemmas: A review. Organizational Behavior and Human Decision Processes 120(2):125–141.
  • [Leibo et al.2017] Leibo, J. Z.; Zambaldi, V.; Lanctot, M.; Marecki, J.; and Graepel, T. 2017. Multi-agent reinforcement learning in sequential social dilemmas. In AAMAS, 464–473.
  • [Lerer and Peysakhovich2017] Lerer, A., and Peysakhovich, A. 2017. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068.
  • [Lowe et al.2017] Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; and Mordatch, I. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. In NIPS, 6379–6390.
  • [Macy and Flache2002] Macy, M. W., and Flache, A. 2002. Learning dynamics in social dilemmas. Proceedings of the National Academy of Sciences 99(suppl 3):7229–7236.
  • [Marden, Arslan, and Shamma2009] Marden, J. R.; Arslan, G.; and Shamma, J. S. 2009. Joint strategy fictitious play with inertia for potential games. In IEEE Transactions on Automatic Control, volume 54, 208–220.
  • [Matignon, Laurent, and le Fort-Piat2007] Matignon, L.; Laurent, G.; and le Fort-Piat, N. 2007. Hysteric Q-Learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 64–69.
  • [Nowak and Sigmund2005] Nowak, M. A., and Sigmund, K. 2005. Evolution of indirect reciprocity. Nature 437(7063):1291.
  • [Nowak2006] Nowak, M. 2006. Five rules for the evolution of cooperation. Science 314(5805):1560–1563.
  • [Ostrum, Gardner, and Walker1994] Ostrum, E.; Gardner, R.; and Walker, J. 1994. Rules, games, and common-pool resources. University of Michigan Press.
  • [Perolat et al.2017] Perolat, J.; Leibo, J. Z.; Zambaldi, V.; Beattie, C.; Tuyls, K.; and Graepel, T. 2017. A multi-agent reinforcement learning model of common-pool resource appropriation. In NIPS, 3643–3652.
  • [Rapoport1974] Rapoport, A. 1974. Prisoner’s Dilemma – Recollections and observations. Game Theory as a Theory of a Conflict Resolution 17–34.
  • [Sandholm and Crites1996] Sandholm, T. W., and Crites, R. H. 1996. Multiagent reinforcement learning in the Iterated Prisoner’s Dilemma. Biosystems 37(1-2).
  • [Schaul et al.2016] Schaul, T.; Quan, J.; Antonoglou, I.; and Silver, D. 2016. Prioritized experience replay. In ICLR.
  • [Shoham, Powers, and Grenager2003] Shoham, Y.; Powers, R.; and Grenager, T. 2003. Multi-agent reinforcement learning: a critical survey. Technical report, Stanford University 1–13.
  • [Stevens and Hauser2004] Stevens, J. R., and Hauser, M. D. 2004. Why be nice? psychological constraints on the evolution of cooperations. Trends in Cognitive Sciences 8(2):60–65.
  • [Sutton and Barto1998] Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. MIT Press.
  • [Tan1993] Tan, M. 1993. Multi-agent reinforcement learning: Independent vs. cooperative agents. In ICML, volume 10, 330–337.
  • [Tesauro2004] Tesauro, G. 2004. Extending Q-learning to general adaptive multi-agent systems. In NIPS, 871–878.
  • [Uchibe and Doya2004] Uchibe, E., and Doya, K. 2004. Competitive-cooperative-concurrent reinforcement learning with importance sampling. In Proceedings of International Conference on Simulation of Adaptive Behavior: From Animals and Animats, 287–296.
  • [Yu et al.2015] Yu, C.; Zhang, M.; Ren, F.; and Tan, G. 2015. Emotional multiagent reinforcement learning in spatial social dilemmas. IEEE Transactions on Neural Networks and Learning Systems 3083–3096.
  • [Zhang and Lesser2010] Zhang, C., and Lesser, V. R. 2010. Multi-agent learning with policy prediction. In AAAI.