Maintaining cooperation in complex social dilemmas using deep reinforcement learning

07/04/2017 ∙ by Adam Lerer, et al. ∙ 0

Social dilemmas are situations where individuals face a temptation to increase their payoffs at a cost to total welfare. Building artificially intelligent agents that achieve good outcomes in these situations is important because many real world interactions include a tension between selfish interests and the welfare of others. We show how to modify modern reinforcement learning methods to construct agents that act in ways that are simple to understand, nice (begin by cooperating), provokable (try to avoid being exploited), and forgiving (try to return to mutual cooperation). We show both theoretically and experimentally that such agents can maintain cooperation in Markov social dilemmas. Our construction does not require training methods beyond a modification of self-play, thus if an environment is such that good strategies can be constructed in the zero-sum case (eg. Atari) then we can construct agents that solve social dilemmas in this environment.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bilateral cooperative relationships, where individuals can pay costs to give larger benefits to others, are ubiquitous in our daily lives. In such situations mutual cooperation can lead to higher payoffs for all involved but there always exists an incentive to free ride. In a seminal work Axelrod asks a practical question: since social dilemmas are so ubiquitous, how should a person behave when confronted with one (Axelrod, 1984)? In this work we will take up a variant of that question: how can we construct artificial agents that can solve complex bilateral social dilemmas?

First, we must define what it means to ‘solve’ a social dilemma. The simplest social dilemma is the two player, repeated Prisoner’s Dilemma (PD). Here each player chooses to either cooperate or defect each turn. Mutual cooperation earns high rewards for both players. Defection improves one’s payoff but only at a larger cost to one’s partner. For the PD, Axelrod & Hamilton suggest the strategy of tit-for-tat (TFT): begin by cooperating and in later turns copy whatever your partner did in the last turn (Axelrod & Hamilton, 1981).

TFT and its variants (eg. Win-Stay-Lose-Shift, (Nowak & Sigmund, 1993; Imhof et al., 2007)) have been studied extensively across many domains including the social and behavioral sciences, biology, and computer science. TFT is popular for several reasons. First, it is able to avoid exploitation by defectors while reaping the benefits of cooperation with cooperators. Second, when TFT is paired with other conditionally cooperative strategies (eg. itself) it achieves cooperative payoffs. Third, it is error correcting because after an accidental defection is provides a way to return to cooperation. Fourth, it is simple to explain to a partner and creates good incentives: if one person commits to using TFT, their partner’s best choice is to cooperate rather than try to cheat.

Our contribution is to expand the idea behind to TFT to a different environment: one shot Markov social dilemmas that require function approximation (eg. deep reinforcement learning). We will work with the standard deep RL setup: at training time, our agent is given access to the Markov social dilemma and can use RL to compute a strategy. At test time the agent is matched with an unknown partner and gets to play the game with that partner once.

We will say that the agent can solve a social dilemma if it can satisfy the four TFT properties listed above. We call our strategy approximate (because we use RL function approximation) Markov (because the game is Markov) tit-for-tat (amTFT).

The first issue amTFT needs to tackle is that unlike in the PD ‘cooperation’ and ‘defection’ are no longer simple labeled strategies, but rather sequences of choices. amTFT uses modified self-play222We note that one advantage of amTFT is that it requires no additional machinery beyond what is required by standard self-play, thus if we can construct competitive agents in some environment (eg. Atari, (Mnih et al., 2015)) then we can also construct agents that solve social dilemmas in that environment. to learn two policies at training time: a fully cooperative policy and a ‘safe’ policy (we refer to this as defection).333In the PD this action is ‘defect’ but in most real social dilemmas this is not the case. For example social dilemmas occur naturally in economic situations where agents face a choice of trading and realizing gains from trade or simply producing everything they need on their own. In this case a safe policy is the outside option of ‘stop transacting with this agent.’

The second issue is that we are considering a setup where our agent will only play the social dilemma once at test time. Thus the goal of amTFT is to intelligently switch between the learned policies within a single game.444The focus on a single test game is a way in which the problem we consider differs from what is normally studied in the literature on maintaining cooperation in repeated games (Fudenberg & Maskin, 1986; Littman & Stone, 2005; de Cote & Littman, 2008). In a standard ‘folk theorem’ setup agents play a game repeatedly and maintain cooperation in one iteration of a game by threats of defection in the next iteration. amTFT performs this as follows: at each time step during test time the amTFT agent computes the gain from the action their partner actually chose compared to the one prescribed by the cooperative policy. This can be done either using a learned function or via policy rollouts. We refer to this as a per period debit. If the total debit is below a threshold amTFT behaves according to the cooperative policy. If the debit is above the threshold, the agent switches to the defecting policy for turns and then returns to cooperation. This is computed such that the partner’s gains (debit) are smaller than the losses they incur ( lost turns of cooperation).

We show both analytically and experimentally that amTFT is a good strategy for Markov social dilemmas in the Axelrod sense defined above. Using a grid-world, Coins, and a modification of an Atari game where players must learn from pixels we demonstrate experimentally that an important component of amTFT is defining a partner’s ‘defection’ in terms of value and not actions. This choice makes amTFT robust to a partner using one of a class of outcome-equivalent cooperative policies as well function approximation, important properties for scaling agents beyond simple games.

We note that for the purposes of this paper we define the ‘cooperative’ the policies as the ones which maximize the sum of both players’ payoff. This definition seems natural for the case of the symmetric games we study (and is the one that is typically used in eg. the literature on the evolution of cooperation). However, it is well known that human social preferences take into account distribution (eg. inequity (Fehr & Schmidt, 1999)), various forms of altruism (Andreoni, 1990; Peysakhovich et al., 2014), and context dependent concerns (eg. social norms, see (Roth et al., 1991; Herz & Taubinsky, 2014; Peysakhovich & Rand, 2015) for how social norms affect economic games and can change even in the lab). Thus when applying amTFT in other circumstances the correct ‘focal point’ needs to be chosen. The automatic determination of focal points is an important topic for future research but far beyond the scope of this paper. However, we note that once this focal point is determined the amTFT algorithm can be used exactly as in this paper simply by swapping out the cooperative objective function during training time.

1.1 Related Work

A large literature on the ‘folk theorem’ asks whether in a repeated game there exists an equilibrium which maintains cooperative payoffs using strategies which take as input histories of observations (Fudenberg & Maskin, 1986; Dutta, 1995) and output stage-game actions. A computer science branch of this literature asks whether it is possible to compute such equilibria either in repeated matrix games (Littman & Stone, 2005) or in repeated Markov games (de Cote & Littman, 2008). These works are related to our questions but have two key differences: first, they focus on switching strategies across iterations of a repeated game rather than within a single game. Second, perhaps more importantly, this literature focuses on finding equilibria unlike the Axelrod setup which focuses on finding a ‘good’ strategy for a single agent. This difference in focus is starkly illustrated by TFT itself because both agents choosing TFT is not an equilibrium (since if one agent commits to TFT the partner’s best response is not TFT, but rather always cooperate).

A second related literature focuses on learning and evolution in games (Fudenberg & Levine, 1998; Sandholm & Crites, 1996; Shoham et al., 2007; Nowak, 2006; Conitzer & Sandholm, 2007)

with recent examples applying deep learning to this question

(Leibo et al., 2017; Perolat et al., 2017). Though there is a large component of this literature focusing on social dilemmas, these works typically are interested how properties of the environment (eg. initial states, payoffs, information, learning rules used) affect the final state of a set of agents that are governed by learning or evolutionary dynamics. This literature gives us many useful insights, but is not usually focused on the question of design of a single agent as we are.

A third literature focuses on situations where long term interactions with the same partner means that a good agent needs to either to discern a partner’s type (Littman, 2001) or be able shape the adaptation of a learning partner (Babes et al., 2008; Foerster et al., 2017b). Babes et al. use reward shaping in the Prisoner’s Dilemma to construct ‘leader’ agents that convince ‘followers’ to cooperate and Foerster et al. uses a policy gradient learning rule which includes an explicit model of the partner’s model. These works are related to ours but deal with situations where interactions are long enough for the partner to learn (rather than a single iteration) and require either explicit knowledge about the game structure (Babes et al., 2008) or the partner’s learning rule (Foerster et al., 2017b).

There is a recent surge of interest in using deep RL to construct agents that can get high payoffs in multi-agent environments. Much of this literature focuses either on zero-sum environments (Tesauro, 1995; Silver et al., 2016, 2017; Brown et al., 2015; Kempka et al., 2016; Wu & Tian, 2016; Usunier et al., 2016) or coordination games without an incentive to defect (Lowe et al., 2017; Foerster et al., 2017a; Riedmiller et al., 2009; Tampuu et al., 2017; Peysakhovich & Lerer, 2017; Lazaridou et al., 2017; Das et al., 2017; Evtimova et al., 2017; Havrylov & Titov, 2017; Foerster et al., 2016) and uses self-play to construct agents that can achieve good outcomes.555One example that is closer to our problem is (Lewis et al., 2017) which applies deep RL to bargaining, which can be thought of as an imperfect information general-sum game. We show that applying this self-play approach naively does not produce agents that can solve social dilemmas (see the Appendix for more discussion).

There is a large literature using the repeated PD to study human decision-making in social dilemmas (Fudenberg et al., 2012; Bó & Fréchette, 2011). In addition, recent work in cognitive science has begun to use more complex games and RL techniques quite related to ours (Kleiman-Weiner et al., 2016). However, while this work provides useful insights into potentially useful strategies the main objective of this work is to understand human decision-making, not to actively improve the construction of agents.

Finally, Crandall et al. study how to construct machines for social dilemma games (Crandall et al., 2017). This work is the closest to our own of the recent literature but differs in that it focuses specifically on cooperation with a human partner in simple games using cheap talk (English) communication from an exiting set of messages. Given the importance of communication in human interactions incorporating explicit signaling into amTFT-like strategies is an important and interesting direction for future work.

2 The Basic Model

We now turn to formalizing our main idea. We will work with a generalization of Markov decision problems:

Definition 1

A (finite, 2-player) Markov game consists of a set of states ; a set of actions for each player , ; a transition function

which tells us the probability distribution on the next state as a function of current state and actions; a reward function for each player

which tells us the utility that player gains from a state, action tuple. We assume rewards are bounded.

Players can choose between policies which are maps from states to probability distributions on actions We denote by the set of all policies for a player. Through the course of the paper we will use the notation to refer to some abstract policy and to learned approximations of it (eg. the output of a deep RL procedure).

Definition 2

A value function for a player inputs a state and a pair of policies and gives the expected discounted reward to that player from starting in state . We assume agents discount the future with rate which we subsume into the value function. A related object is the function for a player inputs a state, action, and a pair of policies and gives the expected discounted reward to that player from starting in state taking action and then continuing according to afterwards.

We will be talking about strategic agents so we often refer to the concept of a best response:

Definition 3

A policy for agent denoted is a best response starting at state to a policy if for any and any along the trajectory generated by these policies we have We denote the set of such best responses as If obeys the inequality above for any choice of state we call it a perfect best response.

The set of stable states in a game is the set of equilibria. We call a policy for player and a policy for player a Nash equilibrium if they are best responses to each other. We call them a Markov perfect equilibrium if they are perfect best responses.

We are interested in a special set of policies:

Definition 4

Cooperative Markov policies starting from state are those which, starting from state , maximize We let the set of cooperative policies be denoted by Let the set of policies which are cooperative from any state be the set of perfectly cooperative policies.

A social dilemma is a game where there are no cooperative policies which form equilibria. In other words, if one player commits to always cooperate, there is a way for their partner to exploit them and earn higher rewards at their expense. Note that in a social dilemma there may be policies which achieve the payoffs of cooperative policies because they cooperate on the trajectory of play and prevent exploitation by threatening non-cooperation on states which are never reached by the trajectory.666An example of such a policy is Grim Trigger in the one-memory repeated PD. Grim Trigger cooperates if its partner has always cooperated in the past and defects otherwise. Thus two Grim players achieve the payoffs of two cooperators but do not use policies that cooperate at every state.

The state representation used plays an important role in determining whether equilibria which achieve cooperative payoffs exist. Specifically, a policy which rewards cooperation today with cooperation tomorrow must be able to remember whether cooperation happened yesterday. In both of our example games, Coins and the PPD, if the game is played from the pixels without memory maintaining cooperation is impossible. This is because the current state does not contain information about past behavior of one’s partner.

Thus, some memory is required to create policies which maintain cooperation. This memory can be learned (eg. an RNN) or it can be an explicitly designed summary statistic (our approach). However, adding memory does not remove equilibria where both players always defect, so adding memory does not imply that self-play will find policies that maintain cooperation (Foerster et al., 2017b; Sandholm & Crites, 1996). In the appendix we show that even in the simplest situation, the one memory repeated PD, always defecting equilibria can be more robust attractors than ones which maintain cooperation. amTFT is designed to get around this problem by using modified self-play to explicitly construct the cooperative and cooperation maintaining strategies as well as then switching rule. We begin with the theory behind amTFT.

3 Approximate Markov TFT

We begin with a social dilemma where pure cooperators can be exploited. We aim to construct a simple meta-policy which incentivizes cooperation along the path of play by switching intelligently between policies in response to its partner.

We assume that cooperative polices are exchangeable. That is, for any pair any combination of the two (eg. is also in and that all pairs give a unique distribution of the total rewards between the two players.

If policies are not exchangeable or can give different distributions of the total payoff then in addition to having a cooperation problem, we also have a coordination problem (ie. in which particular way should agents cooperate? how should gains from cooperation be split?). This is an important question, especially if we want our agents to interact with humans, and is related to the notion of choosing focal points in coordination/bargaining games. However, a complete solution is beyond the scope of this work and will often depend on contextual factors. See eg. (Schelling, 1980; Roth et al., 1991; Kleiman-Weiner et al., 2016; Peysakhovich & Lerer, 2017) for more detailed discussion.

For the social dilemma to be solvable, there must be strategies with worse payoffs to both players. Consider an equilibrium which has worse payoffs for player . We assume that is an equilibrium even if played for a finite time, which we call -dominance. We use -dominance to bound the payoffs of a partner during the execution of a punishment phase, thus it is a sufficient but not necessary condition. We discuss in the Appendix how this assumption can be relaxed. To define this formally, we first introduce the notation of a compound policy which is a policy that behaves according to for turns and then afterwards.

Definition 5

We say a game is dominant (for player ) if for any , any state , and any policy we have

In theory, with access to , their functions, and no noise or function approximation, we can construct amTFT as follows. Suppose the amTFT agent plays as player (the reverse is symmetric).

At the start of the game the amTFT agent begins in phase . If the phase is then the agent plays according to . At each time step, if the agent is in a phase, the agent looks at the action chosen by their partner. The agent computes

If then starting at the next time step when state is reached the agent enters into a phase where they choose according to for periods. is computed such that

Here controls how often an agent can be exploited by a pure defector. After this is over the agent returns to the phase. The amTFT strategy gives a nice guarantee:

Theorem 1

Define If for any state we have that then if player is an amTFT agent, a fully omniscient player maximizes their payoffs by behaving according to when is in a phase and when is in a -phase. Thus, if agents start in the phase and there is no noise, they cooperate forever. If they start in a phase, they eventually return to a phase.

The proof is quite simple and we relegate it to the Appendix. However, we now see that amTFT has the desiderata we have asked for: it is easy to explain, it cooperates with a pure cooperator, it does not get completely exploited by a pure defector,777In an infinite length game amTFT will get exploited an infinite number of times as it tries to return to cooperation after each phase. One potential way to avoid this to avoid this is to increase at each phase. and incentivizes cooperation along the trajectory of play.888This makes amTFT subtly different from TFT. TFT requires one’s partner to cooperate even during the phase for the system return to cooperation. By contrast, amTFT allows any action during the phase, this makes it similar to the rPD strategy of Win-Stay-Lose-Shift or Pavlov (Nowak & Sigmund, 1993).

4 Constructing an amTFT Agent

We now use RL methods to construct the components required for amTFT by approximating the cooperative and defect policies as well as the switching policy. To construct the required policies we use self-play and two reward schedules: selfish and cooperative.

In the selfish reward schedule each agent treats the other agent just as a part of their environment and tries to maximize their own reward. We assume that RL training converges and we call the converged policies under the selfish reward schedule and the associated function approximations . If policies converge with this training then is a Markov equilibrium (up to function approximation).

In the cooperative reward schedule each agent gets rewards both from their own payoff and the rewards the other agent receives. That is, we modify the reward function so that it is

We call the converged policy and value function approximations and In this paper we are agnostic to which learning algorithm is used to compute policies.

In general there can be convergence issues with selfish self-play (Fudenberg & Levine, 1998; Conitzer & Sandholm, 2007; Papadimitriou, 2007) while in the cooperative reward schedule the standard RL convergence guarantees apply. The latter is because cooperative training is equivalent to one super-agent controlling both players and trying to optimize for a single scalar reward.

With the value functions and policies in hand from the procedure above, we can construct an amTFT meta-policy. For the purposes of this construction, we consider agent as the amTFT agent (but everything is symmetric). The amTFT agent keeps a memory state which both start at .

The amTFT agent sees the action of their partner at time and approximates the gain from this deviation as To compute this debit we can either use learned functions or we can simply use rollouts.

The amTFT agent accumulates the total payoff balance of their partner as If is below a fixed threshold the amTFT agent chooses actions according to If crosses a threshold the mTFT agent uses rollouts to compute a such that the partner loses more from relative to cooperation than some constant

times the current debit. The hyperparameters

and trade off robustness to approximation error and noise. Raising allows for more approximation error in the calculation of the debit but relaxes the incentive constraints on the agent’s partner. Raising makes the cost of defection higher but makes false positives more costly. The algorithm is formalized below:

  Input: and their
  while  do
      comes from model or rollouts
     if  then
     end if
     if  then
     end if
     if  then
        Compute using rollouts
     end if
  end while
Algorithm 1 Approximate Markov Tit For Tat (for Agent 1)

A key component of the amTFT strategy is the computation of the per period debit In our experiments we do this via use batched policy rollouts (a similar procedure is used to calculate the length of the phase, ). Each rollout is computed as follows:

(a) Coins

(b) PPD
(c) Training
Figure 1: In two Markov social dilemmas we find that standard self-play converges to defecting strategies while modified self-play finds cooperative, but exploitable strategies. We use the results of these two training schedules to construct and .
  1. The amTFT agent has policy pairs and saved from training

  2. At time when the state was the amTFT agent compares the action chosen by their partner which we denote as to

  3. If then is set to

  4. If then the amTFT agent simulates replicas of the game for turns.

    1. In of the replicates their partner starts with and play continues according to - we call this the ‘true path’

    2. In of the replicates their partner starts with and play continues according to - we call this the ‘counterfactual path’

    3. The amTFT agent takes the difference in the average total reward to the partner from the two paths and uses that as

      - this is an estimate of the reward of the one shot deviation to

      from the recommended strategy

This procedure is an unbiased estimator of

in the limit of large and but is computationally intensive at test time.999A less computationally demanding way to execute amTFT is to use a model to approximate directly. This is difficult in practice since any bias in the model is accumulated across periods and because the model needs to be accurate everywhere, not just on the trajectory of . In the appendix we discuss some results on learning a model and improving the efficiency of such procedures is an important direction for future work. In games where an action today can only affect payoffs up to periods from now it suffices to use rollouts of length and elide the continuation value.

The value-based construction gives amTFT a particular robustness property - if the partner is not using exactly but is using a policy that is outcome equivalent to it the estimated values will end up being in expectation and so the amTFT agent will continue to cooperate. We will see in our experiments that this property is important to the success of amTFT in real Markov social dilemmas.

5 Experiments

Strategy SelfMatch Safety IncentC
68 -58 -41
2 0 -58
amTFT 63 -16 33
Grim 2 0 -59
(a) Coins
Strategy SelfMatch Safety IncentC
-1 -18 -14
-7 0 -19
amTFT -2 -1 2.5
Grim -7 0 -18
(b) PPD
Figure 2: In two Markov social dilemmas, amTFT satisfies the Axelrod desiderata: it mostly cooperates with itself, is robust against defectors, and incentivizes cooperation from its partner. The ‘Grim’ strategy based on (de Cote & Littman, 2008) behaves almost identically to pure defection in these social dilemmas. The result of standard self-play is The full tournament of all strategies against each other is shown in the Appendix.

We test amTFT in two environments: one grid-world and one where agents must learn from raw pixels. In the grid-world game Coins two players move on a board. The game has a small probability of ending in every time step, we set this so the average game length is time steps. Coins of different colors appear on the board periodically, and a player receives a reward of 1 for collecting (moving over) any coin. However, if a player picks up a coin of the other player’s color, the other player loses 2 points. The payoff for each agent at the end of each game is just their own point total. The strategy which maximizes total payoff is for each player to only pick up coins of their own color; however each player is tempted to pick up the coins of the other player’s color.

We also look at an environment where strategies must be learned from raw pixels. We use the method of (Tampuu et al., 2017) to alter the reward structure of Atari Pong so that whenever an agent scores a point they receive a reward of and the other player receives . We refer to this game as the Pong Player’s Dilemma (PPD). In the PPD the only (jointly) winning move is not to play. However, a fully cooperative agent can be exploited by a defector.

We are interested in constructing general strategies which scale beyond tabular games so we use deep neural networks for state representation for both setups. We use standard setups so we relegate the details of the networks as well as the training to the appendix.

We perform both Selfish (self play with reactive agents receiving own rewards) and Cooperative (self play with both agents receiving sum of rewards) training for both games. We train replicates for Coins and replicates for the PPD. In both games Selfish training leads to suboptimal behavior while Cooperative training does find policies that implement socially optimal outcomes. In Coins agents converge to picking up coins of all colors while social agents learn to only pick up matching coins. In PPD selfishly trained agents learn to compete and try to score while prosocially trained agents gently hit the ball back and forth.

We evaluate the performance of various Markov social dilemma strategies in a tournament. To construct a matchup between two strategies we construct agents and have them play a fixed length iteration of the game. Note that at training time we use a random length game but at test time we use a fixed length one so that we can compare payoffs more efficiently. We use replicates per strategy pair to compute the average expected payoff. We compare , , and amTFT.

We also compare the direct adaptation of the construction in (de Cote & Littman, 2008). Recall that the folk theorem algorithm maintains equilibria by threat of deviation later: if either agent’s behavior in game iteration does not accord with the cooperative policy, both agents switch to a different policy in the next repetition of the game. We adapt this to the single test game setting as follows: the agent computes policies If their partner takes an action in a state where the agent switches to forever. We call this the Grim Trigger Strategy due to its resemblance to the rPD strategy of the same name.

In both games we ask how well the strategies satisfy Axelrod’s desiderata from the introduction. Specifically, we would like to measure whether a strategy avoids exploitation, cooperates with conditional cooperators, and incentivizes its partner to cooperate.

Let be the average reward to player when a policy of type is matched with type . The metric

measures how safe a strategy is from exploitation by a defector. The more negative this value, the worse that is exploited by a pure defector.

We measure a strategy’s ability to achieve cooperative outcomes with policies of their same type as

This measure can be thought of as quantifying two things. First, how much social welfare is achieved in a world where everyone behaves according to strategy . Second, while we cannot enumerate all possible conditionally cooperative strategies, in the case of Grim and amTFT this serves as an indicator of how well they would behave against a particular conditional cooperator - themselves.

Finally, we measure if incentivizes cooperation from its partner. For this we use the measure

The higher this number, the better off a partner is from committing to pure cooperation rather than trying to cheat.

Figure 2 shows our metrics evaluated for the strategies of always cooperate, always defect, amTFT and Grim. Pure cooperation is fully exploitable and pure defection gets poor payoffs when matched with itself. Neither pure C or pure D incentivizes cooperation. However, amTFT avoids being exploited by defectors, does well when paired with itself and incentivizes cooperative strategies from its partner. We also see that inferring a partner’s cooperation using the value function (amTFT) is much more stable than inferring it via actions as Grim immediately interprets any deviation from its preferred cooperative strategy as defection.

6 amTFT As Teacher

The results above show that amTFT is a good strategy to employ in a mixed environment which includes some cooperators, some tit-for-tat agents and some defectors. In particular, we have shown that amTFT is not exploited by . However, what happens when amTFT’s partner is themselves a learning agent?

We consider what happens if we fix the one player (the Teacher) to use a fixed policy but let the other player be a selfish deep RL agent (the Learner). We perform the retraining in the domain of Coins.101010We tried to perform the retraining in the PPD but incentivizing cooperation via a shift to requires a low discount rate and we found A3C to be unstable in this regime. This retraining procedure can also be used as an additional metric of the exploitability of a given strategy, rather than asking whether can exploit it, we ask whether a learner trying to maximize its own payoff can find some way to cheat.

Recall that when selfish RL agents played with each other, they converged to the Selfish ‘grab all coins’ strategy. We see that Learners paired with purely cooperative teachers learn to exploit the teachers, learners paired with also learn to exploit (this learning happens much slower because a fully trained policy is able to grab coins very quickly and thus it is hard for a blank slate agent to learn at all), however learners paired with amTFT learn to cooperate. Note that choosing amTFT as a strategy leads to higher payoffs for both the Learner and the Teacher, thus even if we only care about the payoffs accrued to our own agent we can do better with amTFT than a purely greedy strategy.

Figure 3: Both purely selfish and purely cooperative Teachers lead Learners to exploitative strategies. However, amTFT Teachers lead Learners to cooperate and thus both agents reach a higher payoff in the long-run.

7 Conclusion

Humans are remarkably adapted to solving bilateral social dilemmas. We have focused on how to give artificial agents this capability. We have shown that amTFT can maintain cooperation and avoid exploitation in Markov games. In addition we have provided a simple construction for this strategy that requires no more than modified self-play. Thus, amTFT can be applied to social dilemmas in many environments.

Our results emphasize the importance of treating agents as fundamentally different than other parts of the environment. In particular, agents have beliefs, desires, learn, and use some form of optimization while objects follow simple fixed rules. An important future direction for constructing cooperative agents is to continue to incorporate ideas from inverse reinforcement learning (Abbeel & Ng, 2004; Ng et al., 2000) and cognitive science (Baker et al., 2009; Kleiman-Weiner et al., 2016) to construct agents that exhibit some theory of mind.

There is a growing literature on hybrid systems which include both human and artificial agents (Crandall et al., 2017; Shirado & Christakis, 2017). In this work we have focused on defining ‘cooperation’ as maximizing the joint payoff. This assumption seems reasonable in symmetric situations such as those we have considered, however, as we discuss in the introduction it may not always be appropriate. The amTFT construction can be easily modified to allow other types of focal points simply by changing the modified reward function used in the training of the cooperative strategies (for example by using the inequity averse utility functions of (Fehr & Schmidt, 1999)). However moving forward in constructing agents that can interact in social dilemmas with humans will require AI designers (and their agents) to understand and adapt to human cooperative and moral intutions (Yoeli et al., 2013; Hauser et al., 2014; Ouss & Peysakhovich, 2015).


8 Appendix

8.1 Standard Self-Play Fails to Discover Cooperative Strategies in the Repeated PD

In a social dilemma there exists an equilibrium of mutual defection, and there may exist additional equilibria of conditional cooperation. Standard self-play may converge to any of these equilibria. When policy spaces are large, it is often the case that simple equilibria of constant mutual defection have larger basins of attraction than policies which maintain cooperation.

We can illustrate this with the simple example of the repeated Prisoner’s Dilemma. Consider a PD with payoffs of to mutual defection, for mutual cooperation, for defecting on a cooperative partner and for being defected on while cooperating. Consider the simplest possible state representation where the set of states is the pair of actions played last period and let the initial state be (this is the most optimistic possible setup). We consider RL agents that use policy gradient (results displayed here come from using Adam (Kingma & Ba, 2014), similar results were obtained with SGD though convergence speed was much more sensitive to the setting of the learning rate) to learn policies from states (last period actions) to behavior.

Note that this policy space contains TFT (cooperate after , defect otherwise), Grim Trigger (cooperate after , defect otherwise) and Pavlov or Win-Stay-Lose-Shift (cooperate after , defect otherwise (Nowak & Sigmund, 1993)) which are all cooperation maintaining strategies (though only Grim and WSLS are themselves full equilibria).

Each episode is defined as one repeated PD game which lasts a random number of periods with stopping probability of stopping after each period. Policies in the game are maps from the one-memory state space to either cooperation or not. These policies are trained using policy gradient and the REINFORCE algorithm (Williams, 1992). We vary and set such that is the most efficient strategy always. Note that all of these parameters are well within the range where humans discover cooperative strategies in experimental applications of the repeated PD (Bó & Fréchette, 2011).

Figure 4 shows that cooperation only robustly occurs when it is a dominant strategy for both players () and thus the game is no longer a social dilemma.111111Note that these results use pairwise learning and therefore are different from evolutionary game theoretic results on the emergence of cooperation (Nowak, 2006). Those results show that indeed cooperation can robustly emerge in these kinds of strategy spaces under evolutionary processes. Those results differ because they rely on the following argument: suppose we have a population of defectors. This can be invaded by mutants of TFT because TFT can try cooperation in the first round. If it is matched with a defector, it loses once but it then defects for the rest of the time, if it is matched with another TFT then they cooperate for a long time. Thus, for sufficiently long games the risk of one round of loss is far smaller than the potential fitness gain of meeting another mutant. Thus TFT can eventually gain a foothold. It is clear why in learning scenarios such arguments cannot apply..

Figure 4: Results from training one-memory strategies using policy gradient in the repeated Prisoner’s Dilemma. Even in extremely favorable conditions self-play fails to discover cooperation maintaining strategies. Note that temptation payoff is not a PD and here is a dominant strategy in the stage game.

8.2 Proof of Main Theorem

To prove the theorem we will apply the one deviation principle. To show this, we fix player to be an amTFT agent and look at player . Note that from the point of view of player this is now a Markov game with a state representation of where if player behaves according to and if player is in the phase and thus behaves according to .

We consider the policy for player of ‘play when player is in the phase and play when player is in the phase.’ Recall by the Principle of Optimality if there does not exist a one shot deviation at any state under which player earns a higher payoff, then there does not exist a better policy than the one prescribed.

Consider starting at . The suggested policy has player play . By -dominance this is the best response to so there are no one-shot deviations in the phase.

Let us consider what happens in the phase (). By the assumption of the theorem at any state we know that

Let be the per-period reward stream (note here each

is a random variable) for player

induced by the policies Since

where is the discount rate. Because rewards are bounded then for any there exists such that

That is, the first terms time steps approximate the full discounted expectation arbitrarily well. This also means that for some

From any state, the highest profit an agent can make from deviating from with a single action with an amTFT partner is However we have shown that there exists a length such that moving to for turns costs the agent more than Therefore there is no time during the phase they wish to deviate. This completes the proof.

8.3 Aside on -Dominance

The reason we made the -dominance assumption is to bound the expected payoff of an agent playing against and therefore bound the necessary length of a phase after a particular deviation. However, in order to compute what the length of the phase the amTFT agent needs access to the best response policy to , or its associated value function. With -dominance we assume that is that best response. Even if -dominance does not strictly hold, it is likely a sufficient approximation. If necessary however, one can train an RL agent on episodes where their partner plays , where is observed. This allows one to approximate the best response policy to which will then give us what we need to compute the responses to deviations from in the phase that incentivize full cooperation.

8.4 Experimental Details

We used rollouts to calculate the debit to the amTFT’s partner at each time period. This estimator has good performance for both PPD and Coins given their reward structure. It is also possible to use a learned model of . Learning a sufficiently accurate model is challenging for several reasons. First, it has to have very low bias, since any bias in will be accumulated over periods. Second, the one-shot deviation principle demands that be accurate for all state-action pairs, not just those sampled by the policies . Standard on-policy value function estimation will only produce accurate estimates of at states sampled by the cooperative policies. As an example, in Coins, since the cooperative policies never collect their partner’s coins for these state-action pairs may be inaccurate.

We found that it was possible in Coins to learn a model to calculate debit without policy rollouts using the same neural network architecture that was used to train the policies. However, we found that in order to train a model accurate enough to work well we had to use a modified training procedure. After finishing Selfish and Cooperative training, we perform a second step of training using a fixed (converged) . In order to sample states off the path of during this step, the learner behaves according to a mixture of , , and random policies while the partner continues according to . is updated via off-policy Bellman iteration. We found this modified procedure produced a function that was good enough to maintain cooperation (though still not as efficient as rollouts). For more complex games, an important area for future work is to develop methodologies to compute more accurate approximations of or combine a model with rollouts effectively.

8.4.1 Coins Game and Training

For Coins there are four actions (up, down, left, right), and is represented as a

binary tensor where the first two channels encode the location of the each agent and the other two channels encode the location of the coin (if any exist). At each time step if there is no coin on the board a coin is generated at a random location with a random color, with probability


A policy

is learned via the advantage actor critic algorithm. We use a multi-layer convolutional neural network to jointly approximate the policy

and state-value function . For this small game, a simpler model could be used, but this model generalizes directly to games with higher-dimensional 2D state spaces (e.g. environments with obstacles). For a given board size , the model has

repeated layers, each consisting of a 2D convolution with kernel size 3, followed by batch normalization and ReLU. The first layer has stride 1, while the successive layers each have stride 2, which decreases the width and height from

to while doubling the number of channels. For the board, channel sizes are 13, 26, 52, 104. From these 104 features, is computed via a linear layer with 4 outputs with softmax, to compute a distribution over actions, while the value function is computed via a single-output linear layer.

The actor and critic are updated episodically with a common learning rate - at the end of each game we update the model on a batch of episodes via

where is the advantage

and is the advantaged normalized over all episodes and periods in the batch

We train with a learning rate of , continuation probability (i.e. games last on average 500 steps), discount rate , and a batch size of . We train for a total of games.

8.4.2 Pong Player Dilemma Training

We use the arcade learning environment modified for 2-player play as proposed in (Tampuu et al., 2017)

, with modified rewards of +1 for scoring a point and -2 for being scored on. We train policies directly from pixels, using the pytorch-a3c package

Policies are trained directly from pixels via A3C (Mnih et al., 2016). Inputs are rescaled to 42x42 and normalized, and we augment the state with the difference between successive frames with a frame skip of . We use 38 threads for A3C, over a total of 38,000 games (1,000 per thread). We use the default settings from pytorch-a3c: a discount rate of , learning rate of , 20-step returns, and entropy regularization weight of .

The policy is implemented as a convolutional neural network with four layers, following pytorch-a3c. Each layer uses a 3x3 kernel with stride 2, followed by ELU. The network has two heads for the actor and critic. We elide the LSTM layer used in the pytorch-a3c library, as we found it to be unnecessary.

8.4.3 Tournament Results

(a) Coins Results
(b) PPD Results
Figure 5: Results of the tournament in two Markov social dilemmas. Each cell contains the average total reward of the row strategy against the column strategy. amTFT achieves close to cooperative payoffs with itself and achieves close to the defect payoff against defectors. Its partner also receives a higher payoff for cooperation than defection.