Log In Sign Up

Learning Reciprocity in Complex Sequential Social Dilemmas

by   Tom Eccles, et al.

Reciprocity is an important feature of human social interaction and underpins our cooperative nature. What is more, simple forms of reciprocity have proved remarkably resilient in matrix game social dilemmas. Most famously, the tit-for-tat strategy performs very well in tournaments of Prisoner's Dilemma. Unfortunately this strategy is not readily applicable to the real world, in which options to cooperate or defect are temporally and spatially extended. Here, we present a general online reinforcement learning algorithm that displays reciprocal behavior towards its co-players. We show that it can induce pro-social outcomes for the wider group when learning alongside selfish agents, both in a 2-player Markov game, and in 5-player intertemporal social dilemmas. We analyse the resulting policies to show that the reciprocating agents are strongly influenced by their co-players' behavior.


page 4

page 5

page 8


Multi-agent Reinforcement Learning in Sequential Social Dilemmas

Matrix games like Prisoner's Dilemma have guided research on social dile...

Translucent Players: Explaining Cooperative Behavior in Social Dilemmas

In the last few decades, numerous experiments have shown that humans do ...

Graph Neural Networks to Predict Sports Outcomes

Predicting outcomes in sports is important for teams, leagues, bettors, ...

An Evolutionary Game Model for Understanding Fraud in Consumption Taxes

This paper presents a computational evolutionary game model to study and...

Resource sharing on endogenous networks

We examine behavior in a voluntary resource sharing game that incorporat...

Using Graph-Aware Reinforcement Learning to Identify Winning Strategies in Diplomacy Games (Student Abstract)

This abstract proposes an approach towards goal-oriented modeling of the...

1. Introduction

Sustained cooperation among multiple individuals is a hallmark of human social behavior, and may even underpin the evolution of our intelligence Reader and Laland (2002); Dunbar (1993). Often, individuals must sacrifice some personal benefit for the long-term good of the group, for example to manage a common fishery or provide clean air. Logically, it seems that such problems are insoluble without the imposition of some extrinsic incentive structure Olson (1965). Nevertheless, small-scale societies show a remarkable aptitude for self-organization to resolve public goods and common pool resource dilemmas Ostrom (1990). Reciprocity provides a key mechanism for the emergence of collective action, since it rewards for pro-social behavior and punishes for anti-social acts. Indeed, it is a common norm shared by diverse societies Becker (1990); Ostrom (1998); Blau (1964); THIBAUT and Kelley (1966). Moreover, laboratory studies find experimental evidence for conditional cooperation in public goods games; see for example Croson et al. (2005).

By far the most well-known model of reciprocity is Rapoport’s Tit-for-Tat Rapoport et al. (1965). This hand-coded algorithm was designed to compete in tournaments where each round consisted of playing the repeated Prisoner’s Dilemma game against an unknown opponent. The algorithm cooperates on its first move, and thereafter mimics the previous move of its partner, by definition displaying perfect reciprocity. Despite its simplicity, Tit-for-Tat was victorious in the tournaments Axelrod (1980a, b). Axelrod Axelrod (1984) identified four key features of the Tit-for-Tat strategy which contributed to its success, which later successful algorithms such as win-stay lose-shift Nowak and Sigmund (1993) also share:

  • Nice: start by cooperating.

  • Clear: be easy to understand and adapt to.

  • Provocable: retaliate against anti-social behavior.

  • Forgiving: cooperate when faced with pro-social play.

Later, we will think of these as design principles for models of reciprocity.

Although Tit-for-Tat and its variants have proved resilient to modifications in the matrix game setup Duersch et al. (2013); Boyd (1989); Nowak (2006), it is clearly not applicable to realistic situations. In general, cooperating and defecting require an agent to carry out complex sequences of actions across time and space, and the payoffs defining the social dilemma may be delayed. More sophisticated models of reciprocity should be applicable to the multi-agent reinforcement learning domain, where the tasks are defined as Markov games Shapley (1953); Littman (1994), and the nature of cooperation is not pre-specified. In this setting, agents must learn both the high-level strategy of reciprocity and the low level policies required for implementing (gradations of) cooperative behavior. An important class of such games are intertemporal social dilemmas Pérolat et al. (2017); Hughes et al. (2018)

, in which both individual rationality is at odds with group-level outcomes and the negative impact of individual greed is temporally distant from its adverse consequences for the group.

Lerer and Peysakhovich (2017); Kleiman-Weiner et al. (2016); Peysakhovich and Lerer (2017) propose reinforcement learning models for -agent problems, based on a planning approach. Both models first pre-train cooperating and defecting policies using explicit knowledge of other agents’ rewards. The policies are then used as options in hand-coded meta-policies. The main variation between these approaches is the algorithm for switching between these options. In Lerer and Peysakhovich (2017) and Kleiman-Weiner et al. (2016), the decision about which policy is chosen in response to the last action, and that action is assessed through planning. In Peysakhovich and Lerer (2017), the decision is based on the recent rewards of the agent, which defects if it is not doing sufficiently well. These models of reciprocity are important stepping stones, but have some important limitations. Firstly, reciprocity imitates a range of behaviors, rather than pre-determined ones (pure cooperation and pure defection), which are not obviously well-defined in general. Secondly, reciprocity is applicable beyond the -player case. Thirdly, reciprocity does not necessarily rely on directly observing the rewards of others. Finally, reciprocity can be learned online, offering better scalability and flexibility than planning.

We propose an online-learning model of reciprocity which addresses these limitations while still significantly outperforming selfish baselines in the -player setting. Our setup comprises two types of reinforcement learning agents, innovators and imitators. Both innovators and imitators are trained using the A3C algorithm Mnih et al. (2016) with the VTrace correction Espeholt et al. (2018). An innovator optimizes for a purely selfish reward. An imitator has two components: (1) a mechanism for measuring the level of sociality of different behaviors and (2) an intrinsic motivation Chentanez et al. (2005)

for matching the sociality of others. We investigate two mechanisms for assessing sociality. The first is based on hand-crafted features of the environment. The other uses a learned “niceness network”, which estimates the effect of one agent’s actions on another agent’s future returns, hence providing a measure of social impact

Latané (1981). The niceness network also encodes a social norm among the imitators, for it represents “a standard of behavior shared by members of a social group” of Encyclopaedia Britannica (2008). Hence our work represents a model-based generalization of Sen and Airiau (2007) to Markov games.

An innovator’s learning is affected by the reciprocity displayed by imitators co-training in the same environment. The innovator is incentivized to behave pro-socially, because otherwise its anti-social actions are quickly repeated by the group, leading to a bad payoff for all, including the innovator. With one innovator and one imitator, the imitator learns to respond in a Tit-for-Tat like fashion, which we verify in a dyadic game called the

Coins dilemma Peysakhovich and Lerer (2017). For one innovator with many imitators, the setup resembles the phenomenon of pro-social leadership Henrich et al. (2015); Gächter and Renner (2018). Natural selection favours altruism when individuals exert influence over their followers; we see an analogous effect in the reinforcement learning setting.

More specifically, we find the presence of imitators elicits socially beneficial outcomes for the group (5 players) in both the Harvest (common pool resource appropriation) and Cleanup (public good provision) environments Hughes et al. (2018). We also quantify the social influence of the innovator on the imitators by training a graph network Battaglia et al. (2016) to predict future actions for all agents, and examining the edge norms between the agents, just as in Tacchetti et al. (2018). This demonstrates that influence of the innovator on the imitator is greater than the influence between other pairs of agents in the environment. Moreover, we find that the innovator’s policies return to selfish free-riding when we continue training without reciprocating imitators, showing that reciprocity is important for maintaining stability of a learning equilibrium. Finally, we demonstrate that the niceness network learns an appropriate notion of sociality in the dyadic Coins environment, thereby inducing a tit-for-tat like strategy.

2. Agent models

We use two types of reinforcement learning agent, innovators and imitators. Innovators learn purely from the environment reward. Imitators learn to match the sociality level of innovators, hence demonstrating reciprocity. We based the design of the imitators on Axelrod’s principles, which in our language become:

  • Nice: imitators should not behave anti-socially at the start of training, else innovators may not discover pro-sociality.

  • Clear: imitation must occur within the timescale of an episode, else innovators will be unable to adapt.

  • Provocable: imitators must reciprocate anti-social behavior from innovators.

  • Forgiving: the discount factor with which anti-social behavior is remembered must be less than .

Note that imitating the policy of another agent over many episodes is not sufficient to produce cooperation. This type of imitation does not change behaviour during an episode based on the other agent’s actions, and so does not provide feedback which enables the innovators to learn. We validate this in an ablation study. As such our methods are complementary to, but distinct from the extensive literature on imitation learning; see Hussein et al. (2017) for a survey.

Moreover, observe that merely training with collective reward for all agents is inferior to applying reciprocity in several respects. Firstly collective reward suffers from a lazy agent problem due to spurious reward Sunehag et al. (2018). Secondly, it produces agent societies that are exploitable by selfish learning agents, who will free-ride on the contributions of others. Finally, in many real-world situations, agents do not have direct access to the reward functions of others.

2.1. Innovator

The innovator comprises a deep neural network trained to generate a policy from observations using the asynchronous advantage actor-critic algorithm

Mnih et al. (2016) with V-Trace correction for stale experience Espeholt et al. (2018). For details of the architecture, see Appendix A.

2.2. Imitator

We use two variants of the imitator. The difference is in what is being imitated. The metric-matching imitator imitates a hand-coded metric. The niceness network imitator instead learns what to imitate. The metric-matching variant allows for more experimenter control over the nature of reciprocity, but at the expense of generality. Moreover, this allows us to disentangle the learning dynamics which result from reciprocity from the learning of reciprocity itself, a scientifically useful tool. On the other hand, the niceness network can readily be applied to any multi-agent reinforcement learning environment with no prior knowledge required.

2.2.1. Reinforcement learning

Imitators learn their policy using the same algorithm and architecture as innovators, with an additional intrinsic reward to encourage reciprocation. Consider first the case with innovator and imitator; the general case will follow easily. Let the imitator have trajectory , and the innovator has trajectory . Then the intrinsic reward is defined as


where is some function of the trajectory, which is intended to capture the effect of the actions in the trajectory on the return of the other agent. We shall refer to as the niceness of the agent whose trajectory is under consideration. This intrinsic reward is added to the environment reward. We normalize the intrinsic reward so that it accounts for a proportion of the total absolute reward in each batch:


where the mean is taken over a batch of experience and

is a constant hyperparameter, which we tuned separately on each environment.

Generalizing to the case of innovator with imitators is simple: we merely apply the intrinsic reward to each imitator separately, based on the deviation between their niceness and that of the innovator. Since our method uses online learning, it automatically adapts to the details of multi-agent interaction in different environments. This is difficult to capture in planning algorithms, because the correct cooperative policy for interactions with one agent depends on the policies of all the other agents.

The two imitator variants differ primarily in the choice of the niceness function , as follows.

2.2.2. Metric matching

For the metric-matching imitator, we hand-code for trajectories in each environment in a way that measures the pro-sociality of an agent’s behavior. If these metrics are accurate, this should lead to robust reciprocity, which gives us a useful comparison for the niceness network imitators.

2.2.3. Niceness network

The niceness network estimates the value of the innovator’s states and actions to the imitator. Let be the discounted return to the imitator from time . We learn approximations to the following functions:

where and are the state and action of the innovator at time . Clearly this requires access to the states and actions of the innovator. This is not an unreasonable assumption when compared with human cognition; indeed, there is neuroscientific evidence that the human brain automatically infers the states and actions of others Mitchell (2009). Extending our imitators to model the states and actions of innovators would be an valuable extension, but is beyond our scope here.

The niceness of action is defined as


This quantity estimates the expected increase in the discounted return of the imitator given the innovator’s action .

Then, for a generic trajectory we define the niceness of the trajectory, , to be


This is used in as the function in calculating the intrinsic reward 1

The parameter controls the timescale on which the imitation will happen; the larger is, the slower the imitator is to forget. This balances between the criteria of provocability and forgiveness.

We learn the functions and using the algorithm Sutton and Barto (1998) using the innovator’s states and actions and the imitator’s reward.

While the niceness network is trained only on the states and actions of the innovator, in calculating the intrinsic reward it is used for inference on both imitator and innovator trajectories. For this to work we require symmetry between the imitator and the innovator: they must have the same state-action space, the same observation for a given state and the same payoff structure. Therefore our approach is not applicable in asymmetric games. Cooperation in asymmetric games may be better supported by indirect reciprocity than by generalizations of Tit-for-Tat; see for example Johnstone and Bshary (2007).

2.2.4. Off-policy correction

When calculating the intrinsic reward for the niceness network imitator, we evaluate for the imitator’s trajectories, having only trained on trajectories from the innovator. This is problematic if the states and actions of are drawn from a different distribution from those of . In this case, on the trajectory , we might expect that and would be inaccurate estimates for the effect of the imitator on the innovator’s payoff. In other words, the flip of perspective from innovator to imitator at inference time is only valid if the imitator’s policy is sufficiently closely aligned with that of the innovator.

In practice, we find that applying to is particularly problematic. Specifically if the imitator’s policy contains actions which are very rare in the innovator’s policy, then the estimate of for these actions is not informative. To correct this issue, we add a bias for policy imitation to the model. The imitator estimates the policy of the innovator, giving an estimate for each state . Then we add an additional KL-divergence loss for policy imitation,


where is the policy of the imitator, and is a constant hyperparameter. The effect of this loss term is to penalize actions that are very unlikely in ; these are the actions for which the niceness network is unable to produce a useful estimate.

In our ablation study, we will demonstrate that the policy imitation term alone does not suffice to yield cooperation. Indeed, our choice of terminology was based on the high-level bias for reciprocity introduced by the niceness matching, not the low-level policy imitation. The latter might better be thought of as an analogue to human behavioral mimicry Chartrand and Lakin (2013), which while an important component of human social interaction Lakin et al. (2003), alone does not constitute a robust means of generating positive social outcomes Hale and de C. Hamilton (2016).

Empirically we find that niceness matching alone often suffices to generate cooperative outcomes, despite the off-policy problem. This is likely because environments contain some state-action pairs which are universally advantageous or disadvantageous to others, regardless of the particular policy details. Moreover, this is not a limitation of our environments, it is a feature familiar from everyday life: driving a car without a catalytic converter is bad for society, regardless of the time and place. The policy imitation correction does however serve to stabilize cooperative outcomes, a suggested effect of mimicry in human social groups Tanner et al. (2008).

3. Experiments

We test our imitation algorithms in three domains. The first is Coins. This is a -player environment introduced in Lerer and Peysakhovich (2017). This environment has simple mechanics, and a strong social dilemma between the two players, similar to the Prisoner’s Dilemma. This allows us to study our algorithms in a setup close to the Prisoner’s Dilemma, and make comparisons to previous work.

The other two environments are Harvest and Cleanup. These are more complex environments, with delayed results of actions, partial observability of a somewhat complex gridworld, and more than two players. These environments are designed to test the main hypothesis of this paper, that our algorithms are able to learn to reciprocate in complex environments where reciprocity is temporally extended and hard to define. We choose these two environments because they represent different classes of social dilemma; Cleanup is a public goods game, while Harvest is a common pool resource dilemma.

3.1. Coins

3.1.1. Environment

Figure 1. The Coins game. Agents are rewarded for picking up coins, and punished when the other agent picks up a coin of their color.

We use the gridworld game Coins, introduced in Lerer and Peysakhovich (2017). Here, two players move on a fully-observed gridworld, on which coins of two colors periodically appear. When a player picks up a coin of either color, they get a reward of . When a player picks up a coin of the other player’s color, the other player gets a reward of . The episode ends after

steps. The total return is maximized when each player picks up only coins of their own color, but players are tempted to pick up coins of the wrong color. At each timestep coins spawn each unoccupied square with a probability

. Therefore the maximum achievable collective return is approximately in expectation, if neither agent chooses to defect and both agents collect all coins of their own color.

In this game, the metric-matching agent uses the number of recent defections as its measure of niceness . We define to be if the action picks up a coin which penalizes the other player, and otherwise. Then we define


To make the environment tractable for our niceness network we symmetrize the game by swapping the coin colors in the observation of the innovator. This means that the value and functions for the innovator can be consistently applied for the imitator. Note that only the colors in the observation are swapped; the underlying game dynamics remain the same, so the social dilemma still applies.

3.1.2. Results

Both the metric-matching and niceness network imitators outperformed the greedy baseline in this environment, reaching greater overall rewards and a larger proportion of coins being collected by the agent which owns them (Figure 2). However, neither model was able to reach the near-perfect cooperation achieved by the approximate Markov Tit-for-Tat algorithm Lerer and Peysakhovich (2017), as shown in Figure 3 of that paper. We do not provide a numerical comparison, because we reimplemented the Coins environment for this paper.

This suggests that when it is possible to pre-learn policies that purely cooperate or defect and roll these out into the future, it is advantageous to leverage this prior knowledge to generate precise and extreme reciprocity. One might imagine improving our algorithm to display binary imitation behavior by attempting to match the maximum and minimum of recent niceness, rather than a discounted history. It would be interesting to see whether this variant elicited more cooperative behavior from innovators.

Figure 2. Reciprocity generates pro-social outcomes in Coins. (A) Both metric-matching and niceness-network variants significantly outperform the baseline, measured according to the total payoff. (B) The metric-matching imitators closely match the coin collection profile of the innovator during training, driving the innovator towards pro-sociality. No such beneficial outcome is seen for the selfish agents. (C) The same holds for the niceness network imitators.

3.2. Cleanup

3.2.1. Environment

In the Cleanup environment, the aim is to collect apples. Each apple collected provides a reward of to the agent which collects it. Apples spawn at a rate determined by the state of a geographically separated river. Over time, this river fills with waste, lowering the rate of apple spawning linearly. For high enough levels of waste, no apples can spawn. The episode starts with the level of waste slightly above this critical point. The agents can take actions to clean the waste when near the river, which provides no reward but is necessary to generate any apples. The episode ends after steps, and the map is reset to its initial state. For details of the environment hyperparameters, and evidence that this is a social dilemma, see Hughes et al. (2018).

More precisely, this is a public goods dilemma. If some agents are contributing to the public good by clearing waste from the river, there is an incentive to stay in the apple spawning region to collect apples as they spawn. However, if all players adopt this strategy, then no apples spawn and there is no reward.

Figure 3. The Cleanup and Harvest games. Agents take simultaneous actions in a partially observable D gridworld. Rewards are obtained by eating apples, and agents may fine each other, conferring negative utility. The games are intertemporal social dilemmas: short term individual rationality is at odds with the long-term benefit of the whole group.

In this game, the metric-matching agent uses the number of contributions to the public good as its measure of niceness—for a given state and action, is if the action removes waste from the river, and otherwise. Then we define


3.2.2. Results

Both metric-matching and niceness network imitators are able to induce pro-social behaviour in the innovator they play alongside, greatly exceeding the return and contributions to the public good of selfish agents (Figure 4). Niceness network imitators come close to the final performance of metric-matching imitators, despite having to learn online which actions are pro-social. A representative episode from the niceness network case reveals the mechanism by which the society solves the social dilemma.111A video is available at The innovator periodically leads an expedition to clean waste, which is subsequently joined by multiple imitators. Everyone benefits from this regular cleaning, since many apples are made available (and consumed) throughout the episode.

Figure 4. The effect of reciprocity on social outcomes in Cleanup. (A) Collective return is higher when metric-matching or niceness-network imitators co-learn with a selfish innovator. (B) The improvement in collective return is mediated by a greater contribution to the public good. (C) Contributions remain reasonably equal in the imitation conditions, indicating that the imitators are successfully matching their pro-social behavior with that of the innovator.

3.3. Harvest

3.3.1. Environment

In the Harvest environment, introduced in Pérolat et al. (2017), collecting apples again provides a reward of . Harvested apples regrow at a rate determined by the number of nearby apples—the more other apples are present, the more like the apple is to regrow on a timestep. If all the apples in a block are harvested, none of them will ever regrow. The episode ends after steps, and the map is reset to its initial state. For details of the environment hyperparameters, and evidence that this is a social dilemma, see Hughes et al. (2018).

This is a commons dilemma Hardin (1968). A selfish agent will harvest as rapidly as possible; if the whole group adopts this approach, then the resource is quickly depleted and the return over the episode is low. In order to get a high group return, agents must abstain from harvesting apples which would overexploit the common pool resource.

In this game, there is no clear metric of pro-sociality for the metric-matching agent we use. The anti-social actions in this game are taking apples with few neighbours, as this leads to the common pool resource being depleted. The sustainability of taking a particular apple can therefore be approximated by the number of apples within distance of to , capped at . We call this quantity , following an analogous definition in Janssen et al. (2008). For a trajectory where an agent eats apples in order, we define


3.3.2. Results

We present our findings in Figure 4. Selfish agents are successful near the start of training, but as they become more efficient at collection they deplete the common pool resource earlier in each episode, and collective return falls. This effect is most obvious when examining the sustainability metric, which we define as the average proportion of the episode that had passed when each apple was collected. Agents which collect apples perfectly uniformly would achieve a sustainability of . The baseline achieves a mere .

For niceness network imitators, we see the same pattern near the start, where all the agents become more competent at collecting apples and begin to deplete the apple pool faster. We then see sustainability and collective return rise again. This is because the niceness network learns to classify sustainable behaviour, generating imitators that learn to match the sustainability of innovators, which creates an incentive for innovators to behave less selfishly.

Similarly, the experiment with metric-matching imitators enters a tragedy of the commons early in training, before recovering to achieve higher collective return and better sustainability than the niceness network in a shorter period of training time. This makes intuitive sense: by pre-specifying the nature of cooperative behavior, the metric-matching imitator has a much easier optimization problem, and more quickly demonstrates clear reciprocity to the innovator. The outcome is greater pro-sociality by all, and in a faster training time. To save compute, we terminated the metric-matching runs once they were consistently outperforming the niceness network.

A representative episode from the trained agents in the niceness-network case shows the innovator taking a sustainable path through the apples, with imitators striving to match this.222A video is available at Interestingly, the society comes relatively close to causing the tragedy of the commons. Intuitively, when apples are scarce, the actions of each agent should have a more significant effect on their co-players, leading to a better learning signal for reciprocity.

Figure 5. The effect of reciprocity on social outcomes in Harvest.44footnotemark: 4 (A) Collective return is higher when metric-matching or niceness-network imitators co-learn with a selfish innovator. (B) In the imitation conditions, the group learns a more sustainable strategy. (C) Equality remains high throughout training, suggesting that the imitators are successfully matching the cooperativeness of innovators.

3.4. Analysis

In this section we examine the policies learned by the models, and the learning dynamics of the system.

3.4.1. Influence between agents

If our imitators are learning reciprocity, the the policy of the innovator should meaningfully influence the behavior of the imitators at the end of training. We demonstrate this for the Cleanup environment, by learning a GraphNet model that indirectly measures time-varying influence between entities Tacchetti et al. (2018). In Cleanup, the entities of interest are agents, waste and apples.

The input to our model is a complete directed graph with entities as nodes. The node features are the positions of each entity and additional metadata. For waste and apples, the metadata indicates whether the entity has spawned. For agents, it contains orientation, last action and last reward, and a one-hot encoding of agent type (innovator or imitator). In addition, the graph contains the timestep as a global feature.

The model is trained to predict agent actions from recorded gameplay trajectories. The architecture is as follows: the input is processed by a GraphNet encoder block with -unit MLPs for its edge, node and global nets followed by independent GRU units for the edge, node and global attributes with hidden states of size , and finally a decoder MLP network for the agent nodes with layer sizes . Importantly, the graph output of the GRU is identical in structure to that of the input.

In Tacchetti et al. (2018)

, it was shown that the GRU output graph contains information about relationships between different entities. More precisely, the norm of the state vector along the edge from entity

to entity computes the effective influence of on , in the sense of Granger causality Granger (1969). We may use this metric to evaluate the degree of reciprocity displayed by imitators.

Table 1 shows the average norms of the state vectors along edges between imitators and innovators for our different imitation models, alongside an A3C baseline. The edge norm is greatest from the innovator to the imitator, strongly exceeding the baseline, indicating the innovator has a significant effect on the imitator’s actions. The effect is strikingly visible when the output graph net edges are superimposed on a representative episode with metric-matching imitators, with thicknesses proportional to their norms.555A video is available at In the video, the innovator is purple, and the imitators are sky blue, lime, rose and magenta.

Experiment In-Im Im-In Im-Im In-In
A3C 0.27
Metric matching, 4 A3C 0.97 0.30 0.28
Niceness network, 4 A3C 0.35 0.21 0.22
Table 1. Edge norms for graph net models trained to predict future states and actions. For metric-matching and niceness network imitators, we see that the influence of the innovators on imitators is greater than for any other pair of agent types.

3.4.2. Ablation

In the Cleanup environment, we examine the performance of the model with various components of the imitator ablated. We observe that with only the policy deviation cost, the performance is no better than with purely selfish agents. With the niceness network intrinsic reward, but no policy deviation cost, we see less consistent results across random seeds for the environment and agent (Figure 6A).

Figure 6. (A) Ablating the model reduces performance. Policy imitation alone cannot resolve the social dilemma. Imitation reward alone resolves the dilemma inconsistently, since policy divergence may occur, destabilizing the niceness network. (B) Imitation is necessary to maintain cooperation. When innovators that have learned to be pro-social are co-trained, the group outcomes quickly degrade.

3.4.3. Instability of solution without imitation

In the Cleanup environment, we take an innovator trained alongside four imitators, and run several copies of it in the environment, continuing to learn. We see that the contributions and collective reward quickly fall, as the innovators learn to be more selfish (Figure 6B). This shows that for this environment, reciprocity is necessary not only to find cooperative solutions but also to sustain cooperative behaviour.

3.4.4. Predictions of niceness network

We analysed the predictions of the niceness network in the Coins environment, to determine whether the imitator has correctly captured which actions of the innovator are beneficial and harmful. We rolled out episodes using the final policies of the innovator and imitator from a run with the niceness network. On average, we found that the niceness network on average makes significantly negative predictions () for actions which pick up the wrong coin, and near zero predictions for both picking up one’s own coin () and actions which do not pick up a coin ().

On a more qualitative level, we display some of the predictions of the niceness network for the first of these episodes in figure 7. There we see that the niceness network predicts negative values for picking up the other agent’s coins, and for actions which take the agent nearer to such coins.

Figure 7. The niceness network accurately measures the pro-sociality of different actions taken by the innovator. In this example, we see two consecutive frames from a test episode. The innovator (dark blue) starts two steps away from the imitator’s coin (light blue). The -value for each action is indicated on each frame. It is lowest for actions that move the innovator closer to the coin (bold), and highest for actions that move the innovator away from the coin.

4. Discussion

Our reciprocating agents demonstrate an ability to elicit cooperation in otherwise selfish individuals, both in -player and -player social dilemmas. Reciprocity, in the form of reciprocal altruism Trivers (1971), is not just a feature of human interactions. For example, it is also thought to underlie the complex social lives of teleost fish Brandl and Bellwood (2015). As such, it may be of fundamental importance in the construction of next generation social learning agents that display simple collective intelligence Wolpert and Tumer (1999). Moreover, combining online learned reciprocity with the host of other inductive biases in the multi-agent reinforcement learning literature Fehr and Schmidt ([n. d.]); Jaques et al. (2018); Lerer and Peysakhovich (2018); Peysakhovich and Lerer (2017); Lowe et al. (2017) may well be important for producing powerful social agents.

In the -player case, our experiments place the innovator in the leadership role, giving rise to pro-sociality. An obvious extension would involve training several imitators in parallel, leading to a situation more akin to conformity; see for example Cialdini and Goldstein (2004). In this case, all individuals change their responses to match those of others. A conformity variant may well be a better model of human behavior in public goods games Bardsley and Sausgruber (2005), and hence may generalize better to human co-players.

It is instructive to compare our model to the current state-of-the-art planning-based approach, approximate Markov Tit-for-Tat (amTFT) Lerer and Peysakhovich (2017). There, reciprocity is achieved by first learning cooperative and defecting policies, by training agents to optimize collective and individual return respectively. The reciprocating strategy uses rollouts based on the cooperative policy to classify actions as cooperative or defecting, and responds accordingly by switching strategy in a hard-coded manner.

In the Coins environment, amTFT performs significantly better than both our niceness network and metric-matching imitators, solving the dilemma perfectly. We believe is because it better fulfills Axelrod’s clarity condition for reciprocity. By switching between two well-defined strategies, it produces very clear responses to defection, which provides a better reinforcement learning signal driving innovators towards pro-sociality.

On the other hand, our model is more easily scalable to complex environments. We identify three properties of an environment which make it difficult for amTFT, but which do not stop our model from learning to reciprocate.

  1. If no perfect model of the environment exists, or rolling out such a model is infeasible, one must evaluate the cooperativeness of others online, using a learned model.

  2. The cooperative strategy for amTFT is learned on collective return. For games with multiple agents, this may not yield a unique policy. For example, in the Cleanup environment, the set of policies maximizing collective return involve some agents cleaning the river, while other eat apples.

  3. If cooperation and defection are nuanced, rather than binary choices, then to reciprocate you may need to adjust your level of cooperativeness to match that of your opponent. This is hard to achieve by switching between a discrete set of strategies.

This leaves open an important question: how do we produce reciprocity which is both clear and scalable to complex tasks? One approach would be combining a model like ours, which learns what to reciprocate, with a method which switched between policies in a discrete fashion, as in the previous planning-based approaches of Lerer and Peysakhovich (2017); Kleiman-Weiner et al. (2016); Peysakhovich and Lerer (2017). This might lead reciprocity algorithms which can be learned in complex environments, but which are clearer and so can induce cooperation in co-players even more strongly than our model.


Appendix A Hyperparameters

In all experiments, for both imitators and innovators, the network consists of a single convolutional layer with output channels, a

kernel and stride

, followed by an two-layer MLP with hidden sizes , an LSTM Hochreiter and Schmidhuber (1997) with hidden size and linear output layers for the policy and baseline. The discount factor for the return is set to , and the learning rate and entropy cost were tuned separately for each model-environment pair.

The architecture for the niceness network is a non-recurrent neural network with the same convnet and MLP structure as the reinforcement learning architecture (though the weights are not shared between the two). The output layer of the MLP is mapped linearly to give outputs for

, for each possible action, and for each possible action.

For the niceness network, we used a separate optimizer from the RL model, with a separate learning rate. Both optimizers used the RMSProp algorithm

Hinton et al. (2012). The hyperparameters used in each experiment are shown in Table 2.

Hyperparameter Coins (MM) Coins (AN) Cleanup (MM) Cleanup (AN) Harvest (MM) Harvest (AN)
- imitation reward weight
- imitation memory decay
- policy imitation weight
Entropy weight
RL learning rate
Advantage network learning rate
Advantage network TD-
Advantage network discount factor
Table 2. Hyperparameters used in metric matching and advantage network imitation experiments.