Inequity aversion improves cooperation in intertemporal social dilemmas

03/23/2018 ∙ by Edward Hughes, et al. ∙ UCL 0

Groups of humans are often able to find ways to cooperate with one another in complex, temporally extended social dilemmas. Models based on behavioral economics are only able to explain this phenomenon for unrealistic stateless matrix games. Recently, multi-agent reinforcement learning has been applied to generalize social dilemma problems to temporally and spatially extended Markov games. However, this has not yet generated an agent that learns to cooperate in social dilemmas as humans do. A key insight is that many, but not all, human individuals have inequity averse social preferences. This promotes a particular resolution of the matrix game social dilemma wherein inequity-averse individuals are personally pro-social and punish defectors. Here we extend this idea to Markov games and show that it promotes cooperation in several types of sequential social dilemma, via a profitable interaction with policy learnability. In particular, we find that inequity aversion improves temporal credit assignment for the important class of intertemporal social dilemmas. These results help explain how large-scale cooperation may emerge and persist.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In intertemporal social dilemmas, there is a tradeoff between short-term individual incentives and long-term collective interest. Humans face such dilemmas when contributing to a collective food storage during the summer in preparation for a harsh winter, organizing annual maintenance of irrigation systems, or sustainably sharing a local fishery. Classical models of human behavior based on rational choice theory predict that cooperation in these situations is impossible olson1965logic; hardin1968tragedy. This poses a puzzle since humans evidently do find ways to cooperate in many everyday intertemporal social dilemmas, as documented by decades of fieldwork ostrom1990governing; dietz2003struggle and laboratory experiments ostrom1992covenants; fehr2002altruistic. Providing an empirically grounded explanation of how individual behavior gives rise to societal cooperation is seen as a core goal in several subfields of the social sciences and evolutionary biology ostrom1998behavioral; fehr2007human; rand2013human.

fehr1999theory; falk2006theory

proposed influential models based on behavioral game theory. However, these models have limited applicability since they only generate predictions when the problem can be cast as a matrix game (see e.g.

sandholm1996multiagent; CoteLR06). Here we consider a more realistic video-game setting, like those introduced in the behavioral research of janssen2010lab; janssen2010introducing; janssen2013role. In this environment, agents do not simply choose to cooperate or defect like they do in matrix games. Rather they must learn policies to implement their strategic decisions, and must do so while coping with the non-stationarity arising from other agents learning simultaneously. Several papers used multi-agent reinforcement learning leibo2017multiagent; perolat2017multi; foerster2017learning and planning lerer2017maintaining; peysakhovich2017prosocial; peyler2017; KleimanWeiner2016CoordinateTC to generate cooperation in this setting. However, this approach has not yet demonstrated robust cooperation in games with more than two players, which is often observed in human behavioral experiments. Moreover naïvely optimizing group reward is also ineffective, due to the lazy agent problem sunehag2017.111For more detail on the motivations for our research program, see the supplementary information.

It is difficult for both natural and artificial agents to find cooperative solutions to intertemporal social dilemmas for the following reasons:

  1. Collective action – individuals must learn and coordinate policies at a group level to avoid falling into socially deficient equilibria.

  2. Temporal credit assignment – rational defection in the short-term must become associated with long-term negative consequences.

Many different research traditions, including economics, evolutionary biology, sociology, psychology, and political philosophy have all converged on the idea that fairness norms are involved in resolving social dilemmas Rousseau1992; Hart1955; rawls1958justice; Klosko1987; Frey1995fairness; Bicchieri2010fairness; Henrich2010fairness. In one well-known model, agents are assumed to have inequity-averse preferences fehr1999theory. They balance their selfish desire for individual rewards against a need to keep deviations between their own rewards and the rewards of others as small as possible. Inequity-averse individuals are able to solve social dilemmas by resisting the temptation to pull ahead of others or—if punishment is possible—by punishing and discouraging free-riding. The inequity aversion model has been successfully applied to explain human behavior in a variety of laboratory economic games, such as the ultimatum game, the dictator game, the gift exchange game, market games, the trust game and public goods gibbons1992; eckel2010blaming.222For alternative theories of the other-regarding preferences that may underlie human cooperative behavior in economic games, see charness2002understanding; engelmann2004inequality.

In this research, we generalize the inequity aversion model to Markov games, and show that it resolves intertemporal social dilemmas. Crucial to our analysis will be the distinction between disadvantageous inequity aversion (negative reward received by individuals who underperform relative to others) and advantageous inequity aversion (negative reward received by individuals who overperform relative to others). Colloquially, these may be thought of as reductionist models of envy (disadvantageous inequity aversion) and guilt (advantageous inequity aversion) respectively camerer2011. We hypothesise that these directly address the two challenges set out above in the following way.

Inequity aversion mitigates the problem of collective action by changing the effective payoff structure experienced by agents through both a direct and an indirect mechanism. In the direct mechanism, defectors experience advantageous inequity aversion, diminishing the marginal benefit of defection over cooperation. The indirect mechanism arises when cooperating agents are disadvantageous-inequity averse. This motivates them to punish defectors by sanctioning them, reducing the payoff incentive for free-riding. Since agents must learn a defecting strategy via exploration, initially cooperative agents are deterred from switching strategies if the payoff bonus does not outweigh the cost of inefficiently executing the defecting strategy while learning.

Inequity aversion also ameliorates the temporal credit assignment problem. Learning the association between short-term actions and long-term consequences is a high-variance and error-prone process, both for animals

grice1948delay and reinforcement learning algorithms kearns2000. Inequity aversion short-circuits the need for such long-term temporal credit assignment by acting as an “early warning system” for intertemporal social dilemmas. As before, both a direct and an indirect mechanism are at work. With the direct mechanism, advantageous-inequity-averse defectors receive negative rewards in the short-term, since the benefits of defection are delivered on that timescale. The indirect mechanism operates because cooperators experience disadvantageous inequity aversion at precisely the time when other agents defect. This leads cooperators to punish defectors on a short-term timescale. Both systems have the effect of operant conditioning staddon2003, incentivizing agents that cannot resolve long-term uncertainty to act in the lasting interest of the group.

Figure 1: Screenshots from (A) the Cleanup game, (B) the Harvest game, (C) the Dictate apples game, and (D) the Take apples and Give apples games. The size of the agent-centered observation window is also shown in (B). The same size observation was used in all experiments.

2 Reinforcement learning in sequential social dilemmas

2.1 Partially observable Markov games

We consider multi-agent reinforcement learning in partially-observable general-sum Markov games shapley1953stochastic; Littman94markovgames. In each game state, agents take actions based on a partial observation of the state space and receive an individual reward. Agents must learn through experience an appropriate behavior policy while interacting with one another. We formalize this as follows.

Consider an -player partially observable Markov game defined on a finite set of states . The observation function specifies each player’s -dimensional view on the state space. From each state, players may take actions from the set (one for each player). As a result of their joint action the state changes following the stochastic transition function (where

denotes the set of discrete probability distributions over

). Write to indicate the observation space of player . Each player receives an individual extrinsic reward defined as for player .333In our games, , and ranges from to , with actions comprising movement, rotation and firing.

Each agent learns, independently through its own experience of the environment, a behavior policy (written ) based on its own observation and extrinsic reward . For the sake of simplicity we will write , and . Each agent’s goal is to maximize a long term -discounted payoff defined as follows:


2.2 Learning agents

We deploy asynchronous advantage actor-critic (A3C) as the learning algorithm for our agents Mnih16

. A3C maintains both value (critic) and policy (actor) estimates using a deep neural network. The policy is updated according to the policy gradient method, using a value estimate as a baseline to reduce variance. Gradients are generated asynchronously by

independent copies of each agent, playing simultaneously in distinct instantiations of the environment. Explicitly, the gradients are , where is the advantage function, estimated via -step backups, where is the subjective reward. In section 3.1 we decompose this into an extrinsic reward from the environment and an intrinsic reward that defines the agent’s inequity-aversion.

Figure 2: The public goods game (Cleanup) and the commons game (Harvest) are social dilemmas. (A) shows the Schelling diagram for Cleanup. (B) shows the Schelling diagram for Harvest. The dotted line shows the overall average return were the individual to choose defection.

2.3 Intertemporal social dilemmas

An intertemporal social dilemma is a temporally extended multi-agent game in which individual short-term optimal strategies lead to poor long-term outcomes for the group. To define this term precisely, we employ a formalization of empirical game theoretic analysis walsh2002analyzing; wellman2006methods. Our definition is consistent with that of leibo2017multiagent. However, since that work was limited to the -player case, it relied on the empirical payoff matrix to represent the relative values of cooperation and defection. This quantity is unwieldy for

since it becomes a tensor. Therefore we base our definition on a different representation of the

-player game. Explicitly, a Schelling diagram schelling73; perolat2017multi depicts the relative payoffs for a single cooperator or defector given a fixed number of other cooperators. Thus Schelling diagrams are a natural and convenient generalization of payoff matrices to multi-agent settings. Game-theoretic properties like Nash equilibria are readily visible in Schelling diagrams; see schelling73 for additional details and intuition.

An -player sequential social dilemma is a tuple of a Markov game and two disjoint sets of policies, said to implement cooperation and defection respectively, satisfying the following properties. Consider the strategy profile with . We shall denote the average payoff for the cooperating policies by and for the defecting policies by . A Schelling diagram plots the curves and . Intuitively, the diagram displays the two possible payoffs to the player given that of the remaining players elect to cooperate and the rest defect. We say that is a sequential social dilemma iff the following hold:

  1. Mutual cooperation is preferred over mutual defection: .

  2. Mutual cooperation is preferred to being exploited by defectors: .

  3. Either the fear property, the greed property, or both:

    • Fear: mutual defection is preferred to being exploited. for sufficiently small .

    • Greed: exploiting a cooperator is preferred to mutual cooperation. for sufficiently large .

We show that the matrix games Stag Hunt, Chicken and Prisoner’s Dilemma satisfy these properties in Supplementary Fig. 1.

A sequential social dilemma is intertemporal if the choice to defect is optimal in the short-term. More precisely, consider an individual and an arbitrary set of policies for the rest of the group. Given a starting state, for all sufficiently small, the policy with maximum return in the next steps is a defecting policy. There is thus a tension between short-term personal gain and long-term group utility.

Figure 3: Advantageous inequity aversion facilitates cooperation in the Cleanup game. (A) compares the collective return achieved by A3C and advantageous inequity averse agents, (B) shows contributions to the public good, and (C) shows equality over the course of training. (D-F) demonstrate that disadvantageous inequity aversion does not promote greater cooperation in the Cleanup game.

2.4 Examples

kollock1998social divides all multi-person social dilemmas into two broad categories:

  1. Public goods dilemmas, in which an individual must pay a personal cost in order to provide a resource that is shared by all.

  2. Commons dilemmas, in which an individual is tempted by a personal benefit, depleting a resource that is shared by all.

We consider two dilemmas in this paper, one of the public goods type and one of the commons type. Each was implemented as a partially observable Markov game on a 2D grid. Both are also intertemporal social dilemmas because individually selfish actions produce immediate benefits while their impacts on the collective develop over a longer time horizon. The availability of costly punishment is of critical importance in human sequential social dilemmas oliver1980rewards; Gurerk2006 and is therefore an action in the environments presented here.444In both games, players can fine each other using a punishment beam. This contrasts with perolat2017multi, in which a timeout beam was used.

In the Cleanup game, the aim is to collect apples from a field. Each apple provides a reward of . The spawning of apples is controlled by a geographically separate aquifer that supplies water and nutrients. Over time, this aquifer fills up with waste, lowering the respawn rate of apples linearly. For sufficiently high waste levels, no apples can spawn. At the start of each episode, the environment resets with waste just beyond this saturation point. To cause apples to spawn, agents must clean some of the waste.

Here we have a dilemma. Provided that some agents contribute to the public good by cleaning up the aquifer, it is individually more rewarding to stay in the apple field. However, if all players defect, then no-one gets any reward. A successful group must balance the temptation to free-ride with the provision of the public good. Cooperative agents must make a positive commitment to group-level well-being to solve the task.

The goal of the Harvest game is to collect apples. Each apple provides a reward of . The apple regrowth rate varies across the map, dependent on the spatial configuration of uncollected apples: the more nearby apples, the higher the local regrowth rate. If all apples in a local area are harvested then none ever grow back. After steps the episode ends, at which point the game resets to an initial state.

The dilemma is as follows. The short-term interests of each individual leads toward harvesting as rapidly as possible. However, the long-term interests of the group as a whole are advanced if individuals refrain from doing so, especially when many agents are in the same local region. Such situations are precarious because the more harvesting agents there are, the greater the chance of permanently depleting the local resources. Cooperators must abstain from a personal benefit for the good of the group.555Precise details of the ecological dynamics may be found in the supplementary information.

2.5 Validating the environments

We would like to demonstrate that these environments are social dilemmas by plotting Schelling diagrams. In complex, spatially and temporally extended Markov games, it is not feasible to analytically determine cooperating and defecting policies. Instead, we must study the environment empirically. One method employs reinforcement learning to train such policies. We enforce cooperation or defection by making appropriate modifications to the environment, as follows.

In Harvest, we enforce cooperation by modifying the environment to prevent some agents from gathering apples in low-density areas. In Cleanup, we enforce free-riding by removing the ability of some agents to clean up waste. We also add a small group reward signal to encourage the remaining agents to cooperate. The resulting empirical Schelling diagrams in Figure 2 prove that our environments are indeed social dilemmas.

Figure 4: Inequity aversion promotes cooperation in the Harvest game. When all 5 agents have advantageous inequity aversion, there is a small improvement over A3C in the three social outcome metrics: (A) collective return, (B) apple consumption, and (C) sustainability. Disadvantageous inequity aversion provides a much larger improvement over A3C, and works even when only 1 out of 5 agents are inequity averse. (D) shows collective return, (E) apple consumption, and (F) sustainability.

3 The model

We first introduce the inequity aversion model of fehr1999theory. It is directly applicable only to stateless games. We then extend their model to sequential or multi-state problems, making use of deep reinforcement learning.

3.1 Inequity aversion

The fehr1999theory utility function is as follows. Let be the extrinsic payoffs achieved by each of players. Each agent receives a utility


where the additional terms may be interpreted as intrinsic payoffs, in the language of chentanez2005intrinsically.

The parameter controls an agent’s aversion to disadvantageous inequity. A larger value for implies a larger utility loss when other agents achieve rewards greater than one’s own. Likewise, the parameter controls an agent’s aversion to advantageous inequity, utility lost when performing better than others. fehr1999theory argue that . That is, most people are loss averse in social comparisons. There is some empirical support for this prediction loewenstein1989social, though the evidence is mixed Bellemare2008; Hoppe2013. In a sweep over values for and , we found our strongest results for and .

Figure 5: Inequity aversion promotes cooperation by improving temporal credit assignment. (A) shows collective return for delayed advantageous inequity aversion in the Cleanup game. (B) shows apple consumption for delayed disadvantageous inequity aversion in the Harvest game.

3.2 Inequity aversion in sequential dilemmas

Experimental work in behavioral economics suggests that some proportion of natural human populations are inequity averse fehr2007human. However, as a computational model, inequity aversion has only been expounded for the matrix game setting. Equation (2) can be directly applied only to stateless games VerbeeckPN02; JongT11. In this section we extend this model of inequity aversion to the temporally extended Markov game case.

The main problem in re-defining the social preference of equation (2) for Markov games is that the rewards of different players may occur on different timesteps. Thus the key step in extending (2) to this case is to introduce per-player temporal smoothing of the reward traces.

Let denote the reward obtained by the -th player when it takes action from state . For convenience, we also sometimes write it with a time index: . We define the subjective reward received by the -th player when it takes action from state to be


where the temporal smoothed rewards for the agents are updated at each timestep according to


where is the discount factor and

is a hyperparameter. This is analogous to the mathematical formalism used for eligibility traces

sutton1998rl. Furthermore, we allow agents to observe the smoothed reward of every player on each timestep.

4 Results

We show that advantageous inequity aversion is able to resolve certain intertemporal social dilemmas without resorting to punishment by providing a temporally correct intrinsic reward. For this mechanism to be effective, the population must have sufficiently many advantageous-inequity-averse individuals. By contrast disadvantageous-inequity-averse agents can drive mutual cooperation even in small numbers. They achieve this by punishing defectors at a time concomitant with their offences. In addition, we find that advantageous inequity aversion is particularly effective for resolving public goods dilemmas, whereas disadvantageous inequity aversion is more powerful for addressing commons dilemmas. Our baseline A3C agent fails to find socially beneficial outcomes in either category of game. We define the metrics used to quantify our results in the supplementary information.

4.1 Advantageous inequity aversion promotes cooperation

Advantageous-inequity-averse agents are better than A3C at maintaining cooperation in both public goods and commons games. This effect is particularly pronounced in the Cleanup game (Figure 3). Here groups of advantageous-inequity-averse agents find solutions in which consistently clean large amounts of waste, producing a large collective return.666For a video of this behavior, visit We clarify the effect of advantageous inequity aversion on the intertemporal nature of the problem by delaying the delivery of the intrinsic reward signal. Figure 5 suggests that improving temporal credit assignment is an important function of inequity aversion since delaying the time at which the intrinsic reward signal is delivered removes its beneficial effect.

4.2 Disadvantageous inequity aversion promotes cooperation

Disadvantageous-inequity-averse agents are better than A3C at maintaining cooperation via punishment in commons games (Figure 4). In particular, a single disadvantageous-averse agent can fine defectors, generating a sustainable outcome.777For a video of this behavior, visit In Figure 5, we see that the disadvantageous-inequity-aversion signal must be temporally aligned with over-consumption for effective policing to arise. Hence, it is plausible that inequity aversion bridges the temporal gap between short-term incentives and long-term outcomes. Disadvantageous inequity aversion has no such positive impact in the Cleanup game, for reasons that we discuss in section 5.

5 Discussion

In the Cleanup game, advantageous inequity aversion is an unambiguous feedback signal: it encourages agents to contribute to the public good. In the direct pathway, trial and error will quickly discover that the fastest way to diminish the negative rewards arising from advantageous inequity aversion is to clean up waste, since doing so creates more apples for others to consume. However the indirect mechanism of disadvantageous inequity aversion and punishment lacks this property; while punishment may help exploration of new policies, it does not directly increase the attractiveness of waste cleaning.

The Harvest game requires passive abstention rather than active provision. In this setting, advantageous inequity aversion provides a noisy signal for sustainable behaviour. This is because it is sensitive to the precise apple configuration in the environment, which changes rapidly over time. Hence advantageous inequity aversion does not greatly aid the exploration of policy space. Punishment, on the other hand, operates as a valuable shaping reward for learning, dis-incentivizing overconsumption at precisely the correct time and place.

In the Harvest game, disadvantageous inequity aversion generates cooperation in a grossly inefficient manner: huge amounts of collective resource are lost to fines (compare Figures 4D and 4E). This parallels human behavior in laboratory matrix games, e.g. Yamagishi1986; FehrGachter2000. In the Cleanup game, advantageous-inequity averse agents resolve the social dilemma without such losses, but must comprise a large proportion of the population to be successful. This mirrors the cultural modulation of advantageous inequity aversion in humans blake2015. Evolution is hypothesized to have favored fairness as a mechanism for continued human cooperation Brosnan2014. It remains to be seen whether emergent inequity-aversion can be obtained by evolving reinforcement learning agents.

We conclude by putting our approach in the context of prior work. Since our mechanism does not require explicitly training cooperating and defecting agents or modelling their behaviour, it scales more easily to complex environments and large populations of agents. However, our method has several limitations. Firstly, our guilty agents are quite exploitable, as evidenced by the necessity of a homogeneous guilty population to achieve cooperation. Secondly, our agents use outcomes rather than predictions to inform their policies. This is known to be a problem in environments with high stochasticity peyler2017. Finally, the heterogeneity of the population is an additional hyperparameter in our model. Clearly, one must set this appropriately, particularly in games with asymmetric outcomes. It is likely that a hybrid approach will be required to solve these challenging issues at scale.

Appendix A Supplementary information

a.1 Motivating research on emergent cooperation

The aims of this new research program are twofold. First, we seek to better understand the individual level inductive biases that promote emergent cooperation at the group level in humans. Second, we want to develop agents that exhibit these inductive biases, in the hope that they might navigate complex multi-agent tasks in a human-like way. Much as the fields of neuroscience and reinforcement learning have enjoyed a symbiotic relationship over the past fifty years, so also can behavioral economics and multi-agent reinforcement learning.

Consider, for comparison, maximizing joint utility. Firstly, this assumes away the problem of emergent altruism on the individual level, which is exactly our object of study. Therefore, it is not a relevant baseline for our research. Moreover, it is known to suffer from a serious spurious reward problem (Sunehag et al. 2017), which gets worse as the number of agents increases. Furthermore, in realistic environments, one may not have access to the collective reward function, for privacy reasons for example. Finally, groups of agents trained with a group reward are by definition overfitting to the outcomes of their co-players. Thus maximizing joint utility does not easily generalize to complicated multi-agent problems with large numbers of agents and subtasks that mix cooperation and competition.

Individual-level inductive biases sidestep these issues, while allowing us to learn from the extensive human behavioral literature. In this paper, we have taken an extremely well-studied model in the game-theoretic setting (Fehr and Schmidt 1999) and recast it as an intrinsic reward for reinforcement learning. We can thus evaluate the strengths and weaknesses of inequity aversion from a completely new perspective. We note its success in solving social dilemmas, but find that the success is task-conditional, and that the policies are sometimes quite exploitable. This suggests various fascinating extensions, such as a population-based study with evolved intrinsic rewards (Wang et al. to appear).

a.2 Illustrative Schelling diagrams for 2-player matrix games and SSDs

Figure 1 shows Schelling diagrams and the associated payoff matrices for the canonical matrix games Chicken, Stag Hunt and Prisoner’s Dilemma. We may read off the pure strategy Nash equilibria by considering the social pressure generated by the dominant strategy. Where this is defection, then there is a negative pressure on the number of cooperators; where this is cooperation, there is a positive pressure. Hence the pure strategy Nash equilibria in Chicken are and , in Stag Hunt and and in Prisoner’s Dilemma . Moreover, the different motivations for defection are immediately apparent. In Chicken, greed promotes defection: . In Stag Hunt, the problem is fear: . Prisoner’s Dilemma suffers from both temptations to defect.

a.3 Parameters for Cleanup and Harvest games

In both Cleanup and Harvest, all agents are equipped with a fining beam which administers reward to the user and reward to the individual that is being fined. There is no penalty to the user for unsuccessful fining. In Cleanup each agent is additionally equipped with a cleaning beam, which allows them to remove waste from the aquifer. In both games, eating apples provides a reward of . There are no other extrinsic rewards.

In Cleanup, waste is produced uniformly in the river with probability on each timestep, until the river is saturated with waste, which happens when the waste covers 40% of the river. For a given saturation of the river, apples spawn in the field with probability . Initially the river is saturated with waste, so some contribution to the public good is required for any agent to receive a reward.

Figure 6: These Schelling diagrams demonstrate that classic matrix games are social dilemmas by our definition.

In Harvest, apples spawn relative to the current number of other apples within an radius of . The spawn probabilities are for and apples inside the radius respectively. The initial distribution of apples creates a number of more or less precariously linked regions. Sustainable policies must preferentially harvest denser regions, and avoid removing the important apples that link patches.

a.4 Social outcome metrics

Unlike in single-agent reinforcement learning where the value function is the canonical metric of agent performance, in multi-agent systems with mixed incentives, there is no scalar metric that can adequately track the state of the system (see e.g. Chalkiadakis03; perolat2017multi). Thus we use several different social outcome metrics in order to summarize group behavior and facilitate its analysis.

Consider independent agents. Let be the sequence of rewards obtained by the -th agent over an episode of duration . Likewise, let be the -th agent’s observation sequence. Its return is given by .

The Utilitarian metric (), also known as collective return, measures the sum total of all rewards obtained by all agents. It is defined as the average over players of sum of rewards . The Equality metric () is defined using the Gini coefficient gini1912variabilita. The Sustainability metric () is defined as the average time at which the rewards are collected. For the Cleanup game, we also consider a measure of total contribution to the public good (), defined as the number of waste cells cleaned.


where is the number of waste cells cleaned by player .

a.5 Dictate apples, Give apples and Take apples games

In each game, two players are isolated from one another in separate “rooms”. They can interact only by pressing buttons. In the Dictate apples game, initially all apples are in the left room. At any time, the left agent can press a button that transports all the apples it has not yet consumed to the right room. In the Take apples game, both players begin with apples in their room, but there are twice as many in the left room as the right room. The right agent has the option at any time of pressing a button that removes all the apples from the other player’s room that have not yet been collected. In the Give apples game, both players begin with apples, and the left player again has twice as many as the right player. The left player can press a button to add more apples on the right side. Unlike in the Dictate apples game, this has no effect on the left agent’s own apple supply. Each episode terminates when all apples are collected.

a.6 Inequity aversion models “irrational” behavior

The inequity aversion model of fehr1999theory is supported by experimental evidence from behavioral game theory. In particular, human behavior in the Dictator game is consistent with the prediction that some people have inequity-averse social preferences. A subject in a typical Dictator game experiment must decide how much of an initial endowment (if any) to give to another subject in a one-shot anonymous manner. In contrast to the prediction of rational choice theory that subjects would offer —but in accord with the prediction of fehr1999theory’s inequity aversion model—most subjects offer between and camerer2004measuring.

Figure 7: Behavioral economics laboratory paradigms can be simulated by gridworld Markov games. Agent behavior is shown in (A) for the Dictate apples game, in (B) for the Take apples game, and in (C) for the Give apples game.

To test whether our temporally extended inequity-aversion model makes predictions consistent with these findings, we introduce simple -player gridworld games (see Figure 1). These capture the essential features of Dictator game laboratory experiments. As in all our experiments, positive agent external rewards can only be obtained by collecting apples. In addition an agent can press buttons which Dictate apples (give from its own store), Give apples from an external store or Take apples from the other agent. A full description is provided in the supplementary information.

A selfish rational agent would never press the button in any of these games. This prediction was borne out by our A3C agent baseline (Figure 7). On the other hand, advantageous-inequity-averse agents pressed their buttons significantly more often in the Give apples and Dictate apples games. They pressed the button even in the Dictate apples game when doing so could only reduce their own (extrinsic) payoff. Disadvantageous-inequity-averse agents pressed their button in the Take apples game to reduce the rewards obtained by the player with the larger initial endowment despite there being no extrinsic benefit to doing this.

a.7 Theoretical arguments for the success of inequity aversion

We provide theoretical arguments for inequity aversion as an improvement to temporal credit assignment, extending the work of (Fehr and Schmidt 1999) beyond simple market games. In an intertemporal social dilemma, defection dominates cooperation in the short term. To leading order, the short-term Schelling diagram for an intertemporal social dilemma looks like Figure 8A, since by definition defection must dominate cooperation. Here and in the sequel we work in the limit of large number of players . Mathematically, we denote defector payoff by , cooperator payoff by and average payoff across the population by , writing:


First consider the effect of advantageous inequity aversion (AIA) on the short-term payoffs. Clearly the cooperator line is unchanged, since it is dominated. Hence the cooperator and defector lines become:


The transformed short-term payoffs are shown in Figure 8B. Since the curve dominates in some region, cooperative behavior can be self-sustaining in the short-term. Thus AIA improves temporal credit assignment. AIA can resolve the social dilemma when the earliest learned behavior generates multiple cooperators. This is the case for the Cleanup game but not the Harvest game, explaining the results.

The primary effect of disadvantageous inequity aversion (DIA) is to lower the payoff to a cooperator. However, it also motivates the cooperator to use the fining tool to reduce . There are several simple reasons why defectors might end up being especially targeted. Firstly, the behavior that avoids the policing agent may be cooperative (as in the Harvest game). Secondly, policing agents are motivated to avoid tagging other policers, because of the danger of retaliation.

Assuming that defectors are especially targeted, the cooperator and defector lines become:


with . The transformed short-term payoffs are shown in Figure 8C. Here the Nash equilibrium has moved to a positive number of cooperators. Hence DIA has improved temporal credit assignment. Of course, this argument requires the policing effect to emerge in the first place. This is possible when the earliest learned behavior is defection (Harvest), but not when it is cooperation (Cleanup), explaining the results.

Figure 8: Inequity aversion alters the effective payoffs from cooperation and defection in the short-term, in such a way that cooperative behavior is rationally learnable. Hence, helps to solve the intertemporal social dilemma.