Log In Sign Up

Social Diversity and Social Preferences in Mixed-Motive Reinforcement Learning

by   Kevin R. McKee, et al.

Recent research on reinforcement learning in pure-conflict and pure-common interest games has emphasized the importance of population heterogeneity. In contrast, studies of reinforcement learning in mixed-motive games have primarily leveraged homogeneous approaches. Given the defining characteristic of mixed-motive games–the imperfect correlation of incentives between group members–we study the effect of population heterogeneity on mixed-motive reinforcement learning. We draw on interdependence theory from social psychology and imbue reinforcement learning agents with Social Value Orientation (SVO), a flexible formalization of preferences over group outcome distributions. We subsequently explore the effects of diversity in SVO on populations of reinforcement learning agents in two mixed-motive Markov games. We demonstrate that heterogeneity in SVO generates meaningful and complex behavioral variation among agents similar to that suggested by interdependence theory. Empirical results in these mixed-motive dilemmas suggest agents trained in heterogeneous populations develop particularly generalized, high-performing policies relative to those trained in homogeneous populations.


page 1

page 2

page 3

page 5

page 6

page 7

page 8

page 9


Improved cooperation by balancing exploration and exploitation in intertemporal social dilemma tasks

When an individual's behavior has rational characteristics, this may lea...

Hedonic Diversity Games

We consider a coalition formation setting where each agent belongs to on...

A Further Analysis of The Role of Heterogeneity in Coevolutionary Spatial Games

Heterogeneity has been studied as one of the most common explanations of...

Warmth and competence in human-agent cooperation

Interaction and cooperation with humans are overarching aspirations of a...

On the role of population heterogeneity in emergent communication

Populations have often been perceived as a structuring component for lan...

Harnessing Distribution Ratio Estimators for Learning Agents with Quality and Diversity

Quality-Diversity (QD) is a concept from Neuroevolution with some intrig...

Cooperative Group Optimization with Ants (CGO-AS): Leverage Optimization with Mixed Individual and Social Learning

We present CGO-AS, a generalized Ant System (AS) implemented in the fram...

1. Introduction

In multi-agent reinforcement learning, the actions of one agent can influence the experience and outcomes for other agents—that is, agents are interdependent. Interdependent interactions can be sorted into two categories based on the alignment of incentives for the agents involved schelling1960strategy:

  1. Pure-motive interactions, in which the group’s incentives are either entirely aligned (pure-common interest) or entirely opposed (pure-conflict),

  2. and mixed-motive interactions, in which the group’s incentives are sometimes aligned and sometimes in conflict.111When Schelling originally introduced the pure- and mixed-motive framework, he explained, “Mixed-motive refers not, of course, to an individual’s lack of clarity about his own preferences but rather to the ambivalence of his relation to the other player—the mixture of mutual dependence and conflict, of partnership and competition” schelling1960strategy.

Examples of the former include games such as Hanabi bard2019hanabi and Go silver2018general. The latter category is typified by games like the Prisoner’s Dilemma tucker1950two; sandholm1996multiagent and the tragedy of the commons hardin1968tragedy; leibo2017multi. This categorical distinction is especially relevant for spatially and temporally extended Markov games. In these games, interdependence emerges both as a direct impact of one agent’s actions on another’s outcomes and as an indirect effect of each agent on the state of the substrate environment in which others co-exist.

Figure 1. Homogeneity and heterogeneity in population-based multi-agent reinforcement learning. (a) Population homogeneity and heterogeneity result in different training experiences for a given agent i. In the homogeneous case, agent policies are either identical or very similar (e.g., due to identical training distributions or shared motivations). In the heterogeneous setting, a given agent i encounters a range of group compositions over time. The variability in policies can stem from agents training under different distributions or with different motivations. (b) Representative examples of previous multi-agent reinforcement learning research. We study the mixed-motive, heterogeneous setting.

In pure-conflict reinforcement learning, self-play solutions for Markov games bansal2017emergent; heinrich2016deep; silver2018general have gradually given way to population-based approaches jaderberg2019human; vinyals2019grandmaster (Figure 1a). A central impetus for this shift has been an interest in ensuring agent performance is robust to opponent heterogeneity (i.e., variation in the set of potential opponent policies). Similarly, recent work on pure-common interest reinforcement learning in Markov games has highlighted the importance of robustness to diverse partner policies amato2013decentralized; carroll2019utility. In both of these contexts, it is desirable to train agents capable of adapting and best responding to a wide range of potential policy sets.

In mixed-motive Markov games, the effects of partner heterogeneity have not received much attention. Most mixed-motive reinforcement learning research has produced policies through self-play lerer2017maintaining; peysakhovich2017consequentialist or co-training policies in fixed groups leibo2017multi; hughes2018inequity. Such methods foster homogeneity in the set of policies each agent encounters.

We aim to introduce policy heterogeneity into mixed-motive reinforcement learning (Figure 1b). In recent years, a growing body of work has explored the effects of furnishing agents in mixed-motive games with various motivations such as inequity aversion hughes2018inequity; wang2019evolving, social imitation eccles2019imitation, and social influence jaques2019social. Thus, a natural starting point for the study of heterogeneity is to explore the effects of diversity in intrinsic motivation singh2004intrinsically. Here we endow our agents with Social Value Orientation (SVO), an intrinsic motivation to prefer certain group reward distributions between self and others.

Figure 2. Interdependence theory. The top pathway depicts a transformation process for a row player who has altruistic preferences. In this case, the outcome transformation directly transfers the column player’s payoff into the effective matrix. The bottom pathway shows a transformation process for a row player with competitive preferences, who finds it rewarding to maximize the distance between their payoff and the column player’s payoff. These two outcome transformations suggest different dominant strategies (highlighted in blue).

Psychology and economics research has repeatedly demonstrated that human groups sustain high levels of cooperation across different games through heterogeneous distributive preferences batson2012history; cooper2016other; eckel1996altruism; rushton1981altruistic; simon1993altruism. A particularly compelling account from interdependence theory holds that humans deviate from game theoretic predictions in economic games because each player acts not on the “given matrix” of a game—which reflects the extrinsic payoffs set forth by the game rules—but on an “effective matrix”, which represents the set of outcomes as subjectively evaluated by that player kelley1978interpersonal. Players receive the given matrix from their environment and subsequently apply an “outcome transformation” reflecting their individual preferences for various outcome distributions. The combination of the given matrix and the outcome transformation form the effective matrix (Figure 2). Though an individual’s social preferences may decrease their given payoff in a single game, groups with diverse sets of preferences are capable of resisting suboptimal Nash equilibria.

In multi-agent reinforcement learning, reward sharing is commonly used to resolve mixed-motive dilemmas hughes2018inequity; peysakhovich2018prosocial; sunehag2018value; wang2019evolving

. To date, agent hyperparameters controlling reward mixture have typically been shared. This approach implies a homogeneous population of policies, echoing the representative agent assumption from economics

hartley1996retrospectives; kirman1992whom. The continued reliance on shared reward mixtures is especially striking considering that the ability to capture heterogeneity is a key strength of agent-based models over other modeling approaches haldane2019drawing.

Homogeneous populations often fall prey to a peculiar variant of the lazy agent problem sunehag2018value, wherein one or more agents begin ignoring the individual learning task at hand and instead optimize for the shared reward hughes2018inequity. These shared-reward agents shoulder the burden of prosocial work, in a manner invariant to radical shifts in group composition across episodes. This “specialization” represents a failure of training to generate generalized policies.

To investigate the effects of heterogeneity in mixed-motive reinforcement learning, we introduce a novel, generalized mechanism for reward sharing. We derive this reward-sharing mechanism, Social Value Orientation (SVO), from studies of human cooperative behavior in social psychology. We show that across several games, heterogeneous distributions of these social preferences within groups generate more generalized individual policies than do homogeneous distributions. We subsequently explore how heterogeneity in SVO sustains positive group outcomes. In doing so, we demonstrate that this formalization of social preferences leads agents to discover specific prosocial behaviors relevant to each environment.

2. Agents

2.1. Multi-agent reinforcement learning and Markov games

In this work, we consider -player partially observable Markov games. A partially observable Markov game is defined on a finite set of states . The game is endowed with an observation function ; a set of available actions for each player, ; and a stochastic transition function , which maps from the joint actions taken by the

players to the set of discrete probability distributions over states. From each state, players take a joint action


Each agent independently experiences the environment and learns a behavior policy based on its own observation and (scalar) extrinsic reward . Each agent learns a policy which maximizes a long term -discounted payoff defined as:


where is a utility function and for simplicity. In standard reinforcement learning, the utility function maps directly to the extrinsic reward provided by the environment.

2.2. Social Value Orientation

Here we introduce Social Value Orientation (SVO), an intrinsic motivation to prefer certain group reward distributions between self and others.

We introduce the concept of a reward angle as a scalar representation of the observed distribution of reward between player and all other players in the group. The size of the angle formed by these two scalars represents the relative distribution of reward between self and others (Figure 3). Given a group of size

, its corresponding reward vector is

. The reward angle for player is:


where is a statistic summarizing the rewards of all other group members. We choose the arithmetic mean nisbett1985perception. Note that reward angles are invariant to the scalar magnitude of .

The Social Value Orientation, , for player is player ’s target distribution of reward among group members. We use the difference between the observed reward angle and the targeted SVO to calculate an intrinsic reward. Combining this intrinsic reward with the extrinsic reward signal from the environment, we can define the following utility function to be maximized in Eq. (1):


where is a weight term controlling the effect of Social Value Orientation on .

In constructing this utility function, we follow a standard approach in mixed-motive reinforcement learning research hughes2018inequity; jaques2019social; wang2019evolving which provides agents with an overall reward signal combining extrinsic and intrinsic reward singh2004intrinsically. This approach parallels interdependence theory, wherein the effective matrix is formed from the combination of the given matrix and the outcome transformation, and ultimately serves as the basis of actors’ decisions kelley1978interpersonal.

For the exploratory work we detail in the following sections, we restrict our experiments to SVO in the non-negative quadrant (all ). The preference profiles in the non-negative quadrant provide the closest match to parameterizations in previous multi-agent research on reward sharing. Nonetheless, we note that interesting preference profiles exist throughout the entire ring murphy2014social.

Figure 3. Reward angles and the ring formulation of Social Value Orientation (SVO). Reward angles are a scalar representation of the tradeoff between an agent’s own reward and the reward of other agents in the environment. The reward angle an agent prefers is its SVO.

2.3. Algorithm

We deploy advantage actor-critic (A2C) as the learning algorithm for our agents mnih2016asynchronous

. A2C maintains both value (critic) and policy (actor) estimates using a deep neural network. The policy is updated according to the REINFORCE policy-gradient method, using a value estimate as a baseline to reduce variance. Our neural network comprises a convolutional layer, a feedforward module, an LSTM with contrastive predictive coding

oord2018representation, and linear readouts for policy and value. We apply temporal smoothing to observed rewards within the model’s intrinsic motivation function, as described by hughes2018inequity.

We use a distributed, asynchronous framework for training wang2019evolving. We train populations of agents with policies . For each population, we sample players at a time to populate each of 100 arenas running in parallel (see also Figure 1a, in which arenas are represented as “samples” from the agent population). Each arena is an instantiation of a single episode of the environment. Within each arena, the sampled agents play an episode of the environment, after which a new group is sampled. Episode trajectories last 1000 steps and are written to queues for learning. Weights are updated from queues using V-Trace espeholt2018impala.

3. Mixed-motive games

3.1. Intertemporal social dilemmas

Figure 4. Screenshots of gameplay from (a) HarvestPatch and (b) Cleanup.

For our experiments, we consider two temporally and spatially extended mixed-motive games played with group size : HarvestPatch and Cleanup. These two environments are intertemporal social dilemmas, a particular class of mixed-motive Markov games (c.f., littman1994markov).

Intertemporal social dilemmas are group situations which present a tension between short-term individual incentives and the long-term collective interest hughes2018inequity. Each individual has the option of behaving prosocially (cooperation) or selfishly (defection). Though unanimous cooperation generates welfare-maximizing outcomes in the long term, on short timescales the personal benefits of acting selfishly strictly dominate those of prosocial behavior. Thus, though all members of the group prefer the rewards of mutual cooperation, the intertemporal incentive structure pushes groups toward welfare-suppressing equilibria. Previous work has evaluated the game theoretic properties of intertemporal social dilemmas hughes2018inequity.

3.2. HarvestPatch

HarvestPatch is a variant of the common-pool resource appropriation game Harvest hughes2018inequity (Figure 4a). Players are rewarded for collecting apples (reward ) within a gridworld environment. Apples regrow after being harvested at a rate dependent on the number of unharvested apples within a regrowth radius of 3. If there are no apples within its radius, an apple cannot regrow. At the beginning of each episode, apples are probabilistically spawned in a hex-like pattern of patches, such that each apple is within the regrowth radius of all other apples in its patch and outside of the regrowth radius of apples in all other patches. This creates localized stock and flow properties gardner1990nature for each apple patch. Each patch is irreversibly depleted when all of its apples have been harvested—regardless of how many apples remain in other patches. Players are also able to use a beam to punish other players (reward ), at a small cost to themselves (reward ). This enables the possible use of punishment to discourage free-riding henrich2006cooperation; o2008constraining.

A group can achieve indefinite sustainable harvesting by abstaining from eating “endangered apples” (apples which are the last unharvested apple remaining in their patch). However, the reward for sustainable harvesting only manifests after a period of regrowth if all players abstain. In contrast, an individual is immediately and unilaterally guaranteed the reward for eating an endangered apple if it acts greedily. This creates a dilemma juxtaposing the short-term individual temptation to maximize reward through unsustainable behavior and the long-term group interest of generating higher reward by acting sustainably.

In HarvestPatch, episodes last 1000 steps. Each agent’s observability is limited to a RGB window, centered on its current location. The action space consists of movement, rotation, and use of the punishment beam (8 actions total).

3.3. Cleanup

Cleanup hughes2018inequity is a public goods game (Figure 4b). Players are again rewarded for collecting apples (reward ) within a gridworld environment. In Cleanup, apples grow in an orchard at a rate inversely related to the cleanliness of a nearby river. The river accumulates pollution with a constant probability over time. Beyond a certain threshold of pollution, the apple growth rate in the orchard drops to zero. Players have an additional action allowing them to clean a small amount of pollution from the river. However, the cleaning action only works on pollution within a small distance in front of the agents, requiring them to physically leave the apple orchard to clean the river. Thus, players maintain the public good of orchard regrowth through effortful contributions. As in HarvestPatch, players are able to use a beam to punish other players (reward ), at a small cost to themselves (reward ).

A group can achieve continuous apple growth in the orchard by keeping the pollution levels of the river consistently low over time. However, on short timescales, each player would prefer to collect apples in the orchard while other players provide the public good in the river. This creates a tension between the short-term individual incentive to maximize reward by staying in the orchard and the long-term group interest of maintaining the public good through sustained contributions over time.

Episodes last 1000 steps. Agent observability is again limited to a RGB window, centered on the agent’s current location. In Cleanup, agents have an additional action for cleaning (9 actions total).

4. Results

4.1. Social diversity and agent generality

We began by training 12 homogeneous populations per task, with : four consisting of individualistic agents (all ), four of prosocial agents (all ), and four of altruistic agents (all ) agents. These resemble previous approaches using selfishness perolat2017multi, inequity aversion hughes2018inequity; wang2019evolving, and strong reward sharing peysakhovich2018prosocial; sunehag2018value, respectively. The population training curves for homogeneous selfish populations closely resembled group training curves from previous studies perolat2017multi; hughes2018inequity (see sample population training curves in Figure 5). In particular, performance in both environments generated negative returns at the beginning of training due to high-frequency use of the punishment beam. Agents quickly improved performance by learning not to punish one another, but failed to learn cooperative policies. Ultimately, selfish agents were unable to consistently avoid the tragedy of the commons in HarvestPatch or provide public goods in Cleanup.

Figure 5. Episode rewards for homogeneous selfish populations playing (a) HarvestPatch and (b) Cleanup. Each individual line shows a single agent’s return over training.

Optimal hyperparameter values may vary between HarvestPatch and Cleanup. Thus, we selected the weight values for the two tasks by conducting an initial sweep over with homogeneous populations of altruistic agents (all ). In HarvestPatch, a weight produced the highest collective returns across several runs (Figure 6a). In Cleanup, a weight produced the highest collective returns across several runs (Figure 6b).

Figure 6. Equilibrium collective return for homogeneous populations of altruistic agents in (a) HarvestPatch and (b) Cleanup. Closed dots reflect populations in which all agents receive positive returns at equilibrium. Open dots indicate populations in which some agents receive zero or negative reward.

As expected, in HarvestPatch, the highest collective returns among the homogeneous populations were achieved by the altruistic populations (Table 1, Homogeneous row). The prosocial and individualistic populations performed substantially worse. In Cleanup, the highest collective returns similarly emerged among the altruistic populations. The populations of prosocial and individualistic agents, in contrast, achieved near-zero collective returns.

Mean SVO HarvestPatch Cleanup
Homogeneous 587.6 (101.7) -9.9 (11.7)
665.9 (52.4) 1.1 (2.1)
1552.7 (248.2) 563.8 (235.2)
Heterogeneous 553.4 (574.6) -0.1 (5.7)
658.7 (107.1) 2.0 (2.4)
764.1 (236.3) 6.3 (7.1)
860.9 (121.5) 318.5 (335.0)
1167.9 (232.6) 1938.5 (560.6)
Table 1.

Mean collective returns achieved at equilibrium by homogeneous and heterogeneous populations. Standard deviations are reported in parentheses.

We next trained 80 heterogeneous populations per task. To generate each heterogeneous population, we sampled

SVO values from a normal distribution with a specified mean and dispersion. Since we treated SVO as a bounded variable for these initial experiments, we selected five equally spaced values from

to to act as population means and and four equally spaced values from ( radians) to ( radians) to act as population standard deviations. For each mean-standard deviation pair, we generated four populations using different random seeds. We used the same weights as for the homogeneous populations.

Mean SVO HarvestPatch Cleanup
Homogeneous 0.90 (0.09) 0.16 (0.09)
0.97 (0.01) 0.54 (0.07)
0.29 (0.03) 0.41 (0.08)
Heterogeneous 0.90 (0.13) 0.34 (0.09)
0.94 (0.06) 0.40 (0.09)
0.95 (0.03) 0.38 (0.10)
0.91 (0.02) 0.64 (0.21)
0.76 (0.04) 0.87 (0.08)
Table 2. Mean equality scores achieved at equilibrium by homogeneous and heterogeneous populations. Equality is calculated as the inverse Gini coefficient, with 0 representing all reward being received by a single agent and 1 representing all agents receiving an identical positive reward. Standard deviations are reported in parentheses.

Among the heterogeneous populations, we observed the highest equilibrium collective returns among the population (Table 1, Heterogeneous row). In HarvestPatch, the performance of homogeneous altruistic populations outstripped the performance of the populations. In Cleanup, the reverse pattern emerged: the highest collective returns among all populations are achieved by the heterogeneous populations.

We unexpectedly find that homogeneous populations of altruistic agents produced lower equality scores than most other homogeneous and heterogeneous populations (Table 2). Homogeneous, altruistic populations earned relatively high collective returns in both tasks. However, in each case the produced rewards were concentrated in a small proportion of the population. Agents in these homogeneous populations appear to adopt a lazy-agent approach sunehag2018value to resolve the conflict between the group’s shared preferences for selfless reward distributions. To break the symmetry of this dilemma, most agents in the population selflessly support collective action, thereby optimizing for their social preferences. A smaller number of agents then specialize in accepting the generated reward—shouldering the “burden” of being selfish, in contravention of their intrinsic preferences. This result highlights a drawback of using collective return as a performance metric. Though collective return is the traditional social outcome metric used in multi-agent reinforcement learning, it can mask high levels of inequality.

We therefore revisited population performance by measuring median return, which incorporates signal concerning the efficiency and the equality of a group’s outcome distribution blakely2001difference. Median return can help estimate the generality of learned policies within homogeneous and heterogeneous populations. We compare median return for the two population types by measuring the median return for each population after it reaches equilibrium. We conduct a Welch’s t-test and report the resulting t

-statistic, degrees of freedom, and

p-value. We subsequently provide effect estimates () and p-values from linear models regressing median return on the mean and standard deviation of population SVO.

Figure 7. Equilibrium performance of homogeneous and heterogeneous agent populations in HarvestPatch. (a) Heterogeneous populations enjoyed significantly higher median return at equilibrium. (b) Among heterogeneous populations, the highest median returns emerged among the populations whose SVO distributions have both a high mean and a high standard deviation.
Figure 8. Equilibrium performance of homogeneous and heterogeneous agent populations in Cleanup. (a) Heterogeneous populations received higher median return at equilibrium. (b) Among heterogeneous populations, the highest median returns emerged among the populations with a high mean SVO.

Figures 7 and 8 show the generality of policies trained in HarvestPatch and Cleanup, respectively. In HarvestPatch, heterogeneous populations enjoyed significantly higher median return () than homogeneous populations () at equilibrium, (Figure 7a). Among heterogeneous populations, a clear pattern could be observed: the higher the population mean SVO, the higher the median return received (Figure 7b). Specifically for populations with high mean SVO, median return appeared to increase slightly when the SVO distribution was more dispersed. When tested with a linear model regressing median return on the mean and standard deviation of SVO, these trends primarily manifested as an interaction effect between mean population SVO and the standard deviation of population SVO, , . In Cleanup, heterogeneous populations received significantly higher median return () than homogeneous populations () at equilibrium, (Figure 8a). Among heterogeneous populations, the highest median returns were observed in tandem with high mean SVO (Figure 8b). However, in this case the effect of the interaction between mean SVO and standard deviation of SVO was non-significant, .

In summary, our comparison of homogeneous and heterogeneous populations shows that populations of altruists performed effectively in traditional terms (collective return). However, these populations produced highly specialized agents, resulting in undesirably low equality metrics. Populations with diverse SVO distributions were able to circumvent this symmetry-breaking problem and achieve high levels of median return in HarvestPatch and Cleanup.

4.2. Social preferences and prosocial behavior

Figure 9. Correlation between target and observed reward angles among SVO agents in (a) HarvestPatch and (b) Cleanup. The higher an agent’s SVO, the higher the reward angles it tended to observe.

How exactly does the distribution of SVO help diverse populations resolve these social dilemmas? We next evaluated the behavioral effects of SVO by examining a single, heterogeneous population within each task. We randomly selected two populations that achieved high equilibrium performance during training, parameterized with mean SVOs of and standard deviations of . We gathered data from 100 episodes of play for both of these evaluation experiments, sampling 5 agents randomly for each episode. All regressions reported in this section are mixed error-component models, incorporating a random effect to account for the repeated sampling of individual agents. The accompanying figures depict average values per agent, with superimposed regression lines representing the fixed effect estimate () of SVO.

In our evaluation experiments, we observed a positive relationship between an agent’s target reward angle and the group reward angles it tended to observe in HarvestPatch, , (Figure 9a). The effect of SVO on observed reward angle was similarly significant in Cleanup, , (Figure 9b). This reflects the association of higher agent SVO with the realization of more-prosocial distributions. In both environments, the estimated effect lies below the line, indicating that agents acted somewhat more selfishly than their SVO would suggest.

In HarvestPatch, an agent’s prosociality can be estimated by measuring its abstention from consuming endangered apples. We calculated abstention as an episode-level metric incorporating the number of endangered apples an agent consumed and a normalization factor encoding at what points in the episode the endangered apples were consumed. An abstention score of 1 indicates that an agent did not eat a single endangered apple (or that it ate one or more endangered apples on the final step of the episode). An abstention score of 0, though not technically achievable, would indicate that an agent consumed one endangered apple from every apple patch in the environment on the first step of the episode. We observe a significant and positive relationship between an agent’s SVO and its abstention, , (Figure 10).

Figure 10. SVO and prosocial behavior in HarvestPatch. Agents with higher SVOs were significantly more likely to abstain from depleting local resource stocks. Here the yellow agent faces the choice of consuming an endangered apple for immediate reward or abstaining and traveling to a different patch.

The structure of the HarvestPatch environment creates localized stock and flow components. Hosting too many agents in a single patch threatens to quickly deplete the local resource pool. Thus, one rule groups can use to maintain sustainability is for group members to harvest in separate patches, rather than harvesting together and sequentially destroying the environment’s apple patches. We find that SVO correlated with distance maintained from other group members, , (Figure 11). Consequently, groups with higher mean SVO established stronger conventions of interagent distance. This simple behavioral convention helped higher-SVO groups guard against environmental collapse.

Figure 11. SVO and prosocial conventions in HarvestPatch. The higher an agent’s SVO, the more distance it tended to maintain from other agents in its environment. Here the teal agent is maintaining a particularly high interagent distance, allowing it to sustainably harvest from a single patch.

In Cleanup, an agent’s prosociality can be estimated by measuring the amount of pollution it cleans from the river. There was a significant and positive relationship between an agent’s SVO and the amount of pollution it cleaned, , (Figure 12). Agents with higher SVOs acted more prosocially by making greater contributions to the public good.

Figure 12. SVO and prosocial behavior in Cleanup. Agents with higher SVOs cleaned a significantly greater amount of pollution per episode than did peers with low SVO. Here the pink agent is actively cleaning two cells of pollution from the river. The yellow agent is using its cleaning action outside of the river, which does not affect its contribution score.

Finally, do SVO agents develop any sort of prosocial conventions in Cleanup to help maintain high levels of river cleanliness? In Cleanup, we examined one potential coordinating convention that we term behavioral preparedness: an inclination to transition from harvesting to cleaning even before the orchard is fully depleted. In Cleanup, groups that follow the short-term, individual-level incentive structure will respond primarily to the depletion of the orchard, rather than acting preventatively to ensure the public good is sustained over time. Groups that adopt welfare-maximizing strategies, on the other hand, will not wait for the orchard to be fully harvested to clean the river. We find a positive relationship between the average number of apples observable to agents at the times of their transitions to cleaning in each episode, , . The size and significance of this effect are not meaningfully affected by controlling for the number of times each agent transitioned to the river in a given episode, , (Figure 13). In aggregate, this behavioral pattern helped high-SVO groups maintain higher levels of orchard regrowth over time.

Figure 13. SVO and prosocial conventions in HarvestPatch. Agents with higher SVOs were significantly more likely to enter the river while there were unharvested apples within view. Here the magenta agent is transitioning to clean the river, even though it can observe multiple unharvested apples in the orchard.

5. Discussion

Recent research on pure-conflict and pure-cooperation reinforcement learning has highlighted the importance of developing robustness to diversity in opponent and partner policies carroll2019utility; jaderberg2019human; vinyals2019grandmaster. We extend this argument to the mixed-motive setting, focusing in particular on the effects of heterogeneity in social preferences. Drawing from interdependence theory, we endow agents with Social Value Orientation (SVO), a flexible formulation for reward sharing among group members.

In the mixed-motive games considered here, homogeneous populations of pure altruists achieved high collective returns. However, these populations tended to produce hyper-specialized agents which reaped reward primarily from either intrinsic or extrinsic motivation, rather than both—a method of breaking the symmetry of the shared motivation structure. Thus, when equality-sensitive metrics are considered, populations with diverse distributions of SVO values were able to outperform homogeneous populations.

This pattern echoes the historic observation from interdependence theory that, if both players in a two-player matrix game adopt a “maximize other’s outcome” transformation process, the resulting effective matrices often produce deficient group outcomes:

0.35cm “It must be noted first that, in a number of matrices with a mutual interest in prosocial transformations, if one person acts according to such a transformation, the other is better off by acting according to his own given outcomes than by adopting a similar transformation.” kelley1978interpersonal

This quote highlights a striking parallel between our findings and the predictions of interdependence theory. We believe this is indicative of a broader overlap in perspective and interests between multi-agent reinforcement learning and the social-behavioral sciences. Here we capitalize on this overlap, drawing inspiration from social psychology to formalize a general mechanism for reward sharing. Moving forward, SVO agents can be leveraged as a modeling tool for social psychology research morrison1999models.

In this vein, group formation is a topic important to both fields. It is well established among pyschologists that an individual’s behavior is strongly guided by their ingroup—the group with which they psychologically identify de2010social. However, the processes by which individuals form group identities are still being studied and investigated brewer1979group; turner2010social. What sort of mechanisms transform and redefine self-interest to incorporate the interests of a broader group? This line of inquiry has potential linkages to the study of team and coalition formation in multi-agent research shenoy1979coalition.

Our findings show that in multi-agent environments, heterogeneous distributions of SVO can generate high levels of population performance. A natural question follows from these results: how can we identify optimal SVO distributions for a given environment? Evolutionary approaches to reinforcement learning jaderberg2017population could be applied to study the variation in optimal distributions of SVO across individual environments. We note that our results mirror findings from evolutionary biology that across-individual genetic diversity can produce group-wide benefits nonacs2007social. We suspect that SVO agents can be leveraged in simulo to study open questions concerning the emergence and adaptiveness of human altruism bowles2006group; mitteldorf2000population.

The development of human-compatible agents still faces major challenges amershi2014power; amershi2019guidelines; ishowo2019behavioural. In pure-common interest reinforcement learning, robustness to partner heterogeneity is seen as an important step toward human compatibility carroll2019utility. The same holds true for mixed-motive contexts. Within “hybrid systems” containing humans and artificial agents christakis2019blueprint, agents should be able to predict and respond to a range of potential partner behaviors. Social preferences are, of course, an important determinant of human behavior balliet2009social; kelley1978interpersonal. Endowing agents with SVO is a promising path forward for training diverse agent populations, expanding the capacity of agents to adapt to human behavior, and fostering positive human-agent interdependence.