Log In Sign Up

Evolving intrinsic motivations for altruistic behavior

by   Jane X Wang, et al.

Multi-agent cooperation is an important feature of the natural world. Many tasks involve individual incentives that are misaligned with the common good, yet a wide range of organisms from bacteria to insects and humans are able to overcome their differences and collaborate. Therefore, the emergence of cooperative behavior amongst self-interested individuals is an important question for the fields of multi-agent reinforcement learning (MARL) and evolutionary theory. Here, we study a particular class of multi-agent problems called intertemporal social dilemmas (ISDs), where the conflict between the individual and the group is particularly sharp. By combining MARL with appropriately structured natural selection, we demonstrate that individual inductive biases for cooperation can be learned in a model-free way. To achieve this, we introduce an innovative modular architecture for deep reinforcement learning agents which supports multi-level selection. We present results in two challenging environments, and interpret these in the context of cultural and ecological evolution.


page 2

page 5

page 7


Improved cooperation by balancing exploration and exploitation in intertemporal social dilemma tasks

When an individual's behavior has rational characteristics, this may lea...

Escaping the State of Nature: A Hobbesian Approach to Cooperation in Multi-agent Reinforcement Learning

Cooperation is a phenomenon that has been widely studied across many dif...

EgoMap: Projective mapping and structured egocentric memory for Deep RL

Tasks involving localization, memorization and planning in partially obs...

Modelling Cooperation in Network Games with Spatio-Temporal Complexity

The real world is awash with multi-agent problems that require collectiv...

MARLeME: A Multi-Agent Reinforcement Learning Model Extraction Library

Multi-Agent Reinforcement Learning (MARL) encompasses a powerful class o...

1. Introduction

Nature shows a substantial amount of cooperation at all scales, from microscopic interactions of genomes and bacteria to species-wide societies of insects and humans (Maynard Smith and Szathmary, 1997). This is in spite of natural selection pushing for short-term individual selfish interests (Darwin, 1859). In its purest form, altruism can be favored by selection when cooperating individuals preferentially interact with other cooperators, thus realising the rewards of cooperation without being exploited by defectors (Hamilton, 1964a, b; Dawkins, 1976; Santos et al., 2006; Fletcher and Doebeli, 2009). However, many other possibilities exist, including kin selection, reciprocity and group selection (Nowak, 2006; Úbeda and Duéñez-Guzmán, 2011; Trivers, 1971; Nowak and Sigmund, 2005; Wilson, 1975; Smith, 1964).

Lately the emergence of cooperation among self-interested agents has become an important topic in multi-agent deep reinforcement learning (MARL). Leibo et al. (2017) and Hughes et al. (2018) formalize the problem domain as an intertemporal social dilemma (ISD), which generalizes matrix game social dilemmas to Markov settings. Social dilemmas are characterized by a trade-off between collective welfare and individual utility. As predicted by evolutionary theory, self-interested reinforcement-learning agents are typically unable to achieve the collectively optimal outcome, converging instead to defecting strategies (Leibo et al., 2017; Pérolat et al., 2017). The goal is to find multi-agent training regimes in which individuals resolve social dilemmas, i.e., cooperation emerges.

Previous work has found several solutions, belonging to three broad categories: 1) opponent modelling (Foerster et al., 2017; Kleiman-Weiner et al., 2016), 2) long-term planning using perfect knowledge of the game’s rules (Lerer and Peysakhovich, 2017; Peysakhovich and Lerer, 2018) and 3) a specific intrinsic motivation function drawn from behavioral economics (Hughes et al., 2018)

. These hand-crafted approaches run at odds with more recent end-to-end model-free learning algorithms, which have been shown to have a greater ability to generalize (e.g.

(Espeholt et al., 2018)

). We propose that evolution can be applied to remove the hand-crafting of intrinsic motivation, similar to other applications of evolution in deep learning.

Evolution has been used to optimize single-agent hyperparameters

(Jaderberg et al., 2017), implement black-box optimization (Wierstra et al., 2008), and to evolve neuroarchitectures (Miller et al., 1989; Stanley and Miikkulainen, 2002), regularization (Chan et al., 2002)

, loss functions

(Jaderberg et al., 2018; Houthooft et al., 2018), behavioral diversity (Conti et al., 2017), and entire reward functions (Singh et al., 2009, 2010). These principles tend to be driven by single-agent search and optimization or competitive multi-agent tasks. Therefore there is no guarantee of success when applying them in the ISD setting. More closely related to our domain are evolutionary simulations of predator-prey dynamics (Yong and Miikkulainen, 2001)

, which used enforced subpopulations to evolve populations of neurons which are sampled to form the hidden layer of a neural network.

111See also Potter and Jong (2000) and Panait and Luke (2005) for reviews of other evolutionary approaches to cooperative multi-agent problems.

To address the specific challenges of ISDs, the system we propose distinguishes between optimization processes that unfold over two distinct time-scales: (1) the fast time-scale of learning and (2) the slow time-scale of evolution (similar to Hinton and Nowlan, 1987). In the former, individual agents repeatedly participate in an intertemporal social dilemma using a fixed intrinsic motivation. In the latter, that motivation is itself subject to natural selection in a population. We model this intrinsic motivation as an additional additive term in the reward of each agent (Chentanez et al., 2005)

. We implement the intrinsic reward function as a two-layer fully-connected feed-forward neural network, whose weights define the genotype for evolution. We propose that evolution can help mitigate this intertemporal dilemma by bridging between these two timescales via an intrinsic reward function.

Evolutionary theory predicts that evolving individual intrinsic reward weights across a population who interact uniformly at random does not lead to altruistic behavior (Axelrod and Hamilton, 1981). Thus, to achieve our goal, we must structure the evolutionary dynamics (Nowak, 2006). We first implement a “Greenbeard” strategy (Dawkins, 1976; Jansen and van Baalen, 2006) in which agents choose interaction partners based on an honest, real-time signal of cooperativeness. We term this process assortative matchmaking. Although there is ecological evidence of assortative matchmaking (Keller and G. Ross, 1998), it cannot explain cooperation in all taxa (Grafen et al., 1990; Henrich, 2004; Gardner and West, 2010). Moreover it isn’t a general method for multi-agent reinforcement learning, since honest signals of cooperativeness are not normally observable in the ISD models typically studied in deep reinforcement learning.

Figure 1. Screenshots from (a) the Cleanup game, (b) the Harvest game. The size of the agent-centered observation window is shown in (b). The same size observation was used in all experiments.

To address the limitations of the assortative matchmaking approach, we introduce an alternative modular training scheme loosely inspired by ideas from the theory of multi-level (group) selection (Wilson, 1975; Henrich, 2004), which we term shared reward network evolution. Here, agents are composed of two neural network modules: a policy network and a reward network. On the fast timescale of reinforcement learning, the policy network is trained using the modified rewards specified by the reward network. On the slow timescale of evolution, the policy network and reward network modules evolve separately from one another. In each episode every agent has a distinct policy network but the same reward network. As before, the fitness for the policy network is the individual’s reward. In contrast, the fitness for the reward network is the collective return for the entire group of co-players. In terms of multi-level selection theory, the policy networks are the lower level units of evolution and the reward networks are the higher level units. Evolving the two modules separately in this manner prevents evolved reward networks from overfitting to specific policies. This evolutionary paradigm not only resolves difficult ISDs without handcrafting but also points to a potential mechanism for the evolutionary origin of social inductive biases.

2. Methods

We varied and explored different combinations of parameters, namely: (1) environments {Harvest, Cleanup}, (2) reward network features {prospective, retrospective}, (3) matchmaking {random, assortative}, and (4) reward network evolution {individual, shared, none}. We describe these in the following sections.

2.1. Intertemporal social dilemmas

In this paper, we consider Markov games (Littman, 1994) within a MARL setting. Specifically we study intertemporal social dilemmas (Leibo et al., 2017; Hughes et al., 2018), defined as games in which individually selfish actions produce individual benefit on short timescales but have negative impacts on the group over a longer time horizon. This conflict between the two timescales characterizes the intertemporal nature of these games. The tension between individual and group-level rationality identifies them as social dilemmas (e.g. the famous Prisoner’s Dilemma).

We consider two dilemmas, each implemented as a partially observable Markov game on a 2D grid (see Figure 1). In the Cleanup game, agents tried to collect apples (reward ) that spawned in a field at a rate inversely related to the cleanliness of a geographically separate aquifer. Over time, this aquifer filled up with waste, lowering the respawn rate of apples linearly, until a critical point past which no apples could spawn. Episodes were initialized with no apples present and zero spawning, thus necessitating cleaning. The dilemma occurred because in order for apples to spawn, agents must leave the apple field and clean, which conferred no reward. However if all agents declined to clean (defect), then no rewards would be received by any. In the Harvest game, again agents collected rewarding apples. The apple spawn rate at a particular point on the map depended on the number of nearby apples, falling to zero once there were no apples in a certain radius. There is a dilemma between the short-term individual temptation to harvest all the apples quickly and the consequential rapid depletion of apples, leading to a lower total yield for the group in the long-term.

All episodes last 1000 steps, and the total size of the playable area is 2518 for Cleanup and 3616 for Harvest. Games are partially observable in that agents can only observe via a 1515 RGB window, centered on their current location. The action space consists of moving left, right, up, and down, rotating left and right, and the ability to tag each other. This action has a reward cost of 1 to use, and causes the player tagged to lose 50 reward points, thus allowing for the possibility of punishing free-riders (Oliver, 1980; Gürerk et al., 2006). The Cleanup game has an additional action for cleaning waste.

Figure 2. (a) Agent adjusts policy using off-policy importance weighted actor-critic (V-Trace) (Espeholt et al., 2018)

by sampling from a queue with (possibly stale) trajectories recorded from 500 actors acting in parallel arenas. (b) The architecture includes a visual encoder (1-layer convolutional neural net with 6 3x3 filters, stride 1, followed by two fully-connected layers with 32 units each), intrinsic and extrinsic value heads, a policy head, and a long-short term memory (LSTM, with 128 hidden units), which takes last intrinsic and extrinsic rewards and last action as input. The reward network weights are evolved based on total episode return.

2.2. Modeling social preferences as intrinsic motivations

In our model, there are three components to the reward that enter into agents’ loss functions (1) total reward, which is used for the policy loss, (2) extrinsic reward, which is used for the extrinsic value function loss and (3) intrinsic reward, which is used for the intrinsic value function loss.

The total reward for player is the sum of the extrinsic reward and an intrinsic reward as follows:


The extrinsic reward is the environment reward obtained by player when it takes action from state , sometimes also written with a time index . The intrinsic reward is an aggregate social preference across features and is calculated according to the formula,



is the ReLU activation function, and

are the parameters of a 2-layer neural network with 2 hidden nodes. These parameters are evolved based on fitness (see Section 2.3). The elements of can be seen to approximately correspond to a linear combination of the coefficients related to advantagenous and disadvantagenous inequity aversion mentioned in Hughes et al. (2018), which were found via grid search in this previous work, but are here evolved.

The feature vector

is a player-specific quantity that other agents can transform into intrinsic reward via their reward network. Each agent has access to the same set of features, with the exception that its own feature is demarcated specially. The features themselves are a function of recently received or expected future (extrinsic) reward for each agent. In Markov games the rewards received by different players may not be aligned in time. Thus, any model of social preferences should not be overly influenced by the precise temporal alignment of different players’ rewards. Intuitively, they ought to depend on comparing temporally averaged reward estimates between players, rather than instantaneous values. Therefore, we considered two different ways of temporally aggregating the rewards.

Figure 3. (a) Agents assigned and evolved with individual reward networks. (b) Assortative matchmaking, which preferentially plays cooperators with other cooperators and defectors with other defectors. (c) A single reward network is sampled from the population and assigned to all players, while 5 policy networks are sampled and assigned to the 5 players individually. After the episode, policy networks evolve according to individual player returns, while reward networks evolve according to aggregate returns over all players.

The retrospective method derives intrinsic reward from whether an agent judges that other agents have been actually (extrinsically) rewarded in the recent past. The prospective variant derives intrinsic reward from whether other agents are expecting to be (extrinsically) rewarded in the near future.222Our terms prospective and retrospective map onto the terms intentional and consequentialist respectively as used by Lerer and Peysakhovich (2017); Peysakhovich and Lerer (2018). For the retrospective variant, , where the temporally decayed reward for the agents are updated at each timestep according to


and . The prospective variant uses the value estimates for and has a stop-gradient before the reward network module so that gradients don’t flow back into other agents (as in for example DIAL from (Foerster et al., 2016)).

2.3. Architecture and Training

We used the same training framework as in Jaderberg et al. (2018), which performs distributed asynchronous training in multi-agent environments, including population-based training (PBT) (Jaderberg et al., 2017). We trained a population of agents333Similar to as in (Espeholt et al., 2018) we distinguish between an "agent" which acts in the environment according to some policy, and a "learner" which updates the parameters of a policy. In principle, a single agent’s policy may depend on parameters updated by several separate learners. with policies , from which we sampled players in order to populate each of arenas running in parallel. Within each arena, an episode of the environment was played with the sampled agents, before resampling new ones. Agents were sampled using one of two matchmaking processes (described in more detail below). Episode trajectories lasted 1000 steps and were written to queues for learning, from which weights were updated using V-Trace (Figure 2(a)).

The set of weights evolved included learning rate, entropy cost weight, and reward network weights 444We can imagine that the reward weights are simply another set of optimization hyperparameters since they enter into the loss.. The parameters of the policy network were inherited in a Lamarckian fashion as in (Jaderberg et al., 2017). Furthermore, we allowed agents to observe their last actions , last intrinsic rewards (), and last extrinsic rewards () as input to the LSTM in the agent’s neural network.

The objective function was identical to that presented in Espeholt et al. (2018) and comprised three components: (1) the value function gradient, (2) policy gradient, and (3) entropy regularization, weighted according to hyperparameters baseline cost and entropy cost (see Figure 2b).

Evolution was based on a fitness measure calculated as a moving average of total episode return, which was a sum of apples collected minus penalties due to tagging, smoothed as follows:


where and is the return obtained on episode by agent (or reward network in the case of the shared reward network evolution (see Section 2.5 for details).

Training was done via joint optimization of network parameters via SGD and hyperparameters/reward network parameters via evolution in the standard PBT setup. Gradient updates were applied for every trajectory up to a maximum length of 100 steps, using a batch size of 32. Optimization was via RMSProp with epsilon=

, momentum=0, decay rate=0.99, and an RL discount factor of 0.99. The baseline cost weight (see Mnih et al. (2016)) was fixed at 0.25, and the entropy cost was sampled from LogUniform(,0.01) and evolved throughout training using PBT. The learning rates were all initially set to and then allowed to evolve.

PBT uses evolution (specifically genetic algorithms) to search over a space of hyperparameters rather than manually tuning or performing a random search, resulting in an adaptive schedule of hyperparameters and joint optimization with network parameters learned through gradient descent

Jaderberg et al. (2017).

There was a mutation rate of when evolving hyperparameters, using either multiplicative perturbations of for entropy cost and learning rate, and additive perturbation of for reward network parameters. We implemented a burn-in period for evolution of agent steps, to allow network parameters and hyperparameters to be used in enough episodes for an accurate assessment of fitness before evolution.

2.4. Random vs. assortative matchmaking

Matches were determined according to two methods: (1) random matchmaking and (2) assortative matchmaking. Random matchmaking simply selected uniformly at random from the pool of agents to populate the game, while cooperative matchmaking first ranked agents within the pool according to a metric of recent cooperativeness, and then grouped agents such that players of similar rank played with each other. This ensured that highly cooperative agents played only with other cooperative agents, while defecting agents played only with other defectors. For Cleanup, cooperativeness was calculated based on the amount of steps in the last episode the agent chose to clean. For Harvest, it was calculated based on the difference between the the agent’s return and the mean return of all players, so that having less return than average yielded a high cooperativeness ranking. Cooperative metric-based matchmaking was only done with either individual reward networks or no reward networks (Figure 3b). We did not use cooperative metric-based matchmaking for our multi-level selection model, since these are theoretically separate approaches.

Figure 4. Total episode rewards, aggregated over players. (a), (b): Comparing retrospective (backward-looking) reward evolution with assortative matchmaking and PBT-only baseline in (a) Cleanup and (b) Harvest. (c), (d): Comparing prospective (forward-looking) with retrospective (backward-looking) reward evolution in (c) Cleanup and (d) Harvest. The black dotted line indicates performance from Hughes et al. (2018)

. The shaded region shows standard error of the mean, taken over the population of agents.

2.5. Individual vs. shared reward networks

Building on previous work that evolved either the intrinsic reward (Jaderberg et al., 2018) or the entire loss function (Houthooft et al., 2018), we separately evolved the reward network within its own population, thereby allowing different modules of the agent to compete only with like components. This allowed for independent exploration of hyperparameters via separate credit assignment of fitness and thus considerably more of the hyperparameter landscape could be explored compared with using only a single pool. In addition, reward networks could be randomly assigned to any policy network, and so were forced to generalize to a wide range of policies. In a given episode, separate policy networks were paired with the same reward network, which we term a shared reward network. In line with (Jaderberg et al., 2017), the fitness determining the copying of policy network weights and evolution of optimization-related hyperparameters (entropy cost and learning rate) were based on individual agent return. By contrast, the reward network parameters were evolved according to fitness based on total episode return across the group of co-players (Figure 3(c)).

This contribution is distinct from previous work which evolved intrinsic rewards (e.g. Jaderberg et al., 2018) because (1) we evolve over social features rather than a remapping of environmental events, and (2) reward network evolution is motivated by dealing with the inherent tension in ISDs, rather than merely providing a denser reward signal. In this sense it’s closer to evolving a form of communication for social cooperation, rather than learning reward-shaping in a sparse-reward environment. We allow for multiple agents to share the same components, and as we shall see, in a social setting, this winds up being critical. Shared reward networks provide a biologically principled method that mixes between group fitness on a long timescale and individual reward on a short timescale. This contrasts with hand-crafted means of aggregation, as in previous work (Chang et al., 2004; Mataric, 1994).

3. Results

As shown in Figure 4, PBT without using an intrinsic reward network performs poorly on both games, where it asymptotes to 0 total episode reward in Cleanup and 400 for Harvest (the number of apples gained if all agents collect as quickly as they can).

Figures 4(a) and (b) compare random and assortative matchmaking with PBT and reward networks using retrospective social features. When using random matchmaking, individual reward network agents perform no better than PBT at Cleanup, and only moderately better at Harvest. Hence there is little benefit to adding reward networks over social features if players have separate networks, as these tend to be evolved selfishly. The assortative matchmaking experiments used either no reward network ( = 0) or individual reward networks. Without a reward network, performance was the same as the PBT baseline. With individual reward networks, performance was very high, indicating that both conditioning the internal rewards on social features and a preference for cooperative agents to play together were key to resolving the dilemma. On the other hand, shared reward network agents perform as well as assortative matchmaking and the handcrafted inequity aversion intrinsic reward from (Hughes et al., 2018), even using random matchmaking. This implies that agents didn’t necessarily need to have immediate access to honest signals of other agents’ cooperativeness to resolve the dilemma; it was enough to simply have the same intrinsic reward function, evolved according to collective episode return. Videos comparing performance of the PBT baseline with the retrospective variant of shared reward network evolution can be found at and

Figures 4(c) and (d) compare the retrospective and prospective variants of reward network evolution. The prospective variant, although better than PBT when using a shared reward network, generally results in worse performance and more instability. This is likely because the prospective variant depends on agents learning good value estimates before the reward networks become useful, whereas the retrospective variant only depends on environmentally provided reward and thus does not suffer from this issue. Interestingly, we observed that the prospective variant does achieve very high performance if gradients are allowed to pass between agents via the value estimates (data not shown); however, this constitutes centralized learning, albeit with decentralized execution (see Foerster et al. (2016)). Such approaches are promising but less consistent with biologically plausible mechanisms of multi-agent learning which are of interest here and so were not pursued.

Figure 5. Social outcome metrics for (a) Cleanup and (b) Harvest. Top: equality, middle: total amount of tagging, bottom: sustainability. The shaded region shows the standard error of the mean.

We next plot various social outcome metrics in order to better capture the complexities of agent behavior (see Figure 5). Equality is calculated as , where is the Gini coefficient over individual returns. Figure 5(b) demonstrates that, in Harvest, having the prospective version of reward networks tends to lead to lower equality, while the retrospective variant has very high equality. Equality in Cleanup is more unstable throughout training, but tends to be lower overall than for Harvest, even when performance is high, indicating that equality might be harder to achieve in different games.

Tagging measures the average number of times a player fined another player throughout the episode. The middle panel of Figure 5b shows that there is a higher propensity for tagging in Harvest when using either a prospective reward network or an individual reward network, compared to the retrospective shared reward network. This explains the performance shown in Figure 4. Tagging in the Cleanup task is overall much lower than in Harvest.

Sustainability measures the average time step on which agents received positive reward, averaged over the episode and over agents. We see in the bottom panel of 5b that having no reward network results in players collecting apples extremely quickly in Harvest, compared with much more sustainable behavior with reward networks. In Cleanup, the sustainability metric is not meaningful and so this was not plotted.

Finally, we can directly examine the weights of the final retrospective shared reward networks which were best at resolving the ISDs. Interestingly, the final weights evolved in the second layer suggest that resolving each game might require a different set of social preferences. In Cleanup, one of the final layer weights evolved to be close to , whereas in Harvest, and evolved to be of large magnitude but opposite sign. We can see a similar pattern with the biases . We interpret this to mean that Cleanup required a less complex reward network: it was enough to simply find other agents’ being rewarded as intrinsically rewarding. In Harvest, however, a more complex reward function was perhaps needed in order to ensure that other agents were not over-exploiting the apples. We found that the first layer weights tended to take on arbitrary (but positive) values. This is because of random matchmaking: co-players were randomly selected and thus there was little evolutionary pressure to specialize these weights.

Figure 6. Distribution of layer 2 weights and biases of evolved retrospective shared reward network at training steps for (a) Cleanup, and (b) Harvest.

4. Discussion

Real environments don’t provide scalar reward signals to learn from. Instead, organisms have developed various internal drives based on either primary or secondary goals (Baldassarre and Mirolli, 2013). Here we examined intrinsic rewards based on features derived from other agents in the environment, in order to establish whether such social signals could enable the evolution of altruism to solve intertemporal social dilemmas. In accord with evolutionary theory (Axelrod and Hamilton, 1981; Nowak, 2006), we found that naïvely implementing natural selection via genetic algorithms did not lead to the emergence of cooperation. Furthermore, assortative matchmaking was sufficient to generate cooperative behavior in cases where honest signals were available. Finally, we proposed a new multi-level evolutionary paradigm based on shared reward networks that achieves cooperation in more general situations.

We demonstrated that the reward network weights evolved differently for Cleanup versus Harvest, indicating that the two tasks necessitate different forms of social cooperation for optimal performance. This highlights the advantage of evolving rather than hand-crafting the weighting between individual reward and group reward, as optimal weightings cannot necessarily be anticipated for all environments. Evolving such weightings thus constitutes a form of meta-learning, wherein an entire learning system, including intrinsic reward functions, is optimized for fast learning Singh et al. (2010); Fernando et al. (2018). Here we have extended these ideas to the multi-agent domain.

Why does evolving intrinsic social preferences promote cooperation? Firstly, evolution ameliorates the intertemporal choice problem by distilling the long timescale of collective fitness into the short timescale of individual reinforcement learning, thereby improving credit assignment between selfish acts and their temporally displaced negative group outcomes (Hughes et al., 2018). Secondly, it mitigates the social dilemma itself by allowing evolution to expose social signals that correlate with, for example, an agent’s current level of selfishness. Such information powers a range of mechanisms for achieving mutual cooperation like competitive altruism (Hardy and Van Vugt, 2006), other-regarding preferences (Cooper and Kagel, 2016), and inequity aversion (Fehr and Schmidt, 1999). In accord, laboratory experiments show that humans cooperate more readily when they can communicate (Ostrom et al., 1992; Janssen et al., 2010).

The shared reward network evolution model was inspired by multi-level selection; yet it does not correspond to the prototypical case of that theory since its lower level units of evolution (the policy networks) are constantly swapping which higher level unit (reward network) they are paired with. Nevertheless, there are a variety of ways in which we see this form of modularity arise in nature. For example, free-living microorganisms occasionally form multi-cellular structures to solve a higher order adaptive problem, like slime mold forming a spore-producing stalk for dispersal (West et al., 2006), and many prokaryotes can incorporate plasmids (modules) found in their environment or received from other individuals as functional parts of their genome, thereby achieving cooperation in social dilemmas (Griffin et al., 2004; McGinty et al., 2011). Alternatively, in humans a reward network may represent a shared “cultural norm”, with its fitness based on cultural information accumulated from the groups in which it holds sway. In this way, the spread of norms can occur independently of the success of individual agents (Boyd and Richerson, 2009).

Note that in this work, we have assumed that agents have perfect knowledge of other agents’ rewards, while in real-world systems this is not typically the case. This assumption was made in order to disentangle the effects of cultural evolution from the quality of the signals being evolved over. Natural next steps include adding partial observability to this signal (to make it more analogous to, for instance, a smile/frown or other locally observable social signals), and to add noise or even deception.

The approach outlined here opens avenues for investigating alternative evolutionary mechanisms for the emergence of cooperation, such as kin selection (Griffin and West, 2002) and reciprocity (Trivers, 1971). It would be interesting to see whether these lead to different weights in a reward network, potentially hinting at the evolutionary origins of different social biases. Along these lines, one might consider studying an emergent version of the assortative matchmaking model along the lines suggested by Henrich (2004), adding further generality and power to our setup. Finally, it would be fascinating to determine how an evolutionary approach can be combined with multi-agent communication to produce that most paradoxical of cooperative behaviors: cheap talk.

We would like to thank Simon Osindero, Iain Dunning, Andrea Tacchetti, and many DeepMind colleagues for valuable discussions and feedback, as well as code development and support.