Log In Sign Up

Malthusian Reinforcement Learning

by   Joel Z. Leibo, et al.

Here we explore a new algorithmic framework for multi-agent reinforcement learning, called Malthusian reinforcement learning, which extends self-play to include fitness-linked population size dynamics that drive ongoing innovation. In Malthusian RL, increases in a subpopulation's average return drive subsequent increases in its size, just as Thomas Malthus argued in 1798 was the relationship between preindustrial income levels and population growth. Malthusian reinforcement learning harnesses the competitive pressures arising from growing and shrinking population size to drive agents to explore regions of state and policy spaces that they could not otherwise reach. Furthermore, in environments where there are potential gains from specialization and division of labor, we show that Malthusian reinforcement learning is better positioned to take advantage of such synergies than algorithms based on self-play.


page 6

page 7


Mean Field Multi-Agent Reinforcement Learning

Existing multi-agent reinforcement learning methods are limited typicall...

Quantifying environment and population diversity in multi-agent reinforcement learning

Generalization is a major challenge for multi-agent reinforcement learni...

Survey of Self-Play in Reinforcement Learning

In reinforcement learning (RL), the term self-play describes a kind of m...

Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning

No-press Diplomacy is a complex strategy game involving both cooperation...

Provable Self-Play Algorithms for Competitive Reinforcement Learning

Self-play, where the algorithm learns by playing against itself without ...

Is Epicurus the father of Reinforcement Learning?

The Epicurean Philosophy is commonly thought as simplistic and hedonisti...

Optimality-based Analysis of XCSF Compaction in Discrete Reinforcement Learning

Learning classifier systems (LCSs) are population-based predictive syste...

1. Introduction

Reinforcement learning algorithms have considerable difficulty avoiding local optima and continually exploring large state and policy spaces. This is known as the problem of exploration. In single-agent reinforcement learning, the main approach is to rely on intrinsic motivations, e.g., for individual curiosity schmidhuber2010formal; bellemare2016unifying; ostrovski2017count; pathak2017curiosity; martin2017count; burda2018large, empowerment klyubin2005empowerment, or social influence jaques2018intrinsic.

However, critical adaptation events in human history are difficult to explain with intrinsic motivations. Consider the dispersal of homo sapiens out of Africa, where they first evolved between and years ago, throughout the globe, eventually occupying essentially all terrestrial climatic conditions and habitats by years before the present goebel2008late. This example is relevant to AI research because intelligence is often defined as an ability to adapt to a diverse set of environments legg2007universal. Essentially no other process on earth has led so quickly to so much adaptive diversity as did the expansion of human foragers throughout the globe111The rapid dispersal of homo sapiens across the globe and their adaptation to the full range of diverse terrestrial habitats can be regarded as a great feat of intelligence. We may even assess its universal intelligence by adapting the following definition from legg2007universal, , where is a policy, is the set of terrestrial habitats, is the expected value of the sum of rewards from inhabiting the habitat , and is a measure of the complexity of . The complexity measure could be defined as any sensible ecological distance metric measuring the distance from the ancestrally adapted environment to .. Human foraging communities were capable both of discovering water-finding strategies suitable for the arid Australian desert as well as how to hunt seals hiding beneath ice sheets and keep warm in the Arctic boyd2011cultural. From this perspective, the great dispersal of human foragers can be seen as some of the best evidence for an “existence proof” that intelligence, by this definition, is even possible. Yet there’s no evidence that intrinsic motivations like curiosity played a role in it. Rather, considerable evidence points to a variety of extrinsic motivation mechanisms as the main drivers of human migration including climate change stewart2012human; eriksson2012late and demographic expansion mellars2006did; powell2009late.

One perspective on the problem of exploration is that the difficulty comes from the sparseness of extrinsic rewards. If extrinsic rewards are very sparse, then it is hard to estimate state-value functions and policy gradients since returns will have very high variance. Thus intrinsic motivation methods produce frequent intermediate (dense) rewards in hopes of bridging the long gaps between extrinsic rewards

chentanez2005intrinsically. Multi-agent reinforcement learning offers an alternative. Algorithms based on self-play like AlphaGo tesauro1995td; Silver16Go; silver2017chess; bansal2017emergent; jaderberg2018human are aimed at an essentially single-agent objective, e.g., defeat a specific human grandmaster. Directly training by single-agent reinforcement learning to accomplish that objective is a lost cause. Since the untrained agent could never win a game it would never get any reward signal to learn from. Self-play, on the other hand, provides an alternative incentive for agents to explore deeply through the strategy space. In two-player zero-sum, it escapes local optima by learning to exploit them, diminishing the returns available from such strategies, and thereby extrinsically motivating new exploration. Of course, life is not a zero-sum game. Real-life feats of exploration like the dispersal of homo sapiens out of Africa really were motivated in part by a pressure to out-compete rivals in a struggle for scarce resources that became increasingly difficult over time due to demographic expansion. Moreover, local competition between individuals is a universal feature of natural habitats, and underlies the evolution of dispersal doi:10.1146/ In this paper we investigate whether such population size dynamics can be exploited as an algorithmic mechanism in multi-agent reinforcement learning.

As an algorithmic mechanism, augmenting multi-agent reinforcement learning with population size dynamics appears to have the requisite property needed to evade local optima and traverse large state spaces. Strategies that work well at low densities do not necessarily translate well to high densities, but success at any density ensures density will increase further in the future. The rules of the game therefore naturally shift over time in a manner that depends on past outcomes. This ensures that species cannot remain too long in comfortable local optima. When resources are scarce, rising populations eventually dissipate gains from learning, forcing agents to innovate just to maintain existing reward levels clark2008farewell222clark2008farewell argues that preindustrial human populations generally oscillated around a fixed, and only very slowly increasing, carrying capacity until the industrial revolution. Similar oscillations in subpopulation sizes were recently observed in a large-scale multi-agent learning simulation by yang2018study. As those authors pointed out, its possible for population dynamics to endlessly oscillate rather than increasing over time. The same is true for the strategies used by learning algorithms based on self-play. One fix that was used in AlphaGo and elsewhere is to require agents to learn to defeat all previous versions of themselves, not just the most recent Silver16Go; silver2017chess. This prevents self-play-based agents from endlessly learning and forgetting the same exploit and defense..

So far we’ve motivated introducing population dynamics to multi-agent reinforcement learning by appealing to a competitive struggle for existence against a well-matched foe, i.e., the same argument underlying the performance of self-play in two-player zero-sum games. However, there is more to life than competition. As an algorithmic mechanism to promote learning in general-sum multi-agent environments, population dynamics may also be more suitable than self-play-based approaches. In particular, we will consider whether this approach provides greater scope for adapting to synergies between specialists, making it easier to discover joint policies involving significant division of labor.

In this work we introduce a new algorithm for multi-agent reinforcement learning based on these principles of population dynamics. It is called Malthusian reinforcement learning because improvements in returns for any subpopulation translate directly into increases in the size of that subpopulation in subsequent episodes. Thus it may be evaluated on either the individual or the group level. In this work we are interested in two specific questions:

  1. Is Malthusian reinforcement learning better at avoiding becoming stuck in bad local optima in individual policy space than competing algorithms based on intrinsic motivation?

  2. Is it easier to evolve joint policies to implement heterogeneous mutualism behaviors with Malthusian reinforcement learning than with alternative approaches based on self-play?

2. Model

2.1. Characteristics of the Malthusian reinforcement learning framework

Malthusian reinforcement learning differs from standard multi-agent reinforcement paradigms in a number of ways.

  1. Malthusian reinforcement learning may be seen as an algorithm for “community coevolution”. It produces a set of communities, called islands in our terminology. Each island has a set of agents implementing policies that should, if the training was successful, function well together.

  2. Each individual is a member of a species

    . All individuals of the same species share a policy neural network.

  3. The algorithm unfolds on two timescales corresponding to (A) the population dynamics (ecological) time, and (B) policy execution (behavioral) time.

  4. The population dynamics are linked to individual reinforcement learning returns. If individuals of a given species perform well on a particular island then their population will increase there in the future.

  5. During each episode all individuals of a given species generate experience to train a common neural network via v-trace pmlr-v80-espeholt18a. Experiences generated by individuals of a particular species are used only to update their own species neural network. After each episode all the distributions over islands are updated by a REINFORCE-like rule Williams1992 that depends only on individual rewards (not group outcomes). This has the effect of increasing the population size for a given species on islands where individuals of that species have been doing well.

  6. Conservation of compute:

    Biologically realistic population dynamics all contain at least the possibility of exponential growth. In practice, they are limited by carrying capacities, i.e., by environment properties. Since Malthusian reinforcement learning is mainly intended for multi-agent machine learning applications rather than for ecological simulations we cannot rely on resource constraints in the environment to limit growth. Thus, to ensure it can be executed with bounded compute resources, the population dynamic works by maintaining probability distributions for each species over the set of all islands. The probability assigned to any given island may grow or shrink based on the individual returns achieved there, but it is always constrained to be a valid probability distribution. Compute remains bounded because a fixed number of samples are used to assign individuals to islands. The total population varies on any given island from episode to episode, but across the entire

    archipelago, the number of individuals is always constant.

the set of islands in the archipelago
the number of islands in an archipelago
indexes islands
indexes ecological scale time
the total number of species
a species
the policy network of a species
the parameters of
the distribution of species over the archipelago
the parameters of
the set of distributions over the archipelago
total number of individuals
number of species individuals across all islands
labels individual of species
the set of individuals of species
allocated to island at time-step
the average fitness received by species
on island at time
the fitness received by individual of
species at time
entropy regularization of the population
learning of the population
indexes behavioral scale time
the number of agents
the state space
a state
the action space of player
an action of player
the observation of player
the function that maps
to the observation of player
is the transition kernel
the reward of player

2.2. Archipelagos, islands, and species

Island and Archipelago

an island is a multi-agent environment where a variable number of agents can interact. An archipelago is a set of islands. Furthermore we will write the number of islands in an archipelago.


A species is a set of individuals sharing the same policy network parameterization. There are species indexed by . Each species is composed of a policy network with parameters which encodes the behavior of each individual of the species. The distribution of agents of a given species over the islands is (where is the set of distributions over islands. is defined as a softmax over weights . The total number of individuals is and the number of individuals per species is . We will denote each individual of a species by .

Figure 1. At each iteration on the ecological timescale , each island samples the players to participate in its next episode according to the probability distributions over islands maintained by each species. Experienced trajectories from all conspecifics, on all islands, are used to update the same species policy network . The distributions of returns to each species over islands are used to update the distributions from which to sample players for the next episode.

The learning process unfolds over two timescales, a slow ecological scale which adapts the distribution of species over islands and a fast behavioral scale over which individuals execute their policies. Species adapt to behave in the presence of others at the level of the island. The ecological scale timesteps are indexed by , and the behavioral scale timesteps are indexed by . The ecological scale ticks at the level of single episodes for the behavioral scale.

2.3. Population dynamics

The population dynamics govern how individuals of each species are assigned to the different islands over the ecological time scale. At a fixed ecological timestep , individuals of each species are assigned to islands by sampling times from the distributions . For each island, this yields an allocation , the set of individuals from species playing on island at ecological timestep .

Each island has its own environment, in general the islands could have different environments from one another, though in this work we only consider the case where they are all the same.

Over the course of ecological time, the population evolves according to a gradient-based dynamic. At each timestep , each individual of each species receives a fitness , which is exactly its cumulative reward over the behavioral scale timesteps that have elapsed during one step of the ecological timescale. The per-island fitness for each species is then calculated as

The distribution over islands for each species, , is updated according to policy gradient with entropy regularization. Explicitly the distribution weights for species over all islands change according to a policy-gradient update

where island weights can be interpreted of as action probabilities for a step of archipelago time. The goal of this entropy regularization term is to enforce that some minimal population of each species remains on sub-optimal islands.

More precisely, our training setup assigns individuals to islands asynchronously, depending on when islands complete episodes. Therefore we implement the policy-gradient update using REINFORCE Williams1992, writing

whenever island completes an episode, triggering an update. In fact, we use a simplified version of this update, where we approximate the term in the denominator of by the empirical distribution . This approximation has low variance provided that is large.

2.4. Multiagent Reinforcement Learning

A Partially Observable Markov Game (POMG) is sequential decision model of a multiagent environment in which individuals interact. At each state of a POMG, each agent selects an action based on the observation of the state of the game they have. The observation of player is defined here as a function of the state . Then the state changes to and the individuals receive reward . Each species learns a policy given the experience of each of its individuals. At each step, all individuals of a species collect trajectories of the experience gathered in the island they have been assigned to. The reinforcement learning algorithm produces gradient updates of the parameters for each individual of the species. The gradient updates are then averaged over all individuals of the species to update the parameters . The V-trace algorithm is used to update the parameters as described in pmlr-v80-espeholt18a with truncation levels set to . Note that experience (observations, actions and rewards) from all individuals of a species contribute equally, but that the individuals may be spread non-uniformly over islands. This means that the species parameter update may be disproportionately affected by the performance of the species on particular islands.

RL Agent
LSTM Unroll Length:
Entropy Regularizer:
Baseline loss scaling:
RMSProp learning rate:
RMSProp :
RMSProp decay:
Batch size:
Function approximation:

The neural network architecture was similar to that of mnih2016asynchronous. It consists of a convnet with channels, kernel size of

, and stride of

. The output of the convnet is passed to a a 1-layer MLP of size , followed by a recurrent module (an LSTM Hochreiter:1997:LSM:1246443.1246450) of size

. The recurrent module’s output is then linearly transformed into the the policy and value. All nonlinearities between layers were rectified linear units.

Distributed computing:

The island simulation and the species neural network updates were implemented as separate processes, potentially running on different machines. Islands produce trajectories and send them to a circular queue on the species update process. The species update process waits until it can dequeue a complete batch of trajectories, at which point it computes the v-trace update.


The games studied in this work are all partially observable in that individuals can only observe via a 1515 RGB window, centered on their current location. The action space consists of moving left, right, up, and down, rotating left and right. Each species was assigned a unique color, shared by all conspecifics and preserved across all islands.

3. Results

3.1. Exploration experiments

Given an unrefined and infrequently emitted behavior, reinforcement learning algorithms are very good at estimating its value with respect to alternatives and refining it into a well-honed strategy for achieving rewards. However, a central problem in reinforcement learning concerns the initial origin of such behaviors, especially in cases where the state space is too large for exhaustive search, and there are many local optima where the policy’s reward gradient becomes zero.

This section explores how population dynamics may drive innovation in individual behavior. To study this, we introduce a new game that taxes individual exploration skills. It can be seen as a multi-agent analog of the well-known Montezuma’s Revenge single player game that has often been used for studies of intrinsic motivations for single-agent exploration bellemare2013arcade; bellemare2016unifying; ostrovski2017count; martin2017count; conti2017improving

. We hold to the game theory tradition of introducing each game with a facetious (but hopefully memorable) story, and offer the following:

In the Clamity game, agents begin in the trochophore stage of the bivalve mollusk lifecycle. They can freely swim around the map, a partially observed grid-world (map size = , window size = . Then whenever they are ready, they can perform the *settle* action. This action causes the agent to metamorphose into the adult clam stage of their lifecycle at their current location and removes their ability to swim. After settling, their shell grows around them, up to a maximum size. Shell growth is also restricted by the presence of adjacent shells from other clams. Each adult clam filters invisible food particles from the ocean at a rate proportional to the size of its shell, receiving reward for each food particle filtered. However, clam shells that are adjacent to the shell of another individual become unhealthy and do not filter any food. There are also nutrient patches located a considerable distance away from the starting location (more than steps away, see maps in Fig. 2-A). Individuals that settle near a nutrient patch so that it is either partially or fully engulfed in their shell absorb additional nutrients from it. Episodes terminate after steps. Settling immediately on the first action is a very attractive local optimum. The global optimum solution is to swim quickly out to a nutrient patch and settle there instead333A video of the single-agent global optimum policy can be viewed here:

Single-agent reinforcement learning algorithms become stuck in the local optimum444A video of the single-agent local optimum policy can be viewed here:
and fail to ever discover the nutrient patches. To see why, consider the number of consecutive seemingly suboptimal actions that an agent would have to take in order to discover a nutrient patch. The settle action can be taken at any time, it always provides some level of rewards, and once taken, prevents movement for the rest of the episode. Thus any reasonable reinforcement learning algorithm that follows the initial gradient of its experience will reach the local optimum. If it starts out settling on step , it will receive an expected return of . But if it were to settle earlier instead, e.g., on step , it would receive a larger expected return. Thus there is a strong gradient from any policy initialization to the local optimum of settling on step . Furthermore. since the environment is partially observable, a single agent would need to choose to move in the same direction for several steps despite registering no change at all in its observation during that time.

On the other hand, Clamity can also be played by multiple agents simultaneously. All the trochophores begin each episode nearby one another in the center of the map. Since intersecting shells become unhealthy and provide no reward, individuals are penalized for settling too close to one another555A video of such a multiplayer bad outcome can be viewed here:
. This provides a gradient that incentivizes agents to swim away from the starting location to avoid competing with one another for shell space. If the population size is large enough then this competition-motivated spreading eventually leads individuals to discover the nutrient patches666A video of a group of agents implementing a multiplayer global optimum joint policy can be viewed here:

Figure 2. Experiments with extrinsically and intrinsically motivated individual exploration using the Clamity game. (A) Local optimum outcome. (B) Global optimum outcome. (C) Catastrophic multi-player outcome. (D) Multi-player global optimum. (E) Returns as a function of the number of times the species playing on the evaluated solitary island was updated. Except where indicated otherwise. reward values were smoothed over time with a window size of 100. Malthusian RL parameters were and . Each episode lasted behavior steps.

3.1.1. Experimental procedure

To make like-for-like comparisons between single-agent and multi-agent training regimes, we adopt the following protocol. In parallel with the archipelago ( islands), we run (the number of species) additional solitary islands. On the -th solitary island, a single individual of species plays each episode alone. All the experience generated on islands where species appears, even its solitary island, is used to update its policy . However, the amount of each species’s total experience derived from the solitary island is comparatively small since in this experiment, , the number of individuals of species appearing across all islands of the archipelago. The final results are reported only from the solitary islands but reflect the policy learning accumulated in the competitive archipelago setting.

Our single agent training protocol simply sets the number of islands in the archipelago to and replicates each solitary island times. Since there is only a single species (), and all solitary island replicas are the same as one another (though with different random environment seeds), the result is exactly equivalent to the A3C training regime mnih2016asynchronous.

This protocol also makes it easy to compare the proposed training regime where population sizes are dynamic and variable from episode to episode to the case of a “standard self-play” training regime, where population sizes are fixed. In this case, the archipelago contains just one island inhabited by a fixed number of individuals. As before, most of the experience is generated from the island where multiple individuals play. Results are reported only from the (single) solitary island, just as it is in the dynamic case.

3.1.2. Results

Individuals of species trained by Malthusian reinforcement learning find the globally optimum single-player solution, despite most of their experience coming from multi-player islands. Individuals trained by two baseline single-agent reinforcement learning algorithms completely fail to escape the local optimum. The first baseline we tried had all the exact same hyperparameters as in the Malthusian case, but all of its experience was in solitary islands (

of them in parallel).777These hyperparameters were not tuned for the Malthusian case—they were prespecified before the runs of both methods, and not subsequently changed. The second single-agent reinforcement learning algorithm baseline we tried was an implementation of the current state-of-the-art in curiosity-driven reinforcement learning, the intrinsic curiosity module provides a pseudo-reward to the agents based on its prediction error in predicting the next timestep in the evolution of a compressed encoding of its observations pathak2017curiosity; martin2017count. In this case, augmenting the agent with the intrinsic curiosity module is still insufficient to get it it to consistently discover the nutrient patches. It does stumble upon them from time to time, especially early on in training (Fig. 2, but does not even do so consistently enough to register in a smoothed plot of rewards versus time with a step smoothing window (Fig. 2). In contrast, individual members of species trained by Malthusian reinforcement learning with dynamic population sizes consistently implement globally optimal policies once they have discovered them (Fig. 2).

Next we asked whether dynamic population sizes were specifically important or whether the key was just the simultaneous training in multi-agent islands with a given, sufficiently large, population size. We noticed that most runs with dynamical population sizes converged to an island population size around in the best performing islands. Thus we ran several experiments where agents trained in fixed population islands, evaluated on solitary islands as before. We found that individuals that trained in a fixed population size of were able to discover the global optimum, but apparently less consistently than in the case with dynamic population size (red curve above navy curve in Fig. 2), and apparently with greater vulnerability to forgetting (the navy colored curve eventually declines back to the local optimum).

Figure 3. Experiments with the evolution of mutualism using the Allelopathy game. All results in this figure were smoothed with a window size of 25 ecological steps. (A-D) Unbiased Allelopathy game. Malthusian RL parameters were and . E-H) Biased Allelopathy game. Malthusian RL parameters were and . (A, E) Maximum collective return over all islands as a function of ecological time. (B, F) Maximum per capita collective return over all islands as a function of ecological time. (C, G) Maximum island population size over all islands as a function of ecological time. (D, H) Minimum number of times incurred a switching cost as a function of ecological time. ((I) Two screenshots of random procedurally generated initial map configurations. Maps were procedurally generated by randomly placing shrubs at the start of each episode. Episodes lasted behavior steps.

3.2. Mutualism and specialization experiments

Solution concepts for general-sum games may involve mutualistic interactions between synergistic strategies. Successful cooperative joint strategies may be either homogeneous, as in facultative mutualism, or heterogeneous. In nature, partners in mutually profitable associations are often very different from one another so that they can provide complementary capabilities to the partnership. In fact, most known mutualisms involve partners from different kingdoms, e.g., corals and their algae symbionts, vascular plants and mycorrhizal fungi, mammals and their gut bacteria, etc bruno2003inclusion. Moreover, division of labor and the subsequent efficiency gains from specialization are thought to be key components of complex human society smith1776inquiry.

However, it may be difficult to learn such mutually profitable partnerships of widely divergent strategies with self-play. All partners would need to represent all specializations, wasting valuable representation capacity. In addition, a policy learned by self-play requires a switching mechanism to break the symmetry and determine which sub-policy to emit in any given situation. For example, an agent could learn to become a blacksmith if standing on the left and a farmer if standing on the right. The complexity of the switching policy is itself related to the extent of partial observability in the environment. In some cases it may be very difficult to determine the right proportion of individuals needed to perform each part of the partnership at any given time, e.g. if the others’ strategies cannot easily be observed. It would be easier to learn a heterogeneous set of policies, each one implementing only its own part of the partnership. But then, it would seem that the number of copies of each would have to be known in advance, thus adding many new difficult-to-tune hyperparameters, one for each species.

In this section we explore whether Malthusian reinforcement learning can find mutualistic partnerships more easily than other multi-agent reinforcement learning methods, especially when there is a potential for gains from heterogeneous populations containing multiple specialized members. To test this, we created another partially observed Markov game environment. Again continuing the game-theoretic tradition of accompanying each game with a facetious and memorable story, we offer the following.

The Allelopathy game has two main rules. (1) shrubs grow in random positions on an open field. Shrubs allelopathically suppress one another’s growth. That is, the probability that a seed of a given type grows into a shrub in any given timestep is inversely proportional to the number of nearby shrubs of other types. (2) Agents in Allelopathy are herbivorous animals that can eat many different types of shrub. However, switching frequently between digesting different shrub types imposes a metabolic cost since different enzymes must be synthesized for each. Thus, agents benefit from specialization in eating only a single type of shrub. Agents receive increasing rewards for repeatedly harvesting the same type of shrub (up to a maximum of ). Rewards drop back down to their lowest level, , when the agent harvests a different type of shrub (since that entails their switching into a different metabolic regime). Thus an agent that randomly harvests any shrub they come across is likely to receive low rewards. An agent that only harvests a particular type of shrub while ignoring others will obtain significantly greater rewards. The combined effect of these two rules is to make it so that a specialist in any one shrub type benefits from the presence of others who specialize in different shrub types since their foraging increases the growth rate of all the shrubs they do not consume.

We studied two variants of the Allelopathy game. The first variant, unbiased Allelopathy, has two shrub types and that appear with equal probability. In the second variant, biased Allelopathy, the two shrub types do not appear with the same frequency. Type is significantly more common than type . In addition, each shrub of type consumed provides a maximum reward of when at least in a row are consumed. Whereas type shrubs yield a maximum reward of for any agent that manages to consume that many consecutively. Biased Allelopathy is a social dilemma since specialists in type are clearly better off than specialists in type , but both do better when the other is around.

3.2.1. Results

Here the critical comparison is between homogeneous and heterogeneous population dynamics. Therefore the object of study is the performance of the islands rather than specific individuals. The Allelopathy environment contains two niches, corresponding to specialization in consuming either shrub type or . In the heterogeneous case, mutualistic partnerships may develop from initial conditions where species in proximity to one another randomly fill either role. This situation features a gradient that guides each species in different directions. Whichever species begins with a propensity toward role ends up specializing in role . Likewise, the other species evolves to specialize in role , to the mutual benefit of both partners. On the other hand, in the homogeneous case, it is still possible for mutualistic interactions to develop, but it is more difficulty since (1) both specialized parts of the joint policy must be represented in the same network, and (2) the policy must include a switching mechanism that breaks the symmetry, determining which sub-policy to implement in each situation. Heterogeneous species avoid the need for this symmetry breaking, and the relative proportions assigned to each role are handled naturally by the adapting relative population sizes (Fig. 4).

The total number of individuals in each experiment was . Thus in the homogeneous case, , and in the heterogeneous case, . The number of islands was in both dynamic population size conditions. In the fixed population size conditions the number of islands was chosen so that the total number of individual instances would still be , e.g., for fixed population size , this required .

Results were similar for both the biased and unbiased Allelopathy games. Heterogeneous population dynamics achieved higher returns, both per capita, and in aggregate, than the other tested methods including the homogeneous population with size dynamics (Fig. 3). Interestingly both heterogeneous and homogeneous runs converged to the same population size, but the heterogeneous case increased more slowly to that point, and did so while maintaining a higher per capita rate of return.

Figure 4. Representative island timecourses for the Allelopathy game. The different lines represent different islands. Notice that the results are consistent across islands. (A-B) results from the unbiased Allelopathy game. (C-D) Results from the biased Allelopathy game. (A, C) Collective return per capita as a function of ecological time for four representative islands. (B, D) Island population size as a function of ecological time for four representative islands.

4. Discussion

This paper introduces Malthusian reinforcement learning, a multi-agent reinforcement learning algorithm that motivates individual exploration and takes advantage of possibilities for synergy to evolve heterogeneous mutualisms. If populations rise when returns improve then the problem itself shifts over time so no local optimum need ever be reached. This gives rise to a strategy that we may term exploration by exploitation. Individuals can always follow the gradient of their experience, they need never depart from their current estimate of the best policy just to explore the state space. They will naturally explore it, just by following the gradient in a changing world. Opportunities for heterogeneous mutualism may also be detected by gradient following. Initially weak specialization in one agent incentivizes its soon-to-be partner to specialize in a complementary direction, which in turn catalyzes more specialization, and so on.

How does this paper’s proposed population dynamic relate to dynamics studied in evolutionary theory? Our requirement of conservation of compute, that the number of individuals of a given species on a given island may vary from episode to episode, but the total number of individuals of each species in the archipelago is always the same fixed value, implies that for populations to increase on one island they much decrease elsewhere. Thus the population dynamic introduced here may be understood as an evolutionary model of migration. Moreover, since fitnesses are computed globally, i.e. relative to the entire archipelago, it is more similar to hard selection models in evolutionary theory where populations are regulated globally than soft selection, where population regulation occurs locally within each island doi:10.1086/284328; West72; henrich2004.

Other possible relationships between population size and innovation have appeared in the evolutionary anthropology literature. For instance, it is possible that—especially in preliterate societies—larger populations provide for more protection from forgetting of useful cultural elements since more elders, functioning as repositories of cultural knowledge, will be alive at any given time henrich2004demography. Or alternatively, larger social networks may provide more opportunities for recombination of disparate cultural elements that originated in farther and farther away contexts boyd2011cultural; kempe2014cultural; henrich2010markets; muthukrishna2016innovation

. As hypotheses for the origin of innovative behaviors in biology, these possibilities appear to be at odds with the mechanism implied by our algorithm, since each could explain, for instance, the same correlations between brain size and group size across the primate order

dunbar2017there as well as innovation and (cultural) group size in humans richerson2009cultural; muthukrishna2016innovation. However, they are not mutually exclusive. In fact, all three mechanisms may even operate synergistically with one another. More research is needed in order to tease apart the precise mechanisms in biology. In computer science, we think this line of thought opens up a goldmine of new algorithmic ideas concerning the combination of population dynamics with social learning and imitation.