1 Introduction
The recent availability of large datasets collected from various resources, such as digital transactions, location data and government census, is transforming the ways we study and understand social systems [lazer2009computational]. Researchers and policy makers are able to observe and model social interactions and dynamics in great detail, including the structure of friendship networks [eagle2009inferring], the behavior of cities [doi:10.1098/rsif.2016.1048], politically polarized societies [morales2015measuring], or the spread of information on social media [vosoughi2018spread]. These studies show the behaviors present in the data but do not explore the space of possibilities that human dynamics may evolve to. Robust policies should consider mechanisms to respond to every type of events [ashby1991requisite], including those that are very rare [taleb2007black]. Therefore it is crucial to develop simulation environments such that potentially unobserved social dynamics can be assessed empirically.
Agent Based Modeling (ABM) is a generative approach to study social phenomena based on the interaction of individuals [sayama2015introduction]. These models show how different types of individual behavior give rise to emergent macroscopic regularities [schelling1971dynamic], such as unequal wealth distributions [epstein1996growing], new political actors [axelrod2006model], multipolarity in interstate systems [cederman1997emergent] and cultural differentiation [axelrod1997dissemination]. Moreover, ABM allows testing core sociological theories against simulations [epstein1996growing] with emphasis on heterogeneous, autonomous actors with bounded, spatial information [epstein1999agent]. However, the rules of agent interactions are generally fixed which limits the exploration of the space of possible behaviors.
Reinforcement Learning (RL) is a simulation method where agents become intelligent and create new, optimal behaviors based on the state of their environment and a previously defined structure of incentives. This method is referred as MultiAgent Reinforcement Learning (MARL) if multiple agents are employed. Recently, the combination of RL with Deep Learning architectures achieve human level performance in complex tasks, including video gaming
[mnih2015human], motion in harsh environments [heess2017emergence], and effective communication networks without assumptions [sert2018optimizing]. Moreover, it has been recently applied to study societal dilemma and game theory problems
[lanctot2017unified] such as the emergence of cooperation [de2006learning, leibo2017multi], the Prisoner’s Dilemma [sandholm1996multiagent] and payoff matrices in equilibrium [wunder2010classes]. Although Deep RL algorithms applied to multiple agents (MARL) can shed light on social phenomena, to the best of our knowledge, the applications of these methods has been confined to classical gametheoretic problems [zawadzki2014empirically] and drawing connections to realworld examples remains unexplored.In this paper we extend the standard ABM of social segregation using MARL in order to explore the space of possible behaviors as we modify the structure of incentives and promote the interaction among agents of different kinds. The idea is to observe the behavior of agents that want to segregate from each other when interactions are promoted. We achieve the segregation dynamics by considering the rules from the Schelling model [schelling1971dynamic]. The creation of interdependencies among agents of different kinds is inspired by the dynamics of the PredatorPrey model [sayama2015introduction] where agents hunt each other. Our experiments show that spatial segregation diminishes as more interdependencies among agents of different kinds are added. Moreover, our results shed light on previously unknown behaviors regarding segregation and the age of individuals which we confirmed using Census data. These methods can be extended to study other type of social phenomena and inform policy makers on possible actions.
2 Methods
. Each network receives an input of 11x11 locations, runs it through five convolution steps and concatenates the resulting activations with the agent’s remaining age normalized by the maximum initial age. The feature vector is mapped over the action space using a fully connected layer. The action with the maximum Qvalue is taken for the agent.
Parameter  Value 

Number of Episodes  1 
Batch Size  256 
Number of Iterations  5000 
Number of Training Steps  60.000 
Experience Memory Length  1.000.000 
Discount Factor ()  0.98 
Learning Rate  0.001 
Momentum  0.999 
Double Network Copy Parameter ()  0.05 
Initial Exploration Rate  0.999 
Final Exploration Rate  0 
Exploration Decay (per agent action)  100.000 
We design a game in which agents are promoted to both selfsegregate and interact with others. By varying the reward of interactions we are able to explore different incentives that affect the selforganizing process of segregation. Our experiments are based on two types of agents: A and B. Agents try to survive in a 50x50 grid where they can move around and interact with other agents. They observe an 11x11 patch of the grid centered around their current position and can live for a total of 100 iterations in isolation. Figure 1 shows an schematic view of the grid world and the agents. Distinct colors indicate the agents’ types and the green square represents the observation window of the agent illustrated in green.
Each type of agent utilizes one Deep QNetwork for maximizing rewards [mnih2015human]. The rewards of the game, , are as following:

Segregation reward. This incentive promotes agents to selfsegregate. An agent is rewarded +1 for each agent of similar kind that joins its observation window, and 1 for each agent of different kind.

Interdependence reward. This incentive promotes interactions among agents of different kinds. When an agent meets another agent of different kind, we randomly choose a winner of the interaction (following hunting dynamics). The winner (hunter) receives a positive reward, that we vary across experiments, and an extension of its lifetime by one iteration.

Vigilance reward. This incentive promotes agents to stay alive by providing +0.1 reward for every time step they survive.

Death reward. This incentive rewards negatively agents who die or are hunted by agents of opposite kind. Agents receive 1 reward when they die.

Occlusion reward. This incentive rewards movements towards occupied cells negatively. If an agent tries to move towards an occluded area, the agent receives 1 reward.

Stillness reward. This incentive promotes the exploration of space. Agents who choose to stay still receive 1 reward.
Every agent takes one action at each iteration. The sequence of agents who take actions is chosen randomly. There are five possible actions for agents: to stay still or to move left, right, up or down. Agents are confined to the borders of the grid and cannot move towards agents of their own kind. If an agent moves to a location occupied by an agent of the opposite kind, it receives the interdependence reward and the opponent receives the death reward.
Mathematically, agents of type A are represented as , B as , empty space as and border as on the grid. Hence every agent’s spatial observation at time is . Moreover, every agent has the information of its remaining normalized life time, represented as . Full observation of the agent at time is . Let and denote the QNetworks of type A and B. Then the networks’ goal is to satisfy Equations 1 and 2.
(1) 
(2) 
where denotes the number of agents of type , denotes the discount factor, denotes the reward at time and denotes the QNetwork of agents of type .
Each network is initialized with the same parameters. In order to homogenize the networks’ inputs, we normalize the observation windows by the agents’ own kind, such that positive and negative values respectively represent equal and opposite kind for each agent. Actions are taken by following Greedy exploration strategy. Exploration rate decays exponentially. In order to stabilize the learning process, we use Adam optimizer [kingma2014adam], Experience Replay [lin1992self] and Double QLearning [van2016deep]. Networks are trained in parallel over 12 CPUs using data parallelism. We run one episode per experiment. Each episode is comprised of 5000 iterations. Each experiment is repeated 10 times for statistical analysis. Network details are given in Figure 1 (bottom) and training details are given in Table 1.
3 Results
Experiments are conducted by setting up different values of incentives and observing the emergent collective behavior associated with each experiment. During simulations, agents explore the space of possible behaviors and inform which behaviors are promoted under certain incentives and environmental rules. As a result, we create an artificial environment for testing hypotheses and obtaining information through simulations hard to anticipate given the complexity of the space of possibilities.^{1}^{1}1Demonstration of the experiments: (IR: 0) https://youtu.be/AgAeYMe2tUE (IR: 25) https://youtu.be/OZbl8qD50Mg (IR: 50) https://youtu.be/Ca2p2cATmlw (IR: 75) https://youtu.be/R32Xu_EUpBQ.
In this case, we create agents who want to segregate from other kinds and provide incentives to create interactions and interdependencies across kinds. For this purpose, we model the Schelling dynamics for segregation and combine it with the interdependence reward. The interdependence reward is given when agents of different kinds compete and win against each other following hunting dynamics. The one who is hunted dies and the hunter gets a positive reward and lifeextension. In total, there are four different experiments with interdependence reward of 0, 25, 50 and 75 respectively. A set of videos are available with one simulation for each setting. In the videos, colors yellow/orange and cyan/magenta denote the types of agents. The color brightness indicates the age of agents for both kinds.
Interdependence rewards diminish spatial segregation among different types. In Figure 2a we show the collective behavior of the population, using heat maps proportional to the probability of agents location during simulations according to their type. The heat maps are visualized over one trial of the experiments. Blue and red regions show biases towards each kind. White regions show uniform occupation. The dynamics of segregation quickly result in patches of segregated groups (top panels). As interdependence rewards increase, the probability of one grid being occupied by agent of type A or B becomes uniform and plots become white (bottom right panels). By creating interdependencies among agents, they increase their interactions and reduce the spatial segregation.
We measure segregation among agents using multiscale entropy. We convolve the grid space with low pass filters of size 6x6, 12x12 and 25x25 using sliding windows whose output is the window average value. We measure the entropy of the distribution of window averages after each convolution across all iterations. The segregation per iteration is defined as the average entropy across the distributions resulting from the different filter sizes. The resulting segregation dynamics is visualized in Figure 3. Segregation is high when interdependencies are not rewarded (yellow curve). As interdependencies increase (purple and black curves), the agents mix and the spatial segregation is significantly reduced (, see Section S2 in the Supplement).
Interdependencies affect the group dynamics. As we increase the reward for interdependencies, the initially stable patches emerging from promoting segregation become dynamic and mix with the other kind. The properties of the population and associated activities reflect the change of dynamics. Agents create an internal hierarchy where younger agents go out and hunt and elder agents segregate and ensure reproduction. Evidences of such behavior are that the average age of agents decreases and the hunting rate increases (Figure 4a and 4c) and average hunter age is much lower than the average agent age (Figure 4a and 4e). Moreover, the maximum age of agents per kind increases (Figure 4b) showing that some agents stay protected and do not hunt. The hunting strategy of agents is also affected by increasing interdependencies. Pack size increases consistently with interdependence rewards. Figure 4d shows the size of hunting clusters one step before hunting an agent. The increasing cluster size given interdependence rewards suggests that agent association yields better results. It also shows that hostile systems favor agglomeration of agents for safety which can result in ultimate polarization. Additionally, we also analyzed the effects of the vigilance rewards on the dynamics for multiple reward values. Results show that higher vigilance reward increases intrakind interaction and results in more segregation (see Section S3).
Diverse areas attract younger people and people are older in segregated areas. We show that older agents are more segregated than younger ones in the model (see Figure 2b). The behavior has been observed in the model and verified with human behavior using Census data. We analyzed the relationship between age and segregation using Census data across the whole US (see Section S4). A segregation metric based on racial entropy correlated positively with median age by census tract (r=0.4). Our simulation shed light on an observation that is not trivial about current societies.
In summary, our experiments show that increasing interdependencies among kinds can be applied to reduce segregation. Moreover, hostile interdependencies will result in ingroup cooperation for hunting and competition for sheltering. The emergent behavior of the population can be framed in the exploiter and explorer discussion. A part of it chooses to segregate and another one to go out and explore. The one who explores hunts and is vulnerable to be hunted, but creates spatial integration. The one who segregates lives longer and ensures reproduction of its own kind. In this model, explorers tend to be younger and keepers tend to live longer. Spatial mixing was achieved by increasing interaction rewards but was accompanied by larger clusters of agents of the same size. Polarization may arise when there is an adversarial relationship between the parts that segregate from each other. More generally, emergent behaviors lie in a nonlinear space where interaction properties determine outcomes which may happen simultaneously and in different combinations.
4 Discussion
We created an artificial environment for testing rules of interactions and incentives by observing the behaviors that emerge when applied to multiagent populations. Incentives can generate surprising behaviors because of the complexity of social systems. As problems become complex, evolutionary computing is necessary to achieve sustainable solutions. We combine system modeling (ABMs) with artificial intelligence (RL) in order to explore the space of solutions associated to promoted incentives. RL provides ABMs the information processing capabilities that enables the exploration of strategies that satisfy the conditions imposed by the interaction rule. In turn, ABMs provide RL with access to models of collective behavior that achieve emergence and complexity. While ABMs provide access to the complexity of the problem space, RL facilitates the exploration of the solution space. Our methodology opens a new avenue for policy makers to design and test incentives in artificial environments.
Acknowledgements
We would like to thank Intel AI DevCloud Team for granting access to their cloud with powerful parallel processing capabilities. Also, we would like to thank Dhaval Adjodah for his valuable suggestions on training RL algorithms.
Authors Contributions
ES, YBY and AJM contributed equally in the conceptualization, development and interpretation of the experiments as well as in the paper write up.
Data Availability
The source code of the model implementation as well as the data generated to create this report will be made available upon publication.
Additional Information
We declare that have no competing interests.
References
Supplement
S1 Future Work
There are many potential improvements to our work. We classify directions of future work under three categories: representation, training and experimentation. Our method can be advanced by representing agents more realistically such as introducing heterogeneous personalities to agents or facilitating network structure over agents to promote alliances. Moreover, training RL agents yield better results with sophisticated exploration strategies
[nikolov2018information, tang2017exploration, fu2017ex2]. In addition to exploration strategies, MARL is shown to perform better with curriculum learning [bansal2017emergent]. Our aim is to extend the work on multi agent curriculum learning to our problem.Schelling and Predator  Prey models cover just a little portion of the ABM domain [macy2002factors]. We are currently working on extending this artificial environment to other ABMs, i.e. Axelrod model [axelrod1997dissemination]. Our goal is to develop an easy interface where policy makers and AI researchers can collaborate on solving societal problems.
S2 Statistical Significance
We validate the significance of the patterns we observe along the execution of the simulation as we change the IR incentive in Figure 3. We analyze the distribution of values across the last 1000 interactions for each IR values and test the difference among their averages. In Table S1 we summarize the results of the statistical tests. The differences in averages are statistically significant () across all pairs of curves.
S3 Vigilance Reward
We analyze the effects of the Vigilance Rewards (VR) on the dynamics of agents. In Figures S1 and S2 we the impact on segregation and age distribution for multiple values of VR. The results show that increased VR increases intrakind behavior and as a results increases segregation. Therefore, segregation may also be fostered by other types of behaviors.
S4 Agents Age
We analyze the significance of the spatial distribution of agent ages shown in Figure 2b. In Figure S3
we show the entropy of the spatial distribution of agent ages at multiple iteration times and interdependence reward (IR), together with a randomized case for comparison. The randomized case is constructed by drawing agent ages on the grid from a uniform distribution between (0, 1) and calculating the entropy. The difference between the random case and the empirical results is significant in all cases. We tested significance by comparing each curve. A summary of the test results are presented in Table
S2.The behavior has been observed in the model and verified it with human behavior using Census data. We analyzed the relationship between age and segregation using Census data across the whole US. A segregation metric based on racial entropy correlated positively with median age by census tract (r=0.4). In Figure S4 we present a scatter plot of the segregation metric (xaxis) and average age (yaxis) of each census tract (dots).
tvalue  pvalue  

0  25  4.373  6.44e06 
0  50  15.935  0.0 
0  75  23.890  0.0 
25  0  4.373  6.44e06 
25  50  11.267  0.0 
25  75  20.331  0.0 
50  0  15.925  0.0 
50  25  11.267  0.0 
50  75  11.691  0.0 
75  0  23.890  0.0 
75  25  20.331  0.0 
75  50  11.691  0.0 
tvalue  pvalue  

Random  IR: 75  18.755  0.0 
Random  IR: 50  17.105  0.0 
Random  IR: 25  16.347  0.0 
Random  IR: 0  15.665  0.0 
IR: 75  Random  18.755  0.0 
IR: 75  IR: 50  2.676  0.010 
IR: 75  IR: 25  3.772  0.0 
IR: 75  IR: 0  3.520  0.001 
IR: 50  Random  17.105  0.0 
IR: 50  IR: 75  2.676  0.010 
IR: 50  IR: 25  1.214  0.231 
IR: 50  IR: 0  1.039  0.304 
IR: 25  Random  16.347  0.0 
IR: 25  IR: 75  3.772  0.0 
IR: 25  IR: 50  1.214  0.231 
IR: 25  IR: 0  0.144  0.886 
IR: 0  Random  15.665  0.0 
IR: 0  IR: 75  3.521  0.001 
IR: 0  IR: 50  1.039  0.304 
IR: 0  IR: 25  0.144  0.886 