Modeling the Formation of Social Conventions in Multi-Agent Populations

by   Ismael T. Freire, et al.

In order to understand the formation of social conventions we need to know the specific role of control and learning in multi-agent systems. To advance in this direction, we propose, within the framework of the Distributed Adaptive Control (DAC) theory, a novel Control-based Reinforcement Learning architecture (CRL) that can account for the acquisition of social conventions in multi-agent populations that are solving a benchmark social decision-making problem. Our new CRL architecture, as a concrete realization of DAC multi-agent theory, implements a low-level sensorimotor control loop handling the agent's reactive behaviors (pre-wired reflexes), along with a layer based on model-free reinforcement learning that maximizes long-term reward. We apply CRL in a multi-agent game-theoretic task in which coordination must be achieved in order to find an optimal solution. We show that our CRL architecture is able to both find optimal solutions in discrete and continuous time and reproduce human experimental data on standard game-theoretic metrics such as efficiency in acquiring rewards, fairness in reward distribution and stability of convention formation.



There are no comments yet.


page 11

page 17

page 18

page 19

page 20

page 21


Lyapunov-Based Reinforcement Learning for Decentralized Multi-Agent Control

Decentralized multi-agent control has broad applications, ranging from m...

Model-free conventions in multi-agent reinforcement learning with heterogeneous preferences

Game theoretic views of convention generally rest on notions of common k...

From Game-theoretic Multi-agent Log Linear Learning to Reinforcement Learning

Multi-agent Systems (MASs) have found a variety of industrial applicatio...

Learning Fairness in Multi-Agent Systems

Fairness is essential for human society, contributing to stability and p...

Modeling Theory of Mind in Multi-Agent Games Using Adaptive Feedback Control

A major challenge in cognitive science and AI has been to understand how...

Split Q Learning: Reinforcement Learning with Two-Stream Rewards

Drawing an inspiration from behavioral studies of human decision making,...

Reinforcement Learning Models of Human Behavior: Reward Processing in Mental Disorders

Drawing an inspiration from behavioral studies of human decision making,...

Code Repositories


WarGames - An Analysis of Emergent Properties of Information Processing Systems Operating in Complex Environments.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In his seminal work “Convention” [1], David Lewis defines social conventions as regularities in action that emerge to solve coordination problems. According to Lewis’ approach, conventions exhibit two characteristic features: (i) they are self-sustaining and (ii) they are largely arbitrary. Self-sustaining, in the sense that a group of agents in a given population will continue to conform to a particular convention as long as they expect the others to do so; and arbitrary, in the sense that there are other equally plausible solutions to solve the same problem. Understanding what are the set of conditions that lead to the formation of such conventions is still an open question, traditionally studied through cooperation and competition games within the purview of Game Theory [2].

In game theory, Nash equilibrium [3] are the set of optimal strategies for all players such that if everyone behaves according to them, there is no incentive for players to deviate from their choice. Therefore, from a Lewisian perspective, we could say that all Nash equilibria are self-sustaining. However, they can only be operationalized as conventions in contexts in which there are more than one possible equilibrium. This is usually the case in coordination games, a sub-domain of game theory.

Over the past decades, the study of cooperation in humans and other animals has been dominated by classical game theoretical approaches [4, 5]. However, shortcomings in classical formulations have subsequently led to alternative models [6, 7]. One of the major concerns was related to the ecological validity of experiments such as the Iterated Prisoner’s Dilemma (IPD) [8], arguing that the conditions under which these experiments are conducted are hardly (if ever) found in realistic social circumstances [4, 6, 9]. In particular, many studies pointed to the fact that cooperation between humans and animals usually require a continuous exchange of information in order for conventions to emerge, a feature that the IPD and other related cooperation games lack, precisely because they are based on discrete-time turns that impose a significant delay between actions [10, 9, 11]. In order to address this problem, several studies have devised ways to modify standard game theoretic discrete-time tasks into dynamic versions where individuals can respond to the other agent’s actions in real or continuous-time [12, 13, 14, 15, 16, 17]. Their results point out that cooperation can be more readily achieved in the dynamic version of the task due to the rapid flow of information between individuals and their capacity to react in real-time [13, 18].

A recent example of such an ecological approach can be found in [19], where Hawkins and Goldstone show that continuous-time interactions help to converge to more stable strategies in a game theoretic task (Battle of the Exes) compared to the same task modeled in discrete-time. They also show that the involved payoffs affect the formation of social conventions. According to these results, they suggest that real-life coordination problems can be solved either by forming a convention or through spontaneous coordination and that these solutions depend on what is at stake if the coordination fails. To illustrate this point, they suggest two real-life examples of a coordination problem: On one hand, when we drive a car, the stakes are high because if we fail to coordinate the outcome could be fatal, so we resort to a convention – e.g. to drive on the right side of the road. On the other hand, when we try to avoid people on a crowded street, we do it “on the fly” because the stakes are low, so it’s not risky to rely on purely reactive behaviors (e.g. avoidance behavior) to solve it.

Another paradigm shift on the study of decision-making has been the one produced in the cognitive sciences with the introduction of the Embodied Cognition perspective [20]. This new perspective allowed to make a clear distinction between a ’disembodied mind’ first generation of cognitive science and a ’embodied mind’ second generation [21]

. The first generation approach relied heavily on the ’computational metaphor’ to describe cognition. It assumed that cognitive processes could be substrate-independent, like software that can be implemented on different hardware. On the other hand, the approach of the new generation viewed cognition as substrate-dependent, embodied and situated in a world in which it has to interact to survive. It assumed that cognitive processes arise from a close interaction between mind, body and environment. From this perspective, perception and action are two tightly coupled phenomenons known as sensorimotor contingencies. This view on cognition was strongly validated by the discovery of canonical and mirror neurons in the mammalian premotor cortex


In this paper, we introduce a computational model of embodied cognitive agents involved in a social decision-making task called the Battle of the Exes and we test performance metrics of our cognitive model to results of human behavioral data published in [19]. For this purpose, we develop a Control-based Reinforcement Learning (CRL) cognitive architecture based on the principles of Distributed Adaptive Control (DAC) theory. Our architecture integrates a low-level reactive control loop to manage within-round conflicts, along with a policy learning algorithm to acquire across-round strategies. We run simulations showing that the modeled cognitive agents rely more on across-round policy learning when the stakes of the game are higher and that reactive (feedback) control helps enhance performance in terms of efficiency and fairness. This provides a computational hypothesis explaining key aspects of the emergence of social conventions such as turn-taking or pure dominance in game-theoretic setups and provides new experimental predictions to be tested in human coordination tasks.

As for computational modeling of game-theoretical tasks, there is an extensive body of literature where the study of the emergence of conflict and cooperation in agent populations has been addressed, especially through the use of Multi-Agent Reinforcement Learning (for extensive reviews, check [23, 24, 25]). In this direction, a lot of focus has been recently directed towards developing enhanced versions of the Deep Q-Learning Network architecture proposed in [26], particularly on their extensions to the social domain [27, 28, 29, 30]. This architecture uses a reinforcement learning algorithm that extracts abstract features from raw pixels through a deep convolutional network. Along those lines, some researchers [27, 28, 29] are modeling the type of conflicts represented in the classic game-theoretic tasks (e.g. the IPD) into more ecologically valid environments [27] where agent learning is based on deep Q-networks [28, 29]. For instance, agents based on this cognitive model are already capable of learning how to play a two-player video game such as Pong from raw sensory data and achieve human-level performance [26], both in cooperative and competitive modes [30]. Other similar approaches have focused on constructing agent models that achieve good outcomes in general-sum games and complex social dilemmas, by focusing on maintaining cooperation [31], by making an agent prosocial (taking into account the other’s rewards) [32] or by conditioning its behavior solely on its outcomes [33].

However, in all of the above cases, the games studied involve social dilemmas that only provide one single cooperative equilibrium, whereas the case we study in this paper provides several ones, a prerequisite for studying the formation of conventions. Also, the above examples relax one key assumption of embodied agents, that is, that sensory inputs must be obtained through one’s own bodily sensors. Agents in previous studies gather their sensory data from a third person perspective. They are trained using raw pixel data from the screen, with either completely observable [30, 31, 32, 33] or partially observable [28, 29] conditions. Another point of difference between previous approaches and the work presented here is with regards to the continuity of the interaction itself. Most of the work done so far in multi-agent reinforcement learning using game theoretical setups have been modeled using grid-like or discrete conditions [27, 28, 29, 31, 32, 33]. Although this is still an advance insofar that they provide a spatial and temporal dimension (situatedness) to many classical games, they still lack continuous time properties of real-world interactions. Even in the few cases where the coordination task has being modeled in real-time [30] and the agents are situated, the aforementioned approaches do not consider lower-level sensorimotor control loops bootstrapping learning in higher levels of a cognitive architecture.

In contrast, the Control-based Reinforcement Learning (CRL) model we introduce here follows the distributed adaptive control [34, 35] theory, where learning processes are bootstrapped from sensorimotor control loops, as we will see in the next section. Moreover, we systematically compare our results to the experimental human data collected in [19] for studying conditions under which such agents are able to converge towards social conventions. For this purpose, we use an already designed and tested game-theoretical task called the Battle of the Exes [19], which we explain at the end of the following section. In Section 3, we describe the CRL architecture and its two layers: one dealing with the low-level intrinsic behaviors of the agent and another based on model-free reinforcement learning, allowing the agents to acquire rules for maximizing long-term reward [36]. In Section 4 we compare the results of our model against existing human data. Finally, we conclude this study by discussing our main results and their implications in Section 5, where we also comment on limitations and possible extensions of the current model and outlining experimental predictions.

2 Foundations of our approach

Our CRL model, introduced in this paper, puts together and advances two important studies in the literature (described in detail below). Firstly, we use an existing spatially and temporally extended game-theoretic task, called the Battle of the Exes, for which human behavioral data is available in various experimental conditions [19]. Secondly, our CRL model follows the principles of DAC theory, where learning processes are bootstrapped from pre-existing reactive control loops [34, 35]. The main objective of this paper is to validate our control-theoretic cognitive model by benchmarking it with the human behavioral results of [19]. While doing so, we will identify the specific roles of reactive feedback control and policy learning in the emergence of social conventions.

2.1 Game Theory Benchmark

The Battle of the Exes is a coordination game similar to the classic Battle of the Sexes [37], that supposes the following social scenario: A couple just broke up and they don’t want to see each other. Both have their coffee break at the same time, but there are only two coffee shops in the neighborhood: one offers great coffee whereas the other, average coffee. If both go to the great coffee shop they will come across each other and will not enjoy the break at all. Therefore, if they want to enjoy their coffee break, they will have to coordinate in a way that they avoid each other every day. This situation can be modeled within the framework of game theory with a payoff relation such as ; where is the payoff for getting the great coffee, the payoff for the average coffee and the payoff for both players if they go to the same location.

In [19], Hawkins and Goldstone perform a human behavioral experiment based on the above-mentioned game to investigate how two factors – the continuity of the interaction (ballistic versus dynamic) and the stakes of the interaction (high versus low condition) – affect the formation of conventions in a social decision-making task. Concerning the stakes of the interaction, the payoff matrix is manipulated to create two different conditions: and , based on a bigger and smaller difference between rewards, respectively. The payoff matrices in Figure 1 illustrate these two conditions.

Figure 1: Payoff matrices of the original “Battle of the Exes” game. The numbers indicate the reward received by each player (red and blue). Reproduced from [19].

As for the continuity of the interaction, the experiment has a and a condition. In the ballistic condition, as in classical game theory, the players can only choose an action at the beginning of every round of the game, without any further control on the outcome. However, in the dynamic condition, the players can freely change the course of their avatars until one of them reaches a reward (for a visual example of the difference between conditions, check the original videos here). In both conditions, the round ends when one of the players reaches one of the reward spots that represent the coffee shops. Altogether, this results in four conditions: two for the stakes of the interaction (high vs. low) combined with two for the continuity of the interaction (ballistic vs. dynamic). For the experiment, they pair human players in dyads that depending on the payoff condition, play 50 (high) or 60 (low) consecutive rounds together. In order to analyze the coordination between the players of each dyad, they use three measures -efficiency, fairness, and stability- based on Binmore’s three levels of priority [38]:

  • Efficiency – It measures the cumulative sum of rewards that players were able to earn collectively in each round, divided by the total amount of possible rewards. If the efficiency value is 1, it means that the players got the maximum amount of reward.

  • Fairness – It quantifies the balance between the earnings of the two players. If the fairness value is 1, it means that both players earned the higher payoff the same amount of times.

  • Stability – It measures how well the strategy is maintained over time. In other words, it quantifies how predictable are the outcomes of the following rounds based on previous results by

    “using the information-theoretic measure of surprisal, which Shannon defined as the negative logarithm of the probability of an event”


In other words, Efficiency measures utility maximization, Fairness measures the amount of cooperation, and Stability measures the amount of conventions formed. The results show that players in the dynamic condition achieve greater efficiency and fairness than their counterparts in the ballistic condition, both in the high payoff and low payoff setups. However, their key finding is that in the dynamic condition, the players coordinate more “on the fly” (i.e. without the need of a long-term strategy) when the payoff is low, but when the payoff is high, the participants coordinate into more stable strategies. Namely, they identified the stakes of the interaction as a crucial factor in the formation of social conventions when the interaction happens in real-time.

2.2 Distributed Adaptive Control Theory

DAC is a theory of brain and mind that proposes that cognition is based on four control layers operating at different levels of abstraction [39, 34, 35]. The first level, the Soma layer, contains the whole body of the agent with all the sensors and actuators and represents the interface between the agent and its environment. This layer also contains the physiological needs of the agent, which are the driving force of the whole system. In the Reactive layer, those physiological needs are satisfied through the self-regulation of internal drives, implemented as reactive sensorimotor loops for maintaining stability (homeostasis). These reactive interactions bootstrap the learning of instantaneous policies implemented in the Adaptive layer for acquiring a state-space of the agent-environment interaction. Outside the scope of this paper, the Contextual layer acquires temporally extended policies that contribute to the acquisition of more abstract cognitive abilities such as goal selection, memory and planning [34]. These higher-level representations, in turn, affect the behavior of lower layers in a top-down fashion. Control in this architecture is therefore distributed between all layers thanks to the interactions in both directions, top-down and bottom-up, as well as laterally within each layer.

DAC makes explicit the distinction between real-time control on one hand (Reactive layer) and perceptual and behavioral learning on the other hand (Adaptive layer). It is, therefore, an adequate theoretical framework for understanding the specific roles of reactive control and policy learning in the formation of social conventions, which is the aim of this paper. This allows identification of functions that agents will need in both the ballistic and the dynamic conditions of the Battle of the Exes. In fact, in the ballistic condition, where players can only make a decision at the beginning of each round, our agents will only need to use the adaptive layer for solving the task. Whereas, in the dynamic condition, the agents will need both the reactive and the adaptive layer, as they will be moving through the environment, sensing and acting in real-time, and not only making abstract discrete decisions.

3 Methods

3.1 Control-Based Reinforcement Learning

In this section, we introduce our Control-based Reinforcement Learning (CRL) model. This is a operational minimal model, where reinforcement learning interacts with a feedback controller by inhibiting specific reactive behaviors. The CRL is a model-free approach to reinforcement learning, but with the addition of a reactive controller (for model-based adaptive control see [40]). The CRL is composed of two layers, a Reactive and an Adaptive layer. The former governs sensorimotor contingencies of the agent within the rounds of the game, whereas the latter is in charge of learning across rounds.

Figure 2: Representation of the Control-based Reinforcement Learning (CRL) model. On top, the Adaptive layer (reinforcement learning control loop) composed of a Critic or value function (), an Actor or action policy (), and an inhibitor function (). In the bottom, the Reactive layer (sensorimotor control loop), composed of three sets of sensors , , (corresponding to High/Low reward and the other Agent, respectively), three functions , , (corresponding to orienting towards High /Low reward and avoid agents behaviors, respectively) and two motors , (corresponding to the left and right motors). The action selected by the AL is passed through the inhibitor function that will turn off one of the attraction behaviors of the RL depending on the action selected. If the action is go to the high, the orienting towards low reward reactive behavior will be inhibited. If the AL selects go to the low, the RL will inhibit its orienting towards high reward behavior. If the AL selects none, the RL will act normally without any inhibition.

3.1.1 Reactive Layer

The Reactive Layer (RL) represents the agent’s sensorimotor control system and is supposed to be prewired (typically from evolutionary processes in a biological perspective). In the Battle of the Exes game that we are considering here, we equip agents with two predefined reactive behaviors orienting towards rewards and escaping from other agents. This means that, even in the absence of any learning process, the agents are intrinsically attracted to the reward spots and repulsed from each other. This intrinsic dynamic will bootstrap learning in the Adaptive Layer, as we shall see.

To model this layer, we follow an approach inspired by Valentino Braitenberg’s Vehicles [41]. These simple vehicles consist of just a set of sensors and actuators (e.g. motors) that, depending on the type of connections created between them, can perform complex behaviors. For a visual depiction of the two behaviors (orienting towards rewards and avoid agents), see this video.

  • The orienting towards rewards behavior is made by a combination of a crossed excitatory connection and a direct inhibitory connection between the reward spot sensors () and the motors (), plus a forward speed constant set to ,


    where is the sensor positioned on the left side of the robot indicating the proximity of a reward spot, and is either the high () or the low reward () sensor. The sensors perceive the proximity of the spot. The closer the reward spots, the higher the sensors will be activated. Therefore, if no reward spot is detected (), the robot will go forward at speed . Otherwise, the most activated sensor (left or right) will make the robot turn in the direction of the corresponding reward spot.

  • The avoid agents behavior is made by the opposite combination: a direct excitatory connection and a crossed inhibitory connection, but in this case between the agent sensors () and the motors (),


    where is the sensor positioned on the left side of the robot indicating the proximity of the other agent. The closer the other agent, the higher the sensors will be activated. In this case as well, if no agent is detected (), the robot will go forward at the speed . Otherwise, the most activated sensor will make the robot turn in the opposite direction of the other agent, thus avoiding it.

3.1.2 Adaptive Layer

The agent’s Adaptive layer (AL) is based on a model-free reinforcement learning algorithm that endows the agent with learning capacities for maximizing long-term reward. Functionally, it decides the agent’s action at the beginning of the round, based on the state of the previous round and its policy. The possible states are three: high, low and tie; and indicate the outcome of the previous round for each agent. That is, if an agent got the high reward on the previous round, the state is high; if it got the low reward, the state is low; and if both agents went to the same reward, the state is tie. The actions are three as well: go to the high, go to the low and none.

The Adaptive Layer implements reinforcement learning for maximizing accumulated reward over rounds through action, similar to the one implemented in [42] and adapted to operate on discrete state and action spaces. More specifically, we use an Actor-Critic Temporal Difference Learning algorithm (TD-learning), which is based on the interaction between two main components:

  • an Actor, or action policy , which learns the mapping from states () to actions () and define what is the action (), based on a probability (), to be performed in each state ();

  • and a Critic, or value function

    , that estimates the expected accumulated reward (

    ) of a state () following a policy;


    where is the discount factor, and is the reward at step .

The Critic also estimates if the Actor performed better or worse than expected, by comparing the observed reward with the prediction of . This provides a learning signal to the actor for optimizing it, where actions performing better (resp. worse) than expected are reinforced (resp. diminished). This learning signal is called the temporal-difference error (TD error). The TD error is computed as a function of the prediction from value function and the currently observed reward of a given state ,


where is a discount factor that is empirically set to . When (respectively ), this means that the action performed better (resp. worse) than expected. The TD error signal is then sent both to the Actor and back to the Critic for updating their current values.

The Critic (value function) is updated following,


where is a learning rate that is set to .

The update of the Actor is done in two steps. First, a matrix , with rows indexed by discrete actions and columns by discrete states, is updated according to the TD error,


where is a learning rate that is set to , is the current action and the previous state. integrates the observed TD errors when executing the action in the state . It is initialized to for all , and kept to a lower bound of . is then used for updating the probabilities by applying Laplace’s Law of Succession [43],


where is the number of possible actions.

Figure 3: Panel A: Top view of an agent’s body, as represented by the dark-blue large circle. Here the agent is facing toward the top of the page. The two thin black rectangles at the sides represent the two wheels, controlled by their speed. On its front, the agent is equipped with three types of sensors. A: agent sensors (sensing the proximity of the other agent), L: low reward sensors, and H: high reward sensors. For each type, the agent is able to sense the proximity of the corresponding entity both on its left and right side (hence six sensors in total). Panel B: Screenshot of the Experimental Setup (top view). In blue, the two cognitive agents in their initial position at the start of a round. In green, the two reward spots; the bigger one representing the high reward and the smaller, the low reward (i.e. lower payoff). In white, the circles that delimit the tie area.

Laplace’s Law of Succession is a generalized histogram (frequency count) where it is assumed that each value has already been observed once prior to any actual observation. By doing so it prevents null probabilities (when no data has been observed, it returns a uniform probability distribution). Therefore, the higher

, the more probable will be executed in .

Using these equations, actions performing better than expected () will increase their probability to be chosen the next time the agent will be in state . When , the probability will decrease. If this probability distribution converges for both agents, we consider that a convention has been attained.

3.2 Multi-Agent Simulations

We follow, as in the Battle of the Exes benchmark [19], a 2x2 between-subjects experimental design. One dimension represents the ballistic and dynamic versions of the game, whereas the other dimension is composed of the high and low difference between payoffs. Each condition is played by 50 agents who are paired in dyads and that play together 50 rounds of the game if they are in one of the high payoff conditions (ballistic or dynamic), or 60 rounds if they are in one of the low payoff conditions.

Regarding the task, we have developed the two versions (ballistic and dynamic) of the Battle of the Exes in a 2D simulated robotic environment (see Figure 3B for a visual representation). The source code to replicate this experiment is available online at:

In the ballistic condition, where there is no possibility of changing the action chosen at the beginning of the round, agents only use the Adaptive layer to operate. The two first actions (high and low) will take the agent directly to the respective reward spots, while the none action will choose randomly between them. In each round, the action chosen by the AL is sampled according to , where is the actual state observed by the agent.

In the dynamic condition, the agent uses the whole architecture, with the Adaptive and the Reactive layer working together (see Figure 2). As in the previous condition, the agent’s AL chooses an action at the beginning of the round, based on the state of the previous round and its policy. This action is then signaled to the RL, that will inhibit the opposite reward-attraction reactive behavior according to the action selected by the AL. In the case that the AL chooses the action go to the high, the RL will inhibit the orienting towards low reward behavior, allowing the agent to focus only on the high reward. Conversely, if the AL chooses the action go to the low, the reactive attraction to the high reward will be inhibited. In both cases, the agent avoidance reactive behavior still operates. Finally, if the action none is selected, instead of choosing randomly between the other two actions as in the ballistic condition, the AL will rely completely on the behaviors of the RL to play that round of the game.

The rules of the game are as follows: A round of the game finishes when one of the agents reaches a reward spot. If both agents are within the white circle area when this happens, it’s considered a tie, and both get 0 points. The small spot always gives a reward of 1, whereas the big spot gives 2 or 4 depending on the payoff condition (low or high respectively, see Figure 1). The reward spots are allocated randomly between the two positions at the beginning of each round.

4 Results

We report the main results of our model simulations in relation to human performance in the Battle of the Exes task [19], which are analyzed using: efficiency, fairness, and stability [38]. For each of these measures, we report the results of the model and plot them in contrast with human data from [19]. Then, we interpret those results and analyze the role of each layer of the CRL architecture in relation to the data obtained in each condition.

Figure 4: Results of Control-based Reinforcement Learning and TD-learning compared to human performance in the Battle of the Exes game, measured by Efficiency (left), Fairness (center) and Stability (right). The top panel shows the results on the high-payoff condition. The bottom panel shows the results on the low-payoff condition. Within each panel, blue bars represent the results on the ballistic condition, and red bars represent the results on the dynamic condition. Human data from [19]

. All error bars reflect standard errors.

Regarding the efficiency scores on the low-payoff condition (see Figure 4, bottom-left panel), first, a non-parametric Kruskal-Wallis H-test was performed, showing a statistically significant difference between groups . Post-hoc Mann-Whitney U-tests showed that there were significant differences in efficiency between humans playing the ballistic conditions of the game and the TD-learning benchmark algorithm . However, there were no significant differences between human scores in the dynamic condition and the scores achieved by the CRL model . The same statistical relationships are maintained in the high-payoff condition , where human ballistic scores and TD-learning scores were significantly different , while the CRL model shows no statistical difference with human dynamic scores .

As for the fairness scores on the low-payoff condition (see Figure 4, bottom-center panel), a non-parametric Kruskal-Wallis showed no statistically significant difference between groups , which means that both TD-learning and the CRL model matched human scores on this metric in its respective ballistic and dynamic conditions . The story is similar for the high-payoff condition (see Figure 4, top-center panel). Although this time the Kruskal-Wallis H-test showed a significant difference between groups , the post-hoc analysis showed no statistical difference between human ballistic and TD-learning , nor between human dynamic and CRL .

On the stability metric, the results of the four conditions showed a non-Gaussian distribution, so a non-parametric Kruskal-Wallis H-test was performed that showed a statistically significant difference between groups

). The post-hoc Mann-Whitney U-tests showed that both, the differences between human ballistic and TD-learning , and between human dynamic and CRL model , were statistically significant ( on both cases). On the high-payoff condition, a Kurskal-Wallis also showed significant differences among all stability scores . Post-hoc Mann-Whitney U-tests confirmed the statistical difference () between human ballistic scores and TD-learning . Similarly, human dynamic scores were significantly smaller () than the ones obtained by the CRL model .

4.1 Analysis

Figure 5: Top panel: Outcomes of two dyads of CRL agents (dyad 45 on the left, dyad 25 on the right) on the high dynamic condition, showing the formation of turn-taking (left) and pure dominance (right) equilibria. Each bar represents the outcome of a round of the game. A red bar means that player 1 got the high reward, and a blue bar means that player 2 got the high reward. Black bars represent ties. Bottom panel: Surprisal measure over rounds of play. When a convention is formed, the surprise drops down because the outcomes start to be predictable.

Overall, the model achieved a good fit with the benchmark data. Like in the human experiment, we observe that the dynamic (real/continuous-time) version of the model achieves better results in efficiency and fairness and that this improvement is consistent regardless the manipulation of the payoff difference.

The remarkable results in efficiency of the CRL model is due to the key role of the Reactive Layer in avoiding within-round conflict when both agents have chosen to go to the same reward, a feature that a ballistic model such as TD-learning lacks. The reactive behavior exhibited by the CRL model represents a kind of ’fight or flight’ response that can be triggered to make the agent attracted or repulsed to other agents, depending on the context that it finds itself in. In this case, due to the anti-coordination context presented in the Battle of the Exes, the reactive behavior provides the agent with a fast (flight) mechanism to avoid conflict. But in a coordination game like the Battle of the Sexes, this same reactive behavior could be tuned to provide an attraction (fight) response towards the other agent. Future work will extend this model to observe how the manipulation of this reactive behavior can be learned to help the agent in both cooperative and competitive scenarios.

As for the results in stability, the model was overall less stable than the human benchmark data, although it reflected a similar relation between payoff conditions: an increase in stability in the high dynamic condition ( and ) compared to the low dynamic (see Figure 4, right panels). Nonetheless, our results show that social conventions, such as turn-taking and dominance, can be formed by two CRL agents, as shown in Figure 5. The examples shown in the figure illustrate how these two conventions were formed in the dynamic high condition, where these type of equilibria occurred more often and during more rounds than in the other three conditions, thus explaining the higher stability in this condition. Overall, this results are consistent with human data in that dynamic, continuous-time interactions help converge to more efficient, fair and stable strategies when the stakes are high.

Role of the Adaptive Layer

We now analyze the role of each CRL layer in different payoff conditions through the measurement of the “none” action, which refers to the case when the Adaptive Layer is not used during that trial. Based on the results of the benchmark and the CRL model in the dynamic condition, where higher payoff differences helped to achieve higher stability, we expect that the more we increase this difference between payoffs, the more the agents will rely on the Adaptive layer.

Figure 6: Mean of the percentage of "not-none" actions (ie. go to the high and go to the low actions) selected by the agents plotted against 6 conditions with an increasing difference between high and low payoffs. Bars reflect standard errors.

For testing this prediction, we have performed a simulation with six different conditions with varying levels of difference between payoffs (high vs. low reward value), from 1-1 to 32-1. To measure the level of reliance on each layer, we logged the number of times each agent outputted a none action, that is the action in which the agent relies completely on the Reactive layer to solve the round.

Considering that there are only 3 possible actions (’go high’,’go low’, ’none’), if the Adaptive layer is randomly choosing the actions, we should observe that the agent selects each action, on average, the same amount of times. That means that prior to any learning, at the beginning of each dyad, the reliance on the Reactive layer would be and the reliance on the Adaptive layer . Starting from this point, if our hypothesis is correct, we will expect to observe an increase in the reliance of the Adaptive layer as the payoff difference increases. As expected, the results confirm, as seen in Figure 6, that there is a steady increase in the percentage of selection of the Adaptive layer as the payoff difference augments.

Role of the Reactive Layer

To analyze the specific contribution of the Reactive Layer to the overall results of the CRL architecture, we now perform a model-ablation procedure. In this scenario we deactivate the Adaptive layer, so the resulting behavior of the agents is entirely driven by the Reactive layer (this scenario exists only in the dynamic condition). As in the main experiment, there are two payoff conditions (high and low) and 50 dyads per condition.

Figure 7: Results of the model-ablation experiment compared to the complete CRL results. Red bars shows the results of the high-payoff conditions, whereas the orange bars refer to the low-payoff conditions. The ablated model operates using only the Reactive layer’s sensorimotor control. Results represented in terms of Efficiency (left panel), Fairness (center) and Stability (right panel). Note that stability is measured by the level of surprisal, which means that lower surprise values imply higher stability. All error bars reflect standard errors.

As we see in Figure 9, agents exclusively dependent on the Reactive layer perform worse overall, with a significant drop in efficiency. This drop is caused by a higher amount of rounds that end up in ties, in which both agents don’t receive any reward. The results in Fairness are comparable to the ones of CRL model. However, note that these results are computed from fewer rounds, precisely due to the high amount of ties obtained (fairness computes how evenly the high reward is distributed among agents). Regarding stability, we observe that it is lower than that obtained by the full CRL model, as demonstrated by higher values in surprise in Figure 9. In summary, we find that the Reactive Layer, when disconnected from the Adaptive Layer leads to more unstable and less efficient outputs.

5 Discussion

We have investigated the role of real-time control and learning on the formation of social conventions in a multi-agent game-theoretic task. Based on principles of distributed adaptive control theory, we have introduced a new Control-based Reinforcement Learning (CRL) cognitive architecture. The CRL is a model-free approach to reinforcement learning, but with the addition of a reactive controller. Our CRL architecture is composed of a module based on an actor-critic TD learning algorithm that endows the agent with learning capacities for maximizing long-term reward, and a low-level sensorimotor control loop handling the agent’s reactive behaviors. This integrated cognitive architecture is applied to a multi-agent game-theoretic task, the Battle of the Exes, in which coordination between two agents can be achieved. We have demonstrated that real-time agent interaction does affect the formation of more stable, fair and effective social conventions when compared to the same task modeled in discrete-time. The results of our model are consistent with those of Hawkins and Goldstone obtained with human subjects in [19].

Interpreting our results in the context of a functional cognitive model we have elucidated the role of reactive and adaptive control loops in the formation of social conventions and of spontaneous coordination. We found that the Reactive layer plays a significant role in avoiding within-round conflict (spontaneous coordination), whereas the Adaptive layer is required to achieve across-round coordination (social conventions). In addition, the CRL model supports our hypothesis that higher payoff differences will increase the reliance on the Adaptive layer.

Furthermore, there exists biological evidence supporting the functions identified by modules of the CRL architecture. Computations described by temporal difference learning have been found in the human brain, particularly in the ventral striatum and the orbitofrontal cortex [44]. It has also been shown that premotor neurons directly regulate sympathetic nervous system responses such as fight-or-flight [45]. The top-down control system of the brain has been identified in the dorsal posterior parietal and frontal cortex, and shown to be involved in cognitive selection of sensory information and responses. On the other hand, the bottom-up feedback system is linked to the right temporoparietal and ventral frontal cortex and is activated when behaviorally relevant sensory events are detected [46, 47, 48].

In our simulations, we have also modeled extensions of experimental conditions (such as increasing differences between payoffs, presented in Figure 6) which affect task outcomes as well as functionality of each control loop. These results allow us to make predictions that can later be tested in new human experiments. In that sense, we expect to see an increase in the number of conventions formed in the Battle of the Exes that will be positively correlated with the increased difference in the value of the two rewards. At the cognitive level we suggest that this increase in convention formation could be linked to a higher level of top-down cognitive control, as predicted by the increase in activation of the Adaptive layer.

To the best our knowledge, this is the first embodied and situated cognitive model that is able to explain human behavioral data in a social decision-making game in continuous-time setups. Moreover, unlike previous attempts, we take into account the role of sensorimotor control loops in solving social dilemmas in real-life scenarios. This is arguably a fundamental requirement for the development of a fully embodied and situated AI.

For future work, there are several directions in which we can continue to develop the multi-agent framework presented in this paper. One possibility being the addition of a computational model of the DAC Contextual layer to our CRL architecture. As discussed in [34, 36], the Contextual layer facilitates integration of sensory-motor contingencies into a long-term memory that allows for learning of rules. This is important for building causal models of the world and to take into account context in the learning of optimal action policies. The goal of such extensions will be to build meta-learning mechanisms that can identify the particular social scenario in which an agent is placed (i.e., social dilemmas, coordination problems, etc.) and then learn the appropriate policy for each context. Extending our model with such functionality could enable solving more diverse and complicated social coordination problems, including those that provide a delayed reward.

Another interesting avenue concerns the emergence of communication. We could extend the our model by adding signaling behaviors to agents and test them in experimental setups similar to the seminal sender-receiver games proposed by Lewis [1]. One could also follow a more robot-centric approach such as that of [49, 50]. These approaches enable one to study the emergence of complex communicative systems embedding a proto-syntax [42, 51].

Put together, our model in this paper along with recent related work (see [52]) helps towards advancing our understanding of a functional embodied and situated AI that can operate in a multi-agent social environment. For this purpose, we plan to extend this model to study other aspects of cooperation such as in wolf-pack hunting behavior [53, 54], and also aspects of competition within agent populations as in predator-prey scenarios. In ongoing work, we are developing a setup in which embodied cognitive agents will have to compete for limited resources in complex multi-agent environments. This setup will also allow us to test the hypothesis proposed in [55, 56, 57] concerning the role of consciousness as an evolutionary game-theoretic strategy that might have resulted through natural selection triggered by a cognitive arms-race between goal-oriented agents competing for limited resources in a social world.


This research has been funded by the European Commission’s Horizon 2020 socSMC project (socSMC-641321H2020-FETPROACT-2014) and by the European Research Council’s CDAC project (ERC-2013-ADG 341196).

conflict of interest

The authors declare that they have no competing interests.