Promoting Coordination through Policy Regularization in Multi-Agent Reinforcement Learning

08/06/2019 ∙ by Paul Barde, et al. ∙ Corporation de l'ecole Polytechnique de Montreal McGill University 0

A central challenge in multi-agent reinforcement learning is the induction of coordination between agents of a team. In this work, we investigate how to promote inter-agent coordination and discuss two possible avenues based respectively on inter-agent modelling and guided synchronized sub-policies. We test each approach in four challenging continuous control tasks with sparse rewards and compare them against three variants of MADDPG, a state-of-the-art multi-agent reinforcement learning algorithm. To ensure a fair comparison, we rely on a thorough hyper-parameter selection and training methodology that allows a fixed hyper-parameter search budget for each algorithm and environment. We consequently assess both the hyper-parameter sensitivity, sample-efficiency and asymptotic performance of each learning method. Our experiments show that our proposed algorithms are more robust to the hyper-parameter choice and reliably lead to strong results.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-agent Reinforcement Learning (MARL) refers to the task of training an agent to maximize its expected return by interacting with an environment that contains other learning agents. It represents a challenging branch of reinforcement learning with interesting developments in recent years jaderberg2018human; lowe2017multi; foerster2018counterfactual; hernandez2018multiagent.

A popular framework for MARL is the use of a centralized training procedure and decentralized execution lowe2017multi; foerster2018counterfactual; iqbal2018actor; foerster2018bayesian; rashid2018qmix, typically done by training critics that approximate the value of the joint observations and actions, themselves used to train actors that are restricted to the observation-action pair of a single agent. Such critics, if exposed to coordinated joint actions leading to high returns, can steer the agents’ policies toward these highly rewarding behaviors. However, these approaches depend on the agents luckily stumbling on coordinated joint actions in order to grasp their benefit. Thus, it might fail in scenarios where coordination is unlikely to occur by chance. We hypothesize that in such scenarios, coordination-promoting inductive biases on the policy search could help discover coordinated behaviors more efficiently and supersede task related reward shaping and curriculum learning.

In this work, we explore two different priors to successful coordination and use these to regularize the learned policies. The first avenue (TeamMADDPG) supposes that an agent must be able to predict the behavior of its teammates in order to coordinate with them. The second (CoachMADDPG) assumes that coordinating agents individually recognize different modes or situations, and synchronously use different sub-policies (or behaviors) to deal with these distinct events. In the following sections we show how to derive practical regularization terms from these premises and meticulously evaluate them111Source code for the algorithms and environments will be made public upon publication of this work..

Our contributions are twofold. First, we propose two novel algorithms, Team-MADDPG and Coach-MADDPG, that aim at promoting coordination in multi-agent systems. Our approaches build on the widely used MADDPG algotihm and augment it with additionnal multi-agent objectives that act as regularizers and are optimized jointly with the main return-maximisation objective. Second, we design four sparse-reward collaboration tasks in the multi-agent particle environment mordatch2018emergence and present a detailed evaluation of these algorithms against three baseline variations. We further analyze the effect of the induced behavioral bias by performing an ablation study which suggests that our team-spirit objective provides a dense learning signal that helps guiding the policy towards coordination in the absence of external reward and effectively leads it to the discovery of high performing team strategies in a number of cooperative tasks.

2 Background

2.1 Markov Games

In this work we consider the framework of Markov Games littman1994markov

, a multi-agent extension of Markov Decision Processes (MDPs) with

independant agents. A Markov Game involves independent agents and is defined by the tuple . , , and respectively are the set of all possible states, the transition function and the initial state distribution. While these are global properties of the environment, , and are individually defined for each agent . They are respectively the observation functions, the sets of all possible actions and the reward functions. At each time-step , the global state of the environment is given by

and every agent’s individual action vector is denoted by

. To select their action, each agent has only access to its own observation vector which is extracted by its observation function from the global state . The initial global state is sampled from the initial state distribution and the next states of the environment

are sampled from the probability distribution over the possible next states

given by the transition function . Finally, at each time-step each agent receives an individual scalar reward from its reward function . Agents aim at maximizing their expected discounted return over the time horizon , where is a discount factor.

2.2 Multi-Agent Deep Deterministic Policy Gradient (MADDPG)

MADDPG lowe2017multi is an adaptation of the Deep Deterministic Policy Gradient algorithm (DDPG) lillicrap2015continuous to the multi-agent setting. It allows the training of cooperating and competing decentralized policies through the use of a centralized training procedure. In this framework, each agent possesses its own deterministic policy for action selection and critic

for state-action value estimation which are respectively parametrized by


. All parametric models are trained off-policy from previous transitions

uniformly sampled from a replay buffer . Note that is the joint observation vector and is the joint action vector, obtained by concatenating the individual observation vectors and action vectors of all agents. Each centralized critic, is trained to estimate the expected return for a particular agent using the Deep Q-Network (DQN) mnih2015human loss:


Each policy is updated to maximize the expected discounted return of the corresponding agent-i :


yielding the following policy gradient:


By guiding each agent’s policy toward states that have been more positively evaluated when taking into account all agents’ observation-action pairs, the centralized training of the value-function restores stationary in the multi-agent setting. In addition, this mechanism can allow to implicitly learn coordinated strategies that can then be deployed in a decentralized way. However, this procedure does not encourage the discovery of coordinated strategies since high-reward behaviors have to be randomly experienced through unguided exploration.

3 Related Work

Many works in multi-agent reinforcement learning consider explicit communication channels between the agents and distinguish between communicative actions (broadcasting a given message) and physical actions (moving in a given direction) foerster2016learning; mordatch2018emergence; lazaridou2016multi. Consequently, they often focus on the emergence of language, considering tasks where the agents must discover a common communication protocol in order to succeed. Deriving a successful communication protocol can already be seen as coordination in the communicative action space and can enable, to some extent, successful coordination in the physical action space ahilan2019feudal. However, explicit communication is not a necessary condition for coordination as agents can rely on physical communication mordatch2018emergence; gupta2017cooperative.

Approaches to shape RL agents’ behaviors with respect to other agents have also been explored. Jacques et al. jaques2018intrinsic use the mutual information between the agent’s policy and a goal-independent policy to shape the agent’s behaviour towards hiding or spelling out its current goal. However, this approach is only applicable for tasks with an explicit goal representation and is not specifically intended for coordination. Strouse et al. strouse2018learning use the direct causal effect between agent’s actions as an intrinsic reward to encourage social empowerment, but rely on giving to some agents (the influencees) the action of other agents (the influencers) as input to their policy, which violates the decentralized execution framework.

Finally, Barton et al. barton2018measuring propose CCM (convergent cross mapping) as a metric for measuring the degree of effective coordination between two agents. However, as this may represent an interesting avenue for behavior analysis, it does not provide any tool for actually enforcing coordination as CCM must be computed over long time series which makes it impractical as a learning signal for single-step temporal difference based methods. In this work, we design two coordination-driven multi-agent algorithms that do not rely on the existence of explicit communication channel and allow to carry the learned coordinated behaviors at test time, when all agents act in a decentralized fashion.

4 Policy regularization

Our two proposed algorithms use team-objectives that are used as regularizers alongside the common policy gradient update. Pseudocodes of our implementations are provided in Appendix 8.2 (see Algorithm 1 and 2).

4.1 Team regularization

We hypothesize that coordination can be defined as the degree of predictability of one agent’s behavior with respect to its teammate(s). In other words, in a coordinated team, there should be some predictable structure between the agent’s actions. We cast this assumption into the decentralized framework by training agents to predict their teammates actions given only their own observation, which can be simply implemented by adding additional heads to each agent’s policy network. We define this regularization term as the Mean Squared Error (MSE) between the predicted and real actions of the teammates, yielding a teammate-modelling secondary objective similar to the models of other agents used in jaques2018intrinsic; hern2019agent and often referred to as agent modelling schadd2007opponent. Importantly, most previous work such as hern2019agent exclusively use this approach to improve the richness of the learned representations. In our case, the same objective is also used to drive the teammates’ behaviors closer to the prediction, a process that requires a differentiable action selection mechanism. We call this the Team-Spirit regularization term between agents and :


The total gradient for a given agent-i becomes:


Where and are hyper-parameters that weigh respectively how well an agent should predict its teammates’ actions, and how predictable an agent should be for its teammates. We call this modified algorithm TeamMADDPG. The diagram of Figure (a)a summarizes these interactions.

(a) TeamMADDPG
(b) CoachMADDPG
Figure 1: Illustrations of the proposed regularization schemes in a two-agents MARL problem. (a) Each agent’s policy is equipped with additional heads that are trained to predict other agents’ actions and every agent is regularized to produce actions that its teammates correctly predict. Note that this regularization is depicted for agent-1 only to avoid cluttering. (b) A central model called Coach takes all agents’ observations as input and outputs the current mode (one-hot vector). Agents are regularized to predict the same mode from their local observations only, and use it as a mask to select their corresponding subpolicy.

4.2 Coach regularization

This regularization choice builds on the assumption that coordination could be more easily achieved if agents would have the same representation of the current situation and could use this representation to synchronously select a sub-behavior. To promote this, we introduce the coach, a new entity parametrized by , which learns to produce from the joint observations of all agents a discrete embedding (behavior-mask) that is passed to the agents at training time. This mask is a one-hot vector of size drawn from a multinomial distribution . In practice, the policy network of each agent is modified as follows: a first linear pre-hidden layer is split into two heads. The first one consists of the the pre-activations of the hidden layer . The second head, of dimension , consists of the pre-activations of a softmax function yielding the distribution from which a behavior mask is sampled. The mask coming either from the coach or from the agent is then tiled along the dimensions of the first head of the linear layer and applied via element-wise multiplication, a process similar to dropout srivastava2014dropout but in a deterministic, structured fashion. The output forms , consisting therefore of the masked pre-activations of the first linear head.

In other words, the policy of agent is conditioned via embedding-masking and , where , the behavior-mask at time can either be the agent’s private mask or the coach’s mask . To allow this, the coach is trained to output masks that (1) yield high returns when passed to the agents and (2) are predictable by the agents. Similarly, each agent is regularized so that (1) its private mask matches the coach’s mask and (2) it derives efficient behavior from it. This scheme enforces coordinated sub-policies in agents, while having agents predict the best behavior that the coach would impose given all observations. The policy gradient loss when agent is provided with the coach’s plan is given by:


And the difference between the plan of agent

and the coach’s one is measured from the Kullback–Leibler divergence:


The total gradient for agent is:




with and the regularization coefficients. Similarly for the coach:


In order to propagate gradients through the sampled plan we reparametrized the multinomial distribution using the Gumbel-softmax trick jang2016categorical. We call this modified algorithm CoachMADDPG. The diagram of Figure (b)b summarizes these interactions.

5 Experimental setup

5.1 Training environments

All of our environments are based on a modified version of the OpenAI multi-agent particle environments mordatch2018emergence. While non-zero return can be achieved in these tasks by selfish agents, they all benefit from coordinated strategies and optimal return can only be achieved by agents working closely together. Figure 2 presents visualizations and a brief description of all four environments. A more detailed description is provided in Appendix 8.1. In all environments except Imitation (see Figure 2-d), agents receive as observations their own global position and velocity as well as the relative position of other agents and landmarks. Importantly, while most work showcasing experiments with this engine use discrete action spaces at least part of the time and usually learn on dense rewards (e.g. the inverse of the distance with the objective) iqbal2018actor; lowe2017multi; jiang2018learning, in all of our experiments, agents learn on continuous actions spaces and from sparse rewards only.

Figure 2: Multi-agent environments used in this work. (a) Spread: Agents must spread themselves on the different landmarks. (b) Bounce: Two agents are linked together by a spring and must position themselves so that the red ball bounces towards a target. (c) Chase: Two agents (red) chase a scripted prey (green) that moves w.r.t repulsion forces. (d) Imitation: The top agent (purple) needs to learn on which side the activated landmark is only by looking at the bottom agent’s behavior.

5.2 Hyper-parameter tuning and training details

To offer a fair comparison between all methods, the hyper-parameter search routine is the same for each algorithm and environment. For each search-experiment (one per algorithm per environment), 50 randomly sampled hyper-parameter configurations are used to train the models for steps on 3 different training seeds. During training, the policies are evaluated on a fixed set of 10 different episodes every 100 learning steps. For a learned policy performance assessment, its best iteration-model is evaluated on 100 different episodes. The performance of a hyper-parameter configuration is defined as the performance of the corresponding learned policies averaged across training seeds.

We perform searches over the following hyper-parameters: the learning rate of the actor , the learning rate of the critic relative to the actor (), the target-network soft-update parameter and the initial scale of the exploration noise for the Ornstein-Uhlenbeck noise generating process uhlenbeck1930theory (such as in lillicrap2015continuous). In the case of TeamMADDPG and CoachMADDP, we additionaly search over the regularization weights and . The learning rate of the coach is always equal to the actor’s learning rate (i.e. ), motivated by their similar architectures and learning signals and in order to reduce the search space. Table 1 in Appendix 8.4 shows the tunable hyper-parameters and their search ranges.

In all of our experiments, we use the Adam optimizer kingma2014adam to perform parameter updates. All models (actors and critics) are parametrized by MLPs containing two hidden layers of

units each. We use the Rectified Linear Unit (ReLU)


as activation function and layer normalization

ba2016layer to stabilize the learning. We use a buffer-size of entries and a batch-size of . We collect transitions by interacting with the environment for each learning update. For all tasks in our hyper-parameter searches, we train the agents for episodes of steps and then re-train the best configuration for each (algorithm, environment) pair for twice as long ( episodes) to ensure full convergence for the final evaluation. The scale of the exploration noise is kept constant for the first half of the training time and then decreases linearly to until the end of training. We use a reward discount factor of

and a gradient clipping threshold of

in all experiments.

6 Results and Discussion

We compare our algorithms (TeamMADDPG and CoachMADDPG) against MADDPG, DDPG and SharedMADDPG on the four cooperative multi-agent tasks described in Section 5.1. DDPG is the single-agent counterpart of MADDPG (decentralized training) whereas SharedMADDPG shares the policy and value-function models across agents. We report in Figure 3 the results of our per-algorithm hyper-parameter tuning as described in Section 5.2.

6.1 Robustness to hyper-parameters

Figure 3:

Summarized performance distributions of the sampled hyperparameters configurations for each

(algorithm, environment)

pair. The box-plots divide in quartiles the 49 lower-performing configurations for each distribution while the score of the best-performing configuration is highlighted above the box-plots by a single dot.

Stability across hyper-parameter configurations is a recurring challenge in deep reinforcement learning. The average performance (across seeds) for each randomly sampled hyper-parameters configuration allows to empirically evaluate the robustness of an algorithm with respect to its hyper-parameters. Figure 3 shows that while multiple algorithms exhibit good maximum performance after training steps on several environments, our proposed coordination-regularizing strategies can boost robustness to hyper-parameter variations in many environments, which can be of great value with more limited search budgets.

6.2 Final performances

We retrain all algorithms with their most successful configuration for each environment, this time on 10 different seeds, and report the average learning curves in Figure 4. TeamMADDPG and CoachMADDPG outperform all other algorithms by important margins in Spread and Bounce, both in terms of sample efficiency and asymptotic performance. All algorithms perform competitively in Chase, with SharedMADDPG slightly above. However, SharedDDPG fails to learn the task in the three other environments (Spread, Bounce and Imitation). Finally, CoachMADDPG and DDPG obtain the best results in Imitation, with TeamMADDPG and MADDPG slightly lower.

Lowe et al. lowe2017multi suggest that DDPG’s lower performance can be caused by the absence of centralized critics which makes the environment appear non-stationary from any individual agent’s perspective and could explain that it performs best on the most asymmetric task, Imitation, where only one agent is influenced by the second one. The trailing results of SharedMADDPG might be explained by the fact that sharing parameters across agents prevents specialization, a useful asset even in homogeneous cooperative tasks. Finally, TeamMADDPG and CoachMADDPG perform competitively or better than MADDPG on all environments suggesting that learning team-related secondary tasks is a valuable way to foster efficient coordinated strategies resulting in faster learning and better performance.

Figure 4:

Average learning curves for all algorithms on all six environments. Solid lines are the mean across 10 training seeds of the evaluation performance on 10 fixed episodes. The envelopes are the standard error across the 10 training seeds.

6.3 The effects of being predictable

We aim to analyze here the effects of the regularizers of TeamMADDPG. Specifically, we ask if the regularizer weight that pushes agents to be predictable by others is valuable. To this end, we compare one run of the best performing hyper-parameter configuration for TeamMADDPG on the Spread environment with two ablated versions. For the first ablation, we keep everything else fixed and retrain this model while setting to zero. For the second ablation, we set both regularizers’ weights and to zero. The former variant corresponds to MADDPG only augmented by the side task of predicting others’ actions whereas the latter is exactly equivalent to the MADDPG algorithm, where agents neither try to predict others’ actions, nor to be predictable. The expected return and Team-Spirit loss defined in Section 4.1 averaged over the three agents of the environment are presented in Figure 5 for these three experiments.

Figure 5: Comparison between enabling and disabling the regularizing weights and in TeamMADDPG on a successful configuration for the Spread environment

At the beginning of training, due to the weight initialization, the outputted and predicted actions from agents both have relatively small norms, explaining the small Team-Spirit loss. As training goes on, the norms of the action-vector quickly increase and the regularization loss becomes more important. As it is to be expected, the non-regularized baseline ( OFF OFF) has a high Team-Spirit loss as it is in no way encouraged to predict the actions of other agents correctly. When reintegrating the agent-modelling objective ( ON OFF), the agents significantly reduce the team-spirit loss, but it never reaches values as low as for the full TeamMADDPG as we removed the incentive from agents to be predictable. Interestingly, this setting also performs slightly worse than the baseline overall on the task. We hypothesize that this is due to the fact that the regularizing task becomes too hard in this case, and that the learning signal from this auxiliary loss becomes too important in the learning dynamics, hindering performances. Finally, when also pushing agents to be predictable ( ON ON), the agent best predict others’ actions and performance is also improved. We also notice that the Team-Spirit loss increases when performance starts to increase i.e. when agents start to master the task. Once the reward maximisation signals becomes stronger, the relative importance of the second task is reduced. We hypothesize that being predictable with respect to one another may push agents to explore in a more structured and informed manner in the absence of reward signal, as similarly pursued by intrinsic motivation approaches chentanez2005intrinsically.

7 Conclusion

In this work we introduced two policy regularization methods to promote multi-agent coordination. One is based on inter-agent action predictability and is called TeamMADDPG. The other one, CoachMADDPG, is based on synchronized and corresponding behavior selection with the help of an auxiliary coach network present only during training. We performed fair and transparent hyper-parameter searches for each compared algorithms and environments and found that our regularizing strategies have a positive effect on robustness to hyper-parameter selection and asymptotic performances, empirically showing the benefits of such coordination-promoting regularization in the cooperative multi-agent setting. Our techniques work with sparse-rewards and continuous action-spaces, and do not break the decentralized execution paradigm. One downside of our methods is that they are restricted in their current form to metrics that only account for the current timestep, a restriction that simplifies off-policy learning but also greatly impairs the longer-term coordination possibilities. In future work we aim explore model-based planning approaches for multi-agent coordination.


We wish to thank Olivier Delalleau for providing insightful comments on previous versions of these methods as well as Fonds de Recherche Nature et Technologies (FRQNT), Ubisoft Montréal and Mitacs for providing part of the funding for this work.

8 Appendix

8.1 Tasks descriptions

Spread (Figure 2a): In this environment, there are 3 agents (small orange circles) and 3 landmarks (bigger gray circles). At every timestep, agents receive a team-reward where is the number of landmarks occupied by at least one agent. To maximize their return, agents must therefore spread themselves on all landmarks.

Bounce (Figure 2b): In this environment, two agents (small purple circles) are linked together with a spring that pulls them toward each-other when stretched above its relaxation length. At mid-point during the episode, a ball (smaller red circle) falls from the top of the environment. Agents must position correctly so as to make the ball bounce on the spring towards the target (bigger beige circle). They receive a team-reward of if the ball reflects towards the side walls, if the ball reflects towards the top of the environment, and if the ball reflects towards the target.

Chase (Figure 2c): In this environment, two predators (red circles) are chasing a prey (green circle). The prey moves with respect to a hardcoded policy consisting of repulsion forces from the walls and predators. At every timestep, the learning agents (predators) receive a team-reward of where is the number of predator touching the prey. The prey has a greater top speed and acceleration than the predators. Catching the prey does not trigger the end of the episode. Therefore, to maximize their return, the two agents must coordinate in order to squeeze the prey into a corner or a wall and effectively immobilizing it.

Imitation (Figure 2d): In this environment, at the beginning of the episode, one of the two set of landmarks (left or right) is randomly picked to be activated. The bottom agent (small blue circle) receives an individual-reward of for being in its activated landmark (bottom green bigger circle), and the top agent (small purple circle) is rewarded also individually rewarded for standing in its own landmark (top green bigger circle). The trick is that while the bottom agent know the relative position of its two landmarks as well as which one is activated, the top agent receives neither of these information and only has access to the relative position of the bottom agent. It must learn to imitate the bottom agent to reach its activated landmark. Importantly, a door closes its corridor once it has committed on one side, leading to a win or lose situation.

8.2 Algorithms

  Randomly initialize critic networks and actor networks
  Initialize target networks and
  Initialize one replay buffer
  for episode from 1 to number of episodes do
     Initialize random processes for action exploration
     Receive initial joint observation
     for timestep t from 1 to episode length do
        Select action for each agent
        Execute joint action and observe joint reward and new observation
        Store transition in
     end for
     Sample a random minibatch of transitions from
     for each agent  do
        Evaluate and from Equations (1) and (2)
        for each other agent (do
           Evaluate from Equations (4)
           Update actor with
        end for
        Update critic with
        Update actor with
     end for
     Update all target networks with and
  end for
Algorithm 1 TeamMADDPG
  Randomly initialize critic networks , actor networks and one coach network
  Initialize target networks and
  Initialize one replay buffer
  for episode from 1 to number of episodes do
     Initialize random processes for action exploration
     Receive initial joint observation
     for timestep t from 1 to episode length do
        Select action for each agent
        Execute joint action and observe joint reward and new observation
        Store transition in
     end for
     Sample a random minibatch of transitions from
     for each agent  do
        Evaluate , , and from Equations (1), (2), (7) and (6)
        Update critic with
        Update actor with
     end for
     Update coach with
     Update all target networks with and
  end for
Algorithm 2 CoachMADDPG

8.3 Hyper-parameter search results

Figure 6 shows the raw results of the hyper-parameter searches that are summarized in the main text (see Figure 3).

Figure 6: Hyper-parameter tuning results for all algorithms. There is one distribution per (algorithm, environment) pair, each one formed of 50 points (hyperparameter configuration sample). Each point represents the performance over 100 evaluation episodes for each of the 3 training seeds at the end of training for one sampled hyper-parameters configuration (total of 300 performance values per sampled configuration).

8.4 Hyper-parameter search ranges

Table 1 shows the ranges in which values for the hyper-parameters were drawn uniformly during the searches, where represents the initial exploration noise scale. In our experiments, the learning rates of the agents () and the coach network () are always equal, motivated by the similar architectures and learning signals, and in order to reduce the search space. For the critic networks, we aim for greater flexibility, while constraining their learning rate to be related to and . We thus use a critic learning rate coefficient so that .

Hyper-parameter Range
Table 1: Ranges for hyper-parameter search

8.5 Selected hyper-parameters

Tables 2, 3, 4, and 5 shows the best hyper-parameters found by the random searches for each of the environments and each of the algorithms.

- - -
- - -
Table 2: Best found hyper-parameters for the SPREAD environment
- - -
- - -
Table 3: Best found hyper-parameters for the BOUNCE environment
- - -
- - -
Table 4: Best found hyper-parameters for the CHASE environment
- - -
- - -
Table 5: Best found hyper-parameters for the IMITATION environment