Universal Policies to Learn Them All

08/24/2019 ∙ by Hassam Ullah Sheikh, et al. ∙ University of Central Florida 0

We explore a collaborative and cooperative multi-agent reinforcement learning setting where a team of reinforcement learning agents attempt to solve a single cooperative task in a multi-scenario setting. We propose a novel multi-agent reinforcement learning algorithm inspired by universal value function approximators that not only generalizes over state space but also over a set of different scenarios. Additionally, to prove our claim, we are introducing a challenging 2D multi-agent urban security environment where the learning agents are trying to protect a person from nearby bystanders in a variety of scenarios. Our study shows that state-of-the-art multi-agent reinforcement learning algorithms fail to generalize a single task over multiple scenarios while our proposed solution works equally well as scenario-dependent policies.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent research in deep reinforcement learning (RL) has led to wide range of accomplishments in learning optimal policies for sequential decision making problems. These accomplishments include training agents in simulated environments such as playing Atari games [9], beating the best players in board games like Go and Chess [16] as well as learning to solve real world problems. Similar to single agent reinforcement learning, multi-agent reinforcement (MARL) is also producing break through results in challenging collaborative-competitive environments such as [12, 3, 6].

The success in reinforcement learning has prompted interest in more complex challenges as well as a shift towards cases in which an agent tries to learn multiple tasks in a single environment. Formally this paradigm of learning is known as multitask reinforcement learning [18]. The essence of multitask reinforcement learning is to simultaneously learn multiple tasks jointly to speed up learning and induce better generalization by exploiting the common structures among multiple tasks.

Despite having success in single agent multitask reinforcement learning [2, 18], multitask multi-agent reinforcement learning has been explored in only one recent study [11]. In this paper, we are exploring an opposite problem where multiple reinforcement learning agents are trying to master a single task across multiple scenarios. Consider multiple RL agents trying to master a single task in multiple scenarios. In order for the agents to generalise, they need to able to identify and exploit common structure of the single task under multiple scenarios. One possible structure is the similarity between the solutions of the single task over multiple scenarios either in the policy space or associated value-function space. For this, we build our solution upon two frameworks. The first framework is universal value function approximators (UVFAs) by [13] and the second framework is multi-agent deep deterministic policy gradient (MADDPG) by [7]. UVFAs are extension of value functions that also include the notion of a task or a scenario thus exploiting common structure in associated optimal value functions. MADDPG is a multi-agent reinforcement learning algorithm that uses the centralized training and distributed testing paradigm to stabilize learning. The outcome of combining these two frameworks is a solution for multi-agents that learn to generalize over both state space and a set of multiple scenarios.

To investigate the emergence of collaboration in multi-scenario learning for multi-agent learning agents, we designed a challenging environment with simulated physics in Multi-Agent Particle Environment [10]. We have developed 4 different simulated scenarios representing a challenging urban security problem of providing physical protection to VIP from nearby bystanders of more than one different class. The complexity in our environment arises primarily from the different moving patterns of these bystanders that a standard state-of-the-art MARL algorithm such as MADDPG [7] fail to capture. The goal here is to learn a stable and a consistent multi-agent cooperative behavior across all the known scenarios. Here, we are not dealing with unknown scenarios.

2 Background

Partially observable Markov Game [5] is a multi-agent extension of MDP characterized by , agents with partial observations of the environment with a collective action space of , a reward function and a state transition function . At every time step, each agent chooses an action from it’s policy parameterized by conditioned on its private observation i.e and receives a reward The goal of each agent is to maximise its own total expected return where is the collected reward by agent at time .

2.1 Policy Gradients

Policy gradient methods have been shown to learn the optimal policy in a variety of reinforcement learning tasks. The main idea behind policy gradient methods is to maximize the objective function by parameterizing the policy with and updating the policy parameters in the direction of the gradient of the objective function . The gradient is defined as

[17] has shown that it is possible to extend the policy gradient framework to deterministic policies i.e. . In particular we can write as

A variation of this model, Deep Deterministic Policy Gradients (DDPG) [4] is an off-policy algorithm that approximates the policy and the critic

with deep neural networks. DDPG also uses an experience replay buffer alongside a target network to stabilize the training.

Multi-agent deep deterministic policy gradients (MADDPG) [7] extends DDPG for the multi-agent setting where each agent has it’s own policy. The gradient of each policy is written as

where and

is a centralized action-value function that takes the actions of all the agents in addition to the state of the environment to estimate the Q-value for agent

. Since every agent has it’s own Q-function, the model allows the agents to have different action spaces and reward functions. The primary motivation behind MADDPG is that knowing all the actions of other agents makes the environment stationary that helps in the stabilization of the training, even though the policies of the agents change.

The Universal Value Function Approximator [13] is an extension of DQN [9]

where it generalizes not only over a set of states but also on a set of goals. At the beginning of the episode, a state-goal pair is sampled from a probability distribution, the goal remains constant throughout the episode. At each timestep, the agent receives a pair of current state and goal and gets the reward

. As a result, the Q-function is not only dependent on state-action pair but also on the goal. The extension of this approach is straight forward for DDPG [8] and MADDPG.

3 Multi-Agent Universal Policy Gradient

We propose multi-agent universal policy gradient: a multi-agent deep reinforcement learning algorithm that learns distributed policies not only over state space but also over a set of scenarios.

While generalization across multi-task in multi-agent reinforcement learning has been studied in [11], to our best knowledge we are the first one to consider the multi-scenario multi-agent deep RL system. Our approach uses Universal Value Function Approximators [13] to train policies and value functions that take a state-scenario pair as an input. The outcome are universal multi-agent policies that are able to perform on multiple scenarios as well as policies that are trained separately.

The main idea is to represent the different value function approximators for each agent by a single unified value function approximator that generalizes of over both state space and a set of scenarios. For agent we consider or that approximate the optimal unified value functions over multiple scenarios and a large state space. These value functions can be used to extract policies implicitly or as critics for policy gradient methods.

The learning paradigm we used is similar to the centralized training with decentralized execution during testing used by [7]. In this setting, additional information is provided for the agents during training that is not available during test time. Thus, extracting policies from value functions is not feasible in this model. However, the value functions can be used as critics in a multi-agent deep deterministic policy gradient setting.

Concretely, consider an environment with agents with policies parameterized by then the multi-agent deep deterministic policy gradient for agent can written as

where and is a centralized action-value function parameterized by that takes the actions of all the agents in addition to the state of the environment to estimate the Q-value for agent . We extend the idea of MADDPG with universal functional approximator, specifically we augment the centralized critic with an embedding of the scenario. Now the modified policy gradient for each agent can be written as


where is action from agent following policy and is the experience replay buffer. The centralized critic is is updated as:

where is defined as:

where are target policies parameterized by .

The overall algorithm to which we refer as multi-agent universal policy gradient (MAUPG) is described in algorithm 1. Additionally, we refer the learnt policies as universal policies. The overview of the architecture can be seen in  fig. 1.

1:Sample a random scenario
2:for episode = 1 to  do
3:     for t = 1 to episode–length do
4:         For each agent , select action
5:         Execute actions
6:         For each agent , get next observation
7:     end for
8:     Sample an additional scenario
9:     for t = 1 to episode–length do
10:         for agent i = 1 to N do
11:              Get reward
12:              Store in replay buffer
13:              /* Hindsight Replay */
14:              Get reward
15:              Store in replay buffer
16:         end for
17:     end for
18:     Set
19:     for agent to  do
20:         Sample minibatch of size S
23:         Set
24:         Update critic by minimizing
26:     end for
27:     Update target network parameters for each agent
28:end for
Algorithm 1 Multi-agent Universal Policy Gradient
(a) An overview of the mutli-agent decentralized actor with centralized critic represented by a universal value function approximator.
(b) Representation of the single centralized critic by a universal function approximator.
Figure 1: An overview of the multi-agent universal policy gradient architecture where both actors and critics are augmented with the goal that the agents are trying to achieve.

(a) Random Landmarks

(b) Shopping Mall

(c) Street

(d) Pie-in-the-face
Figure 2: Visual representation of the four different scenarios. Emergence of complex behavior can be clearly seen where the bodyguards(in blue) have positioned themselves between the VIP(in brown) and the bystanders(in red) shielding from potential threat.
(a) Random Landmarks
(b) Shopping Mall
(c) Street
(d) Red Carpet
Figure 3: Learning curve of the scenarios in terms of average cumulative reward for the bodyguards. Notice that start-of-the-art multi-agent reinforcement learning algorithms such as COMA and Q-MIX fail to even take off in most of the challenging scenarios. MADDPG was the only consistent algorithm that was able to learn in the environment.

4 Experimental Setup

4.1 The VIP Protection Problem

To demonstrate the effectiveness of our proposed algorithm, we simulated an urban security problem of VIP protection where a team of learning agents (bodyguards) are providing physical protection to a VIP from bystanders in a crowded space. This problem is briefly explored in [14, 15].

We are considering a VIP moving in a crowd of bystanders protected from assault by a team of bodyguards . To be able to reason about this problem, we need to quantify the threat

to the VIP at a given moment from the nearby bystanders–the aim of the bodyguards is to reduce this value.

Two agents and have a line of sight if can directly observe and with no obstacle between them. A bystander can only pose a threat to the VIP if it is closer than the safe distance . The threat level  [1] is defined as the probability that a bystander can successfully assault the VIP, defined as a value exponentially decaying with distance:


where the VIP should be in line of sight of and . and are positive constants that control the decay rate of the threat level.

The residual threat is defined as the threat to the VIP at time from bystanders . Bodyguards can block the line of sight from the bystanders, thus the residual threat is always smaller than the threat level and depends on the position of the bodyguards with respect to the bystanders and the VIP. The cumulative residual threat to the VIP from bystanders in the presence of bodyguards over the time period is defined as:


Our end goal is to minimize through multi-agent reinforcement learning.

4.2 Simulation and Scenarios

We designed four scenarios inspired from possible real world situations of VIP protection and implemented them as behaviors in the Multi-Agent Particle Environment( fig. 2[10].

In each scenario, the participants are the VIP, 4 bodyguards and 10 bystanders of one or more classes. The scenario description contains a number of landmarks, points on a 2D space that serve as a starting point and destinations for the goal-directed movement by the agents. For each scenario, the VIP (brown disk) starts from the starting point and moves towards the destination landmark (green disk). The VIP exhibits a simple path following behavior, augmented with a simple social skill metric: it is about to enter the personal space of a bystander, it will slow down or come to a halt.

The scenarios differ in the arrangement of the landmarks and the behavior of the different classes of bystanders.

  1. [label=]

  2. Random Landmark: In this scenario, 12 landmarks are placed randomly in the area. The starting point and destination for the VIP are randomly selected landmarks. The bystanders are performing random waypoint navigation: they pick a random landmark, move towards it, and when they reached it, they choose a new destination. A set of fixed seeds were used for placement of landmarks and a different set of seeds were used for spawning bystanders in the environment.

  3. Shopping Mall: In this scenario, 12 landmarks are placed in fixed position on the periphery of the area, representing shops in a market. The bystanders visit randomly selected shops and were spawned using a fixed set of random seeds.

  4. Street: This scenario aims to model the movement on a crowded sidewalk. The bystanders are moving towards waypoints that are outside the current area. However, due to their proximity to each other, the position of the other bystanders influence their movement described by laws of particles motion [19].

  5. Pie-in-the-Face: While the in other scenarios the bystanders treat the VIP as just another person, in this “red carpet” scenario the bystanders take an active interest in the VIP. We consider two distinct classes of bystanders with different behaviors. Rule-abiding bystanders stay behind a designated line observing as the VIP passes in front of them. Unruly bystanders break the limit imposed by the line and try to approach the VIP (presumably, to throw a pie in his/her face).

Observation and Action Space

Following the model of Multi-Agent Particle Environment [10], the action space of each bodyguard

consists of 2D vector of forces applied on the bodyguard and to promote collaboration and cooperation 

[10], a c dimensional communication channel.

The observation of each bodyguard is the physical state of the nearest bystanders, all the bodyguards in the scenario and their verbal utterances such that where is the observation of the entity from the perspective of agent and is the verbal utterance of the agent .

In this problem, we are assuming that all bodyguards have identical observation space and action space. Moreover, each scenario embedding is represented as a one hot vector.

4.3 Reward Function

Using the definitions of threat level and cumulative residual threat defined in eq. 2 and eq. 3 respectively, the reward function for bodyguard can be written as


To encourage the bodyguards to stay at a limited distance from the VIP and discourage them to attack the bystanders, a distance regularizer is added to eq. 4 to form the final reward function.


where is the minimum distance the bodyguard has to maintain from VIP and is the mentioned in section 4.1. The final reward function is represented as


Depending upon on the scenario , different values of , and were chosen for the optimal performance. A different value of and also fulfills the requirement of a different reward function to train a UVFA.

5 Experiments and Results

In this section, we first evaluate the usefulness of multi-agent reinforcement learning on the given problem by comparing it’s results with an hand-engineered solution for the VIP problem. Then we demonstrate the inability of the state-of-the-art MARL algorithms to generalize over different scenarios. Finally we compare the results of scenario dependant policies with the results of universal policies. Our primary evaluation metric is

Cumulative Residual Threat (CRT) defined in eq. 3.

5.1 Multi-Agent Reinforcement Learning vs Quadrant Load Balancing

In order to verify that multi-agent reinforcement learning solutions are better than explicitly programmed behavior of the bodyguards, we evaluate policies trained on individual scenarios with quadrant load balancing technique (QLB) introduced in [1].

To identify the best MARL algorithm to compete with the hand-engineered solution, we trained five state-of-the-art MARL algorithms such as Q-Mix, VDN, IQL, COMA111https://github.com/oxwhirl/pymarl/ and MADDPG on our environment. It can be seen in fig. 3 that MADDPG was the only algorithm that was successful in learning in our environment while other algorithms fail even to take off. Therefore, we dropped the CRT graphs.

We then compared the results of MADDPG with quadrant load-balancing (QLB). From the results in fig. 4 we can see that the outcome are different depending on the characteristics of the scenario. For the Pie-in-the-face scenario, where most of the bystanders stay away behind the lines, both the RL-learned agent and the QLB model succeeded to essentially eliminate the threat. This was feasible in this specific setting, as there were four bodyguard agents for one “unruly” bystander. For the other scenarios, the average cumulative residual threat values are higher for both algorithms. However, for the Random Landmark and Shopping Mall scenarios the RL algorithm is able to reduce the threat to less than half, while in the case of the Street scenario, to less than one ninth of the QLB value.

Figure 4: Comparing the average cumulative residual threat values of MADDPG and QLB on four different scenarios.

Overall, these experiments demonstrate that the multi-agent reinforcement learning can learn behaviors that improve upon algorithms that were hand-crafted for this specific task.

5.2 Universal Policies Vs Scenario-Dependant Policies

In order to verify the claim that MARL algorithms trained on specific scenario fail to generalize over different scenarios, we evaluate policies trained via MADDPG on specific scenario and test them on different scenarios. Policies on specific scenarios were trained using the same settings and configurations from experiments performed in section 5.1.

Figure 5:

A confusion matrix representing the average residual threat values of MADDPG policies trained on specific scenario when tested on different scenarios over 100 episodes.

Figure 6: Comparing the average cumulative residual threat values for universal policy agents with MADDPG and QLB agents

From the results shown in fig. 5 we can see that MADDPG policies trained on specific scenarios performed poorly when tested on different scenarios as compared to when tested on same scenario with different seeds. In order to tackle the generalization problem, we train the agents using multi-agent universal policy gradient and compare its results with the results of scenario-dependant MADDPG policies.

From the results in fig. 6 we can see that our proposed method performs better than policies trained on specific scenarios as well as quadrant-load balancing. Overall, these experiments demonstrate that start-of-the-art MARL algorithms such as MADDPG fail to generalize a single task over multiple scenarios while our proposed solution MAUPG learn policies that allows a single task to be learnt across multiple scenarios and improve upon the start-of-the-art multi-agent reinforcement learning algorithm.

6 Ablation Study

The first natural question that can be asked here is that why can’t we just sample scenarios during training of a standard MADDPG?

To answer this question and to see the effect of UVFA and hindsight replay used in MAUPG, we perform an ablation study in which we gradually add important building blocks to MADDPG to transform the solution into MAUPG. First to answer the question, we trained a MADDPG and sampled different scenarios. Second we replaced the standard centralized critic with an UVFA and finally we added the hindsight replay step. All the training settings and hyperparameters were kept same across all the experiments.

From  figs. 8 and 7, we can see that MADDPG does not learn half as good as MAUPG with or without the hindsight replay step. MAUPG learns better and faster than MAUPG without hindsight replay. This happens because MAUPG with hindsight replay benefits from replaying trajectories from one scenario in other scenarios(see lines 14 and 15 of algorithm 1) thus providing more experience to learn efficiently.

Figure 7: Learning curves of the ablated version of MAUPG. Notice the increase in performance in terms of average cumulative reward with an addition of UVFAs and hindsight replay step.
Figure 8: Learning curves of the ablated version of MAUPG. Notice the decline in average cumulative residual threat with an addition of UVFAs and hindsight replay step.

7 Conclusion

In this paper, we highlighted the issue with MARL algorithms of failing to generalize a single task over multiple known scenarios. To solve the generalization problem, we proposed multi-agent universal policy gradient, a universal value function approximator inspired policy gradient method that not only generalizes over state space but also over set of different scenarios. We also built a 2D challenging environment simulating an urban security problem that can be used as a benchmark for similar problems. Experimental studies have demonstrated that our proposed method generalizes well on different scenario and performs better than MADDPG when trained on different scenarios individually.


  • [1] T.S. Bhatia, G. Solmaz, D. Turgut, and L. Bölöni (2016-05) Controlling the movement of robotic bodyguards for maximal physical protection. In Proc. of the 29th International FLAIRS Conference, pp. 380–385. Cited by: §4.1, §5.1.
  • [2] D. Borsa, A. Barreto, J. Quan, D. J. Mankowitz, H. van Hasselt, R. Munos, D. Silver, and T. Schaul (2019) Universal successor features approximators. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • [3] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castaneda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hassabis, K. Kavukcuoglu, and T. Graepel (2018) Human-level performance in first-person multiplayer games with population-based deep reinforcement learning. arXiv preprint arXiv: 1807.01281. Cited by: §1.
  • [4] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. In Proc. of the 3rd Int’l Conf. on Learning Representations (ICLR), Cited by: §2.1.
  • [5] M. L. Littman (1994) Markov games as a framework for multi-agent reinforcement learning. In

    Proc. of the 11th Int’l Conf. on Machine Learning(ICML)

    pp. 157–163. Cited by: §2.
  • [6] S. Liu, G. Lever, N. Heess, J. Merel, S. Tunyasuvunakool, and T. Graepel (2019) Emergent coordination through competition. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • [7] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems 30, pp. 6379–6390. Cited by: §1, §1, §2.1, §3.
  • [8] A. Marcin, W. Filip, R. Alex, S. Jonas, F. Rachel, W. Peter, M. Bob, T. Josh, P. Abbeel, and Z. Wojciech (2017) Hindsight experience replay. In Advances in Neural Information Processing Systems 30, pp. 5048–5058. Cited by: §2.1.
  • [9] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015-02-26) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1, §2.1.
  • [10] I. Mordatch and P. Abbeel (2017) Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908. Cited by: §1, §4.2, §4.2.
  • [11] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian (2017) Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In Proc. of the 34th Int’l Conf. on Machine Learning(ICML), pp. 2681–2690. Cited by: §1, §3.
  • [12] OpenAI (2018) OpenAI five. Note: https://blog.openai.com/openai-five/ Cited by: §1.
  • [13] T. Schaul, D. Horgan, K. Gregor, and D. Silver (2015) Universal value function approximators. In Proc. of the 32st Int’l Conf. on Machine Learning(ICML), pp. 1312–1320. Cited by: §1, §2.1, §3.
  • [14] H. U. Sheikh and L. Bölöni (2018-07) Designing a multi-objective reward function for creating teams of robotic bodyguards using deep reinforcement learning. In Prof. of 1st Workshop on Goal Specifications for Reinforcement Learning (GoalsRL-2018) at ICML 2018, Cited by: §4.1.
  • [15] H. U. Sheikh and L. Bölöni (2018-07) The emergence of complex bodyguard behavior through multi-agent reinforcement learning. In Proc. of Autonomy in Teams (AIT-2018) workshop at ICML-2018, Cited by: §4.1.
  • [16] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529, pp. 484–503. Cited by: §1.
  • [17] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. In Proc. of the 31st Int’l Conf. on Machine Learning(ICML), pp. 387–395. Cited by: §2.1.
  • [18] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu (2017) Distral: robust multitask reinforcement learning. In Advances in Neural Information Processing Systems 30, pp. 4496–4506. External Links: Link Cited by: §1, §1.
  • [19] T. Vicsek, A. Czirók, E. Ben-Jacob, I. Cohen, and O. Shochet (1995)

    Novel type of phase transition in a system of self-driven particles

    Physical review letters 75 (6), pp. 1226. Cited by: item 3.