Inferring and Learning Multi-Robot Policies by Observing an Expert

In this paper we present a technique for learning how to solve a multi-robot mission that requires interaction with an external environment by repeatedly observing an expert system executing the same mission. We define the expert system as a team of robots equipped with a library of controllers, each designed to solve a specific task, supervised by an expert policy that appropriately selects controllers based on the states of robots and environment. The objective is for an un-trained team of robots equipped with the same library of controllers, but agnostic to the expert policy, to execute the mission, with performances comparable to those of the expert system. From observations of the expert system, the Interactive Multiple Model technique is used to estimate individual controllers executed by the expert policy. Then, the history of estimated controllers and environmental state is used to learn a policy for the un-trained robots. Considering a perimeter protection scenario on a team of simulated differential-drive robots, we show that the learned policy endows the un-trained team with performances comparable to those of the expert system.


page 1

page 2

page 3

page 4


A Reinforcement Learning Framework for Sequencing Multi-Robot Behaviors

Given a list of behaviors and associated parameterized controllers for s...

Multi-Objective Policy Generation for Multi-Robot Systems Using Riemannian Motion Policies

In the multi-robot systems literature, control policies are typically ob...

Using Deep Reinforcement Learning to Learn High-Level Policies on the ATRIAS Biped

Learning controllers for bipedal robots is a challenging problem, often ...

Learning to Emulate an Expert Projective Cone Scheduler

Projective cone scheduling defines a large class of rate-stabilizing pol...

Fast Byzantine Gathering with Visibility in Graphs

We consider the gathering task by a team of m synchronous mobile robots ...

Hiding Leader's Identity in Leader-Follower Navigation through Multi-Agent Reinforcement Learning

Leader-follower navigation is a popular class of multi-robot algorithms ...

Decentralization of Multiagent Policies by Learning What to Communicate

Effective communication is required for teams of robots to solve sophist...

I Introduction

One of the primary issues with human control of a multi-robot system, or human-swarm interaction, is the cognitive load associated with tracking and delegating tasks to each robot [kolling2015human]. Along with the poor scalability of a one-to-many approach in controlling teams of robots, further exacerbation of this problem occurs when the robots are equipped with heterogeneous attributes, as the complexity of the robot-to-task allocation process can easily become intractable [prorok2016fast].

To circumvent the scalability issues, abstraction of the state information [luo2015asynchronous, li2016handling] and control strategies [kolling2015human, cortes2017coordinated] of the system are often used by human operators to effective manipulate the multi-robot systems. In particular, local interaction rules between the robots can be designed such that a desired task-oriented collective behavior emerges at the ensemble level, e.g. [cortes2017coordinated, antonelli2013interconnected, schwager2011unifying]. However, the resulting task-oriented controllers cannot be directly implemented towards the solution of complex, real-world missions, which usually require an assortment of these task-oriented controllers (see for instance [nagavalli2017automated, pierpaoli2019sequential] for implementations of this idea). Even though it is possible to enhance the expressiveness of individual controllers through their sequential combination [twu2010graph], human-level intelligence and expertise is required in order to interact with unknown and unpredictable environments [nikolaidis2013human]. Inspired by the paradigm of learning from demonstration [argall2009survey, billard2016learning], we propose a solution to the problem of composing multi-robot primitive controllers by leveraging observations of an expert system.

In this paper, we consider an expert system composed by a team of robots solving a mission that requires interaction with an external environment. In order to make the discussion more concrete, let’s consider a perimeter protection scenario, where the the robots are tasked with preventing intruders from accessing a restricted area. In this case, we can consider the intruders as the environment external to the robots. In order to solve this mission, the policy supervising the robots selects actions from a finite list of primitives (e.g., gather at locations, assemble formations, chase intruders) by considering the state of robots and environment. In this paper, we consider the expert policy to be a deterministic automated strategy capable to solve the mission with a certain level of performance. Extensions of the proposed framework to policies obtained with a human in the loop are possible but will not investigated here.

Parallel to the expert, we consider a team of un-trained imitating robots, having access to the same list of primitive controllers used by the expert system, but agnostic to the actual policy controlling them. Our objective is a procedure that allows the imitating robots to construct a policy capable to achieve expert-like performances by collecting noisy observations of the expert system and environmental state during execution of the mission.

We note that, the idea of learning how to create sequences from a pre-defined list of controllers, rather than learning entirely new controllers, is motivated by the rich literature on coordinated control for multi-agent system [mesbahi2010graph]. In addition, although beyond the scope of the current work, the use of coordinated controllers makes our framework naturally predisposed for fully distributed implementations. Finally, by focusing on high-level strategies that compose well-understood controllers, we can interpret the actions executed by the robots.

The remaining of this paper is organized as follows. In Section II we discuss the main formulation of the problem, along with expert and imitator dynamical models. In Section III we describe the estimation scheme used by the imitator to infer primitive controllers from noisy observations of the expert. In Section IV we describe the training procedure used by imitator to learn the expert’s policy. Finally, we implement the proposed technique to a perimeter protection scenario on a team of differential-drive robots in Section V.

I-a Related Work

Robotic systems capable of inferring human behavior have been studied in variety of contexts, such as human-robot interaction (HRI) [wang2013probabilistic, luo2018unsupervised], autonomous navigation [kelley2008understanding, jain2016recurrent], and interactive learning [cakmak2012designing, dillmann2004teaching]. For instance, within the context of HRI, inferring human behavior is often referred to as intention inference, wherein the robot attempts to identify the underlying intention or goal of the human partner [ravichandar2016human, ravichandar2018gaze]. In context of interactive learning, the robot aims to infer what the human partner is teaching [cakmak2012designing]

. In these contexts, it is either assumed that labelled training data can be acquired or it is sufficient to classify the unlabelled data into abstract clusters. In our work, we are interested not only in inferring the expert behavior, but also learning from such inferences.

Learning from demonstrations (LfD) is a popular paradigm in robotics that provides a plethora of techniques aimed at learning from and imitating expert behavior [argall2009survey]

. A vast majority of techniques, however, rely on labelled demonstrations and solve a supervised learning problem. A few exceptions to this assumption include learning reward functions from unlabelled demonstrations (e.g.,

[babes2011apprenticeship, liu2018imitation]), and learning manipulation skills from unlabelled videos (e.g., [sermanet2018time]).

Both the inference and the learning methods discussed thus far, however, are limited to single-robot tasks and scenarios. In the context of multi-robot systems, prior work has explored identifying joint-intention and shared plans of a group of agents [demiris2007prediction, pei2011parsing]. However, similar to examples in single-robot systems, the inferred information is not utilized to learn control policies. On the other hand, algorithms that learn multi-agent policies (e.g., [natarajan2010multi, vsovsic2017inverse, bhattacharyya2018multi]) rely on labelled expert demonstrations.

As mentioned earlier, a rich body of works exists on task-oriented controllers for multi agent teams. Coordinated controllers based on weighted consensus protocol have been used to achieve, for example, flocking [olfati2006flocking], coverage [santos2018coverage], formation control [zelazo2008decentralized], and cyclic pursuit [ramirez2010distributed]. Examples of alternative methods include the Null Space Methods [antonelli2008null], Navigation Functions [tanner2005towards], and Model Predictive Control [richards2004decentralized]. Related to our contribution, solutions to the problem of composing sequences of controllers include formal methods [kress2018synthesis], path planning [nagavalli2017automated], Finite State Machines [marino2009behavioral], Petri Nets [klavins2000formalism], Behavior Trees [colledanchise2014behavior]

, and Reinforcement Learning 

[mehta2006optimal, pierpaoli2019reinforcement]. We take inspiration from a variety of communities in order to develop a unified framework to learn multi-robot controller selection policies from unlabelled and noisy observations of an expert system.

Ii Problem Formulation

Let us consider a team of robots interacting with an external environment having as final objective the completion of a mission. We denote by and the robots’ and environment states at time step respectively. At each time, the motion of the robots is described by a controller, selected from a library of controllers , where each element is a continuous function of the state of the robots. At any given time, all robots are driven by the same controller, which excludes the possibility of sub-teams acting in parallel.

A supervising deterministic policy selects controllers in as function of the states of robots and environment. Then, the discrete-time dynamics of the robots is



is a white Gaussian additive process noise, with known variance. The policy

selects controllers so that the mission is executed with a measurable performance . We refer to the system of robots with state as expert system, to as expert’s policy, and to as the expert’s performance under the policy . As mentioned above, we assume the expert’s policy to take actions based on the exact expert and environmental states. To this end, although beyond the scope of this paper, we acknowledge that appropriate estimation schemes can be considered to recover values of or .

In addition to the expert system just described, we assume the existence of a second team of robots identical to those in the expert system, whose dynamics is described by the same library of controllers . We refer to this second system as imitator and we denote its state by . The objective of the imitator is to learn an approximation of the expert policy in order to achieve performance similar to . With as the imitator’s approximated policy, the dynamics of the imitator is


where, similarly to the expert system, we assume the imitator has access to the exact environment’s state .

In order to construct such approximation, the imitator collects a sequence of environmental states and a sequence of noisy observations of the expert’s state, where


and is a white Gaussian additive noise with zero mean and known variance.

In this paper we propose a two-stage procedure in order for the imitator to learn . First, from noisy observation of the expert’s state and exact state of the environment, the imitator creates an online estimate of the controllers being executed by the expert. Then, based on these estimates, we propose a policy approximation procedure that allows the imitator to solve the task with performance comparable to the expert’s performance on the same task.

Iii Inferring Controllers from Observations

In this section, we describe the Interactive Multiple Model (IMM) technique [blom1988interacting], implemented by the imitator to estimate the sequence of controllers the expert executes, given a history of observations . The choice of this technique is motivated by its popularity and computational efficiency and we insist that the same objective could be achieved with alternative estimation schemes.

Iii-a Interactive Multiple Models Estimator

The IMM estimator requires a finite list of controllers (or modes) whose transitions are described by a Markov process. We denote by

the process probability of switching from mode

to , i.e.


In addition, we define a bank of filters (e.g. Kalman filters, EKF) each corresponding to a controller in the library

. By combining expert’s state observations with modes’ transition probabilities, the IMM computes 1) the probability of being active at time , which we denote by and 2) the estimate of the expert’s state . The estimation technique follows three main steps, which we briefly discuss here for completeness. We refer the reader to [mazor1998interacting] for a review of IMM implementations and its possible variations. In the following we denote with

the outer product of vector

, defined as and we use the shorthand notation to denote .


First, mixing probability are computed by propagating each modes’ probability through the Markov process as

where is a normalizing factor. Then, we compute the effect of the modes probabilities on previous state estimates and covariance matrix:


The posterior estimates , covariance matrix , and likelihood for each mode are computed by applying EKF iterations for each of the known modes. Then, individual mode probabilities are computed as follows:

where is a normalizing factor.


Final state estimates and convariance matrix are the obtained by combining values from each filter weighted by the probability of the corresponding mode

Every time a new observation is collected, the imitator performs an iteration of the procedure just described, from which we obtain the probability distribution for all the controllers in the library. Then, estimated controller being being executed by the expert system a time step

is obtained from a maximum a posteriori (MAP) estimate


Iii-B Multi-Robot Coordinated Controllers Inference

We demonstrate the implementation of the IMM estimator to the identification of the controllers assuming an expert system composed of robots. Taking inspiration from the literature on coordinated behaviors for multi-agent systems [cortes2017coordinated], we assume each of the controllers in the library to be composed by two terms. The first term corresponds to a coordinated effect between the robots, which is represented by a weighted consensus term [mesbahi2010graph]. The second term is the agent’s individual objective, use to represent, for example, one or more leaders in the team. Each controller can then be described as:


where is the weighted Laplacian of the controller , while , represents a leader-like controller. In this case, we consider to be composed by the following behaviors.

  • Circular formation:

    where is the desired separation between robots and corresponding to a circle-shaped formation.

  • Cone formation:

    where is the defined as above, but corresponds to a cone-shaped formation.

  • Cyclic pursuit:

    where , is the radius of the cycle formed by the robots, and . The point is the center of the cycle.

  • Leader-follower:

    where is the desired separation between the robots, represents the leader’s goal, and subscript denotes the leader.

In this case, since we are not interested in approximating the expert’s policy but rather to assess the performance of the IMM, we assume the expert to randomly select any of the controllers described above at Poisson instants of time. The results obtain from simulations are reported in Fig. 1. As we can observe from results in Fig. 1, the IMM correctly estimates all modes, provided a minimum sojourn time for the controllers is respected.

Fig. 1: IMM estimation performance for a library of behaviors. Solid line represents the behavior being executed by the expert system, while the dashed line is the IMM estimate. Different controllers are order as 1) cyclic pursuit, 2) leader-follower, 3) formation - circle, and 4) formation - cone

Iv Learning to Compose Behaviors

In this section, we introduce our approach to learn the expert policy . We propose to learn the imitator’s policy, denoted by , that approximates the expert policy in mapping the current environmental and estimated robot states onto the appropriate behavior from the controller library . As noted in (2), the policy is attempting to capture the general strategy of choosing the behaviors from the expert as opposed to encoding a deterministic sequence of behaviors. In order to train such a policy, we collect the training data from episodes. Thus, the training data is given by , where , in each episode, denotes the data associated with the episode.

As can be seen from the notation, the training process does not assume access to either the true behavior sequence of the expert systems or the true state of the expert system. Put another way, the system is trained using inferred quantities and provided by the IMM filter. Once the imitator policy is learned, it can be utilized to compose the individual behaviors of the imitator system in order to solve the task of interest.

To illustrate our approach, we utilize a neural network (NN) to parameterize the imitator’s policy. The parameters of the network are trained using the standard back propagation algorithm. Note that the choice of neural networks over other models is motivated by its universal approximation capability 

[cybenko1989approximation] and is not central to our framework. Any model that is capable of sufficiently capturing the expert policy would be applicable.

V Experimental Validation

In this section, we implement the policy approximation technique described in the previous sections, on the observations of an expert system performing a perimeter protection mission. During expert’s execution of the mission, the imitator collects observations of the expert’s state and environment’s state, in order to estimate what controllers are used by the experts for different configurations of the environment. Then, using data from the estimation process, we train the imitator’s policy, which is finally executed by the imitator on the same scenario.

V-a Perimeter Defense Scenario

Perimeter protection can be considered both as a stand-alone scenario or as a sub-task common to many games, which makes it an ideal testing scenarios for multi-agent control protocols [shishika2018local]. In this scenario, the expert team is tasked with defending a region while an adversarial team, representing the environment, tries to intrude. With reference to the domain represented in Fig. 2, the intruders’ objective is to reach the inner most circular region (dark blue), which we denote by . The objective of the defenders is to prevent intruders from reaching . We now described both defenders and intruders strategies, summarized in Fig. 3.

Fig. 2: Perimeter protection domain representation. Expert robots (orange) defend the protected area (dark blue) from intruders (yellow), while and are the radii of the two regions triggering different expert’s controllers. Area (white) is the intruders’ recovery area.

V-A1 Intruders’ strategy

We assume two intruders, with collective state , each described at any given time by three possible modes, namely loiter, attack, retreat. In addition, each intruder is equipped with a circular protected area of radius around it. At initial time, both intruders start at random points inside region , in loiter mode. When both intruders are in

, at Poisson distributed instants of time, one or both intruders are selected at random and switched to

attack mode. Intruders in attack mode proceed straight towards the center of region . During attacks, if any of the intruders encounter a defending robot within distance less than , the intruder switches its state to retreat and moves back to uniformly selected points in region .

V-A2 Defenders’ policy

The policy employed by the defending robots acts by selecting controllers in the library . Here, we assume to be composed of the same controllers described in Section III-B, namely leader-follower, cyclic-pursuit, and two different formation assembling protocols cone formation and circular formation. When none of the intruders are within distance , defending robots execute the circular formation inside region . If either one or two intruders are between distances and , defenders switch to a cyclic pursuit controller. Finally, if a single intruder is within distance less than from the center of , the defenders execute the leader-follower controller towards the intruder. If both intruders are within distance , the policy selects the cone formation.

Finally, we define the mission performance as the ratio between the number of intruders’ successful attacks and the total number of initiated attacks.

Fig. 3: Attack and defense policies. Intruders actions are described by one of three possible modes: loiter, retreat, and attack. Defenders actions are described by four different coordinated controls, rendezvous, leader-follower, cyclic pursuit, and formation control.
Fig. 4: Screen shots from Robotarium simulation of the perimeter protection scenario. All robots are colored in orange and intruders are distinguishable from the yellow disc representing their capture area. The three blue circles, from dark to light, represent , , and respectively. Text in figure represents the behavior executed by the defenders.

V-B Results

The perimeter protection scenario was implemented on the Robotarium simulator [pickem2017robotarium], which includes a number of features of the real robots, e.g., differential drive kinematics and collision avoidance through barrier control functions. In the following, we define an episode as the time between initiation of two consecutive intruders’ attacks. Data collected in simulation were processed by the IMM to produce training data consisting of 200 episodes. The performance of the IMM in estimating controllers executed by the expert during the perimeter defense scenario is reported in Fig. 5. We note that, in this scenario, correctly estimating Cyclic-pursuit and leader-follower is particularly challenging since, by definition of the expert policy, these controllers are executed for a limited period of time, and as a result, the estimator is unlikely to acquire sufficient evidence to accurately infer the underlying behavior.

Fig. 5: IMM estimation performance in the perimeter protection scenario. Each bar represents the fraction of occurrences that a controller was correctly estimated by the IMM.

Since the expert policy described in Section V-A does not depend on the state of the robots, we trained two different models one assuming knowledge of this prior information () and a second without this prior knowledge (). In addition, in order to evaluate the effects of the IMM performance on the results, additional policies were trained assuming exact knowledge of the expert system’s states and controllers. Similar to the IMM-based policies, we trained one ground-truth-based policy assuming prior information () and one without (). In the remainder of this section, we present results comparing the above mentioned four policies in conjunction the expert policy .

Fig. 6: Training and validation accuracy of the imitator neural network with ground truth data (), with IMM data (), with ground truth data and prior (), and with IMM data and prior ().

All four policies are represented using a multi-layer neural network with two 32 neuron, fully connected layers with hyperbolic tangent activation and an output classification layer with softmax activation to classify between the four control modes. For each policy, the training data were split into

training and validation data. Training results are reported in Fig. 6. As one would expect, training with ground truth data results in very high training and validation performance irrespective of whether prior information about the policy structure is assumed. In support of our framework, the IMM-based policy trained without prior information () is shown to perform similarly to the ground-truth-based policies. Perhaps more interestingly, we note that results in better imitation performance than the other IMM-based policy trained with prior information (). We believe this is because the robot state estimates , while seemingly superfluous, might have helped better deal with mislabelled training data.

Finally, mission performance of the learned policies was evaluated in the perimeter defense scenario. In Tab. I actual mission performance obtained with the different policies are shown, while the error rates of the different policies with respect to the expert policy are shown in Fig. 7. In order to better understand the performance of the different policies, we separate attacks executed by a single intruder (SOLO), by both the intruders (DUO), and overall attacks (COMBO). Despite the limited performance in estimating two of the controllers as shown in Fig. 5, we observe that the imitator performance does not deviate significantly from the target one. This is explained by the limited impact of cyclic pursuit and leader-follower on the actual mission performance.

SOLO 6.7 1.6 0.7 2.3 9.2
DUO 13.5 8.0 3.5 7.6 16.1
COMBO 10.5 4.9 2.1 4.9 12.3
TABLE I: Mission performance results. Percentage of intruders’ successful attacks over total attacks.
Fig. 7: Relative error between expert’s and imitator’s performance from different policies. From the left: imitator with ground truth training (), imitator with IMM training (), imitator with ground truth training and prior (), and imitator with IMM training and prior (). For each policy, the different bars correspond to single (left), two (center), and overall (right) intruders’ attacks.

Vi Conclusions

In this paper we addressed the problem of transferring a mission-specific control policy to an un-trained group of robots by exploiting repetitive observations of an expert system. Histories of observations of the expert’s state were first converted into sequences of known controllers. Then, histories of controllers and environmental states were used to train approximated policies. The work presented in this paper suggests at least two future directions of work. On one side, the fact that controllers’ estimation performance is directly correlated to the similarities between controllers’ outputs, motivate efforts towards the definition of a measure for the separability between controllers. Similarly, controllers that are executed by the expert system for short periods of time exacerbate the weakness of the estimation scheme. Nevertheless, as observed in this work, different controllers might have different importance on the final mission performance. For this reason, being able to evaluate from observations of a system the relative importance of its actions towards its performance is another interesting direction of work.