I Introduction
One of the primary issues with human control of a multirobot system, or humanswarm interaction, is the cognitive load associated with tracking and delegating tasks to each robot [kolling2015human]. Along with the poor scalability of a onetomany approach in controlling teams of robots, further exacerbation of this problem occurs when the robots are equipped with heterogeneous attributes, as the complexity of the robottotask allocation process can easily become intractable [prorok2016fast].
To circumvent the scalability issues, abstraction of the state information [luo2015asynchronous, li2016handling] and control strategies [kolling2015human, cortes2017coordinated] of the system are often used by human operators to effective manipulate the multirobot systems. In particular, local interaction rules between the robots can be designed such that a desired taskoriented collective behavior emerges at the ensemble level, e.g. [cortes2017coordinated, antonelli2013interconnected, schwager2011unifying]. However, the resulting taskoriented controllers cannot be directly implemented towards the solution of complex, realworld missions, which usually require an assortment of these taskoriented controllers (see for instance [nagavalli2017automated, pierpaoli2019sequential] for implementations of this idea). Even though it is possible to enhance the expressiveness of individual controllers through their sequential combination [twu2010graph], humanlevel intelligence and expertise is required in order to interact with unknown and unpredictable environments [nikolaidis2013human]. Inspired by the paradigm of learning from demonstration [argall2009survey, billard2016learning], we propose a solution to the problem of composing multirobot primitive controllers by leveraging observations of an expert system.
In this paper, we consider an expert system composed by a team of robots solving a mission that requires interaction with an external environment. In order to make the discussion more concrete, let’s consider a perimeter protection scenario, where the the robots are tasked with preventing intruders from accessing a restricted area. In this case, we can consider the intruders as the environment external to the robots. In order to solve this mission, the policy supervising the robots selects actions from a finite list of primitives (e.g., gather at locations, assemble formations, chase intruders) by considering the state of robots and environment. In this paper, we consider the expert policy to be a deterministic automated strategy capable to solve the mission with a certain level of performance. Extensions of the proposed framework to policies obtained with a human in the loop are possible but will not investigated here.
Parallel to the expert, we consider a team of untrained imitating robots, having access to the same list of primitive controllers used by the expert system, but agnostic to the actual policy controlling them. Our objective is a procedure that allows the imitating robots to construct a policy capable to achieve expertlike performances by collecting noisy observations of the expert system and environmental state during execution of the mission.
We note that, the idea of learning how to create sequences from a predefined list of controllers, rather than learning entirely new controllers, is motivated by the rich literature on coordinated control for multiagent system [mesbahi2010graph]. In addition, although beyond the scope of the current work, the use of coordinated controllers makes our framework naturally predisposed for fully distributed implementations. Finally, by focusing on highlevel strategies that compose wellunderstood controllers, we can interpret the actions executed by the robots.
The remaining of this paper is organized as follows. In Section II we discuss the main formulation of the problem, along with expert and imitator dynamical models. In Section III we describe the estimation scheme used by the imitator to infer primitive controllers from noisy observations of the expert. In Section IV we describe the training procedure used by imitator to learn the expert’s policy. Finally, we implement the proposed technique to a perimeter protection scenario on a team of differentialdrive robots in Section V.
Ia Related Work
Robotic systems capable of inferring human behavior have been studied in variety of contexts, such as humanrobot interaction (HRI) [wang2013probabilistic, luo2018unsupervised], autonomous navigation [kelley2008understanding, jain2016recurrent], and interactive learning [cakmak2012designing, dillmann2004teaching]. For instance, within the context of HRI, inferring human behavior is often referred to as intention inference, wherein the robot attempts to identify the underlying intention or goal of the human partner [ravichandar2016human, ravichandar2018gaze]. In context of interactive learning, the robot aims to infer what the human partner is teaching [cakmak2012designing]
. In these contexts, it is either assumed that labelled training data can be acquired or it is sufficient to classify the unlabelled data into abstract clusters. In our work, we are interested not only in inferring the expert behavior, but also learning from such inferences.
Learning from demonstrations (LfD) is a popular paradigm in robotics that provides a plethora of techniques aimed at learning from and imitating expert behavior [argall2009survey]
. A vast majority of techniques, however, rely on labelled demonstrations and solve a supervised learning problem. A few exceptions to this assumption include learning reward functions from unlabelled demonstrations (e.g.,
[babes2011apprenticeship, liu2018imitation]), and learning manipulation skills from unlabelled videos (e.g., [sermanet2018time]).Both the inference and the learning methods discussed thus far, however, are limited to singlerobot tasks and scenarios. In the context of multirobot systems, prior work has explored identifying jointintention and shared plans of a group of agents [demiris2007prediction, pei2011parsing]. However, similar to examples in singlerobot systems, the inferred information is not utilized to learn control policies. On the other hand, algorithms that learn multiagent policies (e.g., [natarajan2010multi, vsovsic2017inverse, bhattacharyya2018multi]) rely on labelled expert demonstrations.
As mentioned earlier, a rich body of works exists on taskoriented controllers for multi agent teams. Coordinated controllers based on weighted consensus protocol have been used to achieve, for example, flocking [olfati2006flocking], coverage [santos2018coverage], formation control [zelazo2008decentralized], and cyclic pursuit [ramirez2010distributed]. Examples of alternative methods include the Null Space Methods [antonelli2008null], Navigation Functions [tanner2005towards], and Model Predictive Control [richards2004decentralized]. Related to our contribution, solutions to the problem of composing sequences of controllers include formal methods [kress2018synthesis], path planning [nagavalli2017automated], Finite State Machines [marino2009behavioral], Petri Nets [klavins2000formalism], Behavior Trees [colledanchise2014behavior]
, and Reinforcement Learning
[mehta2006optimal, pierpaoli2019reinforcement]. We take inspiration from a variety of communities in order to develop a unified framework to learn multirobot controller selection policies from unlabelled and noisy observations of an expert system.Ii Problem Formulation
Let us consider a team of robots interacting with an external environment having as final objective the completion of a mission. We denote by and the robots’ and environment states at time step respectively. At each time, the motion of the robots is described by a controller, selected from a library of controllers , where each element is a continuous function of the state of the robots. At any given time, all robots are driven by the same controller, which excludes the possibility of subteams acting in parallel.
A supervising deterministic policy selects controllers in as function of the states of robots and environment. Then, the discretetime dynamics of the robots is
(1) 
where
is a white Gaussian additive process noise, with known variance. The policy
selects controllers so that the mission is executed with a measurable performance . We refer to the system of robots with state as expert system, to as expert’s policy, and to as the expert’s performance under the policy . As mentioned above, we assume the expert’s policy to take actions based on the exact expert and environmental states. To this end, although beyond the scope of this paper, we acknowledge that appropriate estimation schemes can be considered to recover values of or .In addition to the expert system just described, we assume the existence of a second team of robots identical to those in the expert system, whose dynamics is described by the same library of controllers . We refer to this second system as imitator and we denote its state by . The objective of the imitator is to learn an approximation of the expert policy in order to achieve performance similar to . With as the imitator’s approximated policy, the dynamics of the imitator is
(2) 
where, similarly to the expert system, we assume the imitator has access to the exact environment’s state .
In order to construct such approximation, the imitator collects a sequence of environmental states and a sequence of noisy observations of the expert’s state, where
(3) 
and is a white Gaussian additive noise with zero mean and known variance.
In this paper we propose a twostage procedure in order for the imitator to learn . First, from noisy observation of the expert’s state and exact state of the environment, the imitator creates an online estimate of the controllers being executed by the expert. Then, based on these estimates, we propose a policy approximation procedure that allows the imitator to solve the task with performance comparable to the expert’s performance on the same task.
Iii Inferring Controllers from Observations
In this section, we describe the Interactive Multiple Model (IMM) technique [blom1988interacting], implemented by the imitator to estimate the sequence of controllers the expert executes, given a history of observations . The choice of this technique is motivated by its popularity and computational efficiency and we insist that the same objective could be achieved with alternative estimation schemes.
Iiia Interactive Multiple Models Estimator
The IMM estimator requires a finite list of controllers (or modes) whose transitions are described by a Markov process. We denote by
the process probability of switching from mode
to , i.e.(4) 
In addition, we define a bank of filters (e.g. Kalman filters, EKF) each corresponding to a controller in the library
. By combining expert’s state observations with modes’ transition probabilities, the IMM computes 1) the probability of being active at time , which we denote by and 2) the estimate of the expert’s state . The estimation technique follows three main steps, which we briefly discuss here for completeness. We refer the reader to [mazor1998interacting] for a review of IMM implementations and its possible variations. In the following we denote withthe outer product of vector
, defined as and we use the shorthand notation to denote .Interaction
First, mixing probability are computed by propagating each modes’ probability through the Markov process as
where is a normalizing factor. Then, we compute the effect of the modes probabilities on previous state estimates and covariance matrix:
Filtering
The posterior estimates , covariance matrix , and likelihood for each mode are computed by applying EKF iterations for each of the known modes. Then, individual mode probabilities are computed as follows:
where is a normalizing factor.
Combination
Final state estimates and convariance matrix are the obtained by combining values from each filter weighted by the probability of the corresponding mode
Every time a new observation is collected, the imitator performs an iteration of the procedure just described, from which we obtain the probability distribution for all the controllers in the library. Then, estimated controller being being executed by the expert system a time step
is obtained from a maximum a posteriori (MAP) estimate(5) 
IiiB MultiRobot Coordinated Controllers Inference
We demonstrate the implementation of the IMM estimator to the identification of the controllers assuming an expert system composed of robots. Taking inspiration from the literature on coordinated behaviors for multiagent systems [cortes2017coordinated], we assume each of the controllers in the library to be composed by two terms. The first term corresponds to a coordinated effect between the robots, which is represented by a weighted consensus term [mesbahi2010graph]. The second term is the agent’s individual objective, use to represent, for example, one or more leaders in the team. Each controller can then be described as:
(6) 
where is the weighted Laplacian of the controller , while , represents a leaderlike controller. In this case, we consider to be composed by the following behaviors.

Circular formation:
where is the desired separation between robots and corresponding to a circleshaped formation.

Cone formation:
where is the defined as above, but corresponds to a coneshaped formation.

Cyclic pursuit:
where , is the radius of the cycle formed by the robots, and . The point is the center of the cycle.

Leaderfollower:
where is the desired separation between the robots, represents the leader’s goal, and subscript denotes the leader.
In this case, since we are not interested in approximating the expert’s policy but rather to assess the performance of the IMM, we assume the expert to randomly select any of the controllers described above at Poisson instants of time. The results obtain from simulations are reported in Fig. 1. As we can observe from results in Fig. 1, the IMM correctly estimates all modes, provided a minimum sojourn time for the controllers is respected.
Iv Learning to Compose Behaviors
In this section, we introduce our approach to learn the expert policy . We propose to learn the imitator’s policy, denoted by , that approximates the expert policy in mapping the current environmental and estimated robot states onto the appropriate behavior from the controller library . As noted in (2), the policy is attempting to capture the general strategy of choosing the behaviors from the expert as opposed to encoding a deterministic sequence of behaviors. In order to train such a policy, we collect the training data from episodes. Thus, the training data is given by , where , in each episode, denotes the data associated with the episode.
As can be seen from the notation, the training process does not assume access to either the true behavior sequence of the expert systems or the true state of the expert system. Put another way, the system is trained using inferred quantities and provided by the IMM filter. Once the imitator policy is learned, it can be utilized to compose the individual behaviors of the imitator system in order to solve the task of interest.
To illustrate our approach, we utilize a neural network (NN) to parameterize the imitator’s policy. The parameters of the network are trained using the standard back propagation algorithm. Note that the choice of neural networks over other models is motivated by its universal approximation capability
[cybenko1989approximation] and is not central to our framework. Any model that is capable of sufficiently capturing the expert policy would be applicable.V Experimental Validation
In this section, we implement the policy approximation technique described in the previous sections, on the observations of an expert system performing a perimeter protection mission. During expert’s execution of the mission, the imitator collects observations of the expert’s state and environment’s state, in order to estimate what controllers are used by the experts for different configurations of the environment. Then, using data from the estimation process, we train the imitator’s policy, which is finally executed by the imitator on the same scenario.
Va Perimeter Defense Scenario
Perimeter protection can be considered both as a standalone scenario or as a subtask common to many games, which makes it an ideal testing scenarios for multiagent control protocols [shishika2018local]. In this scenario, the expert team is tasked with defending a region while an adversarial team, representing the environment, tries to intrude. With reference to the domain represented in Fig. 2, the intruders’ objective is to reach the inner most circular region (dark blue), which we denote by . The objective of the defenders is to prevent intruders from reaching . We now described both defenders and intruders strategies, summarized in Fig. 3.
VA1 Intruders’ strategy
We assume two intruders, with collective state , each described at any given time by three possible modes, namely loiter, attack, retreat. In addition, each intruder is equipped with a circular protected area of radius around it. At initial time, both intruders start at random points inside region , in loiter mode. When both intruders are in
, at Poisson distributed instants of time, one or both intruders are selected at random and switched to
attack mode. Intruders in attack mode proceed straight towards the center of region . During attacks, if any of the intruders encounter a defending robot within distance less than , the intruder switches its state to retreat and moves back to uniformly selected points in region .VA2 Defenders’ policy
The policy employed by the defending robots acts by selecting controllers in the library . Here, we assume to be composed of the same controllers described in Section IIIB, namely leaderfollower, cyclicpursuit, and two different formation assembling protocols cone formation and circular formation. When none of the intruders are within distance , defending robots execute the circular formation inside region . If either one or two intruders are between distances and , defenders switch to a cyclic pursuit controller. Finally, if a single intruder is within distance less than from the center of , the defenders execute the leaderfollower controller towards the intruder. If both intruders are within distance , the policy selects the cone formation.
Finally, we define the mission performance as the ratio between the number of intruders’ successful attacks and the total number of initiated attacks.
VB Results
The perimeter protection scenario was implemented on the Robotarium simulator [pickem2017robotarium], which includes a number of features of the real robots, e.g., differential drive kinematics and collision avoidance through barrier control functions. In the following, we define an episode as the time between initiation of two consecutive intruders’ attacks. Data collected in simulation were processed by the IMM to produce training data consisting of 200 episodes. The performance of the IMM in estimating controllers executed by the expert during the perimeter defense scenario is reported in Fig. 5. We note that, in this scenario, correctly estimating Cyclicpursuit and leaderfollower is particularly challenging since, by definition of the expert policy, these controllers are executed for a limited period of time, and as a result, the estimator is unlikely to acquire sufficient evidence to accurately infer the underlying behavior.
Since the expert policy described in Section VA does not depend on the state of the robots, we trained two different models one assuming knowledge of this prior information () and a second without this prior knowledge (). In addition, in order to evaluate the effects of the IMM performance on the results, additional policies were trained assuming exact knowledge of the expert system’s states and controllers. Similar to the IMMbased policies, we trained one groundtruthbased policy assuming prior information () and one without (). In the remainder of this section, we present results comparing the above mentioned four policies in conjunction the expert policy .
All four policies are represented using a multilayer neural network with two 32 neuron, fully connected layers with hyperbolic tangent activation and an output classification layer with softmax activation to classify between the four control modes. For each policy, the training data were split into
training and validation data. Training results are reported in Fig. 6. As one would expect, training with ground truth data results in very high training and validation performance irrespective of whether prior information about the policy structure is assumed. In support of our framework, the IMMbased policy trained without prior information () is shown to perform similarly to the groundtruthbased policies. Perhaps more interestingly, we note that results in better imitation performance than the other IMMbased policy trained with prior information (). We believe this is because the robot state estimates , while seemingly superfluous, might have helped better deal with mislabelled training data.Finally, mission performance of the learned policies was evaluated in the perimeter defense scenario. In Tab. I actual mission performance obtained with the different policies are shown, while the error rates of the different policies with respect to the expert policy are shown in Fig. 7. In order to better understand the performance of the different policies, we separate attacks executed by a single intruder (SOLO), by both the intruders (DUO), and overall attacks (COMBO). Despite the limited performance in estimating two of the controllers as shown in Fig. 5, we observe that the imitator performance does not deviate significantly from the target one. This is explained by the limited impact of cyclic pursuit and leaderfollower on the actual mission performance.
SOLO  6.7  1.6  0.7  2.3  9.2 

DUO  13.5  8.0  3.5  7.6  16.1 
COMBO  10.5  4.9  2.1  4.9  12.3 
Vi Conclusions
In this paper we addressed the problem of transferring a missionspecific control policy to an untrained group of robots by exploiting repetitive observations of an expert system. Histories of observations of the expert’s state were first converted into sequences of known controllers. Then, histories of controllers and environmental states were used to train approximated policies. The work presented in this paper suggests at least two future directions of work. On one side, the fact that controllers’ estimation performance is directly correlated to the similarities between controllers’ outputs, motivate efforts towards the definition of a measure for the separability between controllers. Similarly, controllers that are executed by the expert system for short periods of time exacerbate the weakness of the estimation scheme. Nevertheless, as observed in this work, different controllers might have different importance on the final mission performance. For this reason, being able to evaluate from observations of a system the relative importance of its actions towards its performance is another interesting direction of work.