Learning of Coordination Policies for Robotic Swarms

09/19/2017 ∙ by Qiyang Li, et al. ∙ 0

Inspired by biological swarms, robotic swarms are envisioned to solve real-world problems that are difficult for individual agents. Biological swarms can achieve collective intelligence based on local interactions and simple rules; however, designing effective distributed policies for large-scale robotic swarms to achieve a global objective can be challenging. Although it is often possible to design an optimal centralized strategy for smaller numbers of agents, those methods can fail as the number of agents increases. Motivated by the growing success of machine learning, we develop a deep learning approach that learns distributed coordination policies from centralized policies. In contrast to traditional distributed control approaches, which are usually based on human-designed policies for relatively simple tasks, this learning-based approach can be adapted to more difficult tasks. We demonstrate the efficacy of our proposed approach on two different tasks, the well-known rendezvous problem and a more difficult particle assignment problem. For the latter, no known distributed policy exists. From extensive simulations, it is shown that the performance of the learned coordination policies is comparable to the centralized policies, surpassing state-of-the-art distributed policies. Thereby, our proposed approach provides a promising alternative for real-world coordination problems that would be otherwise computationally expensive to solve or intangible to explore.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Biological swarms can act in coordination to perform tasks far beyond the capabilities of individuals [1]. In the absence of a centralized control mechanism and global observation, swarm intelligence emerges from the local behaviour of individual agents governed by simple, unified rules [2]. The distributed structure of swarm systems makes them less vulnerable to individual failures. When this robust nature is well-realized in robotic systems, robotic swarms can then be relied upon to solve complex real-world problems such as search and rescue, object transportation, and Mars exploration, where centralized control can be extremely costly or may be impossible [3, 4, 5].

Fig. 1: The proposed approach learns distributed control and communication strategies from well-designed centralized policies for identical agents.

To explore this domain, this work focuses on studying distributed robotic systems with the following properties:

  • All agents are identical (i.e., homogeneous).

  • Each agent has only local interactions and observations.

  • Each agent follows an identical distributed policy with no presence of a leader.

To coordinate individual agents towards achieving a mutual objective, a key challenge is to relay local actions and observations from agent to agent. A distributed policy, therefore, contains two main components: an action policy that determines what action the agent performs given its inputs, and a communication protocol that defines how agents communicate with each other. To achieve system-level coordination, classical methods rely on human-designed policies, in which communication protocols are usually predefined [6]

. This manual design process can be especially difficult for complex coordination tasks with large swarms. Recently, learning-based approaches, such as deep reinforcement learning, have been able to successfully learn communication protocols in coordination tasks 

[7, 8, 9, 10, 11, 12, 13, 14, 15]. The biggest drawback of these approaches is the tremendous amount of data and computational resources required for training.

In this paper, we propose a learning-based approach that utilizes pre-designed centralized policies to train the distributed policies. Each agent follows an identical distributed policy, which (i) interprets the agent’s observation and communication information received, (ii) determines the action of the agent, and (iii)

generates the communication information to be broadcasted to the neighbouring agents. This distributed policy is modelled by a differentiable deep neural network (DNN) called

distributed policy network

, where its inputs and outputs are represented as fixed-sized vectors. The distributed policy network serves as a part of a larger neural network that maps from all agents’ multi-step observations to their multi-step actions (see Fig. 


). We refer to this augmented neural network as the multi-step, multi-agent neural network (MSMANN). Since communication vectors sent from agent to agent are hidden states of the MSMANN, we can perform backpropagation on the MSMANN to capture the communication protocols in addition to the action policies, which completes the distributed policy we aim to learn. Since the input and output vectors for the communication information all have fixed sizes, the distributed policy network must learn how to aggregate information for effective communication. This also enables us to control the communication flow and analyze the communication information sent from agent to agent. This can be regarded as an analogy to controlling the size of the representation layer of an auto-encoder and analyzing the learned representation.

Our learning approach can be applied to a variety of distributed robotic tasks, provided that a well-designed centralized policy is available. On a rendezvous task with limited visibility, this approach consistently outperforms the state-of-the-art distributed control law for systems consisting of different numbers of agents. The generalizability of our proposed approach to other robotic tasks is demonstrated in a task for which no distributed solution exists. Our approach provides a tangible alternative for complex robotic tasks for which centralized policies can be designed. Moreover, we analyze the learned communication protocols and provide insights on their meaning. This analysis of learned coordination policies could potentially benefit the manual design of communication protocols for complex robotic swarm systems.

In the following sections, we begin discussions with a brief overview of related literature in Section II. Upon defining the control framework and the problem statement in Section III and IV, our learning approach is discussed in Section V. The simulation setup and the corresponding results are presented in Section VI and VII.

Ii Related Work

In the distributed control literature, most work focused on the manual design and analysis of distributed control laws for simple robotic tasks [6]. For example, [16] and [17] both study rendezvous and flocking tasks and analyze the performance of various distributed control laws. However, these studies are based on simplified robot dynamics (e.g., single or double integrator). Adapting these approaches to more complicated distributed robotic tasks can be challenging.

With little human expertise required, learning-based approaches provide alternatives for challenging distributed robotic tasks. The majority of these works focus on only learning the action policy, and assume pre-defined communication protocols or no communication among the agents at all [18, 19, 20]. These approaches lack the flexibility to adapt the communication information, thus, limiting the capability of the distributed robotic system. Our approach learns both the control policy and the communication protocols among the agents, allowing the emergence of more effective inter-agent communication.

Some recent work has explored learning communication protocols through reinforcement learning. Specifically, [15] and [7] perform independent learning where each agent learns solely from its local observations. Others assume that centralized learning can be performed [10, 9, 8] and [14]. In particular, [14], [9] and [13] assume that the communication channel can be differentiated during backpropagation. While [14] employs multiple communication cycles at each time step, [9] and [13] are similar to our control framework, which assumes that the communication emitted can only be received at the next time step.

Different from all the reinforcement learning approaches above, our approach makes use of pre-designed centralized policies and learns communication protocols by imitating the behaviour of the centralized policies. Compared to the reinforcement learning approaches, our approach can learn communication and action policies more efficiently using the guidance of the centralized policies. This approach is inspired by [14]

, which performs supervised learning on a simple lever-pulling task using multiple communication cycles at each time step. We perform learning in the distributed robotic domain using a different framework, which assumes a single communication cycle at each time step.

Iii Distributed Control Framework

Consider a group of homogeneous agents that are to complete a task with a dynamic connectivity network defined by a bidirectional graph , where is the discrete time index. Each , , represents an agent and each represents a connection between agent and agent .

We define to be the set of all the neighbours of agent . For each neighbour , we can define to be the communication information sent from agent to agent at time . In this work, we define as a communication vector, where is the size of the communication information.

For convenience, we also define


to be the communication inflow to , with , and


to be the communication outflow from . This means that the communication information sent at the previous time step is received at the current time step . When and are the same agent, the communication vector is a special case, which allows the agent to leave information to itself. In this work, we allow this self-talking behaviour.

At each time step, agent obtains an observation information and receives information . The agent then sends information, , to its corresponding neighbours , and performs an action , where is the unified action space of all agents (see Fig. 2).

Fig. 2: The communication framework with three agents at two consecutive time steps. The connectivity network is indicated by the arrows among the agents. We show the neighbours and the communication associated with for each time step. At time index , agent can only communicate to agent and send information to itself. At time index , agent receives the information sent by from the previous time step and sends out information to its neighbours, and , and to itself.
Fig. 3: Overall picture of the homogeneous, distributed control framework. The agent receives the communication inflow from its neighbours from the previous time step and observes the surrounding . Given the inputs, the policy then provides the communication outflow

and the action probability distribution

. The communication outflow is sent to the agent’s neighbours . The action of the agent

is sampled based on the probability distribution

Fig. 4: Multi-step, multi-agent neural network (MSMANN) consisting of identical components . This is an example of the MSMANN for the three-agent example in Fig. 2. The grey arrows represent the communication vectors, which are also hidden states of the MSMANN; the green arrows represent the inputs (local observation ) of the neural network and the red arrows represent the outputs (action probability distribution ) of the neural network.

We define as a unified, distributed policy for all agents with the following equations holding true for all :


where is the probability distribution over the action space used by at time (see Fig. 3).

Iv Problem Statement

Consider agents with the homogeneous, distributed control framework described in Section III. The problem targeted by this work is to learn a mapping from communication inflow and local observations to communication outflow and action probability distribution . This mapping is referred to as the distributed policy modelled by a DNN, which interprets the communication inflow and the local observations, and infers the communication outflow and the action probability distribution. The goal is to find a policy that optimizes the performance of the target task . During the learning process, robot actions and observations of a centralized strategy are available (Fig. 1).

V Methodology

V-a Neighbour Discretization

The amount of communication inflow and outflow for each agent is changing dynamically as the number of its neighbours changes. Since learning a model that has variable input and output dimensions can be challenging, we perform a neighbour discretization to provide constant input/output dimensions. In this process, each agent’s neighbours are partitioned into groups based on a discretization rule :


where each , , represents a subset of neighbouring nodes and may be used to obtain the discretization. To apply this discretization, we process the communication inflow and average the communication vectors within each group of agents. We also restrict the communication outflow such that the communication vectors sent to the agents in the same group are identical. To formulate this, we define the communication inflow and outflow after the discretization as follows:


with , where and are the concatenations of communication vectors after discretization, and represents the cardinality of a set. This renders the dimensions of the communication inflow and outflow constant. Thus, the distributed policy we aim to learn is transformed into


In this work, we assume that the neighbour discretization is pre-designed and task-specific. Learning the discretization rule is left for future work.

V-B Learning from a Centralized Policy

Using supervised learning, our approach builds upon a pre-designed centralized policy , which defines the action probability distributions of all agents based on global full observations. This can be formulated as follows:



  • is the global observation, which includes all the local observations in the global frame.

  • is the action probability distribution of agent suggested by the centralized policy .

The objective is to make the learned distributed policy behave as similar to the centralized policy as possible.

Directly learning the communication protocols among the agents from the centralized policy is difficult because the centralized policy does not provide correct and labelled communication protocols among the agents. Based on the problem setup, the communication vectors sent at the previous time step are received at the current time step, resulting in a communication information flow across multiple time steps. To learn such a flow of information, we require the learning process to operate across multiple time steps as well.

To formulate the learning target, we first assume can be parametrized by as . Over the course of performing task with time steps using a distributed policy , we can define the overall mapping we aim to achieve as


This mapping is modelled by a neural network consisting of identical components connected by communication vectors among the agents across multiple time steps (see Fig. 4 for an example of the neural network). We recall that this neural network is a multi-step, multi-agent neural network (MSMANN), where the identical components are the distributed policy network for every agent at each time step. The backpropagation can be performed throughout the MSMANN to optimize the distributed policy with respect to a learning objective, which is minimizing :


where is the number of agents and

is the loss function for evaluating the difference between two action probability distributions.

By minimizing , we can learn the distributed policy such that it behaves similarly to the centralized policy. With the guidance of the centralized policy, we hypothesize that this supervised learning process is more efficient than the reward-based learning, which has no direct guidance. An interesting possibility is to use reward-based learning for improvement after having completed supervised learning from the centralized policy. This possibility is left as future work.

Fig. 5: Radial discretization used in the rendezvous task. The visible area is discretized into 9 components. The vectors from the agent to the center of each area represent the discretized action space.

Vi Simulation Setup: Two Different Tasks

In order to demonstrate our proposed learning approach, we consider two coordination tasks: (i) the rendezvous problem with limited vision range and (ii) the particle assignment task. The former is well-studied in the control literature and our solution is compared against the state-of-the-art distributed control law; the latter is a task that has not been solved in the framework of distributed control before.

Vi-a Rendezvous with Limited Visibility

We demonstrate our learning approach on the rendezvous task with limited visibility given (i) its simplicity, and (ii) its assumption of local interactions which fit our framework.

Vi-A1 Task Formulation

Consider homogeneous agents located in a 2-dimensional plane with the position vectors . Each agent is governed by double integrator dynamics and its position is controlled by a PD-type controller. The input to the controller is the desired position vector relative to the agent’s position, which corresponds to the action of the agent. We define dynamic connectivity network based on visibility:


where is the visibility range.

We also define that each agent can only observe the relative positions of all its neighbours to itself as .111The local frame, however, has the same heading as the global frame. This means that the local frame can be transformed to the global frame by only a translational transformation. To formulate the action space and observation space, we discretize the 2-dimensional visibility space of each agent into different components: (see Fig. 5). The observation of agent can be approximated by the number of agents in each discretized component:


with . Using the same discretization, we restrict the action space to be the set of center points of the discretized components (see Fig. 5):


The action probability distribution can be simplified into a vector of probabilities for choosing each discretized action:


where represents the probability of agent to choose action at time for all .

The objective of this task is to make all agents converge to a common location as quickly as possible. We assume that the dynamic connectivity network is connected initially. The task performance is primarily evaluated based on the rendezvous time , which is defined as the smallest that satisfies the following condition:



is a small constant that defines the maximum distance of the two farthest agents. If the constraint can never be satisfied, we classify this as a failure to converge (

). After evaluating the task for multiple trials, we can define the convergence rate as follows: , where is the number of successful trials and is the number of total trials.

Vi-A2 Neighbour Discretization

The neighbour discretization rule builds upon the space discretization performed in the task formulation. Using the relative positions to the neighbouring agents, we can define the partition as follows:


for all , where is the position vector of agent and is the discretization group of agent ’s neighbours under the discretization rule .

Under this rule, all the neighbouring agents in the same discretized component belong to the same discretization group. The intuition behind this is that the neighbouring agents with similar relative positions should receive similar communication vectors, and the communication vectors they are sending back should also be similar.

Vi-A3 Centralized Policy

The centralized policy we use computes an optimal rendezvous coordinate that minimizes . It can be proved that is the center of the smallest circle that encloses all the agents (the smallest enclosing circle). By moving towards the optimal rendezvous coordinate, the optimal solution can be achieved. To adapt to the discrete action space , the optimal action for each agent is the closest one to the optimal coordinate:


Since this centralized policy is deterministic, we define the probability distribution over as follows:


where is the probability of choosing action at time suggested by the centralized policy . This distribution can be represented by a vector similar to Eq. 15.

Vi-A4 Existing Distributed Policies

To the best of our knowledge, the circumcenter law is the state-of-the-art distributed policy for this task that guarantees convergence on single integrator dynamics [16]. This result can be simply extended to second-order systems with double integrator dynamics, which we consider for this task. The circumcenter law is defined as follows: At each time step, all agents pursue the circumcenter of the point set consisting of its neighbours and of itself (or the center of the smallest enclosing circle) [16]. To adapt this control law to our control framework described in Section III, we choose the agent’s action based on the relative position of the agent to the circumcenter, equivalently to Eq. 18. The communication inflow and outflow consist only of zero vectors at all time steps.

There are other existing distributed policies such as the averaging law and cyclic pursuit. The averaging law requires all agents to pursue the average coordinate of its neighbours and of itself. However, it suffers from convergence issues [17]. Cyclic pursuit usually assumes a fixed connectivity network, which is not given in this task [16].

Vi-B Particle Assignment

We also propose a new distributed robotic task that has no existing solution, to the best of our knowledge. The objective is to move all agents to target points such that all target points are covered by an agent using agents’ local observations of the neighbouring agents and target points, and communication. Different from the cooperative navigation task defined in [10] that assumes unlimited visibility range, we assume limited visibility range.

Vi-B1 Task Formulation

We consider homogeneous agents with position vectors in a 2-dimensional plane. The agent dynamics, PD-controller, discrete action space , connectivity network and discretization rule are the same as described in the rendezvous task. Adding on top of these, we introduce target points with the position vectors . We can define as the connectivity of the target points based on the visibility:


where is the limited visibility range.

We define the set of covered target points as follows:


where is a small constant that defines the distance requirement for covering a target point.

We assume that , are both connected initially, and at least one agent can observe one target point initially. Each agent also has a pre-defined potential field layer as a collision avoidance mechanism on top of the PD-controller. For convenience, we can define the neighbouring target points as follows:


The observation of each agent is defined as the relative positions of its neighbouring agents and target points to itself. Using the same space discretization in Section VI-A, we represent the observation as follows:


To evaluate the performance, we define as the smallest such that . If the constraint can never be satisfied, . The convergence rate is defined as in the rendezvous task.

Vi-B2 Centralized Policy

We design our centralized policy to be an optimal assignment of agents to target points so that the maximum distance travelled by any agent is minimized. This leads to the minimization of the completion time . The optimization is done by using the Hungarian algorithm [21].

Vii Simulation Results

In this section, we demonstrate the training details and our simulation results for the two distributed robotic tasks.

Vii-a Training Details

To perform this learning task, we use a DNN as the distributed policy network to model the mapping we aim to learn:


where is the local observation after discretization (see Eq. 13 for the rendezvous task, Eq. 23 for the particle assignment task), is the action probability vector after discretization (see Eq. 15), and are the communication vectors after discretization (see Eq. 6 and 7).

For both tasks, the DNN is a feedforward neural network (2 layers with 32 neurons per layer for the rendezvous task; and 4 layers with 128 neurons per layer for the particle assignment task). The probability distribution

is obtained by a softmax layer. The loss function between the two probability distributions

and is the cross entropy. To train the DNN, we run simulations in parallel with different initial setups. During the simulations, each agent’s action is sampled from the action distribution predicted by the distributed policy network. After every time steps, we construct the MSMANN model (using the DNN as the identical components) based on the dynamic connectivity of the agents in the past time steps. Using the action probability distribution from the centralized policy, we perform the Adam algorithm through backpropagation on this MSMANN to minimize the average of loss functions over parallel simulations [22]. Each cost function is an approximation of the overall learning objective:


where is the current time step and is the number of time steps we back-propagate. In this work, we choose for both tasks and for the rendezvous task, and for the particle assignment task. For the rendezvous task, we further simplify the communication inflow as the sum of the communication vectors after discretization () instead of the concatenation of these communication vectors as in Eq. 7. This allows us to better analyze the communication content learned.

Fig. 6: Performance comparison of centralized policy, state-of-the-art distributed policy, learned policy with communication enabled and learned policy without communication on the rendezvous task with different number of agents. Each data point is the average of 25 trials with random initial conditions. We explicitly exclude the examples that fail to converge () since including these examples would blow up the average performance, which is not desired for our comparison purposes. Instead, we show the convergence rate. A video of the simulations is available online at: http://tiny.cc/DNNswarm.

Vii-B Rendezvous with Limited Visibility

For the rendezvous task, we train our distributed policy network on 10 agents with random initial positions and limit the size of the communication vector to be . We evaluate the performance of the DNN on scenarios with different number of agents where the agent density is similar to the training cases. Agent density is defined as the ratio of number of agents to the area of the smallest circle that encloses all the agents.

We compare the performance of our learning approach against the state-of-the-art circumcenter distributed control law and the centralized policy described in Section VI-A. We also show the performance of our learning approach without communication (i.e., ).

Fig. 7: Comparison of the centralized policy, the state-of-the-art distributed policy, and the learned distributed policy with communication enabled and disabled on a 100-agent rendezvous task. The dots are the agents; the circles are the visible regions of the agents; the red line represents the distance of the two farthest agents. We show that the learned distributed policy can resemble the behaviour of the centralized policy significantly better when communication is available.
Fig. 8: The performance of the learned policy for different communication vector sizes on the rendezvous task. Each data point is the average of 10 trials with different random initial conditions. A sudden drop in performance from to can be observed.

In Fig. 6, we demonstrate that our learning approach consistently outperforms the state-of-the-art distributed policy for different numbers of agents. However, the learning approach without communication performs poorly under almost all circumstances, which demonstrates the necessity of inter-agent communications in resembling the behaviours of the centralized policy (see Fig. 7). Note that the circumcenter control law does not require communication because it behaves qualitatively different than the centralized strategy.

However, the convergence rate can drop significantly for scenarios with large number of agents, not included in the training data. This is a generalization issue of the DNN learning as it might over-fit to the simple situations that the model is trained on. Provided this limitation, our learning approach still demonstrates the ability to learn an effective distributed policy with reasonable scalability on this rendezvous task: we train with 10 agents and test on up to 100.

Vii-C Analysis of Communication Learned

We demonstrate that reducing the size of the communication vectors leads to a decrease in task performance with a significant drop from to (see Fig. 8). To provide more insights into this result, we choose to analyze the learned distributed policy with communication size because it is relatively easy to visualize while achieving comparable performance. We keep the observation input of the model constant and observe the change in the model output with the changing communication input. For the constant observation input, we assume a hypothetical situation where an agent has two neighbours that are located in exactly the opposite direction. For convenience, we define and . In Fig. 9, we hypothesize that the two communication inflow values can be transformed into a 2-dimensional vector that is correlated to the tendency of choosing the action that is closest to the vector’s direction. We can interpret the communication vector as an “intent vector”, which influences the tendency of the moving direction of the agent that receives this “intent vector”. This explains the sudden drop observed in performance from to since a one-dimensional communication vector cannot fully represent a 2-dimensional direction vector.

Fig. 9: The probability of choosing each discretized action () given the communication inflow values. We show five of them in this figure. “Channel 1” represents and “Channel 2” represents . It can be observed that in each discretized area, the direction of the increasing action probability is always opposite to the desired relative position of the corresponding action in the discretized area.
Fig. 10: A behaviour comparison of the centralized policy and the learned distributed policy for 2-agent (top) and 15-agent (bottom) scenarios. Dots represent the agents; lines represent the trajectories of the agents; small circles represent the cover range of the agents; large dashed circles represent the visibility of the agents; and diamonds represent the target points. We demonstrate that agents controlled by our learned distributed policy are able to (1) follow other agents who see the targets when there is no target in its sight, (2) explore neighbouring targets rather than stopping at the nearest target, and (3) resolve target assignment conflicts. These distributed behaviours emerge from the learning of a centralized policy.
Fig. 11: A performance comparison of the centralized policy and the learned distributed policy on different numbers of agents for the particle assignment task. Each data point is the average of 100 trials. A video of the simulations is available online at: http://tiny.cc/DNNswarm

Vii-D Particle Assignment

For the particle assignment task, we also train our DNN on 10-agent scenarios and test our approach on various numbers of agents up to 20. Fig. 10 demonstrates some examples of the performance achieved with 2 and 15 agents and the emergence of distributed behaviours. In Fig. 11, we also show that the average performance of our learned distributed policy is comparable with the centralized policy when there are fewer agents. This approach suffers from convergence issues for larger swarms. We hypothesize that this could be attributed to the inherent complexity of the particle assignment task. There are multiple aspects of the task that must be achieved: exploring, resolving assignment conflict, and staying connected with other agents. Achieving all aspects at once can be much more challenging as the number of agents increases.

Viii Conclusions and Future Work

We present a DNN-based approach that learns distributed action and communication policies from well-designed centralized policies for homogeneous, distributed robotic system. The main advantages of our proposed approach are summarized: (i) this approach can be applied to various distributed robotic tasks given pre-designed centralized policies are available; (ii) it requires little human expertise for task-specific control law and communication protocol designs; and (iii) this approach is computationally efficient compared to other reward-based learning approaches. Moreover, the learned communication protocols reveal that meaningful messages are conveyed, which could potentially inspire the coordination and communication designs for real-world distributed robotic systems. Future work will address some of the observed convergence issues in the more complex scenarios, which may be due to over-fitting.