Neural Flocking: MPC-based Supervised Learning of Flocking Controllers

08/26/2019 ∙ by Shouvik Roy, et al. ∙ Microsoft Stony Brook University TU Wien 0

We show how a distributed flocking controller can be synthesized using deep learning from a centralized controller which generates the trajectories of the flock. Our approach is based on supervised learning, with the centralized controller providing the training data to the learning agent, i.e., the synthesized distributed controller. We use Model Predictive Control (MPC) for the centralized controller, an approach that has been successfully demonstrated on flocking problems. MPC-based flocking controllers are high-performing but also computationally expensive. By learning a symmetric distributed neural flocking controller from a centralized MPC-based flocking controller, we achieve the best of both worlds: the neural controllers have high performance (on par with the MPC controllers) and high efficiency. Our experimental results demonstrate the sophisticated nature of the distributed controllers we learn. In particular, the neural controllers are capable of achieving myriad flocking-oriented control objectives, including flocking formation, collision avoidance, obstacle avoidance, predator avoidance, and target seeking. Moreover, they generalize the behavior seen in the training data in order to achieve these objectives in a significantly broader range of scenarios.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the introduction of Reynolds rule-based model REYNOLDS1987; Reynolds99, it is now possible to understand the flocking problem as one of distributed control. Specifically, in this model, at each time-step, each agent executes a control law given in terms of the weighted sum of three competing forces to determine its next acceleration. Each of these forces has its own rule: separation (keep a safe distance away from your neighbors), cohesion (move towards the centroid of your neighbors), and alignment (steer toward the average heading of your neighbors). Reynolds controller is distributed, i.e., it is executed separately by each agent, using information about only itself and nearby agents, and without communication. Furthermore, it is symmetric; i.e., every agent runs the same controller (same code).

It was subsequently shown that a simpler, more declarative approach to the flocking problem is possible Mehmood2018. In this setting, flocking is achieved when the agents combine to minimize a flock-wide cost function. Centralized and distributed solutions for achieving this form of “declarative flocking” were presented, both of which were formulated in terms of Model-Predictive Control (MPC) CAMACHO2007. The problem with MPC is that computing the next control action can be computationally expensive, as MPC attempts to find an action that minimizes the cost function over a given prediction horizon. This renders MPC unsuitable for real-time applications with short control periods, for which flocking is a prime example. Another potential problem with MPC-based approaches to flocking is its performance (at achieving the desired flight formations), which may suffer in a fully distributed setting.

In this paper, we present Neural Flocking (NF), a new approach to the flocking problem that uses Supervised Learning to learn a symmetric and fully distributed flocking controller from a centralized MPC-based controller. By doing so, we achieve the best of both worlds: high performance (on par with the MPC controllers) in terms of meeting flocking flight-formation objectives, and high efficiency leading to real-time flight controllers. Moreover, our NF controllers can easily be parallelized on specialized hardware such as GPUs and TPUs.

Figure 1: Neural Flocking Architecture

Figure 1 gives an overview of the NF approach. A high-performing centralized MPC controller provides the labeled training data to the learning agent: a symmetric and distributed neural controller in the form of a DNN or LSTM. The training data consists of trajectories of state-action pairs, where a state contains the information known to an agent at a time step (e.g., its own position and velocity, and the position and velocity of its neighbors), and the action (the label) is the acceleration assigned to that agent at that time step by the centralized MPC controller.

We formulate and evaluate NF in a number of essential flocking scenarios: basic flocking as in REYNOLDS1987; Mehmood2018, and more advanced flocking scenarios with additional objectives, including inter-agent collision avoidance, obstacle avoidance, predator avoidance, and target seeking by the flock. We conduct an extensive performance evaluation of NF. Our experimental results, which include videos, demonstrate the sophisticated nature of NF controllers. In particular, they are capable of achieving all of the control objectives listed above. Moreover, they generalize the behavior seen in the training data in order to achieve these objectives in a significantly broader range of scenarios.

2 Background

We consider a set of dynamic agents that move according to the following discrete-time equations of motion:


where , , are the position, velocity and acceleration of agent respectively at time step , and is the time step. The magnitudes of velocities and accelerations are bounded by and , respectively. We define the state of agent as the set . Acceleration is the control input for agent at time step . The acceleration is updated after every time steps, where is the duration of the control step relative to the time step. The flock configuration at time step

is thus given by the following vectors (in boldface):


The configuration vectors are referred to without the time indexing as p, v, and a. The neighborhood of agent at time step , denoted by , contains its -nearest neighbors, i.e., the other agents closest to it. We use this definition for simplicity, and expect that a radius-based definition of neighborhood would lead to similar results for our distributed flocking algorithms.

2.1 Model-Predictive Control

Model-Predictive control (MPC) CAMACHO2007 is a well-known control technique that has recently been applied to the flocking problem Mehmood2018; zhang2015model; zhan2013flocking. At each control step, an optimization problem is solved to find the optimal sequence of control actions (agent accelerations in our case) that minimizes a given cost function with respect to a predictive model of the system. The first control action of the optimal control sequence is then applied to the system; the rest is discarded. In the computation of the cost function, the predictive model is evaluated for a finite prediction horizon of control steps.

MPC-based flocking models can be categorized as centralized or distributed. A centralized model assumes that complete information about the flock is available to the central controller, which uses the states of all agents to compute the optimal accelerations for each agent. The following optimization problem is solved by a centralized MPC at each control step :


The first term is the centralized model-specific cost, evaluated for control steps (this embodies the predictive aspect of MPC), starting at time step . It encodes the control objective. The second term, scaled by a weight , penalizes large control inputs: are the predictions made at time step for the accelerations at time step .

In distributed MPC, each agent computes its acceleration based only on its state and its local knowledge, e.g., information about its neighbors:


is the distributed model-specific cost function for agent , analogous to . In distributed MPC, due to limited information, an agent cannot calculate the exact future behaviour of its neighbors. Hence, the predictive aspect of must rely on some assumption about that behavior during the prediction horizon. Our distributed cost functions are based on the assumption that the neighbors have zero accelerations during the prediction horizon. While this simple design is clearly not completely accurate, our experiments show that it still achieves good results.

2.2 Declarative Flocking

Declarative flocking (DF) Mehmood2018 is a high-level approach to designing flocking algorithms, by defining a suitable cost function for MPC, instead of defining the algorithms operationally using rules, as in Reynolds model. For basic flocking, the cost function contains two terms: (1) A cohesion term based on the squared distance among all pairs of agents in the flock; and (2) a separation term based on the inverse of the squared distance among the agents. The flock evolves toward a configuration in which these two opposing forces are balanced. For centralized DF, i.e., centralized MPC (CMPC), the cohesion term considers all pairs of agents, and the separation term considers only neighbors.


where is the weight of the separation term and controls the density of the flock, and is the set of pairs of agents separated by distance less than , where defines the distance-based neighborhood. The control law for CMPC is given by Eq. (3), with .

The basic flocking cost function for distributed DF is similar to that for CMPC, except that the cost function for agent is computed over its set of neighbors :


The control law for agent is Eq. (4), with .

3 Neural Flocking

We learn a distributed neural controller (DNC) from trajectories obtained from a CMPC. In addition to learning basic flocking behavior, we learn additional flocking-related behaviors, namely, collision avoidance, obstacle avoidance, target seeking, and predator avoidance. We also show how the learned behavior generalizes over a larger number of agents to achieve a successful collision-free flocking.

We use supervised learning to train our DNC. Supervised learning learns a function that maps an input to an output based on an example sequence of input-output pairs. For our task, the trajectory data obtained from CMPC contains both the training inputs and training labels: the state of the agent is the input, and the agent’s acceleration at the same time step is the label, i.e., the output.

3.1 Required Extensions to Declarative Flocking

While the simple cost function in Eqs. (3) and (6) for basic flocking guarantees that in the steady state the agents are well separated, it does not guarantee collision avoidance. Additional goals such as collision avoidance and obstacle avoidance are added to the MPC problem as minimum distance constraints. The constrained MPC is recast as an equivalent unconstrained MPC problem Kerrigan2000softconstraints by converting the constraints to a penalty term, using the theory of exact penalty functions Fletcher1987. The weighted penalty term, derived based on the magnitudes of the constraint violations, is added to the MPC cost function.

Consider a generic nonlinear constrained optimization problem, as given in Eq. (7a). It can be recast as the unconstrained optimization problem given by Eq. (7b):


where is the weight of the penalty term, and is a vector containing the magnitudes of the constraint violations. For a constraint , the magnitude of the constraint violation is equal to . According to Theorem 1 of Kerrigan2000softconstraints, the solution of problem (7a) is equal to the one of problem (7b) if and . Here is the optimal Lagrange multiplier vector, is the dual norm, and is the optimal solution.

We introduce cost function terms for collision avoidance, obstacle avoidance, target seeking, and predator avoidance. Objectives can be combined by including the corresponding terms in the cost function.

Cost Function Term for Collision Avoidance.

For collision avoidance in a flock of agents, pairwise constraints of the form for all are applied to the MPC optimization problem. The pairwise constraints ensure that no two agents are closer than a distance of apart from each other. The constraints are converted into a penalty term, that is the 2-norm of the magnitude of constraint violations. For a pair of agents , the magnitude of the constraint violation is equal to . Hence, the cost-function term for collision avoidance is , where is the vector of the magnitude of collision-avoidance constraint violations.

Cost Function Term for Obstacle Avoidance.

For a flock of size and an obstacle field containing obstacles, constraints of the form are applied to the optimization problem, where is the distance between agent and the closest point on obstacle . For an agent-obstacle pair (), the magnitude of the constraint violation is equal to . Hence, the cost-function term for collision avoidance is, , where is the vector of the magnitudes of obstacle-avoidance constraint violations, and is the set of points on obstacle boundaries.

Cost Function Term for Target Seeking.

This term is the average of the squared distance between the agents and the target. Let denote the position of the fixed target. Then the term is as defined as .

Cost Function Term for Predator Avoidance.

We introduce one predator, which is more agile than the flocking agents, with a maximum speed and maximum acceleration a factor of times higher than and , respectively. Apart from being more agile, the predator has the same dynamics as the agents, given by Eq. (1). The control law for the predator consist of a single rule that causes it seek the centroid of the flock with maximum acceleration.

For a flock of agents and one predator, constraints of the form are applied to the optimization problem, where is the distance between agent and the predator, and is the desired minimum distance. The magnitude of constraint violation for an agent is . The cost function term is , where is the vector of the magnitudes of predator-avoidance constraint violations.

Cost Function terms used in the experiments.

The cost functions for our experiments are weighted sums of the cost function terms introduced above. We refer to the first term of Eq. (5) as and the second as . If in an experiment collision avoidance is added as penalty term, then the separation term is omitted due to redundancy. If multiple objectives expressed as constraints are used, the constraint violations are collected in one penalty term with weight . We use following cost functions , , , and for experiments with basic flocking, collision avoidance, obstacle avoidance with target seeking, and predator avoidance, respectively.


where is the weight of the target-seeking term.

3.2 Neural Network Architectures

We consider two main neural network (NN) architectures. The first is a class of recurrent neural networks called

Long Short Term Memory (LSTM) lstm. LSTM have been shown to perform well in motion planning Everettlstm and obstacle-field navigation chenlstm. Our motivation for using LSTM was to exploit the temporal nature of the trajectory data, as LSTMs employ memory cells well suited for handling temporal data. The second class of NN we use is Deep Neural Network (DNN). The performance of the DNC controllers we obtain strongly depends upon the chosen NN architecture. We refer to the resulting DNC controllers as DNC-LSTM and DNC-DNN, respectively.

3.3 Training Distributed Flocking Controllers

Our objective is to learn basic flocking, collision avoidance, obstacle avoidance with target seeking, and predator avoidance. The last two implicitly also include collision avoidance. For each of these tasks, our methodology is to train a DNC using the trajectory data obtained from the CMPC. CMPC is run with a neighborhood size of , starting from 100 random initial states, producing 100 trajectories, each with a duration of 100 time units. We learn a single DNC from the state-action pairs of all agents. This yields a symmetric distributed controller, which we use for each agent during evaluation.

Basic Flocking.

Trajectory data for basic flocking is generated using the cost function in Eq. (5). The input to the NN is the position and velocity of the agent along with the positions and velocities of its -nearest neighbors. We will refer to the agent (DNC) being learned as . Since we use neighborhood size , the input to the NN is of the form , where , are the position coordinates, and , velocity coordinates for the agent . Similarly, , , and are the position and velocity vectors of its neighbors. Since this vector has 24 components, the input to the NN consists of 24 features.

Collision Avoidance.

The CMPC cost function for collision avoidance is given in Eq. (8b).The input to the NN is the same as for basic flocking.

Obstacle Avoidance with Target Seeking.

For obstacle avoidance with target seeking (and collision avoidance), we use CMPC with the cost function in Eq. (8c). The target is located behind the obstacles, forcing the agents to move through the obstacle field. For this task, the input to the NN is given by the positions and velocities of agent along with its -nearest neighbors as well as the position of the closest point on the obstacle from agent and its -nearest neighbors and the target location of the flock. The input to the NN consists of the 38 features , where , is the closest point to agent on any obstacle; , give the closest point on any obstacle for the 5 neighboring agents, and , is the target location.

Predator Avoidance.

The CMPC cost function for predator avoidance (with collision avoidance) is given in Eq. (8d). The position, velocity and the acceleration of the predator are denoted by , , , respectively. We take , hence and . The input features to the neural network are the positions and velocities of agent along with its -nearest neighbors and the position and velocity of the predator. The input with 28 features has the form .

4 Experimental Evaluation

We conducted an extensive performance evaluation of Neural Flocking, taking into account various control objectives: basic flocking, collision avoidance, obstacle avoidance with target seeking, and predator avoidance. As illustrated in Fig. 1, this involved running CMPC to generate the training data for the distributed neural controllers, whose performance was then compared to the DMPC controllers. We also showed that the learned DNC flocking controllers generalize the training data in two important ways: they achieve successful collision-free flocking in flocks larger than those used in the training data, and they achieve obstacle avoidance in obstacles fields having a larger number of obstacles than what was present in the training data. We include as supplementary material multiple videos depicting the quality of learning for these controllers. We consider different NN architectures, each with training parameters.

The CMPC and DMPC problems are solved using gradient-descent optimization. In the training phase, the size of the flock is . For obstacle-avoidance experiments, we use 5 obstacles. The simulation time is , time units, , and . As reported in Mehmood2018, the weight of the separation term in DMPC and CMPC is and , respectively. We use and ; a higher value is used for due to the agility of the predator. The weight of the penalty term is . For the initial configuration, the positions and velocities are uniformly sampled from and , respectively. We ensure that the initial configuration is recoverable; i.e., no two agents are so close to each other that they cannot avoid a collision when resorting to maximal acceleration. The predator starts at rest from a fixed location at a distance of 50 from the flock center.

For training, we considered 30 agents and 100 trajectories per agent, each trajectory 334 time steps in length. This yielded a total of 1,002,000 training samples. We use two variants of neural nets, LSTM and DNN, to learn DNCs. For LSTM, we use 2 hidden layers with 34 cells per hidden layer, with a sigmoid activation function. For the DNN version, we use 5 hidden layers with 64 neurons per hidden layer, with a sigmoid activation function. We chose the number of hidden layers and neurons such that the numbers of trainable parameters are comparable for both classes of NNs.

The Adam optimizer adamopt was used with the following settings: , , ,

. The number of epochs used for training is 10,000 and the batch size is 500. For measuring training loss, we use the mean-squared error metric. For both basic flocking and collision-avoidance, we give the neural network an input vector with 24 features. The number of trainable parameters for these two control objectives for the DNN and LSTM configurations are 18,370 and 18,644, respectively. For obstacle-avoidance and target-seeking, we give the neural network an input vector with 38 features. The number of trainable parameters for the DNN and LSTM are 19,138 and 20,324, respectively. Finally, the predator-avoidance control objective has 28 features as the input to the neural network; the resulting number of trainable parameters for the DNN and LSTM architectures are 19,266 and 20,604, respectively. For training the neural networks, we use Keras 


, which is a high-level neural network API written in Python and capable of running on top of TensorFlow.

To test the learned DNCs, we generated 100 simulations (runs) for each of the desired control objectives: basic flocking, flocking with collision avoidance, flocking with obstacle avoidance and target seeking, and flocking with predator avoidance. The results presented in Tables 1-3, were obtained using the same number of agents and obstacles and the same predator as in the training phase. We also ran tests that show DNC controllers can achieve collision-free flocking with obstacle avoidance where the numbers of agents and obstacles are greater than those used during training. The Supplemental Material includes videos demonstrating flocking with 35 agents and obstacle avoidance with 10 obstacles.

We use flock diameter and velocity convergence zhang2015model as performance metrics for flocking behavior. At any time step, the flock diameter is the largest distance between any two agents in the flock. The velocity convergence VC is the average of the magnitude of the discrepancy between the velocities of agents and the flock’s average velocity. For both metrics, lower values are better, indicating a dense and coherent flock. A successful flocking controller should also ensure that both values eventually stabilize.

For collision avoidance, obstacle avoidance, and predator avoidance, collision rates are used as a performance metric. An inter-agent collision (IC) occurs when the distance between two agents at any point in time is less than . An obstacle-agent collision (OC) occurs when the distance between an agent and the closest point on any obstacle is less than . A predator-agent collision (PC) occurs when the distance between an agent and the predator is less than . The collision counts reported below are the total numbers of collisions of each type in the 100 test trajectories. The collision rate is the the number of states in those trajectories in which a collision occurs divided by the total number of states in those trajectories.

Models Average SD of Velocity SD of Velocity
Converged Diameter Converged Diameter Convergence Convergence
DNC-DNN 22.8138 2.0137 0.0406 0.0037
DNC-LSTM 23.7629 2.8272 0.0435 0.0041
DMPC 21.1231 2.5358 0.0376 0.0036
CMPC 22.0111 2.6494 0.0303 0.0012
Table 1: Performance comparison for basic flocking
Models Average SD of Velocity SD of Velocity
Converged Diameter Converged Diameter Convergence Convergence
DNC-DNN 22.3521 2.0362 0.1501 0.0180
DNC-LSTM 23.2676 2.2193 0.1638 0.0211
DMPC 61.7392 34.8767 0.2499 0.1941
CMPC 21.4441 1.8788 0.1449 0.0258
Table 2: Performance comparison for flocking with collision avoidance

We calculate the average converged diameter by averaging the flock diameter in the final time step of the simulation over the 100 runs. Similarly, the standard deviation of the velocity convergence is obtained from the last time step of all 100 runs. Table 

1 shows the performance of the DNC variants against the MPC controllers with respect to basic flocking for 30 agents. Although the DMPC performance is better than DNC-DNN and DNC-LSTM, the difference is marginal. Table 2 presents the collision-avoidance results for 30 agents. Both DNC-DNN and DNC-LSTM outperform the DMPC controller in terms of flock diameter and velocity convergence. As seen in Fig. (2b), DMPC does poorly for collision avoidance due to flock fragmentation; this leads to an increase in flock diameter. Although the DNCs are also distributed, they do not encounter flock fragmentation. This is likely because they are trained using CMPC-generated data, and the CMPC controller has a flock-wide view of the system. Thus, the DNCs are able to learn patterns in the state of their neighbors that help them flock better.

Flock diameter for basic flocking
(a) Flock diameter for basic flocking
(b) Flock diameter for collision avoidance
(c) Velocity convergence for basic flocking
(d) Velocity convergence for collision avoidance
Figure 2: Comparison of performance measures for basic flocking and collision avoidance, averaged over a 100 runs for each flocking controller.

Table 3 shows the collision rates for the DNC controllers and CMPC. An ideal controller should produce no collisions. The DNCs achieve zero inter-agent collisions. Furthermore, for all three types of collisions, the DNCs achieve significantly fewer collisions than the DMPC.

Models IC Count IC Rate OC Count OC Rate PC Count PC Rate
DNC-DNN 0 0% 311 0.93% 1621 4.85%
DNC-LSTM 0 0% 338 1.01% 1737 5.20%
CMPC 17 0.05% 569 1.70% 2130 6.38 %
Table 3: Performance comparison for collision, obstacle and predator avoidance

An important advantage of DNCs over MPCs is that they are much faster. Executing a DNC requires a modest number of arithmetic operations, whereas executing an MPC requires simulation of a model and controller over the prediction horizon. In our experiments, on average, the CMPC and DMPC take 10 msec and 57 msec of CPU time, respectively, whereas DNC-DNN and DNC-LSTM take only 1.6 msec and 1.8 msec, respectively.

5 Related Work

The work in bentley2018

synthesizes a flocking controller using multi-agent reinforcement learning (MARL) and natural evolution strategies (NES). The target model from which the system learns is Reynolds flocking model 

REYNOLDS1987. For training purposes, a list of metrics called entropy are chosen which provide a measure of the collective behavior displayed by the target model. As the authors of bentley2018 observe, this technique does not quite work: although it consistently leads to agents forming recognizable patterns during simulation, agents self-organized into a cluster instead of flowing like a flock. The work of La2015predavoid combines reinforcement learning and flocking control for the purpose of predator avoidance, where the learning module determines safe spaces in which the flock can navigate to avoid predators. Their approach to predator avoidance, however, isn’t distributed as it requires a majority consensus by the flock to determine its action to avoid predators. They also impose an -lattice structure olfati2006flocking on the flock to ensure predator avoidance. In contrast, our approach is geometry-agnostic and can achieve predator avoidance in a distributed manner.

The approach of kahn2017RL

develops an uncertainty-aware reinforcement learning algorithm to estimate the probability of a mobile robot colliding with an obstacle in an unknown environment. Their approach is based on bootstrap neural network using dropouts, allowing it to process raw sensory inputs. Similarly, a learning-based approach to robot navigation and obstacle avoidance is given in 

PfeifferSNSC17. They train a model that maps sensor inputs and the target position to motion commands generated by the ROS ros navigation package. Our work in contrast considers obstacle avoidance (and other control objectives) in a multi-agent flocking scenario under the simplifying assumption of full state observation. In Godoy2016

, an approach based on Bayesian inference is proposed that allows an agent in a heterogeneous multi-agent environment to estimate the navigation model and goal of each of its neighbors. It then uses this information to compute a plan that minimizes inter-agent collisions while allowing the agent to reach its goal. Flocking formation is not considered in this paper.

6 Conclusions

With the introduction of Neural Flocking (NF), we have shown in this paper how machine learning in the form of Supervised Learning can bring many benefits to the flocking problem. As our experimental evaluation confirms, the symmetric and distributed neural controllers we derive in this manner are capable of achieving a multitude of flocking-oriented flight objectives, including: flocking formation, inter-agent collision avoidance, obstacle avoidance, predator avoidance, and target seeking. Moreover, NF controllers exhibit real-time performance and generalize the behavior seen in the training data to achieve these objectives in a significantly broader range of scenarios.

For future work, we plan to investigate a distance-based notion of agent neighborhood as opposed to our current nearest-neighbors formulation. We also plan to switch from 2D geometry to 3D geometry and to use a more realistic model of agent dynamics based on a quadcopter model, as given e.g. in ZhangKLA2016berkeley. Finally, motivated by the quadcopter study of ZhangKLA2016berkeley, we will seek to combine MPC with reinforcement learning in the framework of guided policy search as an alternative solution technique for the NF problem.


This material is based on work supported in part by NSF Grants CCF-1414078, CNS-1445770, CNS-1430010, CNS-1421893, CPS-1446832, and IIS-1447549.