## I Introduction

Designing distributed controllers is a challenging task, as the associated agents are typically attempting to achieve a global objective despite only having a local view of the global configuration. They must therefore take actions based on incomplete information. Often it is not possible to optimize for global objectives using locally-optimal actions alone. High-performing distributed controllers may thus need to employ information-sharing among non-neighbors via complicated protocols, such as distributed consensus.

This state of affairs raises the following question. Rather than manually designing distributed controllers, can we automatically learn them? If so, how would we obtain the requisite training data without already having a solution for the distributed control problem in hand?

In this paper, we explore the use of a centralized controller, with global system knowledge, to generate the training data needed to learn a fully distributed neural controller. It is not obvious that this approach would work, since learning a high-performing distributed controller would require the learning process to (implicitly) figure out a way to gather information from non-neighbors. Moreover, there is nothing in the training data that suggests how to perform such consensus tasks. A priori, we do not even know if such information sharing is possible in the distributed setting without explicit communication between agents.

To investigate this idea, we consider a particularly challenging multi-agent flight-formation problem: *V-formation*, an emergent behavior of significant interest to the aerospace industry. The V-formation problem refers to the task of bringing a collection of agents from an arbitrary initial state to a state where they are all flying in a V-shape, with one agent leading the group and the others following on the left and right branches of the V. V-formation provides numerous benefits. It is historically known for being energy-efficient due to the *upwash benefit* an agent in the configuration enjoys from its frontal neighbor. It also offers each agent a clear frontal view, unobstructed by any flock-mate.

The V-formation problem has been shown to be one of optimal control, which can be solved using model predictive control (MPC) [CAMACHO2007]. Section VI discusses various approaches that have been proposed to solve this problem. In particular, there exist centralized [ARES] and (partially) distributed [Lukina2019] solutions for achieving V-formation using MPC. None of these approaches, however, lead to a truly distributed solution for V-formation, i.e., without any form of consensus or information-sharing among non-neighbors. Specifically, the distributed solutions in prior work have three shortcomings. First, the distributed controller [Lukina2019] uses a consensus round at the beginning of every time step, so that all agents agree on a consistent set of actions. This augmented controller performs tasks similar to leader election in the process.
Second, the controller uses *adaptive neighborhood resizing* to enable agents to increase their neighborhood sizes to ensure convergence to a V-formation. MPC-based controllers can be computationally expensive, and increasing the neighborhood size increases the computational cost. Third, each control step consisted of many ministeps where agents exchanged information and solved multiple optimization problems leading to a complicated procedure overall.

In this paper, we present *Neural V-formation*, a new approach to the V-formation problem that uses Supervised Learning and a retraining technique we introduce called *Counterexample-Guided -fold Retraining* to learn a symmetric and fully distributed controller from a centralized, adaptive-horizon MPC controller [ARES]. By doing so, we achieve the best of both worlds: high performance on par with the MPC controllers, and high efficiency, which leads to real-time flight controllers.

Notably, we also show how our neural V-formation controller generalizes to a significantly larger number of agents (up to 15) than the number of agents on which it is trained (only 7). This generalization by our neural V-formation controller is achieved using only local neighborhood information and a local cost-function value, without any communication with other agents. Our experiments demonstrate that attempting to use a distributed MPC controller (without explicit communication or consensus) to achieve this level of generalization does not yield satisfactory results and is computationally more expensive.

Figure 1 provides an overview of our approach. A high-performing, centralized, adaptive-horizon MPC controller (CAMPC) provides the labeled training data to the learning agent: a symmetric and fully distributed neural controller (DNC) in the form of a Deep Neural Network (DNN). The training data consists of trajectories of state-action pairs, where a state contains the information known to an agent at a time step (its position and velocity, the positions and velocities of its neighbors, and the value of its local cost function), and the action (the label) is the acceleration assigned to that agent at that time step by the CAMPC controller.

The key point here is that the CAMPC controller uses knowledge of the full state (positions and velocities of all agents) to find the optimal action for each agent, whereas the DNC controller is trained to compute the same output action only from information about its local state. The DNC has to do more than just a table lookup over the training data: it has to learn a function that uses only locally sensed data to compute the optimal action such that the same DNC works for all agents (and their local views) at all times.

The learning process we use for neural V-formation is significantly enhanced through the introduction of *Counterexample-Guided -fold Retraining* (CEGkR). In this context, a counterexample is a trajectory along which the neural controller failed to achieve V-formation. CEGkR utilizes the first states of such failed trajectories as retraining samples, repeating this process until the desired performance of the neural controller is attained. In terms of verification of our neural controller, we use a form of statistical model checking [Larsen14, Grosu14] to compute confidence intervals for its rate of convergence to a V-formation and for its time to convergence.

The rest of the paper is organized as follows. Section II describes the model dynamics and the AMPC algorithm, including its cost function for V-formation. Section III presents impossibility results that illustrate the difficulty of achieving V-formation through distributed control. Section IV

introduces our distributed neural controller for V-formation, and the associated learning process, with a focus on Counterexample-Guided

-fold Retraining. Section V contains experimental results comparing our neural controller with MPC-based controllers, along with our statistical model checking results. Section VI discusses related work, while Section VII offers concluding remarks.## Ii Background

We describe the model for our agents (including their equations of motion), the Centralized Adaptive-horizon MPC controller used to generate training data, and the distributed variant of the MPC controller with which we compare the performance of our neural V-formation controller.

The state of agent

consists of four variables: a 2-dimensional vector

giving the agent’s position in 2D space, and a 2-dimensional vector giving the agent’s velocity. The state of a collection of agents is denoted . The control input, also called “action”, for agent is a 2-dimensional acceleration denoted .Let , and be the 2-dimensional positions, velocities and accelerations, respectively, of agent at time step , . The discrete-time equations of motion for agent are:

(1) | ||||

(2) |

where is the duration of a time step.

The goal of V-formation is to compute control actions (accelerations for the agents) that drive the system from an initial state (picked arbitrarily from some reasonable set of initial states) to a desired target state (a V-formation). We assume the desired final state is specified by a cost function, , that maps a state to a real-valued cost such that exactly when represents the desired target state (V-formation), and otherwise. Further details about are given below.

A *Centralized Adaptive-horizon Model Predictive Control* (CAMPC) algorithm is proposed in [ARES]. CAMPC generates action (acceleration) sequences using an adaptive prediction horizon

to find the next action to execute towards the global optimum. CAMPC maintains multiple clones of the current state, and runs Particle Swarm Optimization (PSO)

[kennedy95] on each of them. This allows it to call PSO for each clone with a different prediction horizon .CAMPC performs a system-wide minimization of the global cost function (defined in Eq. 5) at each time-step to obtain an optimal action sequence of length . The optimization is subject to the following constraints on the maximum velocities and accelerations:

(3) |

where is a constant and . PSO creates a swarm of particles uniformly at random within the given bounds on their positions and velocities. It then computes the fitness (the value of the cost function) of each particle. The fittest particle becomes a global best for the next iteration. This procedure is repeated until the number of iterations reaches its maximum, a time limit is reached, or the cost function reaches its minimum value (i.e., a V-formation is achieved).

The adaptive prediction horizons are chosen such that the best-performing PSO instances succeed to decrease the objective cost by at least a pre-defined amount. The adaptive-horizon feature allows PSO to escape from local minima by gradually increasing the MPC prediction horizon when necessary. This provides convergence guarantees that would otherwise be impossible.

In [Lukina2019], a distributed version of MPC is used to solve the V-formation problem, albeit with a reliance on a distributed consensus algorithm. It deploys adaptive neighborhood resizing and an adaptive-horizon version of MPC to determine the optimal action (acceleration) for every agent at every time-step. Our comparative performance evaluation considers a modified version of this controller: *Distributed Adaptive-Horizon Model Predictive Control* (DAMPC), which uses the adaptive-horizon feature of [Lukina2019], but eschews any form of consensus. This is to ensure a fair comparison with our neural controller, which is also “consensus free”.
DAMPC does not use the adaptive-neighborhood feature in [Lukina2019]; instead, it uses a fixed neighborhood size of 7 agents, just like our neural controller. At any time-step, an agent’s neighborhood consists of the 7 nearest agents, including itself.
Thus, in DAMPC, each agent computes the optimal action sequences for the agents in its neighborhood, and then uses the first acceleration in the sequence for itself. The accelerations are computed using PSO, as in CAMPC, except that the scope of the cost function is restricted to agents in ’s neighborhood, instead of all agents.

The global cost function , for state , used in CAMPC for capturing V-formation, is defined in terms of the following metrics [YANG2016].

*Clear View*: An agent’s visual field is a cone with angle that can be blocked by the wings of other agents. The clear-view metric is defined as the sum over all agents of the percentage of agent ’s visual field that is blocked by other agents.
Let be the part of the angle subtended by the wing of agent on the view of agent that intersects with agent ’s visual cone with angle . Then, the clear view for agent , , is defined as , and the total clear view, , is defined as . The optimal value in a V-formation is , as all agents have a clear view.

*Velocity Matching*: is defined as the accumulated differences between the velocity of each agent and all other agents, summed up over all agents. Formally. . The minimum value is VM is attained when all agents have the same velocity.

*Upwash Benefit*: is the sum of (the inverse of) each agent’s upwash benefit. A trailing upwash effect is generated near the wingtips of an agent. An upwash measure is defined using a Gaussian model that peaks at the appropriate upwash regions.
Let be the projection of the vector along the wing-span of agent . Similarly, let be the projection of along the direction of . Specifically, the upwash benefit for agent coming from agent is given by

(4) |

where is the error function, which is a smooth approximation of the sign function, . is a 2D-Gaussian shifted so that the mean is , where is a 2D-Gaussian with mean at the origin. The parameter is the wing span, and is the relative position where upwash benefit is maximized. The total upwash benefit, , for agent is . The maximum upwash an agent can obtain is upper-bounded by 1. Since we are working with cost (that we want to minimize), we define . The upwash benefit in a V-formation is UB, as all agents, except for the leader, enjoy maximum upwash benefit.

The overall cost function is be defined as a sum of squares:

(5) |

For distributed controllers (DAMPC and the neural controller), we need to define a local cost function for agent . It is the same as the global cost function, except that only agents in ’s neighborhood are considered. This restriction applies to all aspects of the cost function. For example, is the sum over agents in agent ’s neighborhood of the percentage of agent ’s visual field that is blocked by other agents in agent ’s neighborhood.

We consider an agent’s neighborhood to consist of a fixed number of the nearest agents, including the agent itself. Thus, agent ’s neighborhood consists of agent itself and the agents closest to it. We take in our experiments. State is considered to be a V-formation if for a specified threshold .^{1}^{1}1The threshold is a small positive constant chosen to allow for numerical errors due to floating-point computation, and also to allow for tiny perturbations that result in formations which are visually indistinguishable from a V.

## Iii Impossibility Results

Designing controllers that achieve V-formation when the controllers are distributed, symmetric, and deterministic is difficult. This further motivates our proposed research on learning distributed and symmetric controllers for flight formation, including V-formation, from centralized controllers.

Distributed V-formation is an interesting and challenging problem. First note that a V-formation implicitly elects a leader. Hence a correct distributed algorithm will also solve the distributed leader election problem. It is known that there is no deterministic distributed leader election algorithm when all agents are identical [Attiya2004, chap. 3]. This result, however, does not directly carry over to V-formation since the state of each agent consists of its (and its neighbors) spatial location, and two agents can never be identical (i.e., have the same spatial location for itself and its neighbors). Nevertheless, most attempts to design deterministic distributed algorithms for V-formation will build in some form of spatial symmetry, and it is often possible to exploit this symmetry to devise initial configurations from which a proposed algorithm will fail to reach a V-formation.

First, V-formation inherits the issue in distributed systems that stems from the agents forming a disconnected partition. For the next two results, we assume that: (A1) if the agents in the neighborhood of an agent (including itself) are in a perfect V-formation, then that agent would set its acceleration to (that is, it will maintain the formation).

###### Proposition 1.

Under Assumption (A1), if agents are spatially separated into two groups of and agents, where , such that (1) each group is in a perfect V-formation, and (2) the nearest neighbors of any agent are in its own group, then a distributed procedure using neighborhood will fail to achieve a full V-formation on agents.

###### Proof.

In the distributed case, the agents have no knowledge of the existence of the other group, and hence use an acceleration of to keep their current formation, according to Assumption (A1). ∎

Agents partitioning into disconnected groups is not the only issue. Even when the neighborhood graph is connected, formations can look optimal locally, but remain unoptimal globally.

###### Proposition 2.

Under Assumption (A1), there exist initial configurations of agents such that starting from that configuration, a distributed procedure using neighborhood size will fail to achieve a full -formation.

###### Proof.

Consider a perfect V-formation on three agents with one leader (at coordinate ), one agent on the left branch (at coordinate ), and one agent on the right (at coordinate ), where are positive. Now, add the -th agent at the position . Note that the position experiences optimal upwash (coming from and ). Assume all agents have velocity , and hence all agents are velocity matched. Agent , however, does not have optimal clear view. If , then every agent sees one other agent that is nearest to it. Every such pair of agents, however, are in a local V-formation, so all agents set their acceleration to zero, by Assumption (A1). Note that agent would not realize it doesn’t have clear view unless it looks at at least two other agents, i.e., unless is at least . ∎

These two propositions highlight two potential issues faced by a distributed approach: first, agents could get disconnected, and second, clear view is not a local property. However, what if the agents are connected and ? We present a scenario that demonstrates a third difficulty faced by a distributed procedure, namely the existence of multiple different optimal V-formations. We need some assumptions. We assume that: (A2) if the velocity of an agent is aligned with the average velocity of all neighboring agents, then the controller picks an acceleration that is also aligned with that direction. In other words, this assumption implies that if an agent is moving in the direction that is given by the average of the velocities of its neighbors, then it does not change its direction – it can still speed up or slow down, but it keeps the direction of its motion unchanged. This is a reasonable assumption for a controller since the controller is trying to achieve velocity matching and picking the average velocity is a commonly used strategy for this purpose.

We further assume that: (A3) the controller is invariant to rotation of the coordinate axes; that is, just changing the frame of reference does not change the action computed by the controller for any configuration. Assumption (A3) is also a reasonable assumption. If a controller uses only the relative positions of its neighbors (with respect to its own position) and relative velocities of its neighbors (with respect to its own velocity), then such a controller can be seen to satisfy Assumption (A3). If a controller satisfies Assumptions (A2) and (A3), then we can show that if every agent uses such a controller in a truly distributed manner to compute its own acceleration, then there are configurations that will never converge to a V-formation.

###### Proposition 3.

If every agent’s local controller satisfies Assumptions (A2)–(A3), then there exists an initial configuration such that the trajectory of the multi-agent system starting from that initial configuration will never converge to a V-formation, even as the neighborhood graph remains connected and the agents use a neighborhood size greater than .

###### Proof.

Consider a total of agents placed on a circle equidistant from each other and moving radially outwards with equal speed. Let this be the initial configuration. Let the neighborhood size be . In this case, the neighborhood of each agent will include one neighbor on its left and one on its right. Note the symmetry in this configuration. The local configuration (involving agents) that is available to every agent is equivalent modulo rotation of the coordinate axes. Hence, by Assumption (A3), if we know the action computed by any one agent, then we would know the action computed by all agents (by just rotating it appropriately). Therefore, let us focus on one agent. Without loss of generality, assume that this agent, call it , has position and velocity , where is the center of the circle of radius on which all the agents lie. Let and be the two neighbors of . Therefore, has position and has position . Furthermore, has velocity and has velocity . If we compute the average of the velocities of , , and , we get the velocity . The direction of this average velocity is aligned with the velocity of , and hence by Assumption (A2), the acceleration for computed by the controller will be of the form , for some , which is acceleration in the radial direction. By Assumption (A3), the controller for every agent will pick an acceleration that is aligned with its current velocity. Consequently, after one time step, the agents will continue to lie on a circle with center , and with velocities that are pointing radially outwards or inwards. We can now apply our argument again, and we can do so repeatedly to conclude that the agents will continue to lie on a circle forever. This shows that they will fail to converge to a V-formation. ∎

The issue highlighted in the above proof is that there are several optimal configurations, and different agents can decide to pick a different end configuration. In the above proof, agent concludes that it is moving in the “correct” direction and that its two neighbors and should change their direction to match its own direction. And every agent, including and , come to the same conclusion. This is because there are different V-formations – one heading in each of the different directions. And each agent picks a different final V-formation to target. One might wonder if fixing the heading direction (or the target destination) would solve this issue: any such change in the problem definition surely invalidates Proposition 3. However, we note that direction is not the only thing that can vary. The speed with which each agent is moving in the final V-formation can also change. The different agents can not only pick different final speeds in their final , but also different directions and even different leaders. This suggests that some coordination/consensus is required so that all agents agree and work toward the same final -formation. But building in any form of coordination and consensus is tedious and error-prone.

## Iv Neural V-formation

We learn a Distributed Neural Controller (DNC) for V-formation from trajectories obtained from a Centralized Adaptive-horizon Model Predictive Controller (CAMPC). Our learniong procedure makes use of a technique we call Counterexample-Guided -fold Retraining (CEGkR), which uses counterexamples generated during testing of the neural controller as sources of new initial configurations for the CAMPC to generate additional training data.

### Iv-a Training a Distributed V-Formation Controller

We use Deep Learning to synthesize a distributed and symmetric V-formation controller from the CAMPC controller (see Section II), which generates the requisite training data in the form of trajectories leading to V-formation. A trajectory is a sequence of state-action pairs, where a state contains the information known to an agent at a time step (e.g., the positions and velocities of all agents in its neighborhood, including itself), and the action (the label) is the acceleration assigned to that agent at that time-step by CAMPC. We employ Supervised Learning to train our neural controller with the trajectories obtained from CAMPC.

The input features to the neural network are the 2-dimensional positions and velocities of all 7 agents in the agent’s neighborhood and the value of the agent’s local cost function. Thus, the NN has 29 input features, and the input has the form , where , and , are the position and velocity coordinates, respectively, of the learning agent (i.e., the agent whose controller is being learned), , and , are the positions and velocities of the neighboring agents where , and is the local cost function of the learning agent.

We use CAMPC to generate trajectories, each with a duration of 50 time-steps. Let be a trajectory generated using CAMPC, where each is a -dimensional state and is the -dimensional action computed for that state by CAMPC. For training the DNC controller, we obtain data points from each such trajectory, namely , where denotes agent ’s *view* of state , and is ’s 2-dimensional acceleration.

The view is obtained by (a) replacing absolute positions with positions relative to the position of agent (i.e., is replaced by and is replaced by , for every agent ); and (b) permuting the indices of the agents so that the entries are in order of increasing distance from . Hence, agent is at index , the nearest neighbor of in state is at index , etc. Note that the relative position of agent is always , so the first two entries of are zero. Also, note that during all of the training (but not during the testing) performed in our experiments, the neighborhood size equals the total number of agents. Thus, the local cost functions are equivalent to the global cost function .

We learn a single neural V-formation controller from the state-action pairs of all agents. This yields a symmetric distributed controller, which we use for each agent during evaluation. Note that the neural controller produces accelerations for only one agent, so it needs to be to run separately for each agent.

Our neural controller is a fully connected feed-forward deep neural network (DNN), with 5 hidden layers, 84 neurons per hidden layer, and with a sigmoid activation function. To perform optimizations involving the MPC cost function, the Adam optimizer

[adamopt] was used with the following settings: , , ,. The number of trainable DNN parameters is 31,335, the batch size (number of samples processed before the model is updated) is 500, and the number of epochs (number of complete passes through the training dataset) used for training is 1000,. The mean-squared error metric is used to measure training loss. To train the neural networks, we use Keras

[chollet2015keras], which is a high-level neural network API written in Python and capable of running on top of TensorFlow. We used an iterative approach (based on the success rate of the neural controllers) for choosing the appropriate DNN hyperparameters and architecture.

### Iv-B Counterexample-Guided -fold Retraining

We introduce a new counterexample-based retraining technique we call CEGkR to further improve the performance of the distributed neural controller (DNC) we obtain using the learning approach described in Section IV-A.
In the context of our V-formation investigation, CEGkR works as follows. A retraining procedure first tests the neural controller by running it for time-steps, starting from randomly generated initial states. We use a V-formation convergence threshold of as the success criterion; i.e., the DNC successfully achieved V-formation if at the end of the trajectory. We refer to the failed trajectories as *counterexamples*. Let be a counterexample. Note that the neural controller fails to reach a V-formation (by the end of ) not only from the initial state , but also from the subsequent states of .

We do not know, however, whether states near the end of a failed trajectory are problematic for the neural controller, because there are not enough remaining time-steps in the trajectory to properly evaluate the controller’s performance starting from those states. Therefore, we pick a cutoff and use the first states in each counterexample as initial states to generate new training trajectories. We do this by running the CAMPC controller for 50 time-steps starting from each of these states. Note that each counterexample leads to a total of new training data points for improving the neural controller.

After updating the controller using this new training data and using the same learning algorithm as in initial training, we test the updated controller by running it from a new batch of randomly generated initial states. If the success rate (i.e., number of trajectories ending in a V-formation) for this batch is higher than in the previous round of testing, we perform another round of retraining. Otherwise, the CEGkR retraining procedure is terminated.

As noted, CEGkR uses the first states from each counterexample trajectory to generate new training data. Regarding the choice of , we first observe that should not be too large, partly for the reason mentioned above, and partly because states near the end of the trajectory may be uninteresting, in the sense that they are encountered during testing only as the result of an accumulation of poor decisions made by the current controller earlier in the execution. Our final improved controller will never encounter those states, so there is no benefit of using them for training. For example, states where the flock has split into disconnected subflocks are uninteresting in this sense.

Secondly, should not be too small. It is possible that the neural controller makes good decisions in the first several states, but later on, say in , it chooses a poor action that leads to failure. Using data for these later states during retraining provides the most benefit. Note that these later states might not occur in a trajectory computed by CAMPC starting from an earlier state such as or . In our experiments, we found that increasing (from 10 to 30 to 35) led to a concomitant increase in the success rate. See Section V-A.

## V Experimental Results

This section contains the results of our performance analysis of the distributed neural V-formation controller (DNC). It specifically reports on the performance improvement due to CEGkR, compares the performance of DNC with DAMPC, and uses Statistical Model Checking to obtain confidence intervals for DNC’s correctness/performance.

### V-a CEGkR Performance Evaluation

Table I demonstrates the DNC’s performance improvement due to CEGkR. For the initial configurations used to generate the initial training samples, agent positions and velocities are uniformly sampled from and , respectively. The total number of initial training samples for each experiment is , where is the number of unique initial configurations, and is the number of agents. An “experiment” is an instance of using the CEGkR methodology to generate a DNC. A single training sample (trajectory) is comprised of discrete time-steps. For all experiments, , and for Experiment 1, , and for Experiments 2, 3 and 4, . As we show below, increasing , which thereby increase the total number of training samples, increases the success rate; i.e., the rate of reaching V-formation.

For Experiment 1, Run 1, we initially train our DNC with 161,000 training samples, and perform no retraining. This version of DNC achieves a success rate of 85.07% on test cases, which are generated from the same distribution used for initial training. For Run 2, we take the first 10 states of all Run 1 failed test cases and use them as initial states for CAMPC to use to generate new trajectory data, i.e., the guided training samples. The total number of guided retraining samples is , where is the number of failed test cases. For example, in Run 1, and , so the number of guided retraining samples is . Retraining with these samples leads to a 4% increase in the success rate. As described in Section IV-B, we repeat this procedure until there is no improvement in the DNC success rate.

Retraining | # Guided | Success | Median |
---|---|---|---|

Run Id | Retraining Samples | Rate () | Final |

Experiment 1 : | 161,000 Initial Training Samples | ||

Run 1 | 0 | 85.07 | 0.0002747 |

Run 2 | 104,510 | 89.03 | 0.0000499 |

Run 3 | 76,790 | 91.12 | 0.0000299 |

Run 4 | 62,160 | 91.12 | 0.0000299 |

Experiment 2 : | 350,000 Initial Training Samples | ||

Run 1 | 0 | 90.08 | 0.0000351 |

Run 2 | 69,440 | 91.10 | 0.0000300 |

Run 3 | 62,300 | 92.22 | 0.0000231 |

Run 4 | 54,600 | 92.91 | 0.0000222 |

Run 5 | 49,630 | 93.04 | 0.0000222 |

Run 6 | 48,720 | 93.04 | 0.0000222 |

Experiment 3 : | 350,000 Initial Training Samples | ||

Run 1 | 0 | 90.08 | 0.0000351 |

Run 2 | 208,320 | 92.16 | 0.0000249 |

Run 3 | 164,640 | 93.33 | 0.0000221 |

Run 4 | 140,070 | 94.01 | 0.0000200 |

Run 5 | 125,790 | 94.95 | 0.0000155 |

Run 6 | 106,050 | 94.95 | 0.0000155 |

Experiment 4 : | 350,000 Initial Training Samples | ||

Run 1 | 0 | 90.08 | 0.0000351 |

Run 2 | 243,040 | 93.02 | 0.0000235 |

Run 3 | 171,010 | 93.88 | 0.0000213 |

Run 4 | 149,940 | 94.51 | 0.0000202 |

Run 5 | 134,505 | 95.16 | 0.0000153 |

Run 6 | 118,335 | 95.16 | 0.0000151 |

Experiment 2 is similar to Experiment 1. The only difference is that Experiment 2 has approximately twice the number of initial training samples as compared to Experiment 1, which gives it an improved initial success rate. Experiments 3 and 4 use the same set of initial training samples as Experiment 2; the difference is that they use and , respectively, instead of .

Table I demonstrates the benefit of CEGkR, which include the following. (1) CEGkR always improves the performance of the learned controller. Specifically, Run in every experiment shows significant improvement over Run . (2) As expected, CEGkR does not improve the success rate forever; rather the success rate eventually plateaus. (3) Using higher values for in the CEGkR retraining loop improves the quality of the learned controller: the success rate in Experiment 4 is better than that in Experiment 3, which is better than that in Experiment 2. Note that increasing also increases the size of the training data and therefore the cost of retraining.

Table II presents a comparative evaluation of the performance of neural controllers obtained with and without CEGkR. The same number of training samples are used to train both controllers. We use the DNC obtained from Experiment 4 in Table I as our neural controller with CEGkR. For the non-CEGkR controller, we trained it using 1,116,830 training samples, which is equal to the total number of training samples (initial training samples + guided retraining samples) used for training the DNC with CEGkR. The results show that using CEGkR offers a clear advantage, as the CEGkR controller has a consistently higher success rate as the number of agents generalizes beyond 7.

Number of | CEGkR | Non-CEGkR |
---|---|---|

Agents | Success Rate() | Success Rate() |

7 | 95.16 | 90.51 |

8 | 94.57 | 89.03 |

9 | 93.78 | 87.66 |

10 | 93.05 | 85.25 |

11 | 93.05 | 84.10 |

12 | 92.67 | 81.98 |

13 | 91.38 | 80.24 |

14 | 87.35 | 78.47 |

15 | 84.25 | 74.72 |

16 | 73.40 | 65.39 |

### V-B Comparing the Performance of DNC vs DAMPC

The experiments in this section compare the performance of DNC with the distributed adaptive-horizon MPC controller (DAMPC). We focus on DAMPC and DNC because unlike CAMPC, they both rely on sensing only and not communication. The DNC we use from here onwards is the one obtained from Experiment 4 in Table I. For determining DNC’s success rate, we modify the convergence threshold and number of time steps that the controller runs to be proportional to the number of agents . Specifically, we use a convergence threshold of and a number of time-steps of , where and . We observed experimentally (visually) that this proportional increase in the threshold is justified. The rationale for increasing the number of time-steps is that with an increasing , the DNC will take longer to converge.

The DAMPC controller is presented in Section II; recall that it is a variant of the one presented in [Lukina2019]. The adaptive-horizon feature is used with the prediction horizon restricted to the interval .

DAMPC | DNC | |||

Number of | Success | Avg. Conv. | Success | Avg. Conv. |

Agents | Rate() | Time | Rate() | Time |

7 | 89.84 | 20.11 | 95.16 | 19.69 |

8 | 85.16 | 21.73 | 94.57 | 20.05 |

9 | 79.04 | 24.27 | 93.78 | 20.58 |

10 | 75.37 | 24.52 | 93.05 | 22.16 |

11 | 70.91 | 26.03 | 92.67 | 23.89 |

12 | 66.82 | 27.86 | 91.38 | 25.23 |

13 | 61.58 | 32.23 | 89.97 | 27.77 |

14 | 52.49 | 34.87 | 87.35 | 29.24 |

15 | 41.75 | 39.71 | 84.25 | 34.31 |

16 | 34.03 | 39.84 | 73.40 | 39.05 |

Table III demonstrates the generalization capabilities of DNC (from 7 to 16 agents), and compares its performance with that of DAMPC. While increasing the number of agents from to , the neighborhood size is fixed at 7. The main observations from Table III are the following. (1) DNC consistently outperforms DAMPC, thus demonstrating that our approach for learning distributed controllers from training data generated by a centralized controller produces a very effective distributed controller, one that outperforms a distributed controller designed following the well-established MPC-based approach.
(2) DNC’s average convergence time is considerably smaller than that for DAMPC. Note that the *convergence time* is the time when the global cost function first drops below the success threshold . Since the calculation of average convergence time only uses successful runs (ignoring the failed runs), it follows that not only does DNC achieve success more often, it does so in fewer steps. This means it is better than DAMPC at avoiding wrong decisions that lead to local minima.

An important advantage of the neural controller over CAMPC and DAMPC is that the former is much faster at generating the action at every time-step for each agent. Executing a DNC requires a modest number of arithmetic operations, whereas executing an MPC controller requires simulation of a model and a controller over the prediction horizon.

In our experiments, on average, CAMPC and DAMPC take 1,730 msec and 524 msec of CPU time, respectively, whereas the DNC only takes 1.5 msec. These results are averages over runs with 7 agents. Although multiple instances of DNC are needed (one per agent), they all run in parallel, so it is reasonable to compare the CPU time of CAMPC with that for one instance of DNC. Even if we consider the total CPU time for all instances of DNC, it is much less than CAMPC.

Config. Space | # Agents | Success Rate () | Avg. Convergence Time |
---|---|---|---|

Pos: | 7 | 94.28 | 19.88 |

Vel: | 15 | 82.19 | 39.12 |

Pos: | 7 | 91.84 | 20.54 |

Vel: | 15 | 78.33 | 40.01 |

Pos: | 7 | 87.63 | 20.93 |

Vel: | 15 | 75.46 | 39.43 |

### V-C Evaluating Robustness of Distributed Neural Controller

We also demonstrate that our DNC is robust to variations in the initial conditions: it performs well even from initial states well beyond the range of initial states on which it was trained. Recall that during training, the positions and velocities are uniformly sampled from and , respectively. We test the controller on initial states selected from three other configuration spaces (i.e., ranges of initial states), which are defined in Table IV. The initial positions and velocities are uniformly sampled from these ranges. The table also shows the number of agents, the percentage of successful executions, and the average convergence time. The results in each row are averages over runs.

When we move from the configuration space used during initial training to the third configuration space, the size of the set of possible initial positions expands by a factor of , and the size of the set of possible initial velocities expands by a factor of ; hence there is an overall expansion factor of

in the initial state space. This means that the probability that an initial state picked randomly from the third configuration space also lies inside the initial training configuration space is approximately

. Thus, among the runs, there are only around runs on which we definitely expect a high rate of success.The actual success rate is much better than this argument suggests. The success rate decreases from (in Table III) to for 7 agents, and from to for 15 agents. This is roughly a % decrease, much less than the % drop that would occur if the NN controller did not generalize its training in order to perform well from initial states beyond those used during training.

Figure 3 shows the progression of seven agents starting from initial positions randomly selected from the range till until they successfully converge to a V-formation. At , we can observe that the agents have reached a V-formation, and thus the convergence time is 25.

### V-D Statistical Model Checking Results

We use Monte Carlo (MC) approximation as a form of Statistical Model Checking [Larsen14, Grosu14] to compute confidence intervals for the DNC’s success rate for convergence to V-formation and for the (normalized) convergence time. The main idea of MC is to use random variables, , also called samples, IID distributed according to a random variable with mean , and to take the sum as the value approximating the mean . Since an exact computation of is almost always intractable, an MC approach is used to compute an ()-approximation of this quantity.

*Additive Approximation* [Thomas2004] is an ()-approximation scheme where the mean of an RV is approximated with absolute error and probability :

(6) |

where is an approximation of . An important issue is to determine the number of samples needed to ensure that is an ()-approximation of . If is a Bernoulli variable expected to be large, one can use the Chernoff-Hoeffding instantiation of the
Bernstein inequality and take to be , as in [Thomas2004].
This results in the following *additive approximation algorithm* [Grosu14]:

*(i=0; i ; i++)*do

*;*

We use this algorithm to obtain a joint ()-approximation of the mean success rate and mean normalized convergence time for the DNC trained using CEGkR. Each sample is based on the result of an execution obtained by simulating the system starting from a random initial state, and we take , where is a Boolean variable indicating whether the agents converged to a V-formation during the execution, and is a real value denoting the normalized convergence time in the execution. The normalized convergence time is the time when the global cost function first drops below the success threshold and remains below it for the rest of the execution, measured as a fraction of the total duration of the simulation. The assumptions about required for validity of the additive approximation hold, because RV is a Bernoulli variable, the success rate is expected to be large (i.e., closer to 1 than to 0), and the proportionality constraint of the Bernstein inequality is also satisfied for RV .

In these experiments, the initial states are sampled from the same uniform random distributions as in Section V-A, and we set and , to obtain 396,140. We perform the required set of simulations for different numbers of agents, ranging from 7 to 16.

Table V presents the results, specifically, the ()-approximations and of the mean success rate and mean normalized convergence time, respectively. While the results for the success rate are (as expected) numerically similar to the results in Table II, the results in Table V are much stronger, because they come with the guarantee that they are ()-approximations of the actual mean values.

# Agents | ||
---|---|---|

7 | 0.9511 | 0.3942 |

8 | 0.9453 | 0.4024 |

9 | 0.9382 | 0.4128 |

10 | 0.9305 | 0.4426 |

11 | 0.9262 | 0.4770 |

12 | 0.9141 | 0.5058 |

13 | 0.8994 | 0.5560 |

14 | 0.8727 | 0.5852 |

15 | 0.8419 | 0.6874 |

16 | 0.7338 | 0.7822 |

## Vi Related Work

Distributed control/coordination has been used extensively in multi-agent systems. Distributed controllers are typically designed by hand for specific objectives, and they are often very clever about what information is exchanged between agents and how that information is used to update local state [fb-jc-sm:08c, jwd-af-fb:10v]. Informally, coordination is required when the cost function is non-separable. A cost function is *separable* if it does not contain terms that couple the states of two different neighbors [dunbar2006, dunbar2004, Balas2006]. Here we take the novel view of *learning* distributed controllers from data generated using a centralized controller, while avoiding coordination. We apply it to a problem whose cost is clearly not separable and involves tight coupling of state vectors of all pairs of agents.

Previous work on V-formation, including approaches based on centralized and distributed model-predictive control, have been considered in [YANG2016, ARES, Lukina2019]. Other related work, including [D'Andrea2003, Fowler2002, Ye2017], focuses on distributed controllers for flight formation (of moving-wing aircraft) that operate in an environment where the multi-agent system is already in the desired formation and the distributed controller’s objective is to maintain formation in the presence of disturbances. A distinguishing feature of these approaches is the particular formation they are seeking to maintain, including half-V [Fowler2002], ring and torus [D'Andrea2003], and a leader-follower formation [Ye2017]. In [Moreno2017], MPC-inspired approaches to system self-adaptation are considered, including the Proactive Latency-aware Approach (PLA) [Moreno2015]

. The PLA problem is designed as a Markov decision process, where a sequence of actions is computed from the current state for the length of the prediction horizon.

In terms of related work on counter-example-guided retraining, Dreossi et al. [Dreossi2018] propose an approach called *counter-example guided data augmentation*

to improve the performance of machine learning models. They use synthetically generated data items that are misclassified by the ML model to augment the training data sets. In

[Carr2019], the authors use counter-example guided retraining as part of their strategy for synthesizing partially observable Markov decision processes (POMDPs). Claviere et al. [Claviere2019] use counter-example guided training for trajectory-tracking control of robotic vehicles.The CEGkR retraining approach shares the same high-level philosophy underlying these approaches, but there are subtle differences in the way counter-examples are generated. In [Claviere2019], counter-examples are generated using falsification of desired temporal properties about a closed-loop system, whereas in our approach, safety constraints, if any, are included in the cost function and multiple retraining data points are generated from one counterexample. Further, our goal for retraining is to learn a distributed controller, rather than an NN representation of an existing controller.

In terms of deep-learning methodologies for synthesizing distributed controllers, deep reinforcement learning is used in

[CondeLT17]for designing controllers for UAVs that reach time-varying formations. They also use a DNN to estimate how good a state is, so the agent can choose actions accordingly. Deep reinforcement learning is also used in

[Yang2018] to generate a controller for UAVs in uncertain environments. As the multi-agent learning efficiency is constrained by the high-dimensional and continuous action spaces, a methodology is presented in [Yang2018] to slice the action spaces into a number of tractable fractions to achieve efficient convergences of optimal policies in continuous domains. Graph neural networks are deployed in [tolstaya2019] to learn a distributed controller for a drone swarm capable of achieving flocking formation. The learned controller, which is synthesized by imitating the policy of a centralized controller, exploits information from distant teammates using only local communication interchanges.## Vii Conclusion

We have presented a new learning-based approach for designing distributed controllers that uses centralized controllers to generate the training data, in a teacher-learner fashion. The data generated by a centralized controller undergoes a transformation to yield the requisite training data, a transformation defined by the information available to an agent in the distributed setting. During training, we use counterexample-guided -fold retraining to generate additional data points to train the distributed controller. We demonstrated the power of this approach by developing a distributed neural controller for the V-formation problem, and used Statistical Model Checking to reason about the controller’s correctness.

The V-formation problem is particularly challenging. We showed that a symmetric deterministic distributed controller does not exist under certain reasonable assumptions. This motivates the use of a data-driven approach to automatically synthesize such controllers. The general idea of learning distributed controllers from training data generated by centralized controllers is promising. We believe that our approach will generalize to any distributed control synthesis problem whose objective is specified by a state-based cost function. Investigating its performance on other applications, and exploring enhancements to our learning-based approach to distributed controller design are directions for future work.

Comments

There are no comments yet.