Learning Safe Unlabeled Multi-Robot Planning with Motion Constraints

07/11/2019
by   Arbaaz Khan, et al.
University of Pennsylvania
0

In this paper, we present a learning approach to goal assignment and trajectory planning for unlabeled robots operating in 2D, obstacle-filled workspaces. More specifically, we tackle the unlabeled multi-robot motion planning problem with motion constraints as a multi-agent reinforcement learning problem with some sparse global reward. In contrast with previous works, which formulate an entirely new hand-crafted optimization cost or trajectory generation algorithm for a different robot dynamic model, our framework is a general approach that is applicable to arbitrary robot models. Further, by using the velocity obstacle, we devise a smooth projection that guarantees collision free trajectories for all robots with respect to their neighbors and obstacles. The efficacy of our algorithm is demonstrated through varied simulations.

READ FULL TEXT VIEW PDF

Authors

page 1

page 7

02/20/2022

Velocity Obstacle Based Risk-Bounded Motion Planning for Stochastic Multi-Agent Systems

In this paper, we present an innovative risk-bounded motion planning met...
11/30/2018

PEARL: PrEference Appraisal Reinforcement Learning for Motion Planning

Robot motion planning often requires finding trajectories that balance d...
01/12/2022

Object Gathering with a Tethered Robot Duo

We devise a cooperative planning framework to generate optimal trajector...
09/24/2019

Graph Policy Gradients for Large Scale Unlabeled Motion Planning with Constraints

In this paper, we present a learning method to solve the unlabelled moti...
07/02/2019

Time-optimal path tracking for industrial robot: A model-free reinforcement approach

In pursuit of the time-optimal motion of a robot manipulator along a pre...
08/22/2021

From Agile Ground to Aerial Navigation: Learning from Learned Hallucination

This paper presents a self-supervised Learning from Learned Hallucinatio...
09/04/2021

Socially-Aware Multi-Agent Following with 2D Laser Scans via Deep Reinforcement Learning and Potential Field

Target following in dynamic pedestrian environments is an important task...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In many applications in robotics such as formation flying [1, 2] or perimeter defense and surveillance [3], there exist teams of interchangeable robots operating in complex environments. In these scenarios, the goal is to have a team of robots execute a set of identical tasks such that each robot executes only one task, but it does not matter which robot executes which task. One example of such a problem is the concurrent goal assignment and trajectory planning problem where robots must simultaneously assign goals and plan motion primitives to reach assigned goals.

Solutions to this unlabeled multi-robot planning problem must solve both the goal assignment and trajectory optimization problems. It has been shown that the flexibility to freely assign goals to robots allows for polynomial-time solutions under certain conditions [4, 5, 6]. Nonetheless, there are still significant drawbacks to existing approaches. Solutions with polynomial-time complexities depend on minimum separations between start positions, goal positions, or robots and obstacles [4, 5, 7]. Motion plans generated by graph-based approaches for this problem such as those proposed in [6, 8] are limited to simple real-world robots that have approximately first-order dynamics. Closest to our work, [9] proposes an algorithm to coordinate unlabeled robots with arbitrary dynamics in obstacle-filled environments. However, it assumes the existence of a single-robot trajectory optimizer that can produce a candidate trajectory for a robot to any given goal, which is in itself a difficult research problem. Furthermore, the algorithm is a priority-based method that depends on offsetting the times at which robots start traversing their trajectories to guarantee collision avoidance. In the worst case, it degenerates into a completely sequential algorithm where only one robot is moving at a time.

Fig. 1: Learning Unlabeled Motion Planning Robots observe their own state and velocity, relative positions of all goals and other entities within a sensing region. For robot

, this information is compiled into a vector

. Each robot uses its own policy network to compute an action . During training, policy networks also exchange information with a centralized Q-network which uses information from all robots to compute robot specific Q functions . These Q functions are used by the individual robot functions to update their policies. To guarantee safety, a model based policy runs in the background.

In light of these works, we propose a novel learning-based framework for goal assignment and trajectory optimization for a team of identical robots with arbitrary dynamics operating in obstacle-filled work spaces. Firstly, it is observed that the unlabeled multi-robot planning problem can be recast as a multi-agent reinforcement learning (MARL) problem. Robots are given their own state, configuration of obstacles and information about other robots within some sensing radius and configuration of all goals in the environment. The objective then is to learn policies that couple the assignment and trajectory generation for each robot. It is important to note that in such an approach, since we do not have goals assigned beforehand, it is non-trivial to assign individual rewards to each robot. Instead, we simply assign a global reward to all robots that takes into account if all robots have reached goals in a collision free manner. This global reward then forces each robot to learn policies such that all robots can reach a goal without colliding or having to communicate with each other. Thus, by casting the concurrent goal assignment problem as a MARL problem, we attempt to learn policies that maximize this global reward in minimal time. A centralized training, decentralized execution strategy is used to train policies for each robot. This builds on a body of work in MARL that employ such a centralized training, decentralized execution strategy [10, 11].

When utilizing a deep reinforcement learning (RL) model, one loses optimality guarantees as well as any guarantees for collision free trajectories. To ensure one still has collision free trajectories, we make use of an analytical model based policy that runs in the background and checks if the target velocities produced by the robot are "safe". For each robot, velocity obstacles are computed at every instant in time. The velocity obstacle divides the set of all possible velocities into safe and unsafe sets. If the velocity computed by the learned model lies in the unsafe set, it is projected back into the safe set. We also ensure that this projection preserves certain properties such as smoothness of the transition function in order to be compatible with the learning process. Thus, when using this model based policy in conjunction with the policy learned, we are guaranteed almost safe trajectories at every instant and it is empirically shown that the computed policies converge to the desired values.

Thus, the main contributions of our algorithm are : 1) Capability to extend to arbitrary robot dynamics. We test with both holonomic and non-holonomic robot models 2) There is no need to re-train if the obstacles are moved around or if the robot’s start and goal positions are changed. This is in contrast to any model-based approach where one would need to recompute the assignment and regenerate the trajectory. 3) It retains the safety guarantees of model-based methods, 4) We show experimentally that better performance based on total time to reach goals is achieved than model based methods alone.

Ii Problem Formulation

Consider a two dimensional Euclidean space with homogeneous disk-shaped robots of radius indexed by and goal locations indexed by . Goal location is represented as a vector of its position () and heading () in the plane. Thus, goal is :

(1)

The full goal state vector, G is given as :

(2)

Let robot be equipped with a sensor with range (). The data from the sensor is denoted as . We assume each robot has full information about the location of the goals and full information about its own pose. Thus, the observation of the th robot at time is then given as :

(3)

where is a vector of the position () and heading () of robot in the plane and is given as :

(4)

The linear velocity of the robot at time is denoted as and the angular velocity is denoted as . Let be the set of states describing all possible configurations of all robots. For each robot, given an action where describes all possible actions for robot , the state of the robot evolves according to some stationary dynamics distribution with conditional density . These dynamics can be linear or non-linear. Further let be the set of all possible actions of all robot. We define some arbitrary scalar such that the necessary and sufficient condition to ensure collision avoidance is given as :

(5)

where is the euclidean distance. Lastly, we define the assignment matrix as

(6)

where is the cosine distance and is some threshold. In this work, we consider the case when . The necessary and sufficient condition for all goals to be covered by robots at some time is then:

(7)

where I

is the identity matrix. Thus, the problem statement considered in this paper can be defined as :


Problem 1. Given an initial set of observations and a set of goals G, compute a set of functions such that applying the actions results in a sequence of observations each satisfying Eqn. 5 and at final time satisfies Eqn. 7. In the next section we outline our methodology to easily convert this problem to a Markov game for MARL and outline our methodology to compute policies such that the constraints in Eqn. 5 and Eqn. 7 are satisfied.

Iii Learning Unlabeled Multi-Robot Planning

Iii-a Markov Games for Multi-Robot Planning

One can reformulate the unlabeled multi-robot planning problem describe in Problem 1 as a Markov game [12]. A Markov game for robots is defined by a tuple . describes the full true state of the environment. We observe corresponds to the full state space in the problem setup. Similarly we observe that the action space is the same as defined before in Problem 1. The transition function is equivalent to the dynamics distribution describe in Section II. In the Markov game, the environment is partially observed. At every timestep, each robot can only observe some small part of the environment. This is also defined in our problem and for every timestep we can simply set the observation for robot , and . is a discount factor and can be set close to one. In the Markov game, each robot obtains a reward function as a function of robots state and its action. We propose formulating a reward structure that satisfies the constraints in Eqn. 5 and Eqn. 7.

(8)

where and are some positive constants. It is important to note that this reward structure is global and is given to all robots {

}. By using a global reward we remove the need for any carefully designed heuristic function. In the Markov game, the solution for each robot

is a policy that maximizes the discounted expected reward . Once again, we draw parallels between the Markov game and Problem 1 and set . Thus, we can conclude the solution of the Markov game for Multi-Robot Planning is the solution for the Unlabeled multi-robot planning considered in Problem 1.

Iii-B Learning Policies for Continuous Actions

Consider a single robot setting. The MDP assosciated with a single robot is given as ( is the transition function associated with just the robot under consideration). The goal of any RL algorithm is to find a stochastic policy (where are the parameters of the policy) that maximizes the expected sum of rewards :

(9)

Policy gradient methods look to maximize the reward by estimating the gradient and using it in a stochastic gradient ascent algorithm. A general form for the policy gradient can be given as:

(10)

where represents the action value function (estimate of how good it is to take an action in a state)

(11)

Bellman equations also give us the recursive rule for updating the action value function given as : Q^π(o_t,a_t) = E_o_t+1:∞,a_t:∞ [ r(o_t,a_t) +
γE_o_t+1:∞,a_t+1:∞(Q^π(o_t+1,a_t+1)) ] where is the reward for executing action in . The gradient is estimated by differentiating the objective wrt :

(12)

In order to extend this to continuous actions and continuous deterministic policies, [13] propose the the Deep Deterministic Policy Gradient (DDPG) algorithm for continuous actions and deterministic policies. The algorithm maintains an actor function (parameterized by ) that estimates the deterministic continuous policy . In addition, it also maintains a critic function (parameterized by ) that estimates the action value function. The critic function is updated by using the Bellman loss as in Q-learning [14] (Eqn.  III-B) and the actor function is updated by computing the following policy gradient :

(13)

The DDPG algorithm is an off-policy algorithm and samples trajectories from a replay buffer of experiences stored in a replay buffer. Similar to DQN [15] it also uses a target network to stabilize training.

A natural question to ask at this point is, why not treat every robot in the space as an entity operating independently and learn this DDPG algorithm for each robot and in fact this exact idea has been proposed in Independent Q-Learning [16]. However, there are two major drawbacks to this approach. When operating in high dimensional continuous spaces with sparse rewards, the lack of information sharing between the robots makes it difficult to learn any co-ordination between robots. Further, as each robot’s policy changes during training, the environment becomes non-stationary from the perspective of any individual robot (in a way that cannot be explained by changes in the robots’s own policy). This is the non-stationarity problem in multi-agent learning.

Iii-C Learning Continuous Policies for Multiple Robots

To overcome the aforementioned drawbacks in treating each robot as an independent entity, a small modification to the critic function (action-value) during training time is needed. During training, the critic function for robot uses some extra information from all other robots. This has been proposed in [10, 11]. The modified action value function for robot n can then be represented as . The most naive method is to simply set .

Let policy for robot parameterized by be . For brevity sake, let , and Thus in multi-robot case, the gradient of the actor function for robot is given as

(14)

where is the centralized critic function for robot that takes in input all robot observations and all robot actions and outputs a q value for robot . The robot then takes a gradient of this q value with respect to to the action executed by robot and this gradient along with the policy gradient of robot ’s policy is used to update the actor function for robot . It is important to note that the extra information from other robots actions is only used during training to update the critic function. This gives rise to centralized training but decentralized policies during inference. Thus, we now have a policy gradient algorithm that attempts to learn a policy for each robot such that Problem 1 captured in Eqn. 9 is maximized.

Iii-D Backup Policies for Safety

When using deep RL, one often loses any guarantees of safety. Thus, when attempting to maximize the reward in Eqn. 9, we have no guarantee that actions generated by our actor network are collision free (satisfy constraint in Eqn. 5). In real world applications of robotics this could be simply infeasible. Instead, we propose use of a simple analytical backup policy that ensures collision free trajectories.

We use the Velocity Obstacle concept introduced in [17]. While there exist more sophisticated algorithms for collision avoidance such as ORCA [18] and NH-ORCA [19], we opt for VO due to its simplicity. Consider a robot , operating in the plane with its reference point at and let another planar obstacle b (another robot or a static obstacle), be at , moving at velocity . The velocity obstacle of obstacle to robot is the set consisting of all those velocities for robot

that will result in a collision at some moment in time with obstacle

. The velocity obstacle (VO) is defined as:

where gives the Minkowski sum between object and object , denotes reflection of object reflected in its reference point , and represents a ray starting at and heading in the direction of the relative velocity of robot and b given by .  [17] show that the VO partitions the absolute velocities of robot into avoiding and colliding velocities. This implies that if , then robot and obstacle will collide at some point in time. If is chosen such that it is outside the VO of , both objects will never collide and if is on the boundary, then it will brush obstacle b at some point in time.This concept is illustrated in Fig. 2.

Fig. 2: Velocity Obstacle Velocity obstacle of obstacle to robot . When there exist multiple obstacles, the VO is defined as the union of all velocity obstacles.

Each robots actor network outputs a linear force and a torque, i.e . The dynamics of the robot then evolve according to :

(15)
(16)

where

is some normally distributed noise

, is mass of the robot and is some fixed constant and is the fixed time interval. Similarly, the rotational acceleration is derived from the torque and integrated over twice to update the orientation. For simplicity, say that the observation is set to just the position and the velocity in the 2D plane, i.e . From Eqn. 15, we derive the stationary dynamics distribution with conditional density as

A fundamental assumption for the existence of the gradient in Eqn. 13 (and by extension, Eqn. 14

) is that the conditional probability distribution

be continuous wrt , . We observe that in the case when the state evolves according to Eqn. 15

, the probability distribution is simply a gaussian distribution and is continuous. In order to incorporate the VO as a backup policy, we need to prove that the new transition function is still continuous. From Eqn. 

15, we have the velocity of the robot. Further, from Eqn. 16 . Since is a fixed, it suffices to find the continuity of to conclude about the continuity of . Consider an obstacle inside sensing range of robot . To ensure safety, at every timestep, we compute the VO and check if the velocity . In case, the velocity computed by the actor network falls inside the VO, we project the velocity back to the safe set ( is the complement of the VO). The easiest projection can be given as :

(17)

Thus, the safe velocity for robot is then given as:

(18)

where is the indicator function and is if and otherwise. However, an issue with such a projection is that this gives us a discontinuous distribution for (probability) because now, there exists a set of values that never takes. Thus, this fails the assumption of smooth transition functions necessary to compute deterministic policy gradients. Additionally, in most real world systems, it is infeasible to make large changes to the velocity instantaneously.

To overcome this, we propose an alternate projection that ensures a smooth distribution of . We note that at any given time , the RVO set, i.e the set of infeasible velocities is always a continuous set. By exploiting this property, we propose the following alternative projection:

(19)

where

are the first, middle and last elements of the VO respectively and c is a hyperparameter that depends on the how quickly the robots can change their velocities. This is a shifted sigmoid projection and is visualized in Fig.  

3.

Fig. 3: Min Projection vs Sigmoid Projection of velocity When using the minimum projection given in Eq  17 the safe velocity has a discontinuity. However, if we assume that the event occurs with very low probability, we get a continuous function.

Using the sigmoid projection, the safe velocity for robot is given as :

(20)

and from this we conclude that is a continuous probability distribution thus enabling us to take the gradients specified in Eqn.  14. We put all these parts into our system for learning unlabeled multi-robot planning with motion constraints and present the full algorithm in Algorithm 1 and in the rest of the paper we reference it as MARL+RVO

1:Initial random policy network and critic networks for all robots Replay buffer ,
2:for episode = 1 to (do
3:     construct every robot’s initial state (Eq  3).
4:     for t= 1 to max episode length do
5:         for each robot, compute
6:         for each robot, guarantee safe (Eqn. 20)
7:         for each robot, compute and reward .
8:         Store in
9:         for robot = 1 to N do
10:              Sample minibatch of samples from
11:              Compute bellman error using Eqn. III-B
12:              Update critic network using bellman error.
13:              Compute policy gradient from Eqn.  14
14:              Update actor network using SGD with               
Algorithm 1 Learning Safe Unlabeled Multi-Robot Motion Planning (MARL+RVO)
Fig. 4: Training curves for holonomic robots (with 2,3 and 4 Obstacles (Obs)). We observe that the proposed MARL+RVO algorithm is able to converge and perform better than a centralized RL (C-PPO) policy. The global reward scale is different for each plot since it is a function of the space the robots operate in. Each curve is produced by running three independent runs of the algorithm. Darker line represents mean and shaded area represents mean standard deviation of mean.

Iv Experimental Results

The efficacy of our algorithm is tested in simulated robotics experiments. We experiment by changing the number of robots, number of obstacles present in the environment and the robot dynamics. In order to choose a meaningful reward function that ensures all goals are covered we first compute for each goal the distance to its nearest robot. Then among this set of distances, we pick the maximum, negate it and add it to the reward. This represents the part of the reward function that forces all robots to cover all goals (denoted by ). A similar strategy is adopted to ensure that the cosine difference between orientations is minimized (denoted by ). In order to not overly depend on the projected velocity, we add in a negative penalty every time the projection to the safe set needs to be computed. Thus, we add a negative reward to all robots (denoted by ). Thus, the overall reward given to each robot at time t is :

(21)

where and are coefficients to balance each part of the reward function. This global reward function is the same for every robot operating in the environment. Maximizing this global reward requires a collective effort from all robots.

It is important to note that during, inference time to guarantee safety, we do away with the soft projection introduced in Eqn.  19 and instead use the min projection as given in Eqn. 17. This is because during inference we no longer need to take gradients and hence the transition function need not be smooth continuous anymore.

Fig. 5: Training curves for non holonomic robots (with 2,3 and 4 Obstacles (Obs)

When the robot dynamics are changed, MARL+RVO is still able to converge without making any changes to the loss function or the training parameters.

Iv-a Experimental Setup Details

For each robot, we setup an actor and critic network. The actor network consists of a two layer fully connected multi-layer perceptron (MLP). The critic network is also based on a similar fully connected MLP. The number of units in the hidden layers are varied depending on the size of the problem being solved (additional units and hidden layers when number of robots or obstacles are increased). For each episode, we set a maximum episode length of 300 steps. To update our networks, we use Adam and the learning rate is varied depending on the experiment under consideration. The discount factor (

) is set to We also make use of a replay buffer to make sure dependencies between samples are modelled. The size of the replay buffer is and the size of the minibatch sampled is

. The actions from the neural networks represent accelarations for the robots.

Fig. 6: Trajectory executed by 3 robots Trajectories executed by robots in three randomly generated episodes after training is complete. In addition to robots reaching their goals in collision free manners, the proposed approach also aligns robots to desired final goal orientations.

The robots operate in a two dimensional space and do not have access to the third dimension. The space under consideration stretches from -1 unit to 1 unit in both the X and Y direction. Each robot is considered homogeneous and has nonzero mass. The radius of the robot is set to units. At the start of every episode obstacles, goals and start positions of robots are randomly populated. Radius of the obstacles is 0.12 units and the goal regions have a radius of 0.02 units. Robots are equipped with a sensor that returns perfect information (no noise in sensor measurements) about the pose and velocity of entities within the sensing range which is set at radius of 0.2 units. While the learning algorithm does not have an explicit assignment of goals in the states or in the reward function, when using RVO to avoid collisions, we need to greedily assign goals to agents based on distance to nearest goal and break all ties by randomly choosing goals. Once the RVO subprocess is done running, we again have no notion of goal assignment. In our simulated experiments (below) we set all units to meters.

Fig. 7: Time taken to reach goal (with 2,3 and 4 Obstacles (Obs)) over 500 runs. We compare with Multi-Agent Reinforcement Learning (MARL), Reciprocal Velocity Obstacles (RVO), GAP  [9] and our algorithm which combines the RL and safety(MARL+RVO). For the RVO method, we assign each robot its nearest goal (in terms of euclidean distance). GAP uses discrete nodes to search through space and hence its performance is contingent on the discretization of our continuous space. We observe that with MARL gives the best time performance, but this performance is not guaranteed to be collision free. Our method (MARL+RVO) trades-off time performance for guaranteed collision free trajectories.

Iv-B Simulation Results

We first observe from Fig. 4 that the proposed MARL+RVO algorithm is able to converge even when the number of robots and the number of obstacles are increased. One of the key strengths of using a learning based solution for concurrent goal assignment and planning is that the algorithm can be used even when the dynamics of the robot change. When robot dynamics are changed to that of a non holonomic robot, we observe that our algorithm still converges. This can be seen in Fig. 5. A simulated experiment setup is shown in Fig 6. In Fig. 6 (left), a simple instance is shown where all robots must execute mostly straight line trajectories to arrive at goals. In Fig. 6 (center), an interesting interaction takes place between Robots 1 and 2. Robot 2 takes a longer path curved path around Robot 1. Lastly, in Fig. 6 (right) Robot 3 chooses to take a longer path around the obstacles in order to not cutoff Robot 2’s path. These locally sub optimal, globally optimal behaviors are induced by using a global reward. We also design a centralized RL controller that uses information from all the agents and outputs a distribution for each agent. This controller is trained using PPO [20] and also uses velocity obstacles as a backup policy. We call this method Centralized PPO (C-PPO). We observe that C-PPO is unable to converge to an acceptable goal coverage policy.

In the RL framework, the policy attempts to maximize the reward function in a fixed horizon of time . Thus, inherently the policy is being optimized for minimum time. To demonstrate this we compare our algorithm with vanilla MADDPG or MARL as described in [10], reciprocal velocity obstacles [21] (RVO), and Goal Assignment and Planning (GAP) as introduced in [9]. The RVO framework improves over VO. However, it is not a full "goal assignment and planning" framework and only generates collision free trajectories once goals have been assigned to robots. To benchmark, we assign goals in a greedy fashion. Each robot is assigned the goal closest to it. GAP utilizes a similar assignment but needs a discretization of the state space and a priority sequence for robots/goals. This prioritization is assigned randomly and the space is discretized into units of 0.1m. Out of these three methods, only RVO, GAP and MARL+RVO are guaranteed to produced collision free trajectories. For a fair comparison in terms of time, we only consider those runs from MARL where no collision occurred. Our results are shown in Fig. 7. It can be seen that MARL and MARL+RVO is faster or almost comparable to GAP and RVO without needing any of the requirements of (assigning goals/discretized state space/priority sequence) GAP and RVO. Vanilla MARL is faster than MARL+RVO but isn’t guaranteed to generate safe trajectories as seen from Table I.

3 Robots 4 Robots 5 Robots
MARL 84 192 354
MARL+RVO 0 0 0
TABLE I: Number of collisions for 3 robots in presence of 3 obstacles over 500 runs.
Fig. 8: Training time to convergence. Training time for different configurations of robots and agents when trained on a NVIDIA DGX-1 (Tesla V100, 32GB 8)

V Discussion

In this paper we propose to solve the concurrent goal assignment and planning problem using MARL instead. Traditional approaches to solve this problem utilize a carefully designed heuristic function which produces guaranteed safe trajectories but breaks down if any of the assumptions are not satisfied. These assumptions restrict the class of problems that can be solved by traditional algorithms. By utilizing RL, we remove any assumptions on the robot dynamics or assumptions on the environment and instead use a global reward function that forces robots to collaborate with each other in order to maximize the reward. To overcome the lack of any safety guarantees, we propose using a model based policy in conjunction with the RL policy thus ensuring safe collision free trajectories. We demonstrate the effectiveness of our algorithm on simulations with varying number of obstacles, varying number of robots and varying robot dynamics and show that our proposed algorithm works faster and more robustly than traditional algorithms.

V-a Caveats

While this work attempts to learn an approximate solution for the unlabeled multi-robot problem, it has a few caveats. One of the biggest drawbacks of our work is that there is a significant engineering effort required in scaling up the number of robots. When the number of robots is increased, there are two major challenges that hamper MARL. The first is that the input space of the critic function grows as the number of robots increase. This increase in dimensionality necessitates longer training times for the critic. It might be possible to instead propose a local critic function that only takes in information from nearby robots. This might be possible by thinking of the robots as nodes on a graph and leveraging advances in graph neural networks [22], instead of using a fully connected network. The second problem is concerned with the need for more exploration as the number of robots increase. We observe from Fig 8 that the time required to train the algorithm grows almost exponentially as the number of robots are increased. While there exist massively parallel methods [23, 24] and software libraries [25] to scale up for reinforcement learning, scaling up the number of robots still poses a significant computing challenge. Methods attempting to learn hierarchical policies for agents such as those in [26] might instead prove to be a suitable alternative. Lastly, our choice of VO for collision avoidance while rooted in its simplicity suffers from drawbacks many of which have been improved over by methods presented in [18], [27]. Adapting the projection step with more sophisticated collision avoidance algorithms is something we intend to explore in future work.

References