Towards Optimally Decentralized Multi-Robot Collision Avoidance via Deep Reinforcement Learning

09/28/2017 ∙ by Pinxin Long, et al. ∙ City University of Hong Kong 0

Developing a safe and efficient collision avoidance policy for multiple robots is challenging in the decentralized scenarios where each robot generate its paths without observing other robots' states and intents. While other distributed multi-robot collision avoidance systems exist, they often require extracting agent-level features to plan a local collision-free action, which can be computationally prohibitive and not robust. More importantly, in practice the performance of these methods are much lower than their centralized counterparts. We present a decentralized sensor-level collision avoidance policy for multi-robot systems, which directly maps raw sensor measurements to an agent's steering commands in terms of movement velocity. As a first step toward reducing the performance gap between decentralized and centralized methods, we present a multi-scenario multi-stage training framework to find an optimal policy which is trained over a large number of robots on rich, complex environments simultaneously using a policy gradient based reinforcement learning algorithm. We validate the learned sensor-level collision avoidance policy in a variety of simulated scenarios with thorough performance evaluations and show that the final learned policy is able to find time efficient, collision-free paths for a large-scale robot system. We also demonstrate that the learned policy can be well generalized to new scenarios that do not appear in the entire training period, including navigating a heterogeneous group of robots and a large-scale scenario with 100 robots. Videos are available at



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Multi-robot navigation has recently gained much interest in robotics and artificial intelligence, and has many real-world applications including multi-robot search and rescue, navigation through human crowds, and autonomous warehouse. One of the major challenges for multi-robot navigation is to develop a safe and robust collision avoidance policy for each robot navigating from its starting position to its desired goal.

Some of prior works, known as centralized methods, assume that the comprehensive knowledge about all agents’ intents (e.g. initial states and goals) and their workspace (e.g. a 2D grid map) is given for a central server to control the action of agents. These methods can generate the collision avoidance action by planning optimal paths for all robots simultaneously. However, these centralized methods are difficult to scale to large systems with many robots and their performance can be low when frequent task/goal reassignment is necessary. Besides, in practice, they heavily rely on the reliable communication network between robots and the central server. Therefore, once the central server and/or the communication network fails, the multi-robot systems will break down. Furthermore, these centralized methods are inapplicable when multiple robots are deployed in an unknown and unstructured environments.

Fig. 1: Robot trajectories in the Circle scenario using our learned policy. Note that the robots are square-shaped. It shows the good generalization capability of the learned policy since we directly test the policy trained with disc-shaped robots on this scenarios.

Compared with centralized methods, some existing works propose agent-level decentralized collision avoidance policies, where each agent independently makes decision taking into account the observable states (e.g. shapes, velocities and positions) of other agents as inputs. Most agent-level policies are based on the velocity obstacle (VO) [1, 2, 3, 4, 5], and they can compute local collision-free action efficiently for multiple agents in cluttered workspaces. However, several limitations greatly restrict their applications. First, the simulation based works [6, 1] assume that each agent has perfect sensing about the surrounding environment which does not hold in real world scenarios due to omnipresent sensing uncertainty. To moderate the limitation of perfect sensing, previous approaches use a global positioning system to track the positions and velocities of all robots [2, 5] or design an inter-agent communication protocol for sharing position and velocity information among nearby agents [4, 3, 7]. However, these approaches introduce external tools or communication protocols into the multi-robot systems, which may not be adequately robust. Second, VO based policies have many tunable parameters that are sensitive to scenario settings and thus the parameters must be carefully set offline to achieve satisfying performance. Finally, the performance of previous decentralized methods in terms of navigation speed and navigation time is significantly lower than their centralized counterparts.

Inspired by VO based approaches, Chen et al. [8] train an agent-level collision avoidance policy using deep reinforcement learning, which learns a two-agent value function that explicitly maps an agent’s own state and its neighbors’ states to collision-free action, whereas it still demands the perfect sensing. In their later work [9]

, multiple sensors are deployed to perform tasks of segmentation, recognition, and tracking in order to estimate the states of nearby agents and moving obstacles. However, this complex pipeline not only requires expensive online computation but makes the whole system less robust to the perception uncertainty.

In this paper, we focus on sensor-level decentralized collision avoidance

policies that directly map the raw sensor data to desired, collision-free steering commands. Compared with agent-level policies, the perfect sensing for the neighboring agents and obstacles, and offline parameter-tuning for different scenarios are not required. Sensor-level collision avoidance policies are often modeled by deep neural networks (DNNs) 

[10, 11]

and trained using supervised learning on a large dataset. However, there are several limitations for learning policies under supervision. First, it requires a large amount of training data that should cover different kinds of interaction situations for multiple robots. Second, the expert trajectories in datasets are not guaranteed to be optimal in the interaction scenarios, which makes training difficult to converge to a robust solution. Third, it is difficult to hand-design a proper loss function for training robust collision avoidance policies. To overcome these drawbacks, we propose a multi-scenario multi-stage deep reinforcement learning framework to learn the optimal collision avoidance policy using the policy gradient method.

Main results: In this paper, we address the collision avoidance of multiple robots in a fully decentralized framework, in which the input data is only collected from onboard sensors. To learn the optimal collision avoidance policy, we propose a novel multi-scenario multi-stage training framework which exploits a robust policy gradient based reinforcement learning algorithm trained in a large-scale robot system in a set of complex environments. We demonstrate that the collision avoidance policy learned from the proposed method is able to find time efficient, collision-free paths for a large-scale nonholonomic robot system, and it can be well generalized to unseen scenarios. Its performance is also much better than previous decentralized methods, and can serve as a first step toward reducing the gap between centralized and decentralized navigation policies.

Fig. 2: An overview of our approach. At each timestep , each robot receives its observation and reward from the environment, and generates an action following the policy . The policy is shared across all robots and updated by a policy gradient based reinforcement learning algorithm.

Ii Related Work

Learning-based collision avoidance techniques have been extensively studied for one robot avoiding static obstacles. Many approaches adopt the supervised learning paradigm to train a collision avoidance policy by imitating a dataset of sensor input and motion commands. Muller et al. [12] trained a vision-based static obstacle avoidance system in supervised mode for a mobile robot by training a 6-layer convolutional network which maps raw input images to steering angles. Zhang et al. [13] exploited a successor-feature-based deep reinforcement learning algorithm to transfer depth information learned in previously mastered navigation tasks to new problem instances. Sergeant et al. [14]

proposed a mobile robot control system based on multimodal deep autoencoders. Ross et al. 


trained a discrete controller for a small quadrotor helicopter with imitation learning techniques. The quadrotor was able to successfully avoid collisions with static obstacles in the environment using only a single cheap camera. Yet only discrete movement (left/right) has to be learned and also the robot only trained within static obstacles. Note that the aforementioned approaches only take into account the static obstacles and require a human driver to collect training data in a wide variety of environments. Another data-driven end-to-end motion planner is presented by Pfeiffer et al. 

[11]. They trained a model maps laser range findings and target positions to motion commands using expert demonstrations generated by the ROS navigation package. This model can navigate the robot through a previously unseen environment and successfully react to sudden changes. Nonetheless, similar to the other supervised learning methods, the performance of the learned policy is seriously constrained by the quality of the labeled training sets. To overcome this limitation, Tai et al. [16] proposed a mapless motion planner trained through a deep reinforcement learning method. Kahn et al. [17]

presented an uncertainty-aware model-based reinforcement learning algorithm to estimate the probability of collision in a priori unknown environment. However, the test environments are relative simple and structured, and the learned planner is hard to generalize to the scenarios with dynamic obstacles and other proactive agents.

Regarding multi-agent collision avoidance, the Optimal Reciprocal Collision Avoidance (ORCA) framework [1] has been popular in crowd simulation and multi-agent systems. ORCA provides a sufficient condition for multiple robots to avoid collisions with each other in a short time horizon, and can easily be scaled to handle large systems with many robots. ORCA and its extensions [2, 5]

used heuristics or first principles to construct a complex model for the collision avoidance policy, which has many parameters that are tedious and difficult to be tuned properly. Besides, these methods are sensitive to the uncertainties ubiquitous in the real-world scenarios since they assume each robot to have perfect sensing about the surrounding agents’ positions, velocities and shapes. To alleviate the requirement of perfect sensing, communication protocols are introduce by 

[3, 4, 7] to share the state information including agents’ positions and velocities among the group. Moreover, the original formulation of ORCA is based on holonomic robots which is less common than nonholonomic robots in the real world. To deploy ORCA on the most common differential drive robots, several methods have been proposed to deal with the difficulty of non-holonomic robot kinematics. ORCA-DD [18] enlarges the robot to twice the radius of the original size to ensure collision free and smooth paths for robots under differential constraints. However, this enlarged virtual size of the robot can result in problems in narrow passages or unstructured environments. NH-ORCA [19]

makes a differential drive robot tracking a holonomic speed vector with a certain tracking error

. It is preferred over ORCA-DD, since the virtual increase of the robots’ radii is only by a size of instead of doubling the radii.

In this paper, we focus on learning a collision-avoidance policy which can make multiple nonholonomic mobile robots navigate to their goal positions without collisions in rich and complex environments.

Iii Problem Formulation

The multi-robot collision avoidance problem is defined primarily in the context of a nonholonomic differential drive robot moving on the Euclidean plane with obstacles and other decision-making robots. During training, all of robots are modeled as discs with the same radius , i.e., all robots are homogeneous.

At each timestep , the -th robot has access to an observation and computes a collision-free steering command that drives it to approach the goal from the current position . The observation

drawn from a probability distribution w.r.t. the underlying system state

, , only provides partial information about the state since the -th robot has no explicit knowledge about other robots’ states and intents. Instead of the perfect sensing assumption applied in prior methods (e.g. [8, 9, 4, 3, 6, 1]), our formulation based on partial observation makes our approach more applicable and robust in real world applications. The observation vector of each robot can be divided into three parts: (here we ignore the robot ID for legibility), where denotes the raw 2D laser measurements about its surrounding environment, stands for its relative goal position (i.e. the coordinates of the goal in the robot’s local polar coordinate frame), and refers to its current velocity. Given the partial observation , each robot independently computes an action or a steering command, , sampled from a stochastic policy shared by all robots:


where denotes the policy parameters. The computed action is actually a velocity that guides the robot approaching its goal while avoiding collisions with other robots and obstacles within the time horizon until the next observation is received.

Hence, the multi-robot collision avoidance problem can be formulated as a partially observable sequential decision making problem. The sequential decisions consisting of observations and actions (velocities) made by the robot can be considered as a trajectory from its start position to its desired goal , where is the traveled time. To wrap up the above formulation, we define as the set of trajectories for all robots, which are subject to the robot’s kinematic (e.g. non-holonomic) constraints, i.e.:


To find an optimal policy shared by all robots, we adopt an objective by minimizing the expectation of the mean arrival time of all robots in the same scenario, which is defined as:


where is the travel time of the trajectory in controlled by the shared policy .

The average arrival time will also be used as an important metric to evaluate the learned policy in Section V. We solve this optimization problem through a policy gradient based reinforcement learning method, which bounds the policy parameter updates to a trust region to ensure stability.

Iv Approach

We begin this section by introducing the key ingredients of our reinforcement learning framework. Next, we describe the details about the architecture of the collision avoidance policy in terms of a deep neural network. Finally, we elaborate the training protocols used to optimize the policy.

Iv-a Reinforcement Learning Setup

The partially observable sequential decision making problem defined in Section III

can be formulated as a Partially Observable Markov Decision Process (POMDP) solved with reinforcement learning. Formally, a POMDP can be described as a 6-tuple

, where is the state space, is the action space, is the state-transition model, is the reward function, is the observation space () and is the observation probability distribution given the system state (). In our formulation, each robot only has access to the observation sampled from the underlying system states. Furthermore, since each robot plans its motions in a fully decentralized manner, a multi-robot state-transition model determined by the robots’ kinematics and dynamics is not needed. Below we describe the details of the observation space, the action space, and the reward function.

Iv-A1 Observation space

As mentioned in Section III, the observation consists of the readings of the 2D laser range finder , the relative goal position and robot’s current velocity . Specifically, includes the measurements of the last three consecutive frames from a 180-degree laser scanner which has a maximum range of 4 meters and provides 512 distance values per scanning (i.e. ). In practice, the scanner is mounted on the forepart of the robot instead of the center (see the left image in Figure 1) to obtain a large unoccluded view. The relative goal position is a 2D vector representing the goal in polar coordinate (distance and angle) with respect to the robot’s current position. The observed velocity

includes the current translational and rotational velocity of the differential-driven robot. The observations are normalized by subtracting the mean and dividing by the standard deviation using the statistics aggregated over the course of the entire training.

Iv-A2 Action space

The action space is a set of permissible velocities in continuous space. The action of differential robot includes the translational and rotational velocity, i.e. . In this work, considering the real robot’s kinematics and the real world applications, we set the range of the translational velocity and the rotational velocity in . Note that backward moving (i.e. ) is not allowed since the laser range finder can not cover the back area of the robot.

Iv-A3 Reward design

Our objective is to avoid collisions during navigation and minimize the mean arrival time of all robots. A reward function is designed to guide a team of robots to achieve this objective:


The reward received by robot at timestep is a sum of three terms, , , and . In particular, the robot is awarded by for reaching its goal:


When the robot collides with other robots or obstacles in the environment, it is penalized by :


To encourage the robot to move smoothly, a small penalty is introduced to punish the large rotational velocities:


We set , , and in the training procedure.

Iv-B Network architecture

Given the input (observation ) and the output (action ), now we elaborate the policy network mapping to .

We design a 4-hidden-layer neural network as a non-linear function approximator to the policy . Its architecture is shown in Figure 3. We employ the first three hidden layers to process the laser measurements

effectively. The first hidden layer convolves 32 one-dimensional filters with kernel size = 5, stride = 2 over the three input scans and applies ReLU nonlinearities 

[20]. The second hidden layer convolves 32 one-dimensional filters with kernel size = 3, stride = 2, again followed by ReLU nonlinearities. The third hidden layer is a fully-connected layer with 256 rectifier units. The output of the third layer is concatenated with the other two inputs ( and

), and then are fed into the last hidden layer, a fully-connected layer with 128 rectifier units. The output layer is a fully-connected layer with two different activations: a sigmoid function is used to constrained the mean of translational velocity

in and the mean of rotational velocity in through a hyperbolic tangent function ().

Overall, the neural network maps the input observation vector to a vector . The final action

is sampled from a Gaussian distribution

, where serves as the mean and refers to a log standard deviation which will be updated solely during training.

Fig. 3: The architecture of the collision avoidance neural network. The network has the scan measurements , relative goal position and current velocity as inputs, and outputs the mean of velocity . The final action is sampled from the Gaussian distribution constructed by with a separated log standard deviation vector .

Iv-C Multi-scenario multi-stage training

Iv-C1 Training algorithm

Even if deep reinforcement learning algorithms have been successfully applied in mobile robot motion planning, they have mainly focused on a discrete action space [21, 13] or on small-scale problems [16, 8, 9, 17]. Here we focus on learning a collision avoidance policy which performs robustly and effectively with a large number of robots in complex scenarios with obstacles, such as corridors and mazes. We extend a recently proposed robust policy gradient algorithm, Proximal Policy Optimization (PPO) [22, 23, 24], to our multi-robot system. Our approach adapts the centralized learning, decentralized execution paradigm. In particular, each robot receives its own at each time step and executes the action generated from the shared policy ; the policy is trained with experiences collected by all robots simultaneously.

As summarized in Algorithm 1 (adapted from [22, 23]), the training process alternates between sampling trajectories by executing the policy in parallel, and updating the policy with the sampled data. During data collection, each robot exploits the same policy to generate trajectories until they collect a batch of data above . Then the sampled trajectories are used to construct the surrogate loss , and this loss is optimized with the Adam optimizer [25] for epochs under the Kullback-Leiber (KL) divergence constraint. The state-value function , used as an baseline to estimate the advantage , is also approximated with a neural network with parameters on sampled trajectories. The network structure of is the same as that of the policy network , except that it has only one unit in its last layer with a linear activation. We construct the squared-error loss for , and optimize it also with the Adam optimizer for epochs. We update and independently and their parameters are not shared since we have found that using two separated networks will lead to better results in practice.

This parallel PPO algorithm can be easily scaled to a large-scale multi-robot system with hundred robots in a decentralized fashion since each robot in the team is an independent worker collecting data. The decentralized execution not only dramatically reduce the time of sample collection, also make the algorithm suitable for training many robots in various scenarios.

1:Initialize policy network and value function

, and set hyperparameters as shown in Table 

2:for  do
3:     // Collect data in parallel
4:     for  do
5:         Run policy for timesteps, collecting , where
6:         Estimate advantages using GAE [26] , where
7:         break, if
8:     end for
10:     // Update policy
11:     for  do
13:         if  then
14:              break and continue with next iteration
15:         end if
16:         Update with by Adam [25] w.r.t
17:     end for
18:     // Update value function
19:     for  do
21:         Update with by Adam w.r.t
22:     end for
23:     // Adapt KL Penalty Coefficient
24:     if  then
26:     else if  then
28:     end if
29:end for
Algorithm 1 PPO with Multiple Robots

Iv-C2 Training scenarios

To expose our robots to diverse environments, we create different scenarios with a variety of obstacles using the Stage mobile robot simulator111 (as shown in Figure 4) and move all robots concurrently. In scenario 1, 2, 3, 5, and 6 in Figure 4 (black solid lines are obstacles), we first select reasonable starting and arrival areas from the available workspace, then randomly sample the start and goal positions for each robot in the corresponding areas. Robots in scenario 4 are randomly initialized in a circle with a varied radius, and they aim to reach their antipodal positions by crossing the central area. As for scenario 7, we generate random positions for both robots and obstacles (shown in black) at the beginning of each episode; and the target positions of robots are also randomly selected. These rich, complex training scenarios enable robots to explore their high-dimensional observation space and are likely to improve the quality and robustness of the learned policy. Combining with the centralized learning, decentralized execution mechanism, the collision avoidance policy is effectively optimized at each iteration over a variety of environments.

Fig. 4: Scenarios used to train the collision avoidance policy. All robots are modeled as a disc with the same radius. Obstacles are shown in black.

Iv-C3 Training stages

Although training on multiple environments simultaneously brings robust performance over different test cases (see Section V-C), it makes the training process harder. Inspired by the curriculum learning paradigm [27], we propose a two-stage training process, which accelerates the policy to converge to a satisfying solution, and gets higher rewards than the policy trained from scratch with the same number of epoch (as shown in Figure 5). In the first stage, we only train 20 robots on the random scenarios (scenario 7 in Figure 4) without any obstacles, this allows our robots learn fast on relatively simple collision avoidance tasks. Once the robots achieve reliable performance, we stop the Stage 1 and save the trained policy. The policy will continue to be updated in the Stage 2, where the number of robots increases to 58 and they are trained on richer and more complex scenarios shown in Figure 4.

Fig. 5: Average rewards shown in wall time during the training process.

V Experiments and Results

In this section, we first describe the hyper-parameters and computational complexity of the training process. Then, we quantitatively compare our policy with other methods in various simulated scenarios. Lastly, we demonstrate the good generalization capability of the learned policy in several challenging and complex environments.

V-a Training configuration and computational complexity

The implementation of our algorithm is in TensorFlow and the large-scale robot group with laser scanners is simulated in the Stage simulator. We train the multi-robot collision avoidance policy on a computer with an i7-7700 CPU and a Nvidia GTX 1080 GPU. The offline training takes 12 hours (about 600 iterations in Algorithm 

1) for the policy to converge to a robust performance in all scenarios. The hyper-parameters in Algorithm 1 is summarized in Table I. In particular, the learning rate of policy network is set to 5 5 in the first stage, and is then reduced to 2 5 in the second training stage. As for running ten robots in the online decentralized control, it takes 3ms for the policy network to compute new actions on the CPU and about 1.3ms on the GPU.

Parameter Value
in line 6 0.95
in line 6 and 20 0.99
in line 7 8000
in line 11 20
in line 12 1.0
in line 12
in line 12 50.0
in line 16 (first stage), (second stage)
in line 19 10
in line 21
in line 24 2.0
in line 24 and 27 1.5
in line 26 0.5
TABLE I: The hyper-parameters of our training algorithm described in Algorithm 1.

V-B Quantitative comparison on various scenarios

V-B1 Performance metrics

To compare the performance of our policy with other methods over various test cases, we use the following performance metrics. For each method, every test case is evaluated for repeats.

  • Success rate is the ratio of the number of robots reaching their goals within a certain time limit without any collisions over the total number of robots.

  • Extra time measures the difference between the travel time averaged over all robots and the lower bound of the travel time (i.e. the average cost time of going straight toward the goal for robots at max speed [7, 8]).

  • Extra distance measures the difference between the average traveled trajectory length of the robots and the lower bound of traveled distance of the robots (i.e. the average traveled distance for robots following the shortest paths toward the goals).

  • Average speed measures the average speed of the robot team during navigation.

Note that the extra time and extra distance

metrics are measured over all robots during the evaluation which remove the effects due to the variance in the number of agents and the different distance to goals.

V-B2 Circle scenarios

We first compare our multi-scenario multi-stage learned policy with the NH-ORCA policy [19], and the policy trained using supervised learning (SL-policy, a variation of  [10], see below for details) on circle scenarios with different number of robots. The circle scenarios are similar to the scenario 4 shown in Figure 4

, except we set robots uniformly on the circle. We use the open-sourced NH-ORCA implementation from 

[3, 4], and share the ground truth positions and velocities for all robots in the simulations. The policy learned in supervised mode, with the same architecture of our policy (described in Section IV-B), is trained on about 800,000 samples with the method adapted from [10, 11].

Compared to the NH-ORCA policy, our learned policy has significant improvement over it in terms of success rate, average extra time and travel speed. Although our learned policy has a slightly longer traveled paths than the NH-ORCA policy in scenarios with robot number above 15 (the third row in Table II), the larger speed (the fourth row in Table II) helps our robots reach their goals more quickly. Actually the slightly longer path is a byproduct of the higher speed since the robots need more space to decelerate before stopping at goals.

Metrics Method 4 6 8 10 12 15 20
Success Rate SL-policy 0.6 0.7167 0.6125 0.71 0.6333 - -
NH-ORCA 1.0 0.9667 0.9250 0.8900 0.9000 0.8067 0.7800
Our policy 1.0 1.0 1.0 1.0 1.0 1.0 0.9650
Extra Time SL-policy 9.254 / 2.592 9.566 / 3.559 12.085 / 2.007 13.588 / 1.206 19.157 / 2.657 - -
NH-ORCA 0.622 / 0.080 0.773 / 0.207 1.067 / 0.215 0.877 /0.434 0.771 / 0.606 1.750 / 0.654 1.800 / 0.647
Our policy 0.148 / 0.004 0.193 / 0.006 0.227 / 0.005 0.211 / 0.007 0.271 / 0.005 0.350 / 0.014 0.506 / 0.016
Extra Distance SL-policy 0.358 / 0.205 0.181 / 0.146 0.138 / 0.079 0.127 / 0.047 0.141 / 0.027 - -
NH-ORCA 0.017 / 0.004 0.025 / 0.005 0.041 / 0.034 0.034 / 0.009 0.062 / 0.024 0.049 / 0.019 0.056 / 0.018
Our policy 0.017 / 0.007 0.045 / 0.002 0.040 / 0.001 0.051 / 0.001 0.056 / 0.002 0.062 / 0.003 0.095 / 0.007
Average Speed SL-policy 0.326 / 0.072 0.381 / 0.087 0.354 / 0.042 0.355 / 0.022 0.308 / 0.028 - -
NH-ORCA 0.859 / 0.012 0.867 / 0.026 0.839 / 0.032 0.876 / 0.045 0.875 / 0.054 0.820 / 0.052 0.831 / 0.042
Our policy 0.956 / 0.006 0.929 / 0.002 0.932 / 0.001 0.928 / 0.002 0.921 / 0.002 0.912 / 0.003 0.880 / 0.006
TABLE II: Performance metrics (as mean/std) evaluated for different methods on the circle scenarios with different number of robots.

V-B3 Random scenarios

Random scenarios are other frequently used scenes to evaluate the performance of multi-robot collision avoidance. To measure the performance of our method on random scenarios (as shown in the 7th scenario in Figure 4), we first create 5 different random scenarios with 15 robots in each case. For each random scenario, we repeat our evaluations 50 times. The results are shown in Figure 6, which compares our final policy with the policy only trained in the stage 1 (Section IV-C1) and the NH-ORCA policy. We can observe that both policies trained using deep reinforcement learning have higher success rate than NH-ORCA policy (Figure 5(a)). It can also be seen that robots using the learned policies (in Stage 1 and 2) are able to reach their targets much faster than that of NH-ORCA (Figure 5(b)). Although the learned policies have longer trajectory length (Figure 5(c)), the higher average speed (Figure 5(d)) and success rate indicate that our policies enable a robot to better anticipate other robots’ movements. Similar to the circle scenarios above, the slightly longer path is due to robots’ requirement to decelerate before stopping at goals. In addition, the stage-1 policy’s high performance in the random scenario is partially contributed by the overfitting since it is trained in a similar random scenario, while the stage-2 policy is trained in multiple scenarios.

(a) Success rate
(b) Extra time
(c) Extra distance
(d) Average speed
Fig. 6: Performance metrics evaluated for our learned policies and the NH-ORCA policy on random scenarios.

V-B4 Group scenarios

In order to evaluate the cooperation between robots, we would like to test our trained policy on more challenging scenarios, e.g. group swap, group crossing and group moving in the corridors. In the group swap scenarios, we navigate two groups of robots (each group has 6 robots) moving in opposite directions to swap their positions. As for group crossing scenarios, robots are separated in two groups, and their paths will intersect in the center of the scenarios. We compare our method with NH-ORCA on these two cases by measuring the average extra time with 50 trials. It can be seen from Figure 8 that our policies performs much better than NH-ORCA on both cases. The shorter goal-reached time demonstrates that our policies have learned to produce more cooperative behaviors than reaction-based methods (NH-ORCA). We then evaluate three policies on the corridor scene, where two groups exchange their positions in a narrow corridor with two obstacles as shown in Figure 6(a). However, only the Stage-2 policy can complete this challenging task (with paths as illustrated in Figure 6(b)).The failure of the Stage-1 policy shows that the co-training on a wide range of scenarios can lead to robust performance across different situations. NH-ORCA policy fails on this case because it relies on global planners to guide robots navigating in the complex environment. As mentioned in Section I, the agent-level collision avoidance policy (e.g. NH-ORCA) requires extra pipeline (e.g. a grid map indicating the obstacles) to explicitly identify and process the static obstacles while our method (sensor-level policy) implicitly infers the obstacles from raw sensor readings without any additional processing.

(a) Corridor scenario
(b) Robot trajectories
Fig. 7: Two group of robots moving in a corridor with obstacles. (a) shows the corridor scenario. (b) shows trajectories generated by our Stage-2 policy.
Fig. 8: Extra time of our policies (Stage 1 and Stage 2) and the NH-ORCA policy on two group scenarios.

V-C Generalization

A notable feature benefited from multi-scenario training is the good generalization capability of the learned policy (Stage-2 policy). As mentioned in Section III, our policy is trained within a robot team where all robots share the same collision avoidance strategy. Non-cooperative robots are not introduced over the entire training process. Interestingly, the result shown in Figure 8(b) demonstrates that the learned policy can directly generalize well to avoid non-cooperative agents (i.e. the rectangle-shaped robots in Figure 8(b) which travel in straight lines with a fixed speed). Recall our policy is trained on robots with the same shape and a fixed radius. Figure 8(a) exhibits that the learned policy can also navigate efficiently a heterogeneous group of robots consisting of robots with different sizes and shapes to reach their goals without any collisions. To test the performance of our method on large-scale scenarios, we simulate 100 robots in a large circle moving to antipodal positions as shown in Figure 10. It shows that our learned policy can directly generalize to large-scale environments without any fine-tuning.

(a) Heterogeneous robots
(b) Non-cooperative robots
Fig. 9: In the heterogeneous robot team (a), only the two disc-shaped robots are used in training. (b) shows 6 robots moving around two non-cooperative robots (rectangle-shaped), which traveled in straight lines with a fast speed.
Fig. 10: Simulation of 100 robots trying to move through the center of a circle to antipodal positions.

Vi Conclusion

In this paper, we present a multi-scenario multi-stage training framework to optimize a fully decentralized sensor-level collision avoidance policy with a robust policy gradient algorithm. The learned policy has demonstrated several advantages on an extensive evaluation of the state-of-the-art NH-ORCA policy in terms of success rate, collision avoidance performance, and generalization capability. Our work can serve as a first step towards reducing the navigation performance gap between the centralized and decentralized methods, though we are fully aware that the learned policy focusing on local collision avoidance cannot replace a global path planner when scheduling many robots to navigate through complex environments with dense obstacles.