End-to-end Decentralized Multi-robot Navigation in Unknown Complex Environments via Deep Reinforcement Learning

07/03/2019 ∙ by Juntong Lin, et al. ∙ SUN YAT-SEN UNIVERSITY 0

In this paper, a novel deep reinforcement learning (DRL)-based method is proposed to navigate the robot team through unknown complex environments, where the geometric centroid of the robot team aims to reach the goal position while avoiding collisions and maintaining connectivity. Decentralized robot-level policies are derived using a mechanism of centralized learning and decentralized executing. The proposed method can derive end-to-end policies, which map raw lidar measurements into velocity control commands of robots without the necessity of constructing obstacle maps. Simulation and indoor real-world unmanned ground vehicles (UGVs) experimental results verify the effectiveness of the proposed method.



There are no comments yet.


page 1

page 4

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Multi-robot system has broad applications such as search, rescue, forest inspection, agricultural spraying, collaborative transportation, etc. When a team of robots collaboratively accomplishes tasks in complex environments, multi-robot navigation techniques are essential to guarantee the safety of the robot team [1]. When the workspaces of the robot team are unknown and complex, the navigation policies are required to be both on-line and able to handle obstacles of different shapes. In addition, the communication range of each robot is usually constrained, thus the navigation policy should take connectivity into consideration to ensure communication among the robots. Furthermore, decentralized navigation strategies are preferable to centralized ones on enabling autonomous decision making for each robot.

This paper studies the decentralized multi-robot navigation in unknown complex environments as shown in Fig. (a)a and Fig. (b)b, where the geometric centroid of the formation is required to reach the goal position efficiently while avoiding collision and maintaining connectivity of the robot team. A deep reinforcement learning (DRL)-based approach is proposed to accomplish the multi-robot navigation task. By means of a centralized learning and decentralized executing mechanism as shown in Fig. 2, a decentralized end-to-end robot-level policy can be obtained, which directly maps raw lidar range data to continuous velocity control commands without the necessity of constructing obstacle maps.

I-a Related work

(a) Scene description.
(b) Adaptive control.
(c) Invlid assignment.
Fig. 1: Multi-robot navigation task.
Fig. 2: An overview of our method. During centralized learning period, the robot team interacts with the environment and evolves a team-level policy in a trial-and-error manner. In our method, a robot-level policy can be determined once the team-level policy is obtained. With the robot-level policy , each robot can make decision merely depending on local observations in decentralized executing stage.

There exists extensive research work for multi-robot navigation, which can be further categorized into rule-based and learning-based strategies. Rule-based approaches include using leader-follower scheme [2], artificial potential field (APF) [3], graph theory [4], consensus theory [5], model predictive control [6], virtual structure [7, 8], etc.

In the rule-based navigation methods, the obstacle map should be constructed using the sensor data. The performance of the rule-based methods highly relies on the constructed obstacle map. In addition, the real-time mapping of the navigation environment with the robots’ onboard sensors is challenging and computationally prohibitive sometimes.

Learning-based methods are alternatives for rule-based methods. The learning-based approaches can derive end-to-end policies which map raw sensor data to control commands without the necessity of constructing obstacle maps.

Most of the existing learning-based navigation methods focus on single-robot settings [9, 10, 11]. In the case of multi-robot systems, the research work mainly focuses on local collision avoidance [12, 13, 14], where multiple robots move to their designated goal positions correspondingly without colliding with other robots and the obstacles. As shown in Fig. (c)c, it is difficult to assign the goal position to each robot beforehand when the robot team navigates through unknown cluttered environments. Moreover, as the communication range of the robot is usually constrained, the information exchange among the robots cannot be ensured if the connectivity of the robot team is not considered.

In this paper, a DRL-based method is proposed to accomplish the multi-robot navigation task, where the geometric centroid of the robot team aims to reach the goal position while avoiding collisions and maintaining connectivity.

I-B Contribution and paper organization

In this paper, the multi-robot navigation problem is studied. The main contributions of the paper are presented as follows:

  • A novel DRL-based method is proposed to navigate the multi-robot team through unknown cluttered environments in a decentralized manner. The derived policy directly maps the raw lidar measurements into the continuous velocity control commands of the robots without the necessity of constructing the obstacle maps.

  • Simulation results illustrate that the robot team can navigate through unknown complex environments based on the proposed method. Comparison results show that our proposed method (which makes decisions based on local observations) outperforms an APF-based method (which can obtain global environment information) in narrow scenarios.

  • A team of 3 holonomic unmanned ground vehicles (UGVs) is built to serve as a real-world policy verification platform. Indoor real-world experimental results show that the learned policy can successfully navigate a team of 3 holonomic UGVs through the dense obstacles.

The remainder of this paper is organized as follows: In Section II, the multi-robot navigation problem is described, and the problem formulation is addressed. The proposed DRL-based navigation method is presented in Section III. In Section IV, numerical and experimental results are presented to illustrate its effectiveness. Section V concludes the paper.

Ii Problem Formulation

In this section, the multi-robot navigation problem is formulated as a partially observable Markov decision process (POMDP).

(a) Local observation.
(b) Local frame.
Fig. 3: Setting description. Information about teammates , goal and current velocity are green, yellow and red arrows, respectively. Information about environment

are purple lines. Vectors are represented under robots’ local frames.

In this paper, a team of homogeneous robots is considered, where the robots are holonomic with equal radius . Assume there are robots and static obstacles in a navigation environment. The obstacles are modeled as polygons, and the -th obstacles () is denoted as , where is the position of the -th vertex of the -th obstacles and is the number of vertices of the -th obstacle. At each time step , The -th robot () with velocity locates at . The robot team computes the control command based on the local observations of all robots by sampling from the distribution , where the distribution is generated from the policy and denotes the policy parameters. Within the time horizon , the positions of robots are updated according to the control value .

The local observation of the -th robot consists of current velocity , relative positions of robot team’s goal as well as other robots , and the partial environment information . We assume the local observation is always obtainable for the policy during the navigation. The form of the partial environment information and the control value can be various depending on the configuration of the multi-robot system. In this paper, we study the case where robots perceive the environment with 2D laser scanners and take velocity as control commands. Therefore, the partial environment information represents the raw 2D laser measurements of the -th robot as shown in Fig. (a)a and denotes the velocity control command. Vectors (e.g., velocity control command , current velocity , relative position of the goal , and relative positions of other robots ) are represented under the robots’ local frame as shown in Fig. (b)b.

Accordingly, the multi-robot navigation problem can be formulated as a POMDP. A POMDP can be denoted as a 6-tuple , where is the state space, is the action space, is the observation space, is the state transition distribution, is the emission distribution, and is the reward function.

Specifically, the vector represents the states of all the robots at the time step , where denote the positions/velocities of all the robots, and denotes all the obstacles and denotes the goal position. The vector and denote the actions and observations of all the robots. State transition distribution determines , which is the distribution of the next states on the condition of taking the action under the state . Emission distribution determines , which is the distribution of the accessible observations of the robots under the state . Note that both state transition distribution and emission distribution are defined implicitly by the physical meaning of and . The reward function is a map , which returns the scalar immediate reward of taking the action under the state . The design of the reward function is elaborated in Section III.

Once formulating the multi-robot navigation problem as a POMDP, we aim to find the optimal parameter for the policy to maximize the expected accumulated reward , where refers to the maximum time step, is the trajectory of the robot team, denotes the distribution of a trajectory under policy , state transition distribution and emission distribution . The finding of the optimal parameter can be formulated as:


Iii Methodology

In this section, we first present a DRL-based method to derive a centralized policy, where the whole robot team is regarded as an entity. Then the method is adapted to derive a decentralized policy to enhance the autonomy of each robot.

Iii-a Centralized policy learning

In this part, the robot team is regarded as an entity. To solve the POMDP defined in II, the reward function is presented, which primarily rewards desirable robot team behaviors and penalizes unwanted manners. It can be formulated as:


where is a negative constant serving as a penalty to motivate the robot team to reach the goal position as fast as possible; the reward functions , , and are related to the objectives of reaching the goal position, avoiding collisions, maintaining connectivity, and achieving smooth motion of the robots, respectively; and are the corresponding weights. The reward functions are addressed in detail as follows.

The robot team will be rewarded by when the geometric centroid of the robot team reaches the goal position , and get a reward or penalty depending on the centroid moving to or away from the goal. The reward function can be formulated as:


where refers to the radius of goal zone.

When any robot collides with other robots or obstacles in the environment, the robot team will be penalized:


where is a negative value.

The robot team will get a penalty if the connectivity is not guaranteed (distances among the robots exceeding the communication range threshold ):


where computes the average of a set of values.

Smooth motion of the robot team is desirable, and a penalty is applied with respect to the velocity variation:


where denotes inner product of two vectors.

With the reward functions defined above, actor-critic-based DRL algorithms [15] can be applied to solve the optimization problem in (1). In actor critic-based DRL algorithms, a critic is used to guide the update of actor (i.e., policy ). The critic is based on the value function with parameter . The value function is a map , which returns the expected accumulated reward under an observation . Once the form of parametric function aproximators for the policy and the value function are determined, actor critic-based DRL algorithms can then be applied to derive a centralized policy .

Iii-B Decentralized policy learning

Though a policy can be derived with the method mentioned above, the policy cannot be applied in a decentralized manner since it takes the global observation of the robot team as input, which prohibits the decentralized execution. There exists literatures [16, 17] about deriving decentralized policies for multi-robot tasks, which update the policy for each robot independently. Considering the demand of group behaviors in the navigation task, we adopt a mechanism to obtain the decentralized robot-level policy while updating the joint policy of the robot team during training. Similar mechanism has also succeeded in solving multi-agent graphical games [18].

Specifically, decentralized policies are derived by adding a constraint to the policy and carrying out the centralized policy learning in Section III-A. Note that the local observation is sufficient for the -th robot to identify itself, since the robot can infer its position in the team from the relative positions of other robots . It implies that all robots can share a robot-level policy to make decision as long as the capacity of the function approximator for is high enough. In other words, the team-level policy can be constructed with the robot-level policy . In particular, each robot computes its’ action with the robot-level policy based on its’ local observation . Then the team-level action is obtained by concatenating actions of all robots. The process above can be formulated as a constraint to the team-level policy , where the parameter is shared by all robot-level policies. By adding this constraint, a robot-level policy can be obtained once we find the team-level policy by the centralized learning in Section III-A. Hence, the robot team can navigate in a decentralized manner with the robot-level policy during the execution.

Considering the high capacity requirements, a neural network is used as the function approximator for the robot-level policy

. As shown in Fig. 4, the team-level policy is constructed with a robot-level policy shared by all robots. In the robot-level policy

, the local observations of each robot sequentially go through an observation feature extraction module (wrapped by red rectangles in Fig.

4) and subsequent modules. In the observation feature extraction module, two one-dimensional convolutional layers [19, 20]

with ReLU (denoted by Conv-ReLU) are adopted so as to process raw sensor data

effectively. Since we consider homogeneous robots, the policy function should be insensitive to the order of relative positions in the teammate observation . Therefore, we process the teammate observation with an order insensitive submodule. In particular, all relative positions () in

are separately processed by a shared fully connected layer with ReLU (denoted by FC-ReLU) and the outputs are averaged subsequently. During execution, each robot samples its action from a Gaussian distribution

, where the mean is computed by robot-level policy based on robot’s local observation

and the variance

is obtained during training.

Fig. 4: Policy network architecture.
Fig. 5: Value function network architecture.

Though the value function is centralized in the sense that it takes the global observation of the robot team as input, it does not prohibit the decentralized execution since it is not used during execution. A neural network is also used as the non-linear function approximator for the value function . As shown in Fig. 5, the architecture of value function network is similar to the policy network and consists of observation feature extraction modules (wrapped by blue rectangles in Fig. 5) and subsequent modules (i.e., ). Instead of sharing the whole subnetworks by all robots like the policy network, only the observation feature extraction module is shared in the value function network. The insight of not sharing the subsequent modules is that the expected accumulated reward depends on the global observations of the robot team. Since homogeneous robots are considered in this paper, the value function should be insensitive to the order of robots . This property is guaranteed by sharing the observation feature extraction module among the robots as well as averaging extracted features before further processing.

Iii-C Optimization with proximal policy optimization

1 Initialize and ;
2 ;
3 for iteration  do
       // Collect data in parallel
4       for worker  do
5             for time step  do
6                   for robot  do
7                         The -th robot computes the action with the robot-level policy based on the local observation ;
9                   end for
10                  The robot team interacts with the environment with a team-level action by concatenating robot-level actions (i.e., );
12             end for
13            Collect trajectories;
15       end for
      // Update parameters
16       for epoch  do
             // Update policy
17             Update w.r.t the policy objective function of PPO, which depends on the value function ;
             // Update value function
18             Update w.r.t the value function objective function of PPO;
20       end for
21      , ;
23 end for
Algorithm 1 Optimizing the decentralized policy with PPO

With the designed policy network and value function network, a decentralized policy can be derived in a centralized learning and decentralized executing paradigm once an actor-critic-based DRL algorithm is applied. In this paper, the recently proposed proximal policy optimization (PPO) [21], which is a state-of-the-art DRL algorithm and guarantees stable performance improvement during training.

As summarized in Algorithm 1, the optimization process alternates collecting data and updating parameters. In the data collection stage, multiple workers (a worker refers to a robot team in our case) interact with the environment in parallel and collect a batch of trajectories, which are used in the update step. In the parameter update procedure, the parameters and are optimized with respect to the objective functions of PPO and respectively.

Iv Simulations and Experiments

In this section, simulations and real-world experiments on a team of three robots are carried out to illustrate the effectiveness of the proposed method.

Iv-a Training configurations

The interactive simulation environment for training is implemented based on the OpenAI Gym [22]. An open-source version of the PPO [23]

based on Pytorch

[24] is adapted to optimize the policies. All the programs run on a computer with Intel(R) Xeon(R) E5-2620 v4 CPU and Nvidia TITAN X (Pascal) GPU. The on-line running of the policy takes about ms on CPU and ms on GPU. The on-line computational complexity is low and efficient for real-time implementations. Detailed training settings are defined as follows:

  • Robot team settings: robot number , robot radius , and robot maximum velocity . Each robot is equipped with laser ejectors (i.e., a laser ejector every ) whose sensing range is m. All robots are holonomic.

  • Environment settings: environment size is . Obstacles of different sizes and shapes are generated randomly (look like scenarios in Fig. 8). Time interval . In each run, the positions of the robot team and the goal are randomly initialized.

  • Reward settings: efficiency term , weights of reward functions , goal reward , collision penalty , radius of goal zone , and communication range threshold .

  • PPO settings: number of iteration , epoch , number of workers , and maximum time step . Adam optimizer [25] is used to optimize parameters and . The learning rates ( and ) in the first half and the second half of the training procedure are set to and respectively.

Iv-B Simulation results on various scenarios

We verify the effectiveness of our proposed method by applying the learned policy in different scenarios.

(a) Block.

(b) Gear.

(c) Groove.
Fig. 6: Avoid obstacles of different shapes. Robots are pink, brown and blue circles. Obstacles are grey polygons. The goal is the dark red circle. Shaded discs are lidar measurements. Green circle is the centroid of the robot team and colored dashed lines are the trajectories of robots with corresponding colors. The color scheme applies across all simulation results.

Iv-B1 Obstacles of different shapes

When operating in an unknown complex environment, a multi-robot team may encounter obstacles of different shapes. Therefore, the capacity of handling the diversity of obstacle shape is an important metric to evaluate the performance of a navigation policy. In order to assess the ability of avoiding obstacles of different shapes, we set up scenarios where obstacles of different shapes lie in between the robot team and the goal. In particular, test cases cover both convex and concave obstacles including block, gear and groove. As shown in Fig. 6, the policy succeeds in all test cases, which implies the policy’s capacity of avoiding obstacles of different shapes.

Iv-B2 Narrow scenarios

(a) Narrow passageway.

(b) Corridor with corners.
Fig. 7: Navigation in narrow scenarios.

Narrow scenes are common in real world and is challenging in the sense that it imposes strict spacial constraints to the robot team. To verify the effectiveness of the derived policy in narrow scenes, we set up scenarios including narrow passageway and corridor with corners. It can be seen in the Fig. 7 that the policy successfully navigates the centroid of the robot team to the goal in different narrow scenarios.

Iv-B3 Random scenarios

Random scenarios are frequently used to evaluate the performance of a navigation policy. We randomly generate scenarios with obstacles of different shapes and sizes to test the derived policy. As shown in Fig. 8, the policy successfully navigates the centroid of the robot team to the goal in random scenarios.

Simulation results show that our method can navigate the robot team through various complex scenarios. Note that our policy is run in a decentralized manner and computes continuous velocity control commands with raw laser measurements, which implies that our policy can achieve end-to-end decentralized cooperation in difficult environments.

(a) Random scenario 1.

(b) Random scenario 2.
Fig. 8: Navigation in random scenarios.

Iv-C A comparison with a rule-based method

A comparison with a rule-based method (in particular, an artificial potential field (APF)-based method [3]) is conducted to investigate the performance of the proposed method. As shown in Fig. 9, our policy navigates the robot team through the narrow passageway, while the APF-based method gets stuck at the entrance since the repulsive forces from obstacles neutralize the attractive force from the goal. Note that in this experiment, our policy computes velocity control commands merely based on local observations while the APF-based method can obtain global environment information.

(a) Our method.

(b) APF-based method.
Fig. 9: A comparison with an APF-based method.

Iv-D A comparison with other decentralization mechanism

(a) Convergence.
(b) Success rate.
(c) Connectivity.
(d) Efficiency.
Fig. 10: A quantitative comparison among independent PPO, MAPPO and our method.

To verify the effectiveness of the adopted mechanism for centralized learning and decentralized execution, we conduct quantitative comparisons with ablated variants of our method called independent PPO (which ablates the centralized learning procedure) and MAPPO (which ablates the joint policy update). In particular, the independent PPO follows the paradigm of independent Q-learning [16], where each robot independently updates the policy and treats other robots as a part of the environment. It implies that the value function takes the local observation as input. The MAPPO applies the mechanism proposed in MADDPG [17] where value function takes the global observation as input while robot-level policy for each agent is updated separately. In comparison, our method exploits the centralized learning and updates the team-level joint policy for the robot team.

The following quantitative metrics are adopted to evaluate the performance:

  • Success rate: The ratio of the success times over the number of the total test cases , i.e., .

  • Connectivity: The negative average violation of the distance constraint over a trajectory, i.e., , where is the average violation of distance constrain at time step and is the trajectory length.

  • Efficiency: The ratio of the lower bound of travel time (dividing Euclidean distance from the initial position to the goal by the maximum velocity) over the actual travel time , i.e., .

For all quantitative metrics, higher values indicate better performance. For efficiency and connectivity, we only consider successful navigations and neglect failure cases. To alleviate random errors, 3 policies are trained with different random seeds for each method. The final quantitative metrics of a method is the average performance of 3 derived policies.

100 random scenarios (look like scenarios in Fig. 8) are generated to evaluate policies derived by independent PPO, MAPPO and our method. The curves of accumulated total rewards over training process are shown in Fig. (a)a. It can be seen that our method converges to much higher accumulated total rewards in comparison with independent PPO and MAPPO. As for quantitative metrics, our method significantly outperforms independent PPO and MAPPO in success rate (Fig. (b)b) and connectivity (Fig. (c)c). It can be seen from Fig. (d)d that the MAPPO slightly surpass our method in efficiency, while the performance of the former severely degenerates with respect to the success rate and connectivity. From the quantitative results, we can conclude that both centralized learning and joint policy update are essential for the navigation task, which highly demands group behaviors.

It’s noticed that the policy derived by our method may fail in some of the navigation tasks. A typical reason for the failures can be demonstrated by Fig. 11. When the robot team encounters a crossway and has to decide which way to go (following the green arrow or the red arrow), it’s difficult for the policy to predict the dead end beforehand merely based on local observations. Therefore, trying to approach the goal directly (following the red arrow) is a reasonable choice, which is also a behavior (reaching the goal as fast as possible) encouraged by the reward function (2). Consequently, the robot team runs into the dead end and fails the navigation. Our method currently relies on guide points from higher level path planning algorithm to avoid such failures.

Fig. 11: A demonstration of typical failure cases. The robot team cannot predict the dead end merely based on local observations. Therefore, the robot team may choose to follow the red arrow so as to reach the goal as fast as possible and runs into the dead end.
(a) The chasis of an UGV with an onboard computer (Nvidia Jetson TX2).
(b) A holonomic UGV equipped with a 2D laser scanner (RPLIDAR-A2).
(c) A robot team of three holonomic UGVs.
Fig. 12: Real-world experimental platform.
(a) Initial position of the robot team and obstacles.
(b) The robot team switches to line formation.
(c) The robot team reaches the goal.
Fig. 13: Snapshots of a team of 3 holonomic UGVs navigating through obstacles.

Iv-E Real-world experimental results

Real-world experiments are performed to verify the practicability of the proposed method. Three holonomic UGVs (as shown in Fig. 12) are built to form a robot team, which serves as a platform to test derived policies. In order to enable real-time end-to-end decentralized execution, each UGV is equipped with a 2D-laser scanner (RPLIDAR-A2) and an onboard computer (Nvidia Jetson TX2). We wrap the UGVs with boxes and set laser scanners at different heights to make laser scanners detect teammate UGVs. The OptiTrack motion capture system is used to provide the positions of the robots.

The snapshots of an experiment are shown in Fig. 13. In this experiment, a narrow corridor lies between the robot team (which is initialized on the left) and the goal (which is set on the right). It can be seen that the derived policy switches the robot team into line formation to go through the narrow corridor and successfully navigates the centroid of the robot team to the goal. Note that the policy is run in a decentralized manner and directly maps raw laser measurements into continuous velocities without constructing obstacle maps.

We think the success of sim-to-real deployment partially credits to the input consistency. In particular, the inputs (e.g., laser measurements) of the policy are relatively consistent in simulations and real-world environments (compared with RGB inputs which are heavily affected by light). Moreover, scene pattern similarity between simulations and real-world experiments is also an important factor for the success. In spite of the consistency and similarity, it should be pointed out that the sim-to-real deployment is challenging because both the observations and actions in real world are fairly noisy and imperfect in comparison with simulations. The success of real-world experiments shows the robustness and practicability of the proposed method.

A demo video of the multi-robot navigation can be seen in https://youtu.be/G8UZU1hk9K4.

V Conclusions

This paper proposes a DRL-based method to derive decentralized policies for a robot team to navigate through unknown complex environments safely while maintaining connectivity. The derived policy directly maps raw laser measurements to continuous velocity control commands without constructing the obstacle maps. By means of the learned policies, the robots cooperatively plan the motions to accomplish the navigation task of the robot team using each robot’s local observations. Simulation results in various scenarios verify the effectiveness of the proposed method. Furthermore, indoor real-world experimental results demonstrate that the three-robot team can navigate through the dense obstacles. In the future work, the learning-based navigation approach will be further studied in robot teams of larger scale.


This work is supported by Major Program of Science and Technology Planning Project of Guangdong Province (2017B010116003), NSFC-Shenzhen Robotics Projects (U1613211), and Guangdong Natural Science Foundation (1614050001452, 2017A030310050).