Multi-robot system has broad applications such as search, rescue, forest inspection, agricultural spraying, collaborative transportation, etc. When a team of robots collaboratively accomplishes tasks in complex environments, multi-robot navigation techniques are essential to guarantee the safety of the robot team . When the workspaces of the robot team are unknown and complex, the navigation policies are required to be both on-line and able to handle obstacles of different shapes. In addition, the communication range of each robot is usually constrained, thus the navigation policy should take connectivity into consideration to ensure communication among the robots. Furthermore, decentralized navigation strategies are preferable to centralized ones on enabling autonomous decision making for each robot.
This paper studies the decentralized multi-robot navigation in unknown complex environments as shown in Fig. (a)a and Fig. (b)b, where the geometric centroid of the formation is required to reach the goal position efficiently while avoiding collision and maintaining connectivity of the robot team. A deep reinforcement learning (DRL)-based approach is proposed to accomplish the multi-robot navigation task. By means of a centralized learning and decentralized executing mechanism as shown in Fig. 2, a decentralized end-to-end robot-level policy can be obtained, which directly maps raw lidar range data to continuous velocity control commands without the necessity of constructing obstacle maps.
I-a Related work
There exists extensive research work for multi-robot navigation, which can be further categorized into rule-based and learning-based strategies. Rule-based approaches include using leader-follower scheme , artificial potential field (APF) , graph theory , consensus theory , model predictive control , virtual structure [7, 8], etc.
In the rule-based navigation methods, the obstacle map should be constructed using the sensor data. The performance of the rule-based methods highly relies on the constructed obstacle map. In addition, the real-time mapping of the navigation environment with the robots’ onboard sensors is challenging and computationally prohibitive sometimes.
Learning-based methods are alternatives for rule-based methods. The learning-based approaches can derive end-to-end policies which map raw sensor data to control commands without the necessity of constructing obstacle maps.
Most of the existing learning-based navigation methods focus on single-robot settings [9, 10, 11]. In the case of multi-robot systems, the research work mainly focuses on local collision avoidance [12, 13, 14], where multiple robots move to their designated goal positions correspondingly without colliding with other robots and the obstacles. As shown in Fig. (c)c, it is difficult to assign the goal position to each robot beforehand when the robot team navigates through unknown cluttered environments. Moreover, as the communication range of the robot is usually constrained, the information exchange among the robots cannot be ensured if the connectivity of the robot team is not considered.
In this paper, a DRL-based method is proposed to accomplish the multi-robot navigation task, where the geometric centroid of the robot team aims to reach the goal position while avoiding collisions and maintaining connectivity.
I-B Contribution and paper organization
In this paper, the multi-robot navigation problem is studied. The main contributions of the paper are presented as follows:
A novel DRL-based method is proposed to navigate the multi-robot team through unknown cluttered environments in a decentralized manner. The derived policy directly maps the raw lidar measurements into the continuous velocity control commands of the robots without the necessity of constructing the obstacle maps.
Simulation results illustrate that the robot team can navigate through unknown complex environments based on the proposed method. Comparison results show that our proposed method (which makes decisions based on local observations) outperforms an APF-based method (which can obtain global environment information) in narrow scenarios.
A team of 3 holonomic unmanned ground vehicles (UGVs) is built to serve as a real-world policy verification platform. Indoor real-world experimental results show that the learned policy can successfully navigate a team of 3 holonomic UGVs through the dense obstacles.
The remainder of this paper is organized as follows: In Section II, the multi-robot navigation problem is described, and the problem formulation is addressed. The proposed DRL-based navigation method is presented in Section III. In Section IV, numerical and experimental results are presented to illustrate its effectiveness. Section V concludes the paper.
Ii Problem Formulation
In this section, the multi-robot navigation problem is formulated as a partially observable Markov decision process (POMDP).
are purple lines. Vectors are represented under robots’ local frames.
In this paper, a team of homogeneous robots is considered, where the robots are holonomic with equal radius . Assume there are robots and static obstacles in a navigation environment. The obstacles are modeled as polygons, and the -th obstacles () is denoted as , where is the position of the -th vertex of the -th obstacles and is the number of vertices of the -th obstacle. At each time step , The -th robot () with velocity locates at . The robot team computes the control command based on the local observations of all robots by sampling from the distribution , where the distribution is generated from the policy and denotes the policy parameters. Within the time horizon , the positions of robots are updated according to the control value .
The local observation of the -th robot consists of current velocity , relative positions of robot team’s goal as well as other robots , and the partial environment information . We assume the local observation is always obtainable for the policy during the navigation. The form of the partial environment information and the control value can be various depending on the configuration of the multi-robot system. In this paper, we study the case where robots perceive the environment with 2D laser scanners and take velocity as control commands. Therefore, the partial environment information represents the raw 2D laser measurements of the -th robot as shown in Fig. (a)a and denotes the velocity control command. Vectors (e.g., velocity control command , current velocity , relative position of the goal , and relative positions of other robots ) are represented under the robots’ local frame as shown in Fig. (b)b.
Accordingly, the multi-robot navigation problem can be formulated as a POMDP. A POMDP can be denoted as a 6-tuple , where is the state space, is the action space, is the observation space, is the state transition distribution, is the emission distribution, and is the reward function.
Specifically, the vector represents the states of all the robots at the time step , where denote the positions/velocities of all the robots, and denotes all the obstacles and denotes the goal position. The vector and denote the actions and observations of all the robots. State transition distribution determines , which is the distribution of the next states on the condition of taking the action under the state . Emission distribution determines , which is the distribution of the accessible observations of the robots under the state . Note that both state transition distribution and emission distribution are defined implicitly by the physical meaning of and . The reward function is a map , which returns the scalar immediate reward of taking the action under the state . The design of the reward function is elaborated in Section III.
Once formulating the multi-robot navigation problem as a POMDP, we aim to find the optimal parameter for the policy to maximize the expected accumulated reward , where refers to the maximum time step, is the trajectory of the robot team, denotes the distribution of a trajectory under policy , state transition distribution and emission distribution . The finding of the optimal parameter can be formulated as:
In this section, we first present a DRL-based method to derive a centralized policy, where the whole robot team is regarded as an entity. Then the method is adapted to derive a decentralized policy to enhance the autonomy of each robot.
Iii-a Centralized policy learning
In this part, the robot team is regarded as an entity. To solve the POMDP defined in II, the reward function is presented, which primarily rewards desirable robot team behaviors and penalizes unwanted manners. It can be formulated as:
where is a negative constant serving as a penalty to motivate the robot team to reach the goal position as fast as possible; the reward functions , , and are related to the objectives of reaching the goal position, avoiding collisions, maintaining connectivity, and achieving smooth motion of the robots, respectively; and are the corresponding weights. The reward functions are addressed in detail as follows.
The robot team will be rewarded by when the geometric centroid of the robot team reaches the goal position , and get a reward or penalty depending on the centroid moving to or away from the goal. The reward function can be formulated as:
where refers to the radius of goal zone.
When any robot collides with other robots or obstacles in the environment, the robot team will be penalized:
where is a negative value.
The robot team will get a penalty if the connectivity is not guaranteed (distances among the robots exceeding the communication range threshold ):
where computes the average of a set of values.
Smooth motion of the robot team is desirable, and a penalty is applied with respect to the velocity variation:
where denotes inner product of two vectors.
With the reward functions defined above, actor-critic-based DRL algorithms  can be applied to solve the optimization problem in (1). In actor critic-based DRL algorithms, a critic is used to guide the update of actor (i.e., policy ). The critic is based on the value function with parameter . The value function is a map , which returns the expected accumulated reward under an observation . Once the form of parametric function aproximators for the policy and the value function are determined, actor critic-based DRL algorithms can then be applied to derive a centralized policy .
Iii-B Decentralized policy learning
Though a policy can be derived with the method mentioned above, the policy cannot be applied in a decentralized manner since it takes the global observation of the robot team as input, which prohibits the decentralized execution. There exists literatures [16, 17] about deriving decentralized policies for multi-robot tasks, which update the policy for each robot independently. Considering the demand of group behaviors in the navigation task, we adopt a mechanism to obtain the decentralized robot-level policy while updating the joint policy of the robot team during training. Similar mechanism has also succeeded in solving multi-agent graphical games .
Specifically, decentralized policies are derived by adding a constraint to the policy and carrying out the centralized policy learning in Section III-A. Note that the local observation is sufficient for the -th robot to identify itself, since the robot can infer its position in the team from the relative positions of other robots . It implies that all robots can share a robot-level policy to make decision as long as the capacity of the function approximator for is high enough. In other words, the team-level policy can be constructed with the robot-level policy . In particular, each robot computes its’ action with the robot-level policy based on its’ local observation . Then the team-level action is obtained by concatenating actions of all robots. The process above can be formulated as a constraint to the team-level policy , where the parameter is shared by all robot-level policies. By adding this constraint, a robot-level policy can be obtained once we find the team-level policy by the centralized learning in Section III-A. Hence, the robot team can navigate in a decentralized manner with the robot-level policy during the execution.
Considering the high capacity requirements, a neural network is used as the function approximator for the robot-level policy. As shown in Fig. 4, the team-level policy is constructed with a robot-level policy shared by all robots. In the robot-level policy
, the local observations of each robot sequentially go through an observation feature extraction module (wrapped by red rectangles in Fig.4) and subsequent modules. In the observation feature extraction module, two one-dimensional convolutional layers [19, 20]
with ReLU (denoted by Conv-ReLU) are adopted so as to process raw sensor dataeffectively. Since we consider homogeneous robots, the policy function should be insensitive to the order of relative positions in the teammate observation . Therefore, we process the teammate observation with an order insensitive submodule. In particular, all relative positions () in
are separately processed by a shared fully connected layer with ReLU (denoted by FC-ReLU) and the outputs are averaged subsequently. During execution, each robot samples its action from a Gaussian distribution, where the mean is computed by robot-level policy based on robot’s local observation
and the varianceis obtained during training.
Though the value function is centralized in the sense that it takes the global observation of the robot team as input, it does not prohibit the decentralized execution since it is not used during execution. A neural network is also used as the non-linear function approximator for the value function . As shown in Fig. 5, the architecture of value function network is similar to the policy network and consists of observation feature extraction modules (wrapped by blue rectangles in Fig. 5) and subsequent modules (i.e., ). Instead of sharing the whole subnetworks by all robots like the policy network, only the observation feature extraction module is shared in the value function network. The insight of not sharing the subsequent modules is that the expected accumulated reward depends on the global observations of the robot team. Since homogeneous robots are considered in this paper, the value function should be insensitive to the order of robots . This property is guaranteed by sharing the observation feature extraction module among the robots as well as averaging extracted features before further processing.
Iii-C Optimization with proximal policy optimization
With the designed policy network and value function network, a decentralized policy can be derived in a centralized learning and decentralized executing paradigm once an actor-critic-based DRL algorithm is applied. In this paper, the recently proposed proximal policy optimization (PPO) , which is a state-of-the-art DRL algorithm and guarantees stable performance improvement during training.
As summarized in Algorithm 1, the optimization process alternates collecting data and updating parameters. In the data collection stage, multiple workers (a worker refers to a robot team in our case) interact with the environment in parallel and collect a batch of trajectories, which are used in the update step. In the parameter update procedure, the parameters and are optimized with respect to the objective functions of PPO and respectively.
Iv Simulations and Experiments
In this section, simulations and real-world experiments on a team of three robots are carried out to illustrate the effectiveness of the proposed method.
Iv-a Training configurations
based on Pytorch is adapted to optimize the policies. All the programs run on a computer with Intel(R) Xeon(R) E5-2620 v4 CPU and Nvidia TITAN X (Pascal) GPU. The on-line running of the policy takes about ms on CPU and ms on GPU. The on-line computational complexity is low and efficient for real-time implementations. Detailed training settings are defined as follows:
Robot team settings: robot number , robot radius , and robot maximum velocity . Each robot is equipped with laser ejectors (i.e., a laser ejector every ) whose sensing range is m. All robots are holonomic.
Environment settings: environment size is . Obstacles of different sizes and shapes are generated randomly (look like scenarios in Fig. 8). Time interval . In each run, the positions of the robot team and the goal are randomly initialized.
Reward settings: efficiency term , weights of reward functions , goal reward , collision penalty , radius of goal zone , and communication range threshold .
PPO settings: number of iteration , epoch , number of workers , and maximum time step . Adam optimizer  is used to optimize parameters and . The learning rates ( and ) in the first half and the second half of the training procedure are set to and respectively.
Iv-B Simulation results on various scenarios
We verify the effectiveness of our proposed method by applying the learned policy in different scenarios.
Iv-B1 Obstacles of different shapes
When operating in an unknown complex environment, a multi-robot team may encounter obstacles of different shapes. Therefore, the capacity of handling the diversity of obstacle shape is an important metric to evaluate the performance of a navigation policy. In order to assess the ability of avoiding obstacles of different shapes, we set up scenarios where obstacles of different shapes lie in between the robot team and the goal. In particular, test cases cover both convex and concave obstacles including block, gear and groove. As shown in Fig. 6, the policy succeeds in all test cases, which implies the policy’s capacity of avoiding obstacles of different shapes.
Iv-B2 Narrow scenarios
Narrow scenes are common in real world and is challenging in the sense that it imposes strict spacial constraints to the robot team. To verify the effectiveness of the derived policy in narrow scenes, we set up scenarios including narrow passageway and corridor with corners. It can be seen in the Fig. 7 that the policy successfully navigates the centroid of the robot team to the goal in different narrow scenarios.
Iv-B3 Random scenarios
Random scenarios are frequently used to evaluate the performance of a navigation policy. We randomly generate scenarios with obstacles of different shapes and sizes to test the derived policy. As shown in Fig. 8, the policy successfully navigates the centroid of the robot team to the goal in random scenarios.
Simulation results show that our method can navigate the robot team through various complex scenarios. Note that our policy is run in a decentralized manner and computes continuous velocity control commands with raw laser measurements, which implies that our policy can achieve end-to-end decentralized cooperation in difficult environments.
Iv-C A comparison with a rule-based method
A comparison with a rule-based method (in particular, an artificial potential field (APF)-based method ) is conducted to investigate the performance of the proposed method. As shown in Fig. 9, our policy navigates the robot team through the narrow passageway, while the APF-based method gets stuck at the entrance since the repulsive forces from obstacles neutralize the attractive force from the goal. Note that in this experiment, our policy computes velocity control commands merely based on local observations while the APF-based method can obtain global environment information.
Iv-D A comparison with other decentralization mechanism
To verify the effectiveness of the adopted mechanism for centralized learning and decentralized execution, we conduct quantitative comparisons with ablated variants of our method called independent PPO (which ablates the centralized learning procedure) and MAPPO (which ablates the joint policy update). In particular, the independent PPO follows the paradigm of independent Q-learning , where each robot independently updates the policy and treats other robots as a part of the environment. It implies that the value function takes the local observation as input. The MAPPO applies the mechanism proposed in MADDPG  where value function takes the global observation as input while robot-level policy for each agent is updated separately. In comparison, our method exploits the centralized learning and updates the team-level joint policy for the robot team.
The following quantitative metrics are adopted to evaluate the performance:
Success rate: The ratio of the success times over the number of the total test cases , i.e., .
Connectivity: The negative average violation of the distance constraint over a trajectory, i.e., , where is the average violation of distance constrain at time step and is the trajectory length.
Efficiency: The ratio of the lower bound of travel time (dividing Euclidean distance from the initial position to the goal by the maximum velocity) over the actual travel time , i.e., .
For all quantitative metrics, higher values indicate better performance. For efficiency and connectivity, we only consider successful navigations and neglect failure cases. To alleviate random errors, 3 policies are trained with different random seeds for each method. The final quantitative metrics of a method is the average performance of 3 derived policies.
100 random scenarios (look like scenarios in Fig. 8) are generated to evaluate policies derived by independent PPO, MAPPO and our method. The curves of accumulated total rewards over training process are shown in Fig. (a)a. It can be seen that our method converges to much higher accumulated total rewards in comparison with independent PPO and MAPPO. As for quantitative metrics, our method significantly outperforms independent PPO and MAPPO in success rate (Fig. (b)b) and connectivity (Fig. (c)c). It can be seen from Fig. (d)d that the MAPPO slightly surpass our method in efficiency, while the performance of the former severely degenerates with respect to the success rate and connectivity. From the quantitative results, we can conclude that both centralized learning and joint policy update are essential for the navigation task, which highly demands group behaviors.
It’s noticed that the policy derived by our method may fail in some of the navigation tasks. A typical reason for the failures can be demonstrated by Fig. 11. When the robot team encounters a crossway and has to decide which way to go (following the green arrow or the red arrow), it’s difficult for the policy to predict the dead end beforehand merely based on local observations. Therefore, trying to approach the goal directly (following the red arrow) is a reasonable choice, which is also a behavior (reaching the goal as fast as possible) encouraged by the reward function (2). Consequently, the robot team runs into the dead end and fails the navigation. Our method currently relies on guide points from higher level path planning algorithm to avoid such failures.
Iv-E Real-world experimental results
Real-world experiments are performed to verify the practicability of the proposed method. Three holonomic UGVs (as shown in Fig. 12) are built to form a robot team, which serves as a platform to test derived policies. In order to enable real-time end-to-end decentralized execution, each UGV is equipped with a 2D-laser scanner (RPLIDAR-A2) and an onboard computer (Nvidia Jetson TX2). We wrap the UGVs with boxes and set laser scanners at different heights to make laser scanners detect teammate UGVs. The OptiTrack motion capture system is used to provide the positions of the robots.
The snapshots of an experiment are shown in Fig. 13. In this experiment, a narrow corridor lies between the robot team (which is initialized on the left) and the goal (which is set on the right). It can be seen that the derived policy switches the robot team into line formation to go through the narrow corridor and successfully navigates the centroid of the robot team to the goal. Note that the policy is run in a decentralized manner and directly maps raw laser measurements into continuous velocities without constructing obstacle maps.
We think the success of sim-to-real deployment partially credits to the input consistency. In particular, the inputs (e.g., laser measurements) of the policy are relatively consistent in simulations and real-world environments (compared with RGB inputs which are heavily affected by light). Moreover, scene pattern similarity between simulations and real-world experiments is also an important factor for the success. In spite of the consistency and similarity, it should be pointed out that the sim-to-real deployment is challenging because both the observations and actions in real world are fairly noisy and imperfect in comparison with simulations. The success of real-world experiments shows the robustness and practicability of the proposed method.
A demo video of the multi-robot navigation can be seen in https://youtu.be/G8UZU1hk9K4.
This paper proposes a DRL-based method to derive decentralized policies for a robot team to navigate through unknown complex environments safely while maintaining connectivity. The derived policy directly maps raw laser measurements to continuous velocity control commands without constructing the obstacle maps. By means of the learned policies, the robots cooperatively plan the motions to accomplish the navigation task of the robot team using each robot’s local observations. Simulation results in various scenarios verify the effectiveness of the proposed method. Furthermore, indoor real-world experimental results demonstrate that the three-robot team can navigate through the dense obstacles. In the future work, the learning-based navigation approach will be further studied in robot teams of larger scale.
This work is supported by Major Program of Science and Technology Planning Project of Guangdong Province (2017B010116003), NSFC-Shenzhen Robotics Projects (U1613211), and Guangdong Natural Science Foundation (1614050001452, 2017A030310050).
-  H. G. Tanner and A. Kumar, “Towards decentralization of multi-robot navigation functions,” in IEEE Int. Conf. on Robotics and Automation, 2005, pp. 4132–4137.
-  T. J. Koo and S. M. Shahruz, “Formation of a group of unmanned aerial vehicles (uavs),” in American Control Conf., 2001, pp. 69–74.
-  T. Balch and M. Hybinette, “Social potentials for scalable multi-robot formations,” in IEEE Int. Conf. on Robotics and Automation, 2000, pp. 73–80.
-  A. Wasik, J. N. Pereira, R. Ventura, P. U. Lima, and A. Martinoli, “Graph-based distributed control for adaptive multi-robot patrolling through local formation transformation,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 2016, pp. 1721–1728.
-  R. Olfati-Saber, “Flocking for multi-agent dynamic systems: algorithms and theory,” IEEE Trans. on Automatic Control, vol. 51, no. 3, pp. 401–420, March 2006.
-  M. Saska, V. Vonásek, T. Krajník, and L. Přeučil, “Coordination and navigation of heterogeneous mav–ugv formations localized by a ‘hawk-eye’-like approach under a model predictive control scheme,” The Int. Journal of Robotics Research, vol. 33, no. 10, pp. 1393–1412, 2014.
-  J. Alonso-Mora, S. Baker, and D. Rus, “Multi-robot navigation in formation via sequential convex programming,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 2015, pp. 4634–4641.
-  J. Alonso-Mora, E. Montijano, M. Schwager, and D. Rus, “Distributed multi-robot formation control among obstacles: A geometric and optimization approach with consensus,” in IEEE Int. Conf. on Robotics and Automation, 2016, pp. 5356–5363.
-  M. Duguleana and G. Mogan, “Neural networks based reinforcement learning for mobile robots obstacle avoidance,” Expert Systems with Applications, vol. 62, no. C, pp. 104–115, 2016.
-  L. Xie, S. Wang, A. Markham, and N. Trigoni, “Towards monocular vision based obstacle avoidance through deep reinforcement learning,” arXiv preprint arXiv:1706.09829, 2017.
-  G. Kahn, A. Villaflor, B. Ding, P. Abbeel, and S. Levine, “Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation,” arXiv preprint arXiv:1709.10489, 2017.
P. Long, W. Liu, and J. Pan, “Deep-learned collision avoidance policy for distributed multiagent navigation,”IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 656–663, 2017.
-  P. Long, T. Fan, X. Liao, W. Liu, H. Zhang, and J. Pan, “Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning,” arXiv preprint arXiv:1709.10082, 2017.
-  Y. F. Chen, M. Everett, M. Liu, and J. P. How, “Socially aware motion planning with deep reinforcement learning,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 2017, pp. 1343–1350.
-  R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 1998.
M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative
Proceedings of the tenth Int. Conf. on Machine Learning, 1993, pp. 330–337.
-  R. Lowe, Y. WU, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Advances in Neural Information Processing Systems 30, 2017, pp. 6379–6390.
-  Q. Zhang, D. Zhao, and F. L. Lewis, “Model-free reinforcement learning for fully cooperative multi-agent graphical games,” 2018 Int. Joint Conf. on Neural Networks, pp. 1–6, 2018.
K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,”Biological Cybernetics, vol. 36, no. 4, pp. 193–202, 1980.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,”Neural Computation, vol. 1, no. 4, pp. 541–551, 1989.
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
-  I. Kostrikov, “Pytorch implementations of reinforcement learning algorithms,” https://github.com/ikostrikov/pytorch-a2c-ppo-acktr, 2018.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.