Danger-aware Weighted Advantage Composition of Deep Reinforcement Learning for Robot Navigation

09/11/2018 ∙ by Wei Zhang, et al. ∙ National University of Singapore 0

Self-navigation, referring to automatically reaching the goal while avoiding collision with obstacles, is a fundamental skill of mobile robots. Currently, Deep Reinforcement Learning (DRL) can enable the robot to navigate in a more complex environment with less computation power compared to conventional methods. However, it is time-consuming and hard to train the robot to learn goal-reaching and obstacle-avoidance skills simultaneously using DRL-based algorithms. In this paper, two Dueling Deep Q Networks (DQN) named Goal Network and Avoidance Network are used to learn the goal-reaching and obstacle-avoidance skills individually. A novel method named danger-aware advantage composition is proposed to fuse the two networks together without any redesigning and retraining. The composed Navigation Network can enable the robot to reach the goal right behind the wall and to navigate in unknown complexed environment safely and quickly.



There are no comments yet.


page 3

page 4

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Reinforcement learning (RL) can enable the robots to automatically learn complex skills by interaction with the environment. Currently, deep reinforcement learning (DRL), which uses deep neural networks (DNN) to approximate the value function or policy, shows promising potential in controling mobile robots. Compared to conventional methods, DRL-based end-to-end control is more efficient and needs less computation power. With the help of DRL, robots can navigate in more complex environments, but with fewer sensors. The most exciting results of robot navigation come from DRL-based obstacle-avoidance field. Xie et al. achieve monocular vision-based obstacles avoidance by converting RGB images into depth images and using a Dueling Double Deep Q-network (DQN) to train robots

[1]. In [2], an uncertainty-aware deep RL model is presented to automatically generate safe strategies for collision avoidance task. Moreover, with the help of DRL, robots can successfully avoid obstacles under multiagent condition [3] and in the social environment [4]. DRL can also enable robots to learn the navigation skill directly. The most common approach is adding the target position [5, 6] or image [7] into the inputs of DRL framework. However, training double-task DRL agents is much more difficult and it takes much longer time than purely training an obstacle-avoidance agent because the navigation task has more requirements, more constraints, and the state space is larger. It is quite hard to balance the trade-off between goal reaching and obstacle avoidance when designing the rewarding system. For example, for avoiding sparse rewards, long moving distance is usually encouraged with positive rewards [1] because longer survival time means stronger obstacle-avoidance ability, while it will be punished with negative rewards [6] because the robot is also required to reach the goal as fast as possible. Hence, even if many powerful DRL methods are proposed for learning robot obstacle-avoidance skills, these trained DRL networks cannot be reused directly for navigation tasks. Since there are additional inputs and the rewarding system is changed, redesigning and retraining a new network is inevitable. The process above is quite time-consuming, and it is a waste of computation power if we cannot make use of those well-trained networks.

The motivation of our paper comes from an interesting question, i.e., can we reuse the trained DRL network for obstacle avoidance to execute robot navigation task without redesigning and retraining the neural network? One straightforward idea is training an addition DRL agent which can perform goal reaching and then combining these two skills together to form navigation skill. However, to implement this idea, several challenges must be considered. Firstly, how to design the DRL algorithm for learning goal-reaching ability? Secondly, what rewards should be used and what modification should be done for the trained network for obstacle avoidance? Last and most importantly, after two skills are learned, how to combine the two skills? This paper proposed a novel method to address the above challenges. The contribution of our paper is as follows:

  • A novel DRL-based method is proposed for addressing the robot navigation problem.

  • The proposed method only reuses the trained networks for obstacle-avoidance and goal-reaching tasks; no redesigning or retraining are performed.

  • A novel danger-aware weighted advantage composition method is proposed for fusing the obstacle-avoidance and goal-reaching skills.

  • Autonomous navigation is achieved in an unknown challenging environment.

The rest of this paper is arranged as follows. A brief introduction of DQN will be given in Section II. The proposed danger-aware weighted composition method is described in Section III, followed by the experiment and results in Section IV. Last, we draw the conclusions in Section V.

Ii Background

Ii-a Problem Definition

The robot navigation problem can be treated as a decision-making process where the robot needs to reach a goal while avoiding obstacles. Under a policy , given an input at time , the robot will take an action . The input consists of the relative position of the goal in the robot’s local frame and a stack of laser scans. After the robot taking an action, reaching a new state , it will receive a reward . The objective of the decision process is to find an optimal policy that maximizes the total return , where is the time when reaching the terminal and is called discounted rate which determines the present value of future rewards.

Ii-B Deep Q Network (DQN)

A DQN is a Q learning algorithm taking advantage of deep neural networks for approximating the value of Q function [8]. To apply DQN, the decision process should satisfy the Markov property, i.e. is only related to the current state . Given a policy , the values of the state and the state-action pair are defined as:


where is called the value function, which evaluates the expected return of the robot on state under the policy ; is referred to as Q function, which is used for evaluating the expected return of executing action on state . Q function can be computed recursively with Bellman equation:


By selecting the optimal action at each timestep, the Bellman optimality equation can be obtained:


Another important function in reinforcement learning is Advantage function . It measures the relative importance of each action, which is the difference between the Q function and value function:


During the reinforcement learning process, traditional Q-learning method adopts tabular methods to record Q Values for all state-action pairs, which becomes ineffective with the increase of state space. To address the problem, DQN method utilizes a neural network to approximate the Q function by minimizing the loss function


where the parameters of the network are called ; represents the experience collected during the experiment or simulation, and all the experiences are saved in the memory with capacity H; is the target of Q function. DQN only saves parameters of deep neural networks instead of each Q value, which successfully solves large state space problems.

Wang et. al improve the performance of DQN by decomposing the DQN network into two networks : one for approximating value function , and the other for approximating advantage function , which is also known as Dueling DQN [9]. After value function and advantage function are computed, the Q function can be obtained by adding those two functions together:


where is the weights of the shared shallow layers of the two networks; Parameters and are the weights of the two streams of separate deep layers of advantage network and value network, respectively; is the real advantage value with zero mean, computed by substracting the average value of the output of advantage network.

Iii Danger-aware Weighted Advantage Composition and Navigation Network

Our method contains two stages: i.e. the basic skill-learning stage and skill-fusion stage. In the first stage, two Dueling DQNs named Goal Network and Avoidance Network are used to train the robot to learn goal-reaching and obstacle-avoidance skills, respectively. In the next stage, a weighted composition method is proposed for fusing the two basic skills, and the corresponding network is called Navigation Network.

Iii-a Goal Network

In this subsection, we will elaborate on how to apply Dueling DQN for training Goal reaching skill. The Dueling DQN is trained through accomplishing the task where the robot needs to reach the goal as fast as possible. The Gazebo simulation environment for training the robot is shown in Fig.1. The position of the goal varies with episodes, and each episode will end when the robot reaches the target or crashes into the wall. Collision with the wall is also a terminal condition, which can prevent the endless searching process. It must be noted that the reward is zero when the robot crashes into obstacles because the robot should not learn any obstacle-avoidance behaviour during the period of learning goal-reaching skill.

Fig. 1: Gazebo environment for training goal-reaching skill.
Fig. 2: Structure of Dueling DQN for learning goal-reaching skill (Goal network).

The Dueling DQN structure for learning goal reaching skill is shown in Fig.2. The input, relative position of the goal in the robot frame, is fed into the shared fully-connected (FC) layers with parameter . After processed by the shared layers, the generated 10 representations are fed into two steams of FC layers. One stream is used for predicting value function , while the other is used for predicting advantage function . The network parameters of the two streams are and , respectively. We call the network shown in Fig.2 Goal network, and its output Q function is as follows:


During the training process, the robot will acquire positive reward when reaching the goal and be punished with negative rewards for the rest steps in order to encourage fast reaching of the goal.

To avoid sparse reward and accelerate the training process, we decompose the reward for goal reaching into action reward and state reward .


The action-rewarding system encourages the shortest goal-reaching time. For each step where the robot cannot reach the goal, it will be punished with . The state-rewarding system encourages the shortest goal-reaching path. The robot will be awarded 10, when it reaches the goal. Besides, it will also get a positive reward if it moves closer to the goal. The equation of state rewards is as follows


where is the distance between the goal and the robot, and it is also the first element of the state .

Iii-B Avoidance Network

In this subsection, we will elaborate on how to apply Dueling DQN for training the obstacle avoidance skill. The Gazebo environment for this task is shown in Fig.3. The robot is equipped with laser range finders whose scan range is and maximum measuring distance is 8 meters.

Fig. 3: Gazebo environment for training obstacle-avoidance skill.
Fig. 4: Structure of Dueling DQN for learning obstacle-avoidance skill (Avoidance network).

The Dueling DQN structure for learning the obstacle-avoidance skill is shown in Fig.4. Laser scan with 100 beams are processed as input and fed into the shared 1-D convolutional layers with parameter . Uncommonly, no pooling layers are added after the convolutional layer for the sake of saving more spatial information. After processed by convolutional layers, the generated

representations are flattened into 400 neurons in a row and then fed into two steams of fully-connected (FC) layers. One stream is used for predicting advantage values

with network parameter . The other one is used for predicting value function , and the corresponding network parameter is . We call the network shown in Fig.4 Avoidance Network, whose output Q function value is:


Similar as Goal network, we also decompose the reward for obstacle avoidance into action reward and state reward


The action-rewarding system encourages moving straight instead of turning a big angle, which can help prevent the robot from circling in a small space. The robot will get if it goes straight and when the angular velocity equals 0.2, and for the rest action, it will not be awarded. The equation of action rewards is as follows


where is linear speed and w is angular speed.

To keep the robot far away from the obstacles, it will get a negative state reward if the distance from the nearest obstacles to the robot is within 0.6 meters. Besides, when the robot crashes into an obstacle (i.e. the minimum distance between the robot and obstacles is smaller than 0.2), it will be punished with -10.


where is the minimum value of the laser scan data.

Iii-C Danger-aware Weighted Composition and Navigation Network

Last, we will fuse the goal reaching and obstacle avoidance skills together. The framework of the fusion method is shown in Figure 5.

Fig. 5: Framework for fusing the Goal network and Avoidance Network into Navigation Network.

As shown, the output advantage values of Goal Network and Avoidance Network are combined together by the function , which forms the output of the Navigation Network.



is a changeable regulation hyperparameter;

weights the relative contribution of the output of Avoidance Network, , relative to the output of Goal network , and its description is as follows:


where is a monotonically decreasing function, and is the distance between the robot and the goal. The decision made by the fused Navigation Network is to choose the action that maximize the combined advantage function

Fig. 6: Structure of Navigation Network.

Now, we will explain the insight behind the regulation hyperparameter and the reason why we describe the weighted combination method danger-aware. Before us, some works has studied the problems of combining the Q function learned by DRL. However, those papers focus on analysis the performance of the linear combination method [10, 11, 12]. linear combination method is easy to evaluate in theory and is useful for the problems where rewards are linearly separable. However, the overestimating problem caused by inverting a operation into a in the Bellman equation is unsolvable [10], and the drawbacks of linear combination will be shown in the experiment section. Considering the drawbacks of linear combination, our idea is that we can make the contribution of network changeable instead of keeping constant, i.e. using a changeable . As mentioned before, the value function for obstacle avoidance and the position of the goal can be regarded as good indicators of the current situation. The key of the danger-aware weighted combination method is using to evaluate the situation and decide the contribution of each sub skill. First, Large takes obstacle avoidance into account more strongly. Dangerous situation results in small , and a large can help enable the robot to concentrate more on obstacle avoidance. Besides, larger corresponds the safer situation, and hence can be decreased to encourage the robot to move quickly to the target. For the special case where the goal is surrounded by obstacles, the situation may be evaluated as dangerous, which can result in obstacle avoidance instead of goal reaching even if the goal is nearby. Accordingly, the distance threshold is used to judge whether the distance of the goal is close enough for focusing on goal reaching or not.

The structure of Navigation Network is shown in Fig.6. It should be noted that all the parameters in this network are fixed and need no further training. The input signals are separated into position information and laser scans, and then fed into trained Goal network and Avoidance network respectively. The advantage values of Avoidance Network are scaled by the hyperparameter and then added with advantage values outputted by Goal Network. Once the final advantage values are obtained, the final decision can be made by choosing the action corresponding to the maximum advantage value.Take Fig.7 as an example, since the obstacle is close to the robot, the condition is assessed as a hazard, and the corresponding is large accordingly. Large increase the contribution of obstacle avoidance, and hence maximum value of the combined advantage function for taking action 2, which is favourable for avoiding the obstacles.

Fig. 7: An example for the decision-making process of Navigation Network.

Iv Experiment and Results

This section will compare the proposed danger-aware weighted advantage composition method with other composition methods, and it will show the performance of the Navigation Network in an unknown environment. We use gym-gazebo [13] to perform DRL algorithms in Gazebo environment.

Iv-a Value Visualization of Goal Network

We visualize the value function of the Goal Network in Fig.8. As shown, when the robot faces the target,

is monotonically increasing with the distance between the robot and the goal, which accords with the encouragement of the rewarding system. Notably, the maximum distance is 8 meters, which significantly exceeds the maximum moving distance (around 4.5m) where the robot was trained. This is the reason why our robot can navigate successfully as shown in the following results. In addition, it should also be noted that the abnormal shape of the estimated values around the origin is caused by the terminal condition that the goal is reached when the robot is within 0.4m of it. Fig.

9 illustrates the how the robot makes decisions based on the estimated advantage function in different positions. As shown, in each position, the robot is 3 meters away from the goal and the action with the largest value of is consistent with the optimal action.

Fig. 8: Visualization of the value function with respect to the positions of the robot.
Fig. 9: Advantage values learned by Goal Network: the robot is 3m away from the goal and locates in four different directions (i.e. the relative angles are and ).

Iv-B Impact of Regulation Hyperparameter

According to the previous description, constant means the contribution of repulsive advantage value stays the same, while changeable implies the contribution of repulsive advantage changes with different situations. We compared three combination approaches, the directly combination, the normalized combination, and the danger-aware weighted combination. The direct combination is adding and directly without any data processing. The normalized combination is normalizing the range of and to [0,1] by Equation(17), and then adding the normalized advantage functions together directly. This normalized combination aims to make the contribution of each advantage function euqal.


The danger-aware weighted combination shown in Equation(14) are used to weight the contribution of the two advantage functions according to the current situation. The regulation hyperparameter chosen in this experiment is


To better understand this function, we introduce , and the new combination


will not change the optimal value of Equation(18). From Equation(19), we can find that means a preference of reaching the goal and denotes a preference of avoiding the obstacles. Hence, the value -36.5 is a threshold for the preference of choice.

The simulation environments shown in Fig.9(a) and Fig.10(a) are used for investigating the effect of regulation parameters. The robot is spawned inside an open corner formed by two walls, and the task is to move out of the open corner to reach the goal in each environment. When running the navigation network, it must be noted that the maximum length of the laser scan is reduced to 1.5 meters, which is given in Equation(20)


where is the distance of each beam. The reason for this preprocessing is that long measuring distance will enhance the global decision ability of the Avoidance Network, which may casue conflicts with the decision made by Goal Network when the situation is considered as safe. Besides, a shorter measuring distance can help the Avoidance Network focus on local planning for obstacle avoidance. The trajectories generated by each combination method are visualized in Fig.10 and Fig.11. As shown in Fig.10, both the danger-aware weighted combination and normalized combination methods can accomplish the task. However, the trajectory generated by danger-aware weighted combination method is much shorter than the counterpart generated by normalized combination method, which indicates the weighted combination method may help find the near-optimal path. Besides, it should be noted that, in Fig.11, only the danger-aware weighted combination method can enable the robot to finish the task when the goal is right behind the wall. The reason is that when the robot approaches to the wall, the value function will be decreased and the weights will increase. A large will encourage obstacle avoidance, and hence the robot will take actions benefiting obstacle avoidance. As for the rest two combination methods, since the goal is right in front of the robot, the Goal network will focus on going straight and this tendency cannot be dramatically suppressed even if the robot is near the wall. Accordingly, the danger-aware weighted combination method plays an significant role for accomplishing navigation tasks.The video of the test is available at https://youtu.be/OV7ZuwLGiHw

(a) Testing environment 1
(b) Trajectories of three combination methods
Fig. 10: First Gazebo environment for examining the impact of the regulation hyperparameter. The goal is located in the upper-right outside the corner
(a) Testing environment 2
(b) Trajectories of three combination methods
Fig. 11: Second Gazebo environment for examining the impact of the regulation hyperparameter.The goal is located in front of the robot and behind the wall

Iv-C Testing in the Complex Environment

Lastly, the robot will navigate in an unknown environment shown in Fig.11(a) to reach seven successive target points without collision with the obstacles. This task is similar to [5]. However, no global planner is used to generate a path first for helping find the goal. The results are shown in Fig.12, and it can be seen that all the goal points can be reached successfully while avoiding all the obstacles. Notably, the longest distance between the start point and goal point is about 8.8 meters, which exceeds the maximum moving distance (around 4.5 meters) during training the Goal Networks. This indicates the Goal Network can address goal-reaching problems exceeding its training range, and this is the reason why our navigation network can plan much longer and without the need of a global planner. In addition, when there are multiple obstacles on the way, such as from start point to goal 1 and from goal 3 to goal 4, the robots can also make the right decisions and give acceptable path.The video of the test is available at https://youtu.be/Jn64Sg-5QCo

(a) Unknown complex environment
(b) Trajectories
Fig. 12: the robot navigates in the unknown environment to reach seven successive target points without collision with the obstacles

V Conslusion

This paper proposes a novel method named danger-aware weighted advantage composition method to fuse the goal reaching and obstacle avoidance skills into the navigation skill. Without time-consuming retraining and redesigning, the robot controlled by the composed Navigation Network can successfully navigate in an unknown complex environment. Different from conventional linear unity composition methods, our method is nonlinear composition. Most importantly, this method can automatically adjust the contribution of each DQN based on the current situation. Although this paper focuses on robot navigation, the idea of our work can be useful in other application cases where the main task can be divided into small subtasks.

In the future, we will use a learning-based composition method to learn the regulation hyperparameter instead of using the given one for addressing more challenging problems.