Unmanned Aerial Vehicles (UAVs) have shown great promise in recent years because of its excellent mobility and flexibility. More and more missions involve navigating through the unknown environment, such as search and rescue. Obstacle avoidance is an essential feature for UAVs to navigate autonomously in a complex environment. However, due to the limited capacity and computing resource, autonomous navigation is still a challenging task.
Conventional robotics methods for exploration and navigation, such as Simultaneous Localisation and Mapping (SLAM), tackle the navigation problem through an explicit focus on position inference and mapping . However, it requires a large amount of computation and memory resource. There are also some local approaches which do not need to build a map but act on the sensor data gathered at the current time step directly, such as 3DVHF+ , potential field method (PFM)  and other reactive methods [14, 3]. These algorithms are faster but usually unable to find the optimal path.
Recently, some end-to-end methods have been proposed to address the UAV navigation problem. The control command is generated from a trained neural network using raw sensor data directly. Compared with the traditional hierarchical pip-line, deep neural network does not need artificial feature extraction and can deal with high dimension raw sensor data such as images. Also, it runs in a reactive manner without any optimization or searches which is beneficial for real-time application. Deep reinforcement learning (DRL) is usually used to train this end-to-end policy network. However, DRL is sample inefficient which relies on a large amount of interaction data with the environment. Learning from scratch is time consuming and severally limits the application of DRL to many real-world tasks.
In this work, an end-to-end policy network is proposed for UAV navigation in unknown 3D environment. The network is trained using a off-policy model-free DRL method. To speed up the training process, a novel framework which combines the advantages of imitation learning and reinforcement learning is proposed. Specifically, both Q-value and policy network are trained in the imitation phase and a decayed imitation loss is used to get a smooth transition between imitation and reinforcement learning phase. The training environment is shown in Fig. 1.
Ii Related Work
Ii-a Learning-based UAV Navigation
To address the UAV navigation problem with DRL method, many works only focus on the 2D situation. Ross et al  proposed a vision-based navigation system for an UAV using imitation learning. However, this method needs human in the loop during the training phase. Pham et al  train a quad-rotor to learn to navigate to the target point using a PID assisted Q-learning algorithm in an unknown environment. However, there is no obstacle in the environment. In Wang et al ’s work 
, the navigation problem is formulated as a partially observable Markov decision process (POMDP) and solved by a novel online DRL method. Singlaet al  used the GAN architecture for depth prediction from RGB image and augmenting DRL with memory networks and temporal attention facilitates the agent to retain vital information gathered from the past observations.
Because of the difficulty, only a small amount of work focus on the 3D navigation problem. Sharma et al  proposed an RL based autonomous waypoint generation strategy (AWGS) for on-line path planning in unknown 2D and 3D environments. However, the policy is learned from scratch which is time consuming.
Ii-B Learn from Demonstrations
Demonstrations are widely used in high-dimensional robotic problems. Hester et al proposed Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to accelerate the learning process. Vecerik et al  proposed a general and model-free approach which build upon the DDPG algorithm to use demonstrations. Both demonstrations and actual interactions are used to fill the replay buffer and sampled via a prioritized replay mechanism. Similar to DQfD, DDPGfD also uses a mix of 1-step and n-step return losses and L2 regularization losses. Nair et al 
also proposed a method which builds on top of DDPG, they use BC loss and Q-Filter as an auxiliary loss function when updating the policy in the training phase. Gaoet al  proposed Normalized Actor-Critic (NAC) which is robust to sub-optimal demonstrations.
In application, to address the mapless navigation problem for the mobile robot, Xie et al  proposed Assisted DDPG, where a classical controller is used as an alternative and switchable policy to speed up the training of DRL. This method needs the assisted controller always online in the training phase. Pfeiffer et al 
leverage prior expert demonstrations to pre-train the policy and then use a safety constrained RL method to improve the performance. However, only the policy network is pre-trained using the demonstration data, value network is still initialized randomly. When it starts interacting with the environment, the policy performance will drop because of the incorrect value function estimation.
Iii-a Reinforcement Learning for Navigation Problem
In this work, the navigation and obstacle avoidance problem is formulated with standard Markov Decision Process (MDP) which can be solved using DRL. An MDP is defined by a tuple , which consists of a set of states , a set of actions , a reward function , a transition function , and a discount factor . In each state , the agent takes an action . By executing the action in the environment, the agent receives a reward and reaches a new state
, determined from the probability distribution. The goal of DRL is to find a policy mapping states to actions that maximizes the expected discounted total reward over the agent’s lifetime. This concept is formalized by the action value function: , where is the expectation over the distribution of the admissible trajectories obtained the policy starting from and .
In the UAV navigation and obstacle avoidance problem, state is represented with the relative goal position and sensor data. In our case, the raw depth image obtained from a depth camera or binocular camera is used to extract the obstacle information. Action generated from the policy network which consists of linear velocity in x, y-axis and the rotation speed in the z-axis to navigate the UAV working in 3D environment. The policy network is shown in Fig. 2.
Iii-B Twin Delayed DDPG
Our method builds upon an off-policy model-free reinforcement learning algorithm, Twin Delayed DDPG (TD3) . A common failure mode for DDPG is that the learned Q-function begins to dramatically overestimate Q-values, which then leads to the policy breaking. TD3 addresses this issue by introducing three critical tricks: clipped double Q-Learning, delayed policy update and target policy smoothing .
Target policy smoothing: Actions used to form the Q-learning target are based on the target policy, , but with clipped noise added on each dimension of the action. After adding the clipped noise, the target action is then clipped to lie in the valid action range (all valid actions satisfy ). The target actions are thus:
where . Target policy smoothing essentially serves as a regularize for the algorithm. It addresses a particular failure mode that can happen in DDPG: if the Q-function network develops an incorrect sharp peak for some actions, the policy will quickly exploit that peak and then have brittle or incorrect behaviour. This can be averted by smoothing out the Q-function over similar actions, which target policy smoothing is designed to do.
Clipped double-Q learning: TD3 concurrently learns two Q-functions, and , by mean square Bellman error minimization, in almost the same way that DDPG learns its single Q-function. Both Q-functions use a single target, calculated using whichever of the two Q-functions gives a smaller target value:
and then the parameters of both Q-value functions and are updated by one step of gradient descent using:
where and is a mini-batch sampled from the replay buffer . Using the smaller Q-value for the target, and regressing towards that, helps decrease overestimation in the Q-function.
Delayed policy updates: Lastly, the parameter of the policy network is updated by one step of gradient ascent to maximize the Q-value using:
which is pretty much unchanged from DDPG. However, in TD3, the policy is updated less frequently than the Q-functions are. This helps damp the volatility that normally arises in DDPG because of how a policy update changes the target.
In this section, a learning from demonstration method TD3fD (TD3 from Demonstration) is proposed to address the UAV navigation problem. Our method combines reinforcement learning and imitation learning which can get better data efficiency than learning from scratch. Notably, differing from DQfD and DDPGfD, both policy and Q-value network are initialized using imitation learning. In addition, a decaying behaviour cloning loss is used at the beginning of the training phase to stabilize the training process.
Iv-a Problems with Behaviour Cloning
Given a set of demonstrations that contains all the transition information and the corresponding environment, an agent should perform appropriate actions when it starts interacting with the environment and continues to improve . BC method can learn the mapping between the input observations and their corresponding expert actions, but it will lead to compounding errors, which means an early error could potentially cascade to a sequence of mistakes, especially for the long sequence decision problem. Also, the BC method cannot deal with unseen data. Because of the demonstration set is collected using expert, a classical obstacle avoidance algorithm in our case, only correct transitions are collected without any collision. So the demonstration set is a highly biased sample of the real environment. Using the off-policy RL method directly on the demonstration set will also lead to mismatching problem.
Iv-B TD3 with Demonstrations
To deal with the mismatching problem and speed up the training process, our method combines BC with RL method. The whole process has two phases: imitation and reinforcement. Our work is inspired by the previous work DQfD and DDPGfD. However, differing from DDPGfD, both policy and Q-value network are trained during imitation phase. Moreover, the imitation loss is preserved at the beginning of reinforcement phase and reduces with the training process goes on. Furthermore, we don’t keep the demonstration data permanently. After certain training steps, the reinforcement phase will degenerate to the original TD3. This decayed imitation loss guarantees the stability at the beginning of reinforcement phase and can lead to a smooth transfer from demonstration set to the real environment.
For the actor-critic reinforcement learning framework, if only the policy network is pre-trained using BC method, the performance will decline dramatically when it starts interacting with the environment because of the incorrect Q-value estimation. So, in our work, both Q-value and policy network are initialized using the demonstration data in the imitation phase. The imitation loss (or BC loss) is defined with:
where is the expert action.
To learn both Q-value and policy network simultaneously, the imitation loss is added to equation (4) as an auxiliary loss and the Q-value network is updated by maximizing as well as minimizing simultaneously:
where is the weight of imitation loss.
After the imitation phase, a modified TD3 algorithm is used to improve the policy network to deal with unseen scenario and correct the mismatch between the demonstration set and the real environment. To get a smooth transfer from imitation to reinforcement phase, a decay factor is added to equation (6):
where is the decay factor calculated by:
where is the current time step, is the total decay step number. At the beginning is equal to 1 and will gradually decrease to 0 after steps. The TD3fD algorithm is outlined in Algorithm 1.
Several experiments are conducted to evaluate the performance of the proposed TD3fD against the original TD3. The network is trained in the ROS based Gazebo simulation environment with OpenAI gym interface . In order to simulate the real-world situation as much as possible, the UAV is controlled using PX4 flight stack  and running in the Software In The Loop (SITL) mode. Our TD3fD algorithm is modified from the stable baseline  TD3 implementation which is based on OpenAI Baselines . The training environment is shown in Fig. 1.
V-a Expert Demonstration
The PX4 local planner based on the 3DVHF* algorithm is used as the expert instead of a human. It is an open-source ROS package for obstacle avoidance. Using the depth image as input, the PX4 local planner generates a vector field histogram to represent the local information around the vehicle. Then multiple collision free trajectories are generated based on this vector field histogram and a best one is selected based on a cost function. Although this algorithm has been optimized for on-board application, it still spends much computer resource because of the look-ahead tree search algorithm.
In the expert demonstration gathering phase, 10 different goals are set randomly. The multirotor takes off at the centre of the environment and flies to the goal position guided by the local planner. The original output of the local planner is target waypoint. We transfer these target waypoints to the velocity command in UAV body frame as the expert action . To get better use of the expert demonstrations, all the transition information is recorded and stored in the replay buffer.
V-B Network Framework and Training Settings
The policy network using depth image and the relative position between the current UAV position and goal position as input. A CNN feature extractor is used to get useful information from a raw depth image. The detailed structure of the policy network is shown in Fig. 2
. The output of the policy network is velocity command, consists of forwarding speed, climbing rate and yaw rate in vehicle body frame. The activation function for the hidden layer is ReLU and tanh is used in the final dense layer to generate symmetrical control command. All commands are transformed into ROS topics and published at 5Hz. The low-level control is executed by the PX4 flight firmware. The hyperparameters of training are shown in TableI.
|replay buffer size||50000|
|soft update coefficient||0.005|
|policy update delay||2|
|random exploration steps||1000|
|square deviation of exploration noise||0.1|
V-C Reward Function
The agent’s objective is to reach the target in the shortest possible number of time-steps while avoiding the obstacles. The reward function provides the required feedback to the agent in the training phase. To simplify the training process, a hand-designed reward function include continuous reward is utilized:
where is the Euclidean distance from current position to goal position at time .
is a constant used as time penalty. In order to reduce the variance, no punishment term is used for a crash.
V-D Training Results
The imitation phase starts with the randomly initialized network and a replay buffer initialized with expert transitions. In this phase, the TD3fD algorithm is executed for 5000 time-steps with the data from replay buffer rather than interact with the environment. In our experiment, is set to 20 to get better behaviour cloning of the policy network. After imitation phase, the policy network can get some sense of the environment and can succeed occasionally. Then the training phase is executed for 50000 time-steps to improve the pre-trained network through interacting with the environment. The decay step number is set to 5000.
To show the advantage of learning speed, the original TD3 algorithm is used to compare with our TD3fD method. Training results are shown in Fig. 3. From the results, we can see that TD3fD learns faster than original TD3. After 50000 time-steps training, TD3fD got a acceptable success rate while the original TD3 struggled with the bad data efficiency and need more data to get the same performance.
To test the generalization ability of the learned policy network, three new environments are built. World1 is the training environment shown in Fig. 1. World2 added a tall building in world1 which cannot be flown over. In addition, two environments, Rocks and Neighborhood, are built with AirSim  simulator based on Unreal Engine, which can provide more visually realistic. New environments are shown in Fig. 4.
In each environment, both policy network learned using pure BC method and TD3fD are executed for 50 episodes without action noise. In World1 and World2 environment, goal position is generated randomly on a circle with a radius of 40 meters and centered on the take-off point. In the Rocks environment the radius of goal position is set to 60 meters. In the Neighborhood environment, goal position is selected randomly from a list of 10 reachable position. Trajectories generated in the training environment are shown in Fig. 5. From the trajectories we can see that the UAV learned to climb and fly over some low obstacles to reach the goal.
|Environment||Policy||Average reward||Success rate|
The average reward and success rate for different environments are shown in Table. II. Because the expert controller can only run with ROS, there is no expert data in AirSim environments. From Table. II, the policy trained with TD3fD can greatly outperform the policy learned using pure BC method and the final performance is similar to the traditional methods. It is worth noting that, the average reward of policy learned using TD3fD exceeds the expert even with a slight low success rate, which means that the learned policy finds the shorter path than the expert. However, comparing with the training environment, the success rate declined when the learned policy is executed in the World2, because the learned policy relies too much on climb to avoid the obstacles rather than steer. It can be addressed by a better hand-designed reward function.
According to the evaluation results, the learned policy can achieve acceptable performance in different unseen environments. Because the goal distance is different with Gazebo environment, the average reward in AirSim environments cannot be compared with the gazebo environment directly. In the AirSim environment, the ground-truth state is used for low-level control. So the velocity control is better than Gazebo environment which runs all PX4 flight stack in the SITL mode including state estimation. From the success rate, we can see that the learned policy performs quite well in the complex AirSim environment, which indicates that a good state estimation is important for the obstacle avoidance and navigation problem.
In this work, a DRL framework is proposed for UAV navigation and obstacle avoidance in the unknown 3D environment. Especially, expert demonstrations are used to speed up the training process and both policy and Q-value network are pre-trained in the imitation phase. Simulation results show that this learned end-to-end policy network can achieve similar performance compared with the traditional navigation method. In addition, the DRL process can be accelerated significantly leveraging only a small amount of expert demonstration. Our method shows promise for learning in the real environment and can be integrated to any other actor-critic off-policy RL method.
While in this work, training was only sketched in the simulation environment, in future we will evaluate the learned policy in the real environment. We also plan to add some safety constraint during the training process and achieving on-policy learning in the real environment safely.
-  (2018) Spinning Up in Deep Reinforcement Learning. Cited by: §III-B.
-  (2017) OpenAI baselines. GitHub. Note: https://github.com/openai/baselines Cited by: §V.
-  (2018) R-advance: rapid adaptive prediction for vision-based autonomous navigation, control, and evasion. Journal of Field Robotics 35 (1), pp. 91–100. Cited by: §I.
-  (2018) Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477. Cited by: §III-B.
-  (2016) Online quadrotor trajectory generation and autonomous navigation on point clouds. In 2016 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), pp. 139–146. Cited by: §I.
-  (2018) Reinforcement learning from imperfect demonstrations. arXiv preprint arXiv:1802.05313. Cited by: §II-B, §IV-A.
-  (2018) Stable baselines. GitHub. Note: https://github.com/hill-a/stable-baselines Cited by: §V.
-  (2016) Improved potential field method for unknown obstacle avoidance using uav in indoor environment. In 2016 IEEE 14th International Symposium on Applied Machine Intelligence and Informatics (SAMI), pp. 345–350. Cited by: §I.
-  (2015) PX4: a node-based multithreaded open source robotics framework for deeply embedded platforms. In 2015 IEEE international conference on robotics and automation (ICRA), pp. 6235–6240. Cited by: §V.
-  (2018) Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6292–6299. Cited by: §II-B.
-  (2018) Reinforced imitation: sample efficient deep reinforcement learning for mapless navigation by leveraging prior demonstrations. IEEE Robotics and Automation Letters 3 (4), pp. 4423–4430. Cited by: §II-B.
-  (2018) Autonomous uav navigation using reinforcement learning. arXiv preprint arXiv:1801.05086. Cited by: §II-A.
-  (2013) Learning monocular reactive uav control in cluttered natural environments. In 2013 IEEE international conference on robotics and automation, pp. 1765–1772. Cited by: §II-A.
-  (2018) Real-time on-board obstacle avoidance for uavs based on embedded stereo vision. arXiv preprint arXiv:1807.06271. Cited by: §I.
-  (2018) Airsim: high-fidelity visual and physical simulation for autonomous vehicles. In Field and service robotics, pp. 621–635. Cited by: §V-E.
-  (2012) Autonomous waypoint generation strategy for on-line navigation in unknown environments. environment 2, pp. 3D. Cited by: §II-A.
-  (2019) Memory-based deep reinforcement learning for obstacle avoidance in uav with limited environment knowledge. IEEE Transactions on Intelligent Transportation Systems. Cited by: §II-A.
-  (2014) 3DVFH+: real-time three-dimensional obstacle avoidance using an octomap. In MORSE 2014 Model-Driven Robot Software Engineering: proceedings of the 1st International Workshop on Model-Driven Robot Software Engineering co-located with International Conference on Software Technologies: Applications and Foundations (STAF 2014), York, UK, July 21, 2014/Assmann, Uwe [edit.], pp. 91–102. Cited by: §I.
-  (2017) Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817. Cited by: §II-B.
-  (2019) Autonomous navigation of uavs in large-scale complex environments: a deep reinforcement learning approach. IEEE Transactions on Vehicular Technology 68 (3), pp. 2124–2136. Cited by: §II-A.
-  (2018) Learning with training wheels: speeding up training with a simple controller for deep reinforcement learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6276–6283. Cited by: §II-B.
-  (2016) Extending the openai gym for robotics: a toolkit for reinforcement learning using ros and gazebo. arXiv preprint arXiv:1608.05742. Cited by: §V.