Autonomous Unmanned Aerial Vehicle Navigation using Reinforcement Learning: A Systematic Review

by   Fadi AlMahamid, et al.
Western University

There is an increasing demand for using Unmanned Aerial Vehicle (UAV), known as drones, in different applications such as packages delivery, traffic monitoring, search and rescue operations, and military combat engagements. In all of these applications, the UAV is used to navigate the environment autonomously - without human interaction, perform specific tasks and avoid obstacles. Autonomous UAV navigation is commonly accomplished using Reinforcement Learning (RL), where agents act as experts in a domain to navigate the environment while avoiding obstacles. Understanding the navigation environment and algorithmic limitations plays an essential role in choosing the appropriate RL algorithm to solve the navigation problem effectively. Consequently, this study first identifies the main UAV navigation tasks and discusses navigation frameworks and simulation software. Next, RL algorithms are classified and discussed based on the environment, algorithm characteristics, abilities, and applications in different UAV navigation problems, which will help the practitioners and researchers select the appropriate RL algorithms for their UAV navigation use cases. Moreover, identified gaps and opportunities will drive UAV navigation research.


page 5

page 7

page 10

page 18

page 24


Autonomous UAV Navigation Using Reinforcement Learning

Unmanned aerial vehicles (UAV) are commonly used for missions in unknown...

Reinforcement Learning for UAV Autonomous Navigation, Mapping and Target Detection

In this paper, we study a joint detection, mapping and navigation proble...

Autonomous UAV Landing System Based on Visual Navigation

In this paper, we present an autonomous unmanned aerial vehicle (UAV) la...

Deep Learning and Reinforcement Learning for Autonomous Unmanned Aerial Systems: Roadmap for Theory to Deployment

Unmanned Aerial Systems (UAS) are being increasingly deployed for commer...

Autonomous Quadrotor Landing using Deep Reinforcement Learning

Landing an unmanned aerial vehicle (UAV) on a ground marker is an open p...

Automatic Interpretation of Unordered Point Cloud Data for UAV Navigation in Construction

The objective of this work is to develop a data processing system that c...

Intervention Aided Reinforcement Learning for Safe and Practical Policy Optimization in Navigation

Combining deep neural networks with reinforcement learning has shown gre...

1 Introduction

Autonomous Systems (AS) are systems that can perform desired tasks without human interference, such as robots performing tasks without human involvement, self-driving cars, and delivery drones. AS are invading different domains to make operations more efficient and reduce the cost and risk incurred from the human factor.

An Unmanned Aerial Vehicle (UAV) is an aircraft without a human pilot, mainly known as a drone. Autonomous UAVs have been receiving an increasing interest due to their diverse applications, such as delivering packages to customers, responding to traffic collisions to attain injured with medical needs, tracking military targets, assisting with search and rescue operations, and many other applications.

Typically, UAVs are equipped with cameras, among other sensors, that collect information from the surrounding environment, enabling UAVs to navigate that environment autonomously. UAV navigation training is typically conducted in a virtual 3D environment because UAVs have limited computation resources and power supply, and replacing UAV parts due to crashes can be expensive.

Different Reinforcement Learning (RL) algorithms are used to train UAVs to navigate the environment autonomously. RL can solve various problems where the agent acts as a human expert in the domain. The agent interacts with the environment by processing the environment’s state, responding with an action, and receiving a reward. UAV cameras and sensors capture information from the environment for state representation. The agent processes the captured state and outputs an action that determines the UAV movement’s direction or controls the propellers’ thrust, as illustrated in Figure 1.

Figure 1: UAV training using deep reinforcement agent

The research community provided a review of different UAV navigation problems, such as Visual UAV navigation [lu2018survey, zeng2020survey], UAV Flocking [azoulay2021machine] and Path Planning [aggarwal2020path]. Nevertheless, to the best of the authors’ knowledge there is no survey related to applications of RL in UAV navigation. Hence, this paper aims to provide a comprehensive and systematic review on the application of various RL algorithms to different autonomous UAV navigation problems. This survey has the following contributions:

  • [noitemsep,nolistsep]

  • Help the practitioners and researchers to select the right algorithm to solve the problem on hand based on the application area and environment type.

  • Explain primary principles and characteristics of various RL algorithms, identify relationships among them, and classify them according to the environment type.

  • Discuss and classify different RL UAV navigation frameworks according to the problem domain.

  • Recognize the various techniques used to solve different UAV autonomous navigation problems and the different simulation tools used to perform UAV navigation tasks.

The remainder of the paper is organized as follows: Section 2 presents the systematic review process, Section 3 introduces RL, Section 4 provides a comprehensive review of the application of various RL algorithms and techniques in autonomous UAV navigation, Section 5 discusses the UAV Navigation Frameworks and simulation software, Section 6 classifies RL algorithm and discusses the most prominent algorithms, Section 7 explains RL algorithms selection process, and Section 8 identifies challenges and research opportunities. Finally, Section 9 concludes the paper.

2 Review Process

This section described the inclusion criteria, paper identification process, and threats to validity.

2.1 Inclusion Criteria and Identification of Papers

The study’s main objective is to analyze the application of Reinforcement Learning in UAV navigation and provide insights into RL algorithms. Therefore, the survey considered all papers in the past five years (2016-2021) written in the English language that include the following terms combined, alongside with their variations: Reinforcement Learning, Navigation, and UAV.

In contrast, RL algorithms are listed based on the authors’ domain knowledge of the most prominent algorithms and by going through the related work of the identified algorithms with no restriction to the publication time to include a large number of algorithms.

The identification process of the papers went through the following stages:

  • [noitemsep,nolistsep]

  • First stage: The authors identified all studies that strictly applied RL to UAV Navigation and acknowledged that model-free RL is typically utilized to tackle UAV navigation challenges, except for a single article [lou2016adaptive] that employs model-based RL. Therefore, the authors choose to concentrate on model-free RL and exclude research irrelevant to UAV Navigation, such as UAV networks and traditional optimization tools and techniques [guerra2021networks, guerra2021real, guerra2020dynamic, liu2020distributed, zhang2020self].

  • Second stage: The authors listed all RL algorithms based on authors’ knowledge of the most prominent algorithms, the references of recognized algorithms, then identified the corresponding paper of each algorithm.

  • Third stage: the authors identified how RL is used to solve different UAV navigation problems, classified the work, and then recognized more related papers using exiting work references.

IEEE Xplore and Scopus were the primary sources of papers’ identification between 2016 and 2021. The search query was applied using different terminologies that are used to describe the UAV alternatively, such as UNMANNED AERIAL VEHICLE, DRONE, QUADCOPTER, or QUADROTOR, and these terms are cross-checked with REINFORCEMENT LEARNING, and NAVIGATION, which resulted in a total of 104 papers. After removing 15 duplicate papers and 5 unrelated papers, the count became 84.

The authors identified another 75 papers that mainly describe the RL algorithms based on the authors’ experience and the references list of the recognized work, using Google Scholar as the primary search engine. While RL for UAV navigation studies were restricted to five years, all RL algorithms are included as many are still extensively used regardless of their age. The search was completed in November 2021, with a total of 159 papers after all exclusions.

2.2 Threats to Validity

Despite the authors’ effort to include all relevant papers, the study might be subject to the following main threats to the validity:

  • [noitemsep,nolistsep]

  • Location bias: The search for papers was performed using two primary digital libraries (databases), IEEE Xplore and Scopus, which might limit the retrieved papers based on the published journals, conferences, and workshops in the database.

  • Language bias: Only papers published in English are included.

  • Time Bias: The search query is only limited to retrieving papers between 2016 and 2021, which results in excluding relevant papers published before 2016.

  • Knowledge reporting bias: The research papers of RL algorithms are identified using authors’ knowledge of variant algorithms and the related work in the recognized algorithms. It is hard to pinpoint all algorithms utilizing a search query, which could result in missing some RL algorithms.

3 Reinforcement Learning

RL can be explained using the Markov Decision Process (MDP), where a RL agent learns through experience by taking actions in the environment, causing a change in the environment’s state, and receiving a reward for the action taken to measure the success or failure of the action. Equation

1 defines the transition probability from state by taking the action to the new state with a reward , for all [almahamid2021reinforcement].


The reward is generated using a reward function, which can be expressed as a function of the action , or as a function of action-state pairs . The reward helps the agent learn good actions from bad actions, and as the agent accumulates experience, it starts taking more successful actions and avoiding bad ones [almahamid2021reinforcement].

All actions the agent takes from a start state to a final (terminal) state make an episode (trajectory). The goal of MDP is to maximize the expected summation of the discounted rewards by adding all the rewards generated from an episode. However, sometimes the environment has an infinite horizon, where the actions cannot be divided into episodes. Therefore, using a discounted factor (multiplier) to the power , where as expressed in Equation 2 helps the agent to emphasize the reward at the current time step and reduce the reward value granted at future time steps, and, moreover, helps the expected summation of discounted rewards to converge if the horizon is infinite [almahamid2021reinforcement].


The following subsections introduce important reinforcement learning concepts.

3.1 Policy and Value Function

A policy defines the agent’s behavior by defining the probability of taking action while being in a state , which is expressed as . The agent evaluates its behavior (action) using a value function, which can be either state-value function

, which estimates how good it is to be in state

after executing an action , or using a action-value function that measures how good it is to select action while being in a state . The value produced by the action-value function in Equation 3 is known as the Q-value and is expressed in terms of the expected summation of the discounted rewards [almahamid2021reinforcement].


Since the objective is to maximize the expected summation of discounted rewards under the optimal policy , the agent tries to find the optimal Q-value as defined in Equation 4. This optimal Q-value must satisfy the Bellman Optimality Equation 5 which is defined as the sum of the expected reward received from executing the current action , and sum of all future rewards (discounted) received from any possible future state-action pairs . In other words, the agent tries to select the actions that grant the highest rewards in an episode. In general, selecting the optimal value means selecting the action with the highest Q-value; however, the action with the highest Q-value sometimes might not lead to better rewarding actions in the future [almahamid2021reinforcement].


3.2 Exploration vs Exploitation

Exploration vs. Exploitation may be demonstrated using the multi-armed bandit dilemma, which accurately portrays the behavior of a person experiencing their first slot machine experience. The money (reward) player receives early in the game is unrelated to any previously selected choices, and as the player develops a comprehension of the reward, he/she begins selecting choices that contribute to earning a greater reward. The choices made randomly by the player to acquire knowledge might be defined as the player Exploring the environment. In contrast, the player’s Exploiting the environment is described as the options selected based on his/her experience.

The RL agent needs to find the right balance between exploration and exploitation to maximize the expected return of rewards. Constantly exploiting the environment and selecting the action with the highest reward does not guarantee that the agent performs the optimal action because the agent may miss out on a higher reward provided by future actions taking alternative sets of actions in the future. Finding the ratio between exploration and exploitation can be defined through different strategies such as -greedy strategy, Upper Confidence Bound (UCB), and Gradient Bandits [sutton2018reinforcement].

3.3 Experience Replay

RL agent does not need data to learn; rather, it learns from experiences by interacting with the environment. The agent experience can be formulated as tuple , which describes the agent taking an action at a given state and receiving a reward for the performed action and causing a new state . Experience Replay (ER) [lin1992self] is a technique that suggests storing experiences in a replay memory (buffer) and using a batch of uniformly sampled experiences for RL agent training.

On the other hand, Prioritized Experience Replay (PER) [schaul2015prioritized] prioritizes experiences according to their significance using Temporal Difference error (TD-error) and replays experiences with lower TD-error to repeatedly train the agent, which improves the convergence.

3.4 On-Policy vs Off-Policy

In order to interact with the environment, the RL agent attempts to learn two policies: the first one is referred to as the target policy , which the agent learns through the value function, and the second one is referred to as the behavior policy , which the agent uses for action selection when interacting with the environment.

A RL algorithm is referred to as on-policy algorithm when the same target policy is employed to collect the training sample and to determine the expected return. In contrast, off-policy algorithms are those where the training sample is collected in accordance to the behavior policy , and the expected reward is generated using the target policy [silver2014deterministic]. Another main difference is that Off-policy algorithms can reuse past experiences and do not require all the experiences within an episode (full episode) to generate training samples, and the experiences can be collected from different episodes.

3.5 Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) uses deep agents to learn the optimal policy where it combines artificial Neural Networks (NN) with Reinforcement Learning (RL). The NN type used in DRL varies from one application to another depending on the problem being solved, inputs type (state), and the number of inputs passed to the NN. For example, the RL framework can be integrated with Convolutional Neural Network (CNN) to process images representing the environment’s state or combined with Recurrent Neural Network (RNN) to process inputs over different time steps.

The NN loss function, also known as the Temporal Difference (TD), is generically computed by finding the difference between the output of the NN

and the optimal Q-value obtained from the Bellman equation as shown in Equation 6 [almahamid2021reinforcement]:


The architecture of the deep agent can be simple or complex based on the problem at hand, where a complex architecture combines multiple NN. But what all deep agents have in common is that they receive the state as an input, then they output the optimal action and maximize the discounted return of rewards.

The application of Deep NN to the RL framework enabled the research community to solve more complex problems in autonomous systems that were hard to solve before and achieve better performance than previous state-of-the-art, such as drone navigation and avoiding obstacles using images received from the drone’s monocular camera.

4 Autonomous UAV Navigation using DRL

Different DRL algorithms and techniques were used to solve various problems in autonomous UAV navigation, such as UAV control, obstacle avoidance, path planning, and flocking. The DRL agent acted as an expert in all of these problems, selecting the best action that maximizes the reward to achieve the desired objective. The input and the output of the DRL algorithm are generally determined based on the desired objective and the implemented technique.

RL agent design for UAV navigation depicted in Figure 2 shows different UAV input devices used to capture the state processed by the RL agent. The agent produces action values that can be either the movement values of the UAV or the waypoint values where the UAV needs to relocate. Once the agent executes the action in the environment, it receives the new state and the generated reward based on the performed action. The reward function is designed to generate the reward subject to the intended objective while using various information from the environment. The agent design (’Agent’ box in the figure) is influenced by the RL algorithms discussed in Section 6 where the agent components and inner working varies from one algorithm to another.

Figure 2: RL agent design for UAV navigation task

Table 1 summarizes the application of RL to different UAV navigation tasks (objectives), and the following subsections discuss the UAV navigation tasks in more detail. As seen from this table, most of the research focused on two UAV navigation objectives: 1) Obstacle avoidance using various UAV sensor devices such as cameras and LIDARs and 2) Path planning to find the optimal or shortest route.

Objective Sub-Objective Paper
UAV Control Controlling UAV flying behavior (attitude control) [zhou2020efficient, karthik2020reinforcement, lou2016adaptive, deshpande2020developmental, camci2019learning, li2019optimal, greatwood2019reinforcement, koch2019reinforcement]
Obstacle Avoidance Obstacle avoidance using images and sensor information [salvatore2020neuro, bouhamed2020uav, huang2019autonomous, shin2019automatic, wang2017autonomous, wang2019autonomous, anwar2020autonomous, bouhamed2020autonomous, yang2020autonomous, li2019autonomous, chen2020collision, grando2020deep, camci2020deep, wang2020deep, cetin2019drone, morad2021embodied, yan2020flocking, yoon2019hierarchical, williams2017information, he2020integrated, singla2019memory, wu2018navigating, anwar2018navren, zhou2018neural, yijing2017q, villanueva2019deep, walvekar2019vision, zhou2019vision, hasanzade2021dynamically, munoz2019deep, hodge2021deep, doukhi2021deep, bakale2020indoor, maxey2019navigation, zhao2021reinforcement, greatwood2019reinforcement, tong2021uav]
Obstacle avoidance while considering the battery level [bouhamed2020ddpg]
Path Planning Local and global path planning (finding the shortest/optimal route) [walker2019deep, bouhamed2020generic, bouhamed2020uav, shin2019automatic, zhang2020iadrl, yu2019navigation, li2018path, wu2018navigating, sacharny2019optimal, camci2019planning, guerra2020reinforcement, cui2021uav, hasanzade2021dynamically, wang2021pretrained, eslamiat2019autonomous, bakale2020indoor, tong2021uav]
Path planning while considering the battery level [bouhamed2020uav, imanberdiyev2016autonomous, abedin2020data]
Find fixed or moving targets (points of interest) [andrew2018deep, pham2018reinforcement, guerra2020reinforcement, kulkarni2020uav, peake2020wilderness, akhloufi2019drones, tong2021uav]
Landing the UAV on a selected point [polvara2018toward, polvara2019autonomous, lee2018vision]
Flocking Maintain speed and orientation with other UAVs (formation) [wang2018deep, lee2020autonomous, yan2020flocking, madridano2021software]
Obstacle avoidance [wang2020two, madridano2021software]
Target tracking [moon2021deep, liu2019distributed, omi2021introduction, viseras2021wildfire, bonnet2019uav]
Flocking while considering the battery level [liu2019distributed]
Covering geographical region [liu2019distributed, fan2020prioritized]
Path planning and finding the safest route [majd2018integrating, madridano2021software]
Table 1: DRL application to different UAV Navigation tasks

4.1 UAV Control

RL is used to control the movement of the UAV in the environment by applying changes to the flight mechanics of the UAV, which varies based on the UAV type. In general UAVs can be classified based on the flight mechanics into 1) Multirotor, 2) Fixed-Wing, and 3) single-rotor, and 4) fixed-wing hybrid Vertical Take-Off and Landing (VTOL) [chapman2016dronetypes].

Multirotor, also known as multicopter or drone, uses more than two rotors to control the flight mechanics by applying different amounts of thrust to the rotors causing changes in principal axes leading to four UAV movements 1) pitch, 2) roll, 3) yaw, and 4) throttle as explained in Figure 3. Similarly, single-rotor and fixed-wing hybrid VTOL apply changes to different rotors to generate the desired movement, except they both use tilt-rotor(s) and wings in fixed-wing hybrid VTOL. On the other hand, fixed-wing can only achieve three actions pitch, roll, and yaw, where they take off by generating enough speed that causes the air-dynamics to lift-up the UAV.

Figure 3: Multirotor Flight Mechanics

Quad-rotors has four propellers: two diagonal propellers rotate clockwise and the other two propellers rotate counter-clockwise causing the throttle action. When the propellers generate a thrust more significant than the UAV weight they cause elevation, and when the thrust power equals the UAV weight, the UAV stops elevation and starts hovering in place. In contrast, if all propellers rotate in the same direction, they cause a yaw action in the opposite direction, as shown in Figure 4.

The steps described in Figure 5 depicts the RL process used to control the UAV, which depends on the used RL algorithm, but the most important takeaway is that RL uses the UAV state to produce actions. These actions are responsible for moving the UAV in the environment and can be either direct changes in the value of pitch, roll, yaw, and throttle values or indirect changes that require transformation to commands understood by the UAV.

4.2 Obstacle Avoidance

Avoiding obstacles is an essential task required by the UAV to navigate any environment, which can be achieved by estimating the distance to the objects in the environment using different devices such as front-facing cameras or distance sensors. The output generated by these different devices provides input to the RL algorithm and plays a significant role in the NN architecture.

Lu et al. [lu2018survey]

described different front-facing cameras such as monocular cameras, stereo cameras, and RGB-D cameras that a UAV can use. Each camera type produces a different image type used as raw input to the RL agent. However, regardless of the camera type, these images can be preprocessed using computer vision to produce specific image types as described below:

Figure 4: Yaw vs Throttle Mechanics
Figure 5: UAV Control using RL
  • [noitemsep,nolistsep]

  • RGB Images: are renowned colored images where each pixel is represented in three values (Red, Green, Blue) ranging between .

  • Depth-Map Images: contains information related to the distance of the objects from the Field Of View (FOV).

  • Event-Based Images: are special images that output the changes in brightness intensity instead of standard images. Event-based images are produced by an event camera, also known as Dynamic Vision Sensor (DVS).

RGB images lack depth information, and, therefore, the agent cannot estimate how far or close the UAV is to the object leading to unexpected flying behavior. On the other hand, depth information is essential for building a successful reward function that penalizes moving closer to the objects. Some techniques used RGB images and depth-map simultaneously as input to the agent to provide more information about the environment. In contrast, event-based images data are represented as one-dimensional sequences of events over time, which is used to capture quickly changing information in the scene [salvatore2020neuro].

Similar to cameras, distance sensors have different types, such as LiDAR, RADAR, and acoustic sensors: they estimate the distance of the surrounding objects to the UAV but require less storage size than 2D images since they do not use RGB channels.

The output generated by these devices reflects the different states that the UAV has over time, used as an input to the RL agent to make actions causing the UAV to move in different directions to avoid obstacles. The NN architecture of the RL agent is based on: 1) input type, 2) the number of inputs, and 3) the used algorithm. For example, processing RGB images or depth-map images using the DQN algorithm requires Convolutional Neural Network (CNN) followed by fully-connect layers since CNN is known for its power in processing images. In contrast, processing event-based images is performed using Spiking Neural Networks (SNN), which is designed to handle spatio-temporal data and identify spatio-temporal patterns [salvatore2020neuro].

4.3 Path Planning

Autonomous UAVs must have a well-defined objective before executing a flying mission. Typically, the goal is to fly from a start to a destination point, such as in delivery drones. But, the goal can also be more sophisticated, such as performing surveillance by hovering over a geographical area or participating in search and rescue operations to find a missing person.

Autonomous UAV navigation requires path planning to find the best UAV path to achieve the flying objective while avoiding obstacles. The optimal path does not always mean the shortest path or a straight line between two points; instead, the UAV aims to find a safe path while considering UAV’s limited power and flying mission.

Path planning can be divided into two main types:

  • [noitemsep,nolistsep]

  • Global Path Planning: concerned with planning the path from the start point to destination point in attempt to select the optimal path.

  • Local Path Planning: concerned with planning the local optimal waypoints in an attempt to avoid static and dynamic obstacles while considering the final destination.

Path planning can be solved using different techniques and algorithms; in this work, we focus on RL techniques used to solve global and local path planning, where the RL agent receives information from the environment and outputs the optimal waypoints according to the reward function. RL techniques can be classified according to the usage of the environment’s local information 1) map-based navigation and 2) mapless navigation.

4.3.1 Map-Based Navigation

A UAV that adopts map-based navigation uses a representation of the environment either in 3D or 2D format. The representation might include one or more of the following about the environment: 1) the different terrains, 2) fixed-obstacles locations, and 3) charging/ground stations.

Some maps oversimplify the environment representation: the map is divided into a grid with equally-sized smaller cells that store information about the environment [elnaggar2018irl, andrew2018deep, cui2021uav]. Others oversimplify the environment’s structure by simplifying objects representation or by using 1D/2D to represent the environment [grando2020deep, wang2020deep, liu2019distributed, yan2020flocking, williams2017information, omi2021introduction, sacharny2019optimal, yijing2017q, pham2018reinforcement, guerra2020reinforcement, bakale2020indoor] . The UAV has to plan a safe and optimal path over the cells to avoid cells containing obstacles until it reaches its destination and has to plan its stopover at the charging stations based on the battery level and path length.

In a more realistic scenario, the UAV calculates a route using the map information and the GPS signal to track the UAV’s current location, starting point, and destination point. The RL agent evaluates the change in the distance and the angle between the UAV’s current GPS location and target GPS location, and penalizes the reward if the difference increases or if the path is unsafe depending on the reward function (objective).

4.3.2 Mapless Navigation

Mapless navigation does not rely on maps; instead, it applies computer vision techniques to extract features from the environment and learn the different patterns to reach the destination, which requires computation resources that might be overwhelming for some UAVs.

Localization information of the UAV obtained by different means such as Global Positioning System (GPS) or Inertial Measurement Unit (IMU) is used in mapless navigation to plan the optimal path. DRL agent receives the scene image, the destination target, and the localization information as input and outputs the change in the UAV movements.

For example, Zhou et al. [zhou2019vision] calculated and tracked the angle between the UAV and destination point, then encoded it with the depth image extracted from the scene and used both as a state representation for the DRL agent. Although localization information seems essential to plan the path, some techniques achieved navigation with high speed using monocular visual reactive navigation system without a GPS [escobar2018r].

4.4 Flocking

Although UAVs are known for performing individual tasks, they can flock to perform tasks efficiently and quickly, which requires maintaining flight formation. UAV flocking has many applications, such as search and rescue operations to cover a wide geographical area.

UAV flocking is considered a more sophisticated task than a single UAV flying mission because UAVs need to orchestrate their flight to maintain flight formation while performing other tasks such as UAV control and obstacle avoidance. Flocking can be executed using different topologies:

  • [noitemsep,nolistsep]

  • Flock Centering: maintaining flight formation as suggested by Reynolds [reynolds1987flocks] involves three concepts: 1) flock centering, 2) avoiding obstacles, and 3) velocity matching. This topology was applied in several research papers [olfati2006flocking, lee2020autonomous, la2010flocking, jia2017three, su2009flocking].

  • Leader-Follower Flocking: the flock leader has its mission of reaching destination, while the followers (other UAVs) flock with the leader with a mission of maintaining distance and relative position to the leader [quintero2013flocking, hung2016q].

  • Neighbors Flocking: close neighbors coordinate with each other, where each UAV communicates with two or more nearest neighbors to maintain flight formation by maintaining relative distance and angle to the neighbors [wang2018deep, morihiro2007reinforcement, xu2018multi].

Maintaining flight formation using RL requires communication between UAVs to learn the best policy to maintain the formation while avoiding obstacles. These RL systems can be trained using a single agent or multi-agents in centralized or distributed settings.

4.4.1 Centralized Training

A centralized RL agent trains a shared flocking policy to maintain the flock formation using the experience collected from all UAVs, while each UAV acts individually according to its local environment information such as obstacles. The reward function of the centralized agent can be customized to serve the flocking topology, such as flock centering or leader-follower flocking.

Yan et al. [yan2020flocking] used Proximal Policy Optimization (PPO) algorithm to train a centralized shared flocking control policy, where each UAV flocks as close as possible to the center and decentralized execution for obstacle avoidance according to each UAV local environment information. Similarly, Hung and Givigi [hung2016q] trained a leader UAV to reach a destination while avoiding obstacles and trained a shared policy for followers to flock with the leader considering the relative dynamics between the leader and the followers.

Zhao et al. [zhao2021reinforcement] used a Multi-Agent Reinforcement Learning (MARL) to train a centralized flock control policy shared by all UAVs with decentralized execution. MARL received position, speed, and flight path angle from all UAVs at each time step and tried to find the optimal flocking control policy.

The centralized training would not produce a good generalization in neighbors flocking topology since the learned policy for one neighbor is different from other neighbors’ policies due to the differences in neighbors’ dynamics.

4.4.2 Distributed Training

UAV flocking can be trained using a distributed (decentralized) approach, where each UAV has its designated RL agent responsible for finding the optimal flock policy for the UAV. The reward function is defined to maintain distance and flying direction with other UAVs and can be customized to include further information depending on the objective.

Flight information such as location and heading angle should be communicated to other UAVs since the RL agents are distributed, and the state representation must include information of other UAVs. Any UAV that fails to receive the information from other UAVs will cause the UAV to be isolated from the flock.

Liu et al. [liu2019distributed] proposed a decentralized DRL framework to control each UAV in a distributed setting to maximize average coverage score, geographical fairness, and minimize UAVs’ energy consumption.

5 UAV Navigation Frameworks and Simulation Software

Subsection 5.1 discusses and classifies the UAV navigation frameworks based on the UAV navigation objectives/sub-objectives explained in Section 4, and identifies categories such as Path Planning Frameworks, Flocking Frameworks, Energy-Aware UAV Navigation Frameworks, and others. On the other hand, Subsection 5.2 explains the simulation software’s components and the most common simulation software utilized to perform the experiments.

5.1 UAV Navigation Frameworks

In general, a software framework is a conceptual structure analogous to a blueprint used to guide the comprehending construction of the software by defining different functions and their interrelationships. By definition, RL can be considered a framework by itself. Therefore, we considered only UAV navigation frameworks that add to traditional navigation using sensors or camera data for navigation. As a result, Table 2 classifies UAV frameworks based on the framework objective. The subsequent sections discuss the frameworks in more detail.

Framework Objective Papers
Energy-aware UAV Navigation [bouhamed2020ddpg, imanberdiyev2016autonomous]
Path Planning [walker2019deep, bouhamed2020autonomous, li2019autonomous, zhang2020iadrl, camci2019planning, yijing2017q, eslamiat2019autonomous]
Flocking [bouhamed2020generic, majd2018integrating]
Vision-Based Frameworks [andrew2018deep, he2020integrated, singla2019memory, akhloufi2019drones]
Transfer Learning [yoon2019hierarchical]
Table 2: UAV Navigation Frameworks

5.1.1 Energy-Aware UAV Navigation Frameworks

UAVs has limited flight time, hence operate mainly using batteries. Therefore, planning flight route and recharge stopover are crucial to reach destinations. Energy-aware UAV navigation frameworks aim to provide obstacles avoidance navigation while considering the UAV battery capacity.

Bouhamed et al. [bouhamed2020ddpg] developed a framework based on Deep Deterministic Policy Gradient (DDPG) algorithm to guide the UAV to a target position while communicating with ground stations, allowing the UAV to recharge its battery if it drops below a specific threshold. Similarly, Imanberdiyev et al. [imanberdiyev2016autonomous] monitor battery level, rotors’ condition, and sensor readings to plan the route and apply necessary route changes for required battery charging.

5.1.2 Path Planning Frameworks

Path planning is the process of determining the most efficient route that meets the flight objective, such as finding the shortest, fastest, or safest route. Different frameworks [walker2019deep, yijing2017q] implemented a modular path planning scheme, where each module has a specialized function to achieve while exchanging data with other modules to train action selection policies and discover the optimal path.

Similarly, Li et al. [li2019autonomous] developed a four-layer framework in which each layer generates a set of objective and constraint functions. The functions are intended to serve the lower layer and consider the upper layer’s objectives and constraints, with their primary goal generating trajectories.

Other frameworks suggested stage-based learning to choose actions from the desired stage depending on current environment encounters. For example, Camci and Kayacan [camci2019planning] proposed learning a set of motion primitives offline, then using them online to design quick maneuvers to enable switching seamlessly between two modes: near-hover motions, which is responsible for generating motion plans allowing a stable completion of maneuvers and swift maneuvers to deal smoothly with abrupt inputs.

In a collaborative setting, Zhang et al. [zhang2020iadrl] suggested a coalition between Unmanned Ground Vehicle (UGV) and UAV complementing each other to reach the destination, where UAV cannot get to far locations alone due to limited battery power, and UGV cannot reach high altitude due to limited abilities.

5.1.3 Flocking Frameworks

UAV flocking frameworks have functionality beyond UAV navigation while maintaining flight formation. For example, Bouhamed et al. [bouhamed2020generic] presented a RL-based spatiotemporal scheduling system for autonomous UAVs. The system enables UAVs to autonomously arrange their schedules to cover the most significant number of pre-scheduled events physically and temporally spread throughout a specified geographical region and time horizon. On the other hand, Majd et al. [majd2018integrating] predicted the movement of drones and dynamic obstacles in the flying zone to generate efficient and safe routes.

5.1.4 Vision-Based Frameworks

Vision-Based Framework depends on UAV camera for navigation, where the images produced by the camera are used to draw on additional functionality for improved navigation. It is possible to employ frameworks that augment the agent’s CNN architecture to fuse data from several sensors, use Long-Short Term Memory cells (LSTM) to maintain navigation choices, use RNN to capture the UAV states over different time steps, or pre-process images to provide more information about the environment

[he2020integrated, singla2019memory, he2020integrated].

5.1.5 Transfer Learning Frameworks

UAVs are trained on target environments before executing the flight mission; the training is carried either in a virtual or real-world environment. The UAV requires retraining when introduced to new environments or moving from virtual training as the environments have different terrains and obstacle structures or textures. Besides, UAV training requires a long time and it is hardware resource intensive while actual UAVs have limited hardware resources. Therefore, when UAV is introduced to new environments, transfer learning frameworks reduce the training time by reusing the NN weights trained from the previous environment and retraining only parts of the agent’s NN.

Yoon et al. [yoon2019hierarchical] proposed algorithm-hardware co-design, where the UAV is trained in a virtual environment, and after the UAV is deployed to a real-world environment; the agent loads the weights stored in embedded Non-Volatile Memory (eNVM), and then evaluates new actions and only trains the last few layers of CNN whose weights are stored in the on-die SRAM (Static Random Access Memory).

5.2 Simulation Software

The research community used different evaluation methods for autonomous UAV navigation using RL. Simulation software is used widely over actual UAVs to execute the evaluation due to the cost of the hardware (drone) in addition to the cost of replacement parts required due to UAV crashes. Comparison between simulation software is not the intended purpose, rather than making the research community aware of the most commonly used tools for evaluation as illustrated in Figure 6. 3D UAV navigation simulation requires mainly three components as illustrated in Figure 7:

Figure 6: UAV Simulation Software Usage
Figure 7: UAV Simulation Software Components
  • [noitemsep,nolistsep]

  • RL Agent: represents the RL algorithm used with all computations required to generate the reward, process the states, and compute the optimal action. RL Agent interacts directly with the UAV Flight simulator to send/receive UAV actions/states.

  • UAV Flight Simulator: responsible for simulating the UAV movements and interactions with the 3D environment, such as obtaining images from the UAV camera or reporting UAV crashes with different obstacles. Examples of UAV flight simulators are Robot Operating systems (ROS) [ros2021online] and Microsoft AirSim [airsim2021online].

  • 3D Graphics Engine: provides a 3D graphics environment with the physics engine, which is responsible for simulating the gravity and dynamics similar to the real world. Examples of 3D graphics engines are Gazebo [gazebo2021online] and Unreal Engine [unrealengine2021online].

Due to compatibility/support issues, ROS is used in conjunction with Gazebo, where AirSim uses Unreal Engine to run the simulations. However, the three components might not always be present, especially if the simulation software has internal modules or plugins that provide the required functionality, such as MATLAB.

6 Reinforcement Learning Algorithms Classification

The previous sections discussed the UAV navigation tasks and frameworks without elaborating on RL algorithms. However, to choose a suitable algorithm for the application environment and the navigation task, the comprehension of RL algorithms and their characteristics is necessary. For example, the DQN algorithm and its variations can be used for UAV navigation tasks that use simple movement actions (UP, DOWN, LEFT, RIGHT, FORWARD) since they are discrete. Therefore, this section examines RL algorithms and their characteristics.

AlMahamid and Grolinger [almahamid2021reinforcement] categorized RL algorithms into three main categories according to the number of states and the type of actions: 1) limited number of states and discrete actions, 2) unlimited number of states and discrete actions, and 3) unlimited number of states and continuous actions. We extend this with sub-classes, analyze more than 50 RL algorithms, and examine their use in UAV navigation. Table 3 classifies all RL algorithms found in UAV Navigation studies and includes other prominent RL algorithms to show the intersection between RL and UAV navigation. Furthermore, this section discusses algorithms characteristics and highlights RL applications in different UAV navigation studies. Note that Table 3 includes many algorithms, but only the most prominent ones are discussed in the following subsections.

State Action Class Algorithm On/OffPolicy Actor-Critic Multi-Thread Distributed Multi-Agent Usage
Limited Discrete Simple RL Q-Learning [watkins1992q] - No No No No [bouhamed2020generic, yu2019navigation, li2018path, sacharny2019optimal, karthik2020reinforcement, pham2018reinforcement, guerra2020reinforcement, bouhamed2020uav, kulkarni2020uav, cui2021uav, fotouhi2021deep, greatwood2019reinforcement]
SARSA [rummery1994line] - No No No No -
Unlimited Discrete DQN Variations DQN [mnih2013playing] Off No No No No [salvatore2020neuro, huang2019autonomous, shin2019automatic, chen2020collision, abedin2020data, camci2020deep, huang2019deep, cetin2019drone, williams2017information, wu2018navigating, zhou2018neural, camci2019planning, yijing2017q, walvekar2019vision, viseras2021wildfire, eslamiat2019autonomous, fotouhi2021deep, akhloufi2019drones, bakale2020indoor, camci2019learning, madridano2021software, bonnet2019uav]
Double DQN [VanHasselt2016] Off No No No No [zhou2020efficient, shin2019automatic, anwar2020autonomous, yang2020autonomous, cetin2019drone, yoon2019hierarchical, anwar2018navren, polvara2018toward, polvara2019autonomous, fotouhi2021deep, munoz2019deep]
Dueling DQN [Wang2016] Off No No No No [shin2019automatic, cetin2019drone]
DRQN [hausknecht2015deep] Off No No No No [andrew2018deep, singla2019memory, peake2020wilderness, tong2021uav]
DD-DQN [Wang2016] Off No No No No [shin2019automatic, villanueva2019deep]
DD-DRQN Off No No No No -
Distributional DQN Noisy DQN [fortunato2017noisy] Off No No No No -
C51-DQN [bellemare2017distributional] Off No No No No -
QR-DQN [dabney2018distributional] Off No No No No -
IQN [dabney2018implicit] Off No No No No -
Rainbow DQN [hessel2018rainbow] Off No No No No -
FQF [yang2019fully] Off No No No No -
Distributed DQN R2D2 [kapturowski2018recurrent] Off No No Yes No -
Ape-X DQN [horgan2018distributed] Off No No Yes No -
NGU [badia2020never] Off No No Yes No -
Agent57 [badia2020agent57] Off No No Yes No -
Deep SARSA Variations Deep SARSA [zhao2016deep] On No No No No -
Double SARSA On No No No No -
Dueling SARSA On No No No No -
DR-SARSA On No No No No -
DD-SARSA On No No No No -
DD-DR-SARSA On No No No No -
Continous Policy Based REINFORCE [williams1992simple] On No No No No -
TPRO [schulman2015trust] On No No No No [koch2019reinforcement]
PPO [schulman2017proximal] On No No No No [morad2021embodied, yan2020flocking, zhang2020iadrl, hasanzade2021dynamically, wang2021pretrained, hodge2021deep, deshpande2020developmental, maxey2019navigation, koch2019reinforcement]
PPG [cobbe2020phasic] Off No No No No -
SVPG [liu2017stein] Off No No No No -
Actor-Critic SLAC [lee2019stochastic] Off Yes No No No -
ACE [zhang2019ace] Off Yes Yes No No -
DAC [zhang2019dac] Off Yes No No No -
DPG [silver2014deterministic] Off Yes No No No [li2019optimal]
RDPG [heess2015memory] Off Yes No No No -
DDPG [lillicrap2015continuous] Off Yes No No No [bouhamed2020ddpg, wang2018deep, wang2020two, bouhamed2020uav, bouhamed2020autonomous, li2019autonomous, grando2020deep, wang2020deep, liu2019distributed, he2020integrated, zhou2019vision, doukhi2021deep, koch2019reinforcement, lee2018vision]
TD3 [fujimoto2018addressing] Off Yes No No No [omi2021introduction]
SAC [haarnoja2018soft] Off Yes No No No [grando2020deep]
Multi-Agent and Distributed Actor-Critic Ape-X DPG [horgan2018distributed] Off Yes No Yes No -
D4PG [barth2018distributed] Off Yes No Yes Yes -
A2C [mnih2016asynchronous] On Yes Yes Yes No [lee2020autonomous, peake2020wilderness]
DPPO [heess1707emergence] On No No Yes Yes -
A3C [mnih2016asynchronous] On Yes Yes No Yes [wang2020deep]
PAAC [alfredo2017efficient] On Yes Yes No No -
ACER [wang2016sample] Off Yes Yes No No -
Reactor [gruslys2017reactor] Off Yes Yes No No -
ACKTR [wu2017scalable] On Yes Yes No No -
MADDPG [lowe2017multi] Off Yes No No Yes -
MATD3 [ackermann2019reducing] Off Yes No No Yes -
MAAC [iqbal2019actor] Off Yes No No Yes [fan2020prioritized]
IMPALA [espeholt2018impala] Off Yes Yes Yes Yes -
SEED [espeholt2019seed] Off Yes Yes Yes Yes -
Table 3: RL Algorithms usage and classification

6.1 Limited States and Discrete Actions

Generally, simple environments have a limited number of states and the agent transitions between states by executing discrete (limited number) actions. For example, in a tic-tac-toe game, the agent has a predefined set of two actions X or O that are used to update the nine boxes constituting the predefined set of known states. Q-Learning [watkins1992q] and State–Action–Reward–State–Action (SARSA) [rummery1994line] algorithms can be applied to environments with a limited number of states and discrete actions, where they maintain a Q-Table with all possible states and actions while iteratively updating the Q-values for each state-action pair to find the optimal policy.

SARSA is similar to Q-Learning except to update the current value it computes the next state-action by executing the next action [zhao2016deep]. In contrast, Q-learning updates the current value by computing the next state-action using the Bellman equation since the next action is unknown, and takes a greedy action by selecting the action that maximizes the reward [zhao2016deep].

6.2 Unlimited States and Discrete Actions

An RL agent uses Deep Neural Network (DNN) - usually a CNN, in complex environments such as the pong game, where the states are unlimited and the actions are discrete (UP, DOWN). The deep agent/DNN processes the environment’s state as an input and outputs the Q-values of the available actions. The following subsections discuss the different algorithms that can be used in this type of the environment, such as DQN, Deep SARSA, and their variations [almahamid2021reinforcement].

6.2.1 Deep Q-Networks Variations

Deep Q-Learning, also known as Deep Q-Network (DQN)[mnih2013playing], is a primary method used in settings with an unlimited number of states and discrete actions, and it serves as an inspiration for other algorithms used for the same goal. As illustrated in Figure 8 [anwar2020autonomous], DQN architecture frequently employs convolutional and pooling layers, followed by fully connected layers that provide Q-values corresponding to the number of actions. A significant disadvantage of the DQN algorithm is that it overestimates the action-value (Q-value), with the agent selecting the actions with the highest Q-value, which may not be the optimal action [VanHasselt2010].

Double DQN solves the overestimation issue in DQN by using two networks. The first network, known as the Policy Network, optimizes the Q-value, while the second network, known as the Target Network, is a clone of the Policy Network and is used to generate the estimated Q-value [VanHasselt2016]

. After a specified number of time steps, the parameters of the target network network are updated by copying the policy network parameters instead of performing backpropagation.

Figure 8: DQN using AlexNet CNN

Dueling DQN, as depicted in Figure 9 [Wang2016], is a further enhancement to DQN. To improve Q-value evaluation, Dueling DQN employs the following functions in place the Q-value function:

  • [noitemsep,nolistsep]

  • The State-Value function quantifies how desirable it is for an agent to be in a state .

  • The Advantage-Value function assesses the superiority of the selected action in a given state over other actions.

The two functions depicted in Figure 9 are integrated using a custom aggregation layer to generate an estimate of the state-action value function [Wang2016]. The aggregation layer has the same value as the sum of the two values produced by the two functions:


The term denotes the mean, whereas

denotes vector

length. This assists the identifiability problem while having no effect on the relative rank of the (and thus Q) values. This also improves the optimization because the advantage function only needs to change as fast as the mean [Wang2016].

Figure 9: DQN vs. Dueling DQN

As explained by Wang et al. [Wang2016], Double Dueling DQN (DD-DQN) extends DQN by combining Dueling DQN and Double DQN to determine the optimal Q-value, with the output of Dueling DQN passed to Double DQN.

The Deep Recurrent Q-Network (DRQN) [hausknecht2015deep] algorithm is a DQN variation, using a recurrent LSTM layer in place of the first fully connected layer. This changes the input from a single environment state to to a group of states as a single input, which aids in the integration of information over time [hausknecht2015deep]. The techniques of doubling and dueling can be utilized independently or in combination with a recurrent neural network.

6.2.2 Distributional DQN

The goal of distributional Q-learning is to obtain a more accurate representation of the distribution of observed rewards. Fortunato et al. [fortunato2017noisy] introduced NoisyNet, a deep reinforcement learning agent that uses gradient descent to learn parametric noise added to the network weights, and demonstrated how the agent’s policy’s induced stochasticity can be used to aid efficient exploration [fortunato2017noisy].

Categorical Deep Q-Networks (C51-DQN) [bellemare2017distributional]

applied a distributional perspective using Wasserstein metric to the random return received by Bellman’s equation to approximate value distributions instead of the value function. The algorithm first performs a heuristic projection step and then minimizes the Kullback-Leibler (KL) divergence between the projected Bellman update and the prediction


Quantile Regression Deep Q-Networks (QR-DQN) [dabney2018distributional]

performs a distributional reinforcement learning over the Wasserstein metric in a stochastic approximation setting. Using Wasserstein distance, the target distribution is minimized by stochastically adjusting the distributions’ locations using quantile regression

[dabney2018distributional]. QR-DQN assigns fixed, uniform probabilities to adjustable locations and minimizes the quantile Huber loss between the Bellman updated distribution and current return distribution [yang2019fully], whereas C51-DQN uses fixed locations () for distribution approximation and adjusts the locations probabilities [dabney2018distributional].

Implicit Quantile Networks (IQN) [dabney2018implicit] incorporates QR-DQN [dabney2018distributional] to learn full quantile function controlled by the size of the network and the amount of training, in contrast to QR-DQN quantile function that learns a discrete set of quantiles dependent on the number of quantiles output [dabney2018implicit]. IQN distribution function assumes the base distribution to be non-uniform and reparameterizes samples from a base distribution to the respective quantile values of a target distribution.

Rainbow DQN [hessel2018rainbow] combines several improvements of the traditional DQN algorithm into a single algorithm, such as 1) addressing the overestimation bias, 2) using Prioritized Experience Replay (PER) [schaul2015prioritized], 3) using Dueling DQN [Wang2016], 4)

shifting the bias-variance trade-off and propagating newly observed rewards faster to earlier visited states as implemented in A3C

[mnih2016asynchronous], 5) learning a distributional reinforcement learning instead of the expected return similar to C51-DQN [bellemare2017distributional], and 6) implementing stochastic network layers using Noisy DQN [fortunato2017noisy].

Yang et al. [yang2019fully] proposed Fully parameterized Quantile Function (FQF) for distributional RL providing full parameterization for both quantile fractions and corresponding quantile values. In contrast, QR-DQN [dabney2018distributional] and IQN [dabney2018implicit] only parameterize the corresponding quantile values, while quantile fractions are either fixed or sampled [yang2019fully].

FQF for distributional RL uses two networks: 1) quantile value network that maps quantile fractions to corresponding quantile values, and 2) fraction proposal network that generates quantile fractions for each state-action pair with the goal of distribution approximation while minimizing the 1-Wasserstein distance between the approximated and actual distribution [yang2019fully].

6.2.3 Distributed DQN

Distributed DRL architecture used by different RL algorithms as depicted in Figure 10 [badia2020agent57] aims to decouple acting from learning in distributed settings relaying on prioritized experience replay to focus on the significant experiences generated by actors. The actors share the same NN and replay experience buffer, where they interact with the environment and store their experiences in the shared replay experience buffer. On the other hand, the learner replays prioritized experiences from the shared experience buffer and updates the learner NN accordingly [horgan2018distributed]. In theory, both acting and learning can be distributed across multiple workers or running on the same machine [horgan2018distributed].

Figure 10: Distributed DRL agent scheme

Ape-X DQN [horgan2018distributed], based on the Ape-X framework, was the first algorithm to suggest distributed DRL, which was later extended by Recurrent Replay Distributed DQN (R2D2) [kapturowski2018recurrent] with two main differences: 1) R2D2 adds an LSTM layer after the convolutional stack to overcome partial observability, and 2) it trains a recurrent neural network from randomly sampled replay sequences using the “burn-in” strategy, which produces a start state through using a portion of the replay sequence and updates the network only on the remaining part of the sequence [kapturowski2018recurrent].

Never Give Up (NGA) [badia2020never] is another algorithm that combines R2D2 architecture with a novel approach that encourages the agent to learn exploratory strategies throughout the training process using a compound intrinsic reward consisting of two modules:

  • [noitemsep,nolistsep]

  • Life-long novelty module uses Random Network Distillation (RND) [burda2018exploration], which consists of two networks used to generate an intrinsic reward: 1) target network, and 2) prediction network. This mechanism is known as curiosity because it motivates the agent to explore the environment by going to novel or unfamiliar states.

  • Episodic novelty module uses dynamically-sized episodic memory that stores the controllable states in an online fashion, then turns state-action counts into a bonus reward, where the count is computed using the -nearest neighbors.

While NGA uses intrinsic reward to promote exploration, it promotes exploitation by generating extrinsic reward using the Universal Value Function Approximator (UVFA). NGA uses conditional architecture with shared weights to learn a family of policies that separate exploration and exploitation [badia2020never].

Agent57 [badia2020agent57] is the first RL algorithm that outperforms the human benchmark on all 57 games of Atari 2600. Agent57 implements NGA algorithms with the main difference of applying an adaptive mechanism for exploration-exploitation trade-off and utilizes parameterization of the architecture that allows for more consistent and stable learning [badia2020agent57].

6.2.4 Deep SARSA

SARSA is based on Q-learning and is designed for situations with limited states and discrete actions, as explained in subsection 6.1. Deep SARSA [zhao2016deep] uses a deep neural network similar to DQN and has the same extensions: Double SARSA, Dueling SARSA, Double Dueling SARSA (DD-SARSA), Deep Recurrent SARSA (DR-SARSA), and Double Dueling Deep Recurrent (DD-DR-SARSA). The main difference compared to DQN is that Deep SARSA computes by taking the next action , which is necessary to determine the current state-action rather than taking a greedy action that maximizes the reward.

6.3 Unlimited States and Continuous Actions

While discrete actions are adequate to drive a car or unmanned aerial vehicle in a simulated environment, they do not enable realistic movements in real-world scenarios. Continuous actions specify the quantity of movement in various directions, and the agent does not select from a predetermined set of actions. For instance, a realistic UAV movement defines the amount of roll, pitch, yaw, and throttle changes necessary to navigate the environment while avoiding obstacles, as opposed to flying the UAV in preset directions: forward, left, right, up, and down [almahamid2021reinforcement].

Continuous action space demands learning a parameterized policy that maximizes the expected summation of discounted rewards since it is not feasible to determine the action-value for all continuous actions in all distinct states. Learning a parameterized policy is considered a maximization problem, which can be handled using gradient descent methods to get the optimal in the following manner [almahamid2021reinforcement]:


Here, is the gradient and is the learning rate.

The goal of the reward function is to maximize the expected reward applying the following parameterized policy [sutton2018reinforcement]:


where denotes the stationary probability starting from state and transitioning to future states according to the policy . To determine the best that maximizes the function , the gradient is calculated as follows:


Due to the fact that and the action space is continuous, Equation 10 can be rewritten as:


Equation 12 [silver2014deterministic], referred to as the off-policy gradient theorem, defines the policy change in relation to the ratio of target policy to behavior policy . Take note that the training sample is selected according to the target policy , and the expected return is calculated for the same policy , where the training sample adheres to the behavior policy .


The policy gradient theorem depicted in Equation 9 [sutton2000policy] served as the foundation for a variety of other Policy Gradients (PG) algorithms, including REINFORCE, Actor-Critic algorithms, and various multi-agent and distributed actor-critic algorithms.

6.3.1 Policy-Based Algorithms

Policy-based algorithms are devoted to improving the gradient descent performance by means of applying different methods such as REINFORCE [williams1992simple], Trust Region Policy Optimization (TRPO) [schulman2015trust], Proximal Policy Optimization (PPO) [schulman2017proximal], Phasic Policy Gradient (PPG) [cobbe2020phasic], and Stein Variational Policy Gradient (SVPG) [liu2017stein].


REINFORCE is a Monte-Carlo policy gradient approach that creates a sample by selecting from an entire episode proportionally to the gradient and updates the policy parameter with the step size . Given that , REINFORCE may be defined as follows [sutton2018reinforcement]:


The Monte Carlo method has a high variance and, hence, a slow pace of learning. By subtracting the baseline value from the expected return , REINFORCE decreases variance and accelerates learning while maintaining the bias [sutton2018reinforcement].

Trust Region Policy Optimization (TRPO)

Trust Region Policy Optimization (TRPO) [schulman2015trust] belongs to a category of PG methods: it enhances gradient descent by performing protracted steps inside trust zones specified by a KL-Divergence constraint and updates the policy after each trajectory instead of after each state [almahamid2021reinforcement]. Proximal Policy Optimization (PPO) [schulman2017proximal] may be thought of as an extension of TRPO, where the KL-Divergence constraint is applied as a penalty and the objective is clipped to guarantee that the optimization occurs within a predetermined range [shin2019obstacle].

Phasic Policy Gradient (PPG) [cobbe2020phasic] is an extension of PPO [schulman2017proximal]: it incorporates a recurring auxiliary phase that distills information from the value function into the policy network to enhance the training while maintaining decoupling.

Stein Variational Policy Gradient (SVPG)

Stein Variational Policy Gradient (SVPG) [liu2017stein] algorithm updates the policy using Stein variational gradient descent (SVGD) [liu2016stein], therefore reducing variance and improving convergence. When used in conjunction with REINFORCE and the advantage actor-critic algorithms, SVPG enhances average return and data efficiency [liu2016stein].

6.3.2 Actor-Critic

The term "Actor-Critic algorithms" refers to a collection of algorithms based on the policy gradients theorem. They are composed of two components:

  1. The Actor who is liable of finding the optimal policy .

  2. The Critic who estimates the value function utilizing a parameterized vector and a policy assessment technique such as temporal-difference learning [silver2014deterministic].

The actor can be thought of as a network that is attempting to discover the probability of all possible actions and perform the one with the largest probability, whereas the critic can be thought of as a network that is evaluating the chosen action by assessing the quality of the new state created by the performed action. Numerous algorithms can be classified under the actor-critic category including Deterministic policy gradients (DPG) [silver2014deterministic], Deep Deterministic Policy Gradient (DDPG) [lillicrap2015continuous], Twin Delayed Deep Deterministic (TD3) [fujimoto2018addressing], and many others.

Deterministic Policy Gradients (DPG)

Deterministic policy gradients (DPG) algorithms implement a deterministic policy instead of a stochastic policy . The deterministic policy is a subset of a stochastic policy in which the target policy objective function is averaged over the state distribution of the behavior policy, as depict in 14 [silver2014deterministic].


Importance sampling is frequently used in off-policy techniques with a stochastic policy to account for mismatches between behavior and target policies. The deterministic policy gradient eliminates the integral over actions; therefore, the importance sampling can be skipped, resulting in the following gradient [almahamid2021reinforcement]:


Numerous strategies are employed to enhance DPG; for example, Experience Replay (ER) can be used in conjunction with DPG to increase the stability and efficiency of data [heess2015memory]. Deep Deterministic Policy Gradient (DDPG) [lillicrap2015continuous], on the other hand, expands DPG by leveraging DQN to operate in continuous action space whereas Twin Delayed Deep Deterministic (TD3) [fujimoto2018addressing] expands on DDPG by utilizing Double DQN to prevent the overestimation of the value function by taking the minimum value between the two critics [fujimoto2018addressing].

Recurrent Deterministic Policy Gradients (RDPG)

Wierstra et al. [wierstra2010recurrent] applied RNN to Policy Gradient (PG) to build a model-free RL - namely Recurrent Policy Gradient (RPG), for Partially Observable Markov Decision Problem (POMDP), which does not require the agent to have a complete assumption about the environment [wierstra2010recurrent]. RPG applies a method for backpropagating return-weighted characteristic eligibilities through time to approximate a policy gradient for a recurrent neural network [wierstra2010recurrent].

Recurrent Deterministic Policy Gradient (RDPG) [heess2015memory] implements DPG using RNN and extends the work of RGP [wierstra2010recurrent] to partially observed domains. The RNN with LSTM cells preserves information about past observations over many time steps.

Soft Actor-Critic (SAC)

The objective of Soft Actor-Critic (SAC) is to maximize anticipated reward and the entropy [haarnoja2018soft]. By adding the anticipated entropy of the policy across , SAC improves the maximum sum of rewards established by adding the rewards over states transitions [haarnoja2018soft]. Equation 16 illustrates an extended entropy goal, in which the temperature parameter influences the stochasticity of the optimum policy by specifying the importance of the entropy term to the reward [haarnoja2018soft].


By using function approximators and two independent NNs for the actor and critic, SAC estimates a soft Q-function parameterized by , a state value function parameterized by , and an adjustable policy parameterized by [almahamid2021reinforcement].

6.3.3 Multi-Agent and Distributed Actor-Critic

This group of algorithms includes multi-agent and distributed actor-critic algorithms. They are grouped together as multi-agents can be deployed across several nodes making it a distributed system.

Advantage Actor-Critic

Asynchronous Advantage Actor-Critic (A3C) [mnih2016asynchronous] is a policy gradient algorithm that parallelizes training by using multi-threads, commonly known as workers or agents. Each agent has a local policy and a value function estimate . The agent and the same-structured global network asynchronously exchange the parameters in both directions, from agent to the global network and vice-versa. After actions or when a final state is reached, the policy and the value function are modified [mnih2016asynchronous].

Advantage Actor-Critic (A2C) [mnih2016asynchronous] is a policy gradient method identical to A3C, except that it includes a coordinator for synchronizing all agents. After all agents complete their work, either by arriving at a final state or by completing actions, the coordinator updates the policy and value function in both directions between the agents and the global network and vice versa.

Another variant of A3C is Actor-Critic with Kronecker-Factored Trust Region (ACKTR) [wu2017scalable] which uses Kronecker-factored approximation curvature (K-FAC) [martens2015optimizing] to optimize the actor and critic. It improves the computation of natural gradients by efficiently inverting the gradient covariance matrix.

Actor-Critic with Experience Replay (ACER)

Actor-Critic with Experience Replay (ACER) [wang2016sample] is an off-policy actor-critic algorithm with experience replay that estimates the policy and the value function using a single deep neural network [wang2016sample]. In comparison to A3C, ACER employs a stochastic dueling network and a novel trust region policy optimization [wang2016sample], while improving importance sampling with a bias correction [mnih2016asynchronous].

ACER applies an improved Retrace algorithm [munos2016safe] by using a truncated importance sampling with bias correction and the value as the target value to train the critic [wang2016sample]. The gradient is determined by truncating the importance weights by a constant , and subtracting , which reduces variance.

Retrace-Actor (Reactor) [gruslys2017reactor] increases sampling and time efficiency by combining contributions from different techniques. It employs Distributional Retrace [munos2016safe] to provide multi-step off-policy distributional RL updates while prioritizing replay on transitions [gruslys2017reactor]. Additionally, by taking advantage of action values as a baseline, Reactor improves the trade-off between variance and bias via -leave-one-out (-LOO) resulting in an improvement of the policy gradient [gruslys2017reactor].

Multi-Agent Reinforcement Learning (MARL)

Distributed Distributional DDPG (D4PG) [barth2018distributed] adds features such as N-step returns and prioritized experience replay to the distributed settings of DDPG [barth2018distributed]. On the other hand, Multi-Agent DDPG (MADDPG) [lowe2017multi] expands DDPG to coordinate between multiple agents and learn policies while considering each agent’s policy [lowe2017multi]. In comparison, Multi-Agent TD3 (MATD3) expands TD3 [ackermann2019reducing] to work with multi-agents using centralized training and decentralized execution while,similarly to TD3, controlling the overestimation bias by employing two centralized critics for each agent.

Importance Weighted Actor-Learner Architecture (IMPALA) [espeholt2018impala] is an off-policy algorithm that separates action execution and policy learning. It can be applied using two distinct configurations: 1) a single learner and multiple actors, or 2) multiple synchronous learners and multiple actors.

Using a single learner and several actors, the trajectories generated by the actors are transferred to the learner. Before initiating a new trajectory, the actors are waiting for the learner to update the policy, while the learner simultaneously queues the received trajectories from the actors and constructs the updated policy. Nonetheless, actors may acquire an older version due to their lack of awareness of one another and the lag between the actors and the learner. To address this challenge, IMPALA employs a unique v-trace correction approach that takes into account a truncated importance sampling (IS), defined as the ratio of the learner’s policy to the actor’s present policy [almahamid2021reinforcement]. Likewise, with multiple synchronous learners, policy parameters are spread across numerous learners who communicate synchronously via a master learner [espeholt2018impala].

Scalable, Efficient Deep-RL (SEED RL) [espeholt2019seed] provides a scalable architecture that combines IMPALA with R2D2 and can train on millions of frames per second with a lower cost of experiments compared to IMPALA [espeholt2019seed]. SEED moves the inference to the learner while the environments run remotely, introducing a latency issue due to the increased number of remote calls, which is mitigated using a fast communication layer using gRPC.

7 Problem Formulation and Algorithm Selection

The previous section categorized RL algorithms based on state and action types and reviewed the most prominent algorithms. With such a large number of algorithms, it is challenging to select the RL algorithms suitable to tackle the task at hand. Consequently, Figure 11 depicts the process of selecting a suitable RL algorithm or a group of RL algorithms through six steps/questions that are answered to guide an informed selection.

The selection process places a greater emphasis on how the environment and RL objective are formulated than on the RL problem type because the algorithm selection is dependent on the environment and objective formulation. For instance, UAV navigation tasks can employ several sets of algorithms dependent on the desired action type. The six steps, indicated in Figure 11, guide the selection of algorithms: the selected option at each step limits the choices available in the next step based on the available algorithms’ characteristics. The steps are as follows:

•  Step 1 - Define State Type: When assessing an RL task, it is essential to comprehend the state that can be obtained from the surrounding environment. For instance, some navigation tasks simplify the environment’s states using grid-cell representations [elnaggar2018irl, andrew2018deep, cui2021uav], where the agent has a limited and predetermined set of states, whereas in other tasks, the environment can have unlimited states [grando2020deep, morad2021embodied, yoon2019hierarchical]. Therefore, this steps involves a decision between limited vs. unlimited states.

•  Step 2 - Define Action Type: Choosing between discrete and continuous action types limits the number of applicable algorithms. For instance, discrete actions can be used to move the UAV in pre-specified directions (UP, DOWN, RIGHT, LEFT, etc.), whereas continuous actions, such as the change in pitch, roll, and yaw angles, specify the quantity of the movement using a real number .

•  Step 3 - Define Policy Type: As addressed and explained in Subsection 3.4, RL algorithms can be either off-policy or on-policy algorithms. The policy type selected restricts the alternatives accessible in the subsequent stage. On-policy algorithms converge faster than off-policy algorithms and find a sub-optimal policy, making them a good fit for environments requiring much exploration. Moreover, on-policy algorithms provide stable training since one policy uses learning and data sampling. On the other hand, off-policy algorithms provide an optimal policy and require a good exploration strategy.

The off-policy algorithms’ convergence can be improved using techniques such as prioritized experience replay and importance sampling, making them a good fit for navigation tasks that require finding the optimal path.

•  Step 4 - Define Processing Type: While some RL algorithms run in a single thread, others support multi-threading and distributed processing. This steps select the processing type that suits the application needs and the available computational power.

•  Step 5 - Define Number of Agents: This steps specifies the number of agents the application should have. This is needed as some RL algorithms enable MARL, which accelerates learning but requires more computational resources, while other techniques only employ a single agent.

•  Step 6 - Select the Algorithms: The last phase of the process results in a collection of algorithms that may be applied to the RL problem at hand. However, the performance of the algorithms is affected by a number of factors and may vary depending on variables such as hyper-parameter settings, reward engineering, and the agent’s NN architecture. Consequently, the procedure seeks to reduce the algorithm selection to a group of algorithms rather than a single algorithm.

While this section presented the process of narrowing down the algorithm for use case, Section 6 provided a description and references to many algorithms to assist in comprehending the distinctions between the algorithms and making an informed selection.

Figure 11: Algorithm Selection Process

8 Challenges and Opportunities

Previous sections demonstrated the diversity of UAV navigation tasks besides the diversity of RL algorithms. Due to such a high number of algorithms, selecting the appropriate algorithms for the task at hand is challenging. Table 3 and discussion in Section 6 provide an overview of the RL algorithms and assist in selecting the RL algorithm for navigation task. Nevertheless, there are still numerous challenges and opportunities in RL for UAV navigation, including:

Evaluation and benchmarking: Atari 2600 is a home video game console with 57 built-in games that laid the foundation to establish a benchmarking for RL algorithms. The benchmark was established using 57 different games to compare various RL algorithms and set a benchmark baseline against human performance playing the same games. The agent’s performance is evaluated and compared to other algorithms using the same benchmark (Atari 2600), or evaluated using various non-navigation environments other than Atari 2600 [andrychowicz2020matters]. The performance of the algorithms using the benchmark might differ when applied to the UAV navigation simulated on a 3D environment or the real world because the games in Atari 2600 can provide a full state of the environment, which means the agent does not need to make assumptions about the state, and MDP can be applied to these problems. Whereas in UAV navigation simulation, the agent knows partial states of the environment (observation), these observations are used to train the agent using POMDP, which results in changing behavior for some of the algorithms. Furthermore, images processed from the games are 2D images, where the agent in most algorithms tries to learn an optimal policy based on the pattern of the pixels in the image. The same cannot be inferred for images received from the 3D simulators or the real-world images because objects’ depth plays a vital role in learning the optimal policy avoiding nearby objects. Therefore, there is a need for new evaluation and benchmarking techniques for RL driven navigation.

Environment complexity: The tendency to oversimplify the environment and the absence of a standardized benchmarking tools makes it impossible to compare and conclude performances obtained using different algorithms and simulated using various tools and environments. Nevertheless, the UAV needs to perform tasks in different environments and is subject to various conditions, for example:

  • [noitemsep,nolistsep]

  • Navigating in various environment types such as indoor vs. outdoor.

  • Considering the changing environment conditions such as wind speed, lighting conditions, and moving objects.

Some of the simulation tools discussed in Section 5, such as AirSim combined with Unreal Engine, provide different environment types out-of-the-box and are capable of simulating several environmental effects such as changing wind speed and lighting conditions. Still, these complex environments remain to be combined with new benchmarking techniques for improved comparison of RL algorithms for UAV navigation.

Knowledge transfer: Knowledge transfer imposes another challenge, where the RL agent training in a selected environment does not guarantee similar performance in another environment due to the difference in environments’ nature such as different object/obstacles types, background texture, lighting density, and added noise. Most of the existing research focused on applying transfer learning to reduce the training time for the agent in the new environment [yoon2019hierarchical]. However, generalized training methods or other techniques are needed to guarantee a similar performance of the agent in different environments and under various conditions.

UAVs complexity: Training UAVs is often accomplished in a 3D virtual environment since UAVs have limited computational resources and power supply, with a typical flight time of 10 to 30 minutes. Reducing the computation time will create possibilities for more complex navigation tasks and increase the flight time since it will reduce energy consumption. Figure 6 shows that only of the investigated research used real drones for navigation training. Therefore, more research is required to focus on energy-aware navigation utilizing low-complexity and efficient RL algorithms while simulating using real drones.

Algorithm diversity: As seen from Table 3, many recent and very successful algorithms have not been applied in UAV navigation. As these algorithms have shown great surcease in other domains outperforming the human benchmark, there is a prodigious potential in their application in UAV navigation. The algorithms are expected to gain better generalization on different environments, speed up the training process, and even solve efficiently more complex tasks such as UAVs flocking.

9 Conclusion

This review deliberates on the application of RL for autonomous UAV navigation. RL uses an intelligent agent to control the UAV movement by processing the states from the environment and moving the UAV in desired directions. The data received from the UAV camera or other sensors such as LiDAR are used to estimate the distance from various objects in the environment and avoid colliding with these objects.

RL algorithms and techniques were used to solve navigation problems such as controlling the UAV while avoiding obstacles, path planning, and flocking. For example, RL is used in single UAV path planning and multi-UAVs flocking to plan path waypoints of the UAV(s) while avoiding obstacles or maintaining flight formation (flocking). Furthermore, this study recognizes various navigation frameworks simulation software used to conduct the experiments along with identifying their use within the reviewed papers.

The review discusses over fifty RL algorithms, explains their contributions and relations, and classifies them according to the application environment and their use in UAV navigation. Furthermore, the study highlights other algorithmic traits such as multi-threading, distributed processing, and multi-agents, followed by a systematic process that aims to assist in finding the set of applicable algorithms.

The study observes that the research community tends to experiment with a specific set of algorithms: Q-learning, DQN, Double DQN, DDPG, PPO, although some recent algorithms show more promising results than the mentioned algorithms such as agent57. To the best of the authors’ knowledge, this study is the first systematic review identifying a large number of RL algorithms while focusing on their application in autonomous UAV navigation.

Analysis of the current RL algorithms and their use in UAV navigation identified the following challenges and opportunities: the need for navigation-focused evaluation and benchmarking techniques, the necessity to work with more complex environments, the need to examine knowledge transfer, the complexity of UAVs, and the necessity to evaluate state-of-the-art RL algorithms on navigation tasks.

Appendix A Acronyms

  • [noitemsep,nolistsep]

  • A2C : Advantage Actor-Critic

  • A3C : Asynchronous Advantage Actor-Critic

  • AC : Actor-Critic

  • ACE : Actor Ensemble

  • ACER : Actor-Critic with Experience Replay

  • ACKTR : Actor-Critic using Kronecker-Factored Trust Region

  • Agent57 : Agent57

  • Ape-X DPG : Ape-X Deterministic Policy Gradients

  • Ape-X DQN : Ape-X Deep Q-Networks

  • AS : Autonomous Systems

  • C51-DQN : Categorical Deep Q-Networks

  • CNN : Recurrent Neural Network

  • D4PG : Distributed Distributional DDPG

  • DAC : Double Actor-Critic

  • DD-DQN : Double Dueling Deep Q-Networks

  • DD-DRQN : Double Dueling Deep Recurrent Q-Networks

  • DDPG : Deep Deterministic Policy Gradient

  • Double DQN : Double Deep Q-Networks

  • DPG : Deterministic Policy Gradients

  • DPPO : Distributed Proximal Policy Optimization

  • DQN : Deep Q-Networks

  • DRL : Deep Reinforcement Learning

  • DRQN : Deep Recurrent Q-Networks

  • Dueling DQN : Dueling Deep Q-Networks

  • DVS : Dynamic Vision Sensor

  • eNVM : embedded Non-Volatile Memory

  • FOV : Field Of View

  • FQF : Fully parameterized Quantile Function

  • GPS : Global Positioning System

  • IMPALA : Importance Weighted Actor-Learner Architecture

  • IMU : Inertial Measurement Unit

  • IQN : Implicit Quantile Networks

  • K-FAC : Kronecker-factored approximation curvature

  • KL : Kullback-Leibler

  • LSTM : Long-Short Term Memory

  • MAAC : Multi-Actor-Attention-Critic

  • MADDPG : Multi-Agent DDPG

  • MARL : Multi-Agent Reinforcement Learning

  • MATD3 : Multi-Agent Twin Delayed Deep Deterministic

  • MATD3 : Multi-Agent TD3

  • MDP : Markov Decision Problem

  • NGA : Never Give Up

  • Noisy DQN : Noisy Deep Q-Networks

  • PAAC : Parallel Advantage Actor-Critic

  • PER : Prioritized Experience Replay

  • PG : Policy Gradients

  • POMDP : Partially Observable Markov Decision Problem

  • PPG : Phasic Policy Gradient

  • PPO : Proximal Policy Optimization

  • QR-DQN : Quantile Regression Deep Q-Networks

  • R2D2 : Recurrent Replay Distributed Deep Q-Networks

  • Rainbow DQN : Rainbow Deep Q-Networks

  • RDPG : Recurrent Deterministic Policy Gradients

  • Reactor : Retrace-Actor

  • REINFORCE : REward Increment Nonnegative Factor Offset Reinforcement Characteristic Eligibility

  • RL : Reinforcement Learning

  • RND : Random Network Distillation

  • RNN : Recurrent Neural Network

  • ROS : Robot Operating System

  • SAC : Soft Actor-Critic

  • SARSA : State-Action-Reward-State-Action

  • SEED RL : Scalable, Efficient Deep-RL

  • SLAC : Stochastic Latent Actor-Critic

  • SRAM : Static Random Access Memory

  • SVPG : Stein Variational Policy Gradient

  • TD : Temporal Difference

  • TD3 : Twin Delayed Deep Deterministic

  • TRPO : Trust Region Policy Optimization

  • UAV : Unmanned Aerial Vehicle

  • UBC : Upper Confidence Bound

  • UGV : Unmanned Ground Vehicle

  • UVFA : Universal Value Function Approximator


This research has been supported by NSERC under grant RGPIN-2018-06222