Obstacle Avoidance for UAS in Continuous Action Space Using Deep Reinforcement Learning

11/13/2021
by   Jueming Hu, et al.
0

Obstacle avoidance for small unmanned aircraft is vital for the safety of future urban air mobility (UAM) and Unmanned Aircraft System (UAS) Traffic Management (UTM). There are many techniques for real-time robust drone guidance, but many of them solve in discretized airspace and control, which would require an additional path smoothing step to provide flexible commands for UAS. To provide a safe and efficient computational guidance of operations for unmanned aircraft, we explore the use of a deep reinforcement learning algorithm based on Proximal Policy Optimization (PPO) to guide autonomous UAS to their destinations while avoiding obstacles through continuous control. The proposed scenario state representation and reward function can map the continuous state space to continuous control for both heading angle and speed. To verify the performance of the proposed learning framework, we conducted numerical experiments with static and moving obstacles. Uncertainties associated with the environments and safety operation bounds are investigated in detail. Results show that the proposed model can provide accurate and robust guidance and resolve conflict with a success rate of over 99

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/18/2019

Taming an autonomous surface vehicle for path following and collision avoidance using deep reinforcement learning

In this article, we explore the feasibility of applying proximal policy ...
06/16/2020

COLREG-Compliant Collision Avoidance for Unmanned Surface Vehicle using Deep Reinforcement Learning

Path Following and Collision Avoidance, be it for unmanned surface vesse...
03/23/2020

Using Deep Reinforcement Learning Methods for Autonomous Vessels in 2D Environments

Unmanned Surface Vehicles technology (USVs) is an exciting topic that es...
09/02/2020

Proposed Efficient Design for Unmanned Surface Vehicles

Recently worldwide interest is growing toward commercial, military or sc...
10/09/2019

Defensive Escort Teams via Multi-Agent Deep Reinforcement Learning

Coordinated defensive escorts can aid a navigating payload by positionin...
03/21/2020

Autonomous UAV Navigation: A DDPG-based Deep Reinforcement Learning Approach

In this paper, we propose an autonomous UAV path planning framework usin...
10/18/2021

How Far Two UAVs Should Be subject to Communication Uncertainties

Unmanned aerial vehicles are now becoming increasingly accessible to ama...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

From delivery drones to autonomous electrical vertical take-off and landing (eVTOL) passenger aircraft, modern unmanned aircraft systems (UAS) can perform many different tasks efficiently, including goods delivery, surveillance, public safety, weather monitoring, disaster relief, search and rescue, traffic monitoring, videography, and air transportation (Balakrishnan et al., 2018; Kopardekar et al., 2016). Urban air mobility (UAM) is likely to occur in urban areas close to buildings or airports. Thus, it is expected that UAS can use onboard detect and avoid systems to avoid other traffic, hazardous weather, terrain, and man-made and natural obstacles without constant human intervention (Kopardekar et al., 2016).

Many mathematical models for aircraft conflict resolution have been proposed in the literature. Research efforts can be divided into centralized algorithms and decentralized algorithms. Centralized methods can be based on semidefinite programming (Frazzoli et al., 2001), nonlinear programming (Raghunathan et al., 2004; Enright and Conway, 1992)

, mixed-integer linear programming

(Schouwenaars et al., 2001; Richards and How, 2002; Pallottino et al., 2002; Vela et al., 2009), mixed-integer quadratic programming (Mellinger et al., 2012), sequential convex programming (Augugliaro et al., 2012; Morgan et al., 2014), second-order cone programming (Acikmese and Ploen, 2007), evolutionary techniques (Delahaye et al., 2010; Cobano et al., 2011)

, and particle swarm optimization

(Pontani and Conway, 2010)

. These centralized methods often pursue the global optimum for all the aircraft. However, as the number of aircraft grows, the computation cost of these methods typically scales exponentially. Among the decentralized methods, the conflict resolution problem can also be formulated as a Markov Decision Process (MDP). Reinforcement Learning (RL) has been proved to be a good solution to aircraft traffic management, but mostly use traditional algorithms

(Sun and Zhang, 2019). The next-generation airborne collision avoidance system (ACAS X) formulates the collision avoidance systems (CAS) problem as a partially observable Markov Decision Process (POMDP) and has been extended to unmanned aircraft, named ACAS Xu (Kochenderfer et al., 2012). Both ACAS X and ACAS Xu use Dynamic Programming (DP) to determine the expected cost of each action (Manfredi and Jestin, 2016; Owen et al., 2019). Chryssanthacopoulos and Kochenderfer (2012) combined decomposition methods and DP for optimized collision avoidance with multiple threats. The traditional RL algorithms require a fine discretization scheme of state space and finite action space. Discretization potentially reduces safety by adding discretization errors and cannot provide flexible maneuver guidance for UAS. In addition, discretizing large airspace implies a high computation demand and can be time-consuming. Tree search based algorithms (Yang and Wei, 2018, 2020) are also applied to CAS problems using MDP formulation which does not involve state discretization. But they typically require high onboard computation time to accommodate the continuous state space.

The large and continuous state and action spaces present a challenge for conflict resolution problems using reinforcement learning. Recently, Deep Reinforcement Learning (DRL) is studied to solve this challenge by applying the deep neural network to approximate the cost and the optimal policy functions. Development of DRL algorithms, such as Policy Gradient

(Sutton et al., 2000), Deep Q-Networks (DQN) (Mnih et al., 2013), Double DQN (Van Hasselt et al., 2016), Dueling DQN (Wang et al., 2015), Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015), Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016), and Proximal Policy Optimization (PPO) (Schulman et al., 2017) has increased the potential of automation. Li et al. (2019) used DQN to compute corrections for an existing collision avoidance approach to account for dense airspace. In Yang et al. (2019), the feasibility of using algorithms based on DQN in UAV obstacle avoidance is verified. Wulfe (2017) concluded that DQN can outperform value iteration both in terms of evaluation performance and solution speed when solving a UAV collision avoidance problem. The performance of the agent in avoiding single up to multiple aircraft by using the DQN algorithm is investigated in Keong et al. (2019). Brittain et al. (2020) proposed a novel deep multi-agent reinforcement learning framework based on PPO to identify and resolve conflicts among a variable number of aircraft in a high-density, stochastic, and dynamic sector in en-route airspace. The DRL work mentioned above is in continuous state and discrete action space.

There has been less progress on utilizing DRL to solve UAS conflict resolution with continuous control. Pham et al. (2019) proposed a method inspired by Deep Q-learning and Deep Deterministic Policy Gradient algorithms and it can resolve conflicts, with a success rate of over 81 %, in the presence of traffic and varying degrees of uncertainties. Ma et al. (2018) developed a generic framework that integrates an autonomous obstacle detection module and an actor-critic based reinforcement learning (RL) module to develop reactive obstacle avoidance behavior for a UAV. Experiments in Schulman et al. (2017) test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and show that PPO outperforms other online policy gradient methods. PPO appears to be a favorable balance between sample complexity, simplicity, and wall-time. Thus, a PPO-based conflict resolution model is very valuable for UAS traffic management, which is the major motivation of this study.

To the best of the authors’ knowledge, this is the first study to develop a DRL approach based on PPO algorithm to allow the UAS to navigate successfully in continuous state and action spaces. The benefit of calculating in continuous space is that there is no need to discretize the state space or smooth results for postprocessing results. The proposed model with the optimal policy after offline training can be utilized for UAS real-time online trajectory planning. The main contributions of this paper are as follows:

  • A PPO-based framework has been proposed for UAS to avoid both static and moving obstacles in continuous state and action spaces.

  • A novel scenario state representation and reward function are developed and can effectively map the environment to maneuvers. The trained model can generate continuous heading angle commands and speed commands.

  • We have tested the effectiveness of the proposed learning framework in the environment with static obstacles, the environment with static obstacles and UAS position uncertainty, and the deterministic and stochastic environments with moving obstacles. Results show that the proposed model can provide accurate and robust guidance and resolve conflict with a success rate of over 99%.

The remainder of this paper is organized as follows. Section 2 describes the backgrounds of Markov Decision Process and Deep Reinforcement Learning. Section 3 presents the model formulation using Markov Decision Process for UAS conflict resolution in continuous action space. In Section 4, the numerical experiments are presented to show the capability of the proposed approach to make the UAS learn to avoid conflict. Section 5 concludes this paper.

2 Background

In this section, we briefly review the backgrounds of Markov Decision Process (MDP) and Deep Reinforcement Learning (DRL).

2.1 Markov Decision Process (MDP)

Since the 1950s, MDPs (Bellman, 1957) have been well studied and applied to a wide area of disciplines (Howard, 1964; White, 1993; Feinberg and Shwartz, 2012), including robotics (Koenig and Simmons, 1998; Thrun, 2002), automatic control (Mariton, 1990), economics, and manufacturing. In an MDP, the agent may choose any action that is available based on current state at each time step. The process responds at the next time step by moving into a new state

with certain transition probability and gives the agent a corresponding reward

.

More precisely, the MDP includes the following components:

  1. The state space which consists of all the possible states.

  2. The action space which consists of all the actions that the agent can take.

  3. Transition function which describes the probability of arriving at state , given the current state and action .

  4. The reward function which decides the immediate reward (or expected immediate reward) received after transitioning from state to state , due to action . In general, the reward will depend on the current state, current action, and the next state. is the immediate reward at the time step and is the total discounted reward from time-step forwards.

  5. A discount factor which decides the preference for immediate reward versus future rewards. Setting the discount factor less than 1 is also beneficial for the convergence of cumulative reward.

In an MDP problem, a policy is a mapping from the state to a distribution over actions (known as stochastic policy)

or to one specific action (known as deterministic policy)

(1)

The goal of MDP is to find an optimal policy that, if followed from any initial state, maximizes the expected cumulative immediate rewards:

(2)

Q-function and value function are two important concepts in MDP. The optimal Q-function represents the expected cumulative reward received by an agent that starts from state , picks action , and chooses action optimally afterward. Therefore, is an indication of how good it is for an agent to pick action while being at state . The optimal value function denotes the maximum expected cumulative reward when starting from state , which can be expressed as the maximum of over all possible actions:

(3)

2.2 Deep Reinforcement Learning

Reinforcement learning (Sutton and Barto, 2018)

is an efficient algorithm to solve the MDP problem. With the advent of deep learning, Deep Reinforcement Learning (DRL) achieved much success recently, including game of GO

(Silver et al., 2017), Atari games (Mnih et al., 2015; Hessel et al., 2018), Warcraft (Vinyals et al., 2017). In general, deep reinforcement learning can be divided into value-based learning (Mnih et al., 2015; Lillicrap et al., 2015) and policy-based algorithm (Mnih et al., 2016; Schulman et al., 2015, 2017). In this paper, we consider a policy-based DRL algorithm to generate policies for the agent. Comparing with value-based DRL algorithms, the policy-based algorithm is effective in high-dimensional or continuous action spaces and can learn stochastic policies, which is beneficial when there is uncertainty in the environment.

Typically, the policy-based algorithm uses function approximation such as neural networks to approximate the policy , where the input is the current state and output is the probability of each action (for discrete action space) or an action distribution (for continuous action space). After each trajectory , the algorithm updates the parameter of the function approximation to maximize the cumulative reward using gradient ascent:

(4)

where is the expected cumulative reward of policy parameterized by , is the probability of action for state , and is the cumulative reward gathered by the agent for the remaining trajectory in one episode. The general idea of Eq. 4

is to reduce the probability of sampling an action that leads to a lower return and increase the probability of action leads to a higher reward. But one issue is that the cumulative reward usually has very high variance, which makes the convergence speed to be slow. To address this issue, researchers proposes actor-critic algorithm

(Sutton and Barto, 2018) where a critic function is introduced to approximate the state value function . By subtracting the value function , the expectation of the gradient keeps unchanged and the variance is reduced dramatically:

(5)

where the function approximator is updated to approximate the value function for .

3 Markov Decision Process Formulation

In this study, the UAS and intruders are considered as a point mass. The objective of the proposed conflict resolution algorithm is to find the shortest path for a UAS to its goal while avoiding conflict with other UAS and static obstacles. Guiding the UAS to its destination is a discrete-time stochastic control process that can be formulated as a Markov Decision Process (MDP). In the following subsections, we introduce the MDP formulation by describing its state representation, action space, terminal state, and reward function. For this work, a method of deep reinforcement learning, proximal policy optimization algorithm developed in Schulman et al. (2017), is adopted. The reason and details are also introduced in this Section.

3.1 State representation

The agent gains knowledge of the environment from the state of the formulated MDP. The state should include all the necessary information for an agent to make optimal actions. In this paper, we let denote the agent’s state at time . All the parameters are normalized when querying the neural network.

The state can be divided into two parts, that is , where denotes the part that is related to the agent itself and the goal, and denotes the part related to the environment such as obstacles. We use to denote the information of environment (which represents moving/static obstacles in this paper) and set . indicates the information of obstacle . To speed up the training of the DRL algorithm, we transform the state by following the robot-centric parameterization in Chen et al. (2017), where the agent is located at the origin and the x-axis is pointing toward its goal.

3.1.1 State representation for static obstacle avoidance

In the simulations of static obstacle avoidance,

(6)

where is the agent’s distance to goal, denote the agent’s velocity. The information of obstacle is,

(7)

where is the position in y-axis of obstacle and is the agent’s distance to the center of obstacle . is introduced to help the agent learn the global optimal solution, for example, when approaching the obstacle, turn a small angle counterclockwise if is positive, which means the agent is on the right side of the line passing the obstacle center and the goal. We note that and

are vectors in the transformed coordinate system.

3.1.2 State representation for moving obstacle avoidance

As for moving obstacle avoidance, the position of the goal, , is added to ,

(8)

The information of intruder , is represented by,

(9)

where is the position of intruder , is the velocity of intruder , is the distance between the agent and the intruder and is the velocity of the agent relative to the intruder . We note that , , and are vectors in the transformed coordinate system.

3.2 Action space

3.2.1 Action space for static obstacle avoidance and stochastic intruder avoidance

In the implementations of static obstacle avoidance and stochastic intruder avoidance, the action represents the change in the heading angle for the controlled UAS at each time step. The action space is set to,

(10)

More specifically, at each time step, the agent will select an action , and change its heading angle :

(11)

3.2.2 Action space for deterministic intruder avoidance

As for deterministic intruder avoidance case, besides the heading angle change , the UAS is also controlled by speed command, which is updated every one second. During the interval, the two commands are fixed. Since UAS is more flexible than manned aircraft and there is no available regularization on UAS speed change, UAS speed command can be chosen from ,

(12)

More specifically, the agent will select an action , and change its speed at next time step :

(13)

In real-world applications, however, making a sharp turn is usually not desirable for the controlling of a UAS. Thus a penalty of large heading or speed change due to the power consumption may be considered in future work.

3.3 Terminal state

In the current study, the conflict is defined to be when the distance from the agent to the obstacle is less than a minimum separation distance. When the UAS operation is deterministic, a buffer zone is not necessary and the minimum separation distance is set to zero. In the implementations of static obstacle avoidance with uncertainty and moving obstacle avoidance, the UAS position uncertainty is taken into account. The separation requirement is determined according to the operational safety bound proposed in Hu et al. (2020). With the UAS speed of 20 and other UAS operation performance following the mean value shown in Table 3 in Hu et al. (2020), the minimum separation distances for static obstacle avoidance is 75 and 150 for moving obstacle avoidance.

3.3.1 Terminal state for static obstacle avoidance

The terminal state for static obstacle avoidance includes two different types of states:

  • Conflict state: the distance between the agent and obstacle is less than the minimum separation distance.

  • Goal state: the agent is within 400 from the destination.

3.3.2 Terminal state for moving obstacle avoidance

The episode terminates only when the agent is within 200 from the destination, which indicates the agent accomplishes the navigation task.

3.4 Reward function

To guide the agent to reach its goal and avoid conflict, the reward function is developed to award accomplishments while penalizing conflicts or not moving towards the goal.

3.4.1 Reward function for static obstacle avoidance

In the simulations of static obstacle avoidance, the reward function, , is expressed as the following form, where we set a reward for the goal state and a penalty for the conflict state. The linear term of the reward function guides the UAS flying towards the destination. The constant penalty at each step emphasizes the shortest path rule.

(14)

3.4.2 Reward function for moving obstacle avoidance

In the simulations of intruder avoidance, the reward function, , is expressed in the following form, similar to Eq. 14.

(15)

In this reward function, are coefficients of different cost and should be balanced to help the agent learn conflict resolution and achieve the goal simultaneously. When the ownership is close to the intruder, the inverse tangent term of the reward function is activated to maintain the distance in an appropriate range. With the coefficients set in the stochastic intruder case in Section 4.2.1, the relation between the distance and the inverse tangent term, , is shown in Fig. 1. The agent starts to get a penalty when the distance approaches 250 . This reward setting can help the agent avoid conflicts with other intruders at a relatively early stage. We note that can be tuned to fit different separation standards.

Figure 1: Reward related to the distance from the agent to the intruder.

3.5 Proximal policy optimization algorithm

One drawback of proposed in Sutton and Barto (2018), also shown in Eq. 5 , is that one bad update can lead to large destructive effects and hinder the final performance of the model. Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017) was proposed recently to solve this problem by introducing a policy changing ratio describing the change from previous policy to the new policy at time step :

(16)

where and denote the network weights before and after update.

By restricting policy changing ratio in the range with

set to 0.2 in this paper, the PPO loss function for the actor and critic network is formulated as follows:

(17)
(18)
(19)

where

is a hyperparameter that bounds the policy changing ratio

. In Eq. 17 and Eq. 18, the advantage function measures whether or not the action is better or worse than the policy’s default behavior. Also, the policy entropy is added to the actor loss function to encourage exploration by discouraging premature convergence to sub-optimal deterministic polices.

In the implementation, we use two layers Multilayer perceptron (MLP) with 64 hidden units each for both actor and critic networks. Tanh function is chosen as the activation function for the hidden layers.

4 Numerical experiments

Numerical experiments are presented in this section to evaluate the proposed conflict resolution model in continuous action space. There are two categories of collision avoidance: static obstacle avoidance and moving obstacle avoidance. As for static obstacle avoidance, we investigate the performance on different obstacle shapes and sizes, and uncertainty in UAS operation. We also study the environment with stochastic intruders under control of heading angle, and the environment with deterministic intruders under control of heading angle and speed. In all the simulations, one pixel in the figure represents 10 in the real world. The implementation of PPO algorithm is conducted by OpenAI Baselines (Dhariwal et al., 2017). The deep reinforcement learning model for each case is trained for 30 million time steps.

4.1 Static obstacle avoidance

The simulation environment is the free flight airspace with 4 length and 4 width. The UAS speed is set to 20 . During the training process, the starting position of the aircraft is randomly sampled from four edges of the airspace boundary and is an array of the integer type for simplification. The goal is located at (2500, 2500). In this experiment, we study two types of static obstacles: circular obstacle and rectangular obstacle, as shown in Fig. 2. The plus sign represents the goal position and the blue region represents the no-passing area. The episode reward mean is shown in Fig. 3, which shows the episode reward is growing and the policy is converging to the optimal solution. To visualize the performance of the proposed conflict resolution model, we generate a testing set of 160 trajectories starting from different origins. The origin for testing is chosen every 100 on each edge of the airspace boundary. Heading angles in 160 trajectories are collected and the heading angle is plotted every 15 time steps.

Figure 2: (axis unit: ) (a) Circular obstacle environment. (b) Rectangular obstacle environment. : goal; blue: no-passing area.
Figure 3: Episode reward mean for (a) circular obstacle avoidance, (b) rectangular obstacle environment, and (c) circular obstacle avoidance with probabilistic agent’s position.

4.1.1 Circular obstacle avoidance

The static obstacle set up for this case study is shown in Fig. 2 and the testing result of 160 trajectories starting from different locations on the airspace boundary is shown in Fig. 4. The black arrow represents the agent’s selected heading direction at each position.

Figure 4: (axis unit: ) Results of circular obstacle avoidance. : goal; blue: no-passing area; arrow: the selected heading direction.

From Fig. 4, it can be seen that the agent is selecting the heading angle pointing to the goal and tending to avoid the no-passing region. Also, the agent chooses the optimal behavior according to the relative position of the agent, obstacle, and goal. For example, near the lower-left obstacle, if the agent’s position is above the line passing the obstacle center and the goal, the UAS takes a small left turn to avoid the obstacle. Otherwise, the UAS bypasses the lower semicircle. For the 160 generated trajectories in Fig. 4, there is no failure.

4.1.2 Rectangular obstacle avoidance

The difference between this rectangular obstacle case and the previous circular obstacle case in simulation is the condition when checking whether the agent is at a conflict state. The environment for this case is shown in Fig. 2 and the testing result of 160 trajectories is shown in Fig. 5.

Figure 5: (axis unit: ) Results of rectangular obstacle avoidance. : goal; blue: no-passing area; arrow: the selected heading direction.

The performance in Fig. 5 is similar to the result in Fig. 4. The UAS learns to bypass the obstacle through one side of the obstacle depending on the relative position of the agent, obstacle, and goal. No failure happens among the test of 160 trajectories.

Results in Fig. 4 and Fig. 5 show that the proposed model has the capability to make the UAS learn to find the shortest path and also avoid static obstacles for different obstacle sizes or shapes.

4.1.3 Circular obstacle avoidance with uncertainty

This case is studied to see the performance of handling uncertainty by the proposed conflict resolution model. UAS operation is stochastic and randomness exists in almost every aspect of UTM. Inclusion of uncertainty quantification of aircraft operation is critical for future safety analysis (e.g., deviation from a trajectory plan due to wind, true speed, positioning error) (Hu et al., 2020; Liu and Goebel, 2018; Hu and Liu, 2020; Pang et al., 2019b, a, 2021). Thus, to model the uncertainties in UAS operation, we form a circle, the center of which is the predicted UAS position without uncertainty. And the radius is the separation requirement, 75

. With 90% probability, the UAS position is accurately located at the center of the circle; with a 10% probability, the UAS position will be located at a point around the circle with a uniform distribution. Such uncertainty is considered when calculating the agent’s position at the next time step after taking action

.

The testing results are shown in Fig. 6. In Fig. 6, the agent’s position with uncertainty is plotted. While in Fig. 6, the uncertainty of 75 is added to the obstacle, which is indicated by the red circle. As expected, the UAS tries to keep 75 away from the obstacles. So either method can work when doing the simulations of collision avoidance with uncertainty. One failure happens near the upper-left obstacle in Fig. 6. There are three failures near the lower-left obstacle in Fig. 6. The common among the failures is that the agent’s origin is approximately on the line passing the obstacle center and the goal. The possible reason is that the policy network gets stuck at the local optimum since the two trajectories next to it behave well.

Figure 6: (axis unit: ) (a) Results with probabilistic agent’s position. (b) Results with uncertainty added to obstacles. : goal; blue: no-passing area; arrow: the selected heading direction; red: separation requirement due to uncertainty in UAS operation.

4.2 Moving obstacle avoidance

For the moving intruder aircraft avoidance case, the speed of intruders is set to 20 . There are two cases for moving obstacle avoidance: stochastic intruder case with control of heading angle and deterministic intruder case with control of heading angle and speed. In the stochastic intruder case, the scenario changes every episode. In detail, the intruder has a different origin and heading angle for each episode. But within one episode, intruders have fixed heading angles. The reward coefficients are listed in Table 1. The episode reward mean is shown in Fig. 7. To visualize the performance of the proposed conflict resolution model, we generate a testing set of 500 episodes following the setting during the training process for each case. Also, the minimum distance of the agent to the three intruders within each episode is collected.

Coefficient
Stochastic-intruder avoidance 0.007 0.15 17 0.1 12
Deterministic-intruder avoidance 0.22 0.05 3 0.1 12
Table 1: Reward coefficient
Figure 7: (a)Episode reward mean of stochastic-intruder avoidance with control of heading angle. (b) Episode reward mean of deterministic-intruder avoidance with control of heading angle and speed.

4.2.1 Stochastic-intruder avoidance with control of heading angle

The origin and heading angle of the three intruders are assumed to follow a uniform distribution and the distribution range is shown in Table 2. The origin coordinate of agent is uniformly sampled from . The goal is located at . The agent moves at 20 . The intruders are designed to pass the line connecting the UAS origin and the goal.

Intruder 1 2 3
Origin coordinate range
Heading angle range
Table 2: Intruder information

The demonstration of one scenario and UAS performance is shown in Fig. 8. Information related to the ownership is plotted in blue and black represents the information of intruders. The plus sign denotes the origin and the star sign is the goal for the agent. The centers of circles are the positions of aircraft which are plotted every 5 time steps and labeled with time step every 10 time steps. The radius of the circle represents the aircraft speed. In this scenario, the agent learns to go around the left side to avoid the three intruders. The result of the minimum distance from the agent to the three intruders within each episode is plotted in Fig. 9 by the blue dots. The orange line is the separation requirement of 150 . All the blue dots are above the orange line, which represents that there is no failure case in Fig. 9 and the model succeeds to avoid the three intruders in 500 different testing scenarios.

Figure 8: Demonstration of one scenario and UAS performance of stochastic-intruder avoidance with control of heading angle. Number: time step. Blue: ownership; black: intruders. Circle center: UAS position; circle radius: UAS speed. +: origin; *: goal.
Figure 9: Minimum distance results of stochastic-intruder avoidance with control of heading angle. Orange line: separation requirement of 150 . Blue dot: the minimum distance from the agent to the three intruders within each episode.

4.2.2 Deterministic-intruder avoidance with control of heading angle and speed

We also investigate the possibility of utilizing the proposed reward function to generate heading angle change command and speed command. This investigation is valuable when changing the heading angle cannot efficiently resolve conflicts. Moreover, with an extra choice of changing speed, the UAS may result in less influence on flight plans of other aircraft and aerospace capacity. However, due to the larger action space, the training process needs more effort.

The origin and heading angle of the three intruders are listed in Table 3. The origin coordinate of agent is and the goal is located at . Intruder 1 is designed to test if the ownership can fly at a suitable speed and the other two intruders are set to test the performance of the heading angle change command.

Intruder 1 2 3
Origin coordinate
Heading angle
Table 3: Intruder information

Similar to the result in Fig. 8, the demonstration of the scenario and UAS performance is shown in Fig. 10. Information related to the ownership is plotted in blue and black represents the information of intruders. The plus sign denotes the origin and the star sign is the goal for the agent. The centers of circles are the positions of aircraft which are plotted every 3 time steps and labeled with time step every 6 time steps. The radius of the circle represents the aircraft speed. It can be seen that the agent reduces speed from 12 to 24 time steps to keep a safe separation away from intruder 1. Also, the agent goes around the right side to avoid the approaching intruder 2 and after resolving the possible conflicts with intruder 3 at 66 time step, it flies towards the goal to save time. The result of the minimum distance is plotted in Fig. 11 by the blue dots. All the blue dots are above the orange line which represents the separation requirement of 150 . So, there is no failure case in Fig. 11, indicating the model succeeds to avoid the three intruders under the control of heading angle and speed.

Figure 10: Demonstration of the scenario and UAS performance of deterministic-intruder avoidance with control of heading angle and speed. Number: time step. Blue: ownership; black: intruders. Circle center: UAS position; circle radius: UAS speed. +: origin; *: goal.
Figure 11: Minimum distance results of deterministic-intruder avoidance with control of heading angle and speed. Orange line: separation requirement of 150 . Blue dot: the minimum distance from the agent to the three intruders within each episode.

5 Conclusion

In this work, we present a method for using deep reinforcement learning to allow the UAS to navigate successfully in urban airspace with continuous action space. Both static and moving obstacles are simulated and the trained UAS has the capability to achieve the goal and do conflict resolution simultaneously. We also investigate the performance on different static obstacle shapes and sizes, and under uncertainty in UAS operation. Stochastic intruders are considered in the training process of the moving obstacle experiments. Moreover, we investigate the possibility of the proposed reward function to resolve conflict through heading angle and speed. Results show that the proposed model can provide accurate and robust guidance and resolve conflict with a success rate of over 99%. To make the proposed algorithm more practical and efficient in the real-world, in future work, we would model part of the intruders as agents and there could be cooperation among the multiple aircraft.

Acknowledgments

The research reported in this paper was supported by funds from NASA University Leadership Initiative program (Contract No. NNX17AJ86A, PI: Yongming Liu, Technical Officer: Anupa Bajwa). The support is gratefully acknowledged.

References

  • B. Acikmese and S. R. Ploen (2007) Convex programming approach to powered descent guidance for mars landing. Journal of Guidance, Control, and Dynamics 30 (5), pp. 1353–1366. External Links: Document Cited by: §1.
  • F. Augugliaro, A. P. Schoellig, and R. D’Andrea (2012) Generation of collision-free trajectories for a quadrocopter fleet: a sequential convex programming approach. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 1917–1922. External Links: Document Cited by: §1.
  • K. Balakrishnan, J. Polastre, J. Mooberry, R. Golding, and P. Sachs (2018) Blueprint for the sky. The roadmap for the safe integration of autonomous aircraft. Airbus A 3. Cited by: §1.
  • R. Bellman (1957) A markovian decision process. Indiana Univ. Math. J. 6, pp. 679–684. External Links: ISSN 0022-2518 Cited by: §2.1.
  • M. Brittain, X. Yang, and P. Wei (2020) A deep multi-agent reinforcement learning approach to autonomous separation assurance. arXiv preprint arXiv:2003.08353. Cited by: §1.
  • Y. F. Chen, M. Liu, M. Everett, and J. P. How (2017) Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 285–292. External Links: Document Cited by: §3.1.
  • J. P. Chryssanthacopoulos and M. J. Kochenderfer (2012) Decomposition methods for optimized collision avoidance with multiple threats. Journal of Guidance, Control, and Dynamics 35 (2), pp. 398–405. External Links: Document Cited by: §1.
  • J. A. Cobano, R. Conde, D. Alejo, and A. Ollero (2011)

    Path planning based on genetic algorithms and the monte-carlo method to avoid aerial vehicle collisions under uncertainties

    .
    In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 4429–4434. External Links: Document Cited by: §1.
  • D. Delahaye, C. Peyronne, M. Mongeau, and S. Puechmorel (2010) Aircraft conflict resolution by genetic algorithm and b-spline approximation. In EIWAC 2010, 2nd ENRI International Workshop on ATM/CNS, pp. 71–78. Cited by: §1.
  • P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov (2017) OpenAI baselines. GitHub. Note: https://github.com/openai/baselines Cited by: §4.
  • P. J. Enright and B. A. Conway (1992) Discrete approximations to optimal trajectories using direct transcription and nmiscar programming. Journal of Guidance, Control, and Dynamics 15 (4), pp. 994–1002. External Links: Document Cited by: §1.
  • E. A. Feinberg and A. Shwartz (2012) Handbook of markov decision processes: methods and applications. Vol. 40, Springer Science & Business Media. External Links: Document Cited by: §2.1.
  • E. Frazzoli, Z. Mao, J. Oh, and E. Feron (2001) Resolution of conflicts involving many aircraft via semidefinite programming. Journal of Guidance, Control, and Dynamics 24 (1), pp. 79–86. External Links: Document Cited by: §1.
  • M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2018) Rainbow: combining improvements in deep reinforcement learning. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2.2.
  • R. A. Howard (1964) Dynamic programming and markov processes. External Links: Document Cited by: §2.1.
  • J. Hu, H. Erzberger, K. Goebel, and Y. Liu (2020) Probabilistic risk-based operational safety bound for rotary-wing unmanned aircraft systems traffic management. Journal of Aerospace Information Systems 17 (3), pp. 171–181. Cited by: §3.3, §4.1.3.
  • J. Hu and Y. Liu (2020) UAS conflict resolution integrating a risk-based operational safety bound as airspace reservation with reinforcement learning. In AIAA Scitech 2020 Forum, pp. 1372. Cited by: §4.1.3.
  • C. W. Keong, H. Shin, and A. Tsourdos (2019) Reinforcement learning for autonomous aircraft avoidance. In 2019 Workshop on Research, Education and Development of Unmanned Aerial Systems (RED UAS), pp. 126–131. Cited by: §1.
  • M. J. Kochenderfer, J. E. Holland, and J. P. Chryssanthacopoulos (2012) Next-generation airborne collision avoidance system. Technical report Massachusetts Institute of Technology-Lincoln Laboratory Lexington United States. Cited by: §1.
  • S. Koenig and R. Simmons (1998) Xavier: a robot navigation architecture based on partially observable markov decision process models. Artificial Intelligence Based Mobile Robotics: Case Studies of Successful Robot Systems, pp. 91–122. Cited by: §2.1.
  • P. Kopardekar, J. Rios, T. Prevot, M. Johnson, J. Jung, and J. E. Robinson (2016) Unmanned aircraft system traffic management (utm) concept of operations. Cited by: §1.
  • S. Li, M. Egorov, and M. Kochenderfer (2019) Optimizing collision avoidance in dense airspace using deep reinforcement learning. arXiv preprint arXiv:1912.10146. Cited by: §1.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §2.2.
  • Y. Liu and K. Goebel (2018) Information fusion for national airspace system prognostics. In PHM Society Conference, Vol. 10. Cited by: §4.1.3.
  • Z. Ma, C. Wang, Y. Niu, X. Wang, and L. Shen (2018) A saliency-based reinforcement learning approach for a uav to avoid flying obstacles. Robotics and Autonomous Systems 100, pp. 108–118. Cited by: §1.
  • G. Manfredi and Y. Jestin (2016) An introduction to acas xu and the challenges ahead. In 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC), pp. 1–9. Cited by: §1.
  • M. Mariton (1990) Jump linear systems in automatic control. M. Dekker New York. Cited by: §2.1.
  • D. Mellinger, A. Kushleyev, and V. Kumar (2012) Mixed-integer quadratic program trajectory generation for heterogeneous quadrotor teams. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 477–483. External Links: Document Cited by: §1.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In

    International conference on machine learning

    ,
    pp. 1928–1937. Cited by: §1, §2.2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §2.2.
  • D. Morgan, S. Chung, and F. Y. Hadaegh (2014) Model predictive control of swarms of spacecraft using sequential convex programming. Journal of Guidance, Control, and Dynamics 37 (6), pp. 1725–1740. External Links: Document Cited by: §1.
  • M. P. Owen, A. Panken, R. Moss, L. Alvarez, and C. Leeper (2019) ACAS xu: integrated collision avoidance and detect and avoid capability for uas. In 2019 IEEE/AIAA 38th Digital Avionics Systems Conference (DASC), pp. 1–10. Cited by: §1.
  • L. Pallottino, E. M. Feron, and A. Bicchi (2002) Conflict resolution problems for air traffic management systems solved with mixed integer programming. IEEE transactions on intelligent transportation systems 3 (1), pp. 3–11. External Links: Document Cited by: §1.
  • Y. Pang, N. Xu, and Y. Liu (2019a) Aircraft trajectory prediction using lstm neural network with embedded convolutional layer. In Proceedings of the Annual Conference of the PHM Society, Vol. 11. Cited by: §4.1.3.
  • Y. Pang, H. Yao, J. Hu, and Y. Liu (2019b)

    A recurrent neural network approach for aircraft trajectory prediction with weather features from sherlock

    .
    In AIAA Aviation 2019 Forum, pp. 3413. Cited by: §4.1.3.
  • Y. Pang, X. Zhao, H. Yan, and Y. Liu (2021) Data-driven trajectory prediction with weather uncertainties: a bayesian deep learning approach. Transportation Research Part C: Emerging Technologies 130, pp. 103326. Cited by: §4.1.3.
  • D. Pham, N. P. Tran, S. Alam, V. Duong, and D. Delahaye (2019) A machine learning approach for conflict resolution in dense traffic scenarios with uncertainties. Cited by: §1.
  • M. Pontani and B. A. Conway (2010) Particle swarm optimization applied to space trajectories. Journal of Guidance, Control, and Dynamics 33 (5), pp. 1429–1441. External Links: Document Cited by: §1.
  • A. U. Raghunathan, V. Gopal, D. Subramanian, L. T. Biegler, and T. Samad (2004) Dynamic optimization strategies for three-dimensional conflict resolution of multiple aircraft. Journal of guidance, control, and dynamics 27 (4), pp. 586–594. External Links: Document Cited by: §1.
  • A. Richards and J. P. How (2002) Aircraft trajectory planning with collision avoidance using mixed integer linear programming. In American Control Conference, 2002. Proceedings of the 2002, Vol. 3, pp. 1936–1941. External Links: Document Cited by: §1.
  • T. Schouwenaars, B. De Moor, E. Feron, and J. How (2001) Mixed integer programming for multi-vehicle path planning. In Control Conference (ECC), 2001 European, pp. 2603–2608. External Links: Document Cited by: §1.
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §2.2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §1, §2.2, §3.5, §3.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §2.2.
  • J. Sun and Y. Zhang (2019) A reinforcement learning-based decentralized method of avoiding multi-uav collision in 3-d airspace. In Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence, pp. 77–82. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §2.2, §2.2, §3.5.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §1.
  • S. Thrun (2002) Probabilistic robotics. Communications of the ACM 45 (3), pp. 52–57. External Links: Document Cited by: §2.1.
  • H. Van Hasselt, A. Guez, and D. Silver (2016) Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, Cited by: §1.
  • A. Vela, S. Solak, W. Singhose, and J. Clarke (2009) A mixed integer program for flight-level assignment and speed control for conflict resolution. In Decision and Control, 2009 held jointly with the 2009 28th Chinese Control Conference. CDC/CCC 2009. Proceedings of the 48th IEEE Conference on, pp. 5219–5226. External Links: Document Cited by: §1.
  • O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, et al. (2017) Starcraft ii: a new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782. Cited by: §2.2.
  • Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas (2015) Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581. Cited by: §1.
  • D. J. White (1993) A survey of applications of markov decision processes. Journal of the operational research society 44 (11), pp. 1073–1096. External Links: Document Cited by: §2.1.
  • B. Wulfe (2017) UAV collision avoidance policy optimization with deep reinforcement learning. Cited by: §1.
  • S. Yang, Z. Meng, X. Chen, and R. Xie (2019) Real-time obstacle avoidance with deep reinforcement learning three-dimensional autonomous obstacle avoidance for uav. In Proceedings of the 2019 International Conference on Robotics, Intelligent Control and Artificial Intelligence, pp. 324–329. Cited by: §1.
  • X. Yang and P. Wei (2018) Autonomous on-demand free flight operations in urban air mobility using monte carlo tree search. In 8th International Conference on Research in Air Transportation (ICRAT), Cited by: §1.
  • X. Yang and P. Wei (2020) Scalable multi-agent computational guidance with separation assurance for autonomous urban air mobility. Journal of Guidance, Control, and Dynamics 43 (8), pp. 1473–1486. External Links: Document Cited by: §1.