1 Introduction
From delivery drones to autonomous electrical vertical takeoff and landing (eVTOL) passenger aircraft, modern unmanned aircraft systems (UAS) can perform many different tasks efficiently, including goods delivery, surveillance, public safety, weather monitoring, disaster relief, search and rescue, traffic monitoring, videography, and air transportation (Balakrishnan et al., 2018; Kopardekar et al., 2016). Urban air mobility (UAM) is likely to occur in urban areas close to buildings or airports. Thus, it is expected that UAS can use onboard detect and avoid systems to avoid other traffic, hazardous weather, terrain, and manmade and natural obstacles without constant human intervention (Kopardekar et al., 2016).
Many mathematical models for aircraft conflict resolution have been proposed in the literature. Research efforts can be divided into centralized algorithms and decentralized algorithms. Centralized methods can be based on semidefinite programming (Frazzoli et al., 2001), nonlinear programming (Raghunathan et al., 2004; Enright and Conway, 1992)
, mixedinteger linear programming
(Schouwenaars et al., 2001; Richards and How, 2002; Pallottino et al., 2002; Vela et al., 2009), mixedinteger quadratic programming (Mellinger et al., 2012), sequential convex programming (Augugliaro et al., 2012; Morgan et al., 2014), secondorder cone programming (Acikmese and Ploen, 2007), evolutionary techniques (Delahaye et al., 2010; Cobano et al., 2011), and particle swarm optimization
(Pontani and Conway, 2010). These centralized methods often pursue the global optimum for all the aircraft. However, as the number of aircraft grows, the computation cost of these methods typically scales exponentially. Among the decentralized methods, the conflict resolution problem can also be formulated as a Markov Decision Process (MDP). Reinforcement Learning (RL) has been proved to be a good solution to aircraft traffic management, but mostly use traditional algorithms
(Sun and Zhang, 2019). The nextgeneration airborne collision avoidance system (ACAS X) formulates the collision avoidance systems (CAS) problem as a partially observable Markov Decision Process (POMDP) and has been extended to unmanned aircraft, named ACAS Xu (Kochenderfer et al., 2012). Both ACAS X and ACAS Xu use Dynamic Programming (DP) to determine the expected cost of each action (Manfredi and Jestin, 2016; Owen et al., 2019). Chryssanthacopoulos and Kochenderfer (2012) combined decomposition methods and DP for optimized collision avoidance with multiple threats. The traditional RL algorithms require a fine discretization scheme of state space and finite action space. Discretization potentially reduces safety by adding discretization errors and cannot provide flexible maneuver guidance for UAS. In addition, discretizing large airspace implies a high computation demand and can be timeconsuming. Tree search based algorithms (Yang and Wei, 2018, 2020) are also applied to CAS problems using MDP formulation which does not involve state discretization. But they typically require high onboard computation time to accommodate the continuous state space.The large and continuous state and action spaces present a challenge for conflict resolution problems using reinforcement learning. Recently, Deep Reinforcement Learning (DRL) is studied to solve this challenge by applying the deep neural network to approximate the cost and the optimal policy functions. Development of DRL algorithms, such as Policy Gradient
(Sutton et al., 2000), Deep QNetworks (DQN) (Mnih et al., 2013), Double DQN (Van Hasselt et al., 2016), Dueling DQN (Wang et al., 2015), Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015), Asynchronous Advantage ActorCritic (A3C) (Mnih et al., 2016), and Proximal Policy Optimization (PPO) (Schulman et al., 2017) has increased the potential of automation. Li et al. (2019) used DQN to compute corrections for an existing collision avoidance approach to account for dense airspace. In Yang et al. (2019), the feasibility of using algorithms based on DQN in UAV obstacle avoidance is verified. Wulfe (2017) concluded that DQN can outperform value iteration both in terms of evaluation performance and solution speed when solving a UAV collision avoidance problem. The performance of the agent in avoiding single up to multiple aircraft by using the DQN algorithm is investigated in Keong et al. (2019). Brittain et al. (2020) proposed a novel deep multiagent reinforcement learning framework based on PPO to identify and resolve conflicts among a variable number of aircraft in a highdensity, stochastic, and dynamic sector in enroute airspace. The DRL work mentioned above is in continuous state and discrete action space.There has been less progress on utilizing DRL to solve UAS conflict resolution with continuous control. Pham et al. (2019) proposed a method inspired by Deep Qlearning and Deep Deterministic Policy Gradient algorithms and it can resolve conflicts, with a success rate of over 81 %, in the presence of traffic and varying degrees of uncertainties. Ma et al. (2018) developed a generic framework that integrates an autonomous obstacle detection module and an actorcritic based reinforcement learning (RL) module to develop reactive obstacle avoidance behavior for a UAV. Experiments in Schulman et al. (2017) test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and show that PPO outperforms other online policy gradient methods. PPO appears to be a favorable balance between sample complexity, simplicity, and walltime. Thus, a PPObased conflict resolution model is very valuable for UAS traffic management, which is the major motivation of this study.
To the best of the authors’ knowledge, this is the first study to develop a DRL approach based on PPO algorithm to allow the UAS to navigate successfully in continuous state and action spaces. The benefit of calculating in continuous space is that there is no need to discretize the state space or smooth results for postprocessing results. The proposed model with the optimal policy after offline training can be utilized for UAS realtime online trajectory planning. The main contributions of this paper are as follows:

A PPObased framework has been proposed for UAS to avoid both static and moving obstacles in continuous state and action spaces.

A novel scenario state representation and reward function are developed and can effectively map the environment to maneuvers. The trained model can generate continuous heading angle commands and speed commands.

We have tested the effectiveness of the proposed learning framework in the environment with static obstacles, the environment with static obstacles and UAS position uncertainty, and the deterministic and stochastic environments with moving obstacles. Results show that the proposed model can provide accurate and robust guidance and resolve conflict with a success rate of over 99%.
The remainder of this paper is organized as follows. Section 2 describes the backgrounds of Markov Decision Process and Deep Reinforcement Learning. Section 3 presents the model formulation using Markov Decision Process for UAS conflict resolution in continuous action space. In Section 4, the numerical experiments are presented to show the capability of the proposed approach to make the UAS learn to avoid conflict. Section 5 concludes this paper.
2 Background
In this section, we briefly review the backgrounds of Markov Decision Process (MDP) and Deep Reinforcement Learning (DRL).
2.1 Markov Decision Process (MDP)
Since the 1950s, MDPs (Bellman, 1957) have been well studied and applied to a wide area of disciplines (Howard, 1964; White, 1993; Feinberg and Shwartz, 2012), including robotics (Koenig and Simmons, 1998; Thrun, 2002), automatic control (Mariton, 1990), economics, and manufacturing. In an MDP, the agent may choose any action that is available based on current state at each time step. The process responds at the next time step by moving into a new state
with certain transition probability and gives the agent a corresponding reward
.More precisely, the MDP includes the following components:

The state space which consists of all the possible states.

The action space which consists of all the actions that the agent can take.

Transition function which describes the probability of arriving at state , given the current state and action .

The reward function which decides the immediate reward (or expected immediate reward) received after transitioning from state to state , due to action . In general, the reward will depend on the current state, current action, and the next state. is the immediate reward at the time step and is the total discounted reward from timestep forwards.

A discount factor which decides the preference for immediate reward versus future rewards. Setting the discount factor less than 1 is also beneficial for the convergence of cumulative reward.
In an MDP problem, a policy is a mapping from the state to a distribution over actions (known as stochastic policy)
or to one specific action (known as deterministic policy)
(1) 
The goal of MDP is to find an optimal policy that, if followed from any initial state, maximizes the expected cumulative immediate rewards:
(2)  
Qfunction and value function are two important concepts in MDP. The optimal Qfunction represents the expected cumulative reward received by an agent that starts from state , picks action , and chooses action optimally afterward. Therefore, is an indication of how good it is for an agent to pick action while being at state . The optimal value function denotes the maximum expected cumulative reward when starting from state , which can be expressed as the maximum of over all possible actions:
(3) 
2.2 Deep Reinforcement Learning
Reinforcement learning (Sutton and Barto, 2018)
is an efficient algorithm to solve the MDP problem. With the advent of deep learning, Deep Reinforcement Learning (DRL) achieved much success recently, including game of GO
(Silver et al., 2017), Atari games (Mnih et al., 2015; Hessel et al., 2018), Warcraft (Vinyals et al., 2017). In general, deep reinforcement learning can be divided into valuebased learning (Mnih et al., 2015; Lillicrap et al., 2015) and policybased algorithm (Mnih et al., 2016; Schulman et al., 2015, 2017). In this paper, we consider a policybased DRL algorithm to generate policies for the agent. Comparing with valuebased DRL algorithms, the policybased algorithm is effective in highdimensional or continuous action spaces and can learn stochastic policies, which is beneficial when there is uncertainty in the environment.Typically, the policybased algorithm uses function approximation such as neural networks to approximate the policy , where the input is the current state and output is the probability of each action (for discrete action space) or an action distribution (for continuous action space). After each trajectory , the algorithm updates the parameter of the function approximation to maximize the cumulative reward using gradient ascent:
(4) 
where is the expected cumulative reward of policy parameterized by , is the probability of action for state , and is the cumulative reward gathered by the agent for the remaining trajectory in one episode. The general idea of Eq. 4
is to reduce the probability of sampling an action that leads to a lower return and increase the probability of action leads to a higher reward. But one issue is that the cumulative reward usually has very high variance, which makes the convergence speed to be slow. To address this issue, researchers proposes actorcritic algorithm
(Sutton and Barto, 2018) where a critic function is introduced to approximate the state value function . By subtracting the value function , the expectation of the gradient keeps unchanged and the variance is reduced dramatically:(5) 
where the function approximator is updated to approximate the value function for .
3 Markov Decision Process Formulation
In this study, the UAS and intruders are considered as a point mass. The objective of the proposed conflict resolution algorithm is to find the shortest path for a UAS to its goal while avoiding conflict with other UAS and static obstacles. Guiding the UAS to its destination is a discretetime stochastic control process that can be formulated as a Markov Decision Process (MDP). In the following subsections, we introduce the MDP formulation by describing its state representation, action space, terminal state, and reward function. For this work, a method of deep reinforcement learning, proximal policy optimization algorithm developed in Schulman et al. (2017), is adopted. The reason and details are also introduced in this Section.
3.1 State representation
The agent gains knowledge of the environment from the state of the formulated MDP. The state should include all the necessary information for an agent to make optimal actions. In this paper, we let denote the agent’s state at time . All the parameters are normalized when querying the neural network.
The state can be divided into two parts, that is , where denotes the part that is related to the agent itself and the goal, and denotes the part related to the environment such as obstacles. We use to denote the information of environment (which represents moving/static obstacles in this paper) and set . indicates the information of obstacle . To speed up the training of the DRL algorithm, we transform the state by following the robotcentric parameterization in Chen et al. (2017), where the agent is located at the origin and the xaxis is pointing toward its goal.
3.1.1 State representation for static obstacle avoidance
In the simulations of static obstacle avoidance,
(6) 
where is the agent’s distance to goal, denote the agent’s velocity. The information of obstacle is,
(7) 
where is the position in yaxis of obstacle and is the agent’s distance to the center of obstacle . is introduced to help the agent learn the global optimal solution, for example, when approaching the obstacle, turn a small angle counterclockwise if is positive, which means the agent is on the right side of the line passing the obstacle center and the goal. We note that and
are vectors in the transformed coordinate system.
3.1.2 State representation for moving obstacle avoidance
As for moving obstacle avoidance, the position of the goal, , is added to ,
(8) 
The information of intruder , is represented by,
(9) 
where is the position of intruder , is the velocity of intruder , is the distance between the agent and the intruder and is the velocity of the agent relative to the intruder . We note that , , and are vectors in the transformed coordinate system.
3.2 Action space
3.2.1 Action space for static obstacle avoidance and stochastic intruder avoidance
In the implementations of static obstacle avoidance and stochastic intruder avoidance, the action represents the change in the heading angle for the controlled UAS at each time step. The action space is set to,
(10) 
More specifically, at each time step, the agent will select an action , and change its heading angle :
(11) 
3.2.2 Action space for deterministic intruder avoidance
As for deterministic intruder avoidance case, besides the heading angle change , the UAS is also controlled by speed command, which is updated every one second. During the interval, the two commands are fixed. Since UAS is more flexible than manned aircraft and there is no available regularization on UAS speed change, UAS speed command can be chosen from ,
(12) 
More specifically, the agent will select an action , and change its speed at next time step :
(13) 
In realworld applications, however, making a sharp turn is usually not desirable for the controlling of a UAS. Thus a penalty of large heading or speed change due to the power consumption may be considered in future work.
3.3 Terminal state
In the current study, the conflict is defined to be when the distance from the agent to the obstacle is less than a minimum separation distance. When the UAS operation is deterministic, a buffer zone is not necessary and the minimum separation distance is set to zero. In the implementations of static obstacle avoidance with uncertainty and moving obstacle avoidance, the UAS position uncertainty is taken into account. The separation requirement is determined according to the operational safety bound proposed in Hu et al. (2020). With the UAS speed of 20 and other UAS operation performance following the mean value shown in Table 3 in Hu et al. (2020), the minimum separation distances for static obstacle avoidance is 75 and 150 for moving obstacle avoidance.
3.3.1 Terminal state for static obstacle avoidance
The terminal state for static obstacle avoidance includes two different types of states:

Conflict state: the distance between the agent and obstacle is less than the minimum separation distance.

Goal state: the agent is within 400 from the destination.
3.3.2 Terminal state for moving obstacle avoidance
The episode terminates only when the agent is within 200 from the destination, which indicates the agent accomplishes the navigation task.
3.4 Reward function
To guide the agent to reach its goal and avoid conflict, the reward function is developed to award accomplishments while penalizing conflicts or not moving towards the goal.
3.4.1 Reward function for static obstacle avoidance
In the simulations of static obstacle avoidance, the reward function, , is expressed as the following form, where we set a reward for the goal state and a penalty for the conflict state. The linear term of the reward function guides the UAS flying towards the destination. The constant penalty at each step emphasizes the shortest path rule.
(14) 
3.4.2 Reward function for moving obstacle avoidance
In the simulations of intruder avoidance, the reward function, , is expressed in the following form, similar to Eq. 14.
(15) 
In this reward function, are coefficients of different cost and should be balanced to help the agent learn conflict resolution and achieve the goal simultaneously. When the ownership is close to the intruder, the inverse tangent term of the reward function is activated to maintain the distance in an appropriate range. With the coefficients set in the stochastic intruder case in Section 4.2.1, the relation between the distance and the inverse tangent term, , is shown in Fig. 1. The agent starts to get a penalty when the distance approaches 250 . This reward setting can help the agent avoid conflicts with other intruders at a relatively early stage. We note that can be tuned to fit different separation standards.
3.5 Proximal policy optimization algorithm
One drawback of proposed in Sutton and Barto (2018), also shown in Eq. 5 , is that one bad update can lead to large destructive effects and hinder the final performance of the model. Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017) was proposed recently to solve this problem by introducing a policy changing ratio describing the change from previous policy to the new policy at time step :
(16) 
where and denote the network weights before and after update.
By restricting policy changing ratio in the range with
set to 0.2 in this paper, the PPO loss function for the actor and critic network is formulated as follows:
(17) 
(18) 
(19) 
where
is a hyperparameter that bounds the policy changing ratio
. In Eq. 17 and Eq. 18, the advantage function measures whether or not the action is better or worse than the policy’s default behavior. Also, the policy entropy is added to the actor loss function to encourage exploration by discouraging premature convergence to suboptimal deterministic polices.In the implementation, we use two layers Multilayer perceptron (MLP) with 64 hidden units each for both actor and critic networks. Tanh function is chosen as the activation function for the hidden layers.
4 Numerical experiments
Numerical experiments are presented in this section to evaluate the proposed conflict resolution model in continuous action space. There are two categories of collision avoidance: static obstacle avoidance and moving obstacle avoidance. As for static obstacle avoidance, we investigate the performance on different obstacle shapes and sizes, and uncertainty in UAS operation. We also study the environment with stochastic intruders under control of heading angle, and the environment with deterministic intruders under control of heading angle and speed. In all the simulations, one pixel in the figure represents 10 in the real world. The implementation of PPO algorithm is conducted by OpenAI Baselines (Dhariwal et al., 2017). The deep reinforcement learning model for each case is trained for 30 million time steps.
4.1 Static obstacle avoidance
The simulation environment is the free flight airspace with 4 length and 4 width. The UAS speed is set to 20 . During the training process, the starting position of the aircraft is randomly sampled from four edges of the airspace boundary and is an array of the integer type for simplification. The goal is located at (2500, 2500). In this experiment, we study two types of static obstacles: circular obstacle and rectangular obstacle, as shown in Fig. 2. The plus sign represents the goal position and the blue region represents the nopassing area. The episode reward mean is shown in Fig. 3, which shows the episode reward is growing and the policy is converging to the optimal solution. To visualize the performance of the proposed conflict resolution model, we generate a testing set of 160 trajectories starting from different origins. The origin for testing is chosen every 100 on each edge of the airspace boundary. Heading angles in 160 trajectories are collected and the heading angle is plotted every 15 time steps.
4.1.1 Circular obstacle avoidance
The static obstacle set up for this case study is shown in Fig. 2 and the testing result of 160 trajectories starting from different locations on the airspace boundary is shown in Fig. 4. The black arrow represents the agent’s selected heading direction at each position.
From Fig. 4, it can be seen that the agent is selecting the heading angle pointing to the goal and tending to avoid the nopassing region. Also, the agent chooses the optimal behavior according to the relative position of the agent, obstacle, and goal. For example, near the lowerleft obstacle, if the agent’s position is above the line passing the obstacle center and the goal, the UAS takes a small left turn to avoid the obstacle. Otherwise, the UAS bypasses the lower semicircle. For the 160 generated trajectories in Fig. 4, there is no failure.
4.1.2 Rectangular obstacle avoidance
The difference between this rectangular obstacle case and the previous circular obstacle case in simulation is the condition when checking whether the agent is at a conflict state. The environment for this case is shown in Fig. 2 and the testing result of 160 trajectories is shown in Fig. 5.
4.1.3 Circular obstacle avoidance with uncertainty
This case is studied to see the performance of handling uncertainty by the proposed conflict resolution model. UAS operation is stochastic and randomness exists in almost every aspect of UTM. Inclusion of uncertainty quantification of aircraft operation is critical for future safety analysis (e.g., deviation from a trajectory plan due to wind, true speed, positioning error) (Hu et al., 2020; Liu and Goebel, 2018; Hu and Liu, 2020; Pang et al., 2019b, a, 2021). Thus, to model the uncertainties in UAS operation, we form a circle, the center of which is the predicted UAS position without uncertainty. And the radius is the separation requirement, 75
. With 90% probability, the UAS position is accurately located at the center of the circle; with a 10% probability, the UAS position will be located at a point around the circle with a uniform distribution. Such uncertainty is considered when calculating the agent’s position at the next time step after taking action
.The testing results are shown in Fig. 6. In Fig. 6, the agent’s position with uncertainty is plotted. While in Fig. 6, the uncertainty of 75 is added to the obstacle, which is indicated by the red circle. As expected, the UAS tries to keep 75 away from the obstacles. So either method can work when doing the simulations of collision avoidance with uncertainty. One failure happens near the upperleft obstacle in Fig. 6. There are three failures near the lowerleft obstacle in Fig. 6. The common among the failures is that the agent’s origin is approximately on the line passing the obstacle center and the goal. The possible reason is that the policy network gets stuck at the local optimum since the two trajectories next to it behave well.
4.2 Moving obstacle avoidance
For the moving intruder aircraft avoidance case, the speed of intruders is set to 20 . There are two cases for moving obstacle avoidance: stochastic intruder case with control of heading angle and deterministic intruder case with control of heading angle and speed. In the stochastic intruder case, the scenario changes every episode. In detail, the intruder has a different origin and heading angle for each episode. But within one episode, intruders have fixed heading angles. The reward coefficients are listed in Table 1. The episode reward mean is shown in Fig. 7. To visualize the performance of the proposed conflict resolution model, we generate a testing set of 500 episodes following the setting during the training process for each case. Also, the minimum distance of the agent to the three intruders within each episode is collected.
Coefficient  

Stochasticintruder avoidance  0.007  0.15  17  0.1  12 
Deterministicintruder avoidance  0.22  0.05  3  0.1  12 
4.2.1 Stochasticintruder avoidance with control of heading angle
The origin and heading angle of the three intruders are assumed to follow a uniform distribution and the distribution range is shown in Table 2. The origin coordinate of agent is uniformly sampled from . The goal is located at . The agent moves at 20 . The intruders are designed to pass the line connecting the UAS origin and the goal.
Intruder  1  2  3 

Origin coordinate range  
Heading angle range 
The demonstration of one scenario and UAS performance is shown in Fig. 8. Information related to the ownership is plotted in blue and black represents the information of intruders. The plus sign denotes the origin and the star sign is the goal for the agent. The centers of circles are the positions of aircraft which are plotted every 5 time steps and labeled with time step every 10 time steps. The radius of the circle represents the aircraft speed. In this scenario, the agent learns to go around the left side to avoid the three intruders. The result of the minimum distance from the agent to the three intruders within each episode is plotted in Fig. 9 by the blue dots. The orange line is the separation requirement of 150 . All the blue dots are above the orange line, which represents that there is no failure case in Fig. 9 and the model succeeds to avoid the three intruders in 500 different testing scenarios.
4.2.2 Deterministicintruder avoidance with control of heading angle and speed
We also investigate the possibility of utilizing the proposed reward function to generate heading angle change command and speed command. This investigation is valuable when changing the heading angle cannot efficiently resolve conflicts. Moreover, with an extra choice of changing speed, the UAS may result in less influence on flight plans of other aircraft and aerospace capacity. However, due to the larger action space, the training process needs more effort.
The origin and heading angle of the three intruders are listed in Table 3. The origin coordinate of agent is and the goal is located at . Intruder 1 is designed to test if the ownership can fly at a suitable speed and the other two intruders are set to test the performance of the heading angle change command.
Intruder  1  2  3 

Origin coordinate  
Heading angle 
Similar to the result in Fig. 8, the demonstration of the scenario and UAS performance is shown in Fig. 10. Information related to the ownership is plotted in blue and black represents the information of intruders. The plus sign denotes the origin and the star sign is the goal for the agent. The centers of circles are the positions of aircraft which are plotted every 3 time steps and labeled with time step every 6 time steps. The radius of the circle represents the aircraft speed. It can be seen that the agent reduces speed from 12 to 24 time steps to keep a safe separation away from intruder 1. Also, the agent goes around the right side to avoid the approaching intruder 2 and after resolving the possible conflicts with intruder 3 at 66 time step, it flies towards the goal to save time. The result of the minimum distance is plotted in Fig. 11 by the blue dots. All the blue dots are above the orange line which represents the separation requirement of 150 . So, there is no failure case in Fig. 11, indicating the model succeeds to avoid the three intruders under the control of heading angle and speed.
5 Conclusion
In this work, we present a method for using deep reinforcement learning to allow the UAS to navigate successfully in urban airspace with continuous action space. Both static and moving obstacles are simulated and the trained UAS has the capability to achieve the goal and do conflict resolution simultaneously. We also investigate the performance on different static obstacle shapes and sizes, and under uncertainty in UAS operation. Stochastic intruders are considered in the training process of the moving obstacle experiments. Moreover, we investigate the possibility of the proposed reward function to resolve conflict through heading angle and speed. Results show that the proposed model can provide accurate and robust guidance and resolve conflict with a success rate of over 99%. To make the proposed algorithm more practical and efficient in the realworld, in future work, we would model part of the intruders as agents and there could be cooperation among the multiple aircraft.
Acknowledgments
The research reported in this paper was supported by funds from NASA University Leadership Initiative program (Contract No. NNX17AJ86A, PI: Yongming Liu, Technical Officer: Anupa Bajwa). The support is gratefully acknowledged.
References
 Convex programming approach to powered descent guidance for mars landing. Journal of Guidance, Control, and Dynamics 30 (5), pp. 1353–1366. External Links: Document Cited by: §1.
 Generation of collisionfree trajectories for a quadrocopter fleet: a sequential convex programming approach. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 1917–1922. External Links: Document Cited by: §1.
 Blueprint for the sky. The roadmap for the safe integration of autonomous aircraft. Airbus A 3. Cited by: §1.
 A markovian decision process. Indiana Univ. Math. J. 6, pp. 679–684. External Links: ISSN 00222518 Cited by: §2.1.
 A deep multiagent reinforcement learning approach to autonomous separation assurance. arXiv preprint arXiv:2003.08353. Cited by: §1.
 Decentralized noncommunicating multiagent collision avoidance with deep reinforcement learning. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 285–292. External Links: Document Cited by: §3.1.
 Decomposition methods for optimized collision avoidance with multiple threats. Journal of Guidance, Control, and Dynamics 35 (2), pp. 398–405. External Links: Document Cited by: §1.

Path planning based on genetic algorithms and the montecarlo method to avoid aerial vehicle collisions under uncertainties
. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 4429–4434. External Links: Document Cited by: §1.  Aircraft conflict resolution by genetic algorithm and bspline approximation. In EIWAC 2010, 2nd ENRI International Workshop on ATM/CNS, pp. 71–78. Cited by: §1.
 OpenAI baselines. GitHub. Note: https://github.com/openai/baselines Cited by: §4.
 Discrete approximations to optimal trajectories using direct transcription and nmiscar programming. Journal of Guidance, Control, and Dynamics 15 (4), pp. 994–1002. External Links: Document Cited by: §1.
 Handbook of markov decision processes: methods and applications. Vol. 40, Springer Science & Business Media. External Links: Document Cited by: §2.1.
 Resolution of conflicts involving many aircraft via semidefinite programming. Journal of Guidance, Control, and Dynamics 24 (1), pp. 79–86. External Links: Document Cited by: §1.

Rainbow: combining improvements in deep reinforcement learning.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §2.2.  Dynamic programming and markov processes. External Links: Document Cited by: §2.1.
 Probabilistic riskbased operational safety bound for rotarywing unmanned aircraft systems traffic management. Journal of Aerospace Information Systems 17 (3), pp. 171–181. Cited by: §3.3, §4.1.3.
 UAS conflict resolution integrating a riskbased operational safety bound as airspace reservation with reinforcement learning. In AIAA Scitech 2020 Forum, pp. 1372. Cited by: §4.1.3.
 Reinforcement learning for autonomous aircraft avoidance. In 2019 Workshop on Research, Education and Development of Unmanned Aerial Systems (RED UAS), pp. 126–131. Cited by: §1.
 Nextgeneration airborne collision avoidance system. Technical report Massachusetts Institute of TechnologyLincoln Laboratory Lexington United States. Cited by: §1.
 Xavier: a robot navigation architecture based on partially observable markov decision process models. Artificial Intelligence Based Mobile Robotics: Case Studies of Successful Robot Systems, pp. 91–122. Cited by: §2.1.
 Unmanned aircraft system traffic management (utm) concept of operations. Cited by: §1.
 Optimizing collision avoidance in dense airspace using deep reinforcement learning. arXiv preprint arXiv:1912.10146. Cited by: §1.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §2.2.
 Information fusion for national airspace system prognostics. In PHM Society Conference, Vol. 10. Cited by: §4.1.3.
 A saliencybased reinforcement learning approach for a uav to avoid flying obstacles. Robotics and Autonomous Systems 100, pp. 108–118. Cited by: §1.
 An introduction to acas xu and the challenges ahead. In 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC), pp. 1–9. Cited by: §1.
 Jump linear systems in automatic control. M. Dekker New York. Cited by: §2.1.
 Mixedinteger quadratic program trajectory generation for heterogeneous quadrotor teams. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pp. 477–483. External Links: Document Cited by: §1.

Asynchronous methods for deep reinforcement learning.
In
International conference on machine learning
, pp. 1928–1937. Cited by: §1, §2.2.  Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §2.2.
 Model predictive control of swarms of spacecraft using sequential convex programming. Journal of Guidance, Control, and Dynamics 37 (6), pp. 1725–1740. External Links: Document Cited by: §1.
 ACAS xu: integrated collision avoidance and detect and avoid capability for uas. In 2019 IEEE/AIAA 38th Digital Avionics Systems Conference (DASC), pp. 1–10. Cited by: §1.
 Conflict resolution problems for air traffic management systems solved with mixed integer programming. IEEE transactions on intelligent transportation systems 3 (1), pp. 3–11. External Links: Document Cited by: §1.
 Aircraft trajectory prediction using lstm neural network with embedded convolutional layer. In Proceedings of the Annual Conference of the PHM Society, Vol. 11. Cited by: §4.1.3.

A recurrent neural network approach for aircraft trajectory prediction with weather features from sherlock
. In AIAA Aviation 2019 Forum, pp. 3413. Cited by: §4.1.3.  Datadriven trajectory prediction with weather uncertainties: a bayesian deep learning approach. Transportation Research Part C: Emerging Technologies 130, pp. 103326. Cited by: §4.1.3.
 A machine learning approach for conflict resolution in dense traffic scenarios with uncertainties. Cited by: §1.
 Particle swarm optimization applied to space trajectories. Journal of Guidance, Control, and Dynamics 33 (5), pp. 1429–1441. External Links: Document Cited by: §1.
 Dynamic optimization strategies for threedimensional conflict resolution of multiple aircraft. Journal of guidance, control, and dynamics 27 (4), pp. 586–594. External Links: Document Cited by: §1.
 Aircraft trajectory planning with collision avoidance using mixed integer linear programming. In American Control Conference, 2002. Proceedings of the 2002, Vol. 3, pp. 1936–1941. External Links: Document Cited by: §1.
 Mixed integer programming for multivehicle path planning. In Control Conference (ECC), 2001 European, pp. 2603–2608. External Links: Document Cited by: §1.
 Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §2.2.
 Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §1, §2.2, §3.5, §3.
 Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §2.2.
 A reinforcement learningbased decentralized method of avoiding multiuav collision in 3d airspace. In Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence, pp. 77–82. Cited by: §1.
 Reinforcement learning: an introduction. MIT press. Cited by: §2.2, §2.2, §3.5.
 Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §1.
 Probabilistic robotics. Communications of the ACM 45 (3), pp. 52–57. External Links: Document Cited by: §2.1.
 Deep reinforcement learning with double qlearning. In Thirtieth AAAI conference on artificial intelligence, Cited by: §1.
 A mixed integer program for flightlevel assignment and speed control for conflict resolution. In Decision and Control, 2009 held jointly with the 2009 28th Chinese Control Conference. CDC/CCC 2009. Proceedings of the 48th IEEE Conference on, pp. 5219–5226. External Links: Document Cited by: §1.
 Starcraft ii: a new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782. Cited by: §2.2.
 Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581. Cited by: §1.
 A survey of applications of markov decision processes. Journal of the operational research society 44 (11), pp. 1073–1096. External Links: Document Cited by: §2.1.
 UAV collision avoidance policy optimization with deep reinforcement learning. Cited by: §1.
 Realtime obstacle avoidance with deep reinforcement learning threedimensional autonomous obstacle avoidance for uav. In Proceedings of the 2019 International Conference on Robotics, Intelligent Control and Artificial Intelligence, pp. 324–329. Cited by: §1.
 Autonomous ondemand free flight operations in urban air mobility using monte carlo tree search. In 8th International Conference on Research in Air Transportation (ICRAT), Cited by: §1.
 Scalable multiagent computational guidance with separation assurance for autonomous urban air mobility. Journal of Guidance, Control, and Dynamics 43 (8), pp. 1473–1486. External Links: Document Cited by: §1.
Comments
There are no comments yet.