I Introduction
Pursuit/evasion games pit two opponents against each other such that the pursuer must capture the evader. Within the aerospace community, pursuit/evasion of aircraft has long been of interest and is seeing a resurgence of interest due to a growing capability and acceptance of autonomous unmanned aircraft. Additionally, pursuit/evasion games are interesting in that they pose scalability challenges especially to UAV swarm applications. Problem formulations which lead to efficient and effective pursuit/evasion for 1 versus 1 (1v1) contests do not always allow efficient formulation with larger contests with multiple members per team (e.g., 2v2, 10v10). For problem formulations and algorithms that can support larger teams, it may be possible to solve the problem offline, but it may be exponentially harder and challenging in an online manner.
In this paper, we propose a pursuit/evasion problem formulation based on Markov Decision Processes (MDPs) and use our recently proposed algorithm [bertram2019] to efficiently solve the problem even for large teams. The algorithm seamlessly switches between pursuit and evasion while simultaneously avoiding collisions with other aircraft in the same team and the ground. The algorithm is adaptable to multiple aircraft type through the use of forward projection of the aircraft dynamics, and a pseudo6dof model is presented.
Our main contributions for this work are:

Extension of the 2D algorithm with discrete state space in [bertram2019] to a continuous 3D state space;

Addition of a forward projection module that allows the algorithm to support any arbitrary aircraft type;

Demonstration of efficient algorithm performance that scales to large team. sizes
We additionally develop a 3D visualization tool to evaluate the algorithm and to provide insight to reviewers and readers on the complexity of the problem.
In Section II we identify and discuss related work. In Section III we briefly provide background on Markov Decision Processes. In Section IV we describe the method we use, including the pseudo6DOF model and aircraft dynamics, as well as the details of the Markov Decision Process problem formulation used in this approach. Section V describes the experimental setup to evaluate the pursuit/evasion contests, including the definition of two metrics we use to evaluate the behavior. Section VI describes the results of our experiments demonstrating the efficient performance and consistent behavior of the algorithm. Section VI also provides links to videos showing examples of varying sized teams competing against each other.
Ii Related Work
There is extensive work from many communities which address different approaches to pursuit/evasion. We describe several approaches and discuss how they relate to Markov Decision Process approach used in this paper.
Eklund etc. [eklund2005implementing] described a nonlinear model predictive control (NMPC) approach to a pursuit/evasion problem using a set of cost functions with repulsive and attractive natures to shape the behavior of the pursuer. An iterative optimization method was used to produce a solution at each time step using simplified aircraft dynamics. Multiple matrices in the NMPC formulation required tuning to obtain good behavior. It is worth noting that the cost functions used in their work are analogous to reward functions used for Markov Decision Processes.
Schopferer and Pfeifer [schopferer2015performance] proposed a method to perform flight planning in the presence of a uniform wind field, with the aircraft motion modeled with trochoids. The three dimensional flight path is constructed by superimposing a horizontal and vertical solution to obtain an approximate 3D path. A probabilistic roadmap planner is used to generate global plans.
Vector fields approaches have also been used for pursuit/evasion problems. Goncalves etc. [gonccalves2010vector] described a vector field approach for convergence, circulation, and correction around a closed loop pattern. Lawrence etc. [lawrence2008lyapunov] presented a vector field approach for circular (or warped circular) patterns, and also describes a switching mechanism to handle waypoint following or arbitrary paths. Stable tracking of the vector field is explored using Lyapunov techniques. Vector fields can be viewed as similar in nature to the optimal policy that is generated by solving a Markov Decision Process. Where vector fields are generally applied over a continuous state space, MDP optimal policies normally describe actions that are intended to cause a transition from the current discrete state to a desired next discrete state.
Within the robotics and computational geometry community, pursuit/evasion is often considered in a different context. The pursuer(s) are attempting to search through an environment to observe the evader(s), similar to security guards searching through a museum for a potential intruder. Often in these problem formulations, the goal is identifying the minimum number of pursuers needed in order to guarantee that if an evader is present within the environment that it will be detected, and is not focused on tracking or chasing the evader as in the target problem of this paper. However, these works are instructive as the algorithm used in this paper is built on the recognition that an MDP can be represented as a graph. Examples of this type of pursuit/evasion problem are [guibas1997visibility, lavalle1997finding, kehagias2009graph]. An example of graph based pursuit/evasion problem applied to graphs of infinite nodes is [lehner2016pursuit], where they describe the problem as a copandrobbers problem and define a winning strategy as preventing the robber from visiting a node in the infinite graph infinitely many times. This allows strategies which either catch the robber or force the robber to flee ‘to infinity’. Markov Decision Processes are normally viewed as a tree of sequential actions, but can also be understood as a graph. As most MDP problems normally have a discrete state space, this graph would normally also have a finite number of nodes. Our method provides a way to support MDP problem formulations with continuous state spaces, and the corresponding graph would then have an infinite number of nodes. Like the copandrobbers problem above, forcing an adversary to flee would be an acceptable strategy for our aircraft pursuit/evasion problem as well.
Jia etc. [shengde2014continuous]
proposed a continuoustime Markov Decision Process (CTMDP) approach where variable time steps are allowed to be taken within a discretized state space where the transition function is defined instead as a transition rate function, allowing the possible resulting state transitions to be predicted with varied time steps. The large state space is simplified by classifying the states into neutral, advantaged, disadvantaged, and mutually disadvantaged categories and a Bayesian method is used to determine the transition probabilities. Pursuit/evasion within a 2D grid world environment is considered.
Within the optimal control community, one area of related work is Differential Dynamic Programming (DDP) which uses dynamic programming to iteratively improve a local optimal control policy. Sun etc. [sun2018min] used DDP to solve an adversarial aircraft pursuit/evasion problem, terming their approach as gametheoretic DDP (GTDDP) by combining DDP with a minmax problem formulation. Differential Dynamic Programming and Markov Decision Processes have much in common and both stem from Bellman’s original work on dynamic programming [bellman1957dynamic]
. Where the optimal control field focuses on the HamiltonJacobiBellman (HJB) equation and differentiable dynamics, MDPs often generalize the dynamics into a (deterministic or stochastic) transition function which captures uncertainty about the environment through probabilities (similar to those used for Markov chains.) Comparing
[sun2018min] to this paper’s work, GTDDP in [sun2018min] does have a much richer capability to incorporate system dynamics, but this comes at the expense of additional computation time and a need for convergence of the iterative nature of the algorithm.The most relevant paper to this work is [mcgrew2010air]
which describes a Markov Decision Process based pursuit/evasion problem for aircraft using approximate dynamic programming. A state space was formed from a set of features which minimized mean squared error using a forwardbackward search. Trajectory sampling was used to obtain training data that would be likely to have value during training. Reward shaping was used to guide the exploration to the desired behavior in the form of a scoring function heuristic developed by an expert. Rollout was used to extract a refined policy from the approximation computed via approximate dynamic programming (ADP) and was accelerated with a neural net. The dynamics model for the airplane used is a Dubin’s airplane without any vertical components or altitude modeled.
There are some subtle differences between this paper and the work in [mcgrew2010air]. [mcgrew2010air] is a good example of using a variety of practical techniques to deal with the intractability of large MDP state spaces, whereas this work explicitly uses a state space designed to be intractable by traditional MDP methods via the use of a continuous state space resulting in an MDP with an infinite number of states in order to demonstrate scaling to continuous state spaces. [mcgrew2010air] uses a 2D aircraft model, where this paper uses a 3D pseudo6DOf model to demonstrate scaling to a continuous 3D state space and to demonstrate full maneuvering by the aircraft (e.g., loops, rolls, spirals). In this paper, no reward shaping is required to speed up or aid convergence, as the underlying MDP is solved directly without relying on typical methods used for approximate dynamic programming. And finally, in [mcgrew2010air] 1v1 pursuit/evasion is explored where in this paper scaling to 10v10 teams is demonstrated.
Also of note are [park2016differential] and [zhang2018research]. Park etc. [park2016differential] used a higher fidelity 3D model and a minmax approach over a sliding window to demonstrate 1 vs 1 pursuit/evasion, and while the behavior in simulation appears promising, the realtime performance of the algorithm is not reported. In [zhang2018research]
, a reinforcement learning approach is taken using deep Qlearning using a 2layer multilayer perceptron as the function approximator, and with a modified epsilongreedy exploration strategy where a heuristic function used in place of random action in order to avoid wasteful actions during exploration. Performance is examined in 2D.
Iii Background
Markov Decision Processes (MDPs) are a framework for sequential decision making with broad applications to finance, robotics, operations research and many other domains [suttonbarto]. MDPs are formulated as the tuple where is the state at a given time , is the action taken by the agent at time as a result of the decision process, is the reward received by the agent as a result of taking the action from and arriving at , and is a transition function that describes the dynamics of the environment and capture the probability of transitioning to a state given the action taken from state .
A policy can be defined that maps each state to an action . From a given policy a value function can be computed that computes the expected return that will be obtained within the environment by following the policy . The value function can be expressed in the iterative Bellman equation as follows, where represents immediate reward collected by taking an action which leads to a next state and a value of . This the value function for any state is the current reward plus the discounted future reward that can be obtained by taking the best action from the current state, and is an expectation of the future reward that can be obtained from the current state.
(1) 
The solution of an MDP is termed the optimal policy , which defines the optimal action that can be taken from each state to maximize the expected return. From this optimal policy the optimal value function can be computed which describes the maximum expected value that can be obtained from each state . And from the optimal value function , the optimal policy can also easily be extracted.
Iv Method
We use the algorithm described in [bertram2019] as the underlying guidance and collision avoidance algorithm which demonstrated collision avoidance in a 2D environment. The algorithm is extremely efficient and the paper demonstrated good performance on a discretized state space. We extend the method to demonstrate performance in a continuous state space while also extending it to a 3D environment to demonstrate scaling to the higher dimensional space. Demonstration of scaling is further highlighted by showing large teams performing pursuit/evasion together. Finally, we introduce a pseudo6DOF model allowing the aircraft to roll, pitch, and perform complex aerial maneuvers which serves to further demonstrate the power of this approach.
Iva Dynamic Model
The aircraft kinematic model is a pseudo 6 degree of freedom (pseudo6DOF) model which approximates fixed wing aircraft motion given inputs similar to stick and throttle inputs. The model provides a way to study the algorithms behavior without requiring full aerodynamics to be modelled. The algorithm needs this pseudo6DOF model to provide “forward prediction”. This means that from a given current state, the model must be able to calculate the future state of applying a given set of possible control actions for a fixed number of timesteps. Any model which satisfies this requirement can be integrated with the algorithm, including fullfidelity 6DOF fixedwing models, helicopters, quad rotors, and models with underlying autopilot controllers.
The model used is an extension of the pseudo6DOF formulation in [park2016differential] and also incorporates a few additional terms in the model in [huynh1987numerical]. It should be considered as a simplified model of [huynh1987numerical].

: Throttle acceleration directed out the nose of the aircraft in ’s

: Airspeed in meters/second.

: Flight path angle in radians.

: position in NED coordinates in meters where altitude

: Roll angle in radians

: Horizontal azimuth angle in radians

: Angle of attack in radians with respect to the flight path vector
The inputs to the model are: (1) the thrust , (2) the rate of change of angle of attack and (3) the rate of change of the roll angle .
The equations of motion for the aircraft are:
(2)  
(3)  
(4) 
where the acceleration exerted out the top of the aircraft in s is defined as:
(5) 
with a lift acceleration of . Here, 1 “g” is a unit of acceleration equivalent to . was chosen to provide some amount of lift while in flight to partially counteract gravity and provide a stable flight condition with a low positive angle of attack in the pseudo6dof model. For a true aerodynamic model, this lift varies by the velocity (Mach number), but this level of detail is omitted in our simplified pseudo6dof.
The kinematic equations are:
(6)  
(7)  
(8) 
While this model is not aerodynamically comprehensive, it is sufficient to describe aircraft motion suitable for examining the algorithm behavior without loss of generality. Again, our algorithm can integrate with any aircraft dynamic model that provides a forward prediction.
IvA1 Forward Projection
In order to determine the future state resulted from a given action, we use forward projection to simulate the dynamics forward in time. We use a discrete time step of seconds and apply the control actions at each time step for a specified number of time steps.
For the purposes of determining the future state of an action, we forward project for 1 time step (0.1 second). After selecting an action and applying it to the simulation, we advance the simulation one time step (0.1 seconds). Thus an action is chosen at a 10 Hz rate with a 1 second forward projection horizon.
The simulated future states can be viewed as an approximation of the reachable states, and are applied to the solution of the Markov Decision Process (MDP) to determine the value of the potential future states the agent might reach. Thus the agent follows the optimal policy of the MDP at each time step by determining which future reachable state is most valuable, and then takes the action in the next time step that will lead it towards that state.
Each team is provided with different aircraft performance limits which serve to provide the “blue” team (team 0) with a performance advantage over the “red” team (team 1) and prevents deadlocks where neither team is able to obtain an advantage over the other. Table II lists the performance limits, where the speed of sound . These limits were chosen to represent a highly maneuverable subsonic UAV and do not represent any real aircraft.
Team  

(Mach)  (Mach)  (rad/s)  (rad/s)  (rad)  (rad)  
Blue  0.1  0.35  1.5  1.5  .009  .69 
Red  0.1  0.30  1.3  1.3  .009  .52 
IvB MDP Formulation
IvB1 State Space
We define the environment where the aircraft operates within a 25 km by 25 km by 25 km volume which is treated as a continuous state space. There are two teams of aircraft in this environment: a “blue” team and a “red” team. Each aircraft (an “ownship”) is controlled by our proposed algorithm, and aircraft on the blue team have a slight performance advantage over aircraft on the red team.
The state includes all the information each ownship needs for its decision making: the full aircraft state of the ownship, the position and velocity of every teammate aircraft, and the position and velocity of every opponent aircraft.
Each ownship is aware of its own aircraft state produced by the pseudo6DOF model. For each ownship, the state is formed by concatenating the following:

the pseudo6DOF state: position , the heading angle , the roll angle , the flight path angle , the pitch angle , the angle of attack , and the speed .

for each teammate : the position and velocity , and

for each opponent aircraft : the position and velocity
(9) 
where represents the number of teammates, and represents the number of opponents.
IvB2 Action Space
Inputs to the model are (1) the thrust , (2) the rate of change of angle of attack and (3) the rate of change of the roll angle .
The action space is then:
(10) 
There are two teams of aircraft where team is the “blue team” and is the “red team”. When the teams’ aircraft have equivalent performance, simulations often result in a stalemate which represent a Nash equilibrium where neither aircraft is able to gain advantage over the other. In these cases, simulation will not naturally terminate. Therefore, in the simulations we provide a performance advantage to the blue team which more naturally leads to simulations that terminate.
Team  

(rad/s)  (rad/s)  (g’s)  
Red  1, .8, , .8, 1  .5, .4, , .4, .5  0, 1, , 6 
Blue  1.5, 1.2, , 1.2, 1.5  .5, .4, , .4, .5  0, 1, , 8 
IvB3 Reward Function
The primary mechanism to control the behavior of an agent in a Markov Decision Process (MDP) is through the Reward Function. By providing positive and negative rewards to the agent, it is able to determine which actions lead to positive reward and the solution of an MDP maximizes the expectation of future reward. In our pursuit evasion problem, we will use positive and negative rewards that are coupled together to create tension between potential actions. For example, we will place a positive reward near the location of an aircraft to attract other aircraft, but we will also place a negative reward at the aircraft to prevent a collision. A natural equilibrium develops between these positive and negative rewards that generates the desired behavior of approaching another aircraft without colliding with it.
Following the approach used by Bertram et al. in [bertram2019], we will treat each negative reward as a “risk well”, which is a region of negative reward (i.e., a penalty) which is more intense at the center and decays outward until a fixed radius is reached, where after no penalty is applied. We present our reward function in terms of the behaviors we wish to obtain in Table III. In this table, represents the current position of an aircraft (teammate or opponent) and represents that aircraft’s current linear velocity. In some cases we project the aircraft’s position forward in time with an expression and then define a range of time as in to indicate that we create a reward at the location of the aircraft at each timestep in the future indicated by the range of .
For each teammate:  
Magnitude  Decay factor  Location  Radius  Timesteps  Comment 
Collision avoidance, 5 rewards  
N/A  Weak formation flight or clustering  


For each opponent:  
Magnitude  Decay factor  Location  Radius  Timesteps  Comment 
Collision avoidance, 4 rewards  
N/A  Pursuit 
All aircraft also receive a penalty below a certain altitude which prevents the aircraft from plummeting into the terrain. For this paper, is the maximum height of the terrain that is loaded into the simulation. We define a minimum safe altitude known as the “hard deck” in which we will allow the aircraft to fly. Any aircraft which goes below the hard deck for the purposes of the game has crashed and is removed from the simulation. We define the hard deck . For any state with an altitude of from the hard deck up to an altitude of , a penalty is applied which is a very strong negative reward that will override any other positive rewards in the game.
IvC Algorithm
We alter the algorithm from [bertram2019] by extending it to handle 3D aircraft positions in a continuous state space and by adapting it to allow for forward projection. We present the algorithm which we call FastMDP in Algorithm 1. Note that the algorithm is presented for clarity here; when implemented certain optimizations are made which improve performance but make it more difficult to understand.
In order to efficiently solve the MDP, the algorithm from [bertram2019] divides the rewards into positive and negative rewards and processes them separately. The positive rewards are processed in a straightforward manner where each reward is treated as a peak which decays exponentially, and the resulting value function surface is the of all of these exponentially decaying peaks. Negative rewards are treated differently. They are first converted to what Bertram et al. refers to as “Standard Positive Form” (S.P.F.) where each negative reward is negated so that each negative reward becomes a positive reward in this standard positive form space. Once in standard positive form, a new value function surface is computed from the rewards in the same manner as the positive rewards. The value function surface in standard positive form is then negated, resulting in a value function surface that is negative and is a close approximation of solving the MDP with only the negative rewards present. When this surface resulting from the negative rewards is summed with the surface resulting from the positive rewards, the result is the value function surface that closely approximates the result that would be obtained from solving the original MDPs with all rewards present. We reuse this basic approach, but employ some additional computational optimizations to make these operations more efficient.
All of these steps are optimized as much as possible for operation on a CPU. As the code is implemented in Python, an optimization library known as numba is employed which recompiles key sections of the code as C code to obtain faster operation. Additionally, the code is written to take advantage of the numerical library numpy to perform vectorized operations over arrays. No parallelization on CPU via multiple cores or GPU are employed.
V Experimental Setup
We demonstrate this MDP based planner in a 3D aircraft simulation showing a view of the two teams of aircraft. The simulation covers a configurable sized volume which contains a configurable number of team members on each of the two teams.
Simulation begins with both teams spawned randomly on opposing sides of the environment. The teams must each avoid collisions with team mates while simultaneously pursuing members of the opposing team using only the reward system we have defined above.
At each time step, the simulation generates the state updates for each ownship. Each ownship creates and solves its own MDP using the highly efficient algorithm presented in [bertram2019]. Each ownship forward projects each possible action by 1 second, and then uses the solution of the MDP to determine which action results in the highest valued future state. The action selected with this method will then be applied in simulation for 1 timestamp (0.1 seconds). The actions of all aircraft from both sides are selected and performed simultaneously without knowing the selected actions of any other aircraft in the simulation. Simulation then advances by one time step. Note that a new MDP is calculated at each time step, which is made possible by the performance of the algorithm in [bertram2019].
In this pursuit/evasion game, we define a pursuer “capturing” an opponent if it is in a certain region behind the evading aircraft. The “control point” is defined as the position the evader was at 3 seconds previously. If the pursuer is within 100 meters of the control point and relative angle between the two velocity vectors of the aircraft is within 60 degrees, then the pursuer is close to the control point and pointing at the evader and we consider this a sufficient condition for the pursuer to be able to “capture” the evader (e.g., within range of some weapon). The pursuer must maintain this condition for 30 consecutive time steps in order to successfully “hit” the evader, which is analogous to a weapon taking some time to track the evader. This is indicated visually in the simulation as a red pulsing rectangle around an aircraft that is in danger of being captured.
We build a scoring system that tracks the number of airplanes that have been captured. When a team’s airplane is captured, the opposing team is awarded one point. Thus complete success is when one team reaches a score that equals the number of airplanes on the opposing team. A “win” is described as one team scoring higher than the other, with the other team necessarily incurring a “loss”, and a “draw” is when both teams score the same.
We define a metric to study the effect of the algorithm over runs which is defined for a team as the number of wins the team obtained over the number of runs: . This metric can be applied to 1 vs 1 encounters and can scale to larger teams as well.
The measurement alone is not sufficient. Beyond the probability of win, we also wish to define a metric that describes the survivability of the team. In a 10 vs 10 game, it is clearly better when when winning if all 10 of the teammates survive as compared to a win when only 1 of the teammates remain at the end. If we define the number of aircraft at the beginning of the contest as and the number remaining at the end of the contest as , then we can define the ratio of teammates that survived a given contest as . Over contests, we define the overall probability of survivibility as where is the number of contests and is the average probability that the team will survive the contest.
Vi Results
In Figure 2, results are shown for a typical 1 versus 1 (1v1) encounter. As blue has a performance advantage, it is able to maneuver more effectively and is able to capture the red aircraft. Figure 3 shows the actions selected by the blue aircraft during this run, while Figure 4 shows the values of the pseudo6DOF state variables during the run.
The of the blue team for all experiments is shown in Table IV. This is an indicator that the algorithm is functioning correctly as the blue team was given an advantage in the selection of actions and in aircraft dynamics. Better dynamics allows the aircraft to maneuver into an offensive position more readily, leading to an expected high . Also as expected as the airspace volume becomes more crowded and complex due to the increase in team size, the probability of survivability tends to decrease.
Team Size  

1v1  100%  100% 
2v2  100%  100% 
3v3  100%  100% 
4v4  100%  100% 
10v10  100%  99% 
100v100  100%  97% 
The amount of processing time required to formulate and solve the MDP for each agent at each timestep is summarized in Table V. Processing was performed on a laptop with an Intel i98950HK CPU at 2.90 GHz. While the code is written in Python, it does take advantage of the Numba and Numpy Python libraries that are used to perfom optimized computation loops in C. Additionally, the underlying LLVM library may allow some Numba optimized code to take advantage of SIMD instruction in the CPU. No GPU acceleration is used.
Team Size  Mean (ms) 

1v1  2.26 
2v2  2.50 
3v3  2.70 
4v4  3.16 
10v10  5.55 
100v100  27.59 
Videos of example runs of 1v1, 2v2, 3v3, 4v4, and 10v10 are available for viewing are provided in Table VI. Note that the size of the aircraft is exaggerated by a factor of 3 for improved visibility in the video.
Team Size  URL 

1v1  https://youtu.be/zGWXxtJUwk8 
2v2  https://youtu.be/Q9O50cqpVtA 
3v3  https://youtu.be/6Zok4sj43C4 
4v4  https://youtu.be/qhI6av3oJN4 
10v10  https://youtu.be/6twTWNRurwo 
Vii Conclusion
We have presented an efficient problem formulation for pursuit/evasion problems that scales to large numbers of teams (100v100) while remaining computationally efficient. This method formulates the problem as a Markov Decision Process (MDP) and uses a recently proposed approach in [bertram2019] to efficiently solve the MDP and is suitable for embedded systems commonly found on aircraft. The use of “risk wells” to represent the potential future actions of friendly and opposing aircraft allows the problem to remain tractable even as the number of aircraft per team increases.
For future work, we plan to explore how to incorporate mutual support, combat tactics, and multiagent cooperation to increase the effectiveness of the teams. This should be a rich area to explore with ample problems to examine. We also plan on extending the aircraft model used here to a higher fidelity model to test the algorithm in different areas of the flight envelope.
Comments
There are no comments yet.