A Predictive Deep Learning Approach to Output Regulation: The Case of Collaborative Pursuit Evasion

by   Shashwat Shivam, et al.
Georgia Institute of Technology

In this paper, we consider the problem of controlling an underactuated system in unknown, and potentially adversarial environments. The emphasis will be on autonomous aerial vehicles, modelled by Dubins dynamics. The proposed control law is based on a variable integrator via online prediction for target tracking. To showcase the efficacy of our method, we analyze a pursuit evasion game between multiple autonomous agents. To obviate the need for perfect knowledge of the evader's future strategy, we use a deep neural network that is trained to approximate the behavior of the evader based on measurements gathered online during the pursuit.



There are no comments yet.


page 1

page 2

page 3

page 4


Beacon-referenced Pursuit for Collective Motions in Three Dimensions

Motivated by real-world applications of unmanned aerial vehicles, this p...

A Two-Step Pursuit-Evasion Algorithm for Autonomous Underwater Vehicles

In this paper, we consider the problem of pursuit-evasion using multiple...

3D Pursuit-Evasion for AUVs

In this paper, we consider the problem of pursuit-evasion using multiple...

A Dynamics Perspective of Pursuit-Evasion Games of Intelligent Agents with the Ability to Learn

Pursuit-evasion games are ubiquitous in nature and in an artificial worl...

Adaptive Lookahead Pure-Pursuit for Autonomous Racing

This paper presents an adaptive lookahead pure-pursuit lateral controlle...

Adapting the Predator-Prey Game Theoretic Environment to Army Tactical Edge Scenarios with Computational Multiagent Systems

The historical origins of the game theoretic predator-prey pursuit probl...

Architectural Adversarial Robustness: The Case for Deep Pursuit

Despite their unmatched performance, deep neural networks remain suscept...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Output tracking in dynamical systems, such as robots, flight control, economics, biology, cyber-physical systems, is the practice of designing decision makers which ensure that a system’s output tracks a given signal [6, 14].

Well-known existing methods for nonlinear output regulation and tracking include control techniques based on nonlinear inversions [9], high-gain observers [11], and the framework of model predictive control (MPC) [1, 18]. Recently a new approach has been proposed, based on the Newton-Raphson flow for solving algebraic equations [24]. Subsequently it has been tested on various applications including controlling an inverted pendulum, and position control of platoons of mobile robotic vehicles [25, 20]. While perhaps not as general as the aforementioned established techniques, it seems to hold out promise of efficient computations and large domains of stability.

The successful deployment of complex control systems in real world applications increasingly depends on their ability to operate on highly unstructured – even adversarial – settings, where a-priori knowledge of the evolution of the environment is impossible to acquire. Moreover, due to the increasing interconnection between the physical and the cyber domains, control systems become more intertwined with human operators, making model-based solutions fragile to unpredictable. Towards that, methods that augment low-level control techniques with intelligent decision making mechanisms have been extensively investigated in [19]

. Machine learning

[7, 23], offers a suitable framework to allow control systems to autonomously adapt by leveraging data gathered from their environment. To enable data-driven solutions for autonomy, learning algorithms use artificial neural networks (NNs); classes of functions that, due to properties that stem from their neurobiological analogy, offer adaptive data representations and prediction based on external observations.

NNs have been used extensively in control applications [15]

, both in open-loop and closed-loop fashion. In closed-loop applications, NNs have been utilized as dynamics approximators, or in the framework of reinforcement learning, in enabling online solution of the Hamilton-Jacobi-Bellman equation


. However, the applicability of NNs in open-loop control objectives is broader, due to their ability to operate as classifiers, or as nonlinear function approximators


The authors of [15] introduced NN structures for system identification as well as adaptive control. Extending the identification capabilities of learning algorithms, the authors of [4]

introduce a robustification term that guarantees asymptotic estimation of the state and the state derivative. Furthermore, reinforcement learning has received increasing attention since the development of methods that solve optimal control problems for continuous time control systems online without the knowledge of the dynamics

[22]. Prediction has been in the forefront of research conducted on machine learning. Learning-based attack prediction was employed both in [26] and [2] in the context of cyber-security, and [16] utilized NNs to solve a pursuit evasion game by constructing both the evader’s and the pursuer’s strategies offline using pre-computed trajectories. Recently, authors of this paper have applied NN for on-line model construction in a control application [10].

This paper applies an NN technique to the pursuit-evasion problem investigated in [17], which is more challenging than the problem addressed in [10]. The strategies of both pursuers and evader are based on respective games. In Ref. [17], the pursuers know the game of the evader ahead of time, and an MPC technique is used to determine their trajectories. In this paper the pursuers do not have an a-priori knowledge of the evader’s game or its structure, and they employ an NN in real time to identify its input-output mapping. We use our tracking-control technique [24] rather than MPC, and obtain similar results to [17]. Furthermore, the input to the system has a lesser dimension that its output, and hence the control is underactuated. We demonstrate a way of overcoming this limitation, which may have a broad scope in applications.

The rest of the paper is structured as follows. Section II describes our proposed control technique and some preliminary results on NN, and it formulates the pursuers-evader problem. Section III describes results on model-based and learning-based strategies. Simulation results are presented in Section IV. Finally, Section V concludes the paper and discusses directions for future research.

Ii Preliminaries and Problem Formulation

Ii-a Tracking Control Technique

This subsection recounts results published in our previous work in which prediction-based output tracking was used for fully-actuated systems [24, 25, 20]. Consider a system as shown in Figure 1 with , , , and . The objective of the controller is to ensure that


for a given (small) .

Fig. 1: Basic control system scheme.

To illustrate the basic idea underscoring the controller, let us first assume that (i) The plant subsystem is a memoryless nonlinearity of the form


for a continuously-differentiable function , and (ii) the target reference is a constant, for a given .111Henceforth we will use the notation For a generic signal , to distinguish it from its value at a particular point , . These assumptions will be relaxed later. In this case, the tracking controller is defined by the following equation,


assuming that the Jacobian matrix is nonsingular at every point computed by the controller via (3). Observe that (3) defines the Newton-Raphson flow for solving the algebraic equation , and hence (see [24, 25]) the controller converges in the sense that . Next, suppose that the reference target is time-dependent, while keeping the assumption that the plant is a memoryless nonlinearity. Suppose that is bounded, continuous, piecewise-continuously differentiable, and is bounded. Define


then (see [25]), with the controller defined by (3), we have that


Note that Eqs. (2) and (3) together define the closed-loop system. Observe that the plant-equation (2) is an algebraic equation while the controller equation (3) is a differential equation, hence the closed-loop system represents a dynamical system. Its stability, in the sense that is bounded whenever and are bounded, is guaranteed by (5) as long as the control trajectory does not pass through a point where the Jacobian matrix is singular.

Finally, let us dispense with the assumption that the plant subsystem is a memoryless nonlinearity. Instead, suppose that it is a dynamical system modeled by the following two equations,


where the state variable is in , and the functions and satisfy the following assumption. (i). The function is continuously differentiable, and for every compact set there exists such that, for every and , . (ii). The function is continuously differentiable.  ∎ This assumption ensures that whenever the control signal is bounded and continuous, the state equation (6) has a unique solution on the interval .

In this setting, is no longer a function of , but rather of which is a function of . Therefore (2) is no longer valid, and hence the controller cannot be defined by (3). To get around this conundrum we pull the feedback not from the output but from a predicted value thereof. Specifically, fix the look-ahead time , and suppose that at time the system computes a prediction of , denoted by . Suppose also that is a function of , hence can be written as , where the function is continuously differentiable.

Now the feedback law is defined by the following equation,


The state equation (6) and control equation (8) together define the closed-loop system. This system can be viewed as an -dimensional dynamical system with the state variable and input . We are concerned with a variant of Bounded-Input-Bounded-State (BIBS) stability whereby if and are bounded, is bounded as well. Such stability no-longer can be taken for granted as in the case where the plant is a memoryless nonlinearity.

We remark that a larger means larger prediction errors, and these translate into larger asymptotic tracking errors. On the other hand, an analysis of various second-order systems in [24] reveals that they all were unstable if is too small, and stable if is large enough. It can be seen that, a requirement for a restricted prediction error can stand in contradiction with the stability requirement. This issue was resolved by speeding up the controller in the following manner. Consider , and modify (8) by multiplying its right hand side by , resulting in the following control equation:

It was verified in [24, 25, 20], that regardless of the value of , a large-enough stabilizes the closed-loop system.222This statement seems to have a broad scope, and does not require the plant to be a minimum-phase system. Furthermore, if the closed-loop system is stable then the following bound holds,


where is defined by (4). Thus, a large gain can stabilize the closed-loop system and reduce the asymptotic tracking error.

Ii-B Problem Formulation

In an attempt to broaden the application scope of the control algorithm, underactuated systems such as the fixed-wing aircraft are explored, which are widely used in the domain of aerospace engineering. The behavior of a fixed wing aircraft at constant elevation can be approximated by a planar Dubins vehicle with states [12] ,

where denotes the planar position of the vehicle, its heading and the angular acceleration, constrained as, . The input saturation enforces a minimum turning radius equal to . For testing the efficacy of the controller for the underactuated system, henceforth referred to as the pursuer, it is tasked with tracking an evading vehicle, modeled as a single integrator, with dynamics as follows:

where denote the planar position of the evader, and is its speed. We consider two cases; one where the evader is agnostic to the pursuer and follows a known trajectory and the other where the the evader is adversarial in nature and its trajectory is not known to the pursuer. The next section will provide two solutions for the problem of estimating the evader’s trajectory based, respectively, on a model-based approach and a learning-based approach.

Iii Predictive Framework

Iii-a Model-Based Pursuit Evasion

The considered system is underactuated because the pursuer’s position, , is two-dimensional while it is controlled by an one-dimensional variable, . This raises a problem since the application of the proposed tracking technique requires the control variable and system’s output to have the same dimension. To get around this difficulty, we define a suitable function and set where and are the predicted position of the pursuer and the evader at time ; we apply the Newton-Raphson flow to the equation . The modified controller becomes


Since is a scalar, the modified algorithm works similar to the base case.

Assume general nonlinear system dynamics as in (6) with output described in (7). The predicted state trajectory is computed by holding the input to a constant value over the prediction horizon, given by the following differential equation:


with the initial condition as shown in [24]. The predicted output at is . Furthermore, by taking the partial derivative of (11) with respect to u(t), we obtain


with the initial condition . The above is a differential equation in and (11) and (12) can be solved numerically. Finally, the values of and can be substituted in (10) to get the control law.

In the next section, results are presented for an agnostic as well as an adversarial pursuer- evader system. However, as mentioned above, in the adversarial problem formulation, the trajectory of the evader is not known in advance, which can be overcome in two ways.

In the first approach, the pursuer(s) use game theory to predict the approximate direction of evasion. As mentioned in

[8], in the case of single pursuer, the evader’s optimal strategy is to move along the line joining the evader and pursuer’s position, if the pursuer is far enough. When the distance between the pursuer and the evader reduces to the turning radius of the pursuer, the evader switches strategies and enters into the non-holonomic constraint region of the pursuer. This can be represented as follows:


Here is the expected evasion angle of the evader and is the distance between the pursuer and evader,

If there are multiple pursuers, it is assumed that the evader follows the same strategy by considering only the closest pursuer. It is notable that this will not provide the pursuers a correct prediction of the evader’s motion as they do not know about the goal seeking behavior mentioned above. However, it gives a good enough approximation of the pursuer’s motion that the algorithm can be used for tracking.

The second approach involves learning the evader’s behavior over time using NN. The pursuers take their positions and the position of the evader as input and the NN gives the estimated evasion direction as the output after training.

To showcase the efficacy of our method, we consider a pursuit evasion problem, involving multiple pursuing agents. Such problems are typically formulated as zero-sum differential games [8]. Due to the difficulty of solving the underlying Hamilton-Jacobi-Isaacs (HJI) equations [3] of this problem, we shall utilize the method described in II-A to approximate the desired behavior. Furthermore, we show that augmenting the controller with learning structures in order to tackle the pursuit evasion problem without explicit knowledge of the evader’s behavior is straightforward.

In order to formulate the pursuit evasion problem, we define a global state space system consisting of the dynamics of the pursuers and the evader. For ease of exposition, the analysis will focus on the -pursuer, -evader problem, since extending the results to multiple pursuers is straightforward.

The global state dynamics become,


where the subscripts indicate the autonomous agent. For compactness, we denote the global state vector as

, the pursuers’ control vector , and the nonlinear mapping described by the right-hand side of (14). Thus, given the initial states of the agents , the evolution of the pursuit evasion game is described by .

Subsequently, this zero-sum game can be described as a minimax optimization problem through the cost index,


where , is the distance between the -th pursuer and the evader, are user defined contants, and is a discount factor. The first term ensures that the pursuers remain close to the evader, while the second term encourages cooperation between the agents. The cost decreases exponentially to ensure that the integral has a finite value in the absence of equilibrium points.

Let be a smooth function quantifying the value of the game when specific policies are followed starting from state . Then, we can define the corresponding Hamiltonian of the game as,


The optimal feedback policies , of this game are known to constitute a saddle point [3] such that,


Under the optimal policies (17),(18), the HJI equation is satisfied,


Evaluating the optimal pursuit policies, yields the singular optimal solutions described by, , where is the partial derivative of the value function with respect to the state , calculated by solving (19). To obviate the need for bang-bang control, as is derived by (17) and (18) we shall employ the predictive tracking technique described in Section II-A to derive approximate, easy to implement, feedback controllers for the pursuing autonomous agents. Furthermore, by augmenting the predictive controller with learning mechanisms, the approximate controllers will have no need for explicit knowledge of , the evader’s policy.

The following theorem presents bounds on the optimality loss induced by the use of the look-ahead controller approximation.

Theorem 1.

Let the pursuit evasion game evolve according to the dynamics given by (14), where the evader is optimal with respect to (III-A) and the pursuers utilize the learning-based predictive tracking strategy given (10). Then, the tracking error of the pursuers and the optimality loss due to the use of the predictive controller are bounded if , such that, , where with denoting the partial derivative of the game value with respect to the state component .

Proof: Consider the Hamiltonian function when the approximate controller, denoted and the NN-based prediction of the evader’s policy, are used,


Taking into account the nonlinear dynamics of the system (14), one can rewrite (20) in terms of the optimal Hamiltonian as,, where is the HJI equation that is obtained after substituting (17) and (18) in (16). Now, take the orbital derivative of the value function along the trajectories using the approximate controllers as, Substituting (20) yields Thus, since , ,

Hence for , we have . Thus is a forward invariant set, which implies that the tracking error and the optimality loss over any finite horizon is bounded.

Note that we do not use optimal control or MPC to solve the pursuit evasion problem. Instead, the controller is governed by (10), which is simple to implement and has low computational complexity.  ∎

Iii-B Deep Learning-Based Pursuit Evasion

A deep NN, consisting of hidden layers, describes a nonlinear mapping between its input space and output space

. Each layer receives the output of the previous layer as an input and, subsequently, feeds its own output to the next layer. Each layer’s output consists of the weighted sum of its input alongside a bias term, filtered through an application-specific activation function


Specifically, let be the input space of a specific layer, and the corresponding output space. Then the layer’s output is,

where is the input vector, gathered from training data or from the output of previous layers, is a collection of weights for each layer, the bias term and is the layer’s activation function. We note that it is typical to write the output of layer compactly, with slight abuse of notation, as,


where , and is the activation function of the previous layer, taking as input the vector .

It is known [13], that two-layer NNs possess the universal approximation property, according to which, any smooth function can be approximated arbitrarily close by an NN of two or more layers. Let be a simply connected compact set and consider the nonlinear function . Given any , there exists a NN such structure such that,

where . We note that, typically, the activation function of the output layer is taken to be linear.

Evaluating the weight matrix

in a network is the main concern of the area of machine learning. In this work, we employ the gradient descent based backpropagation algorithm. Given a collection of

training data, stored in the tuple , where , , , we denote the output errors as Then, the update equation for the weights at each optimization iteration is given by,


where denotes the learning rate. We note that the update index need not correspond to the sample index , since different update schedules leverage the gathered data in different ways [13]. It can be seen that in order for the proposed method to compute the pursuers’ control inputs, an accurate prediction of the future state of the evader is required. However, this presupposes that the pursuers themselves have access to the evader’s future decisions; an assumption that is, in most cases, invalid. Thus, we augment the pursuers’ controllers with a NN structure, that learns to predict the actions of the evader, based on past recorded data.

Initially, we assume that the evader’s strategy is computed by a feedback algorithm, given her relative position to the pursuers. This way, the unknown function we wish to approximate is , with, where, denote the distance of pursuer to the evader in the X and Y axes, respectively. In order to train the network, we let the pursuers gather data regarding the fleet’s position with respect to the evader, as well as her behavior over a predefined time window .

Increasing the time window will allow the pursuers to gather more training data for the predictive network. However, this will not only increase the computational complexity of the learning procedure, but will make the pursuers more inert to sudden changes in the evader’s behavior. Simulation results corroborate our choice of training parameters.  ∎

Subsequently, we denote by , the current prediction function for the evader’s strategy, i.e., , where , denotes the current weight estimate of the NNs output layer, and is the current estimate of the hidden layers, parametrized by appropriate hidden weights.

While the learning algorithm for the evader’s behavior operates throughout the duration of the pursuit, thus making the approximation weights time-varying, we suppress their explicit dependence on time since the process is open-loop, in the sense that the system is learning in batches, rather that in a continuous fashion.  ∎

Inputs: , , and evasion strategy approximation weights .
Output: , .

1:Compute , .
2:Predict evader’s future behavior via (21).
3:Train NN as in (22).
4:Predict evader’s future state as .
5:Propagate pursuer dynamics to get .
6:Computed current Newton flow parameters using (IV-C).
7:Computed control dynamics from (3).
8:Propagate actual system evolution using (14).
9:Append current distances to a stack of previous observations.
10:Update evader prediction network through (22).
Algorithm 1 Deep Learning-Based and Predictive Pursuit Evasion

Iv Simulation Results

This section presents results for the problems briefly described in the previous section. First, the agnostic evader case is considered followed by the adversarial case. For the second case, single and multiple pursuer systems are considered separately. The controller is implemented on a Dubins vehicle. For the purpose of tracking, we define the system output to be , .

Iv-a Single Pursuer - Agnostic Target

In this subsection, the controller is tested on a Dubins vehicle with the task of pursuing an agnostic target moving along a known trajectory. Since the vehicle has a constant speed and an input saturation is enforced, it has an inherent minimum turning radius. For this simulation, we set  m/s and the input saturation is first set to  rad/s and then to  rad/s. The evader moves along two semicircular curves with a constant speed which is less than .

As a consequence, when the pursuer catches up to the evader, it overshoots and has to go around a full circle to again start tracking. Naturally, lower turning radius translates to better tracking as the vehicle can make “tighter” turns. This can be seen when comparing the trajectories of the vehicle in Figure 2 with Figure 2. For the same trajectory of the evader, the tracking performance is far better in the second case. Once the pursuer catches up to the target, the maximum tracking error in the first case is approximately meters and only meter in the second case, shown in Figures 2 and 2. This is consistent with the fact that the ratio of the turning radii is .

Fig. 2: Agnostic evader with a large turning radius.
Fig. 2: Agnostic evader with a large turning radius.

Iv-B Single Pursuer - Adversarial Evader

The pursuer is again modelled as a Dubins vehicle, while the evader is modelled as a single integrator with a maximum velocity less than the speed of the pursuer. Hence, while the pursuer is faster, the evader is more agile, and can instantly change its direction of motion. In this and subsequent cases, the evader is considered adversarial in nature and uses game theory to choose evasion direction.

Let and be the position vector of the pursuer and evader respectively at time . First, the pursuer makes an estimate of the optimal evasion direction based on the relative position of the evader and itself at time using (13). Assuming this direction of evasion to be fixed over the prediction window from to gives the predicted position of the evader at all time instances in this interval, denoted as . Next, the pursuer estimates its own predicted position if its input is kept constant, called . Finally, is set as and the value of ( being the ensemble vector of the states of the pursuer and the evader) is used to compute the input differential equation (10).

Figures 3 shows the trajectories of the pursuer and the evader, with the goal for the evader set to to point . It can be observed that the evader moves towards the goal while the pursuer is far away and starts evasive maneuvers when it gets close to it, by entering its non-holonomic region. Figure 3 displays the tracking error, defined as the distance between the pursuer and the evader, which is almost periodic. This is because the evader’s maneuver forcing the pursuer to circle back. The peak tracking error after the pursuer catches up is slightly more than twice the turning radius, as expected.

Fig. 3: Trajectories for a single pursuer-evader system.
Fig. 3: Trajectories for a single pursuer-evader system.

Iv-C Multiple Pursuers - Adversarial Evader

While the previous section had only one pursuer, this simulation considers the case of two pursuers and a single evader. Having multiple pursuers means there must be cooperation between them in order to optimally utilize resources. Thus, a pursuer can no longer make decisions solely based on the position of the evader relative to itself. The positions of the rest of the pursuers must also be factored in. Thus we redefine the expression for to include these parameters as shown below for the case of two pursuers. Let be the distance between the two pursuers, and let


The first term ensures that the pursuers remain close to the evader, while the second term encourages cooperation between agents. The last term is added to repel pursuers apart if they come close to each other, as having multiple pursuers in close vicinity of each other is sub-optimal.

Figure 5 shows the trajectories of the pursuers and the evader when the goal for the evader is set to the point . In this case, the pursuers close in on the evader and trap it away from its goal due to their cooperative behavior. The evader is forced to continuously perform evasive maneuvers as the other pursuer closes in when the first has to make a turn. This can be seen more clearly in the tracking error plot given in Figure 5. After catching up with the evader, it can be seen that when one pursuer is at its maximum distance, the other is at its minimum. The results achieved show good coordination between the pursuers and low tracking error and are qualitatively comparable to [17].

Lastly, we present the results under the learning-based prediction. In Figure 7, we present a comparative result of the tracking error of the model-based algorithm vis-à-vis the NN-based control. Figure 7 showcases the quality of the performance of the proposed algorithm based on the game theoretic cost metric. From these figures, it can be seen that the NN structure offers fast predictive capabilities to the controller; hence the overall performance is comparable to the model based control.

Fig. 4: Trajectories for the two pursuer-single evader system.
Fig. 5: Evolution of the tracking error for the two pursuer-single evader system.
Fig. 4: Trajectories for the two pursuer-single evader system.
Fig. 7: Evolution of the tracking error for the systems with and without learning.
Fig. 6: Trajectories for two pursuers-single evader system with learning.
Fig. 7: Evolution of the tracking error for the systems with and without learning.
Fig. 6: Trajectories for two pursuers-single evader system with learning.

V Conclusion and Future Work

This work extends the framework of prediction-based nonlinear tracking in the context of pursuit evasion games. We present results for vehicle pursuit of agnostic targets, modeled as moving along known trajectories, as well as adversarial target tracking, where the evader evolves according to game-theoretic principles. Furthermore, to obviate the need for explicit knowledge of the evader’s strategy, we employ learning algorithms alongside the predictive controller. The overall algorithm is shown to produce comparable results to those in the literature, while it precludes the need for solving an optimal control problem.

Future work will focus on developing robustness guarantees will allow for more realistic scenarios, where noise and external disturbances are taken into consideration.


  • [1] F. Allgöwer and A. Zheng (2012) Nonlinear model predictive control. Vol. 26, Birkhäuser. Cited by: §I.
  • [2] T. Alpcan and T. Başar (2010) Network security: a decision and game-theoretic approach. Cambridge University Press. Cited by: §I.
  • [3] T. Basar and G. J. Olsder (1999) Dynamic noncooperative game theory. Vol. 23, Siam. Cited by: §III-A, §III-A.
  • [4] S. Bhasin, R. Kamalapurkar, M. Johnson, K. G. Vamvoudakis, F. L. Lewis, and W. E. Dixon (2013) A novel actor–critic–identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 49 (1), pp. 82–92. Cited by: §I.
  • [5] C. M. Bishop et al. (1995)

    Neural networks for pattern recognition

    Oxford university press. Cited by: §I.
  • [6] S. Devasia, D. Chen, and B. Paden (1996) Nonlinear inversion-based output tracking. IEEE Transactions on Automatic Control 41 (7), pp. 930–942. Cited by: §I.
  • [7] S. S. Haykin (2009) Neural networks and learning machines. Vol. 3, Pearson Upper Saddle River. Cited by: §I, §III-B.
  • [8] R. Isaacs (1999) Differential games: a mathematical theory with applications to warfare and pursuit, control and optimization. Courier Corporation. Cited by: §III-A, §III-A.
  • [9] A. Isidori and C.I. Byrnes (1990) Output regulation of nonlinear systems. IEEE Trans. Automat. Control 35, pp. 131–140. Cited by: §I.
  • [10] A. Kanellopoulos, K. Vamvoudakis, and Y. Wardi (2019) Predictive learning via lookahead simulation. In AIAA Scitech 2019 Forum, San Diego, California, January 7-11, pp. . Cited by: §I, §I.
  • [11] H.K. Khalil (1998) On the design of robust servomechanisms for minimum phase nonlinear systems. Proc. 37th IEEE Conf. Decision and Control, Tampa, FL (), pp. 3075–3080. Cited by: §I.
  • [12] S. M. LaValle (2006) Planning algorithms. Cambridge university press. Cited by: §II-B.
  • [13] F. Lewis, S. Jagannathan, and A. Yesildirak (1998) Neural network control of robot manipulators and non-linear systems. CRC Press. Cited by: §III-B, §III-B.
  • [14] P. Martin, S. Devasia, and B. Paden (1996) A different look at output tracking: control of a vtol aircraft. Automatica 32 (1), pp. 101–107. Cited by: §I.
  • [15] K. S. Narendra and K. Parthasarathy (1990) Identification and control of dynamical systems using neural networks. IEEE Transactions on neural networks 1 (1), pp. 4–27. Cited by: §I, §I.
  • [16] H. J. Pesch, I. Gabler, S. Miesbach, and M. H. Breitner (1995) Synthesis of optimal strategies for differential games by neural networks. In New Trends in Dynamic Games and Applications, pp. 111–141. Cited by: §I.
  • [17] S. A. Quintero, D. A. Copp, and J. P. Hespanha (2015) Robust uav coordination for target tracking using output-feedback model predictive control with moving horizon estimation. In American Control Conference, Chicago, Illinois, July 1-3, Cited by: §I, §IV-C.
  • [18] J.B. Rawlings, D.Q. Mayne, and M.M. Diehl (2017) Model predictive control: theory, computation, and design, 2nd edition. Nob Hill, LLC. Cited by: §I.
  • [19] G. Saridis (1983) Intelligent robotic control. IEEE Transactions on Automatic Control 28 (5), pp. 547–557. Cited by: §I.
  • [20] S. Shivam, I. Buckley, Y. Wardi, C. Seatzu, and M. Egerstedt (2018) Tracking control by the newton-raphson flow: applications to autonomous vehicles. In European Control Conference, Naples, Italy, June 25-28, Cited by: §I, §II-A, §II-A.
  • [21] K. G. Vamvoudakis and F. L. Lewis (2010) Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 46 (5), pp. 878–888. Cited by: §I.
  • [22] K. G. Vamvoudakis (2017) Q-learning for continuous-time linear systems: a model-free infinite horizon optimal control approach. Systems & Control Letters 100, pp. 14–20. Cited by: §I.
  • [23] D. Vrabie, K. G. Vamvoudakis, and F. L. Lewis (2013) Optimal adaptive control and differential games by reinforcement learning principles. Vol. 2, IET. Cited by: §I.
  • [24] Y. Wardi, C. Seatzu, M. Egerstedt, and I. Buckley (2017) Performance regulation and tracking via lookahead simulation: preliminary results and validation. In 56th IEEE Conf. on Decision and Control, Melbourne, Australia, December 12-15, pp. . Cited by: §I, §I, §II-A, §II-A, §II-A, §III-A.
  • [25] Y. Wardi, C. Seatzu, and M. Egerstedt (2018) Tracking control via variable-gain integrator and lookahead simulation: application to leader-follower multiagent networks. In Sixth IFAC Conference on Analysis and Design of Hybrid Systems (2018 ADHS)l, Oxford, the UK, July 11-13, pp. . Cited by: §I, §II-A, §II-A, §II-A.
  • [26] B. G. Weber and M. Mateas (2009) A data mining approach to strategy prediction. In 2009 IEEE Symposium on Computational Intelligence and Games, pp. 140–147. Cited by: §I.