I Introduction
Autonomous vehicles have attracted researchers’ attention dramatically in recent years due to the advanced technology in automation, highspeed communication network and new energy. Path planning and path following are two major tasks for the behaviour control of autonomous vehicles [1, 38]. Path planning is executed to plan the path considering safety constraints, and a controller is then used to follow this path accurately by considering the current states and providing suitable control. Path planning has been well explored by many researchers [33, 14]. However, path following still remains a problem due to the high dynamic, limited computation and communication of autonomous vehicles. The path following controller are expected to provide accurate control inputs in realtime with constrained computation and communication. Path following control can be implemented using different controllers, e.g., proportional–integral–derivative (PID) control, state feedback controllers, model predictive control (MPC), and so on.
MPC is capable of handling multiinput multioutput (MIMO) systems with various constraints, making it specially suitable for realworld autonomous vehicle path following problem. MPC can be dated back to the 1980s when engineers in the process industry first began to deploy it in realworld practice [18]. Since then, the increasing computing power of microprocessors has greatly improved its application scope [39]. MPC uses a system model to predict its future behavior, and selects the best control action by solving an optimization problem [2, 3, 23, 45, 50]. Despite the advances of MPC over the years [35, 34, 13, 30, 46], solving the constrained optimal control problem requires high computational power, which is further increased as the system dimension and prediction horizon increase. This has hindered its application to autonomous vehicles’ path following that require a short sampling time but have limited computation power. To reduce computational burden without significantly degrading control performance, eventtriggered MPC (eMPC) has emerged as a promising paradigm where MPC algorithm is solved – instead of at each time instant as in the traditional MPC implementation – only when triggered by a predefined trigger condition [6, 31, 9, 20, 15, 24]. In such framework, a triggering event can be defined based on either the deviation of the system states [6, 31, 9] or the cost function value [20, 15]. By solving the optimization problem only when necessary, eMPC can significantly reduce online computations. However, the trigger mechanism design, concerning when to trigger the optimization so as to preserve system performance while keeping the number of triggers low, still remains a challenge [45].
The most common eventtrigger policy is the thresholdbased eventtrigger policy, where an event is triggered if the predicted state trajectory and realtime feedback diverge beyond a certain threshold [6, 31, 9]. However, the threshold calibration is usually based on the knowledge of the closedloop system behavior which is not always available, especially for complex systems. To address this limitation, our prior work [8] investigates the use of model free RL techniques, a simple linear Qlearning approach, to synthesize a triggering policy with the aim of achieving the optimal balance between control performance and computational efficiency. However, this linear Qlearning has a hard time capturing the nonlinear eventtrigger policy, leading to unnecessarily high event frequency. Therefore, in this paper, we propose to use deep RL to learn the eventtrigger policy which makes the proposed framework achieve better tradeoffs between system performance and computation cost.
This paper addresses the autonomous driving path following problem using a novel eMPC framework. First, it extends the previous work [8] with an improved vehicle model, thereby removing the limitation of using only the front steering angle as driving control. Second, we develop a modelfree deepRLeMPC framework that uses deep RL to learn the eventtrigger policy online, so that no prior knowledge of the closedloop system is needed, which is essential for a dynamic and complex system. Both offpolicy and onpolicy RL methods are tested. Meanwhile, techniques including prioritized experience replay (PER) buffer and longshort term memory (LSTM) are exploited to significantly improve the training efficiency and control performance. The validity of the proposed deepRLeMPC is demonstrated using a nonlinear autonomous vehicle model and the results show that our approach clearly outperforms the conventional thresholdbased approach in [9] and the previous linear Qlearning based approach in [8].
The remainder of the paper is organized as follows. Section II reviews relevant literature on RL and MPC integration, to provide more context for the presented work. Section III formulates the autonomous vehicle path following problem. Section IV presents the framework of eMPC with triggering policy obtained from RL. The experiment setup and results of the proposed deepRLeMPC method in the autonomous vehicle path following problem are presented in V. Finally, conclusion remarks are provided in Section VI.
Ii Relevant Work on Rl/mpc Integration
Utilizing RL to aid MPC is not new in literature. For example, [16, 27]
propose an offpolicy actorcritic algorithm called DMPC, where an offpolicy critic learns a value function while the actor utilizes MPC to interact with the environment. It is assumed that the system dynamics are known, but the cost function that MPC should minimize is unknown and is learned by the critic’s value function estimation. Both analytical and numerical results demonstrate improvements on learning convergence.
RL can also be used to learn the system dynamics that are then used by MPC for prediction [11, 12, 29, 26, 40, 47, 21, 25, 7]. This approach is called modelbased RL in [11, 12, 29, 26, 40, 47, 21, 25, 7]. Specifically, [11, 12, 29, 26, 40] studies learning based probabilistic MPC in the framework of RL, where the system dynamics and environment uncertainties are modeled as Gaussian Process (GP), whose parameters are iteratively identified through trial and error. Authors of [21, 25, 7] use a GP model to learn errors between measurement and a nominal model, which are then used to set up the optimal control problem for MPC to guarantee constraints robustness.
Reference [47]
combines RL and MPC in the context of surgical robot control. The system dynamics are modeled by artificial neural network (ANN), whose parameters are identified through RL or learning from demonstration. In RL approach, the agent explored the action space using
greedy, collected observations, and iteratively trained the ANN to model system dynamic, while an MPC is used to optimize action based on the trained ANN. In the learning from demonstration approach, the ANN parameters are initialized using observations collected from human operators.Finally, RL can also be used to directly optimize MPC control law. For example, [48] proposes a robust MPC where the control law is restricted to an affine function of the feedback with the gain being precomputed offline and the offset being learnt by RL. Reference [48] additionally shows that the robust MPC can also reject disturbance when the Gaussian process model is unknown and learnt online. Authors in [5]
investigated the use of gradient based Partially Observable Markov Decision Processes (POMDP) algorithm to train the RL recomputation policy for eventtriggered MPC control to save energy
[4]. However, the solutions of POMDP algorithm often suffer from the high variance of the gradient estimate
[49].To the best of our knowledge, the use of deep RL to trigger MPC has not been reported in literature. In this paper, we attempt to fill this gap by investigating deep RLbased eventtriggered MPC, or deepRLeMPC, which learns the optimal eventtrigger policy without requiring any knowledge on the closedloop dynamics and therefore significantly reduces the amount of calibrations.
Iii Problem Formulation
This paper aims to improve autonomous vehicles path following control by proposing a systematic, algorithmic framework where eMPC can be used without having the prior knowledge of the closedloop system behavior. Our goal is to use an RL agent to learn the optimal eventtrigger policy automatically.
Iiia Task Description: Autonomous Vehicle Dynamics and Path Following Problem
In order to demonstrate the proposed deep RLeMPC and its improving techniques, a path following task is chosen. For a single track vehicle model, the equations for vehicle center of gravity (CG) and wheel dynamics are given by
(1a)  
(1b)  
(1c)  
(1d)  
(1e)  
(1f) 
where and are the longitudinal and lateral position of the center of gravity of vehicle, respectively; is the vehicle rotational angle along the longitudinal axis in the global inertial frame; and , , and are, respectively, the vehicle longitudinal velocity, lateral velocity, and yaw rate in the vehicle frame. is the aerodynamic drag force [41] and and are tire forces. is the vehicle mass, is the vehicle rotational inertia on yaw dimension, and are the distance from CG to the middle of front and rear axle, respectively.
The tire force and in (1b), (1d) in vehicle frame can be modeled by
(2a)  
(2b) 
where is the wheelroadangle for the wheel , represents the front or rear wheel, and are the tire force in wheel frame which can be obtained as
(3a)  
(3b) 
where is the propulsion/braking torque along the axle, is the effective tire radius, is the tire corner stiffness and characterize the road surface, is the slip angle. We refer readers to [9] for a detailed computation of the slip angle .
The normal force in (1f) can be modeled by static load transfer,
(4) 
IiiB Optimal Control Problem and its Goal
Consider a discretetime system with the following dynamics
(6) 
where is the system state at discrete time and is the control input. Given a prediction horizon , MPC aims to find the optimal control sequence and optimal state sequence by solving the following optimal control problem:
(7a)  
s.t.  (7b)  
(7c)  
(7d)  
(7e)  
(7f) 
where and are defined as and , is the stage cost function, denotes the real state or current state estimation, and denotes the control action at time step . For conventional timetriggered MPC, the above optimal control problem is solved for every sampling time , and only the first element of is applied to the system as the control command, while all the remaining elements are abandoned.
Let and represent the current time step and the last event time, respectively, and there thus exists a such that where is the sampling time of the discrete system. Let denotes the triggering command in eventtriggered MPC at time step . Then when , the above optimal control problem is solved and the first element of the optimal control sequence computed at current time step will be used as control command. When , the optimal control sequence computed at last event when the time instance equals to will be shifted to determine the control command [9]. Then the control input can be compactly represented as:
(8) 
To implement (8) for eMPC, a buffer can be used to store the optimal control sequence computed at last event at time . At each time step, the eventtrigger policy block generates based on current feedback from the plant. In eMPC, only when , a new control sequence is computed by solving (7), whose first element is implemented by actuator as , while the entire sequence is saved into buffer. If , indicating the absence of an event, the control sequence currently stored in the buffer will be shifted based on the time elapsed since last event to determine the current control input . This process is depicted in Fig. 1.
In general, the event can be generated by certain eventtrigger policy , denoted as,
(9) 
where is the optimal state sequence computed at last event when and is the real state (or current state estimate if not directly measured), are parameters characterizing the policy. It is worth noting that, for nonlinear constrained MPC, the design of eventtrigger policy is challenging and requires extensive calibration and prior knowledge of the closedloop system behavior. Therefore, the design of eventtrigger policy and its calibrations are usually problem specific and nontrivial. To address this limitation, the objective of this paper is to learn the optimal eventtrigger policy using modelfree deep RL techniques.
If we discretize (1) to obtain a discretetime model in the form of (6), with and where is the axle driving torque and is the front steering angle. The stage cost of (7a) is defined as
(10) 
where the first nonlinear term penalizes the path tracking error and the second term penalizes large control efforts. Here the norm is defined as . More specifically, the MPC cost function in (7a) in this case can be equivalently represented as:
(11) 
where and are defined as and , and the terms independent of and are ignored.
Iv EventTriggered Mpc With Deep RlBased Policy Learning
In this section, we present our proposed deep RLbased policy learning eMPC, or deepRLeMPC.
Iva DeepRLeMPC Framework
The process of our deepRLeMPC framework is shown in Fig. 2. The RL agent learns the eventtrigger policy parameter by continuously interacting with the environment. Specifically, at each time step, the agent sends an action to the environment. The environment then implements the eMPC following (8), simulates the dynamic system following (6), and emits an immediate reward following the designed reward function. The agent then observes the reward signals, update , and transitions to next state.
For an eMPC problem, the discrete action space for RL agent is defined as , where the event will be triggered when and will not be triggered when . As the feedback from the environment, the immediate reward function is defined as
(12) 
where the first term measures the closedloop system performance and the second term measures the cost of triggering events. Note that is the stage cost and is computed using the the real state (or current state estimate if not directly measured) and realtime control (8). Furthermore,
is a hyperparameter used to balance between control performance index and triggering frequency. One can fine tune this hyperparameter
to make a tradeoff between control performance and computational cost.The complete deepRLeMPC algorithm is shown in Algorithm1. In this algorithm,
is the total number of training epochs,
is the length of each episode representing total training time in each epoch, is the discount factor in the reward function, is the discrete time step, and is the size of sampled experiences at each time (batch size). The output of Algorithm1 is the system parameters . The RL agent interacts with the environment for number of epochs (Lines 224). After initialization, Lines 5 shows how to choose action. Lines 712 implement the eventtriggered MPC to compute the control command , which is used to simulate the dynamical system (6) (Line 13). After that, the environment emits next state and immediate reward (Lines 16), which is observed by RL agent (Line 18). The latest experience tuple is then added into an experience buffer (Line 19). The RL parameters is updated using a batch of experiences sampled from the experience buffer (Line 20). RL agent then moves to next state (Lines 21). After each epoch, RL agent is reset for the next epoch (Line 3). Lines 716 are part of the environment, whose computation is unknown to the RL agents. Note that the agent only observes the environment outputs, i.e., next state and reward.IvB Deep RL Algorithms and Improving Technique
The framework shown in Fig. 2 and Algorithm 1 is a general frame which can accommodate different RL algorithm. In this paper, we investigate three different RL agents, including Double Qlearning (DDQN) [37] and Proximal Policy Optimization (PPO)[44], Soft ActorCritic (SAC) [19], and show the proposed framework is also suitable for other RL algorithms.
In this subsection, we first briefly describe these three deep RL algorithms. Then two improving technique for Rl agent including Prioritized Experience Replay (PER) and Long Shortterm Memory (LSTM) are presented.
IvB1 Double Qlearning
Deep Q network is a type of Qlearning which uses neural network as a policy. To address the issues of overestimation of Q values in deep Q network [36], Double Qlearning (DDQN) explicitly separates action selection from action evaluation which allows each step to use a different function approximator and shows a better overall approximation of the actionvalue function [37]. DDQN improves deep Q network by replacing the target by , resulting in the Double Qlearning loss:
(13) 
IvB2 Ppo
IvB3 Soft ActorCritic
SAC achieves the stateoftheart performance across a wide range of continuousaction control problems and updates the stochastic actorcritic policy in an offpolicy way. SAC takes a good explorationexploitation tradeoff via entropy regularization.
In this paper, we adopt SAC and PPO to the discrete action space setting following the discrete categorical distribution design in [10]. For details, refer to DDQN [37], PPO [44] and SAC [19].
The training performance of the proposed deepRLeMPC framework depends on the quality of the selected experience sample, so how to choose them is critical when using offpolicy RL algorithms. The experience replay buffer utilizes a fixedsize buffer that holds the most recent transitions collected by the policy [32, 17]. In RL, the weights updating and optimization of neural networks are based on the experience replay. The experience replay in the original DDQN uniformly samples the stored experience to train the network weights. However, the importance of experiences are different. Some experiences are more valuable than others in the long run and important experience should be considered more frequently. To address this problem, the prioritized experience replay has been proposed [42]
to prioritize more frequent replay transitions leading to high expected learning progress, as measured by the magnitude of their TD error. Specifically, the probability of sampling transition
is defined as follows:(15) 
where controls how much prioritization is applied; when , the experience will be sampled uniformly. Here represents the priority of transition , which is initialized as 1 and updated based on the TDerror during the transition.
More specifically, to alleviate the bias of the gradient magnitudes introduced by the priority replay, importancesampling (IS) is introduced in [42] as:
(16) 
where is the hyperparameter annealing the amount of importancesampling correction over time. is size of the experience buffer. The weight is then used in the Qlearning updates by replacing the TDerror as . In practice, we can apply the PER by replacing line 24 in Algorithm 1 with the designed PER scheme.
To encode the historical information in the network, a straightforward way is to feed all historical states to the RL agent, but it increases the state dimension significantly and may distract the attention of the RL agent from recent input states. To address this challenge, recurrent neural network (RNN) has been developed, which is a class of artificial neural networks that can encode and learn temporal information. Traditional RNN does not have the ability for long term memory and suffers from vanishing gradient problem. Long shortterm memory (LSTM)
[22], a type of RNN architecture, solves this issue by using feedback connections and thus suitable for longtime series data. In this paper, we will explore the use of LSTM as the last hidden layer to extract representations from different state types and encode the history information.V Autonomous Vehicle Path Following Using DeepRlEmpc
In this section we apply the proposed deepRLeMPC to a nonlinear autonomous vehicle path tracking problem. The prediction horizon of MPC is set to with upper and lower bounds for all control inputs. Since autonomous vehicle requires short control sampling time but has limited onboard computation power, this nonlinear path tracking problem is a good example to demonstrate the proposed deepRLeMPC.
Va RL Structure and Settings
In this paper, we encode the input state with a one fully connected (FC) layer with 128 neurons, followed by two 128neuron FC layers. In the LSTM design, we replace the last FC layer with a 128unit LSTM layer. The last layer outputs two Q values corresponding to two actions, i.e., trigger and not trigger. The target network in DDQN are updated every
steps.The state of the environment is defined to be , where as mentioned above is the state estimate of the dynamical system and is the MPC prediction made at last event. The reward function follows (12), with defined as follows:
(17) 
where is the real state (or current state estimate if not directly measured) and is the realtime applied control computed by (8). Then the return for one episode in the RL algorithm is as follows:
(18) 
where is the episodic return of RL algorithms, is the number of steps for the episode,
is a hyper parameter proposed to balance control performance and event trigger frequency. To evaluate performance of different RL algorithms in our deepRLeMPC frame, we adopt the following two evaluation metrics: total MPC cost
and event triggering frequency , which are defined as follows:(19)  
(20) 
Threshold  LSTDQ  SAC  DDQN  DDQN+LSTM+PER  PPO  PPO+LSTM  
return  1.606  0.062  0.058  0.056  0.058  0.055  0.055  
/  0.118/1.606  0.902/0.062  0.99/0.058  0.902/0.056  0.99/0.058  0.99/0.058  0.99/0.055  
return  1.618  0.157  0.158  0.152  0.137  0.119  0.112  
/  0.118/1.606  0.931/0.157  0.98/0.058  0.951/0.055  0.794/0.056  0.594/0.059  0.594/0.059  
return  1.728  0.66  1.015  0.627  0.431  0.634  0.529  
/  0.118/1.606  0.559/0.660  0.922/0.075  0.5/0.117  0.255/0.171  0.515/0.114  0.515/0.069 
We train the offpolicy RL algorithms over 50,000 steps, which is around 500 episodes, each with a length of and a sampling time of , i.e., episode horizon is time steps. Onpolicy algorithms, e.g., PPO, often require longer training time but with improved stability [10], thus we train them for 1000 episodes for better convergence. For MDP, we set the discount factor and batch size . The learning rate and replay buffer size are set as and 5,000, respectively. Also, greedy is adopted in DDQN with linearly decaying from 1.0 to 0.01 during the first 5000 steps of training.
VB Simulation Results and Analysis
Numerical simulation results on the evaluation returns for , , with the thresholdbased benchmark and different variants of RL algorithms are summarized in Tab. I. The simple linear Qlearning method (leastsquare temporal difference Qlearning, LSTDQ) [8] is also shown here as a benchmark. To measure the computation burden required by different RL algorithms and MPC, we run the simulation 10000 times and use the average time as the time cost. The results show that the average time cost of MPC is about 0.1s while the average time cost of RL algorithms considered in this paper is about . In other words, each MPC computation requires times more computation than evaluating RL policies, and hence the time cost spent on the decision making of RL algorithms is negligible. So overall speaking, fewer MPC queries will provide less computation burden.
The thresholdbased eventtrigger policy [9] depends on a manuallytuned threshold to determine when the event is triggered. However, this method is very sensitive to the tracking error and is susceptible to overtriggering problems when the error is large. This causes the return of the thresholdbased method around for all three different , much worse than the RLbased methods as shown in Tab. I.
Comparing LSTDQ, SAC, DDQN, and PPO, experimental results clearly show that deepRLeMPC frameworks achieve better evaluation return than the the conventional thresholdbased approach and previous LSTDQ for all three different . It is also shown that PPO presents the best result under and , while DDQN performs better when in terms of evaluation return, partly due to the low overestimation. To show the flexibility of the proposed framework, PER buffer and LSTM are employed to foster the exploration and efficiency of the training of DDQN and PPO. PPO is an on policy RL method and PER cannot be applied to this method, so only PPO+LSTM is tested. Specifically, DDQN+LSTM+PER and PPO+LSTM are implemented and compared. The experimental results show that LSTM and PER significantly increase the evaluation return of the system, outperforming the baseline methods. SAC performs well when , while it fails in the more challenging cases when or . The intrinsic reason for the poor performance of SAC deserves to be investigated in the future work.
Recall that the hyperparameter can be used to balance control performance and triggering frequency. When , RL triggers MPC at nearly every time step and achieves the smallest tracking error. As the value of increases, the rewards function (12) penalizes more on triggering MPC, resulting in less frequent events and higher MPC costs . The bigger the is, the larger penalty the system will give for triggering the events. From Tab. I, we can see when is larger, the system tends to give smaller returns because of the larger punishment of triggering the events.
Fig. 3 shows the path following error and event triggering command when using three different RL algorithm (LSTDQ, DDQN+LSTM+PER, PPO+LSTM) in the deepRLMPC framework when using different in reward equation 18. The first row shows the comparison of the tracking error using three different RL algorithm in deepRLeMPC. The corresponding triggering commands during the process is showed in second row when using LSTDQ, is showed in third row when using DDQN+LSTM+PER and is showed in fourth row when using PPO+LSTM. The best results from deepRLeMPC when and are from PPO+LSTM and when is from DDQN+LSTM+PER. In LSTDQ, when , and the triggering frequency is . When , and the triggering frequency is . When , and the triggering frequency is . In PPO+LSTM, when , and the triggering frequency is . In this situation, there is no penalty on triggering MPC, and the RL agent triggers MPC for nearly every sampling time, and the path tracking error is the smallest. It results in a triggering frequency of as the sampling time is . When , and the triggering frequency is . In this situation, the RL agent tends to trigger an event when the tracking error is large, and keeps silent when the error is going to be around . When , DDQN+LSTM+PER achieves the best performance with and the triggering frequency is . In this situation, the eventtrigger pattern is similar to that of , but with a lower triggering frequency. It is worth noting that, for each case, DDQN+LSTM+PER triggers MPC less frequently (resulting in less MPC computation) while incurring smaller MPC cost (resulting in better control performance). We can then conclude that DDQN+LSTM+PER and PPO+LSTM outperforms the previous LSTDQ method as presented in [8].
Vi Conclusion
This paper investigated a path following problem for autonomous driving. A novel eventtriggered model predictive control (eMPC) framework with the triggering policy obtained from deep reinforcement learning was presented to solve the problem. A reward function was proposed to balance control performance and event trigger frequency through a hyperparameter . Compared to existing eMPC, the proposed algorithm does not require any knowledge of the closedloop dynamics (i.e., modelfree) and offers better performance. We also show that incorporating techniques such as priority experience replay and longshort term memory can significantly enhance the performance. The learnt deep RLbased triggering policy effectively decreases the computational burden while achieving satisfactory control performance. In future work, we will consider timevarying computational budget and cost using this deepRLeMPC framework for autonomous driving path following, as well as other applications of relevance and impact. Additionally, we will also examine the stability and convergence of the proposed deepRLeMPC framework.
References
 [1] (2018) An overview of natureinspired, conventional, and hybrid methods of autonomous vehicle path planning. Journal of Advanced Transportation 2018. Cited by: §I.
 [2] (2018) Deep reinforcement learning for eventtriggered control. In 2018 IEEE Conference on Decision and Control (CDC), pp. 943–950. Cited by: §I.
 [3] (2019) Eventtriggered pulse control with model learning (if necessary). In 2019 American Control Conference (ACC), pp. 792–797. Cited by: §I.

[4]
(2001)
Infinitehorizon policygradient estimation.
Journal of Artificial Intelligence Research
15, pp. 319–350. Cited by: §II.  [5] (2021) Optimization of the model predictive control update interval using reinforcement learning. IFACPapersOnLine 54 (14), pp. 257–262. Cited by: §II.
 [6] (2017) Robust eventtriggered MPC with guaranteed asymptotic bound and average sampling rate. IEEE Transactions on Automatic Control 62 (11), pp. 5694–5709. Cited by: §I, §I.
 [7] (2019) Datadriven model predictive control for trajectory tracking with a robotic arm. IEEE Robotics and Automation Letters 4 (4), pp. 3758–3765. Cited by: §II.
 [8] (June 8–10, 2022) Reinforcement learningbased eventtriggered model predictive control for autonomous vehicle path following. In 2022 American Control Conference, Atlanta, GA. External Links: Link Cited by: §I, §I, §VB, §VB.
 [9] (August 8–11, 2021) Comparison of eventtriggered model predictive control for autonomous vehicle path tracking. In 2021 IEEE Conference on Control Technology and Applications (CCTA), San Diego, CA. Cited by: §I, §I, §I, §IIIA, §IIIA, §IIIB, §VB.
 [10] (2019) Soft actorcritic for discrete action settings. arXiv preprint arXiv:1910.07207. Cited by: §IVB3, §VA.
 [11] (2019) Reinforcement learning boat autopilot: a sampleefficient and model predictive control based approach. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2868–2875. Cited by: §II.
 [12] (2021) Autonomous boat driving system using sampleefficient model predictive controlbased reinforcement learning approach. Journal of Field Robotics 38 (3), pp. 331–354. Cited by: §II.
 [13] (2018) Model predictive control and its application in agriculture: a review. Computers and Electronics in Agriculture 151, pp. 104–117. Cited by: §I.
 [14] (2010) Path planning for autonomous vehicles in unknown semistructured environments. The international journal of robotics research 29 (5), pp. 485–501. Cited by: §I.
 [15] (December 1215, 2011) Novel eventtriggered strategies for model predictive controllers. In 2011 50th IEEE Conference on Decision and Control and European Control Conference, Orlando, FL, pp. 3392–3397. Cited by: §I.
 [16] (2019) Deep value model predictive control. arXiv preprint arXiv:1910.03358. Cited by: §II.

[17]
(2020)
Revisiting fundamentals of experience replay.
In
International Conference on Machine Learning
, pp. 3061–3071. Cited by: §IVB3.  [18] (2010) Model predictive control tuning methods: a review. Industrial & Engineering Chemistry Research 49 (8), pp. 3505–3515. Cited by: §I.
 [19] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. Cited by: §IVB3, §IVB.
 [20] (2015) Eventbased robust sampleddata model predictive control: a nonmonotonic lyapunov function approach. IEEE Transactions on Circuits and Systems I: Regular Papers 62 (10), pp. 2555–2564. Cited by: §I.
 [21] (2019) Cautious model predictive control using gaussian process regression. IEEE Transactions on Control Systems Technology 28 (6), pp. 2736–2743. Cited by: §II.
 [22] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §IVB3.
 [23] (2020) Eventtriggered reinforcement learning; an application to buildings’ microclimate control.. In AAAI Spring Symposium: MLPS, Cited by: §I.
 [24] (2022) Eventtriggered model predictive control for autonomous vehicle with rear steering. SAE Technical Paper (2022010877). Cited by: §I.
 [25] (2019) Learningbased model predictive control for autonomous racing. IEEE Robotics and Automation Letters 4 (4), pp. 3363–3370. Cited by: §II.
 [26] (2018) Dataefficient reinforcement learning with probabilistic model predictive control. In International conference on artificial intelligence and statistics, pp. 1701–1710. Cited by: §II.
 [27] (2020) Practical reinforcement learning for mpc: learning from sparse objectives in under an hour on a real robot. In Learning for Dynamics and Control, pp. 211–224. Cited by: §II.
 [28] (June 28–July 1, 2015) Kinematic and dynamic vehicle models for autonomous driving control design. In 2015 IEEE Intelligent Vehicles Symposium, Seoul, Korea, pp. 1094–1099. Cited by: §IIIA.
 [29] (2020) Sampleandcomputationefficient probabilistic model predictive control with random features. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 307–313. Cited by: §II.
 [30] (2021) A review on the application of the mpc technology in wind power control of wind farms. Journal of Energy and Power Technology 3 (3), pp. 1–1. Cited by: §I.
 [31] (2014) Eventtriggered robust model predictive control of continuoustime nonlinear systems. Automatica 50 (5), pp. 1507–1513. Cited by: §I, §I.
 [32] (1992) Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8 (3), pp. 293–321. Cited by: §IVB3.
 [33] (2017) Path planning for autonomous vehicles using model predictive control. In 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 174–179. Cited by: §I.
 [34] (2020) Computationally efficient mpc for path following of underactuated marine vessels using projection neural network. Neural Computing and Applications 32 (11), pp. 7455–7464. Cited by: §I.
 [35] (2020) Computationally efficient stochastic mpc: a probabilistic scaling approach. In 2020 IEEE Conference on Control Technology and Applications (CCTA), pp. 25–30. Cited by: §I.
 [36] (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §IVB1.
 [37] (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §IVB1, §IVB3, §IVB.
 [38] (2021) Realtime implementation of randomized model predictive control for autonomous driving. IEEE Transactions on Intelligent Vehicles 7 (1), pp. 11–20. Cited by: §I.
 [39] (2016) Introduction to the history of computing: a computing history primer. Springer. Cited by: §I.
 [40] (2016) Learningbased nonlinear model predictive control to improve visionbased mobile robot path tracking. Journal of Field Robotics 33 (1), pp. 133–152. Cited by: §II.
 [41] (2011) Vehicle dynamics and control. Springer Science & Business Media. Cited by: §IIIA.
 [42] (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §IVB3, §IVB3.
 [43] (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §IVB2.
 [44] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §IVB3, §IVB.
 [45] (2020) Machine learning in eventtriggered control: recent advances and open issues. arXiv preprint arXiv:2009.12783. Cited by: §I.
 [46] (2018) Model predictive control (mpc) for enhancing building and hvac system energy efficiency: problem formulation, applications and opportunities. Energies 11 (3), pp. 631. Cited by: §I.
 [47] (2019) Autonomous tissue manipulation via surgical robot using learning based model predictive control. In 2019 International Conference on Robotics and Automation (ICRA), pp. 3875–3881. Cited by: §II, §II.
 [48] (2018) Learningbased robust model predictive control with statedependent uncertainty. IFACPapersOnLine 51 (20), pp. 442–447. Cited by: §II.
 [49] (2015) ACIS: an improved actorcritic method for pomdps with internal state. In 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 369–376. Cited by: §II.
 [50] (2021) Eventtriggered model predictive control with a statistical learning. IEEE Transactions on Systems, Man, and Cybernetics: Systems 51 (4), pp. 2571–2581. Cited by: §I.