DeepAI
Log In Sign Up

Event-Triggered Model Predictive Control with Deep Reinforcement Learning for Autonomous Driving

Event-triggered model predictive control (eMPC) is a popular optimal control method with an aim to alleviate the computation and/or communication burden of MPC. However, it generally requires priori knowledge of the closed-loop system behavior along with the communication characteristics for designing the event-trigger policy. This paper attempts to solve this challenge by proposing an efficient eMPC framework and demonstrate successful implementation of this framework on the autonomous vehicle path following. First of all, a model-free reinforcement learning (RL) agent is used to learn the optimal event-trigger policy without the need for a complete dynamical system and communication knowledge in this framework. Furthermore, techniques including prioritized experience replay (PER) buffer and long-short term memory (LSTM) are employed to foster exploration and improve training efficiency. In this paper, we use the proposed framework with three deep RL algorithms, i.e., Double Q-learning (DDQN), Proximal Policy Optimization (PPO), and Soft Actor-Critic (SAC), to solve this problem. Experimental results show that all three deep RL-based eMPC (deep-RL-eMPC) can achieve better evaluation performance than the conventional threshold-based and previous linear Q-based approach in the autonomous path following. In particular, PPO-eMPC with LSTM and DDQN-eMPC with PER and LSTM obtains a superior balance between the closed-loop control performance and event-trigger frequency. The associated code is open-sourced and available at: https://github.com/DangFengying/RL-based-event-triggered-MPC.

READ FULL TEXT VIEW PDF
10/23/2021

Policy Search using Dynamic Mirror Descent MPC for Model Free Off Policy RL

Recent works in Reinforcement Learning (RL) combine model-free (Mf)-RL a...
11/17/2020

Combining Reinforcement Learning with Model Predictive Control for On-Ramp Merging

We consider the problem of designing an algorithm to allow a car to auto...
01/29/2022

ApolloRL: a Reinforcement Learning Platform for Autonomous Driving

We introduce ApolloRL, an open platform for research in reinforcement le...
11/29/2017

Automating Vehicles by Deep Reinforcement Learning using Task Separation with Hill Climbing

Within the context of autonomous vehicles, classical model-based control...
08/11/2020

Learning Event-triggered Control from Data through Joint Optimization

We present a framework for model-free learning of event-triggered contro...
06/17/2022

SafeRL-Kit: Evaluating Efficient Reinforcement Learning Methods for Safe Autonomous Driving

Safe reinforcement learning (RL) has achieved significant success on ris...

I Introduction

Autonomous vehicles have attracted researchers’ attention dramatically in recent years due to the advanced technology in automation, high-speed communication network and new energy. Path planning and path following are two major tasks for the behaviour control of autonomous vehicles [1, 38]. Path planning is executed to plan the path considering safety constraints, and a controller is then used to follow this path accurately by considering the current states and providing suitable control. Path planning has been well explored by many researchers [33, 14]. However, path following still remains a problem due to the high dynamic, limited computation and communication of autonomous vehicles. The path following controller are expected to provide accurate control inputs in real-time with constrained computation and communication. Path following control can be implemented using different controllers, e.g., proportional–integral–derivative (PID) control, state feedback controllers, model predictive control (MPC), and so on.

MPC is capable of handling multi-input multi-output (MIMO) systems with various constraints, making it specially suitable for real-world autonomous vehicle path following problem. MPC can be dated back to the 1980s when engineers in the process industry first began to deploy it in real-world practice [18]. Since then, the increasing computing power of microprocessors has greatly improved its application scope [39]. MPC uses a system model to predict its future behavior, and selects the best control action by solving an optimization problem [2, 3, 23, 45, 50]. Despite the advances of MPC over the years [35, 34, 13, 30, 46], solving the constrained optimal control problem requires high computational power, which is further increased as the system dimension and prediction horizon increase. This has hindered its application to autonomous vehicles’ path following that require a short sampling time but have limited computation power. To reduce computational burden without significantly degrading control performance, event-triggered MPC (eMPC) has emerged as a promising paradigm where MPC algorithm is solved – instead of at each time instant as in the traditional MPC implementation – only when triggered by a predefined trigger condition [6, 31, 9, 20, 15, 24]. In such framework, a triggering event can be defined based on either the deviation of the system states [6, 31, 9] or the cost function value [20, 15]. By solving the optimization problem only when necessary, eMPC can significantly reduce online computations. However, the trigger mechanism design, concerning when to trigger the optimization so as to preserve system performance while keeping the number of triggers low, still remains a challenge [45].

The most common event-trigger policy is the threshold-based event-trigger policy, where an event is triggered if the predicted state trajectory and real-time feedback diverge beyond a certain threshold [6, 31, 9]. However, the threshold calibration is usually based on the knowledge of the closed-loop system behavior which is not always available, especially for complex systems. To address this limitation, our prior work [8] investigates the use of model free RL techniques, a simple linear Q-learning approach, to synthesize a triggering policy with the aim of achieving the optimal balance between control performance and computational efficiency. However, this linear Q-learning has a hard time capturing the nonlinear event-trigger policy, leading to unnecessarily high event frequency. Therefore, in this paper, we propose to use deep RL to learn the event-trigger policy which makes the proposed framework achieve better trade-offs between system performance and computation cost.

This paper addresses the autonomous driving path following problem using a novel eMPC framework. First, it extends the previous work [8] with an improved vehicle model, thereby removing the limitation of using only the front steering angle as driving control. Second, we develop a model-free deep-RL-eMPC framework that uses deep RL to learn the event-trigger policy online, so that no prior knowledge of the closed-loop system is needed, which is essential for a dynamic and complex system. Both off-policy and on-policy RL methods are tested. Meanwhile, techniques including prioritized experience replay (PER) buffer and long-short term memory (LSTM) are exploited to significantly improve the training efficiency and control performance. The validity of the proposed deep-RL-eMPC is demonstrated using a nonlinear autonomous vehicle model and the results show that our approach clearly outperforms the conventional threshold-based approach in [9] and the previous linear Q-learning based approach in [8].

The remainder of the paper is organized as follows. Section II reviews relevant literature on RL and MPC integration, to provide more context for the presented work. Section III formulates the autonomous vehicle path following problem. Section IV presents the framework of eMPC with triggering policy obtained from RL. The experiment setup and results of the proposed deep-RL-eMPC method in the autonomous vehicle path following problem are presented in V. Finally, conclusion remarks are provided in Section  VI.

Ii Relevant Work on Rl/mpc Integration

Utilizing RL to aid MPC is not new in literature. For example, [16, 27]

propose an off-policy actor-critic algorithm called DMPC, where an off-policy critic learns a value function while the actor utilizes MPC to interact with the environment. It is assumed that the system dynamics are known, but the cost function that MPC should minimize is unknown and is learned by the critic’s value function estimation. Both analytical and numerical results demonstrate improvements on learning convergence.

RL can also be used to learn the system dynamics that are then used by MPC for prediction [11, 12, 29, 26, 40, 47, 21, 25, 7]. This approach is called model-based RL in [11, 12, 29, 26, 40, 47, 21, 25, 7]. Specifically, [11, 12, 29, 26, 40] studies learning based probabilistic MPC in the framework of RL, where the system dynamics and environment uncertainties are modeled as Gaussian Process (GP), whose parameters are iteratively identified through trial and error. Authors of [21, 25, 7] use a GP model to learn errors between measurement and a nominal model, which are then used to set up the optimal control problem for MPC to guarantee constraints robustness.

Reference [47]

combines RL and MPC in the context of surgical robot control. The system dynamics are modeled by artificial neural network (ANN), whose parameters are identified through RL or learning from demonstration. In RL approach, the agent explored the action space using

-greedy, collected observations, and iteratively trained the ANN to model system dynamic, while an MPC is used to optimize action based on the trained ANN. In the learning from demonstration approach, the ANN parameters are initialized using observations collected from human operators.

Finally, RL can also be used to directly optimize MPC control law. For example, [48] proposes a robust MPC where the control law is restricted to an affine function of the feedback with the gain being pre-computed offline and the offset being learnt by RL. Reference [48] additionally shows that the robust MPC can also reject disturbance when the Gaussian process model is unknown and learnt online. Authors in [5]

investigated the use of gradient based Partially Observable Markov Decision Processes (POMDP) algorithm to train the RL recomputation policy for event-triggered MPC control to save energy 

[4]

. However, the solutions of POMDP algorithm often suffer from the high variance of the gradient estimate 

[49].

To the best of our knowledge, the use of deep RL to trigger MPC has not been reported in literature. In this paper, we attempt to fill this gap by investigating deep RL-based event-triggered MPC, or deep-RL-eMPC, which learns the optimal event-trigger policy without requiring any knowledge on the closed-loop dynamics and therefore significantly reduces the amount of calibrations.

Iii Problem Formulation

This paper aims to improve autonomous vehicles path following control by proposing a systematic, algorithmic framework where eMPC can be used without having the prior knowledge of the closed-loop system behavior. Our goal is to use an RL agent to learn the optimal event-trigger policy automatically.

Iii-a Task Description: Autonomous Vehicle Dynamics and Path Following Problem

In order to demonstrate the proposed deep RL-eMPC and its improving techniques, a path following task is chosen. For a single track vehicle model, the equations for vehicle center of gravity (CG) and wheel dynamics are given by

(1a)
(1b)
(1c)
(1d)
(1e)
(1f)

where and are the longitudinal and lateral position of the center of gravity of vehicle, respectively; is the vehicle rotational angle along the longitudinal axis in the global inertial frame; and , , and are, respectively, the vehicle longitudinal velocity, lateral velocity, and yaw rate in the vehicle frame. is the aerodynamic drag force [41] and and are tire forces. is the vehicle mass, is the vehicle rotational inertia on yaw dimension, and are the distance from CG to the middle of front and rear axle, respectively.

The tire force and in (1b), (1d) in vehicle frame can be modeled by

(2a)
(2b)

where is the wheel-road-angle for the wheel , represents the front or rear wheel, and are the tire force in wheel frame which can be obtained as

(3a)
(3b)

where is the propulsion/braking torque along the axle, is the effective tire radius, is the tire corner stiffness and characterize the road surface, is the slip angle. We refer readers to [9] for a detailed computation of the slip angle .

The normal force in (1f) can be modeled by static load transfer,

(4)

In this paper, we consider a problem of autonomous vehicle following a sinusoidal trajectory using the proposed deep-RL-eMPC method [28, 9], the following path is given by

(5)

Iii-B Optimal Control Problem and its Goal

Consider a discrete-time system with the following dynamics

(6)

where is the system state at discrete time and is the control input. Given a prediction horizon , MPC aims to find the optimal control sequence and optimal state sequence by solving the following optimal control problem:

(7a)
s.t. (7b)
(7c)
(7d)
(7e)
(7f)

where and are defined as and , is the stage cost function, denotes the real state or current state estimation, and denotes the control action at time step . For conventional time-triggered MPC, the above optimal control problem is solved for every sampling time , and only the first element of is applied to the system as the control command, while all the remaining elements are abandoned.

Let and represent the current time step and the last event time, respectively, and there thus exists a such that where is the sampling time of the discrete system. Let denotes the triggering command in event-triggered MPC at time step . Then when , the above optimal control problem is solved and the first element of the optimal control sequence computed at current time step will be used as control command. When , the optimal control sequence computed at last event when the time instance equals to will be shifted to determine the control command [9]. Then the control input can be compactly represented as:

(8)

To implement (8) for eMPC, a buffer can be used to store the optimal control sequence computed at last event at time . At each time step, the event-trigger policy block generates based on current feedback from the plant. In eMPC, only when , a new control sequence is computed by solving (7), whose first element is implemented by actuator as , while the entire sequence is saved into buffer. If , indicating the absence of an event, the control sequence currently stored in the buffer will be shifted based on the time elapsed since last event to determine the current control input . This process is depicted in Fig. 1.

Fig. 1: The scheme of event-triggered model predictive control (eMPC).

In general, the event can be generated by certain event-trigger policy , denoted as,

(9)

where is the optimal state sequence computed at last event when and is the real state (or current state estimate if not directly measured), are parameters characterizing the policy. It is worth noting that, for nonlinear constrained MPC, the design of event-trigger policy is challenging and requires extensive calibration and prior knowledge of the closed-loop system behavior. Therefore, the design of event-trigger policy and its calibrations are usually problem specific and non-trivial. To address this limitation, the objective of this paper is to learn the optimal event-trigger policy using model-free deep RL techniques.

If we discretize (1) to obtain a discrete-time model in the form of (6), with and where is the axle driving torque and is the front steering angle. The stage cost of (7a) is defined as

(10)

where the first nonlinear term penalizes the path tracking error and the second term penalizes large control efforts. Here the norm is defined as . More specifically, the MPC cost function in (7a) in this case can be equivalently represented as:

(11)

where and are defined as and , and the terms independent of and are ignored.

Iv Event-Triggered Mpc With Deep Rl-Based Policy Learning

In this section, we present our proposed deep RL-based policy learning eMPC, or deep-RL-eMPC.

Iv-a Deep-RL-eMPC Framework

The process of our deep-RL-eMPC framework is shown in Fig. 2. The RL agent learns the event-trigger policy parameter by continuously interacting with the environment. Specifically, at each time step, the agent sends an action to the environment. The environment then implements the eMPC following (8), simulates the dynamic system following (6), and emits an immediate reward following the designed reward function. The agent then observes the reward signals, update , and transitions to next state.

Fig. 2: The scheme of RL based event-triggered MPC.

For an eMPC problem, the discrete action space for RL agent is defined as , where the event will be triggered when and will not be triggered when . As the feedback from the environment, the immediate reward function is defined as

(12)

where the first term measures the closed-loop system performance and the second term measures the cost of triggering events. Note that is the stage cost and is computed using the the real state (or current state estimate if not directly measured) and real-time control (8). Furthermore,

is a hyper-parameter used to balance between control performance index and triggering frequency. One can fine tune this hyperparameter

to make a tradeoff between control performance and computational cost.

The complete deep-RL-eMPC algorithm is shown in Algorithm1. In this algorithm,

is the total number of training epochs,

is the length of each episode representing total training time in each epoch, is the discount factor in the reward function, is the discrete time step, and is the size of sampled experiences at each time (batch size). The output of Algorithm1 is the system parameters . The RL agent interacts with the environment for number of epochs (Lines 2-24). After initialization, Lines 5 shows how to choose action. Lines 7-12 implement the event-triggered MPC to compute the control command , which is used to simulate the dynamical system (6) (Line 13). After that, the environment emits next state and immediate reward (Lines 16), which is observed by RL agent (Line 18). The latest experience tuple is then added into an experience buffer (Line 19). The RL parameters is updated using a batch of experiences sampled from the experience buffer (Line 20). RL agent then moves to next state (Lines 21). After each epoch, RL agent is reset for the next epoch (Line 3). Lines 7-16 are part of the environment, whose computation is unknown to the RL agents. Note that the agent only observes the environment outputs, i.e., next state and reward.

Input: , , , ,
Output:
1 Initialize , ;
2 for  to  do
3       Initialize , , , ;
4       while  do
5             select action
6             % Simulate Environment;
7             if  then
8                   ;
9                   Solving optimal control problem (7);
10            else
11                  ;
12             end if
13            ;
14             Simulate system dynamics (6) using ;
15             ;
16             ;
17             % End of Environment Simulation;
18             Observe and ;
19             Update to include ;
20             Sample experiences from and update ;
21             ;
22            
23       end while
24      
25 end for
Note: can be either conventional on-policy or off-policy experience buffer or priority experience buffer.
Algorithm 1 RL-based Event-Triggered MPC

Iv-B Deep RL Algorithms and Improving Technique

The framework shown in Fig. 2 and Algorithm 1 is a general frame which can accommodate different RL algorithm. In this paper, we investigate three different RL agents, including Double Q-learning (DDQN) [37] and Proximal Policy Optimization (PPO)[44], Soft Actor-Critic (SAC) [19], and show the proposed framework is also suitable for other RL algorithms.

In this subsection, we first briefly describe these three deep RL algorithms. Then two improving technique for Rl agent including Prioritized Experience Replay (PER) and Long Short-term Memory (LSTM) are presented.

Iv-B1 Double Q-learning

Deep Q network is a type of Q-learning which uses neural network as a policy. To address the issues of overestimation of Q values in deep Q network [36], Double Q-learning (DDQN) explicitly separates action selection from action evaluation which allows each step to use a different function approximator and shows a better overall approximation of the action-value function [37]. DDQN improves deep Q network by replacing the target by , resulting in the Double Q-learning loss:

(13)

Iv-B2 Ppo

PPO, an on-policy policy gradient RL algorithm, replaces the KL-divergence used in TRPO [43] with a clipped surrogate objective function (14), which is proved to be better suited for the TRPO and easy to implement.

(14)

Iv-B3 Soft Actor-Critic

SAC achieves the state-of-the-art performance across a wide range of continuous-action control problems and updates the stochastic actor-critic policy in an off-policy way. SAC takes a good exploration-exploitation trade-off via entropy regularization.

In this paper, we adopt SAC and PPO to the discrete action space setting following the discrete categorical distribution design in [10]. For details, refer to DDQN [37], PPO [44] and SAC [19].

The training performance of the proposed deep-RL-eMPC framework depends on the quality of the selected experience sample, so how to choose them is critical when using off-policy RL algorithms. The experience replay buffer utilizes a fixed-size buffer that holds the most recent transitions collected by the policy [32, 17]. In RL, the weights updating and optimization of neural networks are based on the experience replay. The experience replay in the original DDQN uniformly samples the stored experience to train the network weights. However, the importance of experiences are different. Some experiences are more valuable than others in the long run and important experience should be considered more frequently. To address this problem, the prioritized experience replay has been proposed [42]

to prioritize more frequent replay transitions leading to high expected learning progress, as measured by the magnitude of their TD error. Specifically, the probability of sampling transition

is defined as follows:

(15)

where controls how much prioritization is applied; when , the experience will be sampled uniformly. Here represents the priority of transition , which is initialized as 1 and updated based on the TD-error during the transition.

More specifically, to alleviate the bias of the gradient magnitudes introduced by the priority replay, importance-sampling (IS) is introduced in [42] as:

(16)

where is the hyperparameter annealing the amount of importance-sampling correction over time. is size of the experience buffer. The weight is then used in the Q-learning updates by replacing the TD-error as . In practice, we can apply the PER by replacing line 24 in Algorithm 1 with the designed PER scheme.

To encode the historical information in the network, a straightforward way is to feed all historical states to the RL agent, but it increases the state dimension significantly and may distract the attention of the RL agent from recent input states. To address this challenge, recurrent neural network (RNN) has been developed, which is a class of artificial neural networks that can encode and learn temporal information. Traditional RNN does not have the ability for long term memory and suffers from vanishing gradient problem. Long short-term memory (LSTM)

[22], a type of RNN architecture, solves this issue by using feedback connections and thus suitable for long-time series data. In this paper, we will explore the use of LSTM as the last hidden layer to extract representations from different state types and encode the history information.

V Autonomous Vehicle Path Following Using Deep-Rl-Empc

In this section we apply the proposed deep-RL-eMPC to a nonlinear autonomous vehicle path tracking problem. The prediction horizon of MPC is set to with upper and lower bounds for all control inputs. Since autonomous vehicle requires short control sampling time but has limited onboard computation power, this nonlinear path tracking problem is a good example to demonstrate the proposed deep-RL-eMPC.

V-a RL Structure and Settings

In this paper, we encode the input state with a one fully connected (FC) layer with 128 neurons, followed by two 128-neuron FC layers. In the LSTM design, we replace the last FC layer with a 128-unit LSTM layer. The last layer outputs two Q values corresponding to two actions, i.e., trigger and not trigger. The target network in DDQN are updated every

steps.

The state of the environment is defined to be , where as mentioned above is the state estimate of the dynamical system and is the MPC prediction made at last event. The reward function follows (12), with defined as follows:

(17)

where is the real state (or current state estimate if not directly measured) and is the real-time applied control computed by (8). Then the return for one episode in the RL algorithm is as follows:

(18)

where is the episodic return of RL algorithms, is the number of steps for the episode,

is a hyper- parameter proposed to balance control performance and event trigger frequency. To evaluate performance of different RL algorithms in our deep-RL-eMPC frame, we adopt the following two evaluation metrics: total MPC cost

and event triggering frequency , which are defined as follows:

(19)
(20)
Threshold LSTDQ SAC DDQN DDQN+LSTM+PER PPO PPO+LSTM
return 1.606 0.062 0.058 0.056 0.058 0.055 0.055
/ 0.118/1.606 0.902/0.062 0.99/0.058 0.902/0.056 0.99/0.058 0.99/0.058 0.99/0.055
return 1.618 0.157 0.158 0.152 0.137 0.119 0.112
/ 0.118/1.606 0.931/0.157 0.98/0.058 0.951/0.055 0.794/0.056 0.594/0.059 0.594/0.059
return 1.728 0.66 1.015 0.627 0.431 0.634 0.529
/ 0.118/1.606 0.559/0.660 0.922/0.075 0.5/0.117 0.255/0.171 0.515/0.114 0.515/0.069
TABLE I: Evaluation return R, triggering frequency , and MPC cost using different RL agents in deep-RL-eMPC.
Fig. 3: Experimental results of deep-RL-eMPC for the reward function with . The comparison of the tracking error using three different RL algorithm in deep-RL-eMPC (first row). The corresponding triggering commands during the process when using LSTDQ (second row), DDQN+LSTM+PER (third row) and PPO+LSTM (fourth row).
Fig. 4: Experimental results of deep-RL-eMPC for the reward function with . The comparison of the tracking error using three different RL algorithm in deep-RL-eMPC (first row). The corresponding triggering commands during the process when using LSTDQ (second row), DDQN+LSTM+PER (third row) and PPO+LSTM (fourth row).
Fig. 5: Experimental results of deep-RL-eMPC for the reward function with . The comparison of the tracking error using three different RL algorithm in deep-RL-eMPC (first row). The corresponding triggering commands during the process when using LSTDQ (second row), DDQN+LSTM+PER (third row) and PPO+LSTM (fourth row).

We train the off-policy RL algorithms over 50,000 steps, which is around 500 episodes, each with a length of and a sampling time of , i.e., episode horizon is time steps. On-policy algorithms, e.g., PPO, often require longer training time but with improved stability [10], thus we train them for 1000 episodes for better convergence. For MDP, we set the discount factor and batch size . The learning rate and replay buffer size are set as and 5,000, respectively. Also, -greedy is adopted in DDQN with linearly decaying from 1.0 to 0.01 during the first 5000 steps of training.

V-B Simulation Results and Analysis

Numerical simulation results on the evaluation returns for , , with the threshold-based benchmark and different variants of RL algorithms are summarized in Tab. I. The simple linear Q-learning method (least-square temporal difference Q-learning, LSTDQ) [8] is also shown here as a benchmark. To measure the computation burden required by different RL algorithms and MPC, we run the simulation 10000 times and use the average time as the time cost. The results show that the average time cost of MPC is about 0.1s while the average time cost of RL algorithms considered in this paper is about . In other words, each MPC computation requires times more computation than evaluating RL policies, and hence the time cost spent on the decision making of RL algorithms is negligible. So overall speaking, fewer MPC queries will provide less computation burden.

The threshold-based event-trigger policy [9] depends on a manually-tuned threshold to determine when the event is triggered. However, this method is very sensitive to the tracking error and is susceptible to over-triggering problems when the error is large. This causes the return of the threshold-based method around for all three different , much worse than the RL-based methods as shown in Tab. I.

Comparing LSTDQ, SAC, DDQN, and PPO, experimental results clearly show that deep-RL-eMPC frameworks achieve better evaluation return than the the conventional threshold-based approach and previous LSTDQ for all three different . It is also shown that PPO presents the best result under and , while DDQN performs better when in terms of evaluation return, partly due to the low overestimation. To show the flexibility of the proposed framework, PER buffer and LSTM are employed to foster the exploration and efficiency of the training of DDQN and PPO. PPO is an on policy RL method and PER cannot be applied to this method, so only PPO+LSTM is tested. Specifically, DDQN+LSTM+PER and PPO+LSTM are implemented and compared. The experimental results show that LSTM and PER significantly increase the evaluation return of the system, outperforming the baseline methods. SAC performs well when , while it fails in the more challenging cases when or . The intrinsic reason for the poor performance of SAC deserves to be investigated in the future work.

Recall that the hyperparameter can be used to balance control performance and triggering frequency. When , RL triggers MPC at nearly every time step and achieves the smallest tracking error. As the value of increases, the rewards function (12) penalizes more on triggering MPC, resulting in less frequent events and higher MPC costs . The bigger the is, the larger penalty the system will give for triggering the events. From Tab. I, we can see when is larger, the system tends to give smaller returns because of the larger punishment of triggering the events.

Fig. 3 shows the path following error and event triggering command when using three different RL algorithm (LSTDQ, DDQN+LSTM+PER, PPO+LSTM) in the deep-RL-MPC framework when using different in reward equation 18. The first row shows the comparison of the tracking error using three different RL algorithm in deep-RL-eMPC. The corresponding triggering commands during the process is showed in second row when using LSTDQ, is showed in third row when using DDQN+LSTM+PER and is showed in fourth row when using PPO+LSTM. The best results from deep-RL-eMPC when and are from PPO+LSTM and when is from DDQN+LSTM+PER. In LSTDQ, when , and the triggering frequency is . When , and the triggering frequency is . When , and the triggering frequency is . In PPO+LSTM, when , and the triggering frequency is . In this situation, there is no penalty on triggering MPC, and the RL agent triggers MPC for nearly every sampling time, and the path tracking error is the smallest. It results in a triggering frequency of as the sampling time is . When , and the triggering frequency is . In this situation, the RL agent tends to trigger an event when the tracking error is large, and keeps silent when the error is going to be around . When , DDQN+LSTM+PER achieves the best performance with and the triggering frequency is . In this situation, the event-trigger pattern is similar to that of , but with a lower triggering frequency. It is worth noting that, for each case, DDQN+LSTM+PER triggers MPC less frequently (resulting in less MPC computation) while incurring smaller MPC cost (resulting in better control performance). We can then conclude that DDQN+LSTM+PER and PPO+LSTM outperforms the previous LSTDQ method as presented in [8].

Vi Conclusion

This paper investigated a path following problem for autonomous driving. A novel event-triggered model predictive control (eMPC) framework with the triggering policy obtained from deep reinforcement learning was presented to solve the problem. A reward function was proposed to balance control performance and event trigger frequency through a hyper-parameter . Compared to existing eMPC, the proposed algorithm does not require any knowledge of the closed-loop dynamics (i.e., model-free) and offers better performance. We also show that incorporating techniques such as priority experience replay and long-short term memory can significantly enhance the performance. The learnt deep RL-based triggering policy effectively decreases the computational burden while achieving satisfactory control performance. In future work, we will consider time-varying computational budget and cost using this deep-RL-eMPC framework for autonomous driving path following, as well as other applications of relevance and impact. Additionally, we will also examine the stability and convergence of the proposed deep-RL-eMPC framework.

References

  • [1] B. B. K. Ayawli, R. Chellali, A. Y. Appiah, and F. Kyeremeh (2018) An overview of nature-inspired, conventional, and hybrid methods of autonomous vehicle path planning. Journal of Advanced Transportation 2018. Cited by: §I.
  • [2] D. Baumann, J. Zhu, G. Martius, and S. Trimpe (2018) Deep reinforcement learning for event-triggered control. In 2018 IEEE Conference on Decision and Control (CDC), pp. 943–950. Cited by: §I.
  • [3] D. Baurnann, F. Solowjow, K. H. Johansson, and S. Trimpe (2019) Event-triggered pulse control with model learning (if necessary). In 2019 American Control Conference (ACC), pp. 792–797. Cited by: §I.
  • [4] J. Baxter and P. L. Bartlett (2001) Infinite-horizon policy-gradient estimation.

    Journal of Artificial Intelligence Research

    15, pp. 319–350.
    Cited by: §II.
  • [5] E. Bøhn, S. Gros, S. Moe, and T. A. Johansen (2021) Optimization of the model predictive control update interval using reinforcement learning. IFAC-PapersOnLine 54 (14), pp. 257–262. Cited by: §II.
  • [6] F. D. Brunner, W. Heemels, and F. Allgöwer (2017) Robust event-triggered MPC with guaranteed asymptotic bound and average sampling rate. IEEE Transactions on Automatic Control 62 (11), pp. 5694–5709. Cited by: §I, §I.
  • [7] A. Carron, E. Arcari, M. Wermelinger, L. Hewing, M. Hutter, and M. N. Zeilinger (2019) Data-driven model predictive control for trajectory tracking with a robotic arm. IEEE Robotics and Automation Letters 4 (4), pp. 3758–3765. Cited by: §II.
  • [8] J. Chen, X. Meng, and Z. Li (June 8–10, 2022) Reinforcement learning-based event-triggered model predictive control for autonomous vehicle path following. In 2022 American Control Conference, Atlanta, GA. External Links: Link Cited by: §I, §I, §V-B, §V-B.
  • [9] J. Chen and Z. Yi (August 8–11, 2021) Comparison of event-triggered model predictive control for autonomous vehicle path tracking. In 2021 IEEE Conference on Control Technology and Applications (CCTA), San Diego, CA. Cited by: §I, §I, §I, §III-A, §III-A, §III-B, §V-B.
  • [10] P. Christodoulou (2019) Soft actor-critic for discrete action settings. arXiv preprint arXiv:1910.07207. Cited by: §IV-B3, §V-A.
  • [11] Y. Cui, S. Osaki, and T. Matsubara (2019) Reinforcement learning boat autopilot: a sample-efficient and model predictive control based approach. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2868–2875. Cited by: §II.
  • [12] Y. Cui, S. Osaki, and T. Matsubara (2021) Autonomous boat driving system using sample-efficient model predictive control-based reinforcement learning approach. Journal of Field Robotics 38 (3), pp. 331–354. Cited by: §II.
  • [13] Y. Ding, L. Wang, Y. Li, and D. Li (2018) Model predictive control and its application in agriculture: a review. Computers and Electronics in Agriculture 151, pp. 104–117. Cited by: §I.
  • [14] D. Dolgov, S. Thrun, M. Montemerlo, and J. Diebel (2010) Path planning for autonomous vehicles in unknown semi-structured environments. The international journal of robotics research 29 (5), pp. 485–501. Cited by: §I.
  • [15] A. Eqtami, D. V. Dimarogonas, and K. J. Kyriakopoulos (December 12-15, 2011) Novel event-triggered strategies for model predictive controllers. In 2011 50th IEEE Conference on Decision and Control and European Control Conference, Orlando, FL, pp. 3392–3397. Cited by: §I.
  • [16] F. Farshidian, D. Hoeller, and M. Hutter (2019) Deep value model predictive control. arXiv preprint arXiv:1910.03358. Cited by: §II.
  • [17] W. Fedus, P. Ramachandran, R. Agarwal, Y. Bengio, H. Larochelle, M. Rowland, and W. Dabney (2020) Revisiting fundamentals of experience replay. In

    International Conference on Machine Learning

    ,
    pp. 3061–3071. Cited by: §IV-B3.
  • [18] J. L. Garriga and M. Soroush (2010) Model predictive control tuning methods: a review. Industrial & Engineering Chemistry Research 49 (8), pp. 3505–3515. Cited by: §I.
  • [19] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. Cited by: §IV-B3, §IV-B.
  • [20] N. He and D. Shi (2015) Event-based robust sampled-data model predictive control: a non-monotonic lyapunov function approach. IEEE Transactions on Circuits and Systems I: Regular Papers 62 (10), pp. 2555–2564. Cited by: §I.
  • [21] L. Hewing, J. Kabzan, and M. N. Zeilinger (2019) Cautious model predictive control using gaussian process regression. IEEE Transactions on Control Systems Technology 28 (6), pp. 2736–2743. Cited by: §II.
  • [22] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §IV-B3.
  • [23] A. H. Hosseinloo and M. A. Dahleh (2020) Event-triggered reinforcement learning; an application to buildings’ micro-climate control.. In AAAI Spring Symposium: MLPS, Cited by: §I.
  • [24] S. Huang and J. Chen (2022) Event-triggered model predictive control for autonomous vehicle with rear steering. SAE Technical Paper (2022-01-0877). Cited by: §I.
  • [25] J. Kabzan, L. Hewing, A. Liniger, and M. N. Zeilinger (2019) Learning-based model predictive control for autonomous racing. IEEE Robotics and Automation Letters 4 (4), pp. 3363–3370. Cited by: §II.
  • [26] S. Kamthe and M. Deisenroth (2018) Data-efficient reinforcement learning with probabilistic model predictive control. In International conference on artificial intelligence and statistics, pp. 1701–1710. Cited by: §II.
  • [27] N. Karnchanachari, M. I. Valls, D. Hoeller, and M. Hutter (2020) Practical reinforcement learning for mpc: learning from sparse objectives in under an hour on a real robot. In Learning for Dynamics and Control, pp. 211–224. Cited by: §II.
  • [28] J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli (June 28–July 1, 2015) Kinematic and dynamic vehicle models for autonomous driving control design. In 2015 IEEE Intelligent Vehicles Symposium, Seoul, Korea, pp. 1094–1099. Cited by: §III-A.
  • [29] C. Kuo, Y. Cui, and T. Matsubara (2020) Sample-and-computation-efficient probabilistic model predictive control with random features. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 307–313. Cited by: §II.
  • [30] C. Li, J. Hu, J. Yu, J. Xue, R. Yang, Y. Fu, and B. Sun (2021) A review on the application of the mpc technology in wind power control of wind farms. Journal of Energy and Power Technology 3 (3), pp. 1–1. Cited by: §I.
  • [31] H. Li and Y. Shi (2014) Event-triggered robust model predictive control of continuous-time nonlinear systems. Automatica 50 (5), pp. 1507–1513. Cited by: §I, §I.
  • [32] L. Lin (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8 (3), pp. 293–321. Cited by: §IV-B3.
  • [33] C. Liu, S. Lee, S. Varnhagen, and H. E. Tseng (2017) Path planning for autonomous vehicles using model predictive control. In 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 174–179. Cited by: §I.
  • [34] C. Liu, C. Li, and W. Li (2020) Computationally efficient mpc for path following of underactuated marine vessels using projection neural network. Neural Computing and Applications 32 (11), pp. 7455–7464. Cited by: §I.
  • [35] M. Mammarella, T. Alamo, F. Dabbene, and M. Lorenzen (2020) Computationally efficient stochastic mpc: a probabilistic scaling approach. In 2020 IEEE Conference on Control Technology and Applications (CCTA), pp. 25–30. Cited by: §I.
  • [36] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §IV-B1.
  • [37] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §IV-B1, §IV-B3, §IV-B.
  • [38] A. Muraleedharan, H. Okuda, and T. Suzuki (2021) Real-time implementation of randomized model predictive control for autonomous driving. IEEE Transactions on Intelligent Vehicles 7 (1), pp. 11–20. Cited by: §I.
  • [39] G. O’Regan (2016) Introduction to the history of computing: a computing history primer. Springer. Cited by: §I.
  • [40] C. J. Ostafew, A. P. Schoellig, T. D. Barfoot, and J. Collier (2016) Learning-based nonlinear model predictive control to improve vision-based mobile robot path tracking. Journal of Field Robotics 33 (1), pp. 133–152. Cited by: §II.
  • [41] R. Rajamani (2011) Vehicle dynamics and control. Springer Science & Business Media. Cited by: §III-A.
  • [42] T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §IV-B3, §IV-B3.
  • [43] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §IV-B2.
  • [44] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §IV-B3, §IV-B.
  • [45] L. Sedghi, Z. Ijaz, K. Witheephanich, D. Pesch, et al. (2020) Machine learning in event-triggered control: recent advances and open issues. arXiv preprint arXiv:2009.12783. Cited by: §I.
  • [46] G. Serale, M. Fiorentini, A. Capozzoli, D. Bernardini, and A. Bemporad (2018) Model predictive control (mpc) for enhancing building and hvac system energy efficiency: problem formulation, applications and opportunities. Energies 11 (3), pp. 631. Cited by: §I.
  • [47] C. Shin, P. W. Ferguson, S. A. Pedram, J. Ma, E. P. Dutson, and J. Rosen (2019) Autonomous tissue manipulation via surgical robot using learning based model predictive control. In 2019 International Conference on Robotics and Automation (ICRA), pp. 3875–3881. Cited by: §II, §II.
  • [48] R. Soloperto, M. A. Müller, S. Trimpe, and F. Allgöwer (2018) Learning-based robust model predictive control with state-dependent uncertainty. IFAC-PapersOnLine 51 (20), pp. 442–447. Cited by: §II.
  • [49] D. Xu and Q. Liu (2015) ACIS: an improved actor-critic method for pomdps with internal state. In 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 369–376. Cited by: §II.
  • [50] J. Yoo and K. H. Johansson (2021) Event-triggered model predictive control with a statistical learning. IEEE Transactions on Systems, Man, and Cybernetics: Systems 51 (4), pp. 2571–2581. Cited by: §I.