I Introduction
Model Predictive Control (MPC) is a wellknown method to solve finitehorizon optimal control problems online, which has been extensively investigated in various fields [1, 2, 3]. However, existing MPC algorithms still suffer from a major challenge: relatively low computation efficiency [4].
One famous approach to tackle this issue is the moving blocking technique, which assumes constant control input in a fixed portion of the prediction horizon. It increases the computation efficiency by reducing the number of variables to be optimized [5]. However, this solution cannot guarantee the system stability and constraint satisfaction. In addition, Wang and Boyd (2009) proposed an early termination interiorpoint method to reduce the calculation time by limiting the maximum number of iterations per time step [6].
However, these methods are still unable to meet the online computing requirement for nonlinear and largescale systems. Some control algorithms choose to calculate an nearoptimal explicit policy offline, and then implement it online. Bemporad et al. (2002) first proposed the explicit MPC method to increase the computation efficiency, which partitioned the constrained state space into several regions and calculated explicit feedback control laws for each region [7]. During online implementation, the onboard computer only needs to choose the corresponding state feedback control law according to the current system state, thereby reducing the burden of online calculation to some extent. Such algorithms are only suitable for smallscale systems, since the required storage capacity grows exponentially with the state dimension [8].
Furthermore, significant efforts have been devoted to approximation MPC algorithms, which can reduce polyhedral state regions and simplify explicit control laws. Geyer et al. (2008) provided an optimal merging approach to reduce partitions via merging regions with the same control law [9]. Jones et al
. (2010) proposed a polytopic approximation method using double description and barycentric functions to estimate the optimal policy, which greatly reduced the partitions and could be applied to any convex problem
[10]. Wen et al. (2009) proposed a piecewise continuous grid function to represent explicit MPC solution, which reduced the requirements of storage capacity and improve online computation efficiency[11]. Borrelli et al. (2010) proposed an explicit MPC algorithm which can be executed partially online and partially offline[12]. In addition, some MPC studies employed a parameterized function to approximate the MPC controller. They updated the function parameters by minimizing the MPC cost function with a fixed prediction horizon through supervised learning or reinforcement learning
[13, 14, 15, 16].Noted that the policy performance and the computation time for each step usually increase with the number of prediction steps. The abovestated algorithms usually have to make a tradeoff between control performance and computation time constraints, and select a conservative fixed prediction horizon. While the onboard computation resources are often changing dynamically. These algorithms thus usually lead to calculation timeouts or resources waste. In other words, these algorithms cannot adapt to the dynamic allocation of computing resources and make full use of the available computing time to select the longest model prediction horizon.
In this paper, we propose an offline MPC algorithm, called Recurrent MPC (RMPC), for finitehorizon optimal control problems with largescale nonlinearities and nonaffine inputs. Our main contributions can be summarized as below:

A recurrent function is employed to approximate the optimal policy, which maps the system states and reference values directly to the control inputs. Compared to previous algorithms employing nonrecurrent functions (such as multilayer NNs), which must select a fixed prediction horizon previously[13, 14, 15, 16], the use of recurrent structure makes the algorithm be able to select appropriate model prediction horizon according to current computing resources. In particular, the output of the learned policy function after recurrent cycles corresponds to the nearly optimal solution of step MPC.

A policy optimization objective is designed by decomposing the MPC cost function according to the Bellman’s principle of optimality. The optimal recurrent policy can be obtained by directly minimizing the designed objective function. Therefore, unlike the traditional explicit MPC algorithms[7, 8, 9, 10, 11, 12] that can only handle linear systems, the proposed algorithm is applicable for general nonlinear and non inputaffine systems. Meanwhile, the proposed RMPC algorithm utilizes the recursiveness of Bellman’s principle. When the cost function of the longest prediction is optimized, the cost function of short prediction will automatically be optimal. Thus the proposed algorithm can deals with different shorter prediction horizons problems while only training with an objective function with respect to a long prediction horizons. Other MPC algorithms [13, 14, 15, 16, 7, 8, 9, 10, 11, 12]do not consider the recursiveness of Bellman’s principle, when the prediction horizons changes, the optimization problem must be reconstructed and the training or computing process must be reexecuted to deal with the new problem.
The paper is organized as follows. In Section II, we provide the formulation of the MPC problem. Section III presents RMPC algorithm and proves its convergence. In Section IV, we present simulation demonstrations that show the generalizability and effectiveness of the RMPC algorithm. Section V concludes this paper.
Ii Preliminaries
Consider general timeinvariant discretetime dynamic system
(1) 
with state , control input and the system dynamics function . We assume that is Lipschitz continuous on a compact set , and the system is stabilizable on .
Define the cost function of the step Model Predictive Control (MPC) problem
(2) 
where is initial state, is length of prediction horizon, is reference trajectory, is the step cost function of state with reference , is the control input of the th step in step prediction, and is the utility function. The purpose of MPC is to find the optimal control sequence to minimize the objective , which can be denoted as
(3)  
where the superscript represents optimal.
Iii Recurrent Model Predictive Control
Iiia Recurrent Policy Function
In practical applications, we only need to execute the first control input of the optimal sequence in (3) at each step. Given a control problem, assume that is the maximum feasible prediction horizon. Our aim is to make full use of computation resources and adaptively select the longest prediction horizon , which means that we need to calculate and store the optimal control input of , and in advance. This requires us to find an efficient way to represent the policy and solve it offline.
We firstly introduce a recurrent function, denoted as , to approximate the control input , where
is the vector of function parameters and
is the number of recurrent cycles of the policy function. The goal of the proposed Recurrent MPC (RMPC) algorithm is to find the optimal parameters , such that(4)  
The structure of the recurrent policy function is illustrated in Fig. 1. All recurrent cycles share the same parameters , where is the vector of hidden states.
Each recurrent cycle is mathematically described as
(5)  
where , and
are activation functions of hidden layer and output layer, respectively.
As shown in Fig. 1, the recurrent policy function calculates and outputs a control input at each recurrent cycle. Assuming that we have found the optimal parameters , it follows that the output of the th cycle for . This indicates that the more cycles, the longer the prediction horizon. In practical applications, the calculation time of each cycle is different due to the dynamic change of computing resource allocation (see Fig. 2). At each time step, the total time assigned to the control input calculation is assumed to be . Denoting the total number of the recurrent cycles at each time step as , then the control input is , where
Therefore, the recurrent policy is able to make full use of computing resources and adaptively select the longest prediction step . In other word, the more computing resources allocated, the longer prediction horizon will be selected, which usually would lead to the better control performance.
Remark 1.
Previous MPC algorithms employs nonrecurrent form neural networks
[13, 14, 15, 16], which must select a fix prediction horizon previously. RMPC employes recurrent function to approximate the optimal policy, which maps the system states and reference values directly to the control inputs. The use of recurrent structure makes the algorithm be able to select appropriate model prediction horizon according to current computing resources. The output of the learned policy network after recurrent cycles corresponds to the nearly optimal solution of step MPC.IiiB Objective Function
To find the optimal parameters offline, we first need to represent the MPC cost function in (2) in terms of , denoted by . From (2) and the Bellman’s principle of optimality, the global minimum can be expressed as:
It follows that
(6)  
Therefore, for the same and , it is clear that
(7) 
This indicates that the th optimal control input in (3) can be regarded as the optimal control input of the +step MPC control problem with initial state . Hence,by replacing all in (2) with , the step MPC control problem can also be solved via minimizing . Then, we can obtain the step cost function in terms of :
(8) 
To find the optimal parameters that make (4) hold, we can construct the following objective function:
(9) 
Therefore, we can update by directly minimizing . The policy update gradients can be derived as
(10) 
where
Denoting as and as , we have
By defining two immediate variables, and
, we have their recursive formula and the details of gradient backpropagation are shown in Fig.
3.Therefore, the gradient formula can be simplified as:
Taking the Gradient Descent (GD) method as an example, the updating rules of the policy function are
(11) 
where denotes the learning rate and indicates th iteration.
Remark 2.
Traditional explicit MPC algorithms[7, 8, 9, 10, 11, 12] can only handle linear systems.The proposed RMPC algorithm uses an optimization objective designed by decomposing the MPC cost function according to the Bellman’s principle of optimality. The optimal recurrent policy can be obtained by directly minimizing the designed objective function without restrictions on the form of systems. Meanwhile, the proposed algorithm utilizes the recursiveness of Bellman’s principle. When the cost function of the longest prediction is optimized, the cost function of short prediction will automatically be optimal. Thus the proposed algorithm can deals with different shorter prediction horizons problems while only training with an objective function with respect to a long prediction horizons. Other MPC algorithms [13, 14, 15, 16, 7, 8, 9, 10, 11, 12]do not consider the recursiveness of Bellman’s principle, when the prediction horizons changes, the optimization problem must be reconstructed and the training or computing process must be reexecuted to deal with the new problem.
IiiC Convergence and Optimality
There are many types of recurrent functions belonging to the structure defined in (5
), and recurrent neural networks (RNN) are the most commonly used. In recent years, deep RNNs have been successfully implemented in many fields, such as natural language processing and system control, attributing to their ability to process sequential data
[19, 20]. Next, we will show that as the iteration index , the optimal policy that make (4) hold can be achieved using Algorithm 1, as long asis an overparameterized RNN. The overparameterization means that the number of hidden neurons is sufficiently large. Before the main theorem, the following lemma and assumption need to be introduced.
Lemma 1.
(Universal Approximation Theorem[21, 22, 23]). Consider a sequence of finite functions , where , is the input dimension, is a continuous function on a compact set and is the output dimension. Describe the RNN as
where is the number of recurrent cycles, , and are parameters, and are activation functions. Supposing is overparameterized, for any , , such that
where is an arbitrarily small error.
The reported experimental results and theoretical proofs have shown that the straightforward optimization methods, such as GD and Stochastic GD (SGD), can find global minima of most training objectives in polynomial time if the approximate function is an overparameterized neural network or RNN [24, 25]. Based on this fact, we make the following assumption.
Assumption 1.
We now present our main result.
Theorem 1.
Proof. From Assumption 1, we can always find by repeatedly minimizing using (11), such that
According to the definition of in (9), we have
By Lemma 1, there always , such that
Since is the global minimum of , it follows that
and
Then, according to (6), (7) and the Bellman’s principle of optimality, can also make (4) hold, i.e., .
Thus, we have proven that RMPC algorithm can converge to . In other words, it can find the nearly optimal policy of MPC with different prediction horizon, whose output after th recurrent cycles corresponds to the nearly optimal solution of step MPC.
Iv Algorithm Verification
In order to evaluate the performance of the proposed RMPC algorithm, we choose the vehicle lateral control problem in path tracking task as an example [27].
Iva Overall Settings
The policy network is trained offline on the PC, and then deployed to the industrial personal computer (IPC). The vehicle dynamics used for policy training are different from the controlled plant. For online applications, the IPveCcontroller gives the control signal to the plant according to the state information and the reference trajectory. The plant feeds back the state information to the IPCcontroller, so as to realize the closedloop control process. The feedback scheme of the HIL experiment is depicted in Fig 5. The type of IPCcontroller is ADLINK MXC6401, equipped with Intel i76820EQ CPU and 8GB RAM, which is used as a vehicle onboard controller[28]. The plant is a realtime system, simulated by the vehicle dynamic model of CarSim [29]. The longitudinal speed is assumed to be constant, , and the expected trajectory is shown in Fig. 16. The system states and control inputs of this problem are listed in Table I, and the vehicle parameters are listed in Table II.
Mode  Name  Symbol  Unit 

state  Lateral velocity  [m/s]  
Yaw rate at center of gravity (CG)  [rad/s]  
Longitudinal velocity  [m/s]  
Yaw angle  [rad]  
trajectory  [m]  
input  Front wheel angle  [rad] 
Name  Symbol  Unit 

Front wheel cornering stiffness  88000 [N/rad]  
Rear wheel cornering stiffness  94000 [N/rad]  
Mass  1500 [kg]  
Distance from CG to front axle  1.14 [m]  
Distance from CG to rear axle  1.40 [m]  
Polar moment of inertia at CG 
2420 [kg]  
Tireroad friction coefficient  1.0  
Sampling frequency  20 [Hz]  
System frequency  20 [Hz] 
IvB Problem Description
The offline policy is trained based on the nonliner and non inputaffine vehicle dynamics:
where and are the lateral tire forces of the front and rear tires respectively [30]. The lateral tire forces can be approximated according to the Fiala tire model:
where is the tire slip angle, is the tire load, is the friction coefficient, and the subscript represents the front or rear tires. The slip angles can be calculated from the relationship between the front/rear axle and the center of gravity (CG):
The loads on the front and rear tires can be approximated by:
The utility function of this problem is set to be
Therefore, the policy optimization problem of this example can be formulated as:
where , , , and .
IvC Algorithm Details
The policy function is represented by a variant of RNN, called GRU (Gated Recurrent Unit). The input layer is composed of the states, followed by 4 hidden layers using rectified linear unit (RELUs) as activation functions with
units per layer, and the output layer is set as a layer, multiplied by to confront bounded control. We use Adam method to update the network with the learning rate of and the batch size of .IvD Result Analysis
For nonlinear MPC problems, we can solve it with some optimization solvers, such as ipopt [17] and bonmin [18], which can be approximately regarded as the numerical optimal solution.
Fig. 6 compares the calculation efficiency of RMPC and the optimization solvers based on the symbolic framework CasADi [31] under different prediction steps for online applications. It is obvious that the calculation time of the optimization solvers is much longer than RMPC, and the gap increases with the number of prediction steps. Specifically, when , the fastest optimization solver ipopt is over 5 times slower than RMPC (ipopt for ms, RMPC for ms). This demonstrates the effectiveness of the RMPC method.
We run Algorithm 1 for 10 times and calculate the policy error between the solution of ipopt solver and RMPC at each iteration for ,
where and are respectively the maximum and minimum value of for , , is the number of prediction steps. indicates the relative error of control quantity from cycle network respect to the optimum in step prediction control problem .
In Fig. 7, we plot policy error curves during training with different prediction steps . It is clear that all the policy errors decrease rapidly to a small value during the training process. In particular, after iterations, policy errors for all reduce to less than 2%. This indicates that Algorithm 1 has the ability to find the nearoptimal policy of MPC problems with different prediction horizons .
Fig. 8 shows the policy performance of the ipopt solver solution and learned policy with different prediction horizons. The policy performance is measured by the lost function of 200 steps (10s) during the simulation period staring from random initialized state, i.e.,
(12) 
For all prediction domains , the learned policy performs as well as the solution of ipopt solver. More recurrent cycles (or long prediction steps) help reduce the accumulated cost .
In detail, Fig. 16 intuitively presents the control results of the learned policy with different recurrent cycles and Fig. 9 compares the control output between the learned policy(after 15 recurrent cycles) with ipopt controller.Obviously, the trajectory controlled by RMPC controller almost overlaps with the ipopt controller. The more recurrent cycles of the learned policy, the smaller the trajectory tracking error.This is why we want to adaptively select the optimal law with longest prediction horizon in real applications.
To summarize, the example demonstrates the optimality, efficiency and generality of the RMPC algorithm.
V Conclusion
This paper proposes the Recurrent Model Predictive Control (RMPC) algorithm to solve general nonlinear finitehorizon optimal control problems. Unlike traditional MPC algorithms, it can make full use of the current computing resources and adaptively select the longest model prediction horizon. Our algorithm employs an RNN to approximate the optimal policy, which maps the system states and reference values directly to the control inputs. The output of the learned policy network after recurrent cycles corresponds to the nearly optimal solution of step MPC. A policy optimization objective is designed by decomposing the MPC cost function according to the Bellman’s principle of optimality.The optimal recurrent policy can be obtained by directly minimizing the designed objective function, which is applicable for general nonlinear and non inputaffine systems. The convergence and optimality of RMPC is further proved. We demonstrate its optimality, generality and efficiency using a HIL experiment. Results show that RMPC is over 5 times faster than the traditional MPC algorithm. The control performance of the learned policy can be further improved as the number of recurrent cycles increases.
References
 [1] S. J. Qin and T. A. Badgwell, “A survey of industrial model predictive control technology,” Control engineering practice, vol. 11, no. 7, pp. 733–764, 2003.
 [2] S. Vazquez, J. Leon, L. Franquelo, J. Rodriguez, H. A. Young, A. Marquez, and P. Zanchetta, “Model predictive control: A review of its applications in power electronics,” IEEE Industrial Electronics Magazine, vol. 8, no. 1, pp. 16–31, 2014.
 [3] S. E. Li, Z. Jia, K. Li, and B. Cheng, “Fast online computation of a model predictive controller and its application to fuel economy–oriented adaptive cruise control,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 3, pp. 1199–1209, 2014.
 [4] J. H. Lee, “Model predictive control: Review of the three decades of development,” International Journal of Control, Automation and Systems, vol. 9, no. 3, p. 415, 2011.
 [5] R. Cagienard, P. Grieder, E. C. Kerrigan, and M. Morari, “Move blocking strategies in receding horizon control,” Journal of Process Control, vol. 17, no. 6, pp. 563–570, 2007.
 [6] Y. Wang and S. Boyd, “Fast model predictive control using online optimization,” IEEE Transactions on control systems technology, vol. 18, no. 2, pp. 267–278, 2009.
 [7] A. Bemporad, M. Morari, V. Dua, and E. N. Pistikopoulos, “The explicit linear quadratic regulator for constrained systems,” Automatica, vol. 38, no. 1, pp. 3–20, 2002.
 [8] B. Kouvaritakis, M. Cannon, and J. A. Rossiter, “Who needs qp for linear mpc anyway?” Automatica, vol. 38, no. 5, pp. 879–884, 2002.
 [9] T. Geyer, F. D. Torrisi, and M. Morari, “Optimal complexity reduction of polyhedral piecewise affine systems,” Automatica, vol. 44, no. 7, pp. 1728–1740, 2008.
 [10] C. N. Jones and M. Morari, “Polytopic approximation of explicit model predictive controllers,” IEEE Transactions on Automatic Control, vol. 55, no. 11, pp. 2542–2553, 2010.
 [11] C. Wen, X. Ma, and B. E. Ydstie, “Analytical expression of explicit mpc solution via lattice piecewiseaffine function,” Automatica, vol. 45, no. 4, pp. 910–917, 2009.
 [12] F. Borrelli, M. Baotić, J. Pekar, and G. Stewart, “On the computation of linear model predictive control laws,” Automatica, vol. 46, no. 6, pp. 1035–1041, 2010.
 [13] B. M. Åkesson, H. T. Toivonen, J. B. Waller, and R. H. Nyström, “Neural network approximation of a nonlinear model predictive controller applied to a ph neutralization process,” Computers & chemical engineering, vol. 29, no. 2, pp. 323–335, 2005.
 [14] B. M. Åkesson and H. T. Toivonen, “A neural network model predictive controller,” Journal of Process Control, vol. 16, no. 9, pp. 937–946, 2006.
 [15] L. Cheng, W. Liu, Z.G. Hou, J. Yu, and M. Tan, “Neuralnetworkbased nonlinear model predictive control for piezoelectric actuators,” IEEE Transactions on Industrial Electronics, vol. 62, no. 12, pp. 7717–7727, 2015.
 [16] J. Duan, Z. Liu, S. E. Li, Q. Sun, Z. Jia, and B. Cheng, “Deep adaptive dynamic programming for nonaffine nonlinear optimal control problem with state constraints,” arXiv preprint arXiv:1911.11397, 2019.
 [17] A. Wachter and L. T. Biegler, “Biegler, l.t.: On the implementation of a primaldual interior point filter line search algorithm for largescale nonlinear programming. mathematical programming 106, 2557,” Mathematical Programming, vol. 106, no. 1, pp. 25–57, 2006.
 [18] P. Bonami, L. T. Biegler, A. R. Conn, G. Cornuéjols, I. E. Grossmann, C. D. Laird, J. Lee, A. Lodi, F. Margot, N. Sawaya et al., “An algorithmic framework for convex mixed integer nonlinear programs,” Discrete Optimization, vol. 5, no. 2, pp. 186–204, 2008.
 [19] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, “Recurrent neural network based language model,” in Eleventh annual conference of the international speech communication association, 2010.
 [20] S. Li, H. Wang, and M. U. Rafique, “A novel recurrent neural network for manipulator control with improved noise tolerance,” IEEE transactions on neural networks and learning systems, vol. 29, no. 5, pp. 1908–1918, 2017.
 [21] L. K. Li, “Approximation theory and recurrent networks,” in Proc. of IJCNN, vol. 2, pp. 266–271. IEEE, 1992.
 [22] A. M. Schäfer and H.G. Zimmermann, “Recurrent neural networks are universal approximators,” International journal of neural systems, vol. 17, no. 04, pp. 253–263, 2007.
 [23] B. Hammer, “On the approximation capability of recurrent neural networks,” Neurocomputing, vol. 31, no. 14, pp. 107–123, 2000.

[24]
Z. AllenZhu, Y. Li, and Z. Song, “A convergence theory for deep learning via overparameterization,” in
International Conference on Machine Learning
, pp. 242–252. Long Beach, California, USA: ICML, 2019.  [25] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima of deep neural networks,” in International Conference on Machine Learning, pp. 1675–1685. Long Beach, California, USA: ICML, 2019.
 [26] Z. AllenZhu, Y. Li, and Z. Song, “On the convergence rate of training recurrent neural networks,” in Advances in Neural Information Processing Systems, pp. 6673–6685, 2019.
 [27] R. Li, Y. Li, S. E. Li, E. Burdet, and B. Cheng, “Driverautomation indirect shared control of highly automated vehicles with intentionaware authority transition,” in 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 26–32. Redondo Beach, California, USA: IEEE, 2017.
 [28] C. Chen, J. Pan, X. Chang, Q. Xu, and J. Wang, “System design and function verification of an extensible autonomous driving platform,” in 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. Sapporo, Japan: IEEE, 2019.
 [29] R. Benekohal and J. Treiterer, “Carsim: Carfollowing model for simulation of traffic in normal and stopandgo conditions,” Transportation research record, vol. 1194, pp. 99–111, 1988. [Online]. Available: http://dx.doi.org/
 [30] J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli, “Kinematic and dynamic vehicle models for autonomous driving control design,” in 2015 IEEE Intelligent Vehicles Symposium (IV), pp. 1094–1099. Seoul, South Korea: IEEE, 2015.
 [31] J. A. E. Andersson, G. Joris, H. Greg, J. B. Rawlings, and D. Moritz, “Casadi: a software framework for nonlinear optimization and optimal control,” Mathematical Programming Computation, vol. 11, pp. 1–36, 2018.
Comments
There are no comments yet.