1 Introduction
Imitation Learning (IL, osa_algorithmic_2018
) aims at reproducing an existing control policy by means of a function approximator and can be used, for instance, to hotstart reinforcement learning. Effective learning and generalisation to unseen data are paramount to IL success, especially in safety critical applications. Model Predictive Control (MPC,
Maciejowski_book; Camacho2007; rawlingsMPC; Cannon_book; Gallieri2016; Borrelli_book; Rakovic2019) is the most successful advanced control methodology for systems with hard safety constraints. At each time step, a finite horizon forecast is made from a predictive model of the system and the optimal actions are computed, generally relying on convex constrained Quadratic Programming (QP, Boyd_convexopt; Bemporad2000). Stability of the MPC in closed loop with the physical system requires the solution of a simpler unconstrained infinite horizon control problem (rawlings_mayne_paper) which results in a value function (terminal cost and constraint) and a candidate terminal controller to be accounted for in the MPC forecasting. For Linear Time Invariant (LTI) models and quadratic costs, this means solving (offline) a Riccati equation (KalmanLQR) or a linear matrix inequality (Boyd_lmi). Under these conditions, an MPC controller will effectively control a system, up to a certain accuracy, provided that uncertainties in the model dynamics are limited (Limon2009). Inaccuracies in the MPC predictions can reduce its effectiveness (and robustness) as the forecast diverges from the physical system trajectory over long horizons. This is particularly critical in applications with both short and longterm dynamics and it is generally addressed, for instance in robust MPC (richards_a._g._robust_2004; Rakovic2012), by using a controller to prestabilise the predictions.This paper presents an infinitehorizon differentiable linear quadratic MPC that can be learned using gradientbased methods. In particular, the learning method uses an MPC controller where the terminal cost and terminal policy are the solution of an unconstrained infinitehorizon Linear Quadratic Regulator (LQR). A closedform solution for the derivative of the Discretetime Algebraic Riccati Equation (DARE) associated with the LQR is presented so that the stationary solution of the forward pass is fully differentiable. This method allows analytical results from control theory to be used to determine the stabilizing properties of the learned controller when implemented in closedloop. Once the unconstrained LQR is computed, the predictive model is prestabilised using a linear statefeedback controller to improve the conditioning of the QP and the numerical accuracy of the MPC solution and gradients. The proposed algorithm successfully learns an MPC with both local stability and intrinsic robustness guarantees under small model uncertainties.
Contributions
This paper provides a framework for correctly learning an infinitehorizon, LTI quadratic MPC using recent developments in differentiable QPs (amos_optnet:_2017) and principles from optimal control (Blanchini). A primary contribution is that the Discretetime Algebraic Riccati Equation (DARE) is used to provide infinitehorizon optimality and stability, and an analytical derivative of the solution of the DARE is derived so that differentiationbased optimization can be used for training. This connects known results on MPC stability (Limon2003StableCM; Limon2009) and on infinitehorizon optimality (Scokaert1998) to imitation learning (osa_algorithmic_2018).
A further contribution is the MPC control formulation: a prestabilizing linear statefeedback controller is implemented from the solution of the DARE, and then the total control input is obtained as a perturbation of the feedback control law from the solution of a convex QP. The prestabilizing controller ensures that the QP is well conditioned and promotes a highly accurate global solution, which in turn ensures that the gradients calculated in the backwards pass are accurate. Additionally, an augmented Lagrangian penalty method is used to enforce constraints on state and control input. This approach ensures that the hard constraints are strictly enforced if the penalty term is sufficiently large, and also guarantees that the MPC problem is feasible throughout the training process. These contributions are in contrast to (amos_differentiable_2018) which did not consider state constraints, and implemented a differential dynamic programming (tassa_controllimited_2014) method to solve the MPC optimization for which convergence could not be guaranteed.
The framework is implemented on a set of second order massspringdamper systems and a vehicle platooning model, where it is demonstrated that the infinite horizon cost can be learned and the hard constraints can be guaranteed using a short finite prediction horizon.
Notation
identity matrix. matrix of zeros.
a vector of
zeros. a vector of ones. All inequalities and are considered elementwise in the context of vectors.largest absolute eigenvalue of given matrix
. is defined as i.e. the columns of a matrix stacked into a vector. For a matrix , the permutation matrix is implicitly defined by . The Kronecker product, , is defined as in (magnus99, pp. 440).2 Differentiable MPC
Linear quadratic MPC
This paper considers linear time invariant systems of the form
(1) 
where is the system state, is the control input, is the state transition matrix, is the input matrix, is the time, and is the timestep (assumed constant). The control problem for this system is to determine the sequence of values of that achieve a desired level of performance (e.g. stability, frequency response, etc…), and when the system is subject to hard constraints on control input, , and state, , (or a combination of both), a well studied framework for controller synthesis is MPC. The principle of MPC is that the system’s control input and state are optimized over a finite prediction horizon, then the first element of the obtained control sequence is implemented at the current time step and the process is repeated ad infinitum. For linear MPC it is common to use a quadratic stage cost and box constraints on state and control ( and where ), so that at each time index the vector of optimized control variables is obtained from
(2)  
s.t.  
where is the predicted control trajectory, is the predicted state trajectory, represents the stage cost on control input, represents the stage cost on state, represents the terminal cost on state, is the prediction horizon, are slack variables for the control constraint, are slack variables for the state constraint, and and represent the cost of control and state constraint violations. The variables and enforce the box constraints on state and control using the augmented Lagrangian method (nocedal2006a, §17.2), and it can be shown that for sufficiently high and the constraints and can be exactly guaranteed (Kerrigan00softconstraints) (i.e. ). The benefit of this approach is that it ensures that the MPC optimization is feasible at each iteration of the learning process, whilst still ensuring that the constraints are ‘hard’. To close the MPC control loop, at each timestep, , the first element of the optimized control sequence, , is implemented as .
Prestabilised MPC
If the control input is decomposed into , where is a stabilizing linear statefeedback matrix and is a perturbation to the feedback control, system (1) becomes
(3) 
and problem (2) becomes
(4)  
s.t.  
so that is implemented as . Using this decomposition, system (3) controlled with the solution of (4) is precisely equal to system (1) controlled with the solution of (2), but problem (4) is preferable from a computational standpoint if is openloop unstable (i.e. ) and is ‘large’, as this can lead to poor conditioning of the matrices defined in Appendix A. This is important in the context of differentiable MPC, as if is being learned then there may be no bounds on its eigenvalues at any given iteration.
MPC derivative.
Problems (2) and (4) can be rearranged into the QP form (details in Appendix A)
(5) 
When is uniquely defined by (5), it can also be considered as the solution of an implicit function defined by the KarushKuhnTucker (KKT) conditions, and in amos_optnet:_2017 it was demonstrated that it is possible to differentiate through this function to obtain the derivatives of with respect to the parameters , , , , and . ^{1}^{1}1Note that (5) differs from the form presented in amos_optnet:_2017, and is instead the form of problem solved by the OSQP solver used in this paper. Appendix B demonstrates how to differentiate (5) using the solution returned by OSQP.
The MPC controller can then be used as a layer in a neural network, and backpropagation can be used to determine the derivatives of an imitation cost function with respect to the MPC parameters
, , , , , , , , and .Imitation Learning.
A possible use case of the derivative of a model predictive controller is imitation learning, where a subset of cost function, system dynamics, constraints are learned from observations of a system being controlled by an ‘expert’. Imitation learning can be performed by minimizing the loss
(6) 
where is the measured control input, is the full MPC solution, and
is a hyperparameter. It is assumed that both the learning algorithm and MPC controller have completely precise measurements of both the state and control input. The first term of (
6) is the control imitation loss, and the second term penalises the onestep ahead prediction error In practice, the prediction error loss might not be needed for the MPC to be learned correctly, however its use can be instrumental for stability, as discussed in the next section.3 Terminal cost for infinite horizon
Terminal cost.
The infinitehorizon discretetime Linear Quadratic Regulator (LQR, KalmanLQR) is given with state feedback gain
(7) 
where is obtained as a solution of the DARE
(8) 
The principle of the approach presented in this paper is the MPC controller (2,4) is implemented with . Proposition 1 summarises the relevant properties of the proposed MPC, building on classic MPC results from Scokaert1998; Limon2003StableCM; Limon2009.
Proposition 1.
Consider the MPC problem (4) with , where and solve (78). Define as the optimal objective in (4) with . Denote the optimal stage cost with as . Then, for the closedloop system, it follows that:

For any , there exists a closed and bounded set, , such that, if and , then the MPC solution is infinitehorizon optimal for any .

There exist positive scalars , , such that, for any , if then the MPC constraints are feasible, , and the origin is asymptotically stable , with
(9) 
There exist a scalar, , such that, for any the MPC constraints are robustly feasible, , and the system is InputtoState Stable (ISS) given an additive model error, , such that: . In other words:
for some strictly increasing, bounded function, , with .

The QP matrices, , and the vector , in (5), have finite norms for any .
Implications.
Proposition 1 has some important implications. First, point 1 implies that there exists a statedependant finite horizon length, , which is sufficient to make the MPC problem infinitehorizon optimal. This can be upper bounded for a closed and bounded set of feasible states, . Scokaert1998 proposed an iterative search that increases the horizon until optimality is verified; a similar algorithm is discussed in Appendix E where learning is completed with a large horizon and then iteratively reduced afterwards, although it is not implemented in this paper. Point 2,3 state that MPC that can provide stability and constraints satisfaction, hence safety, if the model error is small. This also applies to small errors in the QP solution. Finally, point 4 states that the QP matrices have finite norm when the system dynamics are prestabilised using the LQR gain^{2}^{2}2Note that any stabilising gain would be acceptable for the purpose of QP conditioning only., so the MPC problem is well conditioned and can be solved reliably to high accuracy, even over long horizons. If the openloop system is unstable then the terms of the matrices in Appendix A for the standard form are unbounded, so the QP solution may be poorly conditioned and the result inaccurate for long horizons. This can in turn invalidate the results of amos_optnet:_2017 which assumes that the KKT conditions are exactly satisfied in order to compute its gradients.
DARE Derivative.
In order to implement in a differentiable imitation learning framework such as that presented in Section 2, the solution of the DARE is differentiated as follows.
Proposition 2.
The sensitivity of the DARE solution has been investigated in the context of robustness to perturbations in the input matrices, e.g. riccati_sensitivity_Sun; konstantinov1993perturbation, and the analytical derivative of the continuoustime algebraic Riccati equation was derived in riccati_derivative_brewer by differentiating the exponential of the Hamiltonian matrix, but to the best of the authors’ knowledge this is the first presentation of an analytic derivative of the DARE using the differential calculus approach of magnus99.
Algorithm overview
Algorithm 1 presents the overall procedure for learning a subset, , of the MPC controller parameters, , with the key steps of the forwards and backwards pass of a gradientbased optimization method. In each forward pass the MPC terminal cost matrix, , and the prestabilizing controller, , are set from the solution of the DARE, then the DARE and MPC QP solutions are differentiated in the backward pass to obtain the gradients. Note that the horizon, , is not differentiable, and that learning the entire set simultaneously is challenging in general.
4 Numerical Experiments
In this section the performance of the algorithm was demonstrated through numerical experiments in two test cases: firstly on a set of second order massspringdamper models to provide a performance baseline in an easily interpretable setting, and then on vehicle platooning problem to investigate a higherdimensional realworld application.
4.1 MassSpringDamper
Model & Expert
Expert data was generated using a massspringdamper model parameterized by a mass, , damping coefficient, , stiffness, , and timestep , where
so that is the position and velocity of the mass, and the is a force applied to the mass.
System  

1  0.5  0.1  0.1  0.3  0.5  0.6 
Seven models were created with , , and , and was varied as shown in Table 1 to affect the openloop stability of the models ( stable, unstable). The expert data was then generated by simulating each of the systems the initial condition in closedloop with an infinitehorizon MPC controller (i.e. the horizon was increased until the openloop state predictions matched the closedloop response), using , , , , and . The constraint set was chosen so that the constraints on both state and control input were strongly active at the solution whilst ensuring that the expert MPC optimization was feasbile. The values were found to be sufficient to enforce the hard constraints and were used for all experiments. It is important to note that the approach of (amos_differentiable_2018) cannot be used reliably for even this simple example as it does not consider state constraints, and when hard constraints are added to the method it fails in general because the optimization problem has become infeasible in the forwards pass at some time .
Learning
The learner and expert shared all system and controller information apart from the state transition matrix , which was learned, and the MPC horizon length, which was implemented as each of in three separate experiments.
was initialized with the correct state transition matrix plus a uniformly distributed pseudorandom perturbation in the interval
added to each element. The learner was supplied with the first 50 elements of the closed loop state trajectory and corresponding controls as a batch of inputs, and was trained to minimize the imitation loss (6) with , i.e. the state dynamics were learned using predicted control trajectories only, and the state transitions are not made available to the learner (this is the same approach used in amos_differentiable_2018). The experiments were implemented in Pytorch 1.2.0 using the builtin Adam optimizer
(kingma2015a) for 1000 steps using default parameters. The MPC optimization problems were solved for the ‘expert’ and ‘learner’ using OSQP (osqp) with settings (eps_abs=1E10, eps_rel=1E10, eps_rim_inf=1E10, eps_dual_inf=1E10).Results
Figure 1 shows the imitation and model loss at each of the 1000 optimization iterations for each of the tested horizon lengths. It can be seen that for all of the generated systems the imitation loss converges to a low value, although this is a local minimum in general. In most cases, the learned model converges to a close approximation of the real model, although as the problem is nonconvex this cannot be guaranteed, and it is also shown that there are some cases in which the model does not converge correctly. This occurred exclusively for , where neither system nor system converge to the correct dynamics. Additionally, it can be seen that both the imitation loss and model loss converge faster as the prediction horizon is increased. This suggests that a longer learning horizon improves the learning capabilities of the methods, but there is not sufficient data to demonstrate this conclusively.
To test generalization performance, each of the systems was reinitialized with initial condition and simulated in closed loop using the learned controller for each horizon length. The results are compared in Figure 2 against the same systems controlled with an infinite horizon MPC controller. The primary observation is that as the learned MPC horizon is increased to , the closed loop trajectories converge to expert trajectories, indicating that the infinite horizon cost has been learned (when using the infinite horizon cost with no model mismatch or disturbance, the predicted MPC trajectory is exactly the same as the closed loop trajectory), and that the state constraints are guaranteed for . Furthermore, it can be seen that the learned controllers are stabilizing, even for the shortest horizon and the most unstable openloop systems. This is also the case for systems and where the incorrect dynamics were learned, although in this case the state constraints are not guaranteed for .
4.2 Vehicle Platooning
vehicles in 1 degree of freedom where
is longitudinal displacement.Model & Expert
Vehicle platoon control is a problem that has been studied using control theory (e.g. platoon), but here it is demonstrated that a safe, stabilizing controller can be learned from examples of vehicles driving in formation. Figure 3 shows an illustration of a platoon of vehicles for which the objective is to stabilize the relative longitudinal positions of each vehicle to the steadystate conditions and , subject to the hard constraint that relative position of the vehicles is never lower than a safe threshold , and that the vehicles’ ability to brake and accelerate is constrained by where (note that only the relative positions and velocities of the vehicles is considered, as the global position and velocity of the platoon can be controlled separately by adding an equal perturbation to each element of ). In appendix F it is shown that this can be modelled as a discrete time LTI system. Expert data was generated from the model with vehicles so that and . 20 instances were generated using random feasible initial conditions with m and m, and then simulated for s in time intervals of s with an infinitehorizon MPC controller, using and .
Learning
The learner and expert shared all system and controller information apart from the cost matrices and , which were learned, and the MPC horizon length, which was implemented as each of in four separate experiments. The matrices and were initialized as completely random diagonal matrices with each element uniformly distributed in the interval
, and the diagonal structure was maintained through training. 500 training iterations were used; otherwise the learning process (loss function, learning rate, etc…) was the same as in Section
4.1.Results
Figure 4 shows the imitation and cost function losses at each of the 500 optimization iterations for each of the tested horizon lengths and initial conditions. As with the massspringdamper experiments, it is suggested that a longer prediction horizon improves training as the imitation loss generally converges to a lower value for the examples with , but only convergence to a local minimum is achieved in general. The cost error also does not converge in general (although better convergence is observed again for the longer horizon lengths), however for this learning problem there is a manifold of matrices and with the same minimizing argument, so divergence of the cost error does not necessarily indicate that the learned cost function is ‘incorrect’. Furthermore, in this case the model is known exactly, so the closedloop infinitehorizon properties can be obtained even without the correct cost function.
Figure 5 shows the model simulated from the same initial condition in closed loop using a learned controller for each of the horizon lengths, together with the error between the MPC state predictions and ensuing closedloop behaviour. All of the controllers are observed to successfully satisfy the hard constraints on vehicle separation, and all converge to the correct steadystate vehicle separation. The differences between the prediction capabilities of the controllers is highlighted by the state prediction errors, and it can be seen that for the state predictions match the ensuing behaviour, indicating that the infinite horizon cost is being used and that closedloop stability is guaranteed, even without the use of a terminal constraint set. It is also demonstrated for that the largest errors occur from predictions made at times when the state constraints are active, suggesting that these controllers deviate from their predictions to satisfy the constraints at later intervals.
4.3 Limitations
The above approach is limited in scope to LTI systems, and a more comprehensive solution would cover linear time varying systems (for which the MPC is still obtained from the solution of a QP). In this case the infinite horizon cost cannot be obtained from the solution of the DARE, and the extension of the methods presented in this paper to time varying or nonlinear models is nontrivial (see Appendix G for further discussion). Additionally, the derivative of the DARE in Proposition 2 involves multiple Kronecker products and matrix inversions (including an matrix) that do not scale well to large state and control dimensions, although the dynamics of physical systems can usually be reasonably approximated with only a few tens of variables, so this may not become an issue in practice. The algorithm also requires a stabilizing solution of the DARE to exist; theories for the existence of stabilizing solutions are nontrivial (e.g. RAN198863), and it is not immediately obvious how to enforce their existence throughout the training process (stabilizibility can be encouraged using the onestep ahead term in (6)).
Acknowledgments
The authors are grateful to Brandon Amos for providing support using his differentiable QP tool (https://github.com/locuslab/optnet) in the preliminary work for this project (all of the methods presented in this paper were developed independently).
References
Appendices
Appendix A MPC quadratic program
Appendix B OSQP derivatives
OSQP solves quadratic programs of the form (5), and returns values for , , and that satisfy
(osqp, §2), where is the set , and is the normal cone of . The values of that are returned by the solver can be used to determine whether the constraints are strongly active at the solution, where indicates that the constraints and are inactive, indicates that is strongly active, and indicates that is strongly active. The solution can therefore be completely characterised by the KKT system
(10) 
where and , and the notation indicates a matrix consisting of the columns of given matrix , and indicates a vector consisting of the elements of given vector . Equation (10) can then be differentiated using the techniques detailed in (amos_optnet:_2017, §3).
Appendix C Proof of Proposition 1
Proof.
(Proposition 1) The first point follows from (Scokaert1998). The next two points of Proposition 1 stem from the results in (Limon2003StableCM; Limon2009). In particular, the closedloop is Lipschitz since the model is linear and the controller is the solution of a strictly convex QP. Moreover, the LQR provides a contractive terminal set. The final point follows from the fact that has eigenvalues in the unit circle, . Proof of point 4 is concluded by inspection of the QP matrices (Appendix A) and by application of Theorem 5.6.12, page 298 of Horn:2012:MA:2422911 which states that, given a bound, , on the spectral radius, then there exists a matrix norm which is also less than . ∎
Appendix D Proof of Proposition 2
Proof.
(Proposition 2) If a stabilizing solution () to (8) exists, it is unique (IONESCU1992229, Proposition 1), and the DARE can therefore be considered an implicit function of , , , and . Using the assumption that exists, it can be concluded that and exist (the Kronecker product and matrix addition, subtraction, and multiplication always exist). Equation (8) can be given by
(11) 
which is differentiable, and are also differentiable. Differentials are taken for (11) and each of as
then these can be combined using the differential chain rule
(magnus99, Theorem 18.2) to obtainThe Jacobians, as defined in Proposition 2, therefore exist if exists. ∎
Appendix E Verification and reduction of the prediction horizon
A method is proposed for the reduction of the MPC prediction horizon after imitation learning. The idea is to be able to reproduce the infinitehorizon optimal MPC up to a tolerance
with high probability. Do do so, we check that, for a candidate horizon
, the MPC action deltas, , satisfy , for all . This means that the optimal action is equal to the LQR up to a tolerance . In order to provide a high probability guarantee of this condition, we propose the use of a probabilistic verification approach, similar to bobiti_samplingdriven_nodate. This is described in Algorithm 2. In particular, the condition is checked on a high number, , of initial states. These states are sampled uniformly from a set of interest , which can be either the state constraintsor an estimate of the region of attraction,
. If verified, this set is a region of attraction for the system with high probability. The relationship between the number of samples and the verification probability is discussed in (bobiti_samplingdriven_nodate, Chapter 5). The algorithm also checks whether the infinite horizon condition has been reached for the used during training. Finally, a line search for a suitable is proposed using a scaling factor . In particular, the initial set is downscaled until either an horizon is found or the set becomes empty. In latter case the search fails and the procedure returns to the training algorithm with an increased . Noticeably, the proposed algorithm does not require to explicitly compute the terminal set in which the LQR is invariant and it could be used also for nonlinear MPC if an infinitehorizon (or a stabilising) terminal controller is available.Appendix F Platoon Model Derivation
The problem described in Section 4.2 can be decomposed into the regulation problem
subject to the constraints
If each vehicle is modelled as a mass then a continuoustime LTI state space model can be formed as
(12) 
which can then be given as
If it is assumed that the control input is constant between sampling intervals and , then this can be given in discrete time as
(13) 
where , and and are subject to the constraints
Appendix G Nonlinear models
As discussed in the main paper, our approach is currently limited to Linear Time Invariant (LTI) systems. In general, conditions for infinitehorizon optimality of systems that are not LTI are nontrivial. Some of the results on MPC stability could however be maintained, for example in the case when the LQR value function, , is a local control Lyapunov function (khalil2001; rawlings_mayne_paper). In this case, the stability and intrinsic robustness results are maintained (see Limon2003StableCM; Limon2009). For these system, it would be possible to use our method, for instance in combination with amos_differentiable_2018, to provide a stable Nonlinear MPC. This is however a big assumptions for systems that are very nonlinear. Assessing this LQR controllability condition could be done, for instance, by training a local linear model around the target equilibrium (origin) and then checking whether the DARE is solvable. This should be performed before starting the imitation learning. We leave the study of more general systems to future work.