 # Infinite-Horizon Differentiable Model Predictive Control

This paper proposes a differentiable linear quadratic Model Predictive Control (MPC) framework for safe imitation learning. The infinite-horizon cost is enforced using a terminal cost function obtained from the discrete-time algebraic Riccati equation (DARE), so that the learned controller can be proven to be stabilizing in closed-loop. A central contribution is the derivation of the analytical derivative of the solution of the DARE, thereby allowing the use of differentiation-based learning methods. A further contribution is the structure of the MPC optimization problem: an augmented Lagrangian method ensures that the MPC optimization is feasible throughout training whilst enforcing hard constraints on state and input, and a pre-stabilizing controller ensures that the MPC solution and derivatives are accurate at each iteration. The learning capabilities of the framework are demonstrated in a set of numerical studies.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Imitation Learning (IL, osa_algorithmic_2018

) aims at reproducing an existing control policy by means of a function approximator and can be used, for instance, to hot-start reinforcement learning. Effective learning and generalisation to unseen data are paramount to IL success, especially in safety critical applications. Model Predictive Control (MPC,

Maciejowski_book; Camacho2007; rawlingsMPC; Cannon_book; Gallieri2016; Borrelli_book; Rakovic2019) is the most successful advanced control methodology for systems with hard safety constraints. At each time step, a finite horizon forecast is made from a predictive model of the system and the optimal actions are computed, generally relying on convex constrained Quadratic Programming (QP, Boyd_convexopt; Bemporad2000). Stability of the MPC in closed loop with the physical system requires the solution of a simpler unconstrained infinite horizon control problem (rawlings_mayne_paper) which results in a value function (terminal cost and constraint) and a candidate terminal controller to be accounted for in the MPC forecasting. For Linear Time Invariant (LTI) models and quadratic costs, this means solving (offline) a Riccati equation (KalmanLQR) or a linear matrix inequality (Boyd_lmi). Under these conditions, an MPC controller will effectively control a system, up to a certain accuracy, provided that uncertainties in the model dynamics are limited (Limon2009). Inaccuracies in the MPC predictions can reduce its effectiveness (and robustness) as the forecast diverges from the physical system trajectory over long horizons. This is particularly critical in applications with both short and long-term dynamics and it is generally addressed, for instance in robust MPC (richards_a._g._robust_2004; Rakovic2012), by using a controller to pre-stabilise the predictions.

This paper presents an infinite-horizon differentiable linear quadratic MPC that can be learned using gradient-based methods. In particular, the learning method uses an MPC controller where the terminal cost and terminal policy are the solution of an unconstrained infinite-horizon Linear Quadratic Regulator (LQR). A closed-form solution for the derivative of the Discrete-time Algebraic Riccati Equation (DARE) associated with the LQR is presented so that the stationary solution of the forward pass is fully differentiable. This method allows analytical results from control theory to be used to determine the stabilizing properties of the learned controller when implemented in closed-loop. Once the unconstrained LQR is computed, the predictive model is pre-stabilised using a linear state-feedback controller to improve the conditioning of the QP and the numerical accuracy of the MPC solution and gradients. The proposed algorithm successfully learns an MPC with both local stability and intrinsic robustness guarantees under small model uncertainties.

#### Contributions

This paper provides a framework for correctly learning an infinite-horizon, LTI quadratic MPC using recent developments in differentiable QPs (amos_optnet:_2017) and principles from optimal control (Blanchini). A primary contribution is that the Discrete-time Algebraic Riccati Equation (DARE) is used to provide infinite-horizon optimality and stability, and an analytical derivative of the solution of the DARE is derived so that differentiation-based optimization can be used for training. This connects known results on MPC stability (Limon2003StableCM; Limon2009) and on infinite-horizon optimality (Scokaert1998) to imitation learning (osa_algorithmic_2018).

A further contribution is the MPC control formulation: a pre-stabilizing linear state-feedback controller is implemented from the solution of the DARE, and then the total control input is obtained as a perturbation of the feedback control law from the solution of a convex QP. The pre-stabilizing controller ensures that the QP is well conditioned and promotes a highly accurate global solution, which in turn ensures that the gradients calculated in the backwards pass are accurate. Additionally, an augmented Lagrangian penalty method is used to enforce constraints on state and control input. This approach ensures that the hard constraints are strictly enforced if the penalty term is sufficiently large, and also guarantees that the MPC problem is feasible throughout the training process. These contributions are in contrast to (amos_differentiable_2018) which did not consider state constraints, and implemented a differential dynamic programming (tassa_control-limited_2014) method to solve the MPC optimization for which convergence could not be guaranteed.

The framework is implemented on a set of second order mass-spring-damper systems and a vehicle platooning model, where it is demonstrated that the infinite horizon cost can be learned and the hard constraints can be guaranteed using a short finite prediction horizon.

#### Notation

identity matrix. matrix of zeros.

a vector of

zeros. a vector of ones. All inequalities and are considered element-wise in the context of vectors.

largest absolute eigenvalue of given matrix

. is defined as i.e. the columns of a matrix stacked into a vector. For a matrix , the permutation matrix is implicitly defined by . The Kronecker product, , is defined as in (magnus99, pp. 440).

## 2 Differentiable MPC

This paper considers linear time invariant systems of the form

 xt+dt=Axt+But, (1)

where is the system state, is the control input, is the state transition matrix, is the input matrix, is the time, and is the timestep (assumed constant). The control problem for this system is to determine the sequence of values of that achieve a desired level of performance (e.g. stability, frequency response, etc…), and when the system is subject to hard constraints on control input, , and state, , (or a combination of both), a well studied framework for controller synthesis is MPC. The principle of MPC is that the system’s control input and state are optimized over a finite prediction horizon, then the first element of the obtained control sequence is implemented at the current time step and the process is repeated ad infinitum. For linear MPC it is common to use a quadratic stage cost and box constraints on state and control ( and where ), so that at each time index the vector of optimized control variables is obtained from

 ^u⋆0:N=argmin^u 12N−1∑k=0^u⊤kR^uk+12N−1∑k=1^x⊤kQ^xk+12^x⊤NQN^xN+kuN−1∑k=01⊤mrk+kxN∑k=11⊤nsk (2) s.t. ^x0=xt, ^xk+1=A^xk+B^uk,k∈{0,…,N−1}, u––−rk≤^uk≤¯¯¯u+rkandrk≥0,k∈{0,⋯N−1}, x––−sk≤^xk≤¯¯¯x+skandsk≥0,k∈{1,…,N},

where is the predicted control trajectory, is the predicted state trajectory, represents the stage cost on control input, represents the stage cost on state, represents the terminal cost on state, is the prediction horizon, are slack variables for the control constraint, are slack variables for the state constraint, and and represent the cost of control and state constraint violations. The variables and enforce the box constraints on state and control using the augmented Lagrangian method (nocedal2006a, §17.2), and it can be shown that for sufficiently high and the constraints and can be exactly guaranteed (Kerrigan00softconstraints) (i.e. ). The benefit of this approach is that it ensures that the MPC optimization is feasible at each iteration of the learning process, whilst still ensuring that the constraints are ‘hard’. To close the MPC control loop, at each timestep, , the first element of the optimized control sequence, , is implemented as .

#### Pre-stabilised MPC

If the control input is decomposed into , where is a stabilizing linear state-feedback matrix and is a perturbation to the feedback control, system (1) becomes

 xt+dt=(A+BK)xt+Bδut, (3)

and problem (2) becomes

 δ^u⋆0:N=argminδ^u 12N−1∑k=0(K^xk+δ^uk)⊤R(K^xk+δ^uk)+12N−1∑k=1^x⊤kQ^xk+12^x⊤NQN^xN (4) +kuN−1∑k=01⊤mrk+kxN∑k=11⊤nsk s.t. ^x0=xt, ^xk+1=(A+BK)^xk+Bδ^uk,k∈{0,…,N−1}, u––−rk≤K^xk+δ^uk≤¯¯¯u+rkandrk≥0,k∈{0,…,N−1}, x––−sk≤^xk≤¯¯¯x+skandsk≥0,k∈{1,…,N},

so that is implemented as . Using this decomposition, system (3) controlled with the solution of (4) is precisely equal to system (1) controlled with the solution of (2), but problem (4) is preferable from a computational standpoint if is open-loop unstable (i.e. ) and is ‘large’, as this can lead to poor conditioning of the matrices defined in Appendix A. This is important in the context of differentiable MPC, as if is being learned then there may be no bounds on its eigenvalues at any given iteration.

#### MPC derivative.

Problems (2) and (4) can be rearranged into the QP form (details in Appendix A)

 z⋆=argminz 12z⊤Hz+q⊤zs.t.  lb≤Mz≤ub. (5)

When is uniquely defined by (5), it can also be considered as the solution of an implicit function defined by the Karush-Kuhn-Tucker (KKT) conditions, and in amos_optnet:_2017 it was demonstrated that it is possible to differentiate through this function to obtain the derivatives of with respect to the parameters , , , , and . 111Note that (5) differs from the form presented in amos_optnet:_2017, and is instead the form of problem solved by the OSQP solver used in this paper. Appendix B demonstrates how to differentiate (5) using the solution returned by OSQP.

The MPC controller can then be used as a layer in a neural network, and backpropagation can be used to determine the derivatives of an imitation cost function with respect to the MPC parameters

, , , , , , , , and .

#### Imitation Learning.

A possible use case of the derivative of a model predictive controller is imitation learning, where a subset of cost function, system dynamics, constraints are learned from observations of a system being controlled by an ‘expert’. Imitation learning can be performed by minimizing the loss

 1TT∑t=0∥ut:t+Ndt−^u⋆0:N(xt)∥22+β∥^wt∥22, (6)

where is the measured control input, is the full MPC solution, and

is a hyperparameter. It is assumed that both the learning algorithm and MPC controller have completely precise measurements of both the state and control input. The first term of (

6) is the control imitation loss, and the second term penalises the one-step ahead prediction error In practice, the prediction error loss might not be needed for the MPC to be learned correctly, however its use can be instrumental for stability, as discussed in the next section.

## 3 Terminal cost for infinite horizon

#### Terminal cost.

The infinite-horizon discrete-time Linear Quadratic Regulator (LQR, KalmanLQR) is given with state feedback gain

 K=−(R+B⊤PB)−1B⊤PA, (7)

where is obtained as a solution of the DARE

 P=A⊤PA−A⊤PB(R+B⊤PB)−1B⊤PA+Q. (8)

The principle of the approach presented in this paper is the MPC controller (2,4) is implemented with . Proposition 1 summarises the relevant properties of the proposed MPC, building on classic MPC results from Scokaert1998; Limon2003StableCM; Limon2009.

###### Proposition 1.

Consider the MPC problem (4) with , where and solve (7-8). Define as the optimal objective in (4) with . Denote the optimal stage cost with as . Then, for the closed-loop system, it follows that:

1. For any , there exists a closed and bounded set, , such that, if and , then the MPC solution is infinite-horizon optimal for any .

2. There exist positive scalars , , such that, for any , if then the MPC constraints are feasible, , and the origin is asymptotically stable , with

 ΓN={x∈Rn:V⋆N(x)≤ℓ(x,^u⋆0(x))+(N−1)d+α}. (9)
3. There exist a scalar, , such that, for any the MPC constraints are robustly feasible, , and the system is Input-to-State Stable (ISS) given an additive model error, , such that: . In other words:

 V⋆N(xt+dt)≤V⋆N(xt)−ℓ(xt,^u⋆0(xt))+σ(∥^wt∥),

for some strictly increasing, bounded function, , with .

4. The QP matrices, , and the vector , in (5), have finite norms for any .

###### Proof.

Proof of Proposition 1 is given in Appendix C. ∎

#### Implications.

Proposition 1 has some important implications. First, point 1 implies that there exists a state-dependant finite horizon length, , which is sufficient to make the MPC problem infinite-horizon optimal. This can be upper bounded for a closed and bounded set of feasible states, . Scokaert1998 proposed an iterative search that increases the horizon until optimality is verified; a similar algorithm is discussed in Appendix E where learning is completed with a large horizon and then iteratively reduced afterwards, although it is not implemented in this paper. Point 2,3 state that MPC that can provide stability and constraints satisfaction, hence safety, if the model error is small. This also applies to small errors in the QP solution. Finally, point 4 states that the QP matrices have finite norm when the system dynamics are pre-stabilised using the LQR gain222Note that any stabilising gain would be acceptable for the purpose of QP conditioning only., so the MPC problem is well conditioned and can be solved reliably to high accuracy, even over long horizons. If the open-loop system is unstable then the terms of the matrices in Appendix A for the standard form are unbounded, so the QP solution may be poorly conditioned and the result inaccurate for long horizons. This can in turn invalidate the results of amos_optnet:_2017 which assumes that the KKT conditions are exactly satisfied in order to compute its gradients.

#### DARE Derivative.

In order to implement in a differentiable imitation learning framework such as that presented in Section 2, the solution of the DARE is differentiated as follows.

###### Proposition 2.

Let be the stabilizing solution of (8), and assume that and exist, then the Jacobians of the implicit function defined by (8) are given by

 ∂vecP∂vecA=Z−11Z2,∂vecP∂vecB=Z−11Z3,∂vecP∂vecQ=Z−11Z4,∂vecP∂vecR=Z−11Z5,

where are defined by

 Z1 :=In2−(A⊤⊗A⊤)[In2−(PBM2B⊤⊗In)−(In⊗PBM2B⊤) +(PB⊗PB)(M2⊗M2)(B⊤⊗B⊤)] Z2 :=(Vn,n+In2)(In⊗A⊤M1) Z3 :=(A⊤⊗A⊤)[(PB⊗PB)(M2⊗M2)(Im2+Vm,m)(Im⊗B⊤P) −(In2+Vn,n)(PBM2⊗P)] Z4 :=In2 Z5 :=(A⊤⊗A⊤)(PB⊗PB)(M2⊗M2),

and are defined by

 M1:=P−PBM2B⊤P,M2:=M−13,M3:=R+B⊤PB.
###### Proof.

The proof of Proposition 2 is given in Appendix D. ∎

The sensitivity of the DARE solution has been investigated in the context of robustness to perturbations in the input matrices, e.g. riccati_sensitivity_Sun; konstantinov1993perturbation, and the analytical derivative of the continuous-time algebraic Riccati equation was derived in riccati_derivative_brewer by differentiating the exponential of the Hamiltonian matrix, but to the best of the authors’ knowledge this is the first presentation of an analytic derivative of the DARE using the differential calculus approach of magnus99.

#### Algorithm overview

Algorithm 1 presents the overall procedure for learning a subset, , of the MPC controller parameters, , with the key steps of the forwards and backwards pass of a gradient-based optimization method. In each forward pass the MPC terminal cost matrix, , and the pre-stabilizing controller, , are set from the solution of the DARE, then the DARE and MPC QP solutions are differentiated in the backward pass to obtain the gradients. Note that the horizon, , is not differentiable, and that learning the entire set simultaneously is challenging in general.

## 4 Numerical Experiments

In this section the performance of the algorithm was demonstrated through numerical experiments in two test cases: firstly on a set of second order mass-spring-damper models to provide a performance baseline in an easily interpretable setting, and then on vehicle platooning problem to investigate a higher-dimensional real-world application.

### 4.1 Mass-Spring-Damper

#### Model & Expert

Expert data was generated using a mass-spring-damper model parameterized by a mass, , damping coefficient, , stiffness, , and timestep , where

 A=exp(Acdt),Ac=[01−km−cm],B=(A−In)A−1cBc,Bc=[01m],

so that is the position and velocity of the mass, and the is a force applied to the mass.

Seven models were created with , , and , and was varied as shown in Table 1 to affect the open-loop stability of the models ( stable, unstable). The expert data was then generated by simulating each of the systems the initial condition in closed-loop with an infinite-horizon MPC controller (i.e. the horizon was increased until the open-loop state predictions matched the closed-loop response), using , , , , and . The constraint set was chosen so that the constraints on both state and control input were strongly active at the solution whilst ensuring that the expert MPC optimization was feasbile. The values were found to be sufficient to enforce the hard constraints and were used for all experiments. It is important to note that the approach of (amos_differentiable_2018) cannot be used reliably for even this simple example as it does not consider state constraints, and when hard constraints are added to the method it fails in general because the optimization problem has become infeasible in the forwards pass at some time . Figure 1: Mass-spring-damper. Imitation loss and model loss at each iteration of the training process. Top row: imitation loss. Bottom row: model loss given by ∥vecA−vecAj∥22, where Aj is the learned model at iteration j, and A is the correct model. The model loss was not used as part of the training process, and shown only to indicate whether the model is converging correctly.

#### Learning

The learner and expert shared all system and controller information apart from the state transition matrix , which was learned, and the MPC horizon length, which was implemented as each of in three separate experiments.

was initialized with the correct state transition matrix plus a uniformly distributed pseudo-random perturbation in the interval

added to each element. The learner was supplied with the first 50 elements of the closed loop state trajectory and corresponding controls as a batch of inputs, and was trained to minimize the imitation loss (6) with , i.e. the state dynamics were learned using predicted control trajectories only, and the state transitions are not made available to the learner (this is the same approach used in amos_differentiable_2018

). The experiments were implemented in Pytorch 1.2.0 using the built-in Adam optimizer

(kingma2015a) for 1000 steps using default parameters. The MPC optimization problems were solved for the ‘expert’ and ‘learner’ using OSQP (osqp) with settings (eps_abs=1E-10, eps_rel=1E-10, eps_rim_inf=1E-10, eps_dual_inf=1E-10).

#### Results

Figure 1 shows the imitation and model loss at each of the 1000 optimization iterations for each of the tested horizon lengths. It can be seen that for all of the generated systems the imitation loss converges to a low value, although this is a local minimum in general. In most cases, the learned model converges to a close approximation of the real model, although as the problem is non-convex this cannot be guaranteed, and it is also shown that there are some cases in which the model does not converge correctly. This occurred exclusively for , where neither system nor system converge to the correct dynamics. Additionally, it can be seen that both the imitation loss and model loss converge faster as the prediction horizon is increased. This suggests that a longer learning horizon improves the learning capabilities of the methods, but there is not sufficient data to demonstrate this conclusively. Figure 2: Mass-spring-damper. Closed-loop trajectories using the expert and learned controllers. Trajectories only shown for x1 (position), but x2 (velocity) can be inferred. Expert controllers shown with solid lines, and learned controller shown with dotted lines. The hard state constraints are shown in the red regions.

To test generalization performance, each of the systems was re-initialized with initial condition and simulated in closed loop using the learned controller for each horizon length. The results are compared in Figure 2 against the same systems controlled with an infinite horizon MPC controller. The primary observation is that as the learned MPC horizon is increased to , the closed loop trajectories converge to expert trajectories, indicating that the infinite horizon cost has been learned (when using the infinite horizon cost with no model mismatch or disturbance, the predicted MPC trajectory is exactly the same as the closed loop trajectory), and that the state constraints are guaranteed for . Furthermore, it can be seen that the learned controllers are stabilizing, even for the shortest horizon and the most unstable open-loop systems. This is also the case for systems and where the incorrect dynamics were learned, although in this case the state constraints are not guaranteed for .

### 4.2 Vehicle Platooning Figure 3: Platoon Model. nvvehicles in 1 degree of freedom where y is longitudinal displacement.

#### Model & Expert

Vehicle platoon control is a problem that has been studied using control theory (e.g. platoon), but here it is demonstrated that a safe, stabilizing controller can be learned from examples of vehicles driving in formation. Figure 3 shows an illustration of a platoon of vehicles for which the objective is to stabilize the relative longitudinal positions of each vehicle to the steady-state conditions and , subject to the hard constraint that relative position of the vehicles is never lower than a safe threshold , and that the vehicles’ ability to brake and accelerate is constrained by where (note that only the relative positions and velocities of the vehicles is considered, as the global position and velocity of the platoon can be controlled separately by adding an equal perturbation to each element of ). In appendix F it is shown that this can be modelled as a discrete time LTI system. Expert data was generated from the model with vehicles so that and . 20 instances were generated using random feasible initial conditions with m and m, and then simulated for s in time intervals of s with an infinite-horizon MPC controller, using and .

#### Learning

The learner and expert shared all system and controller information apart from the cost matrices and , which were learned, and the MPC horizon length, which was implemented as each of in four separate experiments. The matrices and were initialized as completely random diagonal matrices with each element uniformly distributed in the interval

, and the diagonal structure was maintained through training. 500 training iterations were used; otherwise the learning process (loss function, learning rate, etc…) was the same as in Section

4.1. Figure 4: Vehicle platooning. Imitation loss and cost function loss at each iteration of the training process. Left: imitation loss. Right: model loss given by ∥vecQ−vecQj∥22+∥vecR−vecRj∥22, where Q and R are the correct cost matrices and Qj and Rj are the cost matrices at iteration j.

#### Results

Figure 4 shows the imitation and cost function losses at each of the 500 optimization iterations for each of the tested horizon lengths and initial conditions. As with the mass-spring-damper experiments, it is suggested that a longer prediction horizon improves training as the imitation loss generally converges to a lower value for the examples with , but only convergence to a local minimum is achieved in general. The cost error also does not converge in general (although better convergence is observed again for the longer horizon lengths), however for this learning problem there is a manifold of matrices and with the same minimizing argument, so divergence of the cost error does not necessarily indicate that the learned cost function is ‘incorrect’. Furthermore, in this case the model is known exactly, so the closed-loop infinite-horizon properties can be obtained even without the correct cost function. Figure 5: Vehicle platooning. Closed loop simulation and prediction error for all horizon lengths. Top row: closed loop simulation where each shaded region is the safe separation distance for each vehicle. Bottom row: prediction error given by x[t:t+N]−^xt, where ^x is the state trajectory predicted by the MPC at time t.

Figure 5 shows the model simulated from the same initial condition in closed loop using a learned controller for each of the horizon lengths, together with the error between the MPC state predictions and ensuing closed-loop behaviour. All of the controllers are observed to successfully satisfy the hard constraints on vehicle separation, and all converge to the correct steady-state vehicle separation. The differences between the prediction capabilities of the controllers is highlighted by the state prediction errors, and it can be seen that for the state predictions match the ensuing behaviour, indicating that the infinite horizon cost is being used and that closed-loop stability is guaranteed, even without the use of a terminal constraint set. It is also demonstrated for that the largest errors occur from predictions made at times when the state constraints are active, suggesting that these controllers deviate from their predictions to satisfy the constraints at later intervals.

### 4.3 Limitations

The above approach is limited in scope to LTI systems, and a more comprehensive solution would cover linear time varying systems (for which the MPC is still obtained from the solution of a QP). In this case the infinite horizon cost cannot be obtained from the solution of the DARE, and the extension of the methods presented in this paper to time varying or non-linear models is non-trivial (see Appendix G for further discussion). Additionally, the derivative of the DARE in Proposition 2 involves multiple Kronecker products and matrix inversions (including an matrix) that do not scale well to large state and control dimensions, although the dynamics of physical systems can usually be reasonably approximated with only a few tens of variables, so this may not become an issue in practice. The algorithm also requires a stabilizing solution of the DARE to exist; theories for the existence of stabilizing solutions are non-trivial (e.g. RAN198863), and it is not immediately obvious how to enforce their existence throughout the training process (stabilizibility can be encouraged using the one-step ahead term in (6)).

## Acknowledgments

The authors are grateful to Brandon Amos for providing support using his differentiable QP tool (https://github.com/locuslab/optnet) in the preliminary work for this project (all of the methods presented in this paper were developed independently).

## Appendix A MPC quadratic program

Problem (2) is equivalent to

 z⋆=argminz 12z⊤⎡⎢⎣R+Ψ⊤QΨONm×NmONn×Nn⎤⎥⎦z+⎡⎢⎣ΨQΦxtku1Nmkx1Nn⎤⎥⎦⊤z s.t. ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣u––−∞0Nmx––−Φxt−∞0Nn⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦≤⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣INmINmINm−INmINmΨINnΨ−INnINn⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦z≤⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣∞¯¯¯¯u∞∞¯¯¯x−Φxt∞⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,

where

 z=⎡⎢⎣^urs⎤⎥⎦,R=⎡⎢ ⎢⎣R⋱R⎤⎥ ⎥⎦,Q=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣Q⋱QQN⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,Φ=⎡⎢ ⎢⎣A⋮AN⎤⎥ ⎥⎦, Ψ=⎡⎢ ⎢⎣B⋮⋱AN−1B⋯B⎤⎥ ⎥⎦,x––=⎡⎢ ⎢⎣x––⋮x––⎤⎥ ⎥⎦,¯¯¯x=⎡⎢ ⎢⎣¯x⋮¯¯¯x⎤⎥ ⎥⎦,u––=⎡⎢ ⎢⎣u––⋮u––⎤⎥ ⎥⎦,¯¯¯¯u=⎡⎢ ⎢⎣¯u⋮¯¯¯u⎤⎥ ⎥⎦,

are of conformal dimensions. Using the above, problem (4) is then equivalent to

 z⋆=argminz 12z⊤⎡⎢ ⎢⎣(K^Ψ+INm)⊤R(K^Ψ+INm)+^Ψ⊤^Q^ΨONm×NmONn×Nn⎤⎥ ⎥⎦z +⎡⎢⎣(K⊤R(K^Ψ+INm)+^Q^Ψ)⊤^Φxtku1Nmku1Nn⎤⎥⎦⊤z s.t. ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣u––−K^Φxt−∞0Nmx––−Φxt−∞0Nn⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦≤⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣(K^Ψ+INm)INm(K^Ψ+INm)−INmINmΨINnΨ−INnINn⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦z≤⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣∞¯¯¯u−K^Φxt∞∞¯¯¯x−Φxt∞⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,

where now

 z=⎡⎢⎣δ^urs⎤⎥⎦,Φ=⎡⎢ ⎢⎣(A+BK)⋮(A+BK)N⎤⎥ ⎥⎦andΨ=⎡⎢ ⎢⎣B⋮⋱(A+BK)N−1B⋯B⎤⎥ ⎥⎦,

and

 ^Q=[On×nQ],^Φ=[InΦ],^Ψ=[On×NnΨ],K=⎡⎢ ⎢⎣K⋱ONm×nK⎤⎥ ⎥⎦,

are of conformal dimensions.

## Appendix B OSQP derivatives

OSQP solves quadratic programs of the form (5), and returns values for , , and that satisfy

 Mz=s, Hz+q−M⊤y=0, s∈C,y∈NC(s),

(osqp, §2), where is the set , and is the normal cone of . The values of that are returned by the solver can be used to determine whether the constraints are strongly active at the solution, where indicates that the constraints and are inactive, indicates that is strongly active, and indicates that is strongly active. The solution can therefore be completely characterised by the KKT system

 ⎡⎢⎣HM⊤UM⊤LMUML⎤⎥⎦⎡⎢⎣zyUyL⎤⎥⎦=⎡⎢⎣quUlL⎤⎥⎦ (10)

where and , and the notation indicates a matrix consisting of the columns of given matrix , and indicates a vector consisting of the elements of given vector . Equation (10) can then be differentiated using the techniques detailed in (amos_optnet:_2017, §3).

## Appendix C Proof of Proposition 1

###### Proof.

(Proposition 1) The first point follows from (Scokaert1998). The next two points of Proposition 1 stem from the results in (Limon2003StableCM; Limon2009). In particular, the closed-loop is Lipschitz since the model is linear and the controller is the solution of a strictly convex QP. Moreover, the LQR provides a contractive terminal set. The final point follows from the fact that has eigenvalues in the unit circle, . Proof of point 4 is concluded by inspection of the QP matrices (Appendix A) and by application of Theorem 5.6.12, page 298 of Horn:2012:MA:2422911 which states that, given a bound, , on the spectral radius, then there exists a matrix norm which is also less than . ∎

## Appendix D Proof of Proposition 2

###### Proof.

(Proposition 2) If a stabilizing solution () to (8) exists, it is unique (IONESCU1992229, Proposition 1), and the DARE can therefore be considered an implicit function of , , , and . Using the assumption that exists, it can be concluded that and exist (the Kronecker product and matrix addition, subtraction, and multiplication always exist). Equation (8) can be given by

 P=A⊤M1A+Q, (11)

which is differentiable, and are also differentiable. Differentials are taken for (11) and each of as

 dvecP= (Vn,n+In2)(In⊗A⊤M1)dvecA+(A⊤⊗A⊤)dvecM1+dvecQ dvecM1= [In2−(PBM2B⊤⊗In)−(In⊗PBM2B⊤)]dvecP −(PB⊗PB)dvecM2−(In2+Vn,n)(PBM2⊗P)vecdB dvecM2= −(M2⊗M2)dvecM3 dvecM3= dvecR+(B⊤⊗B⊤)dvecP+(Im2+Vm,m)(Im⊗B⊤P)% vecdB,

then these can be combined using the differential chain rule

(magnus99, Theorem 18.2) to obtain

 Z1dvecP=Z2dvecA+Z3dvecB+Z4dvecQ+Z5dvecR.

The Jacobians, as defined in Proposition 2, therefore exist if exists. ∎

## Appendix E Verification and reduction of the prediction horizon

A method is proposed for the reduction of the MPC prediction horizon after imitation learning. The idea is to be able to reproduce the infinite-horizon optimal MPC up to a tolerance

with high probability. Do do so, we check that, for a candidate horizon

, the MPC action deltas, , satisfy , for all . This means that the optimal action is equal to the LQR up to a tolerance . In order to provide a high probability guarantee of this condition, we propose the use of a probabilistic verification approach, similar to bobiti_samplingdriven_nodate. This is described in Algorithm 2. In particular, the condition is checked on a high number, , of initial states. These states are sampled uniformly from a set of interest , which can be either the state constraints

or an estimate of the region of attraction,

. If verified, this set is a region of attraction for the system with high probability. The relationship between the number of samples and the verification probability is discussed in (bobiti_samplingdriven_nodate, Chapter 5). The algorithm also checks whether the infinite horizon condition has been reached for the used during training. Finally, a line search for a suitable is proposed using a scaling factor . In particular, the initial set is downscaled until either an horizon is found or the set becomes empty. In latter case the search fails and the procedure returns to the training algorithm with an increased . Noticeably, the proposed algorithm does not require to explicitly compute the terminal set in which the LQR is invariant and it could be used also for non-linear MPC if an infinite-horizon (or a stabilising) terminal controller is available.

## Appendix F Platoon Model Derivation

The problem described in Section 4.2 can be decomposed into the regulation problem

 ⎡⎢ ⎢⎣y2−y1⋮yn−yn−1⎤⎥ ⎥⎦=⎡⎢ ⎢⎣z2−z1⋮zn−zn−1⎤⎥ ⎥⎦+1yss,⎡⎢ ⎢⎣z2−z1⋮zn−zn−1⎤⎥ ⎥⎦→0,

subject to the constraints

 ⎡⎢ ⎢⎣z2−z1⋮zn−zn−1⎤⎥ ⎥⎦≥1(y–−yss),and1b≤¨y≤1a.

If each vehicle is modelled as a mass then a continuous-time LTI state space model can be formed as

 (12)

which can then be given as

 ˙x=[0I00]x+[0^B]u.

If it is assumed that the control input is constant between sampling intervals and , then this can be given in discrete time as

 xt+dt=[IdtIOI]xt+[12^Bc(dt)2^Bcdt]ut,^Bc=⎡⎢ ⎢⎣−11⋱⋱−11⎤⎥ ⎥⎦ (13)

where , and and are subject to the constraints

 xt≥[(y–−yss)1nv−1−∞],andb1≤ut≤a1% ∀t.

## Appendix G Nonlinear models

As discussed in the main paper, our approach is currently limited to Linear Time Invariant (LTI) systems. In general, conditions for infinite-horizon optimality of systems that are not LTI are non-trivial. Some of the results on MPC stability could however be maintained, for example in the case when the LQR value function, , is a local control Lyapunov function (khalil2001; rawlings_mayne_paper). In this case, the stability and intrinsic robustness results are maintained (see Limon2003StableCM; Limon2009). For these system, it would be possible to use our method, for instance in combination with amos_differentiable_2018, to provide a stable Non-linear MPC. This is however a big assumptions for systems that are very non-linear. Assessing this LQR controllability condition could be done, for instance, by training a local linear model around the target equilibrium (origin) and then checking whether the DARE is solvable. This should be performed before starting the imitation learning. We leave the study of more general systems to future work.