# Robust Learning-based Predictive Control for Constrained Nonlinear Systems

The integration of machine learning methods and Model Predictive Control (MPC) has received increasing attention in recent years. In general, learning-based predictive control (LPC) is promising to build data-driven models and solve the online optimization problem with lower computational costs. However, the robustness of LPC is difficult to be guaranteed since there will be uncertainties due to function approximation used in machine learning algorithms. In this paper, a novel robust learning-based predictive control (r-LPC) scheme is proposed for constrained nonlinear systems with unknown dynamics. In r-LPC, the Koopman operator is used to form a global linear representation of the unknown dynamics, and an incremental actor-critic algorithm is presented for receding horizon optimization. To realize the satisfaction of system constraints, soft logarithmic barrier functions are designed within the learning predictive framework. The recursive feasibility and stability of the closed-loop system are discussed under the convergence arguments of the approximation algorithms adopted. Also, the robustness property of r-LPC is analyzed theoretically by taking into consideration the existence of perturbations on the controller due to possible approximation errors. Simulation results with the proposed learning control approach for the data-driven regulation of a Van der Pol oscillator system have been reported, including the comparisons with a classic MPC and an infinite-horizon Dual Heuristic Programming (DHP) algorithm. The results show that the r-LPC significantly outperforms the DHP algorithm in terms of control performance and can be comparative to the MPC in terms of regulating control as well as energy consumption. Moreover, its average computational cost is much smaller than that with the MPC in the adopted environment.

## Authors

• 1 publication
• 1 publication
• 27 publications
• 13 publications
• ### Learning an Approximate Model Predictive Controller with Guarantees

A supervised learning framework is proposed to approximate a model predi...
06/11/2018 ∙ by Michael Hertneck, et al. ∙ 0

• ### Neural Lyapunov Model Predictive Control

This paper presents Neural Lyapunov MPC, an algorithm to alternately tra...
02/21/2020 ∙ by Mayank Mittal, et al. ∙ 11

• ### Data-Driven Predictive Control for Multi-Agent Decision Making With Chance Constraints

In the recent literature, significant and substantial efforts have been ...
11/06/2020 ∙ by Jun Ma, et al. ∙ 0

• ### Infinite-Horizon Differentiable Model Predictive Control

This paper proposes a differentiable linear quadratic Model Predictive C...
01/07/2020 ∙ by Sebastian East, et al. ∙ 0

• ### Computationally efficient stochastic MPC: a probabilistic scaling approach

In recent years, the increasing interest in Stochastic model predictive ...
05/21/2020 ∙ by Martina Mammarella, et al. ∙ 0

• ### Constrained Physics-Informed Deep Learning for Stable System Identification and Control of Unknown Linear Systems

This paper presents a novel data-driven method for learning deep constra...
04/23/2020 ∙ by Jan Drgona, et al. ∙ 0

• ### Constrained Physics-Informed Deep Learning for Stable System Identification and Control of Linear Systems

This paper presents a novel data-driven method for learning deep constra...
04/23/2020 ∙ by Jan Drgona, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Model predictive control (MPC), also known as receding horizon or moving horizon control, has received notable attention due to its theoretical developments and wide-spreading applications in industrial plants, see [23, 27]. With MPC, usually, the control problem is transformed into an online optimization problem that penalizes the errors of the state and control with respect to the origin or steady-state values in a predefined finite prediction horizon with suitably chosen weights subject to model and variable constraints. In MPC, the optimal control sequence is computed via solving the underlying optimization problem at any adopted sampling time instant, and only the first control action is applied. Then at the subsequent time instant, the optimization problem is solved repeatedly, according to the receding horizon strategy.

As it is a model-based approach, a mathematical description of the original model is required for the design and implementation of MPC in order to obtain the multi-step prediction to be used in each prediction horizon. Most of the classic MPC algorithms assume that the prescribed model has been generated a priori in which case, the identification process can be disregarded. In fact, identification of an accurate model description, especially for unknown nonlinear dynamics, is nontrivial due to possible noisy data-sets, and to an unreasonable presumed structure adopted. To account for the modeling uncertainty from identification, robust MPC such as min-max MPC in [2] or tube-based MPC in [24] can be used, however it might lead to conservativity and degradation of control performance.

Recently, a class of learning framework of MPC relying on an online update of the parameters of the controller, such as model description and system constraints, has drawn increasing interest for the capability of reducing conservativity and improving control performance. Many works have been developed in this new direction. Among them, from the theoretical perspective, a unitary learning-based predictive controller for linear systems has been addressed in [31], where the identification process based on set membership is adopted to obtain a multi-step linear prediction and then used with robust MPC. Similarly, with resorting to the set membership identification, adaptive MPC algorithms for uncertain time-varying systems have been proposed in [22, 7] for reducing the conservativity caused by robust MPC. Relying on the main idea from iterative leaning control, a data-driven learning MPC for iterative tasks has been studied in [29] with terminal constraints updated in an iterative fashion. In [18]

, a learning nonlinear model predictive control algorithm with a machine learning based model estimation has been proposed with stability property guarantees. In the prescribed approaches, a robust (or stabilizing in

[18]) MPC problem is still to be solved online at each time instant, which might lead to a huge computational load, for large-scale systems and/or nonlinear dynamics. This could preclude the applications of these approaches to nonlinear systems that must exhibit fast closed-loop dynamics. An attempt has been contributed in [12]

with the scope of learning global linear predictors for nonlinear dynamics using Koopman operator, which is recently noted to be very useful for representing nonlinear dynamics by a linear one, but probably with a higher dimension, see

[1]. This approach paves the way to linear MPC formulation for a nonlinear systems with a global linear predictor that represents the whole operation range, but it still leaves the theoretical property unresolved in this respect. The work presented in this paper is partially inspired by [12]. It is also worth mentioning that, a supervised machine learning algorithm has been used for approximation of nonlinear MPC in [10]. The robustness is guaranteed under bounded control approximate errors with verified statistic empirical risk.

As an alternative to solve optimal control problem with the infinite or finite horizon, reinforcement learning (RL) and adaptive dynamic programming (ADP) have also received notable attention in the past decades, see [16, 25, 15, 34] and the references therein. Instead of solving online optimization problems, RL and ADP are interested in finding approximate solutions via value function and policy iteration in trial-and-error manner, and are suitable for complex nonlinear control tasks that is hard to be solved by optimal control techniques such as, exact dynamic programming, due to the non-linearity in Hamilton-Jacobi-Bellman equation and the existence of state constraint. Similar to MPC, for control problems with high dimensions, RL and ADP might face the issues of computational complexity and learning efficiency, which is also known as “the curse of dimension”. To cope with this problem, adaptive critic designs (ACDs) have been proposed, see for instance [38, 17, 39], where the value function and policy iteration are replaced with actor-critic approximate network structure. Along this direction, various notable algorithms have been studied in [42, 20, 41] for known dynamics, and in [33, 11, 19] for unknown dynamics. These methods are designed for optimal control orientation of infinite-horizon. In [43]

, a finite-horizon near-optimal control algorithm has been presented for affine nonlinear systems with unknown dynamics, where an identifier designed with neural networks is used as the predictor. Relying on receding horizon optimization, a learning-based ADP controller for perturbed discrete-time systems has been studied in

[37]. The algorithm is in an iterative batch mode learning way, and only the convergence under null approximate error is discussed. In [6], an ADP based functional nonlinear MPC has been proposed for nonlinear discrete-time systems, where only control saturation is considered. The uniform ultimate boundedness of the closed-loop system in each prediction horizon is proven. However, to the best of our knowledge, algorithms with recursive feasibility and robustness verified under approximate errors according to the receding horizon strategy are still to be developed. Also, state constraint is not yet coped with in the prescribed algorithms, see [42, 20, 41, 33, 11, 19, 43, 37, 6].

For these reasons, a novel robust learning-based predictive control (r-LPC) is proposed for constrained nonlinear systems with unknown dynamics in this work. The r-LPC utilizes the Koopman operator to calculate a global linear predictor of the nonlinear dynamics, and presents an incremental actor-critic algorithm for receding horizon optimization. The approximation errors caused by the actor-critic network and the linear predictor are regarded as disturbance terms to be rejected with resorting to the tube-based model predictive control (MPC) framework. Moreover, the system constraint satisfaction is properly coped within the learning predictive framework with soft logarithmic barrier functions. The recursive feasibility and stability of the closed-loop system are discussed under the convergence arguments of the approximation algorithms adopted, also the robustness property is studied in-depth taking into consideration the existence of perturbations on the controller due to possible approximation errors. Learning control simulation studies with the r-LPC for the regulation of a Van der Pol oscillator system have been performed, including the comparisons with a classic MPC and an infinite-horizon Dual Heuristic Programming (DHP). The results show that the proposed approach significantly outperforms the DHP in terms of control performance and can be comparative to the MPC in terms of regulating control as well as energy consumption. Moreover, its average computational time is about 319 times smaller than that with the MPC in the adopted environment.

The main contributions of this paper are summarized as follows.

1. An incremental receding horizon ACDs algorithm is proposed for learning near-optimal control policies. Also, we use the Koopman operator to calculate a global linear predictor of the unknown dynamics. Hence, the proposed r-LPC is fully data-driven, both in the modeling phase and the learning process. No prior knowledge on the system structure and dynamics is required. Moreover, the proposed r-LPC can be comparative to MPC, and shows a big advantageous point in terms of computational efficiency, see Section 6 for details.

2. The state and control constraints are coped within the receding horizon ACDs with resorting to soft logarithmic barrier functions. To the best of our knowledge, this is the first time addressing this point in the framework of ACDs.

3. The convergence of the Koopman operator and the actor-critic structure are discussed. With resorting to statistical learning technique, the convergence in the ideal case, the robustness under approximate errors, as well as the recursive feasibility in both cases, of the closed-loop control systems are proven under mild assumptions.

The rest of the paper is organized as follows. Section II introduces the considered control problem and the main idea of the r-LPC. In Section III the data-driven linear predictor with Koopman operator is computed, while the design and main algorithm of the proposed r-LPC are described in Section IV. Section V presents the theoretical properties of the closed-loop system, and Section VI shows the simulation results on a Van der Pol oscillator. Conclusions are drawn in Section VII, while proofs to the main results are given in the Appendix.

Notation: Given the variable , we use to denote the sequence , where is the discrete time index and is a positive integer. We use to stand for , to denote its Euclidean norm. Given two sets and , their Minkowski sum is represented by . Given a set , we use to denote its interior and to represent its boundary. We use the notation to denote non-negative integer. For a given set of variables ,

, we define the vector whose vector-components are

in the following compact form: , where . Finally, a ball with radius and centered at in the space is defined as follows

 Bρεi(¯x):={x∈Rdim:||x−¯x||≤ρεi}.

## 2 Problem formulation

Consider a class of discrete-time nonlinear systems described by

 x(k+1)=f(x(k),u(k)) (1)

where is the discrete-time index, , are the state and control variables, and are convex and compact sets containing the origin in their interiors, is the state transition function and is assumed to be smooth but unknown.

Starting from any initial condition , the control objective is to drive the pair to the origin as goes to infinity. In principle, in the case that is perfectly known, one can utilize the underlying nonlinear MPC technique to achieve the control scope, which is typically a model-based optimization problem, hence the model information is required to be used in each prediction horizon. For systems with unknown dynamics, standard nonlinear MPC is not ready to be employed unless a corresponding nonlinear model that can precisely represent (1) is identified from experimental data samples. Consider that identification process is usually data-driven, hence approximation error might inevitably exist. Denoting the approximated model, the original model (1) can be rewritten as , where is the modeling error. In order to account for the modeling gap, the robust tube based MPC described in [24] can be used. In this case, at any time instant , the following optimization problem is typically solved

 min→^u(k:k+N−1)V(^x(k))subject to:∙the dynamics ^x(k+1)=^f(^x(k),^u(k))∙^x(k+i)∈X⊖Z,i=0,…,N−1∙^u(k+i)∈U⊖Ω,i=0,…,N−1∙^x(0)=x(0)∙^x(k+N)∈Xf, (2)

where is the prediction horizon, is a (possibly minimal)-robust invariant set under a stabilizing feedback policy for the perturbed error system , , is a terminal constraint containing the origin that is usually selected as a subset of the maximal state admissible invariant set under a stabilizing feedback policy , and the cost function is

 V(^x(k))=N−1∑i=0(r(^x(k+i),^u(k+i)))+Vf(^x(k+N)),

the stage cost , , are symmetric positive-definite matrices respectively, and is the terminal cost in the quadratic form. Assuming that at any generic time instant , the optimal solution can be found by solving (2), the overall control to be applied is given as

 u(k)=^u(k|k)+h(x(k),^x(k|k)).

Even so, the problem is still challenging for the following reasons: i) typically, nonlinear MPC problems can be computationally intensive and difficult to solve, especially for systems with highly nonlinear features and/or high dimension; ii) there are several nonlinear identification techniques such as Hammerstein and Wiener models, neural network and etc., the resulting structure might be complex, which is still nontrivial to solve within (2); iii) in principle, traditional linear model identification based on least-squares method can be used instead which is friendly for online MPC, but it is difficult to obtain a global model that can represent the whole operation space for systems with a wide operating range.

For the above reasons, we propose the r-LPC to achieve the aforementioned control objective. The r-LPC utilizes ACDs in the receding horizon optimization formulation so as to reduce computational load and improve control performance. To obtain the finite-step prediction model for the learning process, we resort to the Koopman operator to approximate (1) from data sample collections with an abstract linear predictor in a global sense. The approximating errors caused by the actor-critic network and the linear predictor are regarded as disturbance terms to be rejected in the tube-based MPC framework. Moreover, we employ barrier functions to circumvent the system constraint nonsatisfaction in the learning framework in which, the state, control, and terminal state constraints are transformed into soft ones in the predictive control formulation and the actor-critic structure with logarithmic barrier functions. The main idea, the implementation details, as well as the theoretical analysis are described in-depth in the following sections.

## 3 Data-driven model predictor with Koopman operator

In this section, we present the main idea and the computation of the data-driven linear predictor approximation with the Koopman operator, and the model to be used in the learning-based predictive control framework.

### 3.1 The model predictor with Koopman operator

As prescribed, according to the receding horizon principle, a multi-step prediction model is required in each prediction horizon. In that follows, we utilize the ad-hoc Koopman operator described in [1, 12] to compute the prediction model of (1), which is in linear fashion and suitable for multi-step ahead prediction. Given a nonlinear dynamics described in (1), the main idea is to use a set of scalar observables of the original states in order to define a new high-dimensional state or feature space and estimate their evolution using linear transition matrices. The linear mapping approximation can ideally represent the original nonlinear dynamics as long as the selected dimension of the observables is sufficiently large, see [1]. To show the rationale behind this approach, we first show the main concept and rigorous definition of the Koopman operator. Consider the following unforced dynamics described as

 x(k+1)=g(x(k)) (3)

where the nonlinear mapping .

The Koopman operator is defined as

 (Kϕ)(x)=(ϕ∘g)(x)=ϕ(g(x)) (4)

where is a feature space typically consisted of functions mapping from the state space and is often referred as observables, Koopman operator acts on any scalar observable of the state, , such that the observable of the successive state can be generated. In this way, although the mapping from the state space to the observable might be nonlinear, the Koopman operator establishes a linear dynamical transition in the feature space. If the feature space is invariant under Koopman operator, such that the evolution of all the scalar observables belongs to , the linear dynamics can be regarded as a good estimate of the original nonlinear systems (3). This can be fulfilled by choosing sufficient (possibly infinite) number of features. Nevertheless, for practical computations, it is reasonable to truncate the dominate features at the expense of a certain but acceptable approximation accuracy.

With slight changes, the Koopman operator based approximation can be naturally extended to forced dynamics like (1). In this case, at any time instant , the state space is redefined and extended as , where , and the extended dynamic is

 z(k+1)=f(z(k))=[f(x(k),u(k))Γu(k)],

where is a left shift operator such that . In principle, the above extended dynamics is suitable for defining a koopman operator similar to (4), i.e.,

 (Kϕ)(z)=(ϕ∘f)(z)=ϕ(f(z)) (5)

The objective of interest is to find a finite (possibly minimal)-dimensional approximation for the Koopman operator (5), which can readily be used for generating the model parameters of the linear predictor.

Let assume the resultant linear predictor being computed and given in the following form:

 Σ: {s(k+1)=As(k)+Bu(k)¯x(k)=Cs(k), (6)

where is the abstract state variables, , is the linear state transition matrix, is the input mapping matrix, is the output matrix mapping from feature to original state space, and the output is the estimated value of . The initial condition starting from any time instant of is given by feature mappings from the original initial state , i.e.,

 s(k)=Φ(x(k),Ns):=⎡⎢ ⎢⎣ϕ1(x(k))⋮ϕNs(x(k))⎤⎥ ⎥⎦ (7)

where , is typically defined as a mapping function, which can be chosen as basis or some regular nonlinear functions. The details on how to compute the model matrices , , and via approximating the Koopman operator are deferred in Section 3.2.

### 3.2 The model predictor approximation with EDMD

As the emphasis is to approximate the infinite invariant Koopman operator with a finite-dimensional square matrix, which is then used to construct the linear predictor defined in (6). To this regard, along the same line with [12], the extended dynamic mode decomposition (EDMD) algorithm is adopted to construct a finite-dimensional approximation of the Koopman operator in the observable space. Note however that the extended state is of infinite-dimension, which is impossible for practical computations. Hence, we select a special form of as

 Φ(z,Nk)=[Φ(x,Ns)u]

where . Let be a finite approximate of , such that , where is the approximate residual. Assume to have data sets of , the approximate objective is to minimize by solving the corresponding optimization problem with regularization

 minKM∑i=1∥KΦ(zi,Nk)−Φ(z+i,Nk)∥2+θ∥K∥2 (8)

where is a positive scalar, , are the samples belonging to the -th data sets. As is of finite-dimension that might be not invariant under , the optimal value of residual is not identically zero, but it can be regarded as additive disturbance to be rejected in the robust control framework.

Since we are only interested in the future evolution of the state, the last components of the computed being the transition mapping from to can be disregarded. The matrix group coincides the first rows of the computed . In order to find the optimal solution of that maps from the observable to the original state space, the following optimization is to be solved:

 minCM∑i=1∥CΦ(xi,Ns)−xi∥ (9)

### 3.3 Model to be used in the learning predictive controller

Suppose that the optimal solution of , , and can be computed with (8) and (9), then system (1) is represented by the linear system (6) plus a residual uncertainty term

 Σa: {s(k+1)=As(k)+Bu(k)+δw(k)x(k)=Cs(k)+δv(k), (10)

where , is the minimal residual caused by (8); while , is the minimal residual of (9). Note that in Section 4.4, the actor-critic network is utilized to approximate the robust MPC, which might also lead to uncertainty in the control channel. Therefore, the real dynamics from control to is given as

 Σ: {s(k+1)=As(k)+Bu(k)+d(k)x(k)=Cs(k)+v(k), (11)

where , is the approximation error of the control action by the actor-critic network, . It is assumed that , , where and are compact sets containing the origin. Given the structure of the linear predictor and the actor-critic network, and under certain assumptions, (possibly conservative) choices of and can be given and are deferred in Section 5.3.

is not ready for prediction due to the existence of unknown disturbance terms , , hence we define the unperturbed system of as

 ^Σ: {^s(k+1)=A^s(k)+B^u(k)^x(k)=C^s(k), (12)

The control action to in robust tube-based MPC is defined as

 u(k)=^u(k)+Kes(k) (13)

where can be computed with a standard MPC like (2) with respect to , is a feedback gain matrix such that is Schur stable. The error involves in the following unforced system by subtracting with (13) and :

 ΔΣ: {es(k+1)=Fes(k)+d(k)ex(k)=Ces(k)+v(k), (14)

Let be the (possibly minimal)-robust invariant set of , such that , then the robust “output” invariant set is defined as .

## 4 Design of the robust learning predictive controller

In this section, we first present the formulation of the robust predictive controller with the linear predictor obtained in Section 3 and the main idea of the r-LPC framework. The barrier function based value function is then reformulated in order to cope with the system constraints in the receding horizon. At the end of this section, the main algorithm and the computational details of the proposed r-LPC with the actor-critic network are described.

### 4.1 Robust model predictive controller

With the nominal linear model (12) in the feature space, it is now ready to state the robust model predictive controller. At any time instant , the online optimization problem (2) can be reformulated as

 min→^u(k:k+N−1)Vb(^s(k))subject to:∙the dynamics (???)∙initial constraint (???)∙^s(k+i)∈S,i=0,…,N−1∙^u(k+i)∈^U,i=0,…,N−1∙^s(0)=s(0)∙^s(k+N)∈Sf, (15)

where

 Vb(^s(k))=∑N−1i=0(∥^s(k+i)∥2¯Q+∥^u(k+i)∥2R)+Vf(^s(k+N)), (16)

, , , , , the invariant set is given as , where is a symmetric positive-definite matrix such that . The terminal penalty matrix is a symmetric positive-definite matrix computed with the following Lyapunov function

 F⊤PF−P=−¯Q−K⊤RK. (17)

Assume at any time instant that, the optimal control sequence can be found, then the control applied to the system (1) is given as

 u(k)=^uo(k|k)+K(s(k)−^s(k|k))

Note that, the output and control of (6) with respect to the origin are minimized in (15). For this reason, the following assumptions are introduced for the stability arguments latter described.

###### Assumption 1

The pair is observable and the pair is stabilizable.

###### Remark 1

In view of [40], under mild assumptions, the stabilizability and observability gramians of smooth affine nonlinear systems can be computed and balanced in the observable space, which implies that Assumption 1 can be verified for system (1).

###### Assumption 2

The matrix is full rank, i.e., .

Note that the linear predictor in (15) might be of high dimension, which indeed can happen since usually sufficient number of observables are required to obtain precise model representation of (1), hence it might be computationally intensive to solve the online optimization problem (15). For this reason, a more computational efficient learning control algorithm, i.e., r-LPC, is proposed using ACDs.

### 4.2 Cost function reformulation with barrier functions

In the proposed r-LPC, the hard state and control constraints are regarded as soft ones to be included in the cost function with continuous and differentiable barrier functions multiplied by scalar weighting matrices, so that the optimization problem (15) can be transformed into the one with only model equality constraint. In doing so, the resulting optimization problem can be analyzed with standard HJB equations and solved using policy and value function approximations with the actor-critic structure. The values of the adopted barrier functions approach null for the constrained variables in the interiors and instantaneously go to infinity once the variables reach and escape the constraint boundary points. Due to this property, including the barrier functions in the optimization index could lead to the system variables staying in their restrictive interiors and result in almost null values of the barrier functions. With small weighting matrices, the influence by the barrier functions to the closed-loop control performance can be effectively limited, also the stability arguments of the standard MPC by choosing the optimal value function as a candidate Lyapunov function can be applied for the convergence proofs. In the following, we briefly introduce the cost function reformulation with barrier functions, representing the state, control, and terminal state constraint along the line with [9], which is described as

 Vb(^s(k))=∑N−1i=0(∥^s(k+i)∥2¯Q+∥^u(k+i)∥2R+μB(^s(k+i))+μB(^u(k+i)))+Vf(^s(k+N))+μBf(^s(k+N)) (18)

where is a weighting scalar, , , and are the state, control, and terminal-state barrier functions respectively. for , and for . for , and for . for , and for .

To strictly approximate the inequality constraints with soft barrier functions, the following definitions are introduced.

###### Definition 1

For any variable , where is a polyhedron, the barrier function is defined as

 ¯B(z)={−∑pi=1log(bi−a⊤iz)  z∈Int(Z)+∞  otherwise.
###### Definition 2

For any variable , where is an ellipsoid, and where is a symmetric positive-definite matrix with suitable dimensions, the barrier function is defined as

 ¯B(z)={−log(1−z⊤Zz)  z∈Int(Z)+∞  otherwise.

Note however that is not guaranteed to be zero, which results in the optimal value function probably being nonzero. This impede the stability arguments by selecting as a Lyapunov function due to null of the value in the origin being violated. To this end, we introduce the following Lemma about barrier functions [36].

###### Lemma 1
1. Let be a gradient re-centered barrier function of , then is differentiable and convex for all , and ;

2. Let the relaxed barrier function for polyhedral constraint be defined as

 B(z)={Bc(z)¯σ≥κγ(z,¯σ)¯σ<κ (19)

where the small positive scalar is the relaxing factor, , , , the function is strictly monotone and differentiable such that is differentiable at any that , and is smaller than , then there exists a positive-definite matrix such that , where .

###### Remark 2

The control, state, and terminal state constraints can be easily transformed into the corresponding gradient re-centered barrier functions, while due to space limitations the computation procedures are neglected. It is highlighted that in view of Lemma 1.1), the optimal value function at the origin is zero, which allows us to choose as a Lyapunov function candidate.

With (19) and being included in (18), the terminal penalty matrix is modified as

 F⊤PF−P=−¯Q−K⊤RK−μH, (20)

where and and are computed respectively according to Lemma 1.2) with a presumed value of in (19) for and .

### 4.3 Learning-based predictive controller

Let at any time instant , the index belong to the prediction horizon and,

 \parrb(τ)=∥^s(τ)∥2¯Q+∥^u(τ)∥2R+μB(^s(τ))+μB(^u(τ)),

We define

 Vb,τ(^s(τ))=∑N−1i=τ−krb(k+i)+Vf(^s(k+N))+μBf(^s(k+N))=rb(τ)+Vb,τ+1(^s(τ+1)) (21)

where . Different from (15) where the control sequence is regarded as a whole and computed by solving once the underlying optimization problem, the proposed learning-based predictive controller computes the control action at any time instant , i.e.,

 \parmin^u(τ)rb(τ)+Vb,τ+1(^s(τ+1))subject to:∙the dynamics (???) with (???) (22)

Assume that the optimal solution can be found for (22), that solves the discrete-time HJB equation . According to optimal principle, and in view of (21) and (22), minimizing is equivalent to solving

 ∂V∗b,τ(^s(τ))∂u∗(^s(τ)=μ∂B∗(u(^s(τ)))∂u∗(^s(τ))+2Ru∗(^s(τ))+B⊤λ∗(^s(τ+1))=0, (23)

, where the costate is the partial derivative of optimal value function with respect to :

 λ∗(^s(τ))=μ∂B∗(^s(τ))∂^s∗(τ)+2¯Q^s∗(τ)+A⊤λ∗(^s(τ+1))

.

In principle, one can resort to the underlying policy or value iteration algorithms to solve the above problem. However, due to the existence of nonlinear logarithmic functions in and , it might be difficult to solve the above HJB equation analytically. This motivates the reinforcement learning algorithm with the actor-critic structure described in the following section.

### 4.4 Near-optimal control with the actor-critic structure

In this section we present an efficient algorithm to implement the learning-based predictive controller with the scope to obtain a near-optimal control policy with the actor-critic network. In the prediction horizon starting from any time , the critic network is in charge of approximating the costate for all . The actor is for estimating the optimal control sequence that can be applied to the system. The updates of weights associated with the actor-critic network are coupled that the output from the critic is used as one of the inputs for the actor. For this reason, to decouple their interconnection during the learning process, the learning rate of the critic is usually selected to be larger than that of the actor network. In doing so, the weight of critic network converges fast to the global or local minima so that the weight of the actor can finally achieve convergence in a slower timescale, which will be analyzed in detail in Section 5.2.

To define the actor network, in view of  (23), we define the value of desired control action for all to be estimated in the form:

 μ∂B(ud(^s(τ)))∂ud(^s(τ))+2Rud(^s(τ))=−B⊤^λ(^s(τ+1)) (24)

where is the estimated value of generated by the output of the critic network. We define the output of the actor network as

 ^ud(^s(τ))=Wa(τ)⊤h(^s(τ)),τ∈[k,k+N−1] (25)

where is the weighting matrix, is a vector whose entries are basis functions.

Note that, usually the actor network is used for approximating the desired decision variables via minimizing the error of the outputs with respect to their desired ones in quadratic cost form, which is however not suitable in this case due to for instance, the left-side of (24) being composed by itself multiplied with a constant matrix and the partial gradient of the barrier function with respect to , where the entries of show explicitly in the denominator term. For this reason, provided the structure of the actor network like (25), different from that of the classic actor, the objective here concerned is to regard the left-side term of (24) as a whole to be estimated. To this end, denoting , the estimated value is generated by the output of the actor network such that , where is the approximation residual. At each time instant , the residual needs to be minimized, typically formed in the quadratic cost function as , where is a positive-definite matrix. The weighting matrix is usually updated according to the gradient descend method, however this might lead to the constraint nonsatisfaction of in the learning process. To deal with this problem, a barrier function of is also included in the cost function for :

 δa(τ)=∥ϵa(τ)∥2Qa+μB(^ud(τ)) (26)

At any time instant , the weight updates according to the following rule

 Wa(τ+1)=Wa(τ)−γτ∂^ud(^s(τ))∂Wa(τ)(∂δa(τ)∂^ud(^s(τ)))⊤ (27)

where is the learning rate.

Along the same line, the critic network is given as

 ^λ(^s(τ))=Wc(τ)⊤h(^s(τ)), (28)

where is the weighting matrix, . For the sake of clarity, we use to represent . Similarly with traditional DHP algorithm, the critic network minimizes the residual of the optimal costate and . However, as is not available , the following represented by the one-step ahead costate estimated from the critic network is considered as the target, i.e.,

 λd(^s(τ))=⎧⎪ ⎪⎨⎪ ⎪⎩μ∂B(^s(τ))∂^s(τ)+2¯Q^s(τ)+A⊤^λ(^s(τ+1)),for τ∈[k,k+N−2]μ∂B(^s(τ))∂^s(τ)+2¯Q^s(τ)+μ∂Bf(^s(τ))∂^s(τ)+2A⊤P^s(τ+1),for τ=k+N−1 (29)

The target can also be represented as the output of the critic network plus a residual term, i.e., , where is the corresponding approximation residual to be minimized. In order to optimize , the following quadratic performance index can be adopted, where is a positive-definite matrix. Notice that, it is almost impossible to exactly design a barrier function that can represent the state constraint as just done by the actor for the control constraint satisfaction. Instead, we use the soft constraint relying on barrier function on to guide the weight update for . The problem left now is how to find an upper bound for for each . As it can be seen in  (29), a conservative upper bound of can be easily found since is bounded (see (15)). Note however, under Assumption 2, in order to guarantee the feasibility of (15), the corresponding feasible state region , can be found at starting from the terminal set , see [30]. In view of (29), also, considering , , it holds that . Denoting , one promptly has , the region which , lies in, can be defined and represented as . As in fact for , the effect caused by the barrier function to the scale of is limited. Therefore, in order to reduce conservativity, we simplify the calculation formula of as . The details on how to compute is described in Algorithm 1. Thus, the cost to be minimized by the critic network is

 δc(τ)=∥ϵc(τ)∥2Qc+μB(^λ(^s(τ)), (30)

where for each , is the relaxed re-centered barrier function for . At any time instant , the weight update is also gradient descent based and given as

 Wc(τ+1)=Wc(τ)−βτ∂^λ(^s(τ))∂Wc(τ)(∂δc(τ)∂^λ(^s(τ)))⊤. (31)
###### Remark 3

It is highlighted that the barrier function on adopted in (30) might still not fulfill the state constraint on , but it is capable to shrink the size of the region lies in. In doing so, the number of failures in the learning process can be highly reduced.

The main implementation steps of the r-LPC are described in Algorithm 2.