# Policy iteration for Hamilton-Jacobi-Bellman equations with control constraints

Policy iteration is a widely used technique to solve the Hamilton Jacobi Bellman (HJB) equation, which arises from nonlinear optimal feedback control theory. Its convergence analysis has attracted much attention in the unconstrained case. Here we analyze the case with control constraints both for the HJB equations which arise in deterministic and in stochastic control cases. The linear equations in each iteration step are solved by an implicit upwind scheme. Numerical examples are conducted to solve the HJB equation with control constraints and comparisons are shown with the unconstrained cases.

## Authors

• 2 publications
• 5 publications
05/20/2015

### Convergence Analysis of Policy Iteration

Adaptive optimal control of nonlinear dynamic systems with deterministic...
07/16/2019

### On the Variational Iteration Method for the Nonlinear Volterra Integral Equation

The variational iteration method is used to solve nonlinear Volterra int...
08/15/2020

### An asymptotic-preserving IMEX method for nonlinear radiative transfer equation

We present an asymptotic preserving method for the radiative transfer eq...
11/17/2021

### An efficient iteration for the extremal solutions of discrete-time algebraic Riccati equations

Algebraic Riccati equations (AREs) have been extensively applicable in l...
12/24/2021

### A machine learning pipeline for autonomous numerical analytic continuation of Dyson-Schwinger equations

Dyson-Schwinger equations (DSEs) are a non-perturbative way to express n...
12/17/2021

### Optimized integrating factor technique for Schrödinger-like equations

The integrating factor technique is widely used to solve numerically (in...
12/07/2019

### Definition and certain convergence properties of a two-scale method for Monge-Ampère type equations

The Monge-Ampère equation arises in the theory of optimal transport. Whe...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Stabilizability is one of the major objectives in the area of optimal control theory. Since the open loop control depends only on the time and the initial state, so if the initial state is changed, unfortunately control needs to be recomputed again. In many physical situation, we are particularly interested in seeking a control law which depends on the state and which can deal additional external perturbation or model errors. When the dynamical systems is linear with unconstrained control and corresponding cost functional or performance index is quadratic, then the associated closed loop or feedback control law which minimizes the cost functional is obtained by the so called Riccati equation. Now when the dynamics is nonlinear, one popular approach is to linearize the dynamics and obtain the Riccati based controller to apply for the original nonlinear system, see e.g. [33].

If the optimal feedback cannot be obtained by LQR theory, then it can be approached by the verification theorem and the value function, which in turn is a solution of the Hamilton Jacobi Bellman (HJB) equation associated to the optimal control problem, see e.g. [21]. But in most of the situations, it is very difficult to obtain the exact value function and thus one has to resort to iterative techniques. One possible approach utilizes the so called value iteration. Here we are interested in more efficient technique, known as policy iteration. Discrete counterpart of policy iteration is also known as Howards’ algorithm [27]. Policy iteration can be interpreted as a Newton method applied to the HJB equation. Hence, using the policy iteration HJB equation reduces to a sequence of linearized HJB equation, which for historical reasons are called ’generalized HJB equations’. The policy iteration requires as initialization a stabilizing control. If such a control is not available, then one can use discounted path following policy iteration, see e.g. [28]. The focus of the present paper is the analysis of the policy iteration in the presence of control constraints. To the authors’ knowledge, such an analysis is not available in the literature.

Let us next mentions very selectively, some of the literature on the numerical solution of the control-HJB equations. Specific comments for the constraint case, are given further below. Finite difference methods and vanishing viscosity method are developed in the work of Crandall and Lions [17]. Other notable techniques are finite difference methods [16], semi-Lagrangian schemes [4, 19], the finite element method [26], filtered schemes [11], domain decomposition methods [14], and level set methods [36]. For an overview we refer to [20]. Policy iteration algorithms are developed in [32, 37, 8, 9]

. If the dynamics are given by a partial differential equation (PDE), then the corresponding HJB equations become infinite dimensional. Applying grid based scheme to convert PDE dynamics to ODE dynamics leads to a high dimensional HJB equation. This phenomenon is known as the curse of dimensionality. In the literature there are different techniques to tackle this situation. Related recent works include, polynomial approximation

[28], deep neural technique [31]

, tensor calculus

[18], Taylor series expansions [13] and graph-tree structures [5].

Iteration in policy space for second order HJB PDEs arising in stochastic control is discussed in [40]. Tensor calculus technique is used in [39] where the associated HJB equation becomes linear after using some exponential transformation and scaling factor. For an overview for numerical approximations to stochastic HJB equations, we refer to [30]

. To solve the continuous time stochastic control problem by the Markov chain approximation method (MCAM) approach, the diffusion operator is approximated by a finite dimensional Markov decision process, and further solved by the policy iteration algorithm. For the connection between the finite difference method and the MCAM approach for the second order HJB equation, see

[12]. Let us next recall contributions for the case with control constraints. Regularity of the value function in the presence of control constraints has been discussed in [24, 25] for linear dynamics and in [15] for nonlinear systems. In [34] non-quadratic penalty functionals were introduced to approximately include control constraints. Later in [1] such functionals were discussed in the context of policy iteration for HJB equations.

Convergence of the policy iteration without constraints has been investigated in earlier work, see in particular [37, 40]. In our work, the control constraints are realized exactly by means of a projection operator. This approach has been used earlier only for numerical purposes in the context of the value iteration, see [23] and [29]. In our work we prove the convergence of the policy iteration with control constraints for both the first and the second order HJB-PDEs. For numerical experiments we use an implicit upwind scheme as proposed in [2].

The rest of the paper is organized as follows. Section provides a convergence of the policy iteration for the first order HJB equation in the presence of control constraints. Section establishes corresponding convergence result for the second order HJB equation with control constraints. Numerical tests are presented in Section . Finally in Section concluding remarks are given.

## 2. Nonlinear H2 feedback control problem subject to deterministic system

### 2.1. First order Hamilton-Jacobi-Bellman equation

We consider the following infinite horizon optimal control problem

 (2.1) minu(⋅)∈UJ(x,u(⋅)):=∞∫0(ℓ(y(t))+∥u(t)∥2R)dt,

subject to the nonlinear deterministic dynamical constraint

 (2.2) ˙y(t)=f(y(t))+g(y)u(t),y(0)=x,

where

is the state vector, and

is the control input with . Further , for , is the state running cost, and , represents the control cost, with a positive definite matrix. Throughout we assume that the dynamics and , as well as are Lipschitz continuous on , and that and . We also require that is globally bounded on . This set-up relates to our aim of asymptotic stabilization to the origin by means of the control . We shall concentrate on the case where the initial conditions are chosen from a domain containing the origin in its interior.

The specificity of this work relates to controls which need to obey constraints , where is a closed convex set containing in . As a special case we mention bilateral point-wise constraints of the form

 (2.3) U={u|α≤u≤β},

where , , , and the inequalities act coordinate-wise.

The optimal value function associated to (2.1)-(2.2) is given by

 V(x)=minu(⋅)∈UJ(x,u(⋅)),

where . Here and throughout the paper we assume that for every a solution to (2.1) exists. Thus, implicitly we also assume that the existence of a control such that (2.2) admits a solution . If required by the context, we shall indicate the dependence of on or , by writing , respectively .

In case it satisfies the Hamilton-Jacobi-Bellman (HJB) equation

 (2.4) infu∈U{∇V(x)t(f(x)+g(x)u)+ℓ(x)+∥u∥2R}=0,V(0)=0,

with . Otherwise, sufficient conditions which guarantee that is the unique viscosity solution to (2.4) are well-investigated, see for instance [6, Chapter III].

In the unconstrained case, with the value function is always differentiable (see [6, pg. 80]). If , with a positive definite matrix, and linear control dynamics of the form , the value function is differentiable provided that is nonempty polyhedral (finite horizon) or closed convex set with (infinite horizon), see [24, 25]. Sufficient conditions for (local) differentiability of the value function associated to finite horizon optimal control problems with nonlinear dynamics and control constraints have been obtained in [15]. For our analysis, we assume that

###### Assumption 1.

The value function satisfies . Moreover it is radially unbounded, i.e. .

With Assumption 1 holding, the verification theorem, see eg. [22, Chapter 1] implies that an optimal control in feedback form is given as the minimizer in (2.4) and thus

 (2.5) u∗(x)=PU(−12R−1g(x)t∇V(x)),

where is the orthogonal projection in the -weighted inner product on onto , i.e.

 (R−1g(x)t∇V(x)+2u∗,u−u∗)R≥0 for all u∈U.

Alternatively can be expressed as

 u∗(x)=R−12P¯U(−12R−12g(x)t∇V(x)),

where is the orthogonal projection in , with Euclidean inner product, onto . For the particular case of (2.3) we have

 u∗(x)=PU(−12R−1g(x)t∇V(x))=min{β,max{α,−12R−1g(x)t∇V(x)}},

where the and operations operate coordinate-wise. It is common practice to refer to the open loop control as in (2.1) and to the closed loop control as in (2.5) by the same letter.

For the unconstrained case the corresponding control law reduces to . Using the optimal control law in (2.4), (where we now drop the superscript ) we obtain the equivalent form of the HJB equation as

 (2.6) ∇V(x)t(f(x)+g(x)PU(−12R−1g(x)t∇V(x)))+ℓ(x)+∥∥∥PU(−12R−1g(x)t∇V(x))∥∥∥2R=0.

We shall require the notion of generalized Hamilton-Jacobi-Bellman (GHJB) equations (see e.g. [37, 7]), given as follows:

 (2.7) GHJB(V,∇V;u):=∇Vt(f+gu)+(ℓ+∥u∥2R)=0,V(0)=0,

Here and are considered as functions of .

###### Remark 1.

Concerning the existence of a solution in the unconstrained case to the Zubov type equation (2.7), we refer to [33, Lemma 2.1, Theorem 1.1] where it is shown in the analytic case that if is a stabilizable pair, then (2.8) admits a unique locally positive definite solution , which is locally analytic at the origin.

For solving (2.6) we analyse the following iterative scheme, which is referred to as policy iteration or Howards’ algorithm, see e.g. [37, 4, 7, 10], where the case without control constraints is treated. We require the notion of admissible controls, a concept introduced in [32, 7].

###### Definition 1.

(Admissible Controls). A measurable function is called admissible with respect to , denoted by , if

• is continuous on ,

• ,

• stabilizes (2.2) on , i.e. .

• .

Here denotes the solution to (2.2), where , and with control in feedback form . As in (2.1) the value of the cost in (iv) associated to the closed loop control is denoted by . In general we cannot guarantee that the controlled trajectories remain in for all . For this reason we demand continuity and differentiability properties of and on all of . Under additional assumptions we could introduce a set with with the property that and demanding the regularity properties of and on only. We shall not pursue such a set-up here.

We are now prepared the present the algorithm.

Note that (2.8) can equivalently be expressed as .

###### Lemma 1.

Assume that is an admissible feedback control with respect to . If there exists a function satisfying

 (2.10) GHJB(V,∇V;u)=∇V(x;u)t(f(x)+g(x)u(x))+ℓ(x)+∥u(x)∥2R=0,V(0;u)=0,

then for all . Moreover, the optimal control law is admissible on , the optimal value function satisfies , and .

###### Proof.

The proof takes advantage, in part, from the verification of an analogous result in the unconstrained, see eg. [37], where a finite horizon problem is treated. Let be arbitrary and fixed, and choose any . Then we have

 (2.11) V(y(T;u);u)−V(x;u)=∫T0ddtV(y(t;u))dt.

Since by (iii), and due to V(0;u)=0, we can take the limit in this equation to obtain

 V(y(∞;u);u)−V(x;u)=−V(x;u) =∫∞0ddtV(y(t;u);u)dt (2.12) =∫∞0∇V(y;u)t(f(y)+g(y)u(y)))dt,

where on the right hand side of the above equation. Adding both sides of to the respective sides of (2.1), we obtain

 J(x,u)−V(x;u)=∫∞0(∇V(y;u)t(f(y)+g(y)u(y))+ℓ(y)+∥u(y)∥2R)dt,

and from (2.10) we conclude that for all .

Let us next consider the optimal feedback law . We need to show that it is admissible in the sense of Definition 1. By Assumption 1 and (2.5) it is continuous on . Moreover , for all , thus , and consequently . Here we also use that . Thus (i) and (ii) of Definition 1 are satisfied for and (iv) follows from our general assumption that (2.1) has a solution for every . To verify (iii), note that from (2.11), (2.5), and (2.6) we have for every

 (2.13) V(y(T;u∗))−V(x)=−∫T0ℓ(y(t;u∗))+∥u∗(y(t;u∗))∥2Rdt<0.

Thus is strictly monotonically decreasing, (unless for some ). Thus for some . If , then there exists such that for all . Let . Due to continuity of the set is closed. The radial unboundedness assumption on further implies that is bounded and thus it is compact. Let us set

 ˙V(x)=∇V(x)t(f(x)+g(x)PU(−12R−1g(x)t∇V(x))) for x∈Rd.

Then by (2.6) we have

 ˙V(x)=−ℓ(x)−∥PU(−12R−1g(x)t∇V(x))∥2R<0.

By compactness of we have . Note that . Hence by (2.13) we find

 limT→∞V(y(T;u∗))−V(x)≤limT→∞ζT,

which is impossible. Hence and .

Now we can apply the arguments from the first part of the proof with and obtain for every . This concludes the proof. ∎

### 2.2. Convergence of policy iteration

The previous lemma establishes the fact that the value , for a given admissible control , and can be obtained as the evaluation of the solution of the GHJB equation (2.10). In the following lemma we commence the analysis of the convergence of Algorithm 1.

###### Proposition 1.

If , then for all . Moreover we have . Further, converges from above pointwise to some in .

###### Proof.

We proceed by induction. Given , we assume that , and establish that . We shall frequently refer to Lemma 1 with and as in (2.8). In view of (2.9) is continuous since is continuous and . Using Lemma 1 we obtain that is positive definite. Hence it attains its minimum at the origin, and consequently . Thus (i) and (ii) in the definition of admissibility of are established.

Next we take the time-derivative of where is the trajectory corresponding to

 ˙y(t)=f(y(t))+g(y)u(i+1)(y(t)),y(0)=x.

Let us recall that , for and set . Then we find

 (2.14) ddtV(i)(y(i+1)(t))=∇V(i)(y(i+1))t(f(y(i+1))+g(y(i+1))u(i+1)(yi+1)=−ℓ(y(i+1))−∥∥u(i)(y(i+1))∥∥2R+∇V(i)(y(i+1))tg(y(i+1))(u(i+1)(y(i+1))−u(i)(y(i+1))).

Throughout the following computation we do not indicate the dependence on on the right hand side of the equality. Next we need to rearrange the terms on the right hand side of the last expression. For this purpose it will be convenient to introduce and observe that . We can express the above equality as

 ddtV(i)(y(i+1)(t)) =−ℓ(y(i+1))−∥∥u(i)(y(i+1))∥∥2R+2ztRPU(−z)−2ztRu(i)(y(i+1)) =−ℓ(y(i+1))−∥PU(−z)∥2R−∥u(i)(y(i+1))−PU(−z)∥2R +2(z+PU(−z))tR(PU(−z)−u(i)(y(i+1))).

Since we obtain , and thus

 (2.15) ddtV(i)(y(i+1)(t))≤−ℓ(y(i+1))−∥u(i+1)(y(i+1))∥2R−∥u(i+1)(y(i+1))−u(i)(y(i+1))∥2R.

Hence is strictly monotonically decreasing. As mentioned above, is positive definite. With the arguments as in the last part of the proof of Lemma 1 it follows that . Finally (2.15) implies that . Since was chosen arbitrarily it follows that defined in (2.9) is admissible. Lemma 1 further implies that on .

Since for each the trajectory corresponding to and satisfying

 (2.16) ˙y =(f(y)+g(y)u(i+1)),y(0)=x,

is asymptotically stable, the difference between and can be obtained as

 V(i+1)(x)−V(i)(x) =∫∞0((∇V(i)(y(i+1))t(f+gu(i+1)))−(∇V(i+1)t(f+gu(i+1))))dt,

where and are evaluated at . Utilizing the generalized HJB equation (2.8), we get and . This leads to

 V(i+1)(x)−V(i)(x)=∫∞0(∥u(i+1)∥2R−∥u(i)∥2R+∇V(i)(y(i+1))tg(u(i+1)−u(i)))dt.

The last two terms in the above integrand appeared in (2.14

) and were estimated in the subsequent steps. We can reuse this estimate and obtain

 V(i+1)(x)−V(i)(x)≤−∫∞0∥u(i+1)−u(i)∥2Rdt≤0.

Hence, is a monotonically decreasing sequence which is bounded below by the optimal value function , see Lemma 1. Since is a monotonically decreasing sequence and bounded below by , it converges pointwise to some . ∎

To show convergence of to , additional assumptions are needed. This is considered in the following proposition. In the literature, for the unconstrained case, one can find the statement that, based on Dini’s theorem, the monotonically convergent sequence converges uniformly to , if is compact. This, however, only holds true, once it is argued that is continuous. For the following it will be useful to recall that , , consists of all functions such that is bounded and uniformly continuous on for all multi-index with , see e.g. [3, pg. 10].

###### Proposition 2.

If is bounded, and further satisfy (2.8), , and is equicontinuous in , then converges pointwise to in , and satisfies the HJB equation (2.6) for all with for all .

###### Proof.

Let and be arbitrary. Denote by the -th unit vectors, and choose such that . By continuity of and equicontinuity of in , there exists such

 (2.17) ∥∇¯V(x)−∇¯V(x+h)∥<ϵ3and∥∇V(i)(x)−∇V(i)(x+h)∥<ϵ3,

with for all with , , .

By assumption is bounded and thus is compact. Hence by Dini’s theorem converges to uniformly on . Here we use the assumption that . We can now choose such that

 (2.18) |¯V(y)−V(i)(y)|≤ϵδ6∀y∈Ω,and ∀i≥¯i.

We have

 ∂xk¯V(x)−∂xkV(i)(x) +(1δ(V(i)(x+ekδ)−V(i)(x))−∂xkV(i)(x))=:I1+I2+I3.

We estimate and by using (2.17) as

 |I1|=1δ|∫10(∂xk¯V(x)−∂xk¯V(x+ekσδ))δdσ|≤ϵ3,

and

 |I3|=1δ|∫10(∂xkV(i)(x+ekσδ)−∂xkV(i)(x))δdσ|≤ϵ3.

We estimate by using (2.18)

 |I2|≤ϵ3,

and combining with the estimates for and , we obtain

 (2.19) ∥∇¯V(x)−∇V(i)(x)∥≤ϵ√d.

Since was arbitrary this implies that

 ∇Vi→∇¯Vpointwise inΩ.

It then follows from (2.9) that

 limi→∞PU(−12R−1g(x)t∇V(i)(x))=limi→∞u(i+1)(x)=PU(−12R−1g(x)t∇¯V(x))=:¯u(x),inΩ,

and by (2.8)

 ∇¯V(x)t(f(x)+g¯u)+ℓ(x)+∥¯u∥2R=0,∀x∈Ω.

For uniqueness of value function , we refer to [22, pg. 86] and [6, Chapter III]. ∎

We next provide a sufficient condition for the assumptions posed in Proposition 2.

###### Lemma 2.

If the system (2.2) is linear, i.e.

 (2.20) ˙y=Ay+Bu,y(0)=x,

with , and and are bounded, then in , is equicontinuous in , and .

###### Proof.

We denote by the solution to (2.20) with initial condition and control as in (2.9). In particular the controls are uniformly bounded; i.e. there exists such that for all and .
Let us first show that are equicontinuous on . Let , and observe that

 V(i)(x1)−V(i)(x2)=∫∞0(ℓ(y(i)(x1))−ℓ(y(i)(x2)))dt,

where . Since the spectrum of is strictly in the left half plane, there exist , such that for all .
For every we have

 |ℓ(y(i)(x1))−ℓ(y(i)(x2))| ≤∥Q∥∥e2At(x21−x22)∥+2∥Q∥∥eAt(x1−x2)∥∫t0∥eA(t−s)Bu(i)(s)∥ds ≤C∥Q∥∥x1−x2∥(2e−2wtdiam(Ω)+2e−wt∥B∥∫t0e−w(t−s)|u(i)(s)|ds) ≤C∥Q∥∥x1−x2∥(2e−2wtdiam(Ω)+2e−wtM∥B∥1w(1−e−wt)).

Consequently we get

 |V(i)(x1)−V(i)(x2)| ≤C∥Q∥∥x1−x2∥(1wdiam(Ω)+2M∥B∥1w∫∞0e−wtdt) =C∥Q∥∥x1−x2∥(1w%diam(Ω)+2M∥B∥1w2),

from which equicontinuity of in follows. Since is bounded in for each and each is uniformly continuous in , we have . Thus by the Arzela-Ascoli theorem and pointwise convergence , we have uniform convergence of to and .

To verify the equicontinuity of let us first note that the derivative of in direction , denoted by