# Safe Approximate Dynamic Programming Via Kernelized Lipschitz Estimation

We develop a method for obtaining safe initial policies for reinforcement learning via approximate dynamic programming (ADP) techniques for uncertain systems evolving with discrete-time dynamics. We employ kernelized Lipschitz estimation and semidefinite programming for computing admissible initial control policies with provably high probability. Such admissible controllers enable safe initialization and constraint enforcement while providing exponential stability of the equilibrium of the closed-loop system.

## Authors

• 3 publications
• 16 publications
• 7 publications
• 1 publication
• 1 publication
• ### Differential Dynamic Programming with Nonlinear Safety Constraints Under System Uncertainties

Safe operation of systems such as robots requires them to plan and execu...
11/02/2020 ∙ by Gokhan Alcan, et al. ∙ 0

• ### Approximate Dynamic Programming For Linear Systems with State and Input Constraints

Enforcing state and input constraints during reinforcement learning (RL)...
06/26/2019 ∙ by Ankush Chakrabarty, et al. ∙ 0

• ### Logical Team Q-learning: An approach towards factored policies in cooperative MARL

We address the challenge of learning factored policies in cooperative MA...
06/05/2020 ∙ by Lucas Cassano, et al. ∙ 0

• ### Near-Linear Time Edit Distance for Indel Channels

We consider the following model for sampling pairs of strings: s_1 is a ...
07/06/2020 ∙ by Arun Ganesh, et al. ∙ 0

• ### Stability-certified reinforcement learning: A control-theoretic perspective

We investigate the important problem of certifying stability of reinforc...
10/26/2018 ∙ by Ming Jin, et al. ∙ 0

• ### Online Reinforcement Learning Control by Direct Heuristic Dynamic Programming: from Time-Driven to Event-Driven

In this paper time-driven learning refers to the machine learning method...
06/16/2020 ∙ by Qingtao Zhao, et al. ∙ 0

• ### Victory Probability in the Fire Emblem Arena

We demonstrate how to efficiently compute the probability of victory in ...
08/29/2018 ∙ by Andrew Brockmann, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Recent advances in the field of deep and machine learning has led to a renewed interest in using learning for control of physical systems

[vamtutorial]. Reinforcement learning (RL) is a learning framework that handles sequential decision-making problems, wherein an ‘agent’ or decision maker learns a policy to optimize a long-term reward by interacting with the (unknown) environment. At each step, an RL agent obtains evaluative feedback (called reward or cost) about the performance of its action, allowing it to improve the performance of subsequent actions [sutton1998reinforcement, vrabie2013optimal]. While RL has witnessed huge success in recent times [silver2016mastering, silver2017mastering], there are several unsolved challenges which restricts use of these algorithms for industrial systems. In most practical applications, control policies must be designed to satisfy operational constraints. This leads to the challenge that one has to guarantee constraint satisfaction during learning and policy optimization. Therefore, initializing with an unverified control policy is not ‘safe’ (in terms of stability or constraint handling). In other words, using on-line RL for expensive equipment or safety-critical applications necessitates that the initial policy used for obtaining data for subsequently improved policies must be at least stabilizing, and generally, constraint-enforcing. The work presented in this paper is motivated by this challenge. We present a framework for deriving initial control policies from historical data that can be verified to satisfy constraints and guarantee stability while learning the optimal control policy on-line, from operational data.

A successful RL method needs to balance a fundamental trade-off between exploration and exploitation. One needs to gather data safely (exploration) in order to best extract information from this data for optimal decision-making (exploitation). One way to solve the exploration and exploitation dilemma is to use optimistic initialization [7329650, brafman2002r, thomas2015high, jha2016data], but this assumes the optimal policy is available until data is obtained that proves otherwise. Such approaches have been applied to robotics applications, where systems with discrete and continuous state-action spaces [levine2016end, duan2016benchmarking]. A limitation of these methods is that, before the optimal policy is learned, the agent is quite likely to explore actions that lead to violation of the task-specific constraints as it aims to optimize the cumulative reward for the task. This shortcoming significantly limits such methods to be applicable to industrial applications, since this could lead to irreparable hardware damage or harm human operators due to unexpected dynamics. Consequently, safe learning focuses on learning while enforcing safety constraints. There are primarily two types of approaches to safe RL and approximate/adaptive dynamic programming (ADP). These include: modification of the optimization criterion with a safety component such as barrier functions by transforming the operational constraints into soft constraints [achiam2017constrained, chow2018lyapunov]; and, modifying the exploration process through the incorporation of external system knowledge or historical data [garcia2015comprehensive]. Our method is amongst the latter class of methods, because our operational constraints are hard constraints and softening them could lead to intermittent failure modes.

High performance model-based control requires precise model knowledge for controller design. However, it is well known that for most applications, accurate model knowledge is practically elusive due to the presence of unmodeled dynamical interactions (e.g., friction, contacts, etc.). Recent efforts tackle this issue by learning control policies from operational (on-line) or archival data (off-line). Since the exact structure of the nonlinearity may be unknown or not amenable for analysis, researchers have proposed ‘indirect’ data-driven controllers that employ non-parametric learning methods such as Gaussian processes to construct models from operational data [romeres2016, DBLP:journals/corr/abs-1809-04993] to improve control policies on-line [Berkenkamp2016, hewing2017]. Conversely, ‘direct’ methods, such as those proposed in [jiang2015optimal, tanaskovic2017data, piga2018direct, kiumarsi2018optimal], directly compute policies using a combination of archival/legacy and operational input-output data without constructing an intermediate model. For example, in [fagiano2014automatic], a human expert was introduced into the control loop to conduct initial experiments to ensure safety while generating archival data. A common assumption in many of these approaches is the availability of an initial control policy that is stabilizing and robust to unmodeled dynamics. Designing such safe initial control policies in a computationally tractable manner remains an open challenge.

In this work, we present a formalism for synthesizing safe initial policies for uncertain non-linear systems. We assume the presence of historical/archival/legacy data, with which we estimate Lipschitz constants for the unmodeled system dynamics. The estimation of the Lipschitz constant is done via kernel density estimation (KDE). The estimated Lipschitz constant is used to design control policies via semidefinite programming that can incorporate stability and constraint satisfaction while searching for policies. We show that the proposed approach is able to design feasible policies for different constrained tasks for several systems while respecting all active constraints. Our key insight is that information regarding the structure of classes of unmodeled nonlinearities can be encapsulated using only a few parameters, without knowing the exact form of the nonlinearity. Therefore, it may not be necessary to model the unknown component itself in order to compute a safe control policy. For example, the class of Lipschitz nonlinearities (which constitute a large share of nonlinearities observed in applications) can be described using only a few parameters: the Lipschitz constants of the nonlinear components. Recent work has investigated the utility of Lipschitz properties in constructing controllers when an oracle is available [chakrabarty2017support] or in designing models for prediction [calliess2014conservative] with on-line data used for controller refinement [limon2017learning] In this paper, we construct control policies that respect constraints and certify stability (with high probability) for applications where only off-line data is available, and no oracle is present. We do so through the systematic use of multiplier matrices that enable the representation of nonlinear dynamics through quadratic constraints [chakrabarty2017state, xu2018observer] without requiring knowledge of the underlying nonlinearity. The control policies can then be obtained by solving semidefinite programs. However, construction of multiplier matrices for Lipschitz systems requires knowledge of the Lipschitz constants, which are not always available, and therefore, must be estimated. We refer to the estimation of Lipschitz constants from data as Lipschitz learning. Historically, methods that estimate the Lipschitz constant [wood1996estimation, strongin1973convergence, hansen1992using]

do not provide certificates on the quality of the estimate. Herein, we provide conditions that, if satisfied, enable us to estimate the Lipschitz constant of an unknown locally Lipschitz nonlinearity with high probability. To this end, we employ kernel density estimation (KDE), a non-parametric data-driven method that employs kernels to approximate smooth probability density functions to arbitrarily high accuracy. We refer to our proposed KDE-based Lipschitz constant estimation algorithm as

kernelized Lipschitz learning.

#### Contributions

Compared to the existing literature on safe learning, the contributions of the present paper are threefold. First, we formulate an algorithm to construct stabilizing and constraint satisfying policies for nonlinear systems without knowing the exact form of the nonlinearity. Then we leverage a kernelized Lipschitz learning mechanism to estimate Lipschitz constants of the unmodeled dynamics with high probability; and, finally we use a multiplier-matrix based controller design based on Lipschitz learning from legacy data that forces exponential stability on the closed-loop dynamics (with the same probability as the kernelized Lipschitz learner).

##### Structure

The rest of the paper is structured as follows. We present the formal motivation of our work in Section II. Our kernelized Lipschitz learning algorithm is described in Section III, and benchmarking of the proposed learner on benchmark Lipschitz functions is performed. The utility of Lipschitz learning in policy design via multiplier matrices is elucidated in Section IV, and a numerical example demonstrating the potential of our overall formalism is provided in Section V. We provide concluding remarks and discuss future directions in Section VI.

##### Notation

We denote by the set of real numbers, as the set of positive reals, and as the set of natural numbers. The measure-based distance between two measurable subsets and of a metric space equipped with the metric is given by , where is a measure on and is the symmetric difference . We define a ball and the sum . The complement of a set is denoted by . The indicator function of the set is denoted by . A block diagonal matrix is denoted by . For every , we denote , where is the transpose of . The sup-norm or -norm is defined as . We denote by and

as the smallest and largest eigenvalue of a square, symmetric matrix

. The symbol indicates positive (negative) definiteness and implies for of appropriate dimensions. Similarly, implies positive (negative) semi-definiteness. The operator norm is denoted

and is defined as the maximum singular value of

. For a symmetric matrix, we use the notation to imply symmetric terms, that is, . The symbol denotes the probability measure.

## Ii Problem Formulation

### Ii-a Problem statement

Consider the following discrete-time nonlinear system,

 xt+1 =F(xt,ut), t∈N qt =Cqxt,

where denote the state and the control input of the system respectively.

For simplicity of exposition we will write

 xt+1 =Axt+But+Gϕ(qt), t∈N (1a) qt =Cqxt, (1b)

where the system matrices , , and have appropriate dimensions. Denote by the system’s uncertainty, or unmodeled nonlinearity, whose argument is represented by a linear combination of the state. The origin is an equilibrium state for (1); that is, .

The following assumptions and definition are now needed. The matrix is known. The matrix has full column rank and is sparse and all entries are 0 or 1. Only the non-zero element locations of are known. The matrix and function are unknown.

We require the following definition to describe the class of nonlinearities considered in this paper. A function is Lipschitz continuous in the domain if

 ∥f(x1)−f(x2)∥≤Lf∥x1−x2∥ (2)

for some and all . We define the scalar

 L∗f=infR+{Lf:condition~{% }(???) holds} (3)

as the Lipschitz constant of in . A function is globally Lipschitz if (2) holds for . The nonlinearity is globally Lipschitz continuous. That is,

 ∥ϕ(q1)−ϕ(q2)∥≤L∗ϕ∥q1−q2∥ (4)

for any , and the global Lipschitz constant is unknown. Assumptions II-A and II-A imply that the linear component of the true system (1

) can be assumed, but the rest is unknown. However, we do know the vector space through which the nonlinearity enters the dynamics of (

1), since the non-zero locations of are flagged.

Assumption 1 is mild. For instance, one could relax the assumption on and take the unknown

to be the identity matrix. Then the nonlinearity would be

, with the index set of non-zero rows of , so that .

Given a control policy , we define an infinite horizon cost functional given an initial state as

 J(x0,u)=∞∑t=0γtU(xt,u(xt)), (5)

where is a function with non-negative range, , and denotes the sequence of states generated by the closed loop system

 xt+1=Axt+Bu(xt)+Gϕ(Cqxt). (6)

The scalar is a forgetting/discount factor intended to enable the cost to be emphasized more by current state and control actions and lend less credence to the past.

Before formally stating our objective, we need to introduce the following standard definition [vamtutorial]. A continuous control policy is admissible on if it stabilizes the closed loop system (6) on and is finite for any . We want to design an optimal control policy that achieves the optimal cost

 J∞(x0)=infu∈UJ(x0,u), (7)

for any . Here, denotes the set of all admissible control policies. In other words, we wish to compute an optimal control policy

 u∞=arginfu∈UJ(x0,u). (8)

Directly constructing such an optimal controller is very challenging for general nonlinear systems; this is further complicated because the system (1) contains unmodeled/uncertain dynamics. Therefore, we shall use adaptive/approximate dynamic programming (ADP): a class of iterative, data-driven algorithms that generate a convergent sequence of control policies whose limit is provably the optimal control policy .

Recall from [lewis2012reinforcement, 7823092] that a necessary condition for convergence of policy iteration methods (a sub-class of ADP) is the availability of an initial admissible control policy , which is non-trivial to derive for systems with some unmodeled dynamics. Therefore, our objective in this work is to systematically derive an initial admissible control policy using only partial model information via kernelized Lipschitz learning and semidefinite programming. We also extend this idea to handle the case when the control input is constrained. In such cases, along with an admissible controller, we also derive a domain of attraction of the controller within which the control policy is guaranteed to satisfy input constraints and the closed-loop system remains stable. We refer to the derivation of admissible control policies with guaranteed stabilizability and/or constraint enforcement as safe initialization for ADP: a crucial property required for ADP algorithms to gain traction in expensive industrial applications.

We invoke the assumption in [tanaskovic2017data, piga2018direct] regarding the availability of legacy/archival/historical data generated by the system during prior experiments. That is, at design time, we have a dataset consisting of unique triples: state-input pairs along with corresponding state update information. Concretely, we have access to . For each , we estimate the nonlinear term using (1); that is,

 ϕ(qj)=G†(x+j−Axj−Buj),

where exists by Assumption II-A. Note that we also need to estimate the matrix (see (1)) so that can be calculated from . While estimating the exact elements of these matrices is quite challenging, we can estimate the non-zero elements in the matrices, which is enough to design safe initial control policies, because the exact elements of will be subsumed within the Lipschitz constant.

The problem of estimating the sparsity pattern of

is analogous to the problem of feature selection and sparse learning, known as automatic relevance determination (ARD)

[tipping2001sparse]. The basic idea in ARD is to give feature weights some parametric prior densities; these densities are subsequently refined by maximizing the likelihood of the data [tipping2001sparse, chu2005preference]

. For example, one can define hyperparameters which explicitly represent the relevance of different inputs to a machine learning algorithm w.r.t. the desired output (e.g., a regression problem). These relevance hyperparameters determine the range of variation of parameters relating to a particular input. ARD can then determine these hyperparameters during learning to discover which inputs are relevant.

We need the following assumption on the data , without which one cannot attain the global Lipschitz constant of the nonlinearity with high accuracy. Let denote the convex hull of the samples . The Lipschitz constant of in the domain is identical to the global Lipschitz constant . Assumption II-A ensures that the samples obtained from the archival data are contained in a subregion of where the nonlinearity ’s local Lipschitz constant is the same as its global Lipschitz constant.

Suppose . As long as the convex hull of the samples contain zero, the Lipschitz constant of on the convex hull and on are identical.

In the following section, we will leverage the dataset to estimate the Lipschitz constant of using kernelized Lipschitz learning/estimation, and consequently design an initial admissible linear control policy via semidefinite programs. We will demonstrate how such an initial admissible linear control policy fits into a neural-network based ADP formulation (such as policy iteration) to asymptotically generate the optimal control policy . The control algorithm proposed in this paper is a direct data-driven controller because no model of is identified in the controller design step. Although we focus only on discrete-time systems, our results hold for continuous-time systems with slight modifications. If , our proposed Lipschitz learning algorithm will yield Lipschitz constant estimates, one for each dimension of . To avoid notational complications, we proceed (without loss of generality) with . For larger , our algorithm can be used component-wise.

## Iii Kernelized Lipschitz Learning

In this section, we provide a brief overview of kernel density estimation (KDE) and provide a methodology for estimating Lipschitz constants from data.

### Iii-a Empirical density of Lipschitz estimates

With the data , we obtain underestimates of the global Lipschitz constant using

 ℓjk=|ϕ(qj)−ϕ(qk)|∥qj−qk∥, (9)

where . The sequence are empirical samples drawn from an underlying univariate distribution . Clearly, the true distribution has finite support; indeed, its left-hand endpoint is a non-negative scalar (zero, if but may be positive if ) and its right-hand endpoint is . This leads us to the key idea of our approach that is to identify the support of the distribution to yield an estimate of the true Lipschitz constant of .

Variants of the estimator (9) such as have been widely used in the literature to construct algorithms for determining Lipschitz constants, see for example: [wood1996estimation, strongin1973convergence, calliess2017lipschitz].

In the literature, common methods of tackling the support estimation problem is by assuming prior knowledge about the exact density of Lipschitz estimates [calliess2017lipschitz] or using Strongin overestimates of the Lipschitz constant [strongin1973convergence]. However, we avoid these overestimators because they are provably unreliable, even for globally Lipschitz functions [hansen1992using, Theorem 3.1]. Instead, we try to fit the density directly from local estimates and the data in a non-parametric manner using KDE and characteristics of the estimated density.

### Iii-B Plug-in support estimation

With a set of underestimates , we generate an estimate of the true density using a kernel density estimator

 ^Ln(ℓ)=1nhnn∑r=1K(ℓ−ℓrhn), (10)

where is a smooth function called the kernel function and is the kernel bandwidth. A plug-in estimate of the support of the true density is

 ^Sn:={ℓ∈R≥0:^Ln(ℓ)≥βn}, (11)

where is an element of a sequence that converges to zero as ; this plug-in estimator was proposed in [cuevas1997plug].

### Iii-C Implementation details

Implementing the plug-in estimator involves first constructing a KDE of with samples. Then, if one picks small enough, one can easily compute from (11). Then

 ^Lϕ:=max(^Sn). (12)

This is a very straightforward operation with the availability of tools like ksdensity (MATLAB) and the KernelDensity tool in scikit-learn (Python). The pseudocode is detailed herein in Algorithm 1.

Note that the true support is a subset of . Therefore, when computing the density estimate, this information should be fed into the tool being used. For example, in MATLAB, one has the option ’support’, ’positive’. Essentially, this subroutine transforms the data into the log-scale and estimates the log-density so that upon returning to linear scale, one preserves positivity.

### Iii-D Theoretical guarantees

We formally describe the density . We consider that the samples

are drawn according to some probability distribution

with support . For any set , suppose that can be written as , where is the Lebesgue measure, and is continuous and positive on . Let denote the product measure on . Since is absolutely continuous with respect to the Lebesgue measure, assigns zero mass on the diagonal

. The cumulative distribution function for

is then given by

 ~L(λ)=μX({(q1,q2):q1≠q2,|ϕ(q1)−ϕ(q2)|∥q1−q2∥≤λ}).

Since is non-decreasing, exists almost everywhere by Lebesgue’s theorem for differentiability of monotone functions, and ’s support is contained within because of (9).

We investigate the worst-case sample complexity involved in overestimating under the following mild assumption. The nonlinearity is twice continuously differentiable, that is, .

Suppose that Assumptions II-A and III-D hold. Then there exists some such that .

###### Proof.

Suppose denotes a sequence of paired samples in such that as . Since is the convex hull of finitely many samples, it is compact, so we can choose a subsequence of that converges to where both limits are in . If , then a Taylor expansion estimate implies . Since is an upper bound of at any sample in , and . If , then the result follows by applying the mean value theorem to

 φ(t)=ϕ(q∞1+t(q∞2−q∞1))−ϕ(q∞1)

for , for which and . Also, . Since , this implies . Reordering and if needed, we have

 L∗ϕ∥q∞2−q∞1∥=φ(1)=∫10(dφ/dt)ds≤L∗ϕ∥q∞2−q∞1∥.

Hence, the rightmost inequality must be an equality, which implies that for all . That is, if the Lipschitz constant is attained with , then restricted to the segment connecting and is linear with slope . This concludes the proof.∎

Lemma III-D enables the worst-case complexity result described in the following theorem. Let , and suppose that Assumptions II-A and III-D hold. There exists such that for all sufficiently small and and any set of uniform random samples in , the probability that some pair gives the Lipschitz estimate

 ϕ′(q+,q−)≥(1−δ)L∗ϕ−C0δ

is at least . Here , where is a constant depending on .

###### Proof.

By Lemma III-D, there exists at least one such that . For the worst-case analysis, suppose this occurs only at a single sample, . A Taylor expansion at yields

 ϕ(q⋆+q)=ϕ(q⋆)+∇ϕ(q⋆)⊤q+R(q), (13)

where is a remainder term with when , for some and . Note that

 ∇ϕ(q⋆)⊤q=∥∇ϕ(q⋆)∥∥q∥cosθ,

where is the angle between and . To obtain a good estimate of , one needs to sample two points in this neighborhood, one point with and a second point with . Each of these conditions defines a cone. Regarding one of these cones in cylindrical coordinates and for , we can integrate the dimensional volume to get the volume of this cone as for some dimension-dependent constant . A calculation shows that for small . With and as one sample from each cone, we have that is contained in the cone , and so we can use (13) to approximate

 |ϕ(q⋆+q+)−ϕ(q⋆+q−)| =|∇ϕ(q⋆)T(q+−q−)+R(q+)−R(q−)| ≥∥∇ϕ(q⋆)∥∥q+−q−∥(1−χ22)−|R(q+)−R(q−)|.

Dividing by and using the defining property of gives

 ϕ′(q+,q−)≥(1−χ22)L∗ϕ−|R(q+)−R(q−)|∥q+−q−∥. (14)

Subsequently, one can decompose , with parallel to and perpendicular and satisfying . Identical arguments can be used to infer , and hence, Since and are chosen from opposite cones, we have . Using and , we have . Hence,

 |R(q+)−R(q−)|∥q+−q−∥ ≤CR(∥q+∥2+∥q−∥2)(∥q+∥+∥q−∥)(1−χ)(1−χ2/2) ≤2CRmax{∥q+∥,∥q−∥}2max{∥q+∥,∥q−∥}(1−χ)(1−χ2/2) ≤2CRη(1+χ2)1/2(1−χ)(1−χ2/2).

By combining the aforementioned inequality with  (14), one obtains

 φ′(q+,q−)≥(1−χ2)L∗ϕ−2CRη(1+χ2)1/2(1−χ)(1−χ2/2).

Set and take . Then there exists such that for all sufficiently small ,

 φ′(q+,q−)≥(1−δ)L∗ϕ−C0δ, (15)

which implies that as .

From the assumption on uniformly drawn samples, the probability of sampling in one of the cones is,

 ∫η0κ(χr)nq−1dr=κχnq−1ηnqnq

for some that depends on . Using and absorbing the factors of 2 and into yields .

Let be the event of sampling at least one point in the cone as above, and let be the event of sampling nothing in the cone. The probability of sampling at least one of each of the points and just described is,

 1 −P(X0+∩X0−)−P(X0+∩X1−)−P(X1+∩X0−) ≥1−(1−2κδ2nq−1)n−2(1−κδ2nq−1)n,

where the factor 2 before in the second term comes from the fact that both cones are excluded and they are disjoint, and the 2 before the third term comes by combining the final two terms in the first expression.

By using the fact that for any and the inequality holds, we can conclude that,

 1 −P(X0+∩X0−)−P(X0+∩X1−)−P(X1+∩X0−) ≥1−exp(−2nκδnq−1)−2exp(−nκδnq−1)≥1−ϵ,

for any given . The latter can be ensured by choosing large enough. This gives a lower bound on the probability of obtaining (15) and hence, the desired result. ∎

### Iii-E Benchmarking the Lipschitz estimator

Our Lipschitz estimator is tested on well-studied benchmark examples studied previously in [wood1996estimation, calliess2014conservative]: the benchmark functions are described in Table I along with their domains and true local Lipschitz constants. Note that all the functions are not globally Lipschitz (e.g. ), not differentiable everywhere (e.g. , ), and, in the special case of , specifically constructed to ensure that naive overestimation of using Strongin methods provably fails [hansen1992using]. To evaluate the proposed Lipschitz estimator, we vary the number of data points and the confidence parameter . Over 100 runs, we report mean

one standard deviation of the following quantities: the time required by our learning algorithm, the estimated Lipschitz constant

, and the error (which should be positive when we overestimate ).

The final column of Table I reveals an important empirical detail: all our estimates of are overestimates for and . This is a critical advantage of our proposed approach, because an overestimate will enable us to provide stability and constraint satisfaction guarantees about the data-driven controller, as we will discuss in subsequent sections. Furthermore, the estimation error is small for every test run, and as expected, the error increases as decreases, because a smaller value of indicates the need for greater confidence, which results in more conservative estimates.

## Iv Safe Initialization in ADP

In this section, we begin by reviewing a general ADP procedure, and then explain how to safely initialize unconstrained, as well as input-constrained ADP.

Recall the optimal value function given by (7) and the optimal control policy (8). From the Bellman optimality principle, we know that the discrete-time Hamilton-Jacobi-Bellman equations are given by

 J∞(xt) =infu∈U(U(xt,u(xt))+γJ∞(xt+1)), (16) u∞(xt) =arginfu∈U(U(xt,u(xt))+γJ∞(xt+1)), (17)

where is the optimal value function and is the optimal control policy. The key operations in ADP methods [lewis2012reinforcement] involve setting an admissible control policy and then iterating the policy evaluation step

 Jk+1(xt)=U(xt,uk(xt))+γJk+1(xt+1) (18a) and the policy improvement step uk+1(xt)=\argminu(⋅)(U(xt,u(xt))+γJk+1(xt+1)) (18b)

until convergence.

#### Iv-A1 Semidefinite programming for safe initial control policy

Recall the following definition. The equilibrium point of the closed-loop system (6) is globally exponentially stable with a decay rate if there exist scalars and such that for any .

Conditions for global exponential stability (GES) of the equilibrium state, adopted from [khalil2015nonlinear], is provided next. Let be a continuously differentiable function such that

 γ1∥x∥2≤V(t,xt) ≤γ2∥x∥2 (19a) V(t+1,xt+1)−V(t,xt) ≤−(1−α2)V(t,xt), (19b)

for any and along the trajectories of the system

 x+=φ(x), (20)

where , , and are positive scalars, and is a nonlinear function. Then the equilibrium state for the system (20) is GES with decay rate .

The following design theorem provides a method to construct an initial linear stabilizing policy such that the origin is a GES equilibrium state of the closed-loop system (6). Suppose that Assumptions 1–2 hold, and that there exist matrices , , and scalars , such that

 Ψ+Γ⊤MΓ ⪯0, (21)

where

 Ψ =[(A+BK0)⊤P(A+BK0)−α2P⋆G⊤P(A+BK0)G⊤PG], Γ =[Cq00I],andM=[ν−1(L∗ϕ)2I00−ν−1I].

Then the equilibrium of the closed-loop system (6) is GES with decay rate .

###### Proof.

Let . Then (19a) in Lemma IV-A1 is satisfied with and . Let . Note that

 V+ =(x+)⊤Px+ =((A+BK0)x+Gϕ)⊤P((A+BK0)x+Gϕ) =x⊤(A+BK0)⊤P(A+BK0)x +2x⊤(A+BK0)⊤PGϕ+ϕ⊤G⊤PGϕ.

Therefore,

 [xϕ]⊤Ψ[xϕ] =x⊤(A+BK0)⊤P(A+BK0)x−α2x⊤Px +2x⊤(A+BK0)⊤PGϕ+ϕ⊤G⊤PGϕ =V+−α2V=ΔV+(1−α2)V,

and

 [xϕ]⊤Γ⊤MΓ[xϕ] =[qϕ]⊤M[qϕ]=ν((L∗ϕ)2q⊤q−ϕ⊤ϕ).

Thus, pre- and post-multiplying (21) with and its transpose, respectively, we get

 ΔV+(1−α2)V+ν((L∗ϕ)2q⊤q−ϕ⊤ϕ)≤0.

By inequality (4) in Assumption II-A and recalling that , we get Since , this implies which is identical to (19b).∎

Note that we do not need to know to satisfy conditions (21). Instead, Theorem IV-A1 provides conditions that leverage matrix multipliers similar to those described in [chakrabarty2017state].

We shall now provide LMI-based conditions for computing the initial control policy , the initial domain of attraction and via convex programming.