I Introduction
Recent advances in the field of deep and machine learning has led to a renewed interest in using learning for control of physical systems
[vamtutorial]. Reinforcement learning (RL) is a learning framework that handles sequential decisionmaking problems, wherein an ‘agent’ or decision maker learns a policy to optimize a longterm reward by interacting with the (unknown) environment. At each step, an RL agent obtains evaluative feedback (called reward or cost) about the performance of its action, allowing it to improve the performance of subsequent actions [sutton1998reinforcement, vrabie2013optimal]. While RL has witnessed huge success in recent times [silver2016mastering, silver2017mastering], there are several unsolved challenges which restricts use of these algorithms for industrial systems. In most practical applications, control policies must be designed to satisfy operational constraints. This leads to the challenge that one has to guarantee constraint satisfaction during learning and policy optimization. Therefore, initializing with an unverified control policy is not ‘safe’ (in terms of stability or constraint handling). In other words, using online RL for expensive equipment or safetycritical applications necessitates that the initial policy used for obtaining data for subsequently improved policies must be at least stabilizing, and generally, constraintenforcing. The work presented in this paper is motivated by this challenge. We present a framework for deriving initial control policies from historical data that can be verified to satisfy constraints and guarantee stability while learning the optimal control policy online, from operational data.A successful RL method needs to balance a fundamental tradeoff between exploration and exploitation. One needs to gather data safely (exploration) in order to best extract information from this data for optimal decisionmaking (exploitation). One way to solve the exploration and exploitation dilemma is to use optimistic initialization [7329650, brafman2002r, thomas2015high, jha2016data], but this assumes the optimal policy is available until data is obtained that proves otherwise. Such approaches have been applied to robotics applications, where systems with discrete and continuous stateaction spaces [levine2016end, duan2016benchmarking]. A limitation of these methods is that, before the optimal policy is learned, the agent is quite likely to explore actions that lead to violation of the taskspecific constraints as it aims to optimize the cumulative reward for the task. This shortcoming significantly limits such methods to be applicable to industrial applications, since this could lead to irreparable hardware damage or harm human operators due to unexpected dynamics. Consequently, safe learning focuses on learning while enforcing safety constraints. There are primarily two types of approaches to safe RL and approximate/adaptive dynamic programming (ADP). These include: modification of the optimization criterion with a safety component such as barrier functions by transforming the operational constraints into soft constraints [achiam2017constrained, chow2018lyapunov]; and, modifying the exploration process through the incorporation of external system knowledge or historical data [garcia2015comprehensive]. Our method is amongst the latter class of methods, because our operational constraints are hard constraints and softening them could lead to intermittent failure modes.
High performance modelbased control requires precise model knowledge for controller design. However, it is well known that for most applications, accurate model knowledge is practically elusive due to the presence of unmodeled dynamical interactions (e.g., friction, contacts, etc.). Recent efforts tackle this issue by learning control policies from operational (online) or archival data (offline). Since the exact structure of the nonlinearity may be unknown or not amenable for analysis, researchers have proposed ‘indirect’ datadriven controllers that employ nonparametric learning methods such as Gaussian processes to construct models from operational data [romeres2016, DBLP:journals/corr/abs180904993] to improve control policies online [Berkenkamp2016, hewing2017]. Conversely, ‘direct’ methods, such as those proposed in [jiang2015optimal, tanaskovic2017data, piga2018direct, kiumarsi2018optimal], directly compute policies using a combination of archival/legacy and operational inputoutput data without constructing an intermediate model. For example, in [fagiano2014automatic], a human expert was introduced into the control loop to conduct initial experiments to ensure safety while generating archival data. A common assumption in many of these approaches is the availability of an initial control policy that is stabilizing and robust to unmodeled dynamics. Designing such safe initial control policies in a computationally tractable manner remains an open challenge.
In this work, we present a formalism for synthesizing safe initial policies for uncertain nonlinear systems. We assume the presence of historical/archival/legacy data, with which we estimate Lipschitz constants for the unmodeled system dynamics. The estimation of the Lipschitz constant is done via kernel density estimation (KDE). The estimated Lipschitz constant is used to design control policies via semidefinite programming that can incorporate stability and constraint satisfaction while searching for policies. We show that the proposed approach is able to design feasible policies for different constrained tasks for several systems while respecting all active constraints. Our key insight is that information regarding the structure of classes of unmodeled nonlinearities can be encapsulated using only a few parameters, without knowing the exact form of the nonlinearity. Therefore, it may not be necessary to model the unknown component itself in order to compute a safe control policy. For example, the class of Lipschitz nonlinearities (which constitute a large share of nonlinearities observed in applications) can be described using only a few parameters: the Lipschitz constants of the nonlinear components. Recent work has investigated the utility of Lipschitz properties in constructing controllers when an oracle is available [chakrabarty2017support] or in designing models for prediction [calliess2014conservative] with online data used for controller refinement [limon2017learning] In this paper, we construct control policies that respect constraints and certify stability (with high probability) for applications where only offline data is available, and no oracle is present. We do so through the systematic use of multiplier matrices that enable the representation of nonlinear dynamics through quadratic constraints [chakrabarty2017state, xu2018observer] without requiring knowledge of the underlying nonlinearity. The control policies can then be obtained by solving semidefinite programs. However, construction of multiplier matrices for Lipschitz systems requires knowledge of the Lipschitz constants, which are not always available, and therefore, must be estimated. We refer to the estimation of Lipschitz constants from data as Lipschitz learning. Historically, methods that estimate the Lipschitz constant [wood1996estimation, strongin1973convergence, hansen1992using]
do not provide certificates on the quality of the estimate. Herein, we provide conditions that, if satisfied, enable us to estimate the Lipschitz constant of an unknown locally Lipschitz nonlinearity with high probability. To this end, we employ kernel density estimation (KDE), a nonparametric datadriven method that employs kernels to approximate smooth probability density functions to arbitrarily high accuracy. We refer to our proposed KDEbased Lipschitz constant estimation algorithm as
kernelized Lipschitz learning.Contributions
Compared to the existing literature on safe learning, the contributions of the present paper are threefold. First, we formulate an algorithm to construct stabilizing and constraint satisfying policies for nonlinear systems without knowing the exact form of the nonlinearity. Then we leverage a kernelized Lipschitz learning mechanism to estimate Lipschitz constants of the unmodeled dynamics with high probability; and, finally we use a multipliermatrix based controller design based on Lipschitz learning from legacy data that forces exponential stability on the closedloop dynamics (with the same probability as the kernelized Lipschitz learner).
Structure
The rest of the paper is structured as follows. We present the formal motivation of our work in Section II. Our kernelized Lipschitz learning algorithm is described in Section III, and benchmarking of the proposed learner on benchmark Lipschitz functions is performed. The utility of Lipschitz learning in policy design via multiplier matrices is elucidated in Section IV, and a numerical example demonstrating the potential of our overall formalism is provided in Section V. We provide concluding remarks and discuss future directions in Section VI.
Notation
We denote by the set of real numbers, as the set of positive reals, and as the set of natural numbers. The measurebased distance between two measurable subsets and of a metric space equipped with the metric is given by , where is a measure on and is the symmetric difference . We define a ball and the sum . The complement of a set is denoted by . The indicator function of the set is denoted by . A block diagonal matrix is denoted by . For every , we denote , where is the transpose of . The supnorm or norm is defined as . We denote by and
as the smallest and largest eigenvalue of a square, symmetric matrix
. The symbol indicates positive (negative) definiteness and implies for of appropriate dimensions. Similarly, implies positive (negative) semidefiniteness. The operator norm is denotedand is defined as the maximum singular value of
. For a symmetric matrix, we use the notation to imply symmetric terms, that is, . The symbol denotes the probability measure.Ii Problem Formulation
Iia Problem statement
Consider the following discretetime nonlinear system,
where denote the state and the control input of the system respectively.
For simplicity of exposition we will write
(1a)  
(1b) 
where the system matrices , , and have appropriate dimensions. Denote by the system’s uncertainty, or unmodeled nonlinearity, whose argument is represented by a linear combination of the state. The origin is an equilibrium state for (1); that is, .
The following assumptions and definition are now needed. The matrix is known. The matrix has full column rank and is sparse and all entries are 0 or 1. Only the nonzero element locations of are known. The matrix and function are unknown.
We require the following definition to describe the class of nonlinearities considered in this paper. A function is Lipschitz continuous in the domain if
(2) 
for some and all . We define the scalar
(3) 
as the Lipschitz constant of in . A function is globally Lipschitz if (2) holds for . The nonlinearity is globally Lipschitz continuous. That is,
(4) 
for any , and the global Lipschitz constant is unknown. Assumptions IIA and IIA imply that the linear component of the true system (1
) can be assumed, but the rest is unknown. However, we do know the vector space through which the nonlinearity enters the dynamics of (
1), since the nonzero locations of are flagged.Assumption 1 is mild. For instance, one could relax the assumption on and take the unknown
to be the identity matrix. Then the nonlinearity would be
, with the index set of nonzero rows of , so that .Given a control policy , we define an infinite horizon cost functional given an initial state as
(5) 
where is a function with nonnegative range, , and denotes the sequence of states generated by the closed loop system
(6) 
The scalar is a forgetting/discount factor intended to enable the cost to be emphasized more by current state and control actions and lend less credence to the past.
Before formally stating our objective, we need to introduce the following standard definition [vamtutorial]. A continuous control policy is admissible on if it stabilizes the closed loop system (6) on and is finite for any . We want to design an optimal control policy that achieves the optimal cost
(7) 
for any . Here, denotes the set of all admissible control policies. In other words, we wish to compute an optimal control policy
(8) 
Directly constructing such an optimal controller is very challenging for general nonlinear systems; this is further complicated because the system (1) contains unmodeled/uncertain dynamics. Therefore, we shall use adaptive/approximate dynamic programming (ADP): a class of iterative, datadriven algorithms that generate a convergent sequence of control policies whose limit is provably the optimal control policy .
Recall from [lewis2012reinforcement, 7823092] that a necessary condition for convergence of policy iteration methods (a subclass of ADP) is the availability of an initial admissible control policy , which is nontrivial to derive for systems with some unmodeled dynamics. Therefore, our objective in this work is to systematically derive an initial admissible control policy using only partial model information via kernelized Lipschitz learning and semidefinite programming. We also extend this idea to handle the case when the control input is constrained. In such cases, along with an admissible controller, we also derive a domain of attraction of the controller within which the control policy is guaranteed to satisfy input constraints and the closedloop system remains stable. We refer to the derivation of admissible control policies with guaranteed stabilizability and/or constraint enforcement as safe initialization for ADP: a crucial property required for ADP algorithms to gain traction in expensive industrial applications.
We invoke the assumption in [tanaskovic2017data, piga2018direct] regarding the availability of legacy/archival/historical data generated by the system during prior experiments. That is, at design time, we have a dataset consisting of unique triples: stateinput pairs along with corresponding state update information. Concretely, we have access to . For each , we estimate the nonlinear term using (1); that is,
where exists by Assumption IIA. Note that we also need to estimate the matrix (see (1)) so that can be calculated from . While estimating the exact elements of these matrices is quite challenging, we can estimate the nonzero elements in the matrices, which is enough to design safe initial control policies, because the exact elements of will be subsumed within the Lipschitz constant.
The problem of estimating the sparsity pattern of
is analogous to the problem of feature selection and sparse learning, known as automatic relevance determination (ARD)
[tipping2001sparse]. The basic idea in ARD is to give feature weights some parametric prior densities; these densities are subsequently refined by maximizing the likelihood of the data [tipping2001sparse, chu2005preference]. For example, one can define hyperparameters which explicitly represent the relevance of different inputs to a machine learning algorithm w.r.t. the desired output (e.g., a regression problem). These relevance hyperparameters determine the range of variation of parameters relating to a particular input. ARD can then determine these hyperparameters during learning to discover which inputs are relevant.
We need the following assumption on the data , without which one cannot attain the global Lipschitz constant of the nonlinearity with high accuracy. Let denote the convex hull of the samples . The Lipschitz constant of in the domain is identical to the global Lipschitz constant . Assumption IIA ensures that the samples obtained from the archival data are contained in a subregion of where the nonlinearity ’s local Lipschitz constant is the same as its global Lipschitz constant.
Suppose . As long as the convex hull of the samples contain zero, the Lipschitz constant of on the convex hull and on are identical.
In the following section, we will leverage the dataset to estimate the Lipschitz constant of using kernelized Lipschitz learning/estimation, and consequently design an initial admissible linear control policy via semidefinite programs. We will demonstrate how such an initial admissible linear control policy fits into a neuralnetwork based ADP formulation (such as policy iteration) to asymptotically generate the optimal control policy . The control algorithm proposed in this paper is a direct datadriven controller because no model of is identified in the controller design step. Although we focus only on discretetime systems, our results hold for continuoustime systems with slight modifications. If , our proposed Lipschitz learning algorithm will yield Lipschitz constant estimates, one for each dimension of . To avoid notational complications, we proceed (without loss of generality) with . For larger , our algorithm can be used componentwise.
Iii Kernelized Lipschitz Learning
In this section, we provide a brief overview of kernel density estimation (KDE) and provide a methodology for estimating Lipschitz constants from data.
Iiia Empirical density of Lipschitz estimates
With the data , we obtain underestimates of the global Lipschitz constant using
(9) 
where . The sequence are empirical samples drawn from an underlying univariate distribution . Clearly, the true distribution has finite support; indeed, its lefthand endpoint is a nonnegative scalar (zero, if but may be positive if ) and its righthand endpoint is . This leads us to the key idea of our approach that is to identify the support of the distribution to yield an estimate of the true Lipschitz constant of .
Variants of the estimator (9) such as have been widely used in the literature to construct algorithms for determining Lipschitz constants, see for example: [wood1996estimation, strongin1973convergence, calliess2017lipschitz].
In the literature, common methods of tackling the support estimation problem is by assuming prior knowledge about the exact density of Lipschitz estimates [calliess2017lipschitz] or using Strongin overestimates of the Lipschitz constant [strongin1973convergence]. However, we avoid these overestimators because they are provably unreliable, even for globally Lipschitz functions [hansen1992using, Theorem 3.1]. Instead, we try to fit the density directly from local estimates and the data in a nonparametric manner using KDE and characteristics of the estimated density.
IiiB Plugin support estimation
With a set of underestimates , we generate an estimate of the true density using a kernel density estimator
(10) 
where is a smooth function called the kernel function and is the kernel bandwidth. A plugin estimate of the support of the true density is
(11) 
where is an element of a sequence that converges to zero as ; this plugin estimator was proposed in [cuevas1997plug].
IiiC Implementation details
Implementing the plugin estimator involves first constructing a KDE of with samples. Then, if one picks small enough, one can easily compute from (11). Then
(12) 
This is a very straightforward operation with the availability of tools like ksdensity (MATLAB) and the KernelDensity tool in scikitlearn (Python). The pseudocode is detailed herein in Algorithm 1.
Note that the true support is a subset of . Therefore, when computing the density estimate, this information should be fed into the tool being used. For example, in MATLAB, one has the option ’support’, ’positive’. Essentially, this subroutine transforms the data into the logscale and estimates the logdensity so that upon returning to linear scale, one preserves positivity.
IiiD Theoretical guarantees
We formally describe the density . We consider that the samples
are drawn according to some probability distribution
with support . For any set , suppose that can be written as , where is the Lebesgue measure, and is continuous and positive on . Let denote the product measure on . Since is absolutely continuous with respect to the Lebesgue measure, assigns zero mass on the diagonal. The cumulative distribution function for
is then given bySince is nondecreasing, exists almost everywhere by Lebesgue’s theorem for differentiability of monotone functions, and ’s support is contained within because of (9).
We investigate the worstcase sample complexity involved in overestimating under the following mild assumption. The nonlinearity is twice continuously differentiable, that is, .
Proof.
Suppose denotes a sequence of paired samples in such that as . Since is the convex hull of finitely many samples, it is compact, so we can choose a subsequence of that converges to where both limits are in . If , then a Taylor expansion estimate implies . Since is an upper bound of at any sample in , and . If , then the result follows by applying the mean value theorem to
for , for which and . Also, . Since , this implies . Reordering and if needed, we have
Hence, the rightmost inequality must be an equality, which implies that for all . That is, if the Lipschitz constant is attained with , then restricted to the segment connecting and is linear with slope . This concludes the proof.∎
Lemma IIID enables the worstcase complexity result described in the following theorem. Let , and suppose that Assumptions IIA and IIID hold. There exists such that for all sufficiently small and and any set of uniform random samples in , the probability that some pair gives the Lipschitz estimate
is at least . Here , where is a constant depending on .
Proof.
By Lemma IIID, there exists at least one such that . For the worstcase analysis, suppose this occurs only at a single sample, . A Taylor expansion at yields
(13) 
where is a remainder term with when , for some and . Note that
where is the angle between and . To obtain a good estimate of , one needs to sample two points in this neighborhood, one point with and a second point with . Each of these conditions defines a cone. Regarding one of these cones in cylindrical coordinates and for , we can integrate the dimensional volume to get the volume of this cone as for some dimensiondependent constant . A calculation shows that for small . With and as one sample from each cone, we have that is contained in the cone , and so we can use (13) to approximate
Dividing by and using the defining property of gives
(14) 
Subsequently, one can decompose , with parallel to and perpendicular and satisfying . Identical arguments can be used to infer , and hence, Since and are chosen from opposite cones, we have . Using and , we have . Hence,
By combining the aforementioned inequality with (14), one obtains
Set and take . Then there exists such that for all sufficiently small ,
(15) 
which implies that as .
From the assumption on uniformly drawn samples, the probability of sampling in one of the cones is,
for some that depends on . Using and absorbing the factors of 2 and into yields .
Let be the event of sampling at least one point in the cone as above, and let be the event of sampling nothing in the cone. The probability of sampling at least one of each of the points and just described is,
where the factor 2 before in the second term comes from the fact that both cones are excluded and they are disjoint, and the 2 before the third term comes by combining the final two terms in the first expression.
By using the fact that for any and the inequality holds, we can conclude that,
for any given . The latter can be ensured by choosing large enough. This gives a lower bound on the probability of obtaining (15) and hence, the desired result. ∎
IiiE Benchmarking the Lipschitz estimator
Our Lipschitz estimator is tested on wellstudied benchmark examples studied previously in [wood1996estimation, calliess2014conservative]: the benchmark functions are described in Table I along with their domains and true local Lipschitz constants. Note that all the functions are not globally Lipschitz (e.g. ), not differentiable everywhere (e.g. , ), and, in the special case of , specifically constructed to ensure that naive overestimation of using Strongin methods provably fails [hansen1992using]. To evaluate the proposed Lipschitz estimator, we vary the number of data points and the confidence parameter . Over 100 runs, we report mean
one standard deviation of the following quantities: the time required by our learning algorithm, the estimated Lipschitz constant
, and the error (which should be positive when we overestimate ).Time [s]  (mean stdev)  OE?  
, on  
0.596 0.127  3.361 0.093  
0.610 0.122  3.528 0.177  
2.463 0.432  3.252 0.083  
2.438 0.427  3.369 0.158  
, on  
0.455 0.123  1.018 0.011  
0.459 0.114  1.030 0.020  
1.656 0.218  1.005 0.004  
1.585 0.208  1.010 0.009  
, on  
0.556 0.121  1.780 0.074  
0.547 0.124  1.923 0.166  
1.826 0.224  1.684 0.010  
1.821 0.221  1.720 0.002  
Hansen test function from [wood1996estimation], on  
0.450 0.117  8.969 0.262  
0.488 0.123  9.401 0.476  
1.921 0.138  8.507 0.046  
1.923 0.130  8.707 0.103  
, on  
0.546 0.061  3.139 0.051  
0.612 0.066  3.200 0.087  
1.893 0.245  3.043 0.014  
1.989 0.230  3.104 0.024 

OE indicates an overestimate of the true Lipschitz constant, that is, .
The final column of Table I reveals an important empirical detail: all our estimates of are overestimates for and . This is a critical advantage of our proposed approach, because an overestimate will enable us to provide stability and constraint satisfaction guarantees about the datadriven controller, as we will discuss in subsequent sections. Furthermore, the estimation error is small for every test run, and as expected, the error increases as decreases, because a smaller value of indicates the need for greater confidence, which results in more conservative estimates.
Iv Safe Initialization in ADP
In this section, we begin by reviewing a general ADP procedure, and then explain how to safely initialize unconstrained, as well as inputconstrained ADP.
Iva Unconstrained ADP
Recall the optimal value function given by (7) and the optimal control policy (8). From the Bellman optimality principle, we know that the discretetime HamiltonJacobiBellman equations are given by
(16)  
(17) 
where is the optimal value function and is the optimal control policy. The key operations in ADP methods [lewis2012reinforcement] involve setting an admissible control policy and then iterating the policy evaluation step
(18a)  
and the policy improvement step  
(18b) 
until convergence.
IvA1 Semidefinite programming for safe initial control policy
Recall the following definition. The equilibrium point of the closedloop system (6) is globally exponentially stable with a decay rate if there exist scalars and such that for any .
Conditions for global exponential stability (GES) of the equilibrium state, adopted from [khalil2015nonlinear], is provided next. Let be a continuously differentiable function such that
(19a)  
(19b) 
for any and along the trajectories of the system
(20) 
where , , and are positive scalars, and is a nonlinear function. Then the equilibrium state for the system (20) is GES with decay rate .
The following design theorem provides a method to construct an initial linear stabilizing policy such that the origin is a GES equilibrium state of the closedloop system (6). Suppose that Assumptions 1–2 hold, and that there exist matrices , , and scalars , such that
(21) 
where
Then the equilibrium of the closedloop system (6) is GES with decay rate .
Proof.
Note that we do not need to know to satisfy conditions (21). Instead, Theorem IVA1 provides conditions that leverage matrix multipliers similar to those described in [chakrabarty2017state].
We shall now provide LMIbased conditions for computing the initial control policy , the initial domain of attraction and via convex programming.
(22)  
Comments
There are no comments yet.