# Stability-certified reinforcement learning: A control-theoretic perspective

We investigate the important problem of certifying stability of reinforcement learning policies when interconnected with nonlinear dynamical systems. We show that by regulating the input-output gradients of policies, strong guarantees of robust stability can be obtained based on a proposed semidefinite programming feasibility problem. The method is able to certify a large set of stabilizing controllers by exploiting problem-specific structures; furthermore, we analyze and establish its (non)conservatism. Empirical evaluations on two decentralized control tasks, namely multi-flight formation and power system frequency regulation, demonstrate that the reinforcement learning agents can have high performance within the stability-certified parameter space, and also exhibit stable learning behaviors in the long run.

## Authors

• 17 publications
• 11 publications
05/23/2017

### Safe Model-based Reinforcement Learning with Stability Guarantees

Reinforcement learning is a powerful paradigm for learning optimal polic...
06/06/2020

### Automatic Policy Synthesis to Improve the Safety of Nonlinear Dynamical Systems

Learning controllers merely based on a performance metric has been prove...
12/08/2021

### Learning over All Stabilizing Nonlinear Controllers for a Partially-Observed Linear System

We propose a parameterization of nonlinear output feedback controllers f...
02/22/2022

### A Benchmark Comparison of Learned Control Policies for Agile Quadrotor Flight

Quadrotors are highly nonlinear dynamical systems that require carefully...
12/14/2020

### Safe Reinforcement Learning with Stability Safety Guarantees Using Robust MPC

Reinforcement Learning offers tools to optimize policies based on the da...
12/17/2021

### Stability Verification in Stochastic Control Systems via Neural Network Supermartingales

We consider the problem of formally verifying almost-sure (a.s.) asympto...
03/22/2022

### Review of Metrics to Measure the Stability, Robustness and Resilience of Reinforcement Learning

Reinforcement learning has received significant interest in recent years...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Remarkable progress has been made in reinforcement learning (RL) using (deep) neural networks to solve complex decision-making and control problems

[43]. While RL algorithms, such as policy gradient [52, 26, 41], Q-learning [49, 35], and actor-critic methods [32, 34] aim at optimizing control performance, the security aspect is of great importance for mission-critical systems, such as autonomous cars and power grids [20, 4, 44]. A fundamental problem is to analyze or certify stability of the interconnected system in both RL exploration and deployment stages, which is challenging due to its dynamic and nonconvex nature [20].

The problem under study focuses on a general continuous-time dynamical system:

 ˙x(t)=ft(x(t),u(t)), (1)

with the state and the control action . In general, can be a time-varying and nonlinear function, but for the purpose of stability analysis, we study the important case that

 ft(x(t))=Ax(t)+Bu(t)+gt(x(t)), (2)

where comprises of a linear time-invariant (LTI) component

that is Hurwitz (i.e., every eigenvalue of

has strictly negative real part), a control matrix , and a slowly time-varying component that is allowed to be nonlinear and even uncertain.111This requirement is not difficult to meet in practice, because one can linearize any nonlinear systems around the equilibrium point to obtain a linear component and a nonlinear part. The condition that is stable is a basic requirement, but the goal of reinforcement learning is to design a controller that optimizes some performance metric that is not necessarily related to the stability condition. For feedback control, we also allow the controller to obtain observations that are a linear function of the states, where may have a sparsity pattern to account for partial observations in the context of decentralized controls [8].

Suppose that is a neural network given by an RL agent (parametrized by , which can be time-varying due to learning) to optimize some reward

revealed through the interaction with the environment. The exploration vector

captures the additive randomization effect during the learning phase, and is assumed to have a bounded energy over time (). The main goal is to analyze the stability of the system with the actuation of , which is typically a neural network controller, as illustrated in Fig. 1. Specifically, the stability criterion is stated using the concept of gain [55, 16].222This stability metric is widely adopted in practice, and is closely related to bounded-input bounded-output (BIBO) stability and absolute stability (or asymptotic stability). For controllable and observable LTI systems, the equivalence can be established.

###### Definition 1 (Input-output stability)

The gain of the system controlled by is the worst-case ratio between total output energy and total input energy:

 γ(\G,π)=supu∈L2∥y∥2∥u∥2, (3)

where is the set of all square-summable signals, is the total energy over time, and is the control input with exploration. If is finite, then the interconnected system is said to have input-output stability (or finite gain).

This study investigates the possibility of using the gradient information of the policy to obtain a stability certificate, because this information can be easily extracted in real-time and is generic enough to include a large set of performance-optimizing nonlinear controllers. Let be the set notation. By denoting

 P(ξ)={π∣∣ξ–ij≤∂jπi(y)≤¯¯¯ξij,∀i∈[na],j∈[ns],y∈\Rns} (4)

as the set of controllers whose partial derivatives are bounded by and , it is desirable to provide stability certificate as long as the RL policy remains within the above “safety set.” Indeed, this can be checked efficiently, as stated (informally) in the following theorem.

###### Theorem 1 (Main result)

If there exist constants and such that the condition (39) is feasible for the system (1), then the interconnected system has a finite gain as long as for all .

We call the constants and stability-certified gradient bounds for the underlying system. The above result is based on the intuition that a real-world stable controller should exhibit “smoothness” in the sense that small changes in the input should lead to small changes in the output. This incorporates the special case where controllers are known to have bounded Lipschitz constants (a simple strategy to calculate the Lipschitz constant of a deep neural network is suggested in [48]). To compute the gradient bounds, we borrow powerful ideas from the framework of integral quadratic constraint (in frequency domain) [33] and dissipativity theory (in time domain) [51] for robustness analysis. While these tools are celebrated with their non-conservatism in the robust control literature, existing characterizations of multi-input multi-output (MIMO) Lipschitz functions are insufficient. Thus, one major obstacle is to derive non-trivial bounds that could be of use in practice.

To this end, we develop a new quadratic constraint on gradient-bounded functions, which exploits the sparsity of the control architecture and the non-homogeneity of the output vector. Some key features of the stability-certified smoothness bounds are as follows: (a) the bounds are inherent to the targeted real-world control task; (b) they can be computed efficiently by solving some semi-definite programming (SDP) problem; (c) they can be used to certify stability when reinforcement learning is employed in real-world control with either off-policy or on-policy learning [47]. Furthermore, the stability certification can be regarded as an -procedure, and we analyze its conservatism to show that it is necessary for the robustness of a surrogate system that is closely related to the original system.

The paper is organized as follows. Preliminaries on policy gradient reinforcement learning, the integrated quadratic constraint (IQC) and dissipativity frameworks are presented in Section 2. Main results on gradient bounds for a linear or nonlinear system are presented in Section 3, where we also analyze the conservatism of the certificate. The method is evaluated in Section 4 on two nonlinear decentralized control tasks. Conclusions are drawn in Section 5.

## 2 Preliminary

In this section, we give an overview of the main topics relevant to this study, namely policy gradient reinforcement learning and robustness analysis based on IQC framework and dissipativity theory.

### 2.1 Reinforcement learning using policy gradient

Reinforcement learning aims at guiding an agent to perform a task as efficiently and skillfully as possible through interactions with the environment. The control task is modeled as a Markov decision process (MDP), defined by the tuple

, where is the set of states , is a set of actions , indicates the world dynamics as in (1), is the reward at state and action , and is the factor to discount future rewards. A control strategy is defined by a policy , which can be approximated by a neural network with parameters

. For a continuous control, the actions follow a multivariate normal distribution, where

is the mean, and the standard deviation in each action dimension is set to be a diminishing number during exploration or learning, and 0 during actual deployment. With a slight abuse of notations, we use

to denote this normal distribution over actions, and use to denote for simplicity. The goal of RL is to maximize the expected return:

 η(πθ)=Ex0,ut∼πθ(⋅|xt),xt+1∼T(xt,ut)[∑Tt=0ρtr(xt,ut)], (5)

where is the control horizon, and the expectation is taken over the policy, the initial state distribution and the world dynamics.

From a practitioner’s point of view, the existing methods can be categorized into four groups based on how the optimal policy is determined: (a) policy gradient methods directly optimize the policy parameters

by estimating the gradient of the expected return (e.g., REINFORCE

[52], natural policy gradient [26], and trust region policy optimization (TRPO) [41]); (b) value-based algorithms like Q-learning do not aim at optimizing the policy directly, but instead approximate the Q-value of the optimal policy for the available actions [49, 35]; (c) actor-critic algorithms keep an estimate of the value function (critic) as well as a policy that maximizes the value function (actor) (e.g., DDPG [32] and A3C [34]); lastly, (d) model-based methods focus on the learning of the transition model for the underlying dynamics, and then use it for planning or to improve a policy (e.g., Dyna [46] and guided policy search [30]). We adopt an approach based on end-to-end policy gradient that combines TRPO [41] with natural gradient [26] and smoothness penalty (this method is very useful for RL in dynamical systems described by partial or difference equations).

Trust region policy optimization is a policy gradient method that constrains the step length to be within a “trust region” so that the local estimation of the gradient/curvature has a monotonic improvement guarantee. By manipulating the expected return using the identity proposed in [25], the “surrogate objective” can be designed:

 Lπold(π)=Ex,u∼πold[π(u|x)πold(u|x)Λπold(x,u)], (6)

where the expectation is taken over the old policy , the ratio inside the expectation is also known as the importance weight, and is the advantage function given by:

 Λπold(x,u)=Ex′∼T(x,u)[r(x,u)+ρVπold(x′)−Vπ% old(x)], (7)

where the expectation is with respect to the dynamics (the dependence on is omitted), and it measures the improvement of taking action at state over the old policy in terms of the value function . A bound on the difference between and has been derived in [41], which also proves a monotonic improvement result as long as the KL divergence between the new and old policies is small (i.e., the new policy stays within the trust region). In practice, the surrogate loss can be estimated using trajectories sampled from as follows,

 ˆLπold(π)=∑tπ(ut|xt)π%old(ut|xt)ˆΛπold(x,u), (8)

and the averaged KL divergence over observed states can be used to estimate the trust region.

is defined by a metric based on the probability manifold induced by the KL divergence. It improves the standard gradient by making a step invariant to reparametrization of the parameter coordinates

[3]:

 θt+1←θt−λH−1θζt, (9)

where is the standard gradient, is the Fisher information matrix estimated with the trajectory data, and is the step size. In practice, when the number of parameters is large, conjugate gradient is employed to estimate the term without requiring any matrix inversion. Since the Fisher information matrix coincides with the second-order approximation of the KL divergence, one can perform a back-tracking line search on the step size to ensure that the updated policy stays within the trust region.

Smoothness penalty is introduced in this study to empirically improve learning performance on physical dynamical systems. Specifically, we propose to use

 Lexplore=∑Tt=1∥ut−1−πθ(xt)∥2 (10)

as a regularization term to induce consistency during exploration. The intuition is that since the change in states between two consecutive time steps is often small, it is desirable to ensure small changes in output actions. This is closely related to another penalty term that has been used in [15]

, which is termed “double backpropagation”, and recently rediscovered in

[37, 22]:

 Lsmooth=∑Tt=1∥∥∥∂∂θπθ(xt)∥∥∥2, (11)

which penalizes the gradient of the policy along the trajectories. Since bounded gradients lead to bounded Lipshitz constant, these penalties will induce smooth neural network functions, which is essential to ensure generalizability and, as we will show, stability. In addition, we incorporate a hard threshold (HT) approach that rescales the weight matrices at each layer by if , where is the Lipschitz constant of the neural network , is the number of layers of the neural network and is the certified Lipschitz constant. This ensures that the Lipschitz constant of the RL policy remains bounded by .

In summary, our policy gradient is based on the weighted objective:

 Lpol(πθ)=ˆLπold(πθ)+w1Lexplore(πθ)+w2Lsmooth(πθ), (12)

where the penalty coefficients and are selected such that the scales of the corresponding terms are about of the surrogate loss value . In each round, a set of trajectories are collected using , which are used to estimate the gradient and the Fisher information matrix ; a backtracking line search on the step size is then conducted to ensure that the updated policy stays within the trust region. This learning procedure is known as on-policy learning [47].

### 2.2 Overview of IQC framework

The IQC theory is celebrated for systematic and efficient stability analysis of a large class of uncertain, dynamic, and interconnected systems [33]. It unifies and extends classical passitivity-based multiplier theory, and has close connections to dissipativity theory in the time domain [42].

To state the IQC framework, some terminologies are necessary. We define the space for signals supported on , where denotes the spatial dimension of , and the extended space (we will use and if it is not necessary to specify the dimension and signal support), where we use to denote the signal in general and to denote its value at time . For a vector or matrix, we use superscript to denote its conjugate transpose. An operator is causal if the current output does not depend on future inputs. It is bounded if it has a finite gain. Let be a bounded linear operator on a Hilbert space. Then, its Hilbert adjoint is the operator such that for all , where denotes the inner product. It is self-adjoint if .

 \by =\G(\bu) (13) \bu =\bDelta(\by)+\be, (14)

where is the transfer function of a causal and bounded LTI system (i.e., it maps input to output through the internal state dynamics ), is the disturbance, and is a bounded and causal function that is used to represent uncertainties in the system. IQC provides a framework to treat uncertainties such as nonlinear dynamics, model approximation and identification errors, time-varying parameters and disturbance noise, by using their input-output characterizations.

###### Definition 2 (Integral quadratic constraints)

Consider the signals and

associated with Fourier transforms

and , and , where is a bounded and causal operator. We present both the frequency- and time-domain IQC definitions:

1. (Frequency domain) Let be a bounded and self-adjoint operator. Then, is said to satisfy the IQC defined by (i.e., ) if:

 σ\bPi(^\by,^\bw)=∫∞−∞[^y(jω)^w(jω)]∗Π(jω)[^y(jω)^w(jω)]dω≥0. (15)
2. (Time domain) Let be any factorization of such that is stable and . Then, is said to satisfy the hard IQC defined by (i.e., ) if:

 ∫T0z(t)⊤Mz(t)dt≥0,∀T≥0, (16)

where is the filtered output given by the stable operator . If instead of requiring nonnegativity at each time , the nonnegativity is considered only when , then the corresponding condition is called soft IQC.

As established in [42], the time- and frequency-domain IQC definitions are equivalent if there exists as a spectral factorization of with such that and are stable.

###### Example 1 (Sector IQC)

A single-input single-output uncertainty is called “sector bounded” between if , for all and . It thus satisfies the sector IQC with and . It also satisfies IQC with defined above.

###### Example 2 (\Log2 gain bound)

A MIMO uncertainty has the gain if , where . Thus, it satisfies IQC with and , where . It also satisfies IQC with defined above. This can be used to characterize nonlinear operators with fast time-varying parameters.

Before stating a stability result, we define the system (13)–(14) (see Fig. 1) to be well-posed if for any , there exists a solution , which depends causally on . A main IQC result for stability is stated below:

###### Theorem 2 ([33])

Consider the interconnected system (13)–(14). Assume that: (i) the interconnected system is well posed for all ; (ii) for ; and (iii) there exists such that

 [\hG(jω)I(jω)]∗Π(jω)[\hG(jω)I(jω)]≤−ϵI,∀ω∈[0,∞). (17)

Then, the system (13)–(14) is input-output stable (i.e., finite gain).

The above theorem requires three technical conditions. The well-posedness condition is a generic property for any acceptable model of a physical system. The second condition is implied if has the properties and . The third condition is central, and it requires checking the feasibility at every frequency, which represents a main obstacle. As discussed in Section Section 3.2, this condition can be equivalently represented as a linear matrix inequality (LMI) using the Kalman-Yakubovich-Popov (KYP) lemma. In general, the more IQCs exist for the uncertainty, the better characterization can be obtained. If , , where is the number of IQCs satisfied by , then it is easy to show that , where ; thus, the stability test (17) becomes a convex program, i.e., to find such that:

 [\hG(jω)I(jω)]∗(nK∑k=1τkΠk(jω))[\hG(jω)I(jω)]≤−ϵI,∀ω∈[0,∞). (18)

The counterpart for the frequency-domain stability condition in the time-domain can be stated using a standard dissipation argument [42].

### 2.3 Related work

To close this section, we summarize some connections to existing literature. This work is closely related to the body of works on safe reinforcement learning, defined as the process of learning policies that maximize performance in problems where safety is required during the learning and/or deployment [20]. A detailed literature review can be found in [20], which has categorized two main approaches by modifying: (1) the optimality condition with a safety factor, and (2) the exploration process to incorporate external knowledge or risk metrics. Risk-aversion can be specified in the reward function, for example, by defining risk as the probability of reaching a set of unknown states in a discrete Markov decision process setting [14, 21]. Robust MDP is designed to maximize rewards while safely exploring the discrete state space [36, 50]. For continuous states and actions, robust model predictive control can be employed to ensure robustness and safety constraints for the learned model with bounded errrors [7]. These methods require an accurate or estimated models for policy learning. Recently, model-free policy optimization has been successfully demonstrated in real-world tasks such as robotics, business management, smart grid and transportation [31]. Safety requirement is high in these settings. Existing approaches are based on constraint satisfaction that holds with high probability [45, 1].

The present analysis tackles the safe reinforcement learning problem from a robust control perspective, which is aimed at providing theoretical guarantees for stability [55]. Lyapunov functions are widely used to analyze and verify stability when the system and its controller are known [39, 10]. For nonlinear systems without global convergence guarantees, region of convergence is often estimated, where any state trajectory that starts within this region stays within the region for all times and converges to a target state eventually [27]. For example, recently, [9] has proposed a learning-based Lyapunov stability verification for physical systems, whose dynamics are sequentially estimated by Gaussian processes. In the same vein, [2]

has employed reachability analysis to construct safe regions in the state space by solving a partial differential equation. The main challenge of these methods is to find a suitable non-conservative Lyapunov function to conduct the analysis.

The IQC framework proposed in [33] has been widely used to analyze the stability of large-scale complex systems such as aircraft control [19]. The main advantages of IQC are its computational efficiency, non-conservatism, and unified treatment of a variety of nonlinearities and uncertainties. It has also been employed to analyze the stability of small-sized neural networks in reinforcement learning [28, 5]

; however, in their analysis, the exact coefficients of the neural network need to be known a priori for the static stability analysis, and a region of safe coefficients needs to be calculated at each iteration for the dynamic stability analysis. This is computationally intensive, and it quickly becomes intractable when the neural network size grows. On the contrary, because the present analysis is based on a broad characterization of control functions with bounded gradients, it does not need to access the coefficients of the neural network (or any forms of the controller). In general, robust analysis using advanced methods such as structured singular value

[38] or IQC can be conservative. There are only few cases where the necessity conditions can be established, such as when the uncertain operator has a block diagonal structure of bounded singular values [16], but this set of uncertainties is much smaller than the set of performance-oriented controllers learned by RL. To this end, we are able to reduce conservatism of the results by introducing more informative quadratic constraints for those controllers, and analyze the necessity of the certificate criteria. This significantly extends the possibilities of stability-certified reinforcement learning to large and deep neural networks in nonlinear large-scale real-world systems, whose stability is otherwise impossible to be certified using existing approaches.

## 3 Main results

This section will introduce a set of quadratic constraints on gradient-bounded functions, describe the computation of a smoothness margin for linear (Theorem 3) and nonlinear systems (Theorem 4). Furthermore, we examine the conservatism of the certificate condition in Theorem 3 for linear systems.

The starting point of this analysis is a less conservative constraint on general vector-valued functions. We start by recalling the definition of a Lipschitz continuous function:

###### Definition 3 (Lipschitz continuous function)

We define both the local and global versions of the Lipschitz continuity for a function :

1. The function is locally Lipschitz continuous on the open subset if there exists a constant (i.e., Lipschitz constant of on ) such that

 |f(x)−f(y)|≤ξ|x−y|,∀x,y∈\B. (19)
2. If is Lipschitz continuous on with a constant (i.e., in (19)), then is called globally Lipschitz continuous with the Lipschitz constant .

Lipschitz continuity implies uniform continuity. The above definition also establishes a connection between locally and globally Lipschitz continuity. The norm in the definition can be any norm, but the Lipschitz constant depends on the particular choice of the norm. Unless otherwise stated, we use the Euclidean norm in our analysis.

To explore some useful properties of Lipschitz continuity, consider a scalar-valued function (i.e., ). Let denote a hybrid vector between and , with and . Then, local Lipschitz continuity of on implies that

 |f(h(j)xy)−f(h(j−1)xy)||xj−yj|≤ξ,∀x,y∈\B,xj≠yj,j∈[n]. (20)

If we were to assume that is differentiable, then it follows that its (partial) derivative is bounded by the Lipschitz constant. For a vector-valued function that is -Lipschitz, it is necessary that every component be -Lipschitz. In general, every continuously differentiable function is locally Lipschitz, but the reverse is not true, since the definition of Lipschitz continuity does not require differentiability. Indeed, by the Rademacher’s theorem, if is locally Lipschitz on , then it is differentiable at almost every point in [13].

For the purpose of stability analysis, we can express (19) as a point-wise quadratic constraint:

 [x−yf(x)−f(y)]⊤[ξ2In00−Im][x−yf(x)−f(y)]≥0,∀x,y∈\B. (21)

The above constraint, nevertheless, can be sometimes too conservative, because it does not explore the structure of a given problems. To elaborate on this, consider the function defined as

 f(x1,x2)=[tanh(0.5x1)−ax1,sin(x2)]⊤, (22)

where and is a deterministic but unknown parameter with a bounded magnitude. Clearly, to satisfy (19) on for all possible tuples , we need to choose (i.e., the function has the Lipshitz constant 1). However, this characterization is too general in this case, because it ignores the non-homogeneity of and , as well as the sparsity of the problem representation. Indeed, only depends on with its slope restricted to for all possible , and only depends on with its slope restricted to . In the context of controller design, the non-homogeneity of control outputs often arises from physical constraints and domain knowledge, and the sparsity of control architecture is inherent in scenarios with distributed local information. To explicitly address these requirements, we state the following quadratic constraint.

###### Lemma 1

For a vector-valued function that is differentiable with bounded partial derivatives on (i.e., for all ), the following quadratic constraint is satisfied for all , , , and :

 [x−yq(x,y)]⊤M(λ;ξ)[x−yq(x,y)]≥0, (23)

where is given by

 (24)

where denotes a diagonal matrix with diagonal entries specified by , and is determined by and , is a set of non-negative multipliers that follow the same index order as , , , , and is related to the output of by the constraint:

 f(x)−f(y)=[Im⊗11×n]q=Wq, (25)

where denotes the Kronecker product.

###### Proof

For a vector-valued function that is differentiable with bounded partial derivatives on (i.e., for all ), there exist functions bounded by for all and such that

 f(x)−f(y)=⎡⎢ ⎢ ⎢⎣∑nj=1δ1j(x,y)(xj−yj)⋮∑nj=1δmj(x,y)(xj−yj)⎤⎥ ⎥ ⎥⎦. (26)

By defining , since , it follows that

 [xj−yjqij]⊤[¯¯c2ij−c2ijcijcij−1][⋆]≥0. (27)

The result follows by introducing nonnegative multipliers , and the fact that .

This above bound is a direct consequence of standard tools in real analysis [54]. To understand this result, it can be observed that (23) is equivalent to:

 ∑i,jλij((¯¯c2ij−c2ij)(xj−yj)2+2cijqij(xj−yj)−q2ij)≥0,∀λij≥0, (28)

with , where depends on and . Since (28) holds for all , it is equivalent to the condition that for all and , which is a direct result of the bounds imposed on the partial derivatives of . To illustrate its usage, let us apply the constraint to characterize the example function (22), where , and all the other bounds () are zero. This clearly yields a more informative constraint than merely relying on the Lipschitz constraint (21). In fact, for a differentiable -Lipschitz function, we have , and by limiting the choice of , (28) is reduced to (21). However, as illustrated in this example, the quadratic constraint in Lemma 1 can incorporate richer information about the structure of the problem; therefore, it often gives rise to non-trivial stability bounds in practice.

The constraint introduced above is not a classical IQC, since it involves an intermediate variable that relates to the output through a set of linear equalities. For stability analysis, let be the equilibrium point, and without loss of generality, assume that and . Then, one can define the quadratic functions

 ϕij(x,q)=(¯¯c2ij−c2ij)x2j+2cijqijxj−q2ij,

and the condition (23) can be written as

 ∑ijλijϕij(x,q)≥0,∀λij≥0, (29)

which can be used to characterize the set of associated with the function , as we will discuss in Section 3.4.

To simplify the mathematical treatment, we have focused on differentiable functions in Lemma 1

; nevertheless, the analysis can be extended to non-differentiable but continuous functions (e.g., the ReLU function

) using the notion of generalized gradient [13, Chap. 2]. In brief, by re-assigning the bounds on partial derivatives to uniform bounds on the set of generalized partial derivatives, the constraint (23) can be directly applied.

In relation to the existing IQCs, this constraint has wider applications for the characterization of gradient-bounded functions. The Zames-Falb IQC introduced in [53] has been widely used for single-input single-output (SISO) functions , but it requires the function to be monotone with the slope restricted to with , i.e., whenever . The MIMO extension holds true only if the nonlinear function is restricted to be the gradient of a convex real-valued function [40, 24]. As for the sector IQC, the scalar version can not be used (because it requires whenever there exists such that , which is extremely restrictive), and the vector version is in fact (21). In contrast, the quadratic constraint in Lemma 1 can be applied to non-monotone, vector-valued Lipschitz functions.

### 3.2 Computation of the smoothness margin

With the newly developed quadratic constraint in place, this subsection explains the computation for a smoothness margin of an LTI system , whose state-space representation is given by:

 ⎧⎨⎩˙xG=AxG+Buw=π(xG)u=e+w (30)

where is the state (the dependence on is omitted for simplicity). The system is assumed to be stable, i.e., is Hurwitz. We can connect this linear system in feedback with a controller . The signal is the exploration vector introduced in reinforcement learning, and is the policy action. We are interested in certifying the set of gradient bounds of such that the interconnected system is input-output stable at all time , i.e.,

 ∫T0|y(t)|2dt≤γ2∫T0|e(t)|2dt, (31)

where is a finite upper bound for the gain. Let or denote that is positive semidefinite or positive definite, respectively. To this end, define the as follows:

 SDP(P,λ,γ,ξ):[O(P,λ,ξ)S(P)S(P)⊤−γI]≺0, (32)

where and

 O(P,λ,ξ) =[A⊤P+PAPBWW⊤B⊤P0]+1γ[I000]+M(λ;ξ),S(P)=[PB0],

where is defined in (24). We will show next that the stability of the interconnected system can be certified using linear matrix inequalities.

###### Theorem 3

Let be stable (i.e., is Hurwitz) and be a bounded causal controller. Assume that:

1. the interconnection of and is well-posed;

2. has bounded partial derivatives on (i.e., , for all , and ).

If there exist and a scalar such that is feasible, then the feedback interconnection of and is stable (i.e., it satisfies (31)).

###### Proof

The proof follows a standard dissipation argument. To proceed, we multiply to the left and its transpose to the right of the augmented matrix in (32), and use the constraints and . Then, can be written as a dissipation inequality:

 ˙V(xG)+[xGq]⊤M(λ;ξ)[xGq]<γe⊤e−1γy⊤y,

where is known as the storage function, and is its derivative with respect to time . Because the second term is guaranteed to be non-negative by Lemma 1, if is feasible with a solution , we have:

 ˙V(xG)+1γy⊤y−γe⊤e<0, (33)

which is satisfied at all times . From well-posedness, the above inequality can be integrated from to , and then it follows from that:

 ∫T0|y(t)|2dt≤γ2∫T0|e(t)|2dt. (34)

Hence, the interconnected system with the RL policy is stable.

The above theorem requires that be stable when there is no feedback policy . This is automatically satisfied in many physical systems with an existing stabilizing (but not performance-optimizing) controller. In the case that the original system is not stable, one needs to first design a controller to stablize the system or design the controller under uncertainty (in this case, the RL policy), which are well-studied problems in the literature (e.g., controller synthesis [16]). Then, the result can be used to ensure stability while delegating reinforcement learning to optimize the performance of the policy under gradient bounds.

The above result essentially suggests a computational approach in robust control analysis. Given a stable LTI system depicted in (30), the first step is to represent the RL policy as an uncertainty block in a feedback interconnection. Because the parameters of the neural network policy may not be known a priori and will be continuously updated during learning, we characterize it using bounds on partial gradients (e.g., if it is known that the action is positively correlated with certain observation metric, we can specify its partial gradient to be mostly positive with only a small negative margin). A simple but conservative choice is a -gain bound IQC; nevertheless, to achieve a less conservative result, we can employ the quadratic constraint developed in Lemma 1, which exploits both the sparsity of the control architecture and the non-homogeneity of the outputs. For a given set of gradient bounds , we find the smallest such that (32) is feasible, and corresponds to the upper bound on the gain of the interconnected system both during learning (with the excitation added to facilitate policy exploration) and actual deployment. If is finite, then the system is provably stable in the sense of (31).

We remark that is quasiconvex, in the sense that it reduces to a standard LMI with a fixed . To solve it numerically, we start with a small and gradually increase it until a solution is found. This is repeated for multiple sets of . Each iteration (i.e., LMI for a given set of and ) can be solved efficiently by interior-point methods. As an alternative to searching on for a given , more sophisticated methods for solving the generalized eigenvalue optimization problem can be employed [11].

### 3.3 Extension to nonlinear systems with uncertainty

The previous analysis for LTI systems can be extended to a generic nonlinear system described in (1). The key idea is to model the nonlinear and potentially time-varying part as an uncertain block with IQC constraints on its behavior. Specifically, consider the LTI component :

 {˙xG=AxG+Bu+vy=xG (35)

where is the state and is the output. The linearized system is assumed to be stable, i.e., is Hurwitz. The nonlinear part is connected in feedback:

 ⎧⎨⎩u=e+ww=π(y)v=gt(y) (36)

where and are defined as before, and is the nonlinear and time-varying component. In addition to characterizing using the Lipschitz property as in (23), we assume that satisfies the IQC defined by as in Definition 2. The system has the state-space representation:

 {˙ψ=Aψψ+Bvψv+Byψyz=Cψψ+Dvψv+Dyψy, (37)

where is the internal state and is the filtered output. By denoting as the new state, one can combine (35) and (37) via reducing and letting :

 (38)

where , , , , are matrices of proper dimensions defined above. Similar to the case of LTI systems, the objective is to find the gradient bounds on such that the system becomes stable in the sense of (31). In the same vein, we define as:

 SDP––––(P,λ,γ,ξ):⎡⎢ ⎢⎣O(P,λ,ξ)Ov(P)S(P)Ov(P)⊤Dv⊤ψMqDvψ0S(P)⊤0−γI⎤⎥ ⎥⎦≺0, (39)

where , and

 O(P,λ,ξ) =⎡⎣\ubA⊤P+P\ubAP\ubBq\ubB⊤qP0⎤⎦+[\ubC⊤Mg\ubC000]+M(λ;ξ)+1γ[I000], Ov(P) =[\ubC⊤MqDvψ+P\ubBv0],S(P)=[P\ubBe0],

where is defined in (24). The next theorem provides a stability certificate for the nonlinear time-varying system (1).

###### Theorem 4

Let be stable (i.e., in (35) is Hurwitz) and be a bounded causal controller. Assume that:

1. the interconnection of , , and is well-posed;

2. has bounded partial derivatives on (i.e., for all , and );

3. , where is stable.

If there exist and a scalar such that in (39) is feasible, then the feedback interconnection of the nonlinear system (1) and is stable (i.e., it satisfies (31)).

###### Proof

The proof is in the same vein as that of Theorem 3. The main technical difference is the consideration of the filtered state and the output to impose IQC constraints on the nonlinearities in the dynamical system [33]. The dissipation inequality follows by multiplying both sides of the matrix in (39) by