## 1 Introduction

Neural network decision making and control has seen a huge advancement recently accompanied by the success of reinforcement learning (RL)

(sutton2018reinforcement). In particular, deep reinforcement learning (DRL) has achieved super-human performance with neural network policies (also referred to as controllers in control tasks) in various domains (mnih2015human; lillicrap2016continuous; silver2016mastering).Policy gradient (sutton1999policy) is one of the most important approaches to DRL that synthesizes policies for continuous decision making problems. For control tasks, policy gradient method and its variants have successfully synthesized neural network controllers to accomplish complex control goals (levine2018learning) without solving potentially non-linear planing problems at test time (levine2016end). However, most of these methods focus on maximizing the reward function which only indirectly enforce desirable properties. Specifically, global stability of the closed-loop system (sastry2013nonlinear) guarantees convergence to the desired state of origin from any initial state and therefore is a very important property for safety critical systems (e.g. aircraft control (chakraborty2011susceptibility)) where not a single diverging trajectory is acceptable. However, the set of parameters corresponding to stabilizing controllers is in general nonconvex even in the simple setting of linear systems with linear controllers fazel2018global, which poses significant computational challenges for neural network controllers under the general setting of nonlinear systems.

Thanks to recent robustness studies of deep learning, we have seen attempts on giving stability certificates and/or ensuring stability at test time for fully-observed systems controlled by neural networks. Yet stability problems for neural network controlled partially observed systems remain open. Unlike fully-observed control systems where the plant states are fully revealed to the controller, most real-world control systems are only partially observed due to modeling inaccuracy, sensing limitations, and physical constraints

(braziunas2003pomdp). Here, sensible estimates of the full system state usually depend on historical observations

(callier2012linear). Some partially observed systems are modeled using partially observed Markov decision process (POMDP)

(monahan1982state) where an optimal solution is NP hard in general (mundhenk2000complexity).Paper contributions. In the paper, we propose a method to synthesize recurrent neural network (RNN) controllers with exponential stability guarantees for partially observed systems. We derive a convex inner approximation to the non-convex set of stable RNN parameters based on integral quadratic constraints (megretski1997system), loop transformation (sastry2013nonlinear, Chap. 4) and a sequential semidefinite convexification technique, which guarantees exponential stability for both linear time invariant (LTI) systems and general nonlinear uncertain systems. A novel framework of projected policy gradient is proposed to maximize some unknown/complex reward function and ensure stability in the online setting where a guaranteed-stable RNN controller is synthesized and iteratively updated while interacting with and controlling the underlying system, which differentiates our works from most post-hoc validation methods. Finally, we carry out comprehensive comparisons with policy gradient, and demonstrate that our method effectively ensures the closed-loop stability and achieves higher reward on a variety of control tasks, including vehicle lateral control and power system frequency regulation.

Paper outline. In Section 2, we outline related works on addressing partial observability, and enforcing stability in reinforcement learning. Section 3 discusses our proposed method for synthesizing RNN controllers for LTI plants with stability guarantees, and Section 4 extends it to systems with uncertainties and nonlinearities. Section 5 compares the proposed projected policy gradient method with policy gradient through numerical experiments.

Notation. denote the sets of -by- symmetric, positive semidefinite and positive definite matrices, respectively. , denote the set of -by- diagonal positive semidefinite, and diagonal positive definite matrices. The notation denotes the standard 2-norm. We define to be the set of all one-sided sequences . The subset

consists of all square-summable sequences. When applied to vectors, the orders

are applied elementwise.## 2 Related Work

Partially Observed Decision Making and Output Feedback Control. In many problems (talpaert2019exploring; barto1995learning), only specific outputs but not the full system states are available for the decision maker. Therefore, memory in the controller is required to recover the full system states (scherer1997multiobjective). Control of these partially observed systems is often referred to as output feedback control (callier2012linear), and has been studied extensively from both control and optimization perspectives (doyle1978guaranteed; zheng2021analysis). Under the setting with convexifiable objectives (e.g., or performances), the optimal linear dynamic (i.e. with memory) controller can be obtained by using a change of variables or solving algebraic Riccati equations (gahinet1994linear; zhou1996robust). However, for more sophisticated settings with unknown and/or flexibly defined cost functions, the problems become intractable for the aforementioned traditional methods, and RL techniques are proposed to reduced the computation cost and improve overall performance at test time, including the ones (levine2013guided; levine2016end) with static neural network controllers, and the ones (zhang2016learning; heess2015memory; wierstra2007solving)

with dynamic controllers, represented by RNNs/long short-term memory neural networks.

Stability Guarantees For Neural Network Controlled Systems. As neural networks become popular in control tasks, safety and robustness of neural networks and neural network controlled systems has been actively discussed (morimoto2005robust; luo2014off; friedrich2017robust; berkenkamp2017safe; chow2018lyapunov; matni2019self; han2019h; recht2019tour; choi2020reinforcement; zhang2020policy; fazlyab2020safety). Closely related to this work are recent papers on robustness analysis of memory-less neural networks controlled systems based on robust control ideas. yin2021stability; yin2021imitation; pauli2021linear; jin2020stability

conduct stability analysis of neural network controlled linear and nonlinear systems and propose verification methods by characterizing activation functions using quadratic constraints.

donti2020enforcing adds additional projection layer on the controller to ensure stability for fully observed systems. (revay2020convex) studies the stability of RNN itself when fitted to data but does not consider any plant to control by such RNN. The most related works are those that study dynamic neural network controllers. anderson2007robust; knight2011stable adapt RNN controllers through RL techniques to obtain stability guarantees. However, in these works, the reward function is assumed to be known, and conservative updates of controller parameters projected to a box neighborhood of the previous iterate are applied due to the non-convexity in their conditions. In contrast, our work enables much larger and more efficient updates thanks to jointly convex conditions derived through a novel sequential convexification and loop transformation approach unseen in these works.## 3 Partially Observed Linear Systems

### Problem Formulation

Consider the feedback system (shown in Fig. 2) consisting of a plant and an RNN controller which is expected to stabilize the system (i.e. steer the states of to the origin). To streamline the presentation, we consider a partially observed, linear, time-invariant (LTI) system defined by the following discrete-time model:

(1a) | ||||

(1b) |

where is the state, is the control input, and is the output. , , and . Since the plant is partially observed, the observation matrix may have a sparsity pattern or be column-rank deficient.

###### Assumption 1.

We assume that is stabilizable, and is detectable^{1}^{1}1The definitions of stabilizability and detectability can be found in (callier2012linear)..

###### Assumption 2.

We assume and are known.

Assumption 2 is partially lifted in Section 4 where we only assume partial information on the system dynamics.

###### Problem 1.

Our goal is to find a controller that maps the observation to an action to both maximize some unknown reward over finite horizon and stabilize the plant .

The single step reward is assumed to be unknown and potentially highly complex to capture the vast possibility of desired controller. e.g. In many cases, to ensure extra safety, the reward is set to if there is a state violation at step . This cannot be captured by any simple negative quadratic functions.

### Controllers Parameterization

Output feedback control with known and convexifiable reward has been studied extensively (scherer1997multiobjective), and linear dynamic controllers suffice for this case. However, in our problem setting, since the reward is unknown and nonconvex, and systems dynamics will become uncertain and nonlinear in Section 4, we consider a dynamic controller in the form of an RNN, which makes a class of high-capacity flexible controllers.

We model the RNN controller as an interconnection of an LTI system , and combined activation functions as shown in Fig. 2. This parameterization is expressive, and contains many widely used model structures (revay2020convex). The RNN is defined as follows

(2) |

where is the hidden state, are the input and output of , and matrices are parameters to be learned. Define as the collection of the learnable parameters of . We assume the initial condition of to be zero . The combined nonlinearity is applied element-wise, i.e., , where is the -th scalar activation function. We assume that the activation has a fixed point at origin, i.e. .

### Quadratic Constraints for Activation Functions

The stability condition relies on quadratic constraints (QCs) to bound the activation function. A typical QC is the sector bound as defined next.

###### Definition 3.1.

Let be given. The function lies in the sector if:

(3) |

The interpretation of the sector is that lies between lines passing through the origin with slope and . Many activations are sector bounded, e.g., leaky ReLU is sector bounded in with its parameter ; ReLU and are sector bounded in (denoted as sector ). Fig. 4 illustrates different activations (blue solid) and their sector bounds (green dashed).

Sector constraints can also be defined for combined activations . Assume the -th scalar activation in is sector bounded by , , then these sectors can be stacked into vectors , where and , to provide QCs satisfied by .

###### Lemma 3.1.

Let be given with . Suppose that satisfies the sector bound element-wise. For any , and for all and , it holds that

(4) |

where , and .

A proof is available in (fazlyab2020safety).

### Loop Transformation

To derive convex stability conditions for their efficient enforcement in the learning process, we first perform a loop transformation on the RNN as shown in Fig. 4. Through loop transformation, we obtain a new representation of the controller , which is equivalent to the one shown in Fig. 2:

(5a) | ||||

(5b) |

The newly obtained nonlinearity , defined in Fig. 4, is sector bounded by , and thus it satisfies a simplified QC: for any , it holds that

(6) |

The transformed system , defined in Fig. 4, is of the form:

where

(7) |

The derivation of can be found in Appendix A. We define the learnable parameters of as . Since there is an one-to-one correspondence (7) between the transformed parameters and the original parameters , we will learn in the reparameterized space and uniquely recover the original parameters accordingly.

### Convex Lyapunov Condition

The feedback system of plant and RNN controller in (5) is defined by the following equations

(8a) | ||||

(8b) | ||||

(8c) |

where gathers the states of and , and

Note that matrices are affine in . The following theorem incorporates the QC for in the Lyapunov condition to derive the exponential stability condition of the feedback system using the S-Lemma (yakubovich1971; boyd1994linear)

###### Theorem 3.1 (Sequential Convexification).

Consider the feedback system of plant in (1), and RNN controller in (5). Given a rate with , and matrices and , if there exist matrices and , and parameters such that the following condition holds

(9) |

then for any , we have for all , where cond is the condition number of , and i.e., the feedback system is exponentially stable with rate .

The above convex relaxation of the non-convex condition (22) leverages a “linearizing” semi-definite inequality based on a previous guess of and (as and ). A complete proof is provided in Appendix A. The linear matrix inequality (LMI) condition (9) is jointly convex in the decision variables , , , where and are the inverse matrices of the Lyapunov certificate and the multiplier in (22), and this allows for its efficient enforcement in the reinforcement learning process. Denote the LMI (9), , and altogether as LMI, which will later be incorporated in the policy gradient process to provide exponential stability guarantees.

Based on the stability condition (9), define the convex stability set as

(10) |

Given matrices and , any parameter drawn from ensures the exponential stability of the feedback system (8). The set is a convex inner-approximation to the set of parameters that renders the feedback system stable, and the choice of and affects the conservatism in the approximation. One way of choosing is provided in Algorithm 1.

###### Remark 3.1.

Although only sector bounds (6) are used to describe the activation functions, we can further reduce the conservatism by using off-by-one integral quadratic constraints (lessard2016analysis) to also capture the slope information of activation functions as done in (yin2021stability).

###### Remark 3.2.

Note that although we only consider LTI plant dynamics, the framework can be immediately extended to plant dynamics described by RNNs, or neural state space models provided in (2018Kim).

### Projected Policy Gradient

Policy gradient methods (sutton1999policy; williams1992simple) enjoy convergence to optimality under the tabular setting and achieve good empirical performance for more challenging problems. However, with little assumption about the problem setting, they do not offer any stability guarantee for the closed loop system. We propose the projected policy gradient method that enforces the stability of the interconnected system while the policy is dynamically explored and updated.

Policy gradient approximates the gradient with respect to the parameters of a stochastic controller using samples of trajectories via (11) without any prior knowledge of plant parameters and the reward structures. Gradient ascent is then applied to refine the controller with the estimated gradients.

(11) |

In the above, represents the parameters of . is the expected reward (negative cost) of the controller . is the distribution of states under , where is a set of states. is the reward-to-go after executing control at state under , where is a set of actions.

Like any gradient method, policy gradient does not ensure the controller is in some specific set of preference (the set of stabilizing controller in our setting). To that end, a projection to the stability set , , is applied between gradient updates, where is the updated parameter, and the projection operator is defined as the following convex program,

s.t. | (12) |

Through the recursively feasible projection step (i.e. the feasibility is inherited in subsequent steps, summarized in Theorem A.1 in Appendix A), we conclude with a projected policy gradient method to synthesize stabilizing RNN controllers as summarized in Algorithm 1 and illustrated in Fig. 5.

In the algorithm, the gradient step performs gradient ascent using the estimated gradient . The projection step projects the updated parameters from the gradient step to the convex stability set , where and are computed using and from the previous projection step. We choose , and construct based on the method in (scherer1997multiobjective).

###### Remark 3.3.

The projection step (12) is a semi-definite program (SDP) involving variables. The complexity of interior point SDP solvers usually scales cubically with the number of variables, potentially bringing computational burden when is large. Luckily, most high dimensional problems admit low dimension structures (Wright-Ma-2021) and such overhead is only paid at training without further operations at deployment.

## 4 Partially Observed Nonlinear Systems with Uncertainty

In the context of RL, we often need to handle systems with nonlinear dynamics and/or unmodeled dynamics. Here we model such a nonlinear and uncertain plant (shown in Fig. 5(a)) as an interconnection of the nominal plant , and the perturbation representing the nonlinear, and uncertain part of the system. Therefore, in this new problem setting, we only require system dynamics to be partially known, and we use to cover the difference between the original real system dynamics, and partially known dynamics . The plant is defined by the following equations:

(13) |

where , , and are the state, control input, and output of the nominal plant , and and are the input and output of . The perturbation is a causal and bounded operator.

The perturbation can represent various types of uncertainties and nonlinearities, including sector bounded nonlinearities, slope restricted nonlinearities, and unmodeled dynamics. Thus considering extends our framework to the class of plants beyond LTI plants. The input-output relationship of is characterized with an integral quadratic constraint (IQC) (megretski1997system), which consists of a filter applied to the input and output of , and a constraint on the output of . The filter is an LTI system with the zero initial condition :

(14a) |

To enforce exponential stability of the feedback system, we characterize using the time-domain -hard IQC, which is introduced in (lessard2016analysis), and its definition is also provided below.

###### Definition 4.1.

Let be an LTI system defined in (14), and . Suppose . A bounded, causal operator satisfies the time-domain -hard IQC defined by , , and , if the following condition holds for all , , and

(15) |

where is the output of driven by inputs .

###### Remark 4.1.

For a particular perturbation , there is typically a class of valid -hard IQCs defined by a fixed filter and a matrix drawn from a convex set . Thus, in the stability condition derived later,

will also be treated as a decision variable. A library of frequency-domain

-IQCs is provided in (boczar2017exponential) for various types of perturbations. As shown in (schwenkel2021model), a general class of frequency-domain -IQCs can be translated into time-domain -hard IQC by a multiplier factorization.When deriving the stability condition, the perturbation will be replaced by the time-domain -hard IQC (15) that describes it, and the associated filter , as shown in Fig. 5(b). Therefore, the stabilizing controller will be designed for the extended system (an interconnection of and ) subject to IQCs, instead of the original . This controller will also be able to stabilize the original . Define the extended state as , and the dynamics of the extended system are given in Appendix A. Define to gather the states of the extended system and the controller. The feedback system of the extended system and the controller has the dynamics

(16) |

where

and the state space matrices of the extended system are defined in Appendix A. The next theorem merges the QC for and the time-domain -hard IQC for with the Lyapunov theorem to derive the exponential stability condition for the uncertain feedback system.

###### Theorem 4.1.

Consider the feedback system of uncertain plant , and RNN controller . Assume satisfies the time-domain -hard IQC defined by , , and , with . Given and . If there exist matrices , , , and parameters such that the following condition holds

(17) |

where is a block diagonal matrix and . Then for any , we have for all , where , i.e., the feedback system is exponentially stable with rate .

The complete proof is provided in Appendix A. This LMI (17) is jointly convex in and for any given and . Based on this LMI, we define the convex robust stability set :

Any parameter drawn from ensures the exponential stability of the feedback system of and , and this convex robust stability set can be used in the projection step.

###### Remark 4.2.

If we only require the feedback system to be stable ( in (17)), a more general class of IQCs, the time-domain hard IQCs (megretski1997system), can be used to describe .

## 5 Numerical Experiment

To compare our method against regular RNN controller trained without projection, we consider 6 different tasks involving control of partially observed dynamical systems, including a linearized inverted pendulum and its nonlinear variant, a cartpole, vehicle lateral dynamics, a pendubot, and a high dimensional power system. Fig. 8 gives a demonstrative visualization of tasks including vehicle lateral control and IEEE 39-bus power system frequency regulation, whose communication topologies are shown in Fig. 8. Experimental settings and tasks definitions are detailed in Appendix B.

The experimental results including rewards and sample trajectories at convergence are reported in Fig. 9. In all experiments, our method achieves high reward after the first few projection steps that ensures stability, greatly outperforming the regular method which suffers from instability even after converging. For pendubot and inverted pendulum tasks, our method keeps perfecting the performance after the first projection steps which already give high performance. While for cartpole, vehicle lateral control, and power system frequency regulation tasks, our method converges to optimal performance in one step. Our method gives converging trajectories for all tasks and achieves faster converging trajectories on the vehicle lateral control task. In comparison, policy gradient has been greatly impacted by the partial observability and converges to sub-optimal performance in cartpole, pendubot, and power system frequency regulation tasks and requires more steps to achieve optimal performance in inverted pendulum and vehicle lateral control tasks. Without stability guarantee, policy gradient fails to ensure converging trajectories from some initial conditions for all tasks excluding vehicle lateral control which is open-loop stable.

## 6 Conclusion

In this work, we present a method to synthesize stabilizing RNN controllers, which ensures the stability of the feedback systems both during learning and control process. We develop a convex set of stabilizing RNN parameters for nonlinear and partially observed systems. A novel projected policy gradient method is developed to synthesize a controller while enforcing stability by recursively projecting the parameters of the RNN controller to the convex set. By evaluating on a variety of control tasks, we demonstrate that our method learns stabilizing controllers with fewer samples, faster converging trajectories, and higher final performance than policy gradient. Future directions include extensions to implicit models (bai2019deep; elghaoui2020) or other memory units.

## References

## Appendix A Proofs and Illustrations

### Derivation for the Transformed Plant

### Proof of Theorem 3.1

###### Proof.

Assume there exist , , and , such that (9) holds. It follows from Schur complements that (9) is equivalent to

(20) |

It follows from the inequalities and for any and (tobenkin2017convex; revay2020convex) that (20) implies

(21) |

Defining , and , and rearranging (21), we have that , and satisfy the following condition

(22) |

Define the Lyapunov function . Multiplying (22) on the left and right by and its transpose yields

(23) |

It follows from sector that the last term in (23) is nonnegative, and thus . Iterate it down to , we have , which implies . Recall . Therefore

and this completes the proof.

### Recursive Feasibility of the Projection Steps

###### Theorem A.1 (Recursive Feasibility).

If LMI is feasible (i.e. LMI holds for some , , and ), then LMI is also feasible, where and are from the -th step of projected policy gradient, for

###### Proof.

The main idea is to show that is already a feasible point for LMI. Since LMI is feasible, at optimum of the projection step, we obtain the minimizer and LMI holds. It follows from inequalities and that

(24) |

which renders LMI true at a feasible point of .

### Dynamics of the Extended System

The extended system (shwon in Fig. 5(b)) is of the form

Comments

There are no comments yet.