Actor-Critic Reinforcement Learning for Control with Stability Guarantee

by   Minghao Han, et al.
Harbin Institute of Technology

Deep Reinforcement Learning (DRL) has achieved impressive performance in various robotic control tasks, ranging from motion planning and navigation to end-to-end visual manipulation. However, stability is not guaranteed in DRL. From a control-theoretic perspective, stability is the most important property for any control system, since it is closely related to safety, robustness, and reliability of robotic systems. In this paper, we propose a DRL framework with stability guarantee by exploiting the Lyapunov's method in control theory. A sampling-based stability theorem is proposed for stochastic nonlinear systems modeled by the Markov decision process. Then we show that the stability condition could be exploited as a critic in the actor-critic RL framework and propose an efficient DRL algorithm to learn a controller/policy with a stability guarantee. In the simulated experiments, our approach is evaluated on several well-known examples including the classic CartPole balancing, 3-dimensional robot control, and control of synthetic biology gene regulatory networks. As a qualitative evaluation of stability, we show that the learned policies can enable the systems to recover to the equilibrium or tracking target when interfered by uncertainties such as unseen disturbances and system parametric variations to a certain extent.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8


Lyapunov-Based Reinforcement Learning for Decentralized Multi-Agent Control

Decentralized multi-agent control has broad applications, ranging from m...

Reinforcement Learning Control of Constrained Dynamic Systems with Uniformly Ultimate Boundedness Stability Guarantee

Reinforcement learning (RL) is promising for complicated stochastic nonl...

Stabilizing Neural Control Using Self-Learned Almost Lyapunov Critics

The lack of stability guarantee restricts the practical use of learning-...

Online 3D Bin Packing with Constrained Deep Reinforcement Learning

We solve a challenging yet practically useful variant of 3D Bin Packing ...

Optimal PID and Antiwindup Control Design as a Reinforcement Learning Problem

Deep reinforcement learning (DRL) has seen several successful applicatio...

Robotic Knee Tracking Control to Mimic the Intact Human Knee Profile Based on Actor-critic Reinforcement Learning

We address a state-of-the-art reinforcement learning (RL) control approa...

H_inf Model-free Reinforcement Learning with Robust Stability Guarantee

Reinforcement learning is showing great potentials in robotics applicati...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Robotics problems are generally related to nonlinear stochastic systems with high-dimensional states and actions and are naturally phrased as problems of reinforcement learning [1]

. Until recently, significant progress has been made by combining advances in deep learning with reinforcement learning. Impressive results are obtained in a series of high-dimensional robotic control tasks where sophisticated and hard-to-engineer behaviors are achieved 

[2, 3, 4, 5]. However, the performance of an RL agent is by large evaluated through trial-and-error and RL could hardly provide any guarantee for the reliability of the learned control policy.

Given a control system, regardless of which controller design method is used, the first and most important property of a system needs to be guaranteed is stability, because an unstable control system is typically useless and potentially dangerous [6]. A stable system is guaranteed to converge to the equilibrium or reference signal and it could recover to these targets even in the presence of parametric uncertainties and disturbances [6]. Thus stability is closely related to the robustness, safety and reliability of the robotic systems.

The most useful and general approach for studying the stability of robotic systems is Lyapunov method [7], which is dominant in control engineering [8, 9]. In Lyapunov method, a scalar “energy-like” function called Lyapunov function is constructed to analyze the stability of the system. The controller is designed to difference of Lyapunov function along the state trajectory is semi-negative definite, i.e., for all time instants , so that the state goes in the direction of decreasing the value of Lyapunov function and eventually converges to the equilibrium [10, 11]. In learning methods, the “energy decreasing” condition has to be verified by trying out all possible consecutive data pairs , i.e., to verify infinite inequalities . Obviously, the “infinity” requirement is impossible thus making Lyapunov’s method impossible.

In this paper, we propose a data-based stability theorem and a stability guaranteed reinforcement learning framework to jointly learn the controller or policy111Controller and policy will be used interchangeably throughout the paper.

and a Lyapunov function both of which are parameterized by deep neural networks, with a focus on stabilization and tracking problems in robotic systems. The contribution of our paper can be summarized as follows: 1) a novel data-based stability theorem where only one inequality needs to be evaluated; 2) the stability condition proposed above is exploited as the critic and an actor-critic algorithm is designed to search the stability guaranteed controller; 3) we show through experiments that the learned controller could stabilize the systems when interfered by uncertainties such as unseen disturbances and system parametric variations of certain extent. In our experiment, we show that the stability guaranteed controller is more capable of handling uncertainties compared to those without such guarantees in nonlinear control problems including classic CartPole stabilization tasks, control of 3D legged robots and manipulator and reference tracking tasks for synthetic biology gene regulatory networks.

I-a Related Works

In model-free reinforcement learning (RL), stability is rarely addressed due to the formidable challenge of analyzing and designing the closed-loop system dynamics in a model-free manner [12], and the associated stability theory in model-free RL remains as an open problem [12, 13].

Recently, Lyapunov analysis is used in model-free RL to solve control problems with safety constraints [14, 15]. In [14]

, Lyapunov-based approach for solving constrained Markov decision process is proposed with a novel way of constructing the Lyapunov function through linear programming. In 

[15], the above results were further generalized to continuous control tasks. Even though Lyapunov-based methods were adopted in these results, neither of them addressed the stability of the system.

Other interesting results on the stability of learning-based control systems are reported in recent years. In [16], an initial result is proposed for the stability analysis of deterministic nonlinear systems with optimal controller for infinite-horizon discounted cost, based on the assumption that discount is sufficiently close to . In [17, 18], a learning model-based safe RL approach with safety guarantee during exploration is introduced but limited to Lipschitz continuous nonlinear systems such as Gaussian process model. In addition, the verification of stability condition requires the discretization of state space, which limits its application to tasks with low-dimensional finite state space.

Ii Problem Statement

In this paper, we focus on the stabilization and tracking problems in robotic systems modeled by Markov decision process (MDP). The state of the robot and its environment at time is given by the state , where denotes the state space. The robot then takes an action according to a stochastic policy , resulting in the next state

. The transition of the state is modeled by the transition probability

. In both stabilization and tracking problems, there always is a cost function to measure how good or bad a state-action pair is.

In stabilization problems, the goal is to find a policy such that the norm of state goes to zero, where denotes the Euclidean norm. In this case, cost function . In tracking problems, we divide the state

into two vectors,

and , where is composed of elements of that are aimed at tracking the reference signal while contains the rest. The reference signal could be the desired velocity, path and even the picture of grasping an object in a certain pose. For tracking problems, .

From a control perspective, both stabilization and tracking are related to the asymptotic stability of the closed-loop system (or error system) under , i.e., starting from an initial point, the trajectories of state always converge to the origin or reference trajectory. Let denote the cost function under the policy , the definition of stability studied in this paper is given as follows.

Definition 1

The stochastic system is said to be stable in mean cost if holds for any initial condition . If is arbitrarily large then the stochastic system is globally stable in mean cost.

The stabilization and tracking problems could be collectively summarized as finding a policy such that the closed-loop system is stable in mean cost according to Definition 1.

Before proceeding, some notations are to be defined. denotes the distribution of starting states. The closed-loop transition probability is denoted as . We also introduce the closed-loop state distribution at a certain instant as , which could be defined in an iterative way: and .

Iii Data-Based Stability Analysis

In this section, we propose the main assumptions and a new theorem for stability analysis of stochastic systems. We assume that the Markov chain induced by policy

is ergodic with a unique stationary distribution ,

as commonly exploited by many RL literature [19, 20, 21, 22].

In Definition 1, stability is defined in relation to the set of starting states, which is also called the region of attraction (ROA). If the MSS system starts within the ROA, its trajectory will be surely attracted to the equilibrium. To build a sample-based stability guarantee, we need to ensure that the states in ROA are accessible for the stability analysis. Thus the following assumption is made to ensure that every state in ROA has a chance to be sampled as the starting state.

Assumption 1

There exists a positive constant such that .

Our approach is to construct/find a Lyapunov function of which the difference along the state trajectory is semi-negative definite, so that the state goes in the direction of decreasing the value of Lyapunov function and eventually converges to the origin. The Lyapunov method has long been used for stability analysis and controller design in control theory [23], but mostly exploited along with a known model so that the energy decreasing condition on the entire state space could be transformed into one inequality regarding model parameters [6, 24]. In the following, we show that without a dynamic model, this “infinity” problem could be solved through sampling. Next, we give the sufficient conditions for a stochastic system to be stable in mean cost in the following.

Theorem 1

The stochastic system is stable in mean cost if there exists a function  and positive constants , and , such that



is the (infinite) sampling distribution.

The existence of the sampling distribution is guaranteed by the existence of . Since the sequence converges to as approaches , then by the Abelian theorem, the sequence also converges and . Combined with the form of , (2) infers that


First, on the left hand-side, for all according to (1). Consider that ,

On the other hand, the sequence converges point-wise to the function . According to the Lebesgue’s Dominated convergence theorem [25], if a sequence converges point-wise to a function and is dominated by some integrable function in the sense that,


Thus the left hand side of (3)

Thus taking the relations above into consideration, (3) infers


Since is a finite value and is semi-positive definite, it follows that


Suppose that there exists a state and a positive constant such that , or . Since for all starting states in (Assumption 1), it follows that , which is contradictory with (5). Thus , . Thus the system is stable in mean cost by Definition 1.

(1) directs the choice and construction of Lyapunov function, of which the details are deferred to Section IV. (2) is called the energy decreasing condition and is the major criteria for determining stability.

Remark 1

This remark is on the connection to previous results concerning the stability of stochastic systems. It should be noted that the stability conditions of Markov chains have been reported in [26, 27], however, of which the validation requires verifying infinite inequalities on the state space if is continuous. On the contrary, our approach solely validates one inequality (2) related to the sampling distribution , which further enables data-based stability analysis and policy learning of the stochastic system.

Iv Algorithm

In this section, we propose an actor-critic RL algorithm to learn stability guaranteed policies for the stochastic system. First we introduce the Lyapunov critic function and show how it is constructed. Then based on the maximum entropy actor-critic framework, we use the Lyapunov critic function in the policy gradient formulation.

Iv-a Lyapunov Critic Function

In our framework, the Lyapunov critic plays a role in both stability analysis and the learning of the actor. To enable the actor-critic learning, the Lyapunov critic is designed to be dependent on and and satisfies such that it could be exploited in judging the value of (2). In view of the requirement above, should be a non-negative function of the state and action, . In this paper, we construct Lyapunov critic with the following parameterization technique,


where is the output vector of a fully connected neural network with parameter .

During the learning process, is updated to approximate a designed Lyapunov candidate function. The Lyapunov candidate function is an ideal function that naturally satisfies the property of Lyapunov function, such as norm of state, value function. But Lyapunov candidate function are not parameterized and thus are not directly applicable in an actor-critic learning process. Thus we have Lyapunov candidate function as supervision signal for the training of and update to minimize the following objective function,


where is the approximation target for and is the set of collected transition pairs. In [14] and [17], the value function has been proved to be a valid Lyapunov candidate function where the approximation target is


where is the target network parameterized by as typically used in the actor-critic methods [28, 29]. has the same structure with , but the parameter is updated through exponentially moving average of weights of

controlled by a hyperparameter

, .

In addition to value function, the sum of cost over a finite time horizon could also be employed as Lyapunov candidate, which is exploited in model predictive control literature [30, 9] for stability analysis. In this case,


Here, the time horizon is a hyperparameter to be tuned, of which the influence will be demonstrated in the experiment in Section V.

The choice of Lyapunov candidate function plays an important role in learning a policy. Value function evaluates the infinite time horizon and thus offers a better performance in general, but is rather difficult to approximate because of significant variance and bias 

[31]. On the other hand, the finite horizon sum of cost provides an explicit target for learning a Lyapunov function, thus inherently reduces the bias and enhances the learning process. However, as the model is unknown, predicting the future costs based on the current state and action inevitably introduces variance, which grows as the prediction horizon extends. In principle, for tasks with simple dynamics, the sum-of-cost choice enhances the convergence of learning and robustness of the trained policies, while for complicated systems the choice of value function generally produces better performance. In this paper, we use both value function and sum-of-cost as Lyapunov candidate functions. Later in Section V, we will show the influence of these different choices upon the performance and robustness of trained policies.

Iv-B Lyapunov-based Actor Critic

In this subsection, we will focus on how to learn the controller in a novel actor-critic framework called Lyapunov-based Actor Critic (LAC), such that the inequality (2) is satisfied. The policy learning problem is summarized as the following constrained optimization problem,

find (10)
s.t. (11)

where the second constraint is the minimum entropy constraint borrowed from the maximum entropy RL framework to improve the exploration in the action space [28], and is the desired bound. Solving the above constrained optimization problem is equivalent to minimizing the following objective function,


where and are Lagrange multipliers which control the relative importance of minimum entropy constraint and (2). The stochastic policy is parameterized by a deep neural network that is dependent on and a Gaussian noise . (2

) is estimated by the second term in (

12). One may be curious why in the second term of (12

), only one Lyapunov critic is explicitly dependent on the stochastic policy, while the other dependent on the sample of action. First, note that this estimator is also an unbiased estimation of (

2), although variance may be increased compared to replacing with . From a more practical perspective, having the second Lyapunov critic explicitly dependent on will introduce a term in the policy gradient that updates to increase the value of , which is contradictory to our goal of stabilization.

In the actor-critic framework, the parameters of policy network are updated through stochastic gradient descent of (

12), which is approximated by


The value of Lagrange multipliers and are automatically adjusted by the gradient method maximizing the objective function (12) and clipped to be positive. Pseudo code of the proposed algorithm is shown in Algorithm 1.

     Sample according to
     for each time step do
        Sample from and step forward
        Observe , and store in
     end for
     for each update step do
        Sample minibatches of transitions from and update , , Lagrange multipliers with (7), (13)
     end for
  until (2) is satisfied
Algorithm 1 Lyapunov-based Actor-Critic (LAC)
(a) CartPole
(b) HalfCheetah
(c) FetchReach
(d) GRN
(e) CompGRN
Fig. 1:

Cumulative control performance comparison. The Y-axis indicates the total cost during one episode and the X-axis indicates the total time steps in thousand. The shadowed region shows the 1-SD confidence interval over 10 random seeds. Across all trials of training, LAC converges to stabilizing solution with comparable or superior performance compared with SAC and SPPO.

V Experiment

In this section, we illustrate five simulated robotic control problems to demonstrate the general applicability of the proposed method. First of all, the classic control problem of CartPole balancing from control and RL literature [32] is illustrated. Then, we consider more complicated high-dimensional continuous control problem of 3D robots, e.g., HalfCheetah and FetchReach, using MuJoCo physics engine [33]. Last, we extend our approach to control autonomous systems in the cell, i.e., molecular synthetic biological gene regulatory networks (GRN). Specifically, we consider the problem of reference tracking for two GRNs [34].

The proposed method is evaluated for the following aspects:

  • Convergence: does the proposed training algorithm converge with random parameter initialization and does the stability condition (2) hold for the learned policies;

  • Performance: can the goal of the task be achieved or the cumulative cost be minimized;

  • Stability: if (2) hold, are the closed-loop systems stable indeed and generating stable state trajectories;

  • Robustness: how do the trained policies perform when faced with uncertainties unseen during training, such as parametric variation and external disturbances;

  • Generalization: can the trained policies generalize to follow reference signals that are different from the one seen during training.

We compare our approach with soft actor-critic (SAC) [28], one of the state-of-the-art actor-critic algorithms that outperform a series of RL methods such as DDPG [35], PPO [36] on the continuous control benchmarks. The variant of safe proximal policy optimization (SPPO) [15], a Lyapunov-based method, is also included in the comparison. The original SPPO is developed to deal with constrained MDP, where safety constraints exist. In our experiments, we modify it to apply the Lyapunov constraints on the MDP tasks and see whether it can achieve the same stability guarantee as LAC. In CartPole, we also compare with linear quadratic regulator (LQR), a classical model-based optimal control method for stabilization. For both algorithms, the hyperparameters are tuned to reach their best performance.

The outline of this section is as follows. In Section V-A, a brief introduction will be given on the background and problem description of each example. Then in Section V-B, the convergence, and performance of the proposed method is demonstrated and compared with SAC. In Section V-E, the ability of generalization and robustness of the trained policies are evaluated and analyzed. Finally, in Section V-F, we show the influence of choosing different Lyapunov candidate functions upon the performance and robustness of trained policies.

Training parameters of LAC and detailed experiment setup can be found in Appendix.

V-a Background and Problem Description

In this section, we will give a brief introduction to the examples considered in this paper.

V-A1 CartPole

The controller is to stabilize the pole vertically at a given position. The cost is determined by the norm of the angular position of the pole and the horizontal position of the cart. The control input is the horizontal force applied in the cart. The agent is dead if the angle between pole and vertical position exceeds a threshold, and the episode ends.

V-A2 HalfCheetah

The goal is to control a 17-dimensional 2-legged robot simulated in the MuJoCo simulator. The control task belongs to the reference tracking problem, i.e., to enable the robot to run at the speed of 1m/s in the X-axis direction. The cost is determined by the Euclidean difference between current speed and target speed. The control input is the torque implemented at each joint.

V-A3 FetchReach

The agent is to control a simulated manipulator to track a randomly generated goal position with its end effector. The cost is determined by the Euclidean distance between the end effector and the goal. The control input is the torque implemented at each joint. The manipulator is also simulated in the MuJoCo simulator.

V-A4 GRN and CompGRN

The GRN is a synthetic biology gene regulatory network with a ring structure pioneered in [34], in which each gene represses the other gene cyclically. The dynamics of temporal gene expression exhibit periodic oscillatory behavior. The dynamics of GRN can be quantitatively described by a set of discrete-time nonlinear difference equations consisting of six states, three mRNAs for transcription and three proteins for translation, based on biochemical kinetic laws. We also include a complicated GRN (CompGRN) with 4 genes to be controlled, which exhibits an unstable oscillation and is even harder to control.

The objective is to force one protein concentrations to follow a priori defined reference trajectory using partially observed states.

(a) Cartpole
(b) HalfCheetah
(c) FetchReach
(d) GRN
(e) CompGRN
Fig. 2: Value of Lagrange multiplier during the training of LAC policies. The Y-axis indicates the value of and the X-axis indicates the total time steps in thousand. The shadowed region shows the 1-SD confidence interval over 10 random seeds. The value of gradually drops and becomes zero at convergence, which implies the satisfaction of stability condition.

V-B Performance

In each task, both LAC, SAC and SPPO are trained for 10 times with random initialization, average total cost and its variance during training are demonstrated in Figure 1. In the first three examples (see Figure 1(a)-(c)), SAC and LAC perform comparably in terms of the total cost at convergence and the speed of convergence, while SPPO could converge in Cartpole and FetcheReach. In GRN and CompGRN (see Figure 1(d,e)), SAC is not always able to find a policy that is capable of completing control objective, resulting in the bad average performance. On the contrary, LAC performs stably regardless of the random initialization.

V-C Convergence

As shown in Figure 1, LAC converges stably in all experiments. Moreover, the convergence and validation of stability guarantee could also be checked by observing the value of Lagrange multipliers. When (2) is satisfied, will continuously decrease until it becomes zero. Thus by checking the value of , the satisfaction of stability condition during training and at convergence could be validated. In Figure 2, the value of during training is demonstrated. Across all training trials in the experiments, converges to zero eventually, which implies that the stability guarantee is valid.

V-D Evaluation on Stability

In this part, a further comparison between the stability-assured method (LAC) and that without such guarantee (SAC) is made, by demonstrating the closed-loop system dynamic with the trained policies. A distinguishing feature of the stability assured policies is that it can force and sustain the state or tracking error to zero. This could be intuitively demonstrated by the state trajectories of closed-loop system.

We evaluated the trained policies in the GRN and CompGRN and the results are shown in Figure 3. In our experiments, we found that the LAC agents stabilize the systems well. All the state trajectories converge to the reference signal eventually (see Figure 3 a and c). On the contrary, without stability guarantee, the state trajectories either diverge (see Figure 3 b), or continuously oscillate around the reference trajectory (see Figure 3 d).

(c) LAC-CompGRN
(d) SAC-CompGRN
Fig. 3: State trajectories over time under policies trained by LAC and SAC in the GRN and CompGRN. In each experiment, the policies are tested over 20 random initial states and all the resulting trajectories are displayed above. The X-axis indicates the time and Y-axis shows the concentration of the target protein— Protein 1.

V-E Evaluation on Robustness and Generalization

It is well-known that over-parameterized policies are prone to become overfitted to a specific training environment. The ability of generalization is the key to the successful implementation of RL agents in an uncertain real-world environment. In this part, we first evaluate the robustness of policies in the presence of parametric uncertainties and process noise. Then, we test the robustness of controllers against external disturbances. Finally, we evaluate whether the policy is generalizable by setting different reference signals. To make a fair comparison, we removed the policies that did not converge in SAC and only evaluate the ones that perform well during training. During testing, we found that SPPO appears to be prone to variations in the environment, thus the evaluation results are referred to Fig. 9 and Fig. 10 in the Appendix.

(a) LAC-CartPole
(b) SAC-CartPole
Fig. 4: LAC and SAC agents in the presence of dynamic uncertainties. Solid line indicates the average trajectory and shadowed region for the 1-SD confidence interval. In (a) and (b), the pole length is varied during the inference. In (c) and (d), three parameters are selected to reflect the uncertainties in gene expression. The X-axis indicates the time and Y-axis shows the angle of pole in (a,b) and concentration of target protein in (c,d), respectively. Dashed line indicates the reference signal. The line in orange indicates the dynamic in the original environment. For each curve, only the noted parameter is different with the original setting.

V-E1 Robustness to dynamic uncertainty

In this part, during the inference, we vary the system parameters and introduce process noises in the model/simulator to evaluate the algorithm’s robustness. In CartPole, we vary the length of pole . In GRN, we vary the promoter strength and dissociation rate

. Due to stochastic nature in gene expression, we also introduce uniformly distributed noise ranging from

(we indicate the noise level by ) to the dynamic of GRN. The state trajectories of closed-loop system under LAC and SAC agents in the varied environment are demonstrated in Figure 4.

As shown in Figure 4 (a) and (c), the policies trained by LAC are very robust to the dynamic uncertainties and achieve high tracking precision in each case. On the other hand, though SAC performs well in the original environment (Figure 4 b and d), it fails in all of the varied environment.

V-E2 Robustness to disturbances

An inherent property of a stable system is to recover from perturbations such as external forces and wind. To show this, we introduce periodic external disturbances with different magnitudes in each environment and observe the performance difference between policies trained by LAC and SAC. We also include LQR as the model-based baseline. In CartPole, the agent may fall over when interfered by an external force, ending the episode in advance. Thus in this task, we measure the robustness of controller through the death-rate, i.e., the probability of falling over after being disturbed. For other tasks where the episodes are always of the same length and we measure the robustness of controller by the variation in the cumulative cost. Under each disturbance magnitude, the policies are tested for trials and the performance are shown in Figure 5.

(a) CartPole
(b) HalfCheetah
(c) FetchReach
(d) GRN
Fig. 5: Performance of LAC, SAC, SPPO and LQR in the presence of persistent disturbances with different magnitudes. X-axis indicates the magnitude of the applied disturbance. The Y-axis indicates the death rate in CartPole (a) and the cumulative cost in other examples (b)-(d). All of the trained policies are evaluated for 100 trials in each setting.

As shown in the Figure 5, the controllers trained by LAC outperform SAC and LQR by great extent in CartPole and GRN (lower death rate and cumulative cost). In HalfCheetah, SAC and LAC are both robust to small external disturbances while LAC is more reliable to larger ones. In FetchReach, SAC and LAC perform reliably across all of the external disturbances. In all of the experiments, SPPO agents could hardly sustain any external disturbances.

(a) LAC
(b) SAC
Fig. 6: State trajectories under policies trained by LAC and SAC when tracking different reference signals. Solid line indicates the average trajectory and shadowed region for the 1-SD confidence interval. The X-axis indicates the time and Y-axis shows the concentration of protein to be controlled. Dashed lines in different colors are the different reference signals: sinusoid with period of 150 (brown); sinusoid with period of 200 (skyblue);sinusoid with period of 400 (blue); constant reference of 8 (red); constant reference of 16 (green).

V-E3 Generalization over different tracking references

In this part, we introduce four different reference signals that are unseen during training in the GRN: sinusoids with periods of 150 (brown) and 400 (blue), and the constant reference of 8 (red) and 16 (green). We also show the original reference signal used for training (skyblue) as a benchmark. Reference signals are indicated in Figure 6 by the dashed line in respective colors. All of the trained policies are tested for 10 times with each reference signal. The average dynamics of the target protein are shown in Figure 6 with the solid line, while the variance of dynamic is indicated by the shadowed area.

As shown in Figure 6, the policies trained by LAC could generalize well to follow previously unseen reference signals with low variance (dynamics are very close to the dashed lines), regardless of whether they share the same mathematical form with the one used for training. On the other hand, though SAC tracks the original reference signal well after the unconverged training trials being removed (see the skyblue lines), it is still unable to follow some of the reference signals (see the brown line) and possesses larger variance than LAC.

V-F Influence of Different Lyapunov Candidate Functions

As an independent interest, we evaluate the influence of choosing different Lyapunov candidate functions in this part. First, we adopt candidates of different time horizon to train policies in the CartPole example, and compare their performance in terms of cumulative cost and robustness. Here, implies using value function as Lyapunov candidate. Both of the Lyapunov critics are parameterized as (6). For evaluation of robustness, we apply an impulsive force at instant and observe the death-rate of trained policies. The results are demonstrated in Figure 7.

(a) Horizon-Training
(b) Horizon-Robustness
Fig. 7: Influence of different Lyapunov candidate functions. In (a), the Y-axis indicates cumulative cost during training and the X-axis indicates the total time steps in thousand. (b) shows the death-rate of policies in the presence of instant impulsive force ranging from 80 to 150 Newton.

As shown in Figure 7, both choices of Lyapunov candidates converge fast and achieve comparable cumulative cost at convergence. However, in terms of robustness, the choice of plays an important role. As observed in Figure 7 (b), the robustness of controller decreases as the time horizon increases. Besides, it is interesting to observe that LQR is more robust than SAC when faced with instant impulsive disturbance.

Vi Conclusions

In this paper, we proposed a model-free approach for analyzing the stability of discrete-time nonlinear stochastic systems modeled by Markov decision process, by employing the Lyapunov function from control theory. Based on the theoretical result, a practical algorithm for designing stability assured controllers for the stabilization and tracking problems. We evaluated the proposed method in various examples and show that our method achieves not only comparable or superior performance compared with the state-of-the-art RL algorithm but also outperforms impressively in terms of robustness to uncertainties and disturbances.


I Further Experiment Setup

We setup the experiment using OpenAi Gym [37]. A snapshot of environments can be found in Figure 8.

Fig. 8: Snapshot of environments using OpenAI Gym.

I-a CartPole

In this experiment, the controller is to sustain the pole vertically at a target position . This is modified version of CartPole in [37] with continuous action space. The action is the horizontal force applied on the cart (). and represents the maximum of position and angle, respectively, and . The controller dies if or and the episodes end in advance. Cost function . The episodes are of length 250. For robustness evaluation in Section V-E, we apply an impulsive disturbance force on the cart every 20 seconds, of which the magnitude ranges from 80 to 150 and the direction is opposite to the direction of control input. In Section V-F, the impulsive disturbance has the same magnitude range and direction with that in Section V-E, but only applied once at instant .

I-B HalfCheetah

HalfCheetah is a modified version of that in Gym’s robotics environment [37]. The task is to control a HalfCheetah (a 2-legged simulated robot) to run at the speed of . The reward is where is the forward speed of the HalfCheetah. The control input is the torque applied on each joint, ranging from -1 to 1. The episodes are of length 200.

For robustness evaluation in Section V-E, we apply an impulsive disturbance torque on each joint every 20 seconds, of which the magnitude ranges from 0.2 to 2.0 and the direction is opposite to the direction of control input.

I-C FetchReach-v1

We modify the FetchReach in Gym’s robotics environment [37] to a cost version, where the controller is expected to control manipulator’s end effector to reach a random goal position. The cost is designed as , where is the distance between goal and end-effector. The control input is the torque applied on each joint, ranging from -1 to 1. The episodes are of length 200.

For robustness evaluation in Section V-E, we apply an impulsive disturbance torque on each joint every 20 seconds, of which the magnitude ranges from 0.2 to 2.0 and the direction is opposite to the direction of control input.

Ii Hyperparameters

CartPole FetchReach HalfCheetah GRN CompGRN
Time horizon 5 5
Minibatch size 256 256 256 256 256
Actor learning rate 1e-4 1e-4 1e-4 1e-4 1e-4
Critic learning rate 3e-4 3e-4 3e-4 3e-4 3e-4
Lyapunov learning rate 3e-4 3e-4 3e-4 3e-4 3e-4
Target entropy -1 -5 -6 -3 -4
Soft replacement() 0.005 0.005 0.005 0.005 0.005
Discount() NAN NAN 0.995 NAN NAN
1.0 1.0 1.0 1.0 1.0
Structure of (64,64,16) (64,64,16) (256,256,16) (256,256,16) (256,256,16)
TABLE I: Hyperparameters of LAC

Iii Evaluation on Robustness and Generalization using SPPO

(a) Repressilator
Fig. 9: State trajectories under policies trained by SPPO when tracking different reference signals. The setting of the uncertainty is the same as in Section V-E3.
(a) Cartpole
(b) Repressilator
Fig. 10: State trajectories over time under policies trained by SPPO and tested in the presence of parametric uncertainties and process noise, for CartPole and Repressilator. The setting of the uncertainty is the same as in Section V-E1.