I Introduction
Robotics problems are generally related to nonlinear stochastic systems with highdimensional states and actions and are naturally phrased as problems of reinforcement learning [1]
. Until recently, significant progress has been made by combining advances in deep learning with reinforcement learning. Impressive results are obtained in a series of highdimensional robotic control tasks where sophisticated and hardtoengineer behaviors are achieved
[2, 3, 4, 5]. However, the performance of an RL agent is by large evaluated through trialanderror and RL could hardly provide any guarantee for the reliability of the learned control policy.Given a control system, regardless of which controller design method is used, the first and most important property of a system needs to be guaranteed is stability, because an unstable control system is typically useless and potentially dangerous [6]. A stable system is guaranteed to converge to the equilibrium or reference signal and it could recover to these targets even in the presence of parametric uncertainties and disturbances [6]. Thus stability is closely related to the robustness, safety and reliability of the robotic systems.
The most useful and general approach for studying the stability of robotic systems is Lyapunov method [7], which is dominant in control engineering [8, 9]. In Lyapunov method, a scalar “energylike” function called Lyapunov function is constructed to analyze the stability of the system. The controller is designed to difference of Lyapunov function along the state trajectory is seminegative definite, i.e., for all time instants , so that the state goes in the direction of decreasing the value of Lyapunov function and eventually converges to the equilibrium [10, 11]. In learning methods, the “energy decreasing” condition has to be verified by trying out all possible consecutive data pairs , i.e., to verify infinite inequalities . Obviously, the “infinity” requirement is impossible thus making Lyapunov’s method impossible.
In this paper, we propose a databased stability theorem and a stability guaranteed reinforcement learning framework to jointly learn the controller or policy^{1}^{1}1Controller and policy will be used interchangeably throughout the paper.
and a Lyapunov function both of which are parameterized by deep neural networks, with a focus on stabilization and tracking problems in robotic systems. The contribution of our paper can be summarized as follows: 1) a novel databased stability theorem where only one inequality needs to be evaluated; 2) the stability condition proposed above is exploited as the critic and an actorcritic algorithm is designed to search the stability guaranteed controller; 3) we show through experiments that the learned controller could stabilize the systems when interfered by uncertainties such as unseen disturbances and system parametric variations of certain extent. In our experiment, we show that the stability guaranteed controller is more capable of handling uncertainties compared to those without such guarantees in nonlinear control problems including classic CartPole stabilization tasks, control of 3D legged robots and manipulator and reference tracking tasks for synthetic biology gene regulatory networks.
Ia Related Works
In modelfree reinforcement learning (RL), stability is rarely addressed due to the formidable challenge of analyzing and designing the closedloop system dynamics in a modelfree manner [12], and the associated stability theory in modelfree RL remains as an open problem [12, 13].
Recently, Lyapunov analysis is used in modelfree RL to solve control problems with safety constraints [14, 15]. In [14]
, Lyapunovbased approach for solving constrained Markov decision process is proposed with a novel way of constructing the Lyapunov function through linear programming. In
[15], the above results were further generalized to continuous control tasks. Even though Lyapunovbased methods were adopted in these results, neither of them addressed the stability of the system.Other interesting results on the stability of learningbased control systems are reported in recent years. In [16], an initial result is proposed for the stability analysis of deterministic nonlinear systems with optimal controller for infinitehorizon discounted cost, based on the assumption that discount is sufficiently close to . In [17, 18], a learning modelbased safe RL approach with safety guarantee during exploration is introduced but limited to Lipschitz continuous nonlinear systems such as Gaussian process model. In addition, the verification of stability condition requires the discretization of state space, which limits its application to tasks with lowdimensional finite state space.
Ii Problem Statement
In this paper, we focus on the stabilization and tracking problems in robotic systems modeled by Markov decision process (MDP). The state of the robot and its environment at time is given by the state , where denotes the state space. The robot then takes an action according to a stochastic policy , resulting in the next state
. The transition of the state is modeled by the transition probability
. In both stabilization and tracking problems, there always is a cost function to measure how good or bad a stateaction pair is.In stabilization problems, the goal is to find a policy such that the norm of state goes to zero, where denotes the Euclidean norm. In this case, cost function . In tracking problems, we divide the state
into two vectors,
and , where is composed of elements of that are aimed at tracking the reference signal while contains the rest. The reference signal could be the desired velocity, path and even the picture of grasping an object in a certain pose. For tracking problems, .From a control perspective, both stabilization and tracking are related to the asymptotic stability of the closedloop system (or error system) under , i.e., starting from an initial point, the trajectories of state always converge to the origin or reference trajectory. Let denote the cost function under the policy , the definition of stability studied in this paper is given as follows.
Definition 1
The stochastic system is said to be stable in mean cost if holds for any initial condition . If is arbitrarily large then the stochastic system is globally stable in mean cost.
The stabilization and tracking problems could be collectively summarized as finding a policy such that the closedloop system is stable in mean cost according to Definition 1.
Before proceeding, some notations are to be defined. denotes the distribution of starting states. The closedloop transition probability is denoted as . We also introduce the closedloop state distribution at a certain instant as , which could be defined in an iterative way: and .
Iii DataBased Stability Analysis
In this section, we propose the main assumptions and a new theorem for stability analysis of stochastic systems. We assume that the Markov chain induced by policy
is ergodic with a unique stationary distribution ,as commonly exploited by many RL literature [19, 20, 21, 22].
In Definition 1, stability is defined in relation to the set of starting states, which is also called the region of attraction (ROA). If the MSS system starts within the ROA, its trajectory will be surely attracted to the equilibrium. To build a samplebased stability guarantee, we need to ensure that the states in ROA are accessible for the stability analysis. Thus the following assumption is made to ensure that every state in ROA has a chance to be sampled as the starting state.
Assumption 1
There exists a positive constant such that .
Our approach is to construct/find a Lyapunov function of which the difference along the state trajectory is seminegative definite, so that the state goes in the direction of decreasing the value of Lyapunov function and eventually converges to the origin. The Lyapunov method has long been used for stability analysis and controller design in control theory [23], but mostly exploited along with a known model so that the energy decreasing condition on the entire state space could be transformed into one inequality regarding model parameters [6, 24]. In the following, we show that without a dynamic model, this “infinity” problem could be solved through sampling. Next, we give the sufficient conditions for a stochastic system to be stable in mean cost in the following.
Theorem 1
The stochastic system is stable in mean cost if there exists a function and positive constants , and , such that
(1)  
(2) 
where
is the (infinite) sampling distribution.
The existence of the sampling distribution is guaranteed by the existence of . Since the sequence converges to as approaches , then by the Abelian theorem, the sequence also converges and . Combined with the form of , (2) infers that
(3)  
First, on the left handside, for all according to (1). Consider that ,
On the other hand, the sequence converges pointwise to the function . According to the Lebesgue’s Dominated convergence theorem [25], if a sequence converges pointwise to a function and is dominated by some integrable function in the sense that,
Then
Thus the left hand side of (3)
Thus taking the relations above into consideration, (3) infers
(4)  
Since is a finite value and is semipositive definite, it follows that
(5) 
Suppose that there exists a state and a positive constant such that , or . Since for all starting states in (Assumption 1), it follows that , which is contradictory with (5). Thus , . Thus the system is stable in mean cost by Definition 1.
(1) directs the choice and construction of Lyapunov function, of which the details are deferred to Section IV. (2) is called the energy decreasing condition and is the major criteria for determining stability.
Remark 1
This remark is on the connection to previous results concerning the stability of stochastic systems. It should be noted that the stability conditions of Markov chains have been reported in [26, 27], however, of which the validation requires verifying infinite inequalities on the state space if is continuous. On the contrary, our approach solely validates one inequality (2) related to the sampling distribution , which further enables databased stability analysis and policy learning of the stochastic system.
Iv Algorithm
In this section, we propose an actorcritic RL algorithm to learn stability guaranteed policies for the stochastic system. First we introduce the Lyapunov critic function and show how it is constructed. Then based on the maximum entropy actorcritic framework, we use the Lyapunov critic function in the policy gradient formulation.
Iva Lyapunov Critic Function
In our framework, the Lyapunov critic plays a role in both stability analysis and the learning of the actor. To enable the actorcritic learning, the Lyapunov critic is designed to be dependent on and and satisfies such that it could be exploited in judging the value of (2). In view of the requirement above, should be a nonnegative function of the state and action, . In this paper, we construct Lyapunov critic with the following parameterization technique,
(6) 
where is the output vector of a fully connected neural network with parameter .
During the learning process, is updated to approximate a designed Lyapunov candidate function. The Lyapunov candidate function is an ideal function that naturally satisfies the property of Lyapunov function, such as norm of state, value function. But Lyapunov candidate function are not parameterized and thus are not directly applicable in an actorcritic learning process. Thus we have Lyapunov candidate function as supervision signal for the training of and update to minimize the following objective function,
(7) 
where is the approximation target for and is the set of collected transition pairs. In [14] and [17], the value function has been proved to be a valid Lyapunov candidate function where the approximation target is
(8) 
where is the target network parameterized by as typically used in the actorcritic methods [28, 29]. has the same structure with , but the parameter is updated through exponentially moving average of weights of
controlled by a hyperparameter
, .In addition to value function, the sum of cost over a finite time horizon could also be employed as Lyapunov candidate, which is exploited in model predictive control literature [30, 9] for stability analysis. In this case,
(9) 
Here, the time horizon is a hyperparameter to be tuned, of which the influence will be demonstrated in the experiment in Section V.
The choice of Lyapunov candidate function plays an important role in learning a policy. Value function evaluates the infinite time horizon and thus offers a better performance in general, but is rather difficult to approximate because of significant variance and bias
[31]. On the other hand, the finite horizon sum of cost provides an explicit target for learning a Lyapunov function, thus inherently reduces the bias and enhances the learning process. However, as the model is unknown, predicting the future costs based on the current state and action inevitably introduces variance, which grows as the prediction horizon extends. In principle, for tasks with simple dynamics, the sumofcost choice enhances the convergence of learning and robustness of the trained policies, while for complicated systems the choice of value function generally produces better performance. In this paper, we use both value function and sumofcost as Lyapunov candidate functions. Later in Section V, we will show the influence of these different choices upon the performance and robustness of trained policies.IvB Lyapunovbased Actor Critic
In this subsection, we will focus on how to learn the controller in a novel actorcritic framework called Lyapunovbased Actor Critic (LAC), such that the inequality (2) is satisfied. The policy learning problem is summarized as the following constrained optimization problem,
find  (10)  
s.t.  (11) 
where the second constraint is the minimum entropy constraint borrowed from the maximum entropy RL framework to improve the exploration in the action space [28], and is the desired bound. Solving the above constrained optimization problem is equivalent to minimizing the following objective function,
(12) 
where and are Lagrange multipliers which control the relative importance of minimum entropy constraint and (2). The stochastic policy is parameterized by a deep neural network that is dependent on and a Gaussian noise . (2
) is estimated by the second term in (
12). One may be curious why in the second term of (12), only one Lyapunov critic is explicitly dependent on the stochastic policy, while the other dependent on the sample of action. First, note that this estimator is also an unbiased estimation of (
2), although variance may be increased compared to replacing with . From a more practical perspective, having the second Lyapunov critic explicitly dependent on will introduce a term in the policy gradient that updates to increase the value of , which is contradictory to our goal of stabilization.In the actorcritic framework, the parameters of policy network are updated through stochastic gradient descent of (
12), which is approximated by(13)  
The value of Lagrange multipliers and are automatically adjusted by the gradient method maximizing the objective function (12) and clipped to be positive. Pseudo code of the proposed algorithm is shown in Algorithm 1.
Cumulative control performance comparison. The Yaxis indicates the total cost during one episode and the Xaxis indicates the total time steps in thousand. The shadowed region shows the 1SD confidence interval over 10 random seeds. Across all trials of training, LAC converges to stabilizing solution with comparable or superior performance compared with SAC and SPPO.
V Experiment
In this section, we illustrate five simulated robotic control problems to demonstrate the general applicability of the proposed method. First of all, the classic control problem of CartPole balancing from control and RL literature [32] is illustrated. Then, we consider more complicated highdimensional continuous control problem of 3D robots, e.g., HalfCheetah and FetchReach, using MuJoCo physics engine [33]. Last, we extend our approach to control autonomous systems in the cell, i.e., molecular synthetic biological gene regulatory networks (GRN). Specifically, we consider the problem of reference tracking for two GRNs [34].
The proposed method is evaluated for the following aspects:

Convergence: does the proposed training algorithm converge with random parameter initialization and does the stability condition (2) hold for the learned policies;

Performance: can the goal of the task be achieved or the cumulative cost be minimized;

Stability: if (2) hold, are the closedloop systems stable indeed and generating stable state trajectories;

Robustness: how do the trained policies perform when faced with uncertainties unseen during training, such as parametric variation and external disturbances;

Generalization: can the trained policies generalize to follow reference signals that are different from the one seen during training.
We compare our approach with soft actorcritic (SAC) [28], one of the stateoftheart actorcritic algorithms that outperform a series of RL methods such as DDPG [35], PPO [36] on the continuous control benchmarks. The variant of safe proximal policy optimization (SPPO) [15], a Lyapunovbased method, is also included in the comparison. The original SPPO is developed to deal with constrained MDP, where safety constraints exist. In our experiments, we modify it to apply the Lyapunov constraints on the MDP tasks and see whether it can achieve the same stability guarantee as LAC. In CartPole, we also compare with linear quadratic regulator (LQR), a classical modelbased optimal control method for stabilization. For both algorithms, the hyperparameters are tuned to reach their best performance.
The outline of this section is as follows. In Section VA, a brief introduction will be given on the background and problem description of each example. Then in Section VB, the convergence, and performance of the proposed method is demonstrated and compared with SAC. In Section VE, the ability of generalization and robustness of the trained policies are evaluated and analyzed. Finally, in Section VF, we show the influence of choosing different Lyapunov candidate functions upon the performance and robustness of trained policies.
Training parameters of LAC and detailed experiment setup can be found in Appendix.
Va Background and Problem Description
In this section, we will give a brief introduction to the examples considered in this paper.
VA1 CartPole
The controller is to stabilize the pole vertically at a given position. The cost is determined by the norm of the angular position of the pole and the horizontal position of the cart. The control input is the horizontal force applied in the cart. The agent is dead if the angle between pole and vertical position exceeds a threshold, and the episode ends.
VA2 HalfCheetah
The goal is to control a 17dimensional 2legged robot simulated in the MuJoCo simulator. The control task belongs to the reference tracking problem, i.e., to enable the robot to run at the speed of 1m/s in the Xaxis direction. The cost is determined by the Euclidean difference between current speed and target speed. The control input is the torque implemented at each joint.
VA3 FetchReach
The agent is to control a simulated manipulator to track a randomly generated goal position with its end effector. The cost is determined by the Euclidean distance between the end effector and the goal. The control input is the torque implemented at each joint. The manipulator is also simulated in the MuJoCo simulator.
VA4 GRN and CompGRN
The GRN is a synthetic biology gene regulatory network with a ring structure pioneered in [34], in which each gene represses the other gene cyclically. The dynamics of temporal gene expression exhibit periodic oscillatory behavior. The dynamics of GRN can be quantitatively described by a set of discretetime nonlinear difference equations consisting of six states, three mRNAs for transcription and three proteins for translation, based on biochemical kinetic laws. We also include a complicated GRN (CompGRN) with 4 genes to be controlled, which exhibits an unstable oscillation and is even harder to control.
The objective is to force one protein concentrations to follow a priori defined reference trajectory using partially observed states.
VB Performance
In each task, both LAC, SAC and SPPO are trained for 10 times with random initialization, average total cost and its variance during training are demonstrated in Figure 1. In the first three examples (see Figure 1(a)(c)), SAC and LAC perform comparably in terms of the total cost at convergence and the speed of convergence, while SPPO could converge in Cartpole and FetcheReach. In GRN and CompGRN (see Figure 1(d,e)), SAC is not always able to find a policy that is capable of completing control objective, resulting in the bad average performance. On the contrary, LAC performs stably regardless of the random initialization.
VC Convergence
As shown in Figure 1, LAC converges stably in all experiments. Moreover, the convergence and validation of stability guarantee could also be checked by observing the value of Lagrange multipliers. When (2) is satisfied, will continuously decrease until it becomes zero. Thus by checking the value of , the satisfaction of stability condition during training and at convergence could be validated. In Figure 2, the value of during training is demonstrated. Across all training trials in the experiments, converges to zero eventually, which implies that the stability guarantee is valid.
VD Evaluation on Stability
In this part, a further comparison between the stabilityassured method (LAC) and that without such guarantee (SAC) is made, by demonstrating the closedloop system dynamic with the trained policies. A distinguishing feature of the stability assured policies is that it can force and sustain the state or tracking error to zero. This could be intuitively demonstrated by the state trajectories of closedloop system.
We evaluated the trained policies in the GRN and CompGRN and the results are shown in Figure 3. In our experiments, we found that the LAC agents stabilize the systems well. All the state trajectories converge to the reference signal eventually (see Figure 3 a and c). On the contrary, without stability guarantee, the state trajectories either diverge (see Figure 3 b), or continuously oscillate around the reference trajectory (see Figure 3 d).
VE Evaluation on Robustness and Generalization
It is wellknown that overparameterized policies are prone to become overfitted to a specific training environment. The ability of generalization is the key to the successful implementation of RL agents in an uncertain realworld environment. In this part, we first evaluate the robustness of policies in the presence of parametric uncertainties and process noise. Then, we test the robustness of controllers against external disturbances. Finally, we evaluate whether the policy is generalizable by setting different reference signals. To make a fair comparison, we removed the policies that did not converge in SAC and only evaluate the ones that perform well during training. During testing, we found that SPPO appears to be prone to variations in the environment, thus the evaluation results are referred to Fig. 9 and Fig. 10 in the Appendix.
VE1 Robustness to dynamic uncertainty
In this part, during the inference, we vary the system parameters and introduce process noises in the model/simulator to evaluate the algorithm’s robustness. In CartPole, we vary the length of pole . In GRN, we vary the promoter strength and dissociation rate
. Due to stochastic nature in gene expression, we also introduce uniformly distributed noise ranging from
(we indicate the noise level by ) to the dynamic of GRN. The state trajectories of closedloop system under LAC and SAC agents in the varied environment are demonstrated in Figure 4.VE2 Robustness to disturbances
An inherent property of a stable system is to recover from perturbations such as external forces and wind. To show this, we introduce periodic external disturbances with different magnitudes in each environment and observe the performance difference between policies trained by LAC and SAC. We also include LQR as the modelbased baseline. In CartPole, the agent may fall over when interfered by an external force, ending the episode in advance. Thus in this task, we measure the robustness of controller through the deathrate, i.e., the probability of falling over after being disturbed. For other tasks where the episodes are always of the same length and we measure the robustness of controller by the variation in the cumulative cost. Under each disturbance magnitude, the policies are tested for trials and the performance are shown in Figure 5.
As shown in the Figure 5, the controllers trained by LAC outperform SAC and LQR by great extent in CartPole and GRN (lower death rate and cumulative cost). In HalfCheetah, SAC and LAC are both robust to small external disturbances while LAC is more reliable to larger ones. In FetchReach, SAC and LAC perform reliably across all of the external disturbances. In all of the experiments, SPPO agents could hardly sustain any external disturbances.
VE3 Generalization over different tracking references
In this part, we introduce four different reference signals that are unseen during training in the GRN: sinusoids with periods of 150 (brown) and 400 (blue), and the constant reference of 8 (red) and 16 (green). We also show the original reference signal used for training (skyblue) as a benchmark. Reference signals are indicated in Figure 6 by the dashed line in respective colors. All of the trained policies are tested for 10 times with each reference signal. The average dynamics of the target protein are shown in Figure 6 with the solid line, while the variance of dynamic is indicated by the shadowed area.
As shown in Figure 6, the policies trained by LAC could generalize well to follow previously unseen reference signals with low variance (dynamics are very close to the dashed lines), regardless of whether they share the same mathematical form with the one used for training. On the other hand, though SAC tracks the original reference signal well after the unconverged training trials being removed (see the skyblue lines), it is still unable to follow some of the reference signals (see the brown line) and possesses larger variance than LAC.
VF Influence of Different Lyapunov Candidate Functions
As an independent interest, we evaluate the influence of choosing different Lyapunov candidate functions in this part. First, we adopt candidates of different time horizon to train policies in the CartPole example, and compare their performance in terms of cumulative cost and robustness. Here, implies using value function as Lyapunov candidate. Both of the Lyapunov critics are parameterized as (6). For evaluation of robustness, we apply an impulsive force at instant and observe the deathrate of trained policies. The results are demonstrated in Figure 7.
As shown in Figure 7, both choices of Lyapunov candidates converge fast and achieve comparable cumulative cost at convergence. However, in terms of robustness, the choice of plays an important role. As observed in Figure 7 (b), the robustness of controller decreases as the time horizon increases. Besides, it is interesting to observe that LQR is more robust than SAC when faced with instant impulsive disturbance.
Vi Conclusions
In this paper, we proposed a modelfree approach for analyzing the stability of discretetime nonlinear stochastic systems modeled by Markov decision process, by employing the Lyapunov function from control theory. Based on the theoretical result, a practical algorithm for designing stability assured controllers for the stabilization and tracking problems. We evaluated the proposed method in various examples and show that our method achieves not only comparable or superior performance compared with the stateoftheart RL algorithm but also outperforms impressively in terms of robustness to uncertainties and disturbances.
References
 [1] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
 [2] J. Peters and S. Schaal, “Reinforcement learning of motor skills with policy gradients,” Neural networks, vol. 21, no. 4, pp. 682–697, 2008.
 [3] S. Löckel, J. Peters, and P. Van Vliet, “A probabilistic framework for imitating human race driver behavior,” IEEE Robotics and Automation Letters, 2020.
 [4] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. FeiFei, and A. Farhadi, “Targetdriven visual navigation in indoor scenes using deep reinforcement learning,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 3357–3364.
 [5] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 3389–3396.
 [6] J.J. E. Slotine, W. Li, et al., Applied nonlinear control. Prentice hall Englewood Cliffs, NJ, 1991, vol. 199, no. 1.
 [7] A. M. Lyapunov, The general problem of the stability of motion (in Russian). PhD Dissertation, Univ. Kharkov, 1892.
 [8] K. J. Åström and B. Wittenmark, Adaptive control. Courier Corporation, 1989.
 [9] D. Q. Mayne, J. B. Rawlings, C. V. Rao, and P. O. Scokaert, “Constrained model predictive control: Stability and optimality,” Automatica, vol. 36, no. 6, pp. 789–814, 2000.
 [10] M. Corless and G. Leitmann, “Continuous state feedback guaranteeing uniform ultimate boundedness for uncertain dynamic systems,” IEEE Transactions on Automatic Control, vol. 26, no. 5, pp. 1139–1144, 1981.
 [11] A. Thowsen, “Uniform ultimate boundedness of the solutions of uncertain dynamic delay systems with statedependent and memoryless feedback control,” International Journal of control, vol. 37, no. 5, pp. 1135–1143, 1983.
 [12] L. Buşoniu, T. de Bruin, D. Tolić, J. Kober, and I. Palunko, “Reinforcement learning for control: Performance, stability, and deep approximators,” Annual Reviews in Control, 2018.
 [13] D. Gorges, “Relations between model predictive control and reinforcement learning,” IFACPapersOnLine, vol. 50, no. 1, pp. 4920–4928, 2017.
 [14] Y. Chow, O. Nachum, E. DuenezGuzman, and M. Ghavamzadeh, “A lyapunovbased approach to safe reinforcement learning,” arXiv preprint arXiv:1805.07708, 2018.
 [15] Y. Chow, O. Nachum, A. Faust, M. Ghavamzadeh, and E. DuenezGuzman, “Lyapunovbased safe policy optimization for continuous control,” arXiv preprint arXiv:1901.10031, 2019.
 [16] R. Postoyan, L. Buşoniu, D. Nešić, and J. Daafouz, “Stability analysis of discretetime infinitehorizon optimal control with discounted cost,” IEEE Transactions on Automatic Control, vol. 62, no. 6, pp. 2736–2749, 2017.
 [17] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe modelbased reinforcement learning with stability guarantees,” in Advances in neural information processing systems, 2017, pp. 908–918.
 [18] S. M. Richards, F. Berkenkamp, and A. Krause, “The lyapunov neural network: Adaptive stability certification for safe learning of dynamical systems,” in Conference on Robot Learning, 2018, pp. 466–476.
 [19] R. S. Sutton, H. R. Maei, and C. Szepesvári, “A convergent temporaldifference algorithm for offpolicy learning with linear function approximation,” in Advances in neural information processing systems, 2009, pp. 1609–1616.

[20]
N. Korda and P. La, “On td (0) with function approximation: Concentration
bounds and a centered variant with exponential convergence,” in
International Conference on Machine Learning
, 2015, pp. 626–634.  [21] J. Bhandari, D. Russo, and R. Singal, “A finite time analysis of temporal difference learning with linear function approximation,” arXiv preprint arXiv:1806.02450, 2018.
 [22] S. Zou, T. Xu, and Y. Liang, “Finitesample analysis for sarsa with linear function approximation,” in Advances in Neural Information Processing Systems, 2019, pp. 8665–8675.
 [23] E. Boukas and Z. Liu, “Robust stability and h/sub/spl infin//control of discretetime jump linear systems with timedelay: an lmi approach,” in Decision and Control, 2000. Proceedings of the 39th IEEE Conference on, vol. 2. IEEE, 2000, pp. 1527–1532.
 [24] S. Sastry, Nonlinear systems: analysis, stability, and control. Springer Science & Business Media, 2013, vol. 10.
 [25] H. L. Royden, Real analysis. Krishna Prakashan Media, 1968.
 [26] L. Shaikhet, “Necessary and sufficient conditions of asymptotic mean square stability for stochastic linear difference equations,” Applied Mathematics Letters, vol. 10, no. 3, pp. 111–115, 1997.
 [27] S. P. Meyn and R. L. Tweedie, Markov chains and stochastic stability. Springer Science & Business Media, 2012.
 [28] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al., “Soft actorcritic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
 [29] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 [30] D. Q. Mayne and H. Michalska, “Receding horizon control of nonlinear systems,” IEEE Transactions on automatic control, vol. 35, no. 7, pp. 814–824, 1990.
 [31] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “Highdimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
 [32] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE transactions on systems, man, and cybernetics, no. 5, pp. 834–846, 1983.
 [33] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for modelbased control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033.
 [34] M. B. Elowitz and S. Leibler, “A synthetic oscillatory network of transcriptional regulators,” Nature, vol. 403, no. 6767, p. 335, 2000.
 [35] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 [36] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
 [37] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
I Further Experiment Setup
Ia CartPole
In this experiment, the controller is to sustain the pole vertically at a target position . This is modified version of CartPole in [37] with continuous action space. The action is the horizontal force applied on the cart (). and represents the maximum of position and angle, respectively, and . The controller dies if or and the episodes end in advance. Cost function . The episodes are of length 250. For robustness evaluation in Section VE, we apply an impulsive disturbance force on the cart every 20 seconds, of which the magnitude ranges from 80 to 150 and the direction is opposite to the direction of control input. In Section VF, the impulsive disturbance has the same magnitude range and direction with that in Section VE, but only applied once at instant .
IB HalfCheetah
HalfCheetah is a modified version of that in Gym’s robotics environment [37]. The task is to control a HalfCheetah (a 2legged simulated robot) to run at the speed of . The reward is where is the forward speed of the HalfCheetah. The control input is the torque applied on each joint, ranging from 1 to 1. The episodes are of length 200.
For robustness evaluation in Section VE, we apply an impulsive disturbance torque on each joint every 20 seconds, of which the magnitude ranges from 0.2 to 2.0 and the direction is opposite to the direction of control input.
IC FetchReachv1
We modify the FetchReach in Gym’s robotics environment [37] to a cost version, where the controller is expected to control manipulator’s end effector to reach a random goal position. The cost is designed as , where is the distance between goal and endeffector. The control input is the torque applied on each joint, ranging from 1 to 1. The episodes are of length 200.
For robustness evaluation in Section VE, we apply an impulsive disturbance torque on each joint every 20 seconds, of which the magnitude ranges from 0.2 to 2.0 and the direction is opposite to the direction of control input.
Ii Hyperparameters
Hyperparameters 
CartPole  FetchReach  HalfCheetah  GRN  CompGRN 

Time horizon  5  5  
Minibatch size  256  256  256  256  256 
Actor learning rate  1e4  1e4  1e4  1e4  1e4 
Critic learning rate  3e4  3e4  3e4  3e4  3e4 
Lyapunov learning rate  3e4  3e4  3e4  3e4  3e4 
Target entropy  1  5  6  3  4 
Soft replacement()  0.005  0.005  0.005  0.005  0.005 
Discount()  NAN  NAN  0.995  NAN  NAN 
1.0  1.0  1.0  1.0  1.0  
Structure of  (64,64,16)  (64,64,16)  (256,256,16)  (256,256,16)  (256,256,16) 
Iii Evaluation on Robustness and Generalization using SPPO


Comments
There are no comments yet.