As a powerful learning control paradigm, reinforcement learning is extremely suitable for finding the optimal policy in tasks where the dynamics are either unknown or affected by severe uncertainty (Buşoniu et al., 2018)
. Its combination with the deep neural network has boosted applications in autonomous driving(Sallab et al., 2017), complicated robot locomotion Hwangbo et al. (2019), and skilful games like Atari (Mnih et al., 2015) and Go (Silver et al., 2017). However, overparameterized policies are prone to become overfitted to the specific training environment, limiting its generalization to the various scenarios Pinto et al. (2017). Additionally, RL agents trained in simulation, though cheap to obtain, but likely suffer from the reality gap problem Koos et al. (2010) when transferred from virtual to the real world. To overcome these drawbacks, various efforts are made to enhance the robustness of the policy Jakobi et al. (1995); Tobin et al. (2017); Mordatch et al. (2015), since a robust policy has a greater chance of successful generalization and transfer.
Contribution In this paper, we propose a unified framework of designing policies with both stability and robust performance guarantee against the various uncertainties in the environment. Without any specific domain knowledge, our method is able to find policy that is robust to large exogenous disturbances and generalizes well to different test environment. First, a novel model-free method for analyzing the Lyapunov stability and performance of the closed-loop system is developed. Based on the theoretical results, we propose the Robust Lyapunov-based Actor-Critic (RLAC) algorithm to simultaneously find the Lyapunov function and policy that can guarantee the robust stability of the closed-loop system. We evaluate RLAC on a simulated cartpole in the OpenAI gym (Brockman et al., 2016) environment and show that our approach is robust to: i) Large impulsive disturbance: The trained agent is able to recover when disturbed by adversary impulses 4-6 times of the maximum control input, while other baselines fail almost surely. ii) Parametric Uncertainty: The learned policy generalizes better than the baselines to different test environment settings (e.g. different mass and structural values).
2.1 Markov Decision Process and Reinforcement Learning
A Markov decision process (MDP) is a tuple, (), where is the set of states, is the set of actions, is the cost function,
is the transition probability function, andis the starting state distribution. is a stationary policy denoting the probability of selecting action in state . In addition, the cost function under stationary policy is defined as .
In this paper, we divide the state
into two vectors,and , where is composed of elements of that are aimed at tracking the reference signal while contains the rest. The cost function is defined as , where denotes the Euclidean norm.
2.2 Robust Control Against Environment Uncertainty
The stochastic system is said to be stable in mean cost if holds for any initial condition . If is arbitrarily large then the stochastic system is globally stable in mean cost.
To address the performance of the agent in the presence of uncertainty, the following definition is needed.
The system is said to be mean square stable (MSS) with an gain less or equal than , if the system is MSS when , and the following holds for all ,
where . is the error output of the system and is the uncertainty, which is composed of both environmental disturbance and modelling error.
The robust performance guarantee (1) holds for all , is equivalent to guaranteeing the inequality for the worst case induced by , i.e.,
3 Main Results
3.1 Lyapunov-based Learning Control
In this section, we propose the main assumptions and a new theorem.
The stationary distribution of state exists.
There exists a positive constant such that .
The core theoretical results on analyzing the stability and robust performance of the closed-loop system with the help of Lyapunov function and sampled data are presented. The Lyapunov function is a class of continuously differentiable semi-positive definite functions . The general idea of exploiting Lyapunov function is to ensure that the derivative of Lyapunov function along the state trajectory is semi-negative definite so that the state goes in the direction of decreasing the value of Lyapunov function and eventually converges to the set or point where the value is zero.
If there exists a continuous differentiable function and positive constants , , such that
holds for all and . is the sampling distribution. Then the system is mean square stable and has gain no greater than . If the above holds for , then the system is globally mean square stable with finite gain.
3.2 Learning the Adversarial Disturber
In our setting, in addition to the control policy , a disturber policy is introduced to actively select the worst disturbance for a given state. More specifically, the adversarial disturber seeks to find the disturbance input over which the system has the greatest gain, i.e. maximizing the following cost function,
where is the parameter of the disturber policy .
In this section, based on the theoretical results in Section 3, we propose an actor-critic style algorithm with robust stability guarantee (RLAC).
In this algorithm, we include a critic Lyapunov function to provide the policy gradient, which satisfies . Through Lagrangian method, the objective function for is obtained as follow,
where is parameterized by a neural network and is an input vector consisted of Gaussian noise. In the above objective, and are the positive Lagrangian multipliers, of which the values are adjusted automatically. The gradient of (6) with respect to the policy parameter is approximated by
The Lyapunov function is updated through minimizing the following objective
We use the sum of cost over a finite time horizon as the Lyapunov candidate, i.e.
which has long been exploited as the Lyapunov function in establishing the stability criteria for model predictive control (MPC) (Mayne et al., 2000). The pseudo-code of RLAC is presented in Algorithm 1.
5 Experimental Results
In this section, we evaluate the robustness of RLAC against i) large impulsive disturbances; ii) parametric uncertainty. Setup of the experiment is referred to Appendix C.
5.1 Robustness to Impulsive Disturbances
with different magnitudes. The trained policies are initialized by 10 random seeds. The policies with different initializations are evaluated equally for 500 episodes. The line indicates the average death rate of these policies and the shadowed region shows the 1-SD confidence interval.
We evaluate the robustness of the agents trained by RLAC and baselines against unseen exogenous disturbance. We measure the robust performance via the death rate, i.e., the probability of pole falling after impulsive disturbance. As observed in the figure, RLAC gives the most robust policy against the impulsive force. It maintains the lowest death rate throughout the experiment, far more superior than SAC and RARL. Moreover, RLAC performs even better than MPC and LQR, which possess the full information of the model and are available.
5.2 Robustness to Parametric Uncertainty
In this experiment, we evaluate the trained policies in environments with different parameter settings. In the training environment, the parameter length of pole and mass of cart , while during evaluation and are selected in a 2-D grid with and .
As shown in the heat maps in Figure 2, RLAC achieves the lowest death rate (zero for the majority of the parameter settings) and obtains reasonable total cost (lower than 100). The total cost of RLAC is slightly higher than SAC and RARL since the agents hardly die and sustain longer episodes. Compared to SAC, RARL achieves lower death rate and comparable total cost performance. LQR performs well in the region where parameters are close to the nominal model but deteriorates soon as parameters vary. All of the model-free methods outperform LQR in terms of robustness to parametric uncertainty, except for the case of low and (left bottom of the grid). This is potentially due to the overparameterized policy does not generalize well to the model where dynamic is more sensitive to input than the one used for training.
- Buşoniu et al.  Lucian Buşoniu, Tim de Bruin, Domagoj Tolić, Jens Kober, and Ivana Palunko. Reinforcement learning for control: Performance, stability, and deep approximators. Annual Reviews in Control, 2018.
- Sallab et al.  Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70–76, 2017.
- Hwangbo et al.  Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots. Science Robotics, 4(26):eaau5872, 2019.
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Silver et al.  David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
Pinto et al. 
Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta.
Robust adversarial reinforcement learning.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2817–2826. JMLR. org, 2017.
Koos et al. 
Sylvain Koos, Jean-Baptiste Mouret, and Stéphane Doncieux.
Crossing the reality gap in evolutionary robotics by promoting
Proceedings of the 12th annual conference on Genetic and evolutionary computation, pages 119–126. ACM, 2010.
- Jakobi et al.  Nick Jakobi, Phil Husbands, and Inman Harvey. Noise and the reality gap: The use of simulation in evolutionary robotics. In European Conference on Artificial Life, pages 704–720. Springer, 1995.
- Tobin et al.  Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30. IEEE, 2017.
- Mordatch et al.  Igor Mordatch, Kendall Lowrey, and Emanuel Todorov. Ensemble-cio: Full-body dynamic motion planning that transfers to physical humanoids. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5307–5314. IEEE, 2015.
- Brockman et al.  Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Mayne et al.  David Q Mayne, James B Rawlings, Christopher V Rao, and Pierre OM Scokaert. Constrained model predictive control: Stability and optimality. Automatica, 36(6):789–814, 2000.
- Royden  Halsey Lawrence Royden. Real analysis. Krishna Prakashan Media, 1968.
- Haarnoja et al.  Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
Appendix A Proof of Theorem 1
The existence of sampling distribution is guaranteed by the existence of (Assumption 1). Since the sequence converges to as approaches , then by the Abelian theorem, the sequence also converges and . Combined with the form of , Eq.(4) infers that
First, the stability of the system in mean cost will be proved. According to Eq.(3), for all and consider that ,
On the other hand, the sequence converges pointwise to the function . According to the Lebesgue’s Dominated convergence theorem[Royden, 1968], if a sequence converges pointwise to a function and is dominated by some integrable function in the sense that,
Thus the left hand side of Eq.(10)
First, when , Eq.(10) infers
Since is a finite value and is semi-positive definite, it follows that
Suppose that there exists a state such , or . Consider that for all starting states in (Assumption 2), then , which is contradictory with Eq.(12). Thus , . Thus the system is stable in mean cost by Definition 1.
Next, the performance of the system will be proved. When , Eq.(10) infers
Since is a finite and Due to the semi-definiteness of and is semi-positive definite, one has
The above inequality only holds if , thus the system has gain less than , which concludes the proof. ∎
Appendix B Algorithm
Estimate policy gradient:
Appendix C Experiment Setup
We compare with robust adversarial reinforcement learning (RARL, [Pinto et al., 2017]) which is considered to be the model-free robust RL baseline. RARL demonstrated great robustness to both exogenous disturbance and parametric disturbance on a series of continuous control tasks. In the implementation of RARL Pinto et al. , TRPO was used as the policy optimizer, though other policy optimization methods are also comparable with the framework. Without loss of generality, we use soft actor-critic (SAC) [Haarnoja et al., 2018]
as the policy optimizer for RARL, given that both sample efficiency and final performance of SAC exceeds the state of art on a series of continuous control benchmarks. We also include SAC as the generic RL baseline. Linear Quadratic Regulator (LQR) and MPC are included as the model-based baseline, which is guaranteed to find the optimal analytic solution given the well established linear model and weighting matrices. Hyperparameters and other details for the implementation of RLAC are referred to AppendixD.
We evaluate RLAC and baselines on a simulated cartpole in the OpenAI gym environment (all the units are dimensionless), details of the environment is referred to Appendix C. The agent is expected to sustain the pole vertically at . Both the RLAC and baseline agents are trained under the static environment setting, i.e., with unchanged model parameters and dynamics, and evaluated in the environment variants with unseen uncertainty. During training, both RLAC and RARL generate disturbance input to affect the performance of the policy. The cost and state at the next step are determined by the action and disturbance jointly. The magnitude of the disturbances is held no larger than , while the actions are below .
Each algorithm is trained for the same amount of global time steps over random seeds with optimized hyperparameters. The total cost of rollouts during training is shown in Figure 3. As shown in the figure, all the algorithms converge eventually, achieving similar final performance in terms of return.
In the cartpole experiment, the agent is to sustain the pole vertically at the central position. This is a modified version of cartpole in Brockman et al.  with continuous action space. The action is the horizontal force on the cart(). The position is limited to , . is the angle position with respect to the vertical direction, where . The cost function . The agent is initialized randomly between while other variables initialized in . The maximum length of episodes is 250.
c.1 Robustness to Impulsive Disturbances
We evaluate the robustness of the agents trained by RLAC and baselines against unseen exogenous disturbance. To show this, we implement an impulsive adversarial force on the cartpole, ranging from 80 to 120 which is far more larger than the maximum action input and disturbance during training. The impulsive disturbance is analogous to a sudden hit or strong wind applied on the cart. The disturbance acts in the direction of pushing the cart away from the origin, as shown in Figure 1 (a), and only takes place at the th time step.
We use the death rate instead of total cost because the failure will end the episode in advance and result in a rather low total cost. Under different impulse magnitudes, the policies trained by RLAC and baselines over different initializations are evaluated for 500 times, and results are shown in 1 (b).
c.2 Robustness to Parametric Uncertainty
The trained agents with different initializations are evaluated with an equal number of episodes, and at each point of the parameter grid, the agents are evaluated for 100 times. For the same reason discussed in the previous subsection, total cost together with death rate are shown in Figure 2 to demonstrate the robustness of RLAC and baselines.
Appendix D Hyperparameters
|Actor learning rate||1e-4|
|Lyapunov learning rate||3e-4|
For the policy network, we use a fully-connected MLP with two hidden layers of 64 units, and ReLU nonlinearities, outputting the mean and standard deviations of a Gaussian distribution. For the Lyapunov network, we use a fully-connected MLP with two hidden layers of 64 units, and ReLU nonlinearity respectively, outputting the Lyapunov value. We adopt the same invertible squashing function technique asHaarnoja et al.  to the output layer of policy network.