Log In Sign Up

Safe Model-based Reinforcement Learning with Stability Guarantees

by   Felix Berkenkamp, et al.
ETH Zurich

Reinforcement learning is a powerful paradigm for learning optimal policies from experimental data. However, to find optimal policies, most reinforcement learning algorithms explore all possible actions, which may be harmful for real-world systems. As a consequence, learning algorithms are rarely applied on safety-critical systems in the real world. In this paper, we present a learning algorithm that explicitly considers safety, defined in terms of stability guarantees. Specifically, we extend control-theoretic results on Lyapunov stability verification and show how to use statistical models of the dynamics to obtain high-performance control policies with provable stability certificates. Moreover, under additional regularity assumptions in terms of a Gaussian process prior, we prove that one can effectively and safely collect data in order to learn about the dynamics and thus both improve control performance and expand the safe region of the state space. In our experiments, we show how the resulting algorithm can safely optimize a neural network policy on a simulated inverted pendulum, without the pendulum ever falling down.


The Lyapunov Neural Network: Adaptive Stability Certification for Safe Learning of Dynamic Systems

Learning algorithms have shown considerable prowess in simulation by all...

Learning-based Model Predictive Control for Safe Exploration and Reinforcement Learning

Learning-based methods have been successful in solving complex control t...

Stability-certified reinforcement learning: A control-theoretic perspective

We investigate the important problem of certifying stability of reinforc...

Learning Event-triggered Control from Data through Joint Optimization

We present a framework for model-free learning of event-triggered contro...

Safe Reinforcement Learning by Imagining the Near Future

Safe reinforcement learning is a promising path toward applying reinforc...

Reachability-based safe learning for optimal control problem

In this work we seek for an approach to integrate safety in the learning...

Ctrl-Z: Recovering from Instability in Reinforcement Learning

When learning behavior, training data is often generated by the learner ...

1 Introduction

While reinforcement learning (RL, Sutton1998Reinforcement) algorithms have achieved impressive results in games, for example on the Atari platform (Mnih2015Humanlevel), they are rarely applied to real-world physical systems (e.g., robots) outside of academia. The main reason is that RL algorithms provide optimal policies only in the long-term, so that intermediate policies may be unsafe, break the system, or harm their environment. This is especially true in safety-critical systems that can affect human lives. Despite this, safety in RL has remained largely an open problem (Amodei2016Concrete).

Consider, for example, a self-driving car. While it is desirable for the algorithm that drives the car to improve over time (e.g., by adapting to driver preferences and changing environments), any policy applied to the system has to guarantee safe driving. Thus, it is not possible to learn about the system through random exploratory actions, which almost certainly lead to a crash. In order to avoid this problem, the learning algorithm needs to consider its ability to safely recover from exploratory actions. In particular, we want the car to be able to recover to a safe state, for example, driving at a reasonable speed in the middle of the lane. This ability to recover is known as asymptotic stability in control theory (Khalil1996Nonlinear). Specifically, we care about the region of attraction of the closed-loop system under a policy. This is a subset of the state space that is forward invariant so that any state trajectory that starts within this set stays within it for all times and converges to a goal state eventually.

In this paper, we present a RL algorithm for continuous state-action spaces that provides these kind of high-probability safety guarantees for policies. In particular, we show how, starting from an initial, safe policy we can expand our estimate of the region of attraction by collecting data inside the safe region and adapt the policy to both increase the region of attraction and improve control performance.

Related work

Safety is an active research topic in RL and different definitions of safety exist (Pecka2014Safe; Garcia2015Comprehensive). DiscreteMarkov decision processes (MDPs) are one class of tractable models that have been analyzed. In risk-sensitive RL, one specifies risk-aversion in the reward (Coraluppi1999Risksensitive). For example, (Geibel2005RiskSensitive) define risk as the probability of driving the agent to a set of known, undesirable states. Similarly, robust MDPs maximize rewards when transition probabilities are uncertain (Tamar2014Scaling; Wiesemann2012Robust). Both (Moldovan2012Safe) and (Turchetta2016Safe) introduce algorithms to safely explore MDPs so that the agent never gets stuck without safe actions. All these methods require an accurate probabilistic model of the system.

In continuous state-action spaces, model-free policy search algorithms have been successful. These update policies without a system model by repeatedly executing the same task (Peters2006Policy). In this setting, (Achiam2017Constrained) introduces safety guarantees in terms of constraint satisfaction that hold in expectation. High-probability worst-case safety guarantees are available for methods based on Bayesian optimization (Mockus1989Bayesian) together with Gaussian process models (GP, (Rasmussen2006Gaussian)) of the cost function. The algorithms in (Schreiter2015Safe) and (Sui2015Safe) provide high-probability safety guarantees for any parameter that is evaluated on the real system. These methods are used in (Berkenkamp2016Safe) to safely optimize a parametric control policy on a quadrotor. However, resulting policies are task-specific and require the system to be reset.

In the model-based RL setting, research has focused on safety in terms of state constraints. In (Garcia2012Safe; Hans2008Safe), a priori known, safe global backup policies are used, while (Perkins2003Lyapunov) learns to switch between several safe policies. However, it is not clear how one may find these policies in the first place. Other approaches use model predictive control with constraints, a model-based technique where the control actions are optimized online. For example, (Sadigh2016Safe) models uncertain environmental constraints, while (Ostafew2016Robust) uses approximate uncertainty propagation of GP dynamics along trajectories. In this setting, robust feasability and constraint satisfaction can be guaranteed for a learned model with bounded errors using robust model predictive control Aswani2013Provably. The method in (Akametalu2014Reachability)

uses reachability analysis to construct safe regions in the state space. The theoretical guarantees depend on the solution to a partial differential equation, which is approximated.

Theoretical guarantees for the stability exist for the more tractable stability analysis and verification under a fixed control policy. In control, stability of a known system can be verified using a Lyapunov function (Bobiti2016Sampling). A similar approach is used by (Berkenkamp2016Lyapunov) for deterministic, but unknown dynamics that are modeled as a GP, which allows for provably safe learning of regions of attraction for fixed policies. Similar results are shown in (Vinogradska2016Stability) for stochastic systems that are modeled as a GP. They use Bayesian quadrature to compute provably accurate estimates of the region of attraction. These approaches do not update the policy.

Our contributions

We introduce a novel algorithm that can safely optimize policies in continuous state-action spaces while providing high-probability safety guarantees in terms of stability. Moreover, we show that it is possible to exploit the regularity properties of the system in order to safely learn about the dynamics and thus improve the policy and increase the estimated safe region of attraction without ever leaving it. Specifically, starting from a policy that is known to stabilize the system locally, we gather data at informative, safe points and improve the policy safely based on the improved model of the system and prove that any exploration algorithm that gathers data at these points reaches a natural notion of full exploration. We show how the theoretical results transfer to a practical algorithm with safety guarantees and apply it to a simulated inverted pendulum stabilization task.

2 Background and Assumptions

We consider a deterministic, discrete-time dynamic system


with states and control actions and a discrete time index . The true dynamics  consist of two parts: is a known, prior model that can be obtained from first principles, while represents a priori unknown model errors. While the model errors are unknown, we can obtain noisy measurements of by driving the system to the state  and taking action . We want this system to behave in a certain way, e.g., the car driving on the road. To this end, we need to specify a control policy  that, given the current state, determines the appropriate control action that drives the system to some goal state, which we set as the origin without loss of generality (Khalil1996Nonlinear). We encode the performance requirements of how to drive the system to the origin through a positive cost  that is associated with states and actions and has . The policy aims to minimize the cumulative, discounted costs for each starting state.

The goal is to safely learn about the dynamics from measurements and adapt the policy for performance, without encountering system failures. Specifically, we define the safety constraint on the state divergence that occurs when leaving the region of attraction. This means that adapting the policy is not allowed to decrease the region of attraction and exploratory actions to learn about the dynamics  are not allowed to drive the system outside the region of attraction. The region of attraction is not known a priori, but is implicitly defined through the system dynamics and the choice of policy. Thus, the policy not only defines performance as in typical RL, but also determines safety and where we can obtain measurements.

Model assumptions

In general, this kind of safe learning is impossible without further assumptions. For example, in a discontinuous system even a slight change in the control policy can lead to drastically different behavior. Moreover, to expand the safe set we need to generalize learned knowledge about the dynamics to (potentially unsafe) states that we have not visited. To this end, we restrict ourselves to the general and practically relevant class of models that are Lipschitz continuous. This is a typical assumption in the control community Khalil1996Nonlinear. Additionally, to ensure that the closed-loop system remains Lipschitz continuous when the control policy is applied, we restrict policies to the rich class of -Lipschitz continuous functions , which also contains certain types of neural networks (Szegedy2014Intriguing).

Assumption 1 (continuity).

The dynamics and in creftype 1 are - and Lipschitz continuous with respect to the 1-norm. The considered control policies  lie in a set  of functions that are -Lipschitz continuous with respect to the 1-norm.

To enable safe learning, we require a reliable statistical model. While we commit to GPs for the exploration analysis, for safety any suitable, well-calibrated model is applicable.

Assumption 2 (well-calibrated model).

Let  and  denote the posterior mean and covariance matrix functions of the statistical model of the dynamics creftype 1 conditioned on  noisy measurements. With , there exists a  such that with probability at least it holds for all , , and  that

This assumption ensures that we can build confidence intervals on the dynamics that, when scaled by an appropriate constant 

, cover the true function with high probability. We introduce a specific statistical model that fulfills both assumptions under certain regularity assumptions in Sec. 3.

Lyapunov function

To satisfy the specified safety constraints for safe learning, we require a tool to determine whether individual states and actions are safe. In control theory, this safety is defined through the region of attraction, which can be computed for a fixed policy using Lyapunov functions (Khalil1996Nonlinear). Lyapunov functions are continuously differentiable functions  with  and for all . The key idea behind using Lyapunov functions to show stability of the system creftype 1 is similar to that of gradient descent on strictly quasiconvex functions: if one can show that, given a policy , applying the dynamics on the state maps it to strictly smaller values on the Lyapunov function (‘going downhill’), then the state eventually converges to the equilibrium point at the origin (minimum). In particular, the assumptions in Theorem 1 below imply that is strictly quasiconvex within the region of attraction if the dynamics are Lipschitz continuous. As a result, the one step decrease property for all states within a level set guarantees eventual convergence to the origin.

Theorem 1 ((Khalil1996Nonlinear)).

Let be a Lyapunov function, Lipschitz continuous dynamics, and a policy. If for all within the level set , , then is a region of attraction, so that implies for all and .

It is convenient to characterize the region of attraction through a level set of the Lyapunov function, since it replaces the challenging test for convergence with a one-step decrease condition on the Lyapunov function. For the theoretical analysis in this paper, we assume that a Lyapunov function is given to determine the region of attraction. For ease of notation, we also assume for all , which ensures that level sets  are connected if . Since Lyapunov functions are continuously differentiable, they are -Lipschitz continuous over the compact set .

In general, it is not easy to find suitable Lyapunov functions. However, for physical models, like the prior model in creftype 1, the energy of the system (e.g., kinetic and potential for mechanical systems) is a good candidate Lyapunov function. Moreover, it has recently been shown that it is possible to compute suitable Lyapunov functions (Li2016Computation; Giesl2015Review). In our experiments, we exploit the fact that value functions in RL are Lyapunov functions if the costs are strictly positive away from the origin. This follows directly from the definition of the value function, where . Thus, we can obtain Lyapunov candidates as a by-product of approximate dynamic programming.

Initial safe policy

Lastly, we need to ensure that there exists a safe starting point for the learning process. Thus, we assume that we have an initial policy  that renders the origin of the system in creftype 1 asymptotically stable within some small set of states . For example, this policy may be designed using the prior model  in creftype 1, since most models are locally accurate but deteriorate in quality as state magnitude increases. This policy is explicitly not safe to use throughout the state space .

3 Theory

In this section, we use these assumptions for safe reinforcement learning. We start by computing the region of attraction for a fixed policy under the statistical model. Next, we optimize the policy in order to expand the region of attraction. Lastly, we show that it is possible to safely learn about the dynamics and, under additional assumptions about the model and the system’s reachability properties, that this approach expands the estimated region of attraction safely. We consider an idealized algorithm that is amenable to analysis, which we convert to a practical variant in LABEL:sec:experiments. See Fig. 1 for an illustrative run of the algorithm and examples of the sets defined below.

Region of attraction

We start by computing the region of attraction for a fixed policy. This is an extension of the method in (Berkenkamp2016Lyapunov) to discrete-time systems. We want to use the Lyapunov decrease condition in Theorem 1 to guarantee safety for the statistical model of the dynamics. However, the posterior uncertainty in the statistical model of the dynamics means that one step predictions about  are uncertain too. We account for this by constructing high-probability confidence intervals on : From Assumption 2 together with the Lipschitz property of , we know that is contained in with probability at least . For our exploration analysis, we need to ensure that safe state-actions cannot become unsafe; that is, an initial set of safe set  remains safe (defined later). To this end, we intersect the confidence intervals: , where the set  is initialized to when and otherwise. Note that is contained in  with the same probability as in Assumption 2. The upper and lower bounds on are defined as and .

Given these high-probability confidence intervals, the system is stable according to Theorem 1 if for all . However, it is intractable to verify this condition directly on the continuous domain without additional, restrictive assumptions about the model. Instead, we consider a discretization of the state space  into cells, so that holds for all . Here, denotes the point in with the smallest distance to . Given this discretization, we bound the decrease variation on the Lyapunov function for states in  and use the Lipschitz continuity to generalize to the continuous state space .

Theorem 2.

Under Assumptions 2 and 1 with , let  be a discretization of such that for all  . If, for all with , , and for some  it holds that then holds for all with probability at least and is a region of attraction for creftype 1 under policy .

The proof is given in LABEL:sec:proofs_stability. Theorem 2 states that, given confidence intervals on the statistical model of the dynamics, it is sufficient to check the stricter decrease condition in Theorem 2 on the discretized domain  to guarantee the requirements for the region of attraction in the continuous domain in Theorem 1. The bound in Theorem 2 becomes tight as the discretization constant  and  go to zero. Thus, the discretization constant trades off computation costs for accuracy, while  approaches  as we obtain more measurement data and the posterior model uncertainty about the dynamics,  decreases. The confidence intervals on  and the corresponding estimated region of attraction (red line) can be seen in the bottom half of Fig. 1.

(a) Initial safe set (in red).
(b) Exploration: 15 data points.
(c) Final policy after 30 evaluations.
Figure 1: Example application of LABEL:alg:safe_learning. Due to input constraints, the system becomes unstable for large states. We start from an initial, local policy  that has a small, safe region of attraction (red lines) in Fig. 1(a). The algorithm selects safe, informative state-action pairs within  (top, white shaded), which can be evaluated without leaving the region of attraction  (red lines) of the current policy . As we gather more data (blue crosses), the uncertainty in the model decreases (top, background) and we use creftype 3 to update the policy so that it lies within  (top, red shaded) and fulfills the Lyapunov decrease condition. The algorithm converges to the largest safe set in Fig. 1(c). It improves the policy without evaluating unsafe state-action pairs and thereby without system failure.

Policy optimization

So far, we have focused on estimating the region of attraction for a fixed policy. Safety is a property of states under a fixed policy. This means that the policy directly determines which states are safe. Specifically, to form a region of attraction all states in the discretizaton  within a level set of the Lyapunov function need to fulfill the decrease condition in Theorem 2 that depends on the policy choice. The set of all state-action pairs that fulfill this decrease condition is given by


see Fig. 1(c) (top, red shaded). In order to estimate the region of attraction based on this set, we need to commit to a policy. Specifically, we want to pick the policy that leads to the largest possible region of attraction according to Theorem 2. This requires that for each discrete state in  the corresponding state-action pair under the policy must be in the set . Thus, we optimize the policy according to


The region of attraction that corresponds to the optimized policy  according to creftype 3 is given by , see Fig. 1(b). It is the largest level set of the Lyapunov function for which all state-action pairs  that correspond to discrete states within  are contained in . This means that these state-action pairs fulfill the requirements of Theorem 2 and  is a region of attraction of the true system under policy . The following theorem is thus a direct consequence of Theorem 2 and creftype 3.

Theorem 3.

Let be the true region of attraction of creftype 1 under the policy . For any , we have with probability at least that for all .

Thus, when we optimize the policy subject to the constraint in creftype 3 the estimated region of attraction is always an inner approximation of the true region of attraction. However, solving the optimization problem in creftype 3 is intractable in general. We approximate the policy update step in LABEL:sec:experiments.

Collecting measurements

Given these stability guarantees, it is natural to ask how one might obtain data points in order to improve the model of  and thus efficiently increase the region of attraction. This question is difficult to answer in general, since it depends on the property of the statistical model. In particular, for general statistical models it is often not clear whether the confidence intervals contract sufficiently quickly. In the following, we make additional assumptions about the model and reachability within  in order to provide exploration guarantees. These assumptions allow us to highlight fundamental requirements for safe data acquisition and that safe exploration is possible.

We assume that the unknown model errors  have bounded norm in a reproducing kernel Hilbert space (RKHS, Scholkopf2002Learning) corresponding to a differentiable kernel , . These are a class of well-behaved functions of the form  defined through representer points  and weights  that decay sufficiently fast with . This assumption ensures that  satisfies the Lipschitz property in Assumption 1, see Berkenkamp2016Lyapunov. Moreover, with  we can use GP models for the dynamics that fulfill Assumption 2 if the state if fully observable and the measurement noise is -sub-Gaussian (e.g., bounded in ), see Chowdhury2017Kernelized. Here  is the information capacity. It corresponds to the amount of mutual information that can be obtained about  from  measurements, a measure of the size of the function class encoded by the model. The information capacity has a sublinear dependence on  for common kernels and upper bounds can be computed efficiently Srinivas2012Gaussian. More details about this model are given in LABEL:sec:gaussian_process_theory.

In order to quantify the exploration properties of our algorithm, we consider a discrete action space . We define exploration as the number of state-action pairs in  that we can safely learn about without leaving the true region of attraction. Note that despite this discretization, the policy takes values on the continuous domain. Moreover, instead of using the confidence intervals directly as in creftype 3, we consider an algorithm that uses the Lipschitz constants to slowly expand the safe set. We use this in our analysis to quantify the ability to generalize beyond the current safe set. In practice, nearby states are sufficiently correlated under the model to enable generalization using creftype 2.

Suppose we are given a set  of state-action pairs about which we can learn safely. Specifically, this means that we have a policy such that, for any state-action pair in , if we apply action in state and then apply actions according to the policy, the state converges to the origin. Such a set can be constructed using the initial policy from Sec. 2 as . Starting from this set, we want to update the policy to expand the region of attraction according to Theorem 2. To this end, we use the confidence intervals on  for states inside  to determine state-action pairs that fulfill the decrease condition. We thus redefine  for the exploration analysis to


This formulation is equivalent to creftype 2, except that it uses the Lipschitz constant to generalize safety. Given , we can again find a region of attraction  by committing to a policy according to creftype 3. In order to expand this region of attraction effectively we need to decrease the posterior model uncertainty about the dynamics of the GP by collecting measurements. However, to ensure safety as outlined in Sec. 2, we are not only restricted to states within , but also need to ensure that the state after taking an action is safe; that is, the dynamics map the state back into the region of attraction . We again use the Lipschitz constant in order to determine this set,


The set  contains state-action pairs that we can safely evaluate under the current policy  without leaving the region of attraction, see Fig. 1 (top, white shaded).

What remains is to define a strategy for collecting data points within  to effectively decrease model uncertainty. We specifically focus on the high-level requirements for any exploration scheme without committing to a specific method. In practice, any (model-based) exploration strategy that aims to decrease model uncertainty by driving the system to specific states may be used. Safety can be ensured by picking actions according to  whenever the exploration strategy reaches the boundary of the safe region ; that is, when . This way, we can use  as a backup policy for exploration.

The high-level goal of the exploration strategy is to shrink the confidence intervals at state-action pairs  in order to expand the safe region. Specifically, the exploration strategy should aim to visit state-action pairs in  at which we are the most uncertain about the dynamics; that is, where the confidence interval is the largest:


As we keep collecting data points according to creftype 6, we decrease the uncertainty about the dynamics for different actions throughout the region of attraction and adapt the policy, until eventually we have gathered enough information in order to expand it. While creftype 6 implicitly assumes that any state within  can be reached by the exploration policy, it achieves the high-level goal of any exploration algorithm that aims to reduce model uncertainty. In practice, any safe exploration scheme is limited by unreachable parts of the state space.

We compare the active learning scheme in 

creftype 6 to an oracle baseline that starts from the same initial safe set  and knows  up to accuracy within the safe set. The oracle also uses knowledge about the Lipschitz constants and the optimal policy in  at each iteration. We denote the set that this baseline manages to determine as safe with  and provide a detailed definition in LABEL:sec:app_baseline.

Theorem 4.

Assume -sub-Gaussian measurement noise and that the model error  in creftype 1 has RKHS norm smaller than . Under the assumptions of Theorem 2, with , and with measurements collected according to creftype 6, let be the smallest positive integer so that where . Let be the true region of attraction of creftype 1 under a policy . For any , and , the following holds jointly with probability at least for all :

  •          .          .