Safe Model-Based Reinforcement Learning Using Robust Control Barrier Functions

Reinforcement Learning (RL) is effective in many scenarios. However, it typically requires the exploration of a sufficiently large number of state-action pairs, some of which may be unsafe. Consequently, its application to safety-critical systems remains a challenge. Towards this end, an increasingly common approach to address safety involves the addition of a safety layer that projects the RL actions onto a safe set of actions. In turn, a challenge for such frameworks is how to effectively couple RL with the safety layer to improve the learning performance. In the context of leveraging control barrier functions for safe RL training, prior work focuses on a restricted class of barrier functions and utilizes an auxiliary neural net to account for the effects of the safety layer which inherently results in an approximation. In this paper, we frame safety as a differentiable robust-control-barrier-function layer in a model-based RL framework. As such, this approach both ensures safety and effectively guides exploration during training resulting in increased sample efficiency as demonstrated in the experiments.



There are no comments yet.


page 1

page 2

page 3

page 4


Safe Reinforcement Learning Using Robust Action Governor

Reinforcement Learning (RL) is essentially a trial-and-error learning pr...

Safe Exploration in Model-based Reinforcement Learning using Control Barrier Functions

This paper studies the problem of developing an approximate dynamic prog...

Safe Reinforcement Learning for Grid Voltage Control

Under voltage load shedding has been considered as a standard approach t...

Learning Barrier Certificates: Towards Safe Reinforcement Learning with Zero Training-time Violations

Training-time safety violations have been a major concern when we deploy...

Fully Bayesian Recurrent Neural Networks for Safe Reinforcement Learning

Reinforcement Learning (RL) has demonstrated state-of-the-art results in...

Temporal Logic Guided Safe Reinforcement Learning Using Control Barrier Functions

Using reinforcement learning to learn control policies is a challenge wh...

From self-tuning regulators to reinforcement learning and back again

Machine and reinforcement learning (RL) are being applied to plan and co...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Reinforcement Learning (RL) has been used successfully in a number of robotics applications [kober2013reinforcement, arulkumaran2017deep]. Typically, an RL agent must explore a sufficiently large number of states during training to learn a meaningful policy. However, this exploration may involve visiting unsafe states, making RL unsuitable for safety-critical systems where the cost of failure is high [9392290, cowen2020samba]. As such, safe reinforcement learning is a growing field of research aimed at tackling this challenge [garcia2015comprehensive] by combining control-theoretic techniques with RL. A large variety of safe-RL approaches have been proposed in the literature. These approaches can be categorized as either approaches that solely focus on obtaining a safe policy at the end of training (e.g., [altman1998constrained, geibel2005risk, chow2017risk]) and approaches, including the work in this paper, that focus on safe exploration during training (e.g., [cheng2019end, berkenkamp2017safe, chow2018lyapunov]).

Safe training for RL inherently consists of two components: a constraint-based component, which ensures the agent’s safety, and a performance metric, typically captured by a reward or cost function, that the agent aims to optimize. In the work from [chow2018lyapunov, chow2019lyapunov], the authors propose a Lyapunov-based approach for solving Constrained MDPs (CMDPs) [altman1999constrained] which guarantees the safety of the agent during training. Another similar such approach is the work in [berkenkamp2017safe] where the authors provide a model-based RL framework for safely expanding a region of attraction for the agent. This region of attraction is a set of states for which a stabilizing policy is known. However, the main drawback of these approaches is that they tend to be quickly intractable for more complex environments. Specifically, the method in [berkenkamp2017safe] relies on a discretization of the state space to check states at different level sets of the Lyapunov function, an approach which cannot scale to high dimensional states. Along similar lines, the approach in [chow2018lyapunov] relies on an optimization at each step with as many constraints as there are states, each of which involves an integral over the entire action space.

Another category of approaches leverages a model-based safety framework which serves to prevent the exploration of unsafe states (e.g., [fisac2018general, li2018safe]) by projecting the action taken by the RL agent onto a safe-set of actions. In turn, a central challenge to these approaches is how to leverage the knowledge of the dynamics and the model-based safety component to effectively guide exploration during training. Towards this end, one such approach is that in [cheng2019end], where the authors assume knowledge of a Control Barrier Function (CBF) along with a prior on the dynamics model. During training, the uncertainty in the dynamics is learned using Gaussian Processes (GPs) and accounted for in the CBF-based safety layer. Moreover, a novelty of the work in [cheng2019end]

is the addition of a supervised-learning component to the policy aimed at guiding the RL agent. Specifically, a neural net is used to learn the output of the CBF, so that the gradient updates for the RL policy can be taken with respect to the safe actions taken by the CBF layer.

However, the approach in [cheng2019end] suffers from three drawbacks. First, the discrete-time CBF formulation used is restrictive in that it is only amenable to real-time control synthesis via quadratic programming for affine CBFs. For example, in the context of collision avoidance, affine CBFs can only encode polytopic obstacles. Moreover, the supervised learning of the CBF layer fundamentally results in an approximation, which, in the worst case, could negatively affect the learning of the RL agent if inaccurate. Lastly, the approach utilizes model-free RL, thus neglecting the knowledge of the partially learned dynamics which can otherwise be leveraged to generate synthetic data to further improve the learning performance.

To address these challenges, the contribution of this paper is two-fold. First, by leveraging recent advances in differentiable optimization [amos2017optnet, agrawal2019differentiable], we introduce a differentiable Robust CBF (RCBF) [emam2019robust] based safety layer that is compatible with standard policy-gradient RL algorithms. RCBFs are amenable to real-time control synthesis, even if non-affine, and can encode a wide class of disturbances on the dynamics making them applicable to a large variety of systems. Moreover, the differentiable safety layer eliminates the need to learn the behavior of the CBF. Second, we use GPs to learn the disturbance, which are in turn utilized in the safety layer and to generate synthetic model rollouts to improve the sample efficiency of the training. The approach, pictorially depicted in Figure 1, is validated in two environments, and is shown to be more sample-efficient as compared to the framework from [cheng2019end].

Ii Background Material

In this section, we introduce CBFs and their robust counterpart from [emam2019robust] which can be embedded as a wrapper around any nominal controller to ensure the safety of a disturbed dynamical system. Then, we introduce Gaussian process regression which we leverage to learn the disturbance. Lastly, we discuss the underlying RL algorithm used as the nominal controller in our framework.

We consider the following disturbed control-affine system


where denotes the drift dynamics, is the input dynamics, is an unknown deterministic disturbance and is the input signal. We assume that , , and are continuous. We note that the work in this paper can be straightforwardly extended to the case where is stochastic since GPs are used to learn the disturbance.

Ii-a Control Barrier Functions

Control barrier functions [ames2014, xu2015, AmesBarriers, ogren2006autonomous] are formulated with respect to control-affine systems


A set is called forward invariant with respect to (2) if given a (potentially non-unique) solution to (2) , . Note that (2) is the unperturbed version of (1).

Barrier functions guarantee forward invariance of a particular set that typically represents a constraint in a robotic system, such as collision avoidance or connectivity maintenance. Specifically, a barrier function is a continuously differentiable function (sometimes referred to as a candidate barrier function), and the so-called safe set is defined as the super-zero level set of : . Now, the goal becomes to ensure the forward set invariance of , which can be done equivalently by guaranteeing positivity of along trajectories to (2).

Positivity can be shown if there exists a locally Lipschitz extended class- function and a continuous function such that


where and denote the Lie derivatives of in the directions and respectively. A function is class- if it is continuous, strictly increasing, and . If the above conditions hold, then is called a valid CBF for (2) [ames2019control].

Ii-B Robust Control Barrier Functions

Since we are interested in guaranteeing the safety of the disturbed dynamical system (1), we leverage RCBFs [emam2019robust], which generalize the notion of CBFs to systems obeying the following differential inclusion


where , , are as in (1) and (the disturbance) is an upper semi-continuous set-valued map that takes nonempty, convex, and compact values. Note that refers to the power set of and that the assumptions made on the disturbance are conditions to guarantee the existence of solutions [cortes2008discontinuous].

Moreover, it was shown that for a specific form of , we can recover a similar formulation of regular CBFs as in (3) with almost no additional computational cost as stated in the following theorem. An important aspect of this theorem is that forward invariance is guaranteed for all trajectories of (4).

Theorem 1.

[emam2019robust] Let be a continuously differentiable function. Let , be a set of continuous functions, and define the disturbance as


where denotes the convex hull. If there exists a continuous function and a locally Lipschitz extended class- function such that


then is a valid RCBF for (4).

In other words, given the set , RCBFs ensure safety of all trajectories of the disturbed system (4). Moreover, the result in Theorem 1 can be straightforwardly extended to the case where is a finite union of convex hulls as discussed in [emam2019robust]

, which can be leveraged to capture non-convex disturbances. As such, we can learn an estimate of the unknown disturbance

from (1) and leverage RCBFs to ensure the safety with respect to that estimate. Towards this end, in the next subsection, we briefly discuss GPs which we leverage to learn the disturbance during training.

Ii-C Gaussian Processes

Gaussian Process Regression (GPR) [rasmussen2003gaussian] is a kernel-based regression model used for prediction in many applications such as robotics, reinforcement learning, and data visualisation [deisenroth2015distributed]. One of the main advantages of GPR compared to other regression methods is data efficiency, which allows us to obtain a good estimate of the disturbance using only a small number of data-points. We begin by introducing the general problem setup for GPR and then specialize the results for our application in subsection III-A.

Given training data where and , the objective is to approximate the unknown target function which maps an input to the target value given the model , where

is Gaussian noise having zero mean and variance

. Thus, the prior on the target values can be described by where is the prior mean of the data computed through a mean function as and is the covariance matrix computed through a covariance function ; one commonly used choice of which is the Gaussian kernel given by


where and

denote the signal variance and the kernel widths respectively. As such, the joint distribution of the training labels and predicted value for a query point

is given by

where . The mean and variance of the predictions can then be obtained by conditioning on the training data as

Note that

are hyperparameters which are typically optimized by maximizing the log-likelihood of the training data (e.g.,


Ii-D Soft Actor-Critic (SAC)

The RL problem considered is a policy search in an MDP defined by a tuple (, , , , , ), where denotes the state space, is the input space, , , are as in (1), and is the reward associated with each transition. Note that the state transitions for the MDP are obtained by discretizing the dynamics (1) as


where denotes the state at time step and is the time step size. We note that this approximation has no effect on safety, which is ensured using RCBFs as discussed in Section III-B.

Subsequently, we chose to utilize SAC as the underlying RL algorithm since it is state-of-the-art in terms of sample efficiency [haarnoja2018soft] and thus aligns with the objective of this work. SAC maximizes an entropy objective which is given by the following


where and denote the state and action at timestep respectively. Moreover, denotes the distribution of states and actions induced by the policy . Lastly, the term is the entropy term which incentivizes exploration and is a weighting parameter between the two terms in the objective.

The algorithm relies on an actor-critic approach with a fitted Q-function parametrized by and a fitted actor parametrized by . As such, the critic loss is given by




is the replay buffer and is the target Q-network parameters. Finally, the policy loss is given by


Iii Main Approach

The main objective of this work is two-fold: combining RL with RCBFs for safety during training and improving the sample efficiency of the learning. We assume knowledge of and from  (1) as well as an RCBF to keep the system safe. During training, the dynamics model is improved by collecting data during exploration to estimate the unknown disturbance function . Subsequently, the RCBF is employed as a safety layer around the RL agent, minimally altering its proposed actions subject to safety constraints as pictorially depicted in Figure 1.

Figure 1: A diagram of the proposed framework. At time step the RL agent outputs the potentially unsafe control which is rendered safe using the CBF controller. The safe control input is then applied in the environment. Note that the dynamics model is used by both the CBF controller to guarantee safety, and the RL agent to increase its learning’s sample efficiency.

Iii-a Disturbance Estimation

Although RCBFs are compatible with various data-driven methods, in this paper, we choose to focus on estimating the disturbance set using GPs as discussed in subsection II-C. This learning is achieved by obtaining a dataset with labels given by

where is the noisy measurement of the dynamics during exploration. Note that, in this case, , therefore, we train one GP per dimension for a total of GP models.

In turn, we can obtain the disturbance estimate for a query point as


where and

are the mean and standard-deviation predictions of the

th GP for query point . The coefficient is a user-chosen confidence parameter (e.g., achieves a confidence of ). Note that this is a different representation of than the one used by Theorem 1 where the disturbance is the convex hull of a set of points. However, the desired form can be readily obtained by defining as the vectors generated by permuting the entries of from (15).

Iii-B RCBF-based Safety Layer

In this subsection, we present the minimally invasive RCBF Quadratic Program (RCBF-QP) based control synthesis framework which guarantees the safety of the RL agent with dynamics as in (4). As depicted in Figure 1, given the state, the disturbance estimate from the GPs and the action of the RL agent, this safety layer minimally alters the action so as to ensure safety. Specifically, the RCBF compensation term is given by


where is the output of the RL policy at state , is a slack-variable that serves to ensure feasibility of the QP, and is a large weighting term to minimize safety violations. As depicted in Figure 1, the final safe action taken in the environment is given by


Note that constraints can be modularly added to (16) to account for multiple RCBFs and actuator limits. Moreover, the need for a slack variable stems from cases where no control input can satisfy the control barrier certificate and a safety violation is inevitable. We refer the reader to [ames2019control, wang2017safety, squires2018constructive] for approaches on how to construct CBFs that guarantee the existence of solutions under actuator constraints which typically leverage a back-up safety maneuver.

As discussed in Section II-B, RCBFs are formulated with respect to the continuous-time differential inclusion (4). However, as highlighted in Theorem  of [ames2016control], under certain Lipschitz-continuity assumptions, solving the RCBF-based optimization program at a sufficiently high frequency ensures safety of the system during inter-triggering intervals as well. As such, if from (9) is sufficiently small, we can query the RL policy and solve the RCBF-QP at the required rate for ensuring safety at all times. On the other hand, if is relatively large, safety can still be enforced by applying a zero-order hold on the RL action and subsequently solving the RCBF-QP at the necessary rate; however, in the interest of brevity, we leave these modifications to future work.

2:Dynamics prior and and RCBF
3:for  iterations do
4:     Train GP models on
5:     for  environment steps do
6:         Obtain action from
7:         Render action safe using and (16)
8:         Take safe action in environment
9:         Add transition to
10:     end for
11:     for  model rollouts do
12:         Sample uniformly from
13:         for  model steps do
14:              Obtain action from
15:              Render action safe using and (16)
16:              Generate synthetic transition using and
17:              Add transition to
18:         end for
19:     end for
20:     for  gradient steps do
21:         Update agent parameters ( and ) (11), (14)
22:     end for
23:end for
Algorithm 1 SAC-RCBF

Iii-C Coupling RL and RCBFs

Given the assumptions made so far, we propose two ways of improving the sample efficiency of the framework presented. Namely, we leverage the partially learned dynamics to generate synthetic data that can be added to the replay buffer and utilize a differentiable version of the safety layer which allows us to backpropagate through the QP. In what follows, we discuss both of these approaches and the final framework is presented in Algorithm 


Iii-C1 Differentiable Safety Layer

As discussed above, the final actions taken in the environment are the altered safe actions obtained from the RCBF layer. However, the updates in equations (11) and (14) are typically taken with respect to the potentially unsafe RL actions. This mismatch has also been highlighted in [cheng2019end] and can be exploited to further speed up the learning. Intuitively, the objective is to leverage the agent’s knowledge of the safety layer to aid its training.

To achieve this, we propose utilizing the differentiable optimization framework introduced in [amos2017optnet, agrawal2019differentiable], which enables the propagation of gradients through the QP as in (16) using a linear system formulated from the KKT conditions. As such, we can take the updates in (11) and (14) with respect to the safe action and backpropagate through the RCBF-QP in an end-to-end fashion. We note that, although this approach results in increased computation time because batches of QPs are being solved during the computation of the losses (11) and (14), the inference time remains the same.

Iii-C2 Model-Based RL

To increase the sample efficiency of the baseline RL-RCBF agent, we leverage the partially learned dynamics to generate short-horizon model rollouts as in [janner2019trust]. The use of short-horizon rollouts is motivated by the fact that model errors compound over long horizons, and thus the data collected from such rollouts would not benefit the learning and, in the worst case, can significantly impede the agent from learning an optimal policy.

Specifically, as highlighted in Algorithm 1, at each iteration, we sample initial states from transitions in the environment’s replay buffer and generate new, synthetic -step rollouts using the dynamics prior and the disturbance estimates from the GPs. In turn, these newly generated transitions are added to another replay buffer which is used to train the agent.

Iv Experiments

To validate the efficacy of the proposed approach, we compare it against a baseline as well as a modified version of the approach from [cheng2019end] in two custom environments111The code for the experiments can be found at Specifically, the baseline approach utilizes SAC with RCBFs without model rollouts nor the differentiable QP layer, and the modified version of the approach from [cheng2019end] leverages RCBFs instead of the original discrete CBF formulation to permit the formulation of a safety constraint for non-polytopic obstacles. In both environments, the metric used for comparison is the agents’ performances (i.e. the cumulative sum of rewards over an episode) with respect to the number of episodes. Moreover, we note that no safety violations occurred throughout the experiments, thus validating the effectiveness of the RCBF-QP at keeping the system safe throughout training. The proposed approach is validated in two environments.

Iv-a Unicycle Environment

The first environment, inspired by the Safety Gym environment [ray2019benchmarking], consists of a unicycle robot tasked with reaching a goal location while avoiding obstacles. We chose to build a custom environment due to the fact that explicit knowledge of and from (1) is needed for the formulation of the RCBF constraint, which is not the case for Safety Gym. Note that, in this section, the explicit dependence on time is dropped for brevity.

The robot is modelled using disturbed unicycle dynamics as


where is the control input with and denoting the linear and angular velocities of the robot respectively, and is an unknown disturbance aimed at simulating a sloped surface. In turn, from the agent’s perspective, the dynamics are modelled through the following differential inclusion


where the disturbance set is learned via GPs and given by

Figure 2: Snapshot of the Unicycle environment. The agent (blue) is tasked with reaching a desired location (green) while avoiding the obstacles (red).
Figure 3: Comparison of episode reward versus number of training episodes in the Unicycle environment for the proposed approach (green), the modified approach from [cheng2019end]

(yellow) and the SAC-RCBF baseline (blue). Each plot is an average of ten experiments using different seeds, and the light shadowing denotes a confidence interval of two standard deviations.

To permit the formulation of a collision-avoidance constraint, similarly to [emam2019robust], we introduce the following output of the state. This technique considers a point at a distance ahead of the robot. In particular, define the output (i.e., the point ahead of the robot) as


Differentiating along the disturbed unicycle dynamics yields that




Note that (24) can the be rewritten in a form suitable for Theorem 1 as where


As such, we can encode a collision-avoidance constraint between the agent and a given obstacle using the following RCBF , where denotes the minimum desired distance between and the position of the obstacle .

Shown in Figure 3 is a plot of the episode rewards versus training episodes for the three approaches. We note that, similarly to the first experiment, no safety violations occurred for any of the three methods. As shown in the figure, all three approaches converge to an optimal policy and the proposed approach performs best in terms of sample efficiency. In addition to the proposed approach leveraging model-based rollouts, this improvement in sample efficiency can also be attributed to the fact that the neural net used in the approach from [cheng2019end] is trained simultaneously with RL training, thus adding non-stationarity to the learning. In other words, the supervised learning component must first converge so that an optimal RL policy can be found with respect to it.

Iv-B Car Following

The second environment is based on [cheng2019end] and involves a chain of five cars driving in a lane each modelled as a disturbed double integrator


where are the position, velocity and input acceleration of car respectively. In addition, is a disturbance unknown to the agent that is non-zero for all cars except car (i.e., ).

The agent only controls the acceleration of the fourth car and can observe the positions, velocities and accelerations of all five cars. The leading car’s behavior is aimed at simulating traffic through a sinusoidal acceleration . Cars and behave as follows

where is the desired velocity and and are the velocity and braking gains respectively. Lastly, the last car in the chain (i.e., car ) has the following control input

As such, the objective of the RL agent is to minimize the total control effort, which is captured through the reward function , while avoiding collisions. Moreover, two RCBFs are formulated which enforce a minimum distance between the cars three and four, and four and five respectively where is the minimum distance. We note that since and have relative degree , a cascaded RCBF formulation as in [notomista2020persistification] is leveraged to obtain the collision avoidance constraints. Specifically, we define and enforce the forward invariance of the super-zero level sets of through RCBFs using a similar procedure to the one described above. The theoretical particulars of cascaded RCBFs will be discussed in future work, and we present the results without proof here.

Figure 4: Comparison of episode reward versus number of training episodes in the Car Following environment for the proposed approach (green), the modified approach from [cheng2019end] (yellow) and the SAC-RCBF baseline (blue). Each plot is an average of ten experiments using different seeds, and the light shadowing denotes a confidence interval of two standard deviations.

Similarly to the previous environment, shown in Figure 4 is a plot of the episode rewards versus training episodes for the three different approaches. As shown in the figure, the proposed approach learns a slightly better policy compared to the modified approach from [cheng2019end] which did not fully converge at the end of the episodes, while being significantly more sample efficient.

On the other hand, the baseline fails to learn a meaningful policy altogether, which can be attributed to the fact that the behavior of the safety layer (i.e., the RCBF-QP) is not accounted for and is thus treated as part of the unknown transition function in the MDP with respect to the RL agent. As such, a large change in the QP’s output is equivalent to a sharp transition in the dynamics which is difficult to learn. This signifies that explicitly accounting for the safety layer’s behavior is not only critical for sample-efficient training but also for finding an optimal policy.

V Conclusion

In this paper, we introduce a novel framework that combines RL with a RCBF-based layer to enforce safety during training. Moreover, we empirically demonstrate that leveraging the dynamics prior and the learned disturbances to generate model-rollouts, as well as a differentiable version of the safety layer improves both the sample efficiency and steady-state performance during training.


The authors would like to thank Dr. Samuel Coogan, Dr. Gennaro Notomista and Andrew Szot for helpful discussions.