End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks

by   Richard Cheng, et al.

Reinforcement Learning (RL) algorithms have found limited success beyond simulated applications, and one main reason is the absence of safety guarantees during the learning process. Real world systems would realistically fail or break before an optimal controller can be learned. To address this issue, we propose a controller architecture that combines (1) a model-free RL-based controller with (2) model-based controllers utilizing control barrier functions (CBFs) and (3) on-line learning of the unknown system dynamics, in order to ensure safety during learning. Our general framework leverages the success of RL algorithms to learn high-performance controllers, while the CBF-based controllers both guarantee safety and guide the learning process by constraining the set of explorable polices. We utilize Gaussian Processes (GPs) to model the system dynamics and its uncertainties. Our novel controller synthesis algorithm, RL-CBF, guarantees safety with high probability during the learning process, regardless of the RL algorithm used, and demonstrates greater policy exploration efficiency. We test our algorithm on (1) control of an inverted pendulum and (2) autonomous car-following with wireless vehicle-to-vehicle communication, and show that our algorithm attains much greater sample efficiency in learning than other state-of-the-art algorithms and maintains safety during the entire learning process.



There are no comments yet.


page 1

page 2

page 3

page 4


Safe-Critical Modular Deep Reinforcement Learning with Temporal Logic through Gaussian Processes and Control Barrier Functions

Reinforcement learning (RL) is a promising approach and has limited succ...

Safety Verification of Model Based Reinforcement Learning Controllers

Model-based reinforcement learning (RL) has emerged as a promising tool ...

Learning Unmanned Aerial Vehicle Control for Autonomous Target Following

While deep reinforcement learning (RL) methods have achieved unprecedent...

Onboard Safety Guarantees for Racing Drones: High-speed Geofencing with Control Barrier Functions

This paper details the theory and implementation behind practically ensu...

Augmenting Learning Components for Safety in Resource Constrained Autonomous Robots

This paper deals with resource constrained autonomous robots commonly fo...

Safe Reinforcement Learning on Autonomous Vehicles

There have been numerous advances in reinforcement learning, but the typ...

Safety-aware Adaptive Reinforcement Learning with Applications to Brushbot Navigation

This paper presents a safety-aware learning framework that employs an ad...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Reinforcement learning (RL) focuses on finding an agent’s policy (i.e. controller) that maximizes a long-term reward. It does this by repeatedly observing the agent’s state, taking an action (according to a current policy), and receiving a reward. Over time, the agent modifies its policy to maximize its long-term reward. This method has been successfully applied to continuous control tasks [Duan et al.2016, Lillicrap et al.2015] where controllers have learned to stabilize complex robots (after many policy iterations).

However, since RL focuses on maximizing the long-term reward, it is likely to explore unsafe behaviors during the learning process. This feature is problematic for any RL algorithm that will be deployed on hardware, as unsafe learning policies could damage the hardware or bring harm to a human. As a result, most success in the use of RL for control of physical systems has been limited to simulations, where many failed iterations can occur before success.

Safe RL tries to learn a policy that maximizes the expected return, while also ensuring (or encouraging) the satisfaction of some safety constraints [García and Fernández2015]. Previous approaches to safe reinforcement learning include reward-shaping, policy optimization with constraints [Gaskett2003, Moldovan and Abbeel2012, Achiam et al.2017, Wachi et al.2018], or teacher advice [Abbeel and Ng2004, Abbeel, Coates, and Ng2010, Tang et al.2010]. However, these model-free approaches do not guarantee safety during learning – safety is only approximately guaranteed after a sufficient learning period. The fundamental issue is that without a model, safety must be learned through environmental interactions, which means it may be violated during initial learning interactions.

Model-based approaches have utilized Lyapunov-based methods or model predictive control to guarantee safety under system dynamics during learning [Wang, Theodorou, and Egerstedt2017, Berkenkamp et al.2017, Chow et al.2018, Ohnishi et al.2018, Koller et al.2018], but they do not address the issue of exploration and performance optimization. Other works guarantee safety by switching between backup controllers [Perkins and Barto2003, Mannucci et al.2018], though this overly constrains policy exploration.

We draw inspiration from recent work that has incorporated model information into model-free RL algorithms to ensure safety during exploration [Fisac et al.2018, Li, Kalabic, and Chu2018, Gillula and Tomlin2012]. However, these approaches utilize backup safety controllers that do not guide the learning process (limiting exploration efficiency).

This paper develops a framework for integrating existing model-free RL algorithms with control barrier functions (CBFs) to guarantee safety and improve exploration efficiency in RL, even with uncertain model information. The CBFs require a (potentially poor) nominal dynamics model, but can ensure online safety of nonlinear systems during the entire learning process and help the RL algorithm efficiently search the policy space. This methodology effectively constrains the policy exploration process to a set of safe polices defined by the CBF. An on-line process learns the governing dynamical system over time, which allows the CBF controller to adapt and become less conservative over time. This general framework allows us to utilize any model-free RL algorithm to learn a controller, with the CBF controller guiding policy exploration and ensuring safety.

Using this framework, we develop an efficient algorithm for controller synthesis, RL-CBF, with guarantees on safety (remaining within a safe set) and performance (reward-maximization). To test this approach, we integrated two model-free RL algorithms – trust region policy optimization (TRPO) [Schulman et al.2015] and deep deterministic policy gradients (DDPG) [Lillicrap et al.2015] – with the CBF controllers and dynamical model learning. We tested the algorithms on two nonlinear control problems: (1) balancing of an inverted pendulum, and (2) autonomous car following with wireless vehicle-to-vehicle communication. For both tasks, our algorithm efficiently learned a high-performance controller while maintaining safety throughout the learning process. Furthermore, it learned faster than comparable RL algorithms due to inclusion of a model learning process, which constrains the space of explorable policies and guides the exploration process.

Our main contributions are: (1) we develop the first algorithm that integrates CBF-based controllers with model-free RL to achieve end-to-end safe RL for nonlinear control systems, and (2) we show improved learning efficiency by guiding the policy exploration with barrier functions.


Consider an infinite-horizon discounted Markov decision process (MDP) with control-affine, deterministic dynamics (a good assumption when dealing with robotic systems), defined by the tuple

, where is a set of states, is a set of actions, is the nominal unactuated dynamics, is the nominal actuated dynamics, and is the unknown system dynamics. The time evolution of the system is given by


where , , and compose a known nominal model of the dynamics, and represents the unknown model. In practice, the nominal model may be quite bad (e.g. a robot model that ignores friction and compliance), and we must learn a much better dynamic model through data.

Furthermore is the reward function, is the distribution of the initial state , and is the discount factor.

Reinforcement Learning

Let denote a stochastic control policy that maps states to distributions over actions, and let denote the policy’s expected discounted reward:


Here is a trajectory where the actions are sampled from policy . We use the standard definitions for the value function , action-value function , and advantage function, below:


where actions are drawn from distribution .

Most policy optimization RL algorithms attempt to maximize long-term reward using (a) policy iteration methods [Bertsekas2005], (b) derivative-free optimization methods that optimize the return as a function of policy parameters [Fu, Glover, and April2005], or (c) policy gradient methods [Peters and Schaal2008, Silver et al.2014]. Any of these methods can be rendered end-to-end safe using the RL-CBF control framework proposed in this work. However, we will focus mainly on policy gradient methods, due to their good performance on continuous control problems.

Policy Gradient-Based RL

Policy gradient methods estimate the gradient of the expected return

with respect to the policy based on sampled trajectories. They then optimize the policy using gradient ascent, allowing modification of the control law at episodic intervals. The DDPG and TRPO algorithms are examples of policy gradient methods, which we will use as benchmarks.

DDPG is an off-policy actor-critic method that computes the policy gradient based on sampled trajectories and an estimate of the action-value function. It alternately updates the action-value function and the policy as it samples more and more trajectories.

TRPO is an on-policy policy gradient method that maximizes a surrogate loss function, which serves as an approximate lower bound on the true loss function. It also ensures that the next policy distribution is within a “trust region”. More precisely, it approximates the optimal policy update by iteratively solving the optimization problem:


such that the Kullback-Leibler divergence

. Here represents the discounted visitation frequency of state under policy , and is a constant defining the “trust region”.

Though both DDPG and TRPO have learned good controllers on several benchmark problems, there is no guarantee of safety in these algorithms, nor any other model-free RL algorithm. Therefore, our objective is to complement model-free RL controllers with model-based CBF controllers (using a potentially poor nominal model), which can both improve search efficiency and ensure safety.

Gaussian Processes

We use Gaussian process (GP) models to estimate the unknown system dynamics, , from data. A Gaussian process is a nonparametric regression method for estimating functions and their uncertain distribution from data [Rasmussen and Williams2006]. It describes the evolving model of the uncertain dynamics, , by a mean estimate, , and the uncertainty,

, which allows for high probability confidence intervals on the function:


with probability where is a design parameter that determines (e.g. confidence is achieved at ). Therefore, by learning and in tandem with the controller, we obtain high probability confidence intervals on the unknown dynamics, which adapt/shrink as we obtain more information (i.e. measurements) on the system.

A GP model is parameterized by a kernel function , which defines the similarity between any two states . In order to make inferences on the unknown function , we need measurements, , which are computed from measurements of () using the relation from Equation (1):

. Since any finite number of data points form a multivariate normal distribution, we can obtain the posterior distribution of

at any query state by conditioning on the past measurements. Given measurements subject to independent Gaussian noise , the mean

and variance

at the query state, , are calculated to be,


where is the kernel matrix, and . As we collect more data, becomes a better estimate of , and the uncertainty, , of the dynamics decreases.

We note that in applications with large amounts of data, training the GP becomes problematic since computing the matrix inverse in Equation (6) scales poorly ( in the number of data points). There are several methods to alleviate this issue, such as using sparse inducing inputs or local GPs [Snelson and Ghahramani2007, Nguyen-Tuong, Seeger, and Peters2009]

. In fact, our framework can use any model approximation method that provides quantifiable uncertainty bounds (e.g. neural networks with dropout). However, we bypass this issue in this work by batch training the GP model with only the latest batch of

data points.

Control Barrier Functions

Consider an arbitrary safe set, , defined by the super-level set of a continuously differentiable function ,


To maintain safety during the learning process, the system state must always remain within the safe set (i.e. the set is forward invariant). Examples include keeping a manipulator within a given workspace, or ensuring that a quadcopter avoids obstacles. Essentially, the learning algorithm should learn/explore only in set .

Control barrier functions utilize a Lyapunov-like argument to provide a sufficient condition for ensuring forward invariance of the safe set under controlled dynamics. Therefore, barrier functions are a natural tool to enforce safety throughout the learning process, and can be used to synthesize safe controllers for our systems.

Definition 1.

Given a set defined by (7), the continuously differentiable function is a discrete-time control barrier function (CBF) for dynamical system (1) if there exists such that for all ,


where represents how strongly the barrier function “pushes” the state inwards within the safe set (if , the barrier condition simplifies to the Lyapunov condition).

The existence of a CBF implies that there exists a deterministic controller such that the set is forward invariant for system (1) [Agrawal and Sreenath2017, Ames et al.2017]. In other words, if condition (8) is satisfied for all , then the set is rendered forward invariant. Our goal is to find a controller, , that satisfies condition (8), so that safety is certified.

For this paper, we restrict our attention to affine barrier functions of form , (), though our methodology could support more general barrier functions. This restriction means the set is composed of intersecting half spaces (i.e. polytopes).

Before we can formulate a tractable optimization problem that satisfies condition (8), we must have an estimate for . We use an updating GP model to estimate the mean and variance of the function, and , from measurement data. From equation (5), we know that with probability . Therefore, we can reformulate the CBF condition (8) into the following quadratic program (QP) that can be efficiently solved at each time step:


where is a slack variable in the safety condition, is a large constant that penalizes safety violations, and

denotes the element-wise absolute value of the vector

. The optimization is not sensitive to the parameter as long as it is very large (e.g. ), such that safety constraint violations are heavily penalized. The last constraint on encodes actuator constraints. The solution to this optimization problem (9) enforces the safety condition (8) as best as possible with minimum control effort, even with uncertain dynamics. Accounting for the dynamics uncertainty through GP models allows us to certify system safety, even with a poor nominal model.

Let us define the set . Then we can prove the following lemma.

Lemma 1.

For dynamical system (1), if there exists a solution to (9) for all with , then the controller derived from (9) renders set forward invariant with probability .

However, suppose there exists such that (9) has solution with . If for all , the solution to (9) satisfies , then the larger set is forward invariant with probability .


The first part of the lemma follows directly from Definition 1 and the probabilistic bounds on the uncertainty obtained from GPs shown in equation (5).

For the second part, the property of GPs in equation (5) implies that with probability , the following inequality is satisfied under the system dynamics (1):


Therefore, the constraint in problem (9) ensures that:


Define , so that (11) simplifies to


By Definition 1, the set defined by is forward invariant under system dynamics (1). ∎

The CBF controllers that solve (9) provide deterministic control laws, that naturally encode safety; they provide the minimal control intervention that maintains safety or provide graceful degradation (a small deviation from the safe set) when safety cannot be enforced (e.g. due to actuation constraints). Furthermore, even with dynamics uncertainty, we can make high-probability statements about system safety using GP models with CBFs.

Note that one can easily combine multiple CBF constraints in problem (9) to define polytopic safe regions.

CBF-Based Compensating Control with Reinforcement Learning

To illustrate our framework, we first propose the suboptimal controller in equation (13), which combines a model-free RL-based controller (parameterized by ) and a CBF-based controller in the architecture shown in Figure 0(a).


The concept is akin to shielded RL [Alshiekh et al.2017, Fisac et al.2018], since the CBF controller compensates for the RL controller to ensure safety, but does not guide exploration of the RL algorithm. The next section will extend the CBF controller to improve RL policy exploration.

Note that since the RL policy is stochastic (see Preliminaries section on RL), the controller represents the realization (i.e. sampled control action) of the stochastic policy after policy iteration .

Figure 1: Control architecture combining model-free RL controller with model-based CBF to guarantee safety. (a) Initial architecture that uses CBF to compensate for unsafe control actions, but does not guide learning and exploration. (b) Architecture that uses CBF to guide exploration and learning, as well as ensure safety.

The model-free RL controller, proposes a control action that attempts to optimize long-term reward, but may be unsafe. Before deploying the RL controller, a CBF controller filters the proposed control action and provides the minimum control intervention needed to ensure that the overall controller, , keeps the system state within the safe set. Essentially, the CBF controller, “projects” the RL controller into the set of safe policies. In the case of an autonomous car, this action may enforce a safe distance between nearby cars, regardless of the action proposed by the RL controller.

The CBF controller , which depends on the RL control, is defined by the following QP that can be efficiently solved at each time step:


The last constraint in (14) incorporates possible actuator limits of the system.

We must make clear the important distinction between the indexes and . Note that indexes timesteps within each policy iteration or trial, whereas indexes the policy iterations (which contain trajectories with several timesteps). The CBF controller updates throughout the task (computed at each time step, ), whereas the RL policy and GP model update at episodic policy iteration intervals indexed by .

Let represent the largest violation of the barrier condition (i.e. potential deviation from the safe set) for any . Lemma 1 extends to the modified optimization problem (14), implying that satisfies the barrier certificate inequality (up to ) that guarantees forward invariance of . Therefore, if there exists a solution to problem (14) such that , then controller (13) renders the safe set forward invariant with probability . However if , but for all , then the controller will render the set forward invariant with probability .

Intuitively, the RL controller provides a “feedforward control”, and the CBF controller compensates with the minimum control necessary to render the safe set forward invariant. If such a control does not exist (e.g. due to torque constraints), then the CBF controller provides the control that keeps the state as close as possible to the safe set.

However, a significant issue is that controller (13) ensures safety, but does not actively guide policy exploration of the overall controller. This is because the RL policy being updated around, , is not the policy deployed on the agent, . For example, suppose that in an autonomous driving task, the RL controller inadvertently proposes to collide with an obstacle. The CBF controller compensates to drive the car around the obstacle. The next learning iteration should update the policy around the safe deployed policy , rather than the unsafe policy (which would have led to an obstacle collision). However, the algorithm described in this section updates around the original policy, , as illustrated in Figure 2a.

CBF-Based Guiding Control with Reinforcement Learning

In order to achieve safe and efficient learning, we should learn from the deployed controller , since it operates in the safe region , rather than learning around , which may operate in an unsafe, irrelevant area of state space. The RL-CBF algorithm described below incorporates this goal.

Recall that represent the realized controllers sampled from stochastic policies . Consider an initial RL-based controller (for iteration ). The CBF controller is determined from (14) to obtain . For every following policy iteration, let us define the overall controller to incorporate all previous CBF controllers, as in equation (15).


The dependence of controller (15) on all prior CBF controllers (see Figure 0(b)) is critical to enhancing learning efficiency. Defining the controller in this fashion leads to policy updates around the previously deployed controller, which adds to the efficiency of the learning process by encouraging the policy to operate in desired areas of the state space. This idea is illustrated in Figure 2b.

The intuition is that at iteration , the RL policy proposed actions , but it took safe actions . To update the policy based on the safe actions, the effective RL controller at the next iteration () should be , which is then filtered by the CBF controller (i.e. is now part of the RL controller). Across multiple policy iterations, we can consider to be the guided RL controller (proposing potentially unsafe actions), which is rendered safe by .

Figure 2: Illustration of policy iteration process, where we try to learn the optimal safe policy, . (a) Policy optimization with barrier-compensating controller. Next policy is updated around the previous RL controller, ; (b) Policy optimization with barrier-guided controller. Next policy is updated around previous deployed controller, .

To ensure safety after incorporating all prior CBF controllers, they must be included into the governing QP:


The solution to (16) defines the CBF controller , which ensures safety by satisfying the barrier condition (8).

Let represent the largest violation of the barrier condition for any .

Theorem 2.

Using the control law from (15), if there exists a solution to problem (16) such that , then the safe set is forward invariant with probability . If , but the solution to problem (16) satisfies for all , then the controller will render the larger set forward invariant with probability .

Furthermore, if we use TRPO for the RL algorithm, then the control law from (15) achieves the performance guarantee , where and is chosen as in equation (4).


The first part of the theorem follows directly from Definition 1 and Lemma 1. The only difference from Lemma 1 is that the control includes the RL controller and all previous CBF controllers ().

The proof of the performance bound is given in the Appendix of this paper found at   https://rcheng805.github.io/files/aaai2019.pdf. ∎

RL-CBF provides high-probability safety guarantees during the learning process and can maintain the performance guarantees of TRPO. If we have no uncertainty in the dynamics, then safety is guaranteed with probability 1. Note that the performance guarantee in Theorem 2 is for control law , which is not the deployed controller, . However, this does not pose a significant issue, since rapidly decays to 0 as we iterate. This is because the guided RL controller quickly learns to operate in the safe region, so the CBF controller becomes inactive.

Computationally Efficient Algorithm

This section describes an efficient algorithm to implement the framework described above, since a naive approach would be too computationally expensive in many cases. To see this, recall the controller (15) we would ideally implement:

The first term may be represented by a neural network that is parameterized by , which has a standard implementation. The third term is just a quadratic program with dependencies on the other terms; it does not pose a computational burden. However, the summation in the 2nd term poses a challenge, since every term in depends on a different previous RL controller . Therefore, we would need to store neural networks corresponding to each previous RL controller. In addition, we would have to solve separate QPs in sequence to evaluate each CBF controller. Such a brute-force implementation would be impractical .

To overcome this issue, we approximate , where is a feedforward neural network (MLP) parameterized by . We chose a MLP since they have been shown to be powerful function approximators. Thus, at each policy iteration, we fit the MLP to data of collected from trajectories of the previous policy iteration. Then we obtain the controller:

Note that even with this approximation, safety with probability is still guaranteed. This is because the above approximation only affects the guided RL term . The CBF controller still solves (16), which provides the safety guarantees in Theorem 2 by satisfying the CBF condition (8). Furthermore, we now have to store only two NNs and solve one QP for the controller. The tradeoff is that the performance guarantee in Theorem 2 does not necessarily hold with this approximation. The algorithm is outlined in Algorithm 1.

1:Initialize RL Policy , state ,
2:         measurement array , action array
3:for  do
4:     Sample (but do not deploy) control
5:     Solve for from optimization problem (16)
6:     Deploy controller
7:     Store state-action pair in
8:     Observe and store in
9:Collect Episode Reward,
10:Update GP model using (6) and measurements
11:Set (representing policy iteration)
12:while Episodes do
13:     Do policy iteration using RL algorithm based on
14:      previously observed episode/rewards to obtain
15:     Train to approximate prior CBF controllers
16: using
17:     Initialize state
18:     for  do
19:         Sample control
20:         Solve for from problem (16)
21:         Deploy controller
23:         Store state-action pair () in
24:         Observe and store in      
25:     Collect Episode Reward,
26:     Update GP model using (6) and measurements
29:                                    Overall controller composed
30:                                         from all 3 subcomponents
Algorithm 1 RL-CBF algorithm


We implement two versions of the RL-CBF algorithm with existing model-free RL algorithms: TRPO-CBF, derived from TRPO [Schulman et al.2015], and DDPG-CBF, derived from DDPG [Lillicrap et al.2015]. The code for these examples can be found at: https://github.com/rcheng805/RL-CBF.

Inverted Pendulum

We first apply RL-CBF to the control of a simulated inverted pendulum from the OpenAI gym environment (pendulum-v0), which has mass and length, , and is actuated by torque, . We set the safe region to be radians, and define the reward function to learn a controller that keeps the pendulum upright. The true system dynamics are defined as follows,


with torque limits , and . To introduce model uncertainty, our nominal model assumes ( error in model parameters).

Figure 3 compares the accumulated reward achieved during each episode using TRPO, DDPG, TRPO-CBF, and DDPG-CBF. The two RL-CBF algorithms converge near the optimal solution very rapidly, and significantly outperform the corresponding baseline algorithms without the CBFs. We note that TRPO and DDPG sometimes converge on a high-performance controller (comparable to TRPO-CBF and DDPG-CBF), though this occurs less reliably and more slowly, resulting in the poorer learning curves. More importantly, the RL-CBF controllers maintain safety (i.e. never leave the safe region) throughout the learning process, as also seen in Figure 3. In contrast, TRPO and DDPG severely violate safety while learning the optimal policy.

Figure 4 shows the pendulum angle during a representative trial under the first policy versus the last learned policy deployed for TRPO-CBF and DDPG-CBF. For the first policy iteration, the pendulum angle is maintained near the edge of the safe region – the RL algorithm has proposed a poor controller so the CBF controller takes the minimal action necessary to keep the system safe. By the last iteration though, the CBF controller is completely inactive (), since the guided RL controller () is already safe.

Figure 3: (Top) Maximum angle (rad) of the pendulum throughout each episode. Values above the dashed black line represent exits from the safe set at some point during the episode. (Bottom) Comparison of accumulated reward from inverted pendulum problem using TRPO, DDPG, TRPO-CBF, and DDPG-CBF.
Figure 4: Representative pendulum trajectory (angle vs. time) using first policy vs last policy. The left plot and right plot show results from TRPO-CBF and DDPG-CBF, respectively. The trajectory for the first policy (blue) goes to edge of the safe region and stays there, while the trajectory for the last policy (red) quickly converges to the upright position.

Simulated Car Following

Consider a chain of five cars following each other on a straight road. We control the acceleration/deceleration of the car in the chain, and would like to train a policy to maximize fuel efficiency during traffic congestion while avoiding collisions. Each car utilizes the dynamics shown in equation (18), and we attempt to optimize the reward function (19). The car dynamics and reward function are inspired by previous work [He, Ge, and Orosz2018].


The first term in the reward optimizes fuel efficiency, while the other term encourages the car to maintain a 3 meter distance from the other cars (soft constraint). For the RL-CBF controllers, the CBF enforces a 2 meter safe distance between cars (hard constraint). The behavior of cars 1,2,3, and 5 is described in the Appendix.

The car has access to every other cars’ position, velocity, and acceleration, but it only has a crude model of its own dynamics () and an inaccurate model of the drivers behind and in front of it. In addition, we add Gaussian noise to the acceleration of each car. The idea is that the car can use its crude model to guarantee safety with high probability, and improve fuel efficiency by slowly building and leveraging an implicit model of the other drivers’ behaviors.

From Figure 5, we see that there were no safety violations between the cars during our simulated experiments when using either of the RL-CBF controllers. When using TRPO and DDPG alone without CBF safety, almost all trials had collisions, even in the later stages of learning. Furthermore, as seen in Figure 5, TRPO-CBF learns faster and outperforms TRPO (DDPG-CBF also outperforms DDPG though neither algorithm converged on a high-performance controller in our experiments). It is important to note that in some experiments, TRPO finds a comparable controller to TRPO-CBF, but this is often not the case due to randomness in seeds.

Although DDPG and DDPG-CBF failed to converge on a good policy, Figure 5 shows that DDPG-CBF (and TRPO-CBF) always maintained a safe controller. This is a crucial benefit of the RL-CBF approach, as it guarantees safety independent of the system’s learning performance.

Figure 5: (Top) Minimum headway between cars during each learning episode using DDPG, TRPO, DDPG-CBF, and TRPO-CBF. Values below the dashed black line represent exits from the safe set, and values below 0 represent collisions. The curve for DDPG has high negative values throughout learning, and is not seen. (Bottom) Comparison of reward over multiple episodes from car-following problem using TRPO, TRPO-CBF, and DDPG-CBF (DDPG is excluded because it exhibits very poor performance).


Adding even crude model information and CBFs into the model-free RL framework allows us to improve the exploration of model-free learning algorithms while ensuring end-to-end safety. Therefore, we proposed the safe RL-CBF framework, and developed an efficient controller synthesis algorithm that guarantees safety and improves exploration. These features will be crucial in deploying reinforcement learning on physical systems, where problems require online computation and efficient learning with safety guarantees.

This framework, which combines model-free RL-based control, model-based CBF control, and model learning has the additional advantages of being able to (1) easily integrate new RL algorithms (in place of TRPO/DDPG) as they are developed, and (2) incorporate better model information from measurements to online improve the CBF controller.

A significant assumption in this work is that we are given a valid safe set, , which can be rendered forward invariant. However, computing these valid safe sets is non-trivial and computationally intensive [Wang, Theodorou, and Egerstedt2017, Wabersich and Zeilinger2018, Fisac et al.2018]. If we are not given a valid safe set, we may reach states where it is not possible to remain safe (i.e. ). Although our controller achieves graceful degradation in these cases, in future work it will be important to learn the safe set in addition to the controller.


The authors would like to thank Hoang Le and Yisong Yue for helpful discussions.


Appendix A Appendix A: Proof of Theorem 2

Theorem 2.

Using the control law from (15), if there exists a solution to problem (16) such that , then the safe set is forward invariant with probability . If , but the solution to problem (16) satisfies for all , then the controller will render the set forward invariant with probability .

Furthermore, if we use TRPO for the RL algorithm, then the control law from (15) achieves the performance guarantee , where and is chosen as in equation (4).


To prove the performance bound in the second part of the theorem, we use the property of the advantage function from equation (20) below:


where . As derived in (Schulman et al. 2015), we can then obtain the following inequality:


where is the total variational distance between policies and , and