## I Introduction

Reinforcement Learning (RL) has been used successfully in a number of robotics applications [kober2013reinforcement, arulkumaran2017deep]. Typically, an RL agent must explore a sufficiently large number of states during training to learn a meaningful policy. However, this exploration may involve visiting unsafe states, making RL unsuitable for safety-critical systems where the cost of failure is high [9392290, cowen2020samba]. As such, safe reinforcement learning is a growing field of research aimed at tackling this challenge [garcia2015comprehensive] by combining control-theoretic techniques with RL. A large variety of safe-RL approaches have been proposed in the literature. These approaches can be categorized as either approaches that solely focus on obtaining a safe policy at the end of training (e.g., [altman1998constrained, geibel2005risk, chow2017risk]) and approaches, including the work in this paper, that focus on safe exploration during training (e.g., [cheng2019end, berkenkamp2017safe, chow2018lyapunov]).

Safe training for RL inherently consists of two components: a constraint-based component, which ensures the agent’s safety, and a performance metric, typically captured by a reward or cost function, that the agent aims to optimize. In the work from [chow2018lyapunov, chow2019lyapunov], the authors propose a Lyapunov-based approach for solving Constrained MDPs (CMDPs) [altman1999constrained] which guarantees the safety of the agent during training. Another similar such approach is the work in [berkenkamp2017safe] where the authors provide a model-based RL framework for safely expanding a region of attraction for the agent. This region of attraction is a set of states for which a stabilizing policy is known. However, the main drawback of these approaches is that they tend to be quickly intractable for more complex environments. Specifically, the method in [berkenkamp2017safe] relies on a discretization of the state space to check states at different level sets of the Lyapunov function, an approach which cannot scale to high dimensional states. Along similar lines, the approach in [chow2018lyapunov] relies on an optimization at each step with as many constraints as there are states, each of which involves an integral over the entire action space.

Another category of approaches leverages a model-based safety framework which serves to prevent the exploration of unsafe states (e.g., [fisac2018general, li2018safe]) by projecting the action taken by the RL agent onto a safe-set of actions. In turn, a central challenge to these approaches is how to leverage the knowledge of the dynamics and the model-based safety component to effectively guide exploration during training. Towards this end, one such approach is that in [cheng2019end], where the authors assume knowledge of a Control Barrier Function (CBF) along with a prior on the dynamics model. During training, the uncertainty in the dynamics is learned using Gaussian Processes (GPs) and accounted for in the CBF-based safety layer. Moreover, a novelty of the work in [cheng2019end]

is the addition of a supervised-learning component to the policy aimed at guiding the RL agent. Specifically, a neural net is used to learn the output of the CBF, so that the gradient updates for the RL policy can be taken with respect to the safe actions taken by the CBF layer.

However, the approach in [cheng2019end] suffers from three drawbacks. First, the discrete-time CBF formulation used is restrictive in that it is only amenable to real-time control synthesis via quadratic programming for affine CBFs. For example, in the context of collision avoidance, affine CBFs can only encode polytopic obstacles. Moreover, the supervised learning of the CBF layer fundamentally results in an approximation, which, in the worst case, could negatively affect the learning of the RL agent if inaccurate. Lastly, the approach utilizes model-free RL, thus neglecting the knowledge of the partially learned dynamics which can otherwise be leveraged to generate synthetic data to further improve the learning performance.

To address these challenges, the contribution of this paper is two-fold. First, by leveraging recent advances in differentiable optimization [amos2017optnet, agrawal2019differentiable], we introduce a differentiable Robust CBF (RCBF) [emam2019robust] based safety layer that is compatible with standard policy-gradient RL algorithms. RCBFs are amenable to real-time control synthesis, even if non-affine, and can encode a wide class of disturbances on the dynamics making them applicable to a large variety of systems. Moreover, the differentiable safety layer eliminates the need to learn the behavior of the CBF. Second, we use GPs to learn the disturbance, which are in turn utilized in the safety layer and to generate synthetic model rollouts to improve the sample efficiency of the training. The approach, pictorially depicted in Figure 1, is validated in two environments, and is shown to be more sample-efficient as compared to the framework from [cheng2019end].

## Ii Background Material

In this section, we introduce CBFs and their robust counterpart from [emam2019robust] which can be embedded as a wrapper around any nominal controller to ensure the safety of a disturbed dynamical system. Then, we introduce Gaussian process regression which we leverage to learn the disturbance. Lastly, we discuss the underlying RL algorithm used as the nominal controller in our framework.

We consider the following disturbed control-affine system

(1) |

where denotes the drift dynamics, is the input dynamics, is an unknown deterministic disturbance and is the input signal. We assume that , , and are continuous. We note that the work in this paper can be straightforwardly extended to the case where is stochastic since GPs are used to learn the disturbance.

### Ii-a Control Barrier Functions

Control barrier functions [ames2014, xu2015, AmesBarriers, ogren2006autonomous] are formulated with respect to control-affine systems

(2) |

A set is called forward invariant with respect to (2) if given a (potentially non-unique) solution to (2) , . Note that (2) is the unperturbed version of (1).

Barrier functions guarantee forward invariance of a particular set that typically represents a constraint in a robotic system, such as collision avoidance or connectivity maintenance. Specifically, a barrier function is a continuously differentiable function (sometimes referred to as a candidate barrier function), and the so-called safe set is defined as the super-zero level set of : . Now, the goal becomes to ensure the forward set invariance of , which can be done equivalently by guaranteeing positivity of along trajectories to (2).

Positivity can be shown if there exists a locally Lipschitz extended class- function and a continuous function such that

(3) |

where and denote the Lie derivatives of in the directions and respectively. A function is class- if it is continuous, strictly increasing, and . If the above conditions hold, then is called a valid CBF for (2) [ames2019control].

### Ii-B Robust Control Barrier Functions

Since we are interested in guaranteeing the safety of the disturbed dynamical system (1), we leverage RCBFs [emam2019robust], which generalize the notion of CBFs to systems obeying the following differential inclusion

(4) |

where , , are as in (1) and (the disturbance) is an upper semi-continuous set-valued map that takes nonempty, convex, and compact values. Note that refers to the power set of and that the assumptions made on the disturbance are conditions to guarantee the existence of solutions [cortes2008discontinuous].

Moreover, it was shown that for a specific form of , we can recover a similar formulation of regular CBFs as in (3) with almost no additional computational cost as stated in the following theorem. An important aspect of this theorem is that forward invariance is guaranteed for all trajectories of (4).

###### Theorem 1.

[emam2019robust] Let be a continuously differentiable function. Let , be a set of continuous functions, and define the disturbance as

(5) |

where denotes the convex hull. If there exists a continuous function and a locally Lipschitz extended class- function such that

(6) |

then is a valid RCBF for (4).

In other words, given the set , RCBFs ensure safety of all trajectories of the disturbed system (4). Moreover, the result in Theorem 1 can be straightforwardly extended to the case where is a finite union of convex hulls as discussed in [emam2019robust]

, which can be leveraged to capture non-convex disturbances. As such, we can learn an estimate of the unknown disturbance

from (1) and leverage RCBFs to ensure the safety with respect to that estimate. Towards this end, in the next subsection, we briefly discuss GPs which we leverage to learn the disturbance during training.### Ii-C Gaussian Processes

Gaussian Process Regression (GPR) [rasmussen2003gaussian] is a kernel-based regression model used for prediction in many applications such as robotics, reinforcement learning, and data visualisation [deisenroth2015distributed]. One of the main advantages of GPR compared to other regression methods is data efficiency, which allows us to obtain a good estimate of the disturbance using only a small number of data-points. We begin by introducing the general problem setup for GPR and then specialize the results for our application in subsection III-A.

Given training data where and , the objective is to approximate the unknown target function which maps an input to the target value given the model , where

is Gaussian noise having zero mean and variance

. Thus, the prior on the target values can be described by where is the prior mean of the data computed through a mean function as and is the covariance matrix computed through a covariance function ; one commonly used choice of which is the Gaussian kernel given by(7) | ||||

(8) |

where and

denote the signal variance and the kernel widths respectively. As such, the joint distribution of the training labels and predicted value for a query point

is given bywhere . The mean and variance of the predictions can then be obtained by conditioning on the training data as

Note that

are hyperparameters which are typically optimized by maximizing the log-likelihood of the training data (e.g.,

[wang2019exact]).### Ii-D Soft Actor-Critic (SAC)

The RL problem considered is a policy search in an MDP defined by a tuple (, , , , , ), where denotes the state space, is the input space, , , are as in (1), and is the reward associated with each transition. Note that the state transitions for the MDP are obtained by discretizing the dynamics (1) as

(9) |

where denotes the state at time step and is the time step size. We note that this approximation has no effect on safety, which is ensured using RCBFs as discussed in Section III-B.

Subsequently, we chose to utilize SAC as the underlying RL algorithm since it is state-of-the-art in terms of sample efficiency [haarnoja2018soft] and thus aligns with the objective of this work. SAC maximizes an entropy objective which is given by the following

(10) |

where and denote the state and action at timestep respectively. Moreover, denotes the distribution of states and actions induced by the policy . Lastly, the term is the entropy term which incentivizes exploration and is a weighting parameter between the two terms in the objective.

The algorithm relies on an actor-critic approach with a fitted Q-function parametrized by and a fitted actor parametrized by . As such, the critic loss is given by

(11) | ||||

(12) |

where

(13) |

is the replay buffer and is the target Q-network parameters. Finally, the policy loss is given by

(14) |

## Iii Main Approach

The main objective of this work is two-fold: combining RL with RCBFs for safety during training and improving the sample efficiency of the learning. We assume knowledge of and from (1) as well as an RCBF to keep the system safe. During training, the dynamics model is improved by collecting data during exploration to estimate the unknown disturbance function . Subsequently, the RCBF is employed as a safety layer around the RL agent, minimally altering its proposed actions subject to safety constraints as pictorially depicted in Figure 1.

### Iii-a Disturbance Estimation

Although RCBFs are compatible with various data-driven methods, in this paper, we choose to focus on estimating the disturbance set using GPs as discussed in subsection II-C. This learning is achieved by obtaining a dataset with labels given by

where is the noisy measurement of the dynamics during exploration. Note that, in this case, , therefore, we train one GP per dimension for a total of GP models.

In turn, we can obtain the disturbance estimate for a query point as

(15) |

where and

are the mean and standard-deviation predictions of the

th GP for query point . The coefficient is a user-chosen confidence parameter (e.g., achieves a confidence of ). Note that this is a different representation of than the one used by Theorem 1 where the disturbance is the convex hull of a set of points. However, the desired form can be readily obtained by defining as the vectors generated by permuting the entries of from (15).### Iii-B RCBF-based Safety Layer

In this subsection, we present the minimally invasive RCBF Quadratic Program (RCBF-QP) based control synthesis framework which guarantees the safety of the RL agent with dynamics as in (4). As depicted in Figure 1, given the state, the disturbance estimate from the GPs and the action of the RL agent, this safety layer minimally alters the action so as to ensure safety. Specifically, the RCBF compensation term is given by

(16) | |||

(17) | |||

(18) |

where is the output of the RL policy at state , is a slack-variable that serves to ensure feasibility of the QP, and is a large weighting term to minimize safety violations. As depicted in Figure 1, the final safe action taken in the environment is given by

(19) |

Note that constraints can be modularly added to (16) to account for multiple RCBFs and actuator limits. Moreover, the need for a slack variable stems from cases where no control input can satisfy the control barrier certificate and a safety violation is inevitable. We refer the reader to [ames2019control, wang2017safety, squires2018constructive] for approaches on how to construct CBFs that guarantee the existence of solutions under actuator constraints which typically leverage a back-up safety maneuver.

As discussed in Section II-B, RCBFs are formulated with respect to the continuous-time differential inclusion (4). However, as highlighted in Theorem of [ames2016control], under certain Lipschitz-continuity assumptions, solving the RCBF-based optimization program at a sufficiently high frequency ensures safety of the system during inter-triggering intervals as well. As such, if from (9) is sufficiently small, we can query the RL policy and solve the RCBF-QP at the required rate for ensuring safety at all times. On the other hand, if is relatively large, safety can still be enforced by applying a zero-order hold on the RL action and subsequently solving the RCBF-QP at the necessary rate; however, in the interest of brevity, we leave these modifications to future work.

### Iii-C Coupling RL and RCBFs

Given the assumptions made so far, we propose two ways of improving the sample efficiency of the framework presented. Namely, we leverage the partially learned dynamics to generate synthetic data that can be added to the replay buffer and utilize a differentiable version of the safety layer which allows us to backpropagate through the QP. In what follows, we discuss both of these approaches and the final framework is presented in Algorithm

1.#### Iii-C1 Differentiable Safety Layer

As discussed above, the final actions taken in the environment are the altered safe actions obtained from the RCBF layer. However, the updates in equations (11) and (14) are typically taken with respect to the potentially unsafe RL actions. This mismatch has also been highlighted in [cheng2019end] and can be exploited to further speed up the learning. Intuitively, the objective is to leverage the agent’s knowledge of the safety layer to aid its training.

To achieve this, we propose utilizing the differentiable optimization framework introduced in [amos2017optnet, agrawal2019differentiable], which enables the propagation of gradients through the QP as in (16) using a linear system formulated from the KKT conditions. As such, we can take the updates in (11) and (14) with respect to the safe action and backpropagate through the RCBF-QP in an end-to-end fashion. We note that, although this approach results in increased computation time because batches of QPs are being solved during the computation of the losses (11) and (14), the inference time remains the same.

#### Iii-C2 Model-Based RL

To increase the sample efficiency of the baseline RL-RCBF agent, we leverage the partially learned dynamics to generate short-horizon model rollouts as in [janner2019trust]. The use of short-horizon rollouts is motivated by the fact that model errors compound over long horizons, and thus the data collected from such rollouts would not benefit the learning and, in the worst case, can significantly impede the agent from learning an optimal policy.

Specifically, as highlighted in Algorithm 1, at each iteration, we sample initial states from transitions in the environment’s replay buffer and generate new, synthetic -step rollouts using the dynamics prior and the disturbance estimates from the GPs. In turn, these newly generated transitions are added to another replay buffer which is used to train the agent.

## Iv Experiments

To validate the efficacy of the proposed approach, we compare it against a baseline as well as a modified version of the approach from [cheng2019end] in two custom environments^{1}^{1}1The code for the experiments can be found at github.com/yemam3/SAC-RCBF.. Specifically, the baseline approach utilizes SAC with RCBFs without model rollouts nor the differentiable QP layer, and the modified version of the approach from [cheng2019end] leverages RCBFs instead of the original discrete CBF formulation to permit the formulation of a safety constraint for non-polytopic obstacles. In both environments, the metric used for comparison is the agents’ performances (i.e. the cumulative sum of rewards over an episode) with respect to the number of episodes. Moreover, we note that no safety violations occurred throughout the experiments, thus validating the effectiveness of the RCBF-QP at keeping the system safe throughout training. The proposed approach is validated in two environments.

### Iv-a Unicycle Environment

The first environment, inspired by the Safety Gym environment [ray2019benchmarking], consists of a unicycle robot tasked with reaching a goal location while avoiding obstacles. We chose to build a custom environment due to the fact that explicit knowledge of and from (1) is needed for the formulation of the RCBF constraint, which is not the case for Safety Gym. Note that, in this section, the explicit dependence on time is dropped for brevity.

The robot is modelled using disturbed unicycle dynamics as

(20) |

where is the control input with and denoting the linear and angular velocities of the robot respectively, and is an unknown disturbance aimed at simulating a sloped surface. In turn, from the agent’s perspective, the dynamics are modelled through the following differential inclusion

(21) |

where the disturbance set is learned via GPs and given by

(22) |

To permit the formulation of a collision-avoidance constraint, similarly to [emam2019robust], we introduce the following output of the state. This technique considers a point at a distance ahead of the robot. In particular, define the output (i.e., the point ahead of the robot) as

(23) |

Differentiating along the disturbed unicycle dynamics yields that

(24) |

where

(25) |

Note that (24) can the be rewritten in a form suitable for Theorem 1 as where

(26) |

As such, we can encode a collision-avoidance constraint between the agent and a given obstacle using the following RCBF , where denotes the minimum desired distance between and the position of the obstacle .

Shown in Figure 3 is a plot of the episode rewards versus training episodes for the three approaches. We note that, similarly to the first experiment, no safety violations occurred for any of the three methods. As shown in the figure, all three approaches converge to an optimal policy and the proposed approach performs best in terms of sample efficiency. In addition to the proposed approach leveraging model-based rollouts, this improvement in sample efficiency can also be attributed to the fact that the neural net used in the approach from [cheng2019end] is trained simultaneously with RL training, thus adding non-stationarity to the learning. In other words, the supervised learning component must first converge so that an optimal RL policy can be found with respect to it.

### Iv-B Car Following

The second environment is based on [cheng2019end] and involves a chain of five cars driving in a lane each modelled as a disturbed double integrator

(27) |

where are the position, velocity and input acceleration of car respectively. In addition, is a disturbance unknown to the agent that is non-zero for all cars except car (i.e., ).

The agent only controls the acceleration of the fourth car and can observe the positions, velocities and accelerations of all five cars. The leading car’s behavior is aimed at simulating traffic through a sinusoidal acceleration . Cars and behave as follows

where is the desired velocity and and are the velocity and braking gains respectively. Lastly, the last car in the chain (i.e., car ) has the following control input

As such, the objective of the RL agent is to minimize the total control effort, which is captured through the reward function , while avoiding collisions. Moreover, two RCBFs are formulated which enforce a minimum distance between the cars three and four, and four and five respectively where is the minimum distance. We note that since and have relative degree , a cascaded RCBF formulation as in [notomista2020persistification] is leveraged to obtain the collision avoidance constraints. Specifically, we define and enforce the forward invariance of the super-zero level sets of through RCBFs using a similar procedure to the one described above. The theoretical particulars of cascaded RCBFs will be discussed in future work, and we present the results without proof here.

Similarly to the previous environment, shown in Figure 4 is a plot of the episode rewards versus training episodes for the three different approaches. As shown in the figure, the proposed approach learns a slightly better policy compared to the modified approach from [cheng2019end] which did not fully converge at the end of the episodes, while being significantly more sample efficient.

On the other hand, the baseline fails to learn a meaningful policy altogether, which can be attributed to the fact that the behavior of the safety layer (i.e., the RCBF-QP) is not accounted for and is thus treated as part of the unknown transition function in the MDP with respect to the RL agent. As such, a large change in the QP’s output is equivalent to a sharp transition in the dynamics which is difficult to learn. This signifies that explicitly accounting for the safety layer’s behavior is not only critical for sample-efficient training but also for finding an optimal policy.

## V Conclusion

In this paper, we introduce a novel framework that combines RL with a RCBF-based layer to enforce safety during training. Moreover, we empirically demonstrate that leveraging the dynamics prior and the learned disturbances to generate model-rollouts, as well as a differentiable version of the safety layer improves both the sample efficiency and steady-state performance during training.

## Acknowledgment

The authors would like to thank Dr. Samuel Coogan, Dr. Gennaro Notomista and Andrew Szot for helpful discussions.

Comments

There are no comments yet.