1 Introduction
Researchers have demonstrated that reinforcement learning (RL) can solve complex tasks such as Atari games (mnih2015human), Go (silver2017mastering), dexterous manipulation tasks (akkaya2019solving), and many more robotics tasks in simulated environments (haarnoja2018soft). However, deploying RL algorithms to realworld problems still faces the hurdle that they require many unsafe environment interactions. For example, a robot’s unsafe environment interactions include falling and hitting other objects, which incur physical damage costly to repair. Many recent deep RL works reduce the number of environment interactions significantly (e.g., see haarnoja2018soft; fujimoto2018addressing; janner2019trust; dong2020expressivity; luo2019algorithmic; chua2018deep and reference therein), but the number of unsafe interactions is still prohibitive for safetycritical applications such as robotics, medicine, or autonomous vehicles (berkenkamp2017safe).
Reducing the number of safety violations may not be sufficient for these safetycritical applications—we may have to eliminate them. This paper explores the possibility of safe RL algorithms with zero safety violations in both training time and test time. We also consider the challenging setting where we are only given a safe but trivialreward initial policy.
A recent line of works on safe RL design novel actorcritic based algorithms under the constrained policy optimization formulation (Thananjeyan et al., 2021; srinivasan2020learning; bharadhwaj2020conservative; yang2020accelerating; stooke2020responsive). They significantly reduce the number of trainingtime safety violations. However, these algorithms fundamentally learn the safety constraints by contrasting the safe and unsafe trajectories. In other words, because the safety set is only specified through the safety costs that are observed postmortem, the algorithms only learn the concept of safety through seeing unsafe trajectories. Therefore, these algorithms cannot achieve zero trainingtime violations. For example, even for the simple 2D inverted pendulum environment, these methods still require at least 80 unsafe trajectories (see Figure 2 in Section 6).
Another line of work utilizes ideas from control theory and modelbased approach (Cheng et al., 2019; berkenkamp2017safe; Taylor et al., 2019; Zeng et al., 2020). These works propose sufficient conditions involving certain Lyapunov functions or control barrier functions that can certify the safety of a subset of states or policies (Cheng et al., 2019). These conditions assume access to calibrated dynamical models. They can, in principle, permit safety guarantees without visiting any unsafe states because, with the calibrated dynamics model, we can foresee future danger. However, control barrier functions are often nontrivially handcrafted with prior knowledge of the environments (ames2019control; Nguyen and Sreenath, 2016).
This work aims to design modelbased safe RL algorithms that empirically achieve zero trainingtime safety violations by learning the barrier certificates iteratively. We present the algorithm Cotrained Barrier Certificate for Safe RL (CRABS), which alternates between learning barrier certificates that certify the safety of larger regions of states, optimizing the policy, collecting more data within the certified states, and refining the learned dynamics model with data.^{1}^{1}1
We note that our goal is not to provide endtoend formal guarantees of safety, which might be extremely challenging—nonconvex minimax optimizations and uncertainty quantification for neural networks are used as subprocedures, and it’s challenging to have worstcase guarantees for them.
The work of richards2018lyapunov is a closely related prior result, which learns a Lyapunov function given a fixed dynamics model via discretization of the state space. Our work significantly extends it with three algorithmic innovations. First, we use adversarial training to learn the certificates, which avoids discretizing state space and can potentially work with higher dimensional state space than the twodimensional problems in richards2018lyapunov. Second, we do not assume a given, globally accurate dynamics model; instead, we learn the dynamics model from safe explorations. We achieve this by colearning the certificates, dynamics model, and policy to iteratively grow the certified region and improve the dynamics model and still maintain zero violations. Thirdly, the work richards2018lyapunov only certifies the safety of some states and does not involve learning a policy. In contrast, our work learns a policy and tailors the certificates to the learned policies. In particular, our certificates aim to certify only states near the trajectories of the current and past policies—this allows us to not waste the expressive power of the certificate parameterization on irrelevant lowreward states.
We evaluate our algorithms on a suite of tasks, including a few where achieving high rewards requires careful exploration near the safety boundary. For example, in the Swing environment, the goal is to swing a rod with the largest possible angle under the safety constraints that the angle is less than 90. We show that our method reduces the number of safety violations from several hundred to zero on these tasks.
2 Setup and Preliminaries
2.1 Problem Setup
We consider the standard RL setup with an infinitehorizon deterministicMarkov decision process (MDP). An MDP is specified by a tuple , where is the state space, is the action space, is the reward function, is the discount factor, is the distribution of the initial state, and is the deterministic dynamics model. Let denote the family of distributions over a set . The expected discounted total reward of a policy is defined as
where for . The goal is to find a policy which maximizes .
Let be the set of unsafe states specified by the user. The userspecified safe set is defined as . A state is (userspecified) safe if . A trajectory is safe if and only if all the states in the trajectory are safe. An initial state drawn from
is assumed to safe with probability 1. We say a deterministic policy
is safe starting from state , if the infinitehorizon trajectory obtained by executing starting from is safe. We also say a policy is safe if it is safe starting from an initial state drawn from with probability 1. A major challenge toward safe RL is the existence of irrecoverable states which are currently safe but will eventually lead to unsafe states regardless of future actions. We define the notion formally as follows.Definition 1.
A state is viable iff there exists a policy such that is safe starting from , that is, executing starting from for infinite steps never leads to an unsafe state. A userspecified safe state that is not viable is called an irrecoverable state.
We remark that unlike srinivasan2020learning; roderick2020provably, we do not assume all safe states are viable. We rely on the extrapolation and calibration of the dynamics model to foresee risks. A calibrated dynamics model predicts a confidence region of states , such that for any state and action , we have .
2.2 Preliminaries on Barrier Certificate
Barrier certificates are powerful tools to certify the stability of a dynamical system. Barrier certificates are often applied to a continuoustime dynamical system, but here we describe its discretetime version where our work is based upon. We refer the readers to Prajna and Jadbabaie (2004); Prajna and Rantzer (2005) for more information about continuoustime barrier certificates.
Given a discretetime dynamical system without control starting from , a function is a barrier certifcate if for any such that , . Zeng et al. (2020) considers a more restrictive requirement: For any state , for a constant .
it is easy to use a barrier certificate to show the stability of the dynamical system. Let be the superlevel set of . The requirement of barrier certificates directly translates to the requirement that if , then . This property of , which is known as the forwardinvariant property, is especially useful in safetycritical settings: suppose a barrier certificate such that does not contain unsafe states and contains the initial state , then it is guaranteed that contains the entire trajectory of states which are safe.
Finding barrier certificates requires a known dynamics model , which often can only be approximated in practice. This issue can be resolved by using a wellcalibrated dynamics model
, which predicts a confidence interval containing the true output. When a calibrated dynamics model
is used, we require that for any , .Control barrier functions (ames2019control) are extensions to barrier certificates in the control setting. That is, control barrier functions are often used to find an action to meet the safety requirement instead of certifying the stability of a closed dynamical system. In this work, we simply use barrier certificates because in Section 3, we view the policy and the calibrated dynamics model as a whole closed dynamical system whose stability we are going to certify.
3 Learning Barrier Certificates via Adversarial Training
This section describes an algorithm that learns a barrier certificate for a fixed policy under a calibrated dynamics model . Concretely, to certify a policy is safe, we aim to learn a (discretetime) barrier certificate that satisfies the following three requirements.

[label=R.0.,ref=R.0,leftmargin=*,topsep=0pt,itemsep=2pt,partopsep=0pt, parsep=0pt]

For , with probability 1.

For every , .

For any such that , .
Requirement 1 and 3 guarantee that the policy will never leave the set by simple induction. Moreover, 2 guarantees that only contains safe states and therefore the policy never visits unsafe states.
In the rest of the section, we aim to design and train such a barrier certificate parametrized by neural network .
parametrization.
The three requirements for a barrier certificate are challenging to simultaneously enforce with constrained optimization involving neural network parameterization. Instead, we will parametrize with 1 and 2 builtin such that for any , always satisfies 1 and 2.
We assume the initial state is deterministic. To capture the known userspecified safety set, we first handcraft a continuous function satisfying for typical and for any and can be seen as a smoothed indicator of .^{2}^{2}2The function is called a barrier function for the userspecified safe set in the optimization literature. Here we do not use this term to avoid confusion with the barrier certificate. The construction of does not need prior knowledge of irrecoverable states, but only the userspecified safety set . To further encode the userspecified safety set into , we choose to be of form , where is a neural network, and .
Because is safe and , . Therefore satisfies 1. Moreover, for any , we have , so in our parametrization satisfies 2 by design.
The parameterization can also be extended to multiple initial states. For example, if the initial is sampled from a distribution that is supported on a bounded set, and suppose that we are given the indicator function for the support of (that is, for any , and otherwise). Then, the parametrization of can be . For simplicity, we focus on the case where there is a single initial state.
Training barrier certificates.
We now move on to training to satisfy 3. Let
(1) 
Then, 3 requires for any , The constraint in 3 naturally leads up to formulate the problem as a minmax problem. Define our objective function to be
(2) 
and we want to minimize w.r.t. :
(3) 
Our goal is to ensure the minimum value is less than 0. We use gradient descent to solve the optimization problem, hence we need an explicit form of . Let be the Lagrangian for the constrained optimization problem in where is the Lagrangian multiplier of the constraint :
By the Envelope Theorem (see Section 6.1 in Carter (2001)), we have
where and are the optimal solution for . Once is known, the optimal Lagrangian multiplier can be given by KKT conditions:
Now we move on to the efficient calculation of .
Computing the adversarial .
Because the maximization problem with respect to is nonconcave, there could be multiple local maxima. In practice, we find that it is more efficient and reliable to use multiple local maxima to compute and then average the gradient.
Solving is highly nontrivial, as it is a nonconcave optimization problem with a constraint . To deal with the constraint, we introduce a Lagrangian multiplier and optimize w.r.t. without any constraints. However, it is still very timeconsuming to solve an optimization problem independently at each time. Based on the observation that the parameters of do not change too much by one step of gradient step, we can use the optimal solution from the last optimization problem as the initial solution for the next one, which naturally leads to the idea of maintaining a set of candidates of ’s during the computation of .
We use Metropolisadjusted Langevin algorithm (MALA) to maintain a set of candidates which are supposed to sample from . Here is the temperature indicating we want to focus on the samples with large . Although the indicator function always have zero gradient, it is still useful in the sense that MALA will reject . A detailed description of MALA is given in Appendix C.
We choose MALA over gradient descent because the maintained candidates are more diverse, approximate local maxima. If we use gradient descent to find , then multiple runs of GD likely arrive at the same
, so that we lost the parallelism from simultaneously working with multiple local maxima. MALA avoids this issue by its intrinsic stochasticity, which can also be controlled by adjusting the hyperparameter
.We summarize our algorithm of training barrier certificates in Algorithm 1 (which contains optional regularization that will be discussed in Section 4.2). At Line 2, the initialization of ’s is arbitrary, as long as they have a sort of stochasticity.
4 Crabs: Cotrained Barrier Certificate for Safe RL
In this section, we present our main algorithm, Cotrained Barrier Certificate for Safe RL (CRABS), shown in Algorithm 2, to iteratively cotrain barrier certificates, policy and dynamics model, using the algorithm in Section 3. In addition to parametrizing by , we further parametrize the policy by , and parametrize calibrated dynamics model by . CRABS alternates between training a barrier certificate that certifies the policy w.r.t. a calibrated dynamics model (Line 5), collecting data safely using the certified policy (Line 3, details in Section 4.1), learning a calibrated dynamics model (Line 4, details in Section 4.3), and training a policy with the constraint of staying in the superlevel set of the barrier function (Line 6, details in Section 4.4).In the following subsections, we discuss how we implement each line in detail.
4.1 Safe Exploration with Certified Safeguard Policy
Safe exploration is challenging because it is difficult to detect irrecoverable states. The barrier certificate is designed to address this—a policy certified by some guarantees to stay within and therefore can be used for collecting data. However, we may need more diversity in the collected data beyond what can be offered by the deterministic certified policy . Thanks to the contraction property 3, we in fact know that any exploration policy within the superlevel set can be made safe with being a safeguard policy—we can first try actions from and see if they stay within the viable subset , and if none does, invoke the safeguard policy . Algorithm 3 describes formally this simple procedure that makes any exploration policy safe. By a simple induction, one can see that the policy defined in Algorithm 3 maintains that all the visited states lie in . The main idea of Algorithm 3 is also widely used in policy shielding (Alshiekh et al., 2018; Jansen et al., 2018; Anderson et al., 2020), as the policy sheilds the policy in .
The safeguard policy is supposed to safeguard the exploration. However, activating the safeguard too often is undesirable, as it only collects data from so there will be little exploration. To mitigate this issue, we often choose to be a noisy version of so that will be roughly safe by itself. Moreover, the safeguard policy will be trained via optimizing the reward function as shown in the next subsections. Therefore, a noisy version of will explore the highreward region and avoid unnecessary exploration.
Following haarnoja2018soft, the policy is parametrized as , and the proposal exploration policy is parametrized as for , where and are two neural networks. Here the is applied to squash the outputs to the action set .
4.2 Regularizing Barrrier Certificates
The quality of exploration is directly related to the quality of policy optimization. In our case, the exploration is only within the learned viable set and it will be hindered if is too small or does not grow during training. To ensure a large and growing viable subset , we encourage the volume of to be large by adding a regularization term
Here is the barrier certificate obtained in the previous epoch. In the ideal case when , we have , that is, the new viable subset is at least bigger than the reference set (which is the viable subset in the previous epoch.) We compute the expectation over approximately by using the set of candidate ’s maintained by MALA.
In summary, to learn in CRABS, we minimize the following objective (for a small positive constant ) over as shown in Algorithm 1:
(4) 
We remark that the regularization is not the only reason why the viable set can grow. When the dynamics model becomes more accurate as we collect more data, the will also grow. This is because an inaccurate dynamics model will typically make the smaller—it is harder to satisfy 3 when the confidence region in the constraint contains many possible states. Vice versa, shrinking the size of the confidence region will make it easier to certify more states.
4.3 Learning a Calibrated Dynamics Model
It is a challenging open question to obtain a dynamics model
(or any supervised learning model) that is theoretically wellcalibrated especially with domain shift
(zhao2020individual). In practice, we heuristically approximate a calibrated dynamics model by learning an ensemble of probabilistic dynamics models, following common practice in RL
(yu2020mopo; janner2019trust; chua2018deep). We learn probabilistic dynamics models using the data in the replay buffer. (Interestingly, prior work shows that an ensemble of probabilistic models can still capture the error of estimating a deterministic groundtruth dynamics model
(janner2019trust; chua2018deep).) Each probabilistic dynamics modeloutputs a Gaussian distribution
with diagonal covariances, where and are parameterized by neural networks. Given a replay buffer , the objective for a probabilistic dynamics model is to minimize the negative loglikelihood:(5) 
The only difference in the training procedure of these probabilistic models is the randomness in the initialization and minibatches. We simply aggregate the means of all learn dynamics models as a coarse approximation of the confidence region, i.e., .
We note that we implicitly rely on the neural networks for the dynamics model to extrapolate to unseen states. However, local extrapolation suffices. The dynamics models’ accuracy affects the size of the viable set—the more accurate the model is, the more likely the viable set is bigger. In each epoch, we rely on the additional data collected and the model’s extrapolation to reduce the errors of the learned dynamics model on unseen states that are near the seen states, so that the learned viable set can grow in the next epoch. Indeed, in Section 6 (Figure 3) we show that the viable set grows gradually as the error and uncertainty of the models improves over epoch.
4.4 Policy Optimization
We describe our policy optimization algorithm in Algorithm 4. The desiderata here are (1) the policy needs certified by the current barrier certificate and (2) the policy has as high reward as possible. We break down our policy optimization algorithm into two components: First, we optimize the total rewards of the policy ; Second, we use adversarial training to guarantee the optimized policy can be certified by . The modification of SAC is to some extent nonessential and mostly for technical convenience of making SAC somewhat compatible with the constraint set. Instead, it is the adversarial step that fundamentally guarantees that the policy is certified by the current .
Adversarial training.
We use adversarial training to guarantee can be certified by . Similar to what we’ve done in training adversarially, the objective for training is to minimize . Unlike the case of , the gradient of w.r.t. is simply , as the constraint is unrelated to . We also use MALA to solve and plug it into the gradient term .
Optimizing .
We use a modified SAC (haarnoja2018soft) to optimize . As the modification is for safety concerns and is minor, we defer it to Appendix A. As a side note, although we only optimize here, is also optimized implicitly because simply outputs the mean of deterministically.
5 Highrisk, Highreward Environments
We design four tasks, three of which are highrisk, highreward tasks, to check the efficacy of our algorithm. Even though they are all based on inverted pendulum or cart pole, we choose the reward function to be somewhat conflicted with the safety constraints. That is, the optimal policy needs to take a trajectory that is near the safety boundary. This makes the tasks particularly challenging and suitable for stress testing our algorithm’s capability of avoiding irrecoverable states.
These tasks have state dimension dimensions between 2 to 4. We focus on the relatively low dimensional environments to avoid conflating the failure to learn accurate dynamics models from data and the failure to provide safety given a learned approximate dynamics model. Indeed, we identify that the major difficulty to scale up to highdimensional environments is that it requires significantly more data to learn a decent highdimensional dynamics model that can predict longhorizon trajectories. We remark that we aim to have zero violations. This is very difficult to achieve, even if the environment is low dimensional. As shown by Section 6, many existing algorithms fail to do so.
(a) Upright. The task is based on Pendulumv0 in Open AI Gym (gym), as shown in Figure 0(a). The agent can apply torque to control a pole. The environment involves the crucial quantity: the tilt angle which is defined to be the angle between the pole and a vertical line. The safety requirement is that the pole does not fall below the horizontal line. Technically, the userspecified safety set is (note that the threshold is very close to which corresponds to 90.) The reward function is , so the optimal policy minimizes the angle and angular speed by keeping the pole upright. The horizon is 200 and the initial state .
(b) Tilt. This action set, dynamics model, and horizon, and safety set are the same as in Upright. The reward function is different: . The optimal policy is supposed to stay tilting near the angle where is the largest angle the pendulum can stay balanced. The challenge is during exploration, it is easy for the pole to overshoot and violate the safety constraints.
(c) Move. The task is based on a cart pole and the goal is to move a cart (the yellow block) to control the pole (with color teal), as shown in Figure 0(b). The cart has an position between and , and the pole also has an angle with the same meaning as Upright and Tilt. The starting position is . We design the reward function to be . The userspecified safety set is where 0.2 corresponds to roughly 11. Therefore, the optimal policy needs to move the cart and the pole slowly in one direction, preventing the pole from falling down and the cart from going too far. The horizon is set to 1000.
(d) Swing. This task is similar to Move, except for a few differences: The reward function is ; The userspecified safety set is . So the optimal policy will swing back and forth to some degree and needs to control the angles well so that it does not violate the safety requirement.
For all the tasks, once the safety constraint is violated, the episode will terminate immediately and the agent will receive a reward of 30 as a penalty. The number 30 is tuned by running SAC and choosing the one that SAC performs best with.
6 Experimental Results
Comparision between CRABS and baselines. CRABS can learn a policy without any safety violations, while other baselines have a lot of safety violations. We run each algorithm four times with independent randomness. The solid curves indicate the mean of four runs and the shaded areas indicate one standard deviation around the mean.
In this section, we conduct experiments to answer the following question: Can CRABS learn a reasonable policy without safety violations in the designed tasks?
Baselines.
We compare our algorithm CRABS against four baselines: (a) Soft ActorCritic (SAC) (haarnoja2018soft), one of the stateoftheart RL algorithms, (b) Constrained Policy Optimization (CPO) (achiam2017constrained), a safe RL algorithm which builds a trustregion around the current policy and optimizes the policy in the trustregion, (c) RecoveryRL (Thananjeyan et al., 2021) which leverages offline data to pretrain a risksensitive function and also utilize two policies to achieving two goals (being safe and obtaining high rewards), and (d) SQRL (srinivasan2020learning) which leverages offline data in an easier environment and finetunes the policy in a more difficult environment. SAC and CPO are given an initial safe policy for safe exploration, while RecoveryRL and SQRL are given offline data containing 40K steps from both mixed safe and unsafe trajectories which are free and are not counted. CRABS collects more data at each iteration in Swing than in other tasks to learn a better dynamics model . For SAC, we use the default hyperparameters because we found they are not sensitive. For RecoveryRL and SQRL, the hyperparameters are tuned in the same way as in Thananjeyan et al. (2021). For CPO, we tune the step size and batch size. More details of experiment setup and the implementation of baselines can be found in Appendix B.
Results.
Our main results are shown in Figure 2. From the perspective of total rewards, SAC achieves the best total rewards among all of the 5 algorithms in Move and Swing. In all tasks, CRABS can achieve reasonable total rewards and learns faster at the beginning of training, and we hypothesize that this is directly due to its strong safety enforcement. RecoveryRL and SQRL learn faster than SAC in Move, but they suffer in Swing. RecoveryRL and SQRL are not capable of learning in Swing, although we observed the average return during exploration at the late stages of training can be as high as 15. CPO is quite sampleinefficient and does not achieve reasonable total rewards as well.
From the perspective of safety violations, CRABS surpasses all baselines without a single safety violation. The baseline algorithms always suffer from many safety violations. SAC, SQRL, and RecoveryRL have a similar number of unsafe trajectories in Upright, Tilt, Move, while in Swing, SAC has the fewest violations and RecoveryRL has the most violations. CPO has a lot of safety violations. We observe that for some random seeds, CPO does find a safe policy and once the policy is trained well, the safety violations become much less frequent, but for other random seeds, CPO keeps visiting unsafe trajectories before it reaches its computation budget.
Visualization of learned viable subset .
To demonstrate that the algorithms work as expected, we visualized the viable set in Figure 3. As shown in the figure, our algorithm CRABS succeeds in certifying more and more viable states and does not get stuck locally, which demonstrates the efficacy of the regularization at Section 4.2. We also visualized how confident the learned dynamics model is as training goes on. More specifically, the uncertainty of a calibrated dynamics model at state is defined as
(6) 
We can see from Figure 3 that the initial dynamics model is only locally confident around the initial policy, but becomes more and more confident after collecting more data.
Handcrafted barrier function .
To demonstrate the advantage of learning a barrier function, we also conduct experiments on a variant of CRABS, which uses a handcrafted barrier certificate by ourselves and does not train it, that is, Algorithm 2 without Line 5. The results show that this variant does not perform well: It does not achieve high rewards, and has many safety violations. We hypothesize that the policy optimization is often burdened by adversarial training, and the safeguard policy sometimes cannot find an action to stay within the superlevel set .
7 Related Work
Prior works about Safe RL take very different approaches. dalal2018safe adds an additional layer, which corrects the output of the policy locally. Some of them use Lagrangian methods to solve CMDP, while the Lagrangian multiplier is controlled adaptively (tessler2018reward) or by a PID (stooke2020responsive). achiam2017constrained; yang2020accelerating build a trustregion around the current policy. eysenbach2017leave learns a reset policy so that the policy only explores the states that can go back to the initial state. turchetta2020safe introduces a learnable teacher, which keeps the student safe and helps the student learn faster in a curriculum manner. srinivasan2020learning pretrains a policy in a simpler environment and finetunes it in a more difficult environment. bharadhwaj2020conservative learns conservative safety critics which underestimate how safe the policy is, and uses the conservative safety critics for safe exploration and policy optimization. Thananjeyan et al. (2021) makes use of existing offline data and cotrains a recovery policy.
Another line of work involves Lyapunov functions and barrier functions. Chow et al. (2018) studies the properties of Lyapunov functions and learns them via bootstrapping with a discrete action space. Built upon Chow et al. (2018), Sikchi et al. (2021) learns the policy with Deterministic Policy Gradient theorem in environments with a continuous action space. Like TRPO (schulman2015trust), Sikchi et al. (2021) also builds a trust region of policies for optimization. donti2020enforcing constructs sets of stabilizing actions using a Lyapunov function, and project the action to the set, while chow2019lyapunov projects action or parameters to ensure the decrease of Lyapunov function after a step. ohnishi2019barrier is similar to ours but it constructs a barrier function manually instead of learning such one. ames2019control gives an excellent overview of control barrier functions and how to design them. Perhaps the most related work to ours is Cheng et al. (2019), which also uses a barrier function to safeguard exploration and uses a reinforcement learning algorithm to learn a policy. However, the key difference is that we learn a barrier function, while Cheng et al. (2019) handcrafts one. The works on Lyapunov functions (berkenkamp2017safe; richards2018lyapunov) require the discretizating the state space and thus only work for lowdimensional space.
Anderson et al. (2020) iteratively learns a neural policy which possily has higher total rewards but is more unasfe, distills the learned neural policy into a symbolic policy which is simpler and safer, and use automatic verification to certify the symbolic policy. The certification process is similar to construct a barrier function. As the certification is done on a learned policy, the region of certified states also grows. Howver, it assumes a known calibrated dynamcis model, while we also learns it. Also, tt can only certifie states where a piecewiselinear policy is safe, while potentially we can certify more states.
8 Conclusion
In this paper, we propose a novel algorithm CRABS for trainingtime safe RL. The key idea is that we cotrain a barrier certificate together with the policy to certify viable states, and only explore in the learned viable subset. The empirical rseults show that CRABS can learn some tasks without a single safety violation. We consider using modelbased policy optimization techniques to improve the total rewards and sample efficiency as a promising future work.
We focus on lowdimensional continuous state space in this paper because it is already a sufficiently challenging setting for zero trainingtime violations, and we leave the highdimensional state space as an important open question. We observed in our experiments that it becomes more challenging to learn a dynamics model in higher dimensional state space that is sufficiently accurate and calibrated even under the training data distribution (the distribution of observed trajectories). Therefore, to extend our algorithms to high dimensional state space, we suspect that we either need to learn better dynamics models or the algorithm needs to be more robust to the errors in the dynamics model.
Acknowledgement
We thank Changliu Liu, Kefan Dong, and Garrett Thomas for their insightful comments. YL is supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Amazon Research, DARPA and SRC. TM acknowledges support of Google Faculty Award, NSF IIS 2045685, and JD.com.
References

Safe reinforcement learning via shielding.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §4.1.  Neurosymbolic reinforcement learning with formally verified exploration. arXiv preprint arXiv:2009.12612. Cited by: §4.1, §7.
 Foundations of mathematical economics. MIT press. Cited by: §3.
 Endtoend safe reinforcement learning through barrier functions for safetycritical continuous control tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3387–3395. Cited by: §1, §7.
 A lyapunovbased approach to safe reinforcement learning. arXiv preprint arXiv:1805.07708. Cited by: §7.
 Safe reinforcement learning via probabilistic shields. arXiv preprint arXiv:1807.06096. Cited by: §4.1.
 Exponential control barrier functions for enforcing high relativedegree safetycritical constraints. In 2016 American Control Conference (ACC), pp. 322–328. Cited by: §1.
 PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: Appendix B.
 Safety verification of hybrid systems using barrier certificates. In International Workshop on Hybrid Systems: Computation and Control, pp. 477–492. Cited by: §2.2.
 On the necessity of barrier certificates. IFAC Proceedings Volumes 38 (1), pp. 526–531. Cited by: §2.2.

Searching for activation functions
. arXiv preprint arXiv:1710.05941. Cited by: §B.2.  Lyapunov barrier policy optimization. arXiv preprint arXiv:2103.09230. Cited by: §7.
 Episodic learning with control lyapunov functions for uncertain robotic systems. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6878–6884. Cited by: §1.
 Recovery rl: safe reinforcement learning with learned recovery zones. IEEE Robotics and Automation Letters 6 (3), pp. 4915–4922. Cited by: §1, §6, §7.
 Safetycritical model predictive control with discretetime control barrier function. arXiv preprint arXiv:2007.11718. Cited by: §1, §2.2.
Appendix A Reward Optimizing in CRABS
As in original SAC, we maintain two functions and their target networks for , together with a learnable temperature . The objective for the policy is to minimize
(7) 
where if , otherwise for a large enough constant . The heuristics behind the design of is that we should lower the probability of proposing an action which will possibly leave the superlevel set to reduce the frequency of invoking the safeguard policy during exploration.
The temporal difference objective for the function is
(8) 
We remark that we reject all such that , as our safe exploration algorithm (Algorithm 3) will reject all of them eventually. The temperature is learned the same as in haarnoja2018soft:
(9) 
where is hyperparameter, indicating the target entropy of the policy .
Appendix B Experiment Details
Our code is implemented by Pytorch [Paszke et al., 2019] and runs in a single RTX2080 GPU. Typically it takes 12 hours to run one seed for Upright, Tilt and Move, and for Swing it takes around 60 hours. In a typical run of Swing, 33 hours are spent on learning barrier functions.
b.1 Environment
All the environments are based on OpenAI Gym [gym] where MuJoCo [todorov2012mujoco] serves as the underlying physics engine. We use discount .
The tasks Upright and Tilt are based on Pendulumv0
. The obsevation is where is the angle between the pole and a vertical line, and is the angular velocity. The agent can apply a torque to the pendulum.
The task Move and Swing is based on InvertedPendulumv2
with observation .
The agent can control how the cart moves.
As all of the constraints are in the form of and . For each type of constraint, we design to be
with . If there is no constraint of , we just take . One can easy check that is continuous and equals to 1 at the boundary of safety set.
b.2 Hyperparameters
Policy
We parametrize our policy using a feedforward neural network with ReLU activation and two hidden layers, each of which contains 256 hidden units. Similar to
haarnoja2018soft, the output of the policy is squashed by a tanh function.The initial policy is obtained by running SAC for steps, checking the intermediate policy for every steps and picking the first safe intermediate policy.
In all tasks, we optimize the policy for 2000 steps in a single epoch.
Dynamics Model
We use an ensemble of five learned dynamics models as the calibrated dynamcis model. Each of the dynamics model contains 4 hidden layers with 400 hidden units and use Swish as the activation function [Ramachandran et al., 2017]. Following chua2018deep, we also train learnable parameters to bound the output of . We use Adam [kingma2014adam] with learning rate 0.001, weight decay 0.000075 and batch size 256 to optimize the dynamics model.
In the experiment Move and Swing, the initial model is obtained by traininng one a data for 20000 steps with 500 safe trajectories, obtained by adding different noises to the initial safe policy.
At each epoch, we optimize the dynamics models for 1000 steps.
Barrier certificate
The barrier certificate is parametrized by a feedforward neural network with ReLU activation and two hidden layers, each of which contains 256 hidden units. The coefficient in Equation 4 is set to 0.001.
Collecting data.
In Upright, Tilt and Move, the Line 3 in Algorithm 2 collects a single episode. In Swing, the Line 3 collects six episodes, two of which are from Algorithm 3 with a uniform random policy, another two are from the current policy, and the remaining two are from the current policy but with more noises. In Algorithm 3, we first draw Gaussian samples , and the sampled actions are , where and are the outputs of the exploration policy .
b.3 Baselines
RecoveryRL
We use the code in https://github.com/abalakrishna123/recoveryrl. We remark that when running experiments in Recovery RL, we do not add the violation penalty for an unsafe trajectory. We set (chosen from ) and discount factor (chosen from ). The offline dataset , which is used to pretrain the , contains 20K transitions from a random policy and another 20K transitions from the initial (safe) policy used by CRABS. The violations in the offline dataset is not counted when plotting.
Unfortunately, with chosen hyperparameters, we do not observe reasonable high reward from the policy, but we do observe that after around 400 episodes, RecoveryRL visits high reward (1520) region in the Swing task and there are few violations since then.
Sac
We implement SAC ourselves with learned temperature , which we hypothesize is the reason of it superior performance over RecoveryRL and SQRL. The violation penalty is chosen to be 30 from by tuning in the Swing and Move task. We found out that with violation penalty being 100, SAC has slightly fewer violations (around 167), but the total reward can be quite low (< 2) after samples, so we choose to show the result of violation penalty being 30.
Sqrl
We use code provided by RecoveryRL with the same offline data and hyperparameters. However, we found out that the parameter (that is, the Lagrangian multiplier) is very important and tune it by choosing the optimal one from in Swing. The optimal is the same as that for SAC, which is 30. As SQRL and RecoveryRL use a fixed temperature for SAC, we find it suboptimal in some cases, e.g., for Swing.
Cpo
We use the code in https://github.com/jachiam/cpo. To make CPO more sample efficient and easier to compare, we reduce the batch size from 50000 to 5000 (for Move and Tilt) or 1000 (for Tilt and Upright). We tune the step size in but do not find substantial difference, while tuning the batch size can significantly reduce its sample efficiency, although it is still sampleinefficient.
Appendix C MetropolisAdjusted Langevin Algorithm (MALA)
Given a probability density function
on , MetropolisAdjusted Langevin Algorithm (MALA) obtains random samples when direct sampling is difficult. it is based on MetropolisHastings algorithm which generates a sequence of samples . MetropolisHastings algorithm requires a proposal distribution . At step , MetropolisHastings algorithm generates a new sample and accept it with probabilityIf the sample is accepted, we set ; Otherwise the old sample is used: . MALA considers a special proposal function . See Algorithm 5 for the pseudocode.
For our purpose, as we seek to compute , we maintain sequences of samples . Recall that involves a constrained optimization problem:
so for each , the sequence follows the Algorithm 5 to sample with . The step size is chosen such that the acceptance rate is approximately 0.6. In practice, when , we do not use MALA, but use gradient descent to project it back to the set .
Comments
There are no comments yet.