## 1 Introduction

While reinforcement learning systems have demonstrated impressive potential in a variety of domains, including video games Mnih2013PlayingAW and robotic control in laboratory environments levine2015end, safety remains a critical bottleneck when deploying these systems for real world problems. One natural approach to ensure safety is to manually impose constraints onto the learning process, to prevent the agent from taking actions or entering states that are too risky. For example, this could be done by manually shaping a reward function or imposing constraints on the policy to avoid actions that may lead to unsafe outcomes 45450

. However, the recent successes of machine learning techniques in a wide range of applications indicate advantages to avoiding manual specification and engineering, as such manual approaches will not generalize to new environments or robots

robonet. This begs the question: why should we make an exception with safety? In this paper, we propose to explore the following question: can we*learn*how to be safe, avoiding the need to manually specify states and actions that are risky?

At first, learning and safety may seem inherently at odds, since learning to be safe requires visiting unsafe states. However, when children learn to walk they can leverage the positive and negative experience of getting up and falling down to gradually begin to stand, walk, and eventually run. Throughout this process, they learn how to move around in the world safely and adapt these skills as their bodies grow and are exposed to new and unseen environments, all while limiting how badly and often they are injured. This way of safely exploring while learning is something that can not only be used when learning to walk, but in other areas where safety is critical not only for survival, but generally helpful for successfully learning the task. For instance, an agent can more safely learn to drive a car if it already knows how to avoid collisions.

Motivated by how humans learn to acquire skills that become gradually harder and more risky, we believe that our reinforcement learning agents can similarly leverage previous experience to understand general safety precautions in safety-training environments, and use this understanding to avoid failures when learning *new behaviors*. This previous experience could come from non-safety-critical environments such as when a human is present, when a robot is moving slowly, in simulation, or from pre-collected offline data of safety incidents such as car accidents. To this end, we propose to learn task-agnostic models of safety, and then use these models such as preventing unsafe behavior when learning new tasks via reinforcement. By doing so, we can ensure that the latter reinforcement learning process itself is safe, which in principle, can enable real-world deployment.

The primary contribution of this work is a framework for safe reinforcement learning, by learning safety precautions from previous experience. Our approach, safety Q-functions for reinforcement learning (SQRL), learns a critic that evaluates whether a state, action pair will lead to unsafe behavior, under a policy that is constrained by the safety-critic itself. This is achieved by concurrently training the safety-critic and policy in separate pre-training and task fine-tuning phases, visualized in Figure 1. During pre-training, the agent is allowed to explore and learn about unsafe behaviors. In the fine-tuning phase, we train the policy on a new target task, while simultaneously constraining the policy’s updates and selected actions with the safety-critic. We evaluate our method in three challenging continuous control environments: 2D navigation, quadrupedal robot locomotion, and dexterous manipulation with a five-fingered hand. In comparison to prior state-of-the-art RL methods and approaches to safety, we find that SQRL provides consistent and substantial gains in terms of both the safety of learning and the learning efficiency on new tasks.

## 2 Related Work

Our method builds on a rich body of work on safe RL garcia2015comprehensive. Prior works define safety in a variety of ways, using constraints on the expected return or cumulative costs heger1994consideration; achiam2017constrained

, using risk measures such as Conditional Value at Risk and percentile estimates

Tamar2014PolicyGB; chow2014algorithms; duan2020distributional; ma2020distributional, and defining regions of the state space that result in catastrophic or near-catastrophic failures lipton2016combating; eysenbach2017leave; fisac2019bridging. We define safety as constraining the probability that a catastrophic failure will occur below a specified threshold, similar to how it is defined in prior work on MDPs with constrained probability of constraint violation (CPMDPs)

geibel2006reinforcement.Researchers have determined a wide variety of challenges underlying safe RL, including lower-bounds on off-policy learning and evaluation thomas2015high; dann2018policy, robustness to perturbations smirnova2019distributionally; pinto2017robust; xu2010distributionally, learning to reset to safe states moldovan2012safe; eysenbach2017leave, and verifying the safety of a final policy julian2019verifying. One common focus of prior works is to learn a safe policy by minimizing worst-case discounted cost using conditional value at risk (CVaR) formulation heger1994consideration; chow2017risk; tang2019worst. However, these methods do not guarantee or offer ways to reason about the safety of policies attempted during the learning process. In contrast, we present a method of learning a safety layer that can adapt in parallel with a fine-tuned policy in an online fashion, making it better suited in the transfer learning setting where the environment or task has changed and the trained agent must safely adapt.

While there are many approaches to safety in RL, constrained MDPs are commonly used altman1999constrained; achiam2017constrained; Tessler2018RewardCP, where the agent optimizes the reward signal given constraints on the accumulated cost during the rollout. In work by chow2018lyapunov; Chow2019LyapunovbasedSP, Lyapunov state-action functions are constructed using a feasible baseline policy, to enable safe Q-learning, policy gradient, and actor gradient methods, which map a learned policy into a space of policies that satisfy the Lyapunov safety condition. Similar to our work, this approach instills safety into the training process by directly evaluating every state’s constraint cost. However, unlike these works, we do not assume well-defined, actionable safety criterion, and instead only access a sparse safety label which the agent uses to infer an actionable safety criterion and transfer it to new task instances.

There have been many approaches to handling risk and safety in RL that have varied at an architectural level, which bear some similarities to our work. In geibel2005risk, Q-learning is adapted with an additional risk value estimate based on unsafe states to learn a deterministic policy that is risk-averse. hans2008safe proposed controlling a plant with RL, while learning a “safety function” from data to estimate a state’s degree of safety, as well as a “backup policy” for returning the plant from a critical state to a safe regime. However, neither approach tries transferring learned safety to a different task/environment with modified rewards or dynamics. eysenbach2017leave showed a similar approach in a robotics domain, learning a “reset policy” to return a robot to its initial state distribution, which triggered if a corresponding value function drops below a certain threshold.
Our model builds on this idea by directly learning to predict failures, using this predictor to *preemptively* filter unsafe actions, removing the need for an additional reset policy.
Additionally, we use our learned constraints to provide constrained updates to our policy during fine-tuning.

Finally, a number of works have learned models that predict the probability of future collisions for safe planning and control of mobile robots richter2017safe; kahn2017uncertainty. We instead aim to develop a general framework for many different modes of safety, and empirically observe the importance of several new design choices introduced in our algorithm on challenging continuous control problems.

## 3 Preliminaries

For our method, we consider the standard RL problem on a Markov Decision Process (MDP), which is specified by the tuple

. and are the state and action spaces respectively, is a discounting factor, is the reward function, and and are the state transition dynamics and initial state distributions respectively. The aim of entropy-regularized reinforcement learning ziebart2010modeling is to learn a policy that maximizes the following objective:(1) |

We use to denote the policy’s action entropy and as a tuning parameter.

## 4 Problem Statement

In many robot learning settings, the conditions that lead to a catastrophic failure are difficult to fully specify before learning. While unsafe states are easily identifiable, such as the robot falling down or dropping an object, the task of formally specifying safety constraints that capture these states is non-trivial, and can be potentially biased or significantly hinder learning.

This is especially true if determining the agent’s final behavior is difficult due to uncertainty about the underlying reward function and safety constraints. However, for many problems, the conditions for a catastrophic failure during an episode are easier to determine (e.g. determining that the robot has fallen down, or the object was dropped). Therefore, the challenge is to learn an optimal policy for a task while minimizing the frequency of catastrophic failures during training. To achieve this, we include 1) a safe pre-training environment, in which failures can be tolerated, and 2) a safety-incident indicator, , which indicates if a given state is unsafe or not. In both the pre-training and target task environments, we treat these failure states as terminal. This setup enables the agent to learn to be safe and then transfer to new tasks without accumulating additional failure costs.

Under this framework, we propose learning in two phases: 1) learning an exploratory policy that solves a simpler/safer task in the pre-training environment, and
2) transferring the learned policy to a more safety-critical target task with guarantees to safety. Our *safety-aware* MDP for the pre-training and target tasks are respectively defined as , and .
After pre-training in a safety test-bed task (), the agent must optimize its expected return in the target training task () while minimizing visits to unsafe states:

This objective implies that for the target task, the policy should always stay outside of the unsafe region with probability . This also introduces the notion of a target safety threshold, , which serves as an upper-bound on the expected risk of a given policy. When used as a single-step constraint over actions sampled during policy rollouts, it can guarantee that the policy is safe up to that threshold probability under certain assumptions. Importantly, we aim to impose this safety constraint not just at convergence, but throughout the training process.

## 5 Safety Q-functions for RL

To address the problem formulation described above, we introduce the Safety Q-functions for Reinforcement Learning (SQRL), which simultaneously learns a policy and notion of safety in the first phase, and later fine-tunes the policy to the target task using the learned safety precautions. In turn, the safety in the second phase of training can be ensured during the learning process itself. This is done by the safety-critic, , which estimates future failure probability of a safety-constrained policy given a state, action pair. The safety-constraints learned by can induce the policy to be safe, even when the task has changed. In this section, we will describe our approach and analyze its performance.

### 5.1 Pre-Training Phase

During pre-training, the goal is to learn a safety-critic and the optimal pre-trained policy which will serve as the initialization for training on the target task. For our problem, the optimal safety-critic for the pre-training task estimates the following expectation for a given policy :

(2) |

The target failure estimate estimates the true probability that will fail in the future if, starting at state , it takes the action . This raises two key questions: 1) how should such a model of safety be trained, 2) with what data do we train it, and 3) for what policy do we train it?

Regarding the first question, if the safety-critic were a binary classifier, supervised learning could be used to just estimate the safety labels for a single timestep. However, to learn the cumulative failure probability in the future, the safety-critic must reason over future timesteps also. Hence, the safety-critic is trained using dynamic programming, as in standard Q-learning, using a discount term,

, to limit how far in the past the failure signal is propagated. This cumulative discounted probability of failure is estimated by the Bellman equation:Parametrizing the safety Q-function

as a neural network with parameters

yields the following objective:where corresponds to the delayed target network.

For the safety model to learn new tasks without failure, it must first explore a diverse set of state-action pairs, including unsafe state-action pairs, during pre-training. This motivates using maximum entropy reinforcement learning, which encourages exploration by maximizing both the reward and the entropy of the policy. In our implementation, we choose the soft actor-critic (SAC) algorithm haarnoja2018soft, though in principle, any off-policy algorithm can be used as long as it explores sufficiently.

Finally, we must determine which policy to optimize the safety-critic under. If we use the entirety of training experience from SAC, which contains data from a mixture of all policies encountered, this may result in a safety Q-function that is too pessimistic. For example, if the mixture of policies includes a random, unsafe policy from the start of training, then even a cautious action in one state may be considered unsafe, because risky actions are observed afterwards. Therefore, we are faced with a dilemma: while we need diverse data that includes a range of unsafe conditions, we also need the actions taken to be reflective of those taken by safer downstream policies constrained by the safety-critic to avoid learning a pessimistic safety-critic. To address this conundrum, the safety-critic is optimized under the mixture of policies that are constrained by the safety-critic itself which we denote as where is the iteration. This mitigates pessimism, since for a safe state-action pair in the data, subsequent behavior from that point will be constrained as safe, hence producing a reliable target label.

To this end, the safety-critic and a stochastic safety-constrained policy are jointly optimized during pre-training. Let the safety policy be a policy that has zero probability of sampliing an action where . Let the set of all policies where this holds be . Next, the projection be the projection mapping any policy onto its closest policy in . Consi a natural projection operator, using our safety-critic definition, that does the following:

(3) |

By masking the output distribution over actions from to only sampling actions where this safety condition is met, the policy is ensured to be safe.

Our pre-training process, summarized in Algorithm 1, proceeds as follows. Let define our target safety threshold, beyond which actions should be rejected for being too risky. Then, at each iteration of training, we collect data from our current policy with actions constrained by the current safety-critic (i.e. projecting our policy into ), add the data to our replay buffer, update our safety-critic under the mixture of policies represented in the replay buffer, and update our policy using a MaxEnt RL algorithm. Pre-training returns a safe policy that solves the pre-training task, , and a safety Q-function .

### 5.2 Fine-Tuning Phase

In the fine-tuning phase, SQRL initializes the policy to the safety-constrained pre-training policy and fine-tunes to a new safety-critical target task . To do this, we formulate a safety-constrained MDP using the pre-trained safety-critic . While fine-tuning on the target task, all data is collected using policies that are constrained by , following the data collection approach used in pre-training, and the policy is updated with respect to the target task reward function. We additionally add a safety constraint cost to the policy objective to encourage the unconstrained policy to sample actions that will fall within the distribution of , while optimizing for expected return. This modifies the standard soft-actor policy objective to

(4) |

where and are Lagrange multipliers for the entropy and safety constraints respectively, and where

denotes the target entropy (a hyperparameter of SAC). Algorithm

2 summarizes the fine-tuning process.### 5.3 Analysis

Next, we theoretically analyze the safety of SQRL, specifically considering the safety of the learning process itself.
For this analysis (but *not* in our experiments), we make the following assumptions:

###### Assumption 1.

The safety-critic is optimal such that, after pre-training, can estimate the true expected future failure probability given by in Equation 2, for any experienced state-action pair.

###### Assumption 2.

The transition dynamics leading into failure states have a transition probability at least . That is, for all unsafe states , or .

###### Assumption 3.

The support of the pre-training data for covers the states and actions observed during fine-tuning.

###### Assumption 4.

There always exists a “safe” action at every safe state that leads to another safe state (i.e. if ).

###### Lemma 1.

For any policy , the discounted probability of it failing in the future, given by , is less than or equal to , given a safety-critic .

Assuming that (Assumption 1), for the base case we can assume for a policy starting in state there is always an action for which and :

From recursion, it follows that when the step probability of failing is below if the policy masks the actions sampled by to those that are within the safety threshold, , has failure probability less than This is because there are actions in the support of the expectation of where or else that state’s cumulative failure probability would be From Assumptions 2 and 3, this means that these actions can be sampled by , and from the definition of the masking operator , actions that would violate the constraint are masked out. A visual representation of this concept is presented in Fig. 6 in Appendix A.

###### Theorem 1.

Optimizing the policy learning objective in Eq. 4, given from Assumptions 1-4, all new policies encountered during training will be in when trained on .

Following from Lemma 1, we can show that actions sampled according to the projected safety policy match the constraints that and therefore makes the problem equivalent to solving the CMDP , where the constraints are learned in the pre-training phase. A more detailed proof is included in the supplemental material.

## 6 Practical Implementation of SQRL

To train a safety-critic and policy with SQRL on high-dimensional control problems, we need to make several approximations in our implementation of the algorithm which in practice do not harm overall performance. Our method is implemented as a layer above existing off-policy algorithms which collect and store data in a replay buffer that is used throughout training for performing updates. For our method, we store trajectories in an offline replay buffer, when collecting samples for training the actor and critic, and additionally keep a smaller, “on-policy” replay buffer , which is used to store trajectories samples from the latest policy to train the safety-critic, after training the policy offline for steps. During pre-training, we use rejection sampling to find actions with a failure probability that are only just below the threshold , scored using the current safety-critic. This equates to sampling safe but ‘risky’ actions that encourage the agent to explore the safety boundary to correctly identify between safe and unsafe actions for a given state. Alternatively, methods like cross entropy method which perform importance-based sampling weighted by the safety-critic failure probability could also be used.

Sampling actions in this way affects the algorithm by ensuring the data being collected by our method during policy evaluation gives the safety-critic the most information about how well it determines safe and unsafe actions close to the threshold of . Since the behavior of the final safety policy is largely determined by the safety hyperparameters and , which are task and environment-specific, one additional step during pre-training is tuning both of these parameters to reach optimal pre-training performance, since it is unsafe to tune them while attempting to solve the target task.

In the fine-tuning phase, the safe sampling strategy translates into selecting safe actions according to by, once again, sampling actions, masking out those that are unsafe, and then importance sampling the remaining options with probability proportional to their log probability under the original distribution output by . In the event that no safe action is found, the safest action with lowest probability of failing is chosen. Qualitatively in our experiments, these are typically actions that stabilize the robot to avoid failure, and restrict the ability to accumulate reward for the remainder of the episode. In our experiments in Section 7, we show ablation results that confirm the benefit of the exploration induced by MaxEnt RL to improve the quality of safety exploration during pre-training, yielding a better quality estimate of failure probabilities when the goal task is attempted.

## 7 Experiments

In our experimental evaluation, we aim to answer the following questions to evaluate how well our method transfers in this task-agnostic setting: (1) Does our approach enable substantially safer transfer learning compared to RL without a safety-critic? (2) How does our approach compare to prior safe RL methods? (3) How does SQRL affect the performance, stability, and speed of learning? (4) Does tuning the safety threshold allow the user to trade-off risk and performance?

Environments.

To answer these questions, we design our experiments in three distinct environments, shown in Figure 3. Inspired by kappen2005path, the drunk spider domain is a 2D navigation environment, where the agent must learn to navigate to a fixed goal position, with stochastic dynamics corresponding to action noise. There is a narrow bridge that goes directly from start to goal, with two lava pits on either side. Falling into either pit corresponds to a failure and episode termination. If the agent is unconfident about crossing the bridge without falling, it can instead choose to go around the bridge and lava pits entirely, which takes longer (i.e., costs more) but avoids the risk of falling. Our second environment is the minitaur quadruped locomotion environment from tan2018sim. Here, we consider the task of running at faster speeds, i.e. 0.4 m/s, after pre-training on a safer desired velocity of 0.3 m/s, with the goal of learning to run without falling. We also include a variant of this environment that increases the foot friction of the simulator during fine-tuning, showing that our method works even when there are changes in the environment dynamics. Our last environment corresponds to cube rotation with a five-fingered 24 DoF ShadowHand using the environment introduced by nagabandi2019deep. The goal is to rotate a cube from a random initial orientation to a desired goal orientation, where the fine-tuning task requires a larger orientation change than the pre-training phase. An unsafe state corresponds to the block falling off the hand. See Appendix C for details.

### 7.1 The Performance and Safety of Learning

To answer question (1), we perform an ablation of our approach where no safety-critic training is performed. This comparison directly corresponds to training standard SAC on the pre-training task, and fine-tuning the pre-trained policy on the target task, shown in Figure 4. This directly compares the role the safety-critic plays in both safety and performance. To address question (2), we compare SQRL to four prior approaches for safe reinforcement learning: worst cases policy gradients (WCPG) tang2019worst, which learns a distribution over Q-values and optimizes predicted worst case performance, ensemble SAC (ESAC), which uses an ensemble of critics to produce a conservative, risk-averse estimate of the Q-value during learning eysenbach2017leave, constrained policy optimization (CPO), which adds a constraint-based penalty to the policy objective achiam2017constrained, and Risk-Sensitive Q-Learning (RS), which adds the risk estimate directly to the reward objective geibel2005risk. To compare to geibel2005risk, we re-use the pre-trained SQRL policy and penalize the critic loss with the safety-critic risk estimate scaled by a Lagrange multiplier. For all prior methods, we perform the same pre-training and fine-tuning procedure as our method to provide a fair comparison.

We compare all of the methods in fine-tuning performance in Figure 4. Overall, we observe that SQRL is substantially safer, with fewer safety constraint violations, than both SAC and WCPG, while transferring with better final performance in the target task. In the drunk spider environment, all approaches can acquire a high performing policy; however, only SQRL enables safety during the learning process, with only a 1% fall rate. In the Minitaur environment, SQRL enables the robot to walk at a faster speed, while falling around only 5% of the time during learning, while the unconstrained SAC policy fell about 3x as often. Finally, on the challenging cube rotation task, SQRL achieves good performance while dropping the cube the least of all approaches, although SAC performs better with more safety incidents.

### 7.2 The Effect of Safety on Learning

To answer question (3), we now aim to assess the stability and efficiency of learning when using the safety-critic. To do so, we present learning curves for SQRL and our comparisons in Figures 4 and 7 (in Appendix E). In these curves, we find that SQRL not only reduces the chances of safety incidents (Figure 7), but also significantly speeds up and stabilizes the learning process (Figure 4). In particular, we observe large performance drops during SAC training, primarily due to exploration of new policies, some of which are unsafe and lead to early termination and low reward. Our approach instead mitigates these issues by exploring largely in the space of safe behaviors, producing more stable and efficient learning. We also show that when fine-tuning for the cube rotation task, the safety policy can be overly cautious (see Figures 4 and 7 in the appendix), resulting in fewer failures but lower performance than the unconstrained SAC policy. These results indicate that even when the fine-tuning task is a more difficult version of the original training task, a task-agnostic safety Q-function can guide exploration to be safer while learning to solve the new task.

### 7.3 Trading Off Risk and Reward with

Finally, we aim to answer question (4). We aim to evaluate whether the safety threshold, , can be used to control the amount of risk the agent is willing to take. To do so, we consider the drunk spider environment, evaluating two different values of our safety threshold, and , with action noise included in the environment with magnitude . In Figure 5, we observe that, when using the conservative safety threshold of , the agent chooses to cautiously move around the two pits of lava, taking the long but very safe path to the goal location in green. Further, when we increase the safety threshold to a maximum failure probability of , we find that the agent is willing to take a risk by navigating directly through the two pits of lava, despite the action noise persistent in the environment. We observe that these two behaviors are consistent across multiple trials. Overall, this experiment suggests that tuning the safety threshold does provide a means to control the amount of risk the agent is willing to take, which in turn corresponds to the value of the task reward and efficiency relative to the (negative) value of failure. This knob is important for practical applications where failures have varying costs.

## 8 Discussion

We introduced an approach for learning to be safe and then using the learned safety precautions to act safely while learning new tasks. Our approach trains a safety-critic to estimate the probability of failure in the future from a given state and action, which we can learn with dynamic programming. We impose safety by constraining the actions of future policies to limit the probability of failures. Our approach naturally yields guarantees of safety under standard assumptions, and can be combined in practice with expressive deep reinforcement learning methods. On a series of challenging control problems, our approach encounters substantially fewer failures during the learning process than prior methods, and even produces superior task performance on one task.

Our experimental results make strides towards increasing the safety of learning, compared to multiple prior methods. However, there are a number of potential avenues for future work, such as optimizing the pre-training or curriculum generation procedures. Relaxing some of the assumptions to guarantee that a constrained policy is safe may yield alternative ways of calculating risk and sampling safe actions under the Q-learning framework we propose. Motivated by our analysis, another area for study is handling out-of-distribution queries to the safety-critic. Finally, encouraged by the strong results in simulation, we aim to apply SQRL to real-world manipulation tasks in future work.

## References

## Appendix A Guaranteeing Safety with the Safety Critic

In this section we will prove that constraining *any* policy using a safety Q-function guarantees safety. Theorem 1 from the main text is a special case of this result. We first clarify our notation and assumptions before stating and proving our result. Our proof will consist of the safety Q-function for a safety indicator function , along with a safety threshold .

For an arbitrary policy , define (sometimes written as ) as the policy obtained by “masking” with :

(5) |

where the normalizing constant is defined as

(6) |

We will denote the initial, pre-trained policy as and its safety-constrained variant as . Additionally, we define the safe set as , and the unsafe set as .

We make the following simplifying assumptions:

###### Assumption 1.

The safety-critic is optimal such that, after pre-training, it can estimate the true expected future failure probability given by in Equation 2, for any experienced state-action pair.

###### Assumption 2.

The transition dynamics leading into failure states have a transition probability at least . That is, for all unsafe states , or .

###### Assumption 3.

At every safe state , there exists an action such that there is a non-zero probability of transitioning to another safe state , i.e. . This means that at any safe state that could transition to a failure state .

###### Assumption 4.

The support of the pre-training data for covers all states and actions observed during fine-tuning, and the policy distribution has full support over actions.

With these assumptions in place, we formally state our general result:

###### Theorem 1.

Optimizing the policy learning objective in Eq. 4, with assumptions that the safety-critic is an optimal learner and the initial policy is -safe under , all new policies encountered during training will be in when trained on .

(7) |

###### Proof.

To start, recall that the safety critic , the (optimal) Q-function for the safety indicator , estimates the future discounted probability of failure under the optimal safety-constrained pre-trained policy:

(8) |

At a high level, we will show, by contradiction, that if we are masking actions for using this safety-critic, it cannot have a failure probability , since this would imply there is a state, action pair , where , and , breaking our construction of a -masked policy in Eq. 5.

First, we expand the equation for the failure probability for one time-step (letting be the induced state distribution by following ) to show that if the failure probability is larger than , this yields the contradiction.

(9) |

For the expected failure probability to exceed at some point , there must exist an action in the support of where the next state is unsafe. This leads to the inner expectation (Eq. 9 a) to yield a failure probability . This comes naturally from Assumption 2, which also states the probability of such a transition is . In order for the cumulative discounted future probability , a failure state would therefore need to be reached after some step in the future. From Assumption 2, expanding the expected future failure probability under from yields

Next, let . Since actions that lead to unsafe states should have already been masked out by (from Equations 5, 8), if , then

this yields a contradiction, since by the definition of a masked policy (Eq. 5), .

Furthermore, under Assumption 3, we are guaranteed that a safe action always exists at every state in , which means that the safety-constrained policy can always find a safe alternate action where , where .

Therefore, under no conditions will the discounted future failure probability of being , assuming the policy does not start at an unsafe state , i.e. .

∎

This shows that following a policy with actions masked by the safety-critic limits the probability that the policy fails, and is equivalent even if the policy was trained separately from the safety-critic, since it will be at least as safe as the safety-critic’s actor. Lastly, our method of masking out unsafe actions offers a general way of incorporating safety constraints into any policy using the safety-critic (Fig. 6 illustrates this concept), and works as long as there are safe actions (i.e. ) in the support of that policy’s action distribution.

A corollary of Theorem 1 is that all iterates of the SQRL algorithm will be safe:

###### Corollary 1.1.

Let be the policies obtained by SQRL when using cost function , safety threshold , during the fine tuning stage. All policy iterates have an expected safety cost of at most :

(10) |

###### Proof.

At each policy update step, the policies obtained by SQRL are returned from performing one-step updates on the previous step’s policy using the SQRL fine-tuning policy objective in Eq. 4. By construction, each policy iterate is obtained by masking via the current safety critic , which is computed by evaluating that :

(11) |

Applying Theorem 1, we obtain the desired result. ∎

## Appendix B Hyperparameters

In Table 2, we list the hyperparameters we used across environments, which were chosen via cross-validation.

General Parameters | |
---|---|

Learning rate | |

Number of pre-training steps | |

Number of fine-tuning steps | |

Layer size | 256 |

Number of hidden layers | 2 |

Hidden layer activation | ReLU |

Output activation | tan-h |

Optimizer | Adam |

SQRL Parameters | |

Target Safety () | |

Safety Discount () | |

E-SAC Parameters | |

Number of Critics |

DrunkSpider Parameters | |
---|---|

Episode length | |

Action scale | |

Action noise | |

Number of pre-training steps | |

Safety Discount () | |

Minitaur Parameters | |

Pre-training Goal Velocity | m/s |

Fine-tuning Goal Velocity | m/s |

Pre-training Goal Friction | |

Fine-tuning Goal Friction | |

Episode length | 500 |

Number of pre-training steps | |

CubeRotate Parameters | |

Fine-tuning Goal Rotation | ( quarter rotation) |

Episode length | 100 |

Number of pre-training steps |

## Appendix C Environment Details

The MinitaurFriction and MinitaurVelocity tasks are both identical in the pre-training phase, which involves reaching a desired target velocity of m/s at normal foot-friction. During fine-tuning, in the MinitaurFriction environment, the foot-friction is increased by , while in the MinitaurVelocity environment, the goal velocity is increased by (to m/s). The reward function used for this environment at each timestep is , where is the approximate joint acceleration, calculated as .

In the CubeRotate environment, the pre-training task is to reach one of different rotated cube positions (up, down, left, and right) from a set of fixed starting positions, each of which correspond to a single quarter turn (i.e. degrees) from the starting position. We use an identical formulation of the reward to what is used in nagabandi2019deep. For the fine-tuning task, we offset the cube by an additional eighth turn in each of those directions, and evaluate how well it can get to the goal position.

## Appendix D Algorithm Implementation

While our safety-critic in theory is trained estimating the Bellman target with respect to the safety-constrained policy, which is how we sample policy rollouts during safety-critic training, when computing the Bellman target, we sample future actions from the unconstrained policy. This is due to the fact that at the beginning of training, the safety-critic is too pessimistic, assuming most actions will fail, and as a result, falsely reject most safe actions that would lead to increasing reward.

Another implementation detail is the use of a smaller “online” replay buffer, as mentioned in Algorithm 2. This was used in order to get sampling to train the safety-critic to be more on-policy, since in practice, this seems to estimate the failure probabilities better, which improves the performance of the safety-constrained policy.

## Appendix E Training Plots

In this section, we include additional training plots. We first show plots of the failures throughout fine-tuning in Figure 7, which illustrate how failures accumulate for different methods. We also include pre-training curves (Figs. 8, 9) for the three environments. We additionally include an additional ablation plot (Fig. 9(a)) that shows that learning during pre-training with our method is also more stable, especially as the safety-critic is trained to be pessimistic (), or optimistic (), since it learns to explore safely while learning to perform the task. Each sharp drop in performance indicates that the agent experienced one or more early-reset conditions (due to unsafe-behavior) during an episode of training.

We also include an ablation (Fig. 9(b)) show how our method performs on the Minitaur environment during pretraining and finetuning (with a faster goal velocity) with and without online training of the safety critic. During pretraining, the online failure rate is qualitatively more stable, falling less often on average, while during finetuning, the task reward is significantly greater, thanks to mitigating the issue of an overly pessimistic critic. Finally, for additional environment-specific experiments, we first demonstrate how a SQRL agent can learn safely with physics randomization added to the Minitaur environment tan2018sim, where foot friction, base and leg masses are randomized. This shows that our method can also safely transfer to environments where the dynamics have changed, suggesting future work where this approach can be applied to sim-to-real transfer.