Safety-aware Policy Optimisation for Autonomous Racing

10/14/2021 ∙ by Bingqing Chen, et al. ∙ Carnegie Mellon University University of California, San Diego 0

To be viable for safety-critical applications, such as autonomous driving and assistive robotics, autonomous agents should adhere to safety constraints throughout the interactions with their environments. Instead of learning about safety by collecting samples, including unsafe ones, methods such as Hamilton-Jacobi (HJ) reachability compute safe sets with theoretical guarantees using models of the system dynamics. However, HJ reachability is not scalable to high-dimensional systems, and the guarantees hinge on the quality of the model. In this work, we inject HJ reachability theory into the constrained Markov decision process (CMDP) framework, as a control-theoretical approach for safety analysis via model-free updates on state-action pairs. Furthermore, we demonstrate that the HJ safety value can be learned directly on vision context, the highest-dimensional problem studied via the method to-date. We evaluate our method on several benchmark tasks, including Safety Gym and Learn-to-Race (L2R), a recently-released high-fidelity autonomous racing environment. Our approach has significantly fewer constraint violations in comparison to other constrained RL baselines, and achieve the new state-of-the-art results on the L2R benchmark task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 14

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Autonomous agents need to adhere to safe behaviours when interacting with their environment. In the context of safety-critical applications, such as autonomous driving and human-robot interaction, there is growing interest in learning policies that are simultaneously safe and performant. In the reinforcement learning (RL) literature, it is common to define safety as satisfying safety specifications

(Ray et al., 2019a) under the constrained Markov decision process (CMDP) framework (Altman, 1999), which extends the Markov decision process (MDP) by incorporating constraints on expected cumulative costs. One challenge for solving a CMDP problem is the need to evaluate whether a policy will violate constraints (Achiam et al., 2017). Model-free methods depend on collecting diverse state-action pairs from the environment, including unsafe ones. As a result, safety is not guaranteed, most notably during the initial learning interactions (Cheng et al., 2019).

Given the practical limitations of learning about safety by collecting experiences of constraint violations or even failures, it may be favourable to leverage control theory and/or domain knowledge to bootstrap the learning process. Methods, such as Hamilton-Jacobi (HJ) reachability, compute safe sets with theoretical guarantees using models of the system dynamics. However, these guarantees hinge on the quality of the model, which may not capture the true dynamics. Due to scalability issues with HJ reachability, these models tend to be low-dimensional representations of the true system (up to 5D for offline computation, and 2D for online computation). Furthermore, existing works on HJ reachability exclusively study problems defined on physical states, e.g. poses, instead of high-dimensional sensory inputs, such as RGB images.

Building upon prior work on learning HJ safety value via model-free updates on state-action pairs (Fisac et al., 2019), we inject HJ reachability into the CMDP framework (Section 4). Since safety verification under HJ Reachability theory does not depend on the performance policy, we can bypass the challenges involved with solving a constrained optimisation problem with a neural policy, and naturally decompose the problem of learning under safety constraints into (a) optimising for performance, and (b) updating the safety value function. Given this intuition, we learn two policies that independently manage safety and performance (Figure 1): the performance policy focuses exclusively on optimising performance, while the safety critic verifies if the current state is safe and intervenes when necessary. Primarily focused on the application of autonomous racing, we refer to our approach as Safety-aware Policy Optimisation for Autonomous Racing (SPAR).

Aside from the problem formulation and corresponding framework, our key contributions are as follows. Firstly, we compare the HJ Bellman update rule (Fisac et al., 2019) to alternatives for learning a safety critic (Srinivasan et al., 2020; Bharadhwaj et al., 2020) on two classical control benchmarks, where safe vs. unsafe states are known analytically. Given the same off-policy samples, the HJ Bellman update rule learns safety more accurately and with greater sample efficiency. We also compare empirically how different implementations of the HJ Bellman update affect convergence.

Secondly, we demonstrate that the HJ safety value function can be learned directly on visual context, the highest-dimensional problem studied by HJ safety analysis to-date, thereby expanding the applications of HJ reachability to high-dimensional systems where model may not be available.

Finally, we evaluate our methods on Safety Gym (Ray et al., 2019a) and Learn-to-Race (L2R) (Herman et al., 2021), a recently-released, high-fidelity autonomous racing environment, which challenges the agent to make safety-critical decisions in a complex and fast-changing environment. While SPAR is by no means free from failure, it has significantly fewer constraint violations compared to other constrained RL baselines in Safety Gym. We also report new state-of-the-art results on the L2R benchmark task, and show that incorporating a dynamically updating safety critic grounded in control theory boosts performance especially during the initial learning phase.

(a) SPAR Architecture
(b) Safety Critic
Figure 1: SPAR Overview. (a) By incorporating HJ reachability theory into the CMDP framework, we can decompose learning under safety constraints into optimising for performance and updating safety value function. Thus, SPAR consists of two policies, which are in charge of safety and performance independently. The safety controller only intervenes when the current state-action pair is deemed unsafe by the safety critic. (b) The safety critic is updated via HJ Bellman update and may optionally be warm-started via a nominal model.

2 Related Work

Constrained reinforcement learning. There is growing interest in enforcing some notion of safety in RL algorithms, e.g. satisfying safety constraints, avoiding worst-case outcomes, or robust to environmental stochasticity (Garcıa and Fernández, 2015). We focus on the notion of safety as satisfying constraints. CMDP (Altman, 1999) is a widely-used framework for studying RL under constraints, where the agent maximises cumulative rewards, subject to limits on cumulative costs characterising constraint violations. Solving a CMDP problem is challenging, because the policy needs to be optimised over the set of feasible ones. This requires off-policy evaluation of the constraint functions to determine whether a policy is feasible (Achiam et al., 2017). As a result, safety grows with experience, but requires diverse state-action pairs, including unsafe ones (Srinivasan et al., 2020). Furthermore, one needs to solve a constrained optimisation problem with a non-convex neural policy. This may be implemented with techniques from convex optimisation, such as primal-dual updates (Bharadhwaj et al., 2020) and projection (Yang et al., 2020), or by upper bounding the expected cost at each policy iteration (Achiam et al., 2017). Most relevant to our work is Bharadhwaj et al. (2020); Srinivasan et al. (2020); Thananjeyan et al. (2021), which also uses a safety critic to verify if a state is safe. We compare our control-theoretical learning rule with theirs in Section 5.1.

Guaranteed safe control. Guaranteeing the safety of general continuous nonlinear systems is challenging, but there are several approaches that have been successful. These methods typically rely on knowledge of the environment dynamics. Control barrier functions (CBFs) provide a measure of safety with gradients that inform the acceptable safe actions (Ames et al., 2019). For specific forms of dynamics, e.g. control-affine (Cheng et al., 2019), and unlimited actuation bounds, this approach can be scalable to higher-dimensional systems and can be paired with an efficient online quadratic program for computing the instantaneous control (Cheng et al., 2019). Unfortunately, finding a valid control barrier function for a general system is a nontrivial task. Lyapunov-based methods (Chow et al., 2018, 2019) suffer from the same limitation of requiring hand-crafted functions.

HJ reachability is a technique that uses continuous-time dynamic programming to directly compute a value function that captures the optimal safe control for a general nonlinear system (Bansal et al., 2017; Fisac et al., 2018)

. This method can provide hard safety guarantees for systems subject to bounded uncertainties and disturbances. There are two major drawbacks to HJ reachability. The first is that the technique suffers from the curse of dimensionality and scales exponentially with number of states in the system. Because of this, the technique can only be used directly on systems of up to 4-5 dimensions. When using specific dynamics formulations and/or restricted controllers, this upper limit can be extended

(Chen et al., 2018; Kousik et al., 2020). Second, because of this computational cost, the value function is typically computed offline based on assumed system dynamics and bounds on uncertainties. This can lead the safety analysis to be invalid or overly conservative.

There are many attempts in injecting some form of control theory into RL algorithms. In comparison to works that assume specific problem structure (Cheng et al., 2019; Dean et al., 2019) or existence of a nominal model (Cheng et al., 2019; Bastani, 2021), our proposed approach is applicable to general nonlinear system and does not require a model. But, we do assume access to a distance metric defined on the state space. Our primary inspiration is recent work by Fisac et al. (2019) that connects HJ reachability with RL and introduced a HJ Bellman update, which can be applied to deep Q-learning for safety analysis. This method loses hard safety guarantees due to the neural approximation, but enables scalable learning of safety value function. However, an agent trained using the method in Fisac et al. (2019) will focus exclusively on safety. Thus, We extend the method by formulating it within the CMDP framework, thereby enabling performance-driven learning.

Applications to autonomous racing. There is a large body of research on autonomous driving, predominately focused on urban driving. However, racing presents its unique set of challenges, e.g., making sub-second decisions under complex dynamics (Rhinehart et al., 2018)

. Existing open-source racing simulators, e.g., CarRacing-v0

(Brockman et al., 2016) and TORCS (39), lack in realism both in terms of the graphics and vehicular dynamics, which limits the researchers’ ability to effectively evaluate their algorithms. Florian et al. (2020) developed an interface to and trained their agents in the Gran Turismo video game, but did not make their environment publicly available. Furthermore, their agents assume access to an unrealistic amount of privileged information. In this work, we set the first safe learning results in the recently-introduced, high-fidelity, open-source Learn-to-Race autonomous racing environment (Herman et al., 2021). Apart from RL-based approaches, optimisation-based approaches have been used in works such as Liniger et al. (2015); Kabzan et al. (2019). We refer interested readers to Herman et al. (2021) for a recent and comprehensive review on autonomous racing.

3 Preliminaries

Constrained MDPs. The problem of RL with safety constraints is often formulated as a CMDP. On top of the MDP , where is the state space, is the action space, characterises the system dynamics, and is the reward function, CMDP includes an additional set of cost functions, , where each maps state-action transitions to costs characterising constraint violations.

The objective of RL is to find a policy that maximises the expected cumulative rewards, , where is a temporal discount factor. Similarly, the expected cumulative costs are defined as . CMDP requires the policy to be feasible by imposing limit for the costs, i.e. . Putting everything together, the RL problem in a CMDP is:

(1)

HJ Reachability. To generate the safety constraint, one can apply HJ reachability to a general nonlinear system model, denoted as . Here is the state, is the control contained within a compact set . The dynamics are assumed bounded and Lipschitz continuous. For discrete-time approximations the time step is used.

We denote all allowable states as , for which there exists a terminal reward , such that . An that satisfy this condition is the signed distance to the boundary of . Taking autonomous driving as an example, is the drivable area and is the shortest distance to road boundary or obstacle. This set is the complement of the failure set that must be avoided. The goal of this HJ reachability problem is to compute a safety value function that maps a state to its safety value with respect to over time. This is done by capturing the minimum reward achieved over time by the system applying an optimal control policy:

(2)

where is the state trajectory, is the initial time, and is the final time. To solve for this safety value function, a form of continuous dynamic programming is applied backwards in time from to using the Hamilton-Jacobi-Isaacs Variational Inequality (HJI-VI):

(3)

The super-zero level set of this function is called the reachable tube, and describes all states from which the system can remain outside of the failure set for the time horizon. For the infinite-time, if the limit exists, we define the converged value function as .

Once the safety value function is computed, the optimal safe control can be found online by solving the Hamiltonian: . This safe control is typically applied in a least-restrictive way wherein the safety controller becomes active only when the system approaches the boundary of the reachable tube, i.e. if and otherwise.

The newly introduced discounted safety Bellman equation (Fisac et al., 2019) modifies the HJI-VI in equation 3 in a time-discounted formulation for discrete time:

(4)

This formulation induces a contraction mapping, which enables convergence of the value function when applied to dynamic programming schemes commonly used in RL.

4 SPAR: Safety-aware Policy Optimisation for Autonomous Racing

In this section, we describe our framework for safety-aware policy optimisation. We are inspired by guaranteed-safe methods, such as HJ reachability, which provides a systematic way to verify safety. Thus, we formulate our problem as a combination of constrained RL and HJ reachability theory, adopting a control-theoretical approach to learn safety. The need for an accurate model of the system dynamics can be restrictive. Building upon prior work on neural approximation of HJ Reachability (Fisac et al., 2019), we demonstrate that it is possible to directly update the safety value function on high-dimensional multimodal sensory input, thereby expanding the scope of applications to problems previously inaccessible. We highlight the notable aspects of our framework:

i) Injects control theory into RL. We incorporate HJ Reachability theory into the CMDP framework, thereby updating the safety critic in a control-theoretical manner. An unintended, but welcome outcome is that the original constrained optimisation problem is naturally decomposed into two unconstrained optimisation problems, making the problem more amenable to gradient-based learning.

ii) Scales to high-dimensional problems. Compared to standard HJ Reachability methods, whose computational complexity scales exponentially with the state dimension, we updated the safety value directly on vision embedding using the neural approximation. This is the highest-dimensional problem studied studied via HJ reachability to-date.

Problem formulation. We inject HJ Reachability theory into the CMDP framework. Starting with Eqn. 1, we can interpret the negative of a cost as a reward for safety and, without loss of generality, reverse the direction of the inequality constraint. Recall that the super-zero set of the safety value function, i.e., , designates all states from which the system can remain within the set of allowable states, , over infinite time horizon. Thus, the safety value function derived from HJ reachability can be naturally embedded into CMDP (Eqn. 5):

(5)

where is a safety margin. A key difference from the original CMDP formulation (Eqn. 1) is that constraint satisfaction, , no longer depends on the policy, . Thus, we can bypassing the challenges of solving CMDPs (Section 2

) and decompose learning under safety constraints into optimising for performance and updating safety value estimation. While a number of works have similar dual-policy architecture

(Cheng et al., 2019; Bastani, 2021; Thananjeyan et al., 2021), ours design is informed by HJ reachability theory. A downside of the formulation is that HJ reachability considers safety as absolute, and there isn’t a mechanism to allow for some level of safety infractions.

Update of Safety Value Function. For the update of safety value function, we adopt the learning rule proposed by Fisac et al. (2019) (Eqn. 6). Note that is updated model-free using state action transitions, and only additionally requires , the shortest distance to allowable states .

(6)

On top of the theoretical analysis in Fisac et al. (2019), we compare how common RL implementation techniques, including delayed target network, clipped double Q-learning (Fujimoto et al., 2018), and baseline reduction (Schulman et al., 2015) affect convergence on two classical control benchmarks, Double Integrator and Dubins’ Car and summarise the observations in Appendix A.2.

SPAR. We propose SPAR, which consists of a performance policy and a safety policy. The safety backup controller is applied in a least restrictive way, only intervening when the RL agent is about to enter into an unsafe state, i.e. , if and otherwise. Thus, the agent enjoys the most freedom in safe exploration. The performance policy may be implemented with any RL algorithm. Since we expect the majority of samples to be from the performance policy, it is more appropriate to update the safety actor critic with an off-policy algorithm. In this work, we base our implementation of the safety actor critic on soft-actor critic (SAC) (Haarnoja et al., 2018). The safety critic is updated with Eqn. 6, where . The safety actor is updated via policy gradient through the safety critic, i.e. . The algorithm for SPAR is detailed in Appendix B.

5 Experiments

We evaluate SPAR on three set of benchmarks of increasing difficulty. While the our intended application is autonomous racing, the first two set of benchmarks can be considered as some abstraction of vehicles with the objective of avoiding obstacles and/or moving towards goals. Firstly, we evaluate on two classical control tasks where the safe vs. unsafe states are known analytically, and compare the HJ Bellman update used in SPAR to alternatives for learning safety critics in the literature. Secondly, we compare SPAR to constrained RL baselines in Safety Gym. Finally, we challenge SPAR in Learn-to-Race and conduct ablation to better understand how different components of SPAR contribute to its performance.

5.1 Experiment: Classical Control Benchmarks

(a) Double Integrator
(b) Dubins’ Car
(c) Performance comparison of learning rules (averaged over 5 random seeds)
Figure 2: We uses two classical control benchmarks, double integrator and Dubins’ car, to evaluate the performance of different learning rules for safety analysis. (a) shows the safety value function of the double integrator and the black line delineates , within which the particle can remain within the allowable range of . (b) shows the iso-surface of the safety value function at 0, i.e., , for Dubins’ Car, within which the car can reach a unit circle at the origin. The performance comparison is summarized in (c).

As mentioned earlier, safety critics have been trained in other works (Bharadhwaj et al., 2020; Srinivasan et al., 2020) with different learning rules. The objective here is to compare the HJ Bellman update with alternatives. Thus, we focus on safety analysis with off-policy samples, and evaluate on two classical control benchmarks double integrator and Dubins’ Car, where the safe / lively111Liveliness refers the ability to reach the specified goal (Hsu et al., 2021). states (Figure 1(a) and 1(b)) and the optimal safety controller are known analytically. The double integrator (Fisac et al., 2019) characterise a particle moving on x-axis, with velocity v. By controlling the acceleration, the objective is to keep the particle on a bounded range on x-axis. Dubins’ car (Bansal et al., 2017) is a simplified car model, where the car moves at a constant speed. By controlling the turning rate, the goal is to reach a unit circle regardless of the heading. More information on the two tasks are provided in Appendix A.1.

In this experiment, we generate state-action pairs with a random policy, and evaluate the safety value function with respect to the optimal safety controller, . In both Safety Q-functions for RL (SQRL) (Srinivasan et al., 2020) and Conservative Safety Critic (CSC) (Bharadhwaj et al., 2020), the safety critic is defined as the expected cumulative cost, i.e. , where if a failure occurs at and 0 otherwise. In this case, both the environment and optimal safety policy are deterministic. Thus, by definition, should be 0 if is a safe state. SQRL uses the standard Bellman backup to propagate the failure signal. On top of that, CSC uses conservative Q-learning (CQL) (Kumar et al., 2020) to correct for the difference between the behaviour policy, i.e. the random policy, and the evaluation policy, i.e. the optimal safety policy, and overestimate to err on the side of caution.

Since the safe vs. unsafe states are known for these benchmark tasks, we can directly compare the performance of these safety critics learned with different learning rules (Figure 1(c)

). While the theoretical cut-off for safe vs. unsafe states is 0, the performance of SQRL is very sensitive to the choice of the cut-off. Thus, we report AUROC instead. For both CQL and SQRL, we do a grid search around the hyperparameters used in the original paper and report the best results. The implementation details and additional results are included in Appendix

A.3. Directly applying Bellman update for safety analysis as in SQRL, performs reasonably well on double integrator, but does not on the more challenging Dubins’ Car

. In our experiment, CQL consistently under-performs SQRL. In comparison, HJ Bellman update has AUROC close to 1 on both tasks and has very small variance over different runs. It is worth-noting that the result with the HJ Bellman update is achieved without explicitly addressing the distribution mismatch

(Voloshin et al., 2019), which challenges off-policy evaluation problems. This experiment only compares the efficacy of the different learning rules for safety critic given the same off-policy samples, and does not intend to compare other aspects of SQRL and CSC.

One caveat is that SQRL and CQL uses binary signal for failures, while HJ Bellman update has access to the distance, . On the one hand, HJ Bellman update does assume more information. On the other hand, it may be more practical to learn safety from distance measurements then experiencing failures. Applied to autonomous driving, this translates to learning to avoid obstacle from distance measurements that are readily available on cars with assisted driving capabilities (BMW, 2021), in comparison to experiencing collisions.

5.2 Experiment: Safety Gym

Figure 3: Performance of SPAR with comparison to baselines in the CarGoal1-v0 (top row) and PointGoal1-v0 (bottom row) benchmarks (averaged over 5 random seeds). In Goal tasks, agents must navigate to observed goal locations (indicated by the green regions), while avoiding obstacles (e.g., vases in cyan, and hazards in blue).

We additionally evaluate our proposed approach, SPAR, in Safety Gym (Ray et al., 2019b). Specifically, we evaluate on the standard CarGoal1-v0 and PointGoal1-v0 benchmarks, where the agent navigates to a goal while avoiding harzards. We compare SPAR against baselines including: Constrained Policy Optimisation (CPO) (Achiam et al., 2017), a unconstrained RL algorithm (Proximal Policy Optimisation (PPO) (Schulman et al., 2017)), and its Lagrangian variant (PPO-Lagrangian). By default, distance measurements from LiDAR are available in these benchmarks, and thus SPAR has direct access to . Episodic Performance and Cost curves are shown in Figure 3, and additional SPAR implementation details are included in Appendix C.

PPO-SPAR has significantly fewer constraint violations compared to other baselines and the number of violations decreases over time. While CPO and PPO-Lagrangian take into account that a certain number of violations are permissible, there isn’t a mechanism to do for that in SPAR as HJ reachability theory defines safety in an absolute sense. The inability to allow for some level of safety infractions, unfortunately, compromises performance. The violations that do occur are results of neural approximation error, and the number of violations decrease over time as the safety actor-critic gain experiences, despite of the constantly changing layout.

5.3 Experiment: Learn-to-Race

Task Overview. In this paper, we evaluate our approach using the Arrival Autonomous Racing Simulator, through the newly-introduced and OpenAI-gym compliant Learn-to-Race (L2R) task and evaluation framework (Herman et al., 2021). L2R provides multiple simulated racing tracks, modelled after real-world counterparts, such as Thruxton Circuit in the UK (Track01:Thruxton; see Figure 4). L2R provides access to RGB images from any specified location, semantic segmentation, and vehicle states (e.g., pose, velocity). In each episode, an agent is spawned on the selected track. At each time-step, it uses its observations to determine normalised steering angle and acceleration. All learning-based agents receive the reward specified by L2R, which is formulated as a weighted sum of reward for driving fast and penalty for leaving the drivable area; the main objective is to complete laps in as little time as possible. Additional metrics are defined to evaluate driving quality.

(a) Aerial
(b) Third-person
(c) Ego-view
Figure 4: We use the Learn-to-Race (L2R) framework (Herman et al., 2021) for evaluation; this environment provides simulated racing tracks that are modelled after real-world counterparts, such as the famed Thruxton Circuit in the UK (Track01:Thruxton, (a)). Here, learning-based agents can be trained and evaluated according to challenging metrics and realistic vehicle and environmental dynamics, making L2R a compelling target for safe reinforcement learning. Each track features challenging components for autonomous agents, such as sharp turns (shown in (b)), where SPAR only uses ego-camera views (shown in (c)) and speed.

Implementation Details. To characterise the performance of our approach, we report results on the Average Adjusted Track Speed (AATS) and the Episode Completion Percentage (ECP) metrics (Herman et al., 2021) as proxies for agent performance and safety, respectively. For reference, one lap in Track01:Thruxton is 3.8km, whereas CARLA, the de facto environment for autonomous driving research, has in total 4.3km drivable roads in the original benchmark (Codevilla et al., 2019). Thus, successfully completing an episode, i.e. a lap, is very challenging. Agents’ results on other metrics of driving quality, as defined by the L2R environment, are presented in Appendix F.

We use Track01:Thruxton in L2R (Fig. 4) for all stages of agent interaction with the environment. During training, the agent is spawned at random locations along the race track and uses a stochastic policy. During evaluation, the agent is spawned at a fixed location and uses a deterministic policy. The episode terminates when the agent successfully finishes a lap, leaves the drivable area, collides with obstacles, or does not progress for a number of steps. For each agent, we report averaged results across 5 random seeds evaluated every 5000 steps over an episode, i.e., one lap. We use SAC as the performance policy, and all agents only have access to ego-camera view (Figure 3(c)) and speed, unless specified otherwise. The implementation, including network architecture and hyperparameters, are detailed in Appendix E.

Static Safety Actor-Critic from Nominal Model. To demonstrate the benefit of utilising domain knowledge in the form of a nominal model and to compare with the learnable safety actor-critic in SPAR, we use the kinematic vehicle model (Kong et al., 2015) (see Figure 4(a)), which is a significant simplification of a realistic race car model (Kabzan et al., 2019), to compute the safety value and corresponding ‘optimal’222only with respect to the nominal model safety controller. The dynamics and ‘optimal’ safety control is given in Eqn. 7, where the state is , and the action is . are the vehicle’s location, speed, and yaw angle. is the acceleration, and is the steering angle. m is the car length. Intuitively, the ‘optimal’ safety policy brakes and steers towards the centre of the track as much as possible. The derivation of the safety policy is provided in Appendix D. Setting , we calculated the backward reachable tube using the code from Giovanis et al. (2021). Fig. 4(b) illustrates resulting safety value function at slices of state space, as the agent enters into a sharp turn.

(a) nominal model
(b) computed via the nominal model, where v=12m/s
Figure 5: (a) We compute the safety value function, via a kinematic vehicle model. (b) We illustrate different views of the 4D state space, given fixed velocity and three different yaw angles, indicated by the blue arrows.
(7)

We assume the static actor-critic have access to vehicle poses in order to evaluate safety value and determine the safety action. We evaluate the performance of this static actor-critic by coupling a random agent with it (SafeRandom). We test SafeRandom on a series of safety margins to account for unmodelled dynamics; the performance averaged over 10 random seeds is summarised in Figure F.1. For instance, achieves 80+% ECP. This high safety performance, in comparison to 0.5% ECP by Random agent showcase the benefit of utilising domain knowledge.

Ablation Study. We conduct ablation to better understand how different components of SPAR contribute to its performance. We examine the effect of imposing safety constraints on performance and sample efficiency, by comparing SAC agent with an instance of itself that is coupled with the static safety actor-critic (SafeSAC). We set the safety margin to be 4.2, based on empirical results from SafeRandom. We also compare the performance of using the static safety actor-critic (SafeSAC) and a learnable one (SPAR). Since SPAR is expected to have a better characterisation of the safety value, the agent no longer depend on a large safety margin to remain safe and thus the SPAR agent uses a safety margin of 3.0m, which accounts for the dimension of the vehicle333The HJ reachable tube is computed with respect to the back axle of the vehicle and does not account for the physical dimension of the vehicle..

Figure 6: Left: Episode percent completion and Right: speed evaluated every 5000 steps over an episode (a single lap) and averaged over 5 random seeds. Results reported based on Track01:Thruxton in L2R.

Results. The performance comparison between different agents is summarised in Figure 6.

The static safety actor-critic significantly boost initial safety performance. With the help of the static safety actor-critic, the SafeSAC can complete close to 80% of a lap, in comparison to slightly more than 5% with SAC. This, again, showcase that injecting domain knowledge in the form of a nominal model is extremely beneficial to safety performance, especially in the initial learning phase. However, there are two notable limitation with the static safety controller. Firstly, it is extremely conservative, braking whenever the vehicle less safe. As a result the SafeSAC agent has an initial speed of less than 10km/h. Secondly, as the SAC learns to avoid activating the safety controller and drive faster, the static safety controller is no longer able to recover the vehicle from marginally safe states. In fact, by applying the ‘optimal’ safety action from Eqn. 7, i.e., maximum brake and steer, the vehicle will lose traction and spin out of control. As a result, the ECP actually decreases over time for SafeSAC.

SPAR learns safety directly from vision context and can recover from marginally safe states more smoothly. Having a safety actor-critic that is dedicated to learning about safety significantly boosted the initial safety performance of SPAR in comparison to the SAC agent, when the safety actor-critic is randomly initialised to show the safety value function can be learned from scratch on vision embedding. In practice, we envision the safety actor-critic to be warm-started with the nominal model, and fine-tuned by observations from the environment. Furthermore, the learnable safety actor-critic can recover from marginally safe states more smoothly. A qualitative comparison of such behaviours is available at video link. While SPAR outperforms other baselines, there is still significant performance gap with human, as the speed record at Thruxton Circuit is 237 km/h (average speed).

6 Conclusion

In this paper, we incorporate HJ reachability theory into the CMDP framework as a principled approach to learn about safety. As a result of the problem formulation, we effectively decompose the problem of learning under safety constraints into two more-tractable sub-tasks: optimising for performance and updating safety value. We show on two classical control benchmarks that the HJ Bellman update is more effective than alternatives for learning the safety critic. Comparing to constrained RL baselines in the Safety Gym, we show that SPAR has significantly fewer constraint violations. Finally, we report the new state-of-the-art result on Learn-to-Race. We demonstrate that the HJ safety value can be learned directly on visual context, thereby expanding HJ reachability to broader applications.

Whereas our empirical results demonstrated that it is possible to learn a safety-aware and performant policy, SPAR is by no means free from failure. However, the method proposed in this paper represents a subtle shift away from constraint-satisfaction exclusively through model-free learning, as has become popular in recent literature. Rather than letting agents learn safe behaviours through experiencing failures, our approach provides potential avenues for online safety analysis, through the injection of domain knowledge (e.g. a nominal model), and by informing the learning rule with control theory.

References

  • J. Achiam, D. Held, A. Tamar, and P. Abbeel (2017) Constrained policy optimization. In

    International Conference on Machine Learning

    ,
    pp. 22–31. Cited by: §1, §2, §5.2.
  • J. Achiam (2018) Spinning Up in Deep Reinforcement Learning. Cited by: Appendix E.
  • E. Altman (1999) Constrained markov decision processes. Vol. 7, CRC Press. Cited by: §1, §2.
  • A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada (2019) Control barrier functions: theory and applications. In 2019 18th European Control Conference (ECC), pp. 3420–3431. Cited by: §2.
  • S. Bansal, M. Chen, S. Herbert, and C. J. Tomlin (2017) Hamilton-jacobi reachability: a brief overview and recent advances. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pp. 2242–2253. Cited by: §A.1, §2, §5.1.
  • O. Bastani (2021) Safe reinforcement learning with nonlinear dynamics via model predictive shielding. In 2021 American Control Conference (ACC), pp. 3488–3494. Cited by: §2, §4.
  • H. Bharadhwaj, A. Kumar, N. Rhinehart, S. Levine, F. Shkurti, and A. Garg (2020) Conservative safety critics for exploration. arXiv preprint arXiv:2010.14497. Cited by: §A.3, §A.3, §1, §2, §5.1, §5.1.
  • BMW (2021) Automotive sensors – the sense organs of driver assistance systems. BMW. External Links: Link Cited by: §5.1.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: 1606.01540 Cited by: §2.
  • M. Chen, S. L. Herbert, M. S. Vashishtha, S. Bansal, and C. J. Tomlin (2018) Decomposition of reachable sets and tubes for a class of nonlinear systems. IEEE Transactions on Automatic Control 63 (11), pp. 3675–3688. Cited by: §2.
  • R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick (2019) End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 3387–3395. Cited by: §1, §2, §2, §4.
  • Y. Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh (2018) A lyapunov-based approach to safe reinforcement learning. arXiv preprint arXiv:1805.07708. Cited by: §2.
  • Y. Chow, O. Nachum, A. Faust, E. Duenez-Guzman, and M. Ghavamzadeh (2019) Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031. Cited by: §2.
  • F. Codevilla, E. Santana, A. M. López, and A. Gaidon (2019) Exploring the limitations of behavior cloning for autonomous driving. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    ,
    pp. 9329–9338. Cited by: §5.3.
  • S. Dean, S. Tu, N. Matni, and B. Recht (2019) Safely learning to control the constrained linear quadratic regulator. In 2019 American Control Conference (ACC), pp. 5582–5588. Cited by: §2.
  • J. F. Fisac, A. K. Akametalu, M. N. Zeilinger, S. Kaynama, J. Gillula, and C. J. Tomlin (2018) A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control 64 (7), pp. 2737–2752. Cited by: §2.
  • J. F. Fisac, N. F. Lugovoy, V. Rubies-Royo, S. Ghosh, and C. J. Tomlin (2019) Bridging hamilton-jacobi safety analysis and reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8550–8556. Cited by: Appendix A, Appendix C, Appendix E, §1, §1, §2, §3, §4, §4, §5.1.
  • F. Florian, S. Yunlong, E. Kaufmann, D. Scaramuzza, and P. Duerr (2020) Super-human performance in gran turismo sport using deep reinforcement learning. External Links: 2008.07971 Cited by: §2.
  • S. Fujimoto, H. Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pp. 1587–1596. Cited by: 1st item, 2nd item, §A.2, §4.
  • J. Garcıa and F. Fernández (2015) A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16 (1), pp. 1437–1480. Cited by: §2.
  • G. Giovanis, M. Lu, and M. Chen (2021) Optimizing dynamic programming-based algorithms. GitHub. Note: https://github.com/SFU-MARS/optimized_dp Cited by: §5.3.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1861–1870. External Links: Link Cited by: 2nd item.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1861–1870. Cited by: §4.
  • J. Herman, J. Francis, S. Ganju, B. Chen, A. Koul, A. Gupta, A. Skabelkin, I. Zhukov, M. Kumskoy, and E. Nyberg (2021) Learn-to-race: a multimodal control environment for autonomous racing. arXiv preprint arXiv:2103.11575. Cited by: Appendix E, Table F.1, Table F.2, Appendix F, §1, §2, Figure 4, §5.3, §5.3.
  • K. Hsu, V. Rubies-Royo, C. J. Tomlin, and J. F. Fisac (2021) Safety and liveness guarantees through reach-avoid reinforcement learning. Cited by: footnote 1.
  • J. Kabzan, L. Hewing, A. Liniger, and M. N. Zeilinger (2019) Learning-based model predictive control for autonomous racing. IEEE Robotics and Automation Letters 4 (4), pp. 3363–3370. Cited by: §2, §5.3.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.1, Appendix E.
  • J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli (2015) Kinematic and dynamic vehicle models for autonomous driving control design. In 2015 IEEE Intelligent Vehicles Symposium (IV), pp. 1094–1099. Cited by: §5.3.
  • S. Kousik, S. Vaskov, F. Bu, M. Johnson-Roberson, and R. Vasudevan (2020) Bridging the gap between safety and real-time performance in receding-horizon trajectory design for mobile robots. The International Journal of Robotics Research 39 (12), pp. 1419–1469. Cited by: §2.
  • A. Kumar, A. Zhou, G. Tucker, and S. Levine (2020) Conservative q-learning for offline reinforcement learning. arXiv preprint arXiv:2006.04779. Cited by: §A.3, §5.1.
  • A. Liniger, A. Domahidi, and M. Morari (2015) Optimization-based autonomous racing of 1: 43 scale rc cars. Optimal Control Applications and Methods 36 (5), pp. 628–647. Cited by: §2.
  • A. Ray, J. Achiam, and D. Amodei (2019a) Benchmarking Safe Exploration in Deep Reinforcement Learning. Cited by: §1, §1.
  • A. Ray, J. Achiam, and D. Amodei (2019b) Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708 7. Cited by: §5.2.
  • N. Rhinehart, R. McAllister, and S. Levine (2018) Deep imitative models for flexible inference, planning, and control. arXiv preprint arXiv:1810.06544. Cited by: §2.
  • J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2015) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §4.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §5.2.
  • K. Srinivasan, B. Eysenbach, S. Ha, J. Tan, and C. Finn (2020) Learning to be safe: deep rl with a safety critic. arXiv preprint arXiv:2010.14603. Cited by: §A.3, §A.3, §1, §2, §5.1, §5.1.
  • B. Thananjeyan, A. Balakrishna, S. Nair, M. Luo, K. Srinivasan, M. Hwang, J. E. Gonzalez, J. Ibarz, C. Finn, and K. Goldberg (2021) Recovery rl: safe reinforcement learning with learned recovery zones. IEEE Robotics and Automation Letters 6 (3), pp. 4915–4922. Cited by: §A.3, §2, §4.
  • [39] TORCS, the open racing car simulator. Note: http://torcs.sourceforge.net/index.php?name=Sections&op=viewarticle&artid=19Last accessed: 2021-01-30 Cited by: §2.
  • C. Voloshin, H. M. Le, N. Jiang, and Y. Yue (2019) Empirical study of off-policy policy evaluation for reinforcement learning. arXiv preprint arXiv:1911.06854. Cited by: §5.1.
  • T. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge (2020) Projection-based constrained policy optimization. arXiv preprint arXiv:2010.03152. Cited by: §2.

Appendix A Classical Control Benchmarks

The objective of this section is 1) to examine how different implementation details of the learning rule proposed by Fisac et al. (2019), i.e., , affect convergence, and 2) to compare the learning rule with alternatives for learning safety value function. We evaluate it on two classical control benchmarks, Double Integrator and Dubins’ Car, as described in Section A.1, where the analytical solution to safe states and optimal safe actions are known. Since the optimal safe actions are known, we implement the learning rule here as

In comparison, in the general case, where the optimal safety policy is not known, we sample from the safety actor, , and update the safety actor via policy gradient through the safety critic, i.e., .

a.1 Model Dynamics

Double Integrator. The double integrator models a particle moving along the x-axis at velocity . The control input is the acceleration . The goal in this case is keep the particle within a fixed boundary, in this case , subject to .

(A.1)

By solving the Hamiltonian, i.e., , we can get the optimal safe control as:

(A.2)

Dubins’ Car. The Dubins’ car models a vehicle moving at constant speed, in this case . Similar to the kinematic vehicle model, describes the position and heading of the vehicle, and control input is the turning rate . The goal is to reach a unit circle centred at the origin.

(A.3)

Note that Dubins’ Car is a reach task, i.e. reaching a specified goal, instead of an avoid task, i.e. avoiding specified obstacles. The reach task can be simply implemented by setting (Bansal et al., 2017). In other words, the optimal safe action for a given state is the one that minimises the distance to the goal. The corresponding optimal safe control is

(A.4)

The ground truth safety value function is shown in Figure 1(a) and 1(b).

Implementation & Evaluation.

We use a neural network with hidden layers of size [16, 16] for the double integrator and [64, 64, 32] for Dubins’ car. We use ADAM

Kingma and Ba (2014) as the optimiser with a learning rate of 0.001, batch size of 64. We update the safety value function over 25K steps for Double Integrator and 50K steps for Dubin’s Car, and report classification accuracy every 1000 steps averaged over 5 random seeds. While the safety value is defined over continuous state space, we evaluate the performance over a discrete mesh on the state space. By definition, the safety value at a given state is , where .

Qualitative Results. Qualitative comparison between the ground truth value and that learned via HJ Bellman update is shown in Figure A.1 and A.2. As we can see, the neural approximation largely recovers the ground truth value, except for minute difference.

Figure A.1: A comparison between the ground truth safety value and that learned via HJ Bellman update for double integrator; The black line delineates .
(a)
(b)
Figure A.2: A comparison between isosurface of the ground truth safety value (blue) and that learned via HJ Bellman update (green) for Dubins’ car

While we do not need to learn the safety actor in this case, we further demonstrate that can indeed by used to update the safety actor. In Figure A.3, we compare the ground truth , which indicates the optimal safe action (Eqn. A.2), and the gradient through the safety critic, i.e., . We can see that

, consistently point towards the correct optimal safe action within the safe set, i.e., the area delineated by the black line. The safety value outside the safe set is probably not learned as the episodes terminate upon failure.

Figure A.3: The gradients through the safety critic, i.e., , consistently point towards the correct optimal safe action, as indicated by (Eqn. A.2), within the safe set (the area delineated by the black line) for double integrator.

a.2 Comparison of Implementation Details

On a high level, we are interested in how common RL implementation techniques affect convergence. The specific questions we attempt to answer are:

  • Is a slow-moving target network necessary for convergence? It is common practice in RL algorithms to keep a copy of slow-moving target network for stability (Fujimoto et al., 2018), as the circular dependency between value estimate and policy results in accumulation of residual error and divergent updates. The target network is commonly updated with , where is a small number in (0, 1]. We examine how different value of affects convergence.

  • Is the Clipped Double Q-learning technique conducive to learning the safety value? The Clipped Double Q-learning was popularised by TD3 (Fujimoto et al., 2018) and also adopted in SAC (Haarnoja et al., 2018). The technique addresses the overestimation of the value function by keeping two value networks, and computing the Bellman backup with the smaller estimate of the two. We examine if the same technique is conducive to learning the safety value.

  • Is the baseline technique helpful? Deducting an action-independent baseline from the value function is commonly used for variance reduction in policy gradient algorithms. We expect that by choosing as the baseline, the safety critic can focus on learning whether an action increases or decreases distance to obstacle / goal from the current state.

A slow-moving target network is not necessary for convergence. We compare the performance at . As we can see in Figure A.4, the convergence were barely affected by the choice of , and the safety value converged without problem using , which is equivalent to not keeping a target network at all. We hypothesise that is because the actions come from the behaviour policy, and thus are independent of the safety critic. This removes the circular dependency between value estimation and policy and is consistent with the observation in Fujimoto et al. (2018). In practice, we also expect the majority of the actions to be from the performance policy.

Since does slightly better on Dubins’ Car, we use for the result of the experiment.

Figure A.4: Performance comparison under

Clipped Double Q-learning is unnecessary. In Figure A.5, we provide a comparison on vanilla DQN vs. clipped double Q-learning, which uses the minimum of the two estimates from the Q-networks to calculate the target value. Aside from using the minimum of the two estimates as in the regular case, we also implement a version that uses the maximum of the estimates. In neither case, the clipped double Q-learning technique consistently improved performance. We hypothesise that the HJ Bellman backup already clips with , and thus the overestimation error is no longer a concern.

Figure A.5: Performance comparison with clipped double Q-learning

Baseline technique does not have a significant impact on the accuracy of the safety critic. We consider updating the safety critic with the baselined version of safety Q-value, i.e. . The corresponding update rule is given by Eqn. A.5. Contrary to our expectation, using the baseline technique doesn’t have a significant impact on the accuracy of the safety critic (Figure A.6).

(A.5)
Figure A.6: Performance comparison with and without the baseline

To summarise our observations,

  • The target network for safety critic may be updated more rapidly as we expect the majority of actions to come from the performance policy, removing the circular dependency between value estimation and policy.

  • The Clipped Double Q-learning is not conducive to learning the safety value function, as the HJ Bellman backup clips with , therefore overestimate error does not appear to be a concern.

  • Contrary to our expectation, deducting an action-independent baseline from the value function does not have a significant impact on learning the safety critic.

a.3 Comparison of Learning Rules for Safety Critic

Firstly, we describes the approaches pertaining to learning the safety critic in Srinivasan et al. (2020); Bharadhwaj et al. (2020). In both Safety Q-functions for RL (SQRL) (Srinivasan et al., 2020) and Conservative Safety Critic (CSC) (Bharadhwaj et al., 2020), the safety critic is defined as the expected cumulative cost, i.e. , where if a failure occurs at and 0 otherwise. Both papers endowed the safety critic, with a probabilistic interpretation, i.e. the expected probability of failure.

SQRL. The safety critic is trained by propagating the failure signal using the standard Bellman backup, as in Eqn. A.6, where denotes the replay memory, is a time-discount parameter, and is the delayed target network. This approach for learning the safety critic is also adopted in Thananjeyan et al. (2021).

(A.6)

CSC. On top of using Bellman backup to propagate the failure signals, CSC uses conservative Q-learning (CQL) (Kumar et al., 2020) to correct for the distribution mismatch between the behaviour policy and the evaluation policy, and overestimate to err on the side of caution. The resulting objective is given in Eqn. A.7, where is the Bellman operator and is a hyperparamter that controls the extent of conservativeness. If , the objective is the same as that of SQRL.

(A.7)

Note that CSC reversed the sign in front of compared to the original implementation in CQL so as to over-estimate . This learning objective does not guarantee point-wise conservativeness, but conservativeness in expectation, i.e. .

Implementation Details. In Srinivasan et al. (2020), the authors used a learning rate of , . Using the same learning rate, we did grid search over and . We observed that had better performance, and thus selected and .

In Bharadhwaj et al. (2020), the authors used a learning rate of , , and selected from from 0.05, 0.5, and 5. Using the same learning rate and , we did grid search over and . We selected and .

Results. In Figure A.7, we show a qualitative comparison between the ground truth safety value and that learned via different learning rules. In interpreting the results, note that both the environment and optimal safety policy are deterministic. Thus, should be 0 if is a safe state, following the definition. Due to the difference in definition, i.e., is safe for HJ safety value and is safe in SQRL and CQL, we plot such that in Figure A.7 the larger value consistently indicates safety and the cut-off for safe vs. unsafe is 0.

SQRL largely captures the correct safe states, though the classification performance is highly dependent on picking an appropriate threshold. CQL does underestimate the level of safety (and overestimate ) as intended, but the pattern of underestimation does not appear to have relationship to safety. Instead, it corresponds well to the level of distribution mismatch between the behaviour policy, i.e., a random policy, and the evaluation policy, i.e., the optimal safety policy (refer to Figure A.3).

Figure A.7: Comparison between the group truth safety value and the safety critics from different learning rules for double integrator

Appendix B SPAR Algorithm

SPAR relies on a dual actor-critic structure. One of the actor-critic instances functions as a performance policy, while the other functions as a safety policy. This pairing of a safety- and performance-oriented control is important, as we are able to decompose the problem of learning under safety constraints into optimising for performance and updating the safety value function, separately.

We optimise the performance policy using SAC, but it may be switched for any other comparable RL algorithms. While we use the clipped Double-Q technique for the performance critic as in standard SAC implementation, we do not use the technique for the safety critic based on our observations in Section A.2. We still keep a slow-moving target network for the safety critic, even though we use a that’s a magnitude larger compared to that of the performance critic.

The safety policy is used least-restrictively, that is only intervene when the RL agent is about to enter into an unsafe state and thus allowing the performance policy maximum freedom in exploring safely. Instead of using the optimal safe policy from solving Hamiltonian, the safe policy is updated via gradients through the safety critic, same as other actor-critic algorithms.

Initialise: performance critic and actor ;
Initialise: safety critic , and actor ; target networks , ;
Initialise: replay buffer ;
for i = 0, , # Episodes do
       = env.reset()
       while not terminal do
             u ;
             // The safe actor intervenes when the current state-action is deemed unsafe by the safety critic.
             if  then
                  
            else
                  
             end if
             = env.step()
             .store()
            
             Update performance critic and actor with preferred RL algorithm;
             Sample N transitions from ;
             // Update the safety critic:
             Calculate the target value with the discounted Bellman safety update
// Update the safety actor with deterministic policy gradient:
// Update the target networks:
       end while
      
end for
Algorithm 1 SPAR: Safety-aware Policy Optimisation for Autonomous Racing

Appendix C Additional Implementation Details for Safety Gym

Following the default CarGoal1-v0 and PointGoal1-v0 benchmarks in Safety Gym, all agents were given LiDARs observation with respect to hazard, goal, and vase, with avoiding hazards as the safety constraints. Both environments were initialised with a total of 8 hazards and 1 vase. Agent’s are endowed with accelerometer, velocimeter, gyro, and magnetometer sensors; their LiDAR configurations included 16 bins, with max distance of 3.

The baselines we considered, i.e., CPO, PPO and PPO-Lagrangian follows the default implementation that comes with Safety Gym. PPO-SPAR wraps the proposed safety actor critic around the PPO base agent. Despite PPO being an on-policy algorithm, the SPAR safety critic was implemented with off-policy updates, using prioritised memory replay based on the TD-error of predicting safety value. Since is small in this environment, we scaled cost by a factor of 100. For the safety actor-critic, We used annealing from 0.85 to 1 following Fisac et al. (2019), , critic learning rate of 0.001, actor learning rate of 0.0003, and (regularisation on policy entropy). We used a safety margin , mainly to account for the dimension of the hazards (radius = 0.2).

For each model, on each Safety Gym benchmark, results were reported as the average across 5 instances. All experiments in Safety Gym were run on an Intel(R) Core(TM) i9-9920X CPU @ 3.50GHz – with 1 CPU, 12 physical cores per CPU, and a total of 24 logical CPU units.

Appendix D Derivation of the Optimal Safety Controller

Recall that the nominal model is given by Eqn. D.1, where the state is , and the action is . For the action, and are the vehicle’s location, speed, and yaw angle. is the acceleration, and is the steering angle. m is the car length.

(D.1)

The optimal safety control is derived by solving the Hamiltonian as given in Eqn. D.2a. By definition,

.

(D.2a)
(D.2b)
(D.2c)

From Eqn D.2c, it is clear that the optimal safety controller maximising the Hamiltonian is given by Eq. D.3

(D.3)

Appendix E Additional Implementation Details for Learn-to-Race

Training details. During training, the agent is spawned at random locations along the race track and uses a stochastic policy. During evaluation, the agent is spawned at a fixed location and uses a deterministic policy. The episode terminates when the agent successfully finishes a lap, leaves the drivable area, collides with obstacles, or does not progress for a number of steps. For each agent, we report averaged results across 5 random seeds evaluated every 5000 steps over an episode, i.e., one lap. In total, we train each agent over 250,000 steps, and evaluate it over 50 episodes, i.e. laps.

During its interaction with the environment, the agent receives a ego-camera view and its speed at each time-step. The agent encodes the RGB image frame and its speed to a 40-dimensional feature representation, subsequently used as input to both actor-critic networks. We initialise the replay buffer with 2000 random transitions, following Achiam (2018). After 2000 steps, we perform a policy update at each time step. For the SafeSAC agent, we only save state-action transitions from the performance actor to the replay buffer. For the SPAR agent, we save all state-action transitions.

Operation Input (dim.) Output (dim.) Parameters
Visual Encoder
Conv2d , chan : 332 conv1 k(4,4), s2, p1, activationReLU
Conv2d conv1, chan : 3264 conv2 k(4,4), s2, p1, activationReLU
Conv2d conv2, chan : 64128 conv3 k(4,4), s2, p1, activationReLU
Conv2d conv3, chan : 128256 conv4 k(4,4), s2, p1, activationReLU
Flatten
Visual Encoder Bottleneck Representation
Linear (mu)
Linear (sigma)
Visual Decoder (only for pre-training Visual Encoder)
Unflatten
ConvTranspose2d encoder.conv4: encoder.conv4.chan: 256 128 convtranspose1 k(4,4), s2, p1, activationReLU
ConvTranspose2d convtranspose1, chan : 128 64 convtranspose2 k(4,4), s2, p1, activationReLU
ConvTranspose2d convtranspose2, chan : 64 32 convtranspose3 k(4,4), s2, p1, activationReLU
ConvTranspose2d convtranspose3, chan : 32 3 convtranspose4 k(4,4), s2, p1, activationSigmoid
Safety Actor-Critic
actor_network
q_function1
q_function2
Performance Actor-Critic
actor_network
q_function1
q_function2
Actor Network (Policy): SquashedGaussianMLPActor
Linear activationReLU
Linear activationReLU
Linear activationReLU
Linear (projection: mu_layer)
Linear (projection: log_std_layer)
Q function
speed_encoder
regressor
Speed Encoder
Linear activationReLU
Linear activationIdentity
Regressor
Linear activationReLU
Linear activationReLU
Linear activationReLU
Linear activationReLU
Linear activationReLU
Linear activationIdentity
Table E.1: Network Architecture

Implementation Details. We summarise all network architecture in Table E.1. For all experiments, we implemented the models using the PyTorch 1.8.0. We optimised both the performance and safety actor-critic with Adam Kingma and Ba (2014), with a learning rate of 0.003. We used for the performance critic, and annealed from 0.85 to 1 for the safety critic following Fisac et al. (2019). We used for the performance critic, and for the safety critic. For both the performance and safety actor, we include the policy entropy term with . We used a batch size of 256, and a replay buffer size of 250,000.

Computing hardware. For rendering the simulator and performing local agent verification and analysis, we used a single GPU machine, with the following CPU specifications: Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz; 1 CPU, 4 physical cores per CPU, total of 4 logical CPU units. The machine includes a single GeForce GTX TITAN X GPU, with 12.2GB GPU memory. For generating multi-instance experimental results, we used a cluster of three multi-GPU machines with the following CPU specifications: 2x Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz; 80 total CPU cores using a Cascade Lake architecture; memory of 512 GiB DDR4 3200 MHz, 16x32 GiB DIMMs. Each machine includes 8x NVIDIA GeForce RTX 2080 Ti GPUs, each with 11GB GDDR6 of GPU memory. Experiments were orchestrated on the these machines using Kubernetes, an open-source container deployment and management system.

All experiments were conducted using version 0.7.0.182276 of the Arrival Racing Simulator. The simulator and Learn-to-Race framework (Herman et al., 2021) are available for academic-use, here: https://learn-to-race.org.

Appendix F Additional Results

Performance of the SafeRandom agent. Recall that the SafeRandom agent takes random actions and uses the safety value function precomputed from the nominal model. The optimal safety controller intervene whenever the safety value of the current state falls belong the safety margin. The safety margin is necessary because 1) the nominal model is a significant over-simplification of vehicle dynamics, and 2) the HJ Reachability computation does not take into consideration of the physical dimension of the vehicle.

The performance of the SafeRandom agent at different safety margin is summarised in Figure F.1. For safety margin , the SafeRandom agent can finish 80+% of the lap, and thus we use as the safety margin for the SafeSAC agent. On the other hand, the performance decrease drastically when the safety margin is reduced to 3.

Figure F.1: Performance of the SafeRandom agent at different safety margin (averaged over 10 random seeds)

Performance of SafeSAC with the same safety margin as SPAR. While we choose the safety margin based on performance of the SafeRandom agent over a range of margins and our best engineering judgement, some may wonder if the superior performance of SPAR over SafeSAC may be attributed to the use of different safety margins. Thus, we also show here the performance of a SafeSAC agent with the same safety margin as SPAR in Figure F.2. Given the smaller safety margin, the ECP is low initially, which is inline with the observation from SafeRandom. Furthermore, the ECP barely improves over time. As the performance agent learns to drive faster, it is increasingly difficulty for the static actor-critic to catch the vehicle in marginally safe states.

Figure F.2: Performance of SafeSAC () with comparison to SPAR

Learn-to-Race benchmark results. In tables F.1 and F.2, we follow (Herman et al., 2021) in reporting on all of their driving quality metrics, for the Learn-to-Race benchmark: Episode Completion Percentage (ECP), Episode Duration (ED), Average Adjusted Track Speed (AATS), Average Displacement Error (ADE), Trajectory Admissibility (TrA), Trajectory Efficiency (TrE), and Movement Smoothness (MS).

Agent ECP () ED* () AATS () ADE () TrA () TrE () MS ()
HUMAN
Random
MPC
Table F.1: Learn-to-Race task (Herman et al., 2021) results on Track01 (Thruxton Circuit), for learning-free agents, with respect to the task metrics: Episode Completion Percentage (ECP), Episode Duration (ED), Average Adjusted Track Speed (AATS), Average Displacement Error (ADE), Trajectory Admissibility (TrA), Trajectory Efficiency (TrE), and Movement Smoothness (MS). Arrows () indicate directions of better performance, across agents. Bold results in tables F.1 and F.2 are generally best, however, asterisks (*) indicate metrics which may be misleading, for incomplete racing episodes.
Agent ECP () ED* () AATS () ADE () TrA () TrE () MS ()
SAC
SafeRandom (ours),
SafeRandom (ours),
SafeSAC (ours),
SafeSAC (ours),
SPAR (ours) 59.19 53.28 0.99
Table F.2: Learn-to-Race task (Herman et al., 2021) results on Track01 (Thruxton Circuit), for learning-based agents.

We highlight the fact that such metrics as TrA, TrE, and MS are most meaningful for agents that also have high ECP results. Taking TrA, for example, safe policies score higher ECP values but may spend more time in inadmissible positions (as defined by the task, i.e., with at least one wheel touching the edge of the drivable area), compared to policies without a safety backup controller that may quickly terminate episodes by driving out-of-bounds (thus spending less time in the inadmissible positions). On the other hand, policies that have low completion percentages also have low ED scores, due to more frequent failures and subsequent environment resets.

We observe new state-of-the-art performance received by our approach, across the driving quality metrics, in the Learn-to-Race benchmark.