1 Introduction
Deep reinforcement learning algorithms (Sutton and Barto, 1998) are effective at learning, often from raw sensor inputs, control policies that optimize for a quantitative reward signal. Learning these policies can require experiencing millions of unsafe actions. Even if a safe policy is finally learned – which will happen only if the reward signal reflects all relevant safety priorities – providing a purely statistical guarantee that the optimal policy is safe requires an unrealistic amount of training data (Kalra and Paddock, 2016). The difficulty of establishing the safety of these algorithms makes it difficult to justify the use of reinforcement learning in safetycritical domains where industry standards demand strong evidence of safety prior to deployment (ISO26262, 2011).
Formal verification provides a rigorous way of establishing safety for traditional control systems (Clarke et al., 2018). The problem of providing formal guarantees in RL is called formally constrained reinforcement learning (FCRL). Existing FCRL methods such as (Hasanbeig et al., 2018b, a, 2019, 2020; Hahn et al., 2019; Alshiekh et al., 2018; Fulton and Platzer, 2018; Phan et al., 2019; De Giacomo et al., 2019) combine the best of both worlds: they optimize for a reward function while safely exploring the environment.
Existing FCRL methods suffer from two significant disadvantages that detract from their realworld applicability: a) they enforce constraints over a completely symbolic state space that is assumed to be noiseless (e.g. the position of the safetyrelevant objects are extracted from a simulator’s internal state); b) they assume that the entire reward structure depends upon the same symbolic statespace used to enforce formal constraints. The first assumption limits the applicability of FCRL in realworld settings where the system’s state must be inferred by an imperfect and perhaps even untrusted perception system. The second assumption implies a richer symbolic state that includes a symbolic representation of the reward, which we argue is unnecessary and may require more labelled data. Furthermore, this means these approaches may not generalize across different environments that have similar safety concerns, but completely different reward structures.
The goal of this paper is to safely learn a safe policy without assuming a perfect oracle that identifies the positions of all safetyrelevant objects. I.e., unlike all existing FCRL methods, we do not rely on the internal state of the simulator. Prior to reinforcement learning, we train an object detection system to extract the positions of safetyrelevant objects up to a certain precision. The pretrained object detection system is used during reinforcement learning to extract the positions of safetyrelevant objects, and that information is then used to enforce formal safety constraints. Absolute safety in the presence of untrusted perception is epistemologically challenging, but our formal safety constraints do at least account for a type of noise commonly found in object detection systems. Finally, although our system (called Verifiably Safe Reinforcement Learning, or VSRL) uses a few labeled data to pretrain the object detection, we still learn an endtoend policy that may leverage the entire visual observation for reward optimization.
Prior work from the formal methods community has demonstrated that you can do safe RL when you have full symbolic characterization of the environment and you can precisely observe the entire state. However, this is not realistic for actual robotic systems which have to interact with the physical world and can only perceive it through an imperfect visual system. This paper demonstrates that techniques inspired by formal methods can provide value even in this situation. First, we show that by using existing vision techniques to bridge between the visual input and the symbolic representation, one can leverage formal techniques to achieve highly robust behavior. Second, we prove that under weak assumptions on this vision system, the new approach will safely converge to an optimal safe policy.
Our convergence result is the first of its kind for formally constrained reinforcement learning. Existing FCRL algorithms provide convergence guarantees only for an MDP that is defined over highlevel symbolic features that are extracted from the internal state of a simulator. Instead, we establish optimality for policies that are learned from the lowlevel feature space (i.e., images). We prove that our method is capable of optimizing for reward even when significant aspects of the reward structure are not extracted as highlevel features used for safety checking. Our experiments demonstrate that VSRL is capable of optimize for reward structure related to objects whose positions we do not extract via supervised training. This is significant because it means that VSRL needs pretrained object detectors only objects that are safetyrelevant.
Finally, we provide a novel benchmark suite for Safe Exploration in Reinforcement Learning that includes both environments where the reward signal is aligned with the safety objectives and environments where the rewardoptimal policy is unsafe. Our motivation for the latter is that assuming rewardoptimal policies respect hard safety constraints neglects one of the fundamental challenges of Safe RL: preventing “rewardhacking”. For example, it fundamentally difficult to tune a reward signal so that it has the “correct” tradeoff between a pedestrian’s life and battery efficiency. We show that in the environments where the rewardoptimal policy is safe (“rewardaligned”), VSRL learns a safe policy with convergence rates and final rewards which are competitive or even superior to the baseline method. More importantly, VSRL learns these policie with zero safety violations during training; i.e., it achieves perfectly safe exploration. In the environments where the rewardoptimal policy is unsafe (“rewardmisaligned”), VSRL both effectively optimizes for the subset of reward that can be achieved without violating safety constraints and successfully avoids “rewardhacking” by violating safety constraints.
Summarily, this paper contributes: (1) VSRL, a new approach toward formally constrained reinforcement learning that does not make unrealistic assumptions about oracle access to symbolic features. This approach requires minimal supervision before reinforcement learning begins and explores safely while remaining competitive at optimizing for reward. (2) Theorems establishing that VSRL learns safely and maintains convergence properties of any underlying deep RL algorithm within the set of safe policies. (3) A novel benchmark suite for Safe Exploration in Reinforcement Learning that includes both properly specified and misspecified reward signals.
2 Problem Definition
A reinforcement learning (RL) system can be represented as a Markov Decision Process (MDP)
which includes a (possibly infinite) set of system states, an action space , a transition functionwhich specifies the probability of the next system state being
after the agent executes action at state , a reward function that gives the reward for taking action in state , and a discount factor that indicates the system preference to earn reward as fast as possible. We denote the set of initial states as .In our setting, are images and we are given a safety specification over a set of highlevel observations , specifically, the positions (planar coordinates) of the safetyrelevant objects in a 2D or 3D space. Since , it is not trivial to learn a safe policy such that along every trajectory. We decompose this challenge into two wellformed and tractable subproblems:

Pretraining a system that converts the visual inputs into symbolic states using synthetic data (without acting in the environment);

Learning policies over the visual input space while enforcing safety in the symbolic state space .
This problem is not solvable without making some assumptions, so here we focus on the following:
Assumption 1.
The symbolic mapping is correct up to . More precisely, the true position of every object can be extracted from the image through the object detector so that the Euclidean distance between the actual and extracted positions is at most , i.e. . We assume that we know an upper bound on the number of objects whose positions are extracted.
Assumption 2.
Initial states, described by a set of properties denoted as init, are safe, i.e. . Moreover, every state we reach after taking only safe actions has at least one available safe action.
Assumption 3.
We are given a dynamical model of the safetyrelevant dynamics in the environment, given as either a discretetime dynamical system or a system of ordinary differential equations, denoted as
plant. We assume that model is correct up to simulation; i.e., if for some action , then the dynamical system plant maps to a set of states that includes .For example, the model may be a system of ODEs that describes how the acceleration and angle impact the future positions of a robot, as well as the potential dynamical behavior of some hazards in the environment. Note that this model only operates on (the symbolic state space), not (lowlevel features such as images or LiDAR).
Assumption 4.
We have an abstract model of the agent’s behavior, denoted as ctrl, that is correct up to simulation: if for some action , then is one of the possible next states after by ctrl.
An abstract model of the agent’s behavior describes at a highlevel a safe controller behavior, disregarding the finegrained details an actual controller needs to be efficient. An example is a model that brakes if it is too close to a hazard and can have any other type of behavior otherwise. Note that ctrl is very different from a safe policy , since it only models the safetyrelated aspects of without considering reward optimization.
3 Background
The goal of an RL agent represented as an MDP is to find a policy that maximizes its expected total reward from an initial state :
(1) 
where is the reward at step . In a deep RL setting, we can use the DNN parameters to parametrize . One particularly effective implementation and extension of this idea is proximal policy optimization (PPO), which improves sample efficiency and stability by sampling data in batches and then optimizing a surrogate objective function that prevents overly large policy updates (Schulman et al., 2017). This enables endtoend learning through gradient descent which significantly reduces the dependency of the learning task on refined domain knowledge. Deep RL thus provides a key advantage over traditional approaches which were bottlenecked by a manual, timeconsuming, and often incomplete feature engineering process.
To ensure formal guarantees we use differential Dynamic Logic () (Platzer, 2008, 2010, 2012, 2017), a logic for specifying and proving reachability properties of hybrid dynamical systems, which combine both discretetime (e.g. a robot that decides actions at discrete times) and continuoustime dynamics (e.g. an ODE describing the position of the robot at any time). Hybrid systems can be described with hybrid programs (HPs), for which we give an informal definition in tab:hps. Notably, besides the familiar program syntax, HPs are able to represent a nondeterministic choice between two programs , and a continuous evolution of a system of ODEs for an arbitrary amount of time, given a domain constraint on the state space .
Formulas of are generated by the following grammar where ranges over HPs:
where are polynomials over the state variables, and are formulas of the state variables, is one of . The formula means that a formula is true in every state that can be reached by executing the hybrid program .
Given a set of initial conditions init for the initial states, a discretetime controller ctrl representing the abstract behaviour of the agent, a continuoustime system of ODEs plant representing the environment and a safety property safe we define the safety preservation problem as verifying that the following holds:
(2) 
Intuitively, this formula means that if the system starts in an intial state that satisfies init, takes one of the (possibly infinite) set of control choices described by ctrl, and then follows the system of ordinary differential equations described by plant, then the system will always remain in states where safe is true.
Example 1 (Hello, World).
Consider a 1D pointmass that must avoid colliding with a static obstacle () and has perception error bounded by . The following model characterizes an infinite set controllers that are all safe, in the sense that for all forward time and at every point throughout the entire flow of the ODE:
init  
ctrl  
plant 
Starting from any state that satisifies the formula init, the (abstract/nondeterministic) controller chooses any acceleration satisfying the SB constraint. After choosing any that satisfies SB, the system then follows the flow of the system of ODEs in plant for any positive amount of time less than . The constraint simply means that braking (i.e., choosing a negative acceleration) can braing the pointmass to a stop, but cannot cause it to move backwards.
The full formula says that no matter how many times we execute the controller and then follow the flow of the ODEs, it will always be the case – again, for an infinite set of permissible controllers – that .
Theorems of can be automatically proven in the KeYmaera X theorem prover (Fulton et al., 2015, 2017). (Mitsch and Platzer, 2016) explains how to synthesize action space guards from nondeterministic specifications of controllers (ctrl), and Fulton and Platzer (2018, 2019) explains how these action space guards are incorporated into reinofrcement learning to esnure safe exploration. Additional details about how we synthesize monitoring conditions from models is available in (Mitsch and Platzer, 2016) and in appendix:monitors.
4 Vsrl: Verifiably Safe RL on Visual Inputs
We present VSRL, a framework that can augment any deep RL algorithm to perform safe exploration on visual inputs. As discussed in sec:probdef, we decompose the general problem in two tasks:

learning a mapping of visual inputs into a symbolic state for safetyrelevant properties using only a few examples (described in sec:symbolicMapping and shown in fig:combineda);

learning policies over visual inputs, while enforcing safety in the symbolic state space (described in sec:constrainedLearning and shown in fig:combinedc).
This latter task requires a controller monitor, which is a function
that classifies each action
in each symbolic state as “safe” or not. In this paper this monitor is synthesized and verified offline following (Fulton and Platzer, 2018, 2019). In particular, as discussed in the previous sections, the KeYmaera X theorem prover solves the safety preservation problem presented in Eq. eq:safety for a set of highlevel rewardagnostic safety properties safe, a system of differential equations characterizing the relevant subset of environmental dynamics plant, an abstract description of a safe controller ctrl and a set of initial conditions init (shown in fig:combinedb).4.1 Object Detection
In order to remove the need to construct labelled datasets for each environment, we only assume that we are given a small set of images of each safetycritical object and a set of background images (in practice, we use 1 image per object and 1 background). We generate synthetic images by pasting the objects onto a background with different locations, rotations, and other augmentations. We then train a CenterNetstyle object detector (Zhou et al., 2019)
which performs multiway classification for whether each pixel is the center of an object. For speed and due to the visual simplicity of the environments, the feature extraction CNN is a truncated ResNet18
(He et al., 2016)which only keeps the first residual block. The loss function is the modified focal loss
(Lin et al., 2017) from Law and Deng (2018). See app:symbolicMapping for full details on the object detector. Our current implementation does not optimize or dedicate hardware to the object detector, so detection adds some runtime overhead for all environments. However, this is an implementation detail rather than an actual limitation of the approach. There are many existing approaches that make it possible to run object detectors quickly enough for realtime control.4.2 Enforcing Constraints
While VSRL can augment any existing deep RL algorithm, this paper extends PPO (Schulman et al., 2015). The algorithm performs RL as normal except that, whenever an action is attempted, the object detector and safety monitor are first used to check if the action is safe. If not, a safe action is sampled uniformly at random from the safe actions in the current state. This happens outside of the agent and can be seen as wrapping the environment with a safety check. Pseudocode for performing this wrapping is in alg:mname. The controller monitor is extracted from a verified model (see Page 3 of (Fulton and Platzer, 2018) for details). A full code listing that inlines alg:mname into a generic RL algorithm is provided in Appendix E.
4.3 Safety and Convergence Results
We establish two theoretical properties about VSRL. First, we show that VSRL safely explores. Second, we show that if VSRL is used on top of an RL algorithm which converges (locally or globally) then VSRL will converge to the (locally or globally) optimal safe policy. All proofs are in the Appendix.
Theorem 1.
This results implies that any RL agent augmented with alg:mname is always safe during learning. Our second theorem states that any RL agent that is able to learn an optimal policy in an environment can be combined with alg:mname to learn a rewardoptimal safe policy.
Theorem 2.
Let be an environment and a reinforcement learning algorithm.
If converges to a rewardoptimal policy in , then using alg:mname with converges to , the safe policy with the highest reward (i.e. the rewardoptimal safe policy).
5 Experimental Validation of Vsrl
We evaluate VSRL on four environments: a discrete XO environment (Garnelo et al., 2016), an adaptive cruise control environment (ACC), a 2D goalfinding environment similar to the Open AI Safety Gym Goal environment (Ray et al., 2019) but without a MuJoCo dependency (GF), and a pointmesses environment that emphasizes the problem of preventing reward hacking in safe exploration systems (PM). VSRL explores each environment without encountering any unsafe states.
The XO Environment is a simple setting introduced by (Garnelo et al., 2016) for demonstrating symbolic reinforcement learning algorithms (the implementation by Garnelo et al. (2016) was unavailable, so we reimplemented this environment). The XO environment, visualized in fig:xo_illustration, contains three types of objects: X objects that must be collected (+1 reward), O objects that must be avoided (1 reward), and the agent (marked by a +). There is also a small penalty (0.01) at each step to encourage rapid collection of all Xs and completion of the episode. This environment provides a simple baseline for evaluating VSRL. It is also simple to modify and extend, which we use to evaluate the ability of VSRL to generalize safe policies to environments that deviate slightly from implicit modeling assumptions. The symbolic state space includes the position of the + and the O, but not the position of the Xs because they are not safetyrelevant. The purpose of this benchmark is to provide a benchmark for safe exploration in a simple discrete setting.
The adaptive cruise control (ACC) environment has two objects: a follower and a leader. The follower must maintain a fixed distance from the leader without either running into the leader or following too far behind. We use the verified model from (Quesel et al., 2016) to constrain the agent’s dynamics.
The 2D goalfinding environment consists of an agent, a set of obstacles, and a goal state. The obstacles are the red circles and the goal state is the green circle. The agent must navigate from its (random) starting position to the goal state without encountering any of the obstacles. Unlike the OpenAI Safety Gym, the obstacles are hard safety constraints; i.e., the episode ends if the agent hits a hazard. We use the verified model from (Mitsch et al., 2013) to constrain the agent’s dynamics.
The 2D pointmesses environment consists of an agent, a set of obstacles, a goal state, and a set of pointmesses (blue Xs). The agent receives reward for picking up the pointmesses, and the episode ends when the agent picks up all messes and reaches the goal state. Unlike the 2D goalfinding environment, hitting an obstacle does not end the episode. Instead, the obstacle is removed from the environment and a random number of new pointmesses spawn in its place. Notice that this means that the agent may reward hack by taking an unsafe action (hitting an obstacle) and then cleaning up the resulting pointmesses. We consider this the incorrect behavior. We use the verified model from (Mitsch et al., 2013) to constrain the agent’s dynamics.
XO  ACC  GF  PM  

Method  R  U  R  U  R  U  R  U 
PPO  10.5  7500  529  13983  0.233  3733  0.25  3819 
VSRL  10.5  0  967  0  0.228  0  0.225  0 
We compare VSRL to PPO using two metrics: the number of safety violations during training and the cumulative reward. These results are summarized in tab:results. VSRL is able to perfectly preserve safety in all environments from the beginning of training even with the bounded errors in extracting the symbolic features from the images. In contrast, vanilla PPO takes many unsafe actions while training and does not always converge to a policy that entirely avoids unsafe objects by the end of training.
In some environments, preserving safety specifications also substantially improves sample efficiency and policy performance early in the training process. In the ACC environment, in particular, it is very easy to learn a safe policy which is rewardoptimal. In the GF and PM environments, both the baseline agent and the VSRL agent struggle to learn to perform the task well (note that these tasks are quite difficult because encountering an obstacle ends the episode). However, VSRL remains safe without losing much reward relative to the amount of uncertainty in both policies. See appendix:fullcode for details on our experimental evaluation and implementation.
6 Related Work
Recently, there has been a growing interest in safe RL, especially in the context of safe exploration, where the agent has to be safe also during training. A naive approach to RL safety is reward shaping, in which one defines a penalty cost for unsafe actions. This approach has several drawbacks, e.g. the choice of the penalty is brittle, so a naive choice may not outweight a shorter path to the reward, as shown by Dalal et al. (2018). Therefore, recent work on safe RL addresses the challenge of providing rewardagnostic safety guarantees for deep RL (Garcıa and Fernández, 2015; Xiang et al., 2018). Many recent safe exploration methods focus on safety guarantees that hold in expectation (e.g., (Schulman et al., 2015; Achiam et al., 2017)) or with high probability (e.g., (Berkenkamp et al., 2017; Dalal et al., 2018; Koller et al., 2018; Cheng et al., 2019). Some of these approaches achieve impressive results by drawing upon techniques from control theory, such as Lyapunov functions (Berkenkamp et al., 2017) and control barrier certificates.
On the other hand, ensuring safety in expectation or with high probability is generally not sufficient in safetycritical settings where guarantees must hold always, even for rare and measurezero events. Numerical testing alone cannot provide such guarantees in practice (Kalra and Paddock, 2016) or even in theory (Platzer and Clarke, 2007). The problem of providing formal guarantees in RL is called formally constrained reinforcement learning (FCRL). Existing FCRL methods such as (Hasanbeig et al., 2018b, a, 2019, 2020; Hahn et al., 2019; Alshiekh et al., 2018; Fulton and Platzer, 2018; Phan et al., 2019; De Giacomo et al., 2019) combine the best of both worlds: they optimize for a reward function while still providing formal safety guarantees. While most FCRL method can only ensure the safety in discretetime environments known a priori, Fulton and Platzer (2018, 2019) introduce Justified Speculative Control, which exploits Differential Dynamic Logic(Platzer, 2015) to prove the safety of hybrid systems, systems that combine an agent’s discretetime decisions with a continuous time dynamics of the system.
A major drawback of current FCRL methods is that they only learn control policies over handcrafted symbolic state spaces. While many methods extract a symbolic mapping for RL from visual data, e.g. (Lyu et al., 2019; Yang et al., 2018, 2019; Lu et al., 2018; Garnelo et al., 2016; Li et al., 2018; Liang and Boularias, 2018; Goel et al., 2018), they all require that all of the rewardrelevant features are explicitly represented in the symbolic space. As shown by the many successes of Deep RL, e.g. (Mnih et al., 2013), handcrafted features often miss important signals hidden in the raw data.
Our approach aims at combining the best of FCRL and endtoend RL to ensure that exploration is always safe with formal guarantees, while allowing a deep RL algorithm to fully exploit the visual inputs for reward optimization.
7 Conclusion and Discussions
Safe exploration in the presence of hard safety constraints is a schallenging problem in reinforcement learning. We contribute VSRL, an approach toward safe learning on visual inputs. Through theoretical analysis and experimental evaluation, this paper establishes that VSRL maintains perfect safety during exploration while obtaining comparable reward. Because VSRL separates safetycritical object detection from RL, next steps should include applying tools from adversarial robustness to the object detectors used by VSRL.
References
 Constrained Policy Optimization. See Proceedings of the 34th international conference on machine learning, ICML 2017, sydney, nsw, australia, 611 august 2017, Precup and Teh, pp. 22–31. Cited by: §6.

Safe reinforcement learning via shielding.
In
AAAI Conference on Artificial Intelligence
, Cited by: §1, §6. 
Proceedings of the 32nd international conference on machine learning, ICML 2015, lille, france, 611 july 2015
. JMLR Workshop and Conference Proceedings, Vol. 37. Cited by: J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz (2015).  Safe modelbased reinforcement learning with stability guarantees. In Advances in neural information processing systems, pp. 908–918. Cited by: §6.
 Endtoend safe reinforcement learning through barrier functions for safetycritical continuous control tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3387–3395. Cited by: §6.
 Handbook of model checking. Springer. Cited by: §1.
 Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757. Cited by: §6.
 Foundations for restraining bolts: reinforcement learning with ltlf/ldlf restraining specifications. In International Conference on Automated Planning and Scheduling (ICAPS 2019), Cited by: §1, §6.
 Impala: scalable distributed deeprl with importance weighted actorlearner architectures. arXiv preprint arXiv:1802.01561. Cited by: Appendix E.
 Automated deduction  CADE25  25th international conference on automated deduction, berlin, germany, august 17, 2015, proceedings. LNCS, Vol. 9195, Springer. Cited by: A. Platzer (2015).
 Bellerophon: tactical theorem proving for hybrid systems. In International Conference on Interactive Theorem Proving, Cited by: §A.2, §3.
 KeYmaera X: an axiomatic tactical theorem prover for hybrid systems. In Proceedings of the 25th International Conference on Automated Deduction (CADE25), A. P. Felty and A. Middeldorp (Eds.), LNCS, Vol. 9195, pp. 527–538. Cited by: §A.2, §3.
 Safe reinforcement learning via formal methods: toward safe control through proof and learning. In AAAI Conference on Artificial Intelligence, Cited by: §A.3, §A.3, §A.3, §A.3, §1, §3, §4.2, §4, §6.
 Verifiably safe offmodel reinforcement learning. See Tools and algorithms for the construction and analysis of systems (TACAS 2019), Vojnar and Zhang, pp. 413–430. External Links: Document Cited by: §3, §4, §6.
 A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research. Cited by: §6.
 Towards deep symbolic reinforcement learning. arXiv preprint arXiv:1609.05518. Cited by: §5, §5, §6.
 Unsupervised video object segmentation for deep reinforcement learning. In Advances in Neural Information Processing Systems, Cited by: §6.
 Omegaregular objectives in modelfree reinforcement learning. In TACAS 2019, Cited by: §1, §6.
 Logicallyconstrained reinforcement learning. arXiv preprint arXiv:1801.08099. Cited by: §1, §6.
 Logicallycorrect reinforcement learning. CoRR abs/1801.08099. External Links: 1801.08099 Cited by: §1, §6.
 Cautious reinforcement learning with logical constraints. arXiv preprint arXiv:2002.12156. Cited by: §1, §6.
 Reinforcement Learning for Temporal Logic Control Synthesis with Probabilistic Satisfaction Guarantees. arXiv eprints, pp. arXiv:1909.05304. External Links: 1909.05304 Cited by: §1, §6.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: Appendix D, §4.1.  International Organization for Standardization 26262 road vehicles – functional safety. Cited by: §1.
 Driving to safety: how many miles of driving would it take to demonstrate autonomous vehicle reliability?. RAND Corporation. Cited by: §1, §6.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix D.
 Learningbased model predictive control for safe exploration. In 2018 IEEE Conference on Decision and Control (CDC), pp. 6059–6066. Cited by: §6.
 Cornernet: detecting objects as paired keypoints. In European Conference on Computer Vision, Cited by: Appendix D, Appendix D, §4.1.
 Objectsensitive deep reinforcement learning. arXiv preprint arXiv:1809.06064. Cited by: §6.
 Taskrelevant object discovery and categorization for playing firstperson shooter games. arXiv preprint arXiv:1806.06392. Cited by: §6.
 Focal loss for dense object detection. In IEEE international conference on computer vision, Cited by: Appendix D, §4.1.
 Robot representing and reasoning with knowledge from reinforcement learning. arXiv preprint arXiv:1809.11074. Cited by: §6.
 SDRL: interpretable and dataefficient deep reinforcement learning leveraging symbolic planning. In AAAI’19, Cited by: §6.
 On provably safe obstacle avoidance for autonomous robotic ground vehicles. In Robotics: Science and Systems, P. Newman, D. Fox, and D. Hsu (Eds.), Cited by: §5, §5.
 ModelPlex: verified runtime validation of verified cyberphysical system models. Form. Methods Syst. Des. 49 (1), pp. 33–74. Note: Special issue of selected papers from RV’14 Cited by: §3.

Playing atari with deep reinforcement learning.
In
NIPS Deep Learning Workshop
, Cited by: §6.  Neural simplex architecture. Cited by: §1, §6.
 The image computation problem in hybrid systems model checking. In HSCC, A. Bemporad, A. Bicchi, and G. Buttazzo (Eds.), LNCS, Vol. 4416, pp. 473–486. External Links: Document, ISBN 9783540714927 Cited by: §6.
 Differential dynamic logic for hybrid systems.. J. Autom. Reas. 41 (2), pp. 143–189. Cited by: Appendix A, §3.
 Logical analysis of hybrid systems: proving theorems for complex dynamics. Springer, Heidelberg. Cited by: Appendix A, §3.
 Logics of dynamical systems. In LICS, pp. 13–24. Cited by: Appendix A, §3.
 A uniform substitution calculus for differential dynamic logic. See Automated deduction  CADE25  25th international conference on automated deduction, berlin, germany, august 17, 2015, proceedings, Felty and Middeldorp, pp. . Cited by: §A.1, Appendix A, §6.
 A complete uniform substitution calculus for differential dynamic logic. J. Autom. Reas. 59 (2), pp. 219–265. Cited by: Appendix A, §3.
 Proceedings of the 34th international conference on machine learning, ICML 2017, sydney, nsw, australia, 611 august 2017. Proceedings of Machine Learning Research, Vol. 70, PMLR. Cited by: J. Achiam, D. Held, A. Tamar, and P. Abbeel (2017).
 How to model and prove hybrid systems with KeYmaera: a tutorial on safety. STTT 18 (1), pp. 67–91. Cited by: §5.
 Benchmarking Safe Exploration in Deep Reinforcement Learning. Cited by: §5.
 Trust region policy optimization. See Proceedings of the 32nd international conference on machine learning, ICML 2015, lille, france, 611 july 2015, Bach and Blei, pp. 1889–1897. Cited by: §4.2, §6.
 Proximal policy optimization algorithms. External Links: Link, 1707.06347 Cited by: §3.

Rlpyt: a research code base for deep reinforcement learning in pytorch
. External Links: 1909.01500 Cited by: Appendix E.  Reinforcement learning: an introduction. MIT Press, Cambridge, MA. Cited by: §1.
 Tools and algorithms for the construction and analysis of systems (TACAS 2019). Lecture Notes in Computer Science, Vol. 11427, Springer. External Links: Document, ISBN 9783030174613 Cited by: N. Fulton and A. Platzer (2019).

Verification for machine learning, autonomy, and neural networks survey
. arXiv. Cited by: §6.  Program search for machine learning pipelines leveraging symbolic planning and reinforcement learning. In Genetic Programming Theory and Practice XVI, Cited by: §6.
 Peorl: integrating symbolic planning and hierarchical reinforcement learning for robust decisionmaking. arXiv preprint arXiv:1804.07779. Cited by: §6.
 Objects as points. arXiv preprint arXiv:1904.07850. Cited by: Appendix D, Appendix D, Appendix D, Appendix D, Appendix D, §4.1.
Supplementary material for: Verifiably Safe Exploration for EndtoEnd Reinforcement Learning
Appendix A Model Monitoring
We use differential Dynamic Logic () (Platzer, 2008, 2010, 2012, 2015, 2017) to specify safety constraints on the agent’s action space. is a logic for specifying and proving reachability properties of both discrete and continuous time dynamical systems.
In this section we expand on the definitions and provide some illustrative examples. In particular, we focus on the language of hybrid programs (HPs), their reachability logic (), and monitor synthesis for formulas.
a.1 Hybrid Programs Overview
As shown succinctly in tab:hps, hybrid programs are a simple programming language that combines imperative programs with systems of differential equations. We expand the description from tab:hps and define the syntax and informal semantics of HPs are as follows:

executes and then executes .

executes either or nondeterministically.

repeats zero or more times nondeterministically.

evaluates term and assigns result to .

assigns an arbitrary real value to .

is the continuous evolution of along the solution to the system constrained to a domain defined by .

aborts if formula is not true.
Hybrid programs have a denotational semantics that defines, for reach program, the set of states that are reachable by executing the program from an initial state. A state is an assignment of variables to values. For example, the denotation of in a state is:
Composite programs are given meaning by their constituent parts. For example, the meaning of is:
A full definition of the denotational semantics corresponding to the informal meanings given above is provided by (Platzer, 2015).
a.2 Differential Dynamic Logic Overview
Formulas of are generated by the grammar:
where are polynomials of real arithmetic, is one of , and the meaning of is that is true in every state that can be reached by executing the program . Formulas of can be stated and proven in the KeYmaera X theorem prover (Fulton et al., 2015, 2017).
The meaning of a formula is given by a denotational semantics that specifies the set of states in which a formula is true. For example,
We write as an alternative notation for the fact that is true in all states (i.e., ). We denote by the fact that there is a proof of in the proof calculus of .
a.3 Using Safe Controller Specifications to Constrain Reinforcement Learning
Given a hybrid program and proven safety specification, Fulton and Platzer (2018) explains how to construct safety monitors (which we also call safe actions filters in this paper) for reinforcement learning algorithms over a symbolic state space. In this section, we summarize their algorithm.
As opposed to our approach, Fulton and Platzer (2018) employs both a controller monitor (that ensures the safety of the controller) and a model monitor (that ensures the adherence of the model to the actual system and checks for model mismatch).
The meaning of the controller monitor and model monitor are stated with respect to a specification with the syntactic form where is a formula specifying initial conditions, plant is a dynamical system expressed as a hybrid program that accurately encodes the dynamics of the environment, and is a postcondition. (Fulton and Platzer, 2018) assumes that ctrl as the form , where are discrete assignment programs that correspond to the action space of the RL agent. For example, an agent that can either accelerate or brake as action space . The corresponding control program will be where is a formula characterizing when it is safe to accelerate and is a formula characterizing when it is safe to brake.
Given such a formula, (Fulton and Platzer, 2018) defines the controller and model monitors using the following conditions:
Corollary 1 (Meaning of Controller Monitor).
Suppose is a controller monitor for and s ∈S and . Then implies .
Corollary 2 (Meaning of Model Monitor).
Suppose is a model monitor for , that is a sequence of actions, and that is a sequence of states. If for all then , and also implies .
Appendix B Proof of thm:safety
If the object detector produces an accurate mapping, then alg:mname will preserve the safety constraint associated with the monitor. We state this property formally in thm:correctness.
Theorem 1 (Safety Theorem).
Assume the following conditions hold along a trajectory with :
 A1

Initial states are safe: implies .
 A2

The model and symbolic mapping are correct up to simulation: If for some action then and .
Proof.
We begin the proof by pointing out that our assumption about how
was proven provides us with the following information about some formula :
Now, assume with is a trajectory generated by running an RL agent with actions selected by alg:mname and proceed by induction on the length of the sequence with the inductive hypothesis that .
If then by assumption. Therefore, by A1. We know by LI1 that . Therefore, by Modus Ponens and the soundness of the proof calculus.
Now, suppose . We know by induction. Furthermore, we know because otherwise this trajectory could not exist. By A2 and the denotation of the operator, we know . By LI3, we know Therefore, and implies by the denotation of the box modality and the soundness of .
We have now established that for all . By LI2, Modus Ponens, and soundness of the proof calculus, we finally conclude that . ∎
Note that if all actions along the trajectory are generated using alg:mname, and if the model is accurate, then the two assumptions in thm:correctness will hold.
Appendix C Proof of thm:policy_equivalence
In order to enforce safety, we wrap the original environment in a new one which has no unsafe actions. By not modifying the agent or training algorithm, any theoretical results (e.g. convergence) which the algorithm already has will still apply in our safetywrapped environment. However, it is still necessary to show the relation between the (optimal) policies that may be found in the safe environment and the policies in the original environment. We show that 1) all safe policies in the original environment have the same transition probabilities and expected rewards in the wrapped environment and 2) all policies in the wrapped environment correspond to a policy in the original environment which has the same transition probabilities and expected rewards. This shows that the optimal policies in the wrapped environment are optimal among safe policies in the original environment (so no reward is lost except where required by safety).
Let the original environment be the MDP . We define a safety checker to be a predicate such that is True iff action is safe in state in . When we refer to an action as safe or unsafe, we always mean in the original environment . A policy in is safe iff
The safetywrapped environment will be where the transition and reward functions will be modified to ensure there are no unsafe actions and expected rewards in correspond with those from acting safely in .
is required to prevent the agent from taking unsafe actions; for any safe action, we keep this identical to
. When an unsafe action is attempted, we could either take a particular safe action deterministically (perhaps shared across states, if some action is always safe, or a statespecific safe action) or sample (probably uniformly) from the safe actions in a given state. We prefer the latter approach of sampling from the safe actions because this makes taking an unsafe action have higher variance, so the agent will probably learn to avoid such actions. If unsafe actions are deterministically mapped to some safe action(s), they become indistinguishable, so the agent has no reason to avoid unsafe actions (unless we tamper with the reward function). Thus we set
where is the set of safe actions in state . This simulates replacing unsafe actions with a safe action chosen uniformly at random.
is defined similarly so that it simulates the reward achieved by replacing unsafe actions with safe ones uniformly at random:
Lemma 1.
For every safe policy in E, following that policy in leads to the same transitions with the same probabilities and gives the same expected rewards.
Proof.
By definition of safety, has zero probability for any where isn’t true. Thus actions sampled from lead to transitions and rewards from the branch of and where they are identical to and . ∎
Lemma 2.
For every policy in there exists a safe policy in such that has the same transition probabilities and expected rewards in as does in .
Proof.
For any in , let be defined such that
where is the set of unsafe actions in state . This simulates evenly redistributing the the probability that assigns to unsafe actions in among the safe actions.
We show first that the transition probabilities of in and in are the same.
Let be the expected reward of following the policy in environment at state . The equality of the expected reward for in every state of and in every state of can be shown similarly:
∎
Theorem 2.
Let be an environment and a reinforcement learning algorithm. If converges to a rewardoptimal policy in , then using alg:mname with converges to , the safe policy with the highest reward (i.e. the rewardoptimal safe policy).
Proof.
We provide proof by contraposition. Let’s assume that is not optimal in . Then there must exist in that gets more reward. But, by Lemma 2, corresponds to a safe policy in which gets the same amount of reward, so is better in than . Hence, is not optimal among safe policies in . ∎
A few notes regarding this theorem:

The intuitive approach to making an agent safe, if we know the set of safe actions in each state, might be to sample from the safe subset of the agent’s policy distribution (after renormalization). Because this is not actually sampling from the distribution the agent learned, this may interfere with training the agent.

While we keep the same in and , there may be states which become unreachable in because only unsafe transitions in lead to them. Thus the effective size of ’s state space may be smaller which could speed up learning effective safe policies.

Our approach can be viewed as transforming a constrained optimization problem (being safe in ; have to treat it as a CMDP) into an unconstrained one (being safe in ).
Appendix D Object Detection Details
CenterNet (Zhou et al., 2019)
CenterNetstyle object detectors take an image of size (height, width, and channels, respectively) as input and output an image of size where is a downscaling amount to make the detection more efficient and is the number of classes to detect. For the first channels, is the probability that the pixel is the (downscaled) center of an object of the th class. The final two channels of contain x and y offsets. The offsets account for the error in detecting locations in the original image because the predictions are downscaled: a downscaled detection at can be converted to a detection in the original image coordinates at by setting where . As the objects in our environments have constant sizes, we don’t predict the object sizes as is done in CenterNet.
As in Zhou et al. (2019), we set
and use maxpooling and thresholding to convert from the probability maps to a list of detections. In particular, there is a detection for object class
at location if and where is a 3x3 maxpooling operation centered at(with zeropadding). We set
. The detector then returns a list of tuples containing the class id () and center point () of each detection. These are used in evaluating the constraints wherever the formulas reference the location of an object of type (i.e. if a robot must avoid hazards, the constraint will be checked using the location of each hazard in the detections list).We use ResNet18 (He et al., 2016) truncated to the end of the first residual block. The first layer is also modified to have only a single input channel because we use grayscale images, as is common for inputs to RL agents. This already outputs an image which is downscaled 4x relative to the input, so we do the centerpoint classification and offset prediction directly from this image, removing the need for upscaling. We use one 1x1 convolutional layer for the offset prediction (two output channels) and one for the center point classification ( output channels and sigmoid activation).
Training
To avoid introducing a dependency on heavy annotations, we restricted ourselves to a single image for each safetyrelevant object in an environment and a background image. We produce images for training by pasting the objects into random locations in the background image. We also use other standard augmentations such as leftright flips and rotations. New images are generated for every batch.
We use the labelgeneration and loss function from Zhou et al. (2019). Labels for each object class are generated by evaluating, at each pixel position, a Gaussian density on the distance from that position to the center of the nearest object of the given class (see alg:label_creation for details).
The loss function is a focal loss: a variant of crossentropy that focuses more on difficult examples (where the predicted probabilities are farther from the true probabilities) (Lin et al., 2017). We use a modified focal loss as in (Law and Deng, 2018; Zhou et al., 2019):
where is the number of objects in the image (of any type); ; ; ; are the width and height of the image; and is the number of object classes. is the predicted probability of an object of type being centered at position in the image and is the “true” probability.
are hyperparameters that we set to 2 and 4, respectively, as done by
(Law and Deng, 2018; Zhou et al., 2019). We remove the division by if an image has no objects present. The loss for the offsets is meansquared error, and we weight the focal loss and offset loss equally. We use the Adam optimizer (Kingma and Ba, 2014) with learning rate 0.000125, , , as in Zhou et al. (2019). We decrease the learning rate by a factor of 10 if the loss on a validation set of 5,000 new images doesn’t improve within 10 training epochs of 20,000 images. The batch size is 32 as in
Zhou et al. (2019). We keep the model which had the best validation loss.Appendix E Reinforcement Learning Details
In all of our experiments, we use PPO as the reinforcement learning algorithm. Our hyperparameter settings are listed in Table 2. We run several environments in parallel to increase training efficiency using the method and implementation from (Stooke and Abbeel, 2019).
We use grayscale images as inputs to the RL agent, and the CNN architecture from Espeholt et al. (2018).
Hyperparameter  Value 

Adam learning rate  
Num. epochs  4 
Number of actors  32 
Horizon (T)  64 
Minibatch size  
Discount ()  0.99 
GAE parameter ()  0.98 
Clipping parameter  
Value function coeff.  1 
Entropy coeff.  0.01 
Gradient norm clip  1 
Comments
There are no comments yet.