1 Introduction
In this paper, we will consider the safety of datatrained memoryless feedback controllers^{1}^{1}1
RLtrained feedforward neural networks, for example.
in the following context: we will assume an autonomous vehicle that is described by the Kinematic Bicycle Model (KBM) dynamics (a good approximation for fourwheeled vehicles KongKinematicDynamicVehicle2015 ), and a safety property that the autonomous vehicle should avoid a stationary, fixedradius disk in the plane. In particular, we propose ShieldNN, an algorithm to design Rectified Linear Unit (ReLU) safety networks for this scenario: a ShieldNN network composed in series with any memoryless feedback controller^{1} makes the composition of the two controllers provably safe by the aforementioned criterion. This structure itself distinguishes ShieldNN from most other work on safe datatrained controllers: instead of designing a single safe controller, ShieldNN uses the KBM dynamics to design a controlleragnostic NN that corrects – in realtime – unsafe control actions generated by any controller. In other words, ShieldNN designs a “safetyfilter” NN that takes as input the instantaneous control action generated by a controller (along with the state of the system) and outputs a safe control for the KBM dynamics; this safetyfilter NN thus replaces unsafe controls generated by the original controller with safe controls, whereas safe controls generated by the original controller are passed through unaltered – i.e., unsafe controls are “filtered” out. A block diagram of this structure is depicted in Fig. 1.The benefits of this approach are manifest, especially for datatrained controllers. On the one hand, existing controllers that have been designed without safety in mind can be made safe by merely incorporating the safety filter in the control loop. In this scenario, the safety filter can also be seen as a countervailing factor to controllers trained to mimic experts: the expert learning can be seen as a design for “performance”, and the safety filter is added to correct unanticipated unsafe control behavior as needed.
On the other hand, the controlleragnostic nature of our proposed filter means that ShieldNN itself may be incorporated into training. In this way, the safety filter can be seen to function as a kind of “safety expert” during training, and this can potentially improve sample efficiency by eliminating training runs that end in unsafe states.
The main theoretical contribution of this paper is thus the development of the ShieldNN algorithm. The central pillar of ShieldNN is the notion of a barrier function, because a barrier function allows the safety problem for a feedback controller to be recast as a set membership problem for the outputs of said controller. In particular, this recasting reduces the safetyfilter design problem into one of designing a predictionstyle NN whose outputs are constrained to lie in a specific set. As a prerequisite for ShieldNN then, we propose a novel class of candidate barrier functions for the KBM dynamics that is characterized by three realvalued parameters (one of which is the safety radius). The necessity for a class of candidate barrier functions stems from the difficulty in analytically designing a barrier function for the KBM dynamics. Thus, ShieldNN is functionally divided into two parts: a verifier, which soundly verifies a particular choice of barrier function (from the aforementioned class), and a synthesizer, which designs the actual ShieldNN filter.
Furthermore, we also validated these theoretical results with experimental validation on fourwheeled vehicle models. In particular, we apply ShieldNN safetyfilters both before and after RL training for an autonomous vehicle simulated in CARLA Dosovitskiy17 . Our results show that incorporating ShieldNN dramatically improved the safety of the vehicle: it reduced the number of obstacle collisions by 99.4%100% in our safety experiments. We also studied the effect of incorporating ShieldNN during training: for a constant number of episodes, 28% less reward was observed when ShieldNN wasn’t used during training. This suggests that ShieldNN has the further property of improving sample efficiency during RL training.
Related Work.
Motivated by the lack of safety guarantees in Deep RL, recent works in the literature of safe RL have focused on designing new RL algorithms that can take safety into account. The work in this area can be classified into three categories. The works in the first category focus on how to modify the training algorithm to take into account safety constraints. Representative examples of this work include rewardshaping
saunders2018trial , Bayesian and robust regression liu2019robust ; berkenkamp2016bayesian ; pauli2020training , and policy optimization with constraints gaskett2003reinforcement ; moldovan2012safe ; turchetta2016safe ; wen2020safe . Unfortunately, such approaches do not provide provable guarantees on the safety of the trained controller. The second category of literature focuses on using ideas from control theory to augment the RL agent and provide safety guarantees. Examples of this literature include the use of Lyapunov methods berkenkamp2017safe ; chow2019lyapunov ; chow2018lyapunov , safe model predictive control koller2018learning , reachability analysis akametalu2014reachability ; govindarajan2017data ; fisac2018general , barrier certificates wang2018safe ; cheng2019end ; wabersich2018scalable ; srinivasan2020synthesis ; taylor2020control ; li2019temporal ; cheng2020safe ; wang2018safe ; robey2020learning , and online learning of uncertainties shi2019neural . Unfortunately, such methods suffer from either being computationally expensive, specific to certain controller structures or training algorithms or require certain assumptions on the system model. Finally, the third category focus on applying formal verification techniques (e.g., model checking) to verify formal safety properties of pretrained RL agents. Representative examples of this category are the use of SMTlike solvers sun2019formal ; dutta2018output ; liu2019algorithms and hybridsystem verification fazlyab2019efficient ; xiang2019reachable ; ivanov2019verisig . These techniques only assess the safety of a given RL agent instead of designing a safe agent.2 Problem Statement
As described above, we consider the kinematic bicycle model (KBM) as the dynamical model for our autonomous car. However, the usual KBM is defined in terms of the absolute Cartesian position of the bicycle , which is inconsistent with the sensing modalities typically available to an autonomous robot. Thus, we instead describe the bicycle kinematics in terms of relative positional variables that are directly measurable via LiDAR or visual sensors. In particular, the dynamics for the distance to the obstacle, , and the angle of the bicycle with respect to the obstacle,
, comprise a system of ordinary differential equations. These quantities describe a statespace model that is given by:
(1) 
where ; is the linear acceleration input; is the frontwheel steering angle input^{2}^{2}2That is the steering angle can be set instantaneously, and the dynamics of the steering rack can be ignored.; and . For the sake of intuition, we note a few special cases: when , the bicycle is oriented tangentially to the obstacle, and when , the bicycle is pointing directly at or away from the obstacle, respectively (see Fig. 2). is an intermediate quantity, an invertible function of .
We make the further assumption that the KBM has a control constraint on such that . To simplify further notation, we will consider directly as a control variable; this is without loss of generality, since there is a bijection between and the actual steering control angle, . Thus is also constrained:
. Finally, we define the state and control vectors for the KBM as:
and , where , the set of admissible controls.Problem 1.
Consider a KBM robot with maximum steering angle , length parameters and maximum velocity ^{3}^{3}3In our KBM model, this technically requires a feedback controller on , but this won’t affect our results. . Consider also a diskshaped region of radius centered at the origin, . Find a set of safe initial conditions, , and a ReLU NN:
(2) 
such that for any globally Lipschitz continuous controller , the state feedback controller:
(3) 
is guaranteed to prevent the robot from entering the unsafe region if it was started from a state in . Equivalently, applying feedback controller ensures that for all time when the initial condition is chosen in .
3 Approach
The most important feature of creftype 1 is that is a memoryless function that must correct the output of a feedback controller instantaneously. The existence of such a corrective function is not a priori guaranteed for the KBM dynamics. However, the wellknown theory of Barrier Functions (BFs) provides a mechanism for ensuring the safety of a dynamical systems: in short, barrier functions are realvalued functions of the system state whose properties ensure that the value of the function remains greater than zero along trajectories of the system ames2019control ; ames2016control . Thus, if a barrier function is designed so that its zero superlevel set is contained inside the set of safe states, then that subset is forwardinvariant; i.e. if the system starts from a safe state, then it will stay safe for all future time. In this way, barrier functions can be used to to convert safety properties into an instantaneous – albeit statedependent – set membership problem for control actions.
Thus, in the spirit of creftype 1, we employ the usual theory of autonomous barrier functions to control systems under statefeedback control: i.e. a control system in closed loop with a statefeedback controller . In this scenario, a feedback controller in closed loop converts the control system into an autonomous one – the autonomous vector field . Moreover, the conditions for a barrier function can be translated into a set membership problem for the outputs of such a feedback controller. This is explained in the following corollary.
Corollary 1.
Let be a control system that is Lipschitz continuous in both of its arguments on a set ; furthermore, let with , and let be a class function. If the set
(4) 
is nonempty for each , and a feedback controller satisfies
(5) 
then is forward invariant for the closedloop dynamics .
Proof.
This follows directly from an application of zeroing barrier functions (XuRobustnessControlBarrier2015, , Theorem 1). ∎
Corollary 1 is the foundation of ShieldNN: the only difference is that instead of designing a single controller , we will design a safe “combined” controller . In this usage, when a controller generates a control action, , that lies outside of the set , must map it to a control within the set .
Thus, Corollary 1 admits the following threestep framework for developing ShieldNN filters.
ShieldNN Framework:
(1) Design a Candidate Barrier Function. For a function, , to be a barrier function for a specific safety property, its zero superlevel set, , must be contained in the set of safe states.
(2) Verify the Existence of Safe Controls. (ShieldNN Verifier) Show that the set is nonempty for each state . This establishes that a safe feedback controller may exist.
(3) Design a Safety Filter. (ShieldNN Synthesizer) If possible, design such that ; then obtain a safety filter as:
(6) 
ShieldNN thus hinges on the design of a barrier function, and then the design of two predictiontype NN functions: , which generates a safe control at each ; and , which overrides any unsafe control for a state with the associated value of .
4 Barrier Function(s) for the KBM Dynamics: the Basis of ShieldNN
It difficult to analytically derive a single barrier function as a function of a particular robot and safety radius for the KBM. Thus, we instead define a class of candidate barrier functions for a specific robot: this class is further parameterized by a unitless scaling parameter and the safety radius, and it has the property that there are guaranteed parameter choices that actually result in a barrier function. However, since the analytically guaranteed parameter choices are impractically conservative, this need a ShieldNN verifier algorithm to establish whether a particular (usersupplied) choice of barrier function parameters does indeed constitute a barrier function.
In particular, we propose the following class of candidate barrier functions to certify control actions so that the bicycle doesn’t get within units of the origin (creftype 1):
(7) 
where is an additional parameter whose function we shall describe subsequently. First note that the equation has a unique solution, for each value of :
(8) 
so the smallest value of is . Thus, the function satisfies the requirements of (1) in the ShieldNN framework: i.e. , the zero superlevel set of , is entirely contained in the set of safe states as proscribed by creftype 1, independent of the choice of . See Fig. 2, which also depicts another crucial value, .
Remark 1.
Note that is independent of the velocity state, . This will ultimately force ShieldNN filters to intervene only by altering the steering input.
A barrier function also requires a class function, . For ShieldNN, we choose a linear function
(9) 
where is the assumed maximum linear velocity (see creftype 1), and is a constant selected according to the following theorem.
Theorem 1.
Consider any fixed parameters , and . Assume that (as specified by creftype 1). If is chosen such that:
(10) 
then the Lie derivative is a monotonically increasing function in for all for each fixed choice of and the remaining state and control variables.
In particular, for all such that it is the case that:
(11) 
In addition to concretely defining our class of candidate barrier functions, Theorem 1 is the essential facilitator of the ShieldNN algorithm. In particular, note that
(12) 
since and . Hence, the set is independent of , so (11) gives a sufficient condition for safe controls (2) in terms of a single state variable, , and a single control variable . This simplifies not only the ShieldNN verifier but also the ShieldNN synthesizer, as we shall demonstrate in the next section.
5 ShieldNN
ShieldNN Verifier: The overall ShieldNN algorithm has three inputs: the specs for a KBM robot (, and ); the desired safety radius (); and the barrier parameter . From these inputs, the ShieldNN verifier first soundly verifies that these parameters lead to an actual barrier function for creftype 1. As per Theorem 1, it suffices to show that is nonempty for each .
If the sets have a complicated structure (both themselves and relative to each other), then establishing this could in principle be quite difficult. However, the barrier functions under consideration actually appear to generate quite nice regions of safe controls. In particular, it appears to the case that the set of safe steering angles in any particular orientation state is an interval clipped at the maximum/minimum steering inputs. That is each such set can be written as:
where and are continuous functions of . Even more helpfully, the function generally appears to be concave, and the symmetry of the problem dictates that . See Fig. 2(a) for an example with parameters , , and ; is shown in light green, and and are shown in dark green.
Of course these observations about and are difficult to show analytically, given the nature of the equations (c.f. (12)). Nevertheless, we can exhibit a sound algorithm to verify these claims for particular parameter values, and hence that the input parameters correspond to a legitimate barrier function. Due to space constraints, the details of this algorithm appear in the supplementary material.
ShieldNN Synthesizer: Given a verified barrier function, recall from (3) in Section 3 that synthesizing a ShieldNN filter requires two components: and . That is chooses a safe control for each state, and overrides any unsafe controls with the output of .
Design of . This task is much easier than it otherwise would be, since the ShieldNN verifier also verifies the safe controls as lying between the continuous functions and where and is concave and . In particular, then, it is enough to design as any neural network such that
(13) 
This property can be achieved in several ways, including training against samples of for example. However, we chose to synthesize directly in terms of tangent line segments to (and thus exploit the concavity of ). A portion of just such a function is illustrated by the orange line in Fig. 2(b).
Design of . Since the value of is designed to lie inside the interval of safe controls, the function can itself be used to decide when an unsafe control is supplied. In particular, using this property and the symmetry , we can simply choose
(14) 
Note: in this construction, the closer approximates its lower bound, , the less intrusive the safety filter will be. Two constant slices of such a are shown in Fig. 2(b).
6 ShieldNN Evaluation
We conduct a series of experiments to evaluate ShieldNN’s performance when applied to unsafe RL controllers. The CARLA Simulator Dosovitskiy17 is used as our RL environment, and we consider an RL agent whose goal is to drive a simulated vehicle while avoiding the obstacles in the environment. The video recordings and other details of the experiments can be found here. The goals of the experiments are to assess the following:

The effect of ShieldNN when applied during RL training (Experiment 1) in terms of the average collected reward, obstacle avoidance, etc.

The safety of the RL agent when ShieldNN is applied after training (Experiment 2).

The robustness of ShieldNN when applied in a different environment than that used in training (Experiment 3).
RL Task: The RL task is to drive a simulated fourwheeled vehicle from point A to point B on a curved road that is populated with obstacles. The obstacles are static CARLA pedestrian actors randomly spawned at different locations between the two points. We define unsafe states as those in which the vehicle hits an obstacle. As ShieldNN is designed for obstacle avoidance, we do not consider the states when the vehicle hits the sides of the roads to be unsafe with respect to ShieldNN. Technical details and graphical representations are included in the Supplementary Materials.
Reward function and termination criteria: If the vehicle reaches point B, the episode terminates, and the RL agent gets a reward value of a . The episode terminates, and the agent gets penalized by a value of a in the following cases: when the vehicle (i) hits an obstacle; (ii) hits one of the sides of the road; (iii) has a speed lower than 1 KPH after 5 seconds from the beginning of the episode; or (iv) has a speed that exceed the maximum speed (45 KPH). The reward function is a weighted sum of four terms, and the weights were tuned during training. The four terms are designed in order to incentivize the agent to keep the vehicle’s speed between a minimum speed (35 KPH) and a target speed (40 KPH), maintain the desired trajectory, align the vehicle’s heading with the direction of travel, and keep the vehicle away from obstacles. The reward function is defined formally in the Supplementary Materials.
Proximal Policy Optimization (PPO) schulman2017proximal was used to train a neural network to perform the desired RL task. The network receives measurements and , which are synthesized from CARLA position and orientation measurements to simulate LiDAR input. The network then outputs the new control actions: throttle, , and steering angle, . The steering angle is subsequently processed by the ShieldNN filter to produce a corrected “safe” steering angle , which is applied to the simulated car with the original throttle input generated by the PPO agent. The ShieldNN filter is synthesized according to Section 5 with the parameters and KBM parameters , and . The full details of the architecture and signal processing for the agent are provided in the supplementary materials.
Experiment 1: Effect of ShieldNN During RL Training The goal of this experiment is to study the effect of applying ShieldNN to an RL agent during training. We train three RL agents for 6000 episodes each in order to compare (i) the collected reward and (ii) the obstacle hit rate after an equal number of training episodes. The three agents are characterized as follows: Agent 1 is trained with no obstacles and without the ShieldNN filter in place (Obstacles OFF + Filter OFF); Agent 2 is trained with obstacles spawned at random but without ShieldNN in place (Obstacles ON + Filter OFF); and Agent 3 is trained with obstacles spawned at random and with the ShieldNN filter in place (Obstacles ON + Filter ON).
When obstacles are not present (Agent 1), the RL agent quickly learns how to drive the vehicle, as indicated by the rapid growth in the reward function shown in Fig. 3(a). When obstacles are present but ShieldNN is not used (Agent ), the RL agent’s ability to learn the task degrades, as indicated by a 30% reduction in collected reward. However, when obstacles are present and the ShieldNN filter is in place (Agent 3), the agent collects 28% more reward on average than Agent , and collects a similar amount of reward to Agent . This is an indication that ShieldNN filters improves the training of the system by reducing the number of episodes that are terminated early due to collisions.
Similar behavior can be observed in Fig. 3(b), which shows the obstacle collision rate (averaged across episodes). This figure shows that Agent 2 is slowly learning how to avoid obstacles, since its average obstacle collision rate decreases from 80% to 47% in 60000 episodes. However, Agent 3, which uses ShieldNN during training, has an obstacle collision rate of almost zero. In total, Agent 3 suffers only three collisions across all 60000 episodes. We believe that these three collisions are due to the discrepancy between the KBM and the dynamics of the vehicle used by the CARLA simulator.
Experiment 2: Safety Evaluation of ShieldNN The goal of this experiment is to validate the safety guarantees provided by ShieldNN when applied to nonsafe controllers. To do this, we evaluate the three trained agents from Experiment 1 in the same environment they were trained in, and with obstacles spawned randomly according to the same distribution used during training. With this setup, we consider two evaluation scenarios: (i) when the ShieldNN filter is in place (ShieldNN ON) and (ii) when ShieldNN filter is not in place (ShieldNN OFF). Table 1 shows all six configurations of this experiment. For each configuration, we run 200 episodes and record three metrics: (i) the minimum distance between the center of the vehicle and the obstacles, (ii) the average percentage of track completion, and (iii) the percentage of hitting obstacles across the 200 episodes.
Fig. 4(a) and 4(b) show the histograms of the minimum distance to obstacles for each configuration. The figure also show two vertical lines at 2.3 m and 4 m: the former is the minimum distance at which a collision can occur, given the length of the vehicle, and the latter is the value of the safe distance used to design the ShieldNN filter. Whenever the ShieldNN was not used in the 200 testing episodes (ShieldNN OFF, Fig. 4(a)), the average of all the histograms is close to the 2.3 m line indicating numerous obstacle collisions. The exact percentage of obstacle hit rate is reported in Table Table 1. Upon comparing the histograms in Fig. 4(a) with those in 4(b), we conclude that ShieldNN nevertheless renders all the three agents safe: note that the center of mass of the histograms shifts above the safety radius parameter, , used to design the ShieldNN filter. In particular, Agents 2 and 3 were able to avoid all the obstacles spawned in all 200 episodes, while Agent 1 hit only 0.5% of the obstacles spawned. Again, we believe this is due to the difference between the KBM used to design the filter and the actual dynamics of the vehicle. In general, the obstacle hitting rate is reduced by and for Agents 1, 2, and 3, respectively.
Training  Testing  Experiment 2  Experiment 3A  
Config  Obstacle  Filter  Filter  TC%  OHR%  TC%  OHR% 
1  OFF  OFF  OFF  7.59  99.5  27.53  79.5 
2  OFF  OFF  ON  98.82  0.5  98.73  0.5 
3  ON  OFF  OFF  94.82  8.5  71.88  34 
4  ON  OFF  ON  100  0  100  0 
5  ON  ON  OFF  62.43  44  50.03  60 
6  ON  ON  ON  100  0  100  0 
TC% := Track Completion % OHR% := Obstacle Hit Rate %
Experiment 3: Robustness of ShieldNN in Different Environments The goal of this experiment is to test the robustness of ShieldNN when applied inside a different environment than the one used to train the RL agents. We split the experiment into two parts:
Part 3A:
We use the same setup and metrics as in Experiment 2, but we perturb the locations of the spawned obstacles by a Gaussian distribution
in the lateral and longitudinal directions. Fig. 4(c) and 4(d) show that despite this obstacle perturbation, ShieldNN is still able to maintain a safe distance between the vehicle and the obstacles whereas this is not the case when ShieldNN is OFF. Table 1 shows an overall increase of obstacle hit rate and a decrease in track completion rate when ShieldNN is OFF compared to the previous experiment. This is expected, as the PPO algorithm is trained with the obstacles spawned at locations with a different distribution than the one used in testing. However, ShieldNN continues to demonstrate its performance and safety guarantees by having almost track completion rate and almost obstacle hit rate.Part 3B:
We use a completely different environment and track than the ones used in training, but we spawn the obstacles at locations with the same distribution used in training. We first perform transfer learning and train the pretrained Agents 2 and 3 for 500 episodes in the new environment. In this case, ShieldNN still achieves the desired safety distance on average, and achieving exactly zero obstacle hitting rates in both cases; it also achieves track completions of
and respectively. Implementation details and results for Experiment 3B are included in the Supplementary Material.Side Effects of ShieldNN: In our experiments, applying ShieldNN during training had the side effect of creating a higher curb hitting rate during both training and testing, as compared to the case when the agent was trained with ShieldNN OFF. In particular, after training for 6000 episodes, the curb hitting rate for agent 2 went from down to . However for agent 3 it went from down to
. This is due to the fact that ShieldNN forces the vehicle to steer away from facing an obstacle which, in turn, increases the probability of hitting one of the sides of the road. This side effect suggests the possibility for future research in generalizing ShieldNN to provide safety guarantees against hitting environment boundaries as well.
References
 (1) J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli, “Kinematic and dynamic vehicle models for autonomous driving control design,” in 2015 IEEE Intelligent Vehicles Symposium (IV), pp. 1094–1099, IEEE, 2015.
 (2) A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16, 2017.
 (3) W. Saunders, G. Sastry, A. Stuhlmueller, and O. Evans, “Trial without error: Towards safe reinforcement learning via human intervention,” in Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2067–2069, International Foundation for Autonomous Agents and Multiagent Systems, 2018.
 (4) A. Liu, G. Shi, S.J. Chung, A. Anandkumar, and Y. Yue, “Robust regression for safe exploration in control,” arXiv preprint arXiv:1906.05819, 2019.
 (5) F. Berkenkamp, A. Krause, and A. P. Schoellig, “Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics,” arXiv preprint arXiv:1602.04450, 2016.
 (6) P. Pauli, A. Koch, J. Berberich, and F. Allgöwer, “Training robust neural networks using lipschitz bounds,” arXiv preprint arXiv:2005.02929, 2020.
 (7) C. Gaskett, “Reinforcement learning under circumstances beyond its control,” 2003.
 (8) T. M. Moldovan and P. Abbeel, “Safe exploration in markov decision processes,” arXiv preprint arXiv:1205.4810, 2012.

(9)
M. Turchetta, F. Berkenkamp, and A. Krause, “Safe exploration in finite markov decision processes with gaussian processes,” in
Advances in Neural Information Processing Systems, pp. 4312–4320, 2016.  (10) L. Wen, J. Duan, S. E. Li, S. Xu, and H. Peng, “Safe reinforcement learning for autonomous vehicles through parallel constrained policy optimization,” arXiv preprint arXiv:2003.01303, 2020.
 (11) F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause, “Safe modelbased reinforcement learning with stability guarantees,” in Advances in neural information processing systems, pp. 908–918, 2017.
 (12) Y. Chow, O. Nachum, A. Faust, E. DuenezGuzman, and M. Ghavamzadeh, “Lyapunovbased safe policy optimization for continuous control,” arXiv preprint arXiv:1901.10031, 2019.
 (13) Y. Chow, O. Nachum, E. DuenezGuzman, and M. Ghavamzadeh, “A lyapunovbased approach to safe reinforcement learning,” in Advances in neural information processing systems, pp. 8092–8101, 2018.
 (14) T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learningbased model predictive control for safe exploration,” in 2018 IEEE Conference on Decision and Control (CDC), pp. 6059–6066, IEEE, 2018.
 (15) A. K. Akametalu, J. F. Fisac, J. H. Gillula, S. Kaynama, M. N. Zeilinger, and C. J. Tomlin, “Reachabilitybased safe learning with gaussian processes,” in 53rd IEEE Conference on Decision and Control, pp. 1424–1431, IEEE, 2014.
 (16) V. Govindarajan, K. DriggsCampbell, and R. Bajcsy, “Datadriven reachability analysis for humanintheloop systems,” in 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pp. 2617–2622, IEEE, 2017.
 (17) J. F. Fisac, A. K. Akametalu, M. N. Zeilinger, S. Kaynama, J. Gillula, and C. J. Tomlin, “A general safety framework for learningbased control in uncertain robotic systems,” IEEE Transactions on Automatic Control, vol. 64, no. 7, pp. 2737–2752, 2018.
 (18) L. Wang, E. A. Theodorou, and M. Egerstedt, “Safe learning of quadrotor dynamics using barrier certificates,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2460–2465, IEEE, 2018.

(19)
R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick, “Endtoend safe
reinforcement learning through barrier functions for safetycritical
continuous control tasks,” in
Proceedings of the AAAI Conference on Artificial Intelligence
, vol. 33, pp. 3387–3395, 2019.  (20) K. P. Wabersich and M. N. Zeilinger, “Scalable synthesis of safety certificates from data with application to learningbased control,” in 2018 European Control Conference (ECC), pp. 1691–1697, IEEE, 2018.
 (21) M. Srinivasan, A. Dabholkar, S. Coogan, and P. Vela, “Synthesis of control barrier functions using a supervised machine learning approach,” arXiv preprint arXiv:2003.04950, 2020.
 (22) A. J. Taylor, A. Singletary, Y. Yue, and A. D. Ames, “A control barrier perspective on episodic learning via projectiontostate safety,” arXiv preprint arXiv:2003.08028, 2020.
 (23) X. Li and C. Belta, “Temporal logic guided safe reinforcement learning using control barrier functions,” arXiv preprint arXiv:1903.09885, 2019.
 (24) R. Cheng, M. J. Khojasteh, A. D. Ames, and J. W. Burdick, “Safe multiagent interaction through robust control barrier functions with learned uncertainties,” arXiv preprint arXiv:2004.05273, 2020.
 (25) A. Robey, H. Hu, L. Lindemann, H. Zhang, D. V. Dimarogonas, S. Tu, and N. Matni, “Learning control barrier functions from expert demonstrations,” arXiv preprint arXiv:2004.03315, 2020.
 (26) G. Shi, X. Shi, M. O’Connell, R. Yu, K. Azizzadenesheli, A. Anandkumar, Y. Yue, and S.J. Chung, “Neural lander: Stable drone landing control using learned dynamics,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 9784–9790, IEEE, 2019.
 (27) X. Sun, H. Khedr, and Y. Shoukry, “Formal verification of neural network controlled autonomous systems,” in Proceedings of the 22nd ACM International Conference on Hybrid Systems: Computation and Control, pp. 147–156, 2019.
 (28) S. Dutta, S. Jha, S. Sankaranarayanan, and A. Tiwari, “Output range analysis for deep feedforward neural networks,” in NASA Formal Methods Symposium, pp. 121–138, Springer, 2018.
 (29) C. Liu, T. Arnon, C. Lazarus, C. Barrett, and M. J. Kochenderfer, “Algorithms for verifying deep neural networks,” arXiv preprint arXiv:1903.06758, 2019.

(30)
M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. Pappas, “Efficient and accurate estimation of lipschitz constants for deep neural networks,” in
Advances in Neural Information Processing Systems, pp. 11423–11434, 2019.  (31) W. Xiang, D. M. Lopez, P. Musau, and T. T. Johnson, “Reachable set estimation and verification for neural network models of nonlinear dynamic systems,” in Safe, Autonomous and Intelligent Vehicles, pp. 123–144, Springer, 2019.
 (32) R. Ivanov, J. Weimer, R. Alur, G. J. Pappas, and I. Lee, “Verisig: verifying safety properties of hybrid systems with neural network controllers,” in Proceedings of the 22nd ACM International Conference on Hybrid Systems: Computation and Control, pp. 169–178, 2019.
 (33) A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada, “Control barrier functions: Theory and applications,” in 2019 18th European Control Conference (ECC), pp. 3420–3431, IEEE, 2019.
 (34) A. D. Ames, X. Xu, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs for safety critical systems,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3861–3876, 2016.
 (35) X. Xu, P. Tabuada, J. W. Grizzle, and A. D. Ames, “Robustness of Control Barrier Functions for Safety Critical Control,” IFACPapersOnLine, vol. 48, no. 27, pp. 54–61, 2015.
 (36) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
 (37) A. Raffin, A. Hill, K. R. Traoré, T. Lesort, N. DíazRodríguez, and D. Filliat, “Decoupling feature extraction from policy learning: assessing benefits of state representation learning in goal based robotics,” arXiv preprint arXiv:1901.08651, 2019.
Additional Notation
Throughout the rest of this appendix we will use the following notation:
(15) 
Where is the righthand side of the ODE in (1) and the variable is merely a placeholder, since the (40) doesn’t depend on it at all. In particular, (40) has the following relationship with (12):
(16) 
Moreover, we define the following set:
(17) 
which is the subset of the zerolevel set of that is compatible with our assumption that (see creftype 1).
Proofs for Section 4
There are two claims from Section 4 that require proof.

First, we stated Theorem 1 without proof.
We provide proofs for each of these in the next two subsections.
Proof of Theorem 1
We prove the first claim of Theorem 1 as the following Lemma.
Lemma 1.
Consider any fixed parameters , , and . Furthermore, define
(18) 
Now suppose that is as in (7), and is as in (9) with is chosen such that .
Then for each , the function
(19) 
is increasing on its domain, .
Remark 2.
Note the relationship between the function in (19) and the function used to to define in Corollary 1. That is the set that we are interested in characterizing in Theorem 1.
Proof.
We will show that when , each such function has a strictly positive derivative on its domain. In particular, differentiating gives:
(20) 
To ensure that this derivative is strictly positive, it suffices to choose such that
(21) 
For this, we consider two cases: and .
When , then for all . Thus it suffices to choose such that
(22) 
which is assured under the assumption that if
(23) 
Now, we have the prerequisites to prove Theorem 1.
Proof.
(Theorem 1) The first claim of Theorem 1 is proved as Lemma 1. Thus, it remains to show that for any with — that is — we have that (11) holds. However, this follows from Lemma 1.
In particular, choose an arbitrary , and choose an arbitrary ; as usual we will only need to concern ourselves with the steering control, . First, observe that by definition:
(25) 
However, the conclusion of this implication can be rewritten using the definition (19):
(26) 
We now invoke Lemma 1: since by construction, Lemma 1 indicates that is strictly increasing on the interval . Combining this conclusion with (26), we see that . Again using the definition of in (26), we conclude that
(27) 
Thus, we conclude that by the definition thereof (see the statement of Theorem 1). Finally, since and were chosen arbitrarily, we get the desired conclusion. ∎
Proof of That a Barrier Function Exists for Each KBM Instance
For and to be a useful class of barrier functions, it should be that case that at least one of these candidates is in fact a barrier function for each instance of the KBM. We make this claim in the form of the following Theorem.
Theorem 2.
Consider any KBM robot with length parameters ; maximum steering angle ; and maximum velocity . Furthermore, suppose that the following two conditions hold:

, or equivalently, ;

; and
Then for every such that the set is nonempty. In particular, the feedback controller (interpreted as a function of only):
(28) 
is safe.
Remark 3.
Note that there is always a choice of and such that condition (ii) can be satisfied. In particular, it suffices for and to be chosen such that:
(29) 
Thus, by making large enough relative , it is possible to choose a such that the inequality (29) holds, and (ii) is satisfied.
Proof.
The strategy of the proof will be to consider the control , and verify that for each such that , we have:
(30) 
The symmetry of the problem will allow us to make a similar conclusion for .
We proceed by partitioning the interval into the following three intervals:
and consider the cases that is in each such interval separately.
Case 1 : In this case, , and by assumption. It is direct to show that:
(31) 
and
(32) 
Hence, the term in (12) can be lower bounded by zero, and the first term in (12) is identically zero by (32). Thus, in this case, (12) is lower bounded as as:
(33) 
which of course will be greater than zero since with by assumption (i).
Case 2 : In this case, . Thus, for , we have that:
(34)  
(35)  
(36) 
Consequently, (30) is automatically satisfied, since all of the quantities in the Lie derivative are positive.
Case 3 : In this case, as in Case 2. However, the term is now negative in this case:
(37) 
Thus, since the other two terms are positive on this interval, we need to have:
(38) 
This follows because on , and ; i.e. we substituted the lower and upper end points of , respectively. Noting that , we finally obtain:
(39) 
The preceding is just another form of (ii) so we have the desired conclusion in (30).
The conclusion of the theorem then follows from the combined consideration of Cases 13 and Theorem 1 as claimed above. ∎
Proofs for Section 5
Additional Notation
Throughout the rest of this appendix we will use the following notation:
(40) 
Where is the righthand side of the ODE in (1) and the variable is merely a placeholder, since the (40) doesn’t depend on it at all. In particular, (40) has the following relationship with (12):
(41) 
Moreover, we define the following set:
(42) 
which is the subset of the zerolevel set of that is compatible with our assumption that (see creftype 1).
Proofs
ShieldNN Verifier
Recall that the main function of the ShieldNN verifier to soundly verify that
(43) 
for a concave function and with . The conclusion about follows directly from the symmetry of the problem, so we will focus on verifying the claims for .
As a foundation for the rest of this subsection, we make the following observation.
Proposition 1.
Suppose that (43) holds with . Then for any such that it is the case that
(44) 
Proof.
This follows directly from the definition of , and the fact that we are considering it on the barrier, i.e. for which implies that and hence that . ∎
This suggests that we should start from (44) in order to establish the claim in (43). To this end, let be real numbers, and define:
(45) 
with the appropriate modifications for other interval types , and . We also define a related quantity:
(46) 
We can thus develop a sound algorithm to verify (43) and the concavity of by soundly verifying the following three properties in sequence:
Property 1. Show that ; that is intersects the lower control constraint a single orientation angle, . And likewise by symmetry.
Property 2. Verify that is the graph of a function (likewise for by symmetry), and that . Thus, define according to .
Property 3. Verify that as defined in Property 2 is concave.
The ShieldNN verifier algorithm expresses each of these properties as the sound verification that a particular function is greater than zero on a subset of its domain. Naturally, the functions that are associated with these properties are either itself or else derived from it (i.e. literally obtained by differentiating), and so each is an analytic function where the variables and appear only in trigonometric functions. Thus, these surrogate verification problems are easily approachable by overapproximation and the MeanValue Theorem.
With this program in mind, the remainder of this appendix consists of one section each explaining how to express Property 13 as minimumverification problems. These are followed by a section that describes the main algorithmic component of the ShieldNN verifier, CertifyMin.
Verifying Property 1
To verify Property 1, we can start by using a numerical root finding algorithm to find a zero of , viewed as a function of . However, there is no guarantee that this root, call it is the only root on the set . Thus, the property to be verified in this case in the assumptions of the following proposition.
Proposition 2.
Suppose that . Furthermore, suppose that there exists an such that:

;

and ;

Comments
There are no comments yet.