I Introduction
Guaranteeing safety is an open challenge in designing control policies for many autonomous robotic systems, ranging from consumer electronics to selfdriving cars and aircrafts. In recent years, the development of machine learning (ML) has created unprecedented opportunities to control modern autonomous systems with growing complexity. However, ML also poses great challenges for developing highassurance autonomous systems that are provably dependable. While many learningbased approaches
[33, 32, 38, 27] have been proposed to train controllers to accomplish complex tasks with improved empirical performance, the lack of safety certificates for the learningenabled components has been a fundamental hurdle that blocks the massive deployment of the learned solutions.For decades, mathematical control certificates has been used as proofs that the desired properties of the system are satisfied in closedloop with certain control policies. For example, Control Lyapunov Function [20, 22] and Control Contraction Metrics [28, 29, 40, 45] ensure the existence of a controller under which the system converges to an equilibrium or a desired trajectory. Control Barrier Function [3, 2, 5, 9, 11, 10, 34] ensures the existence of a controller that keeps the system inside a safe invariant set. Classical control synthesis methods based on SumofSquares [4, 2, 30, 42]
and linear programs
[23, 35] usually only work for systems with lowdimensional state space and simple dynamics. This is because they choose polynomials as the candidate certificate functions, where loworder polynomials may not be a valid certificate for complex dynamical systems and highorder polynomials will increase the number of decision variables exponentially in those optimization problems. Recent datadriven methods [34, 7, 40, 13]have shown significant progress in overcoming the limitations of the traditional synthesis methods. They can jointly learn the control certificates and policies as neural networks (NN). However, datadriven methods that can generate control certificates still require an explicit (or whitebox) differentiable model of the system dynamics, as the derivatives of the dynamics are required in the learning process. Finally, many works study the use of control certificates such as CBF to provide safety guarantees during and after training the policies
[10, 14], but they more or less requires an accurate model of the system dynamics or a reasonable modeling of the system uncertainties to build the certificates.Many dynamical systems in the realworld are blackbox and lack accurate models. The most popular modelfree approach to handle such blackbox systems is safe reinforcement learning (RL). Safe RL methods enforce safety and performance by maximizing the expectation of the cumulative reward and constraining the expectation of the cost to be less or equal to a given threshold. The biggest disadvantage of safe RL methods is the lack of systematic or theoretically grounded way of designing cost function and reward functions, which heavily rely on empirical trials and errors. The lack of explainable safety guarantees and low sampling efficiency also make safe RL methods difficult to exhibit satisfactory performance.
Instead of stressing about the tradeoff between the strong guarantees from control certificates and the practicability of modelfree methods, in this work, we propose SABLAS to achieve both. SABLAS is a generalpurpose approach to learning safe control for blackbox dynamical systems. SABLAS enjoys the guarantees provided by the safety certificate from CBF theory without requiring for an accurate model for the dynamics. Instead, SABLAS only needs a nominal dynamics function that can be obtained through regressions over simulation data. There is no need to model the error between the nominal dynamics and the real dynamics since SABLAS essentially redesign the loss function in a novel way to backpropagate gradient to the controller even when the blackbox dynamical system is nondifferentiable. The resulting CBF (and the corresponding safety certificate) holds directly on the original blackbox system if the training process converges. The proposed algorithm is easytoimplement and follows almost the same procedure of learning CBF for whitebox systems with minimal modification. SABLAS fundamentally solves the problem that control certificates cannot be learned directly on blackbox systems, and opens up the next chapter use of CBF theory on synthesizing safe controllers for blackbox dynamical systems.
Experimental results demonstrates the superior advantages of SABLAS over leading learningbased safe control methods for blackbox systems including CPO [1], PPOSafe [38] and TRPOSafe [37, 41]. We evaluate SABLAS on two challenging tasks in simulation: drone control in a city and ship control in a valley (as shown in Fig. 1). The dynamics of the drone and ship are assumed unknown. In both tasks, the controlled agent should avoid collision with uncontrolled agents and other obstacles, and reach its goal before the testing episode ends. We also examine the generalization capability of SABLAS on testing scenarios that are not seen in training. Fig. 1 shows that SABLAS can reach a near 1.0 relative safety rate and task completion rate while using only of the training data compared to existing safe RL methods, demonstrating a significant improvement. We also study the effect of model error (between the nominal model and actual dynamics) on the performance of the learned policy. It is shown that SABLAS is tolerant to large model errors while keeping a high safety rate. A detailed description of the results is presented in the experiment section. Video results can be found at supplementary materials.
To summarize the strength of SABLAS: 1. SABLAS can jointly find a safety certificate (i.e. CBF) and the corresponding control policy on blackbox dynamics; 2. Unlike RLbased methods that need tedious trialanderror on designing the rewards, SABLAS provides a systematic way of learning certified control policy, without parameters (other than the standard hyperparameters in NN training) that need finetuning. 3. Empirical results that SABLAS can achieve a nearly perfect performance in terms of guaranteeing safety and goalreaching, using much less samples than stateoftheart safe RL methods.
Ii Related Work
There is a rich literature on controlling blackbox systems and safe RL. Due to the space limit, we only discuss a few directly related and commonly used techniques on blackbox system control. We will also skip the literature review for the large body of works on modelbased safe control and trajectory planning as the research problems we are solving are very different.
Iia Controller Synthesis for Blackbox Systems
Proportional–integral–derivative (PID) controller is widely used in controlling blackbox systems. The advantage of PID controller is that it does not rely on a model of the system and only requires the measurement of the state. A drawback of PID controller is that it does not guarantee safety or stability, and the system may overshoot or oscillate about the control setpoint. If the underlying blackbox system is linear timeinvariant, existing work [8] has presented a polynomialtime control algorithm without relying on any knowledge of the environment. For nonlinear blackbox systems, the dynamics model can be approximated using system identification and controlled using modelbased approaches [18], and PID can be used to handle the error part. The concept of practical relative degree [26] is also proposed to enhance the control performance on systems with heavy uncertainties. Recent advance in reinforcement learning [27, 38, 37, 21]
also gives us insight into treating the system as a pure blackbox and estimating the gradient for the blackbox functions in order to optimize the control variables. However, in safetycritical systems, these blackbox control methods still lack formal safety guarantee or certificate.
Simulationguided controller synthesis methods can also generate control certificates for blackbox systems, and sometimes those certificates can indicate how policies should be constructed [23, 42, 35]. However, most of these techniques use polynomial templates for the certificates, which limits their use on highdimensional and complex systems. Another line of work studies the use of [15, 16] datadriven reachability analysis, jointly with recedinghorizon control to constructed optimal control policies. These methods rely on side information about the blackbox systems (e.g. lipschitz constants of the dynamics, monotonicity of the states, decoupling in the states’ dynamics) to do the reachability, which is not needed in our method.
IiB Safe Reinforcement Learning
Safe RL [1, 41, 32, 46, 39] extends RL by adding constraints on the expectation of certain cost functions, which encode safety requirements or resource limits. CPO [1] derived a policy improvement step that increases the reward while satisfying the safety constraint. DCRL [32] imposes constraint on state density functions rather than cost value functions, and shows that density constraint has better expressiveness over cost value functionbased constraints. RCPO [41] weights the cost using Lagrangian multipliers and add it to the reward. FISAR [39] uses a metaoptimizer to achieve forward invariance in the safe set. A disadvantage of safe RL is that it does not provide safety guarantee, or their safety guarantee cannot be realized in practice. The problem of sampling efficiency and sparsity of cost also increase the difficulty to synthesize safe controller through RL.
IiC Safety Certificate and Control Barrier Functions
Mathematical certificates can serve as proofs that the desired property of the system is satisfied under the corresponding planning [44, 43] and control components. Such certificate is able to guide the controller synthesis for dynamical systems in order to ensure safety. For example, Control Lyapunov Function [20, 22, 13] ensures the existence of a controller so that the system converges to desired behavior. Control Barrier Function [3, 2, 5, 9, 11, 10, 34, 12] ensures the existence of a controller that keep the system inside a safe invariant set. However, the existing controller synthesis with safety certificate heavily relies on a whitebox model of the system dynamics. For blackbox systems or system with large model uncertainty, these methods are not applicable. While recent work [6] proposes to learn the model uncertainty effect on the CBF conditions, it still assumes that a handcrafted CBF is given, which is not always available for complex nonlinear dynamical systems. Our approach represents a substantial improvement over the existing CBFbased safe controller synthesis strategy. The proposed SABLAS framework simultaneously enjoys the safety certificate from the CBF theory and the effectiveness on blackbox dynamical systems.
Iii Preliminaries
Iiia Safety of Blackbox Dynamical Systems
Definition 1 (Blackbox Dynamical System).
A blackbox dynamical system is represented by tuple , where is the state space and is the control input space. is the system dynamics , which is unknown due to the blackbox assumption.
Let and denote the set of initial states, goal states, safe states and dangerous states respectively. The problem we aim to solve is formalized in Definition 2.
Definition 2 (Safe Control of Blackbox Systems).
Given a blackbox dynamical system modeled as in Definition 1, the safe control problem aims to find a controller such that under control input and the unknown dynamics , the following is satisfied:
(1)  
The above definition requires that starting from the initial set , the system should never enter the dangerous set under controller .
IiiB Control Barrier Function as Safety Certificate
A common approach for guaranteeing safety of dynamical systems is via control barrier functions (CBF) [3], which ensures that the state always stay in the safe invariant set. A control barrier function satisfies:
(2)  
where , and is a class function that is strictly increasing and . It is proven [3] that if there exists a CBF for a given controller , the system controlled by starting from will never enter the dangerous set . Besides the formal safety proof in [3], there is an informal but straightforward way to understand the safety guarantee. Whenever decreases to , we have , which means no longer decreases and will not occur. Thus will not enter the dangerous set where .
IiiC Colearning Controller and CBF for Whitebox Systems
For whitebox dynamical systems where is known, we can jointly synthesize the controller and it safety certificate that satisfy CBF conditions (2) via a learningbased approach. We model and as neural networks with parameters and . Given a dataset of the state samples in , the CBF conditions (2) can be translated into empirical loss functions:
(3)  
Each of the loss functions in (3) corresponds to a CBF condition in (2). In addition to safety, we also consider goalreaching by penalize the difference between the safe controller and the goalreaching nominal controller in (4). The synthesis of is wellstudied and is not the contribution of our work.
(4) 
The total loss function is , where
is a constant that balances the goalreaching and safety objective. The total loss function is minimized via stochastic gradient descent to find the parameters
and . The dataset is not fixed during training and will be periodically updated with new samples by running the current controller. When the loss converges to , the resulting controller and CBF will satisfy (2) on unseen testing samples with a generalization bound as proven in [34]. Therefore, a safe controller and the corresponding CBF is found for the whitebox system.However, for blackbox dynamical systems where is unknown, the colearning method for safe controller and CBF described above is no longer applicable. In Section IV, we propose an important and easytoimplement modification to (3) such that we can leverage similar colearning framework to jointly synthesize the safe controller and its corresponding CBF as safety certificate.
Iv Learning CBF on blackbox systems
In this section, we first elaborate on why it is difficult to learn the safe controller and its CBF for blackbox dynamical systems. Then we will propose an important and easytoimplement reformulation of the optimization objective, which makes learning safe controller in blackbox systems as easy as in whitebox systems.
Iva Challenges in Blackbox Dynamical Systems
Among the three loss functions in (3), is the only one that can propagate gradient to the controller parameter . The main challenge of training safe controller with its CBF for blackbox dynamical systems is that the gradient can no longer be backpropagated to when is unknown. Therefore, a safe controller cannot be trained by minimizing the loss functions in (3).
Given state samples and from the blackbox system where is a sufficiently small time interval, we can approximate and compute as:
(5)  
does give the value of , but its backward gradient flow to the controller is cut off by the blackbox system that is nondifferentiable. (5) can only be used to train CBF but not the safe controller . Even worse, the obtained by minimizing (5) does not guarantee that a corresponding safe controller exists. If we have an differential expression of dynamics and replace with , the gradient flow can successfully reach and update the controller parameter. However, this is not immediately possible because is unknown by the blackbox assumption.
A possible way to backpropagate gradient to is to use a differentiable nominal model . There are many methods to obtain , such as fitting a neural network using sampled data from the real blackbox system. We do not require to perfectly match the real dynamics , because there will always exist an error between them. With , we can approximate as:
(6)  
which is differentiable w.r.t. because both and are differentiable. The gradient of can be backpropagated to the controller to update its parameters. However, whatever way we get , there still exists an error between the real dynamics and , which means is not a good approximation of and is not the true value of . Using , it is not guaranteed that the third CBF condition in (2) will be satisfied.
IvB Learning Safe Control for Blackbox Dynamics
We present a novel reformulation of that makes learning safe controller with CBF for blackbox dynamical systems as easy as in whitebox systems. The proposed formulation possesses two features: it enables the gradient to backpropagate to the controller in training, and offers an errorfree approximation of .
Given state samples and from the trajectories of the real blackbox dynamical system, where is a sufficiently small time interval, we define as :
then construct as:
where is an identity function but without gradient. We need to pretend that is a constant and in backpropagation, the gradient on cannot propagate to its argument
. In PyTorch
[31], there is an offtheshelf implementation of as , which cuts off the gradient from to in backpropagation. Then we approximate using:(7)  
Theorem 1.
exists and . Namely, is differentiable w.r.t. the controller parameter , and is an errorfree approximation of as .
Proof.
Since is differentiable, and are differentiable w.r.t. , and are also differentiable w.r.t. and its parameter . Thus, exists. Furthermore, since , is an errorfree approximation of the real when . Thus is also an errorfree approximation of when . ∎
Note that Theorem 1 reveals the reason why the proposed SABLAS method can jointly learn the CBF and the safe controller for blackbox dynamical system. First, since exists, the gradient from can be backpropagated to the controller parameters to learn a safe controller. On the contrary, is not differentiable w.r.t. so cannot be used to train the controller. Second, since is a good approximation of , minimizing contributes to the minimization of and the satisfaction of the third CBF condition in (2). On the contrary, is an inaccurate approximation of as we elaborated in Section IVA. The construction of incorporates the advantages of and , and avoids the disadvantages of and at the same time. The computational graph of is illustrated in Fig. 2, which shows the forward pass and the backward gradient propagation from to controller .
Remark 1. One may argue that the gradient received by via minimizing is not exactly the gradient it should receive if we had a perfect differentiable model of the blackbox system. Despite this, minimizing directly contributes to the satisfaction of the third CBF condition in (2). A safe controller and its CBF can be found to keep the system within safe set.
Remark 2. Although the current formulation of leads to promising performance in simulation as we will show in experiments, requires further consideration in hardware experiments. Directly using the future state to calculate the time derivative of or is not always desirable because noise will possibly dominate the numerical differentiation. When the noise dominates the time derivative of or , the training will have convergence issues. But a moderate noise is actually beneficial to training, because our optimization objective makes the CBF conditions hold even under noise disturbance, which increases the robustness of the trained CBF and controller. On physical robots where noise dominates the numerical differentiation, one can incoperate filtering techniques to mitigate the noise.
Combining with the loss functions and in (3) and (4), the total loss function can be formulated as:
(8) 
where is a constant balancing the safety and goalreaching objective. is minimized via stochastic gradient descent. Algorithm 1 summarizes the learning process of the safe controller and the corresponding safety certificate (CBF) for blackbox dynamical systems. The Run function runs the blackbox system using the controller under initial condition , and returns the trajectory data. The Update function updates parameters by minimizing using the state samples in via gradient descent.
V Experiment
The primary objective of our experiment is to examine the effectiveness of the proposed method in terms of safety and goalreaching when controlling blackbox dynamical systems. We will conduct comprehensive experiments on two simulation environments illustrated in Fig.1 (a) and (b), and compare with stateoftheart learningbased control methods for blackbox systems.
Va Task Environment Description
Drone control in a city (CityEnv)
In our first case study, we consider the package delivery task in a city using drones, as is illustrated in Fig. 1 (a). There is one controlled drone and 1024 nonplayer character (NPC) drones that are not controlled by our controller. In each simulation episode, each drone is assigned a sequence of randomly selected goals to visit. The aim of our controller is to make sure the controlled drone reach its goals while avoiding collision with NPC drones at any time. A reference trajectory will be given, which sequentially connects the goals and avoid collision with buildings. The reference trajectory can be generated by any offtheshelf singleagent path planning algorithm. We use FACTEST [17] in our implementation, and other options such as RRT [25] are also suitable. The reference path planner does not need to consider the dynamic obstacles, such as the moving NPCs in our experiment. A nominal controller will also be given, which outputs control commands that drive the drone to follow the reference trajectory. However, is purely for goalreaching and does not consider safety. The CityEnv has two modes: with static NPCs and moving NPCs. If the NPCs are static, they will constantly stay at their initial locations. If the NPCs are moving, they will follow preplanned trajectories to sequentially visit their goals. The drone model is with state space , where and are row and pitch angles. The control inputs are the angular acceleration of and the vertical thrust. The underlying model dynamics is from [34] and assumed unknown to the controller and CBF in our experiment.
Ship control in a valley (ValleyEnv)
In our second case study, we consider task of controlling a ship in valley illustrated in Fig. 1
(b). There are one controlled ship and 32 NPC ships. The number of NPCs in ValleyEnv is less than CityEnv because ships are large in size and inertia, and hard to maneuver in dense traffic. Also, different from the 3D CityEnv, ValleyEnv is in 2D, which means the agents have fewer degrees of freedom to avoid collision. The initial location and goal location of each ship is randomly initialized at the beginning of each episode. The aim of our controller is to ensure the controlled ship reach its goal and avoid collision with NPC ships. Similar to CityEnv, a reference trajectory and nominal controller will be provided. There are also two modes in ValleyEnv, including the static and moving NPC mode, as is in CityEnv. The ship model is with state space
, where is the heading angle, are speed in the ship body coordinates, and is the angular velocity of the heading angle. The ship model is from Sec. 4.2 of [19] and is unknown to the controller and CBF.VB Evaluation Criteria
Three evaluation criteria are considered. Relative safety rate measures the improvement of safety comparing to a nominal controller that only targets at goalreaching but not safety. To formally define the relative safety rate, we first consider the absolute safety rate as , which measures the proportion of time that the system stays outside the dangerous set. Given two control policies and with absolute safety rate and , the relative safety rate of w.r.t. is defined as If , then control policy does not have any improvement over in terms of safety. If , then completely guarantees safety of the system in . In our experiment, is the controller to be evaluated, and is the nominal controller that only accounts for goalreaching without considering safety. Task completion rate is defined as the success rate of reaching the goal state before timeout. Tracking error is the average deviation of the system’s state trajectory comparing to a preplanned reference trajectory as Note that we do not assume always stay outside the dangerous set.
VC Baseline Approaches
In terms of safe control for blackbox systems, the most recent stateoftheart approaches are safe reinforcement learning (safe RL) algorithms. We choose three safe RL algorithms for comparison: CPO [1] is a generalpurpose policy optimization algorithm for blackbox systems that maximizes the expected reward while satisfying the safety constraints. PPOSafe is a combination of PPO [38] and RCPO [41]. It uses PPO to maximize the expected cumulative reward while leveraging the Lagrangian multiplier update rule in RCPO to enforce the safety constraint. TRPOSafe is a combination of TRPO [37] and RCPO [41]. The expected reward is maximized via TRPO and the safety constraints are imposed using the Lagrangian multiplier in RCPO.
VD Implementation and Training
Both the controller and CBF
are multilayer perceptrons (MLP) with architecture adopted from Sec. 4.2 of
[34]. and not only take the state of the controlled agent as input, but also the states of 8 nearest NPCs that the controlled agent can observe. In Algorithm 1, we choose . The total number of state samples collected during training is . In Update of the algorithm, we use the Adam [24] optimizer with learning rate and batch size 1024. The gradient descent runs for 100 iterations in Update. The nominal model dynamics are fitted from trajectory data in simulation. We used state samples to fit a linear approximation of the drone dynamics, and samples to fit a nonlinear 3layer MLP as the ship dynamics.In training the safe RL methods, the reward in every step is the negative distance between the system’s current state and the goal state, and the cost is 1 if the system is within the dangerous set and 0 otherwise. The threshold for expected cost is set to 0, which means we wish the system never enter the dangerous set (never reach a state with a positive cost). During training, the agent runs the system for timesteps in total and performs policy updates. In each policy update, 100 iterations of gradient descent are performed. The implementation of the safe RL methods is based on [36].
All the methods are trained with static NPCs and tested on both static and moving NPCs. We believe this can make the testing more challenging and examine the generalization capability of the tested methods in different scenarios. All the agents are assigned random initial and goal locations in every simulation episode, which prevents the learned controller from overfitting a single configuration.
VE Experimental Results
Safety and goalreaching performance
Results are shown in Fig. 3. Among the compared methods, our method is the only one that can reach a high task completion rate and relative safety rate at the same time. For other methods such as TRPOSafe, when the controlled drone or ship is about to hit the NPCs, the learned controller tend to brake and decelerate. Thus, the agent is less likely to reach its goal when a simulation episode ends. The task completion rate and safety rate are opposite to each other for CPO, PPOSafe and TRPOSafe. On the contrary, the controller obtained by our method can maneuver smoothly among NPCs without severe deceleration. This enables the controlled agent to reach the goal location on time. Our method can also keep a relatively low tracking error, which means the difference between actual trajectories and reference trajectories is small.
Generalization capability to unseen scenarios
Sampling efficiency
In Fig. 4 (a), we show the safety performance under different sizes of the training set. The results are averaged over the drone and ship control tasks. SABLAS only needs around of the samples required by the compared methods to achieve a nearly perfect relative safety rate. Note that SABLAS requires an extra to samples to fit the nominal dynamics, but this does not change the fact that the total number of samples needed by SABLAS is much fewer than the baselines.
Effect of model error
We investigate the influence of the model error between the real dynamics and the nominal dynamics on the safety performance. We change the modeling error of the drone model and test the learning controller on CityEnv with static NPCs. We also perform an ablation study where we use in Eqn. 6 instead of in Eqn. 7 as loss function. The red curve in Fig. 4 (b) show that SABLAS is tolerant to large model errors while exhibiting a promising safety rate. In our previous experiments, the model error is always less than . We did not encounter any difficulty fitting a nominal model with empirical error . The orange curve in Fig. 4 (b) shows that if we use in Eqn. 6, the trained controller will have a worse performance in terms of safety rate. This is because only uses the nominal dynamics to calculate the loss, without leveraging the real blackbox dynamics.
VF Discussion on Limitation
The main limitation of the proposed approach is that it cannot guarantee the satisfaction of the CBF conditions in (2) in the entire state space. Even if we minimize and to during training, the CBF conditions may still be occasionally violated during testing. After all, the training samples are finite and cannot cover the continuous state space. If the testing distribution and training distribution are the same, one can leverage the Rademacher complexity to give an error bound that the CBF conditions are violated, as is in Appendix B of [34]. But if the testing distribution is different from training, it is still unclear to derive the generalization error of the CBF conditions. To train CBF and controller that provably satisfy the CBF conditions, one can also use verification tools to find the counterexamples in the state space that violates the CBF conditions and add those counterexamples to the training set [13, 7]. The process is finished when no more counterexample can be found. However, the time complexity of the verification makes it not applicable for large and expressive neural networks. Also, the error between the nominal and real dynamics will have a negative impact on the safety performance. These limitations are left for future work.
Vi Conclusion and future works
We presented SABLAS, a generalpurpose safe controller learning approach for blackbox systems, which is supported by the theoretical guarantees from control barrier function theory, and at the same time is strengthened using a novel learning structure so it can directly learn the policies and barrier certificates for blackbox dynamical systems. Simulation results show that SABLAS indeed provides a systematic way of learning safe control policies with a great improvement over safe RL methods. For future works, we plan to study SABLAS on multiagent systems, especially with adversarial players.
References
 [1] (2017) Constrained policy optimization. In International Conference on Machine Learning, pp. 22–31. Cited by: §I, §IIB, §VC.
 [2] (2019) Control barrier functions: theory and applications. In 2019 18th European Control Conference (ECC), pp. 3420–3431. Cited by: §I, §IIC.
 [3] (2014) Control barrier function based quadratic programs with application to adaptive cruise control. In Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pp. 6271–6278. Cited by: §I, §IIC, §IIIB.
 [4] (2017) Control barrier function based quadratic programs for safety critical systems. IEEE Transactions on Automatic Control 62 (8), pp. 3861–3876. Cited by: §I.
 [5] (2015) Control barrier certificates for safe swarm behavior. IFACPapersOnLine 48 (27), pp. 68–73. Cited by: §I, §IIC.
 [6] (2021) Pointwise feasibility of gaussian processbased safetycritical control under model uncertainty. arXiv preprint arXiv:2106.07108. Cited by: §IIC.
 [7] (2019) Neural lyapunov control. In Advances in Neural Information Processing Systems, pp. 3245–3254. Cited by: §I, §VF.
 [8] (202115–19 Aug) Blackbox control for linear dynamical systems. In Proceedings of Thirty Fourth Conference on Learning Theory, M. Belkin and S. Kpotufe (Eds.), Proceedings of Machine Learning Research, Vol. 134, pp. 1114–1143. Cited by: §IIA.
 [9] (2020) Guaranteed obstacle avoidance for multirobot operations with limited actuation: a control barrier function approach. IEEE Control Systems Letters 5 (1), pp. 127–132. Cited by: §I, §IIC.
 [10] (2020) Safe multiagent interaction through robust control barrier functions with learned uncertainties. arXiv preprint arXiv:2004.05273. Cited by: §I, §IIC.

[11]
(2019)
Endtoend safe reinforcement learning through barrier functions for safetycritical continuous control tasks.
In
AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 3387–3395. Cited by: §I, §IIC.  [12] (2021) Robust control barriervalue functions for safetycritical control. arXiv preprint arXiv:2104.02808. Cited by: §IIC.
 [13] (2021) Lyapunovstable neuralnetwork control. Robotics Science and Systems (RSS). Cited by: §I, §IIC, §VF.
 [14] (2020) Guaranteeing safety of learned perception modules via measurementrobust control barrier functions. In 2020 Conference on Robotics Learning (CoRL), Cited by: §I.
 [15] (2021) Onthefly control of unknown smooth systems from limited data. In 2021 American Control Conference (ACC), pp. 3656–3663. Cited by: §IIA.
 [16] (2021) Onthefly, datadriven reachability analysis and control of unknown systems: an f16 aircraft case study. In Proceedings of the 24th International Conference on Hybrid Systems: Computation and Control, pp. 1–2. Cited by: §IIA.
 [17] (2020) Fast and guaranteed safe controller synthesis for nonlinear vehicle models. In Computer Aided Verification, S. K. Lahiri and C. Wang (Eds.), Cham, pp. 629–652. External Links: ISBN 9783030532888 Cited by: §VA.
 [18] (2006) Complex continuous nonlinear systems: their black box identification and their control. IFAC Proceedings Volumes 39 (1), pp. 416–421. Cited by: §IIA.
 [19] (2000) A survey on nonlinear ship control: from theory to practice. IFAC Proceedings Volumes 33 (21), pp. 1–16. Note: 5th IFAC Conference on Manoeuvring and Control of Marine Craft (MCMC 2000), Aalborg, Denmark, 2325 August 2000 External Links: ISSN 14746670 Cited by: §VA.
 [20] (2008) Robust nonlinear control design: statespace and lyapunov techniques. Springer Science & Business Media. Cited by: §I, §IIC.
 [21] (2017) Backpropagation through the void: optimizing control variates for blackbox gradient estimation. arXiv preprint arXiv:1711.00123. Cited by: §IIA.
 [22] (1995) Nonlinear control systems. Vol. 3, Springer. Cited by: §I, §IIC.
 [23] (2014) Simulationguided lyapunov analysis for hybrid dynamical systems. In Proceedings of the 17th international conference on Hybrid systems: computation and control, pp. 133–142. Cited by: §I, §IIA.
 [24] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §VD.
 [25] (2001) Rapidlyexploring random trees: progress and prospects. Algorithmic and computational robotics: new directions 5, pp. 293–308. Cited by: §VA.
 [26] (2012) Practical relative degree in blackbox control. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pp. 7101–7106. Cited by: §IIA.
 [27] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §I, §IIA.
 [28] (1998) On contraction analysis for nonlinear systems. Automatica 34 (6), pp. 683–696. Cited by: §I.
 [29] (2017) Control contraction metrics: convex and intrinsic criteria for nonlinear feedback design. IEEE Transactions on Automatic Control. Cited by: §I.
 [30] (2005) A tutorial on sum of squares techniques for systems analysis. In Proceedings of the 2005, American Control Conference, 2005., pp. 2686–2700. Cited by: §I.

[31]
(2019)
Pytorch: an imperative style, highperformance deep learning library
. Advances in neural information processing systems 32, pp. 8026–8037. Cited by: §IVB.  [32] (2021) Density constrained reinforcement learning. In International Conference on Machine Learning, pp. 8682–8692. Cited by: §I, §IIB.
 [33] (2020) KETO: learning keypoint representations for tool manipulation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 7278–7285. Cited by: §I.
 [34] (2021) Learning safe multiagent control with decentralized neural barrier certificates. In International Conference on Learning Representations, Cited by: §I, §IIC, §IIIC, §VA, §VD, §VF.
 [35] (2019) Learning control lyapunov functions from counterexamples and demonstrations. Autonomous Robots 43 (2), pp. 275–307. Cited by: §I, §IIA.
 [36] (2019) Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708 7. Cited by: §VD.
 [37] (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §I, §IIA, §VC.
 [38] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §I, §I, §IIA, §VC.
 [39] (2021) FISAR: forward invariant safe reinforcement learning with a deep neural networkbased optimizer. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 10617–10624. Cited by: §IIB.
 [40] (2020) Learning certified control using contraction metric. In Conference on Robot Learning, Cited by: §I.
 [41] (2018) Reward constrained policy optimization. arXiv preprint arXiv:1805.11074. Cited by: §I, §IIB, §VC.
 [42] (2008) Local stability analysis using simulations and sumofsquares programming. Automatica 44 (10), pp. 2669–2675. Cited by: §I, §IIA.
 [43] (2021) MADER: trajectory planner in multiagent and dynamic environments. IEEE Transactions on Robotics. Cited by: §IIC.
 [44] (2019) Faster: fast and safe trajectory planner for flights in unknown environments. In 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 1934–1940. Cited by: §IIC.
 [45] (2020) Neural contraction metrics for robust estimation and control: a convex optimization approach. arXiv preprint arXiv:2006.04361. Cited by: §I.
 [46] (2020) Projectionbased constrained policy optimization. arXiv preprint arXiv:2010.03152. Cited by: §IIB.