In the past few years, there has been much progress in designing reinforcement learning (RL) algorithms [29, 24, 25, 23, 26, 31, 18]. As a consequence, there has been much interest in using RL to design control policies for solving complex robotics tasks [21, 20, 7, 19]. In particular, learning-enabled controllers (LECs) have the potential to outperform optimization-based controllers . In addition, optimization-based controllers can often only be used under strong assumptions about the system dynamics, system constraints, and objective functions [13, 3], which limits their applicability to complex robotics tasks.
However, safety concerns prevent LECs from being widely used in real-world tasks, which are often safety-critical in nature. For example, there may be disturbances in the real world compared to the training environment. If the LEC is not robust to these disturbances, then using it may result in catastrophic consequences . Furthermore, unlike optimization-based controllers, it is typically infeasible to impose hard safety constraints on LECs.
. Many methods in this area leverage optimization tools to prove that a learned neural network policy satisfies a given safety constraint[9, 10, 28, 33, 6, 4, 17]. A related approach is shielding, which verifies a backup controller, and then overrides the LEC using the backup controller when it can no longer ensure that using the LEC is safe [15, 1, 2, 5]. While these methods provide strong mathematical guarantees, they suffer from a number of shortcomings. For example, many of these methods do not scale well to high-dimensional systems. Those that do typically rely on overapproximating reachable set of states, which can become very imprecise—e.g., leading to all states being reachable.
We build on a recently proposed idea called model predictive shielding (MPS), which has been used to ensure safety of learned control policies [32, 5], including extensions to the multi-agent setting . The basic idea is that rather than check whether a state is safe ahead-of-time, we can dynamically check whether we can maintain safety if we use the LEC, and only use the LEC if we can do so. However, existing approach are limited either in that they consider nonlinear, but deterministic, dynamics [5, 35], or that they consider nondeterministic, but linear, dynamics . Nonlinearity is important because many tasks where LECs have the most promise are nonlinear. Stochasticity is important for a number of reasons. For instance, there are often small perturbations in real-world dynamical systems. Similarly, it can be used to model estimation error in the robot’s state (e.g., uncertainty in its GPU position). Finally, LECs are often learned in simulation using a model of the dynamics; there are often errors in the model that need to be robustly accounted for.
We propose an approach, called robust MPS (RMPS), that bridges this gap by using robust nonlinear model-predictive control (NMPC) as the backup controller. The reason for using NMPC is that the goals of the backup controller are qualitatively different from the goals of the LEC. For example, consider the problem of building a robot that can run. The LEC tries to run as quickly as possible. It may be able to outperform the robust NMPC, since the robust NMPC treats the stochastic perturbations conservatively. However, because it is not robust, the LEC cannot guarantee safety. Thus, we want to use the LEC as often as possible, but override it using a backup controller if we are not sure whether it is safe to use the LEC. The NMPC is an effective choice for the backup controller, where the goal is to stop the system and bring it to an equilibrium point, after which a feedback controller can be used to stabilize it. Continuing our example, the NMPC might bring the robot to a halt (e.g., where it is standing).
, where the idea is to compute a tube within which the NMPC is guaranteed to stay (i.e., the tube is the reachable set of the NMPC). This existing work proposes to use a sampling-based heuristic to estimate the tube. We propose to use results from statistical learning theory to obtain provable probabilistic guarantees on our estimates of the sizes of the tubes. We develop a practical algorithm based on these theoretical results.
Contributions. Our key contributions are: (i) an extension of the MPS algorithm to stochastic dynamical systems (Section III), (ii) a novel statistical algorithm for estimating tubes for RMPC with high-probability guarantees (Section III), and (iii) experiments demonstrating how our approach ensures safety for LECs for cart-pole and for a single particle with non-holonomic dynamics and random obstacles (Section IV).
Dynamics. We consider stochastic nonlinear dynamics
where is the time step, is the state, is the control, and is a zero-mean stochastic perturbation with known distribution.
Control policy. A control policy is a map . We use to denote the closed-loop dynamics. The trajectory generated using from initial state is , where and . Since the dynamics are stochastic, is a sequence of random states; we use to denote the distribution of trajectories using from initial state .
Objective. We consider a cost function and a discount factor . The cost of a policy is
Safety constraint. In addition, we consider a set of safe states , with the goal of ensuring that the system stays in states . We do not place any constraints on (e.g., it can be nonconvex), except that we can efficiently check whether . We say a trajectory is safe if for all .
Shielding problem. Overall, our goal is to construct a policy that achieves low cost while satisfying the safety constraint. In general, since the dynamics are stochastic, it is impossible to guarantee safety. Instead, our goal is to try and ensure that safety holds with high probability. We establish a theoretical safety guarantee in Theorem 1; we discuss exactly how this theorem should be interpreted in Section III-D.
Our approach is based on shielding [15, 1, 2, 5]. This approach takes as input a policy that optimizes the cost function . The policy may not take into account the safety constraint (though often a soft penalty for violating safety is baked into ). We refer to as the learned policy, since a key motivation is the setting where is a neural network policy trained using reinforcement learning. For example, in our experiments, we use the deep deterministic policy gradient (DDPG) algorithm  to learn a neural network policy, which is an effective reinforcement learning algorithm for dynamical systems with continuous state and action spaces. However, we emphasize that our approach can be used in conjunction with any algorithm, including ones from both reinforcement learning and control theory.
Then, the shielding problem is to construct a policy that overrides as needed to ensure safety. The key challenge is minimizing how often overrides .
Notation. For , we use the notation . The set of positive semi-definite matrices of dimension is denoted by . Given and , We use the notation . Given two sets , we denote their Minkowski sum by and their Pontryagin difference by .
Iii Robust Model Predictive Shielding
Iii-a Background on Model Predictive Shielding
Model predictive shielding (MPS) is a recently proposed approach for solving the shielding problem for systems with deterministic dynamics. The key idea behind MPS is to maintain an invariant that it can always use a recovery policy to safely transition to an equilibrium point . We say a state that satisfies this invariant is recoverable (denoted ). Near the equilibrium point, we assume that feedback controller can be used to ensure safety for an infinite horizon. Thus, as long as the system remains in , then MPS can guarantee safety. The combination of and is the backup controller used to override . The basic approach is illustrated in Fig. 1.
As an intuitive example, consider a driving robot. In this context, the idea is that is recoverable if the robot can safely apply the brakes to come to a stop. If is recoverable, but using the learned policy would risk breaking recoverability (i.e., ), then MPS uses instead. Since is recoverable, using is guaranteed to keep the system safe. Thus, using the MPS controller is guaranteed to ensure safety for an infinite horizon when starting from a recoverable state .
A key shortcoming of MPS is that it depends critically on the assumption that the dynamics are deterministic. In particular, it uses simulation to check whether is recoverable. However, for stochastic dynamics, each simulation will result in different realizations of the perturbations . Thus, we cannot check recoverability using simulation.
Our approach is to combine MPS with two ideas from robust control. First, we use tracking NMPC to try and transition the system from a given state to an equilibrium point . By using nonlinear feedback control, we can ensure that the system is very likely to reach its goal despite stochastic perturbations. Second, we check recoverability by estimating the reachable sets of . In particular, we use these reachable sets to ensure the trajectory generated using (i) is safe, and (ii) reaches an invariant set . A key innovation in our approach is that we use tools from statistical learning theory to obtain provable guarantees for our approach. In particular, we prove that this check guarantees recoverability with high probability.
Iii-B The Backup Controller
We use a standard robust NMPC as the backup controller . At a high level, this controller first computes a reference trajectory that transitions the system to an equilibrium point. Then, it uses NMPC to track this reference trajectory. Finally, once the trajectory has reached the invariant set around equilibrium point , it uses a feedback controller to stabilize the system within .
Stabilization near equilibrium points. We assume given a mapping , where is an equilibrium point—i.e., . The intuition is that should return the equilibrium point that is “nearest” to . Then, tries to transition the system from to . Once it is near , we can use feedback control to ensure safety—e.g., we can continue using the robust NMPC near . We denote the stabilizing controller used near by .
In addition, we assume that we can compute a safe invariant set around . Our key assumption is that for any state , the trajectory generated using from is safe. Since the dynamics are stochastic, we typically cannot guarantee safety of using in with probability (unless the perturbations are bounded). Nevertheless, in our experiments, we find that is effective at ensuring safety and stability once inside . We discuss how we compute in Section III-E.
Reference trajectory. Denote the nominal dynamics by
where is the nominal state and is the nominal control input. Given an initial state and a time horizon , we compute a nominal trajectory to transition the system to an equilibrium point by solving the following:
where for some and . Furthermore, can be specified by the user to improve robustness; we describe heuristics for computing these sets in Section III-E. We denote the solution to (1) by . Since is a nominal equilibrium, the infinite horizon trajectory
where is concatenation, is safe for the nominal dynamics.
Tracking NMPC. Once we have a reference trajectory , we use NMPC to track this reference trajectory and try to reach the equilibrium . In particular, if we are at state after steps, this controller solves the following:
Backup controller. Given an state , an equilibrium point , and a time horizon , our backup controller first computes the reference trajectory using (1), with corresponding infinite horizon reference trajectory . Then, for each step , solves (3) for the current state to obtain , and chooses control input . Finally, for , it chooses control input .
This procedure for computing the backup controller is summarized in Algorithm 2. Note that actually needs to keep internal state consisting of the target equilibrium point , its corresponding invariant set , the reference trajectory to the equilibrium point, and the number of steps taken so far using the backup controller. This internal state is initialized in the context of a given state by the function call InitializeBackup().
Iii-C Checking Robust Recoverability via Sampling
In contrast to the MPS setting, where the dynamics are deterministic, we cannot use a single simulated trajectory to check whether a given state is recoverable. Instead, building on ideas from tube NMPC , we use Monte Carlo sampling to determine whether can safely reach the invariant set from a given state . Our key idea is to sample trajectories according to the (stochastic) dynamics. Then, we can fit boxes that cover all the states sampled on each given step . Intuitively, if we take the number of sampled trajectories to be sufficiently large, the realized trajectory will lie in at step with high probability. In contrast to prior work, we make this intuition precise using tools from statistical learning theory. Finally, to check if is recoverable, we check that it is robustly safe according to the uncertainty in these boxes, and furthermore that it robustly enters the invariant set .
Robust recoverability. Our goal is to ensure that can always transition the system safely from the current state to the invariant set around . Due to the random perturbation in the dynamics, we cannot make an absolute guarantee that this property holds. Following prior work [15, 1, 6], we instead aim to guarantee that this property holds with high probability.
Let be given. Given a state , let , and let be a (random) trajectory generated using from . We say is robustly recoverable if with probability at least (according to the randomness in ), (i) is safe for every , and (ii) .
In other words, safely transitions the system from to with probability at least . Then, given a state , Algorithm 3 checks whether is robustly recoverable. In contrast to prior work [15, 1, 6], which relies on thresholding the perturbation distribution and then using verification to obtain these kinds of bounds, we use a sampling-based approach. Since they need to threshold the distribution, they provide robust recoverability guarantees such as our own. However, they can guarantee that a given state is robustly recoverable with probability . In contrast, using our approach, there is an additional chance (for any given ) that our algorithm incorrectly concludes that is robustly recoverable when it is not. The difference is that the error in robust recoverability is due to the noise in the dynamics, whereas our error is due to noise in the sampled trajectories taken by our algorithm.
We believe this added potential for error is reasonable for two reasons. First, there is already an chance of error; practically speaking, the added error of does not really affect the kind of guarantee we ultimately obtain. Second, the dependence of the running time of our algorithm is logarithmic, so it is easy to use very small .
Estimating reachable sets. Our approach is to compute sets for such that the trajectory satisfies with probability at least , where is the reference trajectory used by —i.e.,
where the probability is taken over the randomness in the dynamics. To this end, a box constraint is a set
where denotes the closed interval from to . We use to denote the set of all possible boxes. Now, we have the following theoretical guarantee.
Let be a distribution over and be given. Consider i.i.d. samples , where
and let be any box satisfying for all . Then, with probability at least , we have
Intuitively, (6) says that at least a fraction of states (weighted by ) fall inside , and this guarantee holds with probability at least . The proof, based on tools from statistical learning theory, is given in Appendix A.
Let be the sequence of boxes returned by Algorithm 4. With probability at least , we have
As before, the probability is according to the samples used by our algorithm, whereas the probability is according to the randomness in the dynamics.
The sets computed using Algorithm 4 can be thought of as an estimate of a tube in which the trajectories are guaranteed to stay . In contrast to prior work, we have used results from statistical learning theory to obtain probabilistic guarantees on the correctness of these tubes . An example of an estimated tube is shown in Fig. 2.
Checking recoverability. Given the boxes for , Algorithm 3 checks both properties required for robust recovery: (i) to check if with high probability, it checks if
which is equivalent to , and (ii) to check if with high probability, it checks if
which is equivalent to . These checks ensure robust recoverability because Corollary 2 ensures that with high probability for every . Thus, we have the following guarantee:
Given a state , if Algorithm 3 returns true, then is robustly recoverable with probability (according to the randomness in the algorithm).
Iii-D Robust Model Predictive Shielding
Our robust model predictive shielding (RMPS) algorithm is shown in Algorithm 1. At state , this algorithm computes a control input (denoted ) by checking whether next state is robustly recoverable (with high probability) in simulation. Otherwise, it takes a step according to . One subtlety is that if has already been initialized, it actually needs to check if is robustly recoverable. The issue is that robust recoverability is defined with respect to a freshly initialized backup policy, not the backup policy after it has taken some number of steps. We have the following guarantee:
Suppose that is robustly recoverable; then, is robustly recoverable with probability at least .
See Appendix V-B for a proof. A key shortcoming of this guarantee is that it does not ensure safety of the infinite horizon trajectory. Given our assumptions, a stronger guarantee is impossible, since on every step there is a chance that the additive perturbation is large, causing the system to leave . However, this guarantee is still useful since it helps guide the design of our algorithm. In practice, we find that the bounds can be tighter than the theory suggests, since the robust NMPC is actually conservatively overapproximating the reachable set. In other words, the robust NMPC ensures safety much more robustly than the probabilities in Theorem 1 would suggest.
Iii-E Practical Modifications
We describe several practical modifications to our algorithm designed to improve either performance or computational tractability. These modifications may weaken our safety guarantees, but as we show in our experiments, they do not affect safety very much empirically.
Computing . We use a heuristic to compute from . In particular, we sample trajectories over a long horizon and estimate the reachable set the same way as in Algorithm 4. This approach does not provide any guarantees that the estimated set is actually invariant, but it works well in our experiments.
Using tighter constraints for NMPC. In the optimization problem (1) used to compute the reference trajectory, we noted that we can use tighter state constraints than needed. In particular, by doing so, we can improve the robustness of the tracking NMPC. In particular, we use the “tightened set” .
Precomputing . Computing the sets (for ) on-the-fly can be prohibitively expensive, since we might need a large number of samples for Lemma 1 to apply. Instead, we precompute these sets from a fixed initial state . Then, we reuse the same states rather than recomputing them at each step. Intuitively, this approach works well in practice since the dynamics of the tracking NMPC are usually fairly similar for different initial states.
We perform experiments to demonstrate how our system can ensure safety of stochastic systems with nonlinear dynamics and/or nonconvex constraints.
We perform experiments using three environments: (i) cart-pole, which has nonlinear dynamics and polytopic constraints (i.e., the pole should not fall below a certain height), a particle with holonomic dynamics and obstacles (which has linear dynamics but nonconvex constraints), and a particle with non-holonomic dynamics and obstacles (which has both nonlinear dynamics and nonconvex constraints).
For the cart-pole, the states are , where is the cart position, is the cart velocity, is the pole angle from upright position, and is the pole angular velocity, and the control inputs are , with the goal of reaching a target position . The safety constraint is that the pole should not fall down while moving the cart. We define the cost function to be
is a hyperparameter. Finally, disturbances are uniform noisefor the velocity and angular velocity, and zero otherwise.
For the single particle with holonomic dynamics, the states are , where is position and is velocity, and the control inputs are , where is the acceleration. The system dynamics are , where
The cost function is
where is the goal, () are the obstacles, and , and is a hyperparameter. Disturbances are uniform noise for the velocity and angular velocity, and zero otherwise.
For the single particle with non-holonomic dynamics, the states are , where is the position, is the velocity, and is the heading, and the control inputs are , where is acceleration and is angular acceleration. The system dynamics are
where is the particle radius. Costs and disturbances are the same as for the holonomic particle.
We compare our algorithm with two other policies: (i) using the learned policy without any shielding, and (ii) using shielding without robust control (i.e., the MPS algorithm). For each experiment, we run 50 scenarios with 3 different random seeds and compute the safety as well as reach rates.
The safety and performance of the three algorithms are shown in Fig. 3. As can be seen, the learned policy achieves the highest performance in all but one of the environments (the one case where it does not perform the best is likely due to noise). However, it performs very poorly in terms of safety, demonstrating the need for shielding. Next, the non-robust MPS performs slightly better in safety, but still cannot guarantee that safety holds. Its performance is correspondingly worse as well. In contrast, our robust MPS algorithm achieves 100% safety rate in each of the three environments. Its performance is slightly diminished—for the particle with non-holonomic dynamics (the hardest environment), the probability of reaching the goal drops by about 20%. Thus, our algorithm is much more suitable for safety-critical systems where safety must be guaranteed.
In Fig. 4, we show the frequency with which our robust MPS algorithm uses the learned policy compared to the backup controller . As can be seen, on the cart-pole and non-holonomic particle environments, which are more prone to being unsafe, robust MPS is less likely to use .
We have proposed a safe reinforcement learning algorithm ensuring safety of a learned control policy on stochastic nonlinear dynamical systems. We use a sampling-based approach to estimate the reachable set of the backup controller, and use results from statistical learning theory to provide theoretical guarantees on our estimates. We propose a number of modifications to enable a practical implementation of our approach. In our experiments, we show that our approach can ensure safety without sacrificing very much performance despite these modifications. Thus, our approach is a promising way to ensure safety in safety-critical systems.
V-a Proof of Lemma 1
Given , define , where , by —i.e., indicates whether is contained in . Note that
is a binary classifier, and we can consider the family of classifiers. also, define the distribution on by —i.e., all labels are . Thus, sampling is equivalent to sampling , where . Recall that we choose so all samples satisfy . Equivalently, , where for all , so for all . Thus, we can think of choosing as choosing such that the training error
on a set of i.i.d. samples . Thus, we can apply results from statistical learning theory to bound the test error
where the last probability is the one we are seeking to bound. In particular, it is straightforward to check that the VC dimension of for boxes is . By the VC dimension bound , for all , we have
with probability at least . The claim follows by setting equal to the left-hand side of (7).
V-B Proof of Theorem 1
If uses , then by Lemma 3, is robustly recoverable with probability , so the claim holds. Alternatively, suppose that uses . If it uses the already initialized version of , then by Lemma 3, is robustly recoverable with probability , so again the claim holds. By a union bound, both claims hold with probability
Finally, suppose that initializes and then uses it. Because is robustly recoverable, we have
holds with probability at least . Furthermore, the robust recoverability condition for is
In particular, note that , so
so is robustly recoverable. The claim follows.
-  (2014) Reachability-based safe learning with gaussian processes. In 53rd IEEE Conference on Decision and Control, pp. 1424–1431. Cited by: §I, §II, §III-C, §III-C.
Safe reinforcement learning via shielding.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §I, §II.
-  (2009-01) Model predictive control: theory and design. Cited by: §I, §III-C.
-  (2018) Verifiable reinforcement learning via policy extraction. In Advances in Neural Information Processing Systems, pp. 2494–2504. Cited by: §I.
-  (2019) Safe reinforcement learning via online shielding. CoRR abs/1905.10691. External Links: Cited by: §I, §I, §II, §III-A.
-  (2017) Safe model-based reinforcement learning with stability guarantees. In Advances in neural information processing systems, pp. 908–918. Cited by: §I, §III-C, §III-C.
-  (2016) End to end learning for self-driving cars. CoRR abs/1604.07316. External Links: Cited by: §I.
-  (2016) OpenAI gym. CoRR abs/1606.01540. External Links: Cited by: §IV-A.
-  (2019) Safety verification and robustness analysis of neural networks via quadratic constraints and semidefinite programming. CoRR abs/1903.01287. External Links: Cited by: §I.
-  (2019) Efficient and accurate estimation of lipschitz constants for deep neural networks. CoRR abs/1906.04893. External Links: Cited by: §I.
-  (2019) Bridging hamilton-jacobi safety analysis and reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8550–8556. Cited by: §I.
-  (2014) A tube-based robust nonlinear predictive control approach to semiautonomous ground vehicles. Vehicle System Dynamics 52 (6), pp. 802–823. External Links: Cited by: §I.
-  (1989) Model predictive control: theory and practice - a survey. Automatica 25, pp. 335–348. Cited by: §I.
-  (2016-02) Practice makes perfect: an optimization-based approach to controlling agile motions for a quadruped robot. IEEE Robotics & Automation Magazine, pp. 1–1. External Links: Cited by: §I.
-  (2012) Guaranteed safe online learning via reachability: tracking a ground target using a quadrotor. In 2012 IEEE International Conference on Robotics and Automation, pp. 2723–2730. Cited by: §I, §II, §III-C, §III-C.
Generative adversarial imitation learning. CoRR abs/1606.03476. External Links: Cited by: §I.
-  (2019) Verisig: verifying safety properties of hybrid systems with neural network controllers. In Proceedings of the 22nd ACM International Conference on Hybrid Systems: Computation and Control, pp. 169–178. Cited by: §I.
-  (2019) Model-based reinforcement learning for atari. CoRR abs/1903.00374. External Links: Cited by: §I.
-  (2019) Learning safe unlabeled multi-robot planning with motion constraints. CoRR abs/1907.05300. External Links: Cited by: §I.
-  (2019) Low level control of a quadrotor with deep model-based reinforcement learning. CoRR abs/1901.03737. External Links: Cited by: §I.
-  (2018) Hierarchical policy design for sample-efficient learning of robot table tennis through self-play. CoRR abs/1811.12927. External Links: Cited by: §I.
-  (2011) Tube-based robust nonlinear model predictive control. International Journal of Robust and Nonlinear Control 21 (11), pp. 1341–1353. External Links: Cited by: §I, §III-A, §III-B, §III-B, §III-C, §III-E.
-  (2016) Asynchronous methods for deep reinforcement learning. CoRR abs/1602.01783. External Links: Cited by: §I.
-  (2015) Trust region policy optimization. CoRR abs/1502.05477. External Links: Cited by: §I.
-  (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Cited by: §I.
-  (2016) Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–503. External Links: Cited by: §I.
Deterministic policy gradient algorithms.
Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, pp. I–387–I–395. External Links: Cited by: §II.
-  (2019-07) Safety verification of cyber-physical systems with reinforcement learning control, emsoft 2019. pp. . Cited by: §I.
-  (2015) Deep reinforcement learning with double q-learning. CoRR abs/1509.06461. External Links: Cited by: §I.
-  (2013) The nature of statistical learning theory. Springer science & business media. Cited by: §I, §III-C, §V-A.
-  (2017) StarCraft II: A new challenge for reinforcement learning. CoRR abs/1708.04782. External Links: Cited by: §I.
-  (2018) Linear model predictive safety certification for learning-based control. CoRR abs/1803.08552. External Links: Cited by: §I.
-  (2018) Reachable set estimation and verification for neural network models of nonlinear dynamic systems. CoRR abs/1802.03557. External Links: Cited by: §I.
-  (2011-12) Tube mpc scheme based on robust control invariant set with application to lipschitz nonlinear systems. In 2011 50th IEEE Conference on Decision and Control and European Control Conference, Vol. , pp. 2650–2655. External Links: Cited by: §I.
-  MAMPS: safe multi-agent reinforcement learning via model predictive shielding. External Links: Cited by: §I.