Robust Model Predictive Shielding for Safe Reinforcement Learning with Stochastic Dynamics

by   Shuo Li, et al.

This paper proposes a framework for safe reinforcement learning that can handle stochastic nonlinear dynamical systems. We focus on the setting where the nominal dynamics are known, and are subject to additive stochastic disturbances with known distribution. Our goal is to ensure the safety of a control policy trained using reinforcement learning, e.g., in a simulated environment. We build on the idea of model predictive shielding (MPS), where a backup controller is used to override the learned policy as needed to ensure safety. The key challenge is how to compute a backup policy in the context of stochastic dynamics. We propose to use a tube-based robust NMPC controller as the backup controller. We estimate the tubes using sampled trajectories, leveraging ideas from statistical learning theory to obtain high-probability guarantees. We empirically demonstrate that our approach can ensure safety in stochastic systems, including cart-pole and a non-holonomic particle with random obstacles.



There are no comments yet.


page 6


Safe Reinforcement Learning via Online Shielding

Reinforcement learning is a promising approach to learning control polic...

Automatic Policy Synthesis to Improve the Safety of Nonlinear Dynamical Systems

Learning controllers merely based on a performance metric has been prove...

Adaptive CVaR Optimization for Dynamical Systems with Path Space Stochastic Search

We present a general framework for optimizing the Conditional Value-at-R...

ABC-LMPC: Safe Sample-Based Learning MPC for Stochastic Nonlinear Dynamical Systems with Adjustable Boundary Conditions

Sample-based learning model predictive control (LMPC) strategies have re...

Bayesian Particles on Cyclic Graphs

We consider the problem of designing synthetic cells to achieve a comple...

Learning Stabilizable Dynamical Systems via Control Contraction Metrics

We propose a novel framework for learning stabilizable nonlinear dynamic...

Safely Bridging Offline and Online Reinforcement Learning

A key challenge to deploying reinforcement learning in practice is explo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the past few years, there has been much progress in designing reinforcement learning (RL) algorithms [29, 24, 25, 23, 26, 31, 18]. As a consequence, there has been much interest in using RL to design control policies for solving complex robotics tasks [21, 20, 7, 19]. In particular, learning-enabled controllers (LECs) have the potential to outperform optimization-based controllers [14]. In addition, optimization-based controllers can often only be used under strong assumptions about the system dynamics, system constraints, and objective functions [13, 3], which limits their applicability to complex robotics tasks.

However, safety concerns prevent LECs from being widely used in real-world tasks, which are often safety-critical in nature. For example, there may be disturbances in the real world compared to the training environment. If the LEC is not robust to these disturbances, then using it may result in catastrophic consequences [16]. Furthermore, unlike optimization-based controllers, it is typically infeasible to impose hard safety constraints on LECs.

As a consequence, safe reinforcement learning has become an increasingly important area of research [15, 1, 6, 4, 2, 11, 17]

. Many methods in this area leverage optimization tools to prove that a learned neural network policy satisfies a given safety constraint 

[9, 10, 28, 33, 6, 4, 17]. A related approach is shielding, which verifies a backup controller, and then overrides the LEC using the backup controller when it can no longer ensure that using the LEC is safe [15, 1, 2, 5]. While these methods provide strong mathematical guarantees, they suffer from a number of shortcomings. For example, many of these methods do not scale well to high-dimensional systems. Those that do typically rely on overapproximating reachable set of states, which can become very imprecise—e.g., leading to all states being reachable.

We build on a recently proposed idea called model predictive shielding (MPS), which has been used to ensure safety of learned control policies [32, 5], including extensions to the multi-agent setting [35]. The basic idea is that rather than check whether a state is safe ahead-of-time, we can dynamically check whether we can maintain safety if we use the LEC, and only use the LEC if we can do so. However, existing approach are limited either in that they consider nonlinear, but deterministic, dynamics [5, 35], or that they consider nondeterministic, but linear, dynamics [32]. Nonlinearity is important because many tasks where LECs have the most promise are nonlinear. Stochasticity is important for a number of reasons. For instance, there are often small perturbations in real-world dynamical systems. Similarly, it can be used to model estimation error in the robot’s state (e.g., uncertainty in its GPU position). Finally, LECs are often learned in simulation using a model of the dynamics; there are often errors in the model that need to be robustly accounted for.

We propose an approach, called robust MPS (RMPS), that bridges this gap by using robust nonlinear model-predictive control (NMPC) as the backup controller. The reason for using NMPC is that the goals of the backup controller are qualitatively different from the goals of the LEC. For example, consider the problem of building a robot that can run. The LEC tries to run as quickly as possible. It may be able to outperform the robust NMPC, since the robust NMPC treats the stochastic perturbations conservatively. However, because it is not robust, the LEC cannot guarantee safety. Thus, we want to use the LEC as often as possible, but override it using a backup controller if we are not sure whether it is safe to use the LEC. The NMPC is an effective choice for the backup controller, where the goal is to stop the system and bring it to an equilibrium point, after which a feedback controller can be used to stabilize it. Continuing our example, the NMPC might bring the robot to a halt (e.g., where it is standing).

To achieve our goals, we build on algorithms for robust NMPC [12, 34, 22, 22]. In particular, we build closely on tube-based robust NMPC [22]

, where the idea is to compute a tube within which the NMPC is guaranteed to stay (i.e., the tube is the reachable set of the NMPC). This existing work proposes to use a sampling-based heuristic to estimate the tube. We propose to use results from statistical learning theory to obtain provable probabilistic guarantees on our estimates of the sizes of the tubes 

[30]. We develop a practical algorithm based on these theoretical results.

Contributions. Our key contributions are: (i) an extension of the MPS algorithm to stochastic dynamical systems (Section III), (ii) a novel statistical algorithm for estimating tubes for RMPC with high-probability guarantees (Section III), and (iii) experiments demonstrating how our approach ensures safety for LECs for cart-pole and for a single particle with non-holonomic dynamics and random obstacles (Section IV).

Fig. 1: An illustration of model predictive shielding. The black circle is an obstacle. The dashed orange line and blue circles with dashed border are trajectory the robot follows if using ; this trajectory is unsafe. The solid red line and the blue balls with solid border are the trajectory the robot follows if using ; this trajectory is safe.

Ii Preliminaries

Dynamics. We consider stochastic nonlinear dynamics

where is the time step, is the state, is the control, and is a zero-mean stochastic perturbation with known distribution.

Control policy. A control policy is a map . We use to denote the closed-loop dynamics. The trajectory generated using from initial state is , where and . Since the dynamics are stochastic, is a sequence of random states; we use to denote the distribution of trajectories using from initial state .

Objective. We consider a cost function and a discount factor . The cost of a policy is

Safety constraint. In addition, we consider a set of safe states , with the goal of ensuring that the system stays in states . We do not place any constraints on (e.g., it can be nonconvex), except that we can efficiently check whether . We say a trajectory is safe if for all .

Shielding problem. Overall, our goal is to construct a policy that achieves low cost while satisfying the safety constraint. In general, since the dynamics are stochastic, it is impossible to guarantee safety. Instead, our goal is to try and ensure that safety holds with high probability. We establish a theoretical safety guarantee in Theorem 1; we discuss exactly how this theorem should be interpreted in Section III-D.

Our approach is based on shielding [15, 1, 2, 5]. This approach takes as input a policy that optimizes the cost function . The policy may not take into account the safety constraint (though often a soft penalty for violating safety is baked into ). We refer to as the learned policy, since a key motivation is the setting where is a neural network policy trained using reinforcement learning. For example, in our experiments, we use the deep deterministic policy gradient (DDPG) algorithm [27] to learn a neural network policy, which is an effective reinforcement learning algorithm for dynamical systems with continuous state and action spaces. However, we emphasize that our approach can be used in conjunction with any algorithm, including ones from both reinforcement learning and control theory.

Then, the shielding problem is to construct a policy that overrides as needed to ensure safety. The key challenge is minimizing how often overrides .

Notation. For , we use the notation . The set of positive semi-definite matrices of dimension is denoted by . Given and , We use the notation . Given two sets , we denote their Minkowski sum by and their Pontryagin difference by .

Iii Robust Model Predictive Shielding

Iii-a Background on Model Predictive Shielding

Model predictive shielding (MPS) is a recently proposed approach for solving the shielding problem for systems with deterministic dynamics. The key idea behind MPS is to maintain an invariant that it can always use a recovery policy to safely transition to an equilibrium point [5]. We say a state that satisfies this invariant is recoverable (denoted ). Near the equilibrium point, we assume that feedback controller can be used to ensure safety for an infinite horizon. Thus, as long as the system remains in , then MPS can guarantee safety. The combination of and is the backup controller used to override . The basic approach is illustrated in Fig. 1.

As an intuitive example, consider a driving robot. In this context, the idea is that is recoverable if the robot can safely apply the brakes to come to a stop. If is recoverable, but using the learned policy would risk breaking recoverability (i.e., ), then MPS uses instead. Since is recoverable, using is guaranteed to keep the system safe. Thus, using the MPS controller is guaranteed to ensure safety for an infinite horizon when starting from a recoverable state .

A key shortcoming of MPS is that it depends critically on the assumption that the dynamics are deterministic. In particular, it uses simulation to check whether is recoverable. However, for stochastic dynamics, each simulation will result in different realizations of the perturbations . Thus, we cannot check recoverability using simulation.

Our approach is to combine MPS with two ideas from robust control. First, we use tracking NMPC to try and transition the system from a given state to an equilibrium point [22]. By using nonlinear feedback control, we can ensure that the system is very likely to reach its goal despite stochastic perturbations. Second, we check recoverability by estimating the reachable sets of . In particular, we use these reachable sets to ensure the trajectory generated using (i) is safe, and (ii) reaches an invariant set . A key innovation in our approach is that we use tools from statistical learning theory to obtain provable guarantees for our approach. In particular, we prove that this check guarantees recoverability with high probability.

0:  RMPS()
  if IsRecoverable(then
  else if then
  end if
Algorithm 1 Compute the RMPS controller for state .
0:  Backup()
  if  then
     Compute from using (3)
  end if
  Compute target equilibrium point
  Compute invariant set for
  Compute from using (2)
Algorithm 2 Compute the backup controller for state . It keeps internal state .

Iii-B The Backup Controller

We use a standard robust NMPC as the backup controller [22]. At a high level, this controller first computes a reference trajectory that transitions the system to an equilibrium point. Then, it uses NMPC to track this reference trajectory. Finally, once the trajectory has reached the invariant set around equilibrium point , it uses a feedback controller to stabilize the system within .

Stabilization near equilibrium points. We assume given a mapping , where is an equilibrium point—i.e., . The intuition is that should return the equilibrium point that is “nearest” to . Then, tries to transition the system from to . Once it is near , we can use feedback control to ensure safety—e.g., we can continue using the robust NMPC near . We denote the stabilizing controller used near by .

In addition, we assume that we can compute a safe invariant set around . Our key assumption is that for any state , the trajectory generated using from is safe. Since the dynamics are stochastic, we typically cannot guarantee safety of using in with probability (unless the perturbations are bounded). Nevertheless, in our experiments, we find that is effective at ensuring safety and stability once inside . We discuss how we compute in Section III-E.

Reference trajectory. Denote the nominal dynamics by

where is the nominal state and is the nominal control input. Given an initial state and a time horizon , we compute a nominal trajectory to transition the system to an equilibrium point by solving the following:

subj. to

where for some and . Furthermore, can be specified by the user to improve robustness; we describe heuristics for computing these sets in Section III-E. We denote the solution to (1) by . Since is a nominal equilibrium, the infinite horizon trajectory


where is concatenation, is safe for the nominal dynamics.

Tracking NMPC. Once we have a reference trajectory , we use NMPC to track this reference trajectory and try to reach the equilibrium . In particular, if we are at state after steps, this controller solves the following:

subj. to

where is the cost-to-go function of the LQR for the linearization of the nominal dynamics around  [22]. We let be the solution of (3).

Backup controller. Given an state , an equilibrium point , and a time horizon , our backup controller first computes the reference trajectory using (1), with corresponding infinite horizon reference trajectory . Then, for each step , solves (3) for the current state to obtain , and chooses control input . Finally, for , it chooses control input .

This procedure for computing the backup controller is summarized in Algorithm 2. Note that actually needs to keep internal state consisting of the target equilibrium point , its corresponding invariant set , the reference trajectory to the equilibrium point, and the number of steps taken so far using the backup controller. This internal state is initialized in the context of a given state by the function call InitializeBackup().

Iii-C Checking Robust Recoverability via Sampling

In contrast to the MPS setting, where the dynamics are deterministic, we cannot use a single simulated trajectory to check whether a given state is recoverable. Instead, building on ideas from tube NMPC [3], we use Monte Carlo sampling to determine whether can safely reach the invariant set from a given state . Our key idea is to sample trajectories according to the (stochastic) dynamics. Then, we can fit boxes that cover all the states sampled on each given step . Intuitively, if we take the number of sampled trajectories to be sufficiently large, the realized trajectory will lie in at step with high probability. In contrast to prior work, we make this intuition precise using tools from statistical learning theory. Finally, to check if is recoverable, we check that it is robustly safe according to the uncertainty in these boxes, and furthermore that it robustly enters the invariant set .

Robust recoverability. Our goal is to ensure that can always transition the system safely from the current state to the invariant set around . Due to the random perturbation in the dynamics, we cannot make an absolute guarantee that this property holds. Following prior work [15, 1, 6], we instead aim to guarantee that this property holds with high probability.

Definition 1

Let be given. Given a state , let , and let be a (random) trajectory generated using from . We say is robustly recoverable if with probability at least (according to the randomness in ), (i) is safe for every , and (ii) .

In other words, safely transitions the system from to with probability at least . Then, given a state , Algorithm 3 checks whether is robustly recoverable. In contrast to prior work [15, 1, 6], which relies on thresholding the perturbation distribution and then using verification to obtain these kinds of bounds, we use a sampling-based approach. Since they need to threshold the distribution, they provide robust recoverability guarantees such as our own. However, they can guarantee that a given state is robustly recoverable with probability . In contrast, using our approach, there is an additional chance (for any given ) that our algorithm incorrectly concludes that is robustly recoverable when it is not. The difference is that the error in robust recoverability is due to the noise in the dynamics, whereas our error is due to noise in the sampled trajectories taken by our algorithm.

We believe this added potential for error is reasonable for two reasons. First, there is already an chance of error; practically speaking, the added error of does not really affect the kind of guarantee we ultimately obtain. Second, the dependence of the running time of our algorithm is logarithmic, so it is easy to use very small .

0:  IsRecoverable()
  Let be the reference trajectory of
  for  do
     if  then
        return  false
     end if
  end for
  if  then
     return  false
  end if
  return  true
Algorithm 3 Check if is robustly recoverable.
0:  EstimateReachableSets()
  Compute that satisfies (5)
  for  do
     Sample from using
  end for
  for  do
     Fit to
  end for
Algorithm 4 Estimates the reachable sets after steps using Monte Carlo sampling.

Estimating reachable sets. Our approach is to compute sets for such that the trajectory satisfies with probability at least , where is the reference trajectory used by —i.e.,


where the probability is taken over the randomness in the dynamics. To this end, a box constraint is a set

where denotes the closed interval from to . We use to denote the set of all possible boxes. Now, we have the following theoretical guarantee.

Lemma 1

Let be a distribution over and be given. Consider i.i.d. samples , where


and let be any box satisfying for all . Then, with probability at least , we have


Intuitively, (6) says that at least a fraction of states (weighted by ) fall inside , and this guarantee holds with probability at least . The proof, based on tools from statistical learning theory, is given in Appendix A.

Algorithm 4 takes samples of the trajectory by simulating the dynamics, and fits a box based on the sampled states on each step . The following guarantee follows from Lemma 1 via a union bound:

Lemma 2

Let be the sequence of boxes returned by Algorithm 4. With probability at least , we have

As before, the probability is according to the samples used by our algorithm, whereas the probability is according to the randomness in the dynamics.

The sets computed using Algorithm 4 can be thought of as an estimate of a tube in which the trajectories are guaranteed to stay [22]. In contrast to prior work, we have used results from statistical learning theory to obtain probabilistic guarantees on the correctness of these tubes [30]. An example of an estimated tube is shown in Fig. 2.

Fig. 2: An example of a tube (red region) estimated using Algorithm 4 for control policy . The estimate is based on sampling trajectories using this in simulation (solid colored lines). We guarantee that trajectories sampled in the future lie inside this tube with high probability.

Checking recoverability. Given the boxes for , Algorithm 3 checks both properties required for robust recovery: (i) to check if with high probability, it checks if

which is equivalent to , and (ii) to check if with high probability, it checks if

which is equivalent to . These checks ensure robust recoverability because Corollary 2 ensures that with high probability for every . Thus, we have the following guarantee:

Lemma 3

Given a state , if Algorithm 3 returns true, then is robustly recoverable with probability (according to the randomness in the algorithm).

Iii-D Robust Model Predictive Shielding

Our robust model predictive shielding (RMPS) algorithm is shown in Algorithm 1. At state , this algorithm computes a control input (denoted ) by checking whether next state is robustly recoverable (with high probability) in simulation. Otherwise, it takes a step according to . One subtlety is that if has already been initialized, it actually needs to check if is robustly recoverable. The issue is that robust recoverability is defined with respect to a freshly initialized backup policy, not the backup policy after it has taken some number of steps. We have the following guarantee:

Theorem 1

Suppose that is robustly recoverable; then, is robustly recoverable with probability at least .

See Appendix V-B for a proof. A key shortcoming of this guarantee is that it does not ensure safety of the infinite horizon trajectory. Given our assumptions, a stronger guarantee is impossible, since on every step there is a chance that the additive perturbation is large, causing the system to leave . However, this guarantee is still useful since it helps guide the design of our algorithm. In practice, we find that the bounds can be tighter than the theory suggests, since the robust NMPC is actually conservatively overapproximating the reachable set. In other words, the robust NMPC ensures safety much more robustly than the probabilities in Theorem 1 would suggest.

Fig. 3: Safety probabilities (left), and probability of reaching goal (right), for no shielding (brown), non-robust MPS (light green), and robust MPS (dark green).
Fig. 4: Percentage of time using the learned policy (blue) compared to the backup policy (orange).

Iii-E Practical Modifications

We describe several practical modifications to our algorithm designed to improve either performance or computational tractability. These modifications may weaken our safety guarantees, but as we show in our experiments, they do not affect safety very much empirically.

Computing . We use a heuristic to compute from [22]. In particular, we sample trajectories over a long horizon and estimate the reachable set the same way as in Algorithm 4. This approach does not provide any guarantees that the estimated set is actually invariant, but it works well in our experiments.

Using tighter constraints for NMPC. In the optimization problem (1) used to compute the reference trajectory, we noted that we can use tighter state constraints than needed. In particular, by doing so, we can improve the robustness of the tracking NMPC. In particular, we use the “tightened set” .

Precomputing . Computing the sets (for ) on-the-fly can be prohibitively expensive, since we might need a large number of samples for Lemma 1 to apply. Instead, we precompute these sets from a fixed initial state . Then, we reuse the same states rather than recomputing them at each step. Intuitively, this approach works well in practice since the dynamics of the tracking NMPC are usually fairly similar for different initial states.

Iv Experiments

We perform experiments to demonstrate how our system can ensure safety of stochastic systems with nonlinear dynamics and/or nonconvex constraints.

Iv-a Setting

We perform experiments using three environments: (i) cart-pole, which has nonlinear dynamics and polytopic constraints (i.e., the pole should not fall below a certain height), a particle with holonomic dynamics and obstacles (which has linear dynamics but nonconvex constraints), and a particle with non-holonomic dynamics and obstacles (which has both nonlinear dynamics and nonconvex constraints).

For the cart-pole, the states are , where is the cart position, is the cart velocity, is the pole angle from upright position, and is the pole angular velocity, and the control inputs are , with the goal of reaching a target position  [8]. The safety constraint is that the pole should not fall down while moving the cart. We define the cost function to be


is a hyperparameter. Finally, disturbances are uniform noise

for the velocity and angular velocity, and zero otherwise.

For the single particle with holonomic dynamics, the states are , where is position and is velocity, and the control inputs are , where is the acceleration. The system dynamics are , where

The cost function is

where is the goal, () are the obstacles, and , and is a hyperparameter. Disturbances are uniform noise for the velocity and angular velocity, and zero otherwise.

For the single particle with non-holonomic dynamics, the states are , where is the position, is the velocity, and is the heading, and the control inputs are , where is acceleration and is angular acceleration. The system dynamics are

where is the particle radius. Costs and disturbances are the same as for the holonomic particle.

We compare our algorithm with two other policies: (i) using the learned policy without any shielding, and (ii) using shielding without robust control (i.e., the MPS algorithm). For each experiment, we run 50 scenarios with 3 different random seeds and compute the safety as well as reach rates.

Iv-B Results

The safety and performance of the three algorithms are shown in Fig. 3. As can be seen, the learned policy achieves the highest performance in all but one of the environments (the one case where it does not perform the best is likely due to noise). However, it performs very poorly in terms of safety, demonstrating the need for shielding. Next, the non-robust MPS performs slightly better in safety, but still cannot guarantee that safety holds. Its performance is correspondingly worse as well. In contrast, our robust MPS algorithm achieves 100% safety rate in each of the three environments. Its performance is slightly diminished—for the particle with non-holonomic dynamics (the hardest environment), the probability of reaching the goal drops by about 20%. Thus, our algorithm is much more suitable for safety-critical systems where safety must be guaranteed.

In Fig. 4, we show the frequency with which our robust MPS algorithm uses the learned policy compared to the backup controller . As can be seen, on the cart-pole and non-holonomic particle environments, which are more prone to being unsafe, robust MPS is less likely to use .

V Conclusion

We have proposed a safe reinforcement learning algorithm ensuring safety of a learned control policy on stochastic nonlinear dynamical systems. We use a sampling-based approach to estimate the reachable set of the backup controller, and use results from statistical learning theory to provide theoretical guarantees on our estimates. We propose a number of modifications to enable a practical implementation of our approach. In our experiments, we show that our approach can ensure safety without sacrificing very much performance despite these modifications. Thus, our approach is a promising way to ensure safety in safety-critical systems.


V-a Proof of Lemma 1

Given , define , where , by —i.e., indicates whether is contained in . Note that

is a binary classifier, and we can consider the family of classifiers

. also, define the distribution on by —i.e., all labels are . Thus, sampling is equivalent to sampling , where . Recall that we choose so all samples satisfy . Equivalently, , where for all , so for all . Thus, we can think of choosing as choosing such that the training error

on a set of i.i.d. samples . Thus, we can apply results from statistical learning theory to bound the test error

where the last probability is the one we are seeking to bound. In particular, it is straightforward to check that the VC dimension of for boxes is . By the VC dimension bound [30], for all , we have


with probability at least . The claim follows by setting equal to the left-hand side of (7).

V-B Proof of Theorem 1

If uses , then by Lemma 3, is robustly recoverable with probability , so the claim holds. Alternatively, suppose that uses . If it uses the already initialized version of , then by Lemma 3, is robustly recoverable with probability , so again the claim holds. By a union bound, both claims hold with probability

Finally, suppose that initializes and then uses it. Because is robustly recoverable, we have

holds with probability at least . Furthermore, the robust recoverability condition for is

In particular, note that , so

so is robustly recoverable. The claim follows.


  • [1] A. K. Akametalu, J. F. Fisac, J. H. Gillula, S. Kaynama, M. N. Zeilinger, and C. J. Tomlin (2014) Reachability-based safe learning with gaussian processes. In 53rd IEEE Conference on Decision and Control, pp. 1424–1431. Cited by: §I, §II, §III-C, §III-C.
  • [2] M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu (2018) Safe reinforcement learning via shielding. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §I, §II.
  • [3] J. B Rawlings and D.Q. Mayne (2009-01) Model predictive control: theory and design. Cited by: §I, §III-C.
  • [4] O. Bastani, Y. Pu, and A. Solar-Lezama (2018) Verifiable reinforcement learning via policy extraction. In Advances in Neural Information Processing Systems, pp. 2494–2504. Cited by: §I.
  • [5] O. Bastani (2019) Safe reinforcement learning via online shielding. CoRR abs/1905.10691. External Links: Link, 1905.10691 Cited by: §I, §I, §II, §III-A.
  • [6] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause (2017) Safe model-based reinforcement learning with stability guarantees. In Advances in neural information processing systems, pp. 908–918. Cited by: §I, §III-C, §III-C.
  • [7] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba (2016) End to end learning for self-driving cars. CoRR abs/1604.07316. External Links: Link, 1604.07316 Cited by: §I.
  • [8] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. CoRR abs/1606.01540. External Links: Link, 1606.01540 Cited by: §IV-A.
  • [9] M. Fazlyab, M. Morari, and G. J. Pappas (2019) Safety verification and robustness analysis of neural networks via quadratic constraints and semidefinite programming. CoRR abs/1903.01287. External Links: Link, 1903.01287 Cited by: §I.
  • [10] M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. J. Pappas (2019) Efficient and accurate estimation of lipschitz constants for deep neural networks. CoRR abs/1906.04893. External Links: Link, 1906.04893 Cited by: §I.
  • [11] J. F. Fisac, N. F. Lugovoy, V. Rubies-Royo, S. Ghosh, and C. J. Tomlin (2019) Bridging hamilton-jacobi safety analysis and reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8550–8556. Cited by: §I.
  • [12] Y. Gao, A. Gray, H. E. Tseng, and F. Borrelli (2014) A tube-based robust nonlinear predictive control approach to semiautonomous ground vehicles. Vehicle System Dynamics 52 (6), pp. 802–823. External Links: Document, Link, Cited by: §I.
  • [13] C. E. Garcia, D. M. Prett, and M. Morari (1989) Model predictive control: theory and practice - a survey. Automatica 25, pp. 335–348. Cited by: §I.
  • [14] C. Gehring, S. Coros, M. Hutler, D. Bellicoso, H. Heijnen, R. Diethelm, M. Bloesch, P. Fankhauser, J. Hwangbo, M. Hoepflinger, and R. Siegwart (2016-02) Practice makes perfect: an optimization-based approach to controlling agile motions for a quadruped robot. IEEE Robotics & Automation Magazine, pp. 1–1. External Links: Document Cited by: §I.
  • [15] J. H. Gillula and C. J. Tomlin (2012) Guaranteed safe online learning via reachability: tracking a ground target using a quadrotor. In 2012 IEEE International Conference on Robotics and Automation, pp. 2723–2730. Cited by: §I, §II, §III-C, §III-C.
  • [16] J. Ho and S. Ermon (2016)

    Generative adversarial imitation learning

    CoRR abs/1606.03476. External Links: Link, 1606.03476 Cited by: §I.
  • [17] R. Ivanov, J. Weimer, R. Alur, G. J. Pappas, and I. Lee (2019) Verisig: verifying safety properties of hybrid systems with neural network controllers. In Proceedings of the 22nd ACM International Conference on Hybrid Systems: Computation and Control, pp. 169–178. Cited by: §I.
  • [18] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, R. Sepassi, G. Tucker, and H. Michalewski (2019) Model-based reinforcement learning for atari. CoRR abs/1903.00374. External Links: Link, 1903.00374 Cited by: §I.
  • [19] A. Khan, C. Zhang, S. Li, J. Wu, B. Schlotfeldt, S. Y. Tang, A. Ribeiro, O. Bastani, and V. Kumar (2019) Learning safe unlabeled multi-robot planning with motion constraints. CoRR abs/1907.05300. External Links: Link, 1907.05300 Cited by: §I.
  • [20] N. O. Lambert, D. S. Drew, J. Yaconelli, R. Calandra, S. Levine, and K. S. J. Pister (2019) Low level control of a quadrotor with deep model-based reinforcement learning. CoRR abs/1901.03737. External Links: Link, 1901.03737 Cited by: §I.
  • [21] R. Mahjourian, N. Jaitly, N. Lazic, S. Levine, and R. Miikkulainen (2018) Hierarchical policy design for sample-efficient learning of robot table tennis through self-play. CoRR abs/1811.12927. External Links: Link, 1811.12927 Cited by: §I.
  • [22] D. Q. Mayne, E. C. Kerrigan, E. J. van Wyk, and P. Falugi (2011) Tube-based robust nonlinear model predictive control. International Journal of Robust and Nonlinear Control 21 (11), pp. 1341–1353. External Links: Document, Link, Cited by: §I, §III-A, §III-B, §III-B, §III-C, §III-E.
  • [23] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. CoRR abs/1602.01783. External Links: Link, 1602.01783 Cited by: §I.
  • [24] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel (2015) Trust region policy optimization. CoRR abs/1502.05477. External Links: Link, 1502.05477 Cited by: §I.
  • [25] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Link, 1707.06347 Cited by: §I.
  • [26] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016) Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–503. External Links: Link Cited by: §I.
  • [27] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. In

    Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32

    ICML’14, pp. I–387–I–395. External Links: Link Cited by: §II.
  • [28] D. Tran (2019-07) Safety verification of cyber-physical systems with reinforcement learning control, emsoft 2019. pp. . Cited by: §I.
  • [29] H. van Hasselt, A. Guez, and D. Silver (2015) Deep reinforcement learning with double q-learning. CoRR abs/1509.06461. External Links: Link, 1509.06461 Cited by: §I.
  • [30] V. Vapnik (2013) The nature of statistical learning theory. Springer science & business media. Cited by: §I, §III-C, §V-A.
  • [31] O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, J. Quan, S. Gaffney, S. Petersen, K. Simonyan, T. Schaul, H. van Hasselt, D. Silver, T. P. Lillicrap, K. Calderone, P. Keet, A. Brunasso, D. Lawrence, A. Ekermo, J. Repp, and R. Tsing (2017) StarCraft II: A new challenge for reinforcement learning. CoRR abs/1708.04782. External Links: Link, 1708.04782 Cited by: §I.
  • [32] K. P. Wabersich and M. N. Zeilinger (2018) Linear model predictive safety certification for learning-based control. CoRR abs/1803.08552. External Links: Link, 1803.08552 Cited by: §I.
  • [33] W. Xiang, D. M. Lopez, P. Musau, and T. T. Johnson (2018) Reachable set estimation and verification for neural network models of nonlinear dynamic systems. CoRR abs/1802.03557. External Links: Link, 1802.03557 Cited by: §I.
  • [34] S. Yu, H. Chen, and F. Allgöwer (2011-12) Tube mpc scheme based on robust control invariant set with application to lipschitz nonlinear systems. In 2011 50th IEEE Conference on Decision and Control and European Control Conference, Vol. , pp. 2650–2655. External Links: Document, ISSN 0191-2216 Cited by: §I.
  • [35] W. Zhang and O. Bastani MAMPS: safe multi-agent reinforcement learning via model predictive shielding. External Links: Link Cited by: §I.