I Introduction
Learningbased control introduces new ways for controller synthesis based on largescale databases providing cumulated system knowledge. This allows for previously intense tasks, such as system modeling and controller tuning, to eventually be fully automated. For example, deep reinforcement learning provides prominent results, with applications including control of humanoid robots in complex environments
[1] and playing Atari Arcade videogames [2].Despite the advances in researchdriven applications, the results can often not be transferred to industrial systems that are safetycritical, i.e. that must be guaranteed to operate in a given range of physical and safety constraints. This is due to the often complex functioning of learningbased methods rendering their systematic analysis difficult.
By introducing a model predictive safety certification (MPSC) mechanism for any learningbased controller, we aim at bridging this gap for linear systems with additive uncertainties that can, e.g., result from a belief representation of an unknown nonlinear system. The proposed MPSC scheme estimates safety of a proposed learningbased input in realtime by searching for a safe backup trajectory for the next time step in the form of generating a feasible trajectory towards a known safe set. Allowing the MPSC scheme to modify the potentially unsafe learningbased input, if necessary, provides safety for all future times. The result can be seen as a ‘safety filter’, since it only filters proposed inputs that drive the system out of what we call the safe set. The resulting online optimization problem can be efficiently solved in realtime using established model predictive (MPC) solvers. Partially unknown largerscale systems can therefore be efficiently enhanced with safety certificates during learning.
Contributions: We consider linear systems with additive disturbances, described in Section II, that encode the current, possibly datadriven, belief about a safetycritical system to which a potentially unsafe learningbased controller should be applied. A model predictive safety certification scheme is proposed in Section III, which allows for enhancing any learningbased controller with safety guarantees^{1}^{1}1Even human inputs can be enhanced by the safety certification scheme, which relates e.g. to the concept of electronic stabilization control from automotive engineering.. The concept of the proposed scheme is comparable to the safety frameworks presented in [3, 4] by providing an implicit safe set together with a safe backup controller that can be applied if the system would leave the safe set using the proposed learning input. A distinctive advantage compared to existing methods is that the MPSC scheme can build on any system behavior that is known to be safe, i.e. a known set of safe system states can be easily incorporated in our scheme such that it will only analyze safety outside of the provided safe set.
The approach relies on scalable offline computations and online optimization of a robust MPC problem at every sampling time, which can be performed using available realtime capable solvers that can deal with largescale systems (see e.g. [5]). While we relate the required assumptions and design steps to tubebased MPC in Section IV, we present an automated, parametrization free, and datadriven design procedure, that is tailored to the context of learning the system dynamics. The design procedure and MPSC scheme are illustrated in Section V using numerical examples.
Related work: Making the relevant class of safety critical systems accessible to learningbased control methods has gained significant attention in recent years. In addition to the individual construction of safety certificates for specific learningbased control methods subject to different notions of safety, see e.g. the survey [6], a discussion of advances in safe learningbased control subject to state and input constraints can be found in [7]. A promising direction that emerged from recent research focuses on what is called a ‘safety framework’ [3, 7, 4, 8], which consists of a safe set in the state space and a safety controller. While the system state is contained in the safe set, any feasible input (including learningbased controllers) can be applied to the system. However, if such an input would cause the system to leave the safe set, the safety controller interferes. Since this strategy is compatible with any learningbased control algorithm, it serves as a universal safety certification concept. The techniques proposed in [3, 7] are based on a differential game formulation that results in solving a minmax optimal control problem, which can provide the largest possible safe set, but offers very limited scalability. The approach described in [4] uses convex approximation techniques that scale well to largerscale systems at the cost of a potentially conservative safe set. While these results explicitly consider nonlinear systems, we focus on linear model approximations allowing for various improvements. We introduce a new mechanism for generating the safe set and controller using ideas related to tubebased MPC, which enables scalability with respect to the state dimension, while being less conservative than e.g. [4].
There is a methodological similarity to learningbased MPC approaches, as e.g. proposed in [9], or more recently in [10] considering nonlinear Gaussian process models. While such methods are limited to an MPC strategy based on the learned system model, this paper provides a concept that can enhance any learningbased controller with safety guarantees. This allows, e.g., for maximizing blackbox reward functions (reward of a sequence of actions is only available through measurements) for complex tasks, see e.g. [11], which would not be possible within an MPC framework, or to focus on exploration in order to collect informative data about the system, as described in Section V.
Notation: The set of symmetric matrices of dimension is , the set of positive (semi) definite matrices is () , the set of integers in the interval is , and the set of integers in the interval is . The Minkowski sum of two sets is denoted by and the Pontryagin set difference by . The th row and th column of a matrix is denoted by and .
Ii Problem description
We consider dynamical systems, which can be described by linear systems with additive disturbances of the form
(1) 
with initial condition and where is a compact set. The system is subject to polytopic state constraints , , and polytopic input constraints , , . We assume that the origin is contained in , is stabilizable, and the system state is fully observable. Note that system class (1) allows for modeling nonlinear timevarying systems if for all .
We aim at providing a safety certificate for arbitrary control signals in terms of a safe set and a safe control law. Given the system description (1) and a potentially unsafe learningbased controller , we search for a set of states for which we know a feasible backup control strategy such that input and state constraints will be fulfilled for all future times. Therefore, can be applied as long as it does not cause the system to leave or violate input constraints. Otherwise, a safety controller is allowed to modify the learning input based on the backup controller in order to keep the system safe. Formally this is captured by the following definition of a safe set and controller.
Definition II.1.
A set is called a safe set for system (1) if a safe backup control law is available such that for an arbitrary (learningbased) policy , the application of the safety control law
guarantees that the system state is contained in for all if .
While this framework is conceptually similar to those in [3, 7, 4, 8], we do not require the safe set to be robust controlled invariant as in [4, Definition II.4], [7, Definition 2], [8, Section 2.2] or [3, Section II.A]. The presented approach is thereby capable of enlarging any given safe set, and can be combined with any of the previously proposed methods.
Iii Model predictive safety certification
The starting point for the derivation of the proposed safety concept is Definition II.1. The essential requirement is that in the safe set , we always need to know a feasible backup controller , that ensures constraint satisfaction in the face of uncertainty for all future times. The idea for constructing such a controller is based on MPC [12]. Given the current system state, we calculate a safe, finite horizon backup controller towards some conservative target set , which is known to be a safe set and therefore provides ‘infinite safety’ after applying the finitetime controller.
The concept is illustrated in Figure 1. Consider a current system state , together with a proposed learning input . In order to analyze safety of , we test if will lead us to a state , for which we can construct a safe backup controller in the form of a feasible input sequence that drives the system to the safe terminal set in a given finite number of steps. If the test is successful, is a safe state and can be applied. At the next time step , we repeat the calculations for and . If it is successful, we can again apply the learning input , otherwise we can simply use the previously calculated backup controller from time . This strategy yields a safe set that is defined by the feasible set of the corresponding optimization problem for planning a trajectory towards the target set.
As the true system dynamics model is often unknown in the context of learningbased control, we employ mechanisms from tubebased MPC to design a safe backup controller for uncertain system dynamics of the form (1).
Iiia Model predictive safety certification scheme
Similar as in tubebased MPC, see e.g. [12], a nominal backup trajectory is computed, such that a stabilizing auxiliary controller is able to track it for the real system within a ‘tube’ towards the safe terminal set. We first define the main components and assumptions of the tubebased MPC controller, in order to then introduce the model predictive safety certification (MPSC) scheme, consisting of the MPSC problem and the proposed safety controller. Define with and the nominal system states and inputs, as well as the nominal dynamics
(2) 
with initial condition . Denote as the error (deviation) between the system state (1) and the nominal system state (2). The controller is then defined by augmenting the nominal input with an auxiliary feedback on the error, i.e.
(3) 
which keeps the real system state close to the nominal system state if is chosen such that it robustly stabilizes the error with dynamics
(4) 
resulting from application of (3) to the real system.
Assumption III.1.
There exists a linear state feedback matrix that yields a stable error system (4).
Stability of the autonomous error dynamics (4) implies the existence of a corresponding robust positively invariant set according to the following definition.
Definition III.2.
A set is a robust positively invariant (RPI) set for the error dynamics (4) if
In order to guarantee that under application of (3), the state and input constraints and are tightened for the nominal system (2), as described e.g. in [12], to and . There exist various methods in the literature, which can be used in order to calculate a controller and the corresponding RPI set according to Definition III.2, see e.g. [13].
Different from standard tubebased MPC, the model predictive safety certification (MPSC) uses a terminal set that is only required to be itself a safe set according to Definition II.1, which is conceptually similar to the safe terminal set used in [10]. This allows not only for enlarging any potentially conservative initial safe set, but also for recursively improving the safe set, as will be shown in Section IVB.
Assumption III.3.
There exists a safe set and a safe control law according to Definition II.1 such that .
Remark III.4.
Based on these components, the proposed safe backup controller utilizes the following MPSC problem for a given measured state and proposed learning input :
(5a)  
s.t.  (5b)  
(5c)  
(5d)  
(5e)  
(5f) 
where we denote the planning horizon by and the predicted nominal system states and inputs by and . Let the feasible set of (5) be denoted by
(6) 
Problem (5) introduces the auxiliary variable , which includes the auxiliary feedback (5f), ensuring safety of the control input , as we will show in the proof of Theorem III.5. The cost (5a) is chosen such that if possible, is equal to , in which case safety of is certified. The controller resulting from (5) in a receding horizon fashion is given by
(7) 
where , , and is the optimal solution of (5) at state .
It is important to note that (5) may not be recursively feasible for general safe sets . This is due to the fact that the terminal safe set is itself not necessarily invariant or a subset of the feasible set.
To this end, we propose Algorithm 1, which implements a safety controller based on (5). If we can always directly apply (Algorithm 1, line 4). If at a subsequent time , then we know via (5) a finitetime safe backup controller towards using (3), from the trajectory computed to certify (Algorithm 1, line 9), compare also with Figure 1. By Assumption III.3 we can extend this finitetime backup controller after steps with (Algorithm 1, line 11) in order to obtain a safe backup controller for , which will satisfy constraints at all times in the future. In the case that , (5) can be initially infeasible for . This case can be easily treated by directly applying (Algorithm 1, line 1 and line 11) which ensures safety for all future times. Formalization of the above yields our main result.
Theorem III.5.
Proof.
If , the terminal safety controller is applied since is initialized to in (Algorithm 1, line 1), which keeps the system safe for all times. We show that is a safe set by first investigating the case that (5) is feasible for all and extending the analysis to cases in which (5) is infeasible for arbitrarily many time steps.
Let and let (5) be feasible for all (Algorithm 1, line 4), i.e. . Condition (5e) implies by Assumption III.1 that and therefore that , which implies by the tightened constraints on the nominal state (5c) that . Therefore is a safe set under the safe backup controller in (7).
Now, consider an arbitrary time for which was feasible for the last time, i.e. for all with . Because of (5e) and (5f) we have that and therefore Assumption III.3 together with (5d) allows for explicitly stating a safe backup control law based on that keeps the system in the constraints for all future times:
Since (5) was feasible for , the corresponding , exist. Therefore in case of , Algorithm 1, line 9 and , it follows from by Assumption III.1 that for all .
The last remaining case follows from the observation, that by (5d) for which we know the safe control law for all future times by Assumption III.3. Once a feasible solution is found again, the counter is set to zero. Consequently we investigated all possible cases in Algorithm 1 and proved that it will always provide a safe control input if , showing the result. ∎
IiiB A recursively feasible MPSC scheme
By modifying Assumption III.3 and requiring the terminal safe set to be invariant for the nominal system, which is the standard assumption in tubebased MPC, we obtain recursive feasibility of (5) and can thus directly apply the timeinvariant control law (7) to system (1) without the need of Algorithm 1. In other words, (7) directly becomes the safety controller according to Definition II.1.
Assumption III.6.
There exists a set and corresponding control law such that for all , .
Theorem III.7.
Proof.
We begin with showing recursive feasibility under (7). Let (5) be feasible at time . It follows that because of (5e) and (5f). From here, recursive feasibility follows as in standard tubebased MPC by induction, see e.g. [12]. Along the lines of the proof of Theorem III.5, recursive feasibility implies that is a safe set. ∎
Iv Design of and from data
The proposed MPSC scheme is based on two main design components, the robust positively invariant set , which determines the tube, and the terminal safe set .
While can be chosen more generally according to Assumption III.3, we note from Theorem III.7 that we can in principle also use the same design methods proposed for linear tubebased MPC. The computation of the robust invariant set and the nominal terminal set have been widely studied in the literature, see e.g. [12, 14] and references therein.
This section presents a different option for the approximation of a tube and safe terminal set that is tailored to the learning context and aims at a minimal amount of tuning ‘by hand’. We propose to infer a robust control invariant set either directly from data or from a probabilistic model via scenariobased optimization. Secondly, starting from any terminal safe set, e.g. the trivial choice , we show how to enlarge this terminal set iteratively by utilizing feasible solutions of (5) over time.
Iva Scenario based calculation of from data
Let be a set of socalled ‘scenarios’, either sampled from a probabilistic belief about the system dynamics (1) or collected from measurements. We restrict ourselves to ellipsoidal robust positively invariant sets with , in order to enable scalability of the resulting design optimization problems to larger scale systems. The corresponding robust scenariobased design problem for computation of the set is given by
(8a)  
(8b) 
where . Problem (8) defines a robust positively invariant set for the error system (4), if the condition is enforced for all , see e.g. [14]. The objective (8a) is chosen such that a possibly small RPI set is obtained, which increases by definition of and the size of the feasible region of (5), and therefore the size of the safe set. A stabilizing linear state feedback matrix according to Assumption III.1 needs to be chosen beforehand, e.g. using LQR or controller design methods.
Proposition IV.1.
Consider system (4) and let . If (8
) attains a solution, then with probability at least
, the solution is level robustly feasible for the corresponding robust problem imposing (8b) for all , i.e., the probability that there exists a for which (8b) is violated is less or equal to .IvB Iterative enlargement of the terminal safe set
In this section we show how to enlarge the terminal safe set based on previously calculated solutions of (5), which is conceptually similar to the databased terminal set proposed in [17]. Note, that a larger terminal set according to Assumption III.3 or Assumption III.6 typically also leads to a larger feasible set , and therefore to a larger overall safe set according to Theorems III.5 and III.7.
The main idea is to define a safe set based on successfully solved instances of (5) for measured system states
(9) 
where is an index set representing time instances for which the system state was feasible in terms of (5) during application of Algorithm 1 up to time .
Theorem IV.2.
Proof.
If , convexity of the new terminal set (10) is ensured and we can iteratively enlarge the initial terminal set .
Remark IV.3.
In order to provide a similar result with respect to Theorem III.7 consider the set with as defined above.
Corollary IV.4.
Proof.
Follows similarly to the proof of Theorem IV.2. ∎
Using Theorem III.7, we obtain a practical procedure similar to Remark IV.3 by initializing and choosing as iterative terminal safe set.
Remark IV.5.
Theorem IV.2 also provides an explicit approximation of the safe set given by (10), which is generally only implicitly defined. Such a representation can be used to ‘inform’ the learningbased controller about the safety boundary, e.g. in the form of a feature using a barrier function, in order to avoid chattering behavior, as proposed in [3].
V Application to numerical examples
We consider the problem of safely acquiring information about the partially unknown system dynamics of a discretized massspringdamper system, which is given by with , , and . Assume that an approximate model is given by with mass, spring, and damper parameters, which have a 20% error with respect to the true parameters. We use the results from Section IV in order to calculate without deriving a suitable representation (1), i.e. a suitable , first. Using the approximate model and LQR design, we choose . Based on uniformly sampled measurements from the real (but unknown) system, we generate the robust scenario design problem (8) with scenarios . Solving (8) yields that with fulfills (8b) for all possible with probability according to Proposition IV.1. For the MPSC scheme, we use a horizon and the terminal safe set as described in Remark IV.3.
As learning signal we use with the goal of generating informative measurements according to [19].
A closedloop simulation with initial condition under application of Algorithm 1 is illustrated in Figure 2 with the corresponding safe set. As desired, the safety controller modifies the proposed input signal only as the system state approaches a neighborhood of the safe set boundary where the next state would leave the safe set (indicated in red color). The pure learningbased trajectory (dotted line, in Figure 2), in contrast, would have violated state constraints already in the first time steps.
Using a similiar configuration with planning horizon , we now iteratively enlarge the safe set based on previously calculated nominal state trajectories at each time step by following Corollary IV.4. Samples of the nominal and overall terminal set at different time steps are shown in Figure 3. After time steps, a significant portion of the state space is already covered by the safe terminal set.
Vi Conclusion
The paper has addressed the problem of safe learningbased control by means of a model predictive safety certification scheme. The proposed scheme allows for enhancing any potentially unsafe learningbased control strategy with safety guarantees and can be combined with any known safe set. By relying on robust MPC methods, the presented concept is amenable for application to largescale systems with similar offline computational complexity as e.g. ellipsoidal safe set approximations. Using a parameterfree scenariobased design procedure, it was illustrated how the design steps can be performed based on available data and how to reduce conservatism of the MPSC scheme over time by making use of generated closedloop data.
References
 [1] J. Merel, Y. Tassa, S. Srinivasan, J. Lemmon, Z. Wang, G. Wayne, and N. Heess, “Learning human behaviors from motion capture by adversarial imitation,” arXiv preprint arXiv:1707.02201, 2017.
 [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [3] A. K. Akametalu, J. F. Fisac, J. H. Gillula, S. Kaynama, M. N. Zeilinger, and C. J. Tomlin, “Reachabilitybased safe learning with gaussian processes,” in 53rd IEEE Conference on Decision and Control, Dec 2014, pp. 1424–1431.
 [4] K. P. Wabersich and M. N. Zeilinger, “Scalable synthesis of safety certificates from data with application to learningbased control,” arXiv preprint arXiv:1711.11417, 2017.
 [5] A. Domahidi, A. U. Zgraggen, M. N. Zeilinger, M. Morari, and C. N. Jones, “Efficient interior point methods for multistage problems arising in receding horizon control,” in 51st IEEE Conference on Decision and Control (CDC), Dec 2012, pp. 668–674.

[6]
J. García and F. Fernández, “A comprehensive survey on safe
reinforcement learning,”
Journal of Machine Learning Research
, vol. 16, pp. 1437–1480, 2015.  [7] J. F. Fisac, A. K. Akametalu, M. N. Zeilinger, S. Kaynama, J. Gillula, and C. J. Tomlin, “A general safety framework for learningbased control in uncertain robotic systems,” arXiv preprint arXiv:1705.01292, 2017.
 [8] R. B. Larsen, A. Carron, and M. N. Zeilinger, “Safe learning for distributed systems with bounded uncertainties,” 20th IFAC World Congress, vol. 50, no. 1, pp. 2536 – 2542, 2017.
 [9] A. Aswani, H. Gonzalez, S. S. Sastry, and C. Tomlin, “Provably safe and robust learningbased model predictive control,” Automatica, vol. 49, no. 5, pp. 1216–1226, 2013.
 [10] T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learningbased model predictive control for safe exploration and reinforcement learning,” arXiv preprint arXiv:1803.08287, 2018.
 [11] H. Mania, A. Guy, and B. Recht, “Simple random search provides a competitive approach to reinforcement learning,” arXiv preprint arXiv:1803.07055, 2018.
 [12] J. B. Rawlings and D. Q. Mayne, Model predictive control: Theory and design. Nob Hill Pub., 2009.
 [13] S. V. Rakovic, E. C. Kerrigan, K. I. Kouramas, and D. Q. Mayne, “Invariant approximations of the minimal robust positively invariant set,” vol. 50, no. 3. IEEE, 2005, pp. 406–410.
 [14] F. Blanchini, “Set invariance in control,” Automatica, vol. 35, no. 11, pp. 1747 – 1767, 1999.
 [15] M. C. Campi and S. Garatti, “The exact feasibility of randomized solutions of uncertain convex programs,” SIAM Journal on Optimization, vol. 19, no. 3, pp. 1211–1230, 2008.
 [16] G. C. Calafiore and M. C. Campi, “The scenario approach to robust control design,” IEEE Transactions on Automatic Control, vol. 51, no. 5, pp. 742–753, 2006.
 [17] U. Rosolia and F. Borrelli, “Learning model predictive control for iterative tasks: a computationally efficient approach for linear system,” IFACPapersOnLine, vol. 50, no. 1, pp. 3142–3147, 2017.
 [18] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
 [19] L. Ljung, “System identification,” in Signal analysis and prediction. Springer, 1998, pp. 163–173.