Introduction
Reinforcement learning (RL) focuses on finding an agent’s policy (i.e. controller) that maximizes a longterm reward. It does this by repeatedly observing the agent’s state, taking an action (according to a current policy), and receiving a reward. Over time, the agent modifies its policy to maximize its longterm reward. This method has been successfully applied to continuous control tasks [Duan et al.2016, Lillicrap et al.2015] where controllers have learned to stabilize complex robots (after many policy iterations).
However, since RL focuses on maximizing the longterm reward, it is likely to explore unsafe behaviors during the learning process. This feature is problematic for any RL algorithm that will be deployed on hardware, as unsafe learning policies could damage the hardware or bring harm to a human. As a result, most success in the use of RL for control of physical systems has been limited to simulations, where many failed iterations can occur before success.
Safe RL tries to learn a policy that maximizes the expected return, while also ensuring (or encouraging) the satisfaction of some safety constraints [García and Fernández2015]. Previous approaches to safe reinforcement learning include rewardshaping, policy optimization with constraints [Gaskett2003, Moldovan and Abbeel2012, Achiam et al.2017, Wachi et al.2018], or teacher advice [Abbeel and Ng2004, Abbeel, Coates, and Ng2010, Tang et al.2010]. However, these modelfree approaches do not guarantee safety during learning – safety is only approximately guaranteed after a sufficient learning period. The fundamental issue is that without a model, safety must be learned through environmental interactions, which means it may be violated during initial learning interactions.
Modelbased approaches have utilized Lyapunovbased methods or model predictive control to guarantee safety under system dynamics during learning [Wang, Theodorou, and Egerstedt2017, Berkenkamp et al.2017, Chow et al.2018, Ohnishi et al.2018, Koller et al.2018], but they do not address the issue of exploration and performance optimization. Other works guarantee safety by switching between backup controllers [Perkins and Barto2003, Mannucci et al.2018], though this overly constrains policy exploration.
We draw inspiration from recent work that has incorporated model information into modelfree RL algorithms to ensure safety during exploration [Fisac et al.2018, Li, Kalabic, and Chu2018, Gillula and Tomlin2012]. However, these approaches utilize backup safety controllers that do not guide the learning process (limiting exploration efficiency).
This paper develops a framework for integrating existing modelfree RL algorithms with control barrier functions (CBFs) to guarantee safety and improve exploration efficiency in RL, even with uncertain model information. The CBFs require a (potentially poor) nominal dynamics model, but can ensure online safety of nonlinear systems during the entire learning process and help the RL algorithm efficiently search the policy space. This methodology effectively constrains the policy exploration process to a set of safe polices defined by the CBF. An online process learns the governing dynamical system over time, which allows the CBF controller to adapt and become less conservative over time. This general framework allows us to utilize any modelfree RL algorithm to learn a controller, with the CBF controller guiding policy exploration and ensuring safety.
Using this framework, we develop an efficient algorithm for controller synthesis, RLCBF, with guarantees on safety (remaining within a safe set) and performance (rewardmaximization). To test this approach, we integrated two modelfree RL algorithms – trust region policy optimization (TRPO) [Schulman et al.2015] and deep deterministic policy gradients (DDPG) [Lillicrap et al.2015] – with the CBF controllers and dynamical model learning. We tested the algorithms on two nonlinear control problems: (1) balancing of an inverted pendulum, and (2) autonomous car following with wireless vehicletovehicle communication. For both tasks, our algorithm efficiently learned a highperformance controller while maintaining safety throughout the learning process. Furthermore, it learned faster than comparable RL algorithms due to inclusion of a model learning process, which constrains the space of explorable policies and guides the exploration process.
Our main contributions are: (1) we develop the first algorithm that integrates CBFbased controllers with modelfree RL to achieve endtoend safe RL for nonlinear control systems, and (2) we show improved learning efficiency by guiding the policy exploration with barrier functions.
Preliminaries
Consider an infinitehorizon discounted Markov decision process (MDP) with controlaffine, deterministic dynamics (a good assumption when dealing with robotic systems), defined by the tuple
, where is a set of states, is a set of actions, is the nominal unactuated dynamics, is the nominal actuated dynamics, and is the unknown system dynamics. The time evolution of the system is given by(1) 
where , , and compose a known nominal model of the dynamics, and represents the unknown model. In practice, the nominal model may be quite bad (e.g. a robot model that ignores friction and compliance), and we must learn a much better dynamic model through data.
Furthermore is the reward function, is the distribution of the initial state , and is the discount factor.
Reinforcement Learning
Let denote a stochastic control policy that maps states to distributions over actions, and let denote the policy’s expected discounted reward:
(2) 
Here is a trajectory where the actions are sampled from policy . We use the standard definitions for the value function , actionvalue function , and advantage function, below:
(3) 
where actions are drawn from distribution .
Most policy optimization RL algorithms attempt to maximize longterm reward using (a) policy iteration methods [Bertsekas2005], (b) derivativefree optimization methods that optimize the return as a function of policy parameters [Fu, Glover, and April2005], or (c) policy gradient methods [Peters and Schaal2008, Silver et al.2014]. Any of these methods can be rendered endtoend safe using the RLCBF control framework proposed in this work. However, we will focus mainly on policy gradient methods, due to their good performance on continuous control problems.
Policy GradientBased RL
Policy gradient methods estimate the gradient of the expected return
with respect to the policy based on sampled trajectories. They then optimize the policy using gradient ascent, allowing modification of the control law at episodic intervals. The DDPG and TRPO algorithms are examples of policy gradient methods, which we will use as benchmarks.DDPG is an offpolicy actorcritic method that computes the policy gradient based on sampled trajectories and an estimate of the actionvalue function. It alternately updates the actionvalue function and the policy as it samples more and more trajectories.
TRPO is an onpolicy policy gradient method that maximizes a surrogate loss function, which serves as an approximate lower bound on the true loss function. It also ensures that the next policy distribution is within a “trust region”. More precisely, it approximates the optimal policy update by iteratively solving the optimization problem:
(4) 
such that the KullbackLeibler divergence
. Here represents the discounted visitation frequency of state under policy , and is a constant defining the “trust region”.Though both DDPG and TRPO have learned good controllers on several benchmark problems, there is no guarantee of safety in these algorithms, nor any other modelfree RL algorithm. Therefore, our objective is to complement modelfree RL controllers with modelbased CBF controllers (using a potentially poor nominal model), which can both improve search efficiency and ensure safety.
Gaussian Processes
We use Gaussian process (GP) models to estimate the unknown system dynamics, , from data. A Gaussian process is a nonparametric regression method for estimating functions and their uncertain distribution from data [Rasmussen and Williams2006]. It describes the evolving model of the uncertain dynamics, , by a mean estimate, , and the uncertainty,
, which allows for high probability confidence intervals on the function:
(5) 
with probability where is a design parameter that determines (e.g. confidence is achieved at ). Therefore, by learning and in tandem with the controller, we obtain high probability confidence intervals on the unknown dynamics, which adapt/shrink as we obtain more information (i.e. measurements) on the system.
A GP model is parameterized by a kernel function , which defines the similarity between any two states . In order to make inferences on the unknown function , we need measurements, , which are computed from measurements of () using the relation from Equation (1):
. Since any finite number of data points form a multivariate normal distribution, we can obtain the posterior distribution of
at any query state by conditioning on the past measurements. Given measurements subject to independent Gaussian noise , the meanand variance
at the query state, , are calculated to be,(6) 
where is the kernel matrix, and . As we collect more data, becomes a better estimate of , and the uncertainty, , of the dynamics decreases.
We note that in applications with large amounts of data, training the GP becomes problematic since computing the matrix inverse in Equation (6) scales poorly ( in the number of data points). There are several methods to alleviate this issue, such as using sparse inducing inputs or local GPs [Snelson and Ghahramani2007, NguyenTuong, Seeger, and Peters2009]
. In fact, our framework can use any model approximation method that provides quantifiable uncertainty bounds (e.g. neural networks with dropout). However, we bypass this issue in this work by batch training the GP model with only the latest batch of
data points.Control Barrier Functions
Consider an arbitrary safe set, , defined by the superlevel set of a continuously differentiable function ,
(7) 
To maintain safety during the learning process, the system state must always remain within the safe set (i.e. the set is forward invariant). Examples include keeping a manipulator within a given workspace, or ensuring that a quadcopter avoids obstacles. Essentially, the learning algorithm should learn/explore only in set .
Control barrier functions utilize a Lyapunovlike argument to provide a sufficient condition for ensuring forward invariance of the safe set under controlled dynamics. Therefore, barrier functions are a natural tool to enforce safety throughout the learning process, and can be used to synthesize safe controllers for our systems.
Definition 1.
Given a set defined by (7), the continuously differentiable function is a discretetime control barrier function (CBF) for dynamical system (1) if there exists such that for all ,
(8) 
where represents how strongly the barrier function “pushes” the state inwards within the safe set (if , the barrier condition simplifies to the Lyapunov condition).
The existence of a CBF implies that there exists a deterministic controller such that the set is forward invariant for system (1) [Agrawal and Sreenath2017, Ames et al.2017]. In other words, if condition (8) is satisfied for all , then the set is rendered forward invariant. Our goal is to find a controller, , that satisfies condition (8), so that safety is certified.
For this paper, we restrict our attention to affine barrier functions of form , (), though our methodology could support more general barrier functions. This restriction means the set is composed of intersecting half spaces (i.e. polytopes).
Before we can formulate a tractable optimization problem that satisfies condition (8), we must have an estimate for . We use an updating GP model to estimate the mean and variance of the function, and , from measurement data. From equation (5), we know that with probability . Therefore, we can reformulate the CBF condition (8) into the following quadratic program (QP) that can be efficiently solved at each time step:
(9)  
where is a slack variable in the safety condition, is a large constant that penalizes safety violations, and
denotes the elementwise absolute value of the vector
. The optimization is not sensitive to the parameter as long as it is very large (e.g. ), such that safety constraint violations are heavily penalized. The last constraint on encodes actuator constraints. The solution to this optimization problem (9) enforces the safety condition (8) as best as possible with minimum control effort, even with uncertain dynamics. Accounting for the dynamics uncertainty through GP models allows us to certify system safety, even with a poor nominal model.Let us define the set . Then we can prove the following lemma.
Lemma 1.
Proof.
The first part of the lemma follows directly from Definition 1 and the probabilistic bounds on the uncertainty obtained from GPs shown in equation (5).
For the second part, the property of GPs in equation (5) implies that with probability , the following inequality is satisfied under the system dynamics (1):
(10) 
Therefore, the constraint in problem (9) ensures that:
(11) 
Define , so that (11) simplifies to
(12) 
The CBF controllers that solve (9) provide deterministic control laws, that naturally encode safety; they provide the minimal control intervention that maintains safety or provide graceful degradation (a small deviation from the safe set) when safety cannot be enforced (e.g. due to actuation constraints). Furthermore, even with dynamics uncertainty, we can make highprobability statements about system safety using GP models with CBFs.
Note that one can easily combine multiple CBF constraints in problem (9) to define polytopic safe regions.
CBFBased Compensating Control with Reinforcement Learning
To illustrate our framework, we first propose the suboptimal controller in equation (13), which combines a modelfree RLbased controller (parameterized by ) and a CBFbased controller in the architecture shown in Figure 0(a).
(13) 
The concept is akin to shielded RL [Alshiekh et al.2017, Fisac et al.2018], since the CBF controller compensates for the RL controller to ensure safety, but does not guide exploration of the RL algorithm. The next section will extend the CBF controller to improve RL policy exploration.
Note that since the RL policy is stochastic (see Preliminaries section on RL), the controller represents the realization (i.e. sampled control action) of the stochastic policy after policy iteration .
The modelfree RL controller, proposes a control action that attempts to optimize longterm reward, but may be unsafe. Before deploying the RL controller, a CBF controller filters the proposed control action and provides the minimum control intervention needed to ensure that the overall controller, , keeps the system state within the safe set. Essentially, the CBF controller, “projects” the RL controller into the set of safe policies. In the case of an autonomous car, this action may enforce a safe distance between nearby cars, regardless of the action proposed by the RL controller.
The CBF controller , which depends on the RL control, is defined by the following QP that can be efficiently solved at each time step:
(14)  
s.t.  
The last constraint in (14) incorporates possible actuator limits of the system.
We must make clear the important distinction between the indexes and . Note that indexes timesteps within each policy iteration or trial, whereas indexes the policy iterations (which contain trajectories with several timesteps). The CBF controller updates throughout the task (computed at each time step, ), whereas the RL policy and GP model update at episodic policy iteration intervals indexed by .
Let represent the largest violation of the barrier condition (i.e. potential deviation from the safe set) for any . Lemma 1 extends to the modified optimization problem (14), implying that satisfies the barrier certificate inequality (up to ) that guarantees forward invariance of . Therefore, if there exists a solution to problem (14) such that , then controller (13) renders the safe set forward invariant with probability . However if , but for all , then the controller will render the set forward invariant with probability .
Intuitively, the RL controller provides a “feedforward control”, and the CBF controller compensates with the minimum control necessary to render the safe set forward invariant. If such a control does not exist (e.g. due to torque constraints), then the CBF controller provides the control that keeps the state as close as possible to the safe set.
However, a significant issue is that controller (13) ensures safety, but does not actively guide policy exploration of the overall controller. This is because the RL policy being updated around, , is not the policy deployed on the agent, . For example, suppose that in an autonomous driving task, the RL controller inadvertently proposes to collide with an obstacle. The CBF controller compensates to drive the car around the obstacle. The next learning iteration should update the policy around the safe deployed policy , rather than the unsafe policy (which would have led to an obstacle collision). However, the algorithm described in this section updates around the original policy, , as illustrated in Figure 2a.
CBFBased Guiding Control with Reinforcement Learning
In order to achieve safe and efficient learning, we should learn from the deployed controller , since it operates in the safe region , rather than learning around , which may operate in an unsafe, irrelevant area of state space. The RLCBF algorithm described below incorporates this goal.
Recall that represent the realized controllers sampled from stochastic policies . Consider an initial RLbased controller (for iteration ). The CBF controller is determined from (14) to obtain . For every following policy iteration, let us define the overall controller to incorporate all previous CBF controllers, as in equation (15).
(15) 
The dependence of controller (15) on all prior CBF controllers (see Figure 0(b)) is critical to enhancing learning efficiency. Defining the controller in this fashion leads to policy updates around the previously deployed controller, which adds to the efficiency of the learning process by encouraging the policy to operate in desired areas of the state space. This idea is illustrated in Figure 2b.
The intuition is that at iteration , the RL policy proposed actions , but it took safe actions . To update the policy based on the safe actions, the effective RL controller at the next iteration () should be , which is then filtered by the CBF controller (i.e. is now part of the RL controller). Across multiple policy iterations, we can consider to be the guided RL controller (proposing potentially unsafe actions), which is rendered safe by .
To ensure safety after incorporating all prior CBF controllers, they must be included into the governing QP:
(16)  
The solution to (16) defines the CBF controller , which ensures safety by satisfying the barrier condition (8).
Let represent the largest violation of the barrier condition for any .
Theorem 2.
Proof.
The first part of the theorem follows directly from Definition 1 and Lemma 1. The only difference from Lemma 1 is that the control includes the RL controller and all previous CBF controllers ().
The proof of the performance bound is given in the Appendix of this paper found at https://rcheng805.github.io/files/aaai2019.pdf. ∎
RLCBF provides highprobability safety guarantees during the learning process and can maintain the performance guarantees of TRPO. If we have no uncertainty in the dynamics, then safety is guaranteed with probability 1. Note that the performance guarantee in Theorem 2 is for control law , which is not the deployed controller, . However, this does not pose a significant issue, since rapidly decays to 0 as we iterate. This is because the guided RL controller quickly learns to operate in the safe region, so the CBF controller becomes inactive.
Computationally Efficient Algorithm
This section describes an efficient algorithm to implement the framework described above, since a naive approach would be too computationally expensive in many cases. To see this, recall the controller (15) we would ideally implement:
The first term may be represented by a neural network that is parameterized by , which has a standard implementation. The third term is just a quadratic program with dependencies on the other terms; it does not pose a computational burden. However, the summation in the 2nd term poses a challenge, since every term in depends on a different previous RL controller . Therefore, we would need to store neural networks corresponding to each previous RL controller. In addition, we would have to solve separate QPs in sequence to evaluate each CBF controller. Such a bruteforce implementation would be impractical .
To overcome this issue, we approximate , where is a feedforward neural network (MLP) parameterized by . We chose a MLP since they have been shown to be powerful function approximators. Thus, at each policy iteration, we fit the MLP to data of collected from trajectories of the previous policy iteration. Then we obtain the controller:
Note that even with this approximation, safety with probability is still guaranteed. This is because the above approximation only affects the guided RL term . The CBF controller still solves (16), which provides the safety guarantees in Theorem 2 by satisfying the CBF condition (8). Furthermore, we now have to store only two NNs and solve one QP for the controller. The tradeoff is that the performance guarantee in Theorem 2 does not necessarily hold with this approximation. The algorithm is outlined in Algorithm 1.
Experiments
We implement two versions of the RLCBF algorithm with existing modelfree RL algorithms: TRPOCBF, derived from TRPO [Schulman et al.2015], and DDPGCBF, derived from DDPG [Lillicrap et al.2015]. The code for these examples can be found at: https://github.com/rcheng805/RLCBF.
Inverted Pendulum
We first apply RLCBF to the control of a simulated inverted pendulum from the OpenAI gym environment (pendulumv0), which has mass and length, , and is actuated by torque, . We set the safe region to be radians, and define the reward function to learn a controller that keeps the pendulum upright. The true system dynamics are defined as follows,
(17) 
with torque limits , and . To introduce model uncertainty, our nominal model assumes ( error in model parameters).
Figure 3 compares the accumulated reward achieved during each episode using TRPO, DDPG, TRPOCBF, and DDPGCBF. The two RLCBF algorithms converge near the optimal solution very rapidly, and significantly outperform the corresponding baseline algorithms without the CBFs. We note that TRPO and DDPG sometimes converge on a highperformance controller (comparable to TRPOCBF and DDPGCBF), though this occurs less reliably and more slowly, resulting in the poorer learning curves. More importantly, the RLCBF controllers maintain safety (i.e. never leave the safe region) throughout the learning process, as also seen in Figure 3. In contrast, TRPO and DDPG severely violate safety while learning the optimal policy.
Figure 4 shows the pendulum angle during a representative trial under the first policy versus the last learned policy deployed for TRPOCBF and DDPGCBF. For the first policy iteration, the pendulum angle is maintained near the edge of the safe region – the RL algorithm has proposed a poor controller so the CBF controller takes the minimal action necessary to keep the system safe. By the last iteration though, the CBF controller is completely inactive (), since the guided RL controller () is already safe.
Simulated Car Following
Consider a chain of five cars following each other on a straight road. We control the acceleration/deceleration of the car in the chain, and would like to train a policy to maximize fuel efficiency during traffic congestion while avoiding collisions. Each car utilizes the dynamics shown in equation (18), and we attempt to optimize the reward function (19). The car dynamics and reward function are inspired by previous work [He, Ge, and Orosz2018].
(18) 
(19) 
The first term in the reward optimizes fuel efficiency, while the other term encourages the car to maintain a 3 meter distance from the other cars (soft constraint). For the RLCBF controllers, the CBF enforces a 2 meter safe distance between cars (hard constraint). The behavior of cars 1,2,3, and 5 is described in the Appendix.
The car has access to every other cars’ position, velocity, and acceleration, but it only has a crude model of its own dynamics () and an inaccurate model of the drivers behind and in front of it. In addition, we add Gaussian noise to the acceleration of each car. The idea is that the car can use its crude model to guarantee safety with high probability, and improve fuel efficiency by slowly building and leveraging an implicit model of the other drivers’ behaviors.
From Figure 5, we see that there were no safety violations between the cars during our simulated experiments when using either of the RLCBF controllers. When using TRPO and DDPG alone without CBF safety, almost all trials had collisions, even in the later stages of learning. Furthermore, as seen in Figure 5, TRPOCBF learns faster and outperforms TRPO (DDPGCBF also outperforms DDPG though neither algorithm converged on a highperformance controller in our experiments). It is important to note that in some experiments, TRPO finds a comparable controller to TRPOCBF, but this is often not the case due to randomness in seeds.
Although DDPG and DDPGCBF failed to converge on a good policy, Figure 5 shows that DDPGCBF (and TRPOCBF) always maintained a safe controller. This is a crucial benefit of the RLCBF approach, as it guarantees safety independent of the system’s learning performance.
Conclusion
Adding even crude model information and CBFs into the modelfree RL framework allows us to improve the exploration of modelfree learning algorithms while ensuring endtoend safety. Therefore, we proposed the safe RLCBF framework, and developed an efficient controller synthesis algorithm that guarantees safety and improves exploration. These features will be crucial in deploying reinforcement learning on physical systems, where problems require online computation and efficient learning with safety guarantees.
This framework, which combines modelfree RLbased control, modelbased CBF control, and model learning has the additional advantages of being able to (1) easily integrate new RL algorithms (in place of TRPO/DDPG) as they are developed, and (2) incorporate better model information from measurements to online improve the CBF controller.
A significant assumption in this work is that we are given a valid safe set, , which can be rendered forward invariant. However, computing these valid safe sets is nontrivial and computationally intensive [Wang, Theodorou, and Egerstedt2017, Wabersich and Zeilinger2018, Fisac et al.2018]. If we are not given a valid safe set, we may reach states where it is not possible to remain safe (i.e. ). Although our controller achieves graceful degradation in these cases, in future work it will be important to learn the safe set in addition to the controller.
Acknowledgment
The authors would like to thank Hoang Le and Yisong Yue for helpful discussions.
References

[Abbeel and Ng2004]
Abbeel, P., and Ng, A. Y.
2004.
Apprenticeship learning via inverse reinforcement learning.
In
Twentyfirst international conference on Machine learning  ICML ’04
.  [Abbeel, Coates, and Ng2010] Abbeel, P.; Coates, A.; and Ng, A. Y. 2010. Autonomous helicopter aerobatics through apprenticeship learning. International Journal of Robotics Research.
 [Achiam et al.2017] Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Constrained Policy Optimization. arXiv preprint arXiv:1705:10528.
 [Agrawal and Sreenath2017] Agrawal, A., and Sreenath, K. 2017. Discrete Control Barrier Functions for SafetyCritical Control of Discrete Systems with Application to Bipedal Robot Navigation. Robotics science and systems (RSS).
 [Alshiekh et al.2017] Alshiekh, M.; Bloem, R.; Ehlers, R.; Könighofer, B.; Niekum, S.; and Topcu, U. 2017. Safe Reinforcement Learning via Shielding. arXiv preprint arXiv:1708.08611.
 [Ames et al.2017] Ames, A. D.; Xu, X.; Grizzle, J. W.; and Tabuada, P. 2017. Control Barrier Function Based Quadratic Programs for Safety Critical Systems. IEEE Transactions on Automatic Control.
 [Berkenkamp et al.2017] Berkenkamp, F.; Turchetta, M.; Schoellig, A. P.; and Krause, A. 2017. Safe Modelbased Reinforcement Learning with Stability Guarantees. In Neural Information Processing Systems.
 [Bertsekas2005] Bertsekas, D. 2005. Dynamic Programming and Optimal Control.
 [Chow et al.2018] Chow, Y.; Nachum, O.; DuenezGuzman, E.; and Ghavamzadeh, M. 2018. A Lyapunovbased Approach to Safe Reinforcement Learning. arXiv preprint arXiv:1805.07708.
 [Duan et al.2016] Duan, Y.; Chen, X.; Schulman, J.; and Abbeel, P. 2016. Benchmarking Deep Reinforcement Learning for Continuous Control. arXiv.
 [Fisac et al.2018] Fisac, J. F.; Akametalu, A. K.; Zeilinger, M. N.; Kaynama, S.; Gillula, J.; and Tomlin, C. J. 2018. A General Safety Framework for LearningBased Control in Uncertain Robotic Systems. arXiv preprint arXiv:1705.01292.
 [Fu, Glover, and April2005] Fu, M.; Glover, F.; and April, J. 2005. Simulation optimization: a review, new developments, and applications. Proceedings of the Winter Simulation Conference, 2005.
 [García and Fernández2015] García, J., and Fernández, F. 2015. A Comprehensive Survey on Safe Reinforcement Learning. Journal of Machine Learning Research.
 [Gaskett2003] Gaskett, C. 2003. Reinforcement Learning in Circumstances Beyond its Control. In CIMCA.
 [Gillula and Tomlin2012] Gillula, J. H., and Tomlin, C. J. 2012. Guaranteed safe online learning via reachability: Tracking a ground target using a quadrotor. In Proceedings  IEEE International Conference on Robotics and Automation.
 [He, Ge, and Orosz2018] He, C. R.; Ge, J. I.; and Orosz, G. 2018. Databased fueleconomy optimization of connected automated trucks in traffic. Annual American Control Conference (ACC).
 [Koller et al.2018] Koller, T.; Berkenkamp, F.; Turchetta, M.; and Krause, A. 2018. Learningbased Model Predictive Control for Safe Exploration and Reinforcement Learning. arXiv preprint arXiv:1803.08287.
 [Li, Kalabic, and Chu2018] Li, Z.; Kalabic, U.; and Chu, T. 2018. Safe Reinforcement Learning: Learning with Supervision Using a ConstraintAdmissible Set. In Annual American Control Conference.
 [Lillicrap et al.2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
 [Mannucci et al.2018] Mannucci, T.; Van Kampen, E. J.; De Visser, C.; and Chu, Q. 2018. Safe Exploration Algorithms for Reinforcement Learning Controllers. IEEE Transactions on Neural Networks and Learning Systems.
 [Moldovan and Abbeel2012] Moldovan, T. M., and Abbeel, P. 2012. Safe Exploration in Markov Decision Processes. arXiv preprint arXiv:1205.4810.
 [NguyenTuong, Seeger, and Peters2009] NguyenTuong, D.; Seeger, M.; and Peters, J. 2009. Local Gaussian Process Regression for Real Time Online Model Learning and Control. In Advances in neural information processing systems.
 [Ohnishi et al.2018] Ohnishi, M.; Wang, L.; Notomista, G.; and Egerstedt, M. 2018. Safetyaware Adaptive Reinforcement Learning with Applications to Brushbot Navigation. arXiv preprint arXiv:1801.09627.
 [Perkins and Barto2003] Perkins, T. J., and Barto, A. G. 2003. Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research.
 [Peters and Schaal2008] Peters, J., and Schaal, S. 2008. Reinforcement learning of motor skills with policy gradients. Neural Networks.
 [Rasmussen and Williams2006] Rasmussen, C. E., and Williams, C. K. 2006. Gaussian Processes for Machine Learning.
 [Schulman et al.2015] Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.; and Abbeel, P. 2015. Trust Region Policy Optimization. In International Conference on Machine Learning (ICML).
 [Silver et al.2014] Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic Policy Gradient Algorithms. Proceedings of the 31st International Conference on Machine Learning (ICML14).

[Snelson and Ghahramani2007]
Snelson, E., and Ghahramani, Z.
2007.
Local and global sparse Gaussian process approximations.
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS)
.  [Tang et al.2010] Tang, J.; Singh, A.; Goehausen, N.; and Abbeel, P. 2010. Parameterized maneuver learning for autonomous helicopter flight. In Proceedings  IEEE International Conference on Robotics and Automation.
 [Wabersich and Zeilinger2018] Wabersich, K. P., and Zeilinger, M. N. 2018. Scalable synthesis of safety certificates from data with applications to learningbased control. arXiv preprint arXiv:1711.11417.
 [Wachi et al.2018] Wachi, A.; Sui, Y.; Yue, Y.; and Ono, M. 2018. Safe Exploration and Optimization of Constrained MDPs using Gaussian Processes. 32nd AAAI conference on Artificial Intelligence (AAAI).
 [Wang, Theodorou, and Egerstedt2017] Wang, L.; Theodorou, E. A.; and Egerstedt, M. 2017. Safe Learning of Quadrotor Dynamics Using Barrier Certificates. arXiv preprint arXiv:1710:05472.
Appendix A Appendix A: Proof of Theorem 2
Theorem 2.
Using the control law from (15), if there exists a solution to problem (16) such that , then the safe set is forward invariant with probability . If , but the solution to problem (16) satisfies for all , then the controller will render the set forward invariant with probability .
Furthermore, if we use TRPO for the RL algorithm, then the control law from (15) achieves the performance guarantee , where and is chosen as in equation (4).
Proof.
To prove the performance bound in the second part of the theorem, we use the property of the advantage function from equation (20) below:
(20) 
where . As derived in (Schulman et al. 2015), we can then obtain the following inequality:
(21) 
where is the total variational distance between policies and , and
Comments
There are no comments yet.