Reinforcement learning (RL) focuses on finding an agent’s policy (i.e. controller) that maximizes a long-term reward. It does this by repeatedly observing the agent’s state, taking an action (according to a current policy), and receiving a reward. Over time, the agent modifies its policy to maximize its long-term reward. This method has been successfully applied to continuous control tasks [Duan et al.2016, Lillicrap et al.2015] where controllers have learned to stabilize complex robots (after many policy iterations).
However, since RL focuses on maximizing the long-term reward, it is likely to explore unsafe behaviors during the learning process. This feature is problematic for any RL algorithm that will be deployed on hardware, as unsafe learning policies could damage the hardware or bring harm to a human. As a result, most success in the use of RL for control of physical systems has been limited to simulations, where many failed iterations can occur before success.
Safe RL tries to learn a policy that maximizes the expected return, while also ensuring (or encouraging) the satisfaction of some safety constraints [García and Fernández2015]. Previous approaches to safe reinforcement learning include reward-shaping, policy optimization with constraints [Gaskett2003, Moldovan and Abbeel2012, Achiam et al.2017, Wachi et al.2018], or teacher advice [Abbeel and Ng2004, Abbeel, Coates, and Ng2010, Tang et al.2010]. However, these model-free approaches do not guarantee safety during learning – safety is only approximately guaranteed after a sufficient learning period. The fundamental issue is that without a model, safety must be learned through environmental interactions, which means it may be violated during initial learning interactions.
Model-based approaches have utilized Lyapunov-based methods or model predictive control to guarantee safety under system dynamics during learning [Wang, Theodorou, and Egerstedt2017, Berkenkamp et al.2017, Chow et al.2018, Ohnishi et al.2018, Koller et al.2018], but they do not address the issue of exploration and performance optimization. Other works guarantee safety by switching between backup controllers [Perkins and Barto2003, Mannucci et al.2018], though this overly constrains policy exploration.
We draw inspiration from recent work that has incorporated model information into model-free RL algorithms to ensure safety during exploration [Fisac et al.2018, Li, Kalabic, and Chu2018, Gillula and Tomlin2012]. However, these approaches utilize backup safety controllers that do not guide the learning process (limiting exploration efficiency).
This paper develops a framework for integrating existing model-free RL algorithms with control barrier functions (CBFs) to guarantee safety and improve exploration efficiency in RL, even with uncertain model information. The CBFs require a (potentially poor) nominal dynamics model, but can ensure online safety of nonlinear systems during the entire learning process and help the RL algorithm efficiently search the policy space. This methodology effectively constrains the policy exploration process to a set of safe polices defined by the CBF. An on-line process learns the governing dynamical system over time, which allows the CBF controller to adapt and become less conservative over time. This general framework allows us to utilize any model-free RL algorithm to learn a controller, with the CBF controller guiding policy exploration and ensuring safety.
Using this framework, we develop an efficient algorithm for controller synthesis, RL-CBF, with guarantees on safety (remaining within a safe set) and performance (reward-maximization). To test this approach, we integrated two model-free RL algorithms – trust region policy optimization (TRPO) [Schulman et al.2015] and deep deterministic policy gradients (DDPG) [Lillicrap et al.2015] – with the CBF controllers and dynamical model learning. We tested the algorithms on two nonlinear control problems: (1) balancing of an inverted pendulum, and (2) autonomous car following with wireless vehicle-to-vehicle communication. For both tasks, our algorithm efficiently learned a high-performance controller while maintaining safety throughout the learning process. Furthermore, it learned faster than comparable RL algorithms due to inclusion of a model learning process, which constrains the space of explorable policies and guides the exploration process.
Our main contributions are: (1) we develop the first algorithm that integrates CBF-based controllers with model-free RL to achieve end-to-end safe RL for nonlinear control systems, and (2) we show improved learning efficiency by guiding the policy exploration with barrier functions.
Consider an infinite-horizon discounted Markov decision process (MDP) with control-affine, deterministic dynamics (a good assumption when dealing with robotic systems), defined by the tuple, where is a set of states, is a set of actions, is the nominal unactuated dynamics, is the nominal actuated dynamics, and is the unknown system dynamics. The time evolution of the system is given by
where , , and compose a known nominal model of the dynamics, and represents the unknown model. In practice, the nominal model may be quite bad (e.g. a robot model that ignores friction and compliance), and we must learn a much better dynamic model through data.
Furthermore is the reward function, is the distribution of the initial state , and is the discount factor.
Let denote a stochastic control policy that maps states to distributions over actions, and let denote the policy’s expected discounted reward:
Here is a trajectory where the actions are sampled from policy . We use the standard definitions for the value function , action-value function , and advantage function, below:
where actions are drawn from distribution .
Most policy optimization RL algorithms attempt to maximize long-term reward using (a) policy iteration methods [Bertsekas2005], (b) derivative-free optimization methods that optimize the return as a function of policy parameters [Fu, Glover, and April2005], or (c) policy gradient methods [Peters and Schaal2008, Silver et al.2014]. Any of these methods can be rendered end-to-end safe using the RL-CBF control framework proposed in this work. However, we will focus mainly on policy gradient methods, due to their good performance on continuous control problems.
Policy Gradient-Based RL
Policy gradient methods estimate the gradient of the expected returnwith respect to the policy based on sampled trajectories. They then optimize the policy using gradient ascent, allowing modification of the control law at episodic intervals. The DDPG and TRPO algorithms are examples of policy gradient methods, which we will use as benchmarks.
DDPG is an off-policy actor-critic method that computes the policy gradient based on sampled trajectories and an estimate of the action-value function. It alternately updates the action-value function and the policy as it samples more and more trajectories.
TRPO is an on-policy policy gradient method that maximizes a surrogate loss function, which serves as an approximate lower bound on the true loss function. It also ensures that the next policy distribution is within a “trust region”. More precisely, it approximates the optimal policy update by iteratively solving the optimization problem:
such that the Kullback-Leibler divergence. Here represents the discounted visitation frequency of state under policy , and is a constant defining the “trust region”.
Though both DDPG and TRPO have learned good controllers on several benchmark problems, there is no guarantee of safety in these algorithms, nor any other model-free RL algorithm. Therefore, our objective is to complement model-free RL controllers with model-based CBF controllers (using a potentially poor nominal model), which can both improve search efficiency and ensure safety.
We use Gaussian process (GP) models to estimate the unknown system dynamics, , from data. A Gaussian process is a nonparametric regression method for estimating functions and their uncertain distribution from data [Rasmussen and Williams2006]. It describes the evolving model of the uncertain dynamics, , by a mean estimate, , and the uncertainty,
, which allows for high probability confidence intervals on the function:
with probability where is a design parameter that determines (e.g. confidence is achieved at ). Therefore, by learning and in tandem with the controller, we obtain high probability confidence intervals on the unknown dynamics, which adapt/shrink as we obtain more information (i.e. measurements) on the system.
A GP model is parameterized by a kernel function , which defines the similarity between any two states . In order to make inferences on the unknown function , we need measurements, , which are computed from measurements of () using the relation from Equation (1):
. Since any finite number of data points form a multivariate normal distribution, we can obtain the posterior distribution ofat any query state by conditioning on the past measurements. Given measurements subject to independent Gaussian noise , the mean
and varianceat the query state, , are calculated to be,
where is the kernel matrix, and . As we collect more data, becomes a better estimate of , and the uncertainty, , of the dynamics decreases.
We note that in applications with large amounts of data, training the GP becomes problematic since computing the matrix inverse in Equation (6) scales poorly ( in the number of data points). There are several methods to alleviate this issue, such as using sparse inducing inputs or local GPs [Snelson and Ghahramani2007, Nguyen-Tuong, Seeger, and Peters2009]
. In fact, our framework can use any model approximation method that provides quantifiable uncertainty bounds (e.g. neural networks with dropout). However, we bypass this issue in this work by batch training the GP model with only the latest batch ofdata points.
Control Barrier Functions
Consider an arbitrary safe set, , defined by the super-level set of a continuously differentiable function ,
To maintain safety during the learning process, the system state must always remain within the safe set (i.e. the set is forward invariant). Examples include keeping a manipulator within a given workspace, or ensuring that a quadcopter avoids obstacles. Essentially, the learning algorithm should learn/explore only in set .
Control barrier functions utilize a Lyapunov-like argument to provide a sufficient condition for ensuring forward invariance of the safe set under controlled dynamics. Therefore, barrier functions are a natural tool to enforce safety throughout the learning process, and can be used to synthesize safe controllers for our systems.
where represents how strongly the barrier function “pushes” the state inwards within the safe set (if , the barrier condition simplifies to the Lyapunov condition).
The existence of a CBF implies that there exists a deterministic controller such that the set is forward invariant for system (1) [Agrawal and Sreenath2017, Ames et al.2017]. In other words, if condition (8) is satisfied for all , then the set is rendered forward invariant. Our goal is to find a controller, , that satisfies condition (8), so that safety is certified.
For this paper, we restrict our attention to affine barrier functions of form , (), though our methodology could support more general barrier functions. This restriction means the set is composed of intersecting half spaces (i.e. polytopes).
Before we can formulate a tractable optimization problem that satisfies condition (8), we must have an estimate for . We use an updating GP model to estimate the mean and variance of the function, and , from measurement data. From equation (5), we know that with probability . Therefore, we can reformulate the CBF condition (8) into the following quadratic program (QP) that can be efficiently solved at each time step:
where is a slack variable in the safety condition, is a large constant that penalizes safety violations, and
denotes the element-wise absolute value of the vector. The optimization is not sensitive to the parameter as long as it is very large (e.g. ), such that safety constraint violations are heavily penalized. The last constraint on encodes actuator constraints. The solution to this optimization problem (9) enforces the safety condition (8) as best as possible with minimum control effort, even with uncertain dynamics. Accounting for the dynamics uncertainty through GP models allows us to certify system safety, even with a poor nominal model.
Let us define the set . Then we can prove the following lemma.
Therefore, the constraint in problem (9) ensures that:
Define , so that (11) simplifies to
The CBF controllers that solve (9) provide deterministic control laws, that naturally encode safety; they provide the minimal control intervention that maintains safety or provide graceful degradation (a small deviation from the safe set) when safety cannot be enforced (e.g. due to actuation constraints). Furthermore, even with dynamics uncertainty, we can make high-probability statements about system safety using GP models with CBFs.
Note that one can easily combine multiple CBF constraints in problem (9) to define polytopic safe regions.
CBF-Based Compensating Control with Reinforcement Learning
To illustrate our framework, we first propose the suboptimal controller in equation (13), which combines a model-free RL-based controller (parameterized by ) and a CBF-based controller in the architecture shown in Figure 0(a).
The concept is akin to shielded RL [Alshiekh et al.2017, Fisac et al.2018], since the CBF controller compensates for the RL controller to ensure safety, but does not guide exploration of the RL algorithm. The next section will extend the CBF controller to improve RL policy exploration.
Note that since the RL policy is stochastic (see Preliminaries section on RL), the controller represents the realization (i.e. sampled control action) of the stochastic policy after policy iteration .
The model-free RL controller, proposes a control action that attempts to optimize long-term reward, but may be unsafe. Before deploying the RL controller, a CBF controller filters the proposed control action and provides the minimum control intervention needed to ensure that the overall controller, , keeps the system state within the safe set. Essentially, the CBF controller, “projects” the RL controller into the set of safe policies. In the case of an autonomous car, this action may enforce a safe distance between nearby cars, regardless of the action proposed by the RL controller.
The CBF controller , which depends on the RL control, is defined by the following QP that can be efficiently solved at each time step:
The last constraint in (14) incorporates possible actuator limits of the system.
We must make clear the important distinction between the indexes and . Note that indexes timesteps within each policy iteration or trial, whereas indexes the policy iterations (which contain trajectories with several timesteps). The CBF controller updates throughout the task (computed at each time step, ), whereas the RL policy and GP model update at episodic policy iteration intervals indexed by .
Let represent the largest violation of the barrier condition (i.e. potential deviation from the safe set) for any . Lemma 1 extends to the modified optimization problem (14), implying that satisfies the barrier certificate inequality (up to ) that guarantees forward invariance of . Therefore, if there exists a solution to problem (14) such that , then controller (13) renders the safe set forward invariant with probability . However if , but for all , then the controller will render the set forward invariant with probability .
Intuitively, the RL controller provides a “feedforward control”, and the CBF controller compensates with the minimum control necessary to render the safe set forward invariant. If such a control does not exist (e.g. due to torque constraints), then the CBF controller provides the control that keeps the state as close as possible to the safe set.
However, a significant issue is that controller (13) ensures safety, but does not actively guide policy exploration of the overall controller. This is because the RL policy being updated around, , is not the policy deployed on the agent, . For example, suppose that in an autonomous driving task, the RL controller inadvertently proposes to collide with an obstacle. The CBF controller compensates to drive the car around the obstacle. The next learning iteration should update the policy around the safe deployed policy , rather than the unsafe policy (which would have led to an obstacle collision). However, the algorithm described in this section updates around the original policy, , as illustrated in Figure 2a.
CBF-Based Guiding Control with Reinforcement Learning
In order to achieve safe and efficient learning, we should learn from the deployed controller , since it operates in the safe region , rather than learning around , which may operate in an unsafe, irrelevant area of state space. The RL-CBF algorithm described below incorporates this goal.
Recall that represent the realized controllers sampled from stochastic policies . Consider an initial RL-based controller (for iteration ). The CBF controller is determined from (14) to obtain . For every following policy iteration, let us define the overall controller to incorporate all previous CBF controllers, as in equation (15).
The dependence of controller (15) on all prior CBF controllers (see Figure 0(b)) is critical to enhancing learning efficiency. Defining the controller in this fashion leads to policy updates around the previously deployed controller, which adds to the efficiency of the learning process by encouraging the policy to operate in desired areas of the state space. This idea is illustrated in Figure 2b.
The intuition is that at iteration , the RL policy proposed actions , but it took safe actions . To update the policy based on the safe actions, the effective RL controller at the next iteration () should be , which is then filtered by the CBF controller (i.e. is now part of the RL controller). Across multiple policy iterations, we can consider to be the guided RL controller (proposing potentially unsafe actions), which is rendered safe by .
To ensure safety after incorporating all prior CBF controllers, they must be included into the governing QP:
Let represent the largest violation of the barrier condition for any .
Using the control law from (15), if there exists a solution to problem (16) such that , then the safe set is forward invariant with probability . If , but the solution to problem (16) satisfies for all , then the controller will render the larger set forward invariant with probability .
The proof of the performance bound is given in the Appendix of this paper found at https://rcheng805.github.io/files/aaai2019.pdf. ∎
RL-CBF provides high-probability safety guarantees during the learning process and can maintain the performance guarantees of TRPO. If we have no uncertainty in the dynamics, then safety is guaranteed with probability 1. Note that the performance guarantee in Theorem 2 is for control law , which is not the deployed controller, . However, this does not pose a significant issue, since rapidly decays to 0 as we iterate. This is because the guided RL controller quickly learns to operate in the safe region, so the CBF controller becomes inactive.
Computationally Efficient Algorithm
This section describes an efficient algorithm to implement the framework described above, since a naive approach would be too computationally expensive in many cases. To see this, recall the controller (15) we would ideally implement:
The first term may be represented by a neural network that is parameterized by , which has a standard implementation. The third term is just a quadratic program with dependencies on the other terms; it does not pose a computational burden. However, the summation in the 2nd term poses a challenge, since every term in depends on a different previous RL controller . Therefore, we would need to store neural networks corresponding to each previous RL controller. In addition, we would have to solve separate QPs in sequence to evaluate each CBF controller. Such a brute-force implementation would be impractical .
To overcome this issue, we approximate , where is a feedforward neural network (MLP) parameterized by . We chose a MLP since they have been shown to be powerful function approximators. Thus, at each policy iteration, we fit the MLP to data of collected from trajectories of the previous policy iteration. Then we obtain the controller:
Note that even with this approximation, safety with probability is still guaranteed. This is because the above approximation only affects the guided RL term . The CBF controller still solves (16), which provides the safety guarantees in Theorem 2 by satisfying the CBF condition (8). Furthermore, we now have to store only two NNs and solve one QP for the controller. The tradeoff is that the performance guarantee in Theorem 2 does not necessarily hold with this approximation. The algorithm is outlined in Algorithm 1.
We implement two versions of the RL-CBF algorithm with existing model-free RL algorithms: TRPO-CBF, derived from TRPO [Schulman et al.2015], and DDPG-CBF, derived from DDPG [Lillicrap et al.2015]. The code for these examples can be found at: https://github.com/rcheng805/RL-CBF.
We first apply RL-CBF to the control of a simulated inverted pendulum from the OpenAI gym environment (pendulum-v0), which has mass and length, , and is actuated by torque, . We set the safe region to be radians, and define the reward function to learn a controller that keeps the pendulum upright. The true system dynamics are defined as follows,
with torque limits , and . To introduce model uncertainty, our nominal model assumes ( error in model parameters).
Figure 3 compares the accumulated reward achieved during each episode using TRPO, DDPG, TRPO-CBF, and DDPG-CBF. The two RL-CBF algorithms converge near the optimal solution very rapidly, and significantly outperform the corresponding baseline algorithms without the CBFs. We note that TRPO and DDPG sometimes converge on a high-performance controller (comparable to TRPO-CBF and DDPG-CBF), though this occurs less reliably and more slowly, resulting in the poorer learning curves. More importantly, the RL-CBF controllers maintain safety (i.e. never leave the safe region) throughout the learning process, as also seen in Figure 3. In contrast, TRPO and DDPG severely violate safety while learning the optimal policy.
Figure 4 shows the pendulum angle during a representative trial under the first policy versus the last learned policy deployed for TRPO-CBF and DDPG-CBF. For the first policy iteration, the pendulum angle is maintained near the edge of the safe region – the RL algorithm has proposed a poor controller so the CBF controller takes the minimal action necessary to keep the system safe. By the last iteration though, the CBF controller is completely inactive (), since the guided RL controller () is already safe.
Simulated Car Following
Consider a chain of five cars following each other on a straight road. We control the acceleration/deceleration of the car in the chain, and would like to train a policy to maximize fuel efficiency during traffic congestion while avoiding collisions. Each car utilizes the dynamics shown in equation (18), and we attempt to optimize the reward function (19). The car dynamics and reward function are inspired by previous work [He, Ge, and Orosz2018].
The first term in the reward optimizes fuel efficiency, while the other term encourages the car to maintain a 3 meter distance from the other cars (soft constraint). For the RL-CBF controllers, the CBF enforces a 2 meter safe distance between cars (hard constraint). The behavior of cars 1,2,3, and 5 is described in the Appendix.
The car has access to every other cars’ position, velocity, and acceleration, but it only has a crude model of its own dynamics () and an inaccurate model of the drivers behind and in front of it. In addition, we add Gaussian noise to the acceleration of each car. The idea is that the car can use its crude model to guarantee safety with high probability, and improve fuel efficiency by slowly building and leveraging an implicit model of the other drivers’ behaviors.
From Figure 5, we see that there were no safety violations between the cars during our simulated experiments when using either of the RL-CBF controllers. When using TRPO and DDPG alone without CBF safety, almost all trials had collisions, even in the later stages of learning. Furthermore, as seen in Figure 5, TRPO-CBF learns faster and outperforms TRPO (DDPG-CBF also outperforms DDPG though neither algorithm converged on a high-performance controller in our experiments). It is important to note that in some experiments, TRPO finds a comparable controller to TRPO-CBF, but this is often not the case due to randomness in seeds.
Although DDPG and DDPG-CBF failed to converge on a good policy, Figure 5 shows that DDPG-CBF (and TRPO-CBF) always maintained a safe controller. This is a crucial benefit of the RL-CBF approach, as it guarantees safety independent of the system’s learning performance.
Adding even crude model information and CBFs into the model-free RL framework allows us to improve the exploration of model-free learning algorithms while ensuring end-to-end safety. Therefore, we proposed the safe RL-CBF framework, and developed an efficient controller synthesis algorithm that guarantees safety and improves exploration. These features will be crucial in deploying reinforcement learning on physical systems, where problems require online computation and efficient learning with safety guarantees.
This framework, which combines model-free RL-based control, model-based CBF control, and model learning has the additional advantages of being able to (1) easily integrate new RL algorithms (in place of TRPO/DDPG) as they are developed, and (2) incorporate better model information from measurements to online improve the CBF controller.
A significant assumption in this work is that we are given a valid safe set, , which can be rendered forward invariant. However, computing these valid safe sets is non-trivial and computationally intensive [Wang, Theodorou, and Egerstedt2017, Wabersich and Zeilinger2018, Fisac et al.2018]. If we are not given a valid safe set, we may reach states where it is not possible to remain safe (i.e. ). Although our controller achieves graceful degradation in these cases, in future work it will be important to learn the safe set in addition to the controller.
The authors would like to thank Hoang Le and Yisong Yue for helpful discussions.
[Abbeel and Ng2004]
Abbeel, P., and Ng, A. Y.
Apprenticeship learning via inverse reinforcement learning.
Twenty-first international conference on Machine learning - ICML ’04.
- [Abbeel, Coates, and Ng2010] Abbeel, P.; Coates, A.; and Ng, A. Y. 2010. Autonomous helicopter aerobatics through apprenticeship learning. International Journal of Robotics Research.
- [Achiam et al.2017] Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Constrained Policy Optimization. arXiv preprint arXiv:1705:10528.
- [Agrawal and Sreenath2017] Agrawal, A., and Sreenath, K. 2017. Discrete Control Barrier Functions for Safety-Critical Control of Discrete Systems with Application to Bipedal Robot Navigation. Robotics science and systems (RSS).
- [Alshiekh et al.2017] Alshiekh, M.; Bloem, R.; Ehlers, R.; Könighofer, B.; Niekum, S.; and Topcu, U. 2017. Safe Reinforcement Learning via Shielding. arXiv preprint arXiv:1708.08611.
- [Ames et al.2017] Ames, A. D.; Xu, X.; Grizzle, J. W.; and Tabuada, P. 2017. Control Barrier Function Based Quadratic Programs for Safety Critical Systems. IEEE Transactions on Automatic Control.
- [Berkenkamp et al.2017] Berkenkamp, F.; Turchetta, M.; Schoellig, A. P.; and Krause, A. 2017. Safe Model-based Reinforcement Learning with Stability Guarantees. In Neural Information Processing Systems.
- [Bertsekas2005] Bertsekas, D. 2005. Dynamic Programming and Optimal Control.
- [Chow et al.2018] Chow, Y.; Nachum, O.; Duenez-Guzman, E.; and Ghavamzadeh, M. 2018. A Lyapunov-based Approach to Safe Reinforcement Learning. arXiv preprint arXiv:1805.07708.
- [Duan et al.2016] Duan, Y.; Chen, X.; Schulman, J.; and Abbeel, P. 2016. Benchmarking Deep Reinforcement Learning for Continuous Control. arXiv.
- [Fisac et al.2018] Fisac, J. F.; Akametalu, A. K.; Zeilinger, M. N.; Kaynama, S.; Gillula, J.; and Tomlin, C. J. 2018. A General Safety Framework for Learning-Based Control in Uncertain Robotic Systems. arXiv preprint arXiv:1705.01292.
- [Fu, Glover, and April2005] Fu, M.; Glover, F.; and April, J. 2005. Simulation optimization: a review, new developments, and applications. Proceedings of the Winter Simulation Conference, 2005.
- [García and Fernández2015] García, J., and Fernández, F. 2015. A Comprehensive Survey on Safe Reinforcement Learning. Journal of Machine Learning Research.
- [Gaskett2003] Gaskett, C. 2003. Reinforcement Learning in Circumstances Beyond its Control. In CIMCA.
- [Gillula and Tomlin2012] Gillula, J. H., and Tomlin, C. J. 2012. Guaranteed safe online learning via reachability: Tracking a ground target using a quadrotor. In Proceedings - IEEE International Conference on Robotics and Automation.
- [He, Ge, and Orosz2018] He, C. R.; Ge, J. I.; and Orosz, G. 2018. Data-based fuel-economy optimization of connected automated trucks in traffic. Annual American Control Conference (ACC).
- [Koller et al.2018] Koller, T.; Berkenkamp, F.; Turchetta, M.; and Krause, A. 2018. Learning-based Model Predictive Control for Safe Exploration and Reinforcement Learning. arXiv preprint arXiv:1803.08287.
- [Li, Kalabic, and Chu2018] Li, Z.; Kalabic, U.; and Chu, T. 2018. Safe Reinforcement Learning: Learning with Supervision Using a Constraint-Admissible Set. In Annual American Control Conference.
- [Lillicrap et al.2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
- [Mannucci et al.2018] Mannucci, T.; Van Kampen, E. J.; De Visser, C.; and Chu, Q. 2018. Safe Exploration Algorithms for Reinforcement Learning Controllers. IEEE Transactions on Neural Networks and Learning Systems.
- [Moldovan and Abbeel2012] Moldovan, T. M., and Abbeel, P. 2012. Safe Exploration in Markov Decision Processes. arXiv preprint arXiv:1205.4810.
- [Nguyen-Tuong, Seeger, and Peters2009] Nguyen-Tuong, D.; Seeger, M.; and Peters, J. 2009. Local Gaussian Process Regression for Real Time Online Model Learning and Control. In Advances in neural information processing systems.
- [Ohnishi et al.2018] Ohnishi, M.; Wang, L.; Notomista, G.; and Egerstedt, M. 2018. Safety-aware Adaptive Reinforcement Learning with Applications to Brushbot Navigation. arXiv preprint arXiv:1801.09627.
- [Perkins and Barto2003] Perkins, T. J., and Barto, A. G. 2003. Lyapunov design for safe reinforcement learning. Journal of Machine Learning Research.
- [Peters and Schaal2008] Peters, J., and Schaal, S. 2008. Reinforcement learning of motor skills with policy gradients. Neural Networks.
- [Rasmussen and Williams2006] Rasmussen, C. E., and Williams, C. K. 2006. Gaussian Processes for Machine Learning.
- [Schulman et al.2015] Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.; and Abbeel, P. 2015. Trust Region Policy Optimization. In International Conference on Machine Learning (ICML).
- [Silver et al.2014] Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; and Riedmiller, M. 2014. Deterministic Policy Gradient Algorithms. Proceedings of the 31st International Conference on Machine Learning (ICML-14).
[Snelson and Ghahramani2007]
Snelson, E., and Ghahramani, Z.
Local and global sparse Gaussian process approximations.
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS).
- [Tang et al.2010] Tang, J.; Singh, A.; Goehausen, N.; and Abbeel, P. 2010. Parameterized maneuver learning for autonomous helicopter flight. In Proceedings - IEEE International Conference on Robotics and Automation.
- [Wabersich and Zeilinger2018] Wabersich, K. P., and Zeilinger, M. N. 2018. Scalable synthesis of safety certificates from data with applications to learning-based control. arXiv preprint arXiv:1711.11417.
- [Wachi et al.2018] Wachi, A.; Sui, Y.; Yue, Y.; and Ono, M. 2018. Safe Exploration and Optimization of Constrained MDPs using Gaussian Processes. 32nd AAAI conference on Artificial Intelligence (AAAI).
- [Wang, Theodorou, and Egerstedt2017] Wang, L.; Theodorou, E. A.; and Egerstedt, M. 2017. Safe Learning of Quadrotor Dynamics Using Barrier Certificates. arXiv preprint arXiv:1710:05472.
Appendix A Appendix A: Proof of Theorem 2
Using the control law from (15), if there exists a solution to problem (16) such that , then the safe set is forward invariant with probability . If , but the solution to problem (16) satisfies for all , then the controller will render the set forward invariant with probability .
Furthermore, if we use TRPO for the RL algorithm, then the control law from (15) achieves the performance guarantee , where and is chosen as in equation (4).
To prove the performance bound in the second part of the theorem, we use the property of the advantage function from equation (20) below:
where . As derived in (Schulman et al. 2015), we can then obtain the following inequality:
where is the total variational distance between policies and , and