1 Introduction
Reinforcement learning has been extensively studied in the context of closed environments, where it has gained popularity for its success in mastering games such as Atari and Go (Sutton+Barto:1998; Mnih2015HumanlevelCT; silver2017mastering). A pressing need to deploy autonomous agents in the physical world has brought forth a new set of challenges: agents need to be able to interact with their physical environment in a safe and comprehensible manner. This is especially critical in industrial settings (dalal2018safe).
For safetycritical tasks, the trialanderror nature of exploration in RL often prevents agent deployment in the real world during training, motivating the use of simulators. However, when dealing with complex environments, simulators may fail to sufficiently model the complexity of the environment (Achiam2019BenchmarkingSE). Furthermore, reward functions may be unknown a priori, making learning in simulation impossible. This is where methods that guarantee safe exploration during training offer a substantial advantage.
Our work employs policy gradients and model predictive control (MPC) as its primary building blocks to address the safe RL problem. Policy gradient methods learn a parameterized policy to maximize longterm expected rewards using gradient ascent and play a central role in reinforcement learning due to their ability to handle stochasticity, superior convergence properties and training stability, and efficacy in highdimensional action spaces (Sutton+Barto:1998). This family of algorithms is also modelfree
, relying solely on reward signals from the environment without modeling any dynamics. Policy gradient variations have since proliferated under the deep learning paradigm, notably including “natural” policy gradients and actorcritic methods in addition to techniques such as experience replay and importance sampling for better sample efficiency
(PETERS2008682; wang2017sample).Model predictive control is a flexible optimal control framework that has seen successes across a wide variety of settings, including process control in chemical plants and oil refineries, power electronics and power system balancing, autonomous vehicles and drones, and building control (qin2003survey; rawlings2009model). It is modelbased, requiring the system dynamics to be identified either a priori or through learning (koller2018learningbased). Its interpretability lends itself to robust extensions, where system uncertainties and disturbances can be incorporated to probabilistically guarantee agent safety (koller2018learningbased).
1.1 Related Work

Safety filters are the closest line of work to our proposed algorithm (wabersich2019linear; wabersich2021predictive). This is a decoupled method that takes sampled actions from any base policy and uses a MPC controller as the “safety filter” to correct unsafe behaviors. However, these two components function independently, which may lead to conflicting and potentially oscillatory behaviour between the MPC and RL objectives. The computationally taxing safety filter must also be used at both training and test times, making the technique illsuited for realworld deployment on constrained hardware.

Constrained reinforcement learning (CRL) aims to formalize the reliability and safety requirements of an agent by encoding these explicitly as constraints within the RL optimization problem. DBLP:journals/corr/AchiamHTA17 proposes a trustregion based policy search algorithm for CRL with guarantees, under some policy regularity assumptions, that the policy stays within the constraints in expectation. This approach cannot be used in applications where safety must be ensured at all visited states. dalal2018safe addresses the CRL problem by adding a safety layer to the policy that analytically solves an action correction formulation for each state. While this approach guarantees constraint satisfaction, it does not yield a safe policy at the end of training. In tessler2018reward, the constraints are embedded as a penalty signal into the reward function, guiding the policy towards a constraint satisfying solution. Similar to DBLP:journals/corr/AchiamHTA17, safety is not ensured at each state.

Modelbased RL methods generally offer higher sample efficiency than their modelfree counterparts and can be applied in safetycritical settings with more interpretable safety constraints. This area of work includes learningbased robust MPC (koller2018learningbased). (berkenkamp2017safe) proposes an algorithm that considers safety in terms of Lyapunov stability guarantees. More specifically, the approach demonstrates how, starting from an initial safe policy, the safe region of attraction can be expanded by collecting data within the safe region and adapting the policy.
1.2 Paper Contributions
Our approach wraps a policy gradient base policy with an MPCbased safety guide that corrects any potentially unsafe actions. The base policy learns to optimize the agent’s longterm behaviour, while the MPC component accounts for statespace safety constraints. By optimizing over an action distribution in the safety guide, we show that adding a safety penalty to the policy gradient loss allows for a provably safe testtime base policy. This resolves tension between the base policy and the safety guide and allows us to remove the computationally expensive safety guide after training.
2 Background
2.1 Notation
Throughout this work, we let , , and refer to the state, action, and reward at time . A sequence of states and actions is termed a trajectory and denoted by , and the sum of rewards over a trajectory is denoted . We focus on the setting where and . Since our action space is continuous, we represent a stochastic policy as , where
is a Gaussian distribution over actions. More specifically, we can write
for some Gaussian mean and covariance . The space of such policies is denoted as. When such a policy is parameterized by a vector
, we use the notation . With some abuse of notation, we write to denote sampling a trajectory from the policy ; similarly, denotes sampling a state from the stationary distribution induced by and then sampling from . Furthermore, denotes the norm within . The symbol defines an dimensional column vector of ones, and denotes the trace of the matrix .is the expectation operator with respect to the probability distribution
.2.2 Policy Gradient
Policy gradient methods attempt to find the optimal parameters for the objective
(1) 
The vanilla policy gradient approach performs gradient ascent to maximize this objective (Williams:92)
. The gradient can be approximated with the MonteCarlo estimator
(2) 
with
a discount factor. While many variancereduction techniques can be used to improve (
2), for simplicity of exposition we employ this basic formulation.2.3 Model Predictive Control
Model predictive control is a purely optimizationbased planning framework. Given a dynamics model and a set of state and action constraints (safety requirements, physical limitations, etc.), the finitehorizon MPC problem computes the nearoptimal openloop action sequence that minimizes a specified cumulative cost function. The first of these actions is executed, and the entire optimization repeats on the next time step. While the MPC framework offers concreteness in its constraints, it requires a prespecified reward function and is incapable of forming rewardmaximizing plans beyond its horizon.
2.4 Problem Setting
We represent the environment dynamics as a known linear timeinvariant system
(3) 
with initial state , dynamics matrix , and input matrix . The safety requirements are captured by a polyhedral state safe set . The goal is to learn a policy which maximizes the cumulative reward signal while ensuring that the exploration during training is safe at all times, i.e. for all .
3 Method
A highlevel overview of our method combining policy gradient learning and model predictive control is displayed in Figure 1. We first outline the construction of the safety guide
, which solves a chanceconstrained MPC optimization to enforce the safety of actions proposed by the underlying base policy. This allows for guaranteed safety during training time with arbitrarily high probability. Section
3.2 discusses how the safety guide is incorporated into the overarching policy optimization.3.1 Safety Guide Design
The safety guide solves a convex MPC problem for each time step during training to ensure system safety. This safety guide is not needed later at test time, which is justified theoretically in Section 4. We begin by making the following assumption.
Assumption 1
There exists a polyhedral terminal safe set that is invariant, meaning that for any state , there exists a sequence of control inputs that keep the system in for all subsequent time steps.
The construction of invariant sets has a wellestablished theory due to its applications in systems and control. For linear systems, several recursive algorithms have been proposed to construct polyhedral invariant sets (83532; 1470058), with nonlinear systems considered in 7084969; korda2013convex.
Algorithm 1 specifies the safety guide optimization problem. Intuitively, the safety guide attempts to find an action distribution that is as close as possible to that outputted by the base policy, subject to safety constraints. Taking inspiration from techniques in the obstacle avoidance literature (blackmore2011chance), we formulate this in a chanceconstrained model predictive fashion.
3.1.1 Variables
The optimization variables consist of a sequence of state means and openloop control actions over a planning horizon of length , with the first action containing some uncertainty represented by . The bar over any denotes that this matrix is related to the relevant covariance matrix via . This decomposition allows for subsequent chance constraints to be expressed as closedform convex constraints. Since we are interested in allowing the base policy to have as much freedom as possible, we avoid the additional conservatism that would result from incorporating uncertainty over future actions and allow these to be chosen deterministically.
3.1.2 Objective
The safety guide objective minimizes the divergence between the base policy action distribution and the distribution of the MPC’s first action. If the base policy distribution allows for subsequent actions that maintain safety, the objective vanishes and the returned safe distribution is simply the original distribution specified by the base policy. divergence is not symmetric; we choose this argument order to make the objective a convex function of the variables and . To see this, consider the following form for the divergence, dropping references to for notational convenience:
Recall that symbols subscripted by are constants in the optimization, while symbols subscripted by are our optimization variables. Therefore we can disregard the constant terms and . Convexity of follows from multiplicative properties of the determinant and concavity of the operator. For the remaining terms, we assume that is positive definite, a practically satisfied assumption. The fourth term can then be rewritten as with , which is a convex function composed with a linear function and is therefore convex. Finally, the last term is simply a quadratic form with a positive definite matrix and is therefore convex in .
3.1.3 Dynamics
The state propagation equations follow from known properties of linear transformations of Gaussian random variables
(liu2019linear). Since actions after index are entirely deterministic, we can express state uncertainty at future time steps directly as linear functions of the initial action uncertainty . This parallels results in the chanceconstrained path planning literature (blackmore2011chance).3.1.4 Safety Constraint
The safety constraints for and can be handled similarly. Consider the chance constraint , with . Evaluating such a constraint using sampling would require prohibitively many samples for small and result in a nonconvex optimization problem. We instead leverage techniques from chance constrained optimization to represent this constraint deterministically. Let the polyhedral safe set be defined by linear inequalities as
Deriving a tight closedform expression for a joint constraint over multiple linear inequalities is a nontrivial problem that is typically handled by an approximation scheme (cheng2012second). We conservatively bound the probability of violating each inequality by , noting that this implies
We now aim to derive a closedform counterpart for constraints of the form
Since
is normally distributed, this constraint is equivalent to the deterministic constraint
where is the standard Gaussian CDF (duchilecturenotes). Each of our constraints now becomes a secondorder cone constraint and can be handled by conventional convex optimization solvers. If the original problem is infeasible, we relax these constraints with slack variables which we linearly penalize in the objective.
3.2 Policy Gradient with Safety Penalty
We now modify the standard policy gradient formulation (1) to include a term penalizing corrections by the safety guide, effectively training the base policy to behave safely. Our objective becomes
(4) 
where is a positive definite statistical distance which is continuous in for and is a regularization parameter. For notational convenience, the expectation draws from the stationary state distribution induced by and the associated action distribution. We show in Section 4.2 that any positive definite, continuous results in a safe base policy after training. We choose the squared parameter distance for its numerical properties:
We can now obtain our optimal parameters using gradient ascent on a Monte Carlo estimator similar to (2) with an added term for the safety penalty.
4 Theoretical Analysis
We show that our policy leads to safe exploration at training time with arbitrarily high probability. We then prove that coupling reward maximization with a safety penalty in (4) leads to a safe optimal base policy. This is highly desirable as it eliminates conflict between the base policy and the safety guide, mitigates distributional shift, and reduces the computational burden on the agent at test time.
4.1 Training Time Safety
Consider a standard episodic training setting where an episode terminates after a set number of time steps or upon violation of the state safety constraints.
Consider an arbitrary natural number and safety tolerance from Algorithm 1. Then over training steps, the expected number of states such that is at most .
This follows directly from the constraints on the optimization problem in Algorithm 1. Specifically, there is an chance of sampling an action from the safe distribution that leads to an unsafe state, in which case the episode ends in at most time steps. Assumption 1 guarantees that with probability the action sampled will be safe and subsequent optimizations will remain feasible.
Since is a design parameter, this expectation can be driven to be arbitrarily small, at the cost of imposing additional conservatism in the exploration process. In practice, this quantity can be effectively set to zero by a small concession on the size of the safe sets and . Shrinking these by some factor gives the safe policy a buffer to the true unsafe region, allowing it to recover from unsafe actions by softening the chance inequality constraints in Algorithm 1. Our experiments in Section 5 use this technique to maintain perfect safety over the course of a million training steps.
4.2 Base Policy Safety
In order to derive theoretical guarantees for the optimal policy of (4), we introduce two assumptions.
Assumption 2
The parameterized base policy class is a universal approximator. Namely, for every policy and desired , there exists a parameterized such that
Assumption 3
The reward is bounded over all trajectories .
Assumption 2 parallels a standard assumption in the deep learning literature that a richly parameterized network is arbitrarily expressive. Assumption 3 is similarly benign, and is immediately satisfied in a typical setting where rewards are bounded and trajectories are finite.
For every and , there exists a learned parameterization such that
where is the standard reinforcement learning objective. Let be the initial state distribution and represent the environment transition dynamics. For notational simplicity, we define . Then we can write
(5) 
where
Assumption 2 implies that there exists a parameter vector such that can be bounded for all by an arbitrarily small quantity. Since in (5) is bounded by Assumption 3 and probability distributions integrate to , can be driven arbitrarily close to zero.
Lemma 4.2 relates the universal approximation properties from Assumption 2 to the reward incurred by the policy. We now proceed with the main theoretical result.
The optimal parameter vector which maximizes (4) is such that the base policy is safe; i.e., the equality holds except on a set of measure zero with respect to the stationary state density function induced by in a RadonNikodym sense. To prove by contradiction, assume that is not safe. Define the set of states where the policy diverges from its safe representation as
with some abuse of notation. Now, consider the measure induced by the stationary state density function of . Since probability distributions integrate to one, is finite on compact sets; Euclidian space is also locally compact Hausdorff and second countable, and hence we have regular (Theorem 7.8 in folland1999real).
Since is continuous in by assumption, is the inverse image of an open set under a continuous function and is therefore open. Regularity of and (by assumption) implies that there exists a compact set such that . Since continuous functions attain their minimum over compact sets, we have that for all for some .
We now show that the difference in objectives between the safe and base policies is given by
Observe that the state action marginal in the expectation (4) is always taken with respect to ; therefore, the reward terms vanish and the safety penalty is the only remaining term. By the previous discussion, this is at least , providing the desired expression.
Finally, we invoke Lemma 4.2 to construct a policy such that , noting that the safety penalty in (4) can be driven arbitrarily close to zero by Assumption 2 and continuity of . This implies , which is a contradiction.
Theorem 4.2 shows that the optimal parameters for our objective (4) produce a safe base policy . Provided that gradient ascent effectively maximizes (4), we can be confident that the policy has learned to behave safely and no longer requires the safety guide. This has three key advantages.

Harmony between the base policy and safety guide. Without a safety penalty, there is limited incentive for the base policy to learn to correct its own unsafe actions; the executed actions and ensuing rewards are always drawn from the action distribution of the safety guide. As noted in koller2018learningbased, this decoupling can lead to a perpetual conflict between the base policy and the safety guide, with the base policy constantly approaching the boundaries of the safe set and the guide constantly correcting. Theorem 4.2 shows that our method resolves this issue.

Mitigation of distributional shift. One potential concern with this method involves distributional shift; our policy gradient step updates the base policy, while rewards are sampled using the safe policy. Theorem 4.2 implies that as training progresses, the distributional shift between these two policies decays to zero.

Reduction of computational burden. Solving the safety guide optimization problem requires significant computational effort. Theorem 4.2 shows that the safety guide can be removed at test time without compromising safety. This can free up agent resources for other tasks.
We note that in the setting where is not completely maximized, the safety penalty can still be concretely evaluated in any region of the state space. This provides the designer of the system with a quantitative measure of the level of safety of the base policy as well as insights into which regions of the state space are most dangerous.
5 Numerical Experiments
. (b) Testtime performances of a policy gradient agent trained with and without the safety guide on the double integrator task. The thick line indicates mean performance over five runs, with the shaded area representing the standard deviation. (c) Testtime average episode length. The policy trained with the safety guide achieves the maximum episode length of
even when the safety guide is removed, indicating that the base policy has learned to behave safely.Consider a twodimensional quadrotor with state , where is the quadrotor position and is the counterclockwise angle to the vertical. The episode terminates of the quadrotor hits the ground or tilts more than radians. For early termination, the reward penalizes impact speed for hitting the ground () or rotational speed for excessive tilt (). Otherwise, the quadrotor is incentivized to hover close to the ground while remaining centered horizontally (). The control inputs are , with the vertical thrust and the torque. Using the time step , we simulate the system using the following linearized dynamics about the hovering equilibrium
for mass , inertia , and gravity . We design our safety set to have the constraints and . Our terminal safe set consists of the same position bounds as well as a positiondependent velocity bound that captures the maximum velocity that can be brought to zero by the end of the corresponding safe set interval. Since this curve scales with the square root of distance, we inner approximate this by a polytope (Figure 2). The safety tolerance, planning horizon, and safety penalty are set as , , and . Our network consists of two hidden layers of size with nonlinearities. We collected steps per batch with an episode length of steps. The learning rate is and discount factor is . The safety guide optimization problem is solved using MOSEK (mosek).
Our training approach achieved perfect safety over a training corpus of a million steps without compromising performance (Figure 2). Furthermore, Figure 2 shows that safety guidetrained policy rapidly achieves the optimal average episode length of steps even when the safety guide is removed. This suggests that the safety penalty effectively induces the base policy to behave safely without having to try unsafe actions.
6 Conclusion
This work addresses the challenge of safe RL using a novel approach that combines a policy gradient agent with a chanceconstrained MPC safety guide. The safety guide receives as input the proposed action distribution from the base policy and imposes additional safety requirements. By design, the safety guide intervenes minimally and modifies the base policy’s proposed action distribution only if it inevitably leads towards an unsafe region of the state space. An additional safety penalty on these corrections in the overall objective allows us to provide theoretical guarantees that our base policy learns to behave safely without having to explore unsafe actions. We empirically justify our proposed method through numerical experiments on a quadrotor control task.