Reinforcement Learning (RL) is a tool for tackling optimal control from data. RL methods seek to increase the closed-loop performance of the control policy deployed on the system as observations are collected. RL methods often rely on Deep Neural Networks (DNN) to carry the policy approximation. Control policies based on DNNs provide limited opportunities for formal verifications of the resulting closed-loop behavior, and for imposing hard constraints on the evolution of the state of the real system. The development of safe RL methods is currently an open field of research (Garcia2015).
In order to tackle safety issues in RL, it has been recently proposed, see (Wabersich2019) and references therein, to use projections of the inputs delivered by the RL policy into safe sets, which is known by construction to ensure the safety of the system. The construction of the safe set can, e.g., rely on specific knowledge of the system, or robust model predictive control techniques. The projection then operates as a safeguard that prevents RL from taking unsafe decisions, and adopts the safe decision that is the closest to the RL policy when RL is unsafe.
In this paper, we investigate the interaction between these safe policy projections and the learning process deployed by RL. We show that because the projection modifies the policy developed via RL, it can disrupt the learning process performed such that the learned policy can be suboptimal. The problem occurs both in the context of -learning and policy gradient approaches using actor-critic methods. We then propose simple techniques to alleviate the problem. In the context of -learning, we show that the projection technique in general jeopardizes optimality, as it is the projection of a (possibly) optimal policy on a set, and that the problem is best alleviated by relying on a direct minimization of the function learned by RL, under the safety constraint that the inputs must belong to the safety set, as proposed in (Zanon2019b)
. In the context of the deterministic policy gradient approaches, we show that, in order to prevent the projections to bias the policy gradient estimations, the actor-critic method must be corrected with a correction which is simple to deploy. In the context of stochastic policy gradient methods, we show that the actor-critic must be constructed in a particular way to prevent the projection from biasing the policy gradient estimations. We finally show that these results extend to the case of a projection performed via robust Model Predictive Control (MPC) techniques.
The paper is structured as follows. Section 2 provides some background material. Section 3 details the projection approach in the context of -learning, and proposes an approach to address the resulting difficulties. Section 4 details the projection approach for policy gradient methods, both deterministic and stochastic, and proposes simple actor-critic formulations that prevent the projection from biasing the policy gradient estimations. Section LABEL:sec:MPC extends the results to the case in which the projection is performed via robust MPC. Section LABEL:sec:Simulation proposes a simple simulation example using robust linear MPC in the stochastic policy gradient case, and Section LABEL:sec:Conclusion provides conclusions.
In the following, we will consider that the dynamics of the real system are possibly stochastic, evolving on continuous state-input spaces. We will furthermore consider stochastic policies
, taking the form of conditional probability densitiesdenoting the probability density of selecting a given input when the system is in a given state . We will also consider deterministic policies delivering as a function of . For a given stage cost and a discount factor , the performance of a policy is assessed via the total discounted expected cost
where is the expected value of the closed-loop trajectories under policy , including the initial conditions .
In the deterministic policy case, the policy in (1) takes the form of a Dirac distribution centered at . The optimal policy associated to the state transition, the stage cost and the discount factor is deterministic and given by
Reinforcement Learning seeks to find the parameters such that the parametrized policies or approximate closely , using observed state transitions. -learning methods build the optimal policy approximation indirectly, as the minimizer (Sutton2018):
where is an approximation of the true optimal action value function , solution of the Bellman equations (Bertsekas2007):
The approximation is built using Temporal-Difference or Monte-Carlo techniques.
In contrast, policy gradient techniques manipulate directly the policy parameters according to the policy gradients (Sutton1999). Actor-critic techniques evaluate the policy gradient resulting from a stochastic policy as (Sutton1999)
where is the advantage function associated to the policy , defined as
are the value and action-value functions associated to .
Similarly, the policy gradient associated to a deterministic policy reads as (Silver2014)
where the advantage function is defined by (6)-(7) taken over a Dirac-like policy density corresponding to a deterministic policy. The advantage functions and can be estimated using Temporal-Difference or Monte-Carlo techniques.
In the context of Reinforcement-Learning, enforcing the safety of the inputs generated by a policy is not trivial (Garcia2015). Indeed, for safety-critical systems, discovering unsafe inputs from experiments is overly costly, and is typically rather done in extensive simulation campaigns. As an alternative, recent publications have proposed to approach the safety problem underlying RL by adding a safety layer to the RL process, which serves as a safeguard to the policy, see (Wabersich2019) and references therein. We detail that approach next.
2.1 Safe Policy
In this paper, we consider Reinforcement Learning subject to safety limitations. More specifically, we will consider constraints:
that must be respected at all time in order for the system safety to be ensured. Moreover, we will consider a (possibly) state-dependent safe set such that
entails that (9) is satisfied at all times. We ought to stress here the difference between (9) and . Satisfying (9) at time entails that the system is safe at that time , while is such that enforcing (10) at time entails that the system safety can be guaranteed at all time in the future. In the following, we will assume that can be described via inequality constraints on , typically different than :
Set can be complex and non-convex. Let us additionally label the set of states such that is non-empty, and . In some applications, the safe set can be computed explicitly using reachability analysis, but that can be prohibitively difficult in general. Inner convex approximations can then be needed. An approach based on an implicit representation has been the object of recent publications (Zanon2019b; Gros2020a).
Assuming that a safe set is available, a natural approach to ensure the feasibility of a policy learned via Reinforcement Learning techniques is to perform a projection into the safe set , i.e., to solve online the problem:
hence seeking the closest safe input to the RL policy under the Euclidian norm . While (12) imposes safety by construction, the optimality of the projected policy is, in general, not guaranteed if is obtained via RL techniques that disregard the fact that the projection operation (12) takes place. The resulting optimality loss is arguably problem-dependent, and not investigated here. In this paper, we will focus on how (12) can be combined with RL such that optimality of is achieved.
3 Safe -learning via projection
In this section we consider the deployment of the -learning technique under the safety limitation (11). The minimization in (4a) is then restricted to . In the context of -learning, one seeks to adjust the parameters supporting the function approximation such that is achieved in some sense. The parameters are typically adjusted using Temporal-Difference (TD) or Monte-Carlo techniques, aimed at (approximately) solving the least-squares problem
In a safe-learning context, the expected value in (13) is restricted to the safe state-input set , such that may only hold in . The RL policy is then selected according to (3). Let us then investigate the effect of applying the projection (12) on the policy obtained from (3). To that end, let us introduce a trivial but useful result.
By contradiction. Let us assume there is a safe policy that achieves better closed-loop performance than on . Because is safe, it follows that
If achieves better closed-loop performance than , and since holds over , then there is a such that:
Note that because may not hold outside of , may take its minimum outside of for some states . As a result, constraint (14b) is required in order to generate a safe policy.
3.1 Projection Approach for -Learning
Consider the projection (12) of the policy (3) obtained via -learning. We ought to first observe that if holds over , the projected policy is optimal whenever the learned policy . Unfortunately, this observation does not necessarily extend to the situation where . In order to support this observation, let us consider a trivial example displayed in Fig. 1. This shows that does not hold in general.
However, Lemma 1 readily delivers a way to alleviate this problem: assuming that a -function approximation over has been learned, a safe policy can be devised from using obtained from (14) as opposed to a generic projection (12). One then must be careful to include the input restriction in the evaluation of the TD error underlying the -learning. When using SARSA, no special care needs to be taken in the learning process, as (14) generates all inputs in . An approach to formulate (14) via robust MPC is presented in (Zanon2019b).
This section shows that the direct minimization (14) of the function approximation under the safety constraints is arguably better suited than the two steps approach: (3) followed by (12). We ought to extend the discussion to the context of the policy gradient methods using actor-critic techniques. This discussion is more technical, and is the object of the next section.
4 Safe policy gradient via projection
Policy gradient methods are often preferred over -learning because they alleviate the known issue that solving the least-squares problem (13) does not necessarily imply that one has found parameter that yields the best closed-loop performance of the policy (3). Indeed, policy gradient methods seek a direct minimization of the closed-loop cost (1) via gradient steps over (1), and therefore yield (at least locally) optimal policy parameters. Similarly to the discussion of Section 3, when deploying policy gradient techniques jointly with a projection on the safe set (12), the optimality of the resulting policy is unclear. As a matter of fact, we will show in this section that the learning process ought to be corrected in order for the estimation of the gradient of (1) to be unbiased. Subsection 4.1 will cover the deterministic policy gradient case, while subsection LABEL:sec:StochPiGrad will cover the stochastic policy gradient case.
4.1 Projected Policy and Deterministic Policy Gradient
In the context of deterministic policies, we will show next that a correction must be applied in the policy gradient computation to account for the safe projection (12). This correction is provided in the following Proposition.
Consider the projection (12) where stands for the Euclidian norm, and assume that the constraints (12b) satisfy the Linear Constraint Qualification (LICQ) and strict Second-Order Sufficient Conditions (SOSC). The gradient of the projected policy with respect to the policy parameters then reads as:
where is a state-dependent orthonormal null space to the gradient of the strictly active constraints, i.e.:
with gathering the set of strictly active constraints , and is the Hessian associated to (12).
The solution to (12) satisfies the KKT conditions:
where . The Implicit Function Theorem guarantees that if LICQ and SOSC hold, the gradient of the projected policy reads as:
We then observe that entails that
for some vector. We further observe that:
Since we have
this entails .
Hence the gradient of the projected policy is a form of projection of the gradient of the original policy into the null-space of the safety constraints. We will define for all for which all constraints are strictly inactive, and for all where the active constraints fully block the inputs. We observe that the set of states where some constraints are weakly active—such that the gradient of the policy is only defined in the sense of its sub-gradients—is of zero measure and can therefore be disregarded in the context discussed here. In the particular case of a safety set described as a polytope, such that the constraints are affine, holds and matrix simplifies to .
We can then form the Corollary to Proposition 1 providing a correct policy gradient evaluation.
Let us assume that (12) fulfills LICQ and SOSC. Then the policy gradient associated to the safe policy reads as:
where is the advantage function associated to the projected policy . All terms in (24) are evaluated at with distributed according to the probability density of the states in closed-loop under policy .
We observe that for any such that no constraint is weakly active, the equality
holds. If (12) fulfills the LICQ condition, the set of states where some constraints are weakly active is of zero-measure, such that the equality