Constrained Policy Gradient Method for Safe and Fast Reinforcement Learning: a Neural Tangent Kernel Based Approach

07/19/2021 ∙ by Balázs Varga, et al. ∙ Chalmers University of Technology 0

This paper presents a constrained policy gradient algorithm. We introduce constraints for safe learning with the following steps. First, learning is slowed down (lazy learning) so that the episodic policy change can be computed with the help of the policy gradient theorem and the neural tangent kernel. Then, this enables us the evaluation of the policy at arbitrary states too. In the same spirit, learning can be guided, ensuring safety via augmenting episode batches with states where the desired action probabilities are prescribed. Finally, exogenous discounted sum of future rewards (returns) can be computed at these specific state-action pairs such that the policy network satisfies constraints. Computing the returns is based on solving a system of linear equations (equality constraints) or a constrained quadratic program (inequality constraints). Simulation results suggest that adding constraints (external information) to the learning can improve learning in terms of speed and safety reasonably if constraints are appropriately selected. The efficiency of the constrained learning was demonstrated with a shallow and wide ReLU network in the Cartpole and Lunar Lander OpenAI gym environments. The main novelty of the paper is giving a practical use of the neural tangent kernel in reinforcement learning.



There are no comments yet.


page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In reinforcement learning (RL), the agent learns in a trial and error way. In a real setting, it can lead to undesirable situations which may result in damage or injury of the agent or the environment system. In addition, the agent might waste a significant amount of time exploring irrelevant regions of the state and action spaces. Safe RL can be defined as the process of learning policies that maximize the expectation of the return in problems under safety constraints. Thus, safe exploration often includes some prior knowledge of the environment (e.g. a model) or has a risk metric Garcıa and Fernández (2015). In RL, safety can be guaranteed in a probabilistic way. Learning in this context aims to strike a balance between exploration and exploitation so that the system remains within a safe set. On the other hand, in a safety-critical setting, exploration cannot be done blindly.

Recently, several different approaches were conceived to tackle safe learning. Wang et al. (2019) deals with balancing between exploitation and exploration. It includes the exploration explicitly into the objective function. Berkenkamp et al. (2017) proposed a safe RL algorithm with stability guarantees expanding an initial safe set gradually. The proposed learning algorithm safely optimizes the policy in an asymptotically stable way. The authors exploit the fact that value functions in RL are Lyapunov functions if the costs are strictly positive away from the origin. In Zimmer et al. (2018)

safe active learning of time-series models is presented for Gaussian Processes. The data is sequentially selected for labeling such that information content is maximized. Then, an additional risk model is approximated by a GP. Only trajectories that are deemed safe are applied for the learning. A novel safe Bayesian optimization algorithm is introduced in

Sui et al. (2018). It addresses the challenge of efficiently identifying the total safe region and optimizing the utility function within the safe region for Gaussian Processes. The safe optimization problem is done in two stages: an exploration phase where the safe region is iteratively expanded, followed by an optimization phase in which Bayesian optimization is performed within the safe region. Turchetta et al. (2020)

, formulates a teacher-student safe-learning algorithm: the agent (student) learns to perform a task in an environment while the teacher intervenes if it reaches an unsafe region, affecting the student’s reward. The teacher learns an intervention policy too. The teacher has only limited information on the environment and the student. Thus the adaptive teacher is an agent operating in a partially observable Markov Decision Process.

Deisenroth and Rasmussen (2011) takes a different approach and creates an RL algorithm that can solve problems within a few episodes based on Gaussian Processes (GPs). Some works aim at connecting RL with robust control (i.e. achieving the best performance under worst-case disturbance), which can be thought of as a form of safe learning Morimoto and Doya (2005), Pinto et al. (2017). Constrained learning is an intensively studied topic, having close ties to safe learning Yang (2019). Han et al. (2008) argues that constrained learning has better generalization performance and a faster convergence rate. Tessler et al. (2018) presents a constrained policy optimization, which uses an alternative penalty signal to guide the policy. Uchibe and Doya (2007) proposes an actor-critic method with constraints that define the feasible policy space.

In this paper, we develop a deterministic policy gradient algorithm augmented with equality and inequality constraints via shaping the rewards. Compared to Tessler et al. (2018)

, training does not rely on non-convex optimization or additional heuristics. In policy gradient methods, the function approximator predicts action values or probabilities from which the policy can be derived directly. Finding the weights is done via optimization (gradient ascent) following the policy gradient theorem

Sutton and Barto (2018). Policy gradient methods have many variants and extensions to improve their learning performance Zhang et al. (2021). For example, in Kakade (2001) a covariant gradient is defined based on the underlying structure of the policy. Ciosek and Whiteson (2018) deduced expected policy gradients for actor-critic methods. Cheng et al. (2020)

deals with the reduction of the variance in Monte-Carlo (MC) policy gradient methods. In this work, we employ the REINFORCE algorithm (Monte-Carlo policy gradient)

Williams (1987). Under a small learning rate, with the help of the recently introduced Neural Tangent Kernel Jacot et al. (2018) the evolution of the policy can be described during their training by gradient descent. Earlier, policy iteration has been used in conjunction with NTK for learning the value function Goumiri et al. (2020). On the other hand, it was used in its analytical form as the covariance kernel of a GP (Rasmussen et al. (2003), Novak et al. (2019), Yang and Salman (2019)). In this work, we directly use the NTK to project a one-step policy change in conjunction with the policy gradient theorem. Then, safety is incorporated via constraints. It is assumed that there are states where the agent’s desired behaviour is known. At these ”safe” states action probabilities are prescribed as constraints for the policy. Finally, returns are computed in such a way that the agent satisfies the constraints. This assumption is mild when considering physical systems: limits of a controlled system (saturation) or desired behavior at certain states are usually known (i.e. the environment is considered a gray-box). The proposed algorithm is developed for continuous state spaces and discrete action spaces. According to the proposed categorization in Garcıa and Fernández (2015), the proposed algorithm falls into constrained optimization with external knowledge.

The contribution of the paper is twofold. First, we analytically give the policy evolution under gradient flow, using the NTK. Second, we extend the REINFORCE algorithm with constraints. Our version of the REINFORCE algorithm converges within a few episodes if constraints are set up correctly. The constrained extension relies on computing extra returns via convex optimization. In conclusion, the paper provides a practical use of the neural tangent kernel in reinforcement learning.

The paper is organized as follows. First, we present the episode-by-episode policy change of the REINFORCE algorithm (Section 2.1). Then, relying on the NTK, we deduce the policy change at unvisited states, see Section 2.2. Using the results in Section 2.2, we compute returns at arbitrary states in Section 3. We introduce equality constraints for the policy by computing ”safe” returns by solving a system of linear equations (Section 3.1). In the same context, we can enforce inequality constraints by solving a constrained quadratic program, see Section 3.2. In Section 4.1 we investigate the proposed learning algorithm in two OpenAI gym environments: in the Cartpole environment and in the Lunar Lander (Section 4.2). Finally, we summarize the findings of this paper in Section 5.

2 Kernel-based analysis of the REINFORCE algorithm

In this section, the episode-by-episode learning dynamics of a policy network is analyzed and controlled in a constrained way. To this end, first, the REINFORCE algorithm is described. Then, the learning dynamics of a wide and shallow neural network is analytically given. Finally, returns are calculated that force the policy to obey equality and inequality constraints at specific states.

2.1 Reformulating the learning dynamics of the REINFORCE algorithm

Policy gradient methods learn by applying the policy gradient theorem. An agent’s training means tuning the weights of a function approximator episode-by-episode. In most cases function approximator is a Neural Network Silver et al. (2014), Arulkumaran et al. (2017)

. If the training is successful, then the function approximator predicts values or probabilities from which the policy can be derived. More specifically, the output of the policy gradient algorithm is a probability distribution, which is characterized by the function approximator’s weights. In the reinforcement learning setup, the goal is maximizing the (discounted sum of future) rewards. Finding the optimal policy is based on updating the weights of the policy network using gradient ascent. The simplest of the policy gradient methods is the REINFORCE algorithm

Williams (1987), Szepesvári (2010). The agent learns the policy directly by updating its weights using Monte-Carlo episode samples. Thus, one realization is generated with the current policy (at episode ) . Assuming continuous state space, and discrete action space the episode batch (with length ) is , for all , where is the

dimensional state vector in the

step of the MC trajectory, is the action taken, and is the reward in step . For convenience, the states, actions, and rewards in batch are organized into columns , , , respectively. The REINFORCE algorithm learns as summarized in Algorithm 1. The update rule is based on the policy gradient theorem Sutton and Barto (2018) and for the whole episode it can be written as the sum of gradients induced by the batch:

Initialize .
Initialize the policy network with random weights.
while not converged do
      Generate a Monte-Carlo trajectory , with the current policy .
       for the whole MC trajectory () do

   Estimate the return as

with being the discount factor.
             Update policy parameters with gradient ascent (Eq. (1)).
      Increment .
Algorithm 1 The REINFORCE algorithm

Taking Algorithm 1 further, it is possible to compute the episodic policy change due to batch , assuming gradient flow ().

Theorem 1.

Given batch , and assuming gradient flow, the episodic policy change with the REINFORCE algorithm at the batch state-action pairs are

where is the neural tangent kernel, is a diagonal matrix containing the inverse policies (if they exist) at state-action pairs of batch , and is the vector of returns.


Assuming very small learning rate , the update algorithm (gradient ascent) can be written in continuous form (gradient flow), Parikh and Boyd (2014):


Furthermore, to avoid division by zero, it is assumed that the evaluated policy is strictly positive. The derivative on the left hand side is a column vector with size . Denote , , and rewrite the differential equation in vector form as




is the vector of returns in episode (Eq. (2)). The matrix of partial log policy derivatives () is


where subscripts of denote weights and biases of the policy network. By using an element-wise transformation , can be rewritten as a product:


where the matrix on the left is a transposed Jacobian, i.e. . Denote the inverse policies with and . The change of the agent weights based on batch is



. Next, write the learning dynamics of the policy, using the chain rule:


First, extract as in Eq. (8):


Note that is the Neural Tangent Kernel (NTK), as defined in Jacot et al. (2018). Denote it with . Finally, the policy update due to episode batch at states for actions become:


Similarly, by defining we can write Eq. (12) as


too. ∎


From Eq. (11), the learning dynamics of the REINFORCE algorithm is a nonlinear differential equation system. If the same data batch is shown to the neural network over and over again, the policy evolves as .

2.2 Evaluating the policy change for arbitrary states and actions

In this section, we describe the policy change for any state and any action if the agent learns from batch . In most cases, the learning agent can perform multiple actions. Assume the agent can take actions (the policy net has output channels). Previous works that deal with NTK all consider one output channel in the examples (e.g. Bradbury et al. (2018), Jacot et al. (2018), Yang and Salman (2019)), however state that it works for multiple outputs too. Jacot et al. (2018) claim that a network with outputs can be handled as independent neural networks. On the other hand, it is possible to fit every output channel into one equation by generalizing Theorem 1. First, reconstruct the return vector as follows.




and . In other words, consists of sized blocks with zeros at action indexes which are not taken at , ( zeros) and the original return (Eq. (2)) at the position of the taken action. Note that action dependency of stems from this assumption.

Theorem 2.

Given batch , and assuming gradient flow, the episodic policy change with the REINFORCE algorithm at the batch states for an output policy network is

where and .


We can deduce the multi-output case by rewriting Eq. (4). For simplicity, we keep the notations, but the matrix and vector sizes are redefined for the vector output case. Therefore, the log policy derivatives are evaluated for every possible action at the states contained in a batch:


. The zero elements in will cancel out log probabilities of actions that are not taken in episode , (Eq. (14)). Therefore, the final output does not change. Note that, the action dependency is moved from to . That is, because the policy is evaluated for every output channel of the policy network, but nonzero reward is given only if an action is actually taken in the MC trajectory. Continue by separating into two matrices, in the same way as in Eq. (7).


where the diagonalized inverse policies are denoted with with . The weight change can be written as


Following the same steps as for the proof of Theorem 1, the policy change for every output channel is



It is possible to write the average change of each policy output channel over a batch (with superscript ) as


where is an matrix consisting of eye matrices of size . is used to sum elements that correspond to the same output channel. The result of Eq. (2.2) is the policy change at an averaged state of .

With Theorem 2, it is possible to evaluate how the policy will change at states when learning on batch data if the learning rate is small. By manipulating Eq. (18) policy change can be evaluated at states not part of batch too.

Theorem 3.

Given batch , and assuming gradient flow, the episodic policy change with the REINFORCE algorithm at any state is

where is the neural tangent kernel evaluated for all pairs.


First, we concatenate the states and returns , , respectively and solve Eq. (18) for . For more insight, we illustrate the matrix multiplication in Eq. (18) with the concatenated state in Figure 1.

Figure 1: Matrix compatibility with the augmented batch. The blue section indicates the original Eq. (18), while the green blocks stem from the augmentation.

The NTK is based on the partial derivatives of the policy network and can be evaluated anywhere. Therefore, can be computed for any state. consists of symmetric blocks. Since is not included in the learning, it does not affect the policy change, its return is zero for every action, . The zero return cancels out the term . Therefore,


The numerical accuracy of the computed policy change is further discussed in A.

3 The NTK-based constrained REINFORCE algorithm

Every physical system has some limits (saturation). Intuitively, the agent should not increase the control action towards the limit if the system is already in saturation (for example, if a robot manipulator is at an end position, increasing the force in the direction of that end position is pointless). Assuming we have some idea how the agent should behave (which actions to take) at specific states , equality and inequality constraints can be prescribed for the policy. Define equality and inequality constraints as and , where is a vector constant probabilities.

In the sequel, relying on Theorem 3, we provide the mathematical deduction on how to enforce constraints during learning.

3.1 Equality constraints

To deal with constraints, we augment Eq. (18) with the constrained state-action pairs. Visually, the policy change at the augmented batch states are shown in Figure 2. Since the desired policy change at the safe states can be given as


. The only unknowns are the returns for the safe actions at the safe states. Note that the upper block of Figure 2 contains differential equations, while the lower block consists of algebraic equations. It is sufficient to solve the algebraic part. With the lower blocks of Figure 2 we can write the linear equation system


This system has a single unique solution as there are unknown returns and equations and it can be solved with e.g. the DGSEV algorithm Haidar et al. (2018).

Figure 2: Matrix compatibility of the constrained policy change evaluation (concatenated batch)

In order to obey the constraints, a safe data batch is constructed. We concatenate the safe states, actions, and computed returns with the episode batch as: . Then, the agent’s weights are updated with the appended batch with gradient ascent, Eq. (1).

In the initial stages of learning, the difference between the reference policies and the actual ones will be large. Therefore high rewards are needed to eliminate this difference. This also means that the effect of the collected batch on the weights is minor compared to the safe state-action-return tuples. In addition, large returns might cause loss of stability during the learning. The returns computed from the linearized policy change might differ from the actual one, especially if large steps are taken (i.e. large returns are applied). On the other hand, when the policy obeys the constraints, the computed returns will be small and will only compensate for the effect of the actual batch. In a special case when the return is zero, the result is the unconstrained policy change at the specific state as in Section 2.2. In addition, if the policy is smooth, the action probabilities near the constrained points will be similar. In continuous state space, this implies that defining grid-based (finite) constraints is sufficient.


Time complexity: The critical operations are kernel evaluations and solving the linear equation system. The DGSEV algorithm used for solving the linear equation system has time complexity Haidar et al. (2018). The time complexity of kernel evaluations is . If the kernel is computed for every output channel at the batch states, time complexity increases to .

3.2 Inequality constraints

In the same way, inequality constraints can be prescribed too. Assume the there are some states of the environment where an action shall be taken with at least a constant probability: . Then, similar to Eq. (3.1), the inequality constraints can be written as


Solving this system of inequalities can be turned into a convex quadratic programming problem. Since the original goal of reinforcement learning is learning on the collected episode batch data, the influence of the constraints on the learning (i.e. the magnitude of ) should be as small as possible. Therefore, the quadratic program is


subject to Eq. (24).

Note that, the quadratic cost function is needed to similarly penalize positive and negative returns. Quadratic programming with interior point methods has polynomial time complexity (, Ye and Tse (1989)) and has to be solved after every episode. In practical applications, numerical errors are more more of an issue than time complexity (i.e. solving the optimization with thousands of constraints).

We summarize the NTK-based constrained REINFORCE algorithm in Algorithm 2.

Initialize .
Define equality constraints .
Define inequality constraints