In reinforcement learning (RL), the agent learns in a trial and error way. In a real setting, it can lead to undesirable situations which may result in damage or injury of the agent or the environment system. In addition, the agent might waste a significant amount of time exploring irrelevant regions of the state and action spaces. Safe RL can be defined as the process of learning policies that maximize the expectation of the return in problems under safety constraints. Thus, safe exploration often includes some prior knowledge of the environment (e.g. a model) or has a risk metric Garcıa and Fernández (2015). In RL, safety can be guaranteed in a probabilistic way. Learning in this context aims to strike a balance between exploration and exploitation so that the system remains within a safe set. On the other hand, in a safety-critical setting, exploration cannot be done blindly.
Recently, several different approaches were conceived to tackle safe learning. Wang et al. (2019) deals with balancing between exploitation and exploration. It includes the exploration explicitly into the objective function. Berkenkamp et al. (2017) proposed a safe RL algorithm with stability guarantees expanding an initial safe set gradually. The proposed learning algorithm safely optimizes the policy in an asymptotically stable way. The authors exploit the fact that value functions in RL are Lyapunov functions if the costs are strictly positive away from the origin. In Zimmer et al. (2018)
safe active learning of time-series models is presented for Gaussian Processes. The data is sequentially selected for labeling such that information content is maximized. Then, an additional risk model is approximated by a GP. Only trajectories that are deemed safe are applied for the learning. A novel safe Bayesian optimization algorithm is introduced inSui et al. (2018). It addresses the challenge of efficiently identifying the total safe region and optimizing the utility function within the safe region for Gaussian Processes. The safe optimization problem is done in two stages: an exploration phase where the safe region is iteratively expanded, followed by an optimization phase in which Bayesian optimization is performed within the safe region. Turchetta et al. (2020)
, formulates a teacher-student safe-learning algorithm: the agent (student) learns to perform a task in an environment while the teacher intervenes if it reaches an unsafe region, affecting the student’s reward. The teacher learns an intervention policy too. The teacher has only limited information on the environment and the student. Thus the adaptive teacher is an agent operating in a partially observable Markov Decision Process.Deisenroth and Rasmussen (2011) takes a different approach and creates an RL algorithm that can solve problems within a few episodes based on Gaussian Processes (GPs). Some works aim at connecting RL with robust control (i.e. achieving the best performance under worst-case disturbance), which can be thought of as a form of safe learning Morimoto and Doya (2005), Pinto et al. (2017). Constrained learning is an intensively studied topic, having close ties to safe learning Yang (2019). Han et al. (2008) argues that constrained learning has better generalization performance and a faster convergence rate. Tessler et al. (2018) presents a constrained policy optimization, which uses an alternative penalty signal to guide the policy. Uchibe and Doya (2007) proposes an actor-critic method with constraints that define the feasible policy space.
In this paper, we develop a deterministic policy gradient algorithm augmented with equality and inequality constraints via shaping the rewards. Compared to Tessler et al. (2018)
, training does not rely on non-convex optimization or additional heuristics. In policy gradient methods, the function approximator predicts action values or probabilities from which the policy can be derived directly. Finding the weights is done via optimization (gradient ascent) following the policy gradient theoremSutton and Barto (2018). Policy gradient methods have many variants and extensions to improve their learning performance Zhang et al. (2021). For example, in Kakade (2001) a covariant gradient is defined based on the underlying structure of the policy. Ciosek and Whiteson (2018) deduced expected policy gradients for actor-critic methods. Cheng et al. (2020)
deals with the reduction of the variance in Monte-Carlo (MC) policy gradient methods. In this work, we employ the REINFORCE algorithm (Monte-Carlo policy gradient)Williams (1987). Under a small learning rate, with the help of the recently introduced Neural Tangent Kernel Jacot et al. (2018) the evolution of the policy can be described during their training by gradient descent. Earlier, policy iteration has been used in conjunction with NTK for learning the value function Goumiri et al. (2020). On the other hand, it was used in its analytical form as the covariance kernel of a GP (Rasmussen et al. (2003), Novak et al. (2019), Yang and Salman (2019)). In this work, we directly use the NTK to project a one-step policy change in conjunction with the policy gradient theorem. Then, safety is incorporated via constraints. It is assumed that there are states where the agent’s desired behaviour is known. At these ”safe” states action probabilities are prescribed as constraints for the policy. Finally, returns are computed in such a way that the agent satisfies the constraints. This assumption is mild when considering physical systems: limits of a controlled system (saturation) or desired behavior at certain states are usually known (i.e. the environment is considered a gray-box). The proposed algorithm is developed for continuous state spaces and discrete action spaces. According to the proposed categorization in Garcıa and Fernández (2015), the proposed algorithm falls into constrained optimization with external knowledge.
The contribution of the paper is twofold. First, we analytically give the policy evolution under gradient flow, using the NTK. Second, we extend the REINFORCE algorithm with constraints. Our version of the REINFORCE algorithm converges within a few episodes if constraints are set up correctly. The constrained extension relies on computing extra returns via convex optimization. In conclusion, the paper provides a practical use of the neural tangent kernel in reinforcement learning.
The paper is organized as follows. First, we present the episode-by-episode policy change of the REINFORCE algorithm (Section 2.1). Then, relying on the NTK, we deduce the policy change at unvisited states, see Section 2.2. Using the results in Section 2.2, we compute returns at arbitrary states in Section 3. We introduce equality constraints for the policy by computing ”safe” returns by solving a system of linear equations (Section 3.1). In the same context, we can enforce inequality constraints by solving a constrained quadratic program, see Section 3.2. In Section 4.1 we investigate the proposed learning algorithm in two OpenAI gym environments: in the Cartpole environment and in the Lunar Lander (Section 4.2). Finally, we summarize the findings of this paper in Section 5.
2 Kernel-based analysis of the REINFORCE algorithm
In this section, the episode-by-episode learning dynamics of a policy network is analyzed and controlled in a constrained way. To this end, first, the REINFORCE algorithm is described. Then, the learning dynamics of a wide and shallow neural network is analytically given. Finally, returns are calculated that force the policy to obey equality and inequality constraints at specific states.
2.1 Reformulating the learning dynamics of the REINFORCE algorithm
Policy gradient methods learn by applying the policy gradient theorem. An agent’s training means tuning the weights of a function approximator episode-by-episode. In most cases function approximator is a Neural Network Silver et al. (2014), Arulkumaran et al. (2017)
. If the training is successful, then the function approximator predicts values or probabilities from which the policy can be derived. More specifically, the output of the policy gradient algorithm is a probability distribution, which is characterized by the function approximator’s weights. In the reinforcement learning setup, the goal is maximizing the (discounted sum of future) rewards. Finding the optimal policy is based on updating the weights of the policy network using gradient ascent. The simplest of the policy gradient methods is the REINFORCE algorithmWilliams (1987), Szepesvári (2010). The agent learns the policy directly by updating its weights using Monte-Carlo episode samples. Thus, one realization is generated with the current policy (at episode ) . Assuming continuous state space, and discrete action space the episode batch (with length ) is , for all , where is the
dimensional state vector in thestep of the MC trajectory, is the action taken, and is the reward in step . For convenience, the states, actions, and rewards in batch are organized into columns , , , respectively. The REINFORCE algorithm learns as summarized in Algorithm 1. The update rule is based on the policy gradient theorem Sutton and Barto (2018) and for the whole episode it can be written as the sum of gradients induced by the batch:
Estimate the return as
Taking Algorithm 1 further, it is possible to compute the episodic policy change due to batch , assuming gradient flow ().
Given batch , and assuming gradient flow, the episodic policy change with the REINFORCE algorithm at the batch state-action pairs are
where is the neural tangent kernel, is a diagonal matrix containing the inverse policies (if they exist) at state-action pairs of batch , and is the vector of returns.
Assuming very small learning rate , the update algorithm (gradient ascent) can be written in continuous form (gradient flow), Parikh and Boyd (2014):
Furthermore, to avoid division by zero, it is assumed that the evaluated policy is strictly positive. The derivative on the left hand side is a column vector with size . Denote , , and rewrite the differential equation in vector form as
is the vector of returns in episode (Eq. (2)). The matrix of partial log policy derivatives () is
where subscripts of denote weights and biases of the policy network. By using an element-wise transformation , can be rewritten as a product:
where the matrix on the left is a transposed Jacobian, i.e. . Denote the inverse policies with and . The change of the agent weights based on batch is
. Next, write the learning dynamics of the policy, using the chain rule:
First, extract as in Eq. (8):
Note that is the Neural Tangent Kernel (NTK), as defined in Jacot et al. (2018). Denote it with . Finally, the policy update due to episode batch at states for actions become:
Similarly, by defining we can write Eq. (12) as
From Eq. (11), the learning dynamics of the REINFORCE algorithm is a nonlinear differential equation system. If the same data batch is shown to the neural network over and over again, the policy evolves as .
2.2 Evaluating the policy change for arbitrary states and actions
In this section, we describe the policy change for any state and any action if the agent learns from batch . In most cases, the learning agent can perform multiple actions. Assume the agent can take actions (the policy net has output channels). Previous works that deal with NTK all consider one output channel in the examples (e.g. Bradbury et al. (2018), Jacot et al. (2018), Yang and Salman (2019)), however state that it works for multiple outputs too. Jacot et al. (2018) claim that a network with outputs can be handled as independent neural networks. On the other hand, it is possible to fit every output channel into one equation by generalizing Theorem 1. First, reconstruct the return vector as follows.
and . In other words, consists of sized blocks with zeros at action indexes which are not taken at , ( zeros) and the original return (Eq. (2)) at the position of the taken action. Note that action dependency of stems from this assumption.
Given batch , and assuming gradient flow, the episodic policy change with the REINFORCE algorithm at the batch states for an output policy network is
where and .
We can deduce the multi-output case by rewriting Eq. (4). For simplicity, we keep the notations, but the matrix and vector sizes are redefined for the vector output case. Therefore, the log policy derivatives are evaluated for every possible action at the states contained in a batch:
. The zero elements in will cancel out log probabilities of actions that are not taken in episode , (Eq. (14)). Therefore, the final output does not change. Note that, the action dependency is moved from to . That is, because the policy is evaluated for every output channel of the policy network, but nonzero reward is given only if an action is actually taken in the MC trajectory. Continue by separating into two matrices, in the same way as in Eq. (7).
where the diagonalized inverse policies are denoted with with . The weight change can be written as
Following the same steps as for the proof of Theorem 1, the policy change for every output channel is
It is possible to write the average change of each policy output channel over a batch (with superscript ) as
where is an matrix consisting of eye matrices of size . is used to sum elements that correspond to the same output channel. The result of Eq. (2.2) is the policy change at an averaged state of .
With Theorem 2, it is possible to evaluate how the policy will change at states when learning on batch data if the learning rate is small. By manipulating Eq. (18) policy change can be evaluated at states not part of batch too.
Given batch , and assuming gradient flow, the episodic policy change with the REINFORCE algorithm at any state is
where is the neural tangent kernel evaluated for all pairs.
The NTK is based on the partial derivatives of the policy network and can be evaluated anywhere. Therefore, can be computed for any state. consists of symmetric blocks. Since is not included in the learning, it does not affect the policy change, its return is zero for every action, . The zero return cancels out the term . Therefore,
The numerical accuracy of the computed policy change is further discussed in A.
3 The NTK-based constrained REINFORCE algorithm
Every physical system has some limits (saturation). Intuitively, the agent should not increase the control action towards the limit if the system is already in saturation (for example, if a robot manipulator is at an end position, increasing the force in the direction of that end position is pointless). Assuming we have some idea how the agent should behave (which actions to take) at specific states , equality and inequality constraints can be prescribed for the policy. Define equality and inequality constraints as and , where is a vector constant probabilities.
In the sequel, relying on Theorem 3, we provide the mathematical deduction on how to enforce constraints during learning.
3.1 Equality constraints
To deal with constraints, we augment Eq. (18) with the constrained state-action pairs. Visually, the policy change at the augmented batch states are shown in Figure 2. Since the desired policy change at the safe states can be given as
. The only unknowns are the returns for the safe actions at the safe states. Note that the upper block of Figure 2 contains differential equations, while the lower block consists of algebraic equations. It is sufficient to solve the algebraic part. With the lower blocks of Figure 2 we can write the linear equation system
This system has a single unique solution as there are unknown returns and equations and it can be solved with e.g. the DGSEV algorithm Haidar et al. (2018).
In order to obey the constraints, a safe data batch is constructed. We concatenate the safe states, actions, and computed returns with the episode batch as: . Then, the agent’s weights are updated with the appended batch with gradient ascent, Eq. (1).
In the initial stages of learning, the difference between the reference policies and the actual ones will be large. Therefore high rewards are needed to eliminate this difference. This also means that the effect of the collected batch on the weights is minor compared to the safe state-action-return tuples. In addition, large returns might cause loss of stability during the learning. The returns computed from the linearized policy change might differ from the actual one, especially if large steps are taken (i.e. large returns are applied). On the other hand, when the policy obeys the constraints, the computed returns will be small and will only compensate for the effect of the actual batch. In a special case when the return is zero, the result is the unconstrained policy change at the specific state as in Section 2.2. In addition, if the policy is smooth, the action probabilities near the constrained points will be similar. In continuous state space, this implies that defining grid-based (finite) constraints is sufficient.
Time complexity: The critical operations are kernel evaluations and solving the linear equation system. The DGSEV algorithm used for solving the linear equation system has time complexity Haidar et al. (2018). The time complexity of kernel evaluations is . If the kernel is computed for every output channel at the batch states, time complexity increases to .
3.2 Inequality constraints
In the same way, inequality constraints can be prescribed too. Assume the there are some states of the environment where an action shall be taken with at least a constant probability: . Then, similar to Eq. (3.1), the inequality constraints can be written as
Solving this system of inequalities can be turned into a convex quadratic programming problem. Since the original goal of reinforcement learning is learning on the collected episode batch data, the influence of the constraints on the learning (i.e. the magnitude of ) should be as small as possible. Therefore, the quadratic program is
subject to Eq. (24).
Note that, the quadratic cost function is needed to similarly penalize positive and negative returns. Quadratic programming with interior point methods has polynomial time complexity (, Ye and Tse (1989)) and has to be solved after every episode. In practical applications, numerical errors are more more of an issue than time complexity (i.e. solving the optimization with thousands of constraints).
We summarize the NTK-based constrained REINFORCE algorithm in Algorithm 2.