Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning

02/19/2018 ∙ by Qingkai Liang, et al. ∙ MIT Boston College 0

Constrained Markov Decision Process (CMDP) is a natural framework for reinforcement learning tasks with safety constraints, where agents learn a policy that maximizes the long-term reward while satisfying the constraints on the long-term cost. A canonical approach for solving CMDPs is the primal-dual method which updates parameters in primal and dual spaces in turn. Existing methods for CMDPs only use on-policy data for dual updates, which results in sample inefficiency and slow convergence. In this paper, we propose a policy search method for CMDPs called Accelerated Primal-Dual Optimization (APDO), which incorporates an off-policy trained dual variable in the dual update procedure while updating the policy in primal space with on-policy likelihood ratio gradient. Experimental results on a simulated robot locomotion task show that APDO achieves better sample efficiency and faster convergence than state-of-the-art approaches for CMDPs.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In reinforcement learning (RL), agents learn to act by trial and error in an unknown environment. The majority of RL algorithms allow agents to freely explore the environment and exploit any actions that might improve the reward. However, actions that lead to high rewards usually come with high risks. In a safety-critical environment, it is important to enforce safety in the RL algorithm, and a natural way to enforce safety is to incorporate constraints. A standard formulation for RL with safety constraints is the constrained Markov Decision Process (CMDP) framework CMDP , where the agents need to maximize the long-term reward while satisfying the constraints on the long-term cost. Applications of CMDPs include windmill control wind where we need to maximize the average reward (e.g., generated power) while bounding the long-term wear-and-tear cost on critical components (e.g., wind turbine). Another important example is communication network control where we need to maximize network utility while bounding the long-term arrival rate below the long-term service rate in order to maintain network stability (Chapter 1.1 in CMDP ).

While optimal policies for finite CMDPs with known models can be obtained by linear programming


, it cannot scale to high-dimensional continuous control tasks due to curse of dimensionality. Recently, there have been RL algorithms that work for high-dimensional CMDPs based on advances in policy search algorithms

TRPO ; A3C . In particular, two constrained policy search algorithms enjoy state-of-the-art performance for CMDPs: Primal-Dual Optimization (PDO) PDO and Constrained Policy Optimization (CPO) CPO . PDO is based on Lagrangian relaxation and updates parameters in primal and dual spaces in turn. Specifically, the primal policy update uses the policy gradient descent while the dual variable update uses the dual gradient ascent. By comparison, CPO differs from PDO in the dual update procedure, where the dual variable is obtained from scratch by solving a carefully-designed optimization problem in each iteration, in order to enforce safety constraints throughout training. Besides PDO and CPO, there exist other methods for solving CMDPs uchibe2007constrained ; ammar2015safe ; held2017probabilistically , but these approaches are usually computationally intensive or only apply to some specific CMDP models and domains.

A notable feature of existing constrained policy search approaches (e.g., PDO and CPO) is that they only use on-policy samples111On-policy samples refer to those generated by the currently-used policy while off-policy samples are generated by other unknown policies., which ensures that the information used for dual updates is unbiased and leads to stable performance improvement. However, such an on-policy dual update is sample-inefficient since historical samples are discarded. Moreover, due to the on-policy nature, dual updates are incremental and suffer from slow convergence since a (potentially large) batch of on-policy samples have to be obtained before a dual update can be made.

In this paper, we propose a policy search method for CMDPs called Accelerated Primal-Dual Optimization (APDO), which incorporates an off-policy trained dual variable in the dual update procedure while updating the policy in primal space with on-policy likelihood ratio gradient. Specifically, APDO is similar to PDO except that we perform a one-time adjustment for the dual variable with a nearly optimal dual variable trained with off-policy data after a certain number of iterations. Such a one-time adjustment process incurs negligible amortized overhead in the long term but greatly improves the sample efficiency and the convergence rate over exisiting methods. We demonstrate the effectiveness of APDO on a simulated robot locomotion task where the agent must satisfy constraints motivated by safety. The experimental results show that APDO achieves better sample efficiency and faster convergence than state-of-the-art approaches for CMDPs (e.g., PDO and CPO).

Another line of work considers merging the on-policy and off-policy policy gradient updates to improve sample efficiency. Examples of these approaches include Q-Prop Q-prop , IPG IPG , etc. These approaches are designed for unconstrained MDPs and can be applied to the primal policy update. In contrast, APDO leverages off-policy samples for dual updates and is complementary to these efforts on merging on-policy and off-policy policy gradients.

2 Constrained Markov Decision Process

A Markov Decision Process (MDP) is represented by a tuple, , where is the set of states, is the set of actions, is the reward function,

is the transition probability function (where

is the transition probability from state to state given action ), and is the initial state distribution. A stationary policy

corresponds to a mapping from states to a probability distribution over actions. Specifically,

is the probability of selecting action in state . The set of all stationary policies is denoted by . In this paper, we search policy within a parametrized stationary policy class

(e.g., a neural network policy class with weight

). We may write a policy as to emphasize its dependence on the parameter . The long-term discounted reward under policy is denoted as , where is the discount factor, denotes a trajectory, and means that the distribution over trajectories is determined by policy , i.e., .

A constrained Markov Decision Process (CDMP) is an MDP augmented with constraints on long-term discounted costs. Specifically, we augment the ordinary MDP with cost functions , where each cost function is a mapping from transition tuples to costs. The long-term discounted cost under policy is similarly defined as , and the corresponding limit is . In CMDP, we aim to select a policy that maximizes the long-term reward while satisfying the constraints on the long-term costs , i.e.,


3 Algorithm

To solve CMDPs, we employ the Lagrangian relaxation procedure (Chapster 3 in lagrange ). Specifically, the Lagrangian function for the CMDP problem (1) is


where is the Lagrangian multiplier. Then the constrained problem (1) can be converted to the following unconstrained problem:


To solve the unconstrained minimax problem (3), a canonical approach is to use the iterative primal-dual method where in each iteration we update the primal policy and the dual variable in turn. The primal-dual update procedures at iteration are as follows:

 Fix and perform policy gradient update: where is the step size. The policy gradient could be on-policy likelihood ratio policy gradient (e.g., REINFORCE REINFORCE and TRPO TRPO ) or off-policy deterministic policy gradient (e.g., DDPG DDPG ).

 Fix and perform dual update . Existing methods for CMDPs, such as PDO and CPO, differ in the choice of the dual update procedure . For example, PDO uses the simple dual gradient ascent where is the step size and is the projection onto the dual space . By comparison, CPO derives the dual variable by solving an optimization problem from scratch in order to enforce the constraints in every iteration.

However, the dual update procedures used in existing methods (e.g., PDO and CPO) are incremental and only use on-policy samples, resulting sample inefficiency and slow convergence to the optimal primal-dual solution . In this paper, we propose to incorporate an off-policy trained dual variable in the dual update procedure in order to improve sample efficiency and speed up the search for the optimal dual variable . The algorithm is called Accelerated Primal-Dual Optimization (APDO) and is described in Algorithm 1. APDO is similar to PDO where in most iterations the dual variable is updated according to the simple dual gradient ascent (step 6), but the key innovation of APDO is that there is a one-time dual adjustment with an off-policy trained dual variable after iterations (steps 7-10). The off-policy trained is obtained by running an off-policy algorithm for CMDPs with the historical data stored in the replay buffer. We provide a primal-dual version of the DDPG algorithm in the supplementary material for training . Although the off-policy trained dual variable could be biased, it provides a nearly optimal point for further fine tuning of the dual variable using new on-policy data.

The improvement of sample efficiency in APDO is due to the fact that off-policy training can repeatedly exploit historical data while on-policy update only uses each sample once; the acceleration effect of APDO is due to the fact that off-policy training directly solves for the optimal dual variable offline, thus avoiding the slow on-policy dual update as in the existing approaches where only one dual update can be taken after a large batch of samples are obtained.

Note that the adjustment epoch

is an important parameter in APDO. Using a small

avoids slow incremental dual update early, but the dual estimate

could be highly biased and inaccurate due to insufficient amount of data. On the other hand, using a larger provides a more accurate dual estimate at the expense of delayed adjustment.

1:  Initialize policy , replay buffer
2:  for  do
3:     Sample a set of trajectories under the current policy (containing samples)
4:     Add the sampled data to the replay buffer
5:     Update the primal policy with any on-policy likelihood ratio gradient method (e.g., TRPO) using the sampled on-policy trajectories and the current dual variable
6:     Update the dual variable with dual gradient ascent:
7:     if  then
8:        Compute the off-policy trained with the replay buffer (e.g., using the primal-dual DDPG in the supplementary material)
9:        Set
10:     end if
11:  end for
Algorithm 1 Accelerated Primal-Dual Policy Optimization (APDO)

4 Experiments

We evaluate APDO against two state-of-the-art algorithms for solving CMDPs (i.e., CPO and PDO) on a simple point-gather control task in MuJoCo mujoco with an additional safety constraint as used in CPO . All experiments are implemented in rllab rllab . The detailed task description and experiment parameters are provided in the supplementary material. In particular, for APDO we set the adjustment epoch , and additional experimental results regarding the effect of are also given in the supplementary material.

(a) Average Return
(b) Average Cost (limit is 0.2)
(c) Dual Variable
Figure 1: Performance comparison among APDO, PDO and CPO.

Figure 1 shows the learning curves for APDO, CPO and PDO under cost constraints. It can be observed from Fugure 1(b) that APDO enforced constraints successfully to the limit value as approximately same speed as CPO did. More importantly, APDO generally outperforms CPO on reward performance without compromising constraint stabilization, thus achieving better sample efficiency. For example, CPO takes 90 epochs to achieve an average reward of 11 while satisfying the safety constraint. By comparison, APDO only takes 45 epochs to achieve the same point, which corresponds to 2x improvement in sample efficiency over CPO in this task. In addition, PDO fails to enforce the safety constraint during the first 150 epochs due to its slow convergence. Using a larger step size may help speed up the convergence but in this case PDO will over-correct in response to constraint violations and behave too conservatively. We provide additional discussions on the choice of stepsize for PDO and APDO in the supplementary material.

Figure 1(c) illustrates the learning trajectory of the dual variable under PDO and APDO (note that the dual variable for CPO is not illustrated since CPO has a sophisticated recovery scheme to enforce constraints, where the dual variable may not be easily obtained). We find that APDO converges to the optimal dual variable significantly faster than PDO. In particular, there is a “jump" of the dual variable after several epochs in APDO, due to the dual adjustment with the off-policy trained . By comparison, PDO has to adjust its dual variable incrementally with on-policy data.

5 Future Work

Since the adjustment epoch is an important parameter in APDO, one important future work is to provide theoretical guidance on the setting of . It is also very interesting (yet challenging) to provide theoretical justifications about the acceleration effects of APDO. Moreover, as we observed in the experiments, the training trajectory generated by APDO strives for the best tradeoff between improving rewards and enforcing cost constraints. One future work is to incorporate a safety parameter that controls the degree of safety awareness. By tuning the parameter, the RL algorithm should be able to make both risk-averse actions (which enforce safety constraints as soon as possible) and risk-neutral actions (which gives priority to improving rewards).


This work was supported by NSF Grant CNS-1524317 and by DARPA I2O and Raytheon BBN Technologies under Contract No. HROO l l-l 5-C-0097. The authors would also like to acknowledge Chengtao Li who provided valuable feedback on this work.


Supplementary Materials

Appendix A Primal-Dual DDPG for CMDPs

In this appendix, we provide a primal-dual version of the DDPG algorithm for solving CMDPs. The primal policy update and the dual variable update in this algorithm only use the off-policy data stored in the replay buffer, which can be used to fit for our APDO algorithm. For simplicity, we only present the algorithm for CMDPs with a single constraint, and the multiple-constraint case can be easily obtained. In the primal-dual DDPG algorithm, we have the following neural networks.

  • Reward critic Q-network and reward target Q-network

  • Cost critic Q-network and cost target Q-network

  • Actor policy network and actor target Q-network

The target networks are used to slowly track the learned networks.

1:  Randomly initialize reward critic Q-network , cost critic Q-network and actor network
2:  Initialize target networks: , ,
3:  Initialize replay buffer and dual variable
4:  for episode  do
5:     Initialize a random process for action exploration
6:     Receive initial state
7:     for  do
8:        Select action
9:        Execute action and observe
10:        Store transition in the replay buffer
11:        Sample a random batch of transitions from the replay buffer
12:        Set ,
13:        Update reward critic by minimizing and update cost critic by minimizing
14:        Update the actor policy using the sampled policy gradient
15:        Update dual variable using the sampled dual gradient
16:        Update target networks:
17:     end for
18:  end for
Algorithm 2 Primal-Dual DDPG

Appendix B Experiment Details

Task description. Specifically, a point mass receives a reward of 10 for collecting an apple, and a cost of 1 for collecting a bomb. The agent is constrained to incur no more than 0.2 cost in the long term. Two apples and eight bombs spawn on the map at the start of each episode.

Parameters for primal policy update. For all experiments, we use neural network policies with two hidden layers of sizes with tanh non-linearity, and all of the schemes (PDO, CPO, APDO) use TRPO to update the primal policy, with a batch size 50000 and a KL-divergence step size of 0.01. The discount factor is 0.995 and the rollout length is 15. We use GAE- GAE for estimating the regular advantages with .

Parameters for dual variable update. As for dual updates, PDO and APDO both use dual gradient ascent. Note that the step size for dual gradient ascent is important in PDO: if it is set to be too small, the dual variable won’t update quickly enough to meaningfully enforce the constraint; if it is too high, the algorithm will over-correct in response to constraint violations and behave too conservatively CPO As a result, picking a proper step size is critical and difficult in PDO. We experiment with different step sizes and find that 0.1 works best for PDO, and the reported results of PDO are also under the step size 0.1. By comparison, selecting step size in APDO is much easier since the one-time off-policy dual adjustment directly boosts the dual variable to a "nearly optimal" point and we only need to choose a relatively small step size in order to do fine-tuning after the adjustment. For the reported experimental results, we also set the step size to be 0.1 for APDO for the fairness of comparison. As for CPO, we adopt the same set of parameters as in original CPO paper CPO (specially, the parameters used in the point-gather task).

Parameters for training . We use primal-dual DDPG to train . The reward critic network and cost critic network ) is parametrized by a neural network with two hidden layers of sizes with tanh nonlinearity, respectively. The actor policy network is represented by a neural network with two hidden layers of sizes with tanh nonlinearity. The learning rates for the reward/cost critic Q-network and the actor policy network are all and these networks are updated with Adam adam . The update for the dual variable in primal-dual DDPG employs simple dual gradient ascent and the step size for updating the dual variable in the primal-dual DDPG is set to be . The mini-batch size is . We also use a soft target networks with . The off-policy training is executed for primal-dual iterations. Since off-policy algorithms like DDPG are usually unstable, we set to be the average of all historical dual variables throughout the off-policy training trajectory. The max replay buffer size is .

Effect of . Figure 2 shows the effect of adjustment epoch on the performance of APDO, where we experiment with . It is observed that using a smaller avoids slow incremental dual update earlier, but due to limited amount of available samples in the replay buffer the off-policy dual estimate could be highly biased and inaccurate. On the other hand, using a larger provides a more accurate dual estimate at the expense of delayed adjustment.

(a) Average Return
(b) Average Cost (limit is 0.2)
(c) Dual Variable
Figure 2: Effect of adjustment epoch .