1 Introduction
In reinforcement learning (RL), agents learn to act by trial and error in an unknown environment. The majority of RL algorithms allow agents to freely explore the environment and exploit any actions that might improve the reward. However, actions that lead to high rewards usually come with high risks. In a safetycritical environment, it is important to enforce safety in the RL algorithm, and a natural way to enforce safety is to incorporate constraints. A standard formulation for RL with safety constraints is the constrained Markov Decision Process (CMDP) framework CMDP , where the agents need to maximize the longterm reward while satisfying the constraints on the longterm cost. Applications of CMDPs include windmill control wind where we need to maximize the average reward (e.g., generated power) while bounding the longterm wearandtear cost on critical components (e.g., wind turbine). Another important example is communication network control where we need to maximize network utility while bounding the longterm arrival rate below the longterm service rate in order to maintain network stability (Chapter 1.1 in CMDP ).
While optimal policies for finite CMDPs with known models can be obtained by linear programming
LP1, it cannot scale to highdimensional continuous control tasks due to curse of dimensionality. Recently, there have been RL algorithms that work for highdimensional CMDPs based on advances in policy search algorithms
TRPO ; A3C . In particular, two constrained policy search algorithms enjoy stateoftheart performance for CMDPs: PrimalDual Optimization (PDO) PDO and Constrained Policy Optimization (CPO) CPO . PDO is based on Lagrangian relaxation and updates parameters in primal and dual spaces in turn. Specifically, the primal policy update uses the policy gradient descent while the dual variable update uses the dual gradient ascent. By comparison, CPO differs from PDO in the dual update procedure, where the dual variable is obtained from scratch by solving a carefullydesigned optimization problem in each iteration, in order to enforce safety constraints throughout training. Besides PDO and CPO, there exist other methods for solving CMDPs uchibe2007constrained ; ammar2015safe ; held2017probabilistically , but these approaches are usually computationally intensive or only apply to some specific CMDP models and domains.A notable feature of existing constrained policy search approaches (e.g., PDO and CPO) is that they only use onpolicy samples^{1}^{1}1Onpolicy samples refer to those generated by the currentlyused policy while offpolicy samples are generated by other unknown policies., which ensures that the information used for dual updates is unbiased and leads to stable performance improvement. However, such an onpolicy dual update is sampleinefficient since historical samples are discarded. Moreover, due to the onpolicy nature, dual updates are incremental and suffer from slow convergence since a (potentially large) batch of onpolicy samples have to be obtained before a dual update can be made.
In this paper, we propose a policy search method for CMDPs called Accelerated PrimalDual Optimization (APDO), which incorporates an offpolicy trained dual variable in the dual update procedure while updating the policy in primal space with onpolicy likelihood ratio gradient. Specifically, APDO is similar to PDO except that we perform a onetime adjustment for the dual variable with a nearly optimal dual variable trained with offpolicy data after a certain number of iterations. Such a onetime adjustment process incurs negligible amortized overhead in the long term but greatly improves the sample efficiency and the convergence rate over exisiting methods. We demonstrate the effectiveness of APDO on a simulated robot locomotion task where the agent must satisfy constraints motivated by safety. The experimental results show that APDO achieves better sample efficiency and faster convergence than stateoftheart approaches for CMDPs (e.g., PDO and CPO).
Another line of work considers merging the onpolicy and offpolicy policy gradient updates to improve sample efficiency. Examples of these approaches include QProp Qprop , IPG IPG , etc. These approaches are designed for unconstrained MDPs and can be applied to the primal policy update. In contrast, APDO leverages offpolicy samples for dual updates and is complementary to these efforts on merging onpolicy and offpolicy policy gradients.
2 Constrained Markov Decision Process
A Markov Decision Process (MDP) is represented by a tuple, , where is the set of states, is the set of actions, is the reward function,
is the transition probability function (where
is the transition probability from state to state given action ), and is the initial state distribution. A stationary policycorresponds to a mapping from states to a probability distribution over actions. Specifically,
is the probability of selecting action in state . The set of all stationary policies is denoted by . In this paper, we search policy within a parametrized stationary policy class(e.g., a neural network policy class with weight
). We may write a policy as to emphasize its dependence on the parameter . The longterm discounted reward under policy is denoted as , where is the discount factor, denotes a trajectory, and means that the distribution over trajectories is determined by policy , i.e., .A constrained Markov Decision Process (CDMP) is an MDP augmented with constraints on longterm discounted costs. Specifically, we augment the ordinary MDP with cost functions , where each cost function is a mapping from transition tuples to costs. The longterm discounted cost under policy is similarly defined as , and the corresponding limit is . In CMDP, we aim to select a policy that maximizes the longterm reward while satisfying the constraints on the longterm costs , i.e.,
(1) 
3 Algorithm
To solve CMDPs, we employ the Lagrangian relaxation procedure (Chapster 3 in lagrange ). Specifically, the Lagrangian function for the CMDP problem (1) is
(2) 
where is the Lagrangian multiplier. Then the constrained problem (1) can be converted to the following unconstrained problem:
(3) 
To solve the unconstrained minimax problem (3), a canonical approach is to use the iterative primaldual method where in each iteration we update the primal policy and the dual variable in turn. The primaldual update procedures at iteration are as follows:
Fix and perform policy gradient update: where is the step size. The policy gradient could be onpolicy likelihood ratio policy gradient (e.g., REINFORCE REINFORCE and TRPO TRPO ) or offpolicy deterministic policy gradient (e.g., DDPG DDPG ).
Fix and perform dual update . Existing methods for CMDPs, such as PDO and CPO, differ in the choice of the dual update procedure . For example, PDO uses the simple dual gradient ascent where is the step size and is the projection onto the dual space . By comparison, CPO derives the dual variable by solving an optimization problem from scratch in order to enforce the constraints in every iteration.
However, the dual update procedures used in existing methods (e.g., PDO and CPO) are incremental and only use onpolicy samples, resulting sample inefficiency and slow convergence to the optimal primaldual solution . In this paper, we propose to incorporate an offpolicy trained dual variable in the dual update procedure in order to improve sample efficiency and speed up the search for the optimal dual variable . The algorithm is called Accelerated PrimalDual Optimization (APDO) and is described in Algorithm 1. APDO is similar to PDO where in most iterations the dual variable is updated according to the simple dual gradient ascent (step 6), but the key innovation of APDO is that there is a onetime dual adjustment with an offpolicy trained dual variable after iterations (steps 710). The offpolicy trained is obtained by running an offpolicy algorithm for CMDPs with the historical data stored in the replay buffer. We provide a primaldual version of the DDPG algorithm in the supplementary material for training . Although the offpolicy trained dual variable could be biased, it provides a nearly optimal point for further fine tuning of the dual variable using new onpolicy data.
The improvement of sample efficiency in APDO is due to the fact that offpolicy training can repeatedly exploit historical data while onpolicy update only uses each sample once; the acceleration effect of APDO is due to the fact that offpolicy training directly solves for the optimal dual variable offline, thus avoiding the slow onpolicy dual update as in the existing approaches where only one dual update can be taken after a large batch of samples are obtained.
Note that the adjustment epoch
is an important parameter in APDO. Using a smallavoids slow incremental dual update early, but the dual estimate
could be highly biased and inaccurate due to insufficient amount of data. On the other hand, using a larger provides a more accurate dual estimate at the expense of delayed adjustment.4 Experiments
We evaluate APDO against two stateoftheart algorithms for solving CMDPs (i.e., CPO and PDO) on a simple pointgather control task in MuJoCo mujoco with an additional safety constraint as used in CPO . All experiments are implemented in rllab rllab . The detailed task description and experiment parameters are provided in the supplementary material. In particular, for APDO we set the adjustment epoch , and additional experimental results regarding the effect of are also given in the supplementary material.
Figure 1 shows the learning curves for APDO, CPO and PDO under cost constraints. It can be observed from Fugure 1(b) that APDO enforced constraints successfully to the limit value as approximately same speed as CPO did. More importantly, APDO generally outperforms CPO on reward performance without compromising constraint stabilization, thus achieving better sample efficiency. For example, CPO takes 90 epochs to achieve an average reward of 11 while satisfying the safety constraint. By comparison, APDO only takes 45 epochs to achieve the same point, which corresponds to 2x improvement in sample efficiency over CPO in this task. In addition, PDO fails to enforce the safety constraint during the first 150 epochs due to its slow convergence. Using a larger step size may help speed up the convergence but in this case PDO will overcorrect in response to constraint violations and behave too conservatively. We provide additional discussions on the choice of stepsize for PDO and APDO in the supplementary material.
Figure 1(c) illustrates the learning trajectory of the dual variable under PDO and APDO (note that the dual variable for CPO is not illustrated since CPO has a sophisticated recovery scheme to enforce constraints, where the dual variable may not be easily obtained). We find that APDO converges to the optimal dual variable significantly faster than PDO. In particular, there is a “jump" of the dual variable after several epochs in APDO, due to the dual adjustment with the offpolicy trained . By comparison, PDO has to adjust its dual variable incrementally with onpolicy data.
5 Future Work
Since the adjustment epoch is an important parameter in APDO, one important future work is to provide theoretical guidance on the setting of . It is also very interesting (yet challenging) to provide theoretical justifications about the acceleration effects of APDO. Moreover, as we observed in the experiments, the training trajectory generated by APDO strives for the best tradeoff between improving rewards and enforcing cost constraints. One future work is to incorporate a safety parameter that controls the degree of safety awareness. By tuning the parameter, the RL algorithm should be able to make both riskaverse actions (which enforce safety constraints as soon as possible) and riskneutral actions (which gives priority to improving rewards).
Acknowledgment
This work was supported by NSF Grant CNS1524317 and by DARPA I2O and Raytheon BBN Technologies under Contract No. HROO l ll 5C0097. The authors would also like to acknowledge Chengtao Li who provided valuable feedback on this work.
References
 (1) S. ShalevShwartz, S. Shammah, and A. Shashua, “Safe, multiagent, reinforcement learning for autonomous driving,” arXiv preprint arXiv:1610.03295, 2016.
 (2) E. Altman, Constrained Markov decision processes. CRC Press, 1999, vol. 7.
 (3) P. Tavner, J. Xiang, and F. Spinato, “Reliability analysis for wind turbines,” Wind Energy, vol. 10, no. 1, pp. 1–18, 2007.
 (4) E. A. Feinberg and A. Shwartz, “Constrained dynamic programming with two discount factors: Applications and an algorithm,” IEEE Transactions on Automatic Control, vol. 44, no. 3, pp. 628–631, 1999.

(5)
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region
policy optimization,” in
Proceedings of the 32nd International Conference on Machine Learning (ICML15)
, 2015, pp. 1889–1897.  (6) V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016, pp. 1928–1937.
 (7) Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Riskconstrained reinforcement learning with percentile risk criteria,” arXiv preprint arXiv:1512.01629, 2015.
 (8) J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” in Proceedings of the 34nd International Conference on Machine Learning (ICML17), 2017.
 (9) E. Uchibe and K. Doya, “Constrained reinforcement learning from intrinsic and extrinsic rewards,” in Development and Learning, 2007. ICDL 2007. IEEE 6th International Conference on. IEEE, 2007, pp. 163–168.
 (10) H. B. Ammar, R. Tutunov, and E. Eaton, “Safe policy search for lifelong reinforcement learning with sublinear regret,” in Proceedings of the 32nd International Conference on Machine Learning (ICML15), 2015, pp. 2361–2369.
 (11) D. Held, Z. McCarthy, M. Zhang, F. Shentu, and P. Abbeel, “Probabilistically safe policy transfer,” arXiv preprint arXiv:1705.05394, 2017.
 (12) S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine, “Qprop: Sampleefficient policy gradient with an offpolicy critic,” International Conference on Learning Representations (ICLR17), 2017.
 (13) S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, B. Schölkopf, and S. Levine, “Interpolated policy gradient: Merging onpolicy and offpolicy gradient estimation for deep reinforcement learning,” arXiv preprint arXiv:1706.00387, 2017.
 (14) D. P. Bertsekas, Nonlinear programming. Athena scientific Belmont, 1999.
 (15) R. J. Williams, “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 34, pp. 229–256, 1992.
 (16) T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” International Conference on Learning Representations (ICLR16), 2016.
 (17) E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for modelbased control,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012, pp. 5026–5033.
 (18) Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in International Conference on Machine Learning, 2016, pp. 1329–1338.
 (19) J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “Highdimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
 (20) D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
Supplementary Materials
Appendix A PrimalDual DDPG for CMDPs
In this appendix, we provide a primaldual version of the DDPG algorithm for solving CMDPs. The primal policy update and the dual variable update in this algorithm only use the offpolicy data stored in the replay buffer, which can be used to fit for our APDO algorithm. For simplicity, we only present the algorithm for CMDPs with a single constraint, and the multipleconstraint case can be easily obtained. In the primaldual DDPG algorithm, we have the following neural networks.

Reward critic Qnetwork and reward target Qnetwork

Cost critic Qnetwork and cost target Qnetwork

Actor policy network and actor target Qnetwork
The target networks are used to slowly track the learned networks.
Appendix B Experiment Details
Task description. Specifically, a point mass receives a reward of 10 for collecting an apple, and a cost of 1 for collecting a bomb. The agent is constrained to incur no more than 0.2 cost in the long term. Two apples and eight bombs spawn on the map at the start of each episode.
Parameters for primal policy update. For all experiments, we use neural network policies with two hidden layers of sizes with tanh nonlinearity, and all of the schemes (PDO, CPO, APDO) use TRPO to update the primal policy, with a batch size 50000 and a KLdivergence step size of 0.01. The discount factor is 0.995 and the rollout length is 15. We use GAE GAE for estimating the regular advantages with .
Parameters for dual variable update. As for dual updates, PDO and APDO both use dual gradient ascent. Note that the step size for dual gradient ascent is important in PDO: if it is set to be too small, the dual variable won’t update quickly enough to meaningfully enforce the constraint; if it is too high, the algorithm will overcorrect in response to constraint violations and behave too conservatively CPO As a result, picking a proper step size is critical and difficult in PDO. We experiment with different step sizes and find that 0.1 works best for PDO, and the reported results of PDO are also under the step size 0.1. By comparison, selecting step size in APDO is much easier since the onetime offpolicy dual adjustment directly boosts the dual variable to a "nearly optimal" point and we only need to choose a relatively small step size in order to do finetuning after the adjustment. For the reported experimental results, we also set the step size to be 0.1 for APDO for the fairness of comparison. As for CPO, we adopt the same set of parameters as in original CPO paper CPO (specially, the parameters used in the pointgather task).
Parameters for training . We use primaldual DDPG to train . The reward critic network and cost critic network ) is parametrized by a neural network with two hidden layers of sizes with tanh nonlinearity, respectively. The actor policy network is represented by a neural network with two hidden layers of sizes with tanh nonlinearity. The learning rates for the reward/cost critic Qnetwork and the actor policy network are all and these networks are updated with Adam adam . The update for the dual variable in primaldual DDPG employs simple dual gradient ascent and the step size for updating the dual variable in the primaldual DDPG is set to be . The minibatch size is . We also use a soft target networks with . The offpolicy training is executed for primaldual iterations. Since offpolicy algorithms like DDPG are usually unstable, we set to be the average of all historical dual variables throughout the offpolicy training trajectory. The max replay buffer size is .
Effect of . Figure 2 shows the effect of adjustment epoch on the performance of APDO, where we experiment with . It is observed that using a smaller avoids slow incremental dual update earlier, but due to limited amount of available samples in the replay buffer the offpolicy dual estimate could be highly biased and inaccurate. On the other hand, using a larger provides a more accurate dual estimate at the expense of delayed adjustment.
Comments
There are no comments yet.