In reinforcement learning (RL), agents learn to act by trial and error in an unknown environment. The majority of RL algorithms allow agents to freely explore the environment and exploit any actions that might improve the reward. However, actions that lead to high rewards usually come with high risks. In a safety-critical environment, it is important to enforce safety in the RL algorithm, and a natural way to enforce safety is to incorporate constraints. A standard formulation for RL with safety constraints is the constrained Markov Decision Process (CMDP) framework CMDP , where the agents need to maximize the long-term reward while satisfying the constraints on the long-term cost. Applications of CMDPs include windmill control wind where we need to maximize the average reward (e.g., generated power) while bounding the long-term wear-and-tear cost on critical components (e.g., wind turbine). Another important example is communication network control where we need to maximize network utility while bounding the long-term arrival rate below the long-term service rate in order to maintain network stability (Chapter 1.1 in CMDP ).
While optimal policies for finite CMDPs with known models can be obtained by linear programmingLP1
, it cannot scale to high-dimensional continuous control tasks due to curse of dimensionality. Recently, there have been RL algorithms that work for high-dimensional CMDPs based on advances in policy search algorithmsTRPO ; A3C . In particular, two constrained policy search algorithms enjoy state-of-the-art performance for CMDPs: Primal-Dual Optimization (PDO) PDO and Constrained Policy Optimization (CPO) CPO . PDO is based on Lagrangian relaxation and updates parameters in primal and dual spaces in turn. Specifically, the primal policy update uses the policy gradient descent while the dual variable update uses the dual gradient ascent. By comparison, CPO differs from PDO in the dual update procedure, where the dual variable is obtained from scratch by solving a carefully-designed optimization problem in each iteration, in order to enforce safety constraints throughout training. Besides PDO and CPO, there exist other methods for solving CMDPs uchibe2007constrained ; ammar2015safe ; held2017probabilistically , but these approaches are usually computationally intensive or only apply to some specific CMDP models and domains.
A notable feature of existing constrained policy search approaches (e.g., PDO and CPO) is that they only use on-policy samples111On-policy samples refer to those generated by the currently-used policy while off-policy samples are generated by other unknown policies., which ensures that the information used for dual updates is unbiased and leads to stable performance improvement. However, such an on-policy dual update is sample-inefficient since historical samples are discarded. Moreover, due to the on-policy nature, dual updates are incremental and suffer from slow convergence since a (potentially large) batch of on-policy samples have to be obtained before a dual update can be made.
In this paper, we propose a policy search method for CMDPs called Accelerated Primal-Dual Optimization (APDO), which incorporates an off-policy trained dual variable in the dual update procedure while updating the policy in primal space with on-policy likelihood ratio gradient. Specifically, APDO is similar to PDO except that we perform a one-time adjustment for the dual variable with a nearly optimal dual variable trained with off-policy data after a certain number of iterations. Such a one-time adjustment process incurs negligible amortized overhead in the long term but greatly improves the sample efficiency and the convergence rate over exisiting methods. We demonstrate the effectiveness of APDO on a simulated robot locomotion task where the agent must satisfy constraints motivated by safety. The experimental results show that APDO achieves better sample efficiency and faster convergence than state-of-the-art approaches for CMDPs (e.g., PDO and CPO).
Another line of work considers merging the on-policy and off-policy policy gradient updates to improve sample efficiency. Examples of these approaches include Q-Prop Q-prop , IPG IPG , etc. These approaches are designed for unconstrained MDPs and can be applied to the primal policy update. In contrast, APDO leverages off-policy samples for dual updates and is complementary to these efforts on merging on-policy and off-policy policy gradients.
2 Constrained Markov Decision Process
A Markov Decision Process (MDP) is represented by a tuple, , where is the set of states, is the set of actions, is the reward function,
is the transition probability function (whereis the transition probability from state to state given action ), and is the initial state distribution. A stationary policy
corresponds to a mapping from states to a probability distribution over actions. Specifically,is the probability of selecting action in state . The set of all stationary policies is denoted by . In this paper, we search policy within a parametrized stationary policy class
(e.g., a neural network policy class with weight). We may write a policy as to emphasize its dependence on the parameter . The long-term discounted reward under policy is denoted as , where is the discount factor, denotes a trajectory, and means that the distribution over trajectories is determined by policy , i.e., .
A constrained Markov Decision Process (CDMP) is an MDP augmented with constraints on long-term discounted costs. Specifically, we augment the ordinary MDP with cost functions , where each cost function is a mapping from transition tuples to costs. The long-term discounted cost under policy is similarly defined as , and the corresponding limit is . In CMDP, we aim to select a policy that maximizes the long-term reward while satisfying the constraints on the long-term costs , i.e.,
where is the Lagrangian multiplier. Then the constrained problem (1) can be converted to the following unconstrained problem:
To solve the unconstrained minimax problem (3), a canonical approach is to use the iterative primal-dual method where in each iteration we update the primal policy and the dual variable in turn. The primal-dual update procedures at iteration are as follows:
Fix and perform policy gradient update: where is the step size. The policy gradient could be on-policy likelihood ratio policy gradient (e.g., REINFORCE REINFORCE and TRPO TRPO ) or off-policy deterministic policy gradient (e.g., DDPG DDPG ).
Fix and perform dual update . Existing methods for CMDPs, such as PDO and CPO, differ in the choice of the dual update procedure . For example, PDO uses the simple dual gradient ascent where is the step size and is the projection onto the dual space . By comparison, CPO derives the dual variable by solving an optimization problem from scratch in order to enforce the constraints in every iteration.
However, the dual update procedures used in existing methods (e.g., PDO and CPO) are incremental and only use on-policy samples, resulting sample inefficiency and slow convergence to the optimal primal-dual solution . In this paper, we propose to incorporate an off-policy trained dual variable in the dual update procedure in order to improve sample efficiency and speed up the search for the optimal dual variable . The algorithm is called Accelerated Primal-Dual Optimization (APDO) and is described in Algorithm 1. APDO is similar to PDO where in most iterations the dual variable is updated according to the simple dual gradient ascent (step 6), but the key innovation of APDO is that there is a one-time dual adjustment with an off-policy trained dual variable after iterations (steps 7-10). The off-policy trained is obtained by running an off-policy algorithm for CMDPs with the historical data stored in the replay buffer. We provide a primal-dual version of the DDPG algorithm in the supplementary material for training . Although the off-policy trained dual variable could be biased, it provides a nearly optimal point for further fine tuning of the dual variable using new on-policy data.
The improvement of sample efficiency in APDO is due to the fact that off-policy training can repeatedly exploit historical data while on-policy update only uses each sample once; the acceleration effect of APDO is due to the fact that off-policy training directly solves for the optimal dual variable offline, thus avoiding the slow on-policy dual update as in the existing approaches where only one dual update can be taken after a large batch of samples are obtained.
Note that the adjustment epochis an important parameter in APDO. Using a small
avoids slow incremental dual update early, but the dual estimatecould be highly biased and inaccurate due to insufficient amount of data. On the other hand, using a larger provides a more accurate dual estimate at the expense of delayed adjustment.
We evaluate APDO against two state-of-the-art algorithms for solving CMDPs (i.e., CPO and PDO) on a simple point-gather control task in MuJoCo mujoco with an additional safety constraint as used in CPO . All experiments are implemented in rllab rllab . The detailed task description and experiment parameters are provided in the supplementary material. In particular, for APDO we set the adjustment epoch , and additional experimental results regarding the effect of are also given in the supplementary material.
Figure 1 shows the learning curves for APDO, CPO and PDO under cost constraints. It can be observed from Fugure 1(b) that APDO enforced constraints successfully to the limit value as approximately same speed as CPO did. More importantly, APDO generally outperforms CPO on reward performance without compromising constraint stabilization, thus achieving better sample efficiency. For example, CPO takes 90 epochs to achieve an average reward of 11 while satisfying the safety constraint. By comparison, APDO only takes 45 epochs to achieve the same point, which corresponds to 2x improvement in sample efficiency over CPO in this task. In addition, PDO fails to enforce the safety constraint during the first 150 epochs due to its slow convergence. Using a larger step size may help speed up the convergence but in this case PDO will over-correct in response to constraint violations and behave too conservatively. We provide additional discussions on the choice of stepsize for PDO and APDO in the supplementary material.
Figure 1(c) illustrates the learning trajectory of the dual variable under PDO and APDO (note that the dual variable for CPO is not illustrated since CPO has a sophisticated recovery scheme to enforce constraints, where the dual variable may not be easily obtained). We find that APDO converges to the optimal dual variable significantly faster than PDO. In particular, there is a “jump" of the dual variable after several epochs in APDO, due to the dual adjustment with the off-policy trained . By comparison, PDO has to adjust its dual variable incrementally with on-policy data.
5 Future Work
Since the adjustment epoch is an important parameter in APDO, one important future work is to provide theoretical guidance on the setting of . It is also very interesting (yet challenging) to provide theoretical justifications about the acceleration effects of APDO. Moreover, as we observed in the experiments, the training trajectory generated by APDO strives for the best tradeoff between improving rewards and enforcing cost constraints. One future work is to incorporate a safety parameter that controls the degree of safety awareness. By tuning the parameter, the RL algorithm should be able to make both risk-averse actions (which enforce safety constraints as soon as possible) and risk-neutral actions (which gives priority to improving rewards).
This work was supported by NSF Grant CNS-1524317 and by DARPA I2O and Raytheon BBN Technologies under Contract No. HROO l l-l 5-C-0097. The authors would also like to acknowledge Chengtao Li who provided valuable feedback on this work.
- (1) S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” arXiv preprint arXiv:1610.03295, 2016.
- (2) E. Altman, Constrained Markov decision processes. CRC Press, 1999, vol. 7.
- (3) P. Tavner, J. Xiang, and F. Spinato, “Reliability analysis for wind turbines,” Wind Energy, vol. 10, no. 1, pp. 1–18, 2007.
- (4) E. A. Feinberg and A. Shwartz, “Constrained dynamic programming with two discount factors: Applications and an algorithm,” IEEE Transactions on Automatic Control, vol. 44, no. 3, pp. 628–631, 1999.
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region
policy optimization,” in
Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 2015, pp. 1889–1897.
- (6) V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International Conference on Machine Learning, 2016, pp. 1928–1937.
- (7) Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk-constrained reinforcement learning with percentile risk criteria,” arXiv preprint arXiv:1512.01629, 2015.
- (8) J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” in Proceedings of the 34nd International Conference on Machine Learning (ICML-17), 2017.
- (9) E. Uchibe and K. Doya, “Constrained reinforcement learning from intrinsic and extrinsic rewards,” in Development and Learning, 2007. ICDL 2007. IEEE 6th International Conference on. IEEE, 2007, pp. 163–168.
- (10) H. B. Ammar, R. Tutunov, and E. Eaton, “Safe policy search for lifelong reinforcement learning with sublinear regret,” in Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 2015, pp. 2361–2369.
- (11) D. Held, Z. McCarthy, M. Zhang, F. Shentu, and P. Abbeel, “Probabilistically safe policy transfer,” arXiv preprint arXiv:1705.05394, 2017.
- (12) S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine, “Q-prop: Sample-efficient policy gradient with an off-policy critic,” International Conference on Learning Representations (ICLR-17), 2017.
- (13) S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, B. Schölkopf, and S. Levine, “Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning,” arXiv preprint arXiv:1706.00387, 2017.
- (14) D. P. Bertsekas, Nonlinear programming. Athena scientific Belmont, 1999.
- (15) R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
- (16) T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” International Conference on Learning Representations (ICLR-16), 2016.
- (17) E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, 2012, pp. 5026–5033.
- (18) Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in International Conference on Machine Learning, 2016, pp. 1329–1338.
- (19) J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
- (20) D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
Appendix A Primal-Dual DDPG for CMDPs
In this appendix, we provide a primal-dual version of the DDPG algorithm for solving CMDPs. The primal policy update and the dual variable update in this algorithm only use the off-policy data stored in the replay buffer, which can be used to fit for our APDO algorithm. For simplicity, we only present the algorithm for CMDPs with a single constraint, and the multiple-constraint case can be easily obtained. In the primal-dual DDPG algorithm, we have the following neural networks.
Reward critic Q-network and reward target Q-network
Cost critic Q-network and cost target Q-network
Actor policy network and actor target Q-network
The target networks are used to slowly track the learned networks.
Appendix B Experiment Details
Task description. Specifically, a point mass receives a reward of 10 for collecting an apple, and a cost of 1 for collecting a bomb. The agent is constrained to incur no more than 0.2 cost in the long term. Two apples and eight bombs spawn on the map at the start of each episode.
Parameters for primal policy update. For all experiments, we use neural network policies with two hidden layers of sizes with tanh non-linearity, and all of the schemes (PDO, CPO, APDO) use TRPO to update the primal policy, with a batch size 50000 and a KL-divergence step size of 0.01. The discount factor is 0.995 and the rollout length is 15. We use GAE- GAE for estimating the regular advantages with .
Parameters for dual variable update. As for dual updates, PDO and APDO both use dual gradient ascent. Note that the step size for dual gradient ascent is important in PDO: if it is set to be too small, the dual variable won’t update quickly enough to meaningfully enforce the constraint; if it is too high, the algorithm will over-correct in response to constraint violations and behave too conservatively CPO As a result, picking a proper step size is critical and difficult in PDO. We experiment with different step sizes and find that 0.1 works best for PDO, and the reported results of PDO are also under the step size 0.1. By comparison, selecting step size in APDO is much easier since the one-time off-policy dual adjustment directly boosts the dual variable to a "nearly optimal" point and we only need to choose a relatively small step size in order to do fine-tuning after the adjustment. For the reported experimental results, we also set the step size to be 0.1 for APDO for the fairness of comparison. As for CPO, we adopt the same set of parameters as in original CPO paper CPO (specially, the parameters used in the point-gather task).
Parameters for training . We use primal-dual DDPG to train . The reward critic network and cost critic network ) is parametrized by a neural network with two hidden layers of sizes with tanh nonlinearity, respectively. The actor policy network is represented by a neural network with two hidden layers of sizes with tanh nonlinearity. The learning rates for the reward/cost critic Q-network and the actor policy network are all and these networks are updated with Adam adam . The update for the dual variable in primal-dual DDPG employs simple dual gradient ascent and the step size for updating the dual variable in the primal-dual DDPG is set to be . The mini-batch size is . We also use a soft target networks with . The off-policy training is executed for primal-dual iterations. Since off-policy algorithms like DDPG are usually unstable, we set to be the average of all historical dual variables throughout the off-policy training trajectory. The max replay buffer size is .
Effect of . Figure 2 shows the effect of adjustment epoch on the performance of APDO, where we experiment with . It is observed that using a smaller avoids slow incremental dual update earlier, but due to limited amount of available samples in the replay buffer the off-policy dual estimate could be highly biased and inaccurate. On the other hand, using a larger provides a more accurate dual estimate at the expense of delayed adjustment.