1 Introduction
Modelfree RL aims to acquire an effective behavior policy through trial and error interaction with a black box environment. The goal is to optimize the quality of an agent’s behavior policy in terms of the total expected discounted reward. Modelfree RL has a myriad of applications in games atarinature ; tesauro1995 , robotics kober2013 ; levine2016end , and marketing li2010 ; theocharous2015
, to name a few. Recently, the impact of modelfree RL has been expanded through the use of deep neural networks, which promise to replace manual feature engineering with endtoend learning of value and policy representations. Unfortunately, a key challenge remains how best to combine the advantages of value and policy based RL approaches in the presence of deep function approximators, while mitigating their shortcomings. Although recent progress has been made in combining value and policy based methods, this issue is not yet settled, and the intricacies of each perspective are exacerbated by deep models.
The primary advantage of policy based approaches, such as REINFORCE williams92
, is that they directly optimize the quantity of interest while remaining stable under function approximation (given a sufficiently small learning rate). Their biggest drawback is sample inefficiency: since policy gradients are estimated from rollouts the variance is often extreme. Although policy updates can be improved by the use of appropriate geometry
kakade01 ; petersetal10 ; trpo2015 , the need for variance reduction remains paramount. Actorcritic methods have thus become popular schulmaniclr2016 ; silver14ddpg ; sutton1999policy , because they use value approximators to replace rollout estimates and reduce variance, at the cost of some bias. Nevertheless, onpolicy learning remains inherently sample inefficient guetal17 ; by estimating quantities defined by the current policy, either onpolicy data must be used, or updating must be sufficiently slow to avoid significant bias. Naive importance correction is hardly able to overcome these shortcomings in practice precup2000eligibility ; precup2001off .By contrast, value based methods, such as Qlearning watkins1992q ; atarinature ; pdqn ; wangetal16 ; mnih2016asynchronous , can learn from any trajectory sampled from the same environment. Such “offpolicy” methods are able to exploit data from other sources, such as experts, making them inherently more sample efficient than onpolicy methods guetal17 . Their key drawback is that offpolicy learning does not stably interact with function approximation (suttonbook_2nd_ed, , Chap.11)
. The practical consequence is that extensive hyperparameter tuning can be required to obtain stable behavior. Despite practical success
atarinature , there is also little theoretical understanding of how deep Qlearning might obtain nearoptimal objective values.Ideally, one would like to combine the unbiasedness and stability of onpolicy training with the data efficiency of offpolicy approaches. This desire has motivated substantial recent work on offpolicy actorcritic methods, where the data efficiency of policy gradient is improved by training an offpolicy critic lillicrap2015continuous ; mnih2016asynchronous ; guetal17 . Although such methods have demonstrated improvements over onpolicy actorcritic approaches, they have not resolved the theoretical difficulty associated with offpolicy learning under function approximation. Hence, current methods remain potentially unstable and require specialized algorithmic and theoretical development as well as delicate tuning to be effective in practice guetal17 ; acer ; reactor .
In this paper, we exploit a relationship between policy optimization under entropy regularization and softmax value consistency to obtain a new form of stable offpolicy learning. Even though entropy regularized policy optimization is a well studied topic in RL williams1991function ; todorov2006linearly ; todorov10 ; ziebart2010modeling ; azar ; azaretal11 ; azaretal12 ; fox –in fact, one that has been attracting renewed interest from concurrent work pgq2017 ; haarnojaetal17 –we contribute new observations to this study that are essential for the methods we propose: first, we identify a strong form of path consistency that relates optimal policy probabilities under entropy regularization to softmax consistent state values for any action sequence; second, we use this result to formulate a novel optimization objective that allows for a stable form of offpolicy actorcritic learning; finally, we observe that under this objective the actor and critic can be unified in a single model that coherently fulfills both roles.
2 Notation & Background
We model an agent’s behavior by a parametric distribution defined by a neural network over a finite set of actions. At iteration , the agent encounters a state and performs an action sampled from . The environment then returns a scalar reward and transitions to the next state .
Note: Our main results identify specific properties that hold for arbitrary action sequences. To keep the presentation clear and focus attention on the key properties, we provide a simplified presentation in the main body of this paper by assuming deterministic state dynamics. This restriction is not necessary, and in Appendix C we provide a full treatment of the same concepts generalized to stochastic state dynamics. All of the desired properties continue to hold in the general case and the algorithms proposed remain unaffected.
For simplicity, we assume the perstep reward and the next state are given by functions and specified by the environment. We begin the formulation by reviewing the key elements of Qlearning qlearning ; watkins1992q , which uses a notion of hardmax Bellman backup to enable offpolicy TD control. First, observe that the expected discounted reward objective, , can be recursively expressed as,
(1) 
Let denote the optimal state value at a state given by the maximum value of over policies, i.e., . Accordingly, let denote the optimal policy that results in (for simplicity, assume there is one unique optimal policy), i.e., . Such an optimal policy is a onehot distribution that assigns a probability of to an action with maximal return and elsewhere. Thus we have
(2) 
This is the wellknown hardmax Bellman temporal consistency. Instead of state values, one can equivalently (and more commonly) express this consistency in terms of optimal action values, :
(3) 
Qlearning relies on a value iteration algorithm based on (3), where is bootstrapped based on successor action values .
3 Softmax Temporal Consistency
In this paper, we study the optimal state and action values for a softmax form of temporal consistency ziebart2008maximum ; ziebart2010modeling ; fox , which arises by augmenting the standard expected reward objective with a discounted entropy regularizer. Entropy regularization williams1991function encourages exploration and helps prevent early convergence to suboptimal policies, as has been confirmed in practice (e.g., mnih2016asynchronous ; urex ). In this case, one can express regularized expected reward as a sum of the expected reward and a discounted entropy term,
(4) 
where is a userspecified temperature parameter that controls the degree of entropy regularization, and the discounted entropy is recursively defined as
(5) 
The objective can then be reexpressed recursively as,
(6) 
Note that when this is equivalent to the entropy regularized objective proposed in williams1991function .
Let denote the soft optimal state value at a state and let denote the optimal policy at that attains the maximum of . When , the optimal policy is no longer a onehot distribution, since the entropy term prefers the use of policies with more uncertainty. We characterize the optimal policy in terms of the optimal state values of successor states as a Boltzmann distribution of the form,
(7) 
It can be verified that this is the solution by noting that the objective is simply a scaled constantshifted KLdivergence between and , hence the optimum is achieved when .
To derive in terms of , the policy can be substituted into (6), which after some manipulation yields the intuitive definition of optimal state value in terms of a softmax (i.e., logsumexp) backup,
(8) 
Note that in the limit one recovers the hardmax state values defined in (2). Therefore we can equivalently state softmax temporal consistency in terms of optimal action values as,
(9) 
Now, much like Qlearning, the consistency equation (9) can be used to perform onestep backups to asynchronously bootstrap based on . In Appendix C we prove that such a procedure, in the tabular case, converges to a unique fixed point representing the optimal values.
We point out that the notion of softmax Qvalues has been studied in previous work (e.g., ziebart2010modeling ; ziebart2008maximum ; huang2015approximate ; azar ; mellowmax ; fox ). Concurrently to our work, haarnojaetal17 has also proposed a soft Qlearning algorithm for continuous control that is based on a similar notion of softmax temporal consistency. However, we contribute new observations below that lead to the novel training principles we explore.
4 Consistency Between Optimal Value & Policy
We now describe the main technical contributions of this paper, which lead to the development of two novel offpolicy RL algorithms in Section 5. The first key observation is that, for the softmax value function in (8), the quantity also serves as the normalization factor of the optimal policy in (7); that is,
(10) 
Manipulation of (10) by taking the of both sides then reveals an important connection between the optimal state value , the value of the successor state reached from any action taken in , and the corresponding action probability under the optimal logpolicy, .
Theorem 1.
For , the policy that maximizes and state values satisfy the following temporal consistency property for any state and action (where ),
(11) 
Proof.
Note that one can also characterize in terms of as
(12) 
An important property of the onestep softmax consistency established in (11) is that it can be extended to a multistep consistency defined on any action sequence from any given state. That is, the softmax optimal state values at the beginning and end of any action sequence can be related to the rewards and optimal logprobabilities observed along the trajectory.
Corollary 2.
For , the optimal policy and optimal state values satisfy the following extended temporal consistency property, for any state and any action sequence (where ):
(13) 
Proof.
Theorem 3.
5 Path Consistency Learning (PCL)
The temporal consistency properties between the optimal policy and optimal state values developed above lead to a natural pathwise objective for training a policy , parameterized by , and a state value function , parameterized by , via the minimization of a soft consistency error. Based on (13), we first define a notion of soft consistency for a length subtrajectory as a function of and :
(14) 
The goal of a learning algorithm can then be to find and such that is as close to as possible for all subtrajectories . Accordingly, we propose a new learning algorithm, called Path Consistency Learning (PCL), that attempts to minimize the squared soft consistency error over a set of subtrajectories ,
(15) 
The PCL update rules for and are derived by calculating the gradient of (15). For a given trajectory these take the form,
(16)  
(17) 
where and denote the value and policy learning rates respectively. Given that the consistency property must hold on any path, the PCL algorithm applies the updates (16) and (17) both to trajectories sampled onpolicy from as well as trajectories sampled from a replay buffer. The union of these trajectories comprise the set used in to define .
Specifically, given a fixed rollout parameter , at each iteration, PCL samples a batch of onpolicy trajectories and computes the corresponding parameter updates for each subtrajectory of length
. Then PCL exploits offpolicy trajectories by maintaining a replay buffer and applying additional updates based on a batch of episodes sampled from the buffer at each iteration. We have found it beneficial to sample replay episodes proportionally to exponentiated reward, mixed with a uniform distribution, although we did not exhaustively experiment with this sampling procedure. In particular, we sample a full episode
from the replay buffer of size with probability , where we use no discounting on the sum of rewards, is a normalization factor, and is a hyperparameter. Pseudocode of PCL is provided in the Appendix.We note that in stochastic settings, our squared inconsistency objective approximated by Monte Carlo samples is a biased estimate of the true squared inconsistency (in which an expectation over stochastic dynamics occurs inside rather than outside the square). This issue arises in Qlearning as well, and others have proposed possible remedies which can also be applied to PCL antos2008learning .
5.1 Unified Path Consistency Learning (Unified PCL)
The PCL algorithm maintains a separate model for the policy and the state value approximation. However, given the soft consistency between the state and action value functions (e.g.,in (9)), one can express the soft consistency errors strictly in terms of Qvalues. Let denote a model of action values parameterized by , based on which one can estimate both the state values and the policy as,
(18)  
(19) 
Given this unified parameterization of policy and value, we can formulate an alternative algorithm, called Unified Path Consistency Learning (Unified PCL), which optimizes the same objective (i.e., (15)) as PCL but differs by combining the policy and value function into a single model. Merging the policy and value function models in this way is significant because it presents a new actorcritic paradigm where the policy (actor) is not distinct from the values (critic). We note that in practice, we have found it beneficial to apply updates to from and using different learning rates, very much like PCL. Accordingly, the update rule for takes the form,
(21)  
5.2 Connections to ActorCritic and Qlearning
To those familiar with advantageactorcritic methods (mnih2016asynchronous, ) (A2C and its asynchronous analogue A3C) PCL’s update rules might appear to be similar. In particular, advantageactorcritic is an onpolicy method that exploits the expected value function,
(22) 
to reduce the variance of policy gradient, in service of maximizing the expected reward. As in PCL, two models are trained concurrently: an actor that determines the policy, and a critic that is trained to estimate . A fixed rollout parameter is chosen, and the advantage of an onpolicy trajectory is estimated by
(23) 
The advantageactorcritic updates for and can then be written as,
(24)  
(25) 
where the expectation denotes sampling from the current policy . These updates exhibit a striking similarity to the updates expressed in (16) and (17). In fact, if one takes PCL with and omits the replay buffer, a slight variation of A2C is recovered. In this sense, one can interpret PCL as a generalization of A2C. Moreover, while A2C is restricted to onpolicy samples, PCL minimizes an inconsistency measure that is defined on any path, hence it can exploit replay data to enhance its efficiency via offpolicy learning.
It is also important to note that for A2C, it is essential that tracks the nonstationary target to ensure suitable variance reduction. In PCL, no such tracking is required. This difference is more dramatic in Unified PCL, where a single model is trained both as an actor and a critic. That is, it is not necessary to have a separate actor and critic; the actor itself can serve as its own critic.
One can also compare PCL to hardmax temporal consistency RL algorithms, such as Qlearning (qlearning, ). In fact, setting the rollout to in Unified PCL leads to a form of soft Qlearning, with the degree of softness determined by . We therefore conclude that the path consistencybased algorithms developed in this paper also generalize Qlearning. Importantly, PCL and Unified PCL are not restricted to single step consistencies, which is a major limitation of Qlearning. While some have proposed using multistep backups for hardmax Qlearning (peng1996incremental, ; mnih2016asynchronous, ), such an approach is not theoretically sound, since the rewards received after a nonoptimal action do not relate to the hardmax Qvalues . Therefore, one can interpret the notion of temporal consistency proposed in this paper as a sound generalization of the onestep temporal consistency given by hardmax Qvalues.
6 Related Work
Connections between softmax Qvalues and optimal entropyregularized policies have been previously noted. In some cases entropy regularization is expressed in the form of relative entropy (azaretal11, ; azaretal12, ; fox, ; schulmanetal17, ), and in other cases it is the standard entropy (ziebart2010modeling, ). While these papers derive similar relationships to (7) and (8), they stop short of stating the single and multistep consistencies over all action choices we highlight. Moreover, the algorithms proposed in those works are essentially singlestep Qlearning variants, which suffer from the limitation of using singlestep backups. Another recent work (pgq2017, ) uses the softmax relationship in the limit of and proposes to augment an actorcritic algorithm with offline updates that minimize a set of singlestep hardmax Bellman errors. Again, the methods we propose are differentiated by the multistep pathwise consistencies which allow the resulting algorithms to utilize multistep trajectories from offpolicy samples in addition to onpolicy samples.
The proposed PCL and Unified PCL algorithms bear some similarity to multistep Qlearning (peng1996incremental, ), which rather than minimizing onestep hardmax Bellman error, optimizes a Qvalue function approximator by unrolling the trajectory for some number of steps before using a hardmax backup. While this method has shown some empirical success mnih2016asynchronous , its theoretical justification is lacking, since rewards received after a nonoptimal action no longer relate to the hardmax Qvalues . In contrast, the algorithms we propose incorporate the logprobabilities of the actions on a multistep rollout, which is crucial for the version of softmax consistency we consider.
Other notions of temporal consistency similar to softmax consistency have been discussed in the RL literature. Previous work has used a Boltzmann weighted average operator littman ; azar . In particular, this operator has been used by azar to propose an iterative algorithm converging to the optimal maximum reward policy inspired by the work of kappen2005path ; todorov2006linearly . While they use the Boltzmann weighted average, they briefly mention that a softmax (logsumexp) operator would have similar theoretical properties. More recently mellowmax proposed a mellowmax operator, defined as logaverageexp. These logaverageexp operators share a similar nonexpansion property, and the proofs of nonexpansion are related. Additionally it is possible to show that when restricted to an infinite horizon setting, the fixed point of the mellowmax operator is a constant shift of the investigated here. In all these cases, the suggested training algorithm optimizes a singlestep consistency unlike PCL and Unified PCL, which optimizes a multistep consistency. Moreover, these papers do not present a clear relationship between the action values at the fixed point and the entropy regularized expected reward objective, which was key to the formulation and algorithmic development in this paper.
Finally, there has been a considerable amount of work in reinforcement learning using offpolicy data to design more sample efficient algorithms. Broadly speaking, these methods can be understood as trading off bias (sutton1999policy, ; silver14ddpg, ; lillicrap2015continuous, ; gu2016deep, ) and variance (precup2000eligibility, ; munos2016safe, ). Previous work that has considered multistep offpolicy learning has typically used a correction (e.g., via importancesampling (precup2001off, ) or truncated importance sampling with bias correction (munos2016safe, ), or eligibility traces (precup2000eligibility, )). By contrast, our method defines an unbiased consistency for an entire trajectory applicable to on and offpolicy data. An empirical comparison with all these methods remains however an interesting avenue for future work.
7 Experiments
We evaluate the proposed algorithms, namely PCL & Unified PCL, across several different tasks and compare them to an A3C implementation, based on mnih2016asynchronous , and an implementation of double Qlearning with prioritized experience replay, based on pdqn . We find that PCL can consistently match or beat the performance of these baselines. We also provide a comparison between PCL and Unified PCL and find that the use of a single unified model for both values and policy can be competitive with PCL.
These new algorithms are easily amenable to incorporate expert trajectories. Thus, for the more difficult tasks we also experiment with seeding the replay buffer with randomly sampled expert trajectories. During training we ensure that these trajectories are not removed from the replay buffer and always have a maximal priority.
The details of the tasks and the experimental setup are provided in the Appendix.
Synthetic Tree  Copy  DuplicatedInput  RepeatCopy 
Reverse  ReversedAddition  ReversedAddition3  Hard ReversedAddition 
for Synthetic Tree) after choosing best hyperparameters. We also show a single standard deviation bar clipped at the min and max. The xaxis is number of training iterations. PCL exhibits comparable performance to A3C in some tasks, but clearly outperforms A3C on the more challenging tasks. Across all tasks, the performance of DQN is worse than PCL.
7.1 Results
We present the results of each of the variants PCL, A3C, and DQN in Figure 1. After finding the best hyperparameters (see Section B.3), we plot the average reward over training iterations for five randomly seeded runs. For the Synthetic Tree environment, the same protocol is performed but with ten seeds instead.
The gap between PCL and A3C is hard to discern in some of the more simple tasks such as Copy, Reverse, and RepeatCopy. However, a noticeable gap is observed in the Synthetic Tree and DuplicatedInput results and more significant gaps are clear in the harder tasks, including ReversedAddition, ReversedAddition3, and Hard ReversedAddition. Across all of the experiments, it is clear that the prioritized DQN performs worse than PCL. These results suggest that PCL is a competitive RL algorithm, which in some cases significantly outperforms strong baselines.
Synthetic Tree  Copy  DuplicatedInput  RepeatCopy 
Reverse  ReversedAddition  ReversedAddition3  Hard ReversedAddition 
We compare PCL to Unified PCL in Figure 2. The same protocol is performed to find the best hyperparameters and plot the average reward over several training iterations. We find that using a single model for both values and policy in Unified PCL is slightly detrimental on the simpler tasks, but on the more difficult tasks Unified PCL is competitive or even better than PCL.
Reverse  ReversedAddition  ReversedAddition3  Hard ReversedAddition 

We present the results of PCL along with PCL augmented with expert trajectories in Figure 3. We observe that the incorporation of expert trajectories helps a considerable amount. Despite only using a small number of expert trajectories (i.e., ) as opposed to the minibatch size of , the inclusion of expert trajectories in the training process significantly improves the agent’s performance. We performed similar experiments with Unified PCL and observed a similar lift from using expert trajectories. Incorporating expert trajectories in PCL is relatively trivial compared to the specialized methods developed for other policy based algorithms abbeel2004apprenticeship ; ho2016generative . While we did not compare to other algorithms that take advantage of expert trajectories, this success shows the promise of using pathwise consistencies. Importantly, the ability of PCL to incorporate expert trajectories without requiring adjustment or correction is a desirable property in realworld applications such as robotics.
8 Conclusion
We study the characteristics of the optimal policy and state values for a maximum expected reward objective in the presence of discounted entropy regularization. The introduction of an entropy regularizer induces an interesting softmax consistency between the optimal policy and optimal state values, which may be expressed as either a singlestep or multistep consistency. This softmax consistency leads to the development of Path Consistency Learning (PCL), an RL algorithm that resembles actorcritic in that it maintains and jointly learns a model of the state values and a model of the policy, and is similar to Qlearning in that it minimizes a measure of temporal consistency error. We also propose the variant Unified PCL which maintains a single model for both the policy and the values, thus upending the actorcritic paradigm of separating the actor from the critic. Unlike standard policy based RL algorithms, PCL and Unified PCL apply to both onpolicy and offpolicy trajectory samples. Further, unlike value based RL algorithms, PCL and Unified PCL can take advantage of multistep consistencies. Empirically, PCL and Unified PCL exhibit a significant improvement over baseline methods across several algorithmic benchmarks.
9 Acknowledgment
We thank Rafael Cosman, Brendan O’Donoghue, Volodymyr Mnih, George Tucker, Irwan Bello, and the Google Brain team for insightful comments and discussions.
References
 (1) M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for largescale machine learning. arXiv:1605.08695, 2016.

(2)
P. Abbeel and A. Y. Ng.
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the twentyfirst international conference on Machine learning
, page 1. ACM, 2004.  (3) A. Antos, C. Szepesvári, and R. Munos. Learning nearoptimal policies with bellmanresidual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
 (4) K. Asadi and M. L. Littman. A new softmax operator for reinforcement learning. arXiv:1612.05628, 2016.
 (5) M. G. Azar, V. Gómez, and H. J. Kappen. Dynamic policy programming with function approximation. AISTATS, 2011.
 (6) M. G. Azar, V. Gómez, and H. J. Kappen. Dynamic policy programming. JMLR, 13(Nov), 2012.
 (7) M. G. Azar, V. Gómez, and H. J. Kappen. Optimal control as a graphical model inference problem. Mach. Learn. J., 87, 2012.
 (8) D. P. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena Scientific, 1995.
 (9) J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2000.
 (10) G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym. arXiv:1606.01540, 2016.
 (11) R. Fox, A. Pakman, and N. Tishby. Glearning: Taming the noise in reinforcement learning via soft updates. UAI, 2016.
 (12) A. Gruslys, M. G. Azar, M. G. Bellemare, and R. Munos. The reactor: A sampleefficient actorcritic architecture. arXiv preprint arXiv:1704.04651, 2017.
 (13) S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. ICRA, 2016.
 (14) S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. QProp: Sampleefficient policy gradient with an offpolicy critic. ICLR, 2017.
 (15) T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energybased policies. arXiv:1702.08165, 2017.

(16)
J. Ho and S. Ermon.
Generative adversarial imitation learning.
In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.  (17) S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Comput., 1997.
 (18) D.A. Huang, A.m. Farahmand, K. M. Kitani, and J. A. Bagnell. Approximate maxent inverse optimal control and its application for mental simulation of human interactions. 2015.
 (19) S. Kakade. A natural policy gradient. NIPS, 2001.
 (20) H. J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of statistical mechanics: theory and experiment, 2005(11):P11011, 2005.
 (21) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.
 (22) J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. IJRR, 2013.
 (23) S. Levine, C. Finn, T. Darrell, and P. Abbeel. Endtoend training of deep visuomotor policies. JMLR, 17(39), 2016.
 (24) L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextualbandit approach to personalized news article recommendation. 2010.
 (25) T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. ICLR, 2016.
 (26) M. L. Littman. Algorithms for sequential decision making. PhD thesis, Brown University, 1996.
 (27) V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. ICML, 2016.
 (28) V. Mnih, K. Kavukcuoglu, D. Silver, et al. Humanlevel control through deep reinforcement learning. Nature, 2015.
 (29) R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and efficient offpolicy reinforcement learning. NIPS, 2016.
 (30) O. Nachum, M. Norouzi, and D. Schuurmans. Improving policy gradient by exploring underappreciated rewards. ICLR, 2017.
 (31) B. O’Donoghue, R. Munos, K. Kavukcuoglu, and V. Mnih. PGQ: Combining policy gradient and Qlearning. ICLR, 2017.
 (32) J. Peng and R. J. Williams. Incremental multistep Qlearning. Machine learning, 22(13):283–290, 1996.
 (33) J. Peters, K. Müling, and Y. Altun. Relative entropy policy search. AAAI, 2010.
 (34) D. Precup. Eligibility traces for offpolicy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
 (35) D. Precup, R. S. Sutton, and S. Dasgupta. Offpolicy temporaldifference learning with function approximation. 2001.
 (36) T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. ICLR, 2016.
 (37) J. Schulman, X. Chen, and P. Abbeel. Equivalence between policy gradients and soft Qlearning. arXiv:1704.06440, 2017.
 (38) J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel. Trust region policy optimization. ICML, 2015.
 (39) J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. Highdimensional continuous control using generalized advantage estimation. ICLR, 2016.
 (40) D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. ICML, 2014.
 (41) R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MIT Press, 2nd edition, 2017. Preliminary Draft.
 (42) R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al. Policy gradient methods for reinforcement learning with function approximation. NIPS, 1999.
 (43) G. Tesauro. Temporal difference learning and TDgammon. CACM, 1995.
 (44) G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Personalized ad recommendation systems for lifetime value optimization with guarantees. IJCAI, 2015.
 (45) E. Todorov. Linearlysolvable Markov decision problems. NIPS, 2006.
 (46) E. Todorov. Policy gradients in linearlysolvable MDPs. NIPS, 2010.
 (47) J. N. Tsitsiklis and B. Van Roy. An analysis of temporaldifference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 1997.
 (48) Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample efficient actorcritic with experience replay. ICLR, 2017.
 (49) Z. Wang, N. de Freitas, and M. Lanctot. Dueling network architectures for deep reinforcement learning. ICLR, 2016.
 (50) C. J. Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge England, 1989.
 (51) C. J. Watkins and P. Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
 (52) R. J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Mach. Learn. J., 1992.
 (53) R. J. Williams and J. Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 1991.
 (54) B. D. Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. PhD thesis, CMU, 2010.
 (55) B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. AAAI, 2008.
Appendix A Pseudocode
Pseudocode for PCL is presented in Algorithm 1.
Appendix B Experimental Details
We describe the tasks we experimented on as well as details of the experimental setup.
b.1 Synthetic Tree
As an initial testbed, we developed a simple synthetic environment. The environment is defined by a binary decision tree of depth 20. For each training run, the reward on each edge is sampled uniformly from
and subsequently normalized so that the maximal reward trajectory has total reward 20. We trained using a fullyparameterized model: for each nodein the decision tree there are two parameters to determine the logits of
and one parameter to determine . In the Qlearning and Unified PCL implementations only two parameters per node are needed to determine the Qvalues.b.2 Algorithmic Tasks
For more complex environments, we evaluated PCL, Unified PCL, and the two baselines on the algorithmic tasks from the OpenAI Gym library [10]. This library provides six tasks, in rough order of difficulty: Copy, DuplicatedInput, RepeatCopy, Reverse, ReversedAddition, and ReversedAddition3. In each of these tasks, an agent operates on a grid of characters or digits, observing one character or digit at a time. At each time step, the agent may move one step in any direction and optionally write a character or digit to output. A reward is received on each correct emission. The agent’s goal for each task is:

Copy: Copy a sequence of characters to output.

DuplicatedInput: Deduplicate a sequence of characters.

RepeatCopy: Copy a sequence of characters first in forward order, then reverse, and finally forward again.

Reverse: Copy a sequence of characters in reverse order.

ReversedAddition: Observe two ternary numbers in littleendian order via a grid and output their sum.

ReversedAddition3: Observe three ternary numbers in littleendian order via a grid and output their sum.
These environments have an implicit curriculum associated with them. To observe the performance of our algorithm without curriculum, we also include a task “Hard ReversedAddition” which has the same goal as ReversedAddition but does not utilize curriculum.
For these environments, we parameterized the agent by a recurrent neural network with LSTM
[17] cells of hidden dimension 128.b.3 Implementation Details
For our hyperparameter search, we found it simple to parameterize the critic learning rate in terms of the actor learning rate as , where is the critic weight.
For the Synthetic Tree environment we used a batch size of 10, rollout of , discount of , and a replay buffer capacity of 10,000. We fixed the parameter for PCL’s replay buffer to 1 and used for DQN. To find the optimal hyperparameters, we performed an extensive grid search over actor learning rate ; critic weight ; entropy regularizer for A3C, PCL, Unified PCL; and for DQN replay buffer parameters. We used standard gradient descent for optimization.
For the algorithmic tasks we used a batch size of 400, rollout of , a replay buffer of capacity 100,000, ran using distributed training with 4 workers, and fixed the actor learning rate to 0.005, which we found to work well across all variants. To find the optimal hyperparameters, we performed an extensive grid search over discount , for PCL’s replay buffer; critic weight ; entropy regularizer ; , for the prioritized DQN replay buffer; and also experimented with exploration rates and copy frequencies for the target DQN, . In these experiments, we used the Adam optimizer [21].
All experiments were implemented using Tensorflow
[1].Appendix C Proofs
In this section, we provide a general theoretical foundation for this work, including proofs of the main path consistency results. We first establish the basic results for a simple oneshot decision making setting. These initial results will be useful in the proof of the general infinite horizon setting.
Although the main paper expresses the main claims under an assumption of deterministic dynamics, this assumption is not necessary: we restricted attention to the deterministic case in the main body merely for clarity and ease of explanation. Given that in this appendix we provide the general foundations for this work, we consider the more general stochastic setting throughout the later sections.
In particular, for the general stochastic, infinite horizon setting, we introduce and discuss the entropy regularized expected return and define a “softmax” operator (analogous to the Bellman operator for hardmax Qvalues). We then show the existence of a unique fixed point of , by establishing that the softmax Bellman operator () is a contraction under the infinity norm. We then relate to the optimal value of the entropy regularized expected reward objective , which we term . We are able to show that , as expected. Subsequently, we present a policy determined by that satisfies . Then given the characterization of in terms of , we establish the consistency property stated in Theorem 1 of the main text. Finally, we show that a consistent solution is optimal by satisfying the KKT conditions of the constrained optimization problem (establishing Theorem 4 of the main text).
c.1 Basic results for oneshot entropy regularized optimization
For
and any vector
, , define the scalar valued function (the “softmax”) by(26) 
and define the vector valued function (the “soft indmax”) by
(27) 
where the exponentiation is componentwise. It is easy to verify that . Note that maps any real valued vector to a probability vector. We denote the probability simplex by , and denote the entropy function by .
Lemma 4.
(28)  
(29) 
Proof.
First consider the constrained optimization problem on the right hand side of (28). The Lagrangian is given by , hence . The KKT conditions for this optimization problems are the following system of equations
(30)  
(31) 
for the unknowns, and , where . Note that for any , satisfying (31) requires the unique assignment , which also ensures . To subsequently satisfy (30), the equation must be solved for ; since the right hand side is strictly decreasing in , the solution is also unique and in this case given by . Therefore and provide the unique solution to the KKT conditions (30)(31). Since the objective is strictly concave, must be the unique global maximizer, establishing (29). It is then easy to show by algebraic manipulation, which establishes (28). ∎
Corollary 5 (Optimality Implies Consistency).
If then
(32) 
where .
Proof.
From Lemma 4 we know where . From the definition of it also follows that for all , hence for all . ∎
Corollary 6 (Consistency Implies Optimality).
If and jointly satisfy
(33) 
then and ; that is, must be an optimizer for (28) and is its corresponding optimal value.
Proof.
Although these results are elementary, they reveal a strong connection between optimal state values (), optimal action values () and optimal policies () under the softmax operators. In particular, Lemma 4 states that, if is an optimal action value at some current state, the optimal state value must be , which is simply the entropy regularized value of the optimal policy, , at the current state.
Corollaries 5 and 6 then make the stronger observation that this mutual consistency between the optimal state value, optimal action values and optimal policy probabilities must hold for every action, not just in expectation over actions sampled from ; and furthermore that achieving mutual consistency in this form is equivalent to achieving optimality.
Below we will also need to make use of the following properties of .
Lemma 7.
For any vector ,
(34) 
Proof.
Let denote the conjugate of , which is given by
(35) 
for . Since is closed and convex, we also have that [9, Section 4.2]; hence
(36) 
∎
Lemma 8.
For any two vectors and ,
(37) 
Proof.
Corollary 9.
is an norm contraction; that is, for any two vectors and ,
(42) 
Proof.
Immediate from Lemma 8. ∎
c.2 Background results for onpolicy entropy regularized updates
Although the results in the main body of the paper are expressed in terms of deterministic problems, we will prove that all the desired properties hold for the more general stochastic case, where there is a stochastic transition determined by the environment. Given the characterization for this general case, the application to the deterministic case is immediate. We continue to assume that the action space is finite, and that the state space is discrete.
For any policy , define the entropy regularized expected return by
(43) 
where the expectation is taken with respect to the policy and with respect to the stochastic state transitions determined by the environment. We will find it convenient to also work with the onpolicy Bellman operator defined by
(44)  
(45)  
(46)  
(47) 
for each state and action . Note that in (46) we are using to denote a vector values over choices of for a given , and to denote the vector of conditional action probabilities specified by at state .
Lemma 10.
For any policy and state , satisfies the recurrence
(48)  
(49)  
(50) 
Moreover, is a contraction mapping.
Proof.
Note that this lemma shows is a fixed point of the corresponding onpolicy Bellman operator . Next, we characterize how quickly convergence to a fixed point is achieved by repeated application of ther operator.
Lemma 11.
For any and any ,
for all states , and for all it holds that:
.
Proof.
Consider an arbitrary state . We use an induction on . For the base case, consider and observe that the claim follows trivially, since