1 Introduction
Reinforcement Learning (RL) refers to techniques where an agent learns a policy that optimizes a given performance metric from a sequence of interactions with an environment. There are two main types of algorithms in reinforcement learning. In the first type, called onpolicy algorithms, the agent draws a batch of data using its current policy. The second type, known as offpolicy algorithms, reuse data from old policies to update the current policy. Offpolicy algorithms such as Deep QNetwork (Mnih et al., 2015, 2013) and Deep Deterministic Policy Gradients DDPG (Lillicrap et al., 2015) are biased (Gu et al., 2017) because behavior of past policies may be very different from that of the current policy and hence old data may not be a good candidate to inform updates of the current policy. Therefore, although offpolicy algorithms are data efficient, the bias makes them unstable and difficult to tune (Fujimoto et al., 2018). Onpolicy algorithms do not usually incur a bias ^{1}^{1}1Implementations of RL algorithms typically use the undiscounted state distribution instead of discounted distribution, which results in a bias. However, as Thomas (2014) show, being unbiased is not necessarily good and may even hurt performance.; they are typically easier to tune (Schulman et al., 2017)
with the caveat that since they look at each data sample only once, they have poor sample efficiency. Further, they tend to have high variance gradient estimates which necessitates a large number of online samples and highly distributed training
(Ilyas et al., 2018; Mnih et al., 2016).Efforts to combine the easeofuse of onpolicy algorithms with the sample efficiency of offpolicy algorithms have been fruitful (Gu et al., 2016; O’Donoghue et al., 2016b; Wang et al., 2016; Gu et al., 2017; Nachum et al., 2017; Degris et al., 2012). These algorithms merge onpolicy and offpolicy updates to tradeoff the variance of the former against the bias of the latter. Implementing these algorithms in practice is however challenging: RL algorithms already have a lot of hyperparameters (Henderson et al., 2018) and such a combination further exacerbates this. This paper seeks to improve the state of affairs.
We introduce the Policyon Policyoff Policy Optimization (P3O) algorithm in this paper. It performs gradient ascent using the gradient
(1)  
where the first term is the onpolicy policy gradient, the second term is the offpolicy policy gradient corrected by an importance sampling (IS) ratio and the third term is a constraint that keeps the state distribution of the target policy close to that of the behavior policy . Our key contributions are:

we automatically tune the IS clipping threshold and the regularization coefficient using the normalized effective sample size (ESS), and

we control changes to the target policy using samples from replay buffer via an explicit KullbackLeibler constraint.
The normalized ESS measures how efficient offpolicy data is to estimate the onpolicy gradient. We set
We show in Section 4 that this simple technique leads to consistently improved performance over competitive baselines on discrete action tasks from the Atari2600 benchmark suite (Bellemare et al., 2013) and continuous action tasks from MuJoCo benchmark (Todorov et al., 2012).
2 Background
Consider a discretetime agent that interacts with the environment. The agenet picks an action given the current state using a policy . It receives a reward after this interaction and its objective is to maximize the discounted sum of rewards where is a scalar constant that discounts future rewards. The quantity is called the return. We shorten to to simplify notation.
If the initial state is drawn from a distribution and the agent follows the policy thereafter, the actionvalue function and the stateonly value function are
(2)  
respectively. The best policy maximizes the expected value of the returns where
(3) 
2.1 Policy Gradients
We denote by , a policy that is parameterized by parameters . This induces a parameterization of the stateaction and stateonly value functions which we denote by and respectively. MonteCarlo policy gradient methods such as REINFORCE (Williams, 1992) solve for the best policy , typically using firstorder optimization, using the likelihoodratio trick to compute the gradient of the objective. Such a policy gradient of Eq. 3 is given by
(4) 
where is the unnormalized discounted state visitation frequency .
Remark 1 (Variance reduction).
The integrand in Eq. 4 is estimated in a MonteCarlo fashion using sample trajectories drawn using the current policy . The actionvalue function is typically replaced by . Both of these approximations entail a large variance for policy gradients (Kakade and Langford, 2002; Baxter and Bartlett, 2001) and a number of techniques exist to mitigate the variance. The most common one is to subtract a statedependent control variate (baseline) from . This leads to the MonteCarlo estimate of the advantage function (Konda and Tsitsiklis, 2000)
which is used in place of in Eq. 4. Let us note that more general stateaction dependent baselines can also be used (Liu et al., 2017). We denote the baselined policy gradient integrand in short by to rewrite Eq. 4 as
(5) 
2.2 Offpolicy policy gradient
The expression in Eq. 5 is an expectation over data collected from the current policy . Vanilla policy gradient methods use each datum only once to update the policy which makes then sample inefficient. A solution to this problem is to use an experience replay buffer (Lin, 1992) to store previous data and reuse these experiences to update the current policy using importance sampling. For a minibatch of size consisting of with , the integrand in Eq. 5 becomes
where the importance sampling (IS) ratio
(6) 
governs the relative probability of the candidate policy
with respect to .Degris et al. (2012) employed marginal value functions to approximate the above gradient and they obtained the expression
(7) 
for the offpolicy policy gradient. Note that states are sampled from which is the discounted state distribution of . Further, the expectation occurs using the policy while the actionvalue function is that of the target policy . This is important because in order to use the offpolicy policy gradient above, one still needs to estimate . The authors in Wang et al. (2016) estimate using the Retrace() estimator (Munos et al., 2016). If and are very different from each other (i) the importance ratio may vary across a large magnitude, and (ii) the estimate of may be erroneous. This leads to difficulties in estimating the offpolicy policy gradient in practice. An effective way to mitigate (i) is to clip at some threshold . We will use this clipped importance ratio often and denote it as . This helps us shorten the notation for the offpolicy policy gradient to
(8) 
2.3 Covariate Shift
Consider the supervised learning where we observe iid data from a distribution
, say the training dataset. We would however like to minimize the loss on data from another distribution , say the test data. This amounts to minimizing(9)  
Here are the labels associated to draws and is the loss of the predictor . The importance ratio is
(10) 
is the RadonNikodym derivative of the two densities (Resnick, 2013) and it rebalances the data to put more weight on unlikely samples in that are likely under the test data . If the two distributions are the same, the importance ratio is 1 and this is unnecessary. When the two distributions are not the same, we have an instance of covariate shift and need to use the trick in Eq. 9.
Definition 2 (Effective sample size).
Given a dataset and two densities and with being absolutely continuous with respect to , the effective sample size is defined as the number of samples from that would provide an estimator with a performance equal to that of the importance sampling (IS) estimator in Eq. 9 with samples (Kong, 1992). For our purposes, we will use the normalized effective sample size
(11) 
where
is a vector that consists of evaluated at the samples. This expression is a good rule of thumb and is occurs, for instance, for a weighted average of Gaussian random variables
(QuioneroCandela et al., 2009) or in particle filtering (Smith, 2013). We have normalized the ESS by the size of the dataset which makes .Note that estimating the importance ratio requires the knowledge of both and
. While this is not usually the case in machine learning, reinforcement learning allows us access to both offpolicy data and the onpolicy data easily. We can therefore estimate
easily in RL. We can use the ESS as an indicator of the efficacy of updates to with samples drawn from the behavior policy . If the ESS is large, the two policies predict similar actions given the state and we can confidently use data from to update .3 Approach
This section discusses the P3O algorithm. We first identify key characteristics of merging offpolicy and onpolicy updates and then discuss the details of the algorithm and provide insight into its behavior using ablation experiments.
3.1 Combining onpolicy and offpolicy gradients
We can combine the onpolicy update Eq. 5 with the offpolicy update Eq. 8 after biascorrection on the former as
(12) 
where . This is similar to the offpolicy actorcritic (Degris et al., 2012) and ACER gradient (Wang et al., 2016) except that the authors in Wang et al. (2016) use the Retrace() estimator to estimate in Eq. 8. The expectation in the second term is computed over actions that were sampled by whereas the expectation of the first term is computed over all actions weighted by the probability of taking them . The clipping constant in Eq. 12 controls the offpolicy updates versus onpolicy updates. As , ACER does a completely offpolicy update while we have a completely onpolicy update as . In practice, it is difficult to pick a value for that works well for different environments as we elaborate upon in the following remark. This difficulty in choosing is a major motivation for the present paper.
Remark 3 (How much onpolicy updates does ACER do?).
We would like to study the fraction of weight updates coming from onpolicy data as compared to those coming from offpolicy data in Eq. 12. We took a standard implementation of ACER^{2}^{2}2OpenAI baselines: https://github.com/openai/baselines with published hyperparameters from the original authors () and plot the onpolicy part of the loss (first term in Eq. 12) as training progresses in Fig. 1. The onpolicy loss is zero throughout training. This suggests that the performance of ACER (Wang et al., 2016) should be attributed predominantly to offpolicy updates and the Retrace() estimator rather than the combination of offpolicy and onpolicy updates. This experiment demonstrates the importance of hyperparameters when combining offpolicy and onpolicy updates, it is difficult tune hyperparameters that combine the two and work in practice.
3.2 Combining onpolicy and offpolicy data with control variates
Another way to leverage offpolicy data is to use it to learn a control variate, typically the actionvalue function . This has been the subject of a number papers; recent ones include QProp (Gu et al., 2016)
which combines Bellman updates with policy gradients and Interpolated Policy Gradients (IPG)
(Gu et al., 2017) which directly interpolates between onpolicy and offpolicy deterministic gradient, DPG and DDPG algorithms, (Silver et al., 2014; Lillicrap et al., 2016)) using a hyperparameter. To contrast with the ACER gradient in Eq. 12, the IPG is(13) 
where is an offpolicy fitted critic. Notice that since the policy is stochastic the above expression uses for the offpolicy part instead of the DPG for a deterministic policy . This avoids training a separate deterministic policy (unlike QProp) for the offpolicy part and encourages onpolicy exploration and an implicit trust region update. The parameter explicitly controls the tradeoff between the bias and the variance of offpolicy and onpolicy gradients respectively. However, we have found that it is difficult to pick this parameter in practice; this is also seen in the results of (Gu et al., 2017) which show subpar performance on MuJoCo (Todorov et al., 2012) benchmarks; for instance compare these results to similar experiments in Fujimoto et al. (2018) for the Twin Delayed DDPG (TD3) algorithm.
3.3 P3O: Policyon Policyoff Policy optimization
Our proposed approach, named Policyon Policyoff Policy Optimization (P3O) explicitly controls the deviation of the target policy from the behavior policy. The gradient of P3O is given by
(14)  
The first term above is the standard onpolicy gradient. The second term is the offpolicy policy gradient with truncation of the IS ratio using a constant while the third term allows explicit control of the deviation of the target policy from . We do not perform bias correction in the first term so it is missing the factor from the ACER gradient Eq. 12. As we noted in Remark 3, it may be difficult to pick a value of which keeps this factor nonzero. Even if the term is zero, the above gradient is a biased estimate of the onpolicy policy gradient. Further, the divergence term can be rewritten as and therefore minimizes the importance ratio over the entire replay buffer . There are two hyperparameters in the P3O gradient: the IS ratio threshold and the regularization coefficient . We use the following reasoning to pick them.
If the behavior and target policies are far from each other, we would like the be large so as to push them closer. If they are too similar to each other, it entails that we could have performed more exploration, in this scenario, we desire a smaller regularization coefficient . We set
(15) 
where the ESS in Eq. 11 is computed using the current minibatch sampled from the replay buffer .
The truncation threshold is chosen to keep the variance of the second term small. Smaller the , less efficient the offpolicy update and larger the higher the variance of this update. We set
(16) 
This is a very natural way to threshold the IS factor because . This ensures an adaptive tradeoff between the reduced variance of the gradient estimate and the inefficiency of a small IS ratio . Note that the ESS is computed on a minibatch of transitions and their respective IS factors and hence clipping an individual using the ESS tunes automatically to the minibatch.
The gradient of P3O in Eq. 14 is motivated by the following observation: explicitly controlling the divergence between the target and the behavior policy encourages them to have the same visitation frequencies. This is elaborated upon by Lemma 4 which follows from the timedependent state distribution bound proved in (Schulman et al., 2015a; Kahn et al., 2017).
Lemma 4 (Gap in discounted state distributions).
The gap between the discounted state distributions and is bounded as
(17) 
The divergence penalty in Eq. 14 is directly motivated from the above lemma; we however use which is easier to estimate.
Remark 5 (Effect of ).
Fig. 2 shows the effect of picking a good value for on the training performance. We picked two games in Atari for this experiment: BeamRider which is an easy exploration task and Qbert which is a hard exploration task (Bellemare et al., 2016). As the figure and the adjoining caption shows, picking the correct value of is critical to achieving good sample complexity. The ideal also changes as the training progress because policies are highly entropic at initialization which makes exploration easier. It is difficult to tune using annealing schedules, this has also been mentioned by the authors in Schulman et al. (2017) in a similar context. Our choice of adapts the level of regularization automatically.
Remark 6 (P3O adapts the bias in policy gradients).
There are two sources of bias in the P3O gradient. First, we do not perform correction of the onpolicy term in Eq. 12. Second, the KL term further modifies the descent direction by averaging the target policy’s entropy over the replay buffer. If for all transitions in the replay buffer, the bias in the P3O update is
(18) 
The above expression suggests a very useful feature. If the is close to , i.e., if the target policy is close to the behavior policy, P3O is a heavily biased gradient with no entropic regularization. On the other hand, if the ESS is zero, the entire expression above evaluates to zero. The choice therefore tunes the bias in the P3O updates adaptively. Roughly speaking, if the target policy is close to the behavior policy, the algorithm is confident and moves on even with a large bias. It is difficult to control the bias coming from the behavior policy, the ESS allows us to do so naturally.
A number of implementations of RL algorithms such as QProp and IPG often have subtle, unintentional biases (Tucker et al., 2018). However, the improved performance of these algorithms, as also that of P3O, suggests that biased policy gradients might be a fruitful direction for further investigation.
3.4 Discussion on the KL penalty
The divergence penalty in P3O is reminiscent of trustregion methods. These are a popular way of making monotonic improvements to the policy and avoiding premature moves, e.g., see the TRPO algorithm by Schulman et al. (2015a). The theory in TRPO suggests optimizing a surrogate objective where the hard divergence constraint is replaced by a penalty in the objective. In our setting, this amounts to the penalty . Note that the behavior policy is a mixture of previous policies and this therefore amounts to a penalty that keeps close to all policies in the replay buffer . This is also done by the authors in Wang et al. (2016) to stabilize the high variance of actorcritic methods.
A penalty with respect to all past policies slows down optimization. This can be seen abstractly as follows. For an optimization problem , the gradient update can be written as
if the is unique; here is the iterate and is the stepsize at the iteration. A penalty with respect to all previous iterates can be modeled as
(19) 
which leads to the update equation
which has a vanishing stepsize as if the schedule is left unchanged. We would expect such a vanishing stepsize of the policy updates to hurt performance.
The above observation is at odds with the performance of both ACER and P3O; see
Section 4 which shows that both algorithms perform strongly on the Atari benchmark suite. However Fig. 3 helps reconcile this issue. As the target policy is trained, the entropy of the policy decreases, while older policies in the replay buffer are highly entropic and have more exploratory power. A penalty that keeps close to encourages to explore. This exploration compensates for the decreased magnitude of the onpolicy policy gradient seen in Eq. 19.3.5 Algorithmic details
The pseudocode for P3O is given in Algorithm 1. At each iteration, it rolls out trajectories of timesteps each using the current policy and appends them to the replay buffer . In order to be able to compute the divergence term, we store the policy in addition to the action for all states.
P3O performs sequential updates on the onpolicy data and the offpolicy data. In particular, Line 5 in Algorithm 1 samples a Poisson random variable that governs the number of offpolicy updates for each onpolicy update in P3O. This is also commonly done in the literature (Wang et al., 2016). We use Generalized Advantage Estimation (GAE) (Schulman et al., 2015b) to estimate the advantage function in P3O. We have noticed significantly improved results with GAE as compared to without it, as Fig. 4 shows.
4 Experimental Validation
This section demonstrates empirically that P3O with the ESSbased hyperparameter choices from Section 3 achieves, onaverage, comparable performance to stateoftheart algorithms. We evaluate the P3O algorithm against competitive baselines on the Atari2600 benchmarks and MuJoCo continuouscontrol benchmarks.
4.1 Setup
We compare P3O against three competitive baselines: the synchronous actorcritic architecture (A2C) Mnih et al. (2016), proximal policy optimization (PPO) Schulman et al. (2017) and actorcritic with experience replay (ACER) Wang et al. (2016). The first, A2C, is a standard baseline while PPO is a completely onpolicy algorithm that is robust and has demonstrated good empirical performance. ACER combines onpolicy updates with offpolicy updates and is closest to P3O. We use the same network as that of Mnih et al. (2015) for the Atari2600 benchmark and a twolayer fullyconnected network for MuJoCo tasks. The hyperparameters are the same as those of the original authors of the above papers in order to be consistent and comparable to existing literature. We use implementations from OpenAI Baselines^{3}^{3}3https://github.com/openai/baselines. We follow the evaluation protocol proposed by (Machado et al., 2017) and report the training returns for all experiments. More details are provided in the Supplementary Material.
4.2 Results
Atari2600 benchmark. Table 11 shows a comparison of P3O against the three baselines averaged over all the games in the Atari2600 benchmark suite. We measure performance in two ways: (i) in terms of the final reward for each algorithm averaged over the last 100 episodes after 28M timesteps (112M frames of the game), and (ii) in terms of the reward at 40% training time and 80% training time averaged over 100 episodes. The latter compares different algorithms in terms of their sample efficiency. These results suggest that P3O is an efficient algorithm that improves upon competitive baselines both in terms of the final reward at the end of training and the reward obtained after a fixed number of samples. Fig. 7 shows the reward curves for some of games; rewards and training curves for all games are provided in the Supplementary Material.
Algorithm  Won  Won @ 40%  Won @ 80% 

training time  training time  
A2C  0  0  0 
ACER  13  9  11 
PPO  9  8  10 
P3O  27  32  28 
Completely offpolicy algorithms are a strong benchmark on Atari games. We therefore compare P3O with a few state of the art offpolicy algorithms using published results by the original authors. P3O wins 32 games vs. 17 games won by DDQN (Van Hasselt et al., 2016). P3O wins 18 games vs. 30 games won by C51 (Bellemare et al., 2017). P3O wins 26 games vs. 22 games won by SIL (Oh et al., 2018). These offpolicy algorithms use 200M frames and P3O’s performance with 112M frames is comparable to them.
MuJoCo continuouscontrol tasks. In addition to A2C and PPO, we also show a comparison to QProp (Gu et al., 2016) and Interpolated Policy Gradients (IPG) (Gu et al., 2017); the returns for the latter are taken from the training curves in the original papers; they use 10M timesteps and 3 random seeds. The code of the original authors of ACER for MuJoCo is unavailable and we, as also others, were unsuccessful in getting ACER to train for continuouscontrol tasks. Table 2 shows that P3O achieves better performance than strong baselines for continuouscontrol tasks such as A2C and PPO. It is also better than onaverage than algorithms such as QProp and IPG designed to combine offpolicy and onpolicy data. Note that QProp/IPG were tuned by the original authors specifically for each task. In contrast, all hyperparameters for P3O are fixed across the MuJoCo benchmarks. Training curves and results for more environments are in the Supplementary Material.
Task  A2C  PPO  QProp  IPG  P3O 

HalfCheetah  1907  2022  4178  4216  5052 
Walker  2015  2728  2832  1896  3771 
Hopper  1708  2245  2957    2334 
Ant  1811  1616  3374  3943  4727 
Humanoid  720  530  1423  1651  2057 
5 Related Work
This work builds upon recent techniques that combine offpolicy and onpolicy updates in reinforcement learning. The closest to our approach is the ACER algorithm (Wang et al., 2016). It builds upon the offpolicy actorcritic method(Degris et al., 2012) and uses the Retrace operator (Munos et al., 2016) to estimate an offpolicy actionvalue function and constrains the candidate policy to be close to the running average of past policies using a linearized divergence penalty. P3O uses a biased variant of the ACER gradient and incorporates an explicit penalty in the objective.
The PGQL algorithm (O’Donoghue et al., 2016a) uses an estimate of the actionvalue function of the target policy to combine onpolicy updates with those obtained by minimizing the Bellman error. QProp (Gu et al., 2016) learns the actionvalue function using offpolicy data which is used as a control variate for onpolicy updates. The authors in Gu et al. (2017) propose the interpolated policy gradient (IPG) which takes a unified view of these algorithms. It directly combines onpolicy and offpolicy updates using a hyperparameter and shows that, although such updates may be biased, the bias is bounded.
The key characteristic of the above algorithms is that they use hyperparameters as a way to combine offpolicy data with onpolicy data. This is fragile in practice because different environments require different hyperparameters. Moreover, the ideal hyperparameters for combining data may change as training progresses; see Fig. 2. For instance, the authors in Oh et al. (2018)
report poorer empirical results with ACER and prioritized replay as compared to vanilla actorcritic methods (A2C). The effective sample size heuristic (ESS) in P3O is a completely automatic, parameterfree way of combining offpolicy data with onpolicy data.
Policy gradient algorithms with offpolicy data are not new. The importance sampling ratio has been commonly used by a number of authors such as Cao (2005); Levine and Koltun (2013). Effective sample size is popularly used to measure the quality of importance sampling and to restrict the search space for parameter updates (Jie and Abbeel, 2010; Peshkin and Shelton, 2002). We exploit ESS to a similar end, it is an effective way to both control the contribution of the offpolicy data and the deviation of the target policy from the behavior policy. Let us note there are a number of works that learn actionvalue functions using offpolicy data, e.g.,Wang et al. (2013); Hausknecht and Stone (2016); Lehnert and Precup (2015) that achieve varying degrees of success on reinforcement learning benchmarks.
Covariate shift and effective sample size have been studied extensively in the machine learning literature; see Robert and Casella (2013); QuioneroCandela et al. (2009) for an elaborate treatment. These ideas have also been employed in reinforcement learning (Kang et al., 2007; Bang and Robins, 2005; Dudík et al., 2011). To the best of our knowledge, this paper is the first to use ESS for combining onpolicy updates with offpolicy updates.
6 Discussion
Sample complexity is the key inhibitor to translating the empirical performance of reinforcement learning algorithms from simulation to the realworld. Exploiting past, offpolicy data to offset the high sample complexity of onpolicy methods may be the key to doing so. Current approaches to combine the two using hyperparameters are fragile. P3O is a simple, effective algorithm that uses the effective sample size (ESS) to automatically govern this combination. It demonstrates strong empirical performance across a variety of benchmarks. More generally, the discrepancy between the distribution of past data used to fit control variates and the data being gathered by the new policy lies at the heart of modern RL algorithms. The analysis of RL algorithms has not delved into this phenomenon. We believe this to be a promising avenue for future research.
7 Acknowledgements
The authors would like to acknowledge the support of Hang Zhang and Tong He from Amazon Web Services for the opensource implementation of P3O.
References
 Bang and Robins (2005) Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973.

Baxter and Bartlett (2001)
Baxter, J. and Bartlett, P. L. (2001).
Infinitehorizon policygradient estimation.
Journal of Artificial Intelligence Research
, 15:319–350.  Bellemare et al. (2016) Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. (2016). Unifying countbased exploration and intrinsic motivation. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors, NIPS, pages 1471–1479.
 Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. (2017). A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 449–458. JMLR. org.
 Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279.
 Cao (2005) Cao, X.R. (2005). A basic formula for online policy gradient algorithms. IEEE Transactions on Automatic Control, 50(5):696–699.
 Degris et al. (2012) Degris, T., White, M., and Sutton, R. S. (2012). Offpolicy actorcritic. arXiv:1205.4839.
 Dudík et al. (2011) Dudík, M., Langford, J., and Li, L. (2011). Doubly robust policy evaluation and learning. arXiv:1103.4601.
 Fujimoto et al. (2018) Fujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing function approximation error in actorcritic methods. arXiv:1802.09477.
 Gu et al. (2016) Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. (2016). Qprop: Sampleefficient policy gradient with an offpolicy critic. arXiv:1611.02247.
 Gu et al. (2017) Gu, S. S., Lillicrap, T., Turner, R. E., Ghahramani, Z., Schölkopf, B., and Levine, S. (2017). Interpolated policy gradient: Merging onpolicy and offpolicy gradient estimation for deep reinforcement learning. In NIPS, pages 3846–3855.
 Hausknecht and Stone (2016) Hausknecht, M. and Stone, P. (2016). Onpolicy vs. offpolicy updates for deep reinforcement learning. In Deep Reinforcement Learning: Frontiers and Challenges, IJCAI 2016 Workshop.
 Henderson et al. (2018) Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. (2018). Deep reinforcement learning that matters. In ThirtySecond AAAI Conference on Artificial Intelligence.
 Ilyas et al. (2018) Ilyas, A., Engstrom, L., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., and Madry, A. (2018). Are deep policy gradient algorithms truly policy gradient algorithms? arXiv:1811.02553.
 Jie and Abbeel (2010) Jie, T. and Abbeel, P. (2010). On a connection between importance sampling and the likelihood ratio policy gradient. In Advances in Neural Information Processing Systems, pages 1000–1008.
 Kahn et al. (2017) Kahn, G., Zhang, T., Levine, S., and Abbeel, P. (2017). Plato: Policy learning using adaptive trajectory optimization. In ICRA, pages 3342–3349. IEEE.
 Kakade and Langford (2002) Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, volume 2, pages 267–274.
 Kang et al. (2007) Kang, J. D., Schafer, J. L., et al. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539.
 Konda and Tsitsiklis (2000) Konda, V. R. and Tsitsiklis, J. N. (2000). Actorcritic algorithms. In Solla, S. A., Leen, T. K., and Müller, K., editors, NIPS, pages 1008–1014. MIT Press.
 Kong (1992) Kong, A. (1992). A note on importance sampling using standardized weights. Technical Report 348.
 Lehnert and Precup (2015) Lehnert, L. and Precup, D. (2015). Policy gradient methods for offpolicy control. arXiv:1512.04105.
 Levine and Koltun (2013) Levine, S. and Koltun, V. (2013). Guided policy search. In International Conference on Machine Learning, pages 1–9.
 Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv:1509.02971.
 Lillicrap et al. (2016) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2016). Continuous control with deep reinforcement learning. CoRR, abs/1509.02971.
 Lin (1992) Lin, L.J. (1992). Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(34):293–321.
 Liu et al. (2017) Liu, H., Feng, Y., Mao, Y., Zhou, D., Peng, J., and Liu, Q. (2017). Actiondepedent control variates for policy optimization via stein’s identity. arXiv:1710.11198.
 Machado et al. (2017) Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M. J., and Bowling, M. (2017). Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. CoRR, abs/1709.06009.
 Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In ICML, pages 1928–1937.
 Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv:1312.5602.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Humanlevel control through deep reinforcement learning. Nature, 518(7540):529.
 Munos et al. (2016) Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. (2016). Safe and efficient offpolicy reinforcement learning. In NIPS, pages 1054–1062.
 Nachum et al. (2017) Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017). Bridging the gap between value and policy based reinforcement learning. In NIPS.
 O’Donoghue et al. (2016a) O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2016a). Combining policy gradient and qlearning. arXiv:1611.01626.
 O’Donoghue et al. (2016b) O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2016b). PGQ: Combining policy gradient and Qlearning. arXiv:1611.01626.
 Oh et al. (2018) Oh, J., Guo, Y., Singh, S., and Lee, H. (2018). Selfimitation learning. arXiv:1806.05635.
 Peshkin and Shelton (2002) Peshkin, L. and Shelton, C. R. (2002). Learning from scarce experience. cs/0204043.
 QuioneroCandela et al. (2009) QuioneroCandela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. (2009). Dataset shift in machine learning. The MIT Press.
 Resnick (2013) Resnick, S. I. (2013). A probability path. Springer Science & Business Media.
 Robert and Casella (2013) Robert, C. and Casella, G. (2013). Monte Carlo statistical methods. Springer Science & Business Media.
 Schulman et al. (2015a) Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. (2015a). Trust region policy optimization. In International Conference on Machine Learning, volume 37, pages 1889–1897.
 Schulman et al. (2015b) Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2015b). Highdimensional continuous control using generalized advantage estimation. arXiv:1506.02438.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
 Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic policy gradient algorithms. In International Conference on Machine Learning.
 Smith (2013) Smith, A. (2013). Sequential Monte Carlo methods in practice. Springer Science & Business Media.
 Thomas (2014) Thomas, P. (2014). Bias in natural actorcritic algorithms. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pages 441–448. PMLR.
 Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. (2012). Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE.
 Tucker et al. (2018) Tucker, G., Bhupatiraju, S., Gu, S., Turner, R. E., Ghahramani, Z., and Levine, S. (2018). The mirage of actiondependent baselines in reinforcement learning. arXiv:1802.10031.
 Van Hasselt et al. (2016) Van Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double qlearning. In Thirtieth AAAI Conference on Artificial Intelligence.
 Wang et al. (2013) Wang, Y.H., Li, T.H. S., and Lin, C.J. (2013). Backward qlearning: The combination of sarsa algorithm and qlearning. Engineering Applications of Artificial Intelligence, 26(9):2184–2193.
 Wang et al. (2016) Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. (2016). Sample efficient actorcritic with experience replay. arXiv:1611.01224.
 Williams (1992) Williams, R. J. (1992). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(34):229–256.
Appendix
Appendix A Hyperparameters for all experiments
Hyperparameters  Value 

Architecture  conv () 
conv ()  
conv ()  
FC ()  
Learning rate  
Number of environments  16 
Number of steps per iteration  5 
Entropy regularization ()  0.01 
Discount factor ()  
Value loss Coefficient  
Gradient norm clipping coefficient  
Random Seeds 
Hyperparameters  Value 
Architecture  Same as A2C 
Replay Buffer size  
Learning rate  
Number of environments  16 
Number of steps per iteration  20 
Entropy regularization ()  0.01 
Number of training epochs per update 
4 
Discount factor ()  
Value loss Coefficient  
importance weight clipping factor  
Gradient norm clipping coefficient  
Momentum factor in the Polyak  
Max. KL between old & updated policy  
Use Trust region  True 
Random Seeds 
Hyperparameters  Value 
Architecture  Same as A2C 
Learning rate  
Number of environments  8 
Number of steps per iteration  128 
Entropy regularization ()  0.01 
Number of training epochs per update  4 
Discount factor ()  
Value loss Coefficient  
Gradient norm clipping coefficient  
Advantage estimation discounting factor ()  
Random Seeds 
Hyperparameters  Value 
Architecture  Same as A2C 
Learning rate  
Replay Buffer size  
Number of environments  16 
Number of steps per iteration  16 
Entropy regularization ()  0.01 
Off policy updates per iteration ()  Poisson(2) 
Burnin period  
Samples from replay buffer  
Discount factor ()  
Value loss Coefficient  
Gradient norm clipping coefficient  
Advantage estimation discounting factor ()  
Random Seeds 
Hyperparameters  Value 
Architecture  FC(100)  FC(100) 
Learning rate  
Replay Buffer size  
Number of environments  2 
Number of steps per iteration  64 
Entropy regularization ()  0.0 
Off policy updates per iteration ()  Poisson(3) 
Burnin period  2500 
Number of samples from replay buffer  
Discount factor ()  
Value loss Coefficient  
Gradient norm clipping coefficient  
Advantage estimation discounting factor ()  
Random Seeds 
Hyperparameters  Value 
Architecture  FC(64)  FC(64) 
Learning rate  
Number of environments  8 
Number of steps per iteration  32 
Entropy regularization ()  0.0 
Discount factor ()  
Value loss Coefficient  
Gradient norm clipping coefficient  
Random Seeds 
Hyperparameters  Value 
Architecture  FC(64)  FC(64) 
Learning rate  
Number of environments  1 
Number of steps per iteration  2048 
Entropy regularization ()  0.0 
Number of training epochs per update  10 
Discount factor ()  
Value loss Coefficient  
Gradient norm clipping coefficient  
Advantage estimation discounting factor ()  
Random Seeds 
Appendix B Comparisons with baseline algorithms
Games  A2CG  A2C  PPO  P3O 

HalfCheetah  181.46  1907.42  2022.14  5051.58 
Walker  855.62  2015.15  2727.93  3770.86 
Hopper  1377.07  1708.22  2245.03  2334.32 
Swimmer  33.33  45.27  101.71  116.87 
Inverted Double Pendulum  90.09  5510.71  4750.69  8114.05 
Inverted Pendulum  733.34  889.61  414.49  985.14 
Ant  253.54  1811.29  1615.55  4727.34 
Humanoid  530.12  720.38  530.13  2057.17 
Games  A2C  ACER  PPO  P3O 

Alien  
Amidar  
Assault  
Asterix  
Asteroids  
Atlantis  
BankHeist  
BattleZone  
BeamRider  
Bowling  
Boxing  
Breakout  
Centipede  
ChopperCommand  
CrazyClimber  
DemonAttack  
DoubleDunk  
Enduro  
FishingDerby  
Freeway  
Frostbite  
Gopher  
Gravitar  
IceHockey  
Jamesbond  
Kangaroo  
Krull  
KungFuMaster  
MontezumaRevenge  
MsPacman  
NameThisGame  
Pitfall  
Pong  
PrivateEye  
Qbert  
Riverraid  
RoadRunner  
Robotank  
Seaquest  
SpaceInvaders  
StarGunner  
Tennis  
TimePilot  
Tutankham  
UpNDown  
Venture  
VideoPinball  
WizardOfWor  
Zaxxon 
Comments
There are no comments yet.