P3O: Policy-on Policy-off Policy Optimization

05/05/2019 ∙ by Rasool Fakoor, et al. ∙ Amazon 36

On-policy reinforcement learning (RL) algorithms have high sample complexity while off-policy algorithms are difficult to tune. Merging the two holds the promise to develop efficient algorithms that generalize across diverse environments. It is however challenging in practice to find suitable hyper-parameters that govern this trade off. This paper develops a simple algorithm named P3O that interleaves off-policy updates with on-policy updates. P3O uses the effective sample size between the behavior policy and the target policy to control how far they can be from each other and does not introduce any additional hyper-parameters. Extensive experiments on the Atari-2600 and MuJoCo benchmark suites show that this simple technique is highly effective in reducing the sample complexity of state-of-the-art algorithms.



There are no comments yet.


page 12

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement Learning (RL) refers to techniques where an agent learns a policy that optimizes a given performance metric from a sequence of interactions with an environment. There are two main types of algorithms in reinforcement learning. In the first type, called on-policy algorithms, the agent draws a batch of data using its current policy. The second type, known as off-policy algorithms, reuse data from old policies to update the current policy. Off-policy algorithms such as Deep Q-Network (Mnih et al., 2015, 2013) and Deep Deterministic Policy Gradients DDPG (Lillicrap et al., 2015) are biased (Gu et al., 2017) because behavior of past policies may be very different from that of the current policy and hence old data may not be a good candidate to inform updates of the current policy. Therefore, although off-policy algorithms are data efficient, the bias makes them unstable and difficult to tune (Fujimoto et al., 2018). On-policy algorithms do not usually incur a bias 111Implementations of RL algorithms typically use the undiscounted state distribution instead of discounted distribution, which results in a bias. However, as Thomas (2014) show, being unbiased is not necessarily good and may even hurt performance.; they are typically easier to tune (Schulman et al., 2017)

with the caveat that since they look at each data sample only once, they have poor sample efficiency. Further, they tend to have high variance gradient estimates which necessitates a large number of online samples and highly distributed training 

(Ilyas et al., 2018; Mnih et al., 2016).

Efforts to combine the ease-of-use of on-policy algorithms with the sample efficiency of off-policy algorithms have been fruitful (Gu et al., 2016; O’Donoghue et al., 2016b; Wang et al., 2016; Gu et al., 2017; Nachum et al., 2017; Degris et al., 2012). These algorithms merge on-policy and off-policy updates to trade-off the variance of the former against the bias of the latter. Implementing these algorithms in practice is however challenging: RL algorithms already have a lot of hyper-parameters (Henderson et al., 2018) and such a combination further exacerbates this. This paper seeks to improve the state of affairs.

We introduce the Policy-on Policy-off Policy Optimization (P3O) algorithm in this paper. It performs gradient ascent using the gradient


where the first term is the on-policy policy gradient, the second term is the off-policy policy gradient corrected by an importance sampling (IS) ratio and the third term is a constraint that keeps the state distribution of the target policy close to that of the behavior policy . Our key contributions are:

  1. we automatically tune the IS clipping threshold and the regularization coefficient using the normalized effective sample size (ESS), and

  2. we control changes to the target policy using samples from replay buffer via an explicit Kullback-Leibler constraint.

The normalized ESS measures how efficient off-policy data is to estimate the on-policy gradient. We set

We show in Section 4 that this simple technique leads to consistently improved performance over competitive baselines on discrete action tasks from the Atari-2600 benchmark suite (Bellemare et al., 2013) and continuous action tasks from MuJoCo benchmark (Todorov et al., 2012).

2 Background

Consider a discrete-time agent that interacts with the environment. The agenet picks an action given the current state using a policy . It receives a reward after this interaction and its objective is to maximize the discounted sum of rewards where is a scalar constant that discounts future rewards. The quantity is called the return. We shorten to to simplify notation.

If the initial state is drawn from a distribution and the agent follows the policy thereafter, the action-value function and the state-only value function are


respectively. The best policy maximizes the expected value of the returns where


2.1 Policy Gradients

We denote by , a policy that is parameterized by parameters . This induces a parameterization of the state-action and state-only value functions which we denote by and respectively. Monte-Carlo policy gradient methods such as REINFORCE (Williams, 1992) solve for the best policy , typically using first-order optimization, using the likelihood-ratio trick to compute the gradient of the objective. Such a policy gradient of Eq. 3 is given by


where is the unnormalized discounted state visitation frequency .

Remark 1 (Variance reduction).

The integrand in Eq. 4 is estimated in a Monte-Carlo fashion using sample trajectories drawn using the current policy . The action-value function is typically replaced by . Both of these approximations entail a large variance for policy gradients (Kakade and Langford, 2002; Baxter and Bartlett, 2001) and a number of techniques exist to mitigate the variance. The most common one is to subtract a state-dependent control variate (baseline) from . This leads to the Monte-Carlo estimate of the advantage function (Konda and Tsitsiklis, 2000)

which is used in place of in Eq. 4. Let us note that more general state-action dependent baselines can also be used (Liu et al., 2017). We denote the baselined policy gradient integrand in short by to rewrite Eq. 4 as


2.2 Off-policy policy gradient

The expression in Eq. 5 is an expectation over data collected from the current policy . Vanilla policy gradient methods use each datum only once to update the policy which makes then sample inefficient. A solution to this problem is to use an experience replay buffer (Lin, 1992) to store previous data and reuse these experiences to update the current policy using importance sampling. For a mini-batch of size consisting of with , the integrand in Eq. 5 becomes

where the importance sampling (IS) ratio


governs the relative probability of the candidate policy

with respect to .

Degris et al. (2012) employed marginal value functions to approximate the above gradient and they obtained the expression


for the off-policy policy gradient. Note that states are sampled from which is the discounted state distribution of . Further, the expectation occurs using the policy while the action-value function is that of the target policy . This is important because in order to use the off-policy policy gradient above, one still needs to estimate . The authors in Wang et al. (2016) estimate using the Retrace() estimator (Munos et al., 2016). If and are very different from each other (i) the importance ratio may vary across a large magnitude, and (ii) the estimate of may be erroneous. This leads to difficulties in estimating the off-policy policy gradient in practice. An effective way to mitigate (i) is to clip at some threshold . We will use this clipped importance ratio often and denote it as . This helps us shorten the notation for the off-policy policy gradient to


2.3 Covariate Shift

Consider the supervised learning where we observe iid data from a distribution

, say the training dataset. We would however like to minimize the loss on data from another distribution , say the test data. This amounts to minimizing


Here are the labels associated to draws and is the loss of the predictor . The importance ratio is


is the Radon-Nikodym derivative of the two densities (Resnick, 2013) and it re-balances the data to put more weight on unlikely samples in that are likely under the test data . If the two distributions are the same, the importance ratio is 1 and this is unnecessary. When the two distributions are not the same, we have an instance of covariate shift and need to use the trick in Eq. 9.

Definition 2 (Effective sample size).

Given a dataset and two densities and with being absolutely continuous with respect to , the effective sample size is defined as the number of samples from that would provide an estimator with a performance equal to that of the importance sampling (IS) estimator in Eq. 9 with samples (Kong, 1992). For our purposes, we will use the normalized effective sample size



is a vector that consists of evaluated at the samples. This expression is a good rule of thumb and is occurs, for instance, for a weighted average of Gaussian random variables 

(Quionero-Candela et al., 2009) or in particle filtering (Smith, 2013). We have normalized the ESS by the size of the dataset which makes .

Note that estimating the importance ratio requires the knowledge of both and

. While this is not usually the case in machine learning, reinforcement learning allows us access to both off-policy data and the on-policy data easily. We can therefore estimate

easily in RL. We can use the ESS as an indicator of the efficacy of updates to with samples drawn from the behavior policy . If the ESS is large, the two policies predict similar actions given the state and we can confidently use data from to update .

3 Approach

This section discusses the P3O algorithm. We first identify key characteristics of merging off-policy and on-policy updates and then discuss the details of the algorithm and provide insight into its behavior using ablation experiments.

3.1 Combining on-policy and off-policy gradients

We can combine the on-policy update Eq. 5 with the off-policy update Eq. 8 after bias-correction on the former as


where . This is similar to the off-policy actor-critic (Degris et al., 2012) and ACER gradient (Wang et al., 2016) except that the authors in Wang et al. (2016) use the Retrace() estimator to estimate in Eq. 8. The expectation in the second term is computed over actions that were sampled by whereas the expectation of the first term is computed over all actions weighted by the probability of taking them . The clipping constant in Eq. 12 controls the off-policy updates versus on-policy updates. As , ACER does a completely off-policy update while we have a completely on-policy update as . In practice, it is difficult to pick a value for that works well for different environments as we elaborate upon in the following remark. This difficulty in choosing is a major motivation for the present paper.

Remark 3 (How much on-policy updates does ACER do?).

We would like to study the fraction of weight updates coming from on-policy data as compared to those coming from off-policy data in Eq. 12. We took a standard implementation of ACER222OpenAI baselines: https://github.com/openai/baselines with published hyper-parameters from the original authors () and plot the on-policy part of the loss (first term in Eq. 12) as training progresses in Fig. 1. The on-policy loss is zero throughout training. This suggests that the performance of ACER (Wang et al., 2016) should be attributed pre-dominantly to off-policy updates and the Retrace() estimator rather than the combination of off-policy and on-policy updates. This experiment demonstrates the importance of hyper-parameters when combining off-policy and on-policy updates, it is difficult tune hyper-parameters that combine the two and work in practice.

Figure 1: On-policy loss for ACER is zero all through training due to aggressive importance ratio thresholding. ACER had the highest reward from among A2C, PPO and P3O in 3 out of these 5 games (Assault, RiverRaid and BreakOut; see the Supplementary Material for more details). In spite of the on-policy loss being zero for all Atari games, ACER receives good rewards across the benchmark.

3.2 Combining on-policy and off-policy data with control variates

Another way to leverage off-policy data is to use it to learn a control variate, typically the action-value function . This has been the subject of a number papers; recent ones include Q-Prop (Gu et al., 2016)

which combines Bellman updates with policy gradients and Interpolated Policy Gradients (IPG) 

(Gu et al., 2017) which directly interpolates between on-policy and off-policy deterministic gradient, DPG and DDPG algorithms, (Silver et al., 2014; Lillicrap et al., 2016)) using a hyper-parameter. To contrast with the ACER gradient in Eq. 12, the IPG is


where is an off-policy fitted critic. Notice that since the policy is stochastic the above expression uses for the off-policy part instead of the DPG for a deterministic policy . This avoids training a separate deterministic policy (unlike Q-Prop) for the off-policy part and encourages on-policy exploration and an implicit trust region update. The parameter explicitly controls the trade-off between the bias and the variance of off-policy and on-policy gradients respectively. However, we have found that it is difficult to pick this parameter in practice; this is also seen in the results of (Gu et al., 2017) which show sub-par performance on MuJoCo (Todorov et al., 2012) benchmarks; for instance compare these results to similar experiments in Fujimoto et al. (2018) for the Twin Delayed DDPG (TD3) algorithm.

3.3 P3O: Policy-on Policy-off Policy optimization

Our proposed approach, named Policy-on Policy-off Policy Optimization (P3O) explicitly controls the deviation of the target policy from the behavior policy. The gradient of P3O is given by


The first term above is the standard on-policy gradient. The second term is the off-policy policy gradient with truncation of the IS ratio using a constant while the third term allows explicit control of the deviation of the target policy from . We do not perform bias correction in the first term so it is missing the factor from the ACER gradient Eq. 12. As we noted in Remark 3, it may be difficult to pick a value of which keeps this factor non-zero. Even if the -term is zero, the above gradient is a biased estimate of the on-policy policy gradient. Further, the -divergence term can be rewritten as and therefore minimizes the importance ratio over the entire replay buffer . There are two hyper-parameters in the P3O gradient: the IS ratio threshold and the regularization co-efficient . We use the following reasoning to pick them.

If the behavior and target policies are far from each other, we would like the be large so as to push them closer. If they are too similar to each other, it entails that we could have performed more exploration, in this scenario, we desire a smaller regularization co-efficient . We set


where the ESS in Eq. 11 is computed using the current mini-batch sampled from the replay buffer .

The truncation threshold is chosen to keep the variance of the second term small. Smaller the , less efficient the off-policy update and larger the higher the variance of this update. We set


This is a very natural way to threshold the IS factor because . This ensures an adaptive trade-off between the reduced variance of the gradient estimate and the inefficiency of a small IS ratio . Note that the ESS is computed on a mini-batch of transitions and their respective IS factors and hence clipping an individual using the ESS tunes automatically to the mini-batch.

The gradient of P3O in Eq. 14 is motivated by the following observation: explicitly controlling the -divergence between the target and the behavior policy encourages them to have the same visitation frequencies. This is elaborated upon by Lemma 4 which follows from the time-dependent state distribution bound proved in (Schulman et al., 2015a; Kahn et al., 2017).

(a) BeamRider
(b) Qbert
Figure 2: Effect of on performance. First, a non-zero value of trains much faster than without the regularization term because the target policy is constrained to be close to an entropic . Second, for hard exploration games like Qbert, a smaller value works much better than while the trend is somewhat reversed for easy exploration games such as BeamRider. The ideal value of thus depends on the environment and is difficult to pick before-hand. Setting tunes the regularization adaptively depending upon the particular mini-batch and works significantly better for easy exploration, it also leads to gains in hard exploration tasks.
Lemma 4 (Gap in discounted state distributions).

The gap between the discounted state distributions and is bounded as


The -divergence penalty in Eq. 14 is directly motivated from the above lemma; we however use which is easier to estimate.

Remark 5 (Effect of ).

Fig. 2 shows the effect of picking a good value for on the training performance. We picked two games in Atari for this experiment: BeamRider which is an easy exploration task and Qbert which is a hard exploration task (Bellemare et al., 2016). As the figure and the adjoining caption shows, picking the correct value of is critical to achieving good sample complexity. The ideal also changes as the training progress because policies are highly entropic at initialization which makes exploration easier. It is difficult to tune using annealing schedules, this has also been mentioned by the authors in Schulman et al. (2017) in a similar context. Our choice of adapts the level of regularization automatically.

Remark 6 (P3O adapts the bias in policy gradients).

There are two sources of bias in the P3O gradient. First, we do not perform correction of the on-policy term in Eq. 12. Second, the KL term further modifies the descent direction by averaging the target policy’s entropy over the replay buffer. If for all transitions in the replay buffer, the bias in the P3O update is


The above expression suggests a very useful feature. If the is close to , i.e., if the target policy is close to the behavior policy, P3O is a heavily biased gradient with no entropic regularization. On the other hand, if the ESS is zero, the entire expression above evaluates to zero. The choice therefore tunes the bias in the P3O updates adaptively. Roughly speaking, if the target policy is close to the behavior policy, the algorithm is confident and moves on even with a large bias. It is difficult to control the bias coming from the behavior policy, the ESS allows us to do so naturally.

A number of implementations of RL algorithms such as Q-Prop and IPG often have subtle, unintentional biases (Tucker et al., 2018). However, the improved performance of these algorithms, as also that of P3O, suggests that biased policy gradients might be a fruitful direction for further investigation.

(a) ESS
(b) KL term
(c) Entropy of
Figure 3: Evolution of ESS, KL penalty and the entropy of as training progresses. Fig. 2(a) shows the evolution of normalized ESS. A large value of ESS indicates that the target policy is close to in its state distribution. The ESS is about for a large fraction of the training which suggests a good trade-off between exploration and exploitation. The KL term in Fig. 2(b) is relatively constant during the course of training because its coefficient is adapted by ESS. This enables the target policy to be exploratory while still being able to leverage off-policy data from the behavior policy. Fig. 2(c) shows the evolution of the entropy of normalized by the number of actions . Note that using results in the target policy having a smaller entropy than standard P3O. This reduces its exploratory behavior and the latter indeed achieves a higher reward as seen in Fig. 2.
(a) Ms. Pac-Man
(b) Gravitar
Figure 4: Effect of roll-out length and GAE. Figs. 3(b) and 3(a) show the progress of P3O with and without generalized advantage estimation. GAE leads to significant improvements in performance. The above figures also show the effect of changing the number of time-steps from the environment used in on-policy updates: longer time-horizons help in games with sparse rewards although the benefit diminishes across the suite after steps.

3.4 Discussion on the KL penalty

The -divergence penalty in P3O is reminiscent of trust-region methods. These are a popular way of making monotonic improvements to the policy and avoiding premature moves, e.g., see the TRPO algorithm by Schulman et al. (2015a). The theory in TRPO suggests optimizing a surrogate objective where the hard divergence constraint is replaced by a penalty in the objective. In our setting, this amounts to the penalty . Note that the behavior policy is a mixture of previous policies and this therefore amounts to a penalty that keeps close to all policies in the replay buffer . This is also done by the authors in Wang et al. (2016) to stabilize the high variance of actor-critic methods.

A penalty with respect to all past policies slows down optimization. This can be seen abstractly as follows. For an optimization problem , the gradient update can be written as

if the is unique; here is the iterate and is the step-size at the iteration. A penalty with respect to all previous iterates can be modeled as


which leads to the update equation

which has a vanishing step-size as if the schedule is left unchanged. We would expect such a vanishing step-size of the policy updates to hurt performance.

The above observation is at odds with the performance of both ACER and P3O; see 

Section 4 which shows that both algorithms perform strongly on the Atari benchmark suite. However Fig. 3 helps reconcile this issue. As the target policy is trained, the entropy of the policy decreases, while older policies in the replay buffer are highly entropic and have more exploratory power. A penalty that keeps close to encourages to explore. This exploration compensates for the decreased magnitude of the on-policy policy gradient seen in Eq. 19.

3.5 Algorithmic details

The pseudo-code for P3O is given in Algorithm 1. At each iteration, it rolls out trajectories of time-steps each using the current policy and appends them to the replay buffer . In order to be able to compute the -divergence term, we store the policy in addition to the action for all states.

P3O performs sequential updates on the on-policy data and the off-policy data. In particular, Line 5 in Algorithm 1 samples a Poisson random variable that governs the number of off-policy updates for each on-policy update in P3O. This is also commonly done in the literature (Wang et al., 2016). We use Generalized Advantage Estimation (GAE) (Schulman et al., 2015b) to estimate the advantage function in P3O. We have noticed significantly improved results with GAE as compared to without it, as Fig. 4 shows.

Input: Policy , baseline , replay buffer Roll out trajectories for time-steps each
1[0.05in] Compute the returns and policy
3[0.025in] On-policy update of using ; see Eq. 14
4[0.025in] Poisson
5[0.025in] for  do
6        sample mini-batch from
7       [0.025in] Estimate ESS and KL-divergence term using and stored policies
8       [0.025in] Off-policy and KL regularizer update of using ; see Eq. 14
9       [0.025in]
Algorithm 1 One iteration of Policy-on Policy-off Policy Optimization (P3O)

4 Experimental Validation

This section demonstrates empirically that P3O with the ESS-based hyper-parameter choices from Section 3 achieves, on-average, comparable performance to state-of-the-art algorithms. We evaluate the P3O algorithm against competitive baselines on the Atari-2600 benchmarks and MuJoCo continuous-control benchmarks.

Figure 5: Training curves for A2C (blue), ACER (red), PPO (green) and P3O (orange) on some Atari games. See the Supplementary Material for similar plots on all Atari games.

4.1 Setup

We compare P3O against three competitive baselines: the synchronous actor-critic architecture (A2C) Mnih et al. (2016), proximal policy optimization (PPO) Schulman et al. (2017) and actor-critic with experience replay (ACER) Wang et al. (2016). The first, A2C, is a standard baseline while PPO is a completely on-policy algorithm that is robust and has demonstrated good empirical performance. ACER combines on-policy updates with off-policy updates and is closest to P3O. We use the same network as that of Mnih et al. (2015) for the Atari-2600 benchmark and a two-layer fully-connected network for MuJoCo tasks. The hyper-parameters are the same as those of the original authors of the above papers in order to be consistent and comparable to existing literature. We use implementations from OpenAI Baselines333https://github.com/openai/baselines. We follow the evaluation protocol proposed by (Machado et al., 2017) and report the training returns for all experiments. More details are provided in the Supplementary Material.

4.2 Results

Atari-2600 benchmark. Table 11 shows a comparison of P3O against the three baselines averaged over all the games in the Atari-2600 benchmark suite. We measure performance in two ways: (i) in terms of the final reward for each algorithm averaged over the last 100 episodes after 28M time-steps (112M frames of the game), and (ii) in terms of the reward at 40% training time and 80% training time averaged over 100 episodes. The latter compares different algorithms in terms of their sample efficiency. These results suggest that P3O is an efficient algorithm that improves upon competitive baselines both in terms of the final reward at the end of training and the reward obtained after a fixed number of samples. Fig. 7 shows the reward curves for some of games; rewards and training curves for all games are provided in the Supplementary Material.

Algorithm Won Won @ 40% Won @ 80%
training time training time
A2C 0 0 0
ACER 13 9 11
PPO 9 8 10
P3O 27 32 28
Table 1: Number of Atari games “won” by each algorithm measured by the average return over 100 episodes across three random seeds.

Completely off-policy algorithms are a strong benchmark on Atari games. We therefore compare P3O with a few state of the art off-policy algorithms using published results by the original authors. P3O wins 32 games vs. 17 games won by DDQN (Van Hasselt et al., 2016). P3O wins 18 games vs. 30 games won by C51 (Bellemare et al., 2017). P3O wins 26 games vs. 22 games won by SIL (Oh et al., 2018). These off-policy algorithms use 200M frames and P3O’s performance with 112M frames is comparable to them.

MuJoCo continuous-control tasks. In addition to A2C and PPO, we also show a comparison to Q-Prop (Gu et al., 2016) and Interpolated Policy Gradients (IPG) (Gu et al., 2017); the returns for the latter are taken from the training curves in the original papers; they use 10M time-steps and 3 random seeds. The code of the original authors of ACER for MuJoCo is unavailable and we, as also others, were unsuccessful in getting ACER to train for continuous-control tasks. Table 2 shows that P3O achieves better performance than strong baselines for continuous-control tasks such as A2C and PPO. It is also better than on-average than algorithms such as Q-Prop and IPG designed to combine off-policy and on-policy data. Note that Q-Prop/IPG were tuned by the original authors specifically for each task. In contrast, all hyper-parameters for P3O are fixed across the MuJoCo benchmarks. Training curves and results for more environments are in the Supplementary Material.

Task A2C PPO Q-Prop IPG P3O
Half-Cheetah 1907 2022 4178 4216 5052
Walker 2015 2728 2832 1896 3771
Hopper 1708 2245 2957 - 2334
Ant 1811 1616 3374 3943 4727
Humanoid 720 530 1423 1651 2057
Table 2: Average return on MuJoCo continuous-control tasks after 3M time-steps of training on 10 seeds.

5 Related Work

This work builds upon recent techniques that combine off-policy and on-policy updates in reinforcement learning. The closest to our approach is the ACER algorithm (Wang et al., 2016). It builds upon the off-policy actor-critic method(Degris et al., 2012) and uses the Retrace operator (Munos et al., 2016) to estimate an off-policy action-value function and constrains the candidate policy to be close to the running average of past policies using a linearized -divergence penalty. P3O uses a biased variant of the ACER gradient and incorporates an explicit penalty in the objective.

The PGQL algorithm (O’Donoghue et al., 2016a) uses an estimate of the action-value function of the target policy to combine on-policy updates with those obtained by minimizing the Bellman error. QProp (Gu et al., 2016) learns the action-value function using off-policy data which is used as a control variate for on-policy updates. The authors in Gu et al. (2017) propose the interpolated policy gradient (IPG) which takes a unified view of these algorithms. It directly combines on-policy and off-policy updates using a hyper-parameter and shows that, although such updates may be biased, the bias is bounded.

The key characteristic of the above algorithms is that they use hyper-parameters as a way to combine off-policy data with on-policy data. This is fragile in practice because different environments require different hyper-parameters. Moreover, the ideal hyper-parameters for combining data may change as training progresses; see Fig. 2. For instance, the authors in Oh et al. (2018)

report poorer empirical results with ACER and prioritized replay as compared to vanilla actor-critic methods (A2C). The effective sample size heuristic (ESS) in P3O is a completely automatic, parameter-free way of combining off-policy data with on-policy data.

Policy gradient algorithms with off-policy data are not new. The importance sampling ratio has been commonly used by a number of authors such as Cao (2005); Levine and Koltun (2013). Effective sample size is popularly used to measure the quality of importance sampling and to restrict the search space for parameter updates (Jie and Abbeel, 2010; Peshkin and Shelton, 2002). We exploit ESS to a similar end, it is an effective way to both control the contribution of the off-policy data and the deviation of the target policy from the behavior policy. Let us note there are a number of works that learn action-value functions using off-policy data, e.g.,Wang et al. (2013); Hausknecht and Stone (2016); Lehnert and Precup (2015) that achieve varying degrees of success on reinforcement learning benchmarks.

Covariate shift and effective sample size have been studied extensively in the machine learning literature; see Robert and Casella (2013); Quionero-Candela et al. (2009) for an elaborate treatment. These ideas have also been employed in reinforcement learning (Kang et al., 2007; Bang and Robins, 2005; Dudík et al., 2011). To the best of our knowledge, this paper is the first to use ESS for combining on-policy updates with off-policy updates.

6 Discussion

Sample complexity is the key inhibitor to translating the empirical performance of reinforcement learning algorithms from simulation to the real-world. Exploiting past, off-policy data to offset the high sample complexity of on-policy methods may be the key to doing so. Current approaches to combine the two using hyper-parameters are fragile. P3O is a simple, effective algorithm that uses the effective sample size (ESS) to automatically govern this combination. It demonstrates strong empirical performance across a variety of benchmarks. More generally, the discrepancy between the distribution of past data used to fit control variates and the data being gathered by the new policy lies at the heart of modern RL algorithms. The analysis of RL algorithms has not delved into this phenomenon. We believe this to be a promising avenue for future research.

7 Acknowledgements

The authors would like to acknowledge the support of Hang Zhang and Tong He from Amazon Web Services for the open-source implementation of P3O.



Appendix A Hyper-parameters for all experiments

Hyper-parameters Value
Architecture conv (--)
conv (--)
conv (--)
FC ()
Learning rate
Number of environments 16
Number of steps per iteration 5
Entropy regularization () 0.01
Discount factor ()
Value loss Coefficient
Gradient norm clipping coefficient
Random Seeds
Table 3: A2C hyper-parameters on Atari benchmark
Hyper-parameters Value
Architecture Same as A2C
Replay Buffer size
Learning rate
Number of environments 16
Number of steps per iteration 20
Entropy regularization () 0.01

Number of training epochs per update

Discount factor ()
Value loss Coefficient
importance weight clipping factor
Gradient norm clipping coefficient
Momentum factor in the Polyak
Max. KL between old & updated policy
Use Trust region True
Random Seeds
Table 4: ACER hyper-parameters on Atari benchmark
Hyper-parameters Value
Architecture Same as A2C
Learning rate
Number of environments 8
Number of steps per iteration 128
Entropy regularization () 0.01
Number of training epochs per update 4
Discount factor ()
Value loss Coefficient
Gradient norm clipping coefficient
Advantage estimation discounting factor ()
Random Seeds
Table 5: PPO hyper-parameters on Atari benchmark
Hyper-parameters Value
Architecture Same as A2C
Learning rate
Replay Buffer size
Number of environments 16
Number of steps per iteration 16
Entropy regularization () 0.01
Off policy updates per iteration () Poisson(2)
Burn-in period
Samples from replay buffer
Discount factor ()
Value loss Coefficient
Gradient norm clipping coefficient
Advantage estimation discounting factor ()
Random Seeds
Table 6: P3O hyper-parameters on Atari benchmark
Hyper-parameters Value
Architecture FC(100) - FC(100)
Learning rate
Replay Buffer size
Number of environments 2
Number of steps per iteration 64
Entropy regularization () 0.0
Off policy updates per iteration () Poisson(3)
Burn-in period 2500
Number of samples from replay buffer
Discount factor ()
Value loss Coefficient
Gradient norm clipping coefficient
Advantage estimation discounting factor ()
Random Seeds
Table 7: P3O hyper-parameters for MuJoCo tasks
Hyper-parameters Value
Architecture FC(64) - FC(64)
Learning rate
Number of environments 8
Number of steps per iteration 32
Entropy regularization () 0.0
Discount factor ()
Value loss Coefficient
Gradient norm clipping coefficient
Random Seeds
Table 8: A2C (and A2C with GAE) hyper-parameters on MuJoCo tasks
Hyper-parameters Value
Architecture FC(64) - FC(64)
Learning rate
Number of environments 1
Number of steps per iteration 2048
Entropy regularization () 0.0
Number of training epochs per update 10
Discount factor ()
Value loss Coefficient
Gradient norm clipping coefficient
Advantage estimation discounting factor ()
Random Seeds
Table 9: PPO hyper-parameters on MuJoCo tasks

Appendix B Comparisons with baseline algorithms

Figure 6: Training curves of A2C (blue), A2CG [A2C with GAE] (magenta), PPO (green) and P3O (orange) on 8 MuJoCo environments.
Games A2CG A2C PPO P3O
Half-Cheetah 181.46 1907.42 2022.14 5051.58
Walker 855.62 2015.15 2727.93 3770.86
Hopper 1377.07 1708.22 2245.03 2334.32
Swimmer 33.33 45.27 101.71 116.87
Inverted Double Pendulum 90.09 5510.71 4750.69 8114.05
Inverted Pendulum 733.34 889.61 414.49 985.14
Ant -253.54 1811.29 1615.55 4727.34
Humanoid 530.12 720.38 530.13 2057.17
Table 10: Returns on MuJoCo continuous-control tasks after 3M time-steps of training and 10 random seeds.
Table 11: Returns of agents on 49 Atari-2600 games after 28M timesteps (112M frames) of training.
Figure 7: Training curves of A2C (blue), ACER (red), PPO (green) and P3O (orange) on all 49 Atari games.