1 Introduction
Proximal Policy Optimization (Schulman et al., 2017) is one of the most sampleefficient onpolicy algorithms. However, it relies on a synchronous architecture for collecting experiences, which is closely tied to its trust region optimization objective. Other architectures such as IMPALA can achieve much higher throughputs due to the asynchronous collection of samples from workers. Yet, IMPALA suffers from reduced sample efficiency since it cannot safely take multiple SGD steps per batch as PPO can. The new agent, Importance Weighted Asynchronous Architectures with Clipped Target Networks (IMPACT), mitigates this inherent mismatch. Not only is the algorithm highly sample efficient, it can learn quickly, training 30 percent faster than IMPALA. At the same time, we propose a novel method to stabilize agents in distributed asynchronous setups and, through our ablation studies, show how the agent can learn in both a time and sample efficient manner.
In our paper, we show that the algorithm IMPACT realizes greater gains by striking the balance between high sample throughput and sample efficiency. In our experiments, we demonstrate in the experiments that IMPACT exceeds stateoftheart agents in training time (with same hardware) while maintaining similar sample efficiency with PPO’s. The contributions of this paper are as follows:

We show that when collecting experiences asynchronously, introducing a target network allows for a stabilized surrogate objective and multiple SGD steps per batch (Section 3.1).

We show that using a circular buffer for storing asynchronously collected experiences allows for smooth tradeoff between realtime performance and sample efficiency (Section 3.2).

We show that IMPACT, when evaluated using identical hardware and neural network models, improves both in realtime and timestep efficiency over both synchronous PPO and IMPALA (Section 4).
2 Background
Reinforcement Learning assumes a Markov Decision Process (MDP) setup defined by the tuple
where and represent the state and action space, is the discount factor, and and are the transition dynamics and reward function that models an environment.Let denote a stochastic policy mapping that returns an action distribution given state . Rolling out policy in the environment is equivalent to sampling a trajectory , where . We can compactly define state and stateaction marginals of the trajectory distribution and induced by the policy .The goal for reinforcement learning aims to maximize the following objective: .
When parameterizes , the policy is updated according to the Policy Gradient Theorem (Sutton et al., 2000):
where
is an estimator of the advantage function. The advantage estimator is usually defined as the 1step TD error,
, whereis an estimation of the value function. Policy gradients, however, suffer from high variance and large updatestep sizes, oftentimes leading to sudden drops in performance.
2.1 Distributed PPO
Per iteration, Proximal Policy Optimization (PPO) optimizes policy from target via the following objective function
where and
is the clipping hyperparameter. In addition, many PPO implementations use GAE
as a low bias, low variance advantage estimator for (Schulman et al., 2015b). PPO’s surrogate objective contains the importance sampling ratio , which can potentially explode if is too far from . (Han and Sung, 2017). PPO’s surrogate loss mitigates this with the clipping function, which ensures that the agent makes reasonable steps. Alternatively, PPO can also be seen as an adaptive trust region introduced in TRPO (Schulman et al., 2015a).In Figure 0(a), distributed PPO agents implement a synchronous datagathering scheme. Before data collection, workers are updated to and aggregate worker batches to training batch . The learner performs many minibatch gradient steps on . Once the learner is done, learner weights are broadcast to all workers, who start sampling again.
2.2 Importance Weighted ActorLearner Architectures
In Figure 0(b), IMPALA decouples acting and learning by having the learner threads send actions, observations, and values while the master thread computes and applies the gradients from a queue of learner’s experience (Espeholt et al., 2018). This maximizes GPU utilization and allows for increased sample throughput, leading to high training speeds on easier environments such as Pong. As the number of learners grows, worker policies begin to diverge from the learner policy, resulting in stale policy gradients. To correct this, the IMPALA paper utilizes Vtrace to correct the distributional shift:
where, is the value network, is the policy network of the master thread, is the policy network of the learner thread, and and are clipped IS ratios.
Workeri
3 IMPACT Algorithm
Like IMPALA, IMPACT separates sampling workers from learner workers. Algorithm 1 and Figure 0(c) describe the main training loop and architecture of IMPACT. In the beginning, each worker copies weights from the master network. Then, each worker uses their own policy to collect trajectories and sends the data to the circular buffer. Simultaneously, workers also asynchronously pull policy weights from the master learner. In the meantime, the target network occasionally syncs with the master learner every iterations. The master learner then repeatedly draws experience from the circular buffer. Each sample is weighted by the importance ratio of as well as clipped with target network ratio . The target network is used to provide a stable trust region (Figure 2), allowing multiple steps per batch (i.e., like PPO) even in the asynchronous setting (i.e., with the IMPALA architecture). In the next section, we describe the design of this improved objective.
3.1 Maximal TargetWorker Clipping
PPO gathers experience from previous iteration’s policy , and the current policy trains by importance sampling offpolicy experience with respect to . In the asynchronous setting, worker ’s policy, denoted as , generates experience for the policy network
. The probability that batch
comes from worker can be parameterized as a categorical distribution . We include this by adding an extra expectation to the importancesampled policy gradient objective (ISPG) (Jie and Abbeel, 2010):Since each worker contains a different policy, the agent introduces a target network for stability (Figure 2). Offpolicy agents such as DDPG and DQN update target networks with a moving average. For IMPACT, we periodically update the target network with the master network. However, training with importance weighted ratio can lead to numerical instability, as shown in Figure 3. To prevent this, we clip the importance sampling ratio from worker policy,, to target policy, :
where . In the experiments, we set as a hyperparameter with and .
To see why clipping is necessary, when master network’s action distribution changes significantly over few training iterations, worker i’s policy, , samples data outside that of target policy, , leading to large likelihood ratios, . The clipping function pulls back large IS ratios to . Figure 10 in Appendix E provides additional intuition behind the target clipping objective. We show that the target network clipping is a lower bound of the ISPG objective.
For , the clipped target ratio is larger and serves to augment advantage estimator . This incentivizes the agent toward good actions while avoiding bad actions. Thus, higher values of encourages the agent to learn faster at the cost of instability.
We use GAE with Vtrace (Han and Sung, 2019). The Vtrace GAE modifies the advantage function by adding clipped importance sampling terms to the summation of TD errors:
where (we use the convention ) and is the importance sampled 1step TD error introduced in Vtrace.
3.2 Circular Buffer
IMPACT uses a circular buffer (Figure 4) to emulate the minibatch SGD used by standard PPO. The circular buffer stores batches that can be traversed at max times. Upon being traversed times, a batch is discarded and replaced by a new worker batch.
For motivation, the circular buffer and the target network are analogous to minibatching from experience in PPO. When target network’s update frequency , the circular buffer is equivalent to distributed PPO’s training batch when the learner samples minibatches for SGD iterations.
This is in contrast to standard replay buffers, such as in ACER and APEX, where transitions are either uniformly sampled or sampled based on priority, and, when the buffer is full, the oldest transitions are discarded (Wang et al., 2016; Horgan et al., 2018).
Figure 4 illustrates an empirical example where tuning can increase training sample efficiency and decrease training wallclock time.
4 Evaluation
In our evaluation we seek to answer the following questions:

How does the targetclipping objective affect the performance of the agents compared to prior work? (Section 4.1)

How does the IMPACT circular buffer affect sample efficiency and training wallclock time? (Section 4.2)

How does IMPACT compare to PPO and IMPALA baselines in terms of sample and realtime performance? (Section 4.3)

How does IMPACT scale with respect to the number of workers? (Section 4.4)
4.1 Target Clipping Performance
We investigate the performance of the clippedtarget objective relative to prior work, which includes PPO and ISPG based objectives. Specifically, we consider the following ratios below:
For all three experiments, we truncate all three ratios with PPO’s clipping function: and train in an asynchronous setting. Figure 4(a) reveals two important takeaways: first, suffers from sudden drops in performance midway through training. Next, trains stably but does not achieve good performance.
We theorize that fails due to the target and worker network mismatch. During periods of training where the master learner undergoes drastic changes, worker action outputs vastly differ from the learner outputs, resulting in small action probabilities. This creates large ratios in training and destabilizes training. We hypothesize that fails due to different workers pushing and pulling the learner in multiple directions. The learner moves forward with the most recent worker’s suggestions without developing a proper trust region, resulting in many worker’s suggestions conflicting with each other.
The loss function,
shows that clipping is necessary and can help facilitate training. By clipping the targetworker ratio, we make sure that the ratio does not explode and destabilize training. Furthermore, we prevent workers from making mutually destructive suggestions by having a target network provide singular guidance.4.1.1 Target Network Update Frequency
In Section 3.2, an analogy was drawn between PPO’s minibatching mechanism and the circular buffer. Our primary benchmark for target update frequency is , where is circular buffer size and is maximum replay coefficient. This is the case when PPO is equivalent to IMPACT.
In Figure 4(b), we test the frequency of updates with varying orders of magnitudes of . In general, we find that agent performance is robust to vastly differing frequencies. However, when , the agent does not learn. Based on empirical results, we theorize that the agent is able to train as long as a stable trust region can be formed. On the other hand, if update frequency is too low, the agent is stranded for many iterations in the same trust region, which impairs learning speed.
4.2 Time and Sample Efficiency with Circular Buffer
Counter to intuition, the tradeoff between time and sample efficiency when increases is not necessarily true. In Figure 3(b) and 3(c), we show that IMPACT realizes greater gains by striking the balance between high sample throughput and sample efficiency. When , IMPACT performs the best in both time and sample efficiency. Our results reveal that wallclock time and sample efficiency can be optimized based on tuning values of in the circular buffer.
4.3 Comparison with Baselines


We investigate how IMPACT attains greater performance in wall clocktime and sample efficiency compared with PPO and IMPALA across six different continuous control and discrete action tasks.
We tested the agent on three continuous environments (Figure 5): HalfCheetah, Hopper, and Humanoid on 16 CPUs and 1 GPU. The policy networks consist of two fullyconnected layers of 256 units with nonlinear activation tanh. The critic network shares the same architecture as the policy network. For consistentency, same network architectures were employed across PPO, IMPALA, and IMPACT.
For the discrete environments (Figure 6), Pong, SpaceInvaders, and Breakout were chosen as common benchmarks used in popular distributed RL libraries (Caspi et al., 2017; Liang et al., 2018)
. Additional experiments for discrete environments are in the Appendix. These experiments were ran on 32 CPUs and 1 GPU. The policy network consists of three 4x4 and one 11x11 conv layer, with nonlinear activation ReLU. The critic network shares weights with the policy network. The input of the network is a stack of four 42x42 downsampled images of the Atari environment. The hyperparameters for continuous and discrete environments are listed in the Appendix
B table 1 and 2 respectively.Figures 5 and 6 show the total average return on evaluation rollouts for IMPACT, IMPALA and PPO. We train each algorithm with three different random seeds on each environment for a total time of three hours. According to the experiments, IMPACT is able to train much faster than PPO and IMPALA in both discrete and continuous domains, while preserving same or better sample efficiency than PPO.
Our results reveal that continuous control tasks for IMPACT are sensitive to the tuple for the circular buffer. and is a robust choice for continuous control. Although higher inhibits workers’ sample throughput, increased sample efficiency from replaying experiences results in an overall reduction in training wallclock time and higher reward. For discrete tasks, and works best. Empirically, agents learn faster from new experience than replaying old experience, showing how exploration is crucial to achieving high asymptotic performance in discrete enviornments.


4.4 IMPACT Scalability
Figure 7 shows how IMPACT’s performance scales relative to the number of workers. More workers means increased sample throughput, which in turn increases training throughput (the rate that learner consumes batches). With the learner consuming more worker data per second, IMPACT can attain better performance in less time. However, as number of workers increases, observed increases in performance begin to decline.
5 Related Work
Distributed RL architectures are often used to accelerate training. Gorila (Nair et al., 2015) and A3C (Mnih et al., 2016) use workers to compute gradients to be sent to the learner. A2C (Mnih et al., 2016) and IMPALA (Espeholt et al., 2018) send experience tuples to the learner. Distributed replay buffers, introduced in ACER (Wang et al., 2016) and ApeX (Horgan et al., 2018)
, collect workercollected experience and define an overarching heuristic for learner batch selection. IMPACT is the first to fully incorporate the sampleefficiency benefits of PPO in an asynchronous setting.
Surreal PPO (Fan et al., 2018) also studies training with PPO in the asynchronous setting, but do not consider adaptation of the surrogate objective nor IScorrection. Their use of a target network for broadcasting weights to workers is also entirely different from IMPACT’s. Consequently, IMPACT is able to achieve better results in both realtime and sample efficiency.
Offpolicy methods, including DDPG and QProp, utilize target networks to stabilize learning the Q function (Lillicrap et al., 2015; Gu et al., 2016). This use of a target network is related but different from IMPACT, which uses the network to define a stable trust region for the PPO surrogate objective.
6 Conclusion
In conclusion, we introduce IMPACT, which extends PPO with a stabilized surrogate objective for asynchronous optimization, enabling greater realtime performance without sacrificing timestep efficiency. We show the importance of the IMPACT objective to stable training, and show it can outperform tuned PPO and IMPALA baselines in both realtime and timestep metrics.
References
 OpenAI Spinning Up. GitHub. Note: https://spinningup.openai.com/en/latest/spinningup/bench.html Cited by: Appendix D.
 Reinforcement Learning Coach. External Links: Document, Link Cited by: §4.3.
 IMPALA: Scalable Distributed DeepRL with Importance Weighted ActoLearner Architectures. arXiv preprint arXiv:1802.01561. Cited by: IMPACT: Importance Weighted Asynchronous Architectures with Clipped Target Networks, §2.2, §5.

SURREAL: OpenSource Reinforcement Learning Framework and Robot Manipulation Benchmark
. In Conference on Robot Learning, pp. 767–782. Cited by: §5.  QProp: SampleEfficient Policy Gradient with An OffPolicy Critic . arXiv preprint arXiv:1611.02247. Cited by: §5.
 AMBER: Adaptive MultiBatch Experience Replay for Continuous Action Control. arXiv preprint arXiv:1710.04423. Cited by: §2.1.
 DimensionWise Importance Sampling Weight Clipping for SampleEfficient Reinforcement Learning. arXiv preprint arXiv:1905.02363. Cited by: §3.1.
 Distributed Prioritized Experience Replay. arXiv preprint arXiv:1803.00933. Cited by: §3.2, §5.
 On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient. pp. 1000–1008. Cited by: §3.1.

RLlib: Abstractions for Distributed Reinforcement Learning.
In
International Conference on Machine Learning (ICML)
, Cited by: §4.3.  Continuous Control with Deep Reinforcement Learning. arXiv preprint arXiv:1509.02971. Cited by: §5.
 Asynchronous Methods for Deep Reinforcement Learning. In International conference on machine learning, pp. 1928–1937. Cited by: §5.
 Massively Parallel Methods for Deep Reinforcement Learning. arXiv preprint arXiv:1507.04296. Cited by: §5.
 Trust Region Policy Optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §2.1.
 HighDimensional Continuous Control using Generalized Advantage Estimation. arXiv preprint arXiv:1506.02438. Cited by: §2.1.
 Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1.
 Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2.
 Deepmind Control Suite. arXiv preprint arXiv:1801.00690. Cited by: Appendix D.
 Sample Efficient ActorCritic with Experience Replay. arXiv preprint arXiv:1611.01224. Cited by: §3.2, §5.
Appendix A Additional Experiments


Appendix B Hyper parameters for All Environments
b.1 Discrete Environments
Hyperparameters  IMPACT  IMPALA  PPO 
Clip Parameter  0.3  —  0.1 
Entropy Coeff  0.01  0.01  0.01 
Grad Clipping  10.0  40.0  — 
Discount ()  0.99  0.99  0.99 
Lambda ()  0.995  —  0.995 
Learning Rate  
Minibatch Buffer Size (N)  4  —  — 
Num SGD Iterations (K)  2  —  2 
Sample Batch Size  50  50  100 
Train Batch Size  500  500  5000 
SGD Minibatch Size  —  —  500 
KL Coeff  0.0  —  0.5 
KL Target  0.01  —  0.01 
Value Function Coeff  1.0  0.5  1.0 
TargetWorker Clipping ()  2.0  —  — 
b.2 Continuous Environments
Hyperparameters  IMPACT  IMPALA  PPO 
Clip Parameter  0.4  —  0.3 
Entropy Coeff  0.0  0.0  0.0 
Grad Clipping  0.5  0.5  — 
Discount ()  0.995  0.995  0.99 
Lambda ()  0.995  —  0.995 
Learning Rate  
Minibatch Buffer Size (N)  16  —  — 
Num SGD Iterations^{1}^{1}1For HalfCheetahv2, IMPACT and PPO Num SGD Iterations (K) is 32.(K)  20  —  20 
Sample Batch Size  1024  1024  1024 
Train Batch Size  32768  32768  163840 
SGD Minibatch Size  —  —  32768 
KL Coeff  1.0  —  1.0 
KL Target  0.04  —  0.01 
Value Function Coeff^{2}^{2}2For HalfCheetahv2, IMPACT Value Function Coeff is 0.5.  1.0  0.5  1.0 
TargetWorker Clipping ()  2.0  —  — 
b.3 Hyperparameter Budget
Listed below was the grid search we used for each algorithm to obtain optimal hyperparameters. Optimal values were found via grid searching on each hyperparameter separately. We found that IMPACT’s optimal hyperparameter values tend to hover close to either IMPALA’s or PPO’s, which greatly mitigated IMPACT’s budget.
b.3.1 Discrete Environment Search
b.4 Continuous Environment Search
Appendix C IMPALA to IMPACT
In Figure 9, we gradually add components to IMPALA until the agent is equivalent to IMPACT’s. Starting from IMPALA, we gradually add PPO’s objective function, circular replay buffer, and targetworker clipping. In particular, IMPALA with PPO’s objective function and circular replay buffer is equivalent to an asynchronousvariant of PPO (APPO). APPO fails to perform as well as synchronous distributed PPO, since PPO is an onpolicy algorithm.
Appendix D IMPALA in Continuous Environments
In Figure 6, IMPALA performs substantially worse than other agents in continuous environments. We postulate that IMPALA suffers from low asymptotic performance here since its objective is an importancesampled version of the Vanilla Policy Gradient (VPG) objective, which is known to suffer from high variance and large updatestep sizes. We found that for VPG, higher learning rates encourage faster learning in the beginning but performance drops to negative return later in training. In Appendix B, for IMPALA, we heavily tuned on the learning rate, finding that small learning rates stabilize learning at the cost of low asymptotic performance. Prior work also reveals the agents that use VPG fail to attain good performance in nontrivial continuous tasks (Achiam, 2018). Our results with IMPALA reaches similar performance compared to other VPGbased algorithms. The closest neighbor to IMPALA, A3C uses workers to compute gradients from the VPG objective to send to the learner thread. A3C performs well in InvertedPendulum yet flounders in continuous environments (Tassa et al., 2018).
Appendix E The intuition of the objective
The following ratios represent the objective functions for different ablation studies. In the plots (Figure 10), we set the advantage function to be one, i.e.

IS ratio:

IMPACT target:

PPO clip:

IMPACT target clip:
According to Figure 10, IS ratio is large when assigns low probability. IMPACT target clip is a lower bound of the PPO clip. In an distributed asynchronous setting, the trust region suffers from larger variance stemming from offpolicy data. IMPACT target clip ratio mitigates this by encouraging conservative and reasonable policygradient steps.