Proximal Policy Optimization (Schulman et al., 2017) is one of the most sample-efficient on-policy algorithms. However, it relies on a synchronous architecture for collecting experiences, which is closely tied to its trust region optimization objective. Other architectures such as IMPALA can achieve much higher throughputs due to the asynchronous collection of samples from workers. Yet, IMPALA suffers from reduced sample efficiency since it cannot safely take multiple SGD steps per batch as PPO can. The new agent, Importance Weighted Asynchronous Architectures with Clipped Target Networks (IMPACT), mitigates this inherent mismatch. Not only is the algorithm highly sample efficient, it can learn quickly, training 30 percent faster than IMPALA. At the same time, we propose a novel method to stabilize agents in distributed asynchronous setups and, through our ablation studies, show how the agent can learn in both a time and sample efficient manner.
In our paper, we show that the algorithm IMPACT realizes greater gains by striking the balance between high sample throughput and sample efficiency. In our experiments, we demonstrate in the experiments that IMPACT exceeds state-of-the-art agents in training time (with same hardware) while maintaining similar sample efficiency with PPO’s. The contributions of this paper are as follows:
We show that when collecting experiences asynchronously, introducing a target network allows for a stabilized surrogate objective and multiple SGD steps per batch (Section 3.1).
We show that using a circular buffer for storing asynchronously collected experiences allows for smooth trade-off between real-time performance and sample efficiency (Section 3.2).
We show that IMPACT, when evaluated using identical hardware and neural network models, improves both in real-time and timestep efficiency over both synchronous PPO and IMPALA (Section 4).
Reinforcement Learning assumes a Markov Decision Process (MDP) setup defined by the tuplewhere and represent the state and action space, is the discount factor, and and are the transition dynamics and reward function that models an environment.
Let denote a stochastic policy mapping that returns an action distribution given state . Rolling out policy in the environment is equivalent to sampling a trajectory , where . We can compactly define state and state-action marginals of the trajectory distribution and induced by the policy .The goal for reinforcement learning aims to maximize the following objective: .
When parameterizes , the policy is updated according to the Policy Gradient Theorem (Sutton et al., 2000):
is an estimator of the advantage function. The advantage estimator is usually defined as the 1-step TD error,, where
is an estimation of the value function. Policy gradients, however, suffer from high variance and large update-step sizes, oftentimes leading to sudden drops in performance.
2.1 Distributed PPO
Per iteration, Proximal Policy Optimization (PPO) optimizes policy from target via the following objective function
is the clipping hyperparameter. In addition, many PPO implementations use GAE-as a low bias, low variance advantage estimator for (Schulman et al., 2015b). PPO’s surrogate objective contains the importance sampling ratio , which can potentially explode if is too far from . (Han and Sung, 2017). PPO’s surrogate loss mitigates this with the clipping function, which ensures that the agent makes reasonable steps. Alternatively, PPO can also be seen as an adaptive trust region introduced in TRPO (Schulman et al., 2015a).
In Figure 0(a), distributed PPO agents implement a synchronous data-gathering scheme. Before data collection, workers are updated to and aggregate worker batches to training batch . The learner performs many mini-batch gradient steps on . Once the learner is done, learner weights are broadcast to all workers, who start sampling again.
2.2 Importance Weighted Actor-Learner Architectures
In Figure 0(b), IMPALA decouples acting and learning by having the learner threads send actions, observations, and values while the master thread computes and applies the gradients from a queue of learner’s experience (Espeholt et al., 2018). This maximizes GPU utilization and allows for increased sample throughput, leading to high training speeds on easier environments such as Pong. As the number of learners grows, worker policies begin to diverge from the learner policy, resulting in stale policy gradients. To correct this, the IMPALA paper utilizes V-trace to correct the distributional shift:
where, is the value network, is the policy network of the master thread, is the policy network of the learner thread, and and are clipped IS ratios.
, append target output to
3 IMPACT Algorithm
Like IMPALA, IMPACT separates sampling workers from learner workers. Algorithm 1 and Figure 0(c) describe the main training loop and architecture of IMPACT. In the beginning, each worker copies weights from the master network. Then, each worker uses their own policy to collect trajectories and sends the data to the circular buffer. Simultaneously, workers also asynchronously pull policy weights from the master learner. In the meantime, the target network occasionally syncs with the master learner every iterations. The master learner then repeatedly draws experience from the circular buffer. Each sample is weighted by the importance ratio of as well as clipped with target network ratio . The target network is used to provide a stable trust region (Figure 2), allowing multiple steps per batch (i.e., like PPO) even in the asynchronous setting (i.e., with the IMPALA architecture). In the next section, we describe the design of this improved objective.
3.1 Maximal Target-Worker Clipping
PPO gathers experience from previous iteration’s policy , and the current policy trains by importance sampling off-policy experience with respect to . In the asynchronous setting, worker ’s policy, denoted as , generates experience for the policy network
. The probability that batchcomes from worker can be parameterized as a categorical distribution . We include this by adding an extra expectation to the importance-sampled policy gradient objective (IS-PG) (Jie and Abbeel, 2010):
Since each worker contains a different policy, the agent introduces a target network for stability (Figure 2). Off-policy agents such as DDPG and DQN update target networks with a moving average. For IMPACT, we periodically update the target network with the master network. However, training with importance weighted ratio can lead to numerical instability, as shown in Figure 3. To prevent this, we clip the importance sampling ratio from worker policy,, to target policy, :
where . In the experiments, we set as a hyperparameter with and .
To see why clipping is necessary, when master network’s action distribution changes significantly over few training iterations, worker i’s policy, , samples data outside that of target policy, , leading to large likelihood ratios, . The clipping function pulls back large IS ratios to . Figure 10 in Appendix E provides additional intuition behind the target clipping objective. We show that the target network clipping is a lower bound of the IS-PG objective.
For , the clipped target ratio is larger and serves to augment advantage estimator . This incentivizes the agent toward good actions while avoiding bad actions. Thus, higher values of encourages the agent to learn faster at the cost of instability.
We use GAE- with V-trace (Han and Sung, 2019). The V-trace GAE- modifies the advantage function by adding clipped importance sampling terms to the summation of TD errors:
where (we use the convention ) and is the importance sampled 1-step TD error introduced in V-trace.
3.2 Circular Buffer
IMPACT uses a circular buffer (Figure 4) to emulate the mini-batch SGD used by standard PPO. The circular buffer stores batches that can be traversed at max times. Upon being traversed times, a batch is discarded and replaced by a new worker batch.
For motivation, the circular buffer and the target network are analogous to mini-batching from experience in PPO. When target network’s update frequency , the circular buffer is equivalent to distributed PPO’s training batch when the learner samples minibatches for SGD iterations.
This is in contrast to standard replay buffers, such as in ACER and APE-X, where transitions are either uniformly sampled or sampled based on priority, and, when the buffer is full, the oldest transitions are discarded (Wang et al., 2016; Horgan et al., 2018).
Figure 4 illustrates an empirical example where tuning can increase training sample efficiency and decrease training wall-clock time.
In our evaluation we seek to answer the following questions:
How does the target-clipping objective affect the performance of the agents compared to prior work? (Section 4.1)
How does the IMPACT circular buffer affect sample efficiency and training wall-clock time? (Section 4.2)
How does IMPACT compare to PPO and IMPALA baselines in terms of sample and real-time performance? (Section 4.3)
How does IMPACT scale with respect to the number of workers? (Section 4.4)
4.1 Target Clipping Performance
We investigate the performance of the clipped-target objective relative to prior work, which includes PPO and IS-PG based objectives. Specifically, we consider the following ratios below:
For all three experiments, we truncate all three ratios with PPO’s clipping function: and train in an asynchronous setting. Figure 4(a) reveals two important takeaways: first, suffers from sudden drops in performance midway through training. Next, trains stably but does not achieve good performance.
We theorize that fails due to the target and worker network mismatch. During periods of training where the master learner undergoes drastic changes, worker action outputs vastly differ from the learner outputs, resulting in small action probabilities. This creates large ratios in training and destabilizes training. We hypothesize that fails due to different workers pushing and pulling the learner in multiple directions. The learner moves forward with the most recent worker’s suggestions without developing a proper trust region, resulting in many worker’s suggestions conflicting with each other.
The loss function,shows that clipping is necessary and can help facilitate training. By clipping the target-worker ratio, we make sure that the ratio does not explode and destabilize training. Furthermore, we prevent workers from making mutually destructive suggestions by having a target network provide singular guidance.
4.1.1 Target Network Update Frequency
In Section 3.2, an analogy was drawn between PPO’s mini-batching mechanism and the circular buffer. Our primary benchmark for target update frequency is , where is circular buffer size and is maximum replay coefficient. This is the case when PPO is equivalent to IMPACT.
In Figure 4(b), we test the frequency of updates with varying orders of magnitudes of . In general, we find that agent performance is robust to vastly differing frequencies. However, when , the agent does not learn. Based on empirical results, we theorize that the agent is able to train as long as a stable trust region can be formed. On the other hand, if update frequency is too low, the agent is stranded for many iterations in the same trust region, which impairs learning speed.
4.2 Time and Sample Efficiency with Circular Buffer
Counter to intuition, the tradeoff between time and sample efficiency when increases is not necessarily true. In Figure 3(b) and 3(c), we show that IMPACT realizes greater gains by striking the balance between high sample throughput and sample efficiency. When , IMPACT performs the best in both time and sample efficiency. Our results reveal that wall-clock time and sample efficiency can be optimized based on tuning values of in the circular buffer.
4.3 Comparison with Baselines
We investigate how IMPACT attains greater performance in wall clock-time and sample efficiency compared with PPO and IMPALA across six different continuous control and discrete action tasks.
We tested the agent on three continuous environments (Figure 5): HalfCheetah, Hopper, and Humanoid on 16 CPUs and 1 GPU. The policy networks consist of two fully-connected layers of 256 units with nonlinear activation tanh. The critic network shares the same architecture as the policy network. For consistentency, same network architectures were employed across PPO, IMPALA, and IMPACT.
. Additional experiments for discrete environments are in the Appendix. These experiments were ran on 32 CPUs and 1 GPU. The policy network consists of three 4x4 and one 11x11 conv layer, with nonlinear activation ReLU. The critic network shares weights with the policy network. The input of the network is a stack of four 42x42 down-sampled images of the Atari environment. The hyper-parameters for continuous and discrete environments are listed in the AppendixB table 1 and 2 respectively.
Figures 5 and 6 show the total average return on evaluation rollouts for IMPACT, IMPALA and PPO. We train each algorithm with three different random seeds on each environment for a total time of three hours. According to the experiments, IMPACT is able to train much faster than PPO and IMPALA in both discrete and continuous domains, while preserving same or better sample efficiency than PPO.
Our results reveal that continuous control tasks for IMPACT are sensitive to the tuple for the circular buffer. and is a robust choice for continuous control. Although higher inhibits workers’ sample throughput, increased sample efficiency from replaying experiences results in an overall reduction in training wall-clock time and higher reward. For discrete tasks, and works best. Empirically, agents learn faster from new experience than replaying old experience, showing how exploration is crucial to achieving high asymptotic performance in discrete enviornments.
4.4 IMPACT Scalability
Figure 7 shows how IMPACT’s performance scales relative to the number of workers. More workers means increased sample throughput, which in turn increases training throughput (the rate that learner consumes batches). With the learner consuming more worker data per second, IMPACT can attain better performance in less time. However, as number of workers increases, observed increases in performance begin to decline.
5 Related Work
Distributed RL architectures are often used to accelerate training. Gorila (Nair et al., 2015) and A3C (Mnih et al., 2016) use workers to compute gradients to be sent to the learner. A2C (Mnih et al., 2016) and IMPALA (Espeholt et al., 2018) send experience tuples to the learner. Distributed replay buffers, introduced in ACER (Wang et al., 2016) and Ape-X (Horgan et al., 2018)
, collect worker-collected experience and define an overarching heuristic for learner batch selection. IMPACT is the first to fully incorporate the sample-efficiency benefits of PPO in an asynchronous setting.
Surreal PPO (Fan et al., 2018) also studies training with PPO in the asynchronous setting, but do not consider adaptation of the surrogate objective nor IS-correction. Their use of a target network for broadcasting weights to workers is also entirely different from IMPACT’s. Consequently, IMPACT is able to achieve better results in both real-time and sample efficiency.
Off-policy methods, including DDPG and QProp, utilize target networks to stabilize learning the Q function (Lillicrap et al., 2015; Gu et al., 2016). This use of a target network is related but different from IMPACT, which uses the network to define a stable trust region for the PPO surrogate objective.
In conclusion, we introduce IMPACT, which extends PPO with a stabilized surrogate objective for asynchronous optimization, enabling greater real-time performance without sacrificing timestep efficiency. We show the importance of the IMPACT objective to stable training, and show it can outperform tuned PPO and IMPALA baselines in both real-time and timestep metrics.
- OpenAI Spinning Up. GitHub. Note: https://spinningup.openai.com/en/latest/spinningup/bench.html Cited by: Appendix D.
- Reinforcement Learning Coach. External Links: Cited by: §4.3.
- IMPALA: Scalable Distributed Deep-RL with Importance Weighted Acto-Learner Architectures. arXiv preprint arXiv:1802.01561. Cited by: IMPACT: Importance Weighted Asynchronous Architectures with Clipped Target Networks, §2.2, §5.
SURREAL: Open-Source Reinforcement Learning Framework and Robot Manipulation Benchmark. In Conference on Robot Learning, pp. 767–782. Cited by: §5.
- Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic . arXiv preprint arXiv:1611.02247. Cited by: §5.
- AMBER: Adaptive Multi-Batch Experience Replay for Continuous Action Control. arXiv preprint arXiv:1710.04423. Cited by: §2.1.
- Dimension-Wise Importance Sampling Weight Clipping for Sample-Efficient Reinforcement Learning. arXiv preprint arXiv:1905.02363. Cited by: §3.1.
- Distributed Prioritized Experience Replay. arXiv preprint arXiv:1803.00933. Cited by: §3.2, §5.
- On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient. pp. 1000–1008. Cited by: §3.1.
RLlib: Abstractions for Distributed Reinforcement Learning.
International Conference on Machine Learning (ICML), Cited by: §4.3.
- Continuous Control with Deep Reinforcement Learning. arXiv preprint arXiv:1509.02971. Cited by: §5.
- Asynchronous Methods for Deep Reinforcement Learning. In International conference on machine learning, pp. 1928–1937. Cited by: §5.
- Massively Parallel Methods for Deep Reinforcement Learning. arXiv preprint arXiv:1507.04296. Cited by: §5.
- Trust Region Policy Optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §2.1.
- High-Dimensional Continuous Control using Generalized Advantage Estimation. arXiv preprint arXiv:1506.02438. Cited by: §2.1.
- Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1.
- Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2.
- Deepmind Control Suite. arXiv preprint arXiv:1801.00690. Cited by: Appendix D.
- Sample Efficient Actor-Critic with Experience Replay. arXiv preprint arXiv:1611.01224. Cited by: §3.2, §5.
Appendix A Additional Experiments
Appendix B Hyper parameters for All Environments
b.1 Discrete Environments
|Minibatch Buffer Size (N)||4||—||—|
|Num SGD Iterations (K)||2||—||2|
|Sample Batch Size||50||50||100|
|Train Batch Size||500||500||5000|
|SGD Minibatch Size||—||—||500|
|Value Function Coeff||1.0||0.5||1.0|
|Target-Worker Clipping ()||2.0||—||—|
b.2 Continuous Environments
|Minibatch Buffer Size (N)||16||—||—|
|Num SGD Iterations111For HalfCheetah-v2, IMPACT and PPO Num SGD Iterations (K) is 32.(K)||20||—||20|
|Sample Batch Size||1024||1024||1024|
|Train Batch Size||32768||32768||163840|
|SGD Minibatch Size||—||—||32768|
|Value Function Coeff222For HalfCheetah-v2, IMPACT Value Function Coeff is 0.5.||1.0||0.5||1.0|
|Target-Worker Clipping ()||2.0||—||—|
b.3 Hyperparameter Budget
Listed below was the grid search we used for each algorithm to obtain optimal hyperparameters. Optimal values were found via grid searching on each hyperparameter separately. We found that IMPACT’s optimal hyperparameter values tend to hover close to either IMPALA’s or PPO’s, which greatly mitigated IMPACT’s budget.
b.3.1 Discrete Environment Search
b.4 Continuous Environment Search
Appendix C IMPALA to IMPACT
In Figure 9, we gradually add components to IMPALA until the agent is equivalent to IMPACT’s. Starting from IMPALA, we gradually add PPO’s objective function, circular replay buffer, and target-worker clipping. In particular, IMPALA with PPO’s objective function and circular replay buffer is equivalent to an asynchronous-variant of PPO (APPO). APPO fails to perform as well as synchronous distributed PPO, since PPO is an on-policy algorithm.
Appendix D IMPALA in Continuous Environments
In Figure 6, IMPALA performs substantially worse than other agents in continuous environments. We postulate that IMPALA suffers from low asymptotic performance here since its objective is an importance-sampled version of the Vanilla Policy Gradient (VPG) objective, which is known to suffer from high variance and large update-step sizes. We found that for VPG, higher learning rates encourage faster learning in the beginning but performance drops to negative return later in training. In Appendix B, for IMPALA, we heavily tuned on the learning rate, finding that small learning rates stabilize learning at the cost of low asymptotic performance. Prior work also reveals the agents that use VPG fail to attain good performance in non-trivial continuous tasks (Achiam, 2018). Our results with IMPALA reaches similar performance compared to other VPG-based algorithms. The closest neighbor to IMPALA, A3C uses workers to compute gradients from the VPG objective to send to the learner thread. A3C performs well in InvertedPendulum yet flounders in continuous environments (Tassa et al., 2018).
Appendix E The intuition of the objective
The following ratios represent the objective functions for different ablation studies. In the plots (Figure 10), we set the advantage function to be one, i.e.
IMPACT target -clip:
According to Figure 10, IS ratio is large when assigns low probability. IMPACT target -clip is a lower bound of the PPO -clip. In an distributed asynchronous setting, the trust region suffers from larger variance stemming from off-policy data. IMPACT target -clip ratio mitigates this by encouraging conservative and reasonable policy-gradient steps.