IMPACT: Importance Weighted Asynchronous Architectures with Clipped Target Networks

11/30/2019 ∙ by Michael Luo, et al. ∙ 5

The practical usage of reinforcement learning agents is often bottlenecked by the duration of training time. To accelerate training, practitioners often turn to distributed reinforcement learning architectures to parallelize and accelerate the training process. However, modern methods for scalable reinforcement learning (RL) often tradeoff between the throughput of samples that an RL agent can learn from (sample throughput) and the quality of learning from each sample (sample efficiency). In these scalable RL architectures, as one increases sample throughput (i.e. increasing parallelization in IMPALA), sample efficiency drops significantly. To address this, we propose a new distributed reinforcement learning algorithm, IMPACT. IMPACT extends IMPALA with three changes: a target network for stabilizing the surrogate objective, a circular buffer, and truncated importance sampling. In discrete action-space environments, we show that IMPACT attains higher reward and, simultaneously, achieves up to 30 continuous control environments, IMPACT trains faster than existing scalable agents while preserving the sample efficiency of synchronous PPO.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Proximal Policy Optimization (Schulman et al., 2017) is one of the most sample-efficient on-policy algorithms. However, it relies on a synchronous architecture for collecting experiences, which is closely tied to its trust region optimization objective. Other architectures such as IMPALA can achieve much higher throughputs due to the asynchronous collection of samples from workers. Yet, IMPALA suffers from reduced sample efficiency since it cannot safely take multiple SGD steps per batch as PPO can. The new agent, Importance Weighted Asynchronous Architectures with Clipped Target Networks (IMPACT), mitigates this inherent mismatch. Not only is the algorithm highly sample efficient, it can learn quickly, training 30 percent faster than IMPALA. At the same time, we propose a novel method to stabilize agents in distributed asynchronous setups and, through our ablation studies, show how the agent can learn in both a time and sample efficient manner.

In our paper, we show that the algorithm IMPACT realizes greater gains by striking the balance between high sample throughput and sample efficiency. In our experiments, we demonstrate in the experiments that IMPACT exceeds state-of-the-art agents in training time (with same hardware) while maintaining similar sample efficiency with PPO’s. The contributions of this paper are as follows:

  1. We show that when collecting experiences asynchronously, introducing a target network allows for a stabilized surrogate objective and multiple SGD steps per batch (Section 3.1).

  2. We show that using a circular buffer for storing asynchronously collected experiences allows for smooth trade-off between real-time performance and sample efficiency (Section 3.2).

  3. We show that IMPACT, when evaluated using identical hardware and neural network models, improves both in real-time and timestep efficiency over both synchronous PPO and IMPALA (Section 4).

(a) PPO
(b) IMPALA
(c) IMPACT
Figure 1: Architecture schemes for distributed PPO, IMPALA, and IMPACT. PPO aggregates worker batches into a large training batch and the learner performs minibatch SGD. IMPALA workers asynchronously generate data. IMPACT consists of a batch buffer that takes in worker experience and a target’s evaluation on the experience. The learner samples from the buffer.

2 Background

Reinforcement Learning assumes a Markov Decision Process (MDP) setup defined by the tuple

where and represent the state and action space, is the discount factor, and and are the transition dynamics and reward function that models an environment.

Let denote a stochastic policy mapping that returns an action distribution given state . Rolling out policy in the environment is equivalent to sampling a trajectory , where . We can compactly define state and state-action marginals of the trajectory distribution and induced by the policy .The goal for reinforcement learning aims to maximize the following objective: .

When parameterizes , the policy is updated according to the Policy Gradient Theorem (Sutton et al., 2000):

where

is an estimator of the advantage function. The advantage estimator is usually defined as the 1-step TD error,

, where

is an estimation of the value function. Policy gradients, however, suffer from high variance and large update-step sizes, oftentimes leading to sudden drops in performance.

2.1 Distributed PPO

Per iteration, Proximal Policy Optimization (PPO) optimizes policy from target via the following objective function

where and

is the clipping hyperparameter. In addition, many PPO implementations use GAE-

as a low bias, low variance advantage estimator for (Schulman et al., 2015b). PPO’s surrogate objective contains the importance sampling ratio , which can potentially explode if is too far from . (Han and Sung, 2017). PPO’s surrogate loss mitigates this with the clipping function, which ensures that the agent makes reasonable steps. Alternatively, PPO can also be seen as an adaptive trust region introduced in TRPO (Schulman et al., 2015a).

In Figure 0(a), distributed PPO agents implement a synchronous data-gathering scheme. Before data collection, workers are updated to and aggregate worker batches to training batch . The learner performs many mini-batch gradient steps on . Once the learner is done, learner weights are broadcast to all workers, who start sampling again.

2.2 Importance Weighted Actor-Learner Architectures

In Figure 0(b), IMPALA decouples acting and learning by having the learner threads send actions, observations, and values while the master thread computes and applies the gradients from a queue of learner’s experience (Espeholt et al., 2018). This maximizes GPU utilization and allows for increased sample throughput, leading to high training speeds on easier environments such as Pong. As the number of learners grows, worker policies begin to diverge from the learner policy, resulting in stale policy gradients. To correct this, the IMPALA paper utilizes V-trace to correct the distributional shift:

where, is the value network, is the policy network of the master thread, is the policy network of the learner thread, and and are clipped IS ratios.

0:  Batch size , number of workers , circular buffer size , replay coefficient , target update frequency , weight broadcast frequency , learning rates and
1:  Randomly initialize network weights
2:  Initialize target network
3:  Create workers and duplicate to each worker
4:  Initialize circular buffer
5:  for  do
6:     Obtain batch of size traversed times from
7:     If , evaluate on target

, append target output to

8:     Compute policy and value network gradients
9:     Update policy and value network weights ,
10:     If , discard batch from
11:     If , update target network
12:     If , broadcast weights to workers
13:  end for

  Worker-i  

0:  Worker sample batch size
1:  repeat
2:     
3:     for  do
4:        Store ran by in batch
5:     end for
6:     Send to
7:     If broadcasted weights exist, set
8:  until learner finishes
Algorithm 1 IMPACT

3 IMPACT Algorithm

Figure 2: In asynchronous PPO, there are multiple candidate policies from which the trust region can be defined: (1) , the policy of the worker process that produced the batch of experiences, (2) , the current policy of the learner process, and (3) , the policy of a target network. Introducing the target network allows for both a stable trust region and multiple SGD steps per batch of experience collected asynchronously from workers, improving sample efficiency. Since workers can generate experiences asynchronously from their copy of the master policy, this also allows for good real-time efficiency.

Like IMPALA, IMPACT separates sampling workers from learner workers. Algorithm 1 and Figure 0(c) describe the main training loop and architecture of IMPACT. In the beginning, each worker copies weights from the master network. Then, each worker uses their own policy to collect trajectories and sends the data to the circular buffer. Simultaneously, workers also asynchronously pull policy weights from the master learner. In the meantime, the target network occasionally syncs with the master learner every iterations. The master learner then repeatedly draws experience from the circular buffer. Each sample is weighted by the importance ratio of as well as clipped with target network ratio . The target network is used to provide a stable trust region (Figure 2), allowing multiple steps per batch (i.e., like PPO) even in the asynchronous setting (i.e., with the IMPALA architecture). In the next section, we describe the design of this improved objective.

3.1 Maximal Target-Worker Clipping

PPO gathers experience from previous iteration’s policy , and the current policy trains by importance sampling off-policy experience with respect to . In the asynchronous setting, worker ’s policy, denoted as , generates experience for the policy network

. The probability that batch

comes from worker can be parameterized as a categorical distribution . We include this by adding an extra expectation to the importance-sampled policy gradient objective (IS-PG) (Jie and Abbeel, 2010):

Since each worker contains a different policy, the agent introduces a target network for stability (Figure 2). Off-policy agents such as DDPG and DQN update target networks with a moving average. For IMPACT, we periodically update the target network with the master network. However, training with importance weighted ratio can lead to numerical instability, as shown in Figure 3. To prevent this, we clip the importance sampling ratio from worker policy,, to target policy, :

where . In the experiments, we set as a hyperparameter with and .

To see why clipping is necessary, when master network’s action distribution changes significantly over few training iterations, worker i’s policy, , samples data outside that of target policy, , leading to large likelihood ratios, . The clipping function pulls back large IS ratios to . Figure 10 in Appendix E provides additional intuition behind the target clipping objective. We show that the target network clipping is a lower bound of the IS-PG objective.

For , the clipped target ratio is larger and serves to augment advantage estimator . This incentivizes the agent toward good actions while avoiding bad actions. Thus, higher values of encourages the agent to learn faster at the cost of instability.

We use GAE- with V-trace (Han and Sung, 2019). The V-trace GAE- modifies the advantage function by adding clipped importance sampling terms to the summation of TD errors:

where (we use the convention ) and is the importance sampled 1-step TD error introduced in V-trace.

(a) Ratio ablation study.
(b) Target update frequency study.
Figure 3: Training curves of the ablation study on control benchmarks. In (a), the IMPACT objective outperforms other possible ratio choices for the surrogate loss: , , . In (b), we show the target network update frequency is robust to a range of choices. We try target network update frequency equal to the multiple (ranging from 1/16 and 16) of , the product of the size of circular buffer and the replay times for each batch in the buffer.
(a) Circular Buffer.
(b) Wall Clock-time vs.
(c) Sample Efficiency vs.
Figure 4: (a): The Circular Buffer in a nutshell: and correspond to buffer size and max times a batch can be traversed. Old batches are replaced by worker-generated batches. (b): The performance of IMPACT with different K in terms of time. (c): The performance of IMPACT with different K in terms of timesteps. IMPACT can achieve greater timestep as well as time efficiency by manipulating . outperforms all other settings in time and is more sample efficient than .

3.2 Circular Buffer

IMPACT uses a circular buffer (Figure 4) to emulate the mini-batch SGD used by standard PPO. The circular buffer stores batches that can be traversed at max times. Upon being traversed times, a batch is discarded and replaced by a new worker batch.

For motivation, the circular buffer and the target network are analogous to mini-batching from experience in PPO. When target network’s update frequency , the circular buffer is equivalent to distributed PPO’s training batch when the learner samples minibatches for SGD iterations.

This is in contrast to standard replay buffers, such as in ACER and APE-X, where transitions are either uniformly sampled or sampled based on priority, and, when the buffer is full, the oldest transitions are discarded (Wang et al., 2016; Horgan et al., 2018).

Figure 4 illustrates an empirical example where tuning can increase training sample efficiency and decrease training wall-clock time.

4 Evaluation

In our evaluation we seek to answer the following questions:

  1. How does the target-clipping objective affect the performance of the agents compared to prior work? (Section 4.1)

  2. How does the IMPACT circular buffer affect sample efficiency and training wall-clock time? (Section 4.2)

  3. How does IMPACT compare to PPO and IMPALA baselines in terms of sample and real-time performance? (Section 4.3)

  4. How does IMPACT scale with respect to the number of workers? (Section 4.4)

4.1 Target Clipping Performance

We investigate the performance of the clipped-target objective relative to prior work, which includes PPO and IS-PG based objectives. Specifically, we consider the following ratios below:

For all three experiments, we truncate all three ratios with PPO’s clipping function: and train in an asynchronous setting. Figure 4(a) reveals two important takeaways: first, suffers from sudden drops in performance midway through training. Next, trains stably but does not achieve good performance.

We theorize that fails due to the target and worker network mismatch. During periods of training where the master learner undergoes drastic changes, worker action outputs vastly differ from the learner outputs, resulting in small action probabilities. This creates large ratios in training and destabilizes training. We hypothesize that fails due to different workers pushing and pulling the learner in multiple directions. The learner moves forward with the most recent worker’s suggestions without developing a proper trust region, resulting in many worker’s suggestions conflicting with each other.

The loss function,

shows that clipping is necessary and can help facilitate training. By clipping the target-worker ratio, we make sure that the ratio does not explode and destabilize training. Furthermore, we prevent workers from making mutually destructive suggestions by having a target network provide singular guidance.

4.1.1 Target Network Update Frequency

In Section 3.2, an analogy was drawn between PPO’s mini-batching mechanism and the circular buffer. Our primary benchmark for target update frequency is , where is circular buffer size and is maximum replay coefficient. This is the case when PPO is equivalent to IMPACT.

In Figure 4(b), we test the frequency of updates with varying orders of magnitudes of . In general, we find that agent performance is robust to vastly differing frequencies. However, when , the agent does not learn. Based on empirical results, we theorize that the agent is able to train as long as a stable trust region can be formed. On the other hand, if update frequency is too low, the agent is stranded for many iterations in the same trust region, which impairs learning speed.

4.2 Time and Sample Efficiency with Circular Buffer

Counter to intuition, the tradeoff between time and sample efficiency when increases is not necessarily true. In Figure 3(b) and 3(c), we show that IMPACT realizes greater gains by striking the balance between high sample throughput and sample efficiency. When , IMPACT performs the best in both time and sample efficiency. Our results reveal that wall-clock time and sample efficiency can be optimized based on tuning values of in the circular buffer.

4.3 Comparison with Baselines

Time
Timesteps
Figure 5: IMPACT outperforms baselines in both sample and time efficiency for Continuous Control Domains: Hopper, Humanoid, HalfCheetah.

We investigate how IMPACT attains greater performance in wall clock-time and sample efficiency compared with PPO and IMPALA across six different continuous control and discrete action tasks.

We tested the agent on three continuous environments (Figure 5): HalfCheetah, Hopper, and Humanoid on 16 CPUs and 1 GPU. The policy networks consist of two fully-connected layers of 256 units with nonlinear activation tanh. The critic network shares the same architecture as the policy network. For consistentency, same network architectures were employed across PPO, IMPALA, and IMPACT.

For the discrete environments (Figure 6), Pong, SpaceInvaders, and Breakout were chosen as common benchmarks used in popular distributed RL libraries (Caspi et al., 2017; Liang et al., 2018)

. Additional experiments for discrete environments are in the Appendix. These experiments were ran on 32 CPUs and 1 GPU. The policy network consists of three 4x4 and one 11x11 conv layer, with nonlinear activation ReLU. The critic network shares weights with the policy network. The input of the network is a stack of four 42x42 down-sampled images of the Atari environment. The hyper-parameters for continuous and discrete environments are listed in the Appendix

B table 1 and 2 respectively.

Figures 5 and 6 show the total average return on evaluation rollouts for IMPACT, IMPALA and PPO. We train each algorithm with three different random seeds on each environment for a total time of three hours. According to the experiments, IMPACT is able to train much faster than PPO and IMPALA in both discrete and continuous domains, while preserving same or better sample efficiency than PPO.

Our results reveal that continuous control tasks for IMPACT are sensitive to the tuple for the circular buffer. and is a robust choice for continuous control. Although higher inhibits workers’ sample throughput, increased sample efficiency from replaying experiences results in an overall reduction in training wall-clock time and higher reward. For discrete tasks, and works best. Empirically, agents learn faster from new experience than replaying old experience, showing how exploration is crucial to achieving high asymptotic performance in discrete enviornments.

Time
Timesteps
Figure 6: IMPACT outperforms PPO and IMPALA in both real-time and sample efficiency for Discrete Control Domains: Breakout, SpaceInvaders, and Pong.

4.4 IMPACT Scalability

Figure 7 shows how IMPACT’s performance scales relative to the number of workers. More workers means increased sample throughput, which in turn increases training throughput (the rate that learner consumes batches). With the learner consuming more worker data per second, IMPACT can attain better performance in less time. However, as number of workers increases, observed increases in performance begin to decline.

(a) Continuous environment.
(b) Discrete environment.
Figure 7: Performance of IMPACT with respect to the number of workers in both continuous and discrete control tasks

5 Related Work

Distributed RL architectures are often used to accelerate training. Gorila (Nair et al., 2015) and A3C (Mnih et al., 2016) use workers to compute gradients to be sent to the learner. A2C (Mnih et al., 2016) and IMPALA (Espeholt et al., 2018) send experience tuples to the learner. Distributed replay buffers, introduced in ACER (Wang et al., 2016) and Ape-X (Horgan et al., 2018)

, collect worker-collected experience and define an overarching heuristic for learner batch selection. IMPACT is the first to fully incorporate the sample-efficiency benefits of PPO in an asynchronous setting.

Surreal PPO (Fan et al., 2018) also studies training with PPO in the asynchronous setting, but do not consider adaptation of the surrogate objective nor IS-correction. Their use of a target network for broadcasting weights to workers is also entirely different from IMPACT’s. Consequently, IMPACT is able to achieve better results in both real-time and sample efficiency.

Off-policy methods, including DDPG and QProp, utilize target networks to stabilize learning the Q function (Lillicrap et al., 2015; Gu et al., 2016). This use of a target network is related but different from IMPACT, which uses the network to define a stable trust region for the PPO surrogate objective.

6 Conclusion

In conclusion, we introduce IMPACT, which extends PPO with a stabilized surrogate objective for asynchronous optimization, enabling greater real-time performance without sacrificing timestep efficiency. We show the importance of the IMPACT objective to stable training, and show it can outperform tuned PPO and IMPALA baselines in both real-time and timestep metrics.

References

  • J. Achiam (2018) OpenAI Spinning Up. GitHub. Note: https://spinningup.openai.com/en/latest/spinningup/bench.html Cited by: Appendix D.
  • I. Caspi, G. Leibovich, G. Novik, and S. Endrawis (2017) Reinforcement Learning Coach. External Links: Document, Link Cited by: §4.3.
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. (2018) IMPALA: Scalable Distributed Deep-RL with Importance Weighted Acto-Learner Architectures. arXiv preprint arXiv:1802.01561. Cited by: IMPACT: Importance Weighted Asynchronous Architectures with Clipped Target Networks, §2.2, §5.
  • L. Fan, Y. Zhu, J. Zhu, Z. Liu, O. Zeng, A. Gupta, J. Creus-Costa, S. Savarese, and L. Fei-Fei (2018)

    SURREAL: Open-Source Reinforcement Learning Framework and Robot Manipulation Benchmark

    .
    In Conference on Robot Learning, pp. 767–782. Cited by: §5.
  • S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine (2016) Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic . arXiv preprint arXiv:1611.02247. Cited by: §5.
  • S. Han and Y. Sung (2017) AMBER: Adaptive Multi-Batch Experience Replay for Continuous Action Control. arXiv preprint arXiv:1710.04423. Cited by: §2.1.
  • S. Han and Y. Sung (2019) Dimension-Wise Importance Sampling Weight Clipping for Sample-Efficient Reinforcement Learning. arXiv preprint arXiv:1905.02363. Cited by: §3.1.
  • D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. Van Hasselt, and D. Silver (2018) Distributed Prioritized Experience Replay. arXiv preprint arXiv:1803.00933. Cited by: §3.2, §5.
  • T. Jie and P. Abbeel (2010) On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient. pp. 1000–1008. Cited by: §3.1.
  • E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. E. Gonzalez, M. I. Jordan, and I. Stoica (2018) RLlib: Abstractions for Distributed Reinforcement Learning. In

    International Conference on Machine Learning (ICML)

    ,
    Cited by: §4.3.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous Control with Deep Reinforcement Learning. arXiv preprint arXiv:1509.02971. Cited by: §5.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous Methods for Deep Reinforcement Learning. In International conference on machine learning, pp. 1928–1937. Cited by: §5.
  • A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, et al. (2015) Massively Parallel Methods for Deep Reinforcement Learning. arXiv preprint arXiv:1507.04296. Cited by: §5.
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015a) Trust Region Policy Optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §2.1.
  • J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2015b) High-Dimensional Continuous Control using Generalized Advantage Estimation. arXiv preprint arXiv:1506.02438. Cited by: §2.1.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §2.
  • Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. (2018) Deepmind Control Suite. arXiv preprint arXiv:1801.00690. Cited by: Appendix D.
  • Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas (2016) Sample Efficient Actor-Critic with Experience Replay. arXiv preprint arXiv:1611.01224. Cited by: §3.2, §5.

Appendix A Additional Experiments

(a) Time
(b) Timesteps
Figure 8: IMPACT, PPO and IMPALA wallclock time and sample efficiency for Discrete Control Domains: Qbert, BeamRider, and Gravitar.

Appendix B Hyper parameters for All Environments

b.1 Discrete Environments

Hyperparameters IMPACT IMPALA PPO
Clip Parameter 0.3 0.1
Entropy Coeff 0.01 0.01 0.01
Grad Clipping 10.0 40.0
Discount () 0.99 0.99 0.99
Lambda () 0.995 0.995
Learning Rate
Minibatch Buffer Size (N) 4
Num SGD Iterations (K) 2 2
Sample Batch Size 50 50 100
Train Batch Size 500 500 5000
SGD Minibatch Size 500
KL Coeff 0.0 0.5
KL Target 0.01 0.01
Value Function Coeff 1.0 0.5 1.0
Target-Worker Clipping () 2.0
Table 1: Hyperparameters for Discrete Environments.

b.2 Continuous Environments

Hyperparameters IMPACT IMPALA PPO
Clip Parameter 0.4 0.3
Entropy Coeff 0.0 0.0 0.0
Grad Clipping 0.5 0.5
Discount () 0.995 0.995 0.99
Lambda () 0.995 0.995
Learning Rate
Minibatch Buffer Size (N) 16
Num SGD Iterations111For HalfCheetah-v2, IMPACT and PPO Num SGD Iterations (K) is 32.(K) 20 20
Sample Batch Size 1024 1024 1024
Train Batch Size 32768 32768 163840
SGD Minibatch Size 32768
KL Coeff 1.0 1.0
KL Target 0.04 0.01
Value Function Coeff222For HalfCheetah-v2, IMPACT Value Function Coeff is 0.5. 1.0 0.5 1.0
Target-Worker Clipping () 2.0
Table 2: Hyperparameters for Continuous Control Environments

b.3 Hyperparameter Budget

Listed below was the grid search we used for each algorithm to obtain optimal hyperparameters. Optimal values were found via grid searching on each hyperparameter separately. We found that IMPACT’s optimal hyperparameter values tend to hover close to either IMPALA’s or PPO’s, which greatly mitigated IMPACT’s budget.

b.3.1 Discrete Environment Search

Hyperparameters IMPACT IMPALA PPO Clip Parameter [0.1, 0.2, 0.3] [0.1, 0.2, 0.3, 0.4] Grad Clipping [10, 20, 40] [2.5, 5, 10, 20, 40, 80] Learning Rate () [0.5, 1.0, 3.0] [0.1, 0.3, 0.5, 0.8, 1.0, 3.0, 5.0] [0.5, 1.0, 3.0, 5.0, 8.0] Minibatch Buffer Size (N) [2,4,8, 16] Num SGD Iterations (K) [1,2,4] [1,2,4,8] Train Batch Size [1000, 2500, 5000, 10000] Value Function Coeff [0.5, 1.0, 2.0] [0.25, 0.5, 1.0, 2.0] [0.25, 0.5, 1.0, 2.0] # of Runs 19 17 21

Table 3: Hyperparameter Search for Discrete Environments

b.4 Continuous Environment Search

Hyperparameters IMPACT IMPALA PPO Clip Parameter [0.2, 0.3, 0.4] [0.1, 0.2, 0.3, 0.4] Grad Clipping [0.5, 1.0, 5.0] [0.1, 0.25, 0.5, 1.0, 5.0, 10.0] Learning Rate () [1.0, 3.0, 5.0] [0.1, 0.15, 0.3, 0.5, 0.8, 1.0, 3.0, 5.0] 333IMPALA was difficult to finetune due to unstable runs. [1.0, 3.0, 5.0] Minibatch Buffer Size (N) [4,8,16] Num SGD Iterations (K) [20,26,32] [20,26,32] Train Batch Size [65536, 98304, 131072, 163840] KL Target [0.01, 0.02, 0.04] [0.01, 0.02, 0.04] Value Function Coeff [0.5, 1.0, 2.0] [0.5, 1.0, 2.0] [0.5, 1.0, 2.0] # of Runs 21 17 20

Table 4: Hyperparameter Search for Continuous Environments

Appendix C IMPALA to IMPACT

Figure 9: IMPALA to IMPACT: Incrementally Adding PPO Objective, Replay, and Target-Worker Clipping to IMPALA. The experiments are done on the HalfCheetah-v2 gym environment.

In Figure 9, we gradually add components to IMPALA until the agent is equivalent to IMPACT’s. Starting from IMPALA, we gradually add PPO’s objective function, circular replay buffer, and target-worker clipping. In particular, IMPALA with PPO’s objective function and circular replay buffer is equivalent to an asynchronous-variant of PPO (APPO). APPO fails to perform as well as synchronous distributed PPO, since PPO is an on-policy algorithm.

Appendix D IMPALA in Continuous Environments

In Figure 6, IMPALA performs substantially worse than other agents in continuous environments. We postulate that IMPALA suffers from low asymptotic performance here since its objective is an importance-sampled version of the Vanilla Policy Gradient (VPG) objective, which is known to suffer from high variance and large update-step sizes. We found that for VPG, higher learning rates encourage faster learning in the beginning but performance drops to negative return later in training. In Appendix B, for IMPALA, we heavily tuned on the learning rate, finding that small learning rates stabilize learning at the cost of low asymptotic performance. Prior work also reveals the agents that use VPG fail to attain good performance in non-trivial continuous tasks (Achiam, 2018). Our results with IMPALA reaches similar performance compared to other VPG-based algorithms. The closest neighbor to IMPALA, A3C uses workers to compute gradients from the VPG objective to send to the learner thread. A3C performs well in InvertedPendulum yet flounders in continuous environments (Tassa et al., 2018).

Action Distributions Likelihood Ratios w.r.t Different Objectives
Figure 10: Likelihood ratio for different objective functions, including PPO’s. We assume a diagonal Gaussian policy for our policy. Left: Corresponding one dimensional action distributions for Worker i, Target, and Master Learner; Right: Ratio values graphed as a function of possible action values. IMPACT with PPO clipping is a lower bound of PPO.

Appendix E The intuition of the objective

The following ratios represent the objective functions for different ablation studies. In the plots (Figure 10), we set the advantage function to be one, i.e.

  • IS ratio:

  • IMPACT target:

  • PPO -clip:

  • IMPACT target -clip:

According to Figure 10, IS ratio is large when assigns low probability. IMPACT target -clip is a lower bound of the PPO -clip. In an distributed asynchronous setting, the trust region suffers from larger variance stemming from off-policy data. IMPACT target -clip ratio mitigates this by encouraging conservative and reasonable policy-gradient steps.