Distributed Prioritized Experience Replay

by   Dan Horgan, et al.

We propose a distributed architecture for deep reinforcement learning at scale, that enables agents to learn effectively from orders of magnitude more data than previously possible. The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shared experience replay memory; the learner replays samples of experience and updates the neural network. The architecture relies on prioritized experience replay to focus only on the most significant data generated by the actors. Our architecture substantially improves the state of the art on the Arcade Learning Environment, achieving better final performance in a fraction of the wall-clock training time.


page 3

page 13


Prioritized Sequence Experience Replay

Experience replay is widely used in deep reinforcement learning algorith...

A DPDK-Based Acceleration Method for Experience Sampling of Distributed Reinforcement Learning

A computing cluster that interconnects multiple compute nodes is used to...

Double Prioritized State Recycled Experience Replay

Experience replay enables online reinforcement learning agents to store ...

ViZDoom: DRQN with Prioritized Experience Replay, Double-Q Learning, & Snapshot Ensembling

ViZDoom is a robust, first-person shooter reinforcement learning environ...

WALL-E: An Efficient Reinforcement Learning Research Framework

There are two halves to RL systems: experience collection time and polic...

Proxy Experience Replay: Federated Distillation for Distributed Reinforcement Learning

Traditional distributed deep reinforcement learning (RL) commonly relies...

Proxy Experience Replay: Federated Distillation for Distributed Reinforcement Leargning

Traditional distributed deep reinforcement learning (RL) commonly relies...

1 Introduction

A broad trend in deep learning is that combining more computation

(Dean et al., 2012) with more powerful models (Kaiser et al., 2017) and larger datasets (Deng et al., 2009) yields more impressive results. It is reasonable to hope that a similar principle holds for deep reinforcement learning. There are a growing number of examples to justify this optimism: effective use of greater computational resources has been a critical factor in the success of such algorithms as Gorila (Nair et al., 2015), A3C (Mnih et al., 2016), GPU Advantage Actor Critic (Babaeizadeh et al., 2017), Distributed PPO (Heess et al., 2017) and AlphaGo (Silver et al., 2016).

Deep learning frameworks such as TensorFlow

(Abadi et al., 2016)

support distributed training, making large scale machine learning systems easier to implement and deploy. Despite this, much current research in deep reinforcement learning concerns itself with improving performance within the computational budget of a single machine, and the question of how to best harness more resources is comparatively underexplored.

In this paper we describe an approach to scaling up deep reinforcement learning by generating more data and selecting from it in a prioritized fashion (Schaul et al., 2016). Standard approaches to distributed training of neural networks focus on parallelizing the computation of gradients, to more rapidly optimize the parameters (Dean et al., 2012). In contrast, we distribute the generation and selection of experience data, and find that this alone suffices to improve results. This is complementary to distributing gradient computation, and the two approaches can be combined, but in this work we focus purely on data-generation.

We use this distributed architecture to scale up variants of Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG), and we evaluate these on the Arcade Learning Environment benchmark (Bellemare et al., 2013)

, and on a range of continuous control tasks. Our architecture achieves a new state of the art performance on Atari games, using a fraction of the wall-clock time compared to the previous state of the art, and without per-game hyperparameter tuning.

We empirically investigate the scalability of our framework, analysing how prioritization affects performance as we increase the number of data-generating workers. Our experiments include an analysis of factors such as the replay capacity, the recency of the experience, and the use of different data-generating policies for different workers. Finally, we discuss implications for deep reinforcement learning agents that may apply beyond our distributed framework.

2 Background

Distributed Stochastic Gradient Descent

Distributed stochastic gradient descent is widely used in supervised learning to speed up training of deep neural networks, by parallelizing the computation of the gradients used to update their parameters. The resulting parameter updates may be applied synchronously

(Krizhevsky, 2014) or asynchronously (Dean et al., 2012). Both approaches have proven effective and are an increasingly standard part of the deep learning toolbox. Inspired by this, Nair et al. (2015) applied distributed asynchronous parameter updates and distributed data generation to deep reinforcement learning. Asynchronous parameter updates and parallel data generation have also been successfully used within a single-machine, in a multi-threaded rather than a distributed context (Mnih et al., 2016). GPU Asynchronous Actor-Critic (GA3C; Babaeizadeh et al., 2017) and Parallel Advantage Actor-Critic (PAAC; Clemente et al., 2017) adapt this approach to make efficient use of GPUs.

Distributed Importance Sampling

A complementary family of techniques for speeding up training is based on variance reduction by means of importance sampling

(cf. Hastings, 1970). This has been shown to be useful in the context of neural networks (Hinton, 2007)

. Sampling non-uniformly from a dataset and weighting updates according to the sampling probability in order to counteract the bias thereby introduced can increase the speed of convergence by reducing the variance of the gradients. One way of doing this is to select samples with probability proportional to the

norm of the corresponding gradients. In supervised learning, this approach has been successfully extended to the distributed setting (Alain et al., 2015). An alternative is to rank samples according to their latest known loss value and make the sampling probability a function of the rank rather than of the loss itself (Loshchilov & Hutter, 2015).

Prioritized Experience Replay

Experience replay (Lin, 1992) has long been used in reinforcement learning to improve data efficiency. It is particularly useful when training neural network function approximators with stochastic gradient descent algorithms, as in Neural Fitted Q-Iteration (Riedmiller, 2005) and Deep Q-Learning (Mnih et al., 2015). Experience replay may also help to prevent overfitting by allowing the agent to learn from data generated by previous versions of the policy. Prioritized experience replay (Schaul et al., 2016) extends classic prioritized sweeping ideas (Moore & Atkeson, 1993) to work with deep neural network function approximators. The approach is strongly related to the importance sampling techniques discussed in the previous section, but using a more general class of biased sampling procedures that focus learning on the most ‘surprising’ experiences. Biased sampling can be particularly helpful in reinforcement learning, since the reward signal may be sparse and the data distribution depends on the agent’s policy. As a result, prioritized experience replay is used in many agents, such as Prioritized Dueling DQN (Wang et al., 2016), UNREAL (Jaderberg et al., 2017), DQfD (Hester et al., 2017), and Rainbow (Hessel et al., 2017). In an ablation study conducted to investigate the relative importance of several algorithmic ingredients (Hessel et al., 2017), prioritization was found to be the most important ingredient contributing to the agent’s performance.

3 Our Contribution: Distributed Prioritized Experience Replay

In this paper we extend prioritized experience replay to the distributed setting and show that this is a highly scalable approach to deep reinforcement learning. We introduce a few key modifications that enable this scalability, and we refer to our approach as Ape-X.

As in Gorila (Nair et al., 2015), we decompose the standard deep reinforcement learning algorithm into two parts, which run concurrently with no high-level synchronization. The first part consists of stepping through an environment, evaluating a policy implemented as a deep neural network, and storing the observed data in a replay memory. We refer to this as acting. The second part consists of sampling batches of data from the memory to update the policy parameters. We term this learning.

Figure 1: The Ape-X architecture in a nutshell: multiple actors, each with its own instance of the environment, generate experience, add it to a shared experience replay memory, and compute initial priorities for the data. The (single) learner samples from this memory and updates the network and the priorities of the experience in the memory. The actors’ networks are periodically updated with the latest network parameters from the learner.
1:procedure Actor(, ) Run agent in environment instance, storing experiences.
2:      learner.Parameters( ) Remote call to obtain latest network parameters.
3:      environment.Initialize( ) Get initial state from environment.
4:     for  to  do
5:           Select an action using the current policy.
6:           Apply the action in the environment.
7:          localBuffer.Add( Add data to local buffer.
8:          if  then In a background thread, periodically send data to replay.
9:                Get buffered data (e.g. batch of multi-step transitions).
10:                Calculate priorities for experience (e.g. absolute TD error).
11:                Remote call to add experience to replay memory.
12:          end if
13:           Obtain latest network parameters.
14:     end for
15:end procedure
Algorithm 1 Actor
1:procedure Learner() Update network using batches sampled from memory.
2:      InitializeNetwork( )
3:     for  to  do Update the parameters times.
4:           replay.Sample( ) Sample a prioritized batch of transitions (in a background thread).
5:           Apply learning rule; e.g. double Q-learning or DDPG
7:           ComputePriorities( ) Calculate priorities for experience, (e.g. absolute TD error).
8:           Remote call to update priorities.
9:           Remove old experience from replay memory.
10:     end for
11:end procedure
Algorithm 2 Learner

In principle, both acting and learning may be distributed across multiple workers. In our experiments, hundreds of actors run on CPUs to generate data, and a single learner running on a GPU samples the most useful experiences (Figure 1). Pseudocode for the actors and learners is shown in Algorithms 1 and 2. Updated network parameters are periodically communicated to the actors from the learner.

In contrast to Nair et al. (2015), we use a shared, centralized replay memory, and instead of sampling uniformly, we prioritize, to sample the most useful data more often. Since priorities are shared, high priority data discovered by any actor can benefit the whole system. Priorities can be defined in various ways, depending on the learning algorithm; two instances are described in the next sections.

In Prioritized DQN (Schaul et al., 2016) priorities for new transitions were initialized to the maximum priority seen so far, and only updated once they were sampled. This does not scale well: due to the large number of actors in our architecture, waiting for the learner to update priorities would result in a myopic focus on the most recent data, which has maximum priority by construction. Instead, we take advantage of the computation the actors in Ape-X are already doing to evaluate their local copies of the policy, by making them also compute suitable priorities for new transitions online. This ensures that data entering the replay has more accurate priorities, at no extra cost.

Sharing experiences has certain advantages compared to sharing gradients. Low latency communication is not as important as in distributed SGD, because experience data becomes outdated less rapidly than gradients, provided the learning algorithm is robust to off-policy data. Across the system, we take advantage of this by batching all communications with the centralized replay, increasing the efficiency and throughput at the cost of some latency. With this approach it is even possible for actors and learners to run in different data-centers without limiting performance.

Finally, by learning off-policy (cf. Sutton & Barto, 1998, 2017), we can further take advantage of Ape-X’s ability to combine data from many distributed actors, by giving the different actors different exploration policies, broadening the diversity of the experience they jointly encounter. As we will see in the results, this can be sufficient to make progress on difficult exploration problems.

3.1 Ape-X DQN

The general framework we have described may be combined with different learning algorithms. First, we combined it with a variant of DQN (Mnih et al., 2015) with some of the components of Rainbow (Hessel et al., 2017). More specifically, we used double Q-learning (van Hasselt, 2010; van Hasselt et al., 2016) with multi-step bootstrap targets (cf. Sutton, 1988; Sutton & Barto, 1998, 2017; Mnih et al., 2016) as the learning algorithm, and a dueling network architecture (Wang et al., 2016) as the function approximator .

This results in computing for all elements in the batch the loss with

where is a time index for an experience sampled from the replay starting with state and action , and denotes parameters of the target network (Mnih et al., 2015), a slow moving copy of the online parameters. Multi-step returns are truncated if the episode ends in fewer than steps.

In principle, Q-learning variants are off-policy methods, so we are free to choose the policies we use to generate data. However, in practice, the choice of behaviour policy does affect both exploration and the quality of function approximation. Furthermore, we are using a multi-step return with no off-policy correction, which in theory could adversely affect the value estimation. Nonetheless, in Ape-X DQN, each actor executes a different policy, and this allows experience to be generated from a variety of strategies, relying on the prioritization mechanism to pick out the most effective experiences. In our experiments, the actors use

-greedy policies with different values of . Low policies allow exploring deeper in the environment, while high policies prevent over-specialization.

3.2 Ape-X DPG

To test the generality of the framework we also combined it with a continuous-action policy gradient system based on DDPG (Lillicrap et al., 2016), an implementation of deterministic policy gradients Silver et al. (2014) also similar to older methods (Werbos, 1990; Prokhorov & Wunsch, 1997), and tested it on continuous control tasks from the DeepMind Control Suite (Tassa et al., 2018).

The Ape-X DPG setup is similar to Ape-X DQN, but the actor’s policy is now represented explicitly by a separate policy network, in addition to the Q-network. The two networks are optimized separately, by minimizing different losses on the sampled experience. We denote the policy and Q-network parameters by and respectively, and adopt the same convention as above to denote target networks. The Q-network outputs an action-value estimate for a given state , and multi-dimensional action . It is updated using temporal-difference learning with a multi-step bootstrap target. The Q-network loss can be written as , where

The policy network outputs an action . The policy parameters are updated using policy gradient ascent on the estimated Q-value, using gradient — note that this depends on the policy parameters only through the action that is input to the critic network. Further details of the Ape-X DPG algorithm are available in the appendix.

4 Experiments

4.1 Atari

Figure 2: Left: Atari results aggregated across 57 games, evaluated from random no-op starts. Right: Atari training curves for selected games, against baselines. Blue: Ape-X DQN with 360 actors; Orange: A3C; Purple: Rainbow; Green: DQN. See appendix for longer runs over all games.

In our first set of experiments we evaluate Ape-X DQN on Atari, and show state of the art results on this standard reinforcement learning benchmark. We use 360 actor machines (each using one CPU core) to feed data into the replay memory as fast as they can generate it; approximately 139 frames per second (FPS) each, for a total of 50K FPS, which corresponds to 12.5K transitions (because of a fixed action repeat of 4). The actors batch experience data locally before sending it to the replay: up to 100 transitions may be buffered at a time, which are then sent asynchronously in batches of . The learner asynchronously prefetches up to 16 batches of 512 transitions, and computes updates for 19 such batches each second, meaning that gradients are computed for 9.7K transitions per second on average. To reduce memory and bandwidth requirements, observation data is compressed using a PNG codec when sent and when stored in the replay. The learner decompresses data as it prefetches it, in parallel with computing and applying gradients. The learner also asynchronously handles any requests for parameters from actors.

Actors copy the network parameters from the learner every 400 frames (2.8 seconds). Each actor executes an -greedy policy where with , . Each is held constant throughout training. The episode length is limited to 50000 frames during training.

The capacity of the shared experience replay memory is soft-limited to 2 million transitions: adding new data is always permitted, to not slow down the actors, but every 100 learning steps any excess data above this capacity threshold is removed en masse, in FIFO order. The median actual size of the memory is 2035050. Data is sampled according to proportional prioritization, with a priority exponent of 0.6 and an importance sampling exponent set to 0.4.

In Figure 2, on the left, we compare the median human normalized score across all 57 games to several baselines: DQN, Prioritized DQN, Distributional DQN (Bellemare et al., 2017), Rainbow, and Gorila. In all cases the performance is measured at the end of training under the no-op starts testing regime (Mnih et al., 2015). On the right, we show initial learning curves (taken from the greediest actor) for a selection of 6 games (full learning curves for all games are in the appendix). Given that Ape-X can harness substantially more computation than most baselines, one might expect it to train faster. Figure 2 shows that this was indeed the case. Perhaps more surprisingly, our agent achieved a substantially higher final performance.

In Table 1 we compare the median human-normalized performance of Ape-X DQN on the Atari benchmark to corresponding metrics as reported for other baseline agents in their respective publications. Whenever available we report results both for no-op starts and for human starts. The human-starts regime (Nair et al., 2015) corresponds to a more challenging generalization test, as the agent is initialized from random starts drawn from games played by human experts. Ape-X’s performance is higher than the performance of any of the baselines according to both metrics.

Algorithm Training Environment Resources Median Median
Time Frames (per game) (no-op starts) (human starts)
Ape-X DQN 5 days 22800M 376 cores, 1 GPU a 434% 358%
Rainbow 10 days 200M 1 GPU 223% 153%
Distributional (C51) 10 days 200M 1 GPU 178% 125%
A3C 4 days 16 cores 117%
Prioritized Dueling 9.5 days 200M 1 GPU 172% 115%
DQN 9.5 days 200M 1 GPU 79% 68%
Gorila DQN c 4 days unknown b 96% 78%
UNREAL d 250M 16 cores 331% d 250% d
Table 1: Median normalized scores across 57 Atari games. a Tesla P100. b 100 CPUs, with a mixed number of cores per CPU machine. c Only evaluated on 49 games. d Hyper-parameters were tuned per game.

4.2 Continuous Control

In a second set of experiments we evaluated Ape-X DPG on four continuous control tasks. In the manipulator domain the agent must learn to bring a ball to a specified location. In the humanoid domain the agent must learn to control a humanoid body to solve three distinct tasks of increasing complexity: Standing, Walking and Running. Since here we learn from features, rather than from pixels, the observation space is much smaller than it is in the Atari domain. We therefore use small, fully-connected networks (details in the appendix). With 64 actors on this domain, we obtain 14K total FPS (the same number of transitions per second; here we do not use action repeats). We process 86 batches of 256 transitions per second, or 22K transitions processed per second.

Figure 3 shows that Ape-X DPG achieved very good performance on all four tasks. The figure shows the performance of Ape-X DPG for different numbers of actors: as the number of actors increases our agent becomes increasingly effective at solving these problems rapidly and reliably, outperforming a standard DDPG baseline trained for over 10 times longer. A parallel paper (Barth-Maron et al., 2018) builds on this work by combining Ape-X DPG with distributional value functions, and the resulting algorithm is successfully applied to further continuous control tasks.

Figure 3: Performance of Ape-X DPG on four continuous control tasks, as a function of wall clock time. Performance improves as we increase the numbers of actors. The black dashed line indicates the maximum performance reached by a standard DDPG baseline over 5 days of training.

5 Analysis

Figure 4: Scaling the number of actors. Performance consistently improves as we scale the number of actors from 8 to 256, note that the number of learning updates performed does not depend on the number of actors.

In this section we describe additional Ape-X DQN experiments on Atari that helped improve our understanding of the framework, and we investigate the contribution of different components.

First, we investigated how the performance scales with the number of actors. We trained our agent with different numbers of actors (8, 16, 32, 64, 128 and 256) for 35 hours on a subset of 6 Atari games. In all experiments we kept the size of the shared experience replay memory fixed at 1 million transitions. Figure 4 shows that the performance consistently improved as the number of actors increased. The appendix contains learning curves for additional games, and a comparison of the scalability of the algorithm with and without prioritized replay. It is perhaps surprising that performance improved so substantially purely by increasing the number of actors, without changing the rate at which the network parameters are updated, the structure of the network, or the update rule. We hypothesize that the proposed architecture helps with a common deep reinforcement learning failure mode, in which the policy discovered is a local optimum in the parameter space, but not a global one, e.g., due to insufficient exploration. Using a large number of actors with varying amounts of exploration helps to discover promising new courses of action, and prioritized replay ensures that when this happens, the learning algorithm focuses its efforts on this important information.

Next, we investigated varying the capacity of the replay memory (see Figure 5). We used a setup with 256 actors, for a median of 37K total environment frames per second (approximately 9K transitions). With such a large number of actors, the contents of the memory is replaced much faster than in most DQN-like agents. We observed a small benefit to using a larger replay capacity. We hypothesize this is due to the value of keeping some high priority experiences around for longer and replaying them. As above, a single learner machine trained the network with median 19 batches per second, each of 512 transitions, for a median of 9.7K transitions processed per second.

Figure 5: Varying the capacity of the replay. Agents with larger replay memories perform better on most games. Each curve corresponds to a single run, smoothed over 20 points. The curve for Wizard Of Wor with replay size 250K is incomplete because training diverged; we did not observe this with the other replay sizes.

Finally, we ran additional experiments to disentangle potential effects of two confounding factors in our scalability analysis: recency of the experience data in the replay memory, and diversity of the data-generating policies. The full description of these experiments is confined to the appendix; to summarize, neither factor alone is sufficient to explain the performance we see. We therefore conclude that the results are due substantially to the positive effects of gathering more experience data; namely better exploration of the environment and better avoidance of overfitting.

6 Conclusion

We have designed, implemented, and analyzed a distributed framework for prioritized replay in deep reinforcement learning. This architecture achieved state of the art results in a wide range of discrete and continuous tasks, both in terms of wall-clock learning speed and final performance.

In this paper we focused on applying the Ape-X framework to DQN and DPG, but it could also be combined with any other off-policy reinforcement learning update. For methods that use temporally extended sequences (e.g., Mnih et al., 2016; Wang et al., 2017), the Ape-X framework may be adapted to prioritize sequences of past experiences instead of individual transitions.

Ape-X is designed for regimes in which it is possible to generate large quantities of data in parallel. This includes simulated environments but also a variety of real-world applications, such as robotic arm farms, self-driving cars, online recommender systems, or other multi-user systems in which data is generated by many instances of the same environment (c.f. Silver et al., 2013). In applications where data is costly to obtain, our approach will not be directly applicable. With powerful function approximators, overfitting is an issue: generating more training data is the simplest way of addressing it, but may also provide guidance towards data-efficient solutions.

Many deep reinforcement learning algorithms are fundamentally limited by their ability to explore effectively in large domains. Ape-X uses a naive yet effective mechanism to address this issue: generating a diverse set of experiences and then identifying and learning from the most useful events. The success of this approach suggests that simple and direct approaches to exploration may be feasible, even for synchronous agents.

Our architecture illustrates that distributed systems are now practical both for research and, potentially, large-scale applications of deep reinforcement learning. We hope that the algorithms, architecture, and analysis we have presented will help to accelerate future efforts in this direction.


We would like to acknowledge the contributions of our colleagues at DeepMind, whose input and support has been vital to the success of this work. Thanks in particular to Tom Schaul, Joseph Modayil, Sriram Srinivasan, Georg Ostrovski, Josh Abramson, Todd Hester, Jean-Baptiste Lespiau, Alban Rrustemi and Dan Belov.


Appendix A Recency of Experience

Figure 6: Testing whether improved performance is caused by recency alone: denotes the number of actors, the number of times each transition is replicated in the replay. The data in the run with , is therefore as recent as the data in the run with , , but performance is not as good.
Figure 7: Varying the data-generating policies: Red: fixed set of 6 values for . Blue: full range of values for . In both cases, the curve plotted is from a separate actor that does not add data to the replay memory, and which follows an -greedy policy with .

In our main experiments we do not change the size of the replay memory in proportion to the number of actors, so by changing the number of actors we also increased the rate at which the contents of the replay memory is replaced. This means that in the experiments with more actors, transitions in the replay memory are more recent: they are generated by following policies whose parameters are closer to version of the parameters being optimized by the learner, and in this sense they are more on-policy. Could this alone be sufficient to explain the improved performance? If so, we might be able to recover the results without needing a large number of actor machines. To test this, we constructed an experiment wherein we replicate the rate at which the contents of the replay memory is replaced in the 256-actor experiments, but instead of actually using 256 actors, we use 32 actors but add each transition they generate to the replay memory 8 times over. In this setup, the contents of the replay memory is similarly generated by policies with a recent version of the network parameters: the only difference is that the data is not as diverse as in the 256-actor case. We observe (see Figure 7) that this does not recover the same performance, and therefore conclude that the recency of the experience alone is not sufficient to explain the performance of our method. Indeed, we see that adding the same data multiple times can sometimes harm performance, since although it increases recency this comes at the expense of diversity.

Note: in principle, duplicating the added data in this fashion has a similar effect to reducing the capacity of the replay memory, and indeed, our results with a smaller replay memory in Figure 5 do corroborate the finding. However, we test also by duplicating the data primarily in order to exclude any effects arising from the implementation. In particular, in contrast to simply reducing the replay capacity, duplicating each data point means that the computational demands on the replay server in these runs are the same as when we use the corresponding number of real actors.

Appendix B Varying the Data-Generating Policies

Another factor that could conceivably contribute to the scalability of our algorithm is the fact that each actor has a different . To determine the extent to which this impacts upon the performance, we ran an experiment (see Figure 7) with some simple variations on the mechanism we use to choose the policies that generate the data we train on. The first alternative we tested is to choose a small fixed set of 6 values for , instead of the full range that we typically use. In this test, we use prioritized replay as normal, and we find that the results with the full range of are overall slightly better. However, it is not essential for achieving good results within our distributed framework.

Appendix C Atari: Additional Details

The frames received from the environment are preprocessed on the actor side with the standard transformations introduced by DQN. This includes greyscaling, frame stacking, repeating actions 4 times, and clipping rewards to .

The learner waits for at least 50000 transitions to be accumulated in the replay before starting learning. We use a Centered RMSProp optimizer with a learning rate of 0.00025 / 4, decay of 0.95, epsilon of 1.5e-7, and no momentum to minimize the multi-step loss (with

). Gradient norms are clipped to 40. The target network used in the loss calculation is copied from the online network every 2500 training batches. We use the same network as in the Dueling DDQN agent.

Appendix D Continuous Control: Additional Details

The critic network has a layer with 400 units, followed by a tanh activation, followed by another layer of 300 units. The actor network has a layer with 300 units, followed by a tanh activation, followed by another layer of 200 units. The gradient used to update the actor network is clipped to , element-wise. Training uses the Adam optimizer (Kingma & Ba (2014)) with learning rate of . The target network used in the loss calculation is copied from the online network every 100 training batches.

Replay sampling priorities are set according to the absolute TD error as given by the critic, and are sampled by the learner using proportional prioritized sampling (see appendix F) with priority exponent . To maintain a fixed replay capacity of , transitions are periodically evicted using proportional prioritized sampling, with priority exponent . This is a different strategy for removing data than in the Atari experiments, which simply removed the oldest data first - it remains to be seen which is superior.

Unlike the original DPG algorithm which applies autocorrelated noise sampled from a Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein (1930)

), we apply exploration noise to each action sampled from a normal distribution with

. Evaluation is performed using the noiseless deterministic policy. Hyperparameters are otherwise as per DQN.

Benchmarking was performed in two continuous control domains ((a) Humanoid and (b) Manipulator, see Figure 8) implemented in the MuJoCo physics simulator (Todorov et al. (2012)). Humanoid is a humanoid walker with action, state and observation dimensionalities , and respectively. Three Humanoid tasks were considered: walk (reward for exceeding a minimum velocity), run (reward proportional to movement speed) and stand (reward proportional to standing height). Manipulator is a 2-dimensional planar arm with , and , which receives reward for catching a randomly-initialized moving ball.

(a) Humanoid domain.
(b) Manipulator domain.
Figure 8: Continuous control domains considered for benchmarking Ape-X DPG: (a) Humanoid, and (b) Manipulator. All tasks simulated in the MuJoCo physics simulator (Todorov et al. (2012)).

Appendix E Tuning

On Atari, we performed some limited tuning of the learning rate and batch size: we found that larger batch sizes contribute significantly to performance, when using many actors. We tried batch sizes from {32, 128, 256, 512, 1024}, seeing clear benefits up to 512. We attempted increasing the learning rate to 0.00025 with the larger batch sizes but this destabilized training on some games. We also tried a lower learning rate of 0.00025 / 8, but this did not reliably improve results.

Likewise for continuous control, we experimented with batch sizes {32, 128, 256, 512, 1024} and learning rates from to . We also experimented with the prioritization exponents from to , with results proving essentially consistent within the range [0.3, 0.7] (beyond 0.7, training would sometimes become unstable and diverge).

For the experiments with many actors, we set the period for updating network parameters on the actors to be high enough that the learner was not overloaded with requests, and we set the number of transitions that are locally accumulated on each actor to be high enough that the replay server would not be overloaded with network traffic, but we did not otherwise tune those parameters and have not observed them to have significant impact on the learning dynamics.

Appendix F Implementation

The following section makes explicit some of the more practical details that may be of interest to anyone wishing to implement a similar system.

Data Storage

The algorithm is implemented using TensorFlow (Abadi et al., 2016)

. Replay data is kept in a distributed in-memory key-value store implemented using custom TensorFlow ops, similar to the lookup ops available in core TensorFlow. The ops allow adding, reading, and removing batches of Tensor data efficiently.

Sampling Data

We also implemented ops for efficiently maintaining and sampling from a prioritized distribution over the keys, using the algorithm for proportional prioritization described in Schaul et al. (2016). The probability of sampling a transition is where is the priority of the transition with key . The exponent controls the amount of prioritization, and when uniform sampling is recovered. The proportional variant sets priority where is the TD error for transition . Whenever a batch of data is added to or removed from the store, or is processed by the learner, this distribution is correspondingly updated, recording any change to the set of valid keys and the priorities associated with them.

A background thread on the learner fetches batches of sampled data from the remote replay and decompresses it using the learner’s CPU, in parallel with the gradients being computed on the GPU. The fetched data is buffered in a TensorFlow queue, so that the GPU always has data available to train on.

Adding Data

In order to efficiently construct -step transition data, each actor maintains a circular buffer of capacity containing tuples , where is the current size of the buffer. With each step, the new data is appended and the accumulated per-step discounts and partial returns for all entries in the buffer are updated. If the buffer has reached its capacity, , then its first element may be combined with the latest state and value estimates to produce a valid -step transition (with accompanying Q-values).

However, instead of being directly added to the remote replay memory on each step, the constructed transitions are first stored in a local TensorFlow queue, in order to reduce the number of requests to the replay server. The queue is periodically flushed, at which stage the absolute -step TD-errors (and thus the initial priorities) for the queued transitions are computed in batch, using the buffered Q-values to avoid recomputation. The Q-value estimates from which the initial priorities are derived are therefore based on the actor’s copy of the network parameters at the time the corresponding state was obtained from the environment, rather than the latest version on the learner. These Q-values need not be stored after this, since the learner does not require them, although they can be helpful for debugging.

A unique key is assigned to each transition, which records which actor and environment step it came from, and the dequeued transition tuples are stored in the remote replay memory. As mentioned in the previous section, the remote sampling distribution is immediately updated with the newly added keys and the corresponding initial priorities computed by the actor. Note that, since we store both the start and the end state with each transition, we are storing some data twice: this costs more RAM, but simplifies the code.


It is important that the replay server be able to handle all requests in a timely fashion, in order to avoid slowing down the whole system. Possible bottlenecks include CPU, network bandwidth, and any locks protecting the shared data. In our experiments we found CPU to be the main bottleneck, but this was resolved by ensuring all requests and responses use sufficiently large batches. Nonetheless, it is advisable to consider all of these potential performance concerns when designing such systems.


In our framework, since acting and learning proceed with no synchronization, and performance depends on both, it can be misleading to consider performance with reference to only one of these. For example, the results after a given total number of environment frames have been experienced are highly dependent on the number of updates the learner has performed in that time. For this reason it is important to monitor and report the speeds of all parts of the system and to consider them when analyzing results.

Failure Tolerance

In distributed systems with many workers, it is inevitable that interruptions or failures will occur, either due to occasional hardware issues or because shared resources are needed by higher priority jobs. All stateful parts of the system therefore must periodically save their work and be able to resume where they left off when restarted. In our system, actors may be interrupted at any time and this will not prevent continued learning, albeit with a temporarily reduced rate of new data entering the replay memory. If the replay server is interrupted, the data it contains is discarded, and upon resuming, the memory is refilled quickly by the actors. In this event, to avoid overfitting, the learner will pause training briefly, until the minimum amount of data has once again been accumulated. If the learner is interrupted, progress will stall until it resumes.

Figure 9: Training curves for 57 Atari games (performance against wall clock time). Green: DQN baseline. Purple: Rainbow baseline. Orange: A3C baseline. Blue: Ape-X DQN with 360 actors, 1 replay server and 1 Tesla P100 GPU learner. The anomaly in Riverraid is due to an infrastructure error.
Figure 10: Training curves for 57 Atari games (performance against environment frames). Only the first billion frames are shown, corresponding to 5-6 hours of training for Ape-X. Green: DQN baseline. Purple: Rainbow baseline. Blue: ApeX-DQN with 360 actors, 1 replay server and 1 Tesla P100 GPU learner.
Figure 11: Speed of data generation scales linearly with the number of actors.
Figure 12: Training curves showing performance against wall clock time for various numbers of actors on a selection of Atari games. Blue: prioritized replay, with learning rate 0.00025 / 4. Red: uniform replay, with learning rate 0.00025. For both prioritized and uniform, we tried both of these learning rates and selected the best. Both variants benefit from larger numbers of actors, but prioritized can better take advantage of the increased amount of data. In the 256-actor run, prioritized is equal or better in 7 of 9 games.
Game No-op starts Human starts
alien 40,804.9 17,731.5
amidar 8,659.2 1,047.3
assault 24,559.4 24,404.6
asterix 313,305.0 283,179.5
asteroids 155,495.1 117,303.4
atlantis 944,497.5 918,714.5
bank_heist 1,716.4 1,200.8
battle_zone 98,895.0 92,275.0
beam_rider 63,305.2 72,233.7
berzerk 57,196.7 55,598.9
bowling 17.6 30.2
boxing 100.0 80.9
breakout 800.9 756.5
centipede 12,974.0 5,711.6
chopper_command 721,851.0 576,601.5
crazy_climber 320,426.0 263,953.5
defender 411,943.5 399,865.3
demon_attack 133,086.4 133,002.1
double_dunk 23.5 22.3
enduro 2,177.4 2,042.4
fishing_derby 44.4 22.4
freeway 33.7 29.0
frostbite 9,328.6 6,511.5
gopher 120,500.9 121,168.2
gravitar 1,598.5 662.0
hero 31,655.9 26,345.3
ice_hockey 33.0 24.0
jamesbond 21,322.5 18,992.3
kangaroo 1,416.0 577.5
krull 11,741.4 8,592.0
kung_fu_master 97,829.5 72,068.0
montezuma_revenge 2,500.0 1,079.0
ms_pacman 11,255.2 6,135.4
name_this_game 25,783.3 23,829.9
phoenix 224,491.1 188,788.5
pitfall -0.6 -273.3
pong 20.9 18.7
private_eye 49.8 864.7
qbert 302,391.3 380,152.1
riverraid 63,864.4 49,982.8
road_runner 222,234.5 127,111.5
robotank 73.8 68.5
seaquest 392,952.3 377,179.8
skiing -10,789.9 -11,359.3
solaris 2,892.9 3,115.9
space_invaders 54,681.0 50,699.3
star_gunner 434,342.5 432,958.0
surround 7.1 5.5
tennis 23.9 23.0
time_pilot 87,085.0 71,543.0
tutankham 272.6 127.7
up_n_down 401,884.3 347,912.2
venture 1,813.0 935.5
video_pinball 565,163.2 873,988.5
wizard_of_wor 46,204.0 46,897.0
yars_revenge 148,594.8 131,701.1
zaxxon 42,285.5 37,672.0
Table 2: Scores obtained by Ape-X DQN in final evaluation, under the standard no-op starts and human starts regimes. In some games the scores are higher than in the training curves: this is because the maximum episode length is shorter during training.