TorchBeast: A PyTorch Platform for Distributed RL

10/08/2019 ∙ by Heinrich Küttler, et al. ∙ 26

TorchBeast is a platform for reinforcement learning (RL) research in PyTorch. It implements a version of the popular IMPALA algorithm for fast, asynchronous, parallel training of RL agents. Additionally, TorchBeast has simplicity as an explicit design goal: We provide both a pure-Python implementation ("MonoBeast") as well as a multi-machine high-performance version ("PolyBeast"). In the latter, parts of the implementation are written in C++, but all parts pertaining to machine learning are kept in simple Python using PyTorch, with the environments provided using the OpenAI Gym interface. This enables researchers to conduct scalable RL research using TorchBeast without any programming knowledge beyond Python and PyTorch. In this paper, we describe the TorchBeast design principles and implementation and demonstrate that it performs on-par with IMPALA on Atari. TorchBeast is released as an open-source package under the Apache 2.0 license and is available at <>.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


A PyTorch Platform for Distributed RL

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning has recently had a surge of interest thanks to the rise of deep learning and new GPU hardware, conquering important challenges such as chess, Go, and other board games 

(go, ; chessshogi, ), demonstrating the ability to learn policies on visual inputs (mnih2015human, ; levinerobot, ), and tackle strategically complex environments (gehring2018high, ; alphastarblog, ; OpenAI_dota, ) as well as multi-agent settings (foerster2017stabilising, ; lowe2017multi, ). However a lack of well-written, high-performance, scalable implementations of distributed RL architectures has hindered the reproduction of published work, and largely restricted the development of new work to a few organizations with the required know-how. For model-free reinforcement learning with discrete action spaces, approaches built on top of the IMPALA agent DBLP:journals/corr/abs-1802-01561 have achieved prominence for domains like StarCraft II alphastarblog or first-person shooter games Jaderberg859

. While an authoritative implementation of the IMPALA agent built on TensorFlow 

tensorflow2015-whitepaper has been released as open source software, researchers preferring PyTorch had fewer options. TorchBeast aims to help leveling the playing field by being a simple and readable PyTorch implementation of IMPALA, designed from the ground-up to be easy-to-use, scalable, and fast.

2 Background

We consider the case of a Markov Decision Process (MDP) with a state space

, a set of actions and a transition function

specifying the probability distribution over next states given a state and action. The agent receives rewards

and attempts to maximize the cumulative expected return within one episode for a discount factor .

IMPALA (DBLP:journals/corr/abs-1802-01561, ) uses an actor-critic variant to update parameters of a policy

as well as a value-function estimate

with parameters . Importantly, IMPALA is an off-policy method, maintaining a behavior policy that acts in the environment and collects traces of experience, as well as the current policy that we aim to update. This update is achieved via calculating the policy gradient using an importance weight between the behavior and current policy, thereby correcting for the off-policiness of the behavior policy with respect to the current policy. The specific off-policy correction method used by IMPALA, V-trace, is more stable than to other such methods for actor-critic agents. We refer to (DBLP:journals/corr/abs-1802-01561, ) for details).

Actors, learner and rollouts.

The IMPALA architecture consists of a single learner and several actors. Each actor produces rollouts in an indefinite loop. A rollout consists of unroll_length

many environment-agent interactions. The learner then consumes batches of these rollouts. A typical learner input might be a Python dictionary of the form python "observation": tensor(T, B, *obs_shape, dtype=torch.uint8), "reward": tensor(T, B), "done": tensor(T, B, dtype=torch.uint8), "policy_logits": tensor(T, B, num_actions), "baseline": tensor(T, B), "action": tensor(T, B, dtype=torch.int64), Here,

tensor(T, B) is a tensor of shape (T, B), where T is the unroll length, B is the batch size, obs_shape is a tuple of the observation shape (e.g., in the case of Atari with a frame stacking of the last frames) and num_actions is the number of discrete actions of the environment.

In order to facilitate fast experiment turnaround times, the number of actors should be large enough as to saturate the learner infeed, i.e., batches should be generated fast enough for the learner GPU to be fully utilized. Speed is an important design goal of TorchBeast, but not its only one. We will give an overview on our design principles in the next section.

3 TorchBeast Design Principles

Ideally, researchers should be able to prototype their ideas quickly without the mental overhead of low-level languages, but also without the computational overhead of Python in places where it would drastically impact performance. There is a tension between those two goals. Building frameworks with performance in mind can result in rigid constrains that reduce how fast researchers can implement their ideas or even impair their research directions. While TorchBeast necessarily relies on engineering assumptions as well, we employ a few design principles meant to offer researchers maximal leverage when implementing new ideas:

TorchBeast is not a framework.

The TorchBeast repository implements a certain type of agent and environment using the IMPALA architecture. It is not meant to be imported as a dependency but to be forked and modified in whatever way necessary for a specific research goal. Compared to traditional software engineering, the short half-life of research code makes this approach more natural in the domain of deep reinforcement learning.

All machine learning code is in Python.

Although the PolyBeast variant of TorchBeast uses C++ components for its queuing and batching logic, researchers should generally have no need to touch those parts. In the special cases where they do, the changes necessary should not involve digging through many layers of abstractions in the codebase.

One file to rule them all.

While not strictly packaged as a single file, TorchBeast tries to stay close to a “one file only” ideal. To take PolyBeast as an example, all agent code lives in while the environment code lives in

. No other files need to be touched to swap out either the agent neural network model or the specific environment used for training.

Adapting TorchBeast to research needs

As a simple example of how to adapt PolyBeast to specific research needs, we show pseudocode for changing the environment from Atari to the MinAtar task suite young19minatar , a set of grid-world Atari-like environments. A user would fork TorchBeast, and modify the following two parts only: (1) In, change the create_env function to return a MinAtar environment, see Figure 1. (2) In, change the neural network model to use a smaller ConvNet, see Figure 2.

python def create_env(flags): return atari_wrappers.wrap_pytorch( atari_wrappers.wrap_deepmind( atari_wrappers.make_atari(flags.env), clip_rewards=False, frame_stack=True, scale=False))

(a) In default

python def create_env(flags): env = minatar.Environment(flags.env) env = MinAtarEnv(env) # Gym wrapper. return env

(b) In MinAtar fork of
Figure 1: Changing TorchBeast to use MinAtar: Changes in

python class MinAtarNet(nn.Module): def __init__(self, num_actions, use_rnn=False): # … self.conv = nn.Conv2d(4, 16, kernel_size=3, stride=1) self.core = nn.Linear(num_linear_units, 128) self.policy = nn.Linear(128, num_actions) self.baseline = nn.Linear(128, 1)

def forward(self, inputs, core_state=()): # … return (action, policy_logits, baseline), core_state

Figure 2: Changing TorchBeast to use MinAtar: New model in
More complex changes

Some research directions have more specific needs. For example, when using TorchBeast to train RL agents for network congestion control mvfst-rl , the roles of clients and servers in TorchBeast needed to be reversed due to technical limitations of the simulator used as an environment. This was easily achieved by forking the TorchBeast repository and modifying the “actor pool” logic. Another example of a larger extension involving changing logic in C++

 would be moving the rollout logic from doing cross-episode batches to doing padding (by guaranteeing that each batch contains data from at most one episode, this makes using certain models such as attention easier). However, while such changes are straight-forward to do, we believe that most research needs do not fall into this category and can readily use TorchBeast by changing the agent parameterization and environment.

4 Experiments

We test TorchBeast on the classic Atari suite bellemare13arcade

. The hyperparameters, network architecture, and optimization procedure are taken from the IMPALA paper 

(DBLP:journals/corr/abs-1802-01561, , Table G.1). We run PolyBeast on a single Nvidia Quadro GP100 GPU using 25 CPU cores for the 48 environments. For the agent model, we use a version of the “deep network” without an LSTM from the IMPALA paper and train for 200 million frames per environment and run (corresponding to 50 million “agent steps” due to action repetitions).

Figure 3: TorchBeast (blue) and TensorFlow IMPALA (red) runs on several Atari levels (first set).

Figure 4: TorchBeast (blue) and TensorFlow IMPALA (red) runs on several Atari levels (second set).

For the environments, we use the package and API made available in OpenAI Gym 1606.01540 111We exclude the Atari game Defender due to bugs in the Python 3.6 version of its OpenAI Gym environment; Python 3.6 is the latest Python version for which TensorFlow version 1.9.0, required by the open source version of TensorFlow IMPALA, is available. with a set of default “environment wrappers” provided by OpenAI (see openai/baselines, , file baselines/common/

. These wrappers provide preprocessing functionality common for doing RL on Atari. They include action repetitions, frame stacking, frame warping (resizing), frame “max-pool-and-skip”, random no-ops at the beginning of the game as well as functionality to end the RL episode on life loss. We note that researchers often report total episodic returns according to the original definition of episodes in Atari but train with the end-of-life definition of episodes. This makes sense when viewing the end-of-life episodes as a training detail but also increases the complexity of the training and logging setup. For simplicity, we train and report returns according to the same end-of-life episode definition in our experiments. This reduces the total episode returns reported by an average factor of the number of lives in each Atari game.

We also took DeepMind’s open source TensorFlow IMPALA implementation from GitHub and modified it slightly to be compatible with Python 3 and operate with the Gym API.222The slightly modified code can be found at The results of TorchBeast and TensorFlow IMPALA are reported in Figures 3 and 4 and demonstrate that both implementations are on par for these tasks. We note that the scores obtained by either implementation for many environments fall short of the numbers reported in (DBLP:journals/corr/abs-1802-01561, , Table C.1). We believe this is due to the question of episode definitions mentioned above and possibly due to different environment preprocessing more generally.333Anecdotally, even the resizing/downscaling method can impact the performance of RL agents. Our results are equivalent to DeepMind’s IMPALA implementation when using the same preprocessing and episode definitions.

Besides training performance, PolyBeast is also on par with TensorFlow IMPALA when it comes to throughput (measured in consumed frames per second) and thus experiment duration in wallclock time.

5 Implementation

TorchBeast comes in two variants, dubbed “MonoBeast” and “PolyBeast”. The main purpose of the MonoBeast variant is to be easy to setup and get started with (no major dependencies besides Python and PyTorch are required). PolyBeast, on the other hand, utilizes Google’s gRPC library grpc for inter-process and transparent cross-machine communication. It also implements the heavy-duty operations such as batching as a Python extension module written in C++. This allows us to implement advanced features such as dynamic batching at the cost of a more complex installation procedure. Either version uses multiple processes to work around technical limitations of multithreaded Python programs, see Section 5.3 below for details.

5.1 MonoBeast

Since its earliest versions, PyTorch has support for moving tensors to shared memory. In MonoBeast, we utilize this feature in an algorithm that is roughly described as:

  • Create num_buffers sets of rollout buffers, each of them containing shared-memory tensors without a batch dimension, e.g., python buffers[0][’frame’] = torch.empty(T, *obs_shape, torch.uint8)

  • Create two shared queues, free_queue and full_queue. These queues will communicate integers using unix pipes.

  • Start num_actors many actor processes, each with a copy of the environment. Each actor dequeues an index from free_queue and writes a batch-slice with rollout data into buffers[index], then enqueues index to full_queue and dequeues the next index.

  • The main process has several learner threads, each of which

    1. Dequeues batch_size-many indices from full_queue, stacks them together into a batch and moves them to the GPU, puts the indices back into free_queue; then

    2. sends that batch through the model, computes losses, does a backward pass, and hogwild-updates the weights.

While simple, MonoBeast requires a relatively large amount of constantly allocated shared memory, does model evaluations on the actors on CPU instead of GPU, and involves a number of tensor copies not strictly necessary. It is also limited to a single machine. To overcome these downsides, we developed PolyBeast.

5.2 PolyBeast

There are two kinds of PolyBeast processes: A number of environment servers and a single learner process.

Environment servers, once running, wait for incoming gRPC connections and when a client learner process connects, create a new copy of the environment to serve to the client while the bidirectional streaming connection lasts. In this bidirectional stream, an environment server sends out observations, rewards and some book-keeping data like a tensor indicating whether the current episode ended. The client in turn responds with actions. In our implementation, the environment servers load environments via a Python function that returns environment objects compatible with the OpenAI Gym interface. In order to not suffer from GIL contention (see section 5.3), users should therefore limit the number of parallel connections per server.

The learner process starts a number of actor threads (in C++) to connect to the environment servers. To facilitate “inference” evaluations of the received observations on the GPU, these observations should be dynamically batched. TorchBeast implements a version of the dynamic batching functionality present in DeepMind’s IMPALA implementation444See each actor thread appends the environment output data to a queue, the inference queue. Another part of the system is responsible for reading from this queue, evaluating a model (or otherwise creating a minibatch of actions for a given minibatch of observations) and setting the result. After unroll_length many interactions, the actor thread will concatenate the data and enqueue it to another queue, the learner queue. Another part of the system is responsible for dequeing from this queue and updating the model based on this batched rollout.

By using gRPC, PolyBeast transparently runs using either a single-machine or a distributed setup. Distributing the environments over several machines is necessary for large-scale experiments with computationally costly games like StarCraft II. For more economical environments, the bottleneck tends to be on the agent side (e.g., neural network evaluations, backward passes, or memory constrains).

The parts of PolyBeast responsible for reading from the inference and learner queues are precisely those involving machine learning logic and are written in Python for easy accessibility by researchers. In Python-like pseudocode, the PolyBeast agent process looks like this: python def main(): model = Model() optimizer = Optimizer()

inference_queue = DynamicBatcher(batch_dim=1) learner_queue = BatchingQueue(FLAGS.batch_size, batch_dim=1) actors = ActorPool(learner_queue, inference_queue, FLAGS.unroll_length, FLAGS.server_addresses)

inference_thread = threading.Thread(target=infer, args=(model, inference_queue)) inference_thread.start() # Starts threads connecting to environments, fills queues.

for env_outputs, actor_outputs in learner_queue: learner_outputs = model(env_outputs) loss = compute_loss(learner_outputs, actor_outputs, env_outputs) loss.backward() optimizer.step() print("One gradient descent step, loss was", loss)

if learning_done(): break

actors.stop() inference_queue.close() learner_queue.close() inference_thread.join()

def infer(model, inference_queue): for batch in inference_queue: env_outputs = batch.get_inputs() actor_outputs = model(env_outputs) batch.set_outputs(actor_outputs)

def compute_loss(learner_outputs, actor_outputs, env_outputs): … # Compute vtrace (or some other loss)

main() # Start program.

5.3 A Note on Python’s Global Interpreter Lock

The engineering decisions around TorchBeast and many other parallel RL architecture are heavily influenced by an implementation detail of CPython, the Python programming language’s widely used reference implementation. In CPython, the global interpreter lock (“GIL”) is a mutex protecting access to Python objects. Any thread executing Python bytecode needs to hold the GIL, hence only a single thread will do so at any point in time. This has been a core part of CPython’s design ever since its first multithreading support in 1997. While suggestions have been made to change CPython’s design in this regard, none have proved successful or seem likely to be adopted in the future.

Since the development of asynchronous actor-critic methods DBLP:journals/corr/MnihBMGLHSK16 , running multiple environments in parallel has been the norm in scalable RL architectures and Python implementations of any such algorithm had to contend with the GIL. Any naive multithreaded implementation of asynchronous RL methods will fail to scale beyond more than a few parallel environments. Since Python is the most popular language for machine learning by a wide margin, several workarounds have been proposed and used by the community. An obvious candidate is to use separate processes and inter-process communication based on unix sockets, network communication, shared file handles or /dev/shm-like shared-memory buffers. Another possibility is to use multiple Python sub-interpreters in a single process. Another approach is to in essence leave Python and move to “in-graph” environments in a system like TensorFlow either via explicitly defined graphs or jit-compiled operations. From an engineering perspective, all of these workarounds constitute a sizable increase in complexity as well as a likely drop in efficiency due to an increased number of copy operations and other overhead.

In the case of TorchBeast, we opted for multi-processing using (1) sockets and shared-memory for MonoBeast, and (2) RPC calls and C++ threads for PolyBeast.

6 Conclusion

We open-source TorchBeast, a platform for reinforcement learning research implementing the popular IMPALA agent. Our design aims are to be simple, fast and amenable to new research needs. We provide an implementation in plain Python, MonoBeast, as well as a high-performance version, PolyBeast, capable of large-scale experiments across several machines. In either version, all machine learning parts are implemented in Python using PyTorch.

We evaluated TorchBeast on the Atari task suite and compared its performance to the TensorFlow IMPALA implementation published by its authors. When using the same environment preprocessing, both implementations achieve equivalent performance in terms of throughput, data efficiency, stability, as well as final performance.

We discussed some engineering aspects for reinforcement learning agents as well as examples of particular research directions and the modifications of TorchBeast necessary to pursue them. We believe TorchBeast provides a promising basis for reinforcement learning research without the rigidity of static frameworks or complex libraries.


We would like to thank Soumith Chintala, Joe Spisak and the whole PyTorch team for their support, and Alexander Miller, Pierre-Emmanuel Mazaré, Roberta Raileanu, Victor Yuan Zhong and especially Jeremy Reizenstein for their comments and insightful discussions.


  • [1] Lasse Espeholt, Hubert Soyer, Rémi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. CoRR, abs/1802.01561, 2018.
  • [2] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. 2017.
  • [3] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym, 2016.
  • [4] David Silver, Aja Huang, Christopher Maddison, Arthur Guez, Laurent Sifre, George Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529:484–489, 01 2016.
  • [5] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419):1140–1144, 2018.
  • [6] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • [7] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  • [8] Jonas Gehring, Da Ju, Vegard Mella, Daniel Gant, Nicolas Usunier, and Gabriel Synnaeve. High-level strategy selection under partial observability in starcraft: Brood war. arXiv preprint arXiv:1811.08568, 2018.
  • [9] Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Wojtek Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, Timo Ewalds, Dan Horgan, Manuel Kroiss, Ivo Danihelka, John Agapiou, Junhyuk Oh, Valentin Dalibard, David Choi, Laurent Sifre, Yury Sulsky, Sasha Vezhnevets, James Molloy, Trevor Cai, David Budden, Tom Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Toby Pohlen, Dani Yogatama, Julia Cohen, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Chris Apps, Koray Kavukcuoglu, Demis Hassabis, and David Silver. AlphaStar: Mastering the Real-Time Strategy Game StarCraft II., 2019.
  • [10] OpenAI. Openai five., 2018.
  • [11] Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson. Stabilising experience replay for deep multi-agent reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1146–1155. JMLR. org, 2017.
  • [12] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pages 6379–6390, 2017.
  • [13] Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castañeda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel. Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science, 364(6443):859–865, 2019.
  • [14] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. Software available from
  • [15] Kenny Young and Tian Tian. Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments. arXiv preprint arXiv:1903.03176, 2019.
  • [16] Viswanath. Sivakumar, Tim Rocktäschel, Alexander H. Miller, Heinrich Küttler, Nantas Nardelli, Mike Rabbat, Joelle Pineau, and Sebastian Riedel. mvfst-rl: An Asynchronous RL Framework for Congestion Control with Delayed Actions. 2019.
  • [17] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents.

    Journal of Artificial Intelligence Research

    , 47:253–279, Jun 2013.
  • [18] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. OpenAI Baselines., 2017.
  • [19] Google Inc. gRPC: A high performance, open-source universal RPC framework., 2015.
  • [20] Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. CoRR, abs/1602.01783, 2016.