Since the advent of deep reinforcement learning for game play in 2013 DQN and simulated robotic control shortly after (e.g. trpo ), a multitude of new algorithms have flourished. Most are model-free algorithms which can be categorized into three families: deep Q-learning, policy gradients, and Q-value policy gradients. These have developed along separate lines of research, such that few, if any, code bases incorporate all three kinds. In fact, many of the original implementations remain unreleased. As a result, practitioners often must develop from different starting points and potentially learn a new code base for each algorithm of interest or baseline comparison. RL researchers often reimplement algorithms–perhaps a valuable individual exercise, but one that incurs redundant effort across the community, or worse, one that presents a barrier to entry. Yet these algorithms share a great depth of common deep reinforcement learning machinery. We are pleased to share rlpyt, which implements all three algorithm families built on a shared, optimized infrastructure, in a single repository. rlpyt contains modular implementations of many common deep RL algorithms in Python using PyTorch pytorch , a leading deep learning library. Among numerous existing implementations, rlpyt is a more comprehensive open-source resource for researchers. rlpyt is available at https://github.com/astooke/rlpyt.
rlpyt is designed as a high-throughput code base for small- to medium-scale research in deep RL (large-scale being DeepMind AlphaStar alphastar or OpenAI Five OpenAI_dota , for example). This white paper summarizes its features, algorithms implemented, and relation to prior work. A small selection of learning curves are provided to verify learning performance for some standard RL environments in discrete and continuous control. Notably, rlpyt reproduces record-setting results in the Atari domain from “Recurrent Experience Replay in Distributed Reinforcement Learning” (R2D2) r2d2 . This benchmark requires on the order of 30 billion frames of game play and 1 million network updates, which rlpyt achieves in reasonable time without the use of distributed compute infrastructure. Compatibility with the OpenAI Gym interface provides access to many existing learning environments and allows new ones to be freely customized. This paper also introduces the "namedarraytuple", a new data structure for handling collections of arrays, which may be of outside interest. Finally, more detailed implementation and usage notes are provided.
1.1 Key Features and Algorithms
Key capabilities and features include:
Run experiments in serial mode (helpful for debugging, sufficient for some experiments).
Run experiments parallelized, with options for parallel sampling and/or multi-GPU optimization.
Sampling and optimization synchronous or asynchronous (via replay buffer).
Use CPU or GPU for training and/or batched action selection during environment sampling.
Full support for recurrent agents.
Online or offline evaluation and logging of agent diagnostics during training.
Includes launching utilities for stacking / queueing sets of experiments on local computer.
Modularity for easy modification and re-use of existing components.
Compatible with OpenAI Gym openai_gym environment interface.111See implementation details for required modification.
Implemented algorithms include the following (check the repository for possible additions):
Replay buffers support both the DQN and Q-function policy gradient algorithms and include the following options: n-step returns; sequence replay (for recurrence); periodic storage of recurrent state (to save memory); prioritized replay (sum tree) prioritized ; frame-based buffer, to save memory e.g. by storing only unique Atari frames.
2 Parallel Computing Infrastructure for Faster Experimentation
The two phases of model-free RL–sampling environment interactions and training the agent–can be parallelized differently. rlpyt addresses both, as described here. In all arrangements, system shared memory underlies inter-process communication of training data and model parameters, minimizing data transfer time and memory footprint.
For sampling, rlpyt offers the following configurations, also depicted in Figure 1.
Serial. Sampling occurs in the master process and can run one or more environment instances. The built-in agent uses the same model for sampling and for optimization, so if optimizing on the GPU, action-selection during sampling also uses the GPU, batched over all environments. (If running many time steps per batch with GPU optimization, it may be faster to use a separate CPU model for action-selection.)
Parallel-CPU. The sampler launches worker processes to run environments and perform action selection. If optimizing on the GPU, model parameters are copied to shared memory for CPU action selection in workers. Synchronization across workers only occurs per sampling batch.
Parallel-GPU. The sampler launches worker processes to run environments, and observations are communicated back to the master process for action selection, which will use the GPU if optimizing on GPU. All the environments’ observations are batched together for one call to the agent. Step-wise communication happens via another shared memory buffer, and light-weight semaphores enforce synchronization across workers at every simulation batch-step.
Alternating-GPU. Like parallel-GPU sampling but with two groups of workers; one group steps environments while the other group awaits action-selection. May provide speedups when the action-selection time is similar to but shorter than the batch environment simulation time.
Synchronous multi-GPU optimization is implemented using PyTorch’s DistributedDataParallel
to wrap the model. A separate python process drives each GPU. As provided by PyTorch, NCCL is used to all-reduce every gradient, which can occur in chunks concurrently with backpropagation, for better scaling on large models. The same applies for multi-CPU optimization, usingDistributedDataParallelCPU and the “gloo” backend (may be faster than MKL threading for multiple CPU cores). The arrangement is shown in Figure 2. The entire sampling-training stack is replicated in each process and no training data is shared among them. Any of the serial or parallel samplers can be used.
2.3 Asynchronous Sampling-Optimization
In the configurations depicted so far, the sampler and optimizer operate sequentially in the same Python process. In some cases, however, running optimization and sampling asynchronously achieves better hardware utilization, by allowing both to run continuously. In asynchronous mode, separate Python processes run the training and sampling, tied together by a replay buffer built on shared memory. Sampling runs uninterrupted by the use of a double buffer for data batches, which yet another Python process copies into the main buffer, under a read-write lock. This is shown in Figure 3. The optimizer and sampler may be parallelized independently, perhaps each using a different number of GPUs, to achieve best overall utilization and speed.
Some level of control between the processes is maintained. A desired maximum replay ratio can be specified (rate of consumption divided by rate of generation of training data), and the optimizer will be throttled not to exceed this value. The sampler batch size (time-steps) determines rate of actor model update, if new parameters are available from the optimizer. All actors use the same parameters.
2.4 Which Configuration is Best?
When creating or modifying agents, models, algorithms, and environments, serial mode will be the easiest for debugging. Once that runs smoothly, it is straightforward to explore the more sophisticated infrastructures for parallel sampling, multi-GPU optimization, and asynchronous sampling, since they are built on largely the same interfaces. Of course, deviations from the standard RL work-flow (i.e. the runner) may require more care to parallelize–again it is recommended to start with the serial case. The optimal configuration may depend on the problem, available compute hardware, and the number of experiments to run. Currently, rlpyt implements only single-node parallelism, but its components could form building blocks for a distributed framework.
3 Learning Performance
This section presents learning curves which verify the performance of the implementations against published values. A subset of standard Atari games bellemare2013arcade and Mujoco mujoco continuous control environments are shown. This is neither a comprehensive benchmark nor guide to scaling, but merely an exercise of each algorithm and infrastructure component. For Atari scaling guidelines, see e.g. accel_rl , for Mujoco, d4pg is a likely starting point.
3.1 Mujoco: Continuous Control from State
Here we present reinforcement learning algorithms applied to continuous control from state on a selection of Mujoco222mujoco200.
tasks in OpenAI Gym. For each algorithm, we used the same published hyperparameters across all environments and ran serial implementations.
3.2 Atari: Discrete Control from Vision
R2D1 We highlight rlpyt’s reproduction of the state of the art performance of R2D2 r2d2 , which was previously only feasible using distributed computing. This benchmark includes a recurrent agent trained from a replay buffer for on the order of 10 billion samples (40 billion frames). R2D1 (non-distributed R2D2) exercises several of rlpyt’s more advanced infrastructure components to achieve this, namely multi-GPU asynchronous sampling mode with the alternating sampler. In Figure 7 we reproduce several learning curves which surpass any previous algorithm. Some slight differences in performance against published values most likely resulted from a difference in the prioritization for new samples, which affected some games more than others,333Most curves used 1-step TD errors for prioritizing new samples and had unintentionally swapped the two replay priority coefficients. Furthermore, since collection ran in 40 time-step batches but training used 80-step sequences, we used only half the training segment to compute new priorities. Gravitar was especially sensitive and improved when we corrected to use 5-step TD initial priorities and by using the second half-batch, yet this run still plateaued at a low score, below 6,000. Work to remedy this continues. and a slightly lower replay ratio.444We used a replay ratio of 1, including the warmup samples, whereas the original authors ran a replay ratio near 0.8 counting only the training samples; by their counting we ran at 0.67. Given the low replay ratio, initial priorities are very important in some games.
The original, distributed implementation of R2D2 quoted about 66,000 steps per second (SPS) using 256 CPUs for sampling and 1 GPU for training. rlpyt achieves over 16,000 SPS when using only 24 CPUs5552x Intel Xeon Gold 6126, circa 2017. and 3 Titan-Xp GPUs in a single workstation (one GPU for training, two for action-serving in the alternating sampler). This may be enough to enable experimentation without access to distributed infrastructure. One possibility for future research is to increase the replay ratio (here set to 1) for faster learning using multi-GPU optimization. Figure 8 shows the same learning curve over three different measures: environment steps (i.e. 1 step = 4 frames), model updates, and time. This run reached 8 billion steps and 1 million updates in less than 138 hours.
4 New Data Structure: namedarraytuple
rlpyt introduces new object classes "namedarraytuples" for easier organization of collections of numpy arrays or torch tensors. A namedarraytuple is essentially a namedtuple which exposes indexed or sliced read/writes into the structure. Consider writing into a (possibly nested) dictionary of arrays which share some common dimensions for addressing:
This code is replaced by the following:
Importantly, the syntax is the same whether dest and src are individual numpy arrays or arbitrarily-structured collections of arrays. The structures of dest and src must match, or src can be a single value to apply to all fields, and None is a special placeholder value for fields to ignore. rlpyt uses this data structure extensively–different elements of training data are organized with the same leading dimensions, making it easy to interact with desired time- or batch-dimensions.
This is also intended to support environments with multi-modal observations or actions. Rather than flattening and merging, say, camera images and joint-angles into one observation vector, the environment can store them as-is into a namedarraytuple for the observation. In the forward method of the model, observation.joint and observation.image can be fed into the desired layers, without changing intermediate infrastructure code. For more details, see the code and documentation for namedarraytuples in rlpyt/utils/collections.py.
The use of namedtuples and namedarraytuples may incur some programming overhead during setup or modification of agents, algorithms, and environments. For example, for serialization666The only built-in use of serialization for samples data is the option of dropping into a subprocess while generating initial examples for buffer allocation. Model forward execution triggers MKL OpenMP threading initialization which can affect subprocesses thereafter. For example, parallel-CPU sampler agents should be initialized with 1 MKL thread if on 1 CPU core, whereas the optimizer might use multiple cores and threads. Incidentally, most rlpyt subprocesses set torch.num_threads(1) to avoid hanging on MKL, which might not be fork-safe. they must be defined at the module-level, which can be accomplished dynamically via the use of a global variable (see the Gym wrappers). A benefit of these explicitly-defined interfaces is that they reduce chance of mistake by omission or replacement of a shared-memory buffer element by local memory.
5 Related Work
For newcomers to deep RL, other resources may be better for familiarization with algorithms, such as OpenAI Spinning Up spinningup .777https://spinningup.openai.com/en/latest/index.html888https://github.com/openai/spinningup rlpyt is a revision and extension of the accel_rl codebase, 999https://github.com/astooke/accel_rl
which explored scaling RL in the Atari domain using Theanotheano , see accel_rl for results. For a further study of scaling in deep learning including RL, see mccandlish2018empirical . rlpyt and accel_rl were originally inspired by rllab rllab (for example the logger remains nearly a direct copy)101010https://github.com/rll/rllab.
, both of which are implemented in Tensorflowtensorflow , and neither of which are optimized to the extent of rlpyt nor contain all three algorithm families. Rllib rllib , built on top of Ray ray , focuses on distributed computing, possibly complicating small experiments. Facebook Horizon horizon offers a subset of algorithms and focuses on applications toward production at scale. In sum, rlpyt provides modular implementations of more algorithms and modular infrastructure for parallelism, making it a distinct tool set supporting a wide range of research uses.
6 Implementation and Usage Details
To get started, it is recommended to follow the example scripts provided in the repository and read the notes therein. The following is a conceptual overview without code.
6.1 Code Structure.
The following tree and descriptions summarize the structure of classes and interfaces.
Runner - Connects the sampler, agent, and algorithm; manages the training loop and logging of diagnostics.
Sampler - Manages agent-environment interaction to collect training data; can initialize parallel workers.
Collector - Steps environments (and maybe operates agent) and records samples.
Environment - The task to be learned. As in previous implementations, at each step outputs: (observation, reward, done, env_info).
Observation/Action Space - Interface specifications from environment to agent.
TrajectoryInfo - Diagnostics logged on a per-trajectory basis.
Agent - Chooses control action to the environment in sampler; trained by the algorithm; interface to model; holds model recurrent state during sampling. As in previous implementations, at each step outputs (action, agent_info).
- PyTorch neural network module accepting(observation, prev_action, prev_reward) and possibly initial_rnn_state arguments.
- Samples actions for stochastic agents; defines related formulas for loss functions.
Algorithm - Uses gathered samples to train the agent, e.g. defines a loss function and performs gradient descent.
Optimizer - Training update rule (e.g. Adam) for model parameters.
OptimizationInfo - Diagnostics logged on a per-training batch basis.
Logger - Available throughout all processes and classes for recording printed statements and/or tabular values.
6.2 No Asynchronous Optimization
Recent projects in large-scale RL, such as OpenAI Five OpenAI_dota and DeepMind AlphaStar alphastar , have succeeded using synchronous multi-device optimization (meaning every gradient is all-reduced across devices, which hold the same parameter values). Previous experience in accel_rl found good scaling of asynchronous, multi-GPU A3C and PPO on Atari using a CPU parameter store, but this technique did not scale as well to larger networks with more training updates, such as in DQN. Therefore, rlpyt currently does not include asynchronous optimization schemes such as those in A3C ; hogwild .
6.3 Recurrent Agents
All agents receive the (observation, previous_action, previous_reward) inputs (see e.g. A3C ), although standard feedforward agents might use only the observation. The recurrent state is organized into its own namedarraytuple and can be customized.
Sampling. The agent handles the recurrent state during environment sampling. This functionality is provided in an optimized fashion according to the CuDNN cudnn interface, agnostic to the structure of that state. Separate mixin classes for custom agents are included for regular sampling and alternating sampling. Recurrent state is recorded under agent_info.
Training. Training data is organized with leading dimensions of [Time, Batch], matching the PyTorch/CuDNN implementations of recurrence. For CuDNN, the initial recurrent state must be re-organized into [Num_Layers, Batch, Hidden_Size] dimensions and made contiguous, as shown in the included recurrent agent classes.
6.4 Data Organization Inferred in Model Forward Method
The same model can be used with different leading dimensions: a single input (no leading dims), a batch [Batch, ..], or a time-batch [Time, Batch, ..]. In the model’s forward method, leading dimensions are inferred according to known dimension of the observation, for example. Inputs are reshaped accordingly for feed-forward or recurrent layers, and finally the outputs have their leading dimensions restored according to what was input. This way, the same model can be used for action-selection during sampling, for training, and for extracting single examples for constructing buffers. See any of the included models for a template of this pattern which should be followed in any custom models.
6.5 OpenAI Gym Interface
The use of preallocated buffers requires one modification to the Gym environment interface: the env_info dictionary must provide the same keys/fields at every step. A Gym-style wrapper is included, which converts the env_info into a namedtuple for easy writing into the samples buffer. An additional wrapper component is provided as one way to ensure all keys are present at every step. A wrapper is also provided for Gym spaces to convert them to the corresponding rlpyt space (notably the multi-modal Gym Dictionary space becomes the rlpyt Composite space.)
6.6 Launching Utilities
Launching utilities are included for building variants and stacking / queueing experiments on given local hardware resources. For example, on an 8-GPU, 40-CPU machine, one may want to run some number of variants (say, 30 different settings/seeds), each using 2 GPUs; the launcher will launch 4 experiments on non-overlapping resources (each with 2 GPUs and 10 CPUs), and as those finish, it will launch the next in their places until all are complete. Results are recorded into a file structure which matches that of the variants generated (see the example scripts). Other scripting patterns may be preferable for widely parallelized launching into the cloud.
We hope that rlpyt can facilitate adoption of existing deep RL techniques and serve as a launching point for research into new ones. For example, the more advanced topics of meta-learning, model-based, and multi-agent RL are not explicitly addressed in rlpyt, but applicable code components may still be helpful in accelerating their development. We expect the offering of algorithms to grow over time as the field matures.
First we acknowledge the original authors of all algorithms and supporting libraries, as listed in the references. Thanks to Steven Kapturowski for clarification of several implementation details of R2D2, and to Josh Achiam and Wilson Yan for help debugging SAC. Adam Stooke gratefully acknowledges the support of the Fannie and John Hertz Foundation and the NVIDIA Corporation.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp
Trust region policy optimization.
International conference on machine learning, pages 1889–1897, 2015.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
-  DeepMind. Alphastar. https://deepmind.com/blog/article/alphastar-mastering-real-time-strategy-game-starcraft-ii, 2019.
-  OpenAI. Openai five. https://blog.openai.com/openai-five/, 2018.
-  Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. 2018.
-  Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
-  Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
-  John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Hado Van Hasselt, Arthur Guez, and David Silver.
Deep reinforcement learning with double q-learning.
Thirtieth AAAI conference on artificial intelligence, 2016.
-  Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
-  Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 449–458. JMLR. org, 2017.
-  Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018.
-  Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for distributional reinforcement learning. arXiv preprint arXiv:1806.06923, 2018.
-  Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pages 2829–2838, 2016.
-  Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018.
-  Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
-  Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617, 2018.
-  Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
-  Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
-  Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012.
-  Adam Stooke and Pieter Abbeel. Accelerated methods for deep reinforcement learning. arXiv preprint arXiv:1803.02811, 2018.
-  Joshua Achiam. Openai spinning up. GitHub, GitHub repository, 2018.
-  Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590, 2012.
-  Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018.
-  Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329–1338, 2016.
-  Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. GitHub, GitHub repository, 2017.
-  Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G Bellemare. Dopamine: A research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110, 2018.
-  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
-  Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Joseph Gonzalez, Ken Goldberg, and Ion Stoica. Ray rllib: A composable and scalable reinforcement learning library. arXiv preprint arXiv:1712.09381, 2017.
-  Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561–577, 2018.
-  Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Yuchen He, Zachary Kaden, Vivek Narayanan, and Xiaohui Ye. Horizon: Facebook’s open source applied reinforcement learning platform. arXiv preprint arXiv:1811.00260, 2018.
Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu.
Hogwild: A lock-free approach to parallelizing stochastic gradient descent.In Advances in neural information processing systems, pages 693–701, 2011.
-  Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.