Tonic: A Deep Reinforcement Learning Library for Fast Prototyping and Benchmarking

11/15/2020 ∙ by Fabio Pardo, et al. ∙ Imperial College London 79

Deep reinforcement learning has been one of the fastest growing fields of machine learning over the past years and numerous libraries have been open sourced to support research. However, most codebases have a steep learning curve or limited flexibility that do not satisfy a need for fast prototyping in fundamental research. This paper introduces Tonic, a Python library allowing researchers to quickly implement new ideas and measure their importance by providing: 1) a collection of configurable modules such as exploration strategies, replays, neural networks, and updaters 2) a collection of baseline agents: A2C, TRPO, PPO, MPO, DDPG, D4PG, TD3 and SAC built with these modules 3) support for the two most popular deep learning frameworks: TensorFlow 2 and PyTorch 4) support for the three most popular sets of continuous-control environments: OpenAI Gym, DeepMind Control Suite and PyBullet 5) a large-scale benchmark of the baseline agents on 70 continuous-control tasks 6) scripts to experiment in a reproducible way, plot results, and play with trained agents.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Supported by the deep learning revolution (LeCun et al., 2015; Goodfellow et al., 2016), reinforcement learning (RL) (Sutton and Barto, 2018)

has grown in popularity and been at the heart of many of the recent milestones in artificial intelligence. Programs are now able to surpass the best humans at ancient board games

(Silver et al., 2017) and video games (Mnih et al., 2015; Vinyals et al., 2019; Berner et al., 2019). While those achievements are impressive, they often rely on a number of simple fundamental research ideas that were originally developed independently such as Q-learning (Watkins and Dayan, 1992), policy gradient (Sutton et al., 2000) or Monte Carlo tree search (Coulom, 2006; Kocsis and Szepesvári, 2006).

An almost systematic pattern in deep RL research is: 1) the combination of novel general purpose ideas incorporated into agents 2) performance comparison with known baseline agents on simulated environments. While writing code from scratch has formative qualities, it is usually desirable to use a simple and flexible codebase. A large number of libraries exist with diverse goals such as large scale heavily distributed training (e.g. Liang et al., 2017), simple and pedagogical code (e.g. Achiam, 2018), fundamental research in pixel-based domains (e.g. Castro et al., 2018)

or based on specific deep learning frameworks such as Keras

(e.g. Plappert, 2016), TensorFlow (e.g. Dhariwal et al., 2017; Guadarrama et al., 2018), and PyTorch (e.g. Stooke and Abbeel, 2019; D’Eramo et al., 2020). While much effort has been made to build those libraries, we found that there was a need for a simple yet modular codebase designed to quickly try fundamental research ideas and evaluate them in a controlled and fair way, in particular in continuous control domains.

In this article, we introduce Tonic, a library for deep reinforcement learning research, written in Python and supporting both TensorFlow 2 (Abadi et al., 2016) and PyTorch (Paszke et al., 2019). Tonic includes modules such as deep learning models, replay buffers or exploration strategies. Those modules are written to be easily configured and plugged into compatible agents. Furthermore, Tonic implements a number of popular continuous control baseline agents: A2C (Mnih et al., 2016), TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017), MPO (Abdolmaleki et al., 2018), DDPG (Lillicrap et al., 2016), D4PG (Barth-Maron et al., 2018), TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018). Those agents are written with minimal abstractions to simplify readability and modification, emphasizing core ideas while moving other details into modules and sharing general improvements such as non-terminal timeouts (Pardo et al., 2018) and observation normalization. Tonic also includes three essential scripts to 1) train and test agents in a controlled way 2) plot results against baselines and 3) play with trained policies. Finally, Tonic includes a large-scale benchmark with training logs and model weights of the baseline agents for seeds on popular environments from OpenAI Gym (Brockman et al., 2016), DeepMind Control Suite (Tassa et al., 2018) and PyBullet (Coumans and Bai, 2016), representing a large and diverse set of domains based on Box2D (Catto, 2011), MuJoCo (Todorov et al., 2012) and Bullet (Coumans, 2010) physics engines.

The paper is organized as follows: Section 2 presents the configuration-based philosophy underlying most of Tonic’s components, Section 3 presents the training pipeline, Section 4 describes the different agents implemented using the previously introduced modules, Section 5 describes the supported environments and how they are adapted to work with the Tonic agents, Section 6 presents the benchmark and some of the results and Section 7 presents the three essential scripts provided with Tonic to simplify running experiments, interpret the results and play with the trained policies.

Before diving into the description of the library and the results, here is a minimal usage example:

import tonic.tensorflow  # or torch
agent = tonic.tensorflow.agents.MPO()
environment_func = lambda: tonic.environments.ControlSuite(’walker-run’)
environment = tonic.environments.distribute(environment_func)
test_environment = tonic.environments.distribute(environment_func)
tonic.logger.initialize(’walker-run/MPO/42’)
trainer = tonic.Trainer()
trainer.initialize(agent, environment, test_environment, 42)
trainer.run()

2 Library of modules

Figure 2: A hierarchy of configured modules are used to specify experiments in Tonic. Modules written for both TensorFlow 2 and PyTorch are shown with their respective logos.

Many libraries tend to put all the parameters of an experiment at the same level, calling an agent function with an environment name and all the relevant parameters. This prevents modularity and readability. Tonic tries as much as possible to move the configurable parts into modules. This has a number of advantages: 1) it clarifies the parameter targets, avoiding confusing long lists of parameters in input 2) different compatible modules with their own specific parameters can be used 3) new capabilities can be added without modifying agents.

In Tonic, the configuration of modules happens in two stages. First, when an experiment is described, agent modules are configured with general purpose parameters such as hidden layer sizes, exploration noise scale and trace decay used in -returns. Then, when the environment is selected, some specific values such as the observation and action space sizes are known and the modules are finally initialized. An illustration of the hierarchy of parameterized modules corresponding to the previous code snippet is shown in Figure 2.

Before describing the different modules, here is a minimal usage example showing how modules can be configured and initialized:

model = models.ActorCritic(
  actor=models.Actor(
    encoder=models.ObservationEncoder(),
    torso=models.MLP((64, 64), ’tanh’),
    head=models.DetachedScaleGaussianPolicyHead()),
  critic=models.Critic(
    encoder=models.ObservationEncoder(),
    torso=models.MLP((64, 64), ’tanh’),
    head=models.ValueHead()),
  observation_normalizer=normalizers.MeanStd())
actor_updater = updaters.ClippedRatio(
  optimizer=optimizers.Adam(3e-4, epsilon=1e-8), ratio_clip=0.2, kl_threshold=0.015,
  entropy_coeff=0.01, gradient_clip=40)
model.initialize(environment.observation_space, environment.action_space)
actor_updater.initialize(model)

Models

For TensorFlow 2 and PyTorch models, smaller modules are assembled. For example, an actor-critic accepts an actor and a critic network. Actors and critics are built with an encoder, a torso and a head module. An encoder processes inputs, for example concatenating observations and actions for an action-dependent critic or normalizing observations using statistics from perceived observations so far. A torso is typically a multilayer perceptron (MLP) or a recurrent network. A head produces the outputs, such as values or distributions.

Replays

Different kinds of replays can be used for different types of agents. For example, a traditional Buffer can be used to randomly sample past transitions for off-policy training and a Segment can be used to store contiguous transitions for an on-policy agent. Since those replays are configurable modules, they hold parameters like the discount factor or trace decay and are in charge of producing training batches.

Explorations

Different exploration strategies can be used with deterministic actors. Tonic currently includes Uniform and Normal for temporally-uncorrelated action noise and OrnsteinUhlenbeck for temporally-correlated action noise.

Updaters

Different agents have different ways to update the parameters of their models. An updater typically takes batches of values in input, generates a loss, computes the gradient of this loss with respect to some model parameters and updates those parameters. Some updaters even create new sets of parameters used during optimization such as the dual variables in MaximumAPosterioriPolicyOptimization.

Logger

The different values generated by the interaction of the agent with the environments and the values generated by the updaters are written in a csv

file by the logger after each training epoch. When writting those values, outputs are also printed on the terminal in a readable table while a progress bar indicates the remaining time for the current epoch and overall training. The path to the folder containing the logs is usually of the form

’environment/agent/seed/’. When starting an experiment, the logger can also save the launch script and arguments for future reference and reloading.

3 Trainer

Figure 3: Synchronous training loop. At every step, a batch of observations is provided to the agent, which returns a batch of actions in output. Those actions are passed to the environment which returns a batch of transition values used to update the models: next observations, rewards, terminations and resets, plus the batch of next observations used to select the next actions.

The tonic.Trainer module is in charge of the training loop in Tonic. It takes care of the communication between the agent and the environment, testing the agent on the test environment, logging data via the logger and periodically saving the model parameters in checkpoints for future reload.

Distributed training has been shown to greatly accelerate the training of RL agents with respect to wall clock time (Mnih et al., 2016; Espeholt et al., 2018). Instead of interacting with a single environment at a time, the agent interacts with a set of differently seeded copies of the environment to diversify experience and increase throughput. For simplicity and to ensure reproducibility, Tonic uses a synchronous training loop illustrated in Figure 3. At training step

, the trainer first sends a tensor

of observations to the agent via the agent.step function which returns a tensor of actions and keeps track of some values such as

or the log probabilities of the actions. The actions are transmitted to the environment module via the

environment.step function which returns multiple values. First, the ones describing the current transitions caused by the actions : the tensor

of next observations, the vectors

of rewards, of terminations and of resets. The terminations indicate true environmental terminations, the ones caused for example by falling on the floor in a locomotion task or reaching a target state. Agents can use those to know when bootstrapping is possible. The resets vector signals the end of episodes, from terminations and timeouts and can be used by agents to know the boundaries of episodes, for example for -return calculations. When using non-terminal timeouts, partial-episode bootstrapping (Pardo et al., 2018) is used to bootstrap from the values in and it is therefore important to know that a reset happened without an environmental termination. When an environment resets, a new observation is generated and has to be used to select the next action, therefore, the environment also returns a tensor of observations to use next. For a sub-environment , if False. Finally, the transition values are given to the actor via the actor.update function which takes care of registering the transitions in a replay and performing updates, while the new observations are used to generate the new actions at the next step.

4 Agents

A number of reinforcement learning agents have been proposed over the years. Tonic contains 8 popular baseline agents, some are simple and foundational while others are more complicated state of the art algorithms.

Basic agents

Especially useful for debugging, the simple non-parametric agents are NormalRandom, UniformRandom, OrnsteinUhlenbeck, and Constant.

Advantage Actor-Critic (A2C)

This agent, also called Vanilla Policy Gradient (VPG) in some libraries, uses advantages from -returns and a learned value function to update a stochastic policy via policy gradient (Sutton et al., 2000; Schulman et al., 2016; Mnih et al., 2016). It is stable but learns slowly because it can use the latest collected transitions only once to update its actor.

Trust Region Policy Optimization (TRPO)

This agent uses a conjugate gradient optimizer to take a large update step of policy gradient while satisfying a KL constraint between the new and previous policies (Schulman et al., 2015).

Proximal Policy Optimization (PPO)

This agent approximates TRPO by using clipped ratios between the old policy which generated the latest transitions and the currently optimized policy (Schulman et al., 2017).

Maximum a Posteriori Policy Optimisation (MPO)

This agent uses a complex relative-entropy objective taking advantage of the duality between control and estimation

(Abdolmaleki et al., 2018). It can be very powerful if carefully tuned but its complexity made it very challenging to implement and Acme’s code (Hoffman et al., 2020) was the only reliable source when Tonic was created.

Deep Deterministic Policy Gradient (DDPG)

This agent uses a deterministic actor trained via deterministic policy gradient (Silver et al., 2014; Lillicrap et al., 2016). It is data-efficient because it learns off-policy an approximation to the optimal value function used to locally optimize the actor.

Distributed Distributional Deep Deterministic Policy Gradient (D4PG)

This agent uses a distributional critic head, n-step returns and prioritized experience replay (Barth-Maron et al., 2018). Tonic does not currently include a prioritized replay buffer but as pointed in the original paper, this is a less critical component and can lead to unstable updates.

Twin Delayed Deep Deterministic Policy Gradient (TD3)

This agent stabilizes DDPG using a pair of critics, action noise in the target actor and a delay to update the actor network less often (Fujimoto et al., 2018).

Soft Actor-Critic (SAC)

This agent uses an entropy based reward augmentation, a squashed Gaussian policy and a pair of critics (Haarnoja et al., 2018).

5 Environments

Figure 4: Subset of the benchmark results. For each agent, 5 test episodes are collected after each training epoch and averaged. The solid lines represent the average over 10 runs for each agent. The [minimum, maximum] range is shown with transparent areas and a sliding window of size 5 is used for smoothing. A large palette of environments are represented across the supported domains. The best performing agents are mostly TD3, SAC, MPO and D4PG (Control Suite tasks) but significant variations exist for each agent.

OpenAI Gym, PyBullet and DeepMind Control Suite

Tonic includes builders for continuous-control environments from OpenAI Gym (Brockman et al., 2016), DeepMind Control Suite (Tassa et al., 2018) and PyBullet (Coumans and Bai, 2016), representing a large and diverse set of domains based on Box2D (Catto, 2011), MuJoCo (Todorov et al., 2012) and Bullet (Coumans, 2010) physics engines. For simplicity and to match Gym and PyBullet environments, dictionary observations are flattened and concatenated in vectors.

Non-terminal timeouts

All of these environments are wrapped to enable the synchronous interaction described in Section 3. The TimeLimit wrapper is removed from the Gym and PyBullet environments while in the case of Control Suite environments, task terminations are detected from task.get_termination(physics). Moreover, when terminal_timeouts is set to True, it is recommended to also set time_feature to True to use a tonic.environments.TimeFeature wrapper, adding a representation of the remaining time in observation. This is known as time-awareness (Pardo et al., 2018) and allows environments to stay Markovian.

Action scaling

Agents are all expected to act in a action space where is the number of dimensions. This facilitates action noise scaling and learning for agents relying on deterministic policies. Environments use a tonic.environments.ActionRescaler wrapper by default.

Distributed training

Finally, for distributed training, the set of environment copies is maintained in parallel groups of sequential workers. Each parallel group is allocated to a different process and communication is done via pipes. Since this communication method adds some time overhead, using multiple sequential environments in each group can increase throughput.

6 Benchmark

Figure 5: Efficiency of observation normalization and non-terminal timeouts. PPO uses the default configuration with observation normalization and non-terminal timeouts. PPO-ob-norm is identical but without observation-normalization. PPO-terminal-timeouts is identical to PPO but with environmental termination at timeouts as is the case originally for the supported environments. PPO-time-aware is identical to PPO-terminal-timeouts but adds the remaining time as an observed feature. Observation normalization seems to accelerate learning. Non-terminal timeouts are best while time features are needed to account for terminal timeouts.

When evaluating novel ideas in the literature, it is sometimes difficult to measure the significance of results as baselines can be poorly tuned or evaluated in unfair conditions. Benchmarks of popular RL agents on popular RL environments (e.g. Duan et al., 2016; Huang et al., 2020) evaluated in identical conditions are essential to provide reliable lower bounds in fundamental research.

Methods

Tonic contains a large-scale benchmark of the 8 provided deep RL agents on 70 popular continuous-control tasks: 17 from OpenAI Gym (2 classic control, 3 Box2D, 12 MuJoCo), 10 from PyBullet and 43 from the benchmark subset in DeepMind Control Suite. The exact same 10 seeds (0, 1, 2, …, 9) are used for all agents with default parameters on all environments with single-worker training (not distributed). D4PG was only run on DeepMind Control Suite environments because known reward boundaries are required for distributional value functions. Therefore, the total number of runs contained in the benchmark is . These runs were all generated with tonic.tensorflow which was significantly faster with off-policy agents than tonic.torch. This difference could be due to a more efficient graph tracing mechanism provided by TensorFlow’s tf.function decorator. A speed comparison is provided in Appendix Figure 6.

Results

Some of the results can be seen in Figure 4 while the full benchmark plots can be found in Appendix Figure 7. Environments from OpenAI Gym (names starting with an uppercase) and PyBullet (names with “PybulletEnv”) are mostly “solved” with best agents getting scores similar to the best performances reported in the literature. However, many environments from DeepMind Control Suite seem much harder to learn and most results reported for those environments can be found in the literature with distributed training and many more training steps. Nevertheless, it is important to note that better hyper parameters could certainly be found for those agents, and especially better ones for each environment specifically.

Comparison to Spinning Up in Deep RL

To prove that the results for A2C, TRPO, PPO, DDPG, TD3 and SAC can be used as valid baselines, another benchmark was generated with TensorFlow 1 implementations of those agents from Spinning Up in Deep RL (Achiam, 2018). The library was slightly modified to use a test environment for each agent, a frequency and number of test episodes and seeds identical to the ones used in the Tonic benchmark and VPG was renamed A2C. Results on the original 5 environments used in the benchmark of this library can be found in Appendix Figure 8. The results from Spinning Up in Deep RL are compatible with the ones found on the website. The agents from Tonic perform significantly better on four of the five environments. This difference can be explained by some of the improvements in Tonic, such as observation normalization, non-terminal timeouts and action-scaling, even though Spinning Up in Deep RL partially implements non-terminal timeouts for the off-policy methods by ignoring environmental terminations at timeout.

Ablations and variants

To measure the effectiveness of non-terminal timeouts and observation normalization in particular, PPO was trained with different variants. Results shown in Figure 5 demonstrate that those improvements indeed improve performance of PPO. Finally, supplementary experiments validated the effectiveness of these improvements on the other agents and environments.

7 Scripts

The modules and agents described above can easily be used in a standalone experiment Python script or integrated in another codebase. However, for convenience, Tonic includes three essential scripts to take care of the most important things: training, plotting and playing.

tonic.train

A script used to launch training experiments. Since any agent could be configured with any compatible modules and launched on any configured environment, a simple list of parsed parameters would not give enough flexibility. Therefore, Tonic uses the interpreted nature of the Python language to directly evaluate Python snippets describing the agent, the environment and the trainer configurations. The script saves the experiment script and arguments and automatically configures the logger to use a path of the form ’environment/agent/seed/’ which will be recognized by the two other scripts. Usage example:

python3 -m tonic.train \
header ’import tonic.torch’ \
agent ’tonic.torch.agents.PPO()’ \
environment ’tonic.environments.Gym("BipedalWalker-v3")’ \
seed 

tonic.plot

A script to load and display results from multiple experiments together. The script expects a list of csv or pkl files to load data from. Regular expressions like BipedalWalker-v3/PPO-X/, BipedalWalker-v3/{PPO*,DDPG*} or *Bullet* can be used to point to different sets of logs. Multiple sub-figures are generated, one per environment, aggregating results of agents across runs. The script can be configured in many ways. For example, the figure can be saved in different file formats such as PDF and PNG. A non-GUI backend such as agg can be used. If the seconds argument is used, plotting is performed regularly in real time. The baselines argument can be used to load logs from the benchmark saved in the /data/logs folder at the root of Tonic. For example, baselines all uses all agents while baselines A2C PPO TRPO will use logs from A2C, PPO and TRPO. Different parameters allow the user to customize the x and y axes, change the smoothing window size, specify the type of interval shown, display individual runs, select the minimum and maximum values of the x axis, and do many other things. Finally, the legend is shown at the bottom of the figure, regrouping all agents across environments with a mechanism to automatically detect the ideal number of legend columns to use. Usage example:

python3 -m tonic.plot path BipedalWalker-v3 baselines all

tonic.play

A script to reinstantiate the environment and agent from an experiment folder, reloading weights from a checkpoint and rendering the policy acting in the environment. The path to the experiment must be specified and a particular checkpoint can be chosen. While rendering the policy interacting with the environment, the episode lengths, scores, min and max rewards are printed on the terminal. Gym environments are simply rendered while PyBullet and DeepMind Control Suite viewers allow users to add perturbations to the bodies in the simulation. Usage example:

python3 -m tonic.play path BipedalWalker-v3/PPO/

Adding new modules, agents and environments

When using tonic.train, new components can be added using the header field which is evaluated first. If the components are installed, they can directly be imported, otherwise the path to the python scripts have to be added to sys.path first. New environments can be registered and then built normally. Usage example:

python3 -m tonic.train \
header \
"import sys; sys.path.append(’path/to/my/modules’); \
import my_env, my_agent, tonic.torch; \
from dm_control import suite; suite._DOMAINS[’my_env’] = my_env" \
environment "tonic.environments.ControlSuite(’my_env-task’)" \
agent "my_agent.MyAgent()" \
seed 

8 Conclusion and future work

This paper introduced Tonic, a library designed for fast prototyping and benchmarking of deep RL algorithms. It contains a number of configurable modules, agents, supported environments, three essential scripts and a large-scale continuous-control benchmark. Future work will include support for discrete action spaces and pixel-based observations, better handling of dictionary-based observations, benchmark results with improved hyper-parameters, new modules and agents. In particular, some of the new agents will rely on discretization of continuous-action spaces as this alternative has proven to be competitive with continuous-control methods (Metz et al., 2017; Tavakoli et al., 2018; Van de Wiele et al., 2020). Hopefully researchers will use Tonic, contribute to it and will find easier to release the source code of their papers.

Broader Impact

Tonic was designed to help researchers quickly try ideas and measure the significance of their results. Open-source libraries like this one help reduce the disparity between researchers by giving them access to the same tools and data to experiment with. However, free access to Tonic also means its usage could be diverted to endanger people, for example by training agents to control autonomous weapon systems. We strongly condemn any unethical use of this library and we believe it is our duty to look for signs of undesirable results produced by its usage.

We thank Arash Tavakoli, Nemanja Rakicevic and Digby Chappell for comments on the manuscript. Tonic was inspired by a number of other deep RL codebases. In particular, we acknowledge OpenAI Baselines, Spinning Up in Deep RL and Acme. The research presented in this paper has been supported by Dyson Technology Ltd.

References

  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng (2016) TensorFlow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §1.
  • A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller (2018) Maximum a posteriori policy optimisation. In International Conference on Learning Representations, Cited by: §1, §4.
  • J. Achiam (2018) Spinning Up in Deep Reinforcement Learning. GitHub repository. External Links: Link Cited by: §1, §6.
  • G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. Tb, A. Muldal, N. Heess, and T. Lillicrap (2018) Distributed distributional deterministic policy gradients. arXiv preprint arXiv:1804.08617. Cited by: §1, §4.
  • C. Berner, G. Brockman, B. Chan, V. Cheung, P. Dębiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, et al. (2019) Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680. Cited by: §1.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI Gym. arXiv preprint arXiv:1606.01540. Cited by: §1, §5.
  • P. S. Castro, S. Moitra, C. Gelada, S. Kumar, and M. G. Bellemare (2018) Dopamine: a research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110. Cited by: §1.
  • E. Catto (2011) Box2D: a 2D physics engine for games. GitHub repository. External Links: Link Cited by: §1, §5.
  • R. Coulom (2006) Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pp. 72–83. Cited by: §1.
  • E. Coumans and Y. Bai (2016) PyBullet, a Python module for physics simulation for games, robotics and machine learning. External Links: Link Cited by: §1, §5.
  • E. Coumans (2010) Bullet physics engine. External Links: Link Cited by: §1, §5.
  • C. D’Eramo, D. Tateo, A. Bonarini, M. Restelli, and J. Peters (2020) MushroomRL: simplifying reinforcement learning research. arXiv preprint arXiv:2001.01102. External Links: Link Cited by: §1.
  • P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu, and P. Zhokhov (2017) OpenAI baselines. GitHub repository. External Links: Link Cited by: §1.
  • Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel (2016) Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338. Cited by: §6.
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. (2018) Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561. Cited by: §3.
  • S. Fujimoto, H. Van Hoof, and D. Meger (2018) Addressing function approximation error in actor-critic methods. J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 1587–1596. Cited by: §1, §4.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §1.
  • S. Guadarrama, A. Korattikara, O. Ramirez, P. Castro, E. Holly, S. Fishman, K. Wang, E. Gonina, N. Wu, C. Harris, et al. (2018) TF-Agents: a library for reinforcement learning in tensorflow. GitHub repository. External Links: Link Cited by: §1.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §1, §4.
  • M. Hoffman, B. Shahriari, J. Aslanides, G. Barth-Maron, F. Behbahani, T. Norman, A. Abdolmaleki, A. Cassirer, F. Yang, K. Baumli, et al. (2020) Acme: a research framework for distributed reinforcement learning. arXiv preprint arXiv:2006.00979. External Links: Link Cited by: §4.
  • S. Huang, R. Dossa, and C. Ye (2020) CleanRL: high-quality single-file implementation of deep reinforcement learning algorithms. GitHub repository. External Links: Link Cited by: §6.
  • L. Kocsis and C. Szepesvári (2006) Bandit based monte-carlo planning. In European conference on machine learning, pp. 282–293. Cited by: §1.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • E. Liang, R. Liaw, P. Moritz, R. Nishihara, R. Fox, K. Goldberg, J. E. Gonzalez, M. I. Jordan, and I. Stoica (2017) RLlib: abstractions for distributed reinforcement learning. arXiv preprint arXiv:1712.09381. Cited by: §1.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1, §4.
  • L. Metz, J. Ibarz, N. Jaitly, and J. Davidson (2017) Discrete sequential prediction of continuous actions for deep rl. arXiv preprint arXiv:1705.05035. Cited by: §8.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §1, §3, §4.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1.
  • F. Pardo, A. Tavakoli, V. Levdik, and P. Kormushev (2018) Time limits in reinforcement learning. J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 4045–4054. Cited by: §1, §3, §5.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §1.
  • M. Plappert (2016) Keras-RL. GitHub repository. External Links: Link Cited by: §1.
  • J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §1, §4.
  • J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2016) High-dimensional continuous control using generalized advantage estimation. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1, §4.
  • D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. E. P. Xing and T. Jebara (Eds.), Proceedings of Machine Learning Research, Vol. 32, pp. 387–395. Cited by: §4.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354–359. Cited by: §1.
  • A. Stooke and P. Abbeel (2019) Rlpyt: a research code base for deep reinforcement learning in pytorch. arXiv preprint arXiv:1909.01500. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §1, §4.
  • Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. (2018) DeepMind Control Suite. arXiv preprint arXiv:1801.00690. Cited by: §1, §5.
  • A. Tavakoli, F. Pardo, and P. Kormushev (2018) Action branching architectures for deep reinforcement learning. In AAAI Conference on Artificial Intelligence, pp. 4131–4138. Cited by: §8.
  • E. Todorov, T. Erez, and Y. Tassa (2012) MuJoCo: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §1, §5.
  • T. Van de Wiele, D. Warde-Farley, A. Mnih, and V. Mnih (2020) Q-learning in enormous action spaces via amortized approximate maximization. arXiv preprint arXiv:2001.08116. Cited by: §8.
  • O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al. (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
  • C. J. Watkins and P. Dayan (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §1.

Appendix

Appendix A TensorFlow 2 vs PyTorch

Figure 6: Speed comparison between TensorFlow 2 and PyTorch agents trained on walker-walk. Agents were trained for 1 million steps, using the same parameters as for the benchmark. The time spent to run the last 250,000 steps is used to measure the average number of steps per second indicated by bars. The agents were fully trained in turn on the same 6-core processor running at 3.8 GHz without GPU. The difference between the two frameworks might be due to a better optimization mechanism provided by TensorFlow’s tf.function decorator.

Appendix B Benchmark results

Figure 7: Full benchmark results on 70 tasks, for 5 million time steps over 10 seeds. Intervals indicate minimum and maximum values. Curves are smoothed with a sliding window of size 5. A few runs diverged catastrophically, especially on Striker-v2 and Thrower-v2.

Appendix C Comparison with Spinning Up in Deep RL

Figure 8: Comparison with Spinning Up in Deep RL using the same training, evaluation and agent parameters.