In recent years, significant strides have been made in numerous complex sequential decision-making problems including game-playing(dqn, ; alphazero, ) and robotic manipulation (Levine2016, ; Kalashnikov2018, ). These advances have been enabled through Deep Reinforcement Learning (DRL), which has undergone tremendous progress since its resurgence in 2013, with the introduction of deep Q-networks dqn2013 . Since then, the body of literature on DRL algorithms has rapidly grown. This growing body of algorithms, coupled with the emergence of common benchmark tasks (ale, ; mujoco, ) begets the need for comprehensive libraries, tools, and implementations that can aid RL-based research and development.
While DRL has demonstrated several successes, as a field it still faces significant impediments. DRL algorithms are often sensitive to hyperparameter settings and implementation details, which often go unreported in published work, raising concerns about the state of reproducibility in the field(deeprlmatters, )
. These seemingly minor implementation details have striking effects in DRL algorithms since the data collection process is closely tied to the parameter updates, as opposed to the typical supervised learning setting. Such issues make the task of reproducing published results challenging when the original implementation is not open-sourced.
Many open-source software packages have been developed to alleviate these issues by providing reference algorithm implementations. However, as the state-of-the-art rapidly advances, it is difficult for DRL libraries to keep apace. Even when most open-source software packages make re-implementations available, they seldom provide comprehensive benchmarks or implementations that faithfully replicate the original settings of published results.
In this paper, we introduce ChainerRL, an open-source DRL library, built using Python and the Chainer (chainer, ) deep learning framework. ChainerRL has the following features/characteristics:
ChainerRL aims for comprehensiveness as a DRL library. It spans algorithms for both discrete-action and continuous-action tasks, from foundational algorithms such as DQN (dqn, ) and DDPG (ddpg, ) to state-of-the-art algorithms such as IQN (Dabney2018, ), Rainbow (rainbow, ), Soft-Actor-Critic (sac, ), and TD3 (Fujimoto2018a, ). Moreover, ChainerRL supports multiple training paradigms, including standard serial algorithms (e.g. DQN), asynchronous parallel algorithms (e.g. A3C (a3c, )), and synchronous parallel training (e.g. A2C (a2c, )).
- Faithful reproduction
To provide reliable baselines as a starting point for future research, ChainerRL emphasizes the faithfulness of its algorithm implementations to their corresponding original papers or implementations, if any. For several state-of-the-art algorithms, we replicate the published training and evaluation details as closely as possible, successfully reproducing the results from the original paper, both in the Atari 2600 (ale, ) and OpenAI Gym (gym, ) MuJoCo benchmarks. We provide training scripts with full reported scores and training times as well as comparisons against the original implementations for each reproduction.
ChainerRL is accompanied by the ChainerRL Visualizer, an agent visualization tool that enables users to effortlessly visualize the behavior of trained agents, both discrete-action and continuous-action. For tasks with image-based observations, it can show saliency maps (visualizingatari, )
to indicate which part of the image the neural network is attending to.
Individually, these features/characteristics may not be entirely novel, but achieving all of them within a single, unified open-source platform gives ChainerRL a unique value that sets it apart from existing libraries.
This paper is organized as follows. In Section 2, we review related work on reproducibility in RL and existing DRL libraries. In Section 3, we explain the design and functionalities of ChainerRL and introduce the ChainerRL Visualizer, our companion visualization tool. In Section 4, we describe our efforts to reproduce published benchmark results, and provide concluding remarks in Section 5.
2 Related work
. Minor differences in hyperparameters, architectures, reward scales, etc. can have dramatic effects on the performance of an agent. Even minor implementational differences can cause significant differences between different open-source implementations of the same algorithm. It is also known that there can be high variance in performance among training trials even with identical settings save for the random seeds, necessitating evaluation with multiple random seeds and statistical testing(Colas2018b, ).
Many DRL libraries have been released to provide reference implementations of algorithms with different focuses. rllab (rllab, ) and its successor, garage, provide a thorough set of continuous-action algorithms and their own benchmark environments for systematic benchmarking. Dopamine (dopamine, ) primarily focuses on DQN and its extensions for discrete-action environments. rlpyt (rlpyt, ) is comprehensive as it supports both discrete and continuous-action algorithms from the three classes: policy gradient (with V-functions), deep Q-learning, and policy gradient with Q-functions. Several other libraries also support diverse sets of algorithms (baselines, ; coach, ; stablebaselines, ; rayrllib, ). While it is difficult to compare the comprehensiveness of libraries that are actively developed, ChainerRL’s support of a wide range of algorithms and functionality provides a competitive edge over the other libraries, as we detail in Section 3. catalyst.RL (catalyst_rl, ) aims to address reproducibility issues in RL via deterministic evaluations and by tracking code changes for continuous-action algorithms. ChainerRL addresses the reproducibility challenge by providing implementations that closely follow the original papers’ descriptions and experimental settings. These implementations are then extensively benchmarked to best reproduce the original reported scores.
3 Design of ChainerRL
In ChainerRL, each DRL algorithm is written as a class that implements the Agent interface. The Agent interface provides a mechanism through which an agent interacts with an environment, e.g. through an abstract method Agent.act_and_train(obs, reward, done) that takes the current observation, the previous step’s immediate reward, and a flag for episode termination, and returns the agent’s action to execute in the environment. By implementing such methods, both the update rule and the action-selection procedure are specified for an algorithm.
An agent’s internals consist of any model parameters needed for decision-making and model updating.
ChainerRL includes several built-in agents that implement popular algorithms including DQN (dqn, ), IQN (Dabney2018, ), Rainbow (rainbow, ), A2C (a2c, ), A3C (a3c, ), ACER (acer, ), DDPG (ddpg, ), PPO (Schulman2017b, ), TRPO (trpo, ), TD3 (Fujimoto2018a, ), and SAC (sac, ).
111Within the Deep Q-networks (DQN) (dqn, ) family of algorithms, ChainerRL has: DQN (dqn, ), Double DQN (ddqn, ), Categorical DQN (distributionaldqn, ), Rainbow (rainbow, ) , Implicit Quantile Networks (IQN)
, Implicit Quantile Networks (IQN)(Dabney2018, ), Off-policy SARSA, and (Persistent) Advantage Learning (Bellemare2016c, ). Within policy gradient methods, ChainerRL has: (Asynchronous) Advantage Actor-Critic (A2C (a2c, ) and A3C (a3c, )), Actor-Critic with Experience Replay (ACER) (acer, ), Deep Deterministic Policy Gradients (DDPG) (ddpg, ), Twin-delayed double DDPG (TD3) (Fujimoto2018a, ), Proximal Policy Optimization (PPO) (Schulman2017b, ), REINFORCE (Williams1992, ), Trust Region Policy Optimization (TRPO) (trpo, ), and Soft Actor-Critic (SAC) (sac, ).
While users can directly use agents via the interface for maximum flexibility, ChainerRL provides an experiments module that manages the interactions between the agent and the environment as well as the training and evaluation schedule. The module supports any environment that is compatible with the Env class of OpenAI Gym (gym, ). An experiment takes as input an agent and an environment, queries the agent for actions, executes them in the environment, and feeds the agent the rewards for training updates. Moreover, an experiment can periodically perform evaluations, possibly in a separate evaluation environment, storing relevant statistics regarding the agent’s performance.
Through the experiments module, ChainerRL supports batch or asynchronous training, enabling agents to act and train synchronously or asynchronously in several environments in parallel. Asynchronous training, where multiple agents interact with multiple environments while sharing the model parameters, is supported for A3C, ACER, and -step Q-learning (a3c, ). Synchronous parallel training, where a single agent interacts with multiple environments synchronously, known to practitioners for A2C (a2c, ), is supported to leverage GPUs not only for A2C but also for the majority of agents including IQN (Dabney2018, ) and Soft Actor-Critic (sac, ).
3.3 Developing a new agent
The Agent interface is defined very abstractly and flexibly so that users can easily implement new algorithms while leveraging the experiments utility and parallel training infrastructure. The general flow for developing and evaluating a new agent is as follows. First, a class that inherits Agent is created. Next, the learning update rules and the algorithm’s action-selection mechanisms are implemented, employing the many building blocks that ChainerRL provides for building agents (see Section 3.4). Once an agent is created, one can use any Gym-like environment combined with our experimental and evaluation utilities in experiments to easily train and evaluate the agent within the specified environment.
3.4 Agent building blocks
We have described at a high level how agents interact with the environment in ChainerRL, as well as some of the built-in agents and experimental utilities offered. However, these built-in agents are typically built with a set of reusable components that ChainerRL offers. While a comprehensive treatment of the features built into ChainerRL is beyond the scope of this paper, we highlight here some of the building blocks that demonstrate the flexibility and reusability of ChainerRL.
To easily develop an agent’s action-selection mechanisms during training, ChainerRL has several built-in Explorers including -greedy, Boltzmann exploration, additive Gaussian noise, and additive Ornstein-Uhlenbeck noise (ddpg, ).
- Replay buffers
Replay buffers (Lin1992, ; dqn, ) have become standard tools in off-policy DRL. In addition to the standard replay buffer that uniformly samples transitions, ChainerRL supports episodic buffers that sample past (sub-)episodes for recurrent models, and prioritized buffers that implement prioritized experience replay (prioritizeddqn, ). ChainerRL also supports sampling steps of transitions, allowing for the easy implementation of algorithm variants based on -step returns.
- Neural networks
While ChainerRL supports any neural network model that is implemented in Chainer (chainer, ) as chainer.Link, it has several pre-defined architectures, including DQN architectures, dueling network architectures (duelingdqn, ), noisy networks (noisynetworks, )
, and multi-layer perceptrons. Recurrent models are supported for many algorithms, including DQN and IQN.
Distributions are parameterized objects used to model action distributions. Neural network models that return a Distribution object are considered a policy. Supported policies include Gaussian, Softmax, Mellowmax (Asadi2016a, ), and deterministic policies.
- Action values
Similarly to Distributions, ActionValues parameterizing the values of actions are used as outputs of neural networks to model Q-functions. Supported Q-functions include the one that evaluates discrete actions typical for DQN as well as categorical (distributionaldqn, ) and quantile (Dabney2018, ) Q-functions for distributional RL. For continuous action spaces, quadratic Q-functions called Normalized Advantage Functions (NAFs) (Gu2016b, ) are also supported.
The set of algorithms that can be developed by combining the agent building blocks of ChainerRL is large. One notable example is Rainbow (rainbow, ), which combines double updating (ddqn, ), prioritized replay (prioritizeddqn, ), -step learning, dueling architectures (duelingdqn, ), and Categorical DQN (distributionaldqn, ) into a single agent. The following pseudocode depicts the simplicity of creating and training a Rainbow agent with ChainerRL.
[ frame=lines, bgcolor=LightGray, linenos, breaklines ]python import chainerrl as crl import gym
q_func = crl.q_functions.DistributionalDuelingDQN(…)# dueling crl.links.to_factorized_noisy(q_func) # noisy networks # Prioritized Experience Replay Buffer with a 3-step reward per = crl.replay_buffers.PrioritizedReplayBuffer(num_step_return=3,…) # Create a rainbow agent rainbow = crl.agents.CategoricalDoubleDQN(per, q_func,…) num_envs = 5 # Train in five environments env = crl.envs.MultiprocessVectorEnv( [gym.make("Breakout") for _in range(num_envs)])
# Train the agent and collect evaluation statistics crl.experiments.train_agent_batch_with_evaluation(rainbow, env, steps=…)
We first create a distributional dueling Q-function, and then in a single line, convert it to a noisy network. We then initialize a prioritized replay buffer configured to use -step rewards. We pass this replay buffer to a DoubleCategoricalDQN agent — which is a built-in ChainerRL agent — to produce a Rainbow agent. Moreover, with ChainerRL, users can easily specify the number of environments in which to train the Rainbow agent in parallel processes, and the experiments module will automatically manage the training loops, evaluation statistics, logging, and saving of the agent.
ChainerRL is accompanied by a visualizer: ChainerRL Visualizer, which takes as input an environment and an agent, and allows users to inspect their agents from a browser UI easily. Figure 2 depicts some of the key features available in the ChainerRL Visualizer. The top of the figure depicts a trained A3C agent in the Atari game Breakout. With the visualizer, one can visualize the portions of the pixel input that the agent is attending to as a saliency map (visualizingatari, )
. Additionally, users can perform careful, controlled investigations of agents by manually stepping through an episode, or can alternatively view rollouts of agents. Since A3C is an actor-critic agent, we can view the probabilities with which the agent will perform a specific action, as well as the agent’s predicted state values. If the agent learns Q-values or a distribution of Q-values, the predicted Q-value or Q-value distribution for each action can be displayed, as shown in the bottom of Figure2.
|Kung Fu Master||27616.7||23270||87566.3||73512||22820.5||52181.0||37750.0||37422|
|Name This Game||7046.5||7257||23151.3||22682||14035.1||13136.0||11301.1||7168|
|Up N Down||11821.5||8456||83140.0||88148||284465.6||-||78790.0||89067|
|Wizard Of Wor||1957.3||3393||20892.8||31190||19796.5||17862.5||2488.4||8953|
|# Higher scores||22||26||27||25||30||20||26||25|
|DQN dqn||IQN Dabney2018||Rainbow rainbow||A3C noisynetworks|
|Evaluation Frequency (timesteps)||250K||250K||250K||250K|
|Evaluation Phase (timesteps)||125K||125K||125K||125K|
|Evaluation Episode Length (time)||5 min||30 min||30 min||30 min|
|Evaluation Episode Policy||N/A|
|DDPG (Fujimoto2018a, )||TD3 (Fujimoto2018a, )|
|HalfCheetah-v2||10325.45||8577.29||10248.51 1063.48||9636.95 859.065|
|Hopper-v2||3565.60||1860.02||3662.85 144.98||3564.07 114.74|
|Walker2d-v2||3594.26||3098.11||4978.32 517.44||4682.82 539.64|
|Ant-v2||774.46||888.77||4626.25 1020.70||4372.44 1000.33|
|Reacher-v2||-2.92||-4.01||-2.55 0.19||-3.60 0.56|
|InvertedPendulum-v2||902.25||1000.00||1000.00 0.0||1000.00 0.0|
|InvertedDoublePendulum-v2||7495.56||8369.95||8435.33 2771.39||9337.47 14.96|
|TRPO (deeprlmatters, )||PPO (deeprlmatters, )||SAC (sac, )|
|HalfCheetah-v2||1474 112||205 256||2404 185||2201 323||14850.54||~15000|
|Hopper-v2||3056 44||2828 70||2719 67||2790 62||2911.89||~3300|
|Walker2d-v2||3073 59||-||2994 113||-||5282.61||~5600|
|Swimmer-v2||200 25||-||111 4||-||-||-|
Many open-source DRL libraries offer high-quality implementations of algorithms but often deviate from the original paper’s implementation details. We currently provide a set of compact examples, i.e., single files, of paper implementations written with ChainerRL that users can run to reproduce the results of the original research paper. These “reproducibility scripts” are carefully written to replicate as closely as possible the original paper’s (or in some cases, another published paper’s) implementation and evaluation details. For each of these scripts, we provide the training times of the script (in our repository), full tables of our achieved scores, and comparisons of these scores against those reported in the literature.
Though ChainerRL has high-quality implementations of dozens of algorithms, we currently have created such “reproducibility scripts” for 9 algorithms. In the Atari benchmark (ale, ), we have successfully reproduced DQN, IQN, Rainbow, and A3C. For the OpenAI Gym Mujoco benchmark tasks, we have successfully reproduced DDPG, TRPO, PPO, TD3, and SAC.
The reproducibility scripts emphasize correctly reproducing evaluation protocols, which are particularly relevant when evaluating Atari agents. During a typical Atari agent’s 50 million timesteps of training, it is evaluated periodically in an offline evaluation phase for a specified number of timesteps before resuming training. Oftentimes, since the agent performs some form of exploratory policy during training, the agent sometimes changes policies specifically for evaluations. Unfortunately, evaluation protocols tend to vary, and consequently results are often inconsistently reported across the literature (revisitingale, ), significantly impacting results. The critical details of standard Atari evaluation protocols are as follows:
- Evaluation frequency
the frequency (in timesteps) with which the evaluation phase occurs
- Evaluation phase length
the number of timesteps in the offline evaluation
- Evaluation episode length
the maximum duration of an evaluation episode
- Evaluation policy
The policy to follow during an evaluation episode
- Reporting protocol
Each intermediate evaluation phase outputs some score, representing the mean score of all evaluation episodes during that evaluation phase. Papers typically report scores according to one of the following reporting protocols:
best-eval: Papers using the best-eval protocol report the highest mean score across all intermediate evaluation phases.
re-eval: Papers using the re-eval protocol report the score of a re-evaluation of the network parameters that produced the best-eval.
Each of these details, especially the reporting protocols, can significantly influence the results, and thus are critical details to hold consistent for a fair comparison between algorithms.
Table 1 lists the results obtained by ChainerRL’s reproducibility scripts for DQN, IQN, Rainbow, and A3C on the Atari benchmark, with comparisons against a published result. Table 2 depicts the evaluation protocol used for each algorithm, with a citation of the source paper whose results we compare against. Note that the results for the A3C a3c algorithm do not come from the original A3C paper, but from another noisynetworks . For continuous-action algorithms, the results on OpenAI Gym MuJoCo tasks for DDPG (ddpg, ), TRPO (trpo, ), PPO (Schulman2017b, ), TD3 Fujimoto2018a , and SAC (sac, ) are reported in Table 3.
The reproducibility scripts are produced through a combination of reading released source code and studying published hyperparameters, implementation details, and evaluation protocols. We also have extensive email correspondences with authors to clarify ambiguities, omitted details, or inconsistencies that may exist in papers.
As seen in both the Atari and MuJoCo reproducibility results, sometimes a reproduction effort cannot be directly compared against the original paper’s reported results. For example, the reported scores in the original paper introducing the A3C algorithm (a3c, ) utilize demonstrations that are not publicly available, making it impossible to accurately compare a re-implementation’s scores to the original paper. In such scenarios, we seek out high-quality published research (noisynetworks, ; deeprlmatters, ; Fujimoto2018a, ) from which faithful reproductions are indeed possible, and compare against these.
In this paper, we introduced a reproducibility-focused deep reinforcement library, ChainerRL, and its companion visualizer, ChainerRL Visualizer. We hope that ChainerRL ’s comprehensive suite of algorithms, flexible APIs, visualization features, and faithful reproductions can accelerate research in DRL as well as foster its application to a wide range of new and interesting sequential decision-making problems.
ChainerRL has been in active development since 2017 with extensive plans laid out for continued expansion and improvement with novel algorithms, functionality, and paper reproductions. We are currently in the process of releasing a significant number of trained agent models for users to utilize in research and development. We are also adding functionality to enable large scale distributed RL. Lastly, we plan to expand beyond pure reinforcement learning approaches for sequential decision-making by including algorithms that can learn from demonstrations.
We look forward to continuing our collaboration with the open-source community in developing ChainerRL and accelerating RL research.
We thank Avinash Ummadisingu, Mario Ynocente Castro, Keisuke Nakata, Lester James V. Miranda, and all the open source contributors for their contributions to the development of ChainerRL. We thank Kohei Hayashi and Jason Naradowsky for useful comments on how to improve the paper. We thank the many authors who fielded our questions when reproducing their papers, especially George Ostrovski.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei a Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
-  David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel.
End-to-End Training of Deep Visuomotor Policies.
The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
-  Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. In CoRL, 2018.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop, 2013.
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The Arcade Learning Environment: An Evaluation Platform for General
Journal of Artificial Intelligence Research, 47:253–279, 2013.
-  Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IROS, 2012.
-  Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep Reinforcement Learning that Matters. In AAAI, 2018.
-  Seiya Tokui, Ryosuke Okuta, Takuya Akiba, Yusuke Niitani, Toru Ogawa, Shunta Saito, Shuji Suzuki, Kota Uenishi, Brian Vogel, and Hiroyuki Yamazaki Vincent. Chainer: A Deep Learning Framework for Accelerating the Research Cycle. In KDD, 2019.
-  Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2016.
-  Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit Quantile Networks for Distributional Reinforcement Learning. In ICML, 2018.
-  Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining Improvements in Deep Reinforcement Learning. In AAAI, 2017.
-  Tuomas Haarnoja, Henry Zhu, George Tucker, and Pieter Abbeel. Soft Actor-Critic Algorithms and Applications. arXiv preprint arxiv:1812.05905, 2018.
-  Scott Fujimoto, Herke van Hoof, and Dave Meger. Addressing Function Approximation Error in Actor-Critic Methods. In ICML, 2018.
-  Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. In ICML, 2016.
-  OpenAI Baselines: ACKTR & A2C. https://openai.com/blog/baselines-acktr-a2c/.
-  Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
-  Sam Greydanus, Anurag Koul, Jonathan Dodge, and Alan Fern. Visualizing and Understanding Atari Agents. In ICML, 2018.
-  Riashat Islam, Peter Henderson, Gomrokchi Maziar, and Doina Precup. Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control. In ICML Reproducibility in Machine Learning Workshop, 2017.
-  Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. How Many Random Seeds? Statistical Power Analysis in Deep Reinforcement Learning Experiments. arXiv preprint arxiv:1806.08295, 2018.
-  Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking Deep Reinforcement Learning for Continuous Control. In ICML, 2016.
-  Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. Dopamine: A Research Framework for Deep Reinforcement Learning. 2018.
-  Adam Stooke and Pieter Abbeel. rlpyt: A Research Code Base for Deep Reinforcement Learning in PyTorch. arXiv preprint arxiv:1909.01500, 2019.
-  Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. OpenAI Baselines. https://github.com/openai/baselines, 2017.
-  Itai Caspi, Gal Leibovich, Gal Novik, and Shadi Endrawis. Reinforcement learning coach, December 2017.
-  Ashley Hill, Antonin Raffin, Maximilian Ernestus, Rene Traore, Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Stable baselines. https://github.com/hill-a/stable-baselines, 2018.
-  Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. RLlib: Abstractions for Distributed Reinforcement Learning. In ICML, 2018.
-  Sergey Kolesnikov and Oleksii Hrinchuk. Catalyst.RL: A Distributed Framework for Reproducible RL Research. arXiv preprint arXiv:1903.00027, 2019.
-  Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample Efficient Actor-Critic with Experience Replay. In ICLR, 2017.
-  John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347, 2017.
-  John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust Region Policy Optimization. In ICML, 2015.
-  Hado Van Hasselt, Arthur Guez, and David Silver. Deep Reinforcement Learning with Double Q-learning. In AAAI, 2016.
-  Marc G. Bellemare, Will Dabney, and Rémi Munos. A Distributional Perspective on Reinforcement Learning. In ICML, 2017.
-  Marc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip S. Thomas, and Rémi Munos. Increasing the Action Gap: New Operators for Reinforcement Learning. In AAAI, 2016.
-  RJ Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3-4):229–256, 1992.
-  Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3-4):293–321, 1992.
-  Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized Experience Replay. In ICLR, 2016.
-  Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.
-  Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy Networks for Exploration. In ICLR, 2018.
-  Kavosh Asadi and Michael L. Littman. An Alternative Softmax Operator for Reinforcement Learning. In ICML, 2017.
-  Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous Deep Q-Learning with Model-based Acceleration. In ICML, 2016.
-  Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.