ChainerRL: A Deep Reinforcement Learning Library

12/09/2019 ∙ by Yasuhiro Fujita, et al. ∙ Preferred Infrastructure The University of Tokyo 41

In this paper, we introduce ChainerRL, an open-source Deep Reinforcement Learning (DRL) library built using Python and the Chainer deep learning framework. ChainerRL implements a comprehensive set of DRL algorithms and techniques drawn from the state-of-the-art research in the field. To foster reproducible research, and for instructional purposes, ChainerRL provides scripts that closely replicate the original papers' experimental settings and reproduce published benchmark results for several algorithms. Lastly, ChainerRL offers a visualization tool that enables the qualitative inspection of trained agents. The ChainerRL source code can be found on GitHub: https://github.com/chainer/chainerrl .

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, significant strides have been made in numerous complex sequential decision-making problems including game-playing 

(dqn, ; alphazero, ) and robotic manipulation (Levine2016, ; Kalashnikov2018, ). These advances have been enabled through Deep Reinforcement Learning (DRL), which has undergone tremendous progress since its resurgence in 2013, with the introduction of deep Q-networks dqn2013 . Since then, the body of literature on DRL algorithms has rapidly grown. This growing body of algorithms, coupled with the emergence of common benchmark tasks (ale, ; mujoco, ) begets the need for comprehensive libraries, tools, and implementations that can aid RL-based research and development.

While DRL has demonstrated several successes, as a field it still faces significant impediments. DRL algorithms are often sensitive to hyperparameter settings and implementation details, which often go unreported in published work, raising concerns about the state of reproducibility in the field 

(deeprlmatters, )

. These seemingly minor implementation details have striking effects in DRL algorithms since the data collection process is closely tied to the parameter updates, as opposed to the typical supervised learning setting. Such issues make the task of reproducing published results challenging when the original implementation is not open-sourced.

Many open-source software packages have been developed to alleviate these issues by providing reference algorithm implementations. However, as the state-of-the-art rapidly advances, it is difficult for DRL libraries to keep apace. Even when most open-source software packages make re-implementations available, they seldom provide comprehensive benchmarks or implementations that faithfully replicate the original settings of published results.

In this paper, we introduce ChainerRL, an open-source DRL library, built using Python and the Chainer (chainer, ) deep learning framework. ChainerRL has the following features/characteristics:

Comprehensive

ChainerRL aims for comprehensiveness as a DRL library. It spans algorithms for both discrete-action and continuous-action tasks, from foundational algorithms such as DQN (dqn, ) and DDPG (ddpg, ) to state-of-the-art algorithms such as IQN (Dabney2018, ), Rainbow (rainbow, ), Soft-Actor-Critic (sac, ), and TD3 (Fujimoto2018a, ). Moreover, ChainerRL supports multiple training paradigms, including standard serial algorithms (e.g. DQN), asynchronous parallel algorithms (e.g. A3C (a3c, )), and synchronous parallel training (e.g. A2C (a2c, )).

Faithful reproduction

To provide reliable baselines as a starting point for future research, ChainerRL emphasizes the faithfulness of its algorithm implementations to their corresponding original papers or implementations, if any. For several state-of-the-art algorithms, we replicate the published training and evaluation details as closely as possible, successfully reproducing the results from the original paper, both in the Atari 2600 (ale, ) and OpenAI Gym (gym, ) MuJoCo benchmarks. We provide training scripts with full reported scores and training times as well as comparisons against the original implementations for each reproduction.

Visualizer

ChainerRL is accompanied by the ChainerRL Visualizer, an agent visualization tool that enables users to effortlessly visualize the behavior of trained agents, both discrete-action and continuous-action. For tasks with image-based observations, it can show saliency maps (visualizingatari, )

to indicate which part of the image the neural network is attending to.

Individually, these features/characteristics may not be entirely novel, but achieving all of them within a single, unified open-source platform gives ChainerRL a unique value that sets it apart from existing libraries.

This paper is organized as follows. In Section 2, we review related work on reproducibility in RL and existing DRL libraries. In Section 3, we explain the design and functionalities of ChainerRL and introduce the ChainerRL Visualizer, our companion visualization tool. In Section 4, we describe our efforts to reproduce published benchmark results, and provide concluding remarks in Section 5.

2 Related work

Recent work has raised a number of concerns about the state of reproducibility in the field of DRL  (Islam2017, ; deeprlmatters, )

. Minor differences in hyperparameters, architectures, reward scales, etc. can have dramatic effects on the performance of an agent. Even minor implementational differences can cause significant differences between different open-source implementations of the same algorithm. It is also known that there can be high variance in performance among training trials even with identical settings save for the random seeds, necessitating evaluation with multiple random seeds and statistical testing 

(Colas2018b, ).

Many DRL libraries have been released to provide reference implementations of algorithms with different focuses. rllab (rllab, ) and its successor, garage, provide a thorough set of continuous-action algorithms and their own benchmark environments for systematic benchmarking. Dopamine (dopamine, ) primarily focuses on DQN and its extensions for discrete-action environments. rlpyt (rlpyt, ) is comprehensive as it supports both discrete and continuous-action algorithms from the three classes: policy gradient (with V-functions), deep Q-learning, and policy gradient with Q-functions. Several other libraries also support diverse sets of algorithms (baselines, ; coach, ; stablebaselines, ; rayrllib, ). While it is difficult to compare the comprehensiveness of libraries that are actively developed, ChainerRL’s support of a wide range of algorithms and functionality provides a competitive edge over the other libraries, as we detail in Section 3. catalyst.RL (catalyst_rl, ) aims to address reproducibility issues in RL via deterministic evaluations and by tracking code changes for continuous-action algorithms. ChainerRL addresses the reproducibility challenge by providing implementations that closely follow the original papers’ descriptions and experimental settings. These implementations are then extensively benchmarked to best reproduce the original reported scores.

3 Design of ChainerRL

Figure 1: A depiction of ChainerRL. In ChainerRL, DRL algorithms, called agents, are written by implementing the abstract Agent interface, typically using the offered building blocks. Such agents can be trained with the experiment utilities and inspected with the ChainerRL Visualizer. ChainerRL also provides a set of scripts for reproducing published results for agents.

3.1 Agents

In ChainerRL, each DRL algorithm is written as a class that implements the Agent interface. The Agent interface provides a mechanism through which an agent interacts with an environment, e.g. through an abstract method Agent.act_and_train(obs, reward, done) that takes the current observation, the previous step’s immediate reward, and a flag for episode termination, and returns the agent’s action to execute in the environment. By implementing such methods, both the update rule and the action-selection procedure are specified for an algorithm.

An agent’s internals consist of any model parameters needed for decision-making and model updating. ChainerRL includes several built-in agents that implement popular algorithms including DQN (dqn, ), IQN (Dabney2018, ), Rainbow (rainbow, ), A2C (a2c, ), A3C (a3c, ), ACER (acer, ), DDPG (ddpg, ), PPO (Schulman2017b, ), TRPO (trpo, ), TD3 (Fujimoto2018a, ), and SAC (sac, ). 111Within the Deep Q-networks (DQN) (dqn, ) family of algorithms, ChainerRL has: DQN (dqn, ), Double DQN (ddqn, ), Categorical DQN (distributionaldqn, ), Rainbow (rainbow, )

, Implicit Quantile Networks (IQN) 

(Dabney2018, ), Off-policy SARSA, and (Persistent) Advantage Learning (Bellemare2016c, ). Within policy gradient methods, ChainerRL has: (Asynchronous) Advantage Actor-Critic (A2C (a2c, ) and A3C (a3c, )), Actor-Critic with Experience Replay (ACER) (acer, ), Deep Deterministic Policy Gradients (DDPG) (ddpg, ), Twin-delayed double DDPG (TD3) (Fujimoto2018a, ), Proximal Policy Optimization (PPO) (Schulman2017b, ), REINFORCE (Williams1992, ), Trust Region Policy Optimization (TRPO) (trpo, ), and Soft Actor-Critic (SAC) (sac, ).

3.2 Experiments

While users can directly use agents via the interface for maximum flexibility, ChainerRL provides an experiments module that manages the interactions between the agent and the environment as well as the training and evaluation schedule. The module supports any environment that is compatible with the Env class of OpenAI Gym (gym, ). An experiment takes as input an agent and an environment, queries the agent for actions, executes them in the environment, and feeds the agent the rewards for training updates. Moreover, an experiment can periodically perform evaluations, possibly in a separate evaluation environment, storing relevant statistics regarding the agent’s performance.

Through the experiments module, ChainerRL supports batch or asynchronous training, enabling agents to act and train synchronously or asynchronously in several environments in parallel. Asynchronous training, where multiple agents interact with multiple environments while sharing the model parameters, is supported for A3C, ACER, and -step Q-learning (a3c, ). Synchronous parallel training, where a single agent interacts with multiple environments synchronously, known to practitioners for A2C (a2c, ), is supported to leverage GPUs not only for A2C but also for the majority of agents including IQN (Dabney2018, ) and Soft Actor-Critic (sac, ).

3.3 Developing a new agent

The Agent interface is defined very abstractly and flexibly so that users can easily implement new algorithms while leveraging the experiments utility and parallel training infrastructure. The general flow for developing and evaluating a new agent is as follows. First, a class that inherits Agent is created. Next, the learning update rules and the algorithm’s action-selection mechanisms are implemented, employing the many building blocks that ChainerRL provides for building agents (see Section 3.4). Once an agent is created, one can use any Gym-like environment combined with our experimental and evaluation utilities in experiments to easily train and evaluate the agent within the specified environment.

3.4 Agent building blocks

We have described at a high level how agents interact with the environment in ChainerRL, as well as some of the built-in agents and experimental utilities offered. However, these built-in agents are typically built with a set of reusable components that ChainerRL offers. While a comprehensive treatment of the features built into ChainerRL is beyond the scope of this paper, we highlight here some of the building blocks that demonstrate the flexibility and reusability of ChainerRL.

Explorers

To easily develop an agent’s action-selection mechanisms during training, ChainerRL has several built-in Explorers including -greedy, Boltzmann exploration, additive Gaussian noise, and additive Ornstein-Uhlenbeck noise (ddpg, ).

Replay buffers

Replay buffers (Lin1992, ; dqn, ) have become standard tools in off-policy DRL. In addition to the standard replay buffer that uniformly samples transitions, ChainerRL supports episodic buffers that sample past (sub-)episodes for recurrent models, and prioritized buffers that implement prioritized experience replay (prioritizeddqn, ). ChainerRL also supports sampling steps of transitions, allowing for the easy implementation of algorithm variants based on -step returns.

Neural networks

While ChainerRL supports any neural network model that is implemented in Chainer (chainer, ) as chainer.Link, it has several pre-defined architectures, including DQN architectures, dueling network architectures (duelingdqn, ), noisy networks (noisynetworks, )

, and multi-layer perceptrons. Recurrent models are supported for many algorithms, including DQN and IQN.

Distributions

Distributions are parameterized objects used to model action distributions. Neural network models that return a Distribution object are considered a policy. Supported policies include Gaussian, Softmax, Mellowmax (Asadi2016a, ), and deterministic policies.

Action values

Similarly to Distributions, ActionValues parameterizing the values of actions are used as outputs of neural networks to model Q-functions. Supported Q-functions include the one that evaluates discrete actions typical for DQN as well as categorical (distributionaldqn, ) and quantile (Dabney2018, ) Q-functions for distributional RL. For continuous action spaces, quadratic Q-functions called Normalized Advantage Functions (NAFs) (Gu2016b, ) are also supported.

The set of algorithms that can be developed by combining the agent building blocks of ChainerRL is large. One notable example is Rainbow (rainbow, ), which combines double updating (ddqn, ), prioritized replay (prioritizeddqn, ), -step learning, dueling architectures (duelingdqn, ), and Categorical DQN (distributionaldqn, ) into a single agent. The following pseudocode depicts the simplicity of creating and training a Rainbow agent with ChainerRL.

[ frame=lines, bgcolor=LightGray, linenos, breaklines ]python import chainerrl as crl import gym

q_func = crl.q_functions.DistributionalDuelingDQN(…)# dueling crl.links.to_factorized_noisy(q_func) # noisy networks # Prioritized Experience Replay Buffer with a 3-step reward per = crl.replay_buffers.PrioritizedReplayBuffer(num_step_return=3,…) # Create a rainbow agent rainbow = crl.agents.CategoricalDoubleDQN(per, q_func,…) num_envs = 5 # Train in five environments env = crl.envs.MultiprocessVectorEnv( [gym.make("Breakout") for _in range(num_envs)])

# Train the agent and collect evaluation statistics crl.experiments.train_agent_batch_with_evaluation(rainbow, env, steps=…)

We first create a distributional dueling Q-function, and then in a single line, convert it to a noisy network. We then initialize a prioritized replay buffer configured to use -step rewards. We pass this replay buffer to a DoubleCategoricalDQN agent — which is a built-in ChainerRL agent — to produce a Rainbow agent. Moreover, with ChainerRL, users can easily specify the number of environments in which to train the Rainbow agent in parallel processes, and the experiments module will automatically manage the training loops, evaluation statistics, logging, and saving of the agent.

3.5 Visualization

Figure 2: The ChainerRL Visualizer. With the ChainerRL Visualizer, the user can closely investigate an agent’s behaviors within a browser window. The top image is the visualization of a trained A3C agent on Breakout, while the bottom one is that of a C51 agent trained on Seaquest.

ChainerRL is accompanied by a visualizer: ChainerRL Visualizer, which takes as input an environment and an agent, and allows users to inspect their agents from a browser UI easily. Figure 2 depicts some of the key features available in the ChainerRL Visualizer. The top of the figure depicts a trained A3C agent in the Atari game Breakout. With the visualizer, one can visualize the portions of the pixel input that the agent is attending to as a saliency map (visualizingatari, )

. Additionally, users can perform careful, controlled investigations of agents by manually stepping through an episode, or can alternatively view rollouts of agents. Since A3C is an actor-critic agent, we can view the probabilities with which the agent will perform a specific action, as well as the agent’s predicted state values. If the agent learns Q-values or a distribution of Q-values, the predicted Q-value or Q-value distribution for each action can be displayed, as shown in the bottom of Figure

2.

4 Reproducibility

DQN IQN Rainbow A3C
Game CRL Published CRL Published CRL Published CRL Published
Air Raid 6450.5 - 9672.1 - 6500.9 - 3767.8 -
Alien 1713.1 3069 12484.3 7022 9409.1 9491.7 1600.7 2027
Amidar 986.7 739.5 2392.3 2946 3252.7 5131.2 873.1 904
Assault 3317.2 3359 24731.9 29091 15245.5 14198.5 4819.8 2879
Asterix 5936.7 6012 454846.7 342016 353258.5 428200.3 10792.4 6822
Asteroids 1584.5 1629 3885.9 2898 2792.3 2712.8 2691.2 2544
Atlantis 96456.0 85641 946912.5 978200 894708.5 826659.5 806650.0 422700
Bank Heist 645.0 429.7 1326.3 1416 1734.8 1358.0 1327.9 1296
Battle Zone 5313.3 26300 69316.2 42244 90625.0 62010.0 4208.8 16411
Beam Rider 7042.9 6846 38111.4 42776 27959.5 16850.2 8946.9 9214
Berzerk 707.2 - 138167.9 1053 26704.2 2545.6 1527.2 1022
Bowling 52.3 42.4 84.3 86.5 67.1 30.0 31.7 37
Boxing 89.6 71.8 99.9 99.8 99.8 99.6 99.0 91
Breakout 364.9 401.2 658.6 734 340.8 417.5 575.9 496
Carnival 5222.0 - 5267.2 - 5530.3 - 5121.9 -
Centipede 5112.6 8309 11265.2 11561 7718.1 8167.3 5647.5 5350
Chopper Command 6170.0 6687 43466.9 16836 303480.5 16654.0 5916.3 5285
Crazy Climber 108472.7 114103 178111.6 179082 165370.0 168788.5 120583.3 134783
Demon Attack 9044.3 9711 134637.5 128580 110028.0 111185.2 112456.3 37085
Double Dunk -9.7 -18.1 8.3 5.6 -0.1 -0.3 1.5 3
Enduro 298.2 301.8 2363.3 2359 2273.8 2125.9 0.0 0
Fishing Derby 11.6 -0.8 39.3 33.8 45.3 31.3 37.7 -7
Freeway 8.1 30.3 34.0 34.0 33.7 34.0 0.0 0
Frostbite 1093.9 328.3 8531.3 4342 10432.3 9590.5 312.6 288
Gopher 8370.0 8520 116037.5 118365 76662.9 70354.6 10608.9 7992
Gravitar 445.7 306.7 1010.8 911 1819.5 1419.3 250.5 379
Hero 20538.7 19950 27639.9 28386 12590.5 55887.4 36264.3 30791
Ice Hockey -2.4 -1.6 -0.3 0.2 5.1 1.1 -4.5 -2
Jamesbond 851.7 576.7 27959.5 35108 31392.0 - 373.7 509
Journey Escape -1894.0 - -685.6 - 0.0 - -1026.5 -
Kangaroo 8831.3 6740 15517.7 15487 14462.5 14637.5 107.0 1166
Krull 6215.0 3805 9809.3 10707 7989.0 8741.5 9260.2 9422
Kung Fu Master 27616.7 23270 87566.3 73512 22820.5 52181.0 37750.0 37422
Montezuma Revenge 0.0 0.0 0.6 0.0 4.0 384.0 2.6 14
Ms Pacman 2526.6 2311 5786.5 6349 6153.4 5380.4 2851.4 2436
Name This Game 7046.5 7257 23151.3 22682 14035.1 13136.0 11301.1 7168
Phoenix 7054.4 - 145318.8 56599 5169.6 108528.6 38671.4 9476
Pitfall -28.3 - 0.0 0.0 0.0 0.0 -2.0 0
Pong 20.1 18.9 21.0 21.0 20.9 20.9 20.9 7
Pooyan 3118.7 - 28041.5 - 7793.1 - 4328.9 -
Private Eye 1538.3 1788 289.9 200 100.0 4234.0 725.3 3781
Qbert 10516.0 10596 24950.3 25750 42481.1 33817.5 19831.0 18586
Riverraid 7784.1 8316 20716.1 17765 26114.0 - 13172.8 -
Road Runner 37092.0 18257 63523.6 57900 64306.0 62041.0 40348.1 45315
Robotank 47.4 51.6 77.1 62.5 74.4 61.4 3.0 6
Seaquest 6075.7 5286 27045.5 30140 4286.8 15898.9 1789.5 1744
Skiing -13030.2 - -9354.7 -9289 -9441.0 -12957.8 -15820.1 -12972
Solaris 1565.1 - 7423.3 8007 7902.2 3560.3 3395.6 12380
Space Invaders 1583.2 1976 27810.9 28888 2838.0 18789.0 1739.5 1034
Star Gunner 56685.3 57997 189208.0 74677 181192.5 127029.0 60591.7 49156
Tennis -5.4 -2.5 23.8 23.6 -0.1 0.0 -13.1 -6
Time Pilot 5738.7 5947 12758.3 12236 25582.0 12926.0 4077.5 10294
Tutankham 141.9 186.7 337.4 293 251.9 241.0 274.5 213
Up N Down 11821.5 8456 83140.0 88148 284465.6 - 78790.0 89067
Venture 656.7 380.0 289.0 1318 1499.0 5.5 0.0 0
Video Pinball 9194.5 42684 664013.5 698045 492071.8 533936.5 518840.8 229402
Wizard Of Wor 1957.3 3393 20892.8 31190 19796.5 17862.5 2488.4 8953
Yars Revenge 4397.3 - 30385.0 28379 80817.2 102557.0 14217.7 21596
Zaxxon 5698.7 4977 14754.4 21772 26827.5 22209.5 86.8 16544
# Higher scores 22 26 27 25 30 20 26 25
# Ties 1 3 2 3
# Seeds 5 1 2 1 1 1 1 3
Table 1: The performance of ChainerRL against published DQN, IQN, Rainbow, and A3C results on Atari benchmarks. For each algorithm, we compare the number of domains for which ChainerRL scores higher or published paper scores higher. See Table 2 for the evaluation protocols used to obtain the scores.
DQN dqn IQN Dabney2018 Rainbow rainbow A3C noisynetworks
Evaluation Frequency (timesteps) 250K 250K 250K 250K
Evaluation Phase (timesteps) 125K 125K 125K 125K
Evaluation Episode Length (time) 5 min 30 min 30 min 30 min
Evaluation Episode Policy N/A
Reporting Protocol re-eval best-eval re-eval best-eval
Table 2: Evaluation protocols used for the Atari reproductions. These evaluation protocols match the evaluation protocol of the papers referenced in the first row. An evaluation episode policy with an indicates that the agent performs an -greedy evaluation.
DDPG (Fujimoto2018a, ) TD3 (Fujimoto2018a, )
Environment CRL Published CRL Published
HalfCheetah-v2 10325.45 8577.29 10248.51 1063.48 9636.95 859.065
Hopper-v2 3565.60 1860.02 3662.85 144.98 3564.07 114.74
Walker2d-v2 3594.26 3098.11 4978.32 517.44 4682.82 539.64
Ant-v2 774.46 888.77 4626.25 1020.70 4372.44 1000.33
Reacher-v2 -2.92 -4.01 -2.55 0.19 -3.60 0.56
InvertedPendulum-v2 902.25 1000.00 1000.00 0.0 1000.00 0.0
InvertedDoublePendulum-v2 7495.56 8369.95 8435.33 2771.39 9337.47 14.96
TRPO (deeprlmatters, ) PPO (deeprlmatters, ) SAC (sac, )
Environment CRL Published CRL Published CRL Published
HalfCheetah-v2 1474 112 205 256 2404 185 2201 323 14850.54 ~15000
Hopper-v2 3056 44 2828 70 2719 67 2790 62 2911.89 ~3300
Walker2d-v2 3073 59 - 2994 113 - 5282.61 ~5600
Ant-v2 - - - - 5925.63 ~5800
Swimmer-v2 200 25 - 111 4 - - -
Humanoid-v2 - - - - 7772.08 ~8000
Table 3: The performance of ChainerRL against published baselines on OpenAI Gym MuJoCo benchmarks. For DDPG and TD3, each ChainerRL score represents the maximum evaluation score during 1M-step training, averaged over 10 trials with different random seeds, where each evaluation phase of ten episodes is run after every 5000 steps. For PPO and TRPO, each ChainerRL score represents the final evaluation of 100 episodes after 2M-step training, averaged over 10 trials with different random seeds. For SAC, each ChainerRL score reports the final evaluation of 10 episodes after training for 1M (Hopper-v2), 3M (HalfCheetah-v2, Walker2d-v2, and Ant-v2), or 10M (Humanoid-v2) steps, averaged over 10 trials with different random seeds. Since the original paper  (sac, ) provides learning curves only, the published scores are approximated visually from the learning curve. The sources of the published scores are cited with each algorithm. We used the v2 environments, whereas some published papers evaluated on the now-deprecated v1 environments.

Many open-source DRL libraries offer high-quality implementations of algorithms but often deviate from the original paper’s implementation details. We currently provide a set of compact examples, i.e., single files, of paper implementations written with ChainerRL that users can run to reproduce the results of the original research paper. These “reproducibility scripts” are carefully written to replicate as closely as possible the original paper’s (or in some cases, another published paper’s) implementation and evaluation details. For each of these scripts, we provide the training times of the script (in our repository), full tables of our achieved scores, and comparisons of these scores against those reported in the literature.

Though ChainerRL has high-quality implementations of dozens of algorithms, we currently have created such “reproducibility scripts” for 9 algorithms. In the Atari benchmark (ale, ), we have successfully reproduced DQN, IQN, Rainbow, and A3C. For the OpenAI Gym Mujoco benchmark tasks, we have successfully reproduced DDPG, TRPO, PPO, TD3, and SAC.

The reproducibility scripts emphasize correctly reproducing evaluation protocols, which are particularly relevant when evaluating Atari agents. During a typical Atari agent’s 50 million timesteps of training, it is evaluated periodically in an offline evaluation phase for a specified number of timesteps before resuming training. Oftentimes, since the agent performs some form of exploratory policy during training, the agent sometimes changes policies specifically for evaluations. Unfortunately, evaluation protocols tend to vary, and consequently results are often inconsistently reported across the literature (revisitingale, ), significantly impacting results. The critical details of standard Atari evaluation protocols are as follows:

Evaluation frequency

the frequency (in timesteps) with which the evaluation phase occurs

Evaluation phase length

the number of timesteps in the offline evaluation

Evaluation episode length

the maximum duration of an evaluation episode

Evaluation policy

The policy to follow during an evaluation episode

Reporting protocol

Each intermediate evaluation phase outputs some score, representing the mean score of all evaluation episodes during that evaluation phase. Papers typically report scores according to one of the following reporting protocols:

  1. best-eval: Papers using the best-eval protocol report the highest mean score across all intermediate evaluation phases.

  2. re-eval: Papers using the re-eval protocol report the score of a re-evaluation of the network parameters that produced the best-eval.

Each of these details, especially the reporting protocols, can significantly influence the results, and thus are critical details to hold consistent for a fair comparison between algorithms.

Table 1 lists the results obtained by ChainerRL’s reproducibility scripts for DQN, IQN, Rainbow, and A3C on the Atari benchmark, with comparisons against a published result. Table 2 depicts the evaluation protocol used for each algorithm, with a citation of the source paper whose results we compare against. Note that the results for the A3C a3c algorithm do not come from the original A3C paper, but from another noisynetworks . For continuous-action algorithms, the results on OpenAI Gym MuJoCo tasks for DDPG (ddpg, ), TRPO (trpo, ), PPO (Schulman2017b, ), TD3 Fujimoto2018a , and SAC (sac, ) are reported in Table 3.

The reproducibility scripts are produced through a combination of reading released source code and studying published hyperparameters, implementation details, and evaluation protocols. We also have extensive email correspondences with authors to clarify ambiguities, omitted details, or inconsistencies that may exist in papers.

As seen in both the Atari and MuJoCo reproducibility results, sometimes a reproduction effort cannot be directly compared against the original paper’s reported results. For example, the reported scores in the original paper introducing the A3C algorithm (a3c, ) utilize demonstrations that are not publicly available, making it impossible to accurately compare a re-implementation’s scores to the original paper. In such scenarios, we seek out high-quality published research (noisynetworks, ; deeprlmatters, ; Fujimoto2018a, ) from which faithful reproductions are indeed possible, and compare against these.

5 Conclusion

In this paper, we introduced a reproducibility-focused deep reinforcement library, ChainerRL, and its companion visualizer, ChainerRL Visualizer. We hope that ChainerRL ’s comprehensive suite of algorithms, flexible APIs, visualization features, and faithful reproductions can accelerate research in DRL as well as foster its application to a wide range of new and interesting sequential decision-making problems.

ChainerRL has been in active development since 2017 with extensive plans laid out for continued expansion and improvement with novel algorithms, functionality, and paper reproductions. We are currently in the process of releasing a significant number of trained agent models for users to utilize in research and development. We are also adding functionality to enable large scale distributed RL. Lastly, we plan to expand beyond pure reinforcement learning approaches for sequential decision-making by including algorithms that can learn from demonstrations.

We look forward to continuing our collaboration with the open-source community in developing ChainerRL and accelerating RL research.

Acknowledgments

We thank Avinash Ummadisingu, Mario Ynocente Castro, Keisuke Nakata, Lester James V. Miranda, and all the open source contributors for their contributions to the development of ChainerRL. We thank Kohei Hayashi and Jason Naradowsky for useful comments on how to improve the paper. We thank the many authors who fielded our questions when reproducing their papers, especially George Ostrovski.

References

  • [1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei a Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [2] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  • [3] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies.

    The Journal of Machine Learning Research

    , 17(1):1334–1373, 2016.
  • [4] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. In CoRL, 2018.
  • [5] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop, 2013.
  • [6] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade Learning Environment: An Evaluation Platform for General Agents.

    Journal of Artificial Intelligence Research

    , 47:253–279, 2013.
  • [7] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In IROS, 2012.
  • [8] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep Reinforcement Learning that Matters. In AAAI, 2018.
  • [9] Seiya Tokui, Ryosuke Okuta, Takuya Akiba, Yusuke Niitani, Toru Ogawa, Shunta Saito, Shuji Suzuki, Kota Uenishi, Brian Vogel, and Hiroyuki Yamazaki Vincent. Chainer: A Deep Learning Framework for Accelerating the Research Cycle. In KDD, 2019.
  • [10] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In ICLR, 2016.
  • [11] Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit Quantile Networks for Distributional Reinforcement Learning. In ICML, 2018.
  • [12] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining Improvements in Deep Reinforcement Learning. In AAAI, 2017.
  • [13] Tuomas Haarnoja, Henry Zhu, George Tucker, and Pieter Abbeel. Soft Actor-Critic Algorithms and Applications. arXiv preprint arxiv:1812.05905, 2018.
  • [14] Scott Fujimoto, Herke van Hoof, and Dave Meger. Addressing Function Approximation Error in Actor-Critic Methods. In ICML, 2018.
  • [15] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. In ICML, 2016.
  • [16] OpenAI Baselines: ACKTR & A2C. https://openai.com/blog/baselines-acktr-a2c/.
  • [17] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
  • [18] Sam Greydanus, Anurag Koul, Jonathan Dodge, and Alan Fern. Visualizing and Understanding Atari Agents. In ICML, 2018.
  • [19] Riashat Islam, Peter Henderson, Gomrokchi Maziar, and Doina Precup. Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control. In ICML Reproducibility in Machine Learning Workshop, 2017.
  • [20] Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. How Many Random Seeds? Statistical Power Analysis in Deep Reinforcement Learning Experiments. arXiv preprint arxiv:1806.08295, 2018.
  • [21] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking Deep Reinforcement Learning for Continuous Control. In ICML, 2016.
  • [22] Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. Dopamine: A Research Framework for Deep Reinforcement Learning. 2018.
  • [23] Adam Stooke and Pieter Abbeel. rlpyt: A Research Code Base for Deep Reinforcement Learning in PyTorch. arXiv preprint arxiv:1909.01500, 2019.
  • [24] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. OpenAI Baselines. https://github.com/openai/baselines, 2017.
  • [25] Itai Caspi, Gal Leibovich, Gal Novik, and Shadi Endrawis. Reinforcement learning coach, December 2017.
  • [26] Ashley Hill, Antonin Raffin, Maximilian Ernestus, Rene Traore, Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Stable baselines. https://github.com/hill-a/stable-baselines, 2018.
  • [27] Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. RLlib: Abstractions for Distributed Reinforcement Learning. In ICML, 2018.
  • [28] Sergey Kolesnikov and Oleksii Hrinchuk. Catalyst.RL: A Distributed Framework for Reproducible RL Research. arXiv preprint arXiv:1903.00027, 2019.
  • [29] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample Efficient Actor-Critic with Experience Replay. In ICLR, 2017.
  • [30] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • [31] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust Region Policy Optimization. In ICML, 2015.
  • [32] Hado Van Hasselt, Arthur Guez, and David Silver. Deep Reinforcement Learning with Double Q-learning. In AAAI, 2016.
  • [33] Marc G. Bellemare, Will Dabney, and Rémi Munos. A Distributional Perspective on Reinforcement Learning. In ICML, 2017.
  • [34] Marc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip S. Thomas, and Rémi Munos. Increasing the Action Gap: New Operators for Reinforcement Learning. In AAAI, 2016.
  • [35] RJ Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3-4):229–256, 1992.
  • [36] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8(3-4):293–321, 1992.
  • [37] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized Experience Replay. In ICLR, 2016.
  • [38] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling Network Architectures for Deep Reinforcement Learning. In ICML, 2016.
  • [39] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy Networks for Exploration. In ICLR, 2018.
  • [40] Kavosh Asadi and Michael L. Littman. An Alternative Softmax Operator for Reinforcement Learning. In ICML, 2017.
  • [41] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous Deep Q-Learning with Model-based Acceleration. In ICML, 2016.
  • [42] Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.