Reinforcement learning (RL) has achieved remarkable results in recent years silver2017mastering; vinyals2019grandmaster; andrychowicz2020learning; mnih2015human. Consequently, a relatively large number of reinforcement learning libraries has been proposed to help with the challenges of implementing, training and testing the set of existing RL algorithms and their constantly increasing number of methodological improvements pytorchrl; gauci2018horizon; castro2018dopamine; loon2019slm; Schaarschmidt2019; baselines; caspi_itai_2017_1134899; tensorforce.
Other libraries commit to provide modules and programmable components, serving better for learning purposes and making algorithmic research easier, as they are easy to extend to include and test new ideas. One such library which had tremendous impact is for instance OpenAI baselines baselines. However, OpenAI baselines has several limitations, in particular it provides only basic synchronous PPO training, and works within a single machine and single GPU. Indeed, organisations behind recent important contributions such as OpenAI OpenAI_dota or DeepMind silver2018general use their own internal, scalable RL frameworks which are all distributed and offer the possibility to separate sampling, gradient computations and policy updates both in time and hardware. Unfortunately, these libraries are largely proprietary.
Here we present NAPPO, a modular pytorch-based RL framework designed to be easily understandable and extendable, but allowing to leverage the compute power of multiple machines to accelerate training processes. We have based the algorithmic component of our library on a simple single-threaded PPO implementation pytorchrl and engineered multiple strategies to scale training using Ray moritz2018ray to coordinate task distribution among different concurrently executed units in a cluster. We believe our design choices make NAPPO easy to grasp and extend, turning it into a competitive tool to quickly test new ideas. Its distributed training implementations enable increased iteration throughput, accelerating the pace at which experiments are tested and allowing to scale up to the more difficult RL problems. In order to keep our release as simple and compact as possible, we have limited complexity by focusing on the proximal policy optimization (PPO) schulman2017proximal algorithm. NAPPO contains implementations of synchronous DPPO heess2017emergence, asynchronous DPPO recht2011hogwild, DDPPO wijmans2019dd, OpenAI RAPID OpenAI_dota, an asynchronous version of OpenAI RAPID and a version of IMPALA espeholt2018impala adapted to PPO.
The remainder of the paper is organized as follows. An overview of the most relevant related work is presented in Section 2. Section 3 analyses efficiency limitations of single-threaded implementations for RL policy gradient algorithms and introduces the main ideas and distributed training schemes behind NAPPO. Our experimental results are presented in section 4. More specifically, subsection 4.1 compares NAPPO’s performance with previous results on MuJoCo and Atari environments, subsection 4.2 compares performance of multiple distributed training schemes in increasingly large clusters and subsection 4.3 showcases the results on the Obstacle Tower Unity3D challenge environment juliani2019obstacle obtained using NAPPO. Finally, the paper ends summarizing our main contributions.
2 Related work
Many reinforcement learning libraries have been open sourced in recent years. Some of them focus on single-threaded implementations and do not consider distributed trainingbaselines; castro2018dopamine; caspi_itai_2017_1134899; tensorforce. Instead, NAPPO training implementations are designed to be executed at scale in arbitrarily large clusters.
Other available RL software options do offer scalability. However, they often rely on communication between independent program replicas, and require specific additional code to coordinate them Schaarschmidt2019; loon2019slm; espeholt2018impala; falcon2019pytorch; falcon2019pytorch. While this programming model is efficient to scale supervised and unsupervised deep learning, where training has a relatively constant compute pattern bounded only by GPU power, it is not ideal for RL. In RL, different operations can have very diverse resource and time requirements, often resulting in CPU and GPU capacity being underexploited during alternated periods of time. Conversely, NAPPO breaks down the training process into sub-tasks and uses independent compute units called actors to execute them. Actors have access to separate resources within the same cluster and define hierarchical communications patterns among them, enabling coordination from a logically centralised script. This approach, while very flexible, requires a programming model allowing to easily create and handle actors with defined resource boundaries. We currently use Ray moritz2018ray for that purpose.
Other libraries, such as RLlib liang2017rllib and RLgraph Schaarschmidt2019, also use Ray to obtain highly efficient distributed reinforcement learning implementations, with logically centralized control, task parallelism and resource encapsulation. However, these libraries focus more on high level implementations of a wide range of algorithms and offer compatibility with both Tensorflow tensorflow2015-whitepaper and PyTorch paszke2019pytorch deep learning libraries. This design choices allow for a more general use of their RL APIs but difficult code understanding and experimentation. On the other hand, we consider code simplicity a key features to carry on fast paced research and focus on developing scalable implementations while keeping a modular and minimalist codebase, avoiding nonessential features that could impose increased complexity.
3 NAPPO proximal policy optimization implementation
Deep RL policy gradient algorithms are generally based on the repeated execution of three sequentially ordered operations: rollout collection (R), gradient computation (G) and policy update (U). In single-threaded implementations, all operations are executed within the same process and training speed is limited by the performance that the slowest operation can achieve with the resources available on a single machine. Furthermore, these algorithms don’t have regular computation patterns (e.i. while rollout collection is generally limited by CPU capacity, gradient computation is often GPU bounded), causing an inefficient use of the available resources.
To alleviate computational bottlenecks, a more convenient programming model consists on breaking down training into multiple independent and concurrent computation units called actors, with access to separate resources and coordinated from a higher-level script. Even within the computational budged of a single machine, this solution enables a more efficient use of compute resources at the cost of slightly asynchronous operations. Furthermore, if actors can communicate across a distributed cluster, this approach allows to leverage the combined computational power of multiple machines. We currently handle creation and coordination of actors using Raymoritz2018ray distributed framework.
An actor-based software solution offers three main design possibilities that define implementable training schemes: 1) any two consecutive operations can be decoupled by running them in different actors. 2) similarly, any operation or group of operations can be parallelized, executing them simultaneously across multiple actor replicas. 3) finally, coordination between consecutive operations executed by independent actors can be synchronous or asynchronous. In other words, it is not necessary to prevent an operation from occurring until all actors executing the preceding one have finished (e.g. although it might be desirable sometimes, for example if we want to parallelize gradient computation but update the policy with the averaged values). Thus, specifying which operations are decoupled, which are parallelized and which are synchronous, defines the training schemes that we can implement. Note that single-threaded implementations are nothing but a particular case in which any operation pair is decoupled and consequently training is centralised by a single actor. NAPPO includes one such approach of the PPO algorithm schulman2017proximal.
NAPPO contains four distributed implementation, two of which have synchronous and asynchronous policy update variants. More specifically, the library contains functional versions of synchronous DPPO heess2017emergence, and asynchronous DPPO, where the latter can be understood as a PPO version of A3C, extending Holgwild! mnih2016asynchronous from one machine to a whole cluster. NAPPO also includes an implementation of DDPPO wijmans2019dd, a PPO-based version of IMPALA espeholt2018impala and an implementation of OpenAI’s RAPID OpenAI_dota scalable training approach. RAPID’s implementations contains as well an asynchronous variant. Figure 1 shows how different distributed training schemes can be identified by means of the features described above.
To implement the aforementioned training schemes we adopt a modular approach. Modular code is easier to read, understand and modify. Furthermore, it offers more flexibility to compose variants and extensions of already defined algorithms, and allows component reuse for the development of new ideas. Our whole library is formulated around only seven components and divided in two sub-packages. The first sub-package is called and contains the three classes at the center of any Deep RL algorithm:
: Implements the deep neural networks used as function approximators.
: Manages loss and gradient computation. We currently compute the loss function using the PPO formula, but it would be straightforward to adapt this component to implement alternative algorithms.
RolloutStorage: Handles data storage, processing and retrieval.
A second sub-package, called , contains the modules that instantiate and manage core components and that allow for distributed training. It is organised around four base classes, from which components of specific training schemes inherit:
Worker: Manages interactions with the environment, rollout collections and, depending on the training scheme, gradient computation. Is the key component for scalability, allowing task parallelization through multiple instantiations.
WorkerSet: Groups sets of workers performing the same operations (and if necessary a ParameterServer) into a single class for simpler handling.
ParameterServer: Coordinates centralised policy updates.
Learner: Monitors the training process.
All deep learning elements in NAPPO are implemented using Pytorch paszke2019pytorch. Components can be instantiated and combined in different actors to perform RL operations. Figure 2 contains module-based sketches that are faithfully representation the real implementations, composed assembling the 7 classes discussed above plus environment instances.
Furthermore, hierarchical communication patterns between processes allow to centralise algorithm composition and training logic in a single script with a very limited number of code lines. Listing 1 shows how a script to train a policy on the Obstacle Tower Unity3D challenge environment using synchronous DPPO can be coded in no more that 22 lines.
4.1 Comparison to previous benchmarks in continuous and discrete domains
We first use NAPPO to run several experiments on a number of MuJoCo todorov2012mujoco and Atari 2600 games from the Arcade Learning Environment bellemare2013arcade
. We fix all hyperparameters to be equal to the experiments presented inschulman2017proximal and use the same policy network topologies. Our idea is to compare the learning curves obtained with our implementations to a known baseline. For Atari 2600, we process the environment observations following mnih2015human.
We train all implementations contained in NAPPO, including single-node PPO, three times on each environment under consideration with randomly initialized policy weights every time. For MuJoCo environments we use a different seed in each experiment. For Atari 2600 experiments, in schulman2017proximal authors used stacks of 8 environments to collect multiple rollouts in parallel. We do the same and use a different seed for each environment, but do not change seeds between runs. We present our results as the per-step average of the three independent runs in Figure 3.
As the original experiments were conducted using single-threaded implementation (equivalent to our PPO approach), we opt for not parallelizing operations in any training scheme. That makes DPPO and DDPPO equivalent to PPO in all regards and therefore we expect similar training curves to the original experiments from these training strategies. On the other hand, both RAPID and IMPALA inevitably introduce time delays between rollout collection and gradient computation, which means that sometimes the version of the policy used for data collection is older that the one used for gradient computation. This task decoupling can slightly alter the optimization problem, offering no guarantee for the used hyperparameters to be optimal. Additionally, using the v-trace value correction technique in IMPALA further modifies the loss values being optimized. In practice, we find this fact to affect performance on some of the environments. Interestingly, when training IMPALA with and without v-trace correction, we experimentally observe that using v-trace yields superior results on Atari 2600 environments, but is detrimental in MuJoCo environments, leading in some case to unstable training. Whether this depends on the environment itself, on the policy architecture or on the chosen hyperparameters (or on a combinations of all factors) is unclear. Numerical results of our experiments are compared with the baselines in Table 1. Note that schulman2017proximal does not provide numerical results for MuJoCo experiments.
4.2 Scaling to solve computationally demanding environments
The rest of our experiments were conducted in the Obstacle Tower Unity3D challenge environment juliani2019obstacle, a procedurally generated 3D world where an agent must learn to navigate an increasingly difficult set of floors (up to 100) with varying lightning conditions and textures. Each floor can contain multiple rooms, and each room can contain puzzles, obstacles or keys enabling to unlock doors. The number and arrangement of rooms in each floor for a given episode can be fixed with a seed parameter.
We design our second experiment to test the capacity of NAPPO’s implementations to accelerate training processes in increasingly large clusters. With that aim, we define three metrics of interest:
FPS (Frames Per Second): environment frames being processed, on average, in one second of clock time. This metric can depend on the environment used and the policy architecture, and therefore valid comparisons require fixing these variables.
CGD (Collection-to-Gradient Delay)
: average difference in the number of policy updates between the version of the policy used for rollout collection and the version of the policy used for gradient computations. In PPO-based training schemes, the number of minibatches (m) generated from a rollout set in every iteration, and the number of times or epochs (e) rollouts are reused define the minimum possible CGD as. Note that only schemes decoupling collection and gradient computations (i.e. IMPALA and RAPID) will have CGD above this value.
GUD (Gradient-to-Update Delay): average difference in the number of policy updates between the version of the policy used for gradient computation and the version of the policy to which the gradients are applied. Note that only training schemes that allow for an asynchornous policy update (i.e. DPPO async and RAPID async) will have GUD values greater than 0.
We benchmark these metrics when training in clusters composed of 1, 2, 3, 4 and 5 machines. Common hyperparameters across different training implementations are held constant throughout the experiment. We select the hyperparameters specific to each scheme which maximise the training speed. We used machines with 32 CPUs and 3 GPUs, model GeForce RTX 2080 Ti. We could use two GPUs to obtain similar results if the environment instances could be executed in an arbitrarily specified device. However, currently Obstacle Tower Unity3D challenge instances run by default in the primary GPU device, and thus we decide to devote it exclusively to this task. Our results are plotted in Figure 4.
In our last experiment we test the capacity of a policy based on a single deep neural network trained with NAPPO to generalise to Obstacle Tower Unity3D environment seeds never observed during training. We conduct training on a 2-machine cluster. We use the network architecture proposed in espeholt2018impala but we initialize its weights according to Fixup zhang2019fixup
. We end our network with a gated recurrent unit (GRU)gru
with a hidden layer of size 256 neurons. Gradients are computed using Adam optimizerkingma2014adam
, with a starting learning rate of 4e-4 decayed by a factor of 0.25 both after 100 million steps and 400 million steps. The value coefficient, the entropy coefficient and the clipping parameters of the PPO algorithm are set to 0.2, 0.01 and 0.15 respectively. We use a discount factor gamma of 0.99. Furthermore, the horizon parameters is set to 800 steps and rollout collections are parallelized using environment vectors of size 16. Gradients are computed in minibatches of size 1600 for 2 epochs. Finally we also use generalized advantage estimationgae with lambda 0.95.
Our preliminary results on the Obstacle Tower Unity3D environment show remarkably similar training curves when training with different distributed schemes with equal hyperparameters (see supplementary material), suggesting low sensibility to CGD and GUD increase for the proposed policy architecture on this specific environment and further demonstrating the correctness of the implementations. Nonetheless, we deem preferable to limit CGD and GUD if possible and decided to use the synchronous version of RAPID as it offers high training speed while keeping these metrics low and stable.
The reward received from the environment upon the agent completing a floor is +1, and +0,1 is provided for opening doors, solving puzzles, or picking up keys. During the real challenge we ranked 2nd place with a very simple reward shaping method ran on vanilla PPO and a final score of 16 floors. We additionally reward the agent with an extra +1 to pick up keys, +0.002 to detect boxes, +0.001 to find box intended locations, +1.5 to place the boxes target locations and +0.002 for any event that increases remaining time. We also reduce the action set from the initial 54 actions to 6 (rotate camera in both directions, go forward, and go forward while turning left, turning right or jumping). We use frame skip 2 and frame stack 4.
We train our agent on a fixed set of seeds [0, 100) for approximately 11 days and test its behaviour on seeds 1001 to 1005, a procedure designed by the authors of the challenge to evaluate weak generalization capacities of RL agents juliani2019obstacle. During training we restart each new episode at a randomly selected floor between 0 and the higher floor reached in the previous episode and regularly save policy checkpoints to evaluate progression of test performance. Test performance is measured as the highest averaged score on the five test seeds obtained after 5 attempts, due to some intrinsic randomness in the environment. Our maximum average test score is 23.6, which supposes a significant improvement with respect to 19.4, the previous state-of-the-art obtained by the winner of the competition. Our final results are presented in Figure 5 showing that we are also consistently above 19.4. The source code, including the reward shaping strategy, and the resulting model are available in the github repository.
NAPPO presents a minimalist codebase for deep RL research at scale. In the present paper, we demonstrate that the implementations contained in it are reliable and can run in clusters of large sizes, enabling accelerated RL research. NAPPO’s competitive performance is further highlighted by achieving the highest to-date test score on the Obstacle Tower Unity3D challenge environment. We currently focused on PPO-based, single-agent implementations for distributed training.
We expect that NAPPO would be a valuable library for training at scale and at the same time, go beyond the state-of-the-art. It is of general interest for people developing applications using RL as well as scientists developing new methods. We are making the entire source codes, trained networks and examples available in github after acceptance. We do not think that this research put anybody at disadvantage and that there are societal consequences for failure, nor that bias problems apply here.