Increasing computation underlies many recent advances in machine learning [amodei_hernandez_2018]. More and more algorithms exploit parallelism and rely on distributed training for processing enormous amount of data, both in the conventional supervised context and in reinforcement learning (RL) [espeholt2018impala, mnih2016asynchronous, ecoffet2019go, nair2015massively, horgan2018distributed, jaderberg2018human, sutton2018reinforcement] and population-based methods [salimans2017evolution, jaderberg2017population, such2017deep, conti2018improving, wang2019paired, stanley2019designing, cully2015robots]abadi2016tensorflow]
, PyTorch[paszke2017pytorch] and Horovod [sergeev2018horovod]
for deep learning applications, and Spark MLlib[meng2016mllib] outside deep learning). However, RL and population-based methods pose unique challenges for reliability, efficiency, and flexibility that frameworks designed for supervised learning fall short of satisfying.
First, RL and population-based methods are typically applied in a setup that requires frequent interaction with simulators to evaluate policies and collect experiences, such as ALE [bellemare2013arcade], Gym [brockman2016openai], and Mujoco [todorov2012mujoco]
. While neural network computation can leverage specialized hardware (e.g. GPUs and TPUs), the dominant computing workloads are often from simulations that only run on CPUs, and different simulation rollouts can take significantly different lengths of time to finish. This setup requires a distributed computing framework not only to leverage the large amount of computation similar to its counterpart in supervised learning, but also to handle its heterogeneity in resource usage.
Second, an ideal distributed computing framework should dynamically allocate resources for workloads whenever possible to ensure the maximum job throughput given a finite-size pool of computing resources. A naive but natural choice could be to allocate computing resources according to the peak resource needed among all stages of computation. However, for some RL and population-based methods, a better, more fine-grained dynamic scaling strategy is required to address variable computation needs at different phases of an algorithm. For example, Go-Explore [ecoffet2019go] requires only CPUs during its exploration phase, but relies on GPUs later in the robustification phase. Another example is POET [wang2019paired], whose execution could benefit from gradually scaling up resources according to the increasing size of active populations in the open-ended search.
Third, such a distributed computing framework should keep a unified user interface consistent across a wide variety of backends, enabling practitioners to effortlessly turn a prototype algorithm that runs on a laptop into a high-performance distributed application that runs efficiently on a multi-core workstation, over a cluster of machines, or even on a public cloud, all with relatively few additional lines of code. This flexibility would maximize developmental efficiency and allow users to best utilize all the computing resources available to them. Moreover, the user interface should ideally be kept close to a familiar Python interface, which helps reduce adoption overhead. A good analogy in deep learning frameworks is PyTorch [paszke2017pytorch], which keeps many of its APIs close to those in NumPy [oliphant2006guide].
Among existing machine learning frameworks, those for supervised learning are not designed to address these new kinds of requirements as they do not naturally support simulation. Therefore, researchers and practitioners in RL traditionally resort to building one-off systems [nair2015massively, silver2016mastering, tian2017elf, espeholt2018impala] for their specialized use cases, imposing a prohibitive systems engineering burden. More recently, RL frameworks have been developed directly based on specific deep learning frameworks, emphasizing support for either applications or research. For example, PyTorch-based Horizon [gauci2018horizon] provides an end-to-end RL flow optimized for high performance on production data for real-life applications that do not require the level of flexibility for quick iteration needed in algorithm development in academic research. Tensorflow-based Dopamine [castro2018dopamine] instead retains enough flexibility for RL research, but lacks support for distributed training.
Ray [moritz2018ray] tries to provide a generalized solution for emerging AI applications including RL. It provides end-to-end distributed model training and serving. However, its design is influenced by Apache Spark [zaharia2016apache] and adopts some memory-intensive approaches (e.g. task dependency graphs, control and object stores) to support an internal task scheduler. This approach can be heavy-weight for algorithm development. Also, installing and deploying Ray on different backend platforms puts the burden of different customization on its users [Ray_doc]. Beyond machine-learning-focused solutions, general-purpose parallel computing systems such as IPyParallel [perez2007ipython] and OpenMPI [boku2004openmpi] provide little direct support and no dynamic scaling for distributed training for RL and population-based methods. Ad-hoc solutions built on them are often not portable from one platform to another and are difficult to scale, significantly limiting the efficiency of algorithm development.
To address these challenges in this paper we introduce Fiber, a scalable distributed computing framework for RL that aims to achieve both flexibility and efficiency while supporting both research and practical applications:
(1) Fiber is built on a classic master-worker programming model, and introduces a task pool as a lightweight but effective strategy to handle the scheduling of tasks.
(2) Fiber leverages cluster management software (e.g. Kubernetes; burns2016borg burns2016borg) for job scheduling/tracking.
(3) Fiber does not require pre-allocating resources and can dynamically scale up and down on the fly, ensuring maximal flexibility.
(4) Fiber is designed with maximizing development efficiency in mind so the user can effortlessly migrate from multiprocessing on one machine to complete distributed training across multiple machines. The architecture of Fiber and experiments demonstrating its advantages are presented in this paper.
This section briefly reviews RL and population-based methods, the targeted applications of Fiber, followed by a review of core concepts and components behind Fiber, e.g. multiprocessing, containers, and cluster management systems.
Confusingly, the words “reinforcement learning” (RL) can refer to a class of machine learning problems, the field of research about solving those problems, and a specific subset of algorithms that can solve those problems. The class of problems concerns how agents take actions in a (often simulated) environment to maximize a notion of cumulative reward. Without labelled input/output pairs, the focus of RL is to continually find a balance between exploration (of the space of possible actions) and exploitation (of the data currently gathered) within an uncertain environment based on delayed and infrequent feedback. RL algorithms are typically based on temporal-difference learning, and include the Q-learning and policy gradient families of algorithms [sutton2018reinforcement]. Population-based methods are an additional class of search algorithms that can solve RL problems. They maintain a population of candidate solutions wherein encouraging behavioral diversity is a central drive. They have produced state-of-the-art results in robotics [cully2015robots, salimans2017evolution] and some hard-exploration RL problems [ecoffet2019go]. Representative examples includes novelty search [lehman2011abandoning] and Quality-Diversity algorithms [lehman2011evolving, mouret2015illuminating, cully2015robots, pugh2016quality, wang2019paired, nguyen2016understanding, huizinga2018evolving]. RL and population-based methods pose unique challenges to distributed training that Fiber aims to address.
Multiprocessing is a Python standard library for parallel computing. It makes it easy for users to create new processes and create a pool of workers towards which tasks can be distributed. It is designed to leverage the computational power of modern multi-core CPUs on a single machine. Fiber follows the same interface as multiprocessing while extending many multiprocessing components to make them work in a distributed environment (i.e. across many computers in a computer cluster).
Cluster management systems manage computer clusters and many machines simultaneously. They are the “operating system” layer on top of computer clusters and allow other applications to run on top of them. Examples include Apache Mesos [hindman2011mesos], Kubernetes [burns2016borg], Uber Peloton [uberpeloton] and Slurm [yoo2003slurm].
Containers are a method of virtualization that package an application’s code and dependencies into a single object. The aim is to allow application to run reliably and consistently from one environment to another environment. Containers are often used together with cluster management systems.
Fiber provides users the ability to write applications for a large computer cluster with a standard and familiar library interface. This section covers Fiber’s design and applications.
Fiber bridges the classical multiprocessing API with a flexible selection of backends that can run on different cluster management systems. To achieve this integration, Fiber is split into 3 different layers: the API layer, backend layer and cluster layer. The API layer provides basic building blocks for Fiber like processes, queues, pools and managers. They have the same semantics as in multiprocessing, but are extended to work in distributed environments. The backend layer handles tasks like creating or terminating jobs on different cluster managers. When a new backend is added, all the other Fiber components (queues, pools, etc.) do not need to be changed. Finally, the cluster layer consists of different cluster managers. Although they are not a part of Fiber itself, they help Fiber to manage resources and keep track of different jobs, thereby reducing the number of items that Fiber needs to track. This overall architecture is summarized in figure 1.
Fiber introduces a new concept called job-backed processes. It is similar to the process in Python’s multiprocessing library, but more flexible: while a process in multiprocessing only runs on a local machine, a Fiber process can run remotely on a different machine or locally on the same machine. When starting a new Fiber process, Fiber creates a new job with the proper Fiber backend on the current computer cluster. Fiber uses containers to encapsulate the running environment of current processes, including all the required files, input data, and other dependent program packages, etc., to ensure everything is self-contained. All the child processes are started with the same container image as the parent process to guarantee a consistent running environment. Because each process is a cluster job, its life cycle is the same as any job on the cluster. To make it easy for users, Fiber is designed to directly interact with computer cluster managers. Because of this, Fiber doesn’t need to be set up on multiple machines or bootstrapped by any other mechanisms, unlike Spark or IPyParallel. It only needs to be installed on a single machine as a normal Python pip package.
Fiber implements most multiprocessing APIs on top of Fiber processes including pipes, queues, pools and managers.
These components are critical to implement RL algorithms and population-based methods. An example code of the Fiber API are listed in code example 1.
Supported multiprocessing components. Queues and pipes in Fiber behave the same as in multiprocessing. The difference is that queues and pipes are now shared by multiple processes running on different machines. Two processes can read from and write to the same pipe. Furthermore, queues can be shared between many processes on different machines and each process can send to or receive from the same queue at the same time. Fiber’s queue is implemented with Nanomsg222https://nanomsg.org/, a high-performance asynchronous message queue system. Pools are also supported by Fiber. They allow the user to manage a pool of worker processes. Fiber extend pools with job-backed processes so that it can manage thousands of (remote) workers per pool. Users can also create multiple pools at the same time. Managers and proxy objects enable multiprocessing to support shared storage, which is critical to distributed systems. Usually, this function is handled by external storage like Cassandra [lakshman2010cassandra], Redis [carlson2013redis], etc. Fiber instead provides built-in in-memory storage for applications to use. The interface is the same as multiprocessing’s Manager type. In this way, Fiber provides a shared storage that is convenient to use and high performance.
Unsupported multiprocessing components. Shared memory is used heavily by frameworks like PyTorch [paszke2017pytorch] and Ray [moritz2018ray]. In general, it can improve performance of inter-process communications on the same machine. However, it is not available when communicating over computer network, which is common for distributed systems. Thus Fiber provides managers and proxy objects as the primary means to share data instead. Locks can be very import for coordinating between different processes and preventing race conditions. However, in a distributed environment, it may cause wasting large amount of computation resources. Therefore, we excluded locks from the supported APIs as it’s not needed by most RL and population-based methods.
RL and population-based methods are two major applications for Fiber. These algorithms generally require frequent interactions between policies (usually represented by a neural network), and environments (usually represented by a simulator like ALE [bellemare2013arcade], OpenAI Gym [brockman2016openai], and Mujoco [todorov2012mujoco]). The communication pattern for distributed RL and population-based methods usually involves sending different types of data between machines: actions, neural network parameters, gradients, per-step/episode observations and rewards, etc. Actions can be either discrete (represented by an integer) or continuous (represented by a float number). The number of actions that needs to be transmitted are usually less than a thousand. The size of observations and neural network parameters can be larger than actions. Depending on the neural network used, the size of parameters (and gradients) can range from bytes [mnih2015human] to megabytes [OpenAI_dota].
Fiber implements pipes and pools to transmits these data. Under the hood, pools are normal Unix sockets, providing near line-speed communication for the applications using Fiber. Modern computer networking usually has bandwidth as high as hundreds of gigabits per second. Transmitting smaller amount of data over a network is usually fast [dean2007software]. Fiber can run each simulator in a single process and communicate actions, observations, rewards, parameters or gradients via pipes between different processes. If the size of parameters or gradients is too large, Fiber can be used together with Horovod [sergeev2018horovod], which leverages GPU to GPU communication for faster communication. Additionally, the inter-process communication latency does not increase much if there are many different processes sending data to one process because data transfer can happen in parallel. This fact makes Fiber’s pools suitable for creating the foundation of many RL and population-based learning algorithms because simulators can run in each pool worker process and the results can be transmitted back in parallel.
The distinction between pool- and pipe-based communication is that pools usually ignore task order while pipes keep order. For pools, each task (neural network evaluation, etc.) can be mapped to any of the worker processes. This is suitable for algorithms like ES [salimans2017evolution] or POET [wang2019paired], where each task is stateless. Each evaluation can run on any of the pool worker processes and only the (per episode) end results need to be collected back in parallel. An example of ES algorithm implemented with Fiber is listed in code example 2. On the other hand, pipes can maintain the order of each task. Each simulator is mapped to a fixed process so that worker processes can maintain their internal state after each step. At each step, each worker process only needs to accept actions that are for that specific worker and send the results (instantaneous rewards, state transitions, etc.) back. This makes it suitable for RL algorithms like A3C [mnih2016asynchronous], PPO [schulman2017proximal], etc. An example of RL implemented in Fiber is listed in code example 3.
There are two key considerations on scalability: (1) how many resources a framework can manage and (2) how many resources an algorithm can use. Because Fiber relies on the cluster scheduler to manage the resources including CPU cores, memory and GPUs, there is little role for Fiber in managing the resources except in tracking started processes and properly terminating them when computation is completed. Fiber schedules each task at most once. When batching is enabled, multiple tasks can be scheduled at the same time to improve efficiency. It is also possible to run Fiber across multiple clusters, but the network communication cost could make Fiber less efficient. Because Fiber does not require pre-allocating resources, it can scale up and down with the algorithm it runs. Compared to static allocation, Fiber can return unused resources back to the cluster when they are not needed. Furthermore, when it needs more resources, it can ask the cluster manager for more resources. This approach makes it suitable for algorithms that runs heterogeneous tasks in different stages.
Fiber implements pool-based error handling (Figure 2). When a new pool is created, an associated task queue, result queue and pending table are also created. Newly created tasks are then added to the task queue, which is shared between the master process and worker processes. Each of the workers fetches a single task from the task queue, and then runs task functions within that task. Each time a task is removed from the task queue, an entry in the pending table is added. Once the worker finishes that task, it puts its results in the result queue. The entry associated with that task is then removed from the pending table.
If a pool worker process fails in the middle of processing, that failure is detected by the parent pool that serves as the process manager of all the worker processes. Then the parent pool puts the pending task from the pending table back into task queue if the previously failed process has a pending task. Next, it starts a new worker process to replace the previously failed process and binds the newly created worker process to the task queue and the result queue.
Fiber is designed to scale the computation of algorithms like RL and population-based methods easily. In this section, we evaluate Fiber on three different tasks to show its benefits: Framework overhead is tested on a dummy workload, Evolution Strategies (ES) experiments show its potential for population-based training, and Proximal Policy Optimization (PPO) experiments test the same for RL. Results show that Fiber has low overhead and can easily scale ES to thousands of CPU workers. This is a big improvement compared to IPyParallel which can only scale to hundreds of CPU workers. Also, Fiber can easily reuse existing code like OpenAI Baselines [dhariwal2017openai] and seamlessly expand PPO to use hundreds of distributed environment workers. OpenAI baselines does not support computation in such scale. In Addition, it only requires a few lines of changed code.
The aim of this test is to probe how much overhead the framework adds to the workload. For this purpose, we compare Fiber, Python multiprocessing library, Spark, and IPyParallel. The testing procedure is to create a batch of workload that takes a fixed amount of time in total to finish. The duration of each single task ranges from 1 second to 1 millisecond. We run five workers for each framework locally and adjust the batch size to make sure the total finish time for each framework is roughly 1 second (i.e. for 1 millisecond duration, we run 5,000 tasks). Results are in figure 2(a).
We use multiprocessing as a reference because it is very lightweight and does not implement any additional features beyond creating new processes and running tasks in parallel. Additionally, it exploits communication mechanisms only available locally (e.g. shared memory, Unix domain socket, etc.), making it difficult to be surpassed by other frameworks that support distributed runs resource management across multiple machines and cannot exploit similar mechanisms. It thus serves as a good reference on the performance that can be expected. Fiber shows almost no difference when task durations are 100ms or greater, and is much closer to the multiplrocessing than the other frameworks as the task duration drops to 10 or 1ms. The small difference in performance is a reasonable cost to gain the ability to run on multiple machines and scale to the whole computer cluster. Compared to Fiber, IPyParallel and Spark fall well behind at each task duration. When the task duration is 1 millisecond, IPyParallel takes almost 8 times longer than Fiber, and Spark takes 14 times longer. This result highlights that both IPyParallel and Spark introduce considerable overhead when the task duration is short, and are not as suitable as Fiber for RL and population-based methods, where a simulator is used and the response time is a couple of milliseconds.
Evolution Strategies (ES)
To probe the scalability and efficiency of Fiber, we compare it here exclusively with IPyParallel because Spark is slower than IPyParallel as shown above, and multiprocessing does not scale beyond one machine. We evaluate both frameworks on the time it takes to run 50 iterations of ES (figure 2(b)) to test the scalability and efficiency of both frameworks. With the same workload, we expect Fiber to finish faster because it has much less overhead than IPyParallel as shown in the previous test. For both Fiber and IPyParallel, the population size of 2,048, so that the total computation is fixed regardless of the number of workers. The same shared noise table trick mentioned in salimans2017evolution (salimans2017evolution) is also implemented in both. Every 8 workers share one noise table. The experimental domain in this work is a modified version of the “Bipedal Walker Hardcore” environment of the OpenAI Gym [brockman2016openai] with modifications described in POET_bipedal (POET_bipedal).
The main result is that Fiber scales much better than IPyParallel and finishes each test significantly faster. The length of time it takes for Fiber to run gradually decreases with the increase of the number of workers from 32 to 1,024. In contrast, the time for IPyParallel to finish increases from 256 to 512 workers. IPyParallel does not finish the run at 1,024 workers due to communication errors between its processes (hence the red X in figure 2(b)). This unexpected failure undermines the ability for IPyParallel to run large-scale parallel computation. Overall, Fiber’s performance exceeds IPyParallel for all numbers of workers tested. Additionally, unlike IPyParallel, Fiber also finishes the run with 1,024 workers. This result highlights Fiber’s better scalability compared to IPyParallel.
Proximal Policy Optimization
To assess Fiber’s suitability for RL, we want to see how difficult it is to run a typical RL algorithm in a distributed setup. It is well known that parallelizing a single-machine multiprocessing implementation of RL algorithm requires significant engineering effort [heess2017emergence]. However, Fiber makes it as simple as changing one line of code. No other platform as far as we know offers this capability. To demonstrate this simplicity, we chose a widely-used multiprocessing implementation of the popular PPO algorithm [schulman2017proximal] from OpenAI baselines [dhariwal2017openai], and converted it to code that can run over hundreds of machines by simply replacing import multiprocessing as mp with import fiber as mp.
We then compare the performance of the distributed version of PPO enabled by Fiber with its original multiprocessing implementation on Breakout in the Atari benchmark [bellemare2013arcade] with a total of 10 million frames for training. The test runs on one 1080 Ti GPU for the neural network policy and a variable number of CPU workers running OpenAI Gym [brockman2016openai] environments. We run 8 to 32 (maximum CPU cores available on our test machine) workers for multiprocessing and 8 to 256 workers for Fiber. As shown in figure 2(c), Fiber scales beyond 32 workers. When running 64 and more workers, its performance beats the best result multiprocessing can get from a single machine. With 256 workers, the total time by Fiber is less than half of that with 8 workers. These results show that Fiber can scale RL beyond local machines. Additionally, when running a small number of workers, Fiber virtually matches the performance of multiprocessing because Fiber has low overhead. There is only 1% to 3% difference between Fiber and multiprocessing. This observation is significant because multiprocessing leverages optimizations only available locally as noted previously. Finally, the PPO implementation in OpenAI baselines has 2 major time consuming parts: the environment step and the model step. We noticed sub-linear speedup on both multiprocessing and Fiber due to the fact that only the environment step can be benefited from adding more workers. This is a limitation in the current OpenAI baselines implementation.
In this work, we presented a new distributed framework that allows efficient development and scalable training. Experiments highlight that Fiber achieves many goals, including efficiently leveraging a large amount of heterogeneous computing hardware, dynamically scaling algorithms to improve resource usage efficiency, reducing the engineering burden required to make RL and population-based algorithms work on computer clusters, and quickly adapting to different computing environments to improve research efficiency. At the same time, Fiber outperforms existing frameworks like IPyParallel and Spark. Finally, while Fiber is designed for RL and population-based learning, its general API in principle allows it to be applied in much broader contexts. We expect it will further enable progress in solving hard RL problems with RL algorithms and population-based methods by making it easier to develop these methods and train them at the scales necessary to truly see them shine [clune2019ai].