EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine

06/21/2022
by   Jiayi Weng, et al.
8

There has been significant progress in developing reinforcement learning (RL) training systems. Past works such as IMPALA, Apex, Seed RL, Sample Factory, and others aim to improve the system's overall throughput. In this paper, we try to address a common bottleneck in the RL training system, i.e., parallel environment execution, which is often the slowest part of the whole system but receives little attention. With a curated design for paralleling RL environments, we have improved the RL environment simulation speed across different hardware setups, ranging from a laptop, and a modest workstation, to a high-end machine like NVIDIA DGX-A100. On a high-end machine, EnvPool achieves 1 million frames per second for the environment execution on Atari environments and 3 million frames per second on MuJoCo environments. When running on a laptop, the speed of EnvPool is 2.8 times of the Python subprocess. Moreover, great compatibility with existing RL training libraries has been demonstrated in the open-sourced community, including CleanRL, rl_games, DeepMind Acme, etc. Finally, EnvPool allows researchers to iterate their ideas at a much faster pace and has the great potential to become the de facto RL environment execution engine. Example runs show that it takes only 5 minutes to train Atari Pong and MuJoCo Ant, both on a laptop. EnvPool has already been open-sourced at https://github.com/sail-sg/envpool.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

01/04/2020

MushroomRL: Simplifying Reinforcement Learning Research

MushroomRL is an open-source Python library developed to simplify the pr...
10/08/2019

TorchBeast: A PyTorch Platform for Distributed RL

TorchBeast is a platform for reinforcement learning (RL) research in PyT...
08/31/2021

WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU

Deep reinforcement learning (RL) is a powerful framework to train decisi...
08/05/2019

DoorGym: A Scalable Door Opening Environment And Baseline Agent

Reinforcement Learning (RL) has brought forth ideas of autonomous robots...
10/15/2019

SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference

We present a modern scalable reinforcement learning agent called SEED (S...
01/07/2020

Blue River Controls: A toolkit for Reinforcement Learning Control Systems on Hardware

We provide a simple hardware wrapper around the Quanser's hardware-in-th...
11/20/2020

Bridging Scene Understanding and Task Execution with Flexible Simulation Environments

Significant progress has been made in scene understanding which seeks to...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Reinforcement Learning (RL) has made remarkable progress in the past years. Notable achievements include Deep Q-Network (DQN) DQN, AlphaGo muzero; alphago; alphazero; alphago_zero, AlphaStar alphastar, OpenAI Five openai_five, etc. Apart from the algorithmic innovations, arguably the most significant improvements aimed at improving the training throughput for the RL agents, leveraging the computation power of large-scale distributed systems and advanced AI chips like TPUs TPU.

On the other hand, academic research has been accelerated dramatically by the shortened training time. For example, it takes eight days and 200 million frames to train an agent to play a single Atari game DQN, while IMPALA IMPALA shortens the process to a few hours and Seed RL SeedRL continues to push the boundary of training throughput. This allows the researchers to perform iterations of their ideas at a much faster pace and benefits the research progress of the whole RL community.

Since training RL agents with high throughput offers important benefits, we focus on tackling a common bottleneck in the RL training system in this paper: parallel environment execution. To the best of our knowledge, it is often the slowest part of the whole system but has received little attention in the previous literature. The inference and learning with the agent policy network can easily leverage the experience and performance optimization techniques from other areas where deep learning has been applied, like computer vision and natural language processing, often conducted with accelerators like GPUs and TPUs. The unique technical difficulty in RL systems is the interaction between the agents and the environments. Unlike the typical setup in supervised learning performed on a fixed dataset, the RL systems have to generate environment experiences at a very fast speed to fully leverage the highly parallel computation power of accelerators.

Our contribution is to optimize the environment execution for general RL environments, including video games and various applications of financial trading, recommendation system, etc. The current way to run parallel environments is to execute the environment and pre-process the observation under Python multiprocessing. We accelerate the environment execution by implementing a general C++ threadpool-based executor engine that can run multiple environments in parallel. The well-established Python wrappers are optimized on the C++ side as well. The interactions between the agent and the environment are exposed by straightforward Python APIs as below. For comprehensive user APIs, please see Appendix A.

[frame=single,framesep=10pt]python import envpool import numpy as np

# make gym env env = envpool.make("Pong-v5", env_type="gym", num_envs=100) obs = env.reset() # should be (100, 4, 84, 84) act = np.zeros(100, dtype=int) obs, rew, done, info = env.step(act, env_id=np.arange(100)) # can get the next round env_id through info["env_id"]

The system is called EnvPool, a highly parallel reinforcement learning environment execution engine, where we support OpenAI gym APIs and DeepMind dm_env APIs. EnvPool has both synchronous execution mode and asynchronous execution mode, where the latter is rarely explored in the mainstream RL system implementation while having a huge potential. The currently supported environments on EnvPool include Atari ALE, MuJoCo mujoco, DeepMind Control Suite dmc, ViZDoom VizDoom, classic RL environments like mountain car, cartpole RLbook, etc.

Performance highlights of the EnvPool system include:

  • With 256 CPU cores on an NVIDIA DGX-A100 machine, EnvPool can get a simulation throughput of 1 million frames per second on Atari and 3 million physics steps per second on MuJoCo environments, which is 14.9 / 19.2 improvement over the current popular Python implementation gym (i.e., 72K frames per second / 163K physics steps per second for the same hardware setup);

  • On a laptop with 12 CPU cores, EnvPool obtains 2.8x speed of the Python implementation;

  • When integrated with existing RL training libraries, example runs show that we can train Atari Pong and MuJoCo Ant in 5 minutes, both are on a laptop;

  • It’s observed that no sacrifice of sample efficiency when replacing OpenAI gym with EnvPool and keeping the same experiment configuration. It is a pure speedup without cost.

2 Related Works

In this section, we review the existing RL environment execution component in the previous literature. Most implementations in RL systems use Python-level parallelization, e.g., For-loop or subprocess gym, in which we can easily run multiple environments and obtain the interaction experience in a batch. While the simple Python approaches are easy to be plugged with existing Python libraries and thus widely adopted, they are quite computationally inefficient compared to using a C++-level thread pool to execute the environments. The direct outcome of using inefficient environment parallelization is that much more machines have to be used just for the environment execution part. Researchers build distributed systems like Ray Ray which can allow easy distributed remote execution for the environments. Unfortunately, multiple third parties report an inefficient scale-up experience of using Ray RLlib RLlib; Ray (cf. Figure 3 in sample-factory). This issue might be because, in a distributed setup, Ray and RLlib have to trade-off the communication costs with other components and do not specifically optimize for the environment execution part.

Sample Factory sample-factory puts a strong focus on optimizing the whole RL system for a single machine setup instead of going to the distributed computing architecture. To achieve high throughput in the action sampling component, they introduce a sophisticated fully asynchronous approach called Double-Buffered Sampling, which allows network forwarding and environment execution to run in parallel but on different subsets of the environments. Though having improved the overall throughput dramatically over other systems, the implementation complexity is high and it is not a standalone component that can be plugged into other RL systems. Furthermore, Sample Factory sacrifices compatibility with a family of RL algorithms that can only work in synchronous mode to achieve high throughput. In contrast, EnvPool has both properties of high-throughput and great compatibility with existing APIs and RL algorithms.

A few recent works, e.g., Brax brax, Isaac Gym isaac-gym, and WarpDrive warpdrive, use accelerators like GPUs and TPUs for the environment engine. Due to the highly parallel nature of the accelerators, numerous environments can be executed simultaneously. The intrinsic drawback of this approach is that the environments have to be purely compute-based, i.e., matrix operations so that they can be easily accelerated on GPUs and TPUs. They can not handle general environments like common video games, e.g., Atari ALE, ViZDoom VizDoom, StarCraft II alphastar, Dota 2 openai_five. Moreover, in real-world applications, most scenarios can hardly be converted into a pure compute-based simulation. Such a key drawback sets the applications of this approach to a very limited spectrum.

The most relevant work to ours is the PodRacer architecture podracer, which also implements the C++ batched environment interface and can be utilized to run general environments. However, their implementation only supports synchronous execution mode where PodRacer is operated on the whole set of environments at each timestep. The stepping will wait for the results returned by all environments and thus be slowed down significantly by the slowest single environment instance. The description of PodRacer architecture is very specific to the TPU use case. PodRacer is not open-sourced and we cannot find many details on the concrete implementation. In contrast, our EnvPool uses asynchronous execution mode as default to avoid slowing down due to any single environment instance. Moreover, it is not tied to any specific computing architectures. We have run EnvPool on both GPU machines and Cloud TPUs.

3 Methods

EnvPool contains three key components optimized in C++, the ActionBufferQueue, the ThreadPool and the StateBufferQueue. It uses pybind11 pybind11 to expose the user interface to Python. In this section, we start by providing an overview of the overall system architecture. Then we illustrate the optimizations we made in the individual components. Finally, we briefly describe how to add new RL environments into EnvPool. Complete Python user APIs can be found in Appendix A.

3.1 Overview

Figure 1: EnvPool System Overview

As can be seen from the API of gym and dm_env, the central interface to interact with RL environments is the step function. The RL agent sends an action to the environment and gets an observation as the return. To increase the throughput of this interaction, the usual approach is to replicate such interaction in multiple threads/processes. However, experiences from other systems that focus on the throughput (e.g. web services) show that the asynchronous event-driven pattern often achieves better overall throughput because it saves the context switching cost that happens in the simple multi-threading setting.

EnvPool follows the asynchronous event driven pattern visualized in Figure 1. Instead of providing a synchronous step function, in each interaction, EnvPool receives a batched action through the send function. The send function only puts these actions in the ActionBufferQueue, it returns immediately without waiting for environment execution. Independently, threads in the ThreadPool takes action from the ActionBufferQueue and performs the corresponding environment execution. The execution result is then added to the StateBufferQueue, which is pre-allocated as blocks. A block in StateBufferQueue contains a fixed number (batch_size in the next section) of states. Once a block in StateBufferQueue is filled with data, EnvPool will pack them into NumPy numpy arrays. The RL algorithm gets back a batch of states by taking from the StateBufferQueue via the recv function. Details on the ActionBufferQueue and StateBufferQueue can be found in Appendix D.

A traditional step can be seen as consequent calls to send and recv with a single environment. However, separating step into send/recv leaves more flexibility and opportunity for further optimization, e.g. they can be executed in different Python threads.

3.2 Synchronous vs. Asynchronous

A popular implementation of vectorized environments like gym.vector_env 

gym executes all environments synchronously in the worker threads. We denote the number of environments num_envs as . In each iteration, the input actions are first distributed to the corresponding environments, then wait for all environments to finish their executions. The RL agent will receive observation arrays and predict actions via forward pass. As shown in Figure 2(a), the performance of the synchronous step is determined by the slowest environment execution time, and hence not efficient for scaling out.

Here we introduce a new concept batch_size in EnvPool’s asynchronous send/recv execution. This idea is first proposed by Tianshou tianshou. batch_size (denoted as ) is the batch size of environment outputs expected to be received by recv. As such, batch_size  cannot be greater than num_envs .

In each iteration, EnvPool only waits for the first environment steps’ output, and lets other (unfinished) thread executions continue at the backend. Figure 2 (b) demonstrates this process with and in 4 threads. Compared with synchronous step, asynchronous send/recv

has a huge advantage when the environment execution time has a big variance, which is a very common case when

is large.

EnvPool can switch between synchronous mode and asynchronous mode by only specifying different num_envs and batch_size. In the asynchronous mode, batch_size < num_envs, the throughput is maximized in this mode. To switch to synchronous mode, we only need to set num_envs = batch_size, then consecutive calling send/recv is equivalent to synchronously stepping all the environments.

Figure 2: Synchronous step vs asynchronous step in EnvPool.

3.3 ThreadPool

ThreadPool threadpool is a multi-thread executor implemented with std::thread. It maintains a fixed number of threads waiting for task execution without creating or destroying threads for short-term tasks. The ThreadPool is designed with the following considerations:

  • To minimize context switch overhead, the number of threads in ThreadPool is usually limited by the number of CPU cores.

  • To further speed up ThreadPool execution, we can pin each thread to a pre-determined CPU core. This further reduces context switching and improves cache efficiency.

  • We recommend setting num_env to be 2-3x greater than the number of threads to keep the threads fully loaded when part of the envs are waiting to be consumed by the RL Algorithm. On one hand, if we treat the environment execution time as a distribution, taking the environments with the shortest execution times can effectively avoid the long-tail problem; on the other hand, adding too many environments but keeping the batch_size unchanged may cause the problem of sample inefficiency or low-utilization of computational resources.

3.4 Adding New RL Environments

EnvPool is highly extendable and developer-friendly. The process of adding new RL environments is well-documented111https://envpool.readthedocs.io/en/latest/content/new_env.html. First, the developers have to implement RL environments in C++ header files, to define EnvSpec and environment interface including methods like Reset, Step and IsDone. Then write a Bazel BUILD file to manage dependencies, use the C++ source file to generate dynamically linked binary, and instantiate in Python via pybind11. Finally, register the environment, and write rigorous unit tests for debugging. Note that adding new RL environments does not require understanding or even knowing the details of the core EnvPool infrastructure above.

4 Experiments

We divide our experiments into two parts. The first part is evaluating the simulation performance of the RL environment execution engines, with randomly sampled actions as inputs. The isolated benchmark with only the execution part is clean for measuring only the engine component, without introducing the complexity of the agent policy network inference and learning. The second section is testing effects of using the EnvPool component with existing RL training frameworks, i.e., CleanRL huang2021cleanrl, rl_games rl-games2022, DeepMind Acme acme to show case how EnvPool can improve the overall training performance.

4.1 Pure Environment Simulation

Figure 3: Simulation throughput in three machines with Atari and MuJoCo tasks.

We first evaluate the performance of EnvPool against a set of established baselines on the RL environment execution component. There are three hardware setups used for the benchmark: the Laptop setting, the Workstation setting, and the NVIDIA DGX-A100 setting. Detailed CPU types and specifications can be found in Appendix B.

The laptop has 12 CPU cores, and the workstation has 32 CPU cores. Evaluating EnvPool on these two configurations can demonstrate its effectiveness with small-scale experiments. An NVIDIA DGX-A100 has 256 CPU cores with 8 NUMA nodes. Note that running multi-processing on each NUMA node not only makes the memory closer to the processor but also reduces the thread contention on the ActionBufferQueue.

As for the RL environments, we experiment mainly on two of the most commonly used RL benchmarks, namely, Atari ALE with Pong and MuJoCo mujoco with Ant. In the experiments of pure environment simulation, we obtain a randomly sampled action based on the action space definition of the games and send the actions to the environment executors. The number of frames per second is measured with a mean of 50K iterations, where the Atari frame numbers follow the practice of IMPALA IMPALA and Seed RL SeedRL with frameskip set to 4, and MuJoCo sub-step numbers set to 5.

We compare several concrete implementations extensively, which are described below. Among them, Subprocess is the current most popular implementation; to our best knowledge, Sample Factory is the best performing general RL environment execution engine at the time of the submission.

  • For-loop: execute all environment steps synchronously within only one thread;

  • Subprocess gym: execute all environment steps synchronously with shared memory and multiple processes;

  • Sample Factory sample-factory: pure async step with a given number of worker threads; we pick the best performance over various num_envs per worker;

  • EnvPool (sync): synchronous step execution in EnvPool;

  • EnvPool (async): asynchronous step execution in EnvPool; given a number of worker threads for batch_size, pick the best performance over various num_envs;

  • EnvPool (numa+async): use all NUMA nodes, with each launches EnvPool individually with asynchronous execution to see the best performance of EnvPool.

To demonstrate the scalability of the above methods, we conduct experiments using various numbers of workers for the RL environment execution. The experiment setup ranges from a couple of workers (e.g., 4 cores) to using all the CPU cores in the machine (i.e., 256 cores).

System Configuration Laptop Workstation DGX-A100
Method \ Env (FPS) Atari MuJoCo Atari MuJoCo Atari MuJoCo
For-loop 4,893 12,861 7,914 20,298 4,640 11,569
Subprocess 15,863 36,586 47,699 105,432 71,943 163,656
Sample-Factory 28,216 62,510 138,847 309,264 707,494 1,573,262
EnvPool (sync) 37,396 66,622 133,824 380,950 427,851 949,787
EnvPool (async) 49,439 105,126 200,428 582,446 891,286 2,363,864
EnvPool (numa+async) / / / / 1,069,922 3,134,287
Table 1: Numeric results for benchmarking.

As we can see from Figure 3 and Table 1, our EnvPool system outperforms all of the strong baselines with significant margins on all hardware setups of Laptop, Workstation, and DGX-A100. The most popular Subprocess implementation has extremely poor scalability with an almost flat curve. This indicates bare improvement in throughput with the increased number of workers and CPUs. The poor scaling performance of Python-based parallel execution confirms the motivation of our proposed solution.

The second important conclusion is that even if we use a single environment in EnvPool, we can get a free 2x speedup. Complete benchmarks on Atari, MuJoCo, and DeepMind Control can be found in Appendix C.

The third observation is that the synchronous modes have significant performance disadvantages against the asynchronous systems. This is because the throughput of the synchronous mode execution is determined by the slowest single environment instance, where the stepping time for each environment instance may vary a lot, especially when the number of environments is large.

4.2 End-to-end Agent Training

In this work, we demonstrate successful integration of EnvPool into three different open-sourced RL libraries. EnvPool can serve as a drop-in replacement of the vectorized environments in many deep RL libraries and reduces the training wall time at no cost of sample efficiency. The integration with training libraries has been straightforward due to compatibility with existing environment APIs. These example runs were performed by practitioners and researchers themselves, reflecting realistic use cases (i.e., using their machines and their preferred training libraries) in the community.

The full results cover a wide range of combinations to demonstrate the general improvement on different setups, including different training libraries (e.g. PyTorch-based, JAX-based), RL environments (e.g., Atari, Mujoco), machines (e.g. laptops, TPU VM). We present the main findings in the following paragraphs, and the complete results can be found in Appendix 

E. Note that hardware specifications of these experiments are different thus readers should not compare training speeds across different training libraries. The experiments in the main paper were all performed on laptops with a GPU.

Using same number of parallel environments. With the same environment setups, especially the number of parallel environments, EnvPool can boost the overall training speed significantly without the loss of sample efficiency. For example, CleanRL’s PPO implementation with Atari runs about 3x speed with EnvPool (huang2021cleanrl) when all other settings are controlled to be the same, as shown in Figure 4.

(a) Episodic Return w.r.t. Learning Frames
(b) Episodic Return w.r.t. Training Time
Figure 4: CleanRL example runs with Python vectorized environments and with EnvPool, using the same number of parallel environments .

Comparison with Ray. In Figure 5, we compare with another popular parallel solution Ray Ray. With the same training library (rl_games rl-games2022) and the same number of parallel environments (), we report multiple folds of wall-time training speed improvement. The result confirms the advantage of EnvPool over existing popular parallel environment executors.

High throughput training.

Additionally, we can search for an alternative set of hyperparameters that better leverages EnvPool’s throughput. For example, PPO 

PPO

by default uses a single simulation environment and re-uses the same simulation data with 10 epochs. With EnvPool, however, we can utilize a larger number of parallel environments such as

and reduce updates on stale data. Better leveraging larger with different set of hyperparameters has helped us to significantly reduce training wall time. For example, in Figure 6 example runs, rl_games PPO can solve Ant in 5 minutes of training time, while OpenAI baselines’ PPO can only get to score 2,000 in 20 minutes. Such a significant speed up on a laptop-level machine benefits researchers a lot on quick turnaround of their experiments. We note that a drop in sample efficiency is observed in these runs. Similar training speedup observations can be drawn from the example run in Figure 7. OpenAI baselines’ PPO requires training of 100 minutes to solve Atari Pong, while rl_games can tackle it within a fraction of time of the baseline, i.e., 5 minutes.

(a) Episodic Return w.r.t. Learning Frames
(b) Episodic Return w.r.t. Training Time
Figure 5: rl_games example runs with Ray and with EnvPool, using the same number of parallel environments
(a) Episodic Return w.r.t. Learning Frames
(b) Episodic Return w.r.t. Training Time
Figure 6: rl_games and CleanRL example runs with and tuned parameters compared to openai/baselines’ PPO which by default uses for MuJoCo experiments PPO.
(a) Episodic Return w.r.t. Learning Frames
(b) Episodic Return w.r.t. Training Time
Figure 7: Example runs with a larger number of parallel environments and tuned parameters compared to openai/baselines’ PPO which by default uses for Atari experiments PPO; shengyi2022the37implementation.

5 Future Work

Completeness: In this submission, we have only included RL environments with Atari ALE, Mujoco mujoco, DeepMind Control Suite dmc, ViZDoom VizDoom, and classic ones like mountain car, cartpole, etc. We intend to expand our pool of supported environments to cover more research use cases, e.g., grid worlds that are easily customized to research minigrid. On the multi-agent environments, we have implemented ViZDoom VizDoom and welcome the community to add even more environments including Google Research Football, MuJoCo Soccer Environment, etc.

Distributed Use Case: The EnvPool experiments in the paper have been performed on single machines. The same APIs can be extended to a distributed use case with remote execution of the environments using techniques like gRPC. The logic behind the environment execution is still hidden from the researchers but only the machines used to collect data will be at a much larger scale.

Research Directions

: With such a high throughput of data generation, the research paradigm can be shifted to large-batch training to better leverage a large amount of generated data. There are no counterparts as successful as in computer vision and natural language processing fields, where large-batch training leads to stable and faster convergence. An issue induced by faster environment execution would be severe off-policyness. Better off-policy RL algorithms are required to unveil full power of the system. Our proposed system also brings many new opportunities. For example, more accurate value estimation may be achieved by applying a large amount of parallel sampling, rollouts and search.

6 Conclusion

In this work, we have introduced a highly parallel reinforcement learning environment execution engine EnvPool, which significantly outperforms existing environment executors. With a curated design dedicated to the RL use case, we leverage techniques of a general asynchronous execution model, implemented with a C++ thread pool on the environment execution. For data organization and outputting batch-wise observations, we design BufferQueue tailored for the RL environments. We conduct an extensive study with various setups to demonstrate the scale-up ability of the proposed system and compare it with both the most popular implementation gym and highly optimized system Sample Factory. The conclusions hold for both Atari and MuJoCo, two of the most popular RL benchmark environments. In addition, we’ve demonstrated significant improvements in existing RL training libraries’ training speed when integrated with EnvPool in a wide variety of setups, including different machines, different RL environments, different RL algorithms, etc. On laptops with a GPU, we manage to train Atari Pong and Mujoco Ant in 5 minutes, accelerating the development and iteration for researchers and practitioners. Limitations remain that EnvPool can’t speed up RL environments originally written in Python and developers have to translate each existing environment into C++. We hope that EnvPool can become a core component of modern RL infrastructures, providing easy speedup and high-throughput environment experiences for RL training systems.

References

Checklist

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work?

    3. Did you discuss any potential negative societal impacts of your work?

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments (e.g. for benchmarks)…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?

    3. Did you report error bars (e.g., concerning the random seed after running experiments multiple times)?

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets?

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Usage of EnvPool

In this section, we include comprehensive examples of the Python user APIs for EnvPool usage, including both synchronous and asynchronous execution modes, and for both OpenAI gym and dm_env APIs.

a.1 Synchronous Execution, OpenAI gym APIs

[frame=single,framesep=10pt]python import numpy as np import envpool

# make gym env env = envpool.make("Pong-v5", env_type="gym", num_envs=100) obs = env.reset() # should be (100, 4, 84, 84) act = np.zeros(100, dtype=int) obs, rew, done, info = env.step(act, env_id=np.arange(100)) # env_id = info["env_id"]

a.2 Synchronous Execution, DeepMind dm_env APIs

[frame=single,framesep=10pt]python import numpy as np import envpool

# make dm_env env = envpool.make("Pong-v5", env_type="dm", num_envs=100) obs = env.reset().observation.obs # should be (100, 4, 84, 84) act = np.zeros(100, dtype=int) timestep = env.step(act, env_id=np.arange(100)) # timestep.observation.obs, timestep.observation.env_id, # timestep.reward, timestep.discount, timestep.step_type

a.3 Asynchronous Execution

For maximizing the throughput of the environment execution, users may use the asynchronous execution mode. Both typical step API and more low-level APIs recv, send are provided. [frame=single,framesep=10pt]python import numpy as np import envpool

# async by original API env = envpool.make_dm("Pong-v5", num_envs=10, batch_size=9) action_num = env.action_spec().num_values timestep = env.reset() env_id = timestep.observation.env_id while True: action = np.random.randint(action_num, size=len(env_id)) timestep = env.step(action, env_id) env_id = timestep.observation.env_id

[frame=single,framesep=10pt]python # or use low-level API, faster than previous version env = envpool.make_dm("Pong-v5", num_envs=10, batch_size=9) action_num = env.action_spec().num_values env.async_reset() # this can only call once at the beginning while True: timestep = env.recv() env_id = timestep.observation.env_id action = np.random.randint(action_num, size=len(env_id)) env.send(action, env_id)

[frame=single,framesep=10pt]python import numpy as np import envpool

# make asynchronous with gym API num_envs = 10 batch_size = 9 env = envpool.make("Pong-v5", env_type="gym", num_envs=num_envs, batch_size=batch_size) env.async_reset() while True: obs, rew, done, info = env.recv() env_id = info["env_id"] action = np.random.randint(batch_size, size=len(env_id)) env.send(action, env_id)

Appendix B CPU Specifications for Pure Environment Simulation

This section lists the detailed CPU specifications for the pure environment simulation experiments presented in the main paper.

The laptop has 12 Intel CPU cores, with Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz. And the workstation has 32 AMD CPU cores, with AMD Ryzen 9 5950X 16-Core Processor. Evaluating EnvPool on these two configurations can demonstrate its effectiveness with small-scale experiments.

An NVIDIA DGX-A100 has 256 CPU cores with AMD EPYC 7742 64-Core Processor and 8 NUMA nodes. Note that running multi-processing on each NUMA node not only makes the memory closer to the processor but also reduces the thread contention on the ActionBufferQueue.

Appendix C Speed Improvements on Single Environment

We present experiments with a single environment (i.e., ) in Table 2, where EnvPool manages to reduce overhead compared to the Python counterpart and achieves considerable speedup.

System Method Atari Pong-v5 MuJoCo Ant-v3 dm_control cheetah run
Laptop Python 4,891 12,325 6,235
Laptop EnvPool 7,887 15,641 11,636
Laptop Speedup 1.61x 1.27x 1.87x
Workstation Python 7,739 19,472 9,042
Workstation EnvPool 12,623 25,725 16691
Workstation Speedup 1.63x 1.32x 1.85x
DGX-A100 Python 4,449 11,018 5,024
DGX-A100 EnvPool 7,723 16,024 10,415
DGX-A100 Speedup 1.74x 1.45x 2.07x
Table 2: Single environment simulation speed on different hardware setups. The speed is in frames per second.

Appendix D ActionBufferQueue and StateBufferQueue

d.1 ActionBufferQueue

ActionBufferQueue is the queue that caches the actions from the send function, waiting to be consumed by the ThreadPool. Many open-source general-purpose thread-safe event queues can be used for this purpose. In this work, we observe that in our case the total number of environments , the batch_size , and the number of threads are all pre-determined at the construction of EnvPool. The ActionBufferQueue can thus be tailored for our specific case for optimal performance.

We implemented ActionBufferQueue with a lock-free circular buffer. A buffer with a size of is allocated. We use two atomic counters to keep track of the head and tail of the queue. The counters modulo is used as the indices to make the buffer circular. We use a semaphore to coordinate enqueue and dequeue operations and to make the threads wait when there is no action in the queue.

d.2 StateBufferQueue

StateBufferQueue is in charge of receiving data produced by each environment. Like the ActionBufferQueue, it is also tailored exactly for RL environments. StateBufferQueue is a lock-free circular buffer of memory blocks, each block contains a fixed number of slots equal to batch_size, where each slot is for storing data generated by a single environment.

When one environment finishes its step inside ThreadPool, the corresponding thread will acquire a slot in StateBufferQueue to write the data. When all slots are written, a block is marked as ready (see yellow slots in Figure 1). By pre-allocating memory blocks, each block in StateBufferQueue can accommodate a batch of states. Environments will use slots of the pre-allocated space in a first come first serve manner. When a block is full, it can be directly taken as a batch of data, saving the overhead for batching. Both the allocation position and the write count of a block are tracked by atomic counters. When a block is ready, it is notified via a semaphore. Therefore the StateBufferQueue is also lock-free and highly performant.

Data Movement The popular Python vectorized environment executor performs memory copies at several places that are saved in EnvPool. There is one inter-process copy for collecting the states from the worker processes, and one copy for batching the collected states. In EnvPool, these copies are saved thanks to the StateBufferQueue  because:

  • We pre-allocate memory for a batch of states, the pointer to the target slot of memory is directly passed to the environment execution and written from the worker thread.

  • The ownership of the block of memory is directly transferred to Python and converted into NumPy arrays via pybind11 when the block of memory is marked as ready.

Appendix E Complete Results of End-to-end Agent Training

e.1 CleanRL Training Results

This section presents the complete training results using CleanRL’s PPO and EnvPool. CleanRL’s PPO closely matches the performance and implementation details of openai/baselines’ PPO [shengyi2022the37implementation]. The source code is made available publicly222See https://github.com/vwxyzjn/envpool-cleanrl. The hardware specifications for conducting the CleanRL’s experiments are as follows:

  • OS: Pop!_OS 21.10 x86_64

  • Kernel: 5.17.5-76051705-generic

  • CPU: AMD Ryzen 9 3900X (24) @ 3.800GHz

  • GPU: NVIDIA GeForce RTX 2060 Rev. A

  • Memory: 64237MiB

CleanRL’s Atari experiment’s hyperparameters and learning curves can be found in Table 3 and Figure 8. CleanRL’s MuJoCo experiment’s hyperparameters and learning curves can be found in Table 5 and Figure 9. CleanRL’s tuned Pong experiment’s hyperparameters can be found in Table 4.

Note the CleanRL’s EnvPool experiments with MuJoCo use the v4 environments and the gym’s vecenv experiments use the v2 environments. There are subtle differences between the v2 and v4 environments333See https://github.com/openai/gym/pull/2762#issuecomment-1135362092.

Parameter Names Parameter Values
Total Time Steps 10,000,000
Learning Rate 0.00025 Linearly Decreased to 0
Number of Environments 8
Number of Steps per Environment 128
(Discount Factor) 0.99
(for GAE) 0.95
Number of Mini-batches 4
(Number of PPO Update Iteration Per Epoch) 4
(PPO’s Clipping Coefficient) 0.1
(Value Function Coefficient) 0.5
(Entropy Coefficient) 0.01
(Gradient Norm Threshold) 0.5
Value Function Loss Clipping 444See “Value Function Loss Clipping” in [shengyi2022the37implementation] True
Table 3: PPO hyperparameters used for CleanRL’s Atari experiments (i.e., ppo_atari.py and ppo_atari_envpool.py). The hyperparameters used is aligned with [PPO].
Parameter Names Parameter Values
Total Time Steps 7,000,000
Learning Rate 0.002
Number of Environments 64
Number of Steps per Environment 128
(Discount Factor) 0.99
(for GAE) 0.95
Number of Mini-batches 4
(Number of PPO Update Iteration Per Epoch) 4
(PPO’s Clipping Coefficient) 0.1
(Value Function Coefficient) 2.24
(Entropy Coefficient) 0.0
(Gradient Norm Threshold) 1.13
Value Function Loss Clipping False
Table 4: PPO tuned hyperparameters used for CleanRL’s Pong experiments in Figure 7.
Figure 8: CleanRL example runs with Python vectorized environments and with EnvPool, using the same number of parallel environments .
Parameter Names Parameter Values
Total Time Steps 10,000,000
Learning Rate 0.00295 Linearly Decreased to 0
Number of Environments 64
Number of Steps per Environment 64
(Discount Factor) 0.99
(for GAE) 0.95
Number of Mini-batches 4
(Number of PPO Update Iteration Per Epoch) 2
(PPO’s Clipping Coefficient) 0.2
(Value Function Coefficient) 1.3
(Entropy Coefficient) 0.0
(Gradient Norm Threshold) 3.5
Value Function Loss Clipping False
Table 5: PPO hyperparameters used for CleanRL’s MuJoCo experiments (i.e., ppo_continuous_action.py and ppo_continuous_action_envpool.py). Note that [PPO] uses so we needed to find an alternative set of hyperparameters.
Figure 9: CleanRL example runs with Python vectorized environments and with EnvPool, using the same number of parallel environments .

e.2 rl_games Training Results

This section presents the complete training results using rl_games’ PPO and EnvPool. The hyperparameters configuration can be found in rl_games’ repository.555See the files postfixed with envpool in https://github.com/Denys88/rl_games/tree/master/rl_games/configs/atari. The hardware specifications for conducting the rl_games’ experiments are as follows:

  • OS: Ubuntu 21.10 x86_64

  • Kernel: 5.13.0-48-generic

  • CPU: 11th Gen Intel i9-11980HK (16) @ 4.900GHz

  • GPU: NVIDIA GeForce RTX 3080 Mobile / Max-Q 8GB/16GB

  • Memory: 64012MiB

Parameter Names Parameter Values
Number of Environments 64
Number of Steps per Environment 256 for HalfCheetah
64 for Ant
128 for Humanoid
(Discount Factor) 0.99
(for GAE) 0.95
Number of Mini-batches 2
(Number of PPO Update Iteration Per Epoch) 4
(PPO’s Clipping Coefficient) 0.2
(Value Function Coefficient) 2.0
(Entropy Coefficient) 0.0
(Gradient Norm Threshold) 1.0
Learning Rate 0.0003 dynamically adapted based on
KL Divergence threshold (for ) 0.008
Value Function Loss Clipping True
Value Bootstrap on Terminal States True
Reward Scale 0.1
Smooth Ratio Clamp True
Observation Normalization True
Value Normalization True
MLP Sizes for HalfCheetah
for Ant
for Humanoid
MLP Activation Elu
Shared actor critic network True
MLP Layer Initializer Xavier
Table 6:

PPO baseline hyperparameters used for rl_games’s mujoco experiments. Some environments use different neural network architectures

We compare the training performance of rl_games using EnvPool against using Ray [Ray]’s parallel environments. In Figure 11, it’s observed that EnvPool can boost the training system with multiple times training speed compared to Ray’s integration.

For example using well-tuned hyperparameters we can train Atari Pong game in under 2 min to 18+ training and 20+ evaluation score on a laptop.

Well-established implementations of SAC [huang2021cleanrl, tianshou] can only train Humanoid to a score of 5,000 in three to four hours. It’s worth highlighting that we can now train Humanoid to a score over 10,000 just in 15 minutes with a laptop.

rl_games’s Atari experiment’s learning curves can be found in and Figure 10. rl_games’s MuJoCo experiment’s learning curves can be found in Figure 11. Table 6, Table 7 and Table 8 are the hyperparameters.

Parameter Names Parameter Values
Number of Environments 64
Number of Steps per Environment 128
(Discount Factor) 0.999
(for GAE) 0.95
Number of Mini-batches 4
(Number of PPO Update Iteration Per Epoch) 2
(PPO’s Clipping Coefficient) 0.2
(Value Function Coefficient) 2.0
(Entropy Coefficient) 0.01
(Gradient Norm Threshold) 1.0
Learning Rate 0.0008
Value Function Loss Clipping False
Observation Normalization False
Value Normalization True
Neural network Nature CNN
Activation Relu
Shared actor critic network True
Layer Initializer Orthogonal
Table 7: PPO hyperparameters used for rl_games’s Atari Breakout experiments.
Parameter Names Parameter Values
Total Time Steps 8,000,000
Number of Environments 64
Number of Steps per Environment 128
(Discount Factor) 0.999
(for GAE) 0.95
Number of Mini-batches 8
(Number of PPO Update Iteration Per Epoch) 2
(PPO’s Clipping Coefficient) 0.2
(Value Function Coefficient) 2.0
(Entropy Coefficient) 0.01
(Gradient Norm Threshold) 1.0
Learning Rate 0.0003 dynamically adapted based on
KL Divergence threshold (for ) 0.01
Value Function Loss Clipping True
Observation Normalization True
Value Normalization True
Neural network Nature CNN
Activation Elu
Shared actor critic network True
Layer Initializer Orthogonal
Table 8: PPO hyperparameters used for rl_games’s Atari Pong experiments.
Figure 10: rl_games example runs with Ray environments and with EnvPool, using the same number of parallel environments .
Figure 11: rl_games example runs with Ray environments and with EnvPool, using the same number of parallel environments .

e.3 Acme-based Training Results

We integrate EnvPool with Acme [acme] for experiments of PPO [PPO] in MuJoCo tasks, to show EnvPool’s efficiency with different num_envs and its advantage over other vectorized environments such as Stable Baseline’s DummyVecEnv [stable-baselines3]. The codes and hyperparameters can be found in our open-sourced codebase.

All the experiments were performed on a standard TPUv3-8 machine on Google Cloud with the following hardware specifications:

  • OS: Ubuntu 20.04 x86_64

  • Kernel: 5.4.0-1043-gcp

  • CPU: Intel(R) Xeon(R) CPU @ 2.00GHz

  • TPU: v3-8 with v2-alpha software

  • Memory: 342605MiB

Figure 12: Training curves of Acme’s PPO implementation on MuJoCo HalfCheetah-v3 environment using EnvPool of different number of parallel environments.

Figure 12 shows that under the same environment interaction budget, tuning the num_envs can greatly reduce the training time while maintaining similar sample efficiency. We note that the key to maintaining the sample efficiency is to keep the same amount of environment steps under the same set of policy parameters. In our case, we simply maintain a constant.

Figure 13: Comparison of EnvPool and DummyVecEnv using Acme’s PPO implementation on MuJoCo HalfCheetah-v3 environment. Both settings use num_envs of 32.

In Figure 13, we compare EnvPool with another popular batch environment DummyVecEnv, which is argued to be more efficient than its alternative SubprocVecEnv for light environments, both provided by Stable Baseline 3 [stable-baselines3]. Using the same number of environments, EnvPool spent less than half of the wall time of DummyVecEnv to achieve similar final episode return, proving our efficiency.

Appendix F License of EnvPool and RL Environments

EnvPool is under Apache2 license. Other third-party source codes and data are under their corresponding licenses.

Appendix G Data Collection Process and Broader Social Impact

All the data outputted by EnvPool is generated by the underlying simulators and game engines. Wrappers (e.g., transformation of the inputs) implemented in EnvPool conduct data pre-processing for the learning agents. EnvPool provides an effective way to parallel execution of the RL environments and does not have the typical supervised learning data labelling or collection process.

EnvPool is an infrastructure component to improve the throughput of generating RL experiences and the training system. EnvPool does not change the nature of the underlying RL environments or the training systems. Thus, to the best of the authors’ knowledge, EnvPool does not introduce extra social impact to the field of RL and AI apart from our technical contributions.

Appendix H Author Contributions

Jiayi Weng and Min Lin designed and implemented the core infrastructure of EnvPool.

Shengyi Huang originally demonstrated the effectiveness of end-to-end agent training with Atari Pong and Breakout using EnvPool.

Jiayi Weng conducted pure environment simulation experiments.

Bo Liu conducted environment alignment test and contributed to EnvPool bug reports and debugging.

Shengyi Huang, Denys Makoviichuk, Viktor Makoviychuk, and Zichen Liu conducted end-to-end agent training experiments with CleanRL, rl_games, Ray, and Acme with Atari and MuJoCo environments.

Jiayi Weng developed Atari, ViZDoom, and OpenAI Gym Classic Control and Toy Text environments in EnvPool.

Jiayi Weng, Bo Liu, and Yufan Song developed OpenAI Gym MuJoCo and DeepMind Control Suite benchmark environments in EnvPool.

Jiayi Weng and Ting Luo developed OpenAI Gym Box2D environments in EnvPool.

Yukun Jiang developed OpenAI Procgen environments in EnvPool.

Shengyi Huang implemented CleanRL’s PPO integration with EnvPool.

Denys Makoviichuk and Viktor Makoviychuk implemented rl_games integration with EnvPool.

Zichen Liu contributed to EnvPool Acme integration.

Min Lin and Zhongwen Xu led the project from its inception.

Shuicheng Yan advised and supported the project.

Jiayi Weng, Zhongwen Xu, Min Lin, Shengyi Huang, Bo Liu, Denys Makoviichuk, and Viktor Makoviychuk wrote the paper.