Log In Sign Up

MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning

by   Quanyi Li, et al.

Driving safely requires multiple capabilities from human and intelligent agents, such as the generalizability to unseen environments, the decision making in complex multi-agent settings, and the safety awareness of the surrounding traffic. Despite the great success of reinforcement learning, most of the RL research studies each capability separately due to the lack of the integrated interactive environments. In this work, we develop a new driving simulation platform called MetaDrive for the study of generalizable reinforcement learning algorithms. MetaDrive is highly compositional, which can generate an infinite number of diverse driving scenarios from both the procedural generation and the real traffic data replay. Based on MetaDrive, we construct a variety of RL tasks and baselines in both single-agent and multi-agent settings, including benchmarking generalizability across unseen scenes, safe exploration, and learning multi-agent traffic. We open-source this simulator and maintain its development at:


page 2

page 4

page 5

page 6

page 17

page 21

page 22


WILD-SCAV: Benchmarking FPS Gaming AI on Unity3D-based Environments

Recent advances in deep reinforcement learning (RL) have demonstrated co...

Improving the Generalization of End-to-End Driving through Procedural Generation

Recently there is a growing interest in the end-to-end training of auton...

Exploring the trade off between human driving imitation and safety for traffic simulation

Traffic simulation has gained a lot of interest for quantitative evaluat...

VMAS: A Vectorized Multi-Agent Simulator for Collective Robot Learning

While many multi-robot coordination problems can be solved optimally by ...

Megaverse: Simulating Embodied Agents at One Million Experiences per Second

We present Megaverse, a new 3D simulation platform for reinforcement lea...

Scenic4RL: Programmatic Modeling and Generation of Reinforcement Learning Environments

The capability of reinforcement learning (RL) agent directly depends on ...

Graph Convolution-Based Deep Reinforcement Learning for Multi-Agent Decision-Making in Mixed Traffic Environments

An efficient and reliable multi-agent decision-making system is highly d...

1 Introduction

Great progress has been made in reinforcement learning (RL), ranging from super-human Go playing silver2016mastering to delicate dexterous in-hand manipulation andrychowicz2020learning. However, generalization remains one of the fundamental challenges in RL for its real-world applications. Even for a simple driving task, an agent that has learned to drive in one town often fails to drive in another town Dosovitskiy17. There is the critical issue of model overfitting due to the lack of diversity in the existing RL benchmarks. Many on-going efforts have been made to increase the diversity of the data produced by the simulators, such as procedural generation in gaming environments cobbe2019procgen and domain randomization in indoor navigation li2021igibson. In the context of autonomous driving (AD) research, many realistic driving simulators Dosovitskiy17; martinez2017beyond; serban2019chrono; cai2020summit; zhou2020smarts have been developed with their respective successes. Though these simulators address many essential challenges in AD, such as the realistic rendering of the surroundings in Carla Dosovitskiy17 and the scalable multi-agent simulation in SMARTS zhou2020smarts, they do not successfully address the aforementioned generalization problem in RL, especially the generalization across different scenarios. Since the existing simulators mostly adopt fixed assets and hand-crafted traffic maps, the scenarios available for training and test are far from enough to catch up with the complexity of the real world.

To better benchmark the generalizability of RL algorithms in the autonomous driving domain, we develop MetaDrive, a driving simulation platform that can compose a wide range of scenarios with various road networks and traffic flows. MetaDrive holds the key feature of compositionality. Every asset in MetaDrive, such as the vehicle, obstacle, and road structure, is defined as an interactive object with many configurable settings, which can be easily interconnected or actuated over time. As shown in Fig.1, through the procedural generation or the import of real traffic data, diverse scenarios, which are both executable and interactive, can be generated and used to train and test learning-based driving systems. In MetaDrive, the perception of the driving pipeline is abstracted and simplified. As a trade-off, the simulator can run as fast as 300 FPS on a standard PC. Furthermore, MetaDrive can be installed with a single command line and accessible through OpenAI Gym API in Python environment. With those features, MetaDrive aims to facilitate the quick development and prototyping of generalizable RL algorithms.

MetaDrive supports a variety of reinforcement learning tasks. In current stage of development, we construct three standard RL tasks and their baselines in the context of autonomous driving as below. The first two tasks are in single-agent setting while the third one is in multi-agent setting:

  • [itemsep=0mm,topsep=0mm,parsep=0mm]

  • Generalization to unseen scenarios. Based on procedural generation, our simulator composes a large number of diverse driving scenarios from many elementary blocks and traffic vehicles. Those maps are further split into training and test sets, thus we conduct baseline experiments to evaluate the generalizability of different RL methods with respect to the road structures and traffic flows.

  • Safe exploration. We study the safe exploration problem in RL, where the agent learns to drive under a safety constraint. Obstacles are randomly added on the road and a cost is yielded if a collision happens. We benchmark several constrained RL algorithms to measure their applicability to the safety-critical driving application.

  • Multi-agent traffic simulation. In five traffic scenarios such as roundabout and intersection, we study the problem of multi-agent RL for dense traffic simulation. There are 20 to 40 agents in a scene, where each vehicle is actuated by a continuous control policy. Collective motions are supposed to emerge.

MetaDrive is open-source under Apache License 2.0 to the community. More reinforcement learning tasks and baselines are being added.

Figure 1: A. A road map procedurally generated from elementary road blocks. B. MetaDrive supports importing road structure and traffic flow from real-world dataset. C. Multi-modal observations provided by MetaDrive, including Lidar-like cloud points, RGB / depth camera, bird-view semantic map and scalar sensory data. D. Interface of the MetaDrive simulator.

2 Related Work

RL Environments. A large amount of RL environments have been developed to benchmark the progress of different RL problems. Arcade Learning Environments bellemare13arcade and the MuJoCo Simulator todorov2012mujoco are widely studied in single-agent RL tasks, and in multi-agent RL the Particle World Environment mordatch2017emergence and SMAC samvelyan2019starcraft have became the standard testing grounds. The high sample complexity of traditional RL algorithms arouses researches in Meta RL, where Meta-World yu2020meta is a popular environment for evaluation. To apply RL algorithms into the real world, one has to consider their safety and robustness in previously unseen scenarios. Safety Gym safety_gym_Ray2019 and ProcGen cobbe2019procgen are proposed to benchmark the safety and generalizability of RL algorithms, respectively. Similar to ProcGen cobbe2019procgen which uses procedural generation to generate distinct levels of video games, our MetaDrive simulator can also procedurally generate an infinite number of driving scenes. As shown in the following sections, most of the aforementioned RL tasks can be implemented and studied in the proposed MetaDrive simulator because of its flexibility to extension and customization.

Driving Simulators. Because of public safety concerns, it is difficult to train and test the driving policies in the physical world on a large scale. Therefore, simulators have been used extensively to prototype and validate the autonomous driving researches. With remarkable successes, the simulators CARLA Dosovitskiy17, GTA V martinez2017beyond, and SUMMIT cai2020summit realistically preserve the appearance of the physical world, rendering the environment in different lighting conditions, weather, and the time of a day shift. Thus the visuomotor driving agents can be trained and evaluated more thoroughly. Different from rendering realistic visual appearance, the proposed MetaDrive instead focuses on decision making and therefore has high flexibility for generating diverse scenarios for continuous control problems in RL. Other simulators such as Flow vinitsky2018benchmarks, CityFlow zhang2019cityflow, TORCS wymann2000torcs, Duckietown gym_duckietown and Highway-env highway-env abstract the driving problem to a higher level or provide simplistic scenarios as the environments for the agent to interact with. The newly developed SMARTS zhou2020smarts provides an excellent testbed for the interaction of RL agents and social vehicles in atomic traffic scenes. MetaDrive also proposes five MARL environments that cover complex scenarios such as tollgate and parking lot, with a dense population of 40+ agents. MetaDrive distinguishes from SMARTS for its capacity to procedurally generate various driving scenarios and its capacity to accommodate other important RL tasks apart from MARL. Compared to the existing simulators, the proposed MetaDrive holds the key feature of compositionality and aims at facilitating the generalizable reinforcement learning research.

3 MetaDrive: A Compositional Driving Simulator

MetaDrive is a compositional driving simulator to support various RL tasks. The core feature of MetaDrive is its compositionality. This feature presents in two-folds: the abstraction of low-level system implementation and the high-level aggregation of elementary objects into traffic scenario. We encapsulate the back-end implementation of the objects and the intercommunication with the simulation engine with Python API. With such high-level APIs, we implement various methods to generate traffic flows and diverse road networks. The wrapping of back-end simulation makes it convenient and flexible for RL researchers to develop new scenarios and tasks to prototype their methods in autonomous driving domain. The diverse driving scenarios in MetaDrive also become the standard environments to benchmark various RL algorithms. In this section, we first introduce the key designs in MetaDrive and the abstraction of the back-end implementation into high-level classes. We then introduce the implementation of composing diverse scenarios from the elementary objects.

3.1 System Designs

Object. In MetaDrive, we abstract the object as the intermediate connector between the simulation engine and the Python environment. Object is the basic entity in a driving scenarios, such as vehicle, obstacle or road structure. In the back-end simulation, an object is a proxy to two internal models: the physical model and the rendering model. Powered by Bullet engine, the physical model in various shapes is a rigid body that can participate in the collision detection and motion calculation. On the other hand, the rendering model provides fine-grained rendering effects such as light reflection, texturing and shading.

In the Python environment, the object class wraps the aforementioned details, handles the trivial affairs such as garbage collection and simulation stepping, and only exposes high-level APIs to manipulate the object directly. Each class of object has a parameter space that rounds the possible configurations of the instantiated object, enabling randomizing and diversifying objects. For instance, a vehicle has controllable parameters such as the wheel friction, suspension stiffness, wheel damping rate and so on. The parameters can be directly determined from user-specified configuration or from random sampling in the parameter space. The object will automatically assign the determined parameters to the internal models. With such design, a developer who wishes to manipulate a object, such as setting the location of a vehicle, increasing the width of lane or getting nearby objects, can simply call the APIs without touching the back-end simulation system.

Policy. A policy is a function that takes the object and environmental states as input and determines the action or new states of the object. For vehicle, we implement a rule-based policy which mixes the cruising, lane changing, emergency stopping behaviors with various driving models such as IDM and mobile policy kesting2007general. Besides, we also define policies that accepts commands from external controllers such as the RL agent or the human subject to steer the vehicle. Furthermore, we can implement policy to control the infrastructure facilities in the road. For instance, in the MARL Tollgate environment (Sec. 4.4), the tollgate is controlled by the Tollgate Policy which releases vehicle after it holds on for few seconds inside the gate.

Manager. The same class of objects might have different roles, and different roles require various data processing pipelines. For example, in a single-agent RL environment, though the ego vehicle and the traffic vehicles are all vehicle objects, they have different roles and thus different policies. The ego vehicle requires the environment to provide fine-grained surrounding information and is controlled by the RL agent. The traffic vehicles instead rely on the rule-based policy and do not demand those detailed observations. MetaDrive manages objects of different roles with different managers. The managers determines the spawning, stepping, recycling as well as the control policies of the objects.

3.2 Scenario Generation

In MetaDrive, a driving scenario consists of four parts: (1) the map, which is composed by a set of road blocks, (2) the traffic flow, which contains a set of traffic vehicles cruising in the scene and navigating to their given destinations, (3) the obstacles that are randomly scattered in the map, and (4) the target vehicles that are actuated by external policies, e.g. RL agents. We implement Map Manager, Traffic Manager, Obstacle Manager and Agent Manager to manage each part respectively.

Figure 2: Two simulation scenarios are replicated from the real traffic data of Argoverse dataset chang2019argoverse.

Map Generation. A road network in MetaDrive is composed by a set of road blocks sampled from typical types: Straight, Ramp, Fork, Roundabout, Curve, Intersection, Merge, Split, Parking Lot, Tollgate, etc. Each block preserves properties like lanes, spawn points for locating new vehicles, and sockets, namely the exits and entrances of a block that can be used to interconnect other blocks. Based on the road blocks, MetaDrive supports three pipelines to generate road networks: the procedural generation, the import from real data set, and the manual specification from users.

The most important one is the Procedural Generation (PG) pipeline. We propose a search-based PG algorithm Block Incremental Generation (BIG), which recursively appends block to the existing road network if feasible and reverts last block otherwise. When adding new block, BIG first uniformly chooses a road block type and instantiates a block with random parameters sampled from the block-specified parameter space. BIG rotates the new block so the new block’s socket can dock into one socket of existing network. We then test the crossover of all edges in the new block with the existing network. If crossovers exist, then we discard the new block and instantiate another block. Maximally T trials will be conducted. If all trials fail, we remove the latest block and revert to the previous road network. We set the stop criterion of BIG to the number of blocks. The detail of BIG algorithm and the plot of generated maps are included in the Appendix.

Apart from the procedural generation of new scenarios, MetaDrive can also import real traffic data from autonomous driving datasets. The road network data usually consists of a set of lane line center points, such as Argoverse dataset chang2019argoverse, Waymo dataset sun2020scalability and OpenStreetMap haklay2008openstreetmap. Benefiting from the unified data structure to represent road networks, MetaDrive can seamlessly incorporate real world data through converting way points in the dataset to the lanes and then build functionalities at those lanes, such as the localization of vehicles. As demonstrated in Fig. 2, two simulation scenarios imported from the real traffic scenes in Argoverse dataset are plotted.

Figure 3: User can customize the environment easily by passing configuration to MetaDrive.

MetaDrive can also generate road network according to the user specification in the config system. When creating the environment, user can pass a config, namely a dictionary, into the environment MetaDriveEnv(config) and specify the road network by providing overrides to the default settings of the map. For instance, as shown in Fig. 3, user can generate a variety of environments such as wide straight road by modifying the config dict with few lines of code.

Traffic Generation. MetaDrive maintains the traffic through Traffic Manager. Traffic Manager decides when and where to generate or recycle traffic vehicles and also assigns policies to vehicles.

Currently, MetaDrive provides two built-in traffic modes: Respawn mode and Trigger mode. Respawn traffic mode is designed to maintain traffic flow density. In Respawn mode, Traffic Manager assigns traffic vehicles to random spawn points on the map. The traffic vehicles immediately start driving toward their destinations after spawning. When a traffic vehicle terminates, it will be re-positioned to an available spawn point. On the contrary, the Trigger mode traffic flow is designed to maximize the interaction between target vehicles and traffic vehicles. The traffic vehicles stay still in the spawn points until the target agent enters the trigger zone in each block. Taking Intersection block as an example, the traffic vehicles inside the intersection will be triggered and start moving only when the target vehicle trespasses into the intersection.

The traffic vehicles are actuated by our rule-based policy, so they are responsive to the target vehicles. Meanwhile, MetaDrive also supports replaying trajectories recorded in real world, such as the data in Argoverse dataset. Fig. 2 shows the replayed traffic vehicles, whose trajectories are directly drew from recorded real data. More demonstrations on the replayed driving scenes are presented in Appendix.

Scattering Obstacles. MetaDrive scatters many obstacles such as cones, warning triangles as well as shutdown vehicles in the road, as shown in Fig. 4. The density of the obstacles determines the difficulty of the task. A collision with the obstacle yields a cost for the ego vehicle, which is used in the safe RL environments.

Target Vehicles Management. Agent Manager is designed to register and maintain the controllable vehicles in the environment. This is important in MARL environments since the number of active target vehicles is varying and new vehicles spawns immediately once old ones terminate.

Workflow. Scenario composition in MetaDrive is conducted in a hierarchical manner: MetaDrive environment creates a set of managers, then the managers spawn objects and assign policies to those objects if applicable. After initialization stage, all objects will run automatically in the environment following their policies while the managers monitoring the states of the object and kicking off new objects or recycling terminated ones.

3.3 Implementation Details

MetaDrive provides various kinds of sensory input, as illustrated in Fig.1C. For low-level sensors, RGB cameras, depth cameras and Lidar can be placed anywhere in the scene with adjustable parameters such as view field and the laser number. Meanwhile, the high-level scene information including the road information such as the bending angle, length and direction, and nearby vehicles’ information like velocity, heading and profile, can also be provided as input to the learning policy. Note that MetaDrive aims at providing an efficient platform to benchmark RL research, therefore we improve the simulation efficiency at the cost of photorealistic rendering effect. As a result, MetaDrive can run at 300 FPS in single-agent environment with 10 rule-based traffic vehicles and 60 FPS in multi-agent environment with 40 RL agents.

MetaDrive is implemented based on Panda3D goslin2004panda3d and Bullet Engine. The well designed rendering system of Panda3D enables MetaDrive to construct realistic monitoring and observational data. Bullet Engine empowers accurate and efficient physics simulation in MetaDrive.

4 Benchmarking Reinforcement Learning Tasks

Figure 4: Illustrations of the safe exploration and multi-agent environments in MetaDrive.

Based on MetaDrive, we construct three driving tasks corresponding to different reinforcement learning problems. The first two are in single-agent setting where the traffic are actuated by rule-based models and a target vehicle is controlled by external RL agent. The third task is in multi-agent setting where a population of agents learns to simulate a traffic flow and each vehicle is actuated by a continuous control policy.

4.1 Experimental Setting

In all tasks, the objective of RL agents is to steer the target vehicles with low-level continuous control actions, namely acceleration, brake and steering. We attempt to unify all tasks with a general setting of observation, reward function, and evaluation metrics.

Observation. The observation of RL agents is as follows:

  • A 240-dimensional vector denoting the Lidar-like cloud points with

    maximum detecting distance centering at the target vehicle. Each entry is in and represents the relative distance of nearest obstacle in specified direction.

  • A vector containing the data that summarizes the target vehicle’s state such as the steering, heading, velocity and relative distance to the left and right boundaries.

  • The navigation information that guides the target vehicle toward the destination. We densely spread a set of checkpoints in the route and use the relative positions toward future checkpoints as additional observation to the target vehicle.

Reward and Cost Scheme. The reward function is composed of four parts as follows:


The displacement reward , wherein the and denotes the longitudinal coordinates of the target vehicle in the current lane of two consecutive time steps, provides dense reward to encourage agent to move forward. The speed reward incentives agent to drive fast. and denote the current velocity and the maximum velocity (), respectively. We also define a sparse terminal reward , which is non-zero only at the last time step. At that step, we set and assign according to the terminal state. is set to if the vehicle reaches the destination, for crashing others or violating the traffic rule. We set and . Sophisticated reward engineering may provide a better reward function, which we leave for future work. For benchmarking Safe RL algorithms, collision to vehicles, obstacles, sidewalk and buildings raises a cost at each time step.

Evaluation Metrics. We evaluate a given driving agent for multiple episodes and define the ratio of episodes where the agent arrives the destination as the success rate. The definition is the same for traffic rule violation rate (namely driving out of the road) and the crash rate (crashing other vehicles). Compared to episodic reward, the success rate is a more suitable measurement when evaluating generalization, because we have a large number of scenes with different properties such as the road length and the traffic density, which leads the reward varying drastically across different scenes.

We evaluate the performance of the trained agent with success rate. After each training iteration, we roll out the learning agent in the test environments and record the percentage of successful episodes over 30 held-out test episodes. The traffic rule violation rate (namely driving out of the road), the crash rate (crashing other vehicles) are also measurement to the performance of agents.

We conduct experiments on MetaDrive with algorithms mostly implemented in RLLib liang2018rllib. Specifically, we host 8 concurrent trials in an Nvidia GeForce RTX 2080 Ti GPU. Each trial consumes 2 CPUs with 8 parallel rollout workers. All experiments are repeated 5 times with different random seeds. Information about other hyper-parameters is given in Appendix.

Figure 5: The generalization result of the agents trained with off-policy RL algorithm Soft Actor-critic (SAC) haarnoja2018soft and on-policy RL algorithm PPO schulman2017proximal

. Increasing the number of training scenarios leads to higher test success rate and lower traffic rule violation and crash probability, which indicates the agent’s generalization is significantly improved. Compared to PPO, SAC algorithm brings more stable training performance. The shadow of the curves indicates the standard deviation.

4.2 Generalization to Unseen Scenes

To benchmark the generalizability of a driving policy, we develop an RL environment that can generate an unlimited number of diverse driving scenarios through aforementioned procedural generation algorithm. We splits the generated scenes into two sets: the training set and test set. We train the RL agents only in the training set and evaluate them in the held-out test set. The generalizability of a trained agent is therefore measured by the test performance. The objective of this task is to show how the training scenarios diversity affects the generalizability of the learned policy.

We train the agents with two popular RL algorithms respectively, PPO schulman2017proximal and SAC haarnoja2018soft. As shown in Fig. 5, the result of improved generalization is observed in the agents trained from both RL algorithms: First, the overfitting happens if the agent is not trained with a sufficiently large training set. When where the agent is trained in a single map, we can clearly see the significant performance gap of the learned policies between the training set and test set. Second, the generalization ability can be greatly strengthened if the agents are trained in more environments. As the number of training scenes increases, the final test success rate keeps increasing while rule violation and crash decrease drastically. The overfitting is alleviated and the test performance can match the training performance when is higher. The experimental results clearly show that increasing the diversity of training environments can significantly increase the generalization of RL agents. It also validates the strength brought by the compositional MetaDrive simulator for more generalizable reinforcement learning.

Figure 6: The test performance of MetaDrive-trained policies in the real-world scenarios.

To verify the generalizability of the trained policies, we transfer the trained agents in above generalization experiments to the real-world scenarios replayed from Argoverse dataset. The yellow line in Fig. 6 shows the test performance of those agents in Argoverse dataset. Surprisingly, we find that MetaDrive-trained agents generalize so well in the real scenarios.

We hypothesize this is because: (1) In generalization experiments, each scene contains rich traffic and complex road network (3 road blocks in each scene). Therefore MetaDrive map is actually more difficult than the driving scenarios collected in the Argoverse dataset. (2) In driving simulator the agent can freely explore the environment without any safety constrains, while the data collection process in real-world imposes strong constrains. This biases the data since only those conservative driving trajectory and safe scenes were recorded. Utilizing procedural generation to deduce realistic virtual environment that can improve the generalization of trained policies in real-world scenarios is an extremely important topic. MetaDrive provides the flexibility to conduct further exploration.

We also provide other experiments testing various factors that impact agents generalizability in the Appendix.

4.3 Safe Exploration

Safety is a major concern for the trial-and-error nature of RL. Many safe RL methods have been developed to address the safe exploration problem safety_gym_Ray2019. As driving itself is a safety-critical application, it is important to evaluate the constrained optimization methods under the domain of autonomous driving. We define a new suite of environments to benchmark the safe exploration in RL. As shown in Fig. 4, we randomly display static and movable obstacles in the road. Different from the generalization task, we do not terminate the agent if a collision with those obstacles and traffic vehicles happens. Instead, we allow agent to continue driving but flag the crash with a cost . Thus as safe exploration task, the learning agent is required to balance the reward and the cost to solve the constrained optimization problem. We also evaluate the agents on unseen maps to show their generalization ability when encountering unfamiliar dangerous states.

We evaluate the reward shaping variants (RS) and Lagrangian variants (Lag) of PPO schulman2017proximal and SAC haarnoja2018soft as well as the Constrained Policy Optimization (CPO) achiam2017constrained. RS method considers negative cost as auxiliary reward while Lagrangian method safety_gym_Ray2019 consider the learning objective as: , wherein and are the episodic reward and the episodic cost respectively and is the policy parameters. Different from existing work which applying Lagrangian to SAC directly ha2020learning, we additionally equip SAC-Lag with a PID controller to update the multiplier to alleviate the oscillation in the training stooke2020responsive. All algorithms are trained using 500,000 steps.

As shown in Table 1 and Fig. 7, SAC-RS shows superior performance but causes high safety violations. On the contrary, the Lagrangian SAC can achieves lower cumulative cost while giving up marginal task performance. Meanwhile, PPO produces sub-optimal policies compared to SAC baselines. Fig. 7 presents the learning dynamics of different safe RL algorithms. The result suggests that SAC-RS shows best sample efficiency, but a peak of episodic cost happens when it learns most efficiently.

We also conduct the safety generalization experiment to verify the impact of different training set in terms of safety performance.

Following the above safe exploration experimental setting, we randomly display static and movable obstacles in the traffic. We do not terminate the episode if the target vehicle crashes with those objects as well as traffic vehicles. Instead, we allow agent to continue driving but record the crash with a cost . So the driving task is further formulated as a constrained MDP achiam2017constrained; safety_gym_Ray2019. We test the reward shaping method, which assigns the crash penalty as negative reward following the original reward scheme, as well as the Lagrangian method safety_gym_Ray2019:


wherein and are the reward objective and the cost objective respectively and is the policy parameters. is a given cost threshold, which is set to . We use PPO for both methods. For Lagrangian method, an extra cost critic network is used to approximate the cost advantage to compute the cost objective in Eq. 2. The Lagrangian multiplier is independently updated before updating the policy at each training iteration.

We train both methods under different number of training scenes and demonstrate the training and test episode cost in Fig. 8. We observe that the success rate follows the same tendency in Fig. 5, therefore we only present the plot of cost versus the number of training scenes here. Both methods achieve high test cost even the training cost is low when training with few environments and improve the safety generalization when training diversity increases. Lagrangian method outperforms vanilla reward shaping method and reduces the test cost by around . This experimental result reveals the overfitting of safety, which is a critical research topic if we want to apply the end-to-end driving system to the real-world.

Category Method Cumulative Reward Cumulative Cost Success Rate
RL SAC-RS 327.13 7.28 3.38 0.60 0.801 0.040
PPO-RS 197.27 16.24 3.33 0.68 0.207 0.052
Safe RL SAC-Lag 324.23 14.45 1.90 0.44 0.714 0.103
PPO-Lag 269.51 22.54 1.82 0.33 0.477 0.114
CPO 194.06 108.86 1.71 1.02 0.210 0.290
Table 1: The test performance of different approaches in Safe RL benchmarks.
Figure 7: The learning progress of different Safe RL methods. Though achieves superior sample efficiency, the reward shaping version of SAC induces a peak in the training cost, while the Lagrangian SAC improves the policy while satisfies the safety constraint.
Figure 8: The episode cost of the trained policies in safety generalization experiment. We observe overfitting and poor safety performance in test time if trained with few training scenes.

4.4 Mixed Motive Multi-agent Reinforcement Learning

We provide six multi-agent RL environments to benchmark the MARL algorithms in the dense traffic scenarios. Varying across different environments, there are 10 to 40 concurrent agents running in the environment. Such dense multi-agent traffic is hardly experimented due to the efficiency issue in previous simulators. In MetaDrive, we manage to achieve the efficiency of 60 FPS with 40 controllable agents running simultaneously in a shared environment.

As shown in Fig. 4, apart from the Roundabout (40 concurrent agents) and Intersection (30 agents) blocks used in previous single-agent tasks, we additionally construct three single-block environments to study the coordination problem in MARL:

Tollgate: Tollgate scene includes narrow roads to spawn agents and ample space in the middle with multiple tollgates. The tollgates become static obstacles where the crashing is prohibited. We request agents to stop within tollgate for 3s. The agent will fail if they exit the tollgate before being allowed to pass. 40 vehicles are initialized. Complex behaviors such as deceleration and queuing are expected. Additional states such as whether vehicle is in tollgate and whether the tollgate is blocked are given.

Bottleneck: Complementary to Tollgate, Bottleneck contains a narrow bottleneck lane in the middle that forces the vehicles to yield to others. We initialize 20 vehicles in this scene.

Parking Lot: A compact environment with 8 parking slots. Spawn points are scattered in both parking lots or in external roads. 10 vehicles spawn initially and need to navigate toward external roads or enter parking lots. In this environment, we allow agents to back their cars to spare space for others. Good maneuvering and yielding are the key to solving this task.

We reuse the procedurally generated scenarios in the generalization environment and replaces the traffic vehicles by controllable target vehicles, and forms the PGMA (Procedural Generation Multi-Agent) environment. These environments contain rich interactions between agents and complex road structures. This multi-agent environment introduces new challenge under the setting of mixed motive RL. Each constituent agent in this traffic system is self-interested and the relationship between agents is constantly changing.

We benchmark several applicable MARL algorithms, including the Independent policy optimization (IPPO) method schroeder2020independent using PPO schulman2017proximal as the individual learners. We also evaluate the centralized critic methods which encodes the nearby agents’ states into the input of value functions (centralized critics). We test two variants of centralized critic PPO (CCPPO): The first one is the Mean Field (MF) CCPPO, which averages the states of nearby vehicles within 10 meters and feeds the mean states to the value network yang2018mean. The second variant concatenates the state of K nearest vehicles (K=4 in our experiment) as a long vector feeding as extra information to the value network pal2020emergent. Table 2 shows the main results of all the evaluated methods. We find the MF-CCPPO method outperforms independent learner in several environments. The concatenated state of K nearest vehicles hurts the performance of CCPPO and leads to poor performance. We hypothesize this is because in driving tasks the neighborhood of ego vehicles is varying all the time, while concatenating states greatly expands the input dimension thus creates difficulty to the learning.

Roundabout Intersection Tollgate Bottleneck Parking Lot PGMA
IPPO 72.05 4.54 65.02 6.75 86.53 2.03 71.58 5.26 58.76 2.82 69.89 8.09
MF-CCPPO 71.78 5.08 71.86 6.64 83.26 4.48 72.82 2.56 59.91 1.98 73.43 2.57
Concat-CCPPO 71.59 4.58 66.58 5.81 76.80 1.84 63.32 11.90 55.12 2.04 53.12 5.72
Table 2: Success rate of different approaches in Multi-agent RL benchmarks.

5 Conclusion

We develop an open-source, highly efficient and flexible driving simulator MetaDrive to facilitate the research of generalizable reinforcement learning. MetaDrive holds the core feature of compositionality, where an infinite number of diverse driving scenarios can be composed through both the procedural generation and the real traffic data replay. We construct a variety of RL tasks and baselines in both single-agent and multi-agent settings, including benchmarking generalizability across unseen scenes, safe exploration, and learning multi-agent traffic.

MetaDrive is limited in the following aspects: First, The image rendering is not realistic as other driving simulator with photorealistic rendering effect. We improve the efficiency and compositionality of MetaDrive at the cost of its visual appearance. This is because MetaDrive specifically focuses on the compositionality and the flexible augmentation of the available driving scenarios. As we shown in the experimental section, such compositionality opens the door for many interesting research problems which are either difficult or impossible to implement in previous simulators. Second, the generated driving scenes from MetaDrive can contain more realistic components such as pedestrians and traffic lights to increase its complexity in the future. Third, a systematic driving case description protocol is required to enable large scale generation of corner cases such as near-accidental scenes for prototyping safe autonomous driving systems.



Appendix A Comparison to other simulators

Lidar or
Real data
In Active
CARLA [Dosovitskiy17]
GTA V [martinez2017beyond]
Highway-env [highway-env]
TORCS [wymann2000torcs]
Flow [vinitsky2018benchmarks]
Sim4CV [muller2018sim4cv]
Duckietown [gym_duckietown]
SMARTS [zhou2020smarts]
AIRSIM [airsim2017fsr]
SUMO [zhou2020smarts]
MADRaS [santara2021madras]
Udacity [udacity]
DeepDrive [deepdrive]
Table 3: Comparison of representative driving simulators.

Apart from MetaDrive, there are lots of existing driving simulators that support RL research. Table 3 presents a comprehensive comparison between different simulators.

The simulators GTA V [martinez2017beyond], Sim4CV111Note that Sim4CV is not open-sourced. [muller2018sim4cv], AIRSIM [airsim2017fsr], CARLA [Dosovitskiy17] and its derived project SUMMIT [cai2020summit], MACAD [palanisamy2020multi] realistically preserve the delicate appearance of the physical world. For example, CARLA not only provides perception, localization, planning, control modules, and sophisticated kinematics models, but also renders the environment in different lighting conditions, weather and the time of a day shift. Thus the perception ability of driving agents can be trained and evaluated more thoroughly.

Other simulators such as Udacity [udacity], TORCS [wymann2000torcs], DeepDrive [deepdrive], Duckietown [gym_duckietown] and Highway-env [highway-env] simplify the driving problem to a higher level or provide simplistic scenarios as the environments for the agent to interact with. Take Highway-env as an example. Highway-env builds a learning system based on the states of surrounding objects, which avoids the sophisticated process to extract high level state information such as velocity and yaw rat from the raw sensor inputs.

The aforementioned simulators are majorly designed for single-agent scenario, wherein the traffic vehicles are controlled by predefined models or heuristics. In the MARL context, CityFlow 

[zhang2019cityflow] and FLOW [wu2017flow] are two macroscopic traffic simulators that based on SUMO [SUMO2018]. However, since these two simulators focus on different aspect of simulating the traffic system, they are not suitable to investigate the detailed behaviors of each learning-based agents. MACAD [palanisamy2020multi] extends CARLA for MARL behavioral research. For users have limited computational resource, MADRaS [santara2021madras] is an alternative who is built based on TORCS [wymann2000torcs]. Compared to MADRas, SMARTS [zhou2020smarts] holds realistic features such as intersection scenarios and complex traffic behaviors. MetaDrive also contains those useful features for MARL research but with more complex scenes that includes rich driving circumstances. For example, in the Tollgate environment, the agents need to learn not only interacting with others, but also interacting with the road infrastructure, the tollgates. They need to learn queueing and patient waiting in the tollgate until being allowed to pass.

It is noteworthy that most of relevant simulators suffer from the low sample efficiency, which takes many hours or even several days to train driving agents. Take SMARTS and CARLA as an example. According to the efficiency test 222, 25 FPS is achieved when 10 agents are running with only the scalar state observation as input. However, in our 10-agents environment Parking, our simulator can achieve 165 FPS on single PC even with the Lidar-like observations are feeding to each agent. The same problem also exists in CARLA. In synchronous mode, CARLA can achieve maximally 70 FPS 333 without traffic on a powerful workstation. In SUMO co-simulation mode, the efficiency drops to 30 FPS 444 On the contrary, a single instance of MetaDrive can collect more than 300 or 200 RL steps per second in scenes containing 10 IDM vehicles with Lidar-based or image-based observation, respectively.

Appendix B Ethical Issues

What is the potential negative societal impact?

MetaDrive provides a virtual simulation environment for RL researches. Though MetaDrive itself has marginal negative societal impact, there might be two possible cases that cause damages. First, if an user applies the trained agent from MetaDrive to real vehicle, it is possible that the vehicle causes accidents due to domain gap or uncertainty in neural network / action distribution. Second, since MetaDrive supports controlling the vehicles by human via keyboard or joystick, the player may experience, though the chance is very low, discomfort or pain due to exposure to the MetaDrive 3D visualization or fast driving.

What are the licenses of the assets? MetaDrive is under Apache License 2.0. Panda3D, used in MetaDrive as rendering system for visualization, is under the so-called “Modified BSD license,” which is a free software license with very few restrictions on usage. Bullet engine is under Zlib license. The vehicle models are collected from Sketchfab under CC BY 4.0 or CC BY-NC 4.0. The sky box images are under CC0 1.0 Universal (CC0 1.0). In our real data importing experiment, we use Argoverse dataset [chang2019argoverse] and its APIs. Argoverse is provided free of charge under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public license. Argoverse code and APIs are provided under the MIT license.

How consent was obtained from people whose data you’reusing? We use open-sourced code base or public domain assets in MetaDrive.

Whether the data you are using contains personally identifiable information or offensive content? No, the code, model and data we used do not contains personally identifiable information or offensive content.

Appendix C Supplementary Generalization Experiments

c.1 Impact of Traffic Density

We train two agents on the training set of the same 100 maps but with different traffic density. The first agent is trained with traffic density 0.1, called the “Fixed” agent. The second agent, called the “Uniform agent”, is trained with the traffic density varying from 0 to 0.4 uniformly. Averagely, the Fixed agent will encounter 10 vehicles, while the Uniform agent may meet 0 to 40 vehicles in training. We then evaluate these two trained agents in 7 sets of test environments, which share the same group of unseen maps but with different fixed traffic density from 0 to 0.6, respectively. Fig. 9 shows the Uniform agent can maintain a relatively high performance compared to the Fixed agent which fails completely in dense traffic scenes. Besides, even in the environment with high traffic density that unseen during the training, namely 0.5 and 0.6, the Uniform agent can still outperform the Fixed agent. Therefore, training agent in more diverse environments in terms of traffic conditions can alleviate the overfitting and improve the safety and reliability of end-to-end driving.

Figure 9: Agents demonstrate overfitting to the traffic density. The “Fixed” agent means the density is set to 0.1, while the “Uniform” agent varying the traffic density from 0 to 0.4 during training.

c.2 Impact of Friction Coefficient

Fig. 10 shows the impact of the friction coefficient between the vehicle’s wheels and ground, which is an important parameter for driving. We consider two agents where one is trained on the ground with fixed 1.0 friction coefficient, so the maneuverability of the vehicle is good and the other is with 0.6. The result shows that the agent trained on slippery terrain due to the low friction achieves better generalization and can drive on road surfaces with different friction coefficients better than the agent trained in the easy environment. Therefore, we can train more robust agents by configuring the parameters of the environment that are not allowed to tune in some other simulators in the MetaDrive.

Figure 10: Agents trained with wheel friction coefficient 0.6 have better generalization compared to those with 1.0 friction coefficient, when evaluated in the test environments whose friction coefficient is specified set to certain values.

c.3 Impact of Procedural Generalization

We further conduct an experiment to show that an agent specialized on solving all types of blocks separately can not solve a complex driving map composed of multiple block types. We compare two agents: 1) PG Agent trained in 100 environments where each environment has 3 blocks, and 2) Single-block Agent trained in 300 environments where each environment contains only 1 block. We evaluate them in the same test set of the environments with 3 blocks in each. Fig. 11 shows that both agents can solve their training tasks, but they show different test performance. Agent trained on maps generated by PG performs better than agents trained on separate blocks. The results indicate that training agents in separate scenarios can not lead to good performance in complex scenarios. The procedurally generated maps which contain multiple blocks types lead to better generalization of the end-to-end driving.

Figure 11: Compared to the agents trained in multi-blocks environments (called PG Agent), agents trained in single-block environments can not generalize to complex environments.

Appendix D Scenario generation with Procedural Generation

d.1 Road Block

As shown in Fig. 12, we define several typical types of road block. We represent a road block using an undirected graph with additional features: , with nodes denoting the joints in the road network and edges denoting lanes which interconnects nodes. At least one node is assigned as the socket . The socket is usually a straight road at the end of lanes which serves as the anchor to connect to the socket of other block. Block can preserve several sockets. For instance, Roundabout and T-Intersection have 4 sockets and 3 sockets, respectively. There are also some spawn points distributed uniformly in the lanes for allocation of traffic vehicles. Apart from the above properties, a block type-specific parameter space is defined to bound the possible parameters of the block, such as the number of lanes, the lane width, and the road curvature and so on. is the block type from predefined types. The road block is the elementary traffic component that can be assembled into a complete road network by the procedural generation algorithm. Shown in Fig. 12, the detail of typical block type is summarized as follows:

Straight: A straight road is configured by the lanes number, length, width, and types, namely whether it is broken white line or solid white line.

Ramp: A ramp is a road with entry or exit existing in the rightest lane. Acceleration lane and deceleration lane are attached to the main road to guide the traffic vehicles to their destination.

Fork: A structure used to merge or split additional lanes from the main road and change the number of lanes.

Roundabout: A circular junction with four exits (sockets) with configurable radius. Both roundabout, ramp and fork aim to provide diverse merge scenarios.

Curve: A curve block consists of circular shape or clothoid shape lanes with configurable curvature.

T-Intersection: An intersection that can enter and exit in three ways and thus has three sockets. The turning radius is configurable.

Intersection: A four-way intersection allows bi-directional traffic. It is designed to support the research of unprotected-intersection.

Figure 12: Several types of road blocks and their parameters. L, R, X indicate the road length, the road curvature and the number of lanes, respectively.

d.2 Procedural Generation

The following is the algorithm of the proposed procedural generation method to automatically select and assemble blocks into driving scenes. We recap the generation process in details as follows:

As illustrated in Algorithm 1, we propose a search-based PG algorithm Block Incremental Generation (BIG), which recursively appends block to the existing road network if feasible and reverts last block otherwise. When adding new block, BIG first uniformly chooses a road block type and instantiate a block with random parameters (the function GetNewBlock()). After rotating the new block so the new block’s socket can dock into one socket of existing network (Line 17, 18), BIG will then verify whether the new block intersects with existing blocks (Line 19). We test the crossover of all edges of and network . If crossovers exist, then we discard and try new one. Maximally T trials will be conducted. If all of them fail, we remove the latest block and revert to the previous road network (Line 25).

We set the stop criterion of BIG to the number of blocks. After generating a road network with n blocks, the initial traffic flow is attached to the static road network (Line 8) to complete the scene generation by the traffic manager. The traffic manager creates traffic vehicles by randomly selecting the vehicle type, kinematics coefficients, target speed, behavior (aggressive or conservative), spawn points and destinations. The traffic density is define as the number of traffic vehicles per lane per 10 meters and is considered in this period. We randomly select spawn points in the whole map to allocate the traffic vehicles, wherein is the given traffic density, is the total length of road network, and is the average number of lanes.

Input: Maximum tries in one block T; Number of blocks in each map n; Number of required maps N
Result: A set of maps
1 # Define the main function to generate a list of maps 
2 Function main(T, n)
3       Initialize an empty list to store maps
4       while  does not contain N maps do
5             Initialize an empty road network
6             , success=BIG(T, , n)
7             if success is True then
8                   Initialize traffic vehicles in some spawn points
9                   Append to
11      Return M
12 # Define the Block Incremental Generation (BIG) helper function that appends one block to current map if feasible and return current map with a success flag  
13 Function BIG(T, , n)
14       if  has n blocks then
15             Return , True
16      for 1, ..., T do
17             Create new block = GetNewBlock()
18             Find the sockets for new block and old blocks: ,
19             Rotate so that , have supplementary heading
20             if  does not intersect with  then
22                   , success=BIG(T,, n)
23                   if success is True then
24                         Return , True
25                   else
26                         Remove the last block from
29      Return , False
30 # Randomly create a block  
31 Function GetNewBlock()
32       Randomly choose a road block type
33       Instantiate a block and randomize the parameters
34       Return
Algorithm 1 Procedural Generation of Driving Scenes

d.3 Demonstration of Generated Maps

We show samples of the generated maps in Fig. 13.

Figure 13: Procedurally generated maps with different number of road blocks.

Appendix E Scenario generation from configurations

Scenarios in MetaDrive can be generated with user-specified configurations. Recall the initialization of the environment through Gym API: env=gym.make("MetaDrive-v0", config). The config is a dict that contains special settings of the environment. As shown in Fig. 14, abundant scenarios can be generated by passing simple map configurations to the environment. In the showcases, we first specify the method to generate maps. config["map_config"]["type"] = "block_sequence" indicates generating map containing a given sequence of block types, while block_num requests the BIG algorithms to produce a map containing given number of blocks while the type of blocks should be randomized. In the config entry, the user should specify the concrete sequence of block types (if using type:block_sequence) or the number of blocks (if using type:block_num) through config["map_config"]["config"] = "SSS"/3. Detailed annotations of different blocks is given in the documentation of the MetaDrive.

Figure 14: MetaDrive can derive diverse scenarios with simple modification in the input config.

Appendix F Scenarios Replayed from Real-world Dataset

In Fig. 15, we demonstrate several replayed scenarios from Argoverse dataset [chang2019argoverse]. Other datasets will be supported soon.

Figure 15: MetaDrive can load real scenarios from Argoverse dataset.

Appendix G Hyper-parameters

The following tables describe detailed hyperparameter settings in the experiments of this paper.

Hyper-parameter Value
Discounted Factor 0.99
for target network update 0.005
Learning Rate 0.0001
Environmental horizon 1500
Steps before Learning start 10000
Buffer Size 1000,000
Prioritized Buffer True
Train Batch Size 256
Initial Alpha 1.0
Penalty Learning Rate 0.01
Cost Limit for SAC-Lag 1
Table 5: PPO/PPO-Lag/CPO
Hyper-parameter Value
KL Coefficient 0.2
for GAE [schulman2018highdimensional] 0.95
Discounted Factor 0.99

Number of SGD epochs

Train Batch Size 4000
SGD mini batch size 100
Learning Rate 0.00005
Clip Parameter 0.2
Penalty Learning Rate 0.01
Target KL Divergence for CPO 0.01
Cost Limit for PPO-Lag/CPO 1
Table 4: SAC/SAC-Lag
Hyper-parameter Value
KL Coefficient 1.0
for GAE [schulman2018highdimensional] 0.95

for global value estimation

for individual / neighborhood value estimation 0.99
Environmental steps per training batch 1024
Number of SGD epochs 5
SGD mini batch size 512
Learning Rate 0.0003
Environmental horizon 1000
Neighborhood radius 10 meters
Number of random seeds 8
Maximal environment steps for each trial 1M