A Reinforcement Learning Benchmark for Autonomous Driving in Intersection Scenarios

by   Yuqi Liu, et al.

In recent years, control under urban intersection scenarios becomes an emerging research topic. In such scenarios, the autonomous vehicle confronts complicated situations since it must deal with the interaction with social vehicles timely while obeying the traffic rules. Generally, the autonomous vehicle is supposed to avoid collisions while pursuing better efficiency. The existing work fails to provide a framework that emphasizes the integrity of the scenarios while being able to deploy and test reinforcement learning(RL) methods. Specifically, we propose a benchmark for training and testing RL-based autonomous driving agents in complex intersection scenarios, which is called RL-CIS. Then, a set of baselines are deployed consists of various algorithms. The test benchmark and baselines are to provide a fair and comprehensive training and testing platform for the study of RL for autonomous driving in the intersection scenario, advancing the progress of RL-based methods for intersection autonomous driving control. The code of our proposed framework can be found at https://github.com/liuyuqi123/ComplexUrbanScenarios.



There are no comments yet.


page 2


Uncertainty-Aware Model-Based Reinforcement Learning with Application to Autonomous Driving

To further improve the learning efficiency and performance of reinforcem...

Towards Automated Safety Coverage and Testing for Autonomous Vehicles with Reinforcement Learning

The kind of closed-loop verification likely to be required for autonomou...

Learning from All Vehicles

In this paper, we present a system to train driving policies from experi...

Tackling Variabilities in Autonomous Driving

The state-of-the-art driving automation system demands extreme computati...

Transferring Autonomous Driving Knowledge on Simulated and Real Intersections

We view intersection handling on autonomous vehicles as a reinforcement ...

Behaviorally Diverse Traffic Simulation via Reinforcement Learning

Traffic simulators are important tools in autonomous driving development...

Encoding Integrated Decision and Control for Autonomous Driving with Mixed Traffic Flow

Reinforcement learning (RL) has been widely adopted to make intelligent ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Driving under urban scenarios, especially the intersection scenario, is one of the most challenging problems for an autonomous driving(AD) system. According to[1], a general AD system is composed of several subsystems, including sensing, navigation, decision-making, planning, and control. The key challenge of the intersection scenario is the interaction between the autonomous driving vehicle(ADV) and social vehicles, which mainly possess challenges on the decision-making and control modules[2]. Since the behavioral intention of social vehicles is uncertain, the ADV is forced to negotiate and make decisions quickly under strong interactions, otherwise, traffic accidents are very likely to occur. Generally, the intersection scenario discussed in this paper mainly refers to the ADV passing through a cross-junction while interacting with social vehicles.

Reinforcement learning(RL) methods learn an optimal policy through trial-and-error, a RL agent is able to promote its performance by updating the policy repeatedly. In recent years, RL has become a key technology in the field of AD decision-making and control[3, 4, 5]. In[6]

, deep learning (DL) and RL method is proposed to address the lane-keeping task with the visual input for an ADV on a highway track. And

[7], an RL with rule-based constraints method is proposed to train the ADV for the lane-changing task in the highway scenario.

Besides, some RL-based methods are proposed to deal with the decision-making and control problems in intersection scenarios. In [8], an end-to-end framework is proposed for ADV controlling, the RL agent takes raw data as input and directly outputs the control command of the vehicle. The proposed scenario is mixed up with junctions and roundabouts, while the traffic flow in the scenario is not adjustable. In[9], a multi-agent RL framework is proposed for the behavior model design under intersection scenarios, in which both rule-based and RL-based baselines are provided. The proposed simulator is similar to the one proposed in[10], both of them fail to provide a delicate dynamic model of vehicles and a high-resolution simulator. In[11], a simulator based on CARLA simulator[12]

is proposed, which provides a set of real-world road maps as intersection AD benchmark. Though a partially observable Markov decision process(POMDP) agent is proposed, the RL interface is not deployed for further research. In

[13], an RL framework named after ULTRA is proposed with a delicate scenario design. In this paper, the behavior of social vehicles are modeled by the SMARTS simulator[14]. The dynamics of the simulation are idealized as well. We compare the features of some existing frameworks as shown in TABLE I.

As discussed above, previous work has not been able to integrate a tunable intersection scenario benchmark with the RL-based baseline in a high-resolution simulator. In this paper, we propose a training and testing RL framework addressing the complex intersection scenarios for the AD problem. We call our benchmark Reinforcement Learning Complex Intersection Scenario or RL-CIS for short. As for the original contributions, this paper:

  • proposes benchmark called RL-CIS for RL-based AD agents evaluation, the RL-CIS includes both stochastic and deterministic tests and training environment for RL agent,

  • develops an RL training environment for intersection scenarios and proposes a traffic flow generation method based on stochastic process,

  • provides some basic metrics of performance for the AD agent, also a set of baselines of both RL and rule-based methods. The experimental results show that our RL agent outperforms some rule-based methods in intersection scenarios.

Features Vehicle dynamics RL Interface Scenario diversity Scenario randomness Evaluation benchmark
CARLA scenario runner
Highway env
interp e2e driving
TABLE I: Comparison on Framework of Intersection Autonomous Driving

Ii Intersection Scenarios

Ii-a Design of Test Scenarios

In order to evaluate the AD agent, we design a set of intersection scenarios based on [15], which is proposed for ADV field test evaluation. Inspired by [15] and[16], a scenario is defined through a 3-level structure. The first level is the definition of the functional scenario. In this part, the road network structure, target route, behavior of the traffic flow, and other necessary features are defined. The second level is the logical scenario. During this phase of scenario development, the range of all parameters is given, which constitutes a set of all available scenarios. Then, the final stage is that the concrete scenario, a specific group of parameters is selected to instantiate a single concrete scenario, which will be rendered for the ADV test.

Ii-A1 Variability

The variability is the most critical feature for the proposed benchmark. The variability is composed of the diversity of the functional scenarios and the randomness of the logical scenarios, which are the two major rules in scenario design.


Diversity is emphasized for functional scenarios design. In this phase, the diversity of functional scenarios refers to the variety of the driving tasks and behavior of social vehicles, which leads to a diverse interaction between ego and the social vehicles. In[8]and[11], though the traffic flow is random, the diversity of interaction between ADV and social vehicles is not guaranteed, which means some type of interaction may not happen among the whole test set.

Specifically, in a cross-intersection, the ADV has 3 potential passing directions, which are turning left, turning right, and going straight. For each driving task, the ego vehicle may encounter interacting traffic flow from different directions. The variation of the interaction is the major diversity of intersection scenarios.


Randomness is the critical rule in logical scenarios design. In intersection scenarios, randomness mainly indicates two aspects. The first is the randomness of the behavior model of social vehicles. It is defined as that how a social vehicle reacts to potential traffic conflicts. The second aspect is the range of critical kinetic parameters, such as the target speed and minimum brake distance for some rule-based methods. In the design of the range of tunable parameters, there is a trade-off. The parameter range should be as broad as possible to generalize the scene, but it should not exceed a reasonable range.

Fig. 1: Intersection scenario in CARLA simulator

Ii-A2 Deterministic Test Scenarios

The un-signalized intersection scenario is one of the most challenging scenarios for ADV and is mainly discussed in this paper. In the signalized intersection, the interaction between vehicles is taken over by the traffic lights assuming that all traffic participants obey the traffic rules. Though in real-world traffic, both signalized and un-signalized junctions are common. The un-signalized intersection is selected intentionally to stress the interaction problem in the intersection scenario.

In this paper, we select the cross-intersection scene which is the most typical road-network structure among intersection scenarios. We use the Town03 map in the CARLA simulator to configure the scenario, as shown inf 1. The blue lines refer to the planned turning routes in the intersection. In a cross-intersection, three typical tasks are mainly considered, which are turning left, turning right, and going straight. For each turning task, there are multiple potential interacting traffic flows which is deterministic by the road network structure. To further decompose scenarios, each functional scenario extracts only a single interacting traffic flow.

Therefore, we propose five functional scenarios as shown in Fig. 2. The functional scenario (a) and (b) refer to left-turning task, interacting with a going straight and turning right traffic flow coming from the opposite direction. Secondly, scenarios (c) and (d) refer to ego vehicle going straight task, interacting with a going straight and turning left traffic flow coming from left and opposite direction respectively. Thirdly, scenario (e) refers to turning right task interacting with a going straight traffic flow coming from the left direction. The traffic flow in each scenario is composed of a continuous sequence of social vehicles, each vehicle in the traffic flow has a predefined route as illustrated in Fig.1.

The behavior model of all social vehicles is determined by two rules. The first one is speed tracking. The social vehicle will accelerate until it reaches the target speed and then maintain it. The second rule specifies how the social vehicle reacts to the potential conflict, which we employ the autonomous emergency braking(AEB) method. Generally, the AEB model detects a certain range in the front direction, if any obstacles are detected, the vehicle will brake until the collision detection is clear, the vehicle will continue pursuing the target speed.

Besides the design of functional scenarios, the logical scenario is instantiated with a set of determined kinetic parameters. In our proposed intersection scenarios, the adjustable kinetic parameters are the target speed of each vehicle in the traffic flow and the gap distance between adjacent vehicles , as shown in 2. The gap distance is fixed in each concrete scenario instance to guarantee the stability of the test. The two parameters are determined through discretization of a certain range. The value of target velocity is sampled from km/h uniformly with a step length of 2. The value of gap distance is sampled from m uniformly with a step length of 2.

Ii-A3 Stochastic Test Scenarios

Besides the deterministic test, we propose the stochastic test set for the RL-based agent evaluation. In the stochastic test, a comparatively more random traffic flow is provided. Specifically, the behavior model of social vehicles is determined by CARLA’s built-in Autopilot function. CARLA Autopilot is a rule-based AD framework which includes navigation, planning and control module. The social vehicle controlled by CARLA Autopilot will randomly plan the route to pass through the junction. For the driving behavior, a CARLA autopilot agent has a Boolean switch for collision avoidance with certain vehicles. In our experiment, we switch off the collision avoidance of all social vehicles against RL-based ADV. Because the RL agent learns the policy through trial-and-error, switching off the collision detection will help the RL agent explore and learn policies more efficiently.

(a) Turning left facing a going straight traffic flow
(b) Turning left facing a right turning traffic flow
(c) Turning right facing a going straight traffic flow
(d) Going straight facing a going straight traffic flow
(e) Going straight facing a left turning traffic flow
Fig. 2: Functional scenarios of intersection scene

Ii-B Design of Training Scenarios

Generally, the training scenario set is supposed to cover the test scenario set as much as possible but remains some variability at the same time. The traffic flow setting in training scenarios is similar to the one used in test scenarios. We deploy a rule-based traffic flow with adjustable kinetic parameters, while each traffic flow holds a fixed route. The kinetic parameters are target speed and gap distance, which are the same as in the deterministic test. However, in the training scenario, the two parameters are various for each vehicle of the same traffic flow. The parameters are sampled from the same interval as defined in the deterministic test when a new social vehicle is spawned. The behavior model of all social vehicles in training scenarios uses the same assumption as in the deterministic test, which combines speed tracking and the AEB model.

Therefore, we deploy three agents for turning left, turning right, and going straight tasks respectively. In each training procedure, the functional scenario with the same task route is trained at the same time. For example, in the left turning task, the RL agent will confront traffic flows shown in Fig.2 (a) and (b) at the same time. The traffic flow on other roads will not be activated.

Ii-B1 OU-process Parameter Generation

In the training phase, the RL agent is supposed to fully explore all available traffic situations. As discussed above, the randomness of scenarios is highly affected by the distribution of the kinetic parameters of the social vehicles, which are target speed and gap distance in our proposed test scenarios. Therefore, we deploy a kinetic parameter generation method based on Ornstein–Uhlenbeck(OU) process[17]. The OU process will generate a sequence of kinetic parameters for the whole traffic flow, and each social vehicle of the traffic flow will be given a set of parameters sequentially.

For the kinetic parameters generation, the OU process has two significant advantages. The first is that the cumulative probability distribution of the OU process is Gaussian. The second is that the OU process is mean reverting, which limits the difference between two contiguous sampling values. The stochastic differential equation(SDE) of the OU process is


in which , refers to the expectation of the variable, denotes the damping factor of the OU process, refers to the target speed of the social vehicle,

denotes the Wiener process. The ordinary differential equation(ODE) of the OU process is


Since the kinetic parameters of vehicles is bounded by an interval, inspired by[18] we deploy a clipped OU process to avoid over-accumulation on the interval boundary. The clipping process is shown as Algorithm 1. We denote the interval for parameter sampling as .

  Initialize with equation (2)
  while  do
     Sample from OU process using equation (2)
  end while
   as the kinetic parameter for next vehicle
Algorithm 1 Clipped OU process for traffic flow parameter generation

As for the gap distance between social vehicles, the value is sampled from a truncated gaussian distribution


in which

refers to the normal distribution,

refers to the value interval of the gap distance,

denotes the variance of the truncated gaussian distribution. In this paper is determined through


where is a tunable parameter. It is used to adjust the concentration of the parameters relative to the mean value. According to our proposed sampling method, the gap distance is a linear mapping of target speed over the sampling interval.

Iii Baseline Methods

Iii-a Baselines for Intersection Scenarios

Iii-A1 Rule-based Methods

In this part, we select several classic rule-based methods and an RL algorithm as our proposed baseline methods for the intersection scenarios.

Intelligent Driver Model(IDM)

The intelligent driver model(IDM)[19] is one of the most popular rule-based baselines for the ADV. The IDM model is designed based on the car-following behavior. The IDM model is defined by the following equations


where refers to the desired velocity that the vehicle would drive at in free traffic, refers to the minimum desired net distance to the car in the front, refers to the minimum possible time to the vehicle in front, refers to the maximum vehicle acceleration, refers to a comfortable braking deceleration, usually takes a value of 4.

Besides, the acceleration of vehicle can be separated into a free road term and an interaction term

Autonomous Emergency Braking(AEB) Model

The AEB method is a widely used technique of level-2 ADV. In real-world deployment, the AEB method processes sensor data with a determined algorithm and performs the brake action if any collisions are detected. In the CARLA simulator, we directly use the ground-truth value as the input for the deployment of the AEB method.

In our work, the AEB method is determined with two rules. The first one is that the longitudinal range of detection and the second one is the expansion factor of the bounding box of the social vehicle . During the driving process, the ego vehicle will detect along its longitudinal direction with a length of . In meantime, the bounding box of each social vehicle is expanded from its original physical model according to the expansion factor. More specifically, the size of the bounding box is calculated through


where refer to the original size of social vehicle physical model. If the bounding box of any social vehicle penetrating the front area of the ego vehicle, the AEB model will perform a maximum braking action until the detection area is clear.

Iii-A2 Reinforcement Learning

State Representation

For the intersection scenario, inspired by [20]

, we use the ground-truth value of kinetic information of ego vehicle and social vehicles for state representation. Such consideration is common in autonomous driving system design since the major challenge of the intersection scenario comes from the interaction between ego vehicle and social vehicles. For the ego vehicle, the state vector is defined as

, in which denotes the speed of ego vehicle, and denotes a 3-dimension one-hot vector, which indicates ego vehicle’s current position. For the social vehicle, the state vector of each one is defined as , in which refer to two-dimensional velocity of social vehicle , indicate vehicle’s Cartesian coordinates and denotes the heading angles under the ego vehicle’s coordinate system. The total state representation is combined of the ego vehicle and 5 nearest social vehicles. All six state vectors are concatenated to a 33-dimension vector as the RL input state vector.

Action Space

The action space is constructed as a 2-dimension continuous variables . The vector is transformed for speed tracking by . Then we scale the action to m/s as the target speed of ego vehicle for longitudinal control.

Reward Design

Inspired by [21], the reward function is defined through events. More specifically, the reward function is combined with two parts, the first is the reward for each timestep, the second is the final reward at the end of an episode. The complete reward function can be written as follows


where refers to the maximum time limit of one episode, by which common sub-task reward encourages ego vehicle to improve traffic efficiency.

RL Algorithms

For the intersection scenario, we deploy the TD3 algorithm[22]

as our RL baseline. For the neural network design, the state vector is divided by ego and social vehicle, each component is followed by an encoder network, which is formed by

fully-connected(FC) layers. The output of encoders is concatenated and followed by an FC layer. The actor and critic network of the TD3 algorithm share the same network structure in our experiments.

Iv Simulation Experiments

Iv-a Evaluation Metrics

Iv-A1 Intersection Scenarios

Many metrics can be used to measure the behavior of agents [23]. For an AD system, safety and efficiency are the most concerned performance index. In our framework, success rate and average passing time are general metrics for performance evaluation. Success rate indicates that how the RL agent performs for the specified task directly. In this paper, the success rate is defined as


Secondly, efficiency is measured through the average duration time of a single testing episode. Since the functional scenario which targets on same turning task shares the same route, the average passing time is compared through route classification. It is important to note that we only count the time of the successful test. That is to say, scenarios are divided into three groups for average passing time comparison.

Iv-B Results and Analysis

Iv-B1 Training Process

The learning curves of RL baselines in the task of highway training scenario are depicted in Fig. 3. From the figure, we can see that the RL agent for each task converges fast within 2000 episodes. Besides, for each route task, the learning curve converges to a maximum level within 5000 episodes. Then the left turning and right turning agents maintain relatively high and stable performance while the going straight agent has slight fluctuation. The main reason for such appearance is mainly because that the agent in going straight task must interact with the left turning traffic flow coming from the opposite direction, which brings greater challenge than other tasks.

(a) Reward
(b) Success rate
Fig. 3: Intersection training curve

In the training process of each task, the interacting traffic flows are activated according to the definition of logical scenarios, as described in experiment settings. The kinetic parameters generation of each traffic flow is counted, the sampling and distribution in the training of left-turning task is shown in Fig.4. The distribution of parameters sampling is approximate to Gaussian distribution, while the sampling curve of the time domain is rather smooth.

(a) Going straight traffic flow in left turning task
(b) Turning right traffic flow in left turning task
Fig. 4: OU process-based kinetic parameters sampling and distribution

Iv-B2 Deterministic Test

In the deterministic test, we deploy a TD3 agent as the RL baseline. Besides, two rule-based methods are deployed as a comparison, which is IDM and the AEB model. In our deployment in CARLA, the IDM model will detect a certain distance along the driving route, pursuing either speed tracking or car following behavior as shown in (5) and (6). Compared to the AEB method, the IDM agent adjusts its velocity rather smoothly. The experimental results are shown in TABLE II. We evaluate the TD3 agent and rule-based agents in all five functional scenarios. Since the rule-based agents are poorly performed relatively. The statistics are calculated by the task routes for the rules-based agents. In turning left and turning right experiments, the RL agent reaches a success near 90%, and exceeds the rules-based agent in both success rate and average time. According to the result, going straight is the most challenging task since the left-turning social vehicles are rather fast and hard to negotiate for the ego vehicle.

Functional scenario (a) (b) (c) (d) (e)
Ego route Turning left Turning right Going straight

Success rate(%) 94.8 93.8 89.24 99.0 80.0
Average time(s) 6.83 6.65 7.04 6.94 4.79

Success rate(%) 67.7 62.8 47.4
Average time(s) 8.40 8.23 8.59

success rate(%) 72.74 50.0 48.96
average time(s) 8.66 7.21 7.18

TABLE II: Intersection Deterministic Test Results.

Iv-B3 Stochastic Test

The experimental results of the stochastic test are shown in Table III. In the stochastic test, the kinetic parameters of traffic flows are uniformly sampled from an interval, which makes traffic flows not extremely dense. Therefore the RL agent reaches a significantly higher success rate for each task compared to the deterministic test.

In this part, the RL method is compared with rule-based methods as well. As the table shows, the RL agent outperforms rule-based methods in both success rate and average time. The overall success rate of the TD3 agent is above 90%. Though in turning left task and going straight task, rule-base agents have better the average time, they are at a definite disadvantage in terms of safety. We believe such results occur because the IDM and AEB methods have a limited input, which makes them not capable of detecting potential conflict with social vehicles from the cross direction. In conclusion, the model-free RL agent dominates our proposed benchmark of the intersection scene.

Methods Driving task Success rate(%) Average time(s)
TD3(ours) Left 95.9 9.78
Right 96.9 7.64
Straight 91.9 9.20
IDM Left 68.3 11.21
Right 72.7 11.05
Straight 33.7 22.6
AEB Left 71.3 9.06
Right 88.7 8.86
Straight 58.3 8.83
TABLE III: Intersection Stochastic Test Results

V Conclusion

In this paper, we propose RL-CIS as a framework to train and test the RL-based AD agent in intersection scenarios. Firstly, a group of un-signalized intersection functional scenarios is designed. Then the behavioral model of social vehicles is determined with two critical parameters, composed of the whole logical scenario set. The concrete scenarios are proposed by discretizing logical scenarios and become the deterministic test set in our proposed framework. Besides, we deploy a set of stochastic tests to further evaluate the RL-based AD agent. Meanwhile, the training environment for the RL agent is developed. In this part, a stochastic process-based sampling method is deployed for traffic flow parameters generation. Both the training and test sets are built through the CARLA simulator. In addition to that, we offer a set of baselines for the intersection AD benchmarks, including TD3, IDM, and AEB methods. According to the experimental results, the RL agent shows significant superiority compared with rule-based methods.