I Introduction
Autonomous driving is regarded as one of the most potential applications of artificial intelligence, which can provide numerous benefits from increasing efficiency to reducing accidents. Technically, decisionmaking is the core and challenging element to achieve highlevel intelligence of the automated vehicles
[pendleton2017perception]. To that end, reinforcement learning (RL) has been widely adopted to learn a reasonable driving policy inspired by its successful application in Go game and robotics[silver2017mastering, rowland2018distributional_ana]. Generally, RL can be viewed as a trialanderror learning by interacting with the environment[kiran2021deep]. While early RL applications were mainly deployed in the highway scenes to learn some simple behaviors such as lane keeping[sallab2016end], lane changing[wang2018reinforcement] or overtaking [ngai2011multiple], it is currently promising to conquer more complicated urban scenarios. Especially, intersection is the typical representative of urban roads due to the highly dynamic and stochastic characteristics caused by the interaction of signal lights and distinct traffic actors including pedestrians, cyclists and vehicles[wu2013safety]. In that situation, assuring driving safety has always been the principal consideration and challenge to the application of RL.The intuitive approach for safety is introducing the constraints and updating the driving policy towards the direction of satisfying these constraints. For that purpose, Isele et.al (2018) firstly considered the safe exploration during the learning process by using the prediction as a constraint and claimed that it can be used to safely learn intersection handling behaviors on an autonomous vehicle [isele2018navigating]. Wen et. al (2020) considered a risk function and bounded the expected risk within predefined hard constraints, showing that this could assure the driving safety at the simple twovehicle interacting at an intersection[wen2020safe]. Besides, some other works also have explored different constraints formulation for driving tasks such as control barrier function[ma2021model], chance constraint[Peng2020ModelBasedAW] and constraints in continuous time[DUAN2022128]. Inspired by these constraint optimizations, Guan et.al (2021) recently proposed the integrated decision and control (IDC) framework for autonomous driving which intuitively adopted RL to solve the constrained optimal control problems (COCP) based on multiple candidate paths generated by static path planner[guan2021integrated]. The dynamic model of the ego vehicle and the empirical model of surrounding participants are built separately to construct the distance constraints. Besides, both simulation and real vehicle experiments have proven that IDC shows more flexibility and intelligence at complex urban roads compared with the stateoftheart methods[jiang2021integrated]. However, the IDC framework suffers two major issues when extending to more signalized intersections. One is the static path planner is hardly to generalized to diversified intersections with different shape and topology. The other is the policy trained on empirical models is too idealized to accommodate the abnormal behaviors from real driving environments.
The first issue says that the static path planner of IDC currently focuses on a certain intersection to design the candidate paths individually, whereas for other crossroads with different configuration, it should determine the reference paths one by one according to the realistic map of this intersection. Concretely, bizer curves [chen2017accurate] is adopted to generate reference path with the guidance of control points, whose choices become the core procedure as they will determine totally the shape of reference paths. This calls for a general approach to the automatic determination of control points which could be adaptive to the intersections of all shapes. Also, this path generation only concerns the green light, and the following training process simulates red lights by adding virtual vehicle in front of the stop line, and treating the yellow light as the red one to force a hard deceleration at any time[ren2021encoding]. This will lead to a rather conservative dealing with yellow light dilemma zones, where the ego vehicle must choose to stop or pass without hesitation according to traffic condition[liu2012empirical]. To deal with signal lights, Zhou et al. (2019) utilized the predefined rules to replan the trajectory online considering the ego vehicle’s position, velocity, and the constraints of surrounding vehicles [zhou2019development]. Chen et al. (2021) employed model predictive control (MPC) method to optimize the acceleration to drive the vehicle to pass the intersection at the current or the next green light phase[chen2021mixed]. Although the embedding of human experience will be helpful for handling traffic lights, this online optimization is difficult to be applied in practice because of the high computational burden, especially in complicated scenarios.
As for the second issue, IDC mainly focus on the safety guards against common situations by constraining the prior known environment models, while this kind of empirical models hardly reflects the randomness of driving tasks. This gap will bring disabled safety guarantees against the abnormal behaviors such as overspeeding or direction change contrary to the traffic rules. To improve the resistance to environmental disturbances, Pinto et. al (2017) extended adversarial learning to RL fields wherein the protagonist policy tried to learn the control law by maximizing the accumulated reward while an adversary agent aimed to make the worst possible destruction by minimizing the same objective [pinto2017robust]. These two policies were trained alternatively, with one being fixed whilst the other adapted. After that, Pan et.al
(2019) explicitly used an ensemble of policy networks to model risk as the variance of value functions, and then train a riskaverse protagonist and a riskseeking adversary
[pan_rararl]. They built the round way racing track in TORCS[wymann2000torcs] on a selfdriving vehicle controller and showed that it could handle substantially fewer catastrophic events caused by adversarial attacks. Similarly, we demonstrated that adversarial learning could be helpful for the safe decisionmaking at intersections, wherein we used the protagonist policy to control the ego vehicle, the adversarial policy to control the two surrounding vehicles respectively[renitsc2020]. However, these works largely simplify the driving conditions of intersections, only considering the sparse and homogeneous traffic flows. Besides, their adversarial training process mainly focuses on the performance at worstcase situation, and lack of the guarantee of driving safety with the absence of distance constraints.This paper aims to learn robust and intelligent driving skills at signalized intersections, which could be scalable to diversified urban crossroads and can handle the abnormal behaviors from surroundings. Based on the IDC framework, the contributions and novelty of this paper are summarized as follows:

A static path planner for urban intersections is developed to produce candidate paths, which is proved to be generalized to various crossroads. Specifically, our path generation only considers static road information and consists of the route planning and the velocity planning. The former is capable of providing reference position for the automated vehicle and we design a automatic static route generation method using bizer curves where the control points can be adaptive to the size and topology of intersections. The latter aims to deliver reference velocity to adjust the driving behaviors encountering different phases of the signal light, wherein we design two distinct modes, i.e., the stop velocity to encourage yielding or waiting behaviors at red lights and the pass velocity to inspire a quick ride at green lights.

Based on the candidate paths, we construct a COCP wherein the bounded uncertainty of dynamic models is considered to capture the randomness of driving environment and the constraints are introduced to assure driving safety. Novelly, we propose adversarial policy gradient (APG) to solve this COCP under the worst constraint, wherein the driving policy learns to realize reasonable operations and the adversarial policy aims to provide disturbances. Formally, the minimax formulation is established based on the driving performance of tracking cost and safety constraints, with which the driving policy intends to minimize and the adversarial policy attempts to violate the safety constraints by seeking for the most severe uncertainty. APG optimizes these two policies alternately, and the competition will provably make the driving policy obtain a greater ability to cope with external variations.

We establish a comprehensive driving environment wherein the typical intersections are built and the perception module is introduced to simulate the sensor characteristics. Besides, we incorporate the human knowledge into the online application of the trained networks to solve the yellow light dilemma. Simulations show that the trained policy can handle the switch of signal lights flexibly which is comparable to human drivers, meanwhile realizing the smooth and efficient passing at different intersections. Moreover, the generation tests are conducted to verify that APG improves the safety and robustness of the driving policy to resist abnormal behaviors of traffic participants.
The paper is organized as follows. In Section II, we introduce the key notations and related works. Section III describes the mathematical details of the static path planner and APG. Section IV presents driving environment construction and the implementation for RL algorithms. Section V verifies the effectiveness of the trained policy and Section VI concludes this paper.
Ii Preliminaries
This section will first introduce the basic principle of RL and the details of IDC, then the related works on adversarial training will be summarized.
Iia Principles of RL
Formally, we describe the driving process as a standard RL setting wherein the automated vehicle interacts with the driving environment in a realtime manner. At current state , the vehicle will take action according to the driving policy and then the environment will return next state according to the environment model , i.e., , and a scalar reward or utility function . Overall, RL seeks for optimizing the driving policy which maps to the action distribution, i.e., . To that end, value function is introduced to evaluate the performance of policies, which is concretely defined as the accumulated utility expectation under obtained from the state , i.e., . Typically, most RL algorithms can attribute to the classic actorcritic architecture[sutton2018reinforcement]. The critic is designed to update to evaluate the current policy by minimizing its distance from the target generated from interaction with environment. Actor intends to find another better policy by minimizing the calculated value function, i.e., , where is the state distribution determined by policy . Meanwhile, the policy and value function are usually parameterized as and
by neural networks (NNs) to handle continuous and high dimension tasks, in which
and are parameters to be optimized. In that case, the gradientbased optimizations are usually employed to iteratively update these NNs to approximate the optimal policy and value function .IiB Adversarial Learning
Adversarial learning is usually modelled as the minimax formulation, wherein there exists another adversary policy with as its parameters, to provide a competitor for the protagonist policy . Given the current state , will take action , will take action , and then the next state will be reached. Whereas they will obtain distinct utility function: the protagonist gets a utility while the adversary gets a utility at each time step. Under this scheme, the value function will be determined by these two policies together, and a zerosum twoplayer game can be seen as the adversary maximizing the value function while the protagonist attempts to minimizing it:
This optimizes both agents using an alternating procedure. Firstly, we learn the protagonist’s policy while holding the adversary’s policy fixed. Next, the protagonist’s policy is held constant and the adversary’s policy is learned. These two steps will be repeated until convergence. Intuitively, the adversary is introduced to apply disturbance forces to the system and is reinforced–that is, it learns an optimal policy to thwart the original agent’s goal. This learning paradigm jointly trains a pair of agents, a protagonist and an adversary, where the protagonist learns to fulfil the original task goals while being robust to the disruptions generated by its adversary[pinto2017robust].
IiC Integrated decision and control (IDC)
IDC serves as an effective framework of deploying RL to conquer the decision and control of autonomous vehicles, which consists of two essential modules: static path planner and dynamic optimal tracker[guan2021integrated]. The former aims to generate the candidate path set, denoted as , which builds all feasible paths considering static traffic information. The number of candidate paths can be set priorly with consideration of the potential target lanes and the traffic density of intersections. Dynamic optimal tracker considers tracking all candidate paths as well as assuring the safety against surrounding environment by meeting the constraints. Given an arbitrary path , its COCP involves the tracking performance, denoted as , and the constraints for different participants, denoted as to assure safety:
(1)  
where is the prediction horizon and is the driving state at the prediction step , starting from the current time step . We should emphasize the indicates that the initial state is sampled from the driving environment and its following transition can be derived through the simulation model based on the corresponding driving action . The stochastic driving policy aims to build the mapping from the driving states to the distribution of the actions, i.e., , the latter will be delivered to control the automated vehicles. denotes the distribution of
, which usually is designed as a joint distribution of reference path
, the state of the ego vehicle and surrounding participants, the road information, etc. denotes the utility of tracking and denotes all the constraints on , such as the distance to surrounding participants, road edge.Besides, the value function is trained to evaluate the tracking cost by predicting the track performance of the given path . After training, the optimal counterpart and will be obtained and then applied online to implement decision and control functions. firstly tells the optimal path from which has smaller tracking cost and less potential to collide with other surrounding vehicles. And then takes this path to construct the corresponding driving state and consequently outputs the control commands for the automated vehicle. Comparatively, IDC owns high online computing efficiency because of the fast forward propagation of NNs, which could calculate the control commands within 10ms at one complex intersection [guan2021integrated, jiang2021integrated]. With the power of RL, it can also cast off the tedious human designs and improve the safety and intelligence level of decision and control.
Iii Methodology
This section introduces the methodology to improve the generalization ability of RLenabled algorithms under the scheme of IDC from two aspects. One is the static path planner which is scalable to different intersections; the other is the APG algorithm, which could improve the resistance of the driving policy to the abnormal behaviors of traffic participants.
Iiia General Static Path Planner
Static path planner aims to generate trackable paths with consideration of static road information. Considering the diversity and complexity of urban roads, this module should be scalable to intersections of different shapes and sizes, meanwhile incorporated with various traffic rules, such as the speed limits or pass restrictions. To that end, our general static path planner consists of two essential parts: route planning and velocity planning. For the route planning, the basic idea is to generate smooth and continuous curves by fitting and splicing, as shown in Fig. 1. Each candidate path, numbered as , will be determined by the six control points , which divide this candidate path into three important segments: the entrance route controlled by , the curve route controlled by and the exit route controlled by . As the routes outside intersections possess definite lanes, we can directly choose the lane center as the entrance and exit route, indicating that and can be constructed directly along the direction of lane center line. For the curve route, the thirdorder bezier curve [han2008novel] is adopted to generate a reasonable route inside intersections, which also smoothly connects the entrance route and the exit route. Therefore, the key to generating such a curve lies in the choice of four control points . To guarantee the continuity of the whole route, should be the end point of the entrance lane and is the start point of the exit lane. Thus, the curve route is mainly determined by these two middle control points .
To assure the scalability to various intersections, we propose the choice standard of the control points , which can be adaptive according to the topology of the crossroads. Without loss of generality, we assume there exists an included angle between the entrance route and the exit route with the horizontal direction, denoted as and respectively, also shown in Fig. 1. The control points can be decided with such a geometric method: firstly, given and , we connect them directly, thereafter two bisection points, and are built, which starts from and respectively. Note that
is a hyperparameter and can be determined in experience. Then, we extend the entrance route and exit route along the inside of intersection, towards which we draw perpendicular line from
and respectively. Finally, the vertical feet are determined as and respectively.Based on the geometric relation, we can derive the specific calculation formula for respectively:
(2)  
After obtaining four control points , the curve route can be generated according to the bizer formula:
where is the point coordination on the bizer curve and reflects its relative position.
After concatenating the entrance route, curve route and exit route together, one feasible route is completely generated and similarly we can construct more potential routes according to the road structure and passing connection. To show the simplicity and high operability of this route planning, here we choose 6 typical intersections to generate the candidate routes, as shown in Fig. 2
. These intersections are picked carefully from the open source street map to represent common scenarios, covering the single lane and multilane intersections, regular and irregular intersections, as well as the intersections equipped with green belts. We suppose the ego vehicle departs from the downside and aims to finish the leftturn, rightturn and straightgoing tasks. The number of candidate route is determined by that of target lane and
is set to 0.6 for all intersections. For instance, in Fig. 2(f), 3 candidate routes are constructed for leftturn tasks because there exit 3 target lanes to pass through the intersection, while the straightgoing task only has 2 candidate routes. Overall, Fig. 2 indicates that the general static path planner can generate smooth and seemingly reasonable paths for diversified intersections with different topology and we will further verify the trackability of them in VB.Another essential part of static path planner is the velocity planning, as shown in Fig. 3. Considering the existence of traffic lights and rules at intersections, we design two expected velocity profiles with respect to the spatial position, i.e., the pass velocity mode and the stop velocity mode. The former corresponds to passing the intersections quickly while the latter means to wait if encountering a red light or congestion inside the intersection. For the pass velocity, its value outside the intersection is set to be , where shows the speed limit of the entrance lane, usually determined by traffic rules or the human experience. The latter is set to be the lower one between and as Chinese traffic regulations usually limit the speed inside intersections to be less than . For the stop velocity, the ego vehicle is required to decelerate uniformly to the stop line within from the speed of to , then the expected velocity will keep inside the intersection and turn to again outside the intersection. Note that here we design these two velocity modes mainly considering the responses of the automated vehicle to green and red lights and both of them will be utilized to train the driving policy. After that, we further incorporate human knowledge in IVE to apply the driving policy to deal with some unusual situations such as yellow lights. In that case, the appropriate velocity mode will be chosen in a realtime manner based on the the vehicle state and the remaining time of current light phase.
Combining the route planning and velocity planning, the general static path planner is capable of generating candidate paths to provide the reference position and speed for automated vehicles. Note that our method only depends on the highprecision map and has no regard of the dynamic traffic participants in path production, and the subsequent training will further consider to track these paths while satisfying the safe constraints against the surrounding traffic actors.
IiiB Adversarial Policy Gradient (APG)
As for the construction of COCPs, it always has assumed a precise environment model, i.e., , where the current state and action will definitely lead to a deterministic next state . However, this is almost nontrivial for driving tasks due to the behavior randomness of surrounding participants and the perception noises. Therefore, here we construct a stochastic dynamic model to consider the uncertainty of driving environment, i.e., , and is the bounded random noise satisfying , where and represents the lower and upper bound of the randomness respectively. Consequently,
shall be a random variable and its corresponding constraint
must be held over the randomness of environment model to assure the driving safety. Nevertheless, this might be unavailing for the optimization process as can be captured as any value from the continuous interval . Intuitively, we aim to seek for the ”worst case” of this safety constraint to guarantee it can work in the whole interval of the random noise, i.e., , which means that the most aggressive constraint w.r.t. will be satisfied and thus any other value of the random noise would be covered automatically. To sum up, the problem formulation of our COCP can be formulized as:(3)  
This constrained optimal problem needs to be further transformed so that it can be solved by RL algorithms. Here, we adopt the generalized exterior point method [guan_itsc] to convert it as the unconstrained one, wherein is the penalty function of the state constraint , i.e.,
Obviously, characterizes the degree to which the constraints are satisfied, in which if holds, the penalty will exactly be 0 otherwise it will become a square growth to the constraint violation. Based on that, we can reformulate (3) and construct the total policy cost as follows:
(4)  
where is the penalty factor of the penalty function, which determines the importance level of the tracking cost and safe cost . Differently, solving this problem in (4) needs to solve the inner maximization operator firstly to find a sequence of the worst noise for all the predictive states, i.e., , which is rather costly for the nonlinear and highdimension optimization. Inspired by the adversarial training, we introduce the adversarial policy to map the state over the distribution of random noise , i.e., . Note that also needs to be trained progressively to reach its optimal counterpart, which can output the worst noise after receiving an arbitrary state. Thus, we can construct the minimax formulation problem:
(5)  
Up to now, we can employ the policy gradient method to optimize the driving policy and adversarial policy simultaneously, wherein the former optimization aims to decrease both the tracking and safety cost, and the latter intends to seek for the worst noise to disturb the learning process. Concretely, conducts the minimization operation by gradient descent method, whose gradient can be derived as
(6)  
Similarly, the gradient of can be extracted as
(7)  
As for the training of value network , its objective, denoted as , aims to minimize the mean square error between the prediction of and the real value calculated by the environment model given a random path and the initial state :
(8)  
And its gradient w.r.t. the network parameter can be calculated as
(9) 
Accordingly, APG is proposed to obtain a generalized driving policy which is capable of handling the uncertainty from environment model, and its implementation is based on (6), (7) and (9) to update the adversarial policy, the driving policy and the value function iteratively. Once their optimal counterparts are acquired, the automated vehicle can firstly select out the optimal path by comparing their path values, and based on that the driving state will be constructed and fed to the optimal policy to generate the optimal control action , i.e.,
(10)  
The details of APG are shown as Algorithm 1. Note that we introduce the update interval to adjust the update frequency of adversarial policy and driving policy, which is an important technique to stabilize the training process as suggested in [JinNJ20, HinzFWW21].
Iv Training Environment Construction
This section aims to construct the training environment of signalized intersections, and demonstrates the implement details of APG.
Iva Traffic Configuration
The training environment is constructed based on six typical signalized intersections demonstrated in Fig. 2. The candidate path set is generated with the proposed method in IIIA and the mixed traffic flow, i.e., the vehicles, cyclists and pedestrians, will be deployed on these intersections with the support of SUMO software[SUMO2018]. Specifically, we set the density of pedestrians, cyclists and vehicles to 100, 100, 400 per hour for each entrance lane to simulate a dense traffic flow. The pedestrians are controlled by the stripe model of SUMO, while the surrounding vehicles and cyclists are controlled by the the carfollowing and lanechanging models. All these surrounding participants are randomly initialized at the beginning of each experiment and ride along the predefined destination. Moreover, the signal light systems are also designed to regulate the passing of largescale traffic flow, where the light controlling the rightturn keeps green, and that dominating the leftturn and straightgoing keeps synchronous, which will lead to more complex driving operations for the automated vehicle, i.e., the unprotected leftturn. Empirically, the phase time of red, yellow and green are set to 40s, 3s, and 60s respectively. Note that we only care about the red and green light during offline training, that is, the ego vehicle will attempt to track the pass velocity mode in Fig. 3 if the light is green and the stop mode if it is red. The yellow lights will be further considered in the online application in IVE. Note that the is set to for the velocity planning. Based on the above settings, the ego vehicle is initialized outside of the intersection and aims to complete three different tasks, i.e., turning left, going straight and turning right, to pass this intersection with guaranteeing driving safety, efficiency and comfort simultaneously.
IvB Perception system
To make a more realistic training system, we equip the perception system for the ego vehicle including one camera, one lidar and three radars, whose specifications are referred to the real sensor products in market such as Mobileye camera, DELPHI ESR (middle range), and HDL32E [cao2020novel]. As shown in Table I, the effective perception ranges of the camera, radar and lidar are set as 80m, 60m and 70m respectively, and the horizontal field of view of them are set as , , respectively. During riding, only the surrounding participants entering into the perception range can be captured to construct the driving states, forming the interested participant set .
Type  Reference Product  Detect range  Detect angle 

Lidar  VelodyneHDL32E  70m  
Camera  Mobileye camera  80m  
Radar  DELPHI ESR  60m 
Besides, we model the perception noise by adding its statistical counterpart to the true value given by SUMO. These noises obey Gaussian distribution and their parameters, i.e., the mean and variance, come from statistics of real sensor data. Take the lidar as an example, we firstly collect the pairs of true value and observation based on the opensource dataset
[yin2021center], the difference between which can be seen as the sensor error of lidars. Then we draw the statistical histogram of the sensor error of different traffic participants and estimate the distribution parameters using maximum likelihood estimation. Table
II shows the distribution parameters of different traffic participants of lidars. For each surrounding participant, numbered , i.e., , the observation consists of six variables: relative lateral position , relative longitudinal position , relative speed , heading angle , length and width . We can see that almost all variables possess the mean close to zero and the main difference lies in their variance. For example, the variances of for vehicles, cyclists and pedestrians demonstrate the decrease in order as the speed of vehicles varies over a more wide range. Besides, of pedestrians is much larger because their small sizes will cause difficulty to the shape detection of lidars.Type  Vehicle  Cyclist  Pedestrian 

IvC State, action and utility
As a typical RL algorithm, the training of APG mainly involves the design of state, action and utility function. State is the extraction of original observation given by sensors, which consists of the information of ego vehicle , tracking error and surrounding participants , i.e., . Concretely, is represented with a dynamic model which contains longitudinal position and lateral position , longitudinal speed and lateral speed , heading angle , yaw rate , front wheel angle and acceleration , i.e., . is constructed based on and the given reference path , including longitudinal distance error , lateral distance error , speed error and heading angle error , i.e.,
(11) 
As for , we will sort different traffic actors by their distance to ego vehicle based on the observation from the perception system. Then the closest 8 vehicles, 4 cyclists and 4 pedestrians of will be chosen to construct . For each item, the description information, as shown in Table II, can be summarized as .
To realize smooth lateral and longitudinal control, we utilize the derivatives of and as actions, i.e., . Considering the actuator saturation in reality, the action shall be limited to a certain range wherein we assume rad/s, m/. Accordingly, and are also limited into a reasonable range to prevent some unreasonable driving operations, i.e., rad and m/. The utility is mainly related to the precision, stability and energysaving performance of path tracking and thus we construct a classic quadratic formulation
For safety constraint design, we adopt the same method with [ren2021encoding], where they utilize two circles to represent the cyclists and vehicles, one circle to represent pedestrian, and the constraints between the ego vehicle and surrounding participants can be constructed by comparing the distance between their circle centers and the given safety threshold.
IvD Environment Model and Uncertainty
Environment model is established to predict the transition of driving environment within the predictive horizon in (3), which typically consists of the model of ego vehicle and that of the surrounding participants . The former controls the motion of the automated vehicle, and thus the precise dynamic model has been developed and verified completely [ge2021numerically], wherein the state space equation can be written as
Specially, the key parameters of the ego are welldesigned in accordance with a physical vehicle, as shown in TABLE III.
Parameter  Meaning  Value 

Front wheel cornering stiffness  155495 [N/rad]  
Rear wheel cornering stiffness  155495 [N/rad]  
Distance from CG to front axle  1.19 [m]  
Distance from CG to rear axle  1.46 [m]  
Mass  1520 [kg]  
Polar moment of inertia at CG 
2642 [kg]  
Discrete time  0.1 [s] 
takes charge of predicting the motion of surrounding traffic participants, which involves the calculation of safety requirements. Usually, the kinematics models are adopted as its simplicity and good performance on structured roads[hou2019interactive]. However, this may be unable to describe the randomness of intersections, so we add the additional uncertainty terms to to the corresponding states of this basic kinematics model and construct the following stochastic prediction model:
(12) 
in which indicates the turning radius determined by the route and position of the corresponding participant. And will be generated by the adversarial policy , i.e., , to learn the aggressive noise to violate the safety constraints.
IvE Dealing with traffic lights
After offline training, we have obtained for path selecting and for path tracking. As for the subsequent online application, we can further incorporate more human experience with the trained functions to handle some special traffic conditions at signalized intersections such as the yellow light or traffic congestion. Therefore, a series of rules are established in terms of the driving states of ego vehicle and the phase of signal lights to select the appropriate velocity mode, i.e., the pass mode or the stop mode in Fig. 4.
We firstly judge whether there exists traffic congestion ahead, i.e., the front vehicle keeps stopping over 3s. When the result is yes, the stop mode in Fig. 3 will be chosen to avoid further deterioration of traffic efficiency, otherwise we will judge whether the ego vehicle has passed the stop line based on the vehicle’s position and the road map. If the answer is yes, the pass mode will be selected as it has entered into the intersections. Otherwise, the final judgement will be shown up to further distinguish situations, wherein the result will be jointly determined by two conditions, denoted as respectively:

denotes the current light phase, where indicate the red, green and yellow lights respectively.

denotes whether the ego can stop in front of the stop line before the red light comes when the current light is yellow. Note that are the abbreviations of TRUE and FALSE respectively.
Therefore, the ego vehicle will choose the stop mode undoubtedly when encountering a red light, or there is enough time and distance to decelerate at the yellow light. As for other situations, for example, when the green light or yellow light appears but it is hard to stop by braking, the ego will directly pass the intersection with the guidance of pass velocity mode.
Formally, can be captured directly by the vehicle sensors or the communication devices bound to the signal lights. can be derived based on the current speed of the vehicle and the constant acceleration model. At each time step during the yellow light phase, the ego vehicle is assumed to brake with the 80% of the maximum deceleration, i.e., . Based on that, the longest deceleration distance and time can be deduced by
At this moment, the distance of the ego vehicle to the stop line can be read from high precision map. Referred to Chinese traffic rules, the duration of yellow light is always 3s and thus we can obtain the remaining time, denoted as , by counting from the initial stamp of yellow light. Based on these conditions, holds if and are satisfied simultaneously, otherwise is concluded to indicate that the ego can not stop in time at this constant and then the passing velocity will be given to encourage rushing before the red lights.
Note that this choice procedure is conducted at each step of the driving process, therefore the output mode will provably be changing with the state of vehicle and light duration. Besides, the stop or pass mode only determines the speed error in the track error of (IVC). After that, each candidate route will be further adopted to construct other elements respectively and will select the optimal one to track.
V Simulation results
This section mainly tells the training parameters, visualizes the training curves and tests the generalization of the driving policy.
Va Training settings
The NNs of driving policy and adversarial policy employ similar architecture, wherein the multilayer perceptron (MLP) is adopted as the approximate function, containing 5 hidden layers, consisting of 256 units per layer. Considering the stochastic property of both policies, the output of MLPs are the mean and covariance of the Gaussian distribution where the covariance matrix is diagonal. In this case, each policy maps the input states to the mean and logarithm of standard deviation of the Gaussian distribution. Moreover, all hidden layers take GeLU as the activation function while the output layer uses the hyperbolic tangent to bound the output into the predefined range as shown in Table
IV. For the value network, it adopts almost the same architecture with the policy, except that the activation function of the output is the ReLU to output a positive scalar to indicate the path value. The Adam method
[Diederik2015Adam] with a cosine annealing learning rate is adopted to update all networks. The predictive horizon is set as 25 and the penalty factor is 15. We also include the parallel training for RL as in [guan2021mixed] to accelerate the training process, which contains 10 workers to sample from driving environment, 10 buffers to store samples and 18 learners to update the networks. Table IV provides more detailed hyperparameters of our training.Then, we compare the training performance of APG with the baseline called deterministic policy gradient (DPG), wherein the only difference lies in the consideration of the randomness in the surroundings model . For DPG, it assumes the empirical model is accurate enough and thus the uncertainty is ignored to train the single driving policy. Whereas, APG introduces the adversarial policy to simulate the despiteful noise and the driving policy must learn to handle such tough cases to make reasonable decisions. During the training process, we record the policy performance and value loss every 1000 iterations and visualize the learning curves in Fig. 5, including the value loss , the track loss , the safety loss and their summation, i.e., the total policy loss . Note that all these performances only involve the training based on the environment model established in IVD, not the driving performance in the real driving environment. To do that, we also test the temporary policies by applying them to control the automated vehicle to pass the intersection for 10 runs in the real environment every 1000 iterations. At each run, the vehicle will be initialized outside of the intersections and move forward 120 steps during which the utility function will be accumulated. We use the total average return to reflect its real performance, i.e.,
And Fig. 6 visualizes the performance during the training process.
We can see from Fig. 5 that all losses of these two training paradigms show a remarkable decline with the increasing of iteration, indicating the driving policy has been improved progressively with the guidance of their corresponding simulation models. In addition, the tracking loss and safety loss almost keep the synchronous descent until they all drop to the same level, meaning that our driving policy can balance the optimization of tracking performance and safety requirements wisely. However, these exists a little difference between APG and DPG. The latter seems to obtain better policy performance as shown in Fig. (a)a because it accesses a lower track cost in Fig. (c)c. That is, APG has to sacrifice some tracking performance to guarantee a fair safety cost, which is explicable that the adversary policy provides a strong disruption to the learning of driving policy. In this case, the automated vehicle has learned to avoid the potential collision by decelerating or diversion encountering the despiteful surrounding vehicles, which will lead to the decreasing of tracking performance.
Training performance. The solid lines correspond to the mean and the shaded regions correspond to 95% confidence interval over 5 runs.
As for the test performance of these two driving policies shown in Fig. 6, we can see both of them indeed behave better and better at the real intersection, but APG has improved TAR with a large margin. Besides, DPG suffers a little higher fluctuations than APG shown from the confidence interval because it will be likely to fail encountering some unusual conditions. This result is very inspiring, which says although we can realize the training convergence based on the constructed simulation model, but it might hardly work in real environment if there exists a large gap between simulation model and the real driving environment. Thus, for the modelbased RL or classical control methods like Model Predictive Control (MPC), the model uncertainty should be considered especially at complex urban roads. Obviously, the kinematic model cannot describe the randomness of the intersections in our simulation, as there exists complex interactions between different traffic participants.
VB Visualization
To verify the rationality of the paths generated by the general static path planner, we firstly test the path tracking function with the trained policy where we remove all traffic flows at the intersections plotted in Fig. 2. We conduct 200 runs totally based on the six distinct intersections and record the tracking error and vehicle state of each step. We randomly sample the route and velocity mode for each run and always initialize the automated vehicle outside the intersection, then the trained policy will drive it to track the given path. The distribution of the tracking error and vehicle state as shown in Fig 7. Note that the distance error is defined by and the longitudinal error , i.e., . We can see that the tracking error is close to zero in most cases with the maximum as 0.2m. And the maximum speed error is less than 3m/s, and also mainly concentrates around 0. Fig. (b)b demonstrate that steer angle and acceleration are close to zero in most cases, of which the steering angle keeps less than and the absolute value of acceleration keeps less than . In addition, there is no situation where the two variables are large simultaneously, which means the vehicle stability can be guaranteed. To sum up, the general static path planner can generate reasonable candidate paths for diversified intersections, which are easily trackable for the automated vehicle with an acceptable energy consumption.
Next, we visualize a typical driving process of the unprotected leftturn task and the curves of some important vehicle states in Fig. 8 and Fig. 9, which is considered as one of the most difficult tasks at signalized intersections. The automated vehicle starts from the leftturn lane with the green light shown in Fig. (a)a, wherein the pass velocity in Fig. (a)a and the bottommost route will be chosen to drive towards the intersection as fast as possible. Then the signal light goes into the yellow phase and immediately the stop velocity mode works to decelerate the ego as there is a long distance from the stop line and also a front vehicle attempts to stop as shown in Fig. (b)b. After that, the ego vehicle continues deceleration and stops to wait the red light in Fig. (c)c and Fig. (b)b. Next, although the green light appears at 60s in Fig. (a)a, the stop velocity still makes the difference because the front vehicle yields to other vehicles stuck inside the intersection. Later, the ego vehicle will choose the pass mode and accelerate to a high speed as shown in Fig. (d)d and it shall detour the sluggish front vehicle by choosing the topmost path as shown in Fig. (d)d, which needs to steer right for a while as shown in Fig. (c)c. After that it will quickly switch to the middle path in Fig. (e)e to rush before the opposite straightgoing vehicle. Soon, our automated vehicle drives close to the sidewalk in Fig. (g)g through which some pedestrians are walking. Thus, it will decelerate priorly and yield to pedestrians as shown in Fig. (f)f and Fig. (b)b. Finally, the automated vehicle starts to accelerate again at 86s and pass this intersection successfully as demonstrated in Fig. (h)h and Fig. (d)d, during which it prefers the middle path to track because the sparse participants in this lane will reduce the potential collisions.
Besides, we also visualize the straightgoing and rightturn tasks at two other intersections. In Fig. 10, the ego vehicle starts initially close to the stop line at green light in Fig. (a)a and thus it will disregard the yellow light by choosing the pass velocity in Fig. (b)b. After entering into the intersection, it will steer left to avoid the opposite turningleft vehicle by selecting the leftmost route in Fig. (c)c and finally finish the safe navigation at this intersection shown as Fig. (d)d. In Fig. 11, the ego vehicle is initialized in the rightturning lane in Fig. (a)a with the constant green lights. Then it will acceleration to pass before the upcoming pedestrians in Fig. (b)b and attempt to avoid the front vehicle by choosing the middle path in Fig. (c)c. Eventually, it will insert the straightgoing traffic flow to pass successfully in Fig. (d)d.
VC Generalization ability
Compared with the performance during training process, we also concern more about that on situations distinct from the training environment, for example, these may exit some abnormal behaviors of the surrounding participants which rarely encounters during training. Here, we design two typical conditions wherein the part of surrounding vehicles are endowed with some abnormal driving behaviors to test the generalization ability. Our tests will focus on the unprotected leftturn task with green light set as green. In the first case shown in Fig. (a)a, 15% vehicles in the opposite straight lane will overspeed to rush through the intersections as quickly as possible. That is, the speed limits of the surrounding vehicles are set as 25km/h for the roads inside of the training intersection set and 30km/h for the outsides. However, during the testing, part of surrounding vehicles will exceed the speed limits by 10%, 20% and 50% respectively. As for the second case shown in Fig. (b)b, a proportion of vehicles of the traffic flow will violate the road connection, and attempt to round and enter into the inside lane wherein the middle lane is only allowed to ride during training. The proportion will be set to 10%, 20% and 50% respectively. Two crucial indicators including passing rate and travel time will be introduced to evaluate the driving performance of the trained policy at this intersection. The passing rate is defined as the ratio of the successful pass time to the total 200 runs, wherein one successful pass means the automated vehicle can ride out of the intersection without collision with other participant, roads and the traffic lights violation. And the travel time is evaluated by the average time used to pass the intersection, starting from entering the intersection at the stop line.
The test results of the speeding case are shown in Table V. We can see that the driving policy of APG shows a stronger resistance to the overspeeding behavior of surrounding vehicles while that of DPG suffers a sharp drop on the passing rate. APG can handle totally the speeding by 10% and the passing rate maintains steady when the speeding degree changes from 20% to 50%. From the travel time, APG has learned to yield to the aggressive vehicles by decelerating and waiting them to pass firstly, which needs to sacrifice the passing efficiency. This driving strategy provably seems reasonable because it can assure safety of the automated vehicle as much as possible.
Methods  Passing rate  Travel time (s)  

Speeding by 10%  APG  100%  13.08±2.35 
DPG  90%  12.30±3.96  
Speeding by 20%  APG  90%  13.38±4.98 
DPG  82%  12.08±1.71  
Speeding by 50%  APG  90%  16.18±4.85 
APG  75%  10.33±3.28 
In addition, the test results of rounding cases are shown in Table VI. We can conclude in this case the aggressive vehicles have greater influence on the ego because both policies demonstrate reduction with the increasing proportion of rounding vehicles. With the existence of adversarial training, the driving policy of APG has experienced the abnormal behavior more often during training, as the noise added on position, i.e., , might push the surrounding vehicle to round within the predictive horizon. Therefore, it can still acquire a fair performance compared with the policy trained by DPG, which only possesses the 34% passing rate when 50% of vehicles attempt to behave without rules.
Methods  Passing rate  Travel time (s)  

10% vehicles’ rounding  APG  96%  12.83±3.07 
DPG  80%  12.71±2.88  
20% vehicles’ rounding  APG  94%  13.38±3.91 
DPG  72%  12.30±3.96  
50% vehicles’ rounding  APG  84%  13.60±3.58 
DPG  34%  10.50±4.94 
To sum up, we consider the uncertainty of the simulation model and introduce the adversarial training to enhance the driving performance, indicating that this will not only bring promotion at the training environment, but also access more resistances to disturbs from environment.
Vi Conclusion
This paper focuses on the decisionmaking and control for signalized intersections, which are crucial and challenging for the popularity of autonomous driving. To that end, we firstly design a general static path planner for the intersection scenarios, which consists of the route planning and velocity planning. It can generate smooth and trackable paths for the diversified intersections and feature high efficiency. Secondly, we construct a constrained OCP for integrated decision and control wherein the bounded uncertainty of dynamic models is considered to capture the randomness of driving environment. APG is proposed to solve this OCP in which the adversarial policy is introduced to provide disturbances by seeking for the most severe uncertainty and the driving policy learns to handle this situation by competition. Finally, a comprehensive system is established to conduct training and generalization test wherein the perception module is introduced and the human experience is incorporated to to solve the yellow light dilemma. Results indicate that the trained policy can handle the yellow signal lights flexibly and realize smooth and efficient passing with a humanoid paradigm. Besides, the proposed APG enables the largemargin improvement of the resistance to the abnormal behavior of traffic participants and can ensure a high safety level for the autonomous driving. About the future work, we will build largescale intersections to make more powerful driving policy, and conduct real vehicle experiment to further verify its effectiveness.