## I Introduction

The increasing demand for mobility poses great challenges to road transport. The connected and automated vehicles is attracting extensive attention, due to its potential to benefit traffic safety, efficiency, and economy [li2015eco]. The widely studied, but also simplified, version of connected vehicle cooperation is the platoon control system in highways. Platoon control aims to ensure that a group of connected vehicles in the same lane move at a harmonized longitudinal speed, while maintaining desired inter-vehicle spaces[wu2019distributed, gao2018distributed, li2017robustness]. As a typical scenario in urban areas, intersection is more complex and challenging for multi-vehicle coordination than that in highway. At the intersection, vehicles enter from different intersection entrances, cross their specific trajectories at the intersection zone, and leave the intersection at different exits. The complex conflict relationship between vehicles results in complicated vehicle decisions to avoid collisions at the intersection. Hence, it needs complicated design to guarantee traffic safety while improving traffic efficiency. To resolve multi-vehicle coordination at intersections, several studies focus on traffic signals design scheme. Goodall et al. (2013) developed a decentralized fully adaptive traffic control algorithm to optimize traffic signal timing [goodall2013traffic]. Feng et al. (2015) presented a real-time adaptive signal phase allocation algorithm using connected vehicle data, which optimizes the phase sequence and duration by solving a two-level optimization problem [feng2015real]. These traffic signal based coordination methods for intersection control can ensure traffic safety, but they may result in inefficiency for intersection management. Hence, researchers have started to focus on signal-free methods for intersection coordination.

Currently, there are mainly two types of methods to handle coordination at unsignalized intersection, i.e. centralized and distributed coordination methods. Centralized coordination methods utilize the global information of the whole intersection to centrally control every vehicle at intersection. Dresner and Stone (2008) treated drivers and intersections as autonomous agents in a multi-agent system and built a new reservation-based approach around a detailed communication protocol [dresner2008multiagent]. Lee and Park (2012) eliminated potential overlaps of vehicular trajectories coming from all conflicting approaches at the intersection, then sought a safe maneuver for every vehicle approaching the intersection and manipulates each of them [lee2012development]. Dai et al. (2016) formulated an intersection control model and transformed it into a convex optimization problem, with consideration of safety and efficiency [dai2016quality]. However, these centralized coordination methods suffer from huge computation requirement since they coordinate approaching vehicles by optimizing all their trajectories with a centralized controller.

In distributed coordination methods, there is no central controller but distributed controller in each approaching vehicle to optimize its own trajectory considering motion and conflict relationship with its neighboring vehicles. Ahmane et al. proposed a model based on Timed Petri Nets with Multipliers (TPNM) and used that to design the control policy through the structural analysis [ahmane2013modeling]. Xu et al. proposed a conflict-free geometry topology and a communication topology to transform two-dimension vehicle cluster at the intersection to one-dimension vehicle virtual platoon and eventually designed distributed feedback controller [xu2018distributed]. Distributed coordination methods satisfy huge computation resources requirement by distributed computation, however, they need design of sophisticated dynamic model and complicated communication relationship carefully.

One of the most fundamental goals in artificial intelligence is how to learn a new skill, especially from high-dimensional sensor input. Reinforcement Learning (RL) gradually learns a better policy from trail-and-error interaction with environment, which is highly similar to human and has the potential to address a large number of complex problems

[sutton2018reinforcement]. Recently, significant progress has been made on a variety of problems by combining advances of deep learning and RL. Mnih et al. (2015) proposed Deep Q-learning Network (DQN) and attained to human level performance on Atari video games with raw pixels for input

[Mnih2015]. Silver et al. used RL and tree search method to conquer go game and produced two famous programs: Alpha Go and Alpha zero, defeating the most excellent human champion [silver2016mastering, silver2017mastering]. Considering DQN only suitable for problems with discrete action spaces, Deep Deterministic Policy Gradient (DDPG) algorithm is proposed to solve continuous control problems [lillicrap2015continuous]. While vanilla policy gradient methods suffer from poor data efficiency and robustness, Trust Region Policy Optimization (TRPO) is proposed [schulman2015trust]. However, TRPO is not compatible with architectures that include noise and rarely implements parameter sharing between the policy and value function. Proximal Policy Optimization (PPO) is proposed as an updated version of TRPO, which alternated between sampling data through interaction and optimizing a “surrogate” objective function using stochastic gradient ascent [schulman2017proximal].RL has been poised to conquer the autonomous driving problem because of the super-human potential. Existing RL researches on autonomous driving most focus on the intelligence of single vehicle driving in relatively simple traffic scenarios. DQN was initially used to realize control high-frequency discrete steering actions of vehicle [wolf2017learning, ruiming2018end]. After Asynchronous Advantage Actor-Critic (A3C) method was proposed, some researches adopted this framework to accelerate the learning speed and maintain the training stability [mnih2016asynchronous, jaritz2018end, dosovitskiy2017carla]. Due to the long time credit assignment advantage of hierarchical reinforcement learned, it was used to both high-level maneuver selection and low-level motion control for decision making of self-driving cars [duan2019hierarchical]. Besides, other researchers successfully applied DDPG to autonomous driving, realizing control on continuous acceleration, steering and braking actions [Kendall2018, xiong2016combining].

In this paper, we employ RL as our method for centralized control of multiple connected vehicles to realize autonomous collision-free passing at unsignalized intersections. It is realized by firstly formulating state space, action space and reward function in framework of RL, and then training policy by distributed PPO algorithm. Besides, to enhance sample efficiency and accelerate training process, we incorporate a prior model into PPO algorithm. Since we trained a centralized controller, there is no need to design complex components used by distributed controller elaborately. And RL trains off-line and infers on-line, thus it naturally unloads on-line computation burden. Our results show the learned policy is able to increase driving safety and traffic efficiency at intersections.

The rest of this paper is organized as follows. Section II

introduces preliminaries of Markov decision process and policy gradient methods. Section

III proposes model accelerated PPO (MA-PPO) which is an improvement based on original PPO and model-based RL methods. Section IV illustrates our problem statement and methodology and section V looks into experimental settings and illustrates results. Last section VI summaries this work.## Ii Preliminaries of RL

Consider an infinite-horizon discounted Markov decision process (MDP), defined by the tuple (), where is a finite set of states, is a finite set of actions,

is the transition probability distribution.

is the reward function, is the distribution of the initial state , and is the discounted factor.Let denote a stochastic policy , we seek to learn optimal policy which has maximum value function for all , where the value function is the expected sum of discounted rewards from a state when following policy :

(1) |

where and for short. Similarly, we use the following standard definition of the state-action value function :

(2) |

### Ii-a Vanilla policy gradient

In practice, finding optimal policy for every state is impractical for large state space, thus we consider parameterized policies

with parameter vector

. For the same reason, state-value function is parameterizd as with parameter vector .Policy optimization methods seek to find optimal which maximize average performance of policy

(3) |

where and .

Vanilla methods optimize (3) by stochastic policy gradient [sutton2000policy]. Its gradient is shown in (4).

(4) |

(5) |

where is state distribution at time , is called discounted visiting frequency, which in practice is usually replaced with state stationary distribution under denoted by [thomas2014bias]. Combined with likelihood ratio and baseline technique [sutton2018reinforcement], we can write (4) in expectation format.

(6) |

where

is advantage function which could be estimated by several methods

[schulman2015high].### Ii-B Trust region method

While vanilla policy gradient is simple to implement, it often leads to destructively large policy updates. TRPO optimizes lower bound of (3), i.e. equation (7), to guarantee performance improvement.

(7) |

However, it is hard to choose a single value of that performs well across different problems, TRPO uses penalty shown in (8) instead.

(8) | ||||

where we denote , and . TRPO can be regarded as natural policy gradient [kakade2002natural]. It finds steepest policy gradient in fisher matrix normed space rather than euclidean space, which helps to reduce impact of policy parameterization when calculating gradient and stabilize learning process.

## Iii Model accelerated PPO

### Iii-a Proximal policy optimization

In this paper, PPO algorithm is employed as our baseline. It is inspired by TRPO and has two main differences, i.e. unconstrained surrogate objective function and generalized advantage estimation.

#### Iii-A1 Unconstrained surrogate objective function

Observing stability policy update requires punishment of policy deviation based on the unconstrained optimization (9) from theroy of TRPO,

(9) |

PPO alternatively construct a subtle lower bound of (9) to eliminate motivation of too much deviation of policy distribution directly. Its objective is (10).

(10) |

where is probability ratio. When , with objective function (9), would tend to be much larger than 1 to make the objective as large as possible, which leads to unstable learning, while PPO objective (10) cut this motivation by clipping within and taking minimum of original objective (7) and clipped function. Same situation is with .

#### Iii-A2 Generalized advantage estimation

Advantage function is necessary for policy gradient calculation, and it can be estimated by (11).

(11) |

where is action-value function estimated by samples of , is state-value function approximation. TRPO and A3C use Monte-Carlo method to construct as in (12).

(12) |

It is an unbiased estimation of

, but suffers from high variance. Actor-critic methods use one-step TD to form

as in (13).(13) |

which has low variance but is biased. Generalized advantage estimation is actually same as TD() only that it uses linear combination of n-step TD to estimate instead of . Backward view of TD() is shown in (14)

(14) |

where is TD error,

(15) |

Compared to TRPO, PPO is much simpler and faster to implement because it only involves first order optimization, and it has better convergence due to usage of generalized advantage estimation. However, PPO is on-policy method and inevitably has high sample complexity.

### Iii-B Model-based RL

Recent model-free reinforcement learning algorithms have proposed incorporating given or learned dynamics models as a source of additional data with the intention of reducing sample complexity. Generally, there are about two general ways to use model: value gradient methods, and using model for imagination.

Value gradient methods link together the policy, model, and reward function to compute an analytic policy gradient by backpropagation of reward along a trajectory

[deisenroth2011pilco, grondman2015online, heess2015learning]. A major limitation of this approach is that the dynamic model can only be used to retrieve information already presented in observed data and albeit with lower variance, the actual improvement in efficiency is relatively small. Alternatively, the given or learned model can also be used for imagination rollouts. This usage can be naturally incorporated in model-free RL framework, however, learned models suffer from overfitting on experience data and lead to large error in large horizon [kurutach2018model, feinberg2018model].### Iii-C Model accelerated PPO (MA-PPO)

PPO is a model-free on-policy RL algorithm. Model-free means it knows nothing about environment and can only learn from interactions with environment. As a result, it inevitably requires large amount of experience data although its excellent final performance. Besides, the training speed is limited by interaction with real world or simulator. Even worse, property of on-policy makes experiences produced by previous trained policies useless, which aggravates sample inefficiency. This is our motivation to accelerate PPO.

Basically, there are two ways to reduce sample complexity. The first one is incorporating off-policy data in learning process, i.e. using experiences during training. The second one is giving or learning a dynamic model. We claim that off-policy data could not be used due to state distribution mismatch. Assume that off-policy data are generated by another policy , we rewrite PPO objective (10) as (16).

(16) | ||||

To acquire correct gradient by off-policy data, we need not only correct action distribution by action probability ratio , but state distribution by stationary probability ratio . However, stationary probability ratio is hard to estimate and thus lead to distribution mismatch, which hinders use of off-policy data in both theory and empirically. As a result, we only employ model to accelerate PPO.

In field of centralized control at intersection, dynamic model is available from human prior knowledge, so we construct a model ourselves rather than using a learned one. To combine model with PPO algorithm naturally, the second type of model usage from III-B is employed by us. MA-PPO is shown in algorithm 1.

## Iv Problem statement and formulation

### Iv-a Problem statement

In this paper, we focus on a typical 4-direction intersection shown in Fig. 1. Each direction is denoted by its location in the figure, i.e. up (U), down (D), left (L) and right (R) respectively. We only focus on vehicles within a certain distance from the intersection center. The intersection is unsignalized and each entrance or exits is assumed to have only one lane, as a result, there are 4 entrances in total. Vehicle in each entrance is allowed to turn right, go straight or turn left. Thus there are 12 types of vehicles, denoted by their entrance and exit, i.e. DR, DU, DL, RU, RL, RD, LD, LR, LU, UL, UD, and UR. Their number and meanings are listed in Table I. All their possible conflict relations are also illustrated in the figure, which can be categorized into three classes, including crossing conflict (denoted by red dot), converging conflict(denoted by purple dot), and diverging conflict (denoted by pink dot). To simplify our problem, we choose 8 vehicle modes out of all the 12 modes to cover main conflict modes to conduct our experiment. The 8 modes include DR, DL, RU, RL, LD, LU, UL, UD, as shown in Fig. 2. From the figure, we can summary all repeated types of conflict it contains, which is shown in Fig. 3.

Type | Number | Meaning |
---|---|---|

DR | 1 | From ‘Down’ turn right to ‘Right’ |

DU | 2 | From ‘Down’ go straight to ‘Up’ |

DL | 3 | From ‘Down’ turn left to ‘Left’ |

RU | 4 | From ‘Right’ turn right to ‘Up’ |

RL | 5 | From ‘Right’ go straight to ‘Left’ |

RD | 6 | From ‘Right’ turn left to ‘Down’ |

LD | 7 | From ‘Left’ turn right to ‘Down’ |

LR | 8 | From ‘Left’ go straight to ‘Right’ |

LU | 9 | From ‘Left’ go straight to ‘Up’ |

UL | 10 | From ‘Up’ turn right to ‘Left’ |

UD | 11 | From ‘Up’ go straight to ‘Down’ |

UR | 12 | From ‘Up’ turn left to ‘Right’ |

We adopt the following assumptions. First, all vehicles are equipped with positioning and velocity devices so that we can gather location and movement information when they enter interesting zone of the intersection. Then, all approaching vehicles are assumed to be automated vehicles so that vehicles can strictly follow the desired acceleration, control the speed, and pass the intersection automatically. Additionally, There’s a maximum of one vehicle of each type at each lane of entrance, but order of different type is stochastic.

### Iv-B RL formulation

We are ready to transform our problem to a RL problem by defining state space, action space and reward function, which are basic elements in RL.

#### Iv-B1 State and action space

By our assumption, we need to control at most 8 vehicles at a time, i.e. 2 different type of vehicles at each entrance. Vector form is used for both state and action, which are respectively concatenation of each vehicle’s state and control by their order, as shown in (17).

(17) | ||||

where and denote state and action of vehicle type *.

State of each type should contains position and velocity information of each vehicle. Intuitively, we can form state by a tuple of coordinate and velocity, i.e. , where is coordination of its position and is velocity. However, by our task formulation, every vehicle has a fixed path corresponding to its type. There would be redundant information if we use this formula for every vehicle. Besides, for continuous state, it is necessary to decrease state space dimensional to speed up learning and enhance stability. Observing all of paths are cross intersection, we further compress state of each vehicle by , where is distance between vehicle and center of its path. Note that is positive when vehicle is heading for the center and negative when it is leaving. State formulation is shown as Fig. 4.

For action space, we choose acceleration of each vehicle. In total, a 16-dimensional state space and a 8-dimensional action space are constructed.

#### Iv-B2 Reward settings

Reward function is designed under consideration of safety, efficiency and task completion. First of all, the task is designed in episodic manner, in which two types of termination are given, collision or all VEHs passing intersection. To avoid collision, a large negative reward is given if it happens. To enhance efficiency, a minor negative step reward is given every time step. To encourage task completion, there is a positive reward as long as some vehicle passes the intersection, and a large positive reward will be given when all vehicles pass the intersection. All reward settings are listed in Table II.

Reward items | Reward |

Collision | -50 |

Step reward | -1 |

Some vehicle passes | 10 |

All vehicles pass | 50 |

### Iv-C Algorithm architecture

In this section, we illustrate how to apply MA-PPO algorithm to this centralized control problem.

#### Iv-C1 Model construction

MA-PPO learns from data come from both simulation and model. Simulation incorporates true dynamics of environment, i.e. kinematics module with noise, but it takes too much time to interact with simulation. MA-PPO accelerates learning process by incorporates a prior model to generate data which also used for learning. The model is constructed by simple kinematics model. Given current position, velocity and expected acceleration of each vehicle, their next position and velocity can be inferred by this kinematics model.

#### Iv-C2 Overall architecture

Learning algorithm for this RL problem consists of two main parts including MA-PPO learner and worker. Worker is in charge of getting updated policy from learner and using it to collect experience data from simulation or the kinematics model. MA-PPO learner then uses experience data from worker to update value and policy network by backpropagation, and finally sends the updated policy to worker for the next iteration. This overall architecture is shown in Fig. 5.

## V Experiments

### V-a Experimental settings

In this section, we train and test MA-PPO and original PPO in set of vehicles mentioned above, in which there are two vehicles of different types in each entrance of intersection. Thus, we have 8 vehicles in total in this experiment. These vehicles are chosen to cover all conflict modes shown in Fig. 3. The initial position of all vehicles are random, and multiple vehicles enter the intersection from different entrances, follow their trajectories at the intersection zone, and leave it at different exits. The central controller is capable to control the acceleration of all vehicles to adjust their speed and position to ensure traffic safety and efficiency, i.e. all vehicles pass through the intersection as quickly as possible without collision. For results, training processes of MA-PPO and PPO are shown and compared to illustrate our improvement on PPO, and we also visualize effects of policy at the start of training and at the end of training in simulation to show what the trained policy has learned.

### V-B Implement details

We employ multiple layers perceptron with two hidden layers as approximate function of actor and critic. Both actor and critic have 128 units in each hidden layer, and actor has 16 output units for Gaussian distribution of each vehicle (mean and standard deviation) while critic only has one output unit for state value. Note that actor and critic network have no shared parameters. We use Adam as optimizer. For MA-PPO, we collect

transitions and use minibatch epoch for update. For model simulation loop, we set . Besides, we train both PPO and MA-PPO under 5 seeds to eliminate impact of randomness. Complete parameter setting is listed in Table III.Parameter | Value |
---|---|

Discounted factor | 0.99 |

0.95 | |

Clip range | 0.2 |

Total timesteps | |

Inner iteration | 1 |

Seed number | 5 |

Batch size | 2048 |

Minibatch size | 64 |

Epoch | 10 |

Learning rate | 0.0003 0 |

Hidden layer number | 2 |

Hidden units number | 128 |

Adam |

We use parallel workers to improve exploration and stabilize policy gradient and thus speed up learning process. Concretely, 16 parallel workers learn simultaneously. In an iteration, each worker interacts with environment respectively and collects 2048 timesteps batch of data, then takes the first minibatch to calculate gradient, then global gradient is conducted by averaging all local gradients of workers. Each workers updates its parameters by using the global gradient in Adam and takes the next minibatch and going on like this.

### V-C Results and discussion

In this section, we show the performance of our algorithm at intersection and analyze the empirical results.

Fig. (a)a shows the mean episode reward of MA-PPO and PPO during training process. Both MA-PPO and PPO get the highest reward about 50, which means that all 8 vehicle has passed through the intersection successfully. Compared with PPO, MA-PPO converges at around 500 iterations, while PPO algorithm needs nearly 1000 iterations, which shows that MA-PPO converges twice as faster as PPO.

Fig. (b)b shows the change of mean episode length during training process. Both the episode length of MA-PPO and PPO first increase rapidly and then reduce to an equal value. This can be explained that at the beginning the temporary policy mainly focuses on how to avoid collision because this case corresponds to a large negative reward. At that time, one reasonable policy is to let the vehicles with no conflicting trajectories pass through the intersection, such as the RL and LD, DR and UD. Meanwhile, other vehicles have to wait until the next non-collision chance, which leads to the long episode length. However, such a policy is too conservative and suffers poor efficiency because every step has a negative reward -1. Therefore, the following policy would optimize this process to avoid long waiting time, leading to the decrease of mean episode length. Besides, MA-PPO obtained more faster convergence speed in term of mean episode length compared with origin PPO, which has the same trend as the Fig. (a)a.

Fig. 7

visualizes one episode in 20th iteration of the training process. At this moment, VEH5 (mode: LD) has pass through the intersection successfully, however, VEH4 (mode: RL) and VEH8 (mode: UD) collided at the last step of episode. From Fig.

7(b) we can see all these 8 vehicles is approaching the center of intersection, however, almost none of them realized to decelerate their speed to avoid collision except for VEH2. During the last few steps before collision, the velocities of VEH4 and VEH8 still maintained their trend without significant change. Besides, VEH2 has realized that front collision and our policy began to control the acceleration to avoid another collision. On this occasion, the learning time step finally got a reward of 10 because of the success pass of VEH2. We can conclude that at this point, the learned policy cannot coordinate all vehicles successfully and some agents such as VEH4 and VEH8 can not learn effective policy to address this intersection traffic situation.Fig. 8 shows a successful example that the central decision agent learned good policies after 1000 iterations of training. At this moment, VEH3 (mode: RU) has passed through the intersection successfully. VEH8 slowed down from beginning to step 26 to wait VEH7 to turn right. Besides, VEH2 (mode: DL) has to wait the pass of VEH7 which got a closer distance to the center of intersection. Also, VEH4 (mode: RL) has to wait VEH2 to turn right and decreased its acceleration. It learned a human-like policy, detecting the potential collision according to the distance to the center of intersection and assigned the order to pass through. One reasonable explanation that VEH8 has to wait and pass lastly is that it has a longer distance than any other vehicles to the center of intersection, as shown in Fig. 8(b). After step 26, VEH7, VEH2 and VEH4 passed central area of the intersection, VEH8 started to speed up and then passed the intersection successfully. As shown in Fig. 8(d), VEH8 remained stable between -2 to -1 from initial time to 26 time-steps, then it accelerated rapidly after time-step 26, which also illustrated that VEH8 learned a waiting policy to avoiding collision.

VEH5 and VEH6 have similar velocity curves, both of which has large change in velocity. At the beginning, they slowed down and kept low velocity until VEH2, VEH3 and VEH4 passed the intersection. After time-step 25, both of them began to speed up and pass the intersection because with no potential collision around the area of intersection, more larger acceleration would reduce the negative reward during riding. On the other hand, the velocity curves of VEH2, VEH3 and VEH4 demonstrated that they learned to speed up so that they could pass the intersection quickly. From Fig. 8(b), compared with other VEHs, VEH5 and VEH6 decreased slowly first, after time-step 25, the distance curves became sharper, which also proved that VEH5 and VEH6 learned a waiting policy to avoid collide.

In conclusion, results have shown that RL based control can address the intersection situation with multiple vehicles, not only considering the collide avoidance but also improving pass efficiency. Unlike human rules can be applied in control of intersection, our algorithm can coordinate vehicle from different direction corresponding to their velocity and distance to the center of intersection. Our methods based on reinforcement learning is prone to show more advantages when there are more vehicles, in which human rules may not work or is difficult to find the optimal solution to coordinate all vehicles. Besides, we use model to accelerate the learning process and obtain a good acceleration effect, which shows the importance of prior model in learning algorithm.

## Vi Conclusion

In this paper, we employ reinforcement learning method to solve centralized conflict-free cooperation for connected and automated vehicles at intersection, which have been long regarded as a challenge problem due to its large scale and high dimension property. We use PPO algorithm as our baseline, which has state-of-the-art performance on several benchmarks. And we propose MA-PPO to enhance sample efficiency and speed up learning process. A typical 4-direction intersection which contains 8 different modes of vehicle is studied. We find that our method is more efficient than PPO and the learned driving policy shows intelligent behaviors to increase driving safety and traffic efficiency, which indicates that RL is promising to deal with centralized cooperative driving at intersection.

## Vii Acknowledgments

This work is partially supported by International Science Technology Cooperation Program of China under 2016YFE0102200. Special thanks should be given to TOYOTA for funding this study. We would like to acknowledge Mr. Jingliang Duan, Mr. Zhengyu Liu, for their valuable suggestions throughout this research.

Comments

There are no comments yet.