V2I Connectivity-Based Dynamic Queue-Jumper Lane for Emergency Vehicles: An Approximate Dynamic Programming Approach

03/02/2020 ∙ by Haoran Su, et al. ∙ NYU college 0

Emergency vehicle (EV) service is a key function of cities and is exceedingly challenging due to urban traffic congestion. A key contributor to EV service delay is the lack of communication and cooperation between vehicles blocking EVs. In this paper, we study the improvement of EV service using vehicle-to-vehicle connectivity. We consider the establishment of dynamic queue jumper lanes (DQJLs) based on real-time coordination of connected vehicles. We develop a novel stochastic dynamic programming formulation for the DQJL problem, which explicitly account for the uncertainty of drivers' reaction to approaching EVs. We propose a deep neural network-based approximate dynamic programming (ADP) algorithm that efficiently computes the optimal coordination instructions. We also validate our approach on a micro-simulation testbed using Simulation On Urban Mobility (SUMO).



There are no comments yet.


page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Increasing population and urbanization have made it exceedingly challenging to operate urban emergency services efficiently. For example, historical data from New York City (NYC), USA [1] shows that the number of emergency vehicle (EV) incidents has grown from 1,114,693 in 2004 to 1,352,766 in 2014, with corresponding average response times of 7:53 min and 9:23 min, respectively [2]. This means an approximately 20% increase in response times in ten years. In the case of cardiac arrest, every minute until defibrillation reduces survival chances by 7% to 10%, and after 8 minutes there is little chance for survival [3]. Cities are less resilient with worsening response times from EVs (ambulances, fire trucks, police cars), mainly due to traffic congestion.

The performance of these EV service systems in congested traffic can be improved with technology. As a core of modern ITSs, wireless vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) connectivity provide significant opportunities for improving urban emergency response. On the one hand, wireless connectivity provides EVs the traffic conditions on possible routes between the station (hospital, fire station, police station, etc.) and the call, which enables more efficient dispatch and routing. On the other hand, through V2V/V2I communications, traffic managers can broadcast the planned route of EVs to non-EVs that may be affected, and non-EVs can cooperate to form dynamic queue-jump lanes (QJLs) for approaching EVs.

In response to these challenges, this paper develops a methodology for utilizing V2V/V2I connectivity to improve EV services. We design link-level coordination strategies for non-EVs to fast establish dynamic queue jumper lanes (DQJLs) for EVs while maintaining safety. We incorporate the state-of-the-art deep learning methods to specifically dealing with the randomness of driver behavior and devise scalable solution to QJLs problem. The models are incorporated and results are validated through traffic simulation software.

Although QJLs are a relatively new technology, literature is available documenting the positive effects they have in reducing travel time variability, especially when used in conjunction with the traveling salesman problem (TSP). However, they are all based on moving-bottleneck models for buses [4, 5, 6]; such models do not directly apply to our setting, since EVs typically move faster than non-EVs and since EVs can “preempt” non-EV traffic because of their priority. In addition, QJLs have not been studied as a dynamic control strategy. The establishment of DQJLs involves real-time motion planning for both EVs and non-EVs, which has been a focus of robotics both in deterministic and in stochastic settings [7]. However, although robotic motion planning algorithms provide useful insights for EVs, they do not directly apply to EVs, since human drivers are not able to follow complex paths and react instantaneously as robots do. Furthermore, coordination algorithms for multiple robots are hardly applicable to traffic management due to high randomness in drivers’ reaction to coordination instructions. Instead, human drivers need driving strategies that are easy to interpret and implement and preferably only dependent of movement of neighboring vehicles, see [8]. [9] illustrates to use dynamic programming to prevent vehicle-passenger collision and [10] shows how to deep learning methods to ensure road safety. Mixed integer programming has been utilized in routing problems for multiple vehicles in different tasks like in [11]. In particular, [12]

considered an integer linear program formulation for the DQJL problem in the centralized and deterministic setting, which provides a baseline but does not account for the randomness of driver behavior.

In this paper, we model the DQJL problem into a Markov decision process to cope with the uncertainty in drivers’ behavior. We also introduce the approximate dynamic programming (ADP), including utilizing a deep neural network, to address the complexity in this framework and eventually solve the DQJL problem. We validate our results based on traffic simulation software against benchmark system.

Our results indicate the by using ADP the coordinated system can establish a DQJL faster than the benchmark/decentralized system in a urban environment. By incorporating our ADP algorithm, a coordinated system is able to save 12% time than the benchmark system, creating critical time window for emergency vehicles for complete their tasks. Our ADP algorithm is capable of dealing with more sophisticated scenarios such as longer road segment, mix types of vehicles as well as different congestion levels.

The rest of this paper is organized as follows. In Section 2

, we model the establishment of DQJL into a discretized road environment and picture the uncertainty in a geometric distribution. Then in Section

3 we propose our ADP algorithm to solve this extended DQJL problem. The results are validated based on a simulation analysis comparing with benchmark system in Section 4 and discussions are made regarding the insights behind the results.

2 Modeling a DQJL problem with uncertain driver behavior

In this section, we elaborate how we formulate and model the DQJL problem in a urban road environment.

In order to model the establishment of dynamic queue-jumper lane, i,e. path clearance process, for an emergency vehicle (EV), we can take a look at a typical urban road segment. The urban road segment consists of two lanes facing the same direction. When an EV is requesting to pass this road segment, the centralized/coordinated vehicle-to-vehicle system will send out real time instructions to all non-EVs on this road segment. Assuming the EV is always travelling on one lane. When the EV approaches the road section, all non-EVs on the other lane immediately freeze. All non-EVs in front of the EV are instructed to cruise forward or pull over to clear a path for the EV. If an EV can not find a suitable pull over space, it can exit at the end of this road segment. However, the pull over response time for each non-EV is uncertain and the centralized system needs to address this uncertainty during the process.

Assuming that the speed of EV on a mission is much faster than non-EV cruising speed, the position of this EV should be immediately behind the last vehicle, who has not pulled over or exit the road segment. When there is no vehicle in front of this EV, it is then indicated that the dynamic queue-jumper lane has been established for this EV and the process is complete.

2.1 Problem Statement

Given a 2-lane directed link segment with length with number of vehicles, how should the centralized system instruct all non-EVs with uncertain pulling over response time, so that the dynamic queue-jumper lane establishing time for an EV can be minimized.

2.2 Road Discretization

This study will be based on homogeneous timestamps, meaning that the centralized system gather all vehicle coordinates and send instructions to vehicles at the end of each second. Accordingly, we can discretize the road segment into a grid network of cells with length , i.e. if a vehicle is instructed to cruise forward to find a further pulling over space, it should move forward by one cell. See Fig. 1.

Figure 1: An example of road segmentation discretization.

2.3 Uncertainty in Non-EV Pulling Over Time

Non-EV pulling over time varies from driver to driver, creating large uncertainty for the system. Geometric distribution is utilized to model this uncertainty. The number of timestamps of failure to pull over before this driver successfully pull over is a random variable

, where

represents the probability of success on each trial. Which is to say, the probability of a driver can just finish pull over at the

th timestamp after he receives such instruction is .

2.4 Assumptions

In the proposed model, we assume the positions and kinetic characteristics of all non-EVs are known through the connected environment. Each cell can only be occupied by one single non-EV according to the definition of road discretization. Since the EV is on a mission, its speed is significantly higher than the cruising speed of non-EVs, so its real time position is updated as strictly after the last non-EV who hasn’t pulled over or exited this road segment. Non-EVs on the other lane freezes immediately when the process starts so we only investigate on the movement of non-EVs in front of this EV. This study is also limited to the dynamic queue-jumper lane for one single EV on a link level road segment.

3 Approximate Dynamic Programming Algorithm

In this section, we propose our approximate dynamic programming (ADP) algorithm to address the dynamic QJL problem with uncertain driver behavior.

To address when and what instructions the centralized system should send to each non-EV to establish a dynamic queue-jumper lane, we can structure the model into a Markov decision process (MDP). A Markov decision process framework is described by the tuple of , namely state space, action space, reward collection, transition probability matrix, and discount factor.

3.1 Environment Setup for MDP

Taking the advantage of discretization of the road segment, we further label each cell to turn the road segments into a 2-dimensional grid environment. We also label the non-EVs on the upper lane in front of the EV starting, and label the grid environment vertically. An exit space appears at the end of the upper lane, allow non-EVs to exit the road segment if they cannot find a pull over space. After the specific labeling of the environment, we could visualize the road segment as Fig. 2:

Figure 2: MDP grid environment.

3.2 State

A centralized system describes all non-EVs coordinates on the upper lane at timestamp as a collection: , where denotes the coordinate of th vehicle in the grid environment.

For a two-lane road segment with N cells in longitudinal direction and the exiting space at the end of the road segment, there are cells which each non-EV on the upper lane can position in. Therefore, the size of state space is

In the example shown in the Fig. 2, the state is represented as . It can further be coded into a string in 32-bit for the convenience of storing and computing.

3.3 Action

Each non-EV on the upper lane can have three actions: a = {cruise forward, pull over, remain still}. There are three situations that an non-EV is advised to remain still in the current position: 1. when this non-EV has already pulled over in the lower lane or exited; 2. when this non-EV is performing a pull over, but fail to pull over within this timestamp due to the uncertainty in pulling over time; 3. when this non-EV, who is trying to cruise forward, is blocked by another non-EV who remains still.

Since we are considering from the perspective of the collection of all non-EVs, the action value is also a vector indicating the collection of all specific action for each individual non-EV:

. The size of the action space involving non-EVs is .

The action vector can also be encoded in to the same format as state. Each character of this string indicates the corresponding action for the corresponding non-EV: 2 represents cruising forward, 1 represents pulling over into the lower lane and 0 represents remaining still. For example, an action telling all non-EVs to cruise forward is represented as .

3.4 Reward

At every timestamp, if the EV has not passed the road segment, i.e., the EV has not reached the exit cell, we set the reward to be -1. If the EV has passed this road segment, the reward collected is set to be 0 for the convenience of convergence of learning process. To discourage non-EVs collision of any kind, i.e. non-EVs collided into the same cell except the exit cell, the reward for any collision to be -100.

3.5 Transition Probability

represents a the probability of transition from a state an action into a new state . Although the uncertainty in non-EVs pulling over time is modeled as the geometric distribution, the probability for a pulling over non-EV, who has not pulled over in this timestamp, successfully pull over in the next timestamp is still . Based on our definition on the action, , where .

3.6 Discount Factor

indicates how important is future reward to current state. For the convenience of learning convergence, we set .

3.7 Q-Learning

To deal with stochastic transition probability in this problem, we can utilize Q-learning, a model-free learning algorithm, to cope with the uncertainty of non-EV pulling over time. Our goal is to yield a policy for the centralized system to broadcast real-time instructions for each non-EV in order to establish a queue-jumper lane in the shortest amount of time.

Under a policy , a combination of certain state and an action under that state will yield a state-action value as the following (1):


In (1), represents the expected long term reward under stochastic policy . The represents the expected long turn reward by the agent in state choose action under policy . The Q function is represented recursively as:


where means the probability of the state collapsed into when taking action in the state , and represents the reward for that move.

From (2), we can determine the Q function under optimal policy should satisfy the Bellman’s optimality equation:


When number of states and actions are finite, a simple tabular Q-learning algorithm will be initialized and updates through the centralized system’s experience as introduced by [13] as :


where represents the learning rate. Under this algorithm, the Q table will converge to optimal Q function under convergence. Under the traditional Q-Learning approach, all non-EVs would act naively or randomly to take the reward to update the corresponding . The centralized system will then plan next action for next state based on the collected and update the new for new state and new action. The iterations of the Q-learning will eventually maximize the reward and produce the optimal policy.

3.8 Deep Q Network

In this Markov decision process framework, we can notice that the dimension for the state space is and that of the action space is , both of which are exponentially growing with , number of non-EVs who needed to be pulled over. The dimension of the state space grows even faster with the number of cells in the longitudinal direction. Thus, the traditional tabular formatted Q-learning algorithm is not able to handle the memory complexity as well as the time complexity to search or update a certain state-action value. To improve the efficiency with respect to memory space and time, we propose using a Deep Q Network (DQN) introduced by [14] to approximate to select action for each state.

3.8.1 Design of the Deep Q Network

The DQN has two identical neural networks, an evaluation network and a target network. For each neural network, the input layer is a matrix of feature vector of the state of all non-EVs. Under this framework, the state vector is the feature vector as we judge whether or not queue-jumper lane has been established by the locations of non-EVs. The output layer should yield all possible state-action value. Thus, the input layer has neurons and the output layer should have neurons.

Generally speaking, the more hidden layers, the higher accuracy the neural network can achieve. Since the numbers, i.e. coordinates on the grid network, have simple numerical values and linear relationship, we only need one hidden layer to reach high accuracy without spending more training time. With assurance of accuracy, the number of neurons in the hidden layer should also be minimized to prevent overfitting. In our neural network, 10 neurons in the hidden layer is accepted.

Finally, we select Rectified Linear Unit (ReLU) as activation function on the hidden layer since ReLU’s better training performance in the attenuation of gradients


A DQN studying 2 non-EVs, who need to pull over in this road segment, should have a neural network structure like Fig. 3. The neural network will yield state action value and the learner will choose the action with largest state action value.

Figure 3: Neural network structure for 2 non-EVs system

3.8.2 Training of the DQN

A Deep Q Network can be viewed as a combination of a Q-learning algorithm and a neural network with experience replay and fixed q target. According to [14]

, the loss function that be used to train this neural network is:


where refers to a specific weight for this neural network and represents expected long term reward. Taking the partial derivative with respect to and we can get:


From (6

), we could perform a stochastic gradient descent to update

and, accordingly, all weights of this neural network.

3.8.3 Experience Replay and Fixed Q target

For states where the central system has never been, we need an evaluation function to approximate the rewards for those states. Updating weights of the neural network for a specific pair of state and action will impose change to the for other pairs of state and action, which may result in significant increase in the training time or even failure to converge [16][17]. Experience replay is introduced by [18] to store some of experience as a tuple of into a experience history queue . An off-policy Q-learning algorithm will benefit by randomly select experience tuples with size of the mini-batch from so that each memory tuple has equal chance to be selected into the training.

Another important characteristic powering DQN is the fixed Q target. After every certain steps of training, we replace the weights in the target network by the ones in the evaluation network. Otherwise, we fix the weights in the target network to increase the efficiency of training. In mathematical expression, instead of minimizing the previous loss function as (5), we minimize the new loss function as (7) listed below:


where is the fixed weight parameter and only gets updated every certain steps of training.

3.9 Algorithm Overview

Summarize what we find above and the DQN training algorithm introduced by [14], we propose the modified Deep Q Network algorithm to solve the DQJL problem with non-EVs. See Algorithm 1.

1:  Initialize Experience History Queue with mini-batch size
2:  Initialize evaluation with set of weights
3:  Initialize target with set of weights
4:  for DQJL training episode do
5:     Initialize a random state
6:     for DQJL training step do
7:        Select an action to perform in
8:        Update reward and the next state
9:        Store the tuple into D as ()
10:        Collect experience samples () with size of mini-batch
11:        Transform () into a training pair () by have and
12:        Update for the training pair of () according to
13:        Reset target with evaluation every few steps.
14:     end for
15:  end for
Algorithm 1 DQN Algorithm for Centralized DQJL

4 Comparison with Decentralized System

In this section, we validate ADP algorithm on a traffic simulation software against the simulation results from the decentralized/benchmark system. The comparison result shows the centralized system has a shorter QJL establishment time.

4.1 Benchmark System Simulation

In a decentralized/benchmark system, every non-EV driver is selfish and trying to pull over into the nearest space when they facilitate to establish QJL. However, such motion planning principle will incur system-wise time inefficiency because the following non-EV drivers may need longer time to find a space and pull over. The uncertainty in pulling over time might worsen the case that following non-EV drivers have to wait until the EV successfully pull over. A simple example is elaborated in Fig. 4.

Figure 4: An example of queue-jumper lane establishment for two systems

In the benchmark system, the system queue-jumper lane establishment time is equal to the pulling over time of the red car; the system jumper-lane establishment time for the centralized system is equal to the pull over time of the yellow car, which is significantly shorter than that of the benchmark system. To validate that the centralized system can establish a dynamic jumper lane faster than the benchmark system, we use Simulation on Urban Mobility (SUMO) [19] to examine our results.

SUMO has an existing module named Emergency Vehicle Simulation introduced by [20]. Under this module, a blue light device, i.e. an EV, is able to overtake on the right, disregard the right of way and exceed the speed limit. All non-EVs share identical parameters. We perform the simulation on the problem shown as Fig. 2, Table 1 has all the parameters we feed into the SUMO simulation for the benchmark system:

Parameters Value Description
4m/s non-EV’s departure speed
8m/s non-EV’s max speed
80m length of this road segment
4.5m the length of a non-EV
24m/s EV’s departure speed
30m/s EV’s max speed
Table 1: Parameters for benchmark system SUMO simulation

A snapshot for SUMO is shown as Fig. 6. The result shows the benchmark system takes 10.2 seconds to form a QJL.

4.2 Centralized System Simulation

To embed our Deep Q Network into the SUMO portal, we train our neural network in advance. Using offline training with trained weight parameters, we can obtain the optimal action for any state in real time. The pipeline to perform this interface is shown as Fig. 5. During the process, we communicate the state or action between SUMO and our neural network every 2 seconds. For the states, since we discretize the road segment, we approximate the positions of non-EVs output from SUMO into nearest cells. For the actions selected by the neural network, we need to code corresponding direction and velocity on the vehicles in SUMO. The pipeline for performing centralized system simulation is shown in Fig. 5.

Figure 5: Flowchart for NN-SUMO interaction
Figure 6: Snapshot of SUMO simulation

The centralized systems with the same selection of parameters indicates a consistent jumper-lane establishment time of 8.9 seconds.

The result is easy to interpret intuitively as we are minimizing the longest pulling over time for the system. The coordinated algorithm balances the pull over time for all non-EVs. Therefore, even though some non-EVs would experience longer pull over time, the system-wise pull over time is reduced.

The comparison of two systems shows a 12.74% decrease in time for centralized system to form a queue-jumper lane. However, the results are supposed to vary depending on different road congestion levels, background traffic speed, road length as well as initial vehicle positions. Ideally, the difference between two system will approximate to zero when the road is fully congested or empty.

5 Conclusion

In this paper, we propose a novel stochastic dynamic programming algorithm for dynamic queue-jumper lanes problem. Utilizing Markov decision process framework and computing power of a double layers deep q network, we successfully formulate a centralized system which can dispatch real time instructions for all non-EVs to form a queue-jumper lane for an EV. Finally, the simulation result based on our approach is proved to form a QJL faster than the benchmark/decentralized system.

In future work, sensitivity analysis will be conducted to investigate the impact by different factors, including background traffic speed, road length and congestion level, on system performances. The interaction between our trained neural network and SUMO can also be improved. In order to more accurately represents the positions of the vehicles, we could further discretize the road environment into more cells and need not to worry about the estimation error in NN-SUMO communication.


  • [1] New York End-To-End Response Times, 2019 (accessed February 28, 2020). [Online]. Available: https://www1.nyc.gov/site/fdny/about/resources/data-and-analytics/end-to-end-response-times.page
  • [2] Emergency Response Incidents, 2014 (accessed February 28, 2020). [Online]. Available: https://data.cityofnewyork.us/Public-Safety/Emergency-Response-Incidents/pasr-j7fb
  • [3] Heart Disease and Stroke Statistics, 2013 (accessed February 28, 2020). [Online]. Available: https://cpr.heart.org/AHAECC/CPRAndECC/ResuscitationScience/UCM˙477263˙AHA-Cardiac-Arrest-%20Statistics.jsp%5BR=301,L,NC%5D
  • [4] G. Zhou and A. Gan, “Performance of transit signal priority with queue jumper lanes,” Transportation Research Record, vol. 1925, no. 1, pp. 265–271, 2005.
  • [5] B. Cesme, S. Z. Altun, and B. Lane, “Queue jump lane, transit signal priority, and stop location evaluation of transit preferential treatments using microsimulation,” Transportation Research Record, vol. 2533, no. 1, pp. 39–49, 2015.
  • [6] Y. Z. Farid, E. Christofa, and J. Collura, “Dedicated bus and queue jumper lanes at signalized intersections with nearside bus stops,” Transportation Research Record: Journal of the Transportation Research Board, vol. 2484, pp. 182–192, 12/2015 2015.
  • [7] A. Buchenscheit, F. Schaub, F. Kargl, and M. Weber, “A vanet-based emergency vehicle warning system,” 2009 IEEE Vehicular Networking Conference (VNC), pp. 1–8, 2009.
  • [8]

    D. Krajzewicz, G. Hertkorn, C. Rössel, and P. Wagner, “An example of microscopic car models validation using the open source traffic simulation sumo,” in

    14th European Simulation Symposium, ser. SCS European Publishing House, vol. Jahrgang 2002, 2002, pp. 318–322, lIDO-Berichtsjahr=2004,. [Online]. Available: http://elib.dlr.de/6657/
  • [9]

    F. Zuo, K. Ozbay, A. Kurkcu, J. Gao, H. Yang, and K. Xie, “Microscopic simulation based study of pedestrian safety applications at signalized urban crossings in a connected-automated vehicle environment and reinforcement learning based optimization of vehicle decisions,” in

    Road Safety and Simulation, 10 2019.
  • [10] X. Xiong, J. Wang, F. Zhang, and K. Li, “Combining deep reinforcement learning and safety based control for autonomous driving,” ArXiv, vol. abs/1612.00147, 2016.
  • [11] T. Schouwenaars, B. De Moor, E. Feron, and J. How, “Mixed integer programming for multi-vehicle path planning,” in 2001 European Control Conference (ECC), Sep. 2001, pp. 2603–2608.
  • [12] G. J. Hannoun, P. Murray-Tuite, K. Heaslip, and T. Chantem, “Facilitating emergency response vehicles’ movement through a road segment in a connected vehicle environment,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 9, pp. 3546–3557, Sep. 2019.
  • [13] S. Ohnishi, E. Uchibe, Y. Yamaguchi, K. Nakanishi, Y. Yasui, and S. Ishii, “Constrained deep q-learning gradually approaching ordinary q-learning,” Frontiers in Neurorobotics, vol. 13, p. 103, 2019.
  • [14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015. [Online]. Available: http://dx.doi.org/10.1038/nature14236
  • [15] J. Brownlee, A Gentle Introduction to the Rectified Linear Unit (ReLU), 2019 (accessed February 28, 2020). [Online]. Available: shorturl.at/awCV3
  • [16] C. You, Q. Yang, L. Gjesteby, G. Li, S. Ju, Z. Zhang, Z. Zhao, Y. Zhang, W. Cong, G. Wang, et al., “Structurally-sensitive multi-scale deep neural network for low-dose ct denoising,” IEEE Access, vol. 6, pp. 41 839–41 855, 2018.
  • [17] C. You, G. Li, Y. Zhang, X. Zhang, H. Shan, M. Li, S. Ju, Z. Zhao, Z. Zhang, W. Cong, et al.

    , “Ct super-resolution gan constrained by the identical, residual, and cycle learning ensemble (gan-circle),”

    IEEE Transactions on Medical Imaging, 2019.
  • [18] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” CoRR, vol. abs/1511.05952, 2015.
  • [19] P. A. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y.-P. Flötteröd, R. Hilbrich, L. Lücken, J. Rummel, P. Wagner, and E. Wießner, “Microscopic traffic simulation using sumo,” in The 21st IEEE International Conference on Intelligent Transportation Systems.   IEEE, 2018. [Online]. Available: ¡https://elib.dlr.de/124092/
  • [20] M. Behrisch, L. Bieker, J. Erdmann, and D. Krajzewicz, “Sumo - simulation of urban mobility: An overview,” in in SIMUL 2011, The Third International Conference on Advances in System Simulation, 2011, pp. 63–68.