1 Introduction
Increasing population and urbanization have made it exceedingly challenging to operate urban emergency services efficiently. For example, historical data from New York City (NYC), USA [1] shows that the number of emergency vehicle (EV) incidents has grown from 1,114,693 in 2004 to 1,352,766 in 2014, with corresponding average response times of 7:53 min and 9:23 min, respectively [2]. This means an approximately 20% increase in response times in ten years. In the case of cardiac arrest, every minute until defibrillation reduces survival chances by 7% to 10%, and after 8 minutes there is little chance for survival [3]. Cities are less resilient with worsening response times from EVs (ambulances, fire trucks, police cars), mainly due to traffic congestion.
The performance of these EV service systems in congested traffic can be improved with technology. As a core of modern ITSs, wireless vehicletovehicle (V2V) and vehicletoinfrastructure (V2I) connectivity provide significant opportunities for improving urban emergency response. On the one hand, wireless connectivity provides EVs the traffic conditions on possible routes between the station (hospital, fire station, police station, etc.) and the call, which enables more efficient dispatch and routing. On the other hand, through V2V/V2I communications, traffic managers can broadcast the planned route of EVs to nonEVs that may be affected, and nonEVs can cooperate to form dynamic queuejump lanes (QJLs) for approaching EVs.
In response to these challenges, this paper develops a methodology for utilizing V2V/V2I connectivity to improve EV services. We design linklevel coordination strategies for nonEVs to fast establish dynamic queue jumper lanes (DQJLs) for EVs while maintaining safety. We incorporate the stateoftheart deep learning methods to specifically dealing with the randomness of driver behavior and devise scalable solution to QJLs problem. The models are incorporated and results are validated through traffic simulation software.
Although QJLs are a relatively new technology, literature is available documenting the positive effects they have in reducing travel time variability, especially when used in conjunction with the traveling salesman problem (TSP). However, they are all based on movingbottleneck models for buses [4, 5, 6]; such models do not directly apply to our setting, since EVs typically move faster than nonEVs and since EVs can “preempt” nonEV traffic because of their priority. In addition, QJLs have not been studied as a dynamic control strategy. The establishment of DQJLs involves realtime motion planning for both EVs and nonEVs, which has been a focus of robotics both in deterministic and in stochastic settings [7]. However, although robotic motion planning algorithms provide useful insights for EVs, they do not directly apply to EVs, since human drivers are not able to follow complex paths and react instantaneously as robots do. Furthermore, coordination algorithms for multiple robots are hardly applicable to traffic management due to high randomness in drivers’ reaction to coordination instructions. Instead, human drivers need driving strategies that are easy to interpret and implement and preferably only dependent of movement of neighboring vehicles, see [8]. [9] illustrates to use dynamic programming to prevent vehiclepassenger collision and [10] shows how to deep learning methods to ensure road safety. Mixed integer programming has been utilized in routing problems for multiple vehicles in different tasks like in [11]. In particular, [12]
considered an integer linear program formulation for the DQJL problem in the centralized and deterministic setting, which provides a baseline but does not account for the randomness of driver behavior.
In this paper, we model the DQJL problem into a Markov decision process to cope with the uncertainty in drivers’ behavior. We also introduce the approximate dynamic programming (ADP), including utilizing a deep neural network, to address the complexity in this framework and eventually solve the DQJL problem. We validate our results based on traffic simulation software against benchmark system.
Our results indicate the by using ADP the coordinated system can establish a DQJL faster than the benchmark/decentralized system in a urban environment. By incorporating our ADP algorithm, a coordinated system is able to save 12% time than the benchmark system, creating critical time window for emergency vehicles for complete their tasks. Our ADP algorithm is capable of dealing with more sophisticated scenarios such as longer road segment, mix types of vehicles as well as different congestion levels.
The rest of this paper is organized as follows. In Section 2
, we model the establishment of DQJL into a discretized road environment and picture the uncertainty in a geometric distribution. Then in Section
3 we propose our ADP algorithm to solve this extended DQJL problem. The results are validated based on a simulation analysis comparing with benchmark system in Section 4 and discussions are made regarding the insights behind the results.2 Modeling a DQJL problem with uncertain driver behavior
In this section, we elaborate how we formulate and model the DQJL problem in a urban road environment.
In order to model the establishment of dynamic queuejumper lane, i,e. path clearance process, for an emergency vehicle (EV), we can take a look at a typical urban road segment. The urban road segment consists of two lanes facing the same direction. When an EV is requesting to pass this road segment, the centralized/coordinated vehicletovehicle system will send out real time instructions to all nonEVs on this road segment. Assuming the EV is always travelling on one lane. When the EV approaches the road section, all nonEVs on the other lane immediately freeze. All nonEVs in front of the EV are instructed to cruise forward or pull over to clear a path for the EV. If an EV can not find a suitable pull over space, it can exit at the end of this road segment. However, the pull over response time for each nonEV is uncertain and the centralized system needs to address this uncertainty during the process.
Assuming that the speed of EV on a mission is much faster than nonEV cruising speed, the position of this EV should be immediately behind the last vehicle, who has not pulled over or exit the road segment. When there is no vehicle in front of this EV, it is then indicated that the dynamic queuejumper lane has been established for this EV and the process is complete.
2.1 Problem Statement
Given a 2lane directed link segment with length with number of vehicles, how should the centralized system instruct all nonEVs with uncertain pulling over response time, so that the dynamic queuejumper lane establishing time for an EV can be minimized.
2.2 Road Discretization
This study will be based on homogeneous timestamps, meaning that the centralized system gather all vehicle coordinates and send instructions to vehicles at the end of each second. Accordingly, we can discretize the road segment into a grid network of cells with length , i.e. if a vehicle is instructed to cruise forward to find a further pulling over space, it should move forward by one cell. See Fig. 1.
2.3 Uncertainty in NonEV Pulling Over Time
NonEV pulling over time varies from driver to driver, creating large uncertainty for the system. Geometric distribution is utilized to model this uncertainty. The number of timestamps of failure to pull over before this driver successfully pull over is a random variable
, whererepresents the probability of success on each trial. Which is to say, the probability of a driver can just finish pull over at the
th timestamp after he receives such instruction is .2.4 Assumptions
In the proposed model, we assume the positions and kinetic characteristics of all nonEVs are known through the connected environment. Each cell can only be occupied by one single nonEV according to the definition of road discretization. Since the EV is on a mission, its speed is significantly higher than the cruising speed of nonEVs, so its real time position is updated as strictly after the last nonEV who hasn’t pulled over or exited this road segment. NonEVs on the other lane freezes immediately when the process starts so we only investigate on the movement of nonEVs in front of this EV. This study is also limited to the dynamic queuejumper lane for one single EV on a link level road segment.
3 Approximate Dynamic Programming Algorithm
In this section, we propose our approximate dynamic programming (ADP) algorithm to address the dynamic QJL problem with uncertain driver behavior.
To address when and what instructions the centralized system should send to each nonEV to establish a dynamic queuejumper lane, we can structure the model into a Markov decision process (MDP). A Markov decision process framework is described by the tuple of , namely state space, action space, reward collection, transition probability matrix, and discount factor.
3.1 Environment Setup for MDP
Taking the advantage of discretization of the road segment, we further label each cell to turn the road segments into a 2dimensional grid environment. We also label the nonEVs on the upper lane in front of the EV starting, and label the grid environment vertically. An exit space appears at the end of the upper lane, allow nonEVs to exit the road segment if they cannot find a pull over space. After the specific labeling of the environment, we could visualize the road segment as Fig. 2:
3.2 State
A centralized system describes all nonEVs coordinates on the upper lane at timestamp as a collection: , where denotes the coordinate of th vehicle in the grid environment.
For a twolane road segment with N cells in longitudinal direction and the exiting space at the end of the road segment, there are cells which each nonEV on the upper lane can position in. Therefore, the size of state space is
In the example shown in the Fig. 2, the state is represented as . It can further be coded into a string in 32bit for the convenience of storing and computing.
3.3 Action
Each nonEV on the upper lane can have three actions: a = {cruise forward, pull over, remain still}. There are three situations that an nonEV is advised to remain still in the current position: 1. when this nonEV has already pulled over in the lower lane or exited; 2. when this nonEV is performing a pull over, but fail to pull over within this timestamp due to the uncertainty in pulling over time; 3. when this nonEV, who is trying to cruise forward, is blocked by another nonEV who remains still.
Since we are considering from the perspective of the collection of all nonEVs, the action value is also a vector indicating the collection of all specific action for each individual nonEV:
. The size of the action space involving nonEVs is .The action vector can also be encoded in to the same format as state. Each character of this string indicates the corresponding action for the corresponding nonEV: 2 represents cruising forward, 1 represents pulling over into the lower lane and 0 represents remaining still. For example, an action telling all nonEVs to cruise forward is represented as .
3.4 Reward
At every timestamp, if the EV has not passed the road segment, i.e., the EV has not reached the exit cell, we set the reward to be 1. If the EV has passed this road segment, the reward collected is set to be 0 for the convenience of convergence of learning process. To discourage nonEVs collision of any kind, i.e. nonEVs collided into the same cell except the exit cell, the reward for any collision to be 100.
3.5 Transition Probability
represents a the probability of transition from a state an action into a new state . Although the uncertainty in nonEVs pulling over time is modeled as the geometric distribution, the probability for a pulling over nonEV, who has not pulled over in this timestamp, successfully pull over in the next timestamp is still . Based on our definition on the action, , where .
3.6 Discount Factor
indicates how important is future reward to current state. For the convenience of learning convergence, we set .
3.7 QLearning
To deal with stochastic transition probability in this problem, we can utilize Qlearning, a modelfree learning algorithm, to cope with the uncertainty of nonEV pulling over time. Our goal is to yield a policy for the centralized system to broadcast realtime instructions for each nonEV in order to establish a queuejumper lane in the shortest amount of time.
Under a policy , a combination of certain state and an action under that state will yield a stateaction value as the following (1):
(1) 
In (1), represents the expected long term reward under stochastic policy .
The represents the expected long turn reward by the agent in state choose action under policy . The Q function is represented recursively as:
(2)  
where means the probability of the state collapsed into when taking action in the state , and represents the reward for that move.
From (2), we can determine the Q function under optimal policy should satisfy the Bellman’s optimality equation:
(3) 
When number of states and actions are finite, a simple tabular Qlearning algorithm will be initialized and updates through the centralized system’s experience as introduced by [13] as :
(4) 
where represents the learning rate. Under this algorithm, the Q table will converge to optimal Q function under convergence. Under the traditional QLearning approach, all nonEVs would act naively or randomly to take the reward to update the corresponding . The centralized system will then plan next action for next state based on the collected and update the new for new state and new action. The iterations of the Qlearning will eventually maximize the reward and produce the optimal policy.
3.8 Deep Q Network
In this Markov decision process framework, we can notice that the dimension for the state space is and that of the action space is , both of which are exponentially growing with , number of nonEVs who needed to be pulled over. The dimension of the state space grows even faster with the number of cells in the longitudinal direction. Thus, the traditional tabular formatted Qlearning algorithm is not able to handle the memory complexity as well as the time complexity to search or update a certain stateaction value. To improve the efficiency with respect to memory space and time, we propose using a Deep Q Network (DQN) introduced by [14] to approximate to select action for each state.
3.8.1 Design of the Deep Q Network
The DQN has two identical neural networks, an evaluation network and a target network. For each neural network, the input layer is a matrix of feature vector of the state of all nonEVs. Under this framework, the state vector is the feature vector as we judge whether or not queuejumper lane has been established by the locations of nonEVs. The output layer should yield all possible stateaction value. Thus, the input layer has neurons and the output layer should have neurons.
Generally speaking, the more hidden layers, the higher accuracy the neural network can achieve. Since the numbers, i.e. coordinates on the grid network, have simple numerical values and linear relationship, we only need one hidden layer to reach high accuracy without spending more training time. With assurance of accuracy, the number of neurons in the hidden layer should also be minimized to prevent overfitting. In our neural network, 10 neurons in the hidden layer is accepted.
Finally, we select Rectified Linear Unit (ReLU) as activation function on the hidden layer since ReLU’s better training performance in the attenuation of gradients
[15].A DQN studying 2 nonEVs, who need to pull over in this road segment, should have a neural network structure like Fig. 3. The neural network will yield state action value and the learner will choose the action with largest state action value.
3.8.2 Training of the DQN
A Deep Q Network can be viewed as a combination of a Qlearning algorithm and a neural network with experience replay and fixed q target. According to [14]
, the loss function that be used to train this neural network is:
(5) 
where refers to a specific weight for this neural network and represents expected long term reward. Taking the partial derivative with respect to and we can get:
(6)  
From (6
), we could perform a stochastic gradient descent to update
and, accordingly, all weights of this neural network.3.8.3 Experience Replay and Fixed Q target
For states where the central system has never been, we need an evaluation function to approximate the rewards for those states. Updating weights of the neural network for a specific pair of state and action will impose change to the for other pairs of state and action, which may result in significant increase in the training time or even failure to converge [16][17]. Experience replay is introduced by [18] to store some of experience as a tuple of into a experience history queue . An offpolicy Qlearning algorithm will benefit by randomly select experience tuples with size of the minibatch from so that each memory tuple has equal chance to be selected into the training.
Another important characteristic powering DQN is the fixed Q target. After every certain steps of training, we replace the weights in the target network by the ones in the evaluation network. Otherwise, we fix the weights in the target network to increase the efficiency of training. In mathematical expression, instead of minimizing the previous loss function as (5), we minimize the new loss function as (7) listed below:
(7) 
where is the fixed weight parameter and only gets updated every certain steps of training.
3.9 Algorithm Overview
4 Comparison with Decentralized System
In this section, we validate ADP algorithm on a traffic simulation software against the simulation results from the decentralized/benchmark system. The comparison result shows the centralized system has a shorter QJL establishment time.
4.1 Benchmark System Simulation
In a decentralized/benchmark system, every nonEV driver is selfish and trying to pull over into the nearest space when they facilitate to establish QJL. However, such motion planning principle will incur systemwise time inefficiency because the following nonEV drivers may need longer time to find a space and pull over. The uncertainty in pulling over time might worsen the case that following nonEV drivers have to wait until the EV successfully pull over. A simple example is elaborated in Fig. 4.
In the benchmark system, the system queuejumper lane establishment time is equal to the pulling over time of the red car; the system jumperlane establishment time for the centralized system is equal to the pull over time of the yellow car, which is significantly shorter than that of the benchmark system. To validate that the centralized system can establish a dynamic jumper lane faster than the benchmark system, we use Simulation on Urban Mobility (SUMO) [19] to examine our results.
SUMO has an existing module named Emergency Vehicle Simulation introduced by [20]. Under this module, a blue light device, i.e. an EV, is able to overtake on the right, disregard the right of way and exceed the speed limit. All nonEVs share identical parameters. We perform the simulation on the problem shown as Fig. 2, Table 1 has all the parameters we feed into the SUMO simulation for the benchmark system:
Parameters  Value  Description 

4m/s  nonEV’s departure speed  
8m/s  nonEV’s max speed  
80m  length of this road segment  
4.5m  the length of a nonEV  
24m/s  EV’s departure speed  
30m/s  EV’s max speed 
A snapshot for SUMO is shown as Fig. 6. The result shows the benchmark system takes 10.2 seconds to form a QJL.
4.2 Centralized System Simulation
To embed our Deep Q Network into the SUMO portal, we train our neural network in advance. Using offline training with trained weight parameters, we can obtain the optimal action for any state in real time. The pipeline to perform this interface is shown as Fig. 5. During the process, we communicate the state or action between SUMO and our neural network every 2 seconds. For the states, since we discretize the road segment, we approximate the positions of nonEVs output from SUMO into nearest cells. For the actions selected by the neural network, we need to code corresponding direction and velocity on the vehicles in SUMO. The pipeline for performing centralized system simulation is shown in Fig. 5.
The centralized systems with the same selection of parameters indicates a consistent jumperlane establishment time of 8.9 seconds.
The result is easy to interpret intuitively as we are minimizing the longest pulling over time for the system. The coordinated algorithm balances the pull over time for all nonEVs. Therefore, even though some nonEVs would experience longer pull over time, the systemwise pull over time is reduced.
The comparison of two systems shows a 12.74% decrease in time for centralized system to form a queuejumper lane. However, the results are supposed to vary depending on different road congestion levels, background traffic speed, road length as well as initial vehicle positions. Ideally, the difference between two system will approximate to zero when the road is fully congested or empty.
5 Conclusion
In this paper, we propose a novel stochastic dynamic programming algorithm for dynamic queuejumper lanes problem. Utilizing Markov decision process framework and computing power of a double layers deep q network, we successfully formulate a centralized system which can dispatch real time instructions for all nonEVs to form a queuejumper lane for an EV. Finally, the simulation result based on our approach is proved to form a QJL faster than the benchmark/decentralized system.
In future work, sensitivity analysis will be conducted to investigate the impact by different factors, including background traffic speed, road length and congestion level, on system performances. The interaction between our trained neural network and SUMO can also be improved. In order to more accurately represents the positions of the vehicles, we could further discretize the road environment into more cells and need not to worry about the estimation error in NNSUMO communication.
References
 [1] New York EndToEnd Response Times, 2019 (accessed February 28, 2020). [Online]. Available: https://www1.nyc.gov/site/fdny/about/resources/dataandanalytics/endtoendresponsetimes.page
 [2] Emergency Response Incidents, 2014 (accessed February 28, 2020). [Online]. Available: https://data.cityofnewyork.us/PublicSafety/EmergencyResponseIncidents/pasrj7fb
 [3] Heart Disease and Stroke Statistics, 2013 (accessed February 28, 2020). [Online]. Available: https://cpr.heart.org/AHAECC/CPRAndECC/ResuscitationScience/UCM˙477263˙AHACardiacArrest%20Statistics.jsp%5BR=301,L,NC%5D
 [4] G. Zhou and A. Gan, “Performance of transit signal priority with queue jumper lanes,” Transportation Research Record, vol. 1925, no. 1, pp. 265–271, 2005.
 [5] B. Cesme, S. Z. Altun, and B. Lane, “Queue jump lane, transit signal priority, and stop location evaluation of transit preferential treatments using microsimulation,” Transportation Research Record, vol. 2533, no. 1, pp. 39–49, 2015.
 [6] Y. Z. Farid, E. Christofa, and J. Collura, “Dedicated bus and queue jumper lanes at signalized intersections with nearside bus stops,” Transportation Research Record: Journal of the Transportation Research Board, vol. 2484, pp. 182–192, 12/2015 2015.
 [7] A. Buchenscheit, F. Schaub, F. Kargl, and M. Weber, “A vanetbased emergency vehicle warning system,” 2009 IEEE Vehicular Networking Conference (VNC), pp. 1–8, 2009.

[8]
D. Krajzewicz, G. Hertkorn, C. Rössel, and P. Wagner, “An example of microscopic car models validation using the open source traffic simulation sumo,” in
14th European Simulation Symposium, ser. SCS European Publishing House, vol. Jahrgang 2002, 2002, pp. 318–322, lIDOBerichtsjahr=2004,. [Online]. Available: http://elib.dlr.de/6657/ 
[9]
F. Zuo, K. Ozbay, A. Kurkcu, J. Gao, H. Yang, and K. Xie, “Microscopic simulation based study of pedestrian safety applications at signalized urban crossings in a connectedautomated vehicle environment and reinforcement learning based optimization of vehicle decisions,” in
Road Safety and Simulation, 10 2019.  [10] X. Xiong, J. Wang, F. Zhang, and K. Li, “Combining deep reinforcement learning and safety based control for autonomous driving,” ArXiv, vol. abs/1612.00147, 2016.
 [11] T. Schouwenaars, B. De Moor, E. Feron, and J. How, “Mixed integer programming for multivehicle path planning,” in 2001 European Control Conference (ECC), Sep. 2001, pp. 2603–2608.
 [12] G. J. Hannoun, P. MurrayTuite, K. Heaslip, and T. Chantem, “Facilitating emergency response vehicles’ movement through a road segment in a connected vehicle environment,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 9, pp. 3546–3557, Sep. 2019.
 [13] S. Ohnishi, E. Uchibe, Y. Yamaguchi, K. Nakanishi, Y. Yasui, and S. Ishii, “Constrained deep qlearning gradually approaching ordinary qlearning,” Frontiers in Neurorobotics, vol. 13, p. 103, 2019.
 [14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015. [Online]. Available: http://dx.doi.org/10.1038/nature14236
 [15] J. Brownlee, A Gentle Introduction to the Rectified Linear Unit (ReLU), 2019 (accessed February 28, 2020). [Online]. Available: shorturl.at/awCV3
 [16] C. You, Q. Yang, L. Gjesteby, G. Li, S. Ju, Z. Zhang, Z. Zhao, Y. Zhang, W. Cong, G. Wang, et al., “Structurallysensitive multiscale deep neural network for lowdose ct denoising,” IEEE Access, vol. 6, pp. 41 839–41 855, 2018.

[17]
C. You, G. Li, Y. Zhang, X. Zhang, H. Shan, M. Li, S. Ju, Z. Zhao, Z. Zhang,
W. Cong, et al.
, “Ct superresolution gan constrained by the identical, residual, and cycle learning ensemble (gancircle),”
IEEE Transactions on Medical Imaging, 2019.  [18] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” CoRR, vol. abs/1511.05952, 2015.
 [19] P. A. Lopez, M. Behrisch, L. BiekerWalz, J. Erdmann, Y.P. Flötteröd, R. Hilbrich, L. Lücken, J. Rummel, P. Wagner, and E. Wießner, “Microscopic traffic simulation using sumo,” in The 21st IEEE International Conference on Intelligent Transportation Systems. IEEE, 2018. [Online]. Available: ¡https://elib.dlr.de/124092/
 [20] M. Behrisch, L. Bieker, J. Erdmann, and D. Krajzewicz, “Sumo  simulation of urban mobility: An overview,” in in SIMUL 2011, The Third International Conference on Advances in System Simulation, 2011, pp. 63–68.
Comments
There are no comments yet.