EMVLight: A Decentralized Reinforcement Learning Framework for EfficientPassage of Emergency Vehicles

by   Haoran Su, et al.
Siemens AG
NYU college

Emergency vehicles (EMVs) play a crucial role in responding to time-critical events such as medical emergencies and fire outbreaks in an urban area. The less time EMVs spend traveling through the traffic, the more likely it would help save people's lives and reduce property loss. To reduce the travel time of EMVs, prior work has used route optimization based on historical traffic-flow data and traffic signal pre-emption based on the optimal route. However, traffic signal pre-emption dynamically changes the traffic flow which, in turn, modifies the optimal route of an EMV. In addition, traffic signal pre-emption practices usually lead to significant disturbances in traffic flow and subsequently increase the travel time for non-EMVs. In this paper, we propose EMVLight, a decentralized reinforcement learning (RL) framework for simultaneous dynamic routing and traffic signal control. EMVLight extends Dijkstra's algorithm to efficiently update the optimal route for the EMVs in real time as it travels through the traffic network. The decentralized RL agents learn network-level cooperative traffic signal phase strategies that not only reduce EMV travel time but also reduce the average travel time of non-EMVs in the network. This benefit has been demonstrated through comprehensive experiments with synthetic and real-world maps. These experiments show that EMVLight outperforms benchmark transportation engineering techniques and existing RL-based signal control methods.


A Decentralized Reinforcement Learning Framework for Efficient Passage of Emergency Vehicles

Emergency vehicles (EMVs) play a critical role in a city's response to t...

EMVLight: a Multi-agent Reinforcement Learning Framework for an Emergency Vehicle Decentralized Routing and Traffic Signal Control System

Emergency vehicles (EMVs) play a crucial role in responding to time-crit...

Diagnosing Reinforcement Learning for Traffic Signal Control

With the increasing availability of traffic data and advance of deep rei...

Learning to Help Emergency Vehicles Arrive Faster: A Cooperative Vehicle-Road Scheduling Approach

The ever-increasing heavy traffic congestion potentially impedes the acc...

Routing Emergency Vehicles in Arterial Road Networks using Real-time Mixed Criticality Systems*

Reducing the response time of Emergency Vehicles (EVs) has an undoubted ...

Random Ensemble Reinforcement Learning for Traffic Signal Control

Traffic signal control is a significant part of the construction of inte...

Vehicle Route Planning using Dynamically Weighted Dijkstra's Algorithm with Traffic Prediction

Traditional vehicle routing algorithms do not consider the changing natu...

1 Introduction

Emergency vehicles (EMVs) are vehicles including ambulances, fire trucks, and police cars, which respond to critical events such as medical emergencies, fire disasters, and criminal activities. Emergency response time is the key indicator of a city’s emergency management capability. Reducing response time saves lives and prevents property loss. For instance, the survivor rate from a sudden cardiac arrest without treatment drops 7% - 10% for every second elapsed, and there is barely any chance to survive after 8 minutes. EMV travel time, the time interval for an EMV to travel from a rescue station to an incident site, accounts for a major portion of the emergency response time. However, overpopulation and urbanization have been exacerbating road congestion, making it more and more challenging to reduce the average EMV travel time. Records Analytics (2021) have shown that even with a decline in average emergency response time, the average EMV travel time increases from 7.2 minutes in 2015 to 10.1 minutes in 2021 in New York City. Needless to say, there is a severe urgency and significant benefit for shortening the average EMV travel time on increasingly crowded roads.

Existing works have studied strategies to reduce the travel time of EMVs by route optimization and traffic signal pre-emption. Route optimization usually refers to the search for a time-based shortest path. The traffic network (e.g., city road map) is modeled as a graph with intersections as nodes and road segments between intersections as edges. Based on the time a vehicle needs to travel through each edge (road segment), route optimization calculates an optimal route such that an EMV can travel from the rescue station to the incident site in the least amount of time. In addition, as the EMV needs to be as fast as possible, the law in most places requires non-EMVs to yield to emergency vehicles sounding sirens, regardless of the traffic signals at intersections. Even though this practice gives the right-of-way to EMVs, it poses safety risks for vehicles and pedestrians at the intersections. To address this safety concern, existing methods have also studied traffic signal pre-emption which refers to the process of deliberately altering the signal phases at each intersection to prioritize EMV passage.

However, as the traffic condition constantly changes, an optimal route returned by route optimization can potentially become suboptimal as an EMV travels through the network. Moreover, traffic signal pre-emption has a significant impact on the traffic flow, which would change the fastest route as well. Thus, the optimal route should be updated with real-time traffic flow information, i.e., the route optimization should be solved in a dynamic (time-dependent) way. As an optimal route can change as an EMV travels through the traffic network, the traffic signal pre-emption would need to adapt accordingly. In other words, the subproblems of route optimization and traffic signal pre-emption are coupled and should be solved simultaneously in real-time. Existing approaches does not address this coupling.

In addition, most of the existing models on emergency vehicle service have a single objective of reducing the EMV travel time. As a result, their traffic signal control strategies have an undesirable effect of increasing the travel time of non-EMVs, since only EMV passage is optimized. In this paper, we aim to perform route optimization and traffic signal pre-emption to not only reduce EMV travel time but also to reduce the average travel time of non-EMVs. In particular, we address the following two key challenges:

  • [noitemsep]

  • How to dynamically route an EMV to a destination under time-dependent traffic conditions in a computationally efficient way? As the congestion level of each road segment changes over time, the routing algorithm should be able to update the remaining route as the EMV passes each intersection. Running the shortest-path algorithm each time the EMV passes through an intersection is not efficient. A computationally efficient dynamic routing algorithm is desired.

  • How to coordinate traffic signals to not only reduce EMV travel time but reduce the average travel time of non-EMVs as well? To reduce EMV travel time, only the traffic signals along the route of the EMV need to be altered. However, to further reduce average non-EMV travel time, traffic signals in the whole traffic network need to be operated cooperatively.

To tackle these challenges, we propose EMVLight, a decentralized multi-agent reinforcement learning framework with a dynamic routing algorithm to control traffic signal phases for efficient EMV passage. Our experimental results demonstrate that EMVLight outperforms traditional traffic engineering methods and existing RL methods under two metrics - EMV travel time and the average travel time of all vehicles - on different traffic configurations.

2 Related Work

Conventional routing optimization and traffic signal pre-emption for EMVs. Although, in reality, routing and pre-emption are coupled, the existing methods usually solve them separately. Many of the existing approaches leverage Dijkstra’s shortest path algorithm to get the optimal route Wang et al. (2013a); Mu et al. (2018); Kwon et al. (2003); Jotshi et al. (2009). An A* algorithm for ambulance routing has been proposed by Nordin et al. (2012). However, as this line of work assumes that the routes and traffic conditions are fixed and static, they fail to address the dynamic nature of real-world traffic flows. Another line of work has considered the change of traffic flows over time. Ziliaskopoulos and Mahmassani (1993) have proposed a shortest-path algorithm for time-dependent traffic networks, but the travel time associated with each edge at each time step is assumed to be known in prior. Musolino et al. (2013) propose different routing strategies for different times in a day (e.g., peak/non-peak hours) based on traffic history data at those times. However, in the problem of our consideration, routing and pre-emption strategies can significantly affect the travel time associated with each edge during the EMV passage, and the existing methods cannot deal with this kind of real-time changes. Haghani et al. (2003) formulated the dynamic shortest path problem as a mixed-integer programming problem. Koh et al. (2020) have used RL for real-time vehicle navigation and routing. However, both of these studies have tackled a general routing problem, and signal pre-emption and its influence on traffic have not been modeled.

Once an optimal route for the EMV has been determined, traffic signal pre-emption is deployed. A common pre-emption strategy is to extend the green phases of green lights to let the EMV pass each intersection along a fixed optimal route Wang et al. (2013a); Bieker-Walz and Behrisch (2019). Asaduzzaman and Vidyasankar (2017) have proposed pre-emption strategies for multiple EMV requests.

Please refer to Lu and Wang (2019) and Humagain et al. (2020) for a thorough survey of conventional routing optimization and traffic signal pre-emption methods. We would also like to point out that the conventional methods prioritize EMV passage and have significant disturbances on the traffic flow which increases the average non-EMV travel time.

RL-based traffic signal control. Traffic signal pre-emption only alters the traffic phases at the intersections where an EMV travels through. However, to reduce congestion, traffic phases at nearby intersections also need to be changed cooperatively. The coordination of traffic signals to mitigate traffic congestion is referred to as traffic signal control which has been addressed by leveraging deep RL in a growing body of work. Many of the existing approaches use Q-learning Abdulhai et al. (2003); Prashanth and Bhatnagar (2010); Wei et al. (2019a, b); Zheng et al. (2019); Chen et al. (2020). Zang et al. (2020) leverage meta-learning algorithms to speed up Q-learning for traffic signal control. Another line of work has used actor-critic algorithms for traffic signal control El-Tantawy et al. (2013); Aslani et al. (2017); Chu et al. (2019). Xu et al. (2021) propose a hierarchical actor-critic method to encourage cooperation between intersections. Please refer to Wei et al. (2019c) for a review on traffic signal control methods. However, these RL-based traffic control methods focus on reducing the congestion in the traffic network and are not designed for EMV pre-emption. In contrast, our RL framework is built upon state-of-the-art ideas such as max pressure and is designed to reduce both EMV travel time and overall congestion.

3 Preliminaries

Figure 1: Traffic movements illustration and an example pressure calculation for incoming lane #2.
Definition 1 (traffic map, link, lane)

A traffic map can be represented by a graph , with intersections as nodes and road segments between intersections as edges. We refer to a one-directional road segment between two intersections as a link. A link has a fixed number of lanes, denoted as for lane . Fig. 1 shows 8 links and each link has 2 lanes.

Definition 2 (Traffic movements)

A traffic movement is defined as the traffic traveling across an intersection from an incoming lane to an outgoing lane . The intersection shown in Fig. 1 has 24 permissible traffic movements. The set of all permissible traffic movements of an intersection is denoted as .

Definition 3 (Traffic signal phase)

A traffic signal phase is defined as the set of permissible traffic movements. As shown in Fig. 2, an intersection with 4 links has 8 phases.

Figure 2: Top: 8 signal phases; Left: phase #2 illustration; Right: phase #5 illustration.
Definition 4 (Pressure of an incoming lane)

The pressure of an incoming lane measures the unevenness of vehicle density between lane and corresponding out going lanes in permissible traffic movements. The vehicle density of a lane is , where is the number of vehicles on lane and is the vehicle capacity on lane , which is related to the length of a lane. Then the pressure of an incoming lane is


where is the number of lanes of the outgoing link which contains . In Fig. 1, for all the outgoing lanes. An example for Eqn. (1) is shown in Fig. 1.

Definition 5 (Pressure of an intersection)

The pressure of an intersection is the average of the pressure of all incoming lanes.

The pressure of an intersection indicates the unevenness of vehicle density between incoming and outgoing lanes in an intersection. Intuitively, reducing the pressure leads to more evenly distributed traffic, which indirectly reduce congestion and average travel time of vehicles.

4 Dynamic Routing

Dijkstra’s algorithm is an algorithm that finds shortest path between a given node and every other nodes in a graph, which has been used for EMV routing. The EMV travel time along each link is estimated based on the number of vehicles on that link. We refer to it as the

intra-link travel time. Dijkstra’s algorithm takes as input the traffic graph, the intra-link travel time and a destination, and can return the time-based shortest path as well as estimated travel time from each intersection to the destination. The latter is usually referred to as the estimated time of arrival (ETA) of each intersection.

However, traffic conditions are constantly changing and so does EMV travel time along each link. Moreover, EMV pre-emption techniques alters traffic signal phases, which will significantly change the traffic condition as the EMV travels. The pre-determined shortest path might become congested due to stochasticity and pre-emption. Thus, updating the optimal route dynamically can facilitate EMV passage. In theory we can run Dijkstra’s algorithm frequently as the EMV travels through the network to take into account the updated EMV intra-link travel time, but this is inefficient.

To achieve dynamics routing, we extend Dijkstra’s algorithm to efficiently update the optimal route based on the updated intra-link travel times. As shown in Algorithm 1, first a prepopulation process is carried out where a (static) Dijkstra’s algorithm is run to get the ETA from each intersection to the destination. For each intersection, the next intersection along the shortest path is also calculated and stored. We assume this process can be done before the EMV starts to travel. This is reasonable since a sequence of processes, including call-taker processing, are performed before the EMVs are dispatched. Once the pre-population process is finished, we can update and for each intersection efficiently in parallel, since the update only depends on information of neighboring intersections. Please see Appendix for how intra-link travel time is estimated in real time.

Input : 
traffic map as a graph
intra-link travel time at time
index of the destination
Output : 
ETA of each intersection
next intersection to go
from each intersection
/* pre-population */
1 Dijkstra
/* dynamic routing */
2 for  do
3       foreach  do (in parallel)
5             )
Algorithm 1 Dynamic Dijkstra’s for EMV routing
Remark 1

In static Dijkstra’s algorithm, the shortest path is obtained by repeatedly query the attribute of each node from the origin until we reach the destination. In our dynamic Dijkstra’s algorithm, since the shortest path changes, at a intersection , we only care about the immediate next intersection to go to, which is exactly .

5 Reinforcement Learning Formulation

While dynamic routing directs the EMV to the destination, it does not take into account the possible waiting times for red lights at the intersections. Thus, traffic signal pre-emption is also required for the EMV to arrive at the destination in the least amount of time. However, since traditional pre-emption only focuses on reducing the EMV travel time, the average travel time of non-EMVs can increase significantly. Thus, we set up traffic signal control for efficient EMV passage as a decentralized RL problem. In our problem, an RL agent controls the traffic signal phases of an intersection based on local information. Multiple agents coordinate the control signal phases of intersections cooperatively to (1) reduce EMV travel time and (2) reduce the average travel time of non-EMVs. First we design 3 agent types. Then we present agent design and multi-agent interactions.

5.1 Types of agents for EMV passage

When an EMV is on duty, we distinguish 3 types of traffic control agents based on EMV location and routing (Fig. 3). An agent is a primary pre-emption agent if an EMV is on one of its incoming links. The agent of the next intersection is refered to as a secondary pre-emption agent. The rest of the agents are normal agents. We design these types since different agents have different local goals, which is reflected in their reward designs.

5.2 Agent design

  • State: The state of an agent at time is denoted as and it includes the number of vehicles on each outgoing lanes and incoming lanes, the distance of the EMV to the intersection, the estimated time of arrival (), and which link the EMV will be routed to (), i.e.,


    where represents the links incoming to intersection , and with a slight abuse of notation and denote the set of incoming and outgoing lanes, respectively. For the intersection shown in Fig. 1,

    is a vector of four elements. For primary pre-emption agents, one of the elements represents the distance of EMV to the intersection in the corresponding link. The rest of the elements are set to -1. For all other agents,

    are padded with -1.

  • Action: Prior work has focused on using phase switch, phase duration and phase itself as actions. In this work, we define the action of an agent as one of the 8 phases in Fig. 2; this enables more flexible signal patterns as compared to the traditional cyclical patterns. Due to safety concerns, once a phase has been initiated, it should remain unchanged for a minimum amount of time, e.g. 5 seconds. Because of this, we set our MDP time step length to be 5 seconds to avoid rapid switch of phases.

  • Reward: PressLight has shown that minimizing the pressure is an effective way to encourage efficient vehicle passage, we adopt similar idea for normal agents. For secondary pre-emption agents we additionally encourage less vehicle on the link where the EMV is about to enter in order to encourage efficient EMV passage. For primary pre-emption agents, we simply assign a unit penalty at each time step to encourage fast EMV passage. Thus, depending on the agent type, the local reward for agent at time is

Figure 3: Three types of agents.

Justification of agent design. The quantities in local agent state can be obtained at each intersection using various technologies. Numbers of vehicles on each lane can be obtained by vehicle detection technologies, such as inductive loop Gajda et al. (2001) based on the hardware installed underground. The distance of the EMV to the intersection can be obtained by vehicle-to-infrastructure technologies such as VANETBuchenscheit et al. (2009), which broadcasts the real-time position of a vehicle to an intersection. Prior work by Wang et al. (2013b) and Noori et al. (2016) have explored these technologies for traffic signal pre-emption.

The dynamic routing algorithm (Algorithm 1) can provide for each agent at every time step. However, due to the stochastic nature of traffic flows, updating the route too frequently might confuse the EMV driver, since the driver might be instructed a new route, say, every 5 seconds. There are many ways to ensure reasonable frequency. One option is to inform the driver only once while the EMV is travels in a single link. We implement it by updating the state of an RL agent at the time step when the EMV travels through half of a link. For example, if the EMV travels through a link to agent from time step 11 to 20 in constant speed, then dynamic routing information in to are the same, which is , i.e., .

As for the reward design, one might wonder how an agent can know its type. As we assume an agent can observe the state of its neighbors, agent type can be inferred from the observation. This will become clearer below.

5.3 Multi-agent Advantage Actor-critic

We adopt a multi-agent advantage actor-critic (MA2C) framework similar to Chu et al. (2019). The difference is that our local state includes dynamic routing information and our local reward encourages efficient passage of EMV. Here we briefly introduce the MA2C framework. Please refer to Chu et al. (2019) for additional details.

In a multi-agent network , the neighborhood of agent is denoted as . The local region of agent is . We define the distance between two agents as the minimum number of edges that connect them. For example, and . In MA2C, each agent learns a policy (actor) and the corresponding value function (critic), where and

are learnable neural network parameters of agent


Local Observation. In an ideal setting, agents can observe the states of every other agent and leverage this global information to make a decision. However, this is not practical in our problem due to communication latency and will cause scalability issues. We assume agents can observe its own state and the states of its neighbors, i.e., . The agents feed this observation to its policy network and value network .

Fingerprint. In multi-agent training, each agent treats other agents as part of the environment, but the policy of other agents are changing over time. Foerster et al. (2017) introduce fingerprints to inform agents about the changing policies of neighboring agents in multi-agent Q-learning. Chu et al. (2019)

bring fingerprints into MA2C. Here we use the probability simplex of neighboring policies

as fingerprints, and include it into the input of policy network and value network. Thus, our policy network can be written as and value network as , where is the local observation with spatial discount factor, which is introduced below.

Spatial Discount Factor and Adjusted Reward. MA2C agents cooperatively optimize a global cumulative reward. We assume the global reward is decomposable as , where is defined in Eqn. (3). Instead of optimizing the same global reward for every agent, Chu et al. (2019) propose a spatial discount factor to let each agent pay less attention to rewards of agents far away. The adjusted reward for agent is


where is the maximum distance of agents in the graph from agent . When , the adjusted reward include global information, it seems this is in contradiction to the local communication assumption. However, since reward is only used for offline training, global reward information is allowed. Once trained, the RL agents can control traffic signal without relying on global information.

Temporal Discount Factor and Return. The local return is defined as the cumulative adjusted reward , where is the temporal discount factor and is the length of an episode. we can estimate the local return using value function,


where means parameters are frozen and means the parameters of policy networks of all other agents are frozen.

Network architecture and training.

As traffic flow data are spatial temporal, we leverage a long-short term memory (LSTM) layer along with fully connected (FC) layers for policy network (actor) and value network (critic). Our multi-agent actor-critic training pipeline is similar to that in

Chu et al. (2019). We provide neural architecture details, policy loss expression, value loss expression as well as a training pseudocode in the Appendix.

6 Experimentation

In this section, we demonstrate our RL framework using Simulation of Urban MObility (SUMO) Lopez et al. (2018)

SUMO is an open-source traffic simulator capable of simulating both microscopic and macroscopic traffic dynamics, suitable for capturing the EMV’s impact on the regional traffic as well as monitoring the overall traffic flow. A pipeline is established between the proposed RL framework and SUMO, i.e., the agents collects observations from SUMO and preferred signal phases are fed back into SUMO.

6.1 Datasets and Maps Descriptions

We conduct the following experiments based on both synthetic and real-world map.


We synthesize a traffic grid, where intersections are connected with bi-directional links. Each link contains two lanes. We design 4 configurations, listed in Table 1. The origin (O) and destination (D) of the EMV are labelled in Fig. 4. The traffic for this map has a time span of 1200s. We dispatch the EMV at to ensure the roads are compacted when it starts travel.

Figure 4: Left: the synthetic . Right: an intersection illustration in SUMO, the teal area are inductive loop detected area. Origin and destination for EMV are labeled.
Config Traffic Flow (veh/lane/hr) Origin Destination
Non-peak Peak
1 200 240 N,S E,W
2 160 320
3 200 240 Randomly
4 160 320 generated
Table 1: Configuration for Synthetic . Peak flow is assigned from 400s to 800s and non-peak flow is assigned out of this period. For Config. 1 and 2, the vehicles enter the grid from North and South, and exit toward East and West.

This is a traffic network extracted from Manhattan Hell’s Kitchen area (Fig. 5) and customized for demonstrating EMV passage. In this traffic network, intersections are connected by 16 one-directional streets and 3 one-directional avenues. We assume each avenue contains four lanes and each street contains two lanes so that the right-of-way of EMVs and pre-emption can be demonstrated. The traffic flow for this map is generated from open-source NYC taxi data. Both the map and traffic flow data are publicly available.111https://traffic-signal-control.github.io/ The origin and destination of EMV are set to be far away as shown in Fig. 5

Figure 5: Manhattan map: a 16-by-3 traffic network in Hell’s Kitchen area. Origin and destination for the EMV dispatching are labeled.
Method EMV Travel Time [s] Average Travel Time [s]
Config 1 Config 2 Config 3 Config 4 Config 1 Config 2 Config 3 Config 4
FT w/o EMV N/A N/A N/A N/A N/A 353.43 371.13 314.25 334.10 1649.64
W + Static + FT 257.20 272.00 259.20 243.80 487.20 372.19 389.13 342.49 355.05 1811.03
W + Static + MP 255.00 269.00 261.20 245.40 461.80 349.38 352.54 307.91 322.68 708.13
W + Static + CL 281.20 286.20 289.80 277.80 492.20 503.35 524.26 488.12 509.55 2013.54
W + Static + PL 276.00 282.20 271.40 275.00 476.00 358.18 369.45 332.98 338.95 1410.76
W + dynamic + FT 229.60 231.20 228.60 227.20 442.20 370.09 393.40 330.13 345.50 1699.30
W + dynamic + MP 226.20 234.60 224.20 217.60 438.80 345.45 348.43 313.26 325.72 721.32
W + dynamic + CL 273.40 269.60 281.00 270.80 450.20 514.29 536.78 502.12 542.63 1987.86
W + dynamic + PL 251.20 257.80 247.00 268.80 436.20 359.31 342.59 340.11 349.20 1412.12
EMVLight 198.60 192.20 199.20 196.80 391.80 322.40 318.76 301.90 321.02 681.23
Table 2: Performance comparison of different methods evaluated in the four configurations of the synthetic traffic grid as well as Manhattan Map. For both metrics, the lower value indicates better performance. The lowest values are highlighted in bold. The average travel time of Manhattan map (1649.64) is retrieved from data.

6.2 Baselines

Due to the lack of existing RL methods for efficient EMV passage, we select traditional methods and RL methods for each subproblem and combine them to set up baselines.

For traffic signal pre-emption, the most intuitive and widely used approach is extending green light period for EMV passage at each intersection which results in a Green Wave Corman et al. (2009). Walabi (W) Bieker-Walz and Behrisch (2019) is an effective rule-based method that implemented Green Wave for EMVs in SUMO environment. We integrate Walabi with combinations of routing and traffic signal control strategies introduced below as baselines.

Routing baselines:

  • Static routing is performed when EMV is dispatched and the route remains fixed as the EMV travels. We adopt A* search as the baseline since it is an powerful extension to the Dijkstra’s shortest path algorithm and is used in many real-time applications because of its optimality. 222

    Our implementation of A* search employs a Manhattan distance as the heuristic function.

  • Dynamic routing relies on real-time information of traffic conditions. To set up the baseline, we run A* every 50s as EMV travels. This is because running the full A* to update optimal route is not as efficient as our proposed dynamic Dijkstra’s algorithm.

Traffic signal control baselines:

  • Fixed Time (FT): Cyclical fixed time traffic phases with random offset Roess et al. (2004) is a policy that split all phases with an predefined green ratio. It is the default strategy in real traffic signal control.

  • Max Pressure (MP): The state-of-the-art (SOTA) network-level signal control strategy based on pressure Varaiya (2013). It aggressively select the phase with maximum pressure to smooth congestion.

  • Coordinated Learner (CL): A Q-learning based coordinator which directly learns joint local value functions for adjacent intersections Van der Pol and Oliehoek (2016).

  • PressLight (PL): A RL method aiming to optimize the pressure at each intersectionWei et al. (2019a).

6.3 Results

We evaluate performance of models under two metrics: EMV travel time, which reflects routing and pre-emption ability, and average travel time, which indicates the ability of traffic signal control for efficient vehicle passage. The performance of our EMVLight and the baselines in both the synthetic and the Manhattan map is shown in Table 2. The results of all methods are averaged over five independent runs and RL methods are tested with random seeds. We observe that EMVLight outperforms all baseline models under both metrics.

In terms of EMV travel time , the dynamic routing baseline performs better than static routing baselines. This is expected since dynamic routing considers the time-dependent nature of traffic conditions and update optimal route accordingly. EMVLight further reduces EMV travel time by 18% in average as compared to dynamic routing baselines. This advantage in performance can be attributed to the design of secondary pre-emption agents. This type of agents learn to “reserve a link” by choosing signal phases that help clear the vehicles in the link to encourage high speed EMV passage (Eqn. (3)).

As for average travel time , we first notice that the traditional pre-emption technique (W+Static+FT) indeed increases the average travel time by around 10% as compared to a traditional Fix Time strategy without EMV (denoted as “FT w/o EMV” in Table 2), thus decreasing the efficiency of vehicle passage. Different traffic signal control strategies have a direct impact on overall efficiency. Fixed Time is designed to handle steady traffic flow. Max Pressure, as a SOTA traditional method, outperforms Fix Time and, surprisingly, outperforms both RL baselines in terms of overall efficiency. This shows that pressure is an effective indicator for reducing congestion and this is why we incorporate pressure in our reward design. Coordinate Learner performs the worst probably because its reward is not based on pressure. PressLight doesn’t beat Max Pressure because it has a reward design that focuses on smoothing vehicle densities along a major direction, e.g. an arterial. Grid networks with the presence of EMV make PressLight less effective. Our EMVLight improves its pressure-based reward design to encourage smoothing vehicle densities of all directions for each intersection. This enable us to achieve an advantage of 5% over our best baselines (Max Pressure).

Ablation study on pressure and agent types

We propose three types of agents and design their rewards (Eqn. (3)) based on our improved pressure definition and heuristics. In order to see how our improved pressure definition and proposed special agents influence the results, we (1) replace our pressure definition by that defined in PressLight, (2) replace secondary pre-emption agents with normal agents and (3) replace primary pre-emption agents with normal agents.

Ablations (1) (2) (3) EMVLight
[s] 197 289 320 199
[s] 361.05 347.13 359.62 322.40
Table 3: Ablation study on pressure and agent types. Experiments are conducted on the Config 1 synthetic .

Table 3 shows the results of these ablations: (1) PressLight-style pressure (see Appendix) yields a slightly smaller EMV travel time but significantly increases the average travel time; (2) Without secondary pre-emption agents, EMV travel time increases by 45% since almost no “link reservation” happened; (3) Without primary pre-emption agents, EMV travel time increases significantly, which shows the importance of pre-emption.

Ablation study on fingerprint

In multi-agent RL, fingerprint has been shown to stabilize training and enable faster convergence. In order to see how fingerprint affects training in EMVLight, we remove the fingerprint design, i.e., policy and value networks are changed from and to and , respectively. Fig. 6 shows the influence of fingerprint on training. With fingerprint, the reward converges faster and suffers from less fluctuation, confirming the effectiveness of fingerprint.

Figure 6: Reward convergence with and without fingerprint. Experiments are conducted on Config 1 synthetic .

7 Conclusion

In this paper, we proposed a decentralized reinforcement learning framework, EMVLight, to facilitate the efficient passage of EMVs and reduce traffic congestion at the same time. Leveraging the multi-agent A2C framework, agents incorporate dynamic routing and cooperatively control traffic signals to reduce EMV travel time and average travel time of non-EMVs. Evaluated on both synthetic and real-world map, EMVLight significantly outperforms the existing methods. Future work will explore more realistic microscopic interaction between EMV and non-EMVs, efficient passage of multiple EMVs and closing the sim-to-real gap.


  • B. Abdulhai, R. Pringle, and G. J. Karakoulas (2003) Reinforcement learning for true adaptive traffic signal control. Journal of Transportation Engineering 129 (3), pp. 278–285. Cited by: §2.
  • N. Analytics (2021) End-to-end response times. Note: https://www1.nyc.gov/site/fdny/about/resources/data-and-analytics/end-to-end-response-times.page External Links: Link Cited by: §1.
  • M. Asaduzzaman and K. Vidyasankar (2017) A priority algorithm to control the traffic signal for emergency vehicles. In 2017 IEEE 86th Vehicular Technology Conference (VTC-Fall), Vol. , pp. 1–7. External Links: Document Cited by: §2.
  • M. Aslani, M. S. Mesgari, and M. Wiering (2017) Adaptive traffic signal control with actor-critic methods in a real-world traffic network with different traffic disruption events. Transportation Research Part C: Emerging Technologies 85, pp. 732–752. Cited by: §2.
  • L. Bieker-Walz and M. Behrisch (2019) Modelling green waves for emergency vehicles using connected traffic data. EPiC Series in Computing 62, pp. 1–11. Cited by: §2, §6.2.
  • A. Buchenscheit, F. Schaub, F. Kargl, and M. Weber (2009) A vanet-based emergency vehicle warning system. In 2009 IEEE Vehicular Networking Conference (VNC), pp. 1–8. Cited by: §5.2.
  • C. Chen, H. Wei, N. Xu, G. Zheng, M. Yang, Y. Xiong, K. Xu, and Z. Li (2020) Toward a thousand lights: decentralized deep reinforcement learning for large-scale traffic signal control.

    Proceedings of the AAAI Conference on Artificial Intelligence

    34 (04), pp. 3414–3421.
    External Links: Link, Document Cited by: §2.
  • T. Chu, J. Wang, L. Codecà, and Z. Li (2019) Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Transactions on Intelligent Transportation Systems. Cited by: §2, §5.3, §5.3, §5.3, §5.3.
  • F. Corman, A. D’Ariano, D. Pacciarelli, and M. Pranzo (2009) Evaluation of green wave policy in real-time railway traffic management. Transportation Research Part C: Emerging Technologies 17 (6), pp. 607–616. Cited by: §6.2.
  • S. El-Tantawy, B. Abdulhai, and H. Abdelgawad (2013) Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (marlin-atsc): methodology and large-scale application on downtown toronto. IEEE Transactions on Intelligent Transportation Systems 14 (3), pp. 1140–1150. Cited by: §2.
  • J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. Torr, P. Kohli, and S. Whiteson (2017) Stabilising experience replay for deep multi-agent reinforcement learning. In

    International conference on machine learning

    pp. 1146–1155. Cited by: §5.3.
  • J. Gajda, R. Sroka, M. Stencel, A. Wajda, and T. Zeglen (2001) A vehicle classification based on inductive loop detectors. In IMTC 2001. Proceedings of the 18th IEEE Instrumentation and Measurement Technology Conference. Rediscovering Measurement in the Age of Informatics, Vol. 1, pp. 460–464. Cited by: §5.2.
  • A. Haghani, H. Hu, and Q. Tian (2003) An optimization model for real-time emergency vehicle dispatching and routing. In 82nd annual meeting of the Transportation Research Board, Washington, DC, Cited by: §2.
  • S. Humagain, R. Sinha, E. Lai, and P. Ranjitkar (2020) A systematic review of route optimisation and pre-emption methods for emergency vehicles. Transport reviews 40 (1), pp. 35–53. Cited by: §2.
  • A. Jotshi, Q. Gong, and R. Batta (2009) Dispatching and routing of emergency vehicles in disaster mitigation using data fusion. Socio-Economic Planning Sciences 43 (1), pp. 1 – 24. External Links: ISSN 0038-0121, Document, Link Cited by: §2.
  • S. Koh, B. Zhou, H. Fang, P. Yang, Z. Yang, Q. Yang, L. Guan, and Z. Ji (2020) Real-time deep reinforcement learning based vehicle navigation. Applied Soft Computing 96, pp. 106694. Cited by: §2.
  • E. Kwon, S. Kim, and R. Betts (2003) Route-based dynamic preemption of traffic signals for emergency vehicle operations. In Transportation Research Board 82nd Annual MeetingTransportation Research Board, Cited by: §2.
  • P. A. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y. Flötteröd, R. Hilbrich, L. Lücken, J. Rummel, P. Wagner, and E. Wießner (2018) Microscopic traffic simulation using sumo. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 2575–2582. Cited by: §6.
  • L. Lu and S. Wang (2019) Literature review of analytical models on emergency vehicle service: location, dispatching, routing and preemption control. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Vol. , pp. 3031–3036. External Links: Document Cited by: §2.
  • H. Mu, Y. Song, and L. Liu (2018) Route-based signal preemption control of emergency vehicle. Journal of Control Science and Engineering 2018, pp. 1–11. External Links: Document Cited by: §2.
  • G. Musolino, A. Polimeni, C. Rindone, and A. Vitetta (2013) Travel time forecasting and dynamic routes design for emergency vehicles. Procedia-Social and Behavioral Sciences 87, pp. 193–202. Cited by: §2.
  • H. Noori, L. Fu, and S. Shiravi (2016) A connected vehicle based traffic signal control strategy for emergency vehicle preemption. In Transportation Research Board 95th Annual Meeting, Cited by: §5.2.
  • N. A. M. Nordin, Z. A. Zaharudin, M. A. Maasar, and N. A. Nordin (2012) Finding shortest path of the ambulance routing: interface of a-star algorithm using c programming. In 2012 IEEE Symposium on Humanities, Science and Engineering Research, pp. 1569–1573. Cited by: §2.
  • L. Prashanth and S. Bhatnagar (2010) Reinforcement learning with function approximation for traffic signal control. IEEE Transactions on Intelligent Transportation Systems 12 (2), pp. 412–421. Cited by: §2.
  • R. P. Roess, E. S. Prassas, and W. R. McShane (2004) Traffic engineering. Pearson/Prentice Hall. Cited by: 1st item.
  • H. Su, K. Shi, Joseph. Y. J. Chow, and L. Jin (2021) Dynamic queue-jump lane for emergency vehicles under partially connected settings: a multi-agent deep reinforcement learning approach. External Links: 2003.01025 Cited by: Appendix B.
  • E. Van der Pol and F. A. Oliehoek (2016) Coordinated deep reinforcement learners for traffic light control. Proceedings of Learning, Inference and Control of Multi-Agent Systems (at NIPS 2016). Cited by: 3rd item.
  • P. Varaiya (2013) Max pressure control of a network of signalized intersections. Transportation Research Part C: Emerging Technologies 36, pp. 177–195. Cited by: 2nd item.
  • J. Wang, W. Ma, and X. Yang (2013a) Development of degree-of-priority based control strategy for emergency vehicle preemption operation. Discrete dynamics in nature and society 2013. Cited by: §2, §2.
  • Y. Wang, Z. Wu, X. Yang, and L. Huang (2013b) Design and implementation of an emergency vehicle signal preemption system based on cooperative vehicle-infrastructure technology. Advances in Mechanical Engineering 5, pp. 834976. Cited by: §5.2.
  • H. Wei, C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu, and Z. Li (2019a) Presslight: learning max pressure control to coordinate traffic signals in arterial network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1290–1298. Cited by: Appendix A, §2, 4th item.
  • H. Wei, N. Xu, H. Zhang, G. Zheng, X. Zang, C. Chen, W. Zhang, Y. Zhu, K. Xu, and Z. Li (2019b) Colight: learning network-level cooperation for traffic signal control. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1913–1922. Cited by: §2.
  • H. Wei, G. Zheng, V. Gayah, and Z. Li (2019c) A survey on traffic signal control methods. arXiv preprint arXiv:1904.08117. Cited by: §2.
  • B. Xu, Y. Wang, Z. Wang, H. Jia, and Z. Lu (2021) Hierarchically and cooperatively learning traffic signal control. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 669–677. Cited by: §2.
  • X. Zang, H. Yao, G. Zheng, N. Xu, K. Xu, and Z. Li (2020) MetaLight: value-based meta-reinforcement learning for traffic signal control. Proceedings of the AAAI Conference on Artificial Intelligence 34 (01), pp. 1153–1160. Cited by: §2.
  • G. Zheng, Y. Xiong, X. Zang, J. Feng, H. Wei, H. Zhang, Y. Li, K. Xu, and Z. Li (2019) Learning phase competition for traffic signal control. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1963–1972. Cited by: §2.
  • A. K. Ziliaskopoulos and H. S. Mahmassani (1993) Time-dependent, shortest-path algorithm for real-time intelligent vehicle highway system applications. In Transportation Research Record 1408, pp. 94–100. Cited by: §2.

Appendix A Pressure Definition Comparison

Here we present the key difference in pressure definition between our work and PressLight Wei et al. (2019a).

a.1 Pressure in PressLight

PressLight assumes that traffic movements are lane-to-lane, i.e., vehicles in one lane can only move into a particular lane in a link. Because of the lane-to-lane assumption, in PressLight, the pressure is defined per movement. PressLight defines the pressure of a movement as the difference of the vehicle density between an incoming lane and the outgoing lane , i.e.,

PressLight then defines the pressure of an intersection as the absolute value of the sum of pressure of movements of intersection , i.e.,

where is the set of permissible traffic movements of intersection .

a.2 Pressure in our work

EMVLight assumes a lane-to-link style traffic movement as vehicles can enter either lane on the target link, see Fig 1. To present the pressure of an intersection, we first define the pressure of an incoming lane as (Definition 4)

The pressure of an intersection in EMVLight is defined as the average of the pressure of all incoming lanes (Definition 5),

where represents the set of all incoming lanes of intersection .

a.3 Comparison

The first difference between the two definitions is that can be both positive or negative, but can only take positive values that measures the unevenness of the vehicle density in the incoming lane and that of the corresponding outgoing lanes. We take the absolute value since the direction of pressure is irrelevant here, and the goal of each agent is to minimize this unevenness. The second difference is that at the intersection level, takes a sum but takes an average. The average is more suitable for our purpose since it scales the pressure down and the unit penalty for normal agents would be relatively large as compared to rewards for pre-emption agents (Eqn. (3)). This design puts the efficient passage of EMV vehicles at the top priority. Our experimentation results indicate the proposed pressure design produces a more robust reward signal during training and outperforms PressLight in congestion reduction.

Appendix B Intralink EMV travel time

The intra-link traffic pattern with the presence of an EMV on duty is complicated and is under-explored in the current literature. For simplicity, here we demonstrate a simple intra-link traffic model for a link with 2 lanes. The model can be easily extended for multiple lanes.

In a two-lane link, the EMV takes a lane and the non-EMVs on the other lane usually slows down or entirely stop. Some non-EMVs ahead of the EMV find pull-over spots in the other lane and park there. Those that cannot find a parking spot continue to drive in front of the EMV, potentially blocking the EMV passage Su et al. (2021). In this study, we propose a meso-scopic model to estimate the intra-link travel time of an EMV.

Figure 7: Normal traffic state.

Normally, the traffic flow of a link is modeled by a fundamental diagram, see Fig 7. This diagram depicts a simplified relationship between the flow rate, i.e. number of vehicles passing within the unit amount of the time, and number of vehicles on a link. The max number of vehicles indicates the capacity of this link. The critical number of vehicles indicates the boundary differentiates the non-congested state and congested state. When the number of vehicles is smaller than , all vehicles are traveling at the free flow speed, which is represented by the slope of . When number of vehicles is larger than , vehicles are slowing down and traffic flows declines since the link is now congested. The max flow is attained when the number of vehicles is at . The travel speed of vehicles in a congested state is obtained by the slope of .

Figure 8: Traffic state during EMV pre-emption.

During the EMV pre-emption, the original traffic flow relationship pictured in black are diverted into two parts, representing two lanes respectively. The green line represents the traffic conditions of the pre-emption lane, i.e. the lane where the EMV is traveling, and the orange line represents the other lane. During pre-emption, part of the vehicles originally travelling in front of the EMV pull over onto the adjacent lane, resulting a significant decrease in the max capacities of the pre-emption lane. Meanwhile, because vehicles can park onto the curbs, orange line depicts a larger maximum capacity than normal.

Regarding the max flow, the adjacent lane has a smaller max flow as the vehicles on this lane are required to slow down when EMV is on duty. The pre-emption lane obtains a max capacity when there is one vehicle on the lane, i.e. the EMV itself. Under this circumstance, the flow rate is equivalent as the free flow speed of the EMV. However, when there are vehicles remaining in front of the EMV, the travel speed of the EMV might be slowed down, but still higher than the free flow speeds of the non-EMVs. Furthermore, the travel speed of EMV has a discrete value corresponding to the number of non-EMVs blocking. For example, pre-emption lane has a traffic state represented by , and the travel speed of the EMV is obtained as the slope of . We use this simple model, especially the green plot to estimate the intra-link travel time of EMV in a link as a function of number of vehicles in that link.

We calibrate this model with the SUMO environment. Intuitively, the travel speed of the EMV is affected more by the number of vehicles on the pre-emption lane than their positions. The reason behind is since the ETA is frequently updated every seconds ( in our experiment), and the estimated ETA would eventually converge. Imagine there are vehicles very far away from the EMV and about to leave the intersections at , their presence would not slow down the approaching EMV. Therefore, when they have left the intersection at , we have an updated number of non-EMVs count and updated travel speed estimation.

Input : 
maximum time step of an episode
batch size
learning rate for policy networks
learning rate for value networks
spatial discount factor
(temporal) discount factor
regularizer coefficient
Output : 
learned parameters in value networks
learned parameters in policy networks
1 initialize , , , ; initialize SUMO, , get
2 repeat
       /* generate trajectories */
3       foreach  do (in parallel)
4             sample from
5             receive and
8       ,
9       if  then
10             initialize SUMO, , get
      /* update actors and critics */
12       if  then
13             foreach  do (in parallel)
14                   calculate (Eqn. (4)), (Eqn. (5))
20until Convergence
Algorithm 2 Multi-agent A2C Training

Appendix C Training Details

c.1 Value loss function

With a batch of data , each agent’s value network is trained by minimizing the difference between bootstrapped estimated value and neural network approximated value


c.2 Policy loss function

Each agent’s policy network is trained by minimizing its policy loss


where is the estimated advantage which measures how much better the action is as compared to the average performance of the policy in the state . The second term is a regularization term that encourage initial exploration, where is the action set of agent . For an intersection as shown in Fig. 1, contains 8 traffic signal phases.

c.3 Training algorithm

Algorithm 2 shows the multi-agent A2C training process.

Appendix D Implementation Details

d.1 Implementation details for synthetic

  • dimension of :

  • dimension of :

  • dimension of :

  • Policy network : concat[ReLu, ReLu] LSTM Softmax

  • Value network : concat[ReLu, ReLu] LSTM Linear

  • Each link is . The free flow speed of the EMV is and the free flow speed for non-EMVs is .

  • Temporal discount factor is and spatial discount factor is .

  • Initial learning rates and are both 1e-3 and they decay linearly. Adam optimizer is used.

  • MDP step length and for secondary pre-emption reward weight is .

  • Regularization coefficient is .

d.2 Implementation details for

The implementation is similar to the synthetic network implementation, with the following differences:

  • Initial learning rates and are both 5e-4.

  • Since the avenues and streets are both one-directional, the number of actions of each agent are adjusted accordingly.