1. Introduction
Traffic congestion continues to be a pervasive problem in large cities around the world, and is of particular concern in cities experiencing the highest growth (Systematics, 2004). According to the Texas A&M Transportation Institute’s 2019 Urban Mobility Report (Lasley, 2019), traffic congestion and unexpected delays cost $ per person per hour. Traffic bottlenecks are one of the leading causes of congestion (Hale et al., 2016).
A traffic bottleneck is a localized congestion problem which is often caused by a small number of road segments which converge somewhere in the road network. When the traffic volume on bottleneck edges overwhelm the surrounding road capacities, then congestion increases within the road network and reduces the traffic flow throughout a transportation network (Ji et al., 2014; Zheng et al., 2014). The detection and removal of traffic bottlenecks therefore are a management priority in road networks, and can be used to help a traffic management company to reduce traffic congestion through new construction or traffic signal timing (Yuan et al., 2014). However, dynamically identifying traffic bottlenecks remains to be an important and unsolved problem since many different factors influence traffic movement in the network, which can change suddenly or slowly over time (Yue et al., 2018).
Bottlenecks in high demand areas can result in traffic diffusion described by a traffic spread model, which will influence neighboring road segments and will, in turn, create new traffic bottlenecks as traffic continues to move across a road network (Bertini, 2006). Intuitively, when any road segment is heavily congested, traffic migrates to neighboring roads, causing the neighboring roads to become bottlenecks. Two other important issues are common when attempting to create a datadriven traffic spread model (Anwar et al., 2020): (1) faulty sensors: Realtime sensor data is often noisy, which can make the predictions less reliable. (2) missing records: some segments in a road network may have no sensor data available at all. For example, % of daily traffic volume data in Beijing is not reliable or missing entirely (Qu et al., 2009). Relying purely on a nodeedge road network graph is not sufficient, since the shortest path is not always the route taken by a commuter (Xu et al., 2018).
Nevertheless, a huge volume of trajectory data is being generated by the transportation industry at an unprecedented rate (Zheng, 2015; Wang et al., 2021, 2018), which captures realtime traffic conditions in road networks and provides promising new opportunities that can be used to address this longstanding problem. In this work, we propose a promising new approach to resolve this important challenge, which we refer to as the Trajectorydriven Traffic Bottleneck Identification problem (): Given a directed road network represented with a graph, and a database of trajectories, the traffic bottleneck identification problem finds edges in a graph, called seeds, such that the total number of edges influenced is maximized for the traffic spread model. Less formally, when given an existing road network with vehicles at a certain time, improving the traffic conditions of the road segments selected would result in the largest improvement in traffic across the entire road network. Therefore, the key question becomes: which set of edges should be targeted? While we do provide a solution for the problem as defined above, we will not explore how professionals might best use this information to improve traffic conditions given these road segments in this work.
In order to solve the problem, we must first overcome several important challenges. Firstly, how should traffic be modeled dynamically over time? Traffic congestion can cascade through the edges of the network in unexpected ways (GomezRodriguez et al., 2012). A traffic spread model can be used to represent how traffic flow diffuses through the network over time. Existing methods assign propagation rates between two adjacent edges include, and can be: a constant value (e.g., ), drawn uniformly from a predefined set of values (e.g., ), or represented by the reciprocal of the node degree (Goyal et al., 2011). However, these methods do not capture patterns in real data. Secondly, how do you identify seed edges (the traffic bottlenecks) based on the traffic spread model? Each bottleneck edge candidate can influence a set of neighboring edges which also become congested, and the “influence” from several bottleneck edges can overlap. Let denote the influence of the edge . For instance, , , both and will influence . However, when we consider and into the seed edges set, will only be counted once. Thus, the seed edges cannot be obtained by simply ranking candidates by influence to produce the best candidate set.
To resolve the first challenge, we propose that each road segment can be modeled as a bidirectional weighted edge, where the weight is equal to the traffic volume, and is closely related to a temporal factor (Huang et al., 2014; Lv et al., 2014). For example, more vehicles traveling from the suburb to the CBD happen in the morning peak hours, whereas more vehicles travel in the opposite direction in the evening. In addition, we maintain two important characteristics when we construct our traffic spread model, namely the traffic diffusion probability between two adjacent edges and the residual ratio of traffic volume for each edge, which can be computed using historical trajectory data. The traffic spread model has two important properties: (1) spatial influence range, which constrains how far the diffusion chain can propagate to other edges; (2) scale, which is the number of affected edges. Specifically, we will use the scale value to represent the influence of each edge.
To resolve the second challenge, we maintain a monitor time window and transform a historical timeline into snapshots, where the size of each snapshot is the spread time window . As time progresses, the result is updated for and the traffic volume is updated every using the specified traffic spread model. Next, a two phase algorithm, influence acquisition and bottleneck identification, first identifies the influenced edges, and then selects seed edges with the maximum coverage for the traffic bottleneck edges with the most influence over the road network. Specifically, when we select a seed edge, we apply a bestfirst () algorithm. Our main idea is to iteratively find the most profitable edge among all unselected edges. Though the idea is simple, the effectiveness is consistently better. However, when the number of edges is large, it becomes computationally expensive to search the most profitable edge. To improve the efficiency, we also propose a samplingbased greedy () algorithm, which performs the selection on a small subset of sampled candidate edges.
In summary, we list our contributions as below:

We formalize the Traffic Bottleneck Identification () problem and show that the problem is NPhard in Section 3.

We propose a twophase approximation algorithm which can be used to solve the problem in in Section 4.

We perform the experimental study to investigate the performance of our proposed algorithms, and compare them with the stateofart methods on three different test collections in Section 5.
2. Related Work
In this section, we describe literature related to the traffic spread model for traffic diffusion. Then we introduce the influence maximization problem which is the most relevant domain with ours in Section 2.2.
2.1. Traffic Spread
A traffic spread model captures traffic flow evolution in a road network temporally (Tedjopurnomo et al., 2020). Anwar et al. (2020) devise a probabilistic traffic diffusion model and show how to compute the influence scores for each road segment in an urban road network in order to select the top edges with the maximal influence scores. Long et al. (2008) propose a congestion propagation model based on a cell transmission model, which describes flow propagation using links, where a link represents a road, which is divided into road segments called cells. The congestion propagation model is validated using a simulated test environment. However, the model differs from our own in the following two ways: (1) Their model only measures the aggregated inflow (total traffic volume) is propagated into a cell, and it is unclear if the roadtoroad inflow (which cell is the propagation originator) at the road segment level. (2) Their model returns a segment of a road or a zone in the network by comparing the speed to a configurable threshold, while we model the relationships between all road segments to identify traffic bottlenecks. Zhao et al. (2017) study a traffic congestion diffusion model for both both temporal and spatial data. Each spatial region is divided into grids, and then the vehicle traffic flow in and out of each one is computed dynamically in order to construct a model of traffic flow between grids. However, their model only computes the traffic spread between any two grids and not two edges as in our work. To capture this dynamic and temporal property, Rodriguez et al. (2011) propose the use of probabilitybased diffusion models such as Exponential Power law, and Rayleigh modeling. Saberi et al. (2020) devise an epidemic framework inspired by the susceptibleinfectedrecovered (SIR) model to describe the dynamics of traffic congestion spread, and support their claim based on a historical multicity analysis. Each node has three states: susceptible, infected, or recovered to simulate whether a road state is “contaminated” or “recovered”. If a node is susceptible, then it has never been contaminated. If a node is infected, then it is currently contaminated. However, if a node is recovered, then it was contaminated and has now recovered. Although the model captures the congested road congestion over time, it has two disadvantages: (1) it does not show the specific roads which are in a susceptible, infected, or recovered state; (2) modeling influence between two road segments remains an open problem when using this model.
To summarize, our problem differs in the following ways: (1) We aim to define a datadriven traffic spread model for a large city by profiling and analyzing the trajectory datasets, and do not rely on a mathematical simulation model, such as done by Rodriguez et al. (2011). (2) When considering the influence of each road segment, we are not limited to the traffic measures shown in the works (Long et al., 2008; Anwar et al., 2020; Saberi et al., 2020), as road sensors may be faulty or even missing. Moreover, evaluating the effects of road segments cannot be simply reduced by comparing the traffic volume or speed to a predefined threshold (Bertini, 2006; Rao and Rao, 2012). Instead, we focus on modeling the influence between any two road segments, which is rarely considered in previous work.
2.2. Influence Maximization and Variations
In the problem of influence maximization (), the goal is to find a seed set composed of nodes that maximize influence spread over a network graph for a given influence model (Tang et al., 2014). The two most commonly used influence models are (Kempe et al., 2003): (1) Independent Cascade (IC): each node may be active or inactive initially and time proceeds at discrete timestamps. At step , every node that became active at step activates a nonactive neighbor with probability . If it fails, it does not try again; (2) Linear Threshold (LT): Similar to IC model, each node may be active or inactive at the beginning. In addition, each directed edge has a weight , and each node has a randomly generated threshold value . At step , an inactive node becomes active if the aggregated weights of all the neighborhood nodes within one hop can exceed the threshold . For further details see the recent survey of Li et al. (2018b).
The IC model is not applicable to the traffic diffusion problem since, under assumptions imposed by the IC model, each edge can only be considered once, and activated with a fixed probability, which is not true for traffic flow. Besides, the IC model does not rely on edge weights (Wang et al., 2010). Although the LT model considers edge weights, there is currently no approach to assign the probabilities using a dataset (Li et al., 2018b). In summary, our problem differs from the problem in several fundamental ways: (1) In , the state of a node can be switched from being inactive to being active, but not vice versa (Jiang et al., 2011). However, the original active (i.e., a bottleneck edge) edge may become inactive (i.e., a nonbottleneck edge) as time elapses. (2) The propagation of influence from one road to another road may incur a certain time delay (Chen et al., 2012), while the temporal information is rarely considered when using the IC and LT models.
2.3. Other related areas
In addition to the literature above, there are some other domains which are slightly related to our work, such as spatial object selection (Guo et al., 2016; Zhang et al., 2018, 2019; Guo et al., 2018) and maximizing bichromatic reverse k nearest neighbor (MaxRNN) problem (Choudhury et al., 2016; Luo et al., 2018; Zhou et al., 2011). Specifically, Guo et al. (2016) define an influence maximization problem on trajectories, which aims to find a subset of trajectories with the maximum expected influence among a group of audiences. Zhang et al. (2018) propose a billboard placement problem and find a set of billboards when given a budget which influence the most trajectories. Zhang et al. (2019) extend the billboard problem of Zhang et al. (2018) to support the inclusion of impression counts. Guo et al. (2018) study how to select a set of representative spatial objects from the current region of users’ interest. The MaxRNN query aims to find the locations to set up new facilities which can serve the most number of users assuming that users prefer to go to the nearest facility. However, neither of these works use an influence model which can be applied in our problem scenario.
3. Problem Formulation
Symbol  Description 

A road network with a set of vertices, and a set of edges  
A set of trajectories, each of which contains a sequence of timestamped geocoordinates  
, ,  A directed edge in 
The traffic volume of an edge  
A set of seed edges, , and  
The traffic diffusion probability from the edge to  
The residual ratio of traffic volume residing in the edge  
, or  The influenced edges impacted by an edge , or a seed edge set 
, or  The influence score of a seed edge , or a seed set 
The traffic congestion parameter  
The consecutive influence time threshold to measure a traffic bottleneck  
The traffic spread time window  
The traffic monitor time window 
In this section, we formalize the traffic bottleneck identification problem. Table 1 summarizes the necessary notations.
Road Network. A road network can be represented as a directed and edgeweighted graph, which contains a set of vertices and a set of edges . A directed edge is a road segment from a vertex () to another connected vertex (). Similarly, the directed edge from the vertex to can be denoted as . For ease of presentation, we may omit subscripts and use to denote a directed edge when the context is clear. Each edge has a weight to represent the traffic volume of a certain road segment .
Trajectory. Let denote a set of trajectories . A trajectory is a sequence of timestamped geocoordinates: , where each () is a spatial location in a dimensional data space, and each () is the timestamp at which the corresponding occurs.
Assume that we have a traffic monitor time window ( hour for example) and a traffic spread time window (such as seconds), the spread time window will slide every time until the deadline of the monitor time window. The spread time window moves batches, and then a set of seed edges can be computed for this time period. The traffic volume of each edge is updated for time .
3.1. Traffic Spread Model
Before we define our problem formally, we first explain the traffic spread model to indicate how traffic volume diffuses through a road network. Our main goal is to determine how the traffic spread influences changes in traffic volume for each edge (as described in Definition 3.3). First, we define how an edge can influence another edge.
Definition 3.1 ().
Influenced Value. Given an edge with an initial traffic volume value , the influence value of by after one spread time window is defined as:
(1) 
Here, denotes the travel time from to along a road network. Now consider that traffic spread models have a limited coverage region for their influence (Li et al., 2018a; Yu et al., 2018). So, we assume that if , then the influence from to is valid, and is the traffic diffusion probability from to . If two edges and are connective neighbors, then can be directly derived from the real historical trajectory statistics and computed using the traffic volume from to divided by the total traffic volume diffused out from . Otherwise, can be computed as ), where is a travel path from to along the road network. Here, denotes the aggregated probabilities of all possible paths within travel time from to , and is the residual ratio of traffic volume to denote the remaining traffic volume which may still reside in edge .
Example 0 ().
Figure 3.1 shows an example of a road network, which consists of eight nodes from to , and ten directed edges. For ease of visualization, we assume that only one directed edge exists for any two connected nodes, for example the edge from to is denoted as . We use dotted lines to represent exiting edges, such as the dotted edge originating from . In addition, Table 3.1 describes the traffic diffusion probability between any two directly connected edges and in a certain period of time. For instance, . We assume the residual ratio equals to in the following examples. Interestingly, both the and
values can be estimated by analyzing the historical trajectory datasets and depend on the time window we aim to monitor. In our experiments, we precompute
and values for each hour. For example, using the historical trajectory dataset we know that there are vehicles on the edge initially. After traffic diffusion occurs, vehicles are spread out while vehicles are still left on the edge . We also know that vehicles moved from to . So, we can compute: , and . In practice, the roads are connected and the traffic diffusion probability is normally less than one.Definition 3.3 ().
Traffic Spread. We now describe the traffic volume change after one traffic spread window to ease illustration. Given an edge with an initial traffic volume value , the incremental traffic inflow diffused by other edges is defined as:
(2) 
Similarly, the incremental traffic outflow which indicates the diffusing value from to other influenced edges is:
(3) 
Finally, if has an initial travel volume , then the traffic volume after one traffic spread model is:
(4) 
Example 0 ().
Table 3 illustrates traffic volume after traffic diffusion occurs. Note that, we assume an edge can influence its neighbor edges within a single hop in this example to ease illustration, but the traffic spread time window (say seconds) is used to restrict the spread range in our experiments. For example, the initial traffic volume of the edge is . After the diffusion, the incremental incoming traffic volume diffused by the neighbor edges and is: . Similarly, the incremental outdegree traffic volume impacting the neighbor edges and are: . Finally, the traffic volume of edge after one spread time window is: .


Definition 3.5 ().
Congested Edge. If the traffic volume of an edge is greater than a traffic volume threshold, i.e., , then is considered to be a congested edge. The length of the road segment, , can be easily computed based on real world geographical coordinates for datasets collected for a city. The traffic volume threshold can be changed using the traffic congestion parameter .
Definition 3.6 ().
Influenced Edge. Given two congested edges and , if and (i.e., is congested), then we call as an influenced edge of . Let denote the influenced edges of at the th spread time window, then .
Definition 3.7 ().
consecutive. Given a set of influenced edges , , of the edge for multiple spread time windows, an influenced edge is consecutive if it can be influenced by that is consecutively no smaller than the time window (the unit is a spread time window size). The influenced edge set contains the influenced edges which satisfy consecutive constraint.
Note that, in contrast to the concept of an influenced edge as in Definition 3.6 from only one spread time window, the consecutive in Definition 3.7 is from the perspective of the monitor time window which covers multiple spread time windows .
Example 0 ().
Suppose we have the influenced edges set , , for three spread time windows, and (two spread time windows), then since appears in and . Although is influenced by the first and third spread time windows, these two spread time windows are not consecutive.
3.2. Traffic Bottleneck Identification
Now, using the road network information and the traffic spread model above, we are in a position to define the traffic bottleneck identification problem. The overall goal is to find the important edges in a road network using the movement of traffic.
Definition 3.9 ().
Traffic Bottleneck Identification () Given a road network , a set of trajectories , a positive parameter , a traffic congestion parameter , a consecutive time interval , a spread time window and a monitor time window , choose a set of seed edges of size (), where , such that the total number of influenced edges is maximal, where is:
(5) 
Here, is defined as the total number of edges influenced by . However, additional factors such as the influence contribution degree for each edge may also be included, and do not change the time complexity of the problem.
3.3. NPHardness
We now show that the problem is NPhard using a reduction from the Maximum Coverage problem.
Lemma 3.10 ().
The problem is NPhard.
Proof.
In the Maximum Coverage () problem, given a collection of sets over a set of objects , where and each element has an associated weight , and a positive integer , we wish to know whether sets exist (e.g., ) when the weight of the elements in is maximized. The proof for the NPHardness of the problem by (Hochbaum and Pathria, 1998) can now be used as the target reduction, and can be reduced as follows. Given an arbitrary instance of the problem, each edge in the problem can be mapped the elements in the problem. Each edge has an influenced edge set , then we map and the cardinality of ( ) in the problem to the subset and the weight in the problem, respectively. The goal of problem as defined in Equation 5 is to select edges from , such that the total number of covered edges by the selected seed edges set () is maximized.
So, solving the problem is equivalent to deciding whether there are sets whose union has the maximum aggregated weight in the problem. Therefore, the problem is NPhard. ∎
4. Our Approach
In this section, we propose a twophase approach to solve the problem – influence acquisition and bottleneck identification in Section 4.1 and Section 4.2, respectively. The first phase influence acquisition obtains the edges influenced in each spread time window, and then filters out the candidate edges which violate the consecutive constraint. The output of influence acquisition is passed to the next phase, bottleneck identification, which selects the seed edges that maximize the influence score. We propose two different algorithmic solutions: a bestfirst () algorithm and the samplingbased greedy () algorithm with approximation guarantees in Section 4.2.1 and Section 4.2.2, respectively.
4.1. Influence Acquisition
The influence acquisition mainly contains two steps as shown in Algorithm 1.
(1) Get the influenced edges (i.e., ) for each edge in each spread time window (lines 11). That is, for each edge in the th spread time window, we first start from and use depthfirst search over a road network graph to obtain all the connective neighbor edges within spatial range (line 1). Note that, this operation can be done offline as we know the road length a priori and can thus estimate the travel speed. Next, for each connective neighbor edge , we compare the traffic volume with the congestion threshold and determine whether it is a congested edge (line 1) as defined in Definition 3.5. We regard an edge as a traffic bottleneck if it can reach other roads that are congested. This is an online operation since the real traffic volume in the th spread time window can continuously change over time. Finally, we update the traffic volume for each edge based on the Equation 4 (line 1), which is used as the input to compute the next spread time window.
(2) Validate the influenced edges in (obtained in the first step) which cannot satisfy the consecutive constraint (from Definition 3.7) (lines 11). The intersection of influenced edges in the consecutive spread time windows are computed and inserted into the influenced set for each edge in line 1.
4.2. Bottleneck Identification
Given all edges and the influenced edges, bottleneck identification selects a small set of seed edges from . We introduce two different approaches to achieve the goal in Sections 4.2.1 and 4.2.2, respectively.
4.2.1. BestFirst Algorithm ()
The bestfirst algorithm follows the strategy that each time we add an edge into the seed set if provides the maximum marginal gain, which is . The algorithm has an approximation ratio guarantee of , given its monotonicity and submodularity properties as described in Lemma 4.1 and Lemma 4.2, respectively.
Lemma 4.1 ().
(Monotonicity) Let and be two sets of seed edges, and . Then,
(6) 
Proof.
Since , we have , then . Therefore, . ∎
Lemma 4.2 ().
(Submodularity) Let and be two sets of seed edges, and . Assume that is a newly inserted edge, we have:
(7) 
Proof.
Based on differing relationships between and , we analyze all possible outcomes as follows:
Case 1. If , then . Thus = . Similarly, = .
Case 2. If , then . Then = . As such, .
Case 3. If , we assume that , then . We also assume that , then . Note that , thus we can easily obtain that based on the monotonicity property in Lemma 4.1. Thus .
Therefore, the relationship is submodular. ∎
4.2.2. Samplingbased Greedy Algorithm ()
The algorithm requires estimates in total and a seed edge is selected iteratively. However, most of these estimates are wasted since we only care about a small set of edges with the greatest influence spread. Therefore, to reduce the time consumption of in larger datasets, a samplingbased greedy algorithm is proposed and referred to as henceforth. The key idea is to sample a subset of edges as candidates instead of using all the edges. It is also possible to prove that can maintain an equivalent approximation guarantee to our previous approach.
Before introducing our algorithm, we first define the data structure which denotes the reverse influenced edges set. An structure contains an edge as the key and a list of edges that impact are the values. For example, if an edge influences , then and . Now for each edge , we can record its reverse influenced edges set from the influenced edges sets.
The intuition of the algorithm is that if an edge appears in a large number of sets, then it has a high probability of influencing other edges. Therefore, the expected influence of will be large. Based on this premise, if a set with size covers the most sets, then will have the maximum expected influence for all size edge sets in .
Algorithm 2 shows in more detail. Specifically, we first generate a certain number of random sets with size as candidates (line 2). Then we will choose an edge which can cover the most sets (line 2) and add it into the seed set (line 2). Finally, the sets that have been covered are removed by (line 2). One important question remains: How should we specify the number of sets sampled, which is ? Lemma 4.4 provides the answer. This however requires a few other key pieces of information.
Lemma 4.3 ().
Let be a random Bernoulli variable that equals 0 with the probability if , and 1 with the probability otherwise. Given a seed set and a random set , then the expected influence of an arbitrary seed set using random sets is:
(8) 
To ensure the estimation in Equation 8 is accurate, must not deviate significantly from its expectation with an error bound and confidence , meaning that Equation 9 must hold. For example, we normally set and .
(9) 
For simplicity, since and all related variables are independent, then , where denotes the probability that overlaps with a random set, and . Let , we have . Then the Equation 9 can be reduced as:
(10) 
Then, we have:
(11) 
Lemma 4.4 ().
Given a confidence parameter , a sufficiently small error parameter and a sampling size , then for any set with edges, the following inequality holds with a probability that is at least :
(12) 
Proof.
Using a twosided Chernoff bound, for any , we have
(13) 
If we want a confidence of in the estimation, we would like the right side of the Equation 13 to be at most , which is:
(14) 
∎
In essence, will still hold the approximation guarantee as it has the same iteration logic as the algorithm. That is, the approximation guarantee is maintained with probability as only a subset of edges are considered in , when is a confidence parameter such as 0.05.
5. Experiment
In our experimental study, we aim to investigate the following questions.

Q1. How sensitive are our new approaches to parameter choice, and how do the choices affect efficiency and effectiveness tradeoffs?

Q2. How well do our methods scale as the dataset size increases?

Q3. How should the seed sets selected be evaluated when using our methods in a real traffic monitoring scenario?
5.1. Experimental Setup
Datasets. We use the taxi trajectory datasets of Xi’an, Chengdu and Porto, where the first two datasets are from Didi Chuxing GAIA Initiative^{1}^{1}1https://gaia.didichuxing.com, and the third one is from a Kaggle trajectory prediction competition^{2}^{2}2http://www.geolink.pt/ecmlpkdd2015challenge/dataset.html. Table 4 describes the statistics of the road network and trajectory datasets.
Xi’an  Chengdu  Porto  
#nodes  2,086  4,326  60,287  
#edges  5,045  6,135  108,571  
avg edge length (m)  199  194  114  
latitude range  [, ]  [, ]  [, ]  
longitude range  [, ]  [, ]  [, ]  
#trajectories  119,019  192,901  1,565,595  
#points  28,327,565  41,664,011  100,995,114  
time span  Oct 1, 2016  Oct 1, 2016  July 1, 2013  June 30, 2014  
sampling time (s)  24  24  15 

Road Network. The spatial regions of both Xi’an and Chengdu are located in their respective urban areas, i.e., the regions bounded by the 2nd Ring Road. The road network dataset of each city is obtained from OpenStreetMap^{3}^{3}3https://www.openstreetmap.org/ based on a bounding box, where the latitude and longitude ranges are shown in Table 4. A road may be composed of one or more road segments, where two road segments with opposite directions form two different edges in the road network. Each of the two edges undergo different kinds of traffic flow patterns. The road networks for these three cities are shown in Figure 2.

Trajectory. In the raw taxi trip dataset, each driver has multiple orders, where each order includes a series of spatial locations attached with timestamp information. We consider each rider order as a trajectory.
Mapmatching from raw trajectories to road network. The goal of mapmatching is to match geographic coordinates (e.g., in the raw trajectories from vehicle GPS) in the real world to an existing road network (i.e., a graph). Similar to the existing work (Wang et al., 2019), we use the map matching algorithm (Yang and Gidofalvi, 2018) which was shown previously to be efficient and scalable to perform the mapmatching and to align the raw trajectories with the corresponding road network. This is a oneoff and offline operation. Total matching times were around hours, hours and hours for Xi’an, Chengdu and Porto, respectively.
Implementation. All experiments were performed on a server using an Intel Xeon E5 CPU with 256 GB RAM running on Linux, implemented in C++.
Algorithms. We include several algorithms in the experimental comparison. Since there is no previous work to solve our problem (e.g., with the same information diffusion model), we integrate the techniques on how to find influential edges (or nodes in different application scenarios) and extend them to apply our traffic spread model.

algorithm, is a top ranking algorithm. It first divides the whole road network based on the traffic volume values in the road network by using the kway partition [1], then obtains the cut edges after dividing the road network. Finally, we sort the cut edges in the descending order of their average travel volume of multiple spread time windows, then generate the top seed edges.

: is a communitybased greedy algorithm to find a subset of nodes to have the maximal influence spread in a mobile social network (Wang et al., 2010). First, communities in a social network are detected using the information diffusion model. Then the algorithm applies a dynamic programming algorithm in order to select communities and find the top influential nodes.

: is a clusterbased algorithm to find a subset of trajectories which have the maximum expected influence among a group of audiences (Guo et al., 2016). First the trajectory database is divided into clusters using the means method, and the distance between two trajectories is computed based on overlapping POIs. Then the algorithm locates a cluster which may generate the maximal marginal gain, and identifies the edge which achieves that gain. After a seed edge is selected, the marginal gain of the remaining edges is updated, and the process continues until the number of seed edges specified have been returned.

: is the best first algorithm introduced in Section 4.2.1.

: is the samplingbased greedy algorithm introduced in Section 4.2.2.
Performance Measurement. We perform both efficiency and effectiveness evaluations for all methods. For the efficiency, we report the runtime to select the top seed edges. Each experiment is repeated times and the average runtime is reported. For the effectiveness, we show the coverage ratio which is computed as the influence score of the selected seed edges divided by the number of edges covered by the trajectories. A larger coverage ratio indicates a better selection of the seed edges.
Parameter Setting. Parameter settings are shown in Table 5, with the default values shown in bold. Instead of adopting a fixed size of , we use an additional parameter to control the ratio on how many seed edges (over the total number of edges) to be selected algorithmically. Thus, we have .
Parameter  Setting  



(vehicle per meter)  1, 2, 3, 4, 5  
10, 20, 30, 40, 50  
3600 (i.e., 18:00 pm19:00 pm)  
1, 2, 3, 4, 5  
The sampling size 

5.2. A Statistical Analysis of the Datasets
We now perform a statistical comparison to support the default parameter choices as shown in Table 5.
The distribution of edge length on road network. First, we perform a statistical analysis on the edge length distribution, which is shown in Figure 3. For example, there are edges whose lengths are smaller than meters in the road network of Xi’an. For Chengdu, it can be observed that most of the edges are less than meters, and nearly half of the edges are within meters. In the existing work on traffic diffusion (Li et al., 2018a; Yu et al., 2018), the assumption is that the traffic flow can spread at most five hops from a certain node or edge, after which the diffusion power will be mitigated. Instead of using the concept of “hop” as the traffic spread unit, we use a time window to control the range of traffic diffusion. Our motivation is that the real traffic speed may vary for different edges. In our experiments, the traffic speed was estimated by using the historical trajectory datasets. In particular, we calculate the traffic speed per hour. First, we record all of the trajectories passing through each edge, and then obtain the traffic speed computed as the distance between two POIs in a trajectory divided by the timestamp difference. We then use this average traffic speed in our experiments.
The distribution of edge volume on road network. The statistical analysis of edge volume distributions is illustrated in Figure 4, for one peak hour (i.e., 18:00 pm19:00 pm). Note that, the statistical ratio is based only on edges that are covered by trajectories. For instance, edges have a traffic volume of less than in the road network of Xi’an, and edges have an edge density of larger than . But as one might expect, there are some edges which are not covered by any trajectory. Specifically, the proportions of edges without covering trajectories are , and for Xi’an, Chengdu and Porto, respectively. Observe that the traffic volumes in Porto are much smaller, and the cause is twofold: (1) The Porto dataset used a lower sampling rate (i.e., seconds shown in Table 4) than the other test collections, which were around  seconds. (2) Even though there are more trajectories in Porto than in Xi’an and Chengdu, the time span covered by the trajectories is much longer (nearly a year) while the other two cities are for a single day.
5.3. Experimental Result (Q1 and Q2)
5.3.1. Performance Evaluation on Influence Acquisition
For each edge, the average runtime to identify the influenced edges in each spread time window are ms, ms, and ms for Xi’an, Chengdu, and Porto, respectively. After identifying the influenced edges in each spread time window, we check the consecutive constraint. So, we also studied the efficiency costs to check the constraint for different in all three datasets, which is summarized in Table 6. As increases, the runtime also increases for all datasets as more consecutive spread time window combinations have to be validated. Observe that the runtime of the Porto dataset is much smaller than that of the other two datasets. As discussed in Section 5.2, a large number of edges are not covered with trajectories in Porto dataset. When the traffic flow spread starting from an edge to another edge , the traffic influenced value (in Equation 1) is normally zero if is not covered with trajectories as the traffic diffusion probability is zero. In such cases, the number of influenced edges is smaller, which translates to less runtime during influence acquisition processing.
Dataset  Runtime (ms)  Dataset  Runtime (ms)  Dataset  Runtime (ms)  

1  Xi’an  0.611  Chengdu  0.686  Porto  0.229 
2  0.968  1.077  0.395  
3  1.327  1.450  0.559  
4  1.654  1.829  0.720  
5  1.998  2.208  0.883 
5.3.2. Performance Evaluation for Bottleneck Identification
In this section, we conduct an experimental study to evaluate the efficiency and effectiveness of bottleneck identification against the baselines under different parameter settings, over all the three datasets. We plot multiple efficiency and effectiveness tradeoff graphs in order to better understand the performance differences among all the algorithms. The starting and end sweep values are shown for each line to make it easier to observe the performance trends for each algorithm.
Effect of the ratio of seed edges . The effect of , which controls the number of selected seed edges, on the performance is presented in Figure 5(a), Figure 6(a) and Figure 7(a). As increases, both coverage ratio and runtime of all the algorithms increase, and more seed edges are selected so that more influenced edges can be covered. Even though the algorithm is more efficient than the other four methods, the coverage ratio is significantly worse as it only sorts the cut edges in the descending order of their traffic volume. This is valuable evidence that roads (e.g., highways) with high traffic volumes do not necessarily have the most influence on other road segments. Furthermore, we can observe that both and consistently outperform and by an order of magnitude in runtime, which shows that our proposed methods are also efficient. Compared with , is more stable with a larger because first locates a cluster with the largest estimated marginal gain and then selects a suitable seed edge with the maximum marginal gain. When is large, has to compute the estimated marginal gain values for more clusters. In the Porto dataset, both and can outperform by two orders of magnitude, showing that our algorithms are highly scalable. In terms of effectiveness, consistently outperforms the other algorithms as it can always leverage the best solution currently found directly.
Effect of the traffic congestion threshold . The effect of , which constrains the traffic congestion threshold of road congestion, on the performance is shown in Figure 5(b), Figure 6(b) and Figure 7(b). With a larger , all the algorithms have a better running time and coverage ratio since fewer edges are considered as congested. In term of efficiency, and still have lower runtimes than the baselines. For the effectiveness, consistently achieves the best coverage ratio.
Datasets  Methods  Sampling Size ()  Runtime (seconds)  Coverage Ratio 

Xi’an    0.36  0.153  
20%  0.20  0.140  
30%  0.21  0.148  
40%  0.42  0.152  
Chengdu    0.56  0.176  
20%  0.31  0.159  
30%  0.43  0.174  
40%  0.63  0.175  
porto    4.98  0.010  
30%  2.96  0.009  
40%  3.73  0.010  
50%  4.47  0.010 
Effect of the traffic spread time window . The effect of , which measures the traffic diffusion time window, on the performance is depicted in Figure 5(c), Figure 6(c) and Figure 7(c). When we have a larger , all the algorithms maintain a stable trend for runtime and coverage ratio as more edges can be influenced by a certain seed edge after the traffic diffusion.
Effect of sampling size. We also carry out an experimental study on when the sampling size is varied. The performance of algorithm is closely related to the sampling size, which is the number of samples used as candidates. When we compare using different sampling sizes with as a baseline, the result is presented in Table 7. As one might expect, a larger sample size leads to a better coverage ratio at the cost of longer runtime. Based on this study, we set the default sampling size to 30%, 30% and 40% for Xi’an, Chengdu and Porto, respectively as it provides a competitive tradeoff between efficiency and effectiveness.
Scalability Test. When we compare the performance of all methods based on three datasets, we find that the efficiency of both and increases significantly as the collection size increases. The time complexity of these two algorithms are proportional to the size of clusters or communities derived from dividing the road network. For , it has to find a suitable cluster first for every seed edge selected; then, it has to update the estimated maximum marginal gain for both clusters and edges. As is based on dynamic programming over the communities, increasing the number of communities leads to reduced efficiency. Larger datasets have more communities, which exacerbate the efficiency further. In contrast, our algorithms and consistently maintain a reasonable growth, which corresponds to the selection over edges directly rather using clusters or communities.
5.4. Case Study (Q3)
The effectiveness of traffic bottleneck identification on road networks can be demonstrated using a case study visualization with the Plotly API^{4}^{4}4https://plotly.com/python/maps/ for the Xi’an and Chengdu datasets. We omit the visualization of the Porto dataset since the spatial region is much larger resulting in sparsity of covered edges. The case study uses data from a typical peak rush hour (from 18:00 pm to 19:00 pm). We hope to answer two questions using this visualization: (1) How do the congested road segments map to a road network in a real scenario; (2) How are the selected seed edges (which are considered as traffic bottlenecks) influencing the other edges?
We first show congested roads which are highlighted in red in Figure 8(a) and Figure 9(a). Then in the remaining figures, we present the selected seed edges and their corresponding influenced edges. For better visualization, we have selected seed edges, with two endpoints plotted as circle markers and the influenced edges displayed as black lines. We have regarded the algorithm as a baseline, and plot also any edges which were not influenced by each method as red lines. More red lines implies a larger disagreement with the best solution. The general trend appears to be that in terms of effectiveness.
6. Conclusion
In this paper, we have investigated the traffic bottleneck identification problem using trajectory datasets. We first proposed a traffic spread model to describe traffic dynamics over time, and used a historical trajectory dataset to provide diffusion information over edges in the network. Using this traffic spread model, we proposed a framework consisting of two main phases: influence acquisition and bottleneck identification. We then conducted an experimental study and a case study over three realworld datasets to validate the efficiency, scalability, and effectiveness of the proposed methods. In future work, we would like to create a real time traffic analysis system prototype based on our proposed methods, and use information collected realtime from vehicles in an urban environment as trajectories to support transportation management. We would also like to resolve the problem of uncertainty in real test collections, such as those created by the use of faulty sensor data or incomplete datasets.
Acknowledgement.
This research is supported in part by ARC DP200102611 and DP190101113, Singtel Cognitive and Artificial Intelligence Lab for Enterprises (SCALE@NTU), which is a collaboration between Singapore Telecommunications Limited (Singtel) and Nanyang Technological University (NTU) that is funded by the Singapore Government through the Industry Alignment Fund  Industry Collaboration Projects Grant, and a Tier1 project RG114/19.
References
 Influence ranking of road segments in urban road traffic networks. Computing 102 (11), pp. 2333–2360. Cited by: §1, §2.1, §2.1.
 You are the traffic jam: an examination of congestion measures. In The 85th Annual Meeting of Transportation Research Board, Cited by: §1, §2.1.
 Timecritical influence maximization in social networks with timedelayed diffusion process. In Proc. AAAI, Vol. 26. Cited by: §2.2.
 Maximizing bichromatic reverse spatial and textual k nearest neighbor queries. PVLDB 9 (6), pp. 456–467. Cited by: §2.3.
 Inferring networks of diffusion and influence. TKDD 5 (4), pp. 21:1–21:37. Cited by: §1.
 A databased approach to social influence maximization. PVLDB 5 (1), pp. 73–84. Cited by: §1.
 Influence maximization in trajectory databases. TKDE 29 (3), pp. 627–641. Cited by: §2.3, 3rd item.
 Efficient selection of geospatial data on maps for interactive and visualized exploration. In Proc. SIGMOD, pp. 567–582. Cited by: §2.3.
 Traffic bottlenecks: identification and solutions. Technical report United States. Federal Highway Administration. Office of Operations Research. Cited by: §1.
 Analysis of the greedy approach in problems of maximum kcoverage. Naval Research Logistics 45 (6), pp. 615–627. Cited by: §3.3.

Deep architecture for traffic flow prediction: deep belief networks with multitask learning
. TITS 15 (5), pp. 2191–2201. Cited by: §1.  Empirical observations of congestion propagation and dynamic partitioning with probe data for largescale systems. Transportation Research Record 2422 (1), pp. 1–11. Cited by: §1.
 Simulated annealing based influence maximization in social networks. In Proc. AAAI, pp. 127–132. Cited by: §2.2.
 Maximizing the spread of influence through a social network. In Proc. SIGKDD, pp. 137–146. Cited by: §2.2.
 URBAN mobility report. Cited by: §1.

Diffusion convolutional recurrent neural network: datadriven traffic forecasting
. In Proc. ICLR, Cited by: Definition 3.1, §5.2.  Influence maximization on social graphs: a survey. TKDE 30 (10), pp. 1852–1872. Cited by: §2.2, §2.2.
 Urban traffic congestion propagation and bottleneck identification. Science in China Series F: Information Sciences 51 (7), pp. 948. Cited by: §2.1, §2.1.
 MaxBRkNN queries for streaming geodata. In Proc. DASFAA, pp. 647–664. Cited by: §2.3.

Traffic flow prediction with big data: a deep learning approach
. TITS 16 (2), pp. 865–873. Cited by: §1. 
PPCAbased missing data imputation for traffic flow volume: a systematical approach
. TITS 10 (3), pp. 512–522. Cited by: §1.  Measuring urban traffic congestiona review.. International Journal for Traffic & Transport Engineering 2 (4). Cited by: §2.1.
 Uncovering the temporal dynamics of diffusion networks. In Proc. ICML, pp. 561–568. Cited by: §2.1, §2.1.
 A simple contagion process describes spreading of traffic jams in urban networks. Nature communications 11 (1), pp. 1–9. Cited by: §2.1, §2.1.
 Traffic congestion and reliability: linking solutions to problems. Technical report United States. Federal Highway Administration. Cited by: §1.
 Influence maximization: nearoptimal time complexity meets practical efficiency. In Proc. SIGMOD, pp. 75–86. Cited by: §2.2.

A survey on modern deep neural network for traffic prediction: trends, methods and challenges
. TKDE. Cited by: §2.1.  Empowering a* search algorithms with neural networks for personalized route recommendation. In Proc. SIGKDD, pp. 539–547. Cited by: §5.1.
 A survey on trajectory data management, analytics, and learning. ACM Computing Surveys (CSUR) 54 (2), pp. 1–36. Cited by: §1.
 Torch: a search engine for trajectory data. In Proc. SIGIR, pp. 535–544. Cited by: §1.
 Communitybased greedy algorithm for mining topk influential nodes in mobile social networks. In Proc. SIGKDD, pp. 1039–1048. Cited by: §2.2, 2nd item.
 Discovery of critical nodes in road networks through mining from vehicle trajectories. TITS 20 (2), pp. 583–593. Cited by: §1.

Fast map matching, an algorithm integrating hidden markov model with precomputation
. IJGIS 32 (3), pp. 547–570. Cited by: §5.1.  Spatiotemporal graph convolutional networks: a deep learning framework for traffic forecasting. In Proc. IJCAI, pp. 3634–3640. Cited by: Definition 3.1, §5.2.
 Identification and optimization of traffic bottleneck with signal timing. Journal of Traffic and Transportation Engineering (English Edition) 1 (5), pp. 353–361. Cited by: §1.
 Urban traffic bottleneck identification based on congestion propagation. In Proc. ICC, pp. 1–6. Cited by: §1.
 Trajectorydriven influential billboard placement. In Proc. SIGKDD, pp. 2748–2757. Cited by: §2.3.
 Optimizing impression counts for outdoor advertising. In Proc. SIGKDD, pp. 1205–1215. Cited by: §2.3.
 A datadriven congestion diffusion model for characterizing traffic in metrocity scales. In Big Data, pp. 1243–1252. Cited by: §2.1.
 Urban computing: concepts, methodologies, and applications. TIST 5 (3), pp. 1–55. Cited by: §1.
 Trajectory data mining: an overview. TIST 6 (3), pp. 1–41. Cited by: §1.
 Maxfirst for maxbrknn. In Proc. ICDE, pp. 828–839. Cited by: §2.3.
Comments
There are no comments yet.