1. Introduction
Traffic speed prediction has been a challenging problem for decades, and has a wide range of traffic planning and related applications, including congestion control (Li et al., 2017), vehicle routing planning (Johnson et al., 2017), urban road planning (Rathore et al., 2016)
and travel time estimation
(Gao et al., 2019). The difficulty of the problem comes from the complex and highly dynamic nature of traffic and road conditions, as well as a variety of other unpredictable, ad hoc factors. Urban traffic incidents, including lane restriction, road construction and traffic collision, which is one of the most important factors, tend to dramatically impact traffic for limited time periods. Yet the frequency of these events means their aggregate impact cannot be ignored when modeling and predicting traffic speed.Despite a large amount of research on detecting urban traffic incidents (Zhang et al., 2018; Yuan et al., 2018), a small number of works study the impact of urban traffic incidents recently. (Miller and Gupta, 2012) proposed a system for predicting the cost and impact of highway incidents. (Javid and Javid, 2018) developed a framework to estimate travel time variability caused by traffic incidents. (He et al., 2019) proposed to use the ratio of speed before and after incident as the traffic impact coefficient to evaluate the traffic influence of an incident. Those works have proven the significant impact of urban traffic incidents on traffic conditions. However, improving traffic speed prediction by traffic incidents has not been well explored. Some previous works (Lin et al., 2017) use incident data collected from social networks (e.g., Twitter) by keywords to improve traffic prediction. However, they fail to consider the impact level of different urban traffic incidents but treat all incidents equally for speed prediction.
Today, the large majority solutions including traditional machine learning
(CastroNeto et al., 2009), matrix decomposition (Deng et al., 2016) and deep learning methods (Li et al., 2018; Lv et al., 2018; Yao et al., 2019) of traffic speed prediction mainly use spatiotemporal features of traffic network and context features such as weather data. These solutions for predicting traffic speed do not factor in the impact of those dynamic traffic incidents.A number of questions naturally arise: how do different urban traffic incidents impact traffic flow speeds? Do high impact traffic incidents have specific spatiotemporal patterns in the city? How can we use urban traffic incident data to improve traffic speed prediction? In this paper, our goal is to answer these questions, and in doing so, understand the impact of urban traffic incidents on traffic speeds and propose an effective framework using urban traffic incident information to improve traffic speed prediction. There are two main challenges in our incidentdriven traffic speed prediction problem. First, the impact of urban traffic incidents is complex and varies significantly across incidents. For example, incidents occur in the wee hours and in remote areas will have little impact on adjacent roads, while incidents during the rush hours and in hightraffic areas (e.g. downtown) are very likely to affect the surrounding traffic flows or even cause congestion (Pan et al., 2012). Therefore, it is unreasonable to treat all urban traffic incidents equally for traffic speed prediction, which may even negatively impact prediction performance. Second, the impact of urban traffic incidents on adjacent roads will be affected by external factors like incident occurrence time, incident type and road topology structure. We need to extract the latent impact features of traffic incidents on traffic flows to improve traffic speed prediction.
To tackle the first challenge, we propose a critical incident discovery method to quantify the impact of urban traffic incidents on traffic flows. We consider both anomalous degree and speed variation of adjacent roads to discover the critical traffic incidents. Next, to tackle the second challenge, we propose a binary classifier which uses deep learning methods to extract the latent impact features of incidents. The impact of incidents varies in degree and the impact is neither binary nor strict multiclass. So we extract the latent impact features from the middle layer of the classifier, where the latent features are continuous and filtered. We adopt Graph Convolution Network (GCN) (Bruna et al., 2014) to capture spatial features of road networks. GCN is known to be able to effectively capture the topology features in nonEuclidean structures and the complex road network is a typical nonEuclidean structure. Combining above methods, we propose a Deep IncidentAware Graph Convolutional Network (DIGCNet) to improve traffic prediction by traffic incident data. DIGCNet can effectively leverage traffic incident, spatiotemporal, periodic and context features for prediction.
We test our framework using two realworld urban traffic datasets from San Francisco and New York City. Experimental results empirically answer the above mentioned questions, and also show the particularly different spatiotemporal distributions of critical/noncritical incidents. We compare DIGCNet with stateoftheart methods, and the results demonstrate the superior performance of our model and also verify that the incident learning component is the key to the improvement of prediction performance.
We summarize our key contributions as follows:

To quantify the impact of traffic incidents on traffic speeds, we propose a critical incident discover method and discover critical incidents in the city. We further explore the spatiotemporal distributions of critical/noncritical incidents and find noteworthy differences.

In order to extract the latent incident impact features, we skillfully design a binary classifier to extract the latent impact features from the middle layer of the classifier. We use the binary classifier as an internal component of our final framework to improve traffic speed prediction.

We propose a DIGCNet to effectively incorporate incident, spatiotemporal, periodic and context features for traffic speed prediction. We conduct experiments using two realworld urban traffic datasets, and results show that DIGCNet outperforms competing benchmarks and the incident learning component is the key to the improvement of prediction performance. Meanwhile, the incident learning component can be flexibly insert to other models as a common use to learning incident impact features.
2. Preliminaries
Before diving into details of the model, we begin with some preliminaries on our datasets and problem formulation in this section.
2.1. Datasets
We utilize two datasets, a traffic dataset and an attribute dataset (weather data). The traffic dataset consists of traffic road network, speed and incident subdataset from two major metropolitan areas, San Francisco (SFO) and New York City (NYC), with complex traffic conditions and varying physical features that may affect latent traffic patterns (Xie et al., 2018). We collect the weather dataset using Yahoo Weather API (Yahoo, 2019) and fields includes weather type, temperature and sunrise time. We collect the traffic dataset from a public API: HERE Traffic (Here, 2019). 1) Road Network: We set lat/lng bounding boxes (Figure 1(a)) on two cities of SFO (37.707,122.518/37.851,122.337) and NYC (40.927,74.258/40.495,73.750) to gather the internal road network. 2) Traffic Speed: We collect the realtime traffic speed of each flow in the areas described above and record realtime speeds of each flow every 5 minutes. 3) Traffic Incident: We also collect the traffic incident data in same areas every 5 minutes. For each incident, we can get the incident features like type and location.
Component  Datasets Description 
[1pt] Critical Incident Discovery  Use traffic incident, road network and speed subdataset. The incident and speed data are from Apr. 17 to Apr. 24, 2019. 
[1pt] Impact Features Extraction 
Use traffic incident, road network and speed subdataset. The incident and speed data are from Apr. 17 to Apr. 24, 2019. 
[1pt] DIGCNet  Use traffic incident, road network, speed subdataset and weather dataset. The incident and speed data are form Apr. 4 to May 2, 2019 (4 weeks). 
Flow. The realtime speeds in different segments of one single road are discrete. HERE divides every road into multiple segments. We denote one road segment as one flow . Every flow at each time slot will have a speed and we use flow as the smallest unit in the road network.
2.2. Problem Formulation and Preprocessing
First, we denote a road network as an undirected graph , where each node represents an intersection or a split point on the road, and each edge represents a road segment.
Reconstruction of the road network. As our task is to predict the speed of every road segment, we use the road segment as the node. More specifically, we use every flow as one node to build the road network. If two flows and have points of intersection, we will add an edge to connect node and node . Therefore, we build a new road network graph , where each node represents a flow and each edge represents an intersection of the flows or a split point on the flow. There are 2,416 nodes and 19,334 edges of SFO, and 13,028 nodes and 92,470 edges of NYC. We will use the rebuild road network graph in the rest of the paper.
Problem formulation. We use to represent the speed of flow at time slot
. For every speed snapshot of the road network, we will get a vector of all flows
, where is the total number of flows. Given the rebuild road graph and a Tlength historical realtime speed sequence of all flows, our task is to predict future speeds of every flow in the city, i.e., , where is the prediction length. Given a set of urban traffic incidents occur close to the predicted time , more specifically, a set of incidents occur within , where is the earliest included incident occurrence time and is the latest included incident occurrence time. We extract the features of the impact of above mentioned incidents on traffic flows to improve the speed prediction performance.3. Urban Critical Incident Discovery
The impact of urban traffic incidents are complex and also influenced by other factors like the topological structure of urban road network, temporal features and incident type. Treating all urban traffic incidents equally will add additional noise to traffic speed prediction process. In this section, we focus on analyzing the impact of different urban traffic incidents, and introduce our urban critical incident discovery methodology.
3.1. Methodology
Case Study: A Congestion Incident. Figure 1(b) presents a congestion incident occurred at 06:32 am on Apr. 17, 2019 in San Francisco. is the center point of the incident and we set to represent the radius of the impact range. The circle with the center and radius stands for the region affected by the incident. We define that if the center of flow is in the circle, then the flow might be affected by the incident. The circle in Figure 1(b) presents the affected region when . The blue, red and green lines represent three flows , and in San Francisco which might be affected by the incident, respectively. The speed curves of the three candidate flows are shown in Figure 1(c). We observe that during 6:00 am  7:00 am, the speeds of and show a sharp reduction while the variation of is relatively slight, but it still become more choppy after the incident occurred.
Next, we analyze each candidate flow that whether it will truly be affected by the incident. We use a variant of the method proposed in (Zhang et al., 2018) to compute the anomalous degree of each flow. They divides the city area into several grids and compute the anomalous degree of each grid region to detect urban anomalies. The key idea to compute the anomalous degree of a region is based on its historically similar regions in the city. The sudden drop of speed similarity of a region and its historically similar regions indicates the occurrence of urban anomalies, and the welldesigned experiments had verified the effectiveness of the detection method. In our problem, we use each flow as the unit rather than grid region.
Definition 1 ().
Pairwise Similarity of Flows. Given two flows at time slot with speeds and , for a time window , the pairwise similarity is calculated by:
(1) 
where is to calculate Pearson correlation coefficient (Lin, 1989) of two speed sequences. Then the similarity matrix of all flows at is calculated by the following equation:
(2) 
where is the total number of flows in the city.
Definition 2 ().
Similarity Decrease Matrix (SD). Similar to (Zhang et al., 2018), we define the similarity decrease matrix , which represents the decreased similarity of each flow pair from time slot to . at time slot is calculated by: . Zeroing the numbers less than zero is due to that we only consider the case where the similarity goes down.
Definition 3 ().
Anomalous Degree (AD). Then we use similarity matrix and similarity decrease matrix to compute of flows at time slot . We use a threshold parameter to capture the historically similar flows. When the similarity of two flows is equal or greater than , we define they are historically similar. Given a flow at time slot , the historically similar flow sets of is denoted as . Pairwise similarity is computed by Pearson correlation coefficient (PCC) and PCC in [0.5, 0.7] indicates variables are moderately correlated according to (Rumsey, 2015). Therefore, we set here to select the historically similar flows which are at least moderately similarity to the flow . Anomalous degree of flow at time slot is calculated by the following equation:
(3) 
where is the decrease degree in speed similarity of and its historically similar flows.
Local Anomalous Degree Algorithm. The time complexity of computing similarity matrix is , where is the number of flows and is the length of historical speed sequences. For cities with complex traffic road networks such as New York City (13,028 flows), it will cost a lot to compute similarity matrix , similarity decrease matrix and anomalous degree
. We propose a local anomalous degree algorithm to speed up our method based on spectral clustering
(Yu and Shi, 2003). Spectral clustering is able to identify spatial communities of nodes in graph structures. According to several studies (Tong et al., 2017; Yao et al., 2018; Zheng and Ni, 2013), which assume that traffic in nearby locations should be similar, we also assume that flows in the same community and in the spatially nearby regions will be historically similar. Given a graph , we perform spectral decomposition and obtaingraph spatial features of each flow. Then we use Kmeans
(Dhillon et al., 2004), a common unsupervised clustering method, to cluster flows into classes.Validation of Local Algorithm. Figure 4 shows the clustering result when (marked by different colors). The result shows that the eigenvectors can effectively capture spatial graph features. Our method divides New York City into 10 local districts which are conform to the realworld urban districts, e.g., the red area corresponds to the Manhattan area in New York City. Then we only need to compute the local values of , and in the same district.
Next, different from anomaly detection, we aim at exploring the impact on traffic flows of different
urban traffic incidents. Also taking Figure 1(b) as an example, there is a flawed scene that three flows , and are historically similar to each other at time slot . Therefore, the sharp variations of and will strongly affect the anomalous degree of . Figure 2(a) shows the anomalous degrees of them from 4:00 am to 12:00 pm. Near 06:32 am, actually has a higher anomalous degree (0.198) than (0.110) and (0.085). However, we can see it intuitively in Figure 1(c) that when close to 06:32 am, the anomalous variation of speeds of and are more striking than . The reason for this diametrically opposite result is that after the incident, the tendency of anomalous changes of and are mighty similar, which leads to the low anomalous degree of them. Therefore, in order to handle the scenario mentioned above, we add another metric to help amend our discovery method.Definition 4 ().
Relative Speed Variation (RSV). Given a flow at time slot , and the historical speed sequence of in a length time window, we define the relative speed variation of is
(4) 
We define a normalization time window and use the max value observed in the time window to normalize . We use 24 hours (288 intervals) as the normalization window length, i.e., and , and intervals.
Validation of .
As a heuristic approach, we test different candidate computing methods of
as baselines for validation. We consider three related features: slope of speed variation () (Viovy et al., 1992), recent speed () and historical average speed () (Boriboonsomsin et al., 2012) corresponding to three candidate computing methods of . They are listed as follows:
Consider all three features: , where and is two parameters to control the ratio of recent speed and historical average speed, is the historical average slope and is the slope of time slot and .

Consider recent speed and historical average speed: .

Consider historical average speed: .
We use the normalized item to normalize the three computing methods. We use Pearson correlation coefficient to calculate the correlation coefficient of and of all urban traffic incidents in our dataset (an hour before and after the incident). In order to use to amend , we choose the most negatively correlated computing method as our ( and are set to 0.5), i.e., only consider historical average speed: . Figure 2(b) shows the result of the congestion incident. Near 06:32 am, in contrast to , the max of and are both larger (0.377 and 0.333) than . It is conform to the speed variation (Figure 1(c)) and indicates that can also capture anomalies well and effectively correct the flaw of .
Definition 5 ().
Incident Effect Score (IES). Due to the complementarity of anomalous degree and relative speed variation , we combine both of them to compute the incident effect score. Given a flow at time slot , the incident effect score is calculated by:
(5) 
where is a parameter to control the ratio of and .
Definition 6 ().
Critical Incident. For incidents like megaevents, the traffic flows might be affected before incidents begin, on the contrary, incidents like traffic collisions will begin to affect traffic flows after they occurred. Therefore, given an incident with a start time , we firstly set a Tlength “start to influence” window and define the flows which are highly affected by the incident is , where is a threshold parameter.
When , more specifically, there is at least one flow is highly affected by , we call is a critical incident, where denotes the cardinality of a set. We define an incident which is not a critical incident as a noncritical incident.
3.2. Evaluation and results
Parameter Setting. The datasets we use here are listed in Table 1. We set and one hour as the length of “start to influence” time window.
Varing and . Figure 4 shows the number of critical incidents discovered when varying and . In SFO, when , most incidents are discovered as critical (1,706 out of 1,832 averagely), which indicates that most incidents indeed have an impact on traffic flows. There are a small number of incidents which almost have no impact (6.9%, and 12.2%, ), which further proves that treating all traffic incidents equally for traffic speed prediction is unreasonable. When rises (0.10, 0.15 or 0.20), there is a sharp reduction of critical incidents, which indicates the impact of incidents varies in degree. In order to discover incidents with high impact, we set and of SFO. The results of NYC is similar with SFO, most incidents are discovered as critical incident when is set to 0 or 0.05. Reductions also appear when rises. We set and of NYC.
Spatial Distributions. Figure 5(a) and 5(c) shows the spatial distributions of incidents in SFO and NYC. An incident is plotted by a line with an origin and an end. In SFO, although most of both two type incidents occur on the main roads (continuous parts), our method can effectively discover critical incidents (green circle). Moreover, we check critical incidents in the green circles and find they are mostly the Event type, which has a high severity level recorded by HERE. In NYC, both two type incidents also gather in the main roads. The number of urban traffic incidents in NYC is far more than in SFO but we can still observe the differences. Critical incidents which did not occur in the main roads are mainly locate in Manhattan (the middle circle). In the left circle, we find that most incidents away from city center are discovered as noncritical.
Temporal Distributions. Figure 5(b) and 5(d) show the temporal distributions of critical and noncritical incidents in two cities. In SFO, incidents mostly occur in rush hours (79 am and 47 pm), which is in line with daily routine. At about 12 pm (noon) and 3 pm on weekday, the ratio of critical incidents has a drop while the ratio of noncritical rises, which might because both time are not in rush hours and incidents may not have high impact. On weekend and during midafternoon, there is also a drop of the critical incidents and a rise of noncritical type. We also find that incidents are more likely to occur in the early morning on weekend than weekday. In NYC, most critical incidents also occur in rush hours. Incidents occur in the early morning tend to be noncritical in both two cities. On weekend, NYC only has one incident peak (midafternoon) and on weekday, NYC does not have the midafternoon peak while SFO presents the peak.
Summary of Results. Parameters and represent the threshold to discover urban incidents with high impact on traffic speeds. The lower the and are, the lower the threshold to mark critical incidents. The results of varying and show that some urban incidents almost have no impact on traffic speeds and impact of urban incidents varies in degree, which indicate that it is unreasonable to use all urban traffic incidents features for traffic speed prediction. Spatiotemporal distributions show noteworthy differences between urban critical and noncritical incidents, which indicates that our urban critical incident discovery method can effectively discover incidents with high impact on traffic speeds.
4. Extract the Latent Incident Impact Features
So far, we have proven that our discovery method can effectively discover urban critical/noncritical incidents. In this section, we propose to use deep learning methods to extract the latent incident impact features for traffic speed prediction. Taking two aspects into account, we design a binary classifier to extract the latent impact features:

There are some urban incidents almost have no impact on traffic flows and lowimpact incidents features will even bring noise to the model. There are also noteworthy differences of spatiotemporal features between crucial and noncrucial incidents, which inspires us to consider the binary classification problem.

The impact of urban incidents on traffic speeds varies in degree and the impact is neither binary nor strict multiclass. Therefore, we should not use the binary result directly, we propose to extract the latent impact features from the middle layer of the binary classifier for traffic speed prediction, where the latent features are continuous and filtered.
4.1. Methodology
The task of the binary classifier is to predict whether an incident is critical/noncritical, i.e., whether an urban incident has a high/low impact on traffic speed. Considering that the impact of incidents is related to spatiotemporal and context features and previous works (Lv et al., 2018; Yao et al., 2018; Zhang et al., 2016) which use spatiotemporal and context features for traffic prediction (we will discuss them in Section 6), our classifier consists of three components: spatial learning component (GCN), temporal learning component (LSTM) and context learning component.
Spatial Learning: GCN (Figure 6(a)). City road network has latent traffic patterns and there are complex spatial dependencies (Li et al., 2018)
. We need to capture the road topological features, i.e., the spatial dependencies of the road network. Traditional methods divide the city into several grids and use Convolutional Neural Network (CNN) to capture spatial features
(Yao et al., 2018; Zhang et al., 2016). However, it neglects the road topological features and also lose the spatial information within grids. Moreover, graph structure related features are hard to be used in CNN for our problem. We adopt graph convolutional network (GCN) (Bruna et al., 2014) to learn the spatial topology features. GCN is known for being able to capture the topology features in nonEuclidean structures, which is suitable for road network. GCN model follows the layerwise propagation rule (Kipf and Welling, 2017):(6) 
where is the adjacency matrix, is the adjacency matrix of the graph with added selfconnections, is the degree matrix and . is the normalized Laplacian matrix of the graph .
denotes an activation function.
is the trainable weight matrix, is the matrix of activations in the th layer. , where is the input vectors of GCN.We use the above mentioned graph . At each time slot , we obtain a realtime speed of every flow in , and we define the speed snapshot , where is the total number of flows in the city. We also add another graph structure related feature: the distance of each flow from the incident, which is because of the impact of incidents on flows has a strong correlation with distance (Tong et al., 2017; Yao et al., 2018; Zheng and Ni, 2013). We define the distance of from the incident is the Euclidean distance between the flow center and incident center. Therefore, at each time slot , the input features . For a urban traffic incident, the time span of input speed snapshots is , where is the start time of the incident and is the length of “start to influence” time window which is defined in Section 3.
For the input signal with C input channels ( here) and filters or features of spectral convolutions map as follows (Kipf and Welling, 2017):
(7) 
where is a matrix of filter parameters, is the convolved signal matrix and is the number of filters or features. Next, at each time slot , after k graph convolutional (GC) layers, we then feed middle states into fully connected (FC) layers to get the spatial learning output of each snapshot.
Temporal Learning: LSTM (Figure 6(b)). We feed a sequence of graph speed snapshots to GCN, and the output is a sequence of spatial features at each time slot from to
. Then we adopt Long ShortTerm Memory (LSTM) model
(Hochreiter and Schmidhuber, 1997) as our temporal learning component. LSTM is known for being able to learn longterm dependency information of time related sequences. LSTM has the ability to remove or add information to the state of the cell through a welldesigned structure “gate”. we extract the spatial features for each snapshot in GCN and feed the sequence into LSTM cells. Then we can iteratively get the output sequence . We use the last LSTM cell output as the output of temporal learning part.Context Learning (Figure 6(c)). Incident context features are also important for prediction. We use the following features for context learning:

Incident type (e.g., traffic collision and event).

Road status: Whether the urban incident lead to a road close or not.

Start and end hour: HERE gives a start time and an anticipative end time of an incident.

Incident duration: The anticipative duration of the incident, i.e., .

Weekday, Saturday or Sunday.
We use onehot encoding to preprocess class features and normalize the incident duration feature.
The context learning component is a Deep Neural Network (DNN) structure, more specifically, an input layer and a fully connected layer (shown in Figure 6(c)). After embedding the context information, we feed the context embedding to a fully connected layer to get , which is the output of context learning.Latent incident impact features extraction. After getting and spatiotemporal feature , we use a concat operation to concatenate them as of each incident. Then we feed to FC layers. We extract the output of the last FC layer before the output layer as the latent incident impact features, which is because that output layer uses these features as the input to predict whether the incident has high impact on traffic flows. We denote the latent impact features as . Finally we get the prediction value , and compute the loss compare to real value .
Objective Function and Evaluation Metric.
The classifier is training by minimizing Binary Cross Entropy Loss (BCELoss) between the predicted speed and the real value. BCELoss is defined as follow:(8) 
We use BCELoss and F1score () to evaluate the binary classifier.
4.2. Middle Experiments
Parameter Setting. The datasets we use here are listed in Table 1. We use the discovery results obtained in the last section as the ground truth. There are 1,061 positive samples (critical) and 771 negative samples (noncritical) of SFO and 17,924 positive samples and 15367 negative samples of NYC. We use 5 minutes as the time interval and train our classifier with the following hyperparameter settings: learning rate (0.001) with Adam optimizer. In GCN, we set two GCN layers followed by one FC layer with the 64dimension output. The length of ”start to influence” window is set to one hour, i.e., the input size of the first GCN layer is 12. We use relu activation function and add Dropout () in GC layer. We use one LSTM layer with 64dimension hidden states. After concat, we adopt one FC layer (16dimension) and follow by the output FC layer using sigmoid activation function. We use 70% data for training, and the remaining 30% as the test set. We select 90% of training set for training and 10% as the validation set for early stopping.
Results and Analysis. Using the traffic incident and traffic speed subdatasets for training, we finally get 0.8241 F1score and 0.4429 BCELoss in the test set of SFO, and 4731 BCELoss and 0.8000 F1score of NYC. Our binary classifier model can capture the latent impact features on traffic flows of different incidents, more specifically, we can get the embedding of each input incident. is the output features of context learning and is the output features of spatiotemporal learning. And we feed into ( in our experiment) FC layers to extract the latent impact features before the ouput layer. We will use the binary classifier in the next section as an internal component to help improve traffic speed prediction performance. Since we take the classifier as a middleware of our incidentdriven framework, we further evaluate our complete framework with competitive baselines in the next Section.
5. Incidentdriven Traffic Speed Prediction
So far, we can effectively capture the latent impact features of urban incidents on traffic flow speeds. Combining above methods, we propose Deep IncidentAware Graph Convolutional Network (DIGCNet) to improve traffic speed prediction by urban incident data.
5.1. Methodology
DIGCNet (Figure 7) consists of three components: spatiotemporal, incident and periodic learning. Our prediction problem is defined above in the Section 2.
Spatiotemporal Learning (Figure 7(a)). Considering traffic speed prediction also related to spatiotemporal patterns of traffic network and previous works (Lv et al., 2018; Yao et al., 2018; Zhang et al., 2016) which use spatiotemporal features for traffic prediction (we will discuss them in Section 6), we use the similar structure of spatial and temporal learning in the binary classifier. The spatialtemporal and context structure is a common use in traffic prediction, and we use GCN rather than CNN to better capture spatial features of road network here. GCN is used for capturing spatial graph features and LSTM is adopted to capture the time evolution patterns of traffic speeds. The input features of each node is in GCN, i.e., the speed of each flow at time slot . More specifically, the input features is , which is graph speed snapshot at time slot . We input a sequence of graph speed snapshots features to GCN and after the GCN part, similar to (Yao et al., 2018), we concatenate the weather contexts at each time slot to get . Then we feed the spatial features sequence to LSTM cells to iteratively get the output sequence . Then we use learnable units to predict future traffic speeds . The output of spatiotemporal learning is .
Incident Learning (Figure 7(b)). To predict traffic speed at time slot , we select all incidents occurred within as the incident learning inputs (the last two hours), where is the earliest included incident occurrence time and is the latest time. We use the pretrained binary classifier (trained in last Section) to extract
, i.e., the latent incident impact features of each incident. Because the number of incidents occur within the time range is uncertain and incidents occur in a sequential order, so we adopt standard Recurrent Neural Network (RNN)
(Mikolov et al., 2010) for incident learning. RNN is a neural network that contains loops that allow information to be persisted. Previous incidents will affect the traffic conditions, which may lead to the occurrence of future incidents. Using RNN also help us capture the interrelation of sequentially occurring urban traffic incidents, which is neglected by previous works (Lin et al., 2017). We denote as the output of the last RNN cell.Periodic Learning (Figure 7(c)). Traffic flow speeds change periodically and we use the similar structure of (Lv et al., 2018) to learn longterm periodical patterns. We use the same time slots in the last 5 days to learn the periodic features. A fully connected layer is adopted to capture the longterm periodic features. The output of periodic learning is .
Output. After getting spatiotemporal features , incident impact features , and periodic features , we adopt a concat operation to concatenate them, then feed them into FClayers. Finally we get the prediction value , and compute the loss compare to the real value .
Objective Function and Evaluation Metric. DIGCnet is training by minimizing Mean Squared Error () between the predicted speed and the real value. We use Mean Absolute Percentage Error to evaluate DIGCNet, MAPE is defined as follow:
(9) 
where is the total number of flows.
Method  MAPESFO  MAPENYC 

[1pt] ARIMA  26.70 %  38.60 % 
[1pt] SVR  28.24 %  39.73 % 
[1pt] LSTM  18.98 %  30.26 % 
[1pt] GC  15.69 %  25.79 % 
[1pt] LSMRN  13.72 %  21.53 % 
LCRNN  12.26 %  18.77 % 
DIGCNet  11.02 %  17.21 % 
5.2. Evaluations
Parameter Setting. The datasets we use here are listed in Table 1. We set 5 minutes as the time interval and time window as 4 hours, i.e., . We train our network with the following hyperparameter settings: learning rate (0.001) with Adam optimizer. In spatiotemporal learning, we set two GCN layers followed by one FClayer (64dimension) and the input size of the first GCN layer is 64. We use relu activation function and add Dropout in GCN layer with . In incident learning, we use one RNN layer with 128dimension hidden state. In periodical learning, we use one FC layer with 64dimension hidden state. After concat operation, we adopt one FClayer with 256dimension and connect the final output layer. We use relu activation function in the FC layers. We use first three weeks data for training, and the remaining one week data as the test set. In training dataset, we select 90% of them for training and 10% as the validation set for early stopping.
Comparison with competitive benchmarks. We compare our model with the following models in consideration of covering traditional machine learning, matrix decomposition and stateoftheart deep learning methods:

ARIMA (Contreras et al., 2003): Autoregressive integrated moving average is a classics linear model in time series forecasting.

SVR (Smola and Schölkopf, 1997)
: Support Vector Regression is based on the computation of linear regression in a high dimensional feature space and is widely used.

LSTM (Ma et al., 2015): This method uses LSTM to capture nonlinear traffic dynamic to predict traffic speed.

GC (Michaël et al., 2016): GC uses graph convolution, pooling and fullyconnected layer to predict future traffic speed. GC is the variation of basic GCN with the efficient pooling.

LSMRN (Deng et al., 2016): Latent space model for road networks learns the attributes of vertices in latent spaces which mainly uses matrix decomposition. It also consider spatiotemporal effects of latent attributes and use an incremental online algorithm to predict traffic speed.
Table 2
shows the MAPE results of using different methods of SFO and NYC. All other benchmarks in the table is onestep prediction. When compared with different methods, DIGCNet achieves the best performance in both two cities. DIGCnet has relatively from 10.11% up to 60.97% lower MAPE than these benchmarks in SFO and relatively from 8.31% up to 56.68% lower MAPE than these benchmarks in NYC. We also note significant variance between SFO and NYC among all methods, likely due to large differences in the traffic road network (NYC is much larger than SFO: 2,416 vs 13,028 nodes and 19,334 vs 92,470 edges). The results indicate that DIGCnet can effectively incorporate incident, spatiotemporal, periodic and context features for traffic speed prediction.
Comparison with variants of DIGCnet. We also present the comparison with different variants of DIGCnet with only spatiotemporal component, spatiotemporal and periodic component, and the whole DIGCnet with all components (spatiotemporal, periodic and incident component). The results are shown in Table 3. The first finding is that the performance improvement of periodic learning is relatively weak, with only difference of 0.25% of SFO and 0.06% of NYC. One possible reason that the improvement margin of SFO is larger than NYC is that there is a relatively simple road network in SFO and the variation of traffic speed is more regular. The MAPE without incident learning (spatiotemporal + periodic) is 12.22% of SFO and 18.63 % of NYC, which also outperform all benchmarks (sightly outperform LCRNN). It also verifies that our incident learning component is the key to the improvement with a 1.2% MAPE improvement of SFO and 1.42% MAPE improvement of NYC.
Variant  MAPESFO  MAPENYC 
[1pt] Spatiotemporal  12.47 %  18.69 % 
[1pt] Spatiotemporal + periodic  12.22 %  18.63 % 
[1pt] DIGCNetall (Spatiotemporal + periodic + incident )  11.02 %  17.21 % 
Comparison with different time period. As shown in Figure 5(b) and Figure 5(d), the number of incidents varies over time, and more incidents occur at traffic peak periods. Meanwhile, traffic speed variation is also timesensitive. Therefore, we further select 2:00  3:00 am as the wee hour and 07:00  08:00 am as the rush hour, and take SFO as the illustration to evaluate the performance of different methods. Figure 8 shows the MAPE results in the wee hour and rush hour. In the wee hour, our method has relatively from 2.08% up to 64.43% lower MAPE than these benchmarks in SFO, and relatively from 10.78% up to 89.50% lower MAPE than these benchmarks in the rush hour. The performance of our method and LCRNN are pretty similar in the wee hour but exhibits a relatively clear gap in the rush hour, which derives from more complex traffic patterns in the rush hour.
Comparison for multistep prediction. We then present the comparison results for multistep prediction. DIGCnet can be used for multistep speed prediction by setting learnable units in spatiotemporal learning component. We set prediction length (speeds of next 5, 10 and 15 minutes) to evaluate the multistep prediction case. The results are shown in Table 4. The performance of DIGCnet of multistep prediction remains stable as the predicted length increases (drop relatively 3.09% of and 5.44 % of compare to in SFO and drop relatively 3.88% of and 9.03% of compare to in NYC). When prediction length is within three steps, DIGCnet outperforms all other baselines of onestep prediction in SFO, and in NYC, only of onestep that LCRNN outperforms threesteps DIGCnet. The multistep results demonstrate that our model can be effectively applied to multistep prediction within a certain time range.
Method  MAPESFO  MAPENYC 

DIGCNet, k=1  11.02 %  17.27 % 
DIGCNet, k=2  11.36 %  17.94 % 
DIGCNet, k=3  11.62 %  18.83 % 
6. Related Work
Traffic Speed Prediction. A number of solutions have been proposed for traffic speed prediction. ARIMA (Contreras et al., 2003) is a classical model for this area, and regression methods (CastroNeto et al., 2009) are also widely used for predicting traffic speed. There are also matrix spectral decomposition models for traffic speed prediction: (Deng et al., 2016) proposed a latent space model to capture both topological and temporal properties. Recently, deep learning approachs achieve great success in this space by using spatiotemporal and context features (Lv et al., 2014; Ma et al., 2017). The spatiotemporal and context structure is a common use in traffic prediction. (Zhang et al., 2016) divided road network into grids and used CNN to capture spatial dependencies. (Lv et al., 2018) proposed a model that integrates both RNN and CNN models. GCN begin to be used for traffic speed prediction recently because of the ability to effectively capture the topology features in nonEuclidean structures. (Li et al., 2018) proposed to model the traffic flow as a diffusion process on a directed graph. (Yu et al., 2017) proposed the STGCN model to tackle the time series prediction problem in traffic domain. In our work, we effectively incorporate traffic incident, spatiotemporal, periodic and weather features for traffic speed prediction. Our main contributions are focus on the effective utilization of incident information for improving prediction performance.
Urban Incidents. Research on urban anomalous incidents mainly focus on the detection of incidents. (Gu et al., 2016) mined tweet texts to extract incident information to do the traffic incident detection. (Zhang et al., 2018) proposed an algorithm based on SVM to capture rare patterns to detect urban anomalies. (Yuan et al., 2018) proposed a ConvLSTM model for traffic incident prediction. There are also a few works focus on mining the impact of incidents. (Miller and Gupta, 2012) proposed a system for predicting the cost and impact of highway incidents, in order to classify the duration of the incident induced delays and the magnitude of the incident impact. (Javid and Javid, 2018) developed a framework to estimate travel time variability caused by traffic incidents by using a series of robust regression methods. In our work, we extract the latent incident impact features for traffic speed prediction.
7. Conclusion
In this work, we investigate the problem of incidentdriven traffic speed prediction. We first propose the critical incident discovery method to identify urban crucial incidents and their impact on traffic flows. Then we design a binary classifier to extract the latent incident impact features for improving traffic speed prediction. Combining both processes, we propose a Deep IncidentAware Graph Convolutional Network (DIGCNet) to effectively incorporate traffic incident, spatiotemporal, periodic and weather features for traffic speed prediction. We evaluate DIGCNet using two realworld urban traffic datasets of large cities (SFO and NYC). The results demonstrate the superior performance of DIGCNet and validate the effectiveness of extracting latent incident features in our framework.
References
 (1)
 Boriboonsomsin et al. (2012) Kanok Boriboonsomsin, Matthew J Barth, Weihua Zhu, and Alexander Vu. 2012. Ecorouting navigation system based on multisource historical and realtime traffic information. IEEE Transactions on Intelligent Transportation Systems 13, 4 (2012), 1694–1704.
 Bruna et al. (2014) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2014. Spectral networks and locally connected networks on graphs. In Proceedings of 2nd International Conference on Learning Representations (ICLR ’14).
 CastroNeto et al. (2009) Manoel CastroNeto, YoungSeon Jeong, MyongKee Jeong, and Lee D Han. 2009. OnlineSVR for shortterm traffic flow prediction under typical and atypical traffic conditions. Expert systems with applications 36, 3 (2009), 6164–6173.
 Contreras et al. (2003) Javier Contreras, Rosario Espinola, Francisco J Nogales, and Antonio J Conejo. 2003. ARIMA models to predict nextday electricity prices. IEEE transactions on power systems 18, 3 (2003), 1014–1020.
 Deng et al. (2016) Dingxiong Deng, Cyrus Shahabi, Ugur Demiryurek, Linhong Zhu, Rose Yu, and Yan Liu. 2016. Latent Space Model for Road Networks to Predict TimeVarying Traffic. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16). 1525–1534.
 Dhillon et al. (2004) Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. 2004. Kernel Kmeans: Spectral Clustering and Normalized Cuts. In Proceedings of the 10th ACM SIGKDD international conference on Knowledge Discovery and Data Dining (KDD ’04). 551–556.

Gao
et al. (2019)
Ruipeng Gao, Xiaoyu Guo,
Fuyong Sun, Lin Dai,
Jiayan Zhu, Chenxi Hu, and
Haibo Li. 2019.
Aggressive driving saves more time? multitask
learning for customized travel time estimation. In
Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI ’19)
. AAAI Press, 1689–1696.  Gu et al. (2016) Yiming Gu, Zhen(Sean) Qian, and Feng Chen. 2016. From Twitter to detector: Realtime traffic incident detection using social media data. Transportation Research Part C: Emerging Technologies. 67 (2016), 321–342.
 He et al. (2019) Yaqin He, Yulun Rong, Zupeng Liu, and Shengpin Du. 2019. Traffic Influence Degree of Urban Traffic Accident Based on Speed Ratio. Journal of Highway and Transportation Research and Development (English Edition) 13, 3 (2019), 96–102.
 Here (2019) Here. 2019. https://developer.here.com/.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long Shortterm Memory. Neural computation. (1997), 1735–1780.
 Javid and Javid (2018) Roxana J Javid and Ramina Jahanbakhsh Javid. 2018. A framework for travel time variability analysis using urban traffic incident data. IATSS research 42, 1 (2018), 30–38.
 Johnson et al. (2017) Isaac Johnson, Jessica Henderson, Caitlin Perry, Johannes Schöning, and Brent Hecht. 2017. Beautiful… but at What Cost?: An Examination of Externalities in Geographic Vehicle Routing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (Ubicomp ’17) 1, 2 (2017), 15.
 Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. SemiSupervised Classification with Graph Convolutional Networks. In Proceedings of 5th International Conference on Learning Representations (ICLR ’17).
 Li et al. (2018) Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018. DIFFUSION CONVOLUTIONAL RECURRENT NEURAL NETWORK: DATADRIVEN TRAFFIC FORECASTING. In Proceedings of 6th International Conference on Learning Representations. (ICLR ’18).
 Li et al. (2017) Zhibin Li, Pan Liu, Chengcheng Xu, Hui Duan, and Wei Wang. 2017. Reinforcement learningbased variable speed limit control strategy to reduce traffic congestion at freeway recurrent bottlenecks. IEEE transactions on intelligent transportation systems 18, 11 (2017), 3204–3217.
 Lin et al. (2017) Lu Lin, Jianxin Li, Feng Chen, Jieping Ye, and Jinpeng Huai. 2017. Road traffic speed prediction: a probabilistic model fusing multisource data. IEEE Transactions on Knowledge and Data Engineering 30, 7 (2017), 1310–1323.
 Lin (1989) Lawrence IKuei Lin. 1989. A Concordance Correlation Coefficient to Evaluate Reproducibility. Biometrics. 67 (1989), 255–268.
 Lv et al. (2014) Yisheng Lv, Yanjie Duan, Wenwen Kang, Zhengxi Li, and FeiYue Wang. 2014. Traffic flow prediction with big data: a deep learning approach. IEEE Transactions on Intelligent Transportation Systems 16, 2 (2014), 865–873.
 Lv et al. (2018) Zhongjian Lv, Jiajie Xu, Kai Zheng, Hongzhi Yin, Pengpeng Zhao, and Xiaofang Zhou. 2018. LCRNN: A Deep Learning Model for Traffic Speed Prediction. In Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence (IJCAI ’18).
 Ma et al. (2017) Xiaolei Ma, Zhuang Dai, Zhengbing He, Jihui Ma, Yong Wang, and Yunpeng Wang. 2017. Learning traffic as images: a deep convolutional neural network for largescale transportation network speed prediction. Sensors 17, 4 (2017), 818.
 Ma et al. (2015) Xiaolei Ma, Zhimin Tao, Yinhai Wang, Haiyang Yu, and Yunpeng Wang. 2015. Long shortterm memory neural network for traffic speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Technologies 54 (2015), 187–197.
 Michaël et al. (2016) Defferrard Michaël, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of Neural Information Processing Systems. (NIPS ’16).
 Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association.
 Miller and Gupta (2012) Mahalia Miller and Chetan Gupta. 2012. Mining Traffic Incidents to Forecast Impact. In Proceedings of the ACM SIGKDD International Workshop on Urban Computing (Urbcomp ’12).
 Pan et al. (2012) Bei Pan, Ugur Demiryurek, and Cyrus Shahabi. 2012. Utilizing RealWorld Transportation Data for Accurate Traffic Prediction. In Proceedings of 2012 IEEE 12th International Conference on Data Mining (ICDM ’18).
 Rathore et al. (2016) M Mazhar Rathore, Awais Ahmad, Anand Paul, and Seungmin Rhob. 2016. Urban planning and building smart cities based on the internet of things using big data analytics. Computer Networks 101 (2016), 63–80.
 Rumsey (2015) Deborah J Rumsey. 2015. U Can: statistics for dummies. (2015).
 Smola and Schölkopf (1997) Alex J. Smola and Bernhard Schölkopf. 1997. A Tutorial on Support Vector Regression. Statistics and Computing. (1997), 199–222.
 Tong et al. (2017) Yongxin Tong, Yuqiang Chen, Zimu Zhou, Lei Chen, Jie Wang, Qiang Yang, Jieping Ye, and Weifeng Lv. 2017. The simpler the better: a unified approach to predicting original taxi demands based on largescale online platforms. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 1653–1662.
 Viovy et al. (1992) N Viovy, O Arino, and AS Belward. 1992. The Best Index Slope Extraction (BISE): A method for reducing noise in NDVI timeseries. International Journal of remote sensing 13, 8 (1992), 1585–1590.
 Xie et al. (2018) Rong Xie, Yang Chen, Yu Xiao, and Xin Wang. 2018. We Know Your Preferences in New Cities: Mining and Modeling the Behavior of Travelers. IEEE Communications Magazine (2018), pages 28–35.
 Yahoo (2019) Yahoo. 2019. https://developer.yahoo.com/weather/.
 Yao et al. (2019) Huaxiu Yao, Xianfeng Tang, Hua Wei, Guanjie Zheng, and Zhenhui Li. 2019. Revisiting spatialtemporal similarity: A deep learning framework for traffic prediction. In AAAI Conference on Artificial Intelligence (AAAI ’19).
 Yao et al. (2018) Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng Tang, Yitian Jia, Siyu Lu, Pinghua Gong, Jieping Ye, and Zhenhui Li. 2018. Deep multiview spatialtemporal network for taxi demand prediction. In ThirtySecond AAAI Conference on Artificial Intelligence (AAAI’ 18).
 Yu et al. (2017) Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2017. Spatiotemporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875 (2017).

Yu and Shi (2003)
Stella X. Yu and Jianbo
Shi. 2003.
Multiclass spectral clustering. In
Proceedings Ninth IEEE International Conference on Computer Vision
(ICCV ’03).  Yuan et al. (2018) Zhuoning Yuan, Xun Zhou, and Tianbao Yang. 2018. HeteroConvLSTM: A deep learning approach to traffic accident prediction on heterogeneous spatiotemporal data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’18). 984–992.
 Zhang et al. (2018) Huichu Zhang, Yu Zheng, and Yong Yu. 2018. Detecting urban anomalies using multiple Spatiotemporal data sources. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (Ubicomp ’18) 2, 1 (2018), 54.
 Zhang et al. (2017) Junbo Zhang, Yu Zheng, and Dekang Qi. 2017. Deep spatiotemporal residual networks for citywide crowd flows prediction. In Proceedings of ThirtyFirst AAAI Conference on Artificial Intelligence (AAAI ’17).
 Zhang et al. (2016) Junbo Zhang, Yu Zheng, Dekang Qi, Ruiyuan Li, and Xiuwen Yi. 2016. DNNBased Prediction Model for SpatialTemporal Data. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems. (SIGSPATIAL ’16).
 Zheng and Ni (2013) Jiangchuan Zheng and Lionel M Ni. 2013. Timedependent trajectory regression on road networks via multitask learning. In TwentySeventh AAAI Conference on Artificial Intelligence.