1. Introduction
The studies of early detecting the traffic incidents and estimating the impact of the nonrecurrent congestions caused by traffic incidents have become increasingly important research topics due to the significant social and economic losses generated. A oneminute reduction on congestion duration produces a 65 US dollars gain per incident
(Adler et al., 2013). Although nonrecurrent congestion is hard to predict due to its nature of randomness, the studies on impact and duration of the traffic incidents are still ones of the major focuses for the traffic operators. The vast deployment of transportation traffic speed sensors and Traffic Incident Management Systems (TIMS) make the traffic speed data and traffic incident records ubiquitously accessible for the transportation operators. With the abundance of the traffic data sources, an efficient multitask learning model can be implemented to provide an accurate prediction on incident duration.Incident duration is the time elapsed from the incident occurrence until all evidence of the incident has been removed from the incident scene. From the perspective of traffic management and operation, the life cycle of a traffic incident is split into five stages: Detection, Verification, Response, Clearance, and Recovery (Ozbay and Kachroo, 1999). Figure 1 shows the life cycle of a traffic incident. However, the fivestage life cycle separation cannot be used directly as the temporal features for the traffic incident duration prediction. To accurately estimate the duration of a traffic incident in its early stages, the transportation operators and first responders encounter three major challenges: 1) No explicit highlevel temporal features: Although the conventional fivestage life cycle separation is effective for the purposes of traffic management, such fivestages cannot be considered as temporal features in traffic incident duration prediction task. It is important to group the critical time point features in the early stages of the incident forming higher level time periods that can perform as a better indicator for predicting the incident duration. 2) Hard to predict the influence of incident: In the research field of Traffic Incident Management, one of the most essential tasks is to estimate the impact of the traffic incident in terms of its temporal duration at early stages. However, the performances of the conventional time series based methods are limited by their incapability of identifying higher level temporal features. 3) Spatial connectivity of the road networks is rarely considered: The traffic congestion cascades within the road network. As a consequence, the traffic patterns of incidents in their early stages are similar when the traffic incidents are topologically closer from the perspective of the road networks. Traffic incidents that are spatiotemporally closer should share more similar traffic speed patterns. However, this spatial correlation between traffic incidents is rarely considered in the previous studies (Li et al., 2018).
The existing methods are mostly infeasible to solve these challenges. Current feature learning methods such as norm regularized methods such as Lasso (Tibshirani et al., 2005) have properties in terms of feature selection. However, strong assumptions on the design matrix are required (Zhou et al., 2013a). Zhan et al. (Zhan et al., 2011) propose an M5P tree algorithm to predict the clearance time of traffic incident based on the geometric, and traffic features. Feature learning algorithms for biomarker identification (Zhou et al., 2013b) and social event indicators (Zhao et al., 2018) are proved to be effective while finding higher level features. However, most of them focus on learning important feature sets from attributes and does not apply to our encountered problem due to expensive computation. In these studies, they considered the duration of an incident to quantify the impact. However, their quantification strategies are designed to capture the onetime impact of the incident, instead of the timevarying nature of impact at different locations. Multitask learning based spatiotemporal model plays an important role while considering the connectivity of the road networks. Multitask based spatiotemporal models focus on regression and classification problems such as county income prediction (Zhang et al., 2017), social unrest event forecasting (Zhao et al., 2016), and even service disruption detection for transit networks (Ji et al., 2018). However, none of the previously proposed methods is capable of modeling the spatial connectivity between features at a higher level. Therefore, most of the existing models are not suitable for our traffic incident duration prediction problem.
To address these challenges, we propose a Traffic IncidenT DurAtion PredictioN (TITAN) model based on both sparse feature learning and multitask learning framework. Our main contributions are:
Formulating a novel machine learning framework for traffic incident duration prediction using temporal features
. In contrast to existing works, we formulate the problem of traffic incident duration prediction for transportation systems as a multitask supervised learning problem. In the proposed methods, models for different road segments are learned simultaneously by restricting all road segments to exploit a common set of features.
Modeling traffic speed similarity among road segments via spatial connectivity in feature space. Based on the cascading nature of the traffic congestion in road networks, specifically designed constraints are proposed to model traffic speed similarities among data for spatiotemporally correlated road segments. These similarities in feature space are driven by spatial connectivity.
Proposing a sparse feature learning process to identify groups of temporal features at a higher level. According to the nature of the traffic incidents, the traffic speed fluctuation in the early stages of the incidents is always important while estimating the impact and duration of the traffic incident. In the proposed model, constraints with sparsity and orthogonality are introduced to extract grouped important temporal features at a higher level.
Developing an efficient algorithm to train the proposed model. The underlying optimization problem of the proposed multitask model is a nonsmooth, multiconvex, and inequalityconstrained problem, which is challenging to solve. By introducing auxiliary variables, we develop an effective ADMM based algorithm to decouple the main problem into several subproblems which can be solved by block coordinate descent and proximal operators.
The rest of our paper is structured as follows. Related works are reviewed in Section 2. In Section 3, we describe the problem setup of our work. In Sections 4 and 5, we present a detailed discussion of our proposed TITAN model for predicting durations of traffic incidents, and its solution for parameter learning. In Section 6, extensive experiment evaluations and comparisons are presented. In the last section, we discuss our conclusion and directions for future work.
2. Related Works
In this section, we provide a detailed review of the current state of research for traffic incident analysis problem. There are several threads of related work of this paper: traffic incident impacts analysis, urban event forecasting, and spatiotemporal multitask learning.
Traffic Impacts Analysis
. The applications of conventional statistical methods have addressed its effectiveness in the traffic incident duration time prediction problems. The statistical methods fall into several branches: Bayesian classifier
(Boyles et al., 2007), discrete choice model (Lin et al., 2004), linear/nonparametric regression (Peeta et al., 2000), hazardbased duration model (Nam and Mannering, 2000). In the recent decade, the Traffic Incident Management Systems (TIMS) have been deployed by traffic control centers in various cities and highways to alleviate the influence of traffic incidents on traffic conditions (Owens et al., 2010). The historical traffic data obtained corresponds to traffic incidents play an important role in predicting the traffic incident durations. A new research field based on datadriven algorithms and supported by realworld traffic data availability has recently emerged for traffic incident duration prediction with increasing research popularity. Various data mining and machine learning approaches have been employed to estimate and predict traffic incident duration time. Some of these approaches are the following: Lee et al. (Lee and Wei, 2010)proposed a genetic algorithm on traffic incident duration time prediction problems; Kim et al. and Zhan et al.
(Zhan et al., 2011)applied decision trees and classification tree models on the same problem and achieved improvements; Valenti et al.
(Valenti et al., 2010)proposed a support vector machine related method that utilizes the temporal features of the traffic data; artificial neural networks
(Vlahogianni and Karlaftis, 2013) is another highlighted direction for traffic incident duration prediction. In recent years, the research field of Intelligent Transportation Systems (ITS) have addressed its attention towards the hybrid methods (Kim and Chang, 2012) to predict traffic incident durations.Urban Event Forecasting. To predict and detect the occurrence and impact the traffic incidents as urban events have received increasing attention in recent years. A large body of traditional work for event forecasting has focused on the early detection of events such as earthquakes (Sakaki et al., 2010), disease outbreaks (Zhao et al., 2015a), and transit service disruption (Ji et al., 2018), while event forecasting methods predict the incidence of such events in the future. Temporal events are the major focuses of the most existing event forecasting methods, with no interest in the geographical dimension, such as stock market movements (Bollen et al., 2011) and elections (O’Connor et al., 2010). A handful of works started to address the urban event prediction problem on a spatiotemporal resolution. For example, Zhao et al. (Zhao et al., 2015b) proposed a multitask learning framework that models forecasting tasks in related geolocations concurrently and; Gerber et al. (Gerber, 2014)
utilized a logistic regression model for spatiotemporal event forecasting, the urban event predictions with true spatiotemporal resolution. One limitation of these existing studies is that the temporal dimension is considered to be independent of the spatial dimension, and any interactions between the two are ignored. Our proposed
TITAN model addresses the importance of the topology dimension, which is derived from the spatial dimension. We propose a multitask learning framework with orthogonal constraints to model the interactions between the temporal and topological dimensions.Spatiotemporal Multitask Learning. Multitask learning (MTL) refers to models that learn multiple related tasks simultaneously to improve overall performance. Recent decades have witnessed proposals for many MTL approaches (Zhou et al., 2011). Evgeniou et al. (Evgeniou and Pontil, 2004) proposed a regularized MTL formulation that constrains the models of each task to be close to each other. Task relatedness can also be modeled by constraining multiple tasks to share a common underlying structure (e.g., a common set of features) (Argyriou et al., 2007), or a common subspace (Ando and Zhang, 2005). Zhao et al. (Zhao et al., 2015b)
designed a multitask learning framework that models forecasting tasks in related geolocations. MTL approaches have been applied in many domains including computer vision and biomedical informatics. Our work, to the best of our knowledge, is the first paper to address the feasibility of combining multitask learning and orthogonal regularization techniques to resolve traffic incident duration prediction and critical phases learning problems.
3. Problem Statement
Assume that we are given a collection of traffic incidents from the traffic incident management system (TIM). For each traffic incident in , we find the spatially correlated traffic sensor , and its traffic speed reading at time interval : , the granularity of the time interval is 1 minute. Given an incident record, and the traffic speed readings of its corresponding traffic speed sensor, the main objective of this paper is to predict the future impact of this given incident in terms of the temporal duration of this traffic incident.
Definition I: Traffic speed in detection time and early verification time. Suppose the verification time of the traffic incident is in time interval , we define and extract two important time periods respond time (time between incident occurrence and incident verification time ) and early verification time (a short period after the traffic incident verification time ) for feature construction. The traffic speeds for both time periods are extracted as: (1) Traffic speed in detection time: the previous readings: and (2) Traffic speed in early verification time: the succeeding readings .
Given the collection of traffic incidents, we first filter the collection with a selection of arterial roads. This produces the targeted traffic incidents collection . Then based on which traffic incident takes place at the arterial road, is grouped into , for example, .
We adopt a combination of traffic speed readings in detection time and early verification time as the training features. For each traffic incident subcollection , we construct the training input and the label . The problem is then formulated as solving the mapping:
(1) 
where . is the number of traffic incident records for one arterial road; represents the feature dimension of the training data, which is a combination of the detection time and the verification time; is the learning model for inferring the traffic incident duration in the subcollection .
Consider that our problem is to predict the duration of the traffic incidents if there is a historical traffic speed reading for the corresponding collection of target traffic incidents , then it fits into the scope of the regression problem. For instance, learning the function
can be modeled as a regression problem with a least square loss function, and the model parameters
can be learned by solving the following optimization problem:(2) 
where controls the sparsity of the grouped features, is the total number of data points in . Moreover, as inspired by the spatial correlations of traffic incidents introduced by the connectivity between road segments, the subproblem defined in Section 3 to a regression problem under a multi task learning framework. The proposed model should be encouraged to capture hidden patterns among road segments and to maintain sparsity in feature space. Mathematically, this consideration inspires us to use the norm (Argyriou et al., 2008) to perform joint feature selection:
(3) 
where each column of , which represented by , denotes the model parameters for . In this way, we can further model the relatedness among the road segments with parameter matrix . The overview of the TITAN model is represented in Figure 3. The following subsections address the details of the constraints on orthogonality and spatial connectivity.
4. Model
To identify the critical temporal features for traffic incident duration prediction, orthogonal constraints are applied to the TITAN model; to properly model the correlations between the traffic incidents based on the connectivity between the arterial roads, we apply a multitask learning framework while designing the model.
4.1. Group Feature Learning
In the studies of Traffic Incident Management (TIM), one important task is to identify the key response time points and periods of traffic incidents. Assume that a twovehicle collision occurs at 5:15 pm on the road segment of Interstate 66, based on the traffic speed readings from the traffic sensor, the transportation agencies want to learn how much impact the traffic incident will introduce to the local transportation system in terms of duration in time. The traffic speed readings of 5 minutes and 15 minutes after the traffic incident’s occurrence play an important role in predicting the duration of the traffic incident.
Definition II: Groups of key time points for a traffic incident. The group assignment information is represented in a vector, and the th group of time points is denoted by . If the th time point feature belongs to this group, then the th component of is nonzero and the relative magnitude represents the ‘importance’ of the feature in this group. For training data for one specific road segment, the new features generated by the group assignment is given by . Assume that there are groups of features and the group structure is denoted by , and the generalized new features are given by . To assign physical meaning to each generated group, the elements of have to be nonnegative.
The new model vector for the grouped features is denoted by . The resulting formulation of the key feature group identification problem is then defined by:
(4)  
where the parameter that controls the sparsity of each assigned group in . The norm in the constraint determines the length of the column in to be , which makes the group matrix easy to be interpreted.
By solving Equation 4, the model learns the group structure of the data features. However, the features may be largely overlapped because the proposed constraint does not consider any restrictions on feature overlapping. Such group overlapping is not ideal in our problem setting of traffic incident duration prediction problem. Because our selection of features is based on a time sequence of traffic speed readings, the consecutiveness of the features always provides a physical meaning.
In the research of traffic incident management, the lifetime of an incident generally consists of five different stages: incident detection, verification, response, clearance, and recovery. Because all stages do not overlap with each other, we impose the orthogonal constraints to control the overlapping conditions among the groups. The original nonnegative constraint between all , is also applied. In terms of simplicity and interpretation, we normalize the group assignments and assume that the columns of are of length 1 for norm. The constraint can further be expressed by . We use the norm regularization to control the sparsity on . The improved formulation of group feature learning can be given by:
(5)  
4.2. Spatial Connectivity in Feature Space
In realworld transportation systems, different road segments are spatially related by intersections or interchanges. That is, two or more road segment may share similar traffic speed pattern during the traffic incidents. For instance, traffic congestion on Interstate 495 could not only cause traffic pattern change at local road segments but also lead to traffic pattern change on other arterial roads that have close spatial correlations (e. g. Interstate 66 and US Route 7). This spatial relatedness caused by network failure cascade (Su et al., 2014; Kwee et al., 2018) results in similar traffic speed fluctuations; therefore, a similar pattern of traffic incident durations.
Definition III: Traffic incident spatial correlations. With prior knowledge such as the road network connectivity, we assume that the traffic incidents are spatially correlated with each other. Given a road network , where the vertices set represents the union collection of the intersections and interchanges, and the edges set represents the collection of roadblocks. In order to model the connectivity of the road network, we transform the original road network graph to its line graph , where the vertices set represents the roads, and the edges set represents the connectivity between roads. The adjacency matrix of the line graph reflects the overall connectivity of the roads. The roads connectivity and the line graph transformation is shown in Figure 2. Mathematically, we improve the model with constraints on parameters among different tasks:
(6)  
where each constraint with forces the Euclidean distance between model parameters for a specific pair of road segments to be within a range. As defined in Section 3, is the adjacency matrix that models the connectivity between road segments.
Combining the models represented by Equations 5 and 6, we obtain our proposed TITAN model. By moving the nontrivial constraints that are correlated to spatial connectivity into the objective function, we can obtain an equivalent regularized problem, which is easier to solve:
(7)  
where is tradeoff penalty balancing the value of the loss function and the regularizers. is the adjacency matrix representing the road connectivity; denotes the connectivity information between the th road and the th road. Because the line graph for road segments is undirected, the corresponding adjacency matrix is a symmetric matrix. The coefficient is introduced to eliminate the repeatedly added lower triangular matrix.
5. Parameter Learning for Titan
The objective function in Equation 7 is multiconvex and the regularizer is nonsmooth. This increases the difficulty of solving this problem. A traditional way to solve this kind of problem is to use proximal gradient descent. But this approach is slow to converge. Recently, the alternating direction method of multipliers (ADMM) (Boyd et al., 2011) has become popular as an efficient algorithm framework which decouples the original problem into smaller and easier to handle subproblems. Here we propose an ADMMbased Algorithm 1 which can optimize the proposed models efficiently. In particular, primal variables are updated on Line 4, dual variables on Line 5, and Lagrange multipliers on Line 6. Line 7 calculates both primal and dual residuals.
5.1. Augmented Lagrangian Scheme
First, we introduce an auxiliary variable and into the original problem 7 and obtain the following equivalent problem:
(8)  
where is the set of variables to be optimized. Then we transform the above problem into its augmented Lagrangian form as follows:
(9)  
where , , and are the Lagrangian multipliers. With this step, we decouple the original problem into two easier to handle problems in which seven variables , , , , , , and will be optimized individually. Note that the coefficient is omitted according to the optimization problem, and is the Frobenius norm.
5.2. Parameter Optimization
The Lagrangian form in Equation 9 is separated based on the primal variables and the dual variables, where the problem of solving the primal variables and is smooth and convex:
(10)  
5.2.1. Update
We define Equation 10 as objective function which is multiconvex. In particular, of is convex where all other are fixed. This kind of problem can be decoupled into subproblems using block coordinate descent (BCD) (Xu and Yin, 2013), in which each is updated by solving the following suboptimization problems:
(11) 
is smooth and convex for each and can be solved by gradient descent as follows:
(12) 
where according to the BCD algorithm, the is calculated in sequence, from to . And the is defined as follows:
where and are the th columns of the corresponding Lagrangian multiplier and dual variable.
5.2.2. Update
Similarly, the objective function of is also smooth and convex. Because there are no constraints defined between the columns of , the problem can be solved by gradient decent directly based on the objective function 10, and the gradient of is calculated by:
(13)  
and the primal variable is then updated with a step size of :
(14) 
Now that the primal variable is taken care of, the dual variable is updated as follows:
(15) 
Note that this problem is the definition of proximal , where is the nonsmooth function . The proximal operator can be solved efficiently using (Parikh et al., 2014).
5.2.3. Update Dual Variables
Now that primal variables and is taken care of, the dual variables and are updated as follows:
(16)  
where is the nonsmooth function and is the nonsmooth function . The proximal operator can be solved efficiently using proximal operators (Parikh et al., 2014).
Next, the Lagrangian multipliers , , and are updated as follows:
(17)  
Finally, primal and dual residuals are calculated with:
(18)  
where is primal residual, and is dual residual.
6. Experiment
In this section, we present the experiment environment, dataset introduction, evaluation metrics and comparison methods, extensive experimental analysis on predictive results, and discussions on the learner features.
Method  I270  I295  I395  

(l)210  RMSE  MAE  MAPE  RMSE  MAE  MAPE  RMSE  MAE  MAPE 
Ridge  92.4709  76.4666  96.3826  89.1404  69.1273  87.3530  84.6881  65.5869  83.3106 
LASSO  90.8535  73.8732  90.3336  76.4372  58.8515  70.1599  72.4028  55.8695  68.8993 
SVR  87.8016  72.9036  88.7639  72.4579  53.9583  68.6843  68.4456  50.0854  62.6849 
nMTL  70.7942  59.9754  82.8141  55.4657  42.6052  55.3893  57.2953  43.3107  41.2034 
FeaFiner  77.0080  57.5550  81.4397  63.3036  50.1060  62.6381  51.6727  40.8695  47.4805 
TITAN  73.1291  59.5265  81.3789  46.0873  34.3043  52.9296  46.2329  38.9277  42.3794 
Method  I495  I66  I95  
(l)210  RMSE  MAE  MAPE  RMSE  MAE  MAPE  RMSE  MAE  MAPE 
Ridge  69.9718  52.2384  81.2393  80.4118  62.5392  85.3443  76.0088  64.6172  80.1281 
LASSO  60.0119  48.5583  75.6027  68.0900  60.7429  77.9394  84.5617  58.7706  69.6493 
SVR  58.9676  46.7641  71.5021  72.7470  59.0808  71.1609  62.8689  54.7717  68.8999 
nMTL  52.5722  40.5422  63.6820  60.6244  48.4900  58.4887  57.1166  45.1327  49.4991 
FeaFiner  56.3049  44.0023  44.9048  62.5098  50.4090  56.4438  55.6806  46.0073  56.0013 
TITAN  47.7131  31.7725  37.1649  53.7001  44.3786  40.9370  52.6403  40.5345  49.9848 
6.1. Experiment Setup
6.1.1. Experiment Environment
We conducted our experiments on a machine with Intel Core i74790 3.6 GHz, the computational power of this CPU is 4.13 Gflops per core. For realworld traffic incident analysis problems, time requirements should be an important factor. The most timeconsuming process of our proposed TITAN model is at the training stage. The training stage learns the parameters for temporal features and the orthogonal groups of the temporal features . A matrix multiplication will generate the prediction rapidly. In the validation and testing stages, our prediction for a single data point is generated in less than seconds.
6.1.2. Dataset and Feature Settings
We evaluate our proposed Traffic Incident Duration Prediction model using two realworld traffic data sources. 1) Traffic incident records with reported duration. We collect 43,923 records of traffic incidents in the year 2018 from three major transportation agencies in the Washington DC Metropolitan area: Washington DC, Virginia State and Maryland State departments of transportation. From the collected traffic incident records, we select 29,075 traffic incidents that take places on the six major arterial roads in the region: , , , , , and . In the selected data frame, the time duration of the traffic incidents are recorded in minutes, and we utilize the duration as the ground truth. From the selected incidents 80% of the records serve as the training set, while the rest serve as the testing set. 2) INRIX traffic speed data. We leverage the traffic speed readings from the traffic sensors as the training features. Given the location and verification time of the traffic incidents, we collect traffic speed readings of nearby traffic sensors.
The connectivity of the road network determines the number of tunable parameters in our TITAN
model. According to the selected arterial roads in our experiment, seven hyperparameters can be tuned. During the experiment, we observe that the value of the loss function is significantly larger than regularizers, which means a large penalty should be used to balance the loss function and the regularizers.
6.2. Comparison Methods
To evaluate the performance of the traffic incident duration prediction, 5 comparison methods are considered in our experiment:
regulized linear regression (ridge regression),
regulized linear regression (LASSO), support vector regression (SVR), Naïve multitask learning model (nMTL), and feature refiner method (FeaFiner).Regulized Linear Regression (Ridge) (Peeta et al., 2000). Ridge regression is an extension for linear regression. It’s a linear regression model regulized on norm. The parameter is a scalar that controls the model complexity; the smaller is, the more complex the model will be. In our implementation, is searched from . This model only considers the temporal features on duration prediction. No multitask for arterial road connectivity and grouped temporal features are considered.
Regulized Linear Regression (LASSO) (Ramakrishnan et al., 2014; Tibshirani, 1996). This is a classic way to conduct costefficient regressions by enforcing the sparsity of the selected features. It has been proved to be effective in the field of event detection (Ramakrishnan et al., 2014). It includes a parameter that trades off the regularization term; typically, the larger this parameter is, the fewer the selected features will be. In our experiment, is searched from . The feature configurations applied by this model is the same as the ridge regression model.
Support Vector Regression (SVR) (Tibshirani, 1996). Support vector regression provides solutions for both linear and nonlinear problems. In our experiment implementation, we utilize nonlinear kernel functions (RBF kernel) to find the optimal solution for incident duration prediction problem. The model parameters are selected with and . This model considers similar temporal features with ridge regression and LASSO methods, no multitask features for connectivity is considered.
Naïve Multitask Learning Model (nMTL) (Zhao et al., 2016). We implement the fundamental settings of the naive multitask learning model for event detection. This comparison method is regularized with constraint between tasks. The training tasks of this model are split by the arterial roads. The correlations between tasks are intuitively constrained by norm, and within each task, the importance of the features are constrained by norm. The penalty parameter is searched from .
FeaFiner (Zhou et al., 2013b). FeaFiner regression model with a capability of learning feature clusters. This method learns an optimal sparse feature grouping for general regression problems. However, there are no multitask properties supported. In our implementation of this method, we apply this method on the complete set of traffic incidents, and the target feature is selected to be the temporal features. In the parameter initialization, we select the parameter
for the kMean clustering.
6.3. Evaluation Metrics
To quantify and validate model performance on traffic incident duration prediction, we adopt root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). These metrics are widely utilized in the field of traffic duration prediction studies (Li et al., 2018; Khattak et al., 2016; Park et al., 2016; Zou et al., 2016), it reflects the predictive performance of the proposed model. Equations 19, 20, and 21 represent the calculations of the selected evaluation metrics:
(19) 
(20) 
(21) 
where is the total number of records; is the predicted traffic incident durations represented in vector; is the ground truth value of the corresponding record, which is also represented in vector. and are the predicted result and the ground truth value respectively.
6.4. Incident Duration Prediction Analysis
6.4.1. Titan Performance Analysis on Spatial Connectivity
Table 1 summarizes the comparisons of our proposed method to the competing methods for the task of traffic incident duration prediction. From the experimental results, we can justify our application of a multitask learning framework for predicting the incident duration. In general, TITAN outperforms the single task models (LR, SVR, and FeaFiner) on RMSR, MAE, and MAPE. This result shows that the spatial correlations between the road segments can improve the performance of the traffic incident duration prediction. The TITAN model outperforms the nMTL on RMSE and MAE. These results demonstrated that for the traffic incident duration prediction problem, only regularizers is insufficient, detailed spatial connectivity between the road segments should also be considered.
6.4.2. Titan Performance Analysis on Feature Groups Learning
TITAN Performance Analysis on Feature Groups Learning. Among the comparison methods, the FeaFiner (Zhou et al., 2013b) method considers the orthogonal constraint that is capable of grouping lowlevel features into a highlevel feature representation. The original FeaFiner applies the Ridge and LASSO as the original problem settings. Thus, the results presented in Table 1 can be categorized by whether the orthogonal constraints are considered or not. The methods consider orthogonal constraints are FeaFiner and TITAN; the methods do not consider the orthogonal constraint are Ridge, LASSO, and SVR. By comparing these two categories, we learn that the overall performance of the methods consider the orthogonal constraint is better than the methods do not consider the orthogonal constraint. However, the overall performance increase is not as significant as the performance increase from the spatial connectivity constraint introduced by the framework of multitask learning.
6.4.3. Performance Analysis between Training Tasks
The results in Table 1 show that the model performance for traffic incident duration prediction is not the same across different road segments. For instance, the prediction performances of all the comparison methods on highway only have slight differences between each other. This is because the highway only has one spatial connectivity to the rest of the road segments, and the constraint of Euclidean distance for only shares a limited connection between the other columns of the feature matrix . In contrast, our model for the highway outperforms the comparison methods, because the subtask for shares feature similarity with all other subtasks.
6.5. Feature Groups Assignment Analysis
The orthogonal constraint ensures the proposed model to learn a group of highlighted features that play an important role in predicting the traffic incident durations. In our experiment, we also study the results of the learned group features empirically. In the experiment, we set the number of groups to be 10, and we also apply two conditions: 1) TITAN with orthogonal constraint and 2) TITAN without orthogonal constraint. Figure 4 shows the learned feature groups assignments for both experimental conditions. We can find the learned features with orthogonal constraint overlap less than the learned feature assignment without orthogonal constraint.
While experimenting without the orthogonal constraint, we found that the model has a preference for grouping the lowlevel features into one feature assignment for every group . Figure 4(a) shows the single feature group assignment for the model without orthogonal constraint. From Figure 4(a), we can find that for the model without orthogonal constraint, temporal features with higher indexes are assigned with higher weights (¿300). This result is reasonable because this can be interpreted as the duration of the traffic incident can be better inferred with the most recent traffic speed readings.
To compare with the model with orthogonal constraint, Figure 4(b) shows the learned feature group assignment for several subtasks. We can find the most weighted feature group by checking the weights in the learned variable . For example, in Figure 4(b), we demonstrate top weighted group for three subtasks (, , and ). From Figure 4(b), we find that the top assigned feature group for different arterial roads differ from each other slightly. This result shows that the most critical temporal features for predicting the traffic incident duration for different roads differ. This observation is valuable for the transportation operators and first responders. In Figure 4(b), we can observe that the highlevel features of the subtask have a shift comparing to the subtask of . The 10 minutes’ shift indicates that to predict the duration of an incident on , the traffic speed readings of 10 minutes in advance have higher importance.
6.6. Case Studies
During the experiments, several interesting facts revealed by using the proposed approach were discovered. Here we discuss the details towards the identification of the critical phases for traffic incidents and the influences of the connectivity between the arterial roads.
6.6.1. Critical Phases Identification for Traffic Incidents
According to the experiment results on the correlations between the number of groups and the performance of the TITAN mode, we discover the optimal number of groups for the temporal features. The physical meaning of the number of groups in this experiment, corresponding to the number of phases will be identified for the traffic incidents. As mentioned in Section I, the life cycle of the traffic incident is conventionally grouped into five phases: detection, verification, response, clearance, and recovery. Although such grouping strategy is efficient in the perspective of transportation management and operations, it cannot provide useful temporal feature grouping to predict the traffic incident durations. From this experiment, we can study how the performance of the TITAN model will be affected with respect to the number of feature groups. As shown in Figure 5, we illustrate the RMSE, MAE, and MAPE obtained by varying the number of the groups from 1 to 50; and the colorcoded lines representing different arterial roads in the experiment. From Figure 5, we learn that for most of the arterial roads, the TITAN model reaches the best performance when the number of groups in the ranges of 1820 and 4043. This experiment result indicates that the conventional fivephase definition of traffic incident life cycle may not provide informative input to the traffic incident duration prediction problems.
6.6.2. Influences of Arterial Road Connectivity
The performance differences between the arterial roads can be observed in Figure 5. In Figure 5, the general prediction performance of the arterial road Interstate 495 outperforms the rest of the arterial roads, and the arterial road Interstate 270 has the worst duration prediction results overall. This comparison result reveals that the connectivity between different arterial roads plays an important role while predicting the traffic incident duration. Because the more connection with other arterial roads means the more information shared with other train tasks in the training stage. The Interstate 495 intersections with all other arterial roads, while the Interstate 270 only intersects with the Interstate 495.
7. Conclusion
This paper presents a novel traffic incident duration prediction and feature learning model TITAN. The proposed model is designed based on the multitask learning framework for prediction, and a sparse feature learning framework for higher feature groups identification. The proposed TITAN model outperforms the existing traffic incident duration prediction models because of two major advantages in model design: 1) consideration of the connectivity between road segments within the urban road networks; 2) the learned higher level features provide a better predictive pattern for the problem of traffic incident duration prediction. Extensive experiments on realworld datasets with comparisons of the baseline methods justify the performance of TITAN model. By applying the orthogonal constraint, the proposed model is capable of identifying groups of higher level features which can be further considered as the critical evolution stages of the traffic incident. Such functionality provided by our proposed model is helpful for the transportation operators and first responders while judging the influences of the traffic incidents.
References
 Road congestion and incident duration. Economics of transportation 2 (4), pp. 109–118. Cited by: §1.
 A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6 (Nov), pp. 1817–1853. Cited by: §2.
 Multitask feature learning. In Advances in neural information processing systems, pp. 41–48. Cited by: §2.
 Convex multitask feature learning. Machine Learning 73 (3), pp. 243–272. Cited by: §3.
 Twitter mood predicts the stock market. Journal of computational science 2 (1), pp. 1–8. Cited by: §2.
 Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3 (1), pp. 1–122. Cited by: §5.
 A naive bayesian classifier for incident duration prediction. In 86th Annual Meeting of the Transportation Research Board, Washington, DC, Cited by: §2.
 Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 109–117. Cited by: §2.

Predicting crime using twitter and kernel density estimation
. Decision Support Systems 61, pp. 115–125. Cited by: §2.  Multitask learning for transit service disruption detection. In 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 634–641. Cited by: §1, §2.

Modeling traffic incident duration using quantile regression
. Transportation Research Record 2554 (1), pp. 139–148. Cited by: §6.3.  Development of a hybrid prediction model for freeway incident duration: a case study in maryland. International journal of intelligent transportation systems research 10 (1), pp. 22–33. Cited by: §2.
 Trafficcascade: mining and visualizing lifecycles of traffic congestion events using public bus trajectories. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1955–1958. Cited by: §4.2.
 A computerized feature selection method using genetic algorithms to forecast freeway accident duration times. ComputerAided Civil and Infrastructure Engineering 25 (2), pp. 132–148. Cited by: §2.
 Overview of traffic incident duration analysis and prediction. European Transport Research Review 10 (2), pp. 22. Cited by: §1, §6.3.

Integration of a discrete choice model and a rulebased system for estimation of incident duration: a case study in maryland
. In CDROM of Proceedings of the 83rd TRB Annual Meeting, Washington, DC, Cited by: §2.  An exploratory hazardbased analysis of highway incident duration. Transportation Research Part A: Policy and Practice 34 (2), pp. 85–102. Cited by: §2.
 From tweets to polls: linking text sentiment to public opinion time series. In Fourth International AAAI Conference on Weblogs and Social Media, Cited by: §2.
 Traffic incident management handbook. Technical report Cited by: §2.
 Incident management in intelligent transportation systems. Cited by: §1.
 Proximal algorithms. Foundations and Trends® in Optimization 1 (3), pp. 127–239. Cited by: §5.2.2, §5.2.3.
 Interpretation of bayesian neural networks for predicting the duration of detected incidents. Journal of Intelligent Transportation Systems 20 (4), pp. 385–400. Cited by: §6.3.
 Providing realtime traffic advisory and route guidance to manage borman incidents online using the hoosier helper program. Cited by: §2, §6.2.

’Beating the news’ with embers: forecasting civil unrest using open source indicators
. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1799–1808. Cited by: §6.2.  Earthquake shakes twitter users: realtime event detection by social sensors. In Proceedings of the 19th international conference on World wide web, pp. 851–860. Cited by: §2.
 Robustness of interrelated traffic networks to cascading failures. Scientific reports 4, pp. 5413. Cited by: §4.2.
 Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (1), pp. 91–108. Cited by: §1.
 Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §6.2, §6.2.
 A comparative study of models for the incident duration prediction. European Transport Research Review 2 (2), pp. 103–111. Cited by: §2.
 Fuzzyentropy neural network freeway incident duration modeling with single and competing uncertainties. ComputerAided Civil and Infrastructure Engineering 28 (6), pp. 420–433. Cited by: §2.

A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion
. SIAM Journal on imaging sciences 6 (3), pp. 1758–1789. Cited by: §5.2.1.  Prediction of lane clearance time of freeway incidents using the m5p tree algorithm. IEEE Transactions on Intelligent Transportation Systems 12 (4), pp. 1549–1557. Cited by: §1, §2.
 Spatiotemporal event forecasting from incomplete hyperlocal price data. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 507–516. Cited by: §1.

Simnest: social media nested epidemic simulation via online semisupervised deep learning
. In 2015 IEEE International Conference on Data Mining, pp. 639–648. Cited by: §2.  Multitask learning for spatiotemporal event forecasting. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1503–1512. Cited by: §2, §2.

Distantsupervision of heterogeneous multitask learning for social event forecasting with multilingual indicators.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §1.  Hierarchical incomplete multisource feature learning for spatiotemporal event forecasting. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2085–2094. Cited by: §1, §6.2.
 Malsar: multitask learning via structural regularization. Arizona State University 21. Cited by: §2.
 Modeling disease progression via multitask learning. NeuroImage 78, pp. 233–248. Cited by: §1.
 Feafiner: biomarker identification from medical data through feature generalization and selection. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1034–1042. Cited by: §1, §6.2, §6.4.2.
 Application of finite mixture models for analysing freeway incident clearance time. Transportmetrica A: Transport Science 12 (2), pp. 99–115. Cited by: §6.3.
Comments
There are no comments yet.