1 Introduction
Prediction of traffic crash has been a major research topic in transportation safety studies. Crashes, especially on urban expressways, can trigger heavy traffic congestions, impose huge external costs, and reduce the level of service of transportation infrastructures. Therefore, the accurate and reliable prediction of crash risks is critical to the success of proactive safety management strategies on urban expressways.
There have been fruitful studies in the domain of the realtime crash likelihood estimation
(AbdelAty and Pemmanaboina, 2006; AbdelAty et al., 2007, 2008; Ahmed and AbdelAty, 2012). It has been reported that crash occurrence was affected by four major factors: realtime traffic state, drivers’ behavior, environment factors, and road geometry (Ahmed and AbdelAty, 2013b). Traditional devices utilized in detecting realtime traffic states are mainly intrusive, e.g., loop detectors. Recently, more nonintrusive traffic detection devices are in use due to their easiness of installation, maintenance, accuracy, and affordable costs. For example, Remote Traffic Microwave Sensors (RTMS) and Automatic Vehicle Identification (AVI) devices provide access to realtime traffic data from multiple sources. In field applications, RTMS simultaneously provide realtime data of flow, time occupancy, and speed.Despite RTMS or other detectors (e.g., AVI devices and loop detectors) have been widely used and successfully applied in traffic operations including the realtime crash likelihood estimation, the problem of missing information has generated trouble for researchers and traffic operators for years (Turner et al., 2000; Chen et al., 2002, 2003; Smith et al., 2003; Rajagopal and Varaiya, 2007). Dynamic traffic flow data in the intelligence transportation systems unavoidably face with the missing data issue mainly due to the detector failure and lossy communication systems (Asif et al., 2016). According to Ahmed and AbdelAty (2012), loop detectors have a failure that ranges between 24% and 29%. It is reported that 5% of traffic data are lost at hundreds of detection points within PeMS traffic flow database (Li et al., 2013a). In some extreme cases, the missing percentage can reach 90%, which has become a critical issue for traffic management (Tan et al., 2013). On the other hand, dynamic traffic flow data serve as one of the most important components of the features in realtime crash likelihood estimation. Therefore, the issue of missing data is not negligible, since it may greatly affect the predictive performance of the models. However, in the crash likelihood estimation literature, most existing methods and algorithms are developed under the assumption of complete data (the samples with missing data are simply deleted). Although there were some studies on the missing data imputation for traffic flow measurements (Li et al., 2014b, 2013a, 2013b, a), the patterns and characteristics of traffic crash data are different from those of traffic flow data. One main distinction is that traffic crash data do not have periodicity and tendency, which are important properties in traffic flow data. Few studies have been implemented to seek out the suitable imputation approaches for imputing missing crash data. Furthermore, the behavior, robustness, and properties of the predictive models under highly missing data have not been fully understood in realtime crash likelihood estimation.
Another problem, which has attracted little attention in realtime crash likelihood prediction, is the imbalanced issue of the field measurements. It is a typical imbalanced classification problem because the number of crash occurrence samples is much smaller than the number of noncrash samples. In this paper, two kinds of solutions are employed to solve the imbalanced issue, including the costsensitive learning at the algorithmic level and the synthetic minority oversampling technique (SMOTE) at the data level. An imbalanced dataset may produce biased classification results towards the majority class, the reasons for which include: (I) the classifiers regard the costs of misclassifying positive or negative samples as the same; (II) the objective of the algorithm tends to reduce the total errors on which the minority class has few influences.
To bridge the two research gaps in realtime crash likelihood estimation, the paper aims to incorporate the two important components, i.e., missing data imputation, and solutions to the imbalanced issue in realtime crash likelihood estimation. Firstly, various imputation approaches are examined and compared in the imputation of crash data, which have different patterns and distributions from traditional traffic flow data. To understand characteristics of the missing patterns in crash data and find the most suitable imputation approach are important for the field application of realtime crash likelihood estimation. Secondly, different classifiers’ tolerance and robustness to missing data are studied, especially when the missing ratio is high. Those classifiers with the low tolerance to missing data should be avoided under the scenario of highly missing data. Thirdly, two solutions to imbalanced issues are adopted and compared in realtime crash likelihood estimation.
The rest of the paper is organized as follows: Section 2 reviews related studies. Section 3 presents the methodologies in the domain of realtime crash likelihood estimation, missing data imputation, and solutions to imbalanced data. Section 4 demonstrates the numerical test of an urban expressway in Hangzhou, China, and presents the results of sensitivity analyses. Finally, Section 5 concludes the whole paper, summarizes the interesting findings of this paper, and outlooks future research.
2 Literature Review
Since the beginning of the last decade, researchers have employed a board range of machine learning algorithms in realtime crash likelihood estimation, mainly utilizing factors extracted from traffic dynamics, such as flow, occupancy, and speed. In the early studies of this domain, empirical models based on crash precursors
(Lee et al., 2003), nonparametric Bayesian model (Oh et al., 2005), were developed to predict the potential crash occurrence. AbdelAty and Pemmanaboina (2006)designed a matched casecontrol Logit model, which integrated traffic flow data and weather data into realtime crash likelihood estimation. Support vector machine, a classical machine learning algorithm, was widely used to predict the crash occurrence due to its robustness on small datasets
(Yu and AbdelAty, 2013b; Sun et al., 2014). Treebased ensemble methods, such as the stochastic gradient boosting (SGB) and AdaBoost, also demonstrated strong ability in enhancing reliability of the realtime risk assessment
(Ahmed and AbdelAty, 2013a). Motivated by the fact that a large number of explanatory variables might induce overfitting issues, random forest was widely used for selecting important factors
(Saha et al., 2015). Kwak and Kho (2016)used the conditional logistic regression analysis to remove confounding factors, and developed separate models to predict crash occurrence with the genetic programming technique. Recently, more and more researchers employed Bayesian approaches, such as the hybrid latent class analysis (LCA), dynamic Bayesian network (DBN), and semiparametric based model, to further reveal the unobserved heterogeneity among crashes
(Sun and Sun, 2015; Yu et al., 2016; Yu and AbdelAty, 2013a, 2014; Xu et al., 2014). Roshandel et al. (2015) compared the accumulated approaches in this area and summarized the current knowledge, challenges and opportunities of assessing the impact of traffic factors on the crash occurrence.Apart from developing stateoftheart approaches for realtime crash risk assessment, researchers have made various attempts to seek for more relevant explanatory variables. ElBasyouny et al. (2014) investigated the impact of the sudden extreme snow or rain variation on the crash type, using full Bayesian multivariate Poisson lognormal models. Yu et al. (2014) proposed a hierarchical logistic regression model to predict crash likelihood, with multisource information, including traffic, weather, and roadway geometric factors.
On the other hand, the difference existing in the types of crash has also attracted attention from many researchers. Sun et al. (2016)
built separate models for noncongestedflow crashes and congestedflow crashes and compared their safety factors. Although most of the literatures defined dependent variables as a binary variable (crash occurrence or not), some researchers made attempts to predict crash likelihood at various levels of severity
(Xu et al., 2013). There were some studies concentrating on assessing the risk of rearend crash, which was regarded as the most severe crash type (Weng and Meng, 2014; Lao et al., 2014; Chen et al., 2015; Fildes et al., 2015; Li et al., 2014c). In addition to the primary crash, secondary crashes have also been studied by researchers, who developed models based on the speed contour plot to predict the probability of secondary crashes
(Xu et al., 2016; Park and Haghani, 2016).Although fruitful approaches have been proposed to improve the predictive performance, few studies focused on the issue of missing data in the realtime crash likelihood estimation. However, the approaches for missing data imputation utilized in other related areas, such as the missing traffic volume data estimation (Tang et al., 2015), missing data imputation in road networks (Asif et al., 2013), may provide an insight for the missing data imputation in the traffic safety area. Traditional missing data imputing approaches utilized in the transportation area included the historical mean/median imputation, means clustering imputation, etc. (Conklin and Scherer, 2003; Deb and Liew, 2016)
. Recently, tensor decomposition has also been developed to impute missing data in various areas
(Wu et al., 2017; Tan et al., 2013).The pattern of missing data means the distribution of missing values in the whole dataset. Little and Rubin (2014) classified missing patterns into three categories: missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR).

MCAR means the missing data does not have any relationship with the distribution of the observations, while it does not depend on specific variables.

MAR indicates that the distribution of missing data has no relationship with the missing values, but is related to the observed data of other attributes.

NMAR refers to the cases that the distribution of missing data has a certain pattern. This is the most difficult mechanism to deal with because the missing pattern should be treated case by case.
The MCAR or MAR problem could be addressed by some universal algorithms while the NMAR problem means almost countless possibilities in the distribution of missing values. Various kinds of algorithms, which assume that the missing pattern is MCAR or MAR, have been proposed for missing data imputation. These algorithms could mainly be divided into three categories (Li et al., 2004):

The first group is discarding the samples with missing data, which is acceptable when the missing ratio is low and the data resource is abundant.

The second group is interpolation, which refers to the process of interpolating missing values based on the existing information under certain regulations, such as, mean or median imputation, nearest imputation, and
means imputation. 
The third group is EM (expectation maximization) based parameter estimation, which estimates the parameters of the data distribution through the existing data and then impute the missing data based on the estimated distribution.
Among the methods in the third group, principal component analysis (PCA) based methods are considered as a category of efficient and reliable missing data imputation methods, which incorporate PCA and the EM estimation (Ilin and Raiko, 2010). It was proved that the probabilistic principal component analysis (PPCA) based missing data imputation method outperformed the conventional methods (e.g., the nearest/mean historical imputation methods and the local interpolation/regression methods) in the traffic flow volume dataset (Qu et al., 2009)
, mainly due to the two reasons: (I) the data of traffic flow volume followed the Gaussian distribution, which was in accordance with the hypothesis of the PCAbased missing data imputation; (II) PPCA succeeded at combining and utilizing global information as well as the local information.
Dear (1959) was the first to develop a PCAbased formulation to impute missing data. Grung and Manne (1998) further employed a leastsquare approach to solve the problem, and thus the algorithm was named the leastsquare PCA (LSPCA). However, LSPCA faced the overfitting issue, especially when the missing ratio was high. To solve this problem, Tipping and Bishop (1999) proposed the PPCA algorithm, which added a generalization term to the objective function to avoid overfitting, by assuming that the PCA followed a probabilistic form. Furthermore, the variational Bayesian PCA (VBPCA) was proposed to address the sensitive dependence on initial values of parameters, which was frequently observed in PPCA (Bishop, 1999). Ilin and Raiko (2010) summarized these three algorithms and compared their imputing performance in artificial experiments. In the domain of transportation, Qu et al. (2009); Li et al. (2013b) developed a broad range of variants of PCAbased missing data imputation, including PPCA and kernel probabilistic principle component analysis (KPPCA), which demonstrated outstanding performance on resolving issues of missing values in traffic volume estimation.
PCAbased approaches have at least three merits in the domain of missing traffic data imputation. Firstly, it does not require strict assumptions such as the daily similarity, no continuous incompleteness of data points, and a large database. Secondly, the principal components remove the relatively trivial details and make sure that only the major information is used for constructing the probabilistic distribution of the latent variables. Thirdly, it simultaneously achieves the high imputing accuracy, acceptable speed, and robustness to abnormal data points in a broad range of missing traffic data imputation issues.
Although the cuttingedge PCAbased imputation algorithms have been utilized to impute the missing data in the traffic volume, there exist differences between the traffic flow data and the dataset in realtime crash likelihood estimation. Traffic flow data have continuous timeseries properties, such as periodicity and tendency. On the contrary, the dataset of realtime crash likelihood estimation can be viewed as a table with rows of samples and columns of features. The values of the features in different rows are not in sequential orders and do not have periodicity and tendency, which indicates that imputing the missing data in realtime crash likelihood estimation is more difficult than that in the traffic flow data analysis. Most of the imputing approaches, such as the historical mean/median imputation, means clustering methods and interpolations, rely on the assumptions of periodicity and tendency, which no longer hold in the imputation of the crash data table. However, PCAbased imputation approaches do not have strict assumptions on the periodicity and continuity of the data; they first extract the major and abletomodel information and discard the trivial and unabletomodel details via principal components, and then use the obtained domain probabilistic distributions to impute missing values via MLE. Thus, PCAbased imputation approaches are assumed to have better performance in imputing missing crash data. One of the main goals of this paper is to verify this assumption and examine the PCAbased algorithms’ ability in imputing missing data within realtime crash likelihood estimation.
Classification on the imbalanced dataset is another common issue in realtime crash likelihood estimation but has drawn little attention. In a binary classification problem, a dataset is said to be imbalanced when the number of samples in one class is higher than the other one (Seiffert et al., 2010; Guo et al., 2008). The class with more samples is named the major class while the other with relatively fewer samples is denoted as the minor class. The imbalanced issue is commonly observed in a broad range of classification problems (Longadge and Dongre, 2013), e.g., medical diagnosis detection of rare disease, determining frauds in banking system, detecting failures of technical devices, etc.
Realtime crash likelihood prediction is a typical imbalanced classification problem since the number of crash cases is usually much smaller than that of noncrash cases. This issue has attracted researchers’ attentions in recent two years. Theofilatos et al. (2016) considered accidents as rareevents and developed a series of rareevent logit models to predict realtime accidents. Basso et al. (2018) proposed an accident prediction model combining SMOTE and SVM, which was then validated with original imbalanced data instead of artificially balanced data. To mitigate the imbalanced issue, Yuan et al. (2017) proposed an informative sampling approach that selected diverse negative (noncrash) samples, with some close to and some far from positive (crash) samples. In this paper, we use the matched case control method (Ahmed and AbdelAty, 2013a) to select the negative samples. The previous research commonly selected 4:1 as the ratio of the number of noncash cases to crash cases (Ahmed and AbdelAty, 2012, 2013b). To better illustrate the performance of various solutions to imbalanced issues, we use a larger ratio 10:1 for demonstration purposes, which means that one crash case is matched with 10 noncrash cases.
In an imbalanced classification problem, most of the classifiers tend to classify all the samples into the major class, which sacrifices the accuracy of predicting a sample from the minor class. This is due to the inherit motivation of their objective functions which minimize the sum of errors by assigning the same weight to both the major and minor classes, where samples in the minor class make little contribution (Sun et al., 2007). There are two main solutions to this problem: (I) Costsensitive learning techniques which stimulate the classifiers to pay more attention to the minor class by assigning different weights to the samples of both classes in the objective function (Pazzani et al., 1994); (II) Resampling the dataset, including oversampling and undersampling (Chawla et al., 2004; Estabrooks, 2000; Kubat et al., 1997). SMOTE, viewed as one of the most efficient resampling algorithms, oversamples the minor class by generating synthetic samples instead of simple replications (Chawla et al., 2002).
3 Methodology
As mentioned above, the main objectives of this paper are to seek out the most suitable imputation approaches and solutions to imbalanced issues in the domain of realtime crash risk estimation, and to examine different classifiers’ tolerance to missing data. In this section, we first revisit the problem of realtime crash likelihood estimation and related classical classifiers (Problem 1), then present the two solutions to imbalanced issues (Problem 2), and finally propose the PCAbased missing data imputation approaches (Problem 3).
3.1 Realtime crash likelihood estimation
Most of the studies in the domain of realtime crash likelihood estimation define the problem as a binary classification problem.
Problem 1 (Realtime crash likelihood estimation): Given a training dataset of samples, , where means the vector of explanatory variables of the sample and , the object is to find a function which can distinguish positive from negative . In this paper, the positive labels refers to crash cases, while the negative labels refers to noncrash cases.
3.1.1 Classical classifiers
A broad range of machine learning algorithms have been used to detect crashprone cases. However, the main focus of this paper is to incorporate missing data imputation and solutions to imbalanced issues in realtime crash likelihood estimation, instead of developing new classifiers. In this paper, we only consider two classical classifiers: SVM (with linear, Gaussian, and polynomial kernels) and AdaBoost ensemble algorithm.
Firstly, SVM constructs a hyperplane or a set of hyperplanes in highdimensional space to separate data samples into two kinds of labels. Intuitively, a good hyperplane aims to maximize the distance to the nearest data points of binary classes, while it also extends to the training dataset that cannot be linearly separated by introducing slack variables. Starting from this idea, the general objective function of SVM can be written as Eq. (
1). The first part of the equation refers to an L2 norm regularization, while the second part is in a form of hinge loss.(1) 
In order to apply SVM to a nonlinear separable classification problem, the kernel trick can be utilized. In this paper, three categories of kernels are used:

Linear kernel: ;

Gaussian kernel: ;

Polynomial kernel: , specifically, is selected in this paper.
Secondly, boosting is a combination of machine learning techniques which ensemble a series of weak learners, each of which achieves a classification accuracy slightly larger than 0.5 for the binary classification, to improve the prediction performance and robustness. AdaBoost, the abbreviation of “Adaptive Boosting”, is a typical boosting method which adapts the subsequent weak learners to emphasizing more on the samples misclassified by the prior weak learners. AdaBoost generates weak learners and assembles them into a strong learner by multiplying each weak learner with a coefficient , see Eq. (2).
3.1.2 Measures of Effectiveness (MoEs)
Accuracy is the basic and simplest MoE in the classification problem, but it may produce a biased illusion on imbalanced data. In the case that the ratio of the number of negative samples to the number of positive samples is large, e.g., 99:1, the classifier may predict all the samples into the negative class, consequently, the accuracy achieves 99%. However, this result is meaningless to scenarios where the objective is to detect the minority cases, such as the crash likelihood prediction. To overcome this biased phenomenon, the confusion matrix is utilized, where samples can be categorized into four conditions, i.e., true negative (TN), false negative (FN), false positive (FP), and true positive (TP), as shown in Table
1.True label  

0 (noncrash)  1 (crash)  
Predictive label  0 (noncrash)  TN  FN 
1 (crash)  FP  TP 
Based on the confusion matrix, the true positive rate and false positive rate can be calculated by
(3) 
(4) 
Most of the classifiers provide probabilistic degrees or predicting scores of each sample; the higher the probabilistic degrees or predicting scores, the greater confidential level that the sample can be classified as a positive sample. By changing the score threshold, a group of confusion matrices and the corresponding TPR and FPR can be calculated. A receiver operation characteristic curve (ROC) shows the relationship between FPR on the Xaxis and TPR on the Yaxis. The ROC curve considers both the accuracies of classifying positive samples and negative samples, and thus provides a more fair MoE for imbalanced data. To quantify the performance of the ROC curve, a higher AUC value (areas under ROC curve) implies a stronger classifier.
In addition to the AUC value, sensitivity and specificity are also utilized in this paper. Sensitivity is the same as TPR while specificity equals to (1FPR). The high sensitivity indicates that the classifier successfully predicts the majority of the crash occurrence, while high specificity implies that the classifier produces less irrelevant alarms. In an imbalanced dataset, improving sensitivity usually means sacrificing specificity to some extent. Therefore, a tradeoff should be made between these two MoEs.
3.2 Solutions to Imbalanced Data Classification
Problem 2 (Solutions to imbalanced issue): Given an imbalanced dataset with much more samples from the majority class than those from the minority class, how to simultaneously achieve acceptable sensitivity, specificity, and AUC in the classification?
3.2.1 Solution I—AlgorithmicLevel Solutions
On the algorithmic level, costsensitive learning is the most widely used method to take the misclassified cost of different classes into consideration. The costsensitive learning assigns different costs of misclassifying different classes. The objective functions of most classifiers can be written as the sum of two parts: one is the sum of empirical errors and the other is a regularization term, e.g., Eqs. (1–2). Although a coefficient is employed to measure the tradeoff of the two terms, it is set to be the same for all the samples without distinguishing positive and negative labels, which may produce biased results towards the majority class. The costsensitive learning technique (COST) simply adjusts this biased classification by allocating a larger coefficient to the minority class, which means misclassifying a sample from the minority class (crash case in this paper) receives a higher penalty. For instance, the objective function of SVM in Eq. (1) can be rewritten as follows:
(5) 
where and are the sets of positive and negative training samples, respectively, and and are the corresponding coefficients. This transformation can be easily migrated to other classifiers like LR, decision tree, etc. In the following section, a sensitivity analysis is conducted to examine MoEs of the classifiers under different ratios of to .
3.2.2 Solution II—DataLevel Solutions
To avoid the poor performance of classifiers on imbalanced data, resampling at the data level is another group of approaches, which can be classified into two types: oversampling and undersampling. Simple oversampling may lead to overfitting issues while undersampling will discard a large amount of potentially meaningful information in the small dataset. To overcome these drawbacks, an advanced oversampling algorithm called the synthetic minority oversampling technique was proposed by Chawla et al. (2004). SMOTE creates “synthetic” samples of the minority class instead of simply duplicating the existing samples.
3.3 PCAbased missing data imputation
Problem 3 (Imputing missing crash data table): Given an incomplete crash table with rows of samples and columns of features, how to impute the missed values based on the observed values?
The PCAbased missing data imputation algorithms formulate the relationship between original variables and latent variables in a PCAbased form, and then solve the problem with EM iterations. PCA is a machine learning technique which can compress highdimensional data into lowdimensional data with the minimum loss on variance. This lowdimensional data after compression can also be reconstructed into the original data. This property can be utilized in missing data imputation: first estimate the probability distribution of the compressed information based on the original observed data, and then reconstruct the missing data by the compressed information, which can also be viewed as latent variables. PCAbased missing data imputation consists of three main kinds of algorithms, including LSPCA, PPCA, and VBPCA, which have different assumptions on the relationships between original variables and latent variables.
Suppose that we have samples of original vectors , which can be formulated as a function of dimensional latent variables:
(6) 
where is a matrix, is a vector of principal components (i.e. latent variables), and is a bias term.
3.3.1 Imputation Algorithm I—LeastSquare PCA (LSPCA)
A straightforward method to determine the latent variables is to minimize the meansquare error between the reconstructed attained from latent variables and the original observed :
(7) 
(8) 
where means the th variable of the th sample of the observed data, while is the reconstruction of the data element . is the set of indexes . means the th latent variable of the th sample of the latent space.
This optimization problem can be solved by a leastsquare algorithm which updates parameters , , and latent variables . However, the LSPCA algorithm might easily suffer from the overfitting issue, especially when the missing ratio is high, since the object of LSPCA is to minimize the mean square error between the observed original data and the reconstructed data, thus the algorithm may generate unreasonable large parameters to well fit observed data and lose the generalization ability.
3.3.2 Imputation Algorithm II—Probabilistic PCA (PPCA)
A natural solution to the overfitting problem of LSPCA is adding a regularization term in the objective function to penalize unreasonably large parameters. Another solution is altering the transformation between the original data and latent variables to a probabilistic form, from which the regularization term is naturally derived. PPCA is derived by adding an isotropic term to Eq. (7):
(9) 
where and
follow the normal distributions, i.e.,
, . There are three groups of parameters, i.e., , and , which can be estimated by the EM algorithm (Bishop, 1999).3.3.3 Imputation Algorithm III—Variational Bayesian PCA (VBPCA)
PPCA is sometimes sensitive to the initialization of parameters , and . To overcome this defect, an assumption of the Gaussian prior probabilistic distribution was made to the parameters and , which formulates the VBPCA. and follow the normal distributions: , , where and
are the hyperparameters that can be updated during learning (e.g., using the evidence framework or variational approximations). VBPCA can also be iteratively solved by the EM algorithm
(Ilin and Raiko, 2010).To summarize, in the PCA framework, LSPCA employs the leastsquare approach to estimate the parameters, which might be overfitting when the missing ratio of traffic flow data is high. PPCA introduces a probabilistic form to the original data by adding an isotropic term to penalize the occurrence of unreasonable large parameters, and thus avoids overfitting issues. VBPCA further introduces a probabilistic form to the parameters in order to eliminate the variance of the initialization of the parameters in PPCA.
4 Numerical Tests and Discussions
4.1 Data Preparation
The data used in this study were obtained from the Traffic Management Platform of Hangzhou, China. This platform continuously collects 5min aggregated lanebylane traffic flow, occupancy, and speed via remote traffic microwave sensors for each lane in real time. Besides the microwave sensor data, the platform also provides traffic crash records with several attributes including the incident time, location, and incident type. The study site is located at the ShangtangZhonghe Urban Expressway of Hangzhou (see Fig. (a)a). Two datasets (a trainandtest dataset and a validation dataset) are collected and each dataset has a crash record table and traffic flow database collected from the microwave sensor data.
Each crash case in the table of crash records is mapped with 2 closest upstream and 2 closest downstream microwave sensors. For each crash case, we collect the dynamic traffic flow data in 510 min and 1015 min prior to its occurrence time, respectively.
Fig. (b)b shows the 4 microwave sensors selected for each crash and their names: m1 means the second closest upstream sensors to the crash, m2 means the nearest upstream sensors to the crash, and so on. Fig. (c)c. shows the time intervals prior to each crash, among which t2 (510 min prior to the crash occurrence) and t3 (1015 min prior to the crash occurrence) are collected in our study because t1 does not provide enough time for alarm or other crashprevention measures based on the crash likelihood prediction.
Number  1  2  3  4  5  6 

Variable  fm1t3  fm1t2  fm2t3  fm2t2  fm3t3  fm3t2 
Number  7  8  9  10  11  12 
Variable  fm4t3  fm4t2  om1t3  om1t2  om2t3  om2t2 
Number  13  14  15  16  17  18 
Variable  om3t3  om3t2  om4t3  om4t2  sm1t3  sm1t2 
Number  19  20  21  22  23  24 
Variable  sm2t3  sm2t2  sm3t3  sm3t2  sm4t3  sm4t2 
Three kinds of variables, i.e., flow, time occupancy and speed, are selected for each crash case, thus each sample contains 4 (sensors) 2 (time intervals) 3 (variable types) = 24 variables. The notation of the explanatory variables is shown in Table 2, where the first part of each variable indicates the category (e.g., flow, occupancy and speed), the second part refers to the RTMS, and third part represents the time interval. For example, fm1t3 represents the flow (f) at the second closest upstream sensor (m1) during the time interval of 510 min prior to the crash (t2).
As abovementioned, the matched casecontrol strategy is applied in selecting the noncrash samples. In this paper, we match 10 noncrash samples for each crash sample, where the matching rules are illustrated as follows:

Location. The location of the matched noncrash cases should be the same as the crash case;

Withinday time. The withinday time of the noncrash cases should be the same as that of the matched crash case, but they should be in different days. For example, if one crash occurred at 12:45 PM on June 15, 2015 (Monday), then one matched noncrash case can be extracted from the dataset with the time stamp of 12:45 PM on June 29, 2015 (Monday);

Day type. We define two kinds of day types, i.e., weekday and weekend, and the crash cases should share the same day type with the matched noncrash case.
The trainandtest dataset collected from June 11 to November 11, 2015, is used for the training and testing (10fold cross validation in this paper). In order to implement the sensitivity analysis of the missing data, we only select the crash cases which can be matched with complete explanatory variables in the trainandtest dataset. The trainandtest dataset is iteratively split into 90% training set and 10% testing set in a 10fold cross validation. On the other hand, the validation dataset is collected from June 1 to October 1, 2016, and used for validating the real effectiveness of the proposed framework, which is comprised of PCAbased missing data imputation and solutions to the imbalanced data (see Table 3
). All the crash records in the validation dataset have been selected and matched with explanatory variables, while the missing ratio of the validation dataset reaches 21%. The values of each explanatory variable in the two datasets are standardized, with a mean value of 0 and standard deviation of 1.
dataset  trainandtest dataset  validation dataset 

time range  2015/06/112015/11/11  2016/06/012016/10/01 
number of crash samples  123  120 
number of noncrash samples  1230  1200 
proportion of missing data  0%  21% 
Fig. 2 depicts the proposed framework which is comprised of three important parts: the predictive models for realtime crash likelihood estimation, the algorithms for missing data imputation, and two solutions to imbalanced issues. A series of sensitivity analyses are implemented to examine the efficiency and interaction of these three components based on the trainandtest dataset, and the optimal method for each component is selected. In addition, an outofexperiment validation test is conducted to evaluate the predictive performance of the selected framework, based on the independent validation dataset.
4.2 Sensitive Analysis of Solutions to Imbalanced Issues
In this section, we design a sensitivity analysis to investigate how different classweighted ratios affect the model MoEs in the trainandtest dataset, in terms of different solutions to imbalanced issues, including COST, SMOTE sampling, and the combination of them. Considering the real imbalanced ratio in the dataset is 1:10, five classweighted ratios (defined as ), i.e. 1, 5, 10, 20, 30, are tested in this sensitivity analysis. The parameters in solutions to imbalanced issues are depicted in Table 4.
Fig. 3. shows the accuracy, AUC, sensitivity and specificity of the 4 classifiers, i.e., SVM (with linear, Gaussian, and polynomial kernels), and AdaBoost, under different classweighted ratios and different solutions to imbalanced issues. Some interesting results are observed:
Solutions  Parameters  

COST  
SMOTE  Multiplier of synthetic samples  
COST + SMOTE 


Accuracy of all models drops with the increase of the classweighted ratio, which means the classifiers scarify a part of the precision and overall accuracy to improve the sensitivity; A tradeoff should be made between the specificity and sensitivity because a low specificity means more false alarms, which will relax people’s vigilance, while a low sensitivity indicates the low accuracy on identifying a real crash case;

AUC is not sensitive to the imbalanced data because of its inherit property, and thus it is not a suitable indicator for selecting a proper classweighted ratio;

Linear classifier, i.e. SVM (linear), is more sensitive to the classweighted ratio compared to nonlinear classifiers, such as SVM (Gaussian), SVM (polynomial) and AdaBoost;

Three examined solutions to imbalanced issues, i.e., COST, SMOTE, and COST + SMOTE, show comparable abilities in changing the classifiers’ behavior under the same classweighted ratio .
In the following analyses, the classweighted ratio is set to be 10 for all classifiers, by considering the tradeoff between the sensitivity and specificity under such an imbalanced dataset.
4.3 Missing Data Imputation: Accuracy Analysis
In this section, we first discuss the tradeoff between the computation complexity and imputation accuracy of the 3 PCAbased missing data imputing algorithms, and then compare them with two conventional interpolation methods, i.e., mean imputation, and means.
Two experiments are designed based on the trainandtest dataset, which is a complete dataset. The missing pattern is assumed to be MCAR. In Experiment I, 20%, 40%, and 60% of the explanatory variables in the trainandtest dataset are randomly removed from each sample (for both crash cases and noncrash cases). It is noteworthy that different samples (rows) have different missing explanatory variables (columns). Each row and each column have at least one observed value, otherwise, the row or the column will be removed. Then three PCAbased missing data imputation approaches are utilized to impute the missing values. The dimensionality of latent variables is a key parameter in PCAbased approaches, thus the root mean squared error (RMSE) and computing time (which measures the computation complexity) are calculated under different latent dimensionality and different missing values. RMSE is calculated by
(10) 
where and are the real and estimated values of the th imputed values, respectively, while implies the number of imputed values.
After determining the latent dimensionality by Experiment I, the PCAbased approaches are compared with the two traditional interpolation imputing methods, i.e., mean imputation and means clustering imputation in Experiment II. Experiment II randomly generates a more elaborate group of missing ratios, which starts from 0 to 60%, with a step of 5%, in the trainandtest dataset. Considering that random generation of missing values leads to different results in different trails, both the Experiment I and Experiment II are repeated by 5 times and the mean results are presented, respectively,.
4.3.1 Experiment I—Tradeoff Between Computation Complexity and Imputing Accuracy
The results of Experiment I are shown in Fig. 4. (I) RMSE decreases while the computing time increases with the dimensionality of latent variables, which indicates tradeoff should be made between the accuracy and computation complexity. (II) PPCA and VBPCA achieve comparable imputing performance measured by RMSE and remain stable in different missing rates. (III) RMSE of LSPCA shows a high fluctuation, especially in a high missing rate under which the RMSE can reach 1. This is caused by the the inherit overfitting issue of LSPCA, whose objective function aims at minimizing the difference between the original data and reconstructed data without considering any regularization. It leads the algorithm to generate unreasonable large parameters.
By considering the tradeoff between the computation complexity and imputation accuracy, 15 is selected as the latent dimensionality for the PCAbased imputing algorithms in this paper.
4.3.2 Experiment II—Comparison Between PCAbased Imputation and Interpolations
PCAbased missing data imputation algorithms belong to the EM based parameter estimation approach, where parameters are estimated based on observed data and then the missing data are imputed via the probabilistic distribution of these parameters. On the contrary, interpolation methods, e.g., mean imputation and means clustering imputation, aim to impute the missing values by the existing values based on specific rules of similarity.
As shown in Fig. 5, the results of Experiment II show that RMSEs of PPCA and VBPCA are comparable and much lower than that of the mean imputation and means imputation under any missing ratios. It is not surprising that RMSE of LSPCA increases rapidly with the the missing ratio, since the overfitting problem of LSPCA becomes more severe under a high missing ratio. RMSE of the mean imputation remains stable around 1, because the explanatory variable values in the dataset are standardized, with a mean value of 0 and standard deviation of 1.
4.4 Sensitive Analysis of Missing Data on Predictive Performance
In this section, the sensitivity analysis is implemented to see how the predictive performance of the classifiers (including SVM with linear, Gaussian, and polynomial kernels and AdaBoost) react with the increase of missing ratios.
Firstly, the random forest is utilized to calculate the feature importance of the explanatory variables (see Fig. 6). It is observed that the flow rate 1015 min ahead of crash at the second downstream RTMS (
), the flow rate 510 min ahead of crash at the second downstream RTMS (fm4t2), the flow rate 510 min ahead of crash at the first downstream RTMS (fm3t2) are the top 3 most important features. Feature selection is proved to be an efficient method to avoid overfitting issue. In this paper, we compare two cases for each classifier: (I) no feature selection is applied, indicating that all the features are used (full features); (II) the top 8 most important features are selected for forecasting (selected features). Therefore, ten predictive models are examined and compared (see Table
5).Model  Description 

SVMlinear (full features)  SVM with linear kernel (without feature selection) 
SVMlinear (selected features)  SVM with linear kernel (with feature selection) 
SVMGaussian (full features)  SVM with Gaussian kernel (without feature selection) 
SVMGaussian (selected features)  SVM with Gaussian kernel (with feature selection) 
SVMpolynomial (full features)  SVM with polynomial kernel (without feature selection) 
SVMpolynomial (selected features)  SVM with polynomial kernel (with feature selection) 
AdaBoost (full features)  AdaBoost (without feature selection) 
AdaBoost (selected features)  AdaBoost (with feature selection) 
Secondly, several fractions (ranging from 0 to 60%, with a step of 5%) of data are removed from the trainandtest dataset, based on the MCAR pattern. Four kinds of missing data imputation algorithms (i.e., PPCA, LSPCA, mean imputation, and means clustering imputation) are utilized to impute the missing values. The previous result shows that PPCA and VBPCA are comparable in terms of the imputation accuracy, thus only PPCA is examined in this section.
Finally, the 8 predictive models are trained and tested (with 10fold cross validation) on the dataset with different missing ratios, and the AUCs, which measures the overall predictive performance, are calculated. Based on the study in Section 4.2, both COST and SMOTE perform well in solving the imbalanced issues, and they do not show a significant distinction. Therefore, we do not compare different solutions to the imbalanced dataset in this section, but simply select COST as a demonstration for all predictive models. To ensure reliability, all the trails are repeated by five times and the averaged AUCs are recorded.
As shown in Fig. 6, the PPCA imputation outperforms the other three imputation approaches in terms of AUC. The AUCs of the classifiers utilizing the LSPCA imputation show the fastest decrease with the increase of the missing ratio, which is consistent with the accuracy analysis of imputation approaches in Section 4.3. It is found that the PPCA shows a greater power under higher missing ratios, where the difference between models utilizing PPCA and models using other imputations is significant. This indicates that PPCA has a stable ability to recover the missing information even in a highly missing dataset.
It is observed that the SVM (with Gaussian or polynomial kernels) and the AdaBoost with full features achieve AUCs higher than 0.8, while the AUC of SVM with the linear kernel only reaches 0.76, under complete dataset. However, with the increase of the missing ratios, SVM with the polynomial kernel show a sharp decrease in AUC, while the SVM with the linear kernel is slightly affected. It is also interesting to find that the AUCs of the models except SVM with the polynomial kernel converge to around 0.7 at a high missing ratio (such as 0.6) when the PPCA imputation is used. On the other hand, feature selection can not significantly improve the classifiers’ predictive performance in the trainandtest dataset, which indicates the gains from reducing overfitting do not exceed the losses of discarded information.
These results and discussions might provide suggestions to traffic managers on how to select suitable classifiers and imputation approaches in realtime crash likelihood prediction:

It is highly suggested to use PPCA or VBPCA for the missing data imputation, especially in the dataset with high missing ratios.

In the case that the missing ratio is low, it is suggested to use SVM with Gaussian and polynomial kernels, instead of SVM with the linear kernel. However, when the missing ratio is high, SVM with the polynomial kernel should be avoided since its AUC drops dramatically, while other classifiers, such as SVM with linear and Gaussian kernels, and AdaBoost, can achieve comparable AUCs.
4.5 OutofExperiment Validation
classifier  accuracy  AUC  sensitivity  specificity 

SVMlinear  0.755  0.763  0.625  0.768 
SVMGaussian  0.736  0.742  0.575  0.752 
SVMpolynomial  0.740  0.740  0.550  0.759 
AdaBoost  0.733  0.758  0.600  0.747 
A series of sensitivity analyses have been conducted in the trainandtest dataset, while an outofexperiment validation is carried out in the validation dataset. In this validation, the PPCA approach is applied as the missing data imputation method, COST is used as solutions to the imbalanced dataset, while the aforementioned five classifiers are examined.
The results in Table 6 show that the five classifiers achieve AUCs around 0.75 in the validation dataset with 21% missing data in the real world, which are comparable to the results in the trainandtest dataset. The AUCs are sightly lower than those in the previous studies which were based on the complete dataset, the AUCs of which were roughly in the range between 0.75 and 0.8 (Yu and AbdelAty, 2013b; Xu et al., 2014). In this validation dataset, SVM with the linear kernel has the best performance, with the AUC of 0.763, sensitivity of 0.625, specificity of 0.768, and accuracy of 0.755.
5 Conclusions
This paper attempts to address the missing data imputation problem and solutions to the imbalanced dataset in realtime crash likelihood estimation. Although these two problems are easily encountered in realworld applications, few research has been conducted in the domain of realtime crash likelihood estimation.
In terms of the missing data imputation, we compare PCAbased missing data imputation algorithms (including PPCA, VBPCA, and LSPCA) with several conventional imputing approaches, such as the mean imputation and means clustering imputation. Numerical results show that PPCA and VBPCA outperform LSPCA and the conventional imputing methods in terms of RMSE, and also help the classifiers achieve more stable predictive performance under high missing ratios. It is found that the two solutions, i.e., costsensitive learning techniques on the algorithmic level and SMOTE on the data level, both achieve good performance in improving the sensitivity with an acceptable loss of specificity by adjusting the classifiers to pay more attention to the crash cases (the minority class in the dataset).
It is observed that different classifiers have different decreasing curves measured by AUC with the increase of missing ratios in the trainandtest dataset. SVM with the linear kernel is the weakest classifier when the dataset is complete, but its predictive performance drops more slowly than other classifiers. SVM with the polynomial kernel has outstanding performance in the complete dataset but becomes the worst one under high missing ratios.
An outofexperiment validation is implemented using an independent dataset, which has 21% of missing data and is also imbalanced (the ratio of the number of noncrash samples to crash samples is 10:1). In such a partlymissing and imbalanced dataset, the classifiers can achieve AUCs around 0.75, with the help the PPCA missing data imputation and costsensitive learning technique.
These interesting findings provide useful insights for the traffic operators to implement proper predictive strategies:

PPCA and VBPCA are two highlysuggested missing data imputing approaches, especially when the missing ratio is high;

The costsensitive learning technique and SMOTE are useful and comparable methods to deal with imbalanced issues;

Complex and highdimensional models, like SVM with the polynomial kernel, are not always the most accurate and stable classifier, especially when the missing ratio is high. The operators should select different classifiers under different circumstances (for example, considering the missing ratio of the realtime traffic flow data).
The limitations and potential future are discussed as follows. Firstly, only the MCAR case is considered in this paper, while the missing pattern of NMAR is also commonly observed in field applications (for example, the failure of sensors could lead to a long and continuous missing sequences in traffic flow data). In such scenarios, the casebycase design should be implemented for missing data imputation. In the future, we expect to propose approaches to dealing with NMAR patterns, and explore more missing data imputation algorithms, such as the tensor decomposition, in the domain of realtime crash likelihood estimation. Secondly, this paper only considers the missing features of crash/noncrash samples, but does not tackle the issue of missed crash records. The issue of missing crash samples is more difficult than that of missing features, since it makes the classification to be a semisupervised learning problem. We hope to investigate this problem in our future studies.
Acknowledgements
This research is financially supported by Zhejiang Provincial Natural Science Foundation of China [LR17E080002], National Natural Science Foundation of China [51508505, 71771198, 51338008], and Fundamental Research Funds for the Central Universities [2017QNA4025].
References
References
 AbdelAty et al. (2008) AbdelAty, M.A., Pande, A., Das, A., Knibbe, W., 2008. Assessing safety on Dutch freeways with data from infrastructurebased intelligent transportation systems. Transportation Research Record: Journal of the Transportation Research Board 2083, 153–161.
 AbdelAty et al. (2007) AbdelAty, M.A., Pande, A., Lee, C., Gayah, V., Santos, C.D., 2007. Crash risk assessment using intelligent transportation systems data and realtime intervention strategies to improve safety on freeways. Journal of Intelligent Transportation Systems 11, 107–120.
 AbdelAty and Pemmanaboina (2006) AbdelAty, M.A., Pemmanaboina, R., 2006. Calibrating a realtime traffic crashprediction model using archived weather and its traffic data. IEEE Transactions on Intelligent Transportation Systems 7, 167–174.
 Ahmed and AbdelAty (2013a) Ahmed, M., AbdelAty, M.A., 2013a. Application of stochastic gradient boosting technique to enhance reliability of realtime risk assessment: Use of automatic vehicle identification and remote traffic microwave sensor data. Transportation Research Record: Journal of the Transportation Research Board 2386, 26–34.
 Ahmed and AbdelAty (2013b) Ahmed, M., AbdelAty, M.A., 2013b. A data fusion framework for realtime risk assessment on freeways. Transportation Research Part C: Emerging Technologies 26, 203–213.
 Ahmed and AbdelAty (2012) Ahmed, M.M., AbdelAty, M.A., 2012. The viability of using automatic vehicle identification data for realtime crash prediction. IEEE Transactions on Intelligent Transportation Systems 13, 459–468.
 Asif et al. (2016) Asif, M.T., Mitrovic, N., Dauwels, J., Jaillet, P., 2016. Matrix and tensor based methods for missing data estimation in large traffic networks. IEEE Transactions on Intelligent Transportation Systems 17, 1816–1825.
 Asif et al. (2013) Asif, M.T., Mitrovic, N., Garg, L., Dauwels, J., Jaillet, P., 2013. Lowdimensional models for missing data imputation in road networks, in: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3527–3531.
 Basso et al. (2018) Basso, F., Basso, L.J., Bravo, F., Pezoa, R., 2018. Realtime crash prediction in an urban expressway using disaggregated data. Transportation Research Part C: Emerging Technologies 86, 202–219.

Bishop (1999)
Bishop, C.M., 1999.
Variational principal components, in: the 9th IET International Conference on Artificial Neural Networks, pp. 509–514.

Chawla et al. (2002)
Chawla, N.V., Bowyer, K.W.,
Hall, L.O., Kegelmeyer, W.P.,
2002.
SMOTE: Synthetic minority oversampling technique.
Journal of Artificial Intelligence Research 16, 321–357.
 Chawla et al. (2004) Chawla, N.V., Japkowicz, N., Kotcz, A., 2004. Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter 6, 1–6.
 Chen et al. (2003) Chen, C., Kwon, J., Rice, J., Skabardonis, A., Varaiya, P., 2003. Detecting errors and imputing missing data for singleloop surveillance systems. Transportation Research Record: Journal of the Transportation Research Board 1855, 160–167.
 Chen et al. (2002) Chen, C., Kwon, J., Varaiya, P., 2002. The quality of loop data and the health of California’s freeway loop detectors. PeMS Development Group .
 Chen et al. (2015) Chen, C., Zhang, G., Tarefder, R., Ma, J., Wei, H., Guan, H., 2015. A multinomial logit modelBayesian network hybrid approach for driver injury severity analyses in rearend crashes. Accident Analysis & Prevention 80, 76–88.
 Conklin and Scherer (2003) Conklin, J.H., Scherer, W.T., 2003. Data imputation strategies for transportation management systems. Technical Report. University of Virginia, Charlottesville.
 Dear (1959) Dear, R.E., 1959. A principalcomponent missingdata method for multiple regression models. System Development Corporation.
 Deb and Liew (2016) Deb, R., Liew, A.W.C., 2016. Missing value imputation for the analysis of incomplete traffic accident data. Information Sciences 339, 274–289.
 ElBasyouny et al. (2014) ElBasyouny, K., Barua, S., Islam, M.T., 2014. Investigation of time and weather effects on crash types using full Bayesian multivariate Poisson lognormal models. Accident Analysis & Prevention 73, 91–99.
 Estabrooks (2000) Estabrooks, A., 2000. A combination scheme for inductive learning from imbalanced data sets. DalTech.
 Fildes et al. (2015) Fildes, B., Keall, M., Bos, N., Lie, A., Page, Y., Pastor, C., Pennisi, L., Rizzi, M., Thomas, P., Tingvall, C., 2015. Effectiveness of low speed autonomous emergency braking in realworld rearend crashes. Accident Analysis & Prevention 81, 24–29.
 Grung and Manne (1998) Grung, B., Manne, R., 1998. Missing values in principal component analysis. Chemometrics and Intelligent Laboratory Systems 42, 125–139.
 Guo et al. (2008) Guo, X., Yin, Y., Dong, C., Yang, G., Zhou, G., 2008. On the class imbalance problem, in: the 4th IEEE International Conference on Natural Computation, pp. 192–201.
 Ilin and Raiko (2010) Ilin, A., Raiko, T., 2010. Practical approaches to principal component analysis in the presence of missing values. Journal of Machine Learning Research 11, 1957–2000.
 Kubat et al. (1997) Kubat, M., Matwin, S., et al., 1997. Addressing the curse of imbalanced training sets: onesided selection, in: ICML, Nashville, USA. pp. 179–186.
 Kwak and Kho (2016) Kwak, H.C., Kho, S., 2016. Predicting crash risk and identifying crash precursors on Korean expressways using loop detector data. Accident Analysis & Prevention 88, 9–19.
 Lao et al. (2014) Lao, Y., Zhang, G., Wang, Y., Milton, J., 2014. Generalized nonlinear models for rearend crash risk analysis. Accident Analysis & Prevention 62, 9–16.
 Lee et al. (2003) Lee, C., Hellinga, B., Saccomanno, F., 2003. Realtime crash prediction model for application to crash prevention in freeway traffic. Transportation Research Record: Journal of the Transportation Research Board , 67–77.
 Li et al. (2004) Li, D., Deogun, J., Spaulding, W., Shuart, B., 2004. Towards missing data imputation: A study of fuzzy kmeans clustering method, in: Rough Sets and Current Trends in Computing, Springer. pp. 573–579.
 Li et al. (2013a) Li, L., Li, Y., Li, Z., 2013a. Efficient missing data imputing for traffic flow by considering temporal and spatial dependence. Transportation research part C: emerging technologies 34, 108–120.
 Li et al. (2014a) Li, L., Su, X., Zhang, Y., Hu, J., Li, Z., 2014a. Traffic prediction, data compression, abnormal data detection and missing data imputation: An integrated study based on the decomposition of traffic time series, in: the 17th IEEE International Conference on Intelligent Transportation Systems, pp. 282–289.
 Li et al. (2014b) Li, Y., Li, Z., Li, L., 2014b. Missing traffic data: Comparison of imputation methods. IET Intelligent Transport Systems 8, 51–57.
 Li et al. (2013b) Li, Y., Li, Z., Li, L., Zhang, Y., Jin, M., 2013b. Comparison on PPCA, KPPCA and MPPCA based missing data imputing for traffic flow, in: ICTIS 2013: Improving Multimodal Transportation SystemsInformation, Safety, and Integration, pp. 1151–1156.
 Li et al. (2014c) Li, Z., Liu, P., Wang, W., Xu, C., 2014c. Development of a control strategy of variable speed limits to reduce rearend collision risks near freeway recurrent bottlenecks. IEEE Transactions on Intelligent Transportation Systems 15, 866–877.
 Little and Rubin (2014) Little, R.J., Rubin, D.B., 2014. Statistical Analysis with Missing Data. John Wiley & Sons.
 Longadge and Dongre (2013) Longadge, R., Dongre, S., 2013. Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707 .
 Oh et al. (2005) Oh, J.S., Oh, C., Ritchie, S.G., Chang, M., 2005. Realtime estimation of accident likelihood for safety enhancement. Journal of Transportation Engineering 131, 358–363.
 Park and Haghani (2016) Park, H., Haghani, A., 2016. Realtime prediction of secondary incident occurrences using vehicle probe data. Transportation Research Part C: Emerging Technologies 70, 69–85.
 Pazzani et al. (1994) Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C., 1994. Reducing misclassification costs, in: Proceedings of the Eleventh International Conference on Machine Learning, pp. 217–225.
 Qu et al. (2009) Qu, L., Li, L., Zhang, Y., Hu, J., 2009. PPCAbased missing data imputation for traffic flow volume: A systematical approach. IEEE Transactions on Intelligent Transportation Systems 10, 512–522.
 Rajagopal and Varaiya (2007) Rajagopal, R., Varaiya, P., 2007. Health of California’s loop detector system. California PATH Program, Institute of Transportation Studies, University of California at Berkeley .
 Roshandel et al. (2015) Roshandel, S., Zheng, Z., Washington, S., 2015. Impact of realtime traffic characteristics on freeway crash occurrence: Systematic review and metaanalysis. Accident Analysis & Prevention 79, 198–211.
 Saha et al. (2015) Saha, D., Alluri, P., Gan, A., 2015. A random forests approach to prioritize highway safety manual (HSM) variables for data collection. Journal of Advanced Transportation 50, 522–540.
 Seiffert et al. (2010) Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A., 2010. Rusboost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and CyberneticsPart A: Systems and Humans 40, 185–197.
 Smith et al. (2003) Smith, B., Scherer, W., Conklin, J., 2003. Exploring imputation techniques for missing data in transportation management systems. Transportation Research Record: Journal of the Transportation Research Board 1836, 132–142.
 Sun et al. (2016) Sun, J., Li, T., Li, F., Chen, F., 2016. Analysis of safety factors for urban expressways considering the effect of congestion in Shanghai, China. Accident Analysis & Prevention 95, 503–511.
 Sun and Sun (2015) Sun, J., Sun, J., 2015. A dynamic Bayesian network model for realtime crash prediction using traffic speed conditions data. Transportation Research Part C: Emerging Technologies 54, 176–186.
 Sun et al. (2014) Sun, J., Sun, J., Chen, P., 2014. Use of support vector machine models for realtime prediction of crash risk on urban expressways. Transportation Research Record: Journal of the Transportation Research Board 2432, 91–98.
 Sun et al. (2007) Sun, Y., Kamel, M.S., Wong, A.K., Wang, Y., 2007. Costsensitive boosting for classification of imbalanced data. Pattern Recognition 40, 3358–3378.
 Tan et al. (2013) Tan, H., Feng, G., Feng, J., Wang, W., Zhang, Y.J., Li, F., 2013. A tensorbased method for missing traffic data completion. Transportation Research Part C: Emerging Technologies 28, 15–27.

Tang et al. (2015)
Tang, J., Zhang, G., Wang,
Y., Wang, H., Liu, F.,
2015.
A hybrid approach to integrate fuzzy cmeans based imputation method with genetic algorithm for missing traffic volume data estimation.
Transportation Research Part C: Emerging Technologies 51, 29–40.  Theofilatos et al. (2016) Theofilatos, A., Yannis, G., Kopelias, P., Papadimitriou, F., 2016. Predicting road accidents: a rareevents modeling approach. Transportation Research Procedia 14, 3399–3405.
 Tipping and Bishop (1999) Tipping, M.E., Bishop, C.M., 1999. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61, 611–622.
 Turner et al. (2000) Turner, S., Albert, L., Gajewski, B., Eisele, W., 2000. Archived intelligent transportation system data quality: Preliminary analyses of San Antonio TransGuide data. Transportation Research Record: Journal of the Transportation Research Board 1719, 77–84.
 Weng and Meng (2014) Weng, J., Meng, Q., 2014. Rearend crash potential estimation in the work zone merging areas. Journal of Advanced Transportation 48, 238–249.

Wu et al. (2017)
Wu, Y., Tan, H., Li, Y.,
Li, F., He, H., 2017.
Robust tensor decomposition based on Cauchy distribution and its applications.
Neurocomputing 223, 107–117.  Xu et al. (2016) Xu, C., Liu, P., Yang, B., Wang, W., 2016. Realtime estimation of secondary crash likelihood on freeways using highresolution loop detector data. Transportation Research Part C: Emerging Technologies 71, 406–418.
 Xu et al. (2013) Xu, C., Tarko, A.P., Wang, W., Liu, P., 2013. Predicting crash likelihood and severity on freeways with realtime loop detector data. Accident Analysis & Prevention 57, 30–39.
 Xu et al. (2014) Xu, C., Wang, W., Liu, P., Guo, R., Li, Z., 2014. Using the Bayesian updating approach to improve the spatial and temporal transferability of realtime crash risk prediction models. Transportation Research Part C: Emerging Technologies 38, 167–176.
 Yu and AbdelAty (2013a) Yu, R., AbdelAty, M.A., 2013a. Multilevel Bayesian analyses for singleand multivehicle freeway crashes. Accident Analysis & Prevention 58, 97–105.
 Yu and AbdelAty (2013b) Yu, R., AbdelAty, M.A., 2013b. Utilizing support vector machine in realtime crash risk evaluation. Accident Analysis & Prevention 51, 252–259.
 Yu and AbdelAty (2014) Yu, R., AbdelAty, M.A., 2014. Using hierarchical Bayesian binary probit models to analyze crash injury severity on high speed facilities with realtime traffic data. Accident Analysis & Prevention 62, 161–167.
 Yu et al. (2014) Yu, R., AbdelAty, M.A., Ahmed, M.M., Wang, X., 2014. Utilizing microscopic traffic and weather data to analyze realtime crash patterns in the context of active traffic management. IEEE Transactions on Intelligent Transportation Systems 15, 205–213.

Yu et al. (2016)
Yu, R., Wang, X., Yang,
K., AbdelAty, M.A., 2016.
Crash risk analysis for shanghai urban expressways: a Bayesian semiparametric modeling approach.
Accident Analysis & Prevention 95, 495–502.  Yuan et al. (2017) Yuan, Z., Zhou, X., Yang, T., Tamerius, J., Mantilla, R., 2017. Predicting traffic accidents through heterogeneous urban data: A case study .
Comments
There are no comments yet.