Fatal crash incidents on highways, cities, and urban/rural roads are significant concerns to commuters and residents of the United States. Although it is a national problem (Schultz et al., 2011), this paper only considers the fatal crash incidents in the State of North Carolina; however, it can be easily extended to address the same problem in other states. One of the most recent articles (Pour-Rouholamin and Jalayer, 2016) discusses the fatal crash accidents in North Carolina, but is related to the severity of motorcycle crashes. Fatal crash incidents are random and possibly correlated with the locations (Abdel-Aty and Pande, 2007)
, where we can represent the locations using the variables longitude and latitude. As a result, they display interesting patterns in the form of clusters of locations over a very large domain, such as cities and states. These cluster formations are generally observable from the scatter plot of fatal crash incidents with longitude (x-axis) and latitude (y-axis) as location variables. Hence, the statistical pattern recognition approaches(Suthaharan, 2015) can be adopted to study this problem efficiently.
The clusters of locations can be extracted using clustering techniques, such as -means (MacQueen et al., 1967), k-means++ (Arthur and Vassilvitskii, 2007), and k-medoids clustering (Jin and Han, 2011). When clusters are extracted, it is expected that many strong clusters will overlap with the busy traffic areas and population dense regions, such as major cities and arterial highways. Hence, the frequency of fatal crash incidents can be observed from the cluster locations using visualization techniques (Ramos et al., 2015). This crash frequency can be used as a quantitative measure of traffic safety to identify the locations that face traffic safety problems. The crash frequency is a simple measure, and it can be used to represent the count of fatal crash incidents as well: the higher the fatal crash frequency the lower the traffic safety. The authors of (Park et al., 2015) performed their study by grouping the crash frequency analysis into Crash Frequency-Based Analysis and Crash Rate-Based Analysis. They defined the crash frequency analysis by the absolute crash frequency with respect to each crash type and its severity, and the crash rate analysis as the combination of absolute crash frequency and traffic volume at the crash location. Thus, we can group the quantitative measures of traffic safety into absolute and relative fatal crash frequency measures for our study, where the relative fatal crash frequency measure quantifies the fatal crash frequency relative to some other factors that influence the fatal crash incidents.
As stated in (Lovegrove and Sayed, 2006), a limited number of quantitative measures with the relative crash frequency can be found in the transportation literature. The paper by (Liu et al., 2016) focused on injury crash frequency and severity using absolute crash frequency. Their focus is on the data associated with the crash counts on intersections. The authors of (Chiou and Fu, 2013) reported the contributors to crash frequency and severity differ in many cases; hence, they proposed an architecture based on a multinomial generalized Poisson (MGP) model, which allowed them to analyze the frequency and severity of traffic incidents together. The paper (Claros et al., 2017) focused on evaluating red light cameras for traffic safety. They also used absolute crash frequency as a measure to make their conclusions such as “the reduction in angle crashes due to red light camera” and “the increase in rear-end crashes.” Hence, the crash frequency plays a significant role in crash incidents analysis.
In the paper (Chiou and Fu, 2015)
, the authors analyzed what they called multi-period crash frequency and severity by using a Poisson distribution model that they proposed. The model utilized spatial and temporal dependencies of the traffic incidents; hence, it provides a type of relative crash frequency analysis. However, they do not consider the relative frequency with respect to a fatal point that has the highest frequency of fatality. The paper(Abdel-Aty and Pande, 2007) presented a location-based traffic safety analysis using crash frequency as a measure. They studied the traffic safety performance at both the individual crash level and collective crash level, and in both case, the crash frequency played a major role. (Ivan et al., 2015) studied the effect of low-light conditions on accident (crash) frequency in a city in Romania. Their findings show that the traffic accident count (i.e., the crash frequency) is linearly correlated with the low-light condition. (Gitelman et al., 2010) presented a composite safety indicators for traffic safety and determined that the frequency (or rate) of fatal crash incidents is not sufficient to determine the traffic safety performance.
The authors (Montella, 2010)
used four quantitative evaluation criteria to compare seven crash-hotspot identification methods for road safety. The main quantitative measures they used were the crash frequencies associated with property damage, crash rate over time, and empirical Bayes estimate of crash frequency. The authors(Zhang et al., 2012) studied the mathematical relationship between crash frequency and the characteristics of road segments associated with urban roadways. They used generalized additive models with crash frequency to study the mathematical relationship. Hence, It is clear from the transportation research literature that the crash frequency is the major influencer in traffic safety research. Therefore, it will be useful to the traffic safety research community if a better quantitative measure of traffic safety is proposed and evaluated.
2 Research Motivation
One of the motivating factors is the limited research on developing a relative quantitative measure for fatal crash incidents. The Transportation Research Board (TRB) provides a database at TRID: https://trid.trb.org/, where the articles on transportation research can be found. A search on this database indicated the need for such an important measure. Another motivating factor is that the widely used crash frequency measure is very sensitive to the cluster variability – caused by the randomness of the fatal crash incident – when the clusters are created by clustering techniques. The third motivating factor is the need for the number of clusters, a priori, as a parameter to a clustering technique, while the goal of applying clustering is to find the right number of clusters that can help characterize crash locations. Hence, this paper presents a novel fatal point concept and a validation method using k-means clustering and FARS data to reduce the above problems.
3 An Example Using FARS Data Source
The cross-sectional fatal crash data in Fatality Analysis Reporting System (FARS) indicate that the spread of fatality incidents on highways, major cities, and rural areas in the State of North Carolina (NC) is a significant threat to commuters and residents (NHTSA, 2016). FARS is a data source that provides factual information about fatal crash incidents occurring in the United States. As mentioned earlier, we only considered 2015 NC fatal crash data of FARS to study the patterns of fatal crash incidents in NC. We can clearly observe in Fig. 1 that the location information (longitude and latitude) of fatal crash incidents in the FARS data sets draws the map of North Carolina. The evolution of such a state map from the fatal crash incidents is an indicator that the traffic safety is a serious problem in NC.
Therefore, it is important to study this fatal crash data set by utilizing suitable data mining techniques to characterize fatal crash locations, quantify traffic safety, and develop solutions to the spatially oriented traffic safety problem in NC. The most commonly used quantitative measures in traffic safety analytics is crash frequency; therefore, we used it in this example along with -means clustering to describe these data sets. For example, the application of -means clustering with, =18 to the fatal crash data in Fig. 1, produced the 18 clusters shown in Fig. 2, which highlights some of the meaningful clusters. Also note that the highlighted points in red are the centroids of the clusters, not the actual locations where fatal crash incidents occurred. The scatter plots in these figures demonstrate that the frequency of fatality is very high in major cities, such as Charlotte, Greensboro, and Raleigh, and low in the northeastern areas of NC. It also shows that the frequency of crashes are high along the main arterial highways, such as I-40, I-85, and I-95. This distribution of clusters would naturally encourage one to apply a clustering technique repeatedly with increasing values of the parameter to detect more clusters as needed.
The repeated application of a clustering method creates variability in the clusters, which is an effect of the randomness of the fatal crash incidents. This problem can be conceptualized with a simple example: if we consider two clusters with the same crash frequencies, one may interpret that those two clusters have the same traffic safety quantitatively. However, if one of these clusters has a fatal point that has the highest fatal crash incidents, and the other cluster has several isolated fatal crash incidents, then the assumption of the same traffic safety for these two clusters may not be appropriate (i.e., the variability of the clusters can affect the clusters’ crash frequencies). This analogy is still valid when we have two clusters with two different crash frequencies in which one cluster has smaller segments of locations with very high fatal crash incidents. The authors (Claros et al., 2017) also stated that the crash frequency (i.e., an absolute crash frequency) may not be suitable to characterize crash incidents or quantify traffic safety. Thus, it is applicable to the characterization and quantization of fatal crash incidents.
The FARS data set is freely available, and its accessibility is not that limited; hence, it supports the reproducible research without any restrictions. It consists of several data sets in spreadsheets, which makes a significant convenience to many applications. We have used the accident related 2015 NC fatal crash data, which includes 1,275 fatal crash incidents. However, we had to perform a minor cleansing to identify and remove the outliers with unusual values, such as 888.8888 and 999.9999. We found such outliers in 12 longitude or latitude variables; hence, the removal of these records reduced the number of records to 1,263, which is still large enough to perform the analysis. The number of independent variables found in this data is 51, but we are interested only in longitude and latitude variables associated with each fatal crash incident in this paper.
4 Proposed Approaches
The proposed approaches bring two new contributions to the traffic safety research domain. The first approach is the concept that leads to a development of a ratio as a quantitative measure to quantify traffic safety. The second approach is the fatal point detection with the induction of rounding errors, where the fatal point crash frequency is a major contributor to make the proposed quantitative measure less sensitive to cluster variability.
4.1 Proposed Fatal Point Concept
The proposed concept is presented in Fig. 3, where the size of the shapes represents the proportionality of the frequency rather than the size of the cluster region on the map of fatal crash incidents. To explain the concept, let us first define a set of variables, and a concept: the variable that represents the fatal crash frequency of a cluster , and the variable that represents the frequency of the fatal point of a cluster . The fatal point is a new concept, and it is defined as follows:
The fatal point is a logical location that is mapped to a set of locations (longitude and latitude) that form a segment with the highest frequency of fatality within a cluster.
Fig. 3 presents two examples that justify the reason for using a fatal point in the development of the proposed quantitative measure. The intra-domain example shows the variability between two clusters in a domain of clusters generated using a single value of in -means clustering. Similarly, the inter-domain example shows the cluster variability between two clusters from two domains of clusters generated using two different values of in -means clustering.
The intra-domain example shows two clusters, and , that have the same crash frequency . It also shows their fatal point frequencies of fatality, and , where (one of the variabilities mentioned earlier). Hence, the ratios and satisfy the following inequality:
It shows that cluster has lower traffic safety than cluster . Similarly, the inter-domain example shows that clusters and have distinct crash frequencies and , where (the second variability), but their fatal points have the same frequencies of fatality . Hence, the ratios and satisfy the following inequality:
It also shows the cluster has a lower traffic safety than cluster . Different examples can be generated using this analogy (or patterns), but these two patterns can form the basis for them.
4.2 Proposed Quantitative Measure
The proposed measure is devised based on the aforementioned concept and the ratio between the cluster frequency and the fatal point frequency. Hence, we represent the quantitative measure of traffic safety using the variable , which varies with respect to the variability of cluster , and define it as follows:
where is the frequency of the segment that has the higher frequency of fatality (i.e., the fatal point). The idea is to make the new quantitative measure uncorrelated to the cluster frequency so that the new measure is less sensitive to the variability of clusters, which is generated by the randomness of fatal crash incidents, than the cluster frequency , which is generally affected significantly by cluster variability.
4.3 Proposed Fatal Point Detection
A method is proposed to define and detect the fatal point of a cluster by dividing the cluster into vertical segments of about 0.7 miles width along the longitude variable. The reason for the selection vertical segments is that our observation of horizontal traffic flow pattern over the state of NC; however, the future research will include the horizontal segments as well. In this paper, we also have secured the width of the segment to 0.7 miles and it will be optimized in the future research using an empirical approach. Each segment can be interpreted as a set of fatal crash locations in the proposed approach. These sets are created by rounding the longitude values of a cluster to their hundredth decimal places based on the information available (StackExchange, 2017). The rounded longitude locations are called the logical locations, and they create redundancy in the fatal crash locations, thus forming the aforementioned sets. The redundancies are used, along with the statistical measure “mode,” to obtain frequencies of fatality for these sets. The set with the highest frequency of fatality of the cluster is defined as the fatal point and is detected by the approach.
5 Results and Performance Evaluation
The number of clusters that we considered for the -means clustering varies from =8 to =128, with an increment of 1. Hence, they produce 121 domains of clusters. The cluster domain with =16 is selected to explain the steps of the simulation. When -means clustering with = 16 is applied to the two-dimensional map (longitude and latitude) of 2015 NC fatal crash data in Fig. 1, it produced the spatial characteristics of the clusters presented in Fig. 4 and were highlighted using different colors and markers. It clearly characterizes the fatal crash incidents as clusters of locations that are meaningful to describe the map of North Carolina. The fatal crash frequencies , where , of the 16 clusters are calculated and presented in the second column of Table 1. Also note that the first column of the table provides the labels for the 16 clusters.
We can observe from the table that the largest value of the variable is 141, indicating the cluster , has the largest number of fatal crashes in NC in 2015, which is associated with the areas (Charlotte) highlighted in “magenta” color. Similarly, the smallest value of is 37, indicating the cluster , has the smallest number of fatal crashes in NC in 2015, which is associated with the areas (northeast of NC) in “cyan” color. So, using the crash frequency as a measure, we can say that cluster has the lowest traffic safety and cluster has the highest traffic safety among the 16 clusters determined by -means clustering. Hence, it is important to note that the larger crash frequency means lower traffic safety (i.e., the crash frequency provides an opposite measure for traffic safety). In the third column of the Table 1, the normalized values of are also presented to transform the measure to a common scale.
5.1 Results from Fatal Point Detection
The fatal crash frequency provides the information about the number of fatal crash locations with longitude and latitude values. For example, the number of fatal crash locations in cluster 1 is 74; therefore, this cluster has at most 74 distinct longitude values and their corresponding latitude values. In the fatal point detection, we are interested in the vertical segments; therefore, we induced rounding errors to longitude values by rounding them to their hundredth decimal place. Then the statistical measure “mode” is used to detect the number of fatal crash locations in close proximity, which we call logical fatal crash locations. The set of locations associated with the most occurring logical fatal crash location is recorded, and we call it a fatal point. The number of fatal crash incidents in the fatal points of the 16 clusters are listed in the fourth column of Table 1. One location from each fatal point of the clusters are also listed in the fifth column of the table to help the readers associate them visually with the clusters highlighted in Fig. 4. For this particular domain of clusters (i.e., =16), and are correlated with the correlation value of 0.7129. It does not mean the high correlation is guaranteed for other values of .
Therefore, we repeated the correlation analysis with the values of from 8 to 128, and the results are presented in Fig. 5 in blue. It shows the following: when the number of clusters determined by k-means clustering is low (i.e., the size of the cluster is large), the correlations between and are also low (about 0.4). When the number of clusters are increased, the correlations and are also increased and stabilized at about 0.75. Hence, they all make sense that when the cluster sizes become smaller, the cluster and the fatal point become closer. With this result, we can validate the use of -means clustering and the induction of rounding errors. Although, this research considered the k-means clustering only, the Density-based spatial clustering of applications with noise (Ester et al., 1996), Ordering points to identify the clustering structure (Ankerst et al., 1999), k-means++, and k-medoids will be used in the future research to evaluate the concept of fatal point.
5.2 Results from Quantitative Measure
The traffic safety measure () in equation (4.3) is then calculated and normalized to . The results are presented in the sixth column of Table 1. The correlations between and , with the number of clusters varying from 8 to 128, are calculated and presented in Fig. 5 in red. It shows the following: initially when the number of clusters determined by -means clustering is low, the correlation between and are also low (about 0.05). Later when the number of clusters are increased, the correlations and
are also increased and stabilized at about 0.5. Now, comparing the correlation results with blue and red, we can determine that the proposed quantitative measure, as a random variable, has a lower correlation with the cluster frequencythan the fatal point frequency . We are also interested to know whether these correlations are correlated or not correlated themselves with respect to the number of clusters. Hence, we calculated the correlation of these correlations and obtained a very low correlation value of 0.1452. This indicates, when and are correlated, it is less likely and will be correlated. Similarly, when and are uncorrelated it is less likely and will be uncorrelated. Hence, the selection of as an alternative measure to is appropriate so that we can limit the effect of the cluster variability.
5.3 Performance Evaluation
The expectation of the proposed quantitative measure is to perform well in terms of measuring traffic safety accurately under the variability issues resulted from the randomness of the fatal crash incidents. Therefore the goal of the performance evaluation is to use the results obtained in the previous subsections and compare them with the results of the standard crash frequency measure. The average traffic safety values of NC per domain are calculated for the normalized traffic safety values ( and ) of the clusters in that domain. As mentioned earlier, there are 121 domains (based on the values of ranging from 8 to 128); therefore, we have 121 average traffic safety values for NC. These values are presented in Fig. 6, where blue represents the averages of and red represents the averages of
. At the same time, the averages of the variances are also calculated and presented in Fig.7, respectively.
To evaluate the performance of the proposed measure against the standard crash frequency measure, we divided the 121 domains into four groups and analyzed the results in Fig. 6 and Fig. 7 together. The parameter values of 8 to 24 is group 1, 25 to 40 is group 2, 41 to 64 is group 3, and 65 to 128 is group 4 are selected based on the changes that we can observe. Within group 1, both the proposed measure and the standard crash frequency measure show an increase in traffic safety. Note a decrease in blue graph (i.e., decrease in fatal crashes) means an increase in traffic safety. Although both measures increase the traffic safety, we can clearly observe that the variability of the results from the proposed measure is significantly low. This can also be confirmed from the variance graph in Fig. 7.
Within group 2, some stability in the traffic safety rating is displayed by the measures; however, the proposed measure has a strong stability value of 0.54, approximately. The crash frequency measure has a very high oscillation, indicating high variability. Interestingly, both of these measures show conflicting results within the next group. That is, the proposed measure indicates the average traffic safety of the state is decreasing, whereas the crash frequency measure indicates it is increasing. The question is, which one is correct? Considering the results of all four groups, the third of the four groups suggest that the proposed measure is less sensitivity to the variability; hence, we can assume that traffic safety is decreasing within that group by accepting the output of the proposed measure by voting.
The results of group four are also very interesting and useful because the larger value means the map of NC is divided into a larger number of clusters with smaller cluster sizes. Hence, the results of this group are highly suitable to describe the overall safety of the entire state. The results in Fig. 6 and Fig. 7 indicate that the proposed measure is less sensitive to the cluster variability with the convergence to an average traffic safety value of 0.35, along with the variance that is less than 0.05 and the variability that is very small. Hence, using the proposed measure, we may be able to say that the traffic safety in NC is about 35%. In other words, the ratio between the cluster crash frequency and the fatal point crash frequency is 1 is to 3, indicating high fatal crashes.
The proposed measure is a better and alternative measure to the standard crash frequency measure because it is less sensitive to the variability of clusters generated by -means clustering technique. It also reports that the traffic safety rating of NC is 35% – indicating high number of fatal crashes – using the fatal crash location information available in the 2015 NC fatal crash data set of FARS data source. However, it is important to note that the results and findings that are reported in this paper is limited, and further significant research is required to support the findings. Hence, the future research will include the experimental analysis using other location-based clustering algorithms such as the Density-based spatial clustering of applications with noise, ordering points to identify the clustering structure, k-means++, and k-medoids.
The future research will also include the analyze of the FARS data from other years so that the accuracy of the above rating can be confirmed. Once that research is completed, the FARS data from other states will be utilized to compare the traffic safety ratings between the states. In addition, the effect of the varying width size of fatal point cluster segments, in contrast to the fix width of 0.7 miles, will be studied to determine the optimal size for the fatal point segments. In the present form, the vertical segments are considered due to the horizontal flow of traffic patterns over the state map, and further study will be conducted with the inclusion of horizontal segments.
- Abdel-Aty and Pande (2007) Abdel-Aty, M., Pande, A., 2007. Crash data analysis: Collective vs. individual crash level approach. Journal of Safety Research 38, 581–587.
- Ankerst et al. (1999) Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J., 1999. Optics: ordering points to identify the clustering structure, in: ACM Sigmod record, ACM. pp. 49–60.
- Arthur and Vassilvitskii (2007) Arthur, D., Vassilvitskii, S., 2007. k-means++: The advantages of careful seeding, in: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics. pp. 1027–1035.
- Chiou and Fu (2013) Chiou, Y.C., Fu, C., 2013. Modeling crash frequency and severity using multinomial-generalized poisson model with error components. Accident Analysis & Prevention 50, 73–82.
- Chiou and Fu (2015) Chiou, Y.C., Fu, C., 2015. Modeling crash frequency and severity with spatiotemporal dependence. Analytic Methods in Accident Research 5, 43–58.
- Claros et al. (2017) Claros, B., Sun, C., Edara, P., 2017. Safety effectiveness and crash cost benefit of red light cameras in missouri. Traffic injury prevention 18, 70–76.
- Ester et al. (1996) Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al., 1996. A density-based algorithm for discovering clusters in large spatial databases with noise., in: Kdd, pp. 226–231.
- Gitelman et al. (2010) Gitelman, V., Doveh, E., Hakkert, S., 2010. Designing a composite indicator for road safety. Safety science 48, 1212–1224.
- Ivan et al. (2015) Ivan, K., Haidu, I., Benedek, J., Ciobanu, S., 2015. Identification of traffic accident risk-prone areas under low-light conditions. Natural Hazards and Earth System Sciences 15, 2059–2068.
Jin and Han (2011)
Jin, X., Han, J., 2011.
K-medoids clustering, in: Encyclopedia of Machine Learning. Springer, pp. 564–565.
- Liu et al. (2016) Liu, Y., Li, Z., Liu, J., Patel, H., 2016. Vehicular crash data used to rank intersections by injury crash frequency and severity. Data in brief 8, 930–933.
- Lovegrove and Sayed (2006) Lovegrove, G.R., Sayed, T., 2006. Macro-level collision prediction models for evaluating neighbourhood traffic safety. Canadian Journal of Civil Engineering 33, 609–621.
MacQueen et al. (1967)
MacQueen, J., et al., 1967.
Some methods for classification and analysis of multivariate observations, in: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA.. pp. 281–297.
- Montella (2010) Montella, A., 2010. A comparative analysis of hotspot identification methods. Accident Analysis & Prevention 42, 571–581.
- NHTSA (2016) NHTSA, 2016. National Highway Traffic Safety Administration, Fatality Analysis Reporting System. https://www.nhtsa.gov/research-data/fatality-analysis-reporting-system-fars. (accessed: 2017.07.08).
- Park et al. (2015) Park, S., Musey, K., Press, J., McFadden, J., 2015. Exploring roundabouts safety and operation in the context of design consistency. Institute of Transportation Engineers. ITE Journal 85, 43.
- Pour-Rouholamin and Jalayer (2016) Pour-Rouholamin, M., Jalayer, M., 2016. Analyzing the severity of motorcycle crashes in north carolina using highway safety information systems data. Institute of Transportation Engineers. ITE Journal 86, 45.
- Ramos et al. (2015) Ramos, L., Silva, L., Santos, M.Y., Pires, J.M., 2015. Detection of road accident accumulation zones with a visual analytics approach. Procedia Computer Science 64, 969–976.
- Schultz et al. (2011) Schultz, G.G., Dudley, S.C., Saito, M., 2011. Transportation Safety Data and Analysis. Volume 3: Framework for Highway Safety Mitigation and Workforce Development. Technical Report.
- StackExchange (2017) StackExchange, 2017. Geographic Information Systems, Measuring Accuracy of Latitude and Longitude? https://gis.stackexchange.com/questions/8650/measuring-accuracy-of-latitude-and-longitude. (accessed: 2017.07.08).
- Suthaharan (2015) Suthaharan, S., 2015. Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning. volume 36. Springer.
- Zhang et al. (2012) Zhang, Y., Xie, Y., Li, L., 2012. Crash frequency analysis of different types of urban roadway segments using generalized additive model. Journal of safety research 43, 107–114.