Mapping road safety features from streetview imagery: A deep learning approach

07/15/2019 ∙ by Arpan Sainju, et al. ∙ The University of Alabama 0

Each year, around 6 million car accidents occur in the U.S. on average. Road safety features (e.g., concrete barriers, metal crash barriers, rumble strips) play an important role in preventing or mitigating vehicle crashes. Accurate maps of road safety features is an important component of safety management systems for federal or state transportation agencies, helping traffic engineers identify locations to invest on safety infrastructure. In current practice, mapping road safety features is largely done manually (e.g., observations on the road or visual interpretation of streetview imagery), which is both expensive and time consuming. In this paper, we propose a deep learning approach to automatically map road safety features from streetview imagery. Unlike existing Convolutional Neural Networks (CNNs) that classify each image individually, we propose to further add Recurrent Neural Network (Long Short Term Memory) to capture geographic context of images (spatial autocorrelation effect along linear road network paths). Evaluations on real world streetview imagery show that our proposed model outperforms several baseline methods.



There are no comments yet.


page 2

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Every year, around 6 million car accidents occur in the U.S. on average [1]. Traffic safety has long been an important societal issue. In order to avoid or mitigate vehicle crashes, traffic engineers place roadside barriers to prevent out of control vehicles from diverting off the roads and hitting the roadside hazards. Such road safety features can also prevent vehicles from crossing into the path of other vehicles. During winter season, vehicles can become more difficult to control on icy and slippery road surface, particularly when the vehicle speed is high. Barriers on the roadside can act as a safety precaution in such cases. Other safety features such as rumble strips help alert inattentive drivers who are deviating from their lanes. Figure 1 shows three different common type of road safety features, rumble strip, concrete barrier, and metal crash barrier. Federal, state and local governments spent several hundred billion dollars each year on transportation infrastructure development and maintenance [2]. Mapping safety features along road networks can play a crucial role in managing and maintaining road safety infrastructures. Traffic engineers can use the detailed safety feature map to identify locations where new safety infrastructure should be invested.

Figure 1: Three common classes of road safety features on Google Streetview Imagery. (a) Rumble strips (b) Concrete barrier (c) Metal crash barrier

In current practice, mapping road safety features are mostly done manually by well-trained traffic engineers driving through road networks or visually interpreting streetview images. A streetview image is a geo-referenced image taken at a specific location on the ground. One common example is Google Streetview Imagery collected by vehicles equipped with GPS and cameras driving along streets on road networks. However, such a manual process is both expensive and time-consuming. Given the large amount of information to collect, the cost of these approaches quickly become prohibitive.

The focus of this paper is to develop a deep learning algorithm that can automatically map road safety features from streetview imagery. The results can be used by the transportation agencies in management and maintenance of road safety infrastructures, as well as planning the investment on new infrastructures. Specifically, we can utilize a small set of manually labeled imagery (whose road safety features are visually inspected) to learn a classification model. Then the model can be used to classify safety feature types on a large number of unlabeled imagery along the road network.

However, mapping the road safety features based on streetview imagery poses several unique challenges. First, streetview images are not independent and identically distributed along a road network. In contrast, the safety feature types of consecutive images along a same road network path often resemble each other (also called the spatial autocorrelation effect). Second, the spatial scales of road safety features may vary across different class categories. For example, concrete barrier is often very long (e.g., miles). In contrast, metal crash barriers are much shorter (e.g., hundred meters). Third, individual images may be imperfect due to some noise or obstacles. For example, a safety feature can be blocked by a large vehicle and thus become invisible in an image.

To address these challenges, we propose a deep learning model based on both convolutional and recurrent units. We use covolutional neural network (CNN) model to extract semantic features from individual images. We also use a recurrent neural network, Long Short-Term Memory (LSTM), to model spatial sequential structure on extracted features from consecutive images along a road network path (the spatial autocorrelation effect). The integration of CNN and LSTM enables our deep learning model to utilize not only the content of individual images but also the geographic context between images. Evaluations on real world streetview images collected from highways in Alabama show that our approaches outperform several baseline methods in classification performance.

In summary, the contributions of this paper are listed below:

  • To the best of our knowledge, we are the first to explore a deep learning approach on Google Streetview imagery for road safety feature mapping.

  • We propose to use integrated deep learning models that combine CNN and LSTM. The integrated models can utilize not only the content of individual images but also the spatial sequential structure between images.

  • We compare our approaches with several baseline methods on two real world streetview imagery datasets collected in Alabama. We achieve 3 and 5 percent improvement in F-score over the best baseline method on two different test datasets.

  • We perform a case study of mapping road safety features with 69,500 streetview imageries over all the major highways in the entire state of Alabama.

The outline of the paper is as follows. Section  2 discusses some of the related works. Section 3 formally defines the problem. Section 4 introduces the approaches. Section 5 summarizes the results of our experimental evaluation on two real world datasets as well as discusses the case study of mapping road safety features over all the major highways in Alabama. Section 6 concludes the paper with discussions on future works.

2 Related Works

In this section, we briefly review the relevant research on transportation safety and deep learning techniques for spatial and spatiotemporal data.

2.1 Transportation Safety

Related work in transportation safety often focuses on analyzing the protective effect of different road safety features (e.g., roadside barriers) [3, 4, 5]. For example, studies in [6, 7, 8] quantify the protective effect of barriers with regards to motorcyclist injury. Work in [9] analyzes the performance of roadside barriers related to vehicle size and type. [10] studies how to increase the effectiveness of the roadside barriers in safety protection. For example, studies found that concrete barriers can hold high-energy truck crash, but can also cause more fatalities. Some recent work focuses on developing energy absorbing barrier [11]. Beside the protective effect, other studies on roadside barriers focus on the impact on mitigating near-road air pollution [12]. The study on effect of solid barriers on dispersion of roadway emissions in [13, 14] shows that roadside barriers is one of the most practical mitigation methods. There are also works that analyze spatial patterns from traffic accident event locations such as network hotspots and colocation patterns [15, 16, 17]. [18] proposes efficient algorithms to identify primary corridors from cyclists’ GPS trajectories on urban road networks to study riding behaviors for safety issues. [19] shows techniques to detect coarse scale hotspots of road failure events through geo-tagged tweets from social media.

Recently, researchers have used Google Streetview imagery along the road network for traffic sign detection for roadway inventory management [20, 21, 22]

. Other works use streetview imagery to estimate the demographic makeup of neighborhoods 

[23], to assess street-level greenness in an urban area [24]. To the best of our knowledge, there is little research on utilizing streetview imagery to automatically map road safety features.

2.2 Deep Learning for Spatio-Temporal Data

In recent years, deep learning techniques have shown great growth in the field of spatiotemporal data mining [25, 26]. One common approach is to integrate deep convolutional neural networks (CNN) with recurrent neural networks such as Long Short-Term Memory (LSTM). The CNN component can be used to model spatial dependency structure in one temporal snapshot, while the LSTM component can be used to model temporal dynamics between different snapshots. For example, [27] uses fully convolutional networks with LSTM to estimate vehicle count maps based on city cameras. [28] uses CNN-LSTM model together with multi-view learning to predict taxi demand. [29] uses a one-dimensional CNN to capture spatial features of traffic flow and two LSTM models to capture the short-term variability and periodicities of traffic flow. [30] addresses the traffic prediction problem with a new spatiotemporal model. It uses a flow gating mechanism to learn the dynamic similarity between locations, and uses a periodically shifted attention mechanism to handle long-term periodic temporal shifting. [31, 32] researches on better traffic accident prediction to improve transportation and public safety. In these existing works, LSTM model is often used to model temporal dynamics between multiple spatial snapshots. The difference from our work in this paper is that we use LSTM to capture linear spatial sequential structure between consecutive images along a road network path.

3 Problem Description

In this section, we discuss some basic concepts and describe our problem.

Road network: A road network is a network whose nodes are road intersections, and whose edges are road segments. At the same time, a road network is also a spatial network whose nodes are spatial points and whose edges are spatial line strings. In other words, a road network has both graph properties and geometric properties.

Streetview imagery: Streetview imagery is a sequence of geo-referenced images whose locations are embedded on road network edges (in the form of line strings). The imagery is collected through driving a vehicle equipped with GPS and camera, so that each image can be geo-referenced based on the GPS time stamp. In this paper, we used Google Streetview API to select imagery at a regular spatial interval of 20 meters.

Road safety feature: A road safety feature is defined as the measure or infrastructure placed on a road to improve safety. We consider three most common safety features: rumble strips, concrete barrier and metal crash barrier. Figure  1 shows examples of the three safety features from Google Streetview imagery.

  • Rumble Strips: Rumble strips (Figure 1(a)) are milled grooves or rows of raised pavement markers placed perpendicular to the direction of travel, or a continuous sinusoidal pattern milled longitudinal to the direction of travel. It creates a vibration and rumbling sound transmitted through the wheels into the vehicle interior which can alert the drivers who have drifted from their lanes.

  • Concrete barrier: Concrete barrier (Figure 1(b)) is a rigid barrier. It is easy to maintain. This type of barrier is often used on roads where traffic in opposing direction is flowing in close proximity due to lack of space.

  • Metal crash barrier: Metal crash barrier (Figure 1(c)), also known as guardrails, is usually made from steel beams or rails. It ensures minimum damage to the vehicle and its occupants by absorbing the impact energy of the colliding vehicle. It can also act as a good visual guide during night time for the driver to maintain their lane position.

Problem Definition: Given a road network with geo-referenced streetview imagery sampled at an equal distance interval, as well as a small collection of labeled imagery sequences (each image has three binary class labels corresponding to the existence of rumble strips, concrete barrier, and metal crash barrier respectively), the road safety feature mapping problem aims to learn a classification model that can predict the labels for all unlabeled images on the road network. Since each image may contain multiple types of road safety features at the same time, our problem is a multi-label classification problem.

Figure 2: Overall framework of our deep learning approach

4 Proposed Approach

In this section, we introduce our proposed deep learning approaches. Figure 2 illustrates the overall framework of our proposed models. The bottom component shows the data collection process. We sample a number of spatial points along road network edges at an equal distance interval (e.g., 20 meters), and then use Google Streetview API to download geo-referenced imagery at those point locations. We fixed the distance interval of 20 meters because the average length of some road safety features such as metal crash barrier is only a few hundred meters. If we select a higher distance interval, there may not exist enough images for short extent barrier such as metal crash barrier. Although lower distance interval can provide fine-grained dataset, it increases the number of streetview images to be downloaded which incurs extra cost. The middle component of our proposed models is based on retrained CNN model to extract low dimensional features from individual images. The last component is LSTM layer. In contrast to existing works, our LSTM does not capture temporal dynamics between different spatial snapshots, but represents spatial sequential pattern between consecutive imagery along road network edges.

4.1 Extract Image Feature with CNN

Convolutional Neural Network (CNN) was developed mainly for image classification. CNN introduces the concept of parameter sharing which allows the model to learn less number of parameters in comparison to regular neural network. Similar to regular neural networks, CNN also consists of a sequence of layers. We briefly describe each layer in CNN below.

Input Layer:

Input Layer holds the raw pixel color values (RGB) of the images. Usually, the pixel values are normalized to stabilize the learning process and dramatically reduce the number of training epochs required to train deep learning models.

Convolution Layer: Convolutional layer transforms the input using convolution operation. A convolution operation is element-wise multiplication of a pixel and its neighborhood pixels color value (RGB) by a matrix. It is also known as convolution filter. Different filters are used to convolve around all the pixels in an image. Filters like horizontal and vertical edge detecting filter can extract the linear feature from the image. Other complicated filters such as sobel filters can extract non-linear edges. In CNNs, filters are not defined, they are learned during the training process. By stacking layers of convolutions on top of each other, we can get more abstract and in-depth information from a CNN.

ReLU Layer:ReLU stands for Rectified Linear Unit, which is a type of activation function commonly used in neural networks. Activation functions are applied to introduce non-linear properties to the network. The function returns 0 if it receives any negative input. However, for any positive value , the function returns the same value back. So, it can be written as . ReLU activation function is computationally less expensive as there is no complicated math, which can reduce the model training time.

Pooling layer: The function of pooling layer is to reduce the spatial size of the input. It is also known as downsampling layer. Pooling layer can reduce the number of parameters and computation in the network. It applies a filter (usually of size 2x2) to the input volume. Pooling filters can be based on different operations such as max, min or average. The most common one is max filter which extracts the max value from the filter region.

Fully Connected Layer:

Fully connected (Dense) layer takes an input volume (output of activation function) and outputs a N-dimensional vector. Similar to regular neural networks, neurons in a fully connected layer have full connections to all activations in the previous layer.

For our proposed models, we use the current state-of-art Inception-ResnetV2 [33]

model to extract features from the streetview images. We use the keras implementation of Incpetion-ResnetV2 pre-trained on ImageNet


dataset with 1000 classes. Incpetion-ResnetV2 combines the idea of residual connections to inception architecture. In residual connection, each layer feeds into the next layer and directly into the layers about few hops away. Residual connections are important for very deep architecture. When deeper networks starts converging, the accuracy can saturate at a point and eventually degrade. Residual connections are designed to overcome this degrading problem. As the Inception-v4 network is very deep with around 200 layers, combining Inception architecture with residual connections can be beneficial.

We remove the final dense layer with softmax activation function because the network has been pretrained to classify 1000 classes. In our work, we only have 3 classes: rumble strips, concrete barrier, and metal crash barrier. Next, we add a dense layer with 250 nodes after the last average pooling layer (with 1,536 nodes). We reduce the feature dimension because we are classifying our dataset into a lower number of classes than the pretrained model. Finally, we add a dense layer with 3 nodes with a sigmoid activation function so that each node provides a probability value for one class label.

As shown in Figure 2, we retrain the CNN model using our streetview dataset. The input to the CNN model is a set of 224x224 streetview images. After retraining, we extract the output of dense layer with 250 nodes to get a feature vector of 250 dimensions for each image in the sequence. We then create a set of feature sequences to feed into the LSTM model.

4.2 Model Spatial Linear Pattern with LSTM

In order to model spatial linear (sequential) structure along a road network path, we used the LSTM model on a sequence of image features extracted by the CNN model. LSTM is a type of recurrent neural networks that uses gating functions to avoid the exploding and vanishing gradient issues. The gate function can help a model to memorize the state of previous units in a sequence. Such recurrent structure is well-suitable to capture the spatial autocorrelation effect across consecutive images. According to the first law of geography: "everything is related to everything else, but near things are more related than distant things." For example, concrete barriers are often very long spanning over several miles. Metal crash barriers, in contrast, have a shorter spatial scale within a few hundred meters.

Figure 3: LSTM Unit

LSTM models a sequential structure by maintaining a sequence of memory cells ( with as the spatial location index). In each spatial location , LSTM takes an input feature , hidden state and cell state . Figure 3 shows a LSTM unit with a cell state() and three different gates: input gate, output gate and forget gate. The forget gate () decides how much information from a previous cell unit is ignored before coming to the next cell. The input gate () decides how much contribution an input feature vector makes to the current cell state. Finally, the output gate () decides what the current LSTM unit is going to output (current cell state and current hidden state ) based on the cell state. The LSTM transaction equations are as follows,


where denotes the sigmoid activation function, is hyperbolic tangent function and denotes element-wise product. and denote model parameters. As Figure 2(c) shows, our LSTM model consists of 4 hidden layers. The first layer is an LSTM layer with an output dimension of 100 units. The second layer is a 20% dropout layer. The third layer is a dense layer with 50 nodes. Our problem is a multi-label classification problem because each image may contain multiple types of road safety features at the same time. So, we implemented two design decision for the last layer to handle multi-label classification issue which will be discussed in detail in subsection 4.3.

4.3 Multi-label Classification

To model multi-label classification, we propose two approaches. First approach involves training a shared CNN-LSTM model for all three class label together. Second approach involves training separate CNN-LSTM models for each class label independently.

In first approach, the last layer in the model is a sigmoid transformation layer with 3 nodes, corresponding to the three independent class labels (rumble strips, concrete barrier, and metal crash barrier). This is different from the common softmax layer whose output node values sum into one because class labels are assumed to be exclusive to each other. We use the binary cross-entropy loss. To get the final output labels for each image, we use a threshold of 0.5 on the sigmoid outputs.

In second approach, we transform the multi-label problem into multiple single-label problems. We learn three independent models corresponding to each class label. For instance, the independent model for metal crash barriers will only classify the image based on presence of metal crash barrier. Likewise, independent models for concrete barrier and rumble strips. In this design decision, the last layer is a sigmoid transformation layer with only 1 node, corresponding to either one of the three independent class labels. We use the binary cross-entropy loss. However, it would also be fine to use softmax activation function with categorical cross-entropy loss.

5 Experimental Evaluation

In this section, we compared our proposed method with baseline methods on two real world datasets in classification performance. Experiments were conducted on a Dell workstation with Intel(R) Xeon(R) CPU E5-2687w v4@3.00GHz, 64GB main memory, and a Nvidia Quadro K6000 GPU with 2880 cores and 12GB memory. We used Keras with Tensorflow as backend to run the deep learning models. Candidate classification methods included:

  • CNN only: We used Inception-ResnetV2 CNN model on streetview images with three classes: rumble strips, concrete barriers and metal crash barriers. We added one more dense layer with 250 nodes and a activation function before the final sigmoid layer with 3 nodes.

  • CNN-DT

    : We extracted output of second last layer (with 250 nodes) from our CNN only model (Inception-ResnetV2) as feature vectors and fed it into a Decision Tree (DT) model. We used the scikit-learn package in Python.

  • CNN-RF

    : We extracted output of second last layer (with 250 nodes) from our CNN only model (Inception-ResnetV2) as feature vectors and fed it into a Random Forest (RF) model. We used the scikit-learn package in Python.

  • CNN-sharedLSTM: This is our proposed model to address the issue of multi-label classification using shared CNN-LSTM model for all three class labels together.

  • CNN-separateLSTM: This is our proposed model to address the issue of multi-label classification using separate single-label independent CNN-LSTM models corresponding to each class label.

Unless specified otherwise, we used default parameters in open source tools in baseline methods.

Evaluation Metrics: To evaluate the candidate classification methods, we used precision, recall and F-score. We computed the precision, recall and F-score for all the class labels. Finally, we computed the weighted average F-score for all candidate classification methods. To calculate the weighted average, we used equation 2,


where , , and refers to F-score of rumble strips, metal crash barriers and concrete barriers class labels respectively. Similarly, , , and refers to the number of image containing class labels: rumble strips, metal crash barriers and concrete barriers respectively.

5.1 Dataset Description

To evaluate the performance of our proposed models, first we randomly selected 3,745 labeled isolated streetview images across the state of Alabama for pre-training. We used this dataset to pre-train the CNN in all baseline and proposed methods. Next, we selected different road segments within I-20 highway in Alabama to extract spatially continuous streetview images for training, validation, and test datasets. The images in extracted datasets are different from the images in pre-training dataset. For training and validation, we selected road segment from 33°37’08.4"N 85°42’28.2"W to 33°35’06.5"N 86°03’40.6"W in I-20 East near Oxford, Alabama. We selected two test datasets. Our first test dataset was based on the road segment from 33°35’09.6"N 85°52’32.1"W to 33°36’56.4"N 85°41’25.2"W in I-20 West near Oxford, Alabama, which is closer to the training and validation datasets. Our second dataset was based on the road segment from 33°08’09.1"N 87°38’05.6"W to 33°11’13.2"N 87°19’59.3"W in I-20 West near Tuscaloosa, Alabama, which is far from the training and validation datasets. We refer to the first test dataset as Test Set_ and second test dataset as Test Set_ in rest of the paper. We used same training and validation dataset for both of the test datasets.

We divided the road segments into an equally distanced set of geolocation coordinates. We set the distance interval of 20 meters. We then used Google Street View API to download the streetview images respective to each coordinate. We have three safety features classes: rumble strips (RS), metal crash barriers (MCB), and concrete barriers (CB). Table  1 shows the number of images and class distribution for training, validation and test datasets.

of Images
Pre-Training Set 3745 1882 1149 1632
Training Set 983 868 324 352
Validation Set 594 493 96 224
Test Set_ 950 857 279 354
Test Set_ 1350 879 403 784
Table 1: Class Distribution

5.2 Hyperparameter Settings

For our proposed models, there are several design decision to be made including input vector length in LSTM, dimension of the hidden state of LSTM, learning rate, dropout, optimization function and training batch size. To train our proposed CNN-LSTM models, we used the input vector length of 50 spatially continuous images on each sequence. We used a sliding window of 1 on training and validation datasets to create training and validation sequences for our proposed CNN-LSTM models. We generated 883 training and 544 validation sequences. We set the input vector length in LSTM as 50, and the dimension of the hidden state of LSTM as 100. The input vector length as 50 refers to streetview images covering 1000 meters (20 meters separation between images). For learning rate, we first started with high learning rate of and observed oscillating training and validation loss curves. We then gradually decreased the learning rate and obtained more stable curves with the learning rate of for CNN and CNN-LSTMs models. Next, we varied the dropout value from to . We observed that with the higher value of dropout the convergence of training and validation loss was slower and required higher number of epochs. So, we tuned the dropout to the optimal value of . We trained CNN and CNN-LSTMs models using Adam optimizer. The training batch size for CNN model was 32 and that for CNN-LSTMs were 1.

Figure 4: Training and Validation performance of CNN-sharedLSTM over 50 epochs for all 3 class labels
Figure 5: Training and Validation performance of CNN-separateLSTM for rumble strips over 50 epochs
Figure 6: Training and Validation performance of CNN-separateLSTM for metal crash barriers over 50 epochs
Figure 7: Training and Validation performance of CNN-separateLSTM for concrete barriers over 50 epochs

5.3 Classification Performance

Figure 4 shows the training performance of CNN-sharedLSTM model. Similarly, Figure 5,  6, and  7 shows the training performance of CNN-separateLSTM model for rumble strips, metal crash barriers, and concrete barriers class labels respectively. The training and validation loss for CNN-sharedLSTM are 0.06 and 0.32 respectively. In case of CNN-separateLSTM, the training and validation loss for rumble strips class label are 0.02 and 0.08 respectively. We can observe that the gap between training and validation loss for rumble strips in CNN-separateLSTM model is lower than that of CNN-sharedLSTM model. The training and validation loss in CNN-separateLSTM for metal crash barriers class label are 0.05 and 0.28 respectively, which also shows similar trend. It is because the CNN-separateLSTM model can better learn the spatial scale of rumble strips and metal crash barrier separately. Rumble stirps have very high spatial scale in comparison to the metal crash barriers. Furthermore, rumble strips and metal crash barriers are mostly easy to identify in the streetview images. Finally, the training and validation loss for concrete barrier class label are 0.02 and 0.48 respectively. As shown in in Figure 7, the gap between training and validation curve is high and indicates over-fitting. It is likely because the concrete barriers are very diverse and hard to learn. We observe the texture of concrete barriers in some regions are very similar to the texture of the roads.

Classifiers Class Precision Recall F Avg. F
CNN-DT RS 0.95 0.83 0.89 0.85
MCB 0.79 0.82 0.80
CB 0.81 0.76 0.78
CNN-RF RS 0.94 0.87 0.90 0.87
MCB 0.89 0.86 0.87
CB 0.78 0.82 0.80
CNN only RS 0.94 0.95 0.95 0.89
MCB 0.91 0.78 0.84
CB 0.74 0.85 0.79
CNN-sharedLSTM RS 0.43 0.98 0.96 0.91
MCB 0.90 0.83 0.86
CB 0.74 0.91 0.82
CNN-separateLSTM RS 0.95 0.96 0.96 0.92
MCB 0.93 0.83 0.88
CB 0.77 0.93 0.84
Table 2: Classification on Test Set_
Classifiers Class Precision Recall F Avg. F
CNN-DT RS 0.85 0.84 0.85 0.75
MCB 0.50 0.86 0.63
CB 0.94 0.54 0.69
CNN-RF RS 0.88 0.87 0.88 0.77
MCB 0.58 0.77 0.66
CB 0.93 0.58 0.71
CNN only RS 0.87 0.97 0.92 0.78
MCB 0.78 0.73 0.75
CB 0.89 0.49 0.63
CNN-sharedLSTM RS 0.85 0.98 0.91 0.79
MCB 0.68 0.84 0.75
CB 0.93 0.53 0.68
CNN-separateLSTM RS 0.88 0.96 0.92 0.83
MCB 0.74 0.80 0.77
CB 0.96 0.61 0.76
Table 3: Classification on Test Set_

We compare different candidate methods on precision, recall, and F-score. To obtain the predicted class label, we set the probability threshold of 0.5. A road safety feature probability value above the threshold indicates presence of the road safety feature in the image. Results are summarized in Table 2 and  3 for Test Set_ and Test Set_ respectively.

For Test Set_ in Table 2, the average F-score of CNN-DT, CNN-RF and CNN only models are 0.85, 0.87 and 0.89 respectively. we can observe the average F-score for CNN with DT and RF is lower than CNN only. It is because CNN-DT and CNN-RF takes the output of 2nd last layer (with 250 output dimension) of CNN only model as the features and fits the models. But the last layer in CNN only model, a dense layer with 3 nodes, have extra learnable parameters which can help CNN only model train better. Next, we also observe that both of our proposed models: CNN-sharedLSTM and CNN-separateLSTM perform better than CNN only model with average F-score of 0.91 and 0.92 respectively. CNN only model may fail to correctly identify some inbetween images in the test image sequence. Our proposed models can correct these errors by incorporating spatial dependency in the learning process using LSTM network. Also, we can observe that our CNN-separateLSTM performs better than CNN-sharedLSTM. It is because in case of CNN-separateLSTM, the independent models for each class label can better learn the spatial scale of different class labels separately. For example, metal crash barrier have an average length of few hundred meters wheres other road safety features such as rumble strips may have average length of few kilometers. Furthermore, for CNN-separateLSTM, we can observe very high F-score of 0.96 on rumble strips class which is consistent with the learning curve in Figure 5 with small gap between the training and validation loss. In case of metal crash barriers, we observe F-score of 0.88 which is also consistent with the learning curve shown in Figure 6 with medium gap between training and validation loss. Finally, for concrete barrier, we observe lowest F-score of 0.84, also consistent with the learning curve shown in Figure 7 with high gap between training and validation loss.

Figure 8: Prediction maps on Test Set_ for rumble strips
Figure 9: Prediction maps on Test Set_ for metal crash barriers
Figure 10: Prediction maps on Test Set_ for concrete barriers
Figure 11: Prediction maps on Test Set_ for rumble strips
Figure 12: Prediction maps on Test Set_ for metal crash barriers
Figure 13: Prediction maps on Test Set_ for concrete barriers

For Test Set_ in Table 3, we can also observe similar trends as discussed above. But, the overall performance of all the candidate methods is lower. The average F-score of CNN-DT, CNN-RF, CNN only, CNN-sharedLSTM, and CNN-separateLSTM are 0.75, 0.77, 0.78, 0.79, and 0.83 respectively. It is likely because the Test Set_ image location lies far from the training and validation region. So, the training and validation images might not be very similar to images in Test Set_. It can be easily improve by introducing some representative images closer to Test Set_ in training and validation data.

We also visualize the predicted class label on map based on all the predicted results summarized in Table 2 and Table 3. Figures 8,  9 and  10 shows the ground truth and prediction maps for three road safety feature classes based on CNN only, CNN-sharedLSTM and CNN-separateLSTM models on Test Set_. Similarly, Figures 11,  13 and  12 shows ground truth and prediction maps on Test Set_. We can observe that different road safety features have different spatial scale. From Figure 8 and  11, we can observe that ground truth spatial scale of rumble strips are usually very long. From Figure 9 and  12, we can observe that ground truth spatial scale of metal crash barriers are usually very short. And finally, from Figure 10 and  13, we can observe that ground truth spatial scale of concrete barriers can vary from short to long. Next, in all the prediction maps, we can observe CNN only based predictions contain some isolated errors. But, CNN-sharedLSTM model was able to correct some of those isolated errors as highlighted in zoom-in sub-visualizations. However, we observer CNN-separateLSTM model to be more accurate, which is consistent with the summarized result in Table 2 and  3. The CNN-sharedLSTM and CNN-separateLSTM models are able to generate more accurate map than CNN only model due to the incorporation of spatial sequential structure in the learning process.

Figure 14: Classification map for rumble strips class label using CNN-separateLSTM model
Figure 15: Classification map for metal crash barriers class label using CNN-separateLSTM model
Figure 16: Classification map for concrete barriers class label using CNN-separateLSTM model

5.4 Case Study

We downloaded 69,500 streetview images of all the major highways in the entire state of Alabama. The major highways include I-10, I-20, I-59, I-65, I-85 and I-459 within Alabama. We then classified the road safety features in those images using our CNN-separateLSTM model. The map for each road safety features based on the classification results is shown in figures  14,  15 and  16. We randomly checked few data points to verify the classification result and found that most of the images were accurately classified.

According to the predicted maps, 54,890 streetview images were classified to have rumble strips which is around 1200 km out of 1390 km in major highways in Alabama. 12,841 streetview images were classified to have metal crash barriers which implies around 257 km out of 1390 km. Finally, 26,997 streetview images were classified to have concrete barriers which implies around 540 km out of 1390 km. We observed that the spatial scale of the predicted map for rumble strips are very long which is consistent with the spatial scale observed in the ground truth for test datasets. Similarly, from the predicted map in Figure 15, we observed that the metal crash barriers are evenly distributed throughout the major highways in Alabama and have short spatial scale. It is also consistent with the test datasets. Finally, from the predicted map in Figure 16, we observe that the spatial scale of concrete barriers can vary. We further observed that the long chain concrete barriers are usually located near city areas. Short concrete barriers are usually placed in the bridges. Also, we noticed that the bridges with the concrete barriers usually do not contain rumble strips.

6 Conclusion

In this paper, we proposed two different CNN-LSTM based spatial classification models for mapping safety features along road networks. Our CNN-lSTM models can capture spatial linear structure between consecutive images along a road network path. Results on real world Google Streetview images collected in Alabama showed that our models outperform several baseline methods. Furthermore, through experimental evaluation, we found out that the separate CNN-LSTM models for independent class labels performed better than shared CNN-LSTM model.

In future work, we plan to investigate more general spatial network structure with graph-LSTM.

7 Acknowledgement

This material is based upon work supported by Alabama Transaportation Institute.


  • [1] Driver Knowledge. Accident statistics, 2013.
  • [2] Congressional Budget Office. Public spending on transportation and water infrastructure, 1956 to 2017, 2018.
  • [3] Hawzheen Karim, Rolf Magnusson, and Mats Wiklund. Assessment of injury rates associated with road barrier collision. Procedia-social and behavioral sciences, 48:52–63, 2012.
  • [4] Carlos Roque and João Lourenço Cardoso. Observations on the relationship between european standards for safety barrier impact severity and the degree of injury sustained. IATSS research, 37(1):21–29, 2013.
  • [5] Yaotian Zou, Andrew P Tarko, Erdong Chen, and Mario A Romero. Effectiveness of cable barriers, guardrails, and concrete barrier walls in reducing the risk of injury. Accident Analysis & Prevention, 72:55–65, 2014.
  • [6] Michael R Bambach, RJ Mitchell, and Raphael H Grzebieta. The protective effect of roadside barriers for motorcyclists. Traffic injury prevention, 14(7):756–765, 2013.
  • [7] Hussein H Jama, Raphael H Grzebieta, Rena Friswell, and Andrew S McIntosh. Characteristics of fatal motorcycle crashes into roadside safety barriers in australia and new zealand. Accident Analysis & Prevention, 43(3):652–660, 2011.
  • [8] Raphael Grzebieta, Mike Bambach, and Andrew McIntosh. Motorcyclist impacts into roadside barriers: Is the european crash test standard comprehensive enough? Transportation Research Record, 2377(1):84–91, 2013.
  • [9] James E Bryden and Jan S Fortuniewicz. Traffic barrier performance related to vehicle size and type. Transportation Research Record, 1065:69–78, 1986.
  • [10] Carlos M Vieira, Henrique A Almeida, Irene S Ferreira, Joel O Vasco, Paulo J Bártolo, Rui B Ruben, and Sérgio P Santos. Development of an impact absorber for roadside barriers. In Proceedings of the 7th LS-DYNA Forum, 2008.
  • [11] Jennifer D Schmidt, Ronald K Faller, Dean L Sicking, John D Reid, Karla A Lechtenberg, Robert W Bielenberg, Scott K Rosenbaugh, Jim C Holloway, et al. Development of a new energy-absorbing roadside/median barrier system with restorable elastomer cartridges. Technical report, Nebraska. Dept. of Roads, 2013.
  • [12] Zheming Tong, Richard W Baldauf, Vlad Isakov, Parikshit Deshmukh, and K Max Zhang. Roadside vegetation barrier designs to mitigate near-road air pollution impacts. Science of the Total Environment, 541:920–927, 2016.
  • [13] Nico Schulte, Michelle Snyder, Vlad Isakov, David Heist, and Akula Venkatram. Effects of solid barriers on dispersion of roadway emissions. Atmospheric environment, 97:286–295, 2014.
  • [14] Gayle SW Hagler, Wei Tang, Matthew J Freeman, David K Heist, Steven G Perry, and Alan F Vette. Model evaluation of roadside barrier impact on near-road air pollution. Atmospheric Environment, 45(15):2522–2530, 2011.
  • [15] Benjamin Romano and Zhe Jiang.

    Visualizing traffic accident hotspots based on spatial-temporal network kernel density estimation.

    In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, page 98. ACM, 2017.
  • [16] Arpan Man Sainju and Zhe Jiang. Grid-based colocation mining algorithms on gpu for big spatial event data: A summary of results. In International Symposium on Spatial and Temporal Databases, pages 263–280. Springer, 2017.
  • [17] Arpan Man Sainju, Danial Aghajarian, Zhe Jiang, and Sushil K Prasad. Parallel grid-based colocation mining algorithms on gpus for big spatial event data. IEEE Transactions on Big Data, 2018.
  • [18] Zhe Jiang, Michael Evans, Dev Oliver, and Shashi Shekhar. Identifying k primary corridors from urban bicycle gps trajectories on a road network. Information Systems, 57:142–159, 2016.
  • [19] Aibek Musaev, Zhe Jiang, Steven Jones, Pezhman Sheinidashtegol, and Mirbek Dzhumaliev. Detection of damage and failure events of road infrastructure using social media. In International Conference on Web Services, pages 134–148. Springer, 2018.
  • [20] Vahid Balali, Elizabeth Depwe, and Mani Golparvar-Fard. Multi-class traffic sign detection and classification using google street view images. In Transportation Research Board 94th Annual Meeting, Transportation Research Board, Washington, DC, 2015.
  • [21] Vahid Balali, Armin Ashouri Rad, and Mani Golparvar-Fard. Detection, classification, and mapping of us traffic signs using google street view images for roadway inventory management. Visualization in Engineering, 3(1):15, 2015.
  • [22] Victor JD Tsai, Jyun-Han Chen, and Hsun-Sheng Huang. Traffic sign inventory from google street view images. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 41:243–246, 2016.
  • [23] Timnit Gebru, Jonathan Krause, Yilun Wang, Duyun Chen, Jia Deng, Erez Lieberman Aiden, and Li Fei-Fei. Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the united states. Proceedings of the National Academy of Sciences, 114(50):13108–13113, 2017.
  • [24] Xiaojiang Li, Chuanrong Zhang, Weidong Li, Robert Ricard, Qingyan Meng, and Weixing Zhang. Assessing street-level urban greenery using google street view and a modified green view index. Urban Forestry & Urban Greening, 14(3):675–685, 2015.
  • [25] Shashi Shekhar, Zhe Jiang, Reem Ali, Emre Eftelioglu, Xun Tang, Venkata Gunturi, and Xun Zhou. Spatiotemporal data mining: a computational perspective. ISPRS International Journal of Geo-Information, 4(4):2306–2338, 2015.
  • [26] Zhe Jiang. A survey on spatial prediction methods. IEEE Transactions on Knowledge and Data Engineering, 2018.
  • [27] Shanghang Zhang, Guanhang Wu, Joao P. Costeira, and Jose M. F. Moura. Fcn-rlstm: Deep spatio-temporal neural networks for vehicle counting in city cameras. In

    The IEEE International Conference on Computer Vision (ICCV)

    , Oct 2017.
  • [28] Huaxiu Yao, Fei Wu, Jintao Ke, Xianfeng Tang, Yitian Jia, Siyu Lu, Pinghua Gong, Jieping Ye, and Zhenhui Li. Deep multi-view spatial-temporal network for taxi demand prediction. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • [29] Yuankai Wu and Huachun Tan. Short-term traffic flow forecasting with spatial-temporal correlation in a hybrid deep learning framework. arXiv preprint arXiv:1612.01022, 2016.
  • [30] Huaxiu Yao, Xianfeng Tang, Hua Wei, Guanjie Zheng, Yanwei Yu, and Zhenhui Li. Modeling spatial-temporal dynamics for traffic prediction. arXiv preprint arXiv:1803.01254, 2018.
  • [31] Zhuoning Yuan, Xun Zhou, and Tianbao Yang. Hetero-convlstm: A deep learning approach to traffic accident prediction on heterogeneous spatio-temporal data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 984–992. ACM, 2018.
  • [32] Zhuoning Yuan, Xun Zhou, Tianbao Yang, James Tamerius, and Ricardo Mantilla. Predicting traffic accidents through heterogeneous urban data: A case study. In Proceedings of the 6th International Workshop on Urban Computing (UrbComp 2017), Halifax, NS, Canada, volume 14, 2017.
  • [33] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • [34] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    , pages 248–255. Ieee, 2009.