Detecting Unsigned Physical Road Incidents from Driver-View Images

04/24/2020 ∙ by Alex Levering, et al. ∙ Powered by Pure, Scopus & Elsevier Fingerprint Engine 14

Safety on roads is of uttermost importance, especially in the context of autonomous vehicles. A critical need is to detect and communicate disruptive incidents early and effectively. In this paper we propose a system based on an off-the-shelf deep neural network architecture that is able to detect and recognize types of unsigned (non-placarded, such as traffic signs), physical (visible in images) road incidents. We develop a taxonomy for unsigned physical incidents to provide a means of organizing and grouping related incidents. After selecting eight target types of incidents, we collect a dataset of twelve thousand images gathered from publicly-available web sources. We subsequently fine-tune a convolutional neural network to recognize the eight types of road incidents. The proposed model is able to recognize incidents with a high level of accuracy (higher than 90 generalizes well across spatial context by training a classifier on geostratified data in the United Kingdom (with an accuracy of over 90 translation to visually less similar environments requires spatially distributed data collection. Note: this is a pre-print version of work accepted in IEEE Transactions on Intelligent Vehicles (T-IV;in press). The paper is currently in production, and the DOI link will be added soon.



There are no comments yet.


page 3

page 4

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Roads are a highly dynamic environment, continuously impacted by ephemeral changes to surface conditions. Incidents that interrupt the road network reduce network connectivity cause economical damage by increasing travel time, causing delays in deliveries, and lead to missed connections to other modes of transport. In England, congestion caused by incidents on trunk roads and motorways is estimated to cost approximately half a billion pounds every year 

[highways_england_smart_2016] and delaying approximately one in five journeys [peluffo_strategic_2015]. This will worsen, as traffic on England’s trunk roads and motorways has grown by over 50% since 1993 and is expected to grow another 31% by 2041  [highways_england_highways_2017]. Combined with the continued trend of car ownership worldwide [dargay_vehicle_2007] and a continued increase in road vehicle miles [peluffo_strategic_2015]

, the effects of incidents on the serviceability of the road network are likely to deteriorate. The low-cost and wide availability of vehicle-borne cameras offer great potential to visually detect road incidents. This information could subsequently be communicated across the networked traffic. Yet, previous research on incident perception and detection from the vehicle-centric perspective (ego-vehicle) has thus far not widely considered the detection of incidents. In this research we propose a taxonomy of applicable incidents, collect a first-of-its-kind machine-learning dataset of incident images approximating a vehicle-centric perspective, and develop and test a deep learning model to recognize unsigned physical incidents.

The rest of this paper is organized as follows: Section 2 gives an overview of the background material of the research. Section 3 discusses the definition of unsigned physical incidents and the taxonomy of incidents. Section 4 presents the methodology and experimental set-up. In Section 5 we discuss the results of the model classification, and discuss its nuances in Section 6. We draw conclusions from our work in Section 7.

2 Background

2.1 Definitions of incidents

To properly study unsigned physical incidents, the concept must first be defined. The United States Federal Highway Administration defines an incident as "any non-recurring event that causes a reduction of roadway capacity or an abnormal increase in demand." [p.b._farradyne_traffic_2000, p.2]. This definition is unsuitable as it does not account for a reduction in serviceability. Secondly, it attempts to limit incidents to non-recurring events, which excludes events such as snowfall. Berdica defines an incident as "an event, which directly or indirectly can result in considerable reductions or interruptions in the serviceability of a link/route/road network." [berdica_introduction_2002, p.118]. This definition is well-suited for this research, as it covers the reduction in serviceability and does not limit the context in which incidents may occur. We thus apply Berdica’s definition as a basis for defining incidents in this paper. We further delineate incidents by their physical nature. We consider physical incidents to be incidents that are perceivable by sensors such as cameras, sonar systems, and laser scanners. Digital events such as a traffic light hack are not considered, because the cause of the disruption is not easily interpretable by in-situ sensors.

Finally, we further distinguish between signed and unsigned incidents. Signed incidents are incidents which are signposted or otherwise marked as a hazard. For instance, roadworks and parades are often signposted with barriers, traffic signs, and high-visibility equipment. On the contrary, unsigned incidents are unplanned, such as a flash-flood after heavy rainfall. Figure 1 gives an example of both types of physical incidents. Once detected and acted upon by authorities, incidents are often signposted. In this research we only consider unsigned physical incidents. We do not consider signed physical incidents because it shares a broad overlap with tasks such as street sign recognition [zhu_traffic-sign_2016, zaklouta_real-time_2014], traffic cone recognition [yong_real-time_2015], road marking detection [hillel_recent_2014], and roadworks detection [fazekas_locating_2017].

2.2 Image-based Incident Detection

Animals are a frequent cause of collisions with cars. Research by Zhou et al [zhou_real-time_2014] considers the state of animal detection systems for road-going vehicles until 2014. They record five papers applicable to intelligent vehicles’ need to classify various species of animals. Most of these efforts concern general animal classifiers rather than classifiers specialized to the detection of animals on road. Directly relevant to unsigned physical incident detection is the semantic segmentation in synthetic images of kangaroos on the road for collision prevention [saleh_kangaroo_2016]. This work relied on training on computer-generated images of kangaroos. Related work on road damage detection from smartphone dashcam images [alfarrarjeh_deep_2018] did focus on road surface assessment rather than damage that may cause road closure.

Other types of unsigned physical incidents are scarcely considered in existing literature. Chen et al [chen_lidar-histogram_2017]

consider the use of LIDAR sensors for the detection of road obstacles by defining the driving surface and then detecting outliers the surface. In 

[levi_stixelnet:_2015], this same task is approached through images from true color cameras to detect perceivable edges of the driving surface and thus find the ground plane boundaries of potential obstacles. However, neither of these works attempts to classify the object that is obstructing the driving path, nor do they distinguish regular obstacles (e.g., cars) from unusual obstacles (e.g., debris). To our best knowledge, the only research in this domain that defines the specific type of hazard to occur on the driving surface is [shao_research_2015], in which authors detect shallow holes and water hazards on the driving surface using the (lack of) returns of a given LIDAR beam. Similar research based on RGB images is still lacking.

2.3 Computer Vision Techniques

Many applicable feature extraction techniques exist, such as first & second-order edge detection, image motion descriptions, shape matching, texture extraction, and statistical features 

[s._nixon_feature_2012]. Convolutional Neural Networks (CNNs) were conceptualized by Fukishima [fukushima_neocognitron:_1980] and expanded upon by LeCun et al [lecun_gradient-based_1998] as a means to automatically learn features from images by using trainable filters. Krizhevsky et al [krizhevsky_imagenet_2012] implemented a GPU-based CNN which laid the foundation for modern deep CNN network architectures. VGG [simonyan_deep_2014] improved upon the AlexNet architecture by experimenting with layer configurations and explored deeper networks. ResNet [he_deep_2016] made notable improvements to CNNs by introducing skip connections

. Skip connections let network layers learn identity mappings enabling to train very deep networks while accounting for the vanishing gradient problem


. Additional batch normalization layers enabled to stabilize the signal of intermediate layers. Further improvements to CNN architectures are continuously made, but for brevity are not discussed here.

Figure 1: Left: A signed physical incident with signposted roadworks [dixon_roadworks_2017]. Right: An unsigned physical incident – road flooding [sandy_spring_volunteer_fire_department_ssvfd_2017].

3 Unsigned Physical Incidents

By enhancing the definition of incidents proposed by Berdica with the concepts of physicality and signage, the working definition for unsigned physical incidents used in this research is: An unsigned physical incident is an event without road signs or hazard markers that can be detected by sensors, which directly or indirectly can result in considerable reductions or interruptions in the serviceability of a link/route/road network.

Figure 2: Taxonomy of incidents and their Semantic groupings

3.1 Taxonomy of Incidents

There are many incidents which may affect a road network, many of them related through shared characteristics. For instance, snowy and inundated roads are distinct phenomena, but related as they both concern natural events affecting the road surface. In order to systematically explore the breadth of the domain of possible incident types before developing a classifier, we first propose a taxonomy that semantically groups incidents. We define the structure of the taxonomy through Formal Concept Analysis (FCA) [ganter_formal_2012] in order to uncover attributes and attribute groupings of incidents. We iteratively refined and grouped attributes until they provided hierarchical, binary semantic groupings. From the FCA we identified two common aspects distinguishing between incidents. The first aspect is the manifest and most likely cause of an incident, i.e., man-made or natural causes. For instance, a car crash involving two cars manifests as man-made obstructions on the road. Whether it is caused by a driver error or a mechanical failure, both causes are rooted in human failure. Flooding on the other hand manifests as a natural cause, although it may be triggered by a human damaging a road hydrant. This aspect must be interpreted as the perceivable nature of the incident. The second grouping is whether the incident is by nature a well-defined discrete (set of) objects or a continuous field, i.e., a cover. Road flooding is a continuous phenomenon without a well-defined discrete delineation of the incident. On the other hand, a fallen tree can be counted and can be considered a discrete incident, which we refer to as objects. We show how the resulting taxonomy may be structured formally in Figure 2

. Such a distinction should serve to inform the comprehensive, systematic collection of training data. It may further function as a test on the model’s capacity to classify groups of incidents. As more refined models are produced, the taxonomy may deepen to include more probable causes to provide a more fine-grained risk estimation.

The taxonomy has been been designed with several limitations in mind:

  • While an error in a traffic management system may cause traffic disruption, it is hard if not impossible to detect this incident by sensors in-situ. Hence, digital incidents are not included. In a full hierarchy of possible incidents it would occur on the same level as physical incidents;

  • We consider attributes of incidents by their perceivable cause. It is possible that e.g., a tree was purposely cut to block a road, making it a man-made adversarial incident. Yet, such an incident is often hard to distinguish from natural tree fall, a much more common occurrence;

  • Signed physical incidents are not considered in this research as the detection of signage is actively researched using computer vision techniques. With sufficient research progress the detection of road signs may soon cover for the detection signed incidents and events such as roadworks and parades.

3.2 Incident Definitions

The lowest-level labels refer to individual incidents that belong to the specified groupings. Here, we consider eight class labels. We chose a variety of incidents that can be semantically described for the purposes of collecting a training dataset of images and for which we can reasonably expect to find hundreds of images per class, covering the spectrum and variability of common objects and surfaces wherever possible. We use the following working definitions of classes in this research:

  • Animal on Road: Any animal, both living and dead, situated on or within close proximity of the driving surface

  • Collapse: A major break-up of the driving surface which would be too big for common motor vehicles to drive across without incurring damage

  • Fire: An uncontrolled and active fire anywhere in the image that may affect the driving conditions immediately or when left uncontrolled

  • Flooding on Road: A (section of) driving surface that is submerged in a cover of water puddles such that it causes drivers to change their driving behaviour

  • Landslide: A cover of dirt, rocks, or natural debris originating from a raised surface, which has settled on the driving surface

  • Snow on Road: Any amount of snow on the driving surface such that it could cause drivers to change their driving behaviour

  • Treefall: A tree, trunk, or sizable branch leaning over or lying on the driving surface in such a way that it would obstruct traffic

  • Vehicle Crash: Any visible collision between one or more motor vehicles, or a motor vehicle collision with an object in the environment, such as a tree

(a) Crash (b) Collapse (c) Animal (d) Treefall
(e) Snow (f) Flood (g) Landslide (h) Fire
Figure 3: Prototypical images of each incident class covered in this research

4 Methodology

We perform supervised single-label classification on web-gathered positive examples and a diverse set of negative examples sampled from various sources. We do not consider multilabel-classification to be fit for the case, as multiple incidents occurring at once are unusual.

4.1 Image Dataset

The input dataset of imagery of various incidents occurring on the road network is built from a multitude of sources. Each image is attributed to only one of the fine-level classes via a single label. Thus, images that contain, e.g., both a crash and a fire are excluded from the training set. Images were selected to accurately represent road scenes from the perspective of a vehicle. Images should be taken from the vehicle-perspective with expected viewports. For instance, an image taken just centimeters above the road is not representative for a driving scene. Likewise, an image that is rotated beyond a few degrees is not relevant either. Since this information is not supplied with most images it is up to human labellers to filter scenes to match these criteria.

4.1.1 Positive Examples

In order for a classifier to differentiate between incidents and normal driving situations, a dataset of positive examples has to be gathered. Positive examples of unsigned physical incident images have one of the incidents clearly in view. We gather the set of positives from four sources. Firstly, we perform Web harvesting from Google [google_custom_nodate], Flickr [yahoo_flickr_nodate], and Bing [microsoft_bing_nodate] image APIs. Per query, we retain the first 100 images as the quality of returned images quickly declined after this limit. We gather images by constructing queries using synonym pairs. For two sets of predefined keywords, we combine each set pairwise to form a new set with pairs two keywords each. For instance, the sets {road, street} and {snow, blizzard} are combined to form the set {road snow, road blizzard, street snow, street blizzard}. We also constructed additional queries based on the translation of these English keyword sets into Dutch, Croatian, Farsi, Mandarin, and Slovak. We do so to increase the diversity of scenes and to reduce the influence of geographic bias. Lastly, we filter duplicate returns by checking for absolute equivalence between images matrices. Prototypical examples of each incident type are given in Figure 3

Through synonym combinations, we formed a set of 118 English language queries which retrieved 40,063 images. After selecting suitable examples, 5,844 images were included in the dataset. We retain 2,439 images from Google, 2,742 from Bing, and 663 from Flickr. Note that duplication filtering works in favour of the total amount of images retained from Bing. The total amount of duplicates removed was not tracked. Figure 4 lists the distribution of images retained per class along with their sources. Google and Bing provide a superior rate of correct images when compared to Flickr, while being highly similar in the amount of correctly returned images per class. By translating the 118 representative queries into the five non-English languages, we performed additional 63 queries which retrieved 12,846 images, of which 1,641 were included in the dataset. We retained 762 images from Google, 804 from Bing, and 74 from Flickr. Figure 5 displays the distribution of images retained after non-English queries, per class and by source.

Figure 4: Overview of images per class as derived from each source, harvested using English language queries.

To add to the web-sampled images, the Geograph UK project supplied a database export for three classes, namely animal on road, flooding, and snow. The database exports were made using the search terms animals/cow/sheep on road, flooding road, and snow road. Table 1 provides an overview of positive collected images as gathered from English and non-English Web harvesting queries and from Geograph.

Incident English non-English Geograph Total
Animal on road 534 79 708 1,321
Collapse 362 123 6 491
Crash 1,158 320 - 1,478
Fire 791 74 - 865
Flooding 453 446 1,257 2,156
Landslide 676 149 - 825
Snow 1,265 304 3,174 4,743
Treefall 605 146 - 751
Total 5,844 1,641 5,145 12,630
Table 1: Total amount of images per class as collected by gathering type.

4.1.2 Negative Examples

Since this research is concerned with recognizing incidents as compared to normal driving situations, a dataset of negative examples was sampled. Negative examples are images of driving situations without an incident affecting the road. For this purpose we sample from various sources to cover the preconditions of the research:

  • Berkeley Deep Drive (20k): a random sample of 20k images from the 100k image subset of the Berkeley Deep Drive (BDD) dataset [seita_bdd100k:_2018]. This dataset consists of dashcam footage captured in a variety of US cities. It is notable for the variety of captured scenes and weather conditions. The inclusion of scenes containing wet and snowy conditions makes this dataset especially relevant to help distinguish between disruptive and non-disruptive weather conditions. It also contains variations in rotation and angle similar to the positive dataset. The image dataset contains the 10th frame of each video in their 100k videos dataset. This dataset thus captures the widest variety of driving conditions of the negatives datasets such as weather conditions, geographic diversity (across the United States), day/nighttime, and complications, e.g., reflections of the dashboard.

  • Cityscapes (10k): sample images from the Cityscapes dataset [cordts_cityscapes_2016] improve the geographic diversity of our dataset. This dataset covers a variety of street scenes of German cities and contains every 20th frame of 30fps video sequences. While less varied in weather and camera conditions, the driving conditions in the dataset are distinctly more European than the BDD dataset. We include 10k images from the Cityscapes dataset to improve the geographic diversity of the negatives class.

  • Geograph (10.2k):: a random sample of 10.2k images tagged as ’road transport’ from the Geograph project [noauthor_geograph_2018]

    . The dataset contains unfiltered stills covered by the tag, and therefore reflects a high diversity of images, at times even irrelevant to driving conditions in general. While most photos are taken from a viewpoint similar to BDD and Cityscapes, a number of images also contain odd angles and targets (e.g. streams or pastures). We include this dataset to offset the strong urban focus of the benchmark datasets, as well as to counteract potential overfitting on the difference in viewpoint rotation, angle, and orientation in the incidents dataset. We retain 10k images from this dataset to ensure the inclusion of landscape images, which are frequent in the positives dataset.

Images from the BDD dataset often contain elements of the ego vehicle in the image itself (e.g., dashboard or bonnet). To reduce the chance of overfitting the negatives dataset onto such irrelevant visual cues, we crop out the bottom 25% of all images from the BDD dataset. To retain the image aspect ratio, we also crop 12,5% from both the left and right sides of the image. Images are resized to 224x224 pixels to match the input size of the classification model. We subset the images into batches of 70/20/10% of the full data for the training, validation, and testing split respectively.

4.2 Set-up of CNN Model

To perform incident detection we fine-tune a ResNet-34 model [he_deep_2016]

pre-trained on the ImageNet dataset 

[deng_imagenet:_2009]. To do so, we unfreeze all layers of the network. During training we consider the individual classes listed at the lowest-level nodes in Figure 2. To account for the imbalance in the number of positive and negative samples and to limit overfitting onto specific classes we scale the loss according to the inverse frequency of the number of images of each class across the dataset, as in Equation 1.


Here, C denotes the set of all image labels (including negatives) in the dataset, and is the set of positive examples of the ground truth class of the input image. We train the model with the parameter settings listed in table 2

. We optimize the model using the RMSprop optimization algorithm

[hinton_neural_2015, s.29] without applying momentum.

Figure 5: Overview of images per class as derived from each source using the non-English queries.
Parameter Value
Batch size 10
Initial learning rate 0.0001
Learning rate decay schedule

Decay at epochs 10, 30, 40

L2 regularization strength 0.0001
Table 2: Classification performance of the best model trained on the complete dataset

We apply the following random augmentations with a 50% chance of occurrence where applicable:

  • Random horizontal flip

  • Random greyscale transform

  • Random rotation up to 5 degrees in either direction

  • Jittering of hue/brightness/contrast/saturation up to a factor of 0.05

Last, we normalize all images to the mean and standard deviation of the colour bands of the training subset. Network training is performed using PyTorch version 0.3 running on Python version 3.6, using the freely-available Google Colaboratory


5 Results

5.1 Full Dataset Results

Training was concluded after 50 epochs, with a stable state being reached after 37 epochs. Figure 6 displays the loss for the model at every epoch during training and validation, while Table 3

shows the final accuracy and average F1-score derived for each phase on the best model. The confusion matrix for the testing phase is given in Table 

4, showing the expected trend that most misclassifications pertain to the flooding class.

Metric Training Validation Testing
Accuracy 99.49% 96.31% 97.15%
F1-score 0.9403 0.9054 0.8909
Loss 0.02149 0.2135 0.1761
Table 3: Classification performance of the best model trained on the full dataset
Figure 6: Loss curve of the model trained on the complete dataset.
True label Predicted F1 Top-1
Animal on Road 129 0 1 0 2 1 0 1 1 0.9021 95.56%
Road Collapse 0 50 1 0 0 0 0 0 3 0.9174 92.59%
Vehicle Crash 1 0 155 0 0 0 1 1 2 0.9394 96.88%
Fire 0 0 0 97 0 1 0 0 2 0.9848 97.00%
Flooded Road 0 1 0 0 188 0 1 2 20 0.8806 88.68%
Landslide 0 1 0 0 0 65 1 0 3 0.9028 92.86%
Treefall 0 0 1 0 1 2 67 1 1 0.9241 91.78%
Snow on Road 2 0 0 0 3 0 0 468 14 0.9689 96.10%
Negative 19 3 12 0 21 6 2 6 3894 0.9854 98.26%
Table 4: Testing split confusion matrix (=5,263) of the best model trained on the full dataset. The rows represents the true class and the columns represents the predicted class

5.2 Geographical Stratification

To assess the influence of geographical correlation in visual features in images on the incident detection task, we explore the impact of geographical stratification of the training image set. We run an experiment using three incident classes present in the Geograph data: Animals, Flooding, and Snow, while considering negatives as usual. We only use these three positive classes and the full dataset of negatives as the images retrieved from the Geograph project have reliable geotags that are situated in the United Kingdom or Ireland, and as such they can be regionally stratified by locations in these two countries. Images from England, Scotland, or Ireland are included in the training or validation dataset, while images from Wales form the holdout dataset. We thus effectively split the Geograph data situated in England, Scotland, and Ireland to a 72.5/22.5/5% split, with the 5% holdout data situated in Wales so that we can test the trained model performance on unseen data from a new geographical region.

During training and validation of the geographically stratified dataset we include the harvested and non-English data. 75% of the harvested and non-English data in each of the three positive incident classes is added to the training dataset, while the remaining 25% data is distributed to the validation dataset. While it is possible that some of the harvested data are situated in Wales, UK, the high ratio of Geograph to non-Geograph data in this class (2:1 for all three classes) assures that the occurrence of undesirable geographical correlation resulting from the inclusion of harvested images is low. Model training adhered to the same parameters as for all incident classes, as the chosen hyperparameters were observed to lead to good convergence.

Training was concluded after 50 epochs, while validation loss stabilized after 33 epochs. Table 5 displays the final accuracy and F1-score derived for each phase on the best model. The confusion matrices for the testing phase is given in Table 6. The trends visible in the confusion matrix largely follow the trend seen for the complete dataset. While overall the accuracy is good, the drop in accuracy and F1-score is notable.

Metric Training Validation Testing
Accuracy 97.90% 96.59% 92.90%
F1-score 0.9403 0.9054 0.9169
Loss 0.0771 0.1352 0.1973
Table 5: Classification performance of the best model trained on the geo-stratified dataset.
True label Predicted F1 Top-1
Animals on Road 73 0 0 0 0.9299 100%
Flooded Road 1 54 3 0 0.9319 93.10%
Negative 10 3 48 2 0.8205 76.19%
Snow on Road 0 0 3 112 0.9782 97.39%
Table 6: Testing split confusion matrix (=309) of the best model trained on the geographically stratified dataset. The rows represents the true class and the columns represents the predicted class.

6 Discussion

6.1 Incident Recognition

The confusion matrices show encouraging patterns, and evaluation of the F1-score confirms that the model is not overclassifying images to any particular class. A notably consistent error of the trained model is the confusion between Snow or Flooding and the negatives class. Both classes have a definition that is hard to delineate, challenging even human classifiers. This uncertainty is reflected in the consistency with which misclassifications occur between the three splits. Notice also how the Animal on Road class is hardly ever misclassified during training on the full dataset, but also how it is one of the worst-off classes during validation, despite having a greater amount of training samples compared to other problematic classes such as Road collapse. In Figure 7 we inspect the model’s visual attention on protoypical incidents by applying Class Activation Mapping, a summation of the total amount of signal at each pixel of the input image. Of the prototypical images, only one incident type is misclassified, namely crashes, classified as a negative scene. The class attention for the other classes is centred on the incident of interest, though the class attention for the class snow may drift. Ideally, the class attention for this class should be on the driving surface.

(a) Negative (b) Collapse (c) Animal (d) Treefall
(e) Snow (f) Flood (g) Landslide (h) Fire
Figure 7: Class attention of predicted class overlaid on prototypical images of each class.

In order to further investigate the model’s prediction patterns we apply t-SNE dimension reduction on the last fully-connected layer of the model as in [krizhevsky_imagenet_2012]

. t-SNE dimension reduction maps the 512-dimensional vector of the fully-connected layer to just two dimensions for visual interpretation. This is done by clustering points with a high similarity whilst repelling points that are not alike in the original high-dimensional space, then projecting these predictions onto a 2-dimensional plane. Several patterns appear in the t-SNE plots:

  • The model can distinguish most classes well
    Most positives classes are clustered together without a fuzzy border towards the grey negatives cluster, with the exception of the flooding and the landslide classes. This reflect the uncertainty of these classes as per their reported accuracies, while the other classes are far less affected by uncertainty throughout all three splits.

  • The flooding class has the greatest uncertainty in its classification region
    As indicated by point of interest d, the flooding class strongly gravitates towards the Geograph negatives cluster and shares a large indecisive boundary region with it.

  • Images within the negatives set are easily distinguishable within the negatives cluster
    Outlined with a red ellipse we find a cluster that predominantly consists of negative Geograph images. We conjecture that this cluster is formed by the comparatively greater amount of countryside images within the Geograph negatives set when compared to both the Berkeley Deep Drive (BDD) and the Cityscapes negatives.

    Figure 8: t-SNE dimension reduction of inputs to the fully-connected layer for every image of the complete dataset model, used to make predictions on which class each image belongs to. Plot generated with a perplexity of 50, a learning rate of 500, and a total of 1,000 iterations. The inner circle of each point represents its true class, with the outer circle representing its predicted class. The red ellipse indicates a region consisting almost exclusively of Geograph negatives.

6.2 Relevancy/Severity

An important focus of future work are the concepts of relevancy and severity. As noted in [ohn-bar_are_2017], objects should only be considered incidents if they are spatially relating to the driving situation, thus relevant. Positive class examples gathered during this research all share this feature. A second characteristic to consider is the severity of the incident, i.e., is an incident significant enough to disrupt road serviceability. In Figure 9 we show examples which demonstrate the role of relevancy and severity. We note that both relevancy and severity depend on characteristics of the driving situation such as the type of vehicle used, tire quality, and velocity. As such, the relevancy/severity problem is likely a new complex prediction problem that uses the detection of an incident as input. For the purposes of training, such characteristics can perhaps be averaged for vehicle types to compute type-specific relevance/severity scores for incident examples.

(a) Animal (b) Flood (c) Animal
Figure 9: Relevancy vs severity. Figure (a) shows an irrelevant scene. Figure (b) shows the difficulty of estimating severity. Figure (c) shows an example where both the relevancy and severity are hard to determine.

6.3 Geographical Stratification

In driving scenarios it is imperative that trained models generalize well across various regions or landscapes. A model needs to detect incidents regardless of the visual properties of the environment. The ability to generalize from training data collected in one region to deployments in unseen geographical regions is therefore an important consideration for a model’s fitness to deploy and detection quality. This problem has, to the best of our knowledge, been thus far neglected. Yet, with the advent of self-driving cars with systems trained only on data from limited regions (e.g.,USA), the lack of geographical transfer-ability of training datasets may have serious consequences.

This is especially the case for classifiers for recognizing hazardous situations. If, e.g., an incident detector is unable to recognize a flooding in a desert during a flash-flood, it may result in an autonomous vehicle that drives into a hazardous situation. In this research, we considered the role of geographical correlation between datapoints during training, and we demonstrate the need for future research efforts to consider the adaptability of incident-detection models to unseen geographical domains.

6.4 Limitations

Limitations of the Taxonomy of Incidents

An addition to the process of creating deeper groupings in the taxonomy is to consider the concept of synsets (sets of related synonyms) such as those used in ImageNet [deng_imagenet:_2009]. Princeton’s WordNet [miller_wordnet:_1990] may form a good basis for some of the classes in this research which are not combinations of various terms. For instance, landslide is listed as a distinct synset, along with its hypernyms rockslide and mudslide, while animal on road is not listed as it is a combination of animal and road. Instead, the various hypernyms of animals may be considered separately, and then combined with their context term (e.g. road). For classes which have hypernyms, standardized search terms should be considered as much as possible so that the semantic definition will remain the same throughout future research efforts.

Dataset Limitations

Images gathered by API harvesting contained many duplicates prior to the selection of images. We filtered many of these duplicates by checking each image with every other image for their exact equivalence without considering resizing and artifacts. This means that resampled and resized images, and images with effect filters applied may be retained in the dataset. While we suspect that only few duplicates remain in the final datasets, it is worth to consider more sophisticated data cleaning approaches in future work, e.g., through feature extraction methods such as perceptual image hashing using feature points, a method able to accurately detect equivalence while accounting for a wide variety of distortions, transformations, and alterations [monga_perceptual_2006].

Lastly, there is a chance that the full dataset classification may be sensitive to sampling biases as a result of the method for generating the negatives dataset. The negatives set of the full experiment does not contain harvested images which differ from the Geograph negatives and the driving datasets. Thus, the model may have learned to distinguish harvested positives from non-harvested negatives. The geographically stratified model does not suffer from this suspected bias in that the test dataset for this experiment contains the same sources as the training and validation data, and harvested images are only used to enhance the model during the training and validation process. As the accuracy does not differ greatly between both experiments, we do not suspect the likelihood of this bias to be significant. For a decisive test on the influence of biases by source type we suggest that a new experiment is run with a second curated test dataset that also contains harvested samples, so that the degree of bias can be assessed.

7 Conclusions

Road networks around the world are under increasing pressure as car ownership rises and road transport intensifies. This increase in road network pressure intensifies the effect that incidents have on the road network. At the same time, vehicles equipped with sensors are becoming increasingly prevalent on the road network as autonomous vehicles are beginning production. To the best of our best knowledge, no existing research has previously been performed on the recognition of incidents as a domain using images as seen in sensor-equipped vehicles. The main motivation for this research was the need to assess the possibility of visual incident detection, as a new class of classification task impacted by constraints to the types of images used, the underspecification of the definition of an incident, and the dependence on geographically stratified large datasets. In this research we have therefore created a taxonomy for unsigned physical incidents, gathered a dataset of images to be used in classification, and confirmed that unsigned physical incidents are learnable by convolutional neural networks with an overall accuracy rate of accuracy rate of 97.15% and a F1-score of 0.8909. In a second experiment we determined that spatially stratified training and test datasets deteriorate the performance. The overall accuracy of this second experiment was 92.90% with a F1-score of 0.9169. While this is a small decline showing the model generalizes well, the experiment indicates that the performance in visually very distinct regions would drop rapidly. Future work should therefore focus on a comprehensive collection of geographically distributed training data, to assure consistent performance of such models globally. The dataset is available at


The authors would like to acknowledge the Geograph UK project for their assistance in using Geograph imagery.