Traffic sign detection and recognition are critical parts of an autonomous vehicle (AV) system for its navigational purpose. Currently, traffic sign detection is achieved by deploying state-of-the-art deep learning object detection networks[Wen2017traffic, zhu2016traffic, sheikh2016traffic, lee2018simultaneous]. For safety reasons, an AV should not miss detecting any sign as such a mistake could result in a catastrophic incident. Therefore, these detectors are required to operate reliably in variable conditions. However, unknown environments, degraded image quality due to bad weather, uneven illumination, and poor textures are some of the factors that can impact the performance of the deployed object detection systems onboard AVs. This fact raises safety concerns, and it is among the reasons that are halting the widespread deployment of autonomous vehicles beyond level autonomy [sae2014taxonomy] where the vehicle must be assisted by the driver when needed [litman2017autonomous].
One way to tackle the safety issue is to keep improving the performance of the traffic sign detectors. However and given the fact that object detection systems cannot be guaranteed never to make mistakes, we argue that there should be a mechanism to detect when these detectors make mistakes during deployment – a failure detection system that raises the alarm when there is evidence that the performance of the deployed sign detector may have degraded. The aim is to alert the detector that it may have made a mistake to identify a sign in a particular region of its input image. Upon receiving the alarm, the detector can take alternative measures to detect the sign again. Consequently, if the frequency of alarm keeps increasing, the autonomous system can ask for intervention from a human user to take control.
To this end, this paper proposes such a failure detection system in the context of traffic sign detection, although the proposed method is not restricted to traffic signs. Our proposed method uses the feature maps of a deployed traffic sign deep neural network to extract cues and detect potential false negatives. To the best of our knowledge, this method is the first to tackle this critical aspect of false negative detection in robotic vision.
The rest of the paper is organized as follow: In Section II, we review the related works on failure detection. In Section III, we introduce our approach to detect failure of a traffic sign detection system by discovering false negatives. Section IV outlines our experimental evaluation setup. Section V presents the results and finally in Section VI we draw conclusions and suggest areas for future work.
Ii Related Work
Several works have been done in the area of detecting or predicting failures of vision systems. We can categorize the proposed approaches into two broad groups. The first group identifies failures by examining the output of the vision system. The second group uses a separate system to predict the failure of the vision system based on its input.
Among the first group of approaches, [morris2007robotic] introduced the idea of introspection in the context of robotics. They described the ambiguity of a robots awareness during deployment as a barrier to using these systems in a real environment. [grimmett2013knowing] and [triebel2016driven]
have discussed the importance of introspection capability in robotics context using a classifier. They have performed extensive studies to explore the failure prediction capability of multiple classification algorithms. They propose to assess the predictive variance to mitigate potentially overconfident classifiers.[devries2018leveraging] proposed a single end-to-end framework to measure the uncertainty of an automated segmentation pipeline and [devries2018learning]estimated the confidence of a neural network for out of distribution sample detection.
In the second group of procedures, [zhang2014predicting] proposed a warning system named ALERT to build a self-evaluating vision system. It analyzes the input and predict the output reliability of the vision system. [hecker2018failure] has proposed the concept of scene drivability. It predicts the feasibility of driving scene for a driving method. A probabilistic approach has been used by [Gurau2018LearnFE] to use space, time and appearance to predict the performance of an autonomous vehicle. They also anticipate when to hand over control to the human user. [hu2017introspective] proposed a method to evaluate the performance of a perception system without any ground truth. [saxena2017learning] argued that most vision based perception failure occurs because of improper illumination of the scene. To tackle this problem, they proposed a failure detection and recovery maneuver for a vision system. A system agnostic framework has been proposed by [daftry2016introspective] to predict failure in a vision system. They argued that predicting failure from raw sensor data is more effective than using the uncertainty of model-based classifiers.
Our proposed algorithm belongs to the first group of approaches as we use the traffic sign detector to extract important cues to discover the failure of the detector. As stated in the introduction, our work does not aim to improve the performance of the sign detector. Instead, we focus on identifying the cases where the detector fails to identify a traffic sign from a particular location.
Iii False Negative Detection
In this section, we describe our false negative detector (FND) that identifies the failure of a deployed traffic sign detector (TSD). We assume that the weight of the TSD model will be fixed during the deployment phase.
The proposed FND works in two steps as follows:
Collect features from specific areas of an input image, where TSD has not detected any sign.
Evaluate those features to identify false negative traffic signs from those areas.
The FND relies on the observation that when the TSD misses a sign, most of the time, there are still some excited regions in its internal feature maps, some of which correspond to the location of that sign, see Figure 1. We will exploit this fact to build a classifier that takes features from those regions and determine if TSD has failed to detect a sign in that area or not.
Figure 2 shows an example of a missed sign with multiple excited regions in the feature maps of the detector. TSD has missed to detect a traffic sign for the input image (). However, the figure also shows one of the internal feature maps corresponding to . We can see multiple excited regions () in this feature map. One of these regions are located at the corresponding position of the missing traffic sign. We refer this region as a failure because TSD has failed to detect a sign from here. Other regions will be referred as imposter because those are excited but not related to the missing traffic sign. After binarizing the feature map (step b) in Figure 2), we apply contour area detection to locate the bounding box () for each excited region (). Step d in Figure 2 is the output of the FND showing the discovered false negative traffic sign.
Iii-a Training the false negative detection system
During the training stage, we convert the excited regions coordinates from feature space to image space and measure the intersection over union with the ground truth bounding boxes. Then we label each region using Equation 1.
where measures the maximum intersection over union for region with all the ground truths of the input image ().
Now we have a set of failure and imposter regions, we extract corresponding failure and imposter features from all of the regions (). To do so, we first stack three dimensional features maps () from the deployed TSD network along their channel axis, . Each feature map has variable number of channels.
After the stacking, we get a new feature map () of size where . We apply Algorithm 1 to extract length 1-dimensional feature for each region () from .
In Algorithm 1,
returns a feature vector as the maximum values within the region () along the channels of .
After extracting length failure and imposter features vectors, we train a fully connected binary classifier () to classify these two types of features. The full architecture of the proposed system is shown in Figure 3.
Iii-B Deployment of the false negative detection system
During the testing phase, FND follows the feature extraction pipeline (Figure 2) to extract features from the internal layers of TSD. At first, FND receives the detection output generated by TSD and locate the input image area without any detection. Excited regions are located from this area following the similar approach of step b and step c of the pipeline. These regions are then used to extract features from the internal layers. At the next step, the failure detection network () predicts these features as failures or imposters.
Iv Experimental Setup
In this section we describe the datasets, the traffic sign detector and the evaluation metrics that we use to evaluate our proposed method.
Our training and testing dataset consist of images from Belgium Traffic Sign Dataset (BTSD) [timofte2014multi] and German Traffic Sign Detection Benchmark (GTSDB) [Houben-IJCNN-2013]. The training split of BTSD has been used for all of the training purposes. The testing split of BTSD and the whole dataset from GTSDB has been used only for evaluation purposes (i.e., neither the sign detector nor the false negative detector has seen the images from BTSD testing split and GTSDB during training).
Rain and fog effect has been applied using Automold [automold] on our test split to simulate different environments in our test dataset.
There are three groups of signs in both datasets. These are mandatory, prohibitory and danger [Houben-IJCNN-2013]. Table I shows the number of images and sign classes in our training and testing settings.
|BTSD (train)||BTSD (test)||GTSDB (test)|
Iv-B Evaluation Metrics
In this paper, we are using precision and recall to evaluate the proposed method. For false negative detection, precision and recall is defined as follows:
where, (true failure) denotes the number of cases where the FND has successfully discovered a failure by the TSD. (false alarm) is a case where FND has mistakenly identified an imposter instance as a failure.
where, is the number of failure instances that have been identified as imposter.
Iv-C Traffic Sign Detector
In all of our experiments, we have used Single Shot Multi-box Detector (SSD) [liu2016ssd] for traffic sign detection. To train the detector for traffic sign, we use SSD with Inception V2 [szegedy2016rethinking] pre-trained on COCO dataset [lin2014microsoft]
from Tensorflow object detection API[huang2017speed]. The minimum score threshold ) is set to for all of the experiments. is used to filter out traffic signs from TSD generated proposals. Table II shows the detection performance of TSD on the BTSD and GTSDB testing data.
In this section, two baseline approaches are proposed to compare the performance of FND.
: To train this baseline, all the traffic signs are cropped from BTSD training data and grouped according to their class (prohibitory, mandatory and danger). The next step is to train an imagenet pre-trained VGG16 classifier with dropout layer to classify these three classes. During testing, we use TSD to detect traffic signs from both BTSD and GTSDB testing data. TSD discards some proposals for having a score less than. This baseline uses the classifier to classify all those rejected proposals and measure the classifier uncertainty using dropout sampling. A lower uncertainty means the classifier has detected a traffic sign with high confidence in the rejected proposals.
Baseline 2: Similar approaches like [zhang2014predicting, daftry2016introspective] is adopted for the baseline 2 training. TSD is used to detect traffic signs from the BTSD training split, and we collect proposals where TSD score is less than . These proposals are divided into two groups. The first group is named failure and contains proposals where TSD has made a mistake by not detecting a sign. The second group is named imposter and contains proposals where there is no sign. An imagenet pre-trained VGG16 binary classifier is trained using these two groups. During testing, the classifier assigns a failure score from to for each input proposals.
Iv-E False Negative Detector (FND)
To train the FND, failure and imposter features are collected based on the procedures described in Section III. In our experiments, all the images from BTSD training split have been used for the training of TSD and FND.
For feature collection, We have selected all the three dimensional () convolutional feature maps in the base network of TSD (Inception V2). Here each feature map has similar width and height and different number of channels, and there are such feature maps in TSD. One of the feature maps has been selected empirically to locate the excited regions. A feature map is generated after stacking all of these feature maps along their channel axis. To label the excited regions, FND uses Equation 1 with .
V Evaluation and Results
V-a Naive Solution
Before we present the results of our proposed method, we will discuss why the simple act of lowering the threshold to accept more detections by the sign detector does not provide a satisfactory solution.
Although lowering the minimum score threshold decreases the number of false negatives, this threshold needs to be tuned for different operational condition and environment. Figure 3(a) and Figure 3(b) show the percentage of false negatives generated by TSD for three different settings (normal, fog and rain) of BTSD and GTSDB dataset.
In Figure 3(a), the percentage of false negatives for in normal BTSD test dataset is . However, it increases to and respectively for rain and fog condition. To maintain similar false negative rate as in the normal environment, for both fog and rain, we need to accept all of the TSD generated proposals.
Figure 3(b) shows the TSD generated false negative when TSD is trained using BTSD and tested in GTSDB. We can also see the higher false negative rate for fog and rain than normal weather condition. Besides, related to BTSD dataset will not work here. This experiment shows the necessity of a separate false negative detection system rather than tuning the minimum score threshold of TSD.
V-B Comparison to baselines
We have tested baseline 1, baseline 2 and FND on the normal, foggy and rainy version of BTSD and GTSDB testing data. The purpose of this different settings is to evaluate the robustness of the proposed method in variable conditions.
Figure 4(a) shows the comparison of precision and recall curve for normal BTSD testing data. For recall FND achieves precision. For similar recall, precision for baseline 1 and baseline 2 are and respectively.
Figure 5(a) and Figure 6(a) show the performance of FND, baseline 1 and baseline 2 in two different weather conditions of BTSD testing data. For simulated rainy weather FND achieves precision and it becomes for foggy weather. Precision for baseline 1 drops from to for rainy weather and for foggy weather. Baseline 2 also suffers from changing environment. Precision drops from to for rainy data and for foggy data.
In GTSDB, for normal dataset, FND achieves precision for recall. Figure 4(b) shows the precision recall curve for FND, baseline 1 and baseline 2 in GTSDB for normal weather condition. Though FND has been trained using data on BTSD, it can raise alarm with a high precision and recall when the traffic sign detector makes a mistake. This high alarm rate proves the robustness of the collected features for failure and imposter classification.
We have also tested FND on simulated rain and fog version of GTSDB. For recall, FND achieves and precision respectively. Figure 6(b) and Figure 5(b) shows the precision and recall curve for FND, baseline 1 and baseline 2 in rain and fog version of GTSDB.
Figure 8 shows a qualitative results of our false negative detection system. Here we show some cases where FND has successfully identified false negative traffic signs missed by the detector. These sample results are taken from three different environments (normal, simulated fog and simulated rain) of BTSD testing data.
Vi Conclusions and Future Work
In this paper, we addressed the critical aspect of detecting false negatives of object detectors in the context of traffic sign detection for autonomous vehicles. This is an important safety issue. If the vehicle fails to locate any critical sign, it might make a catastrophic failure. We proposed a false negative detector (FND) that is trained to distinguish missed signs from imposters in the excitations of the feature maps of a deployed traffic sign detector. We tested FND using two traffic sings benchmarking datasets, the Belgium Traffic Sign Detection dataset and German Traffic Sign Recognition Benchmark dataset, as well as simulated weather conditions, fog and rain using images from both datasets. We compared our proposed method to two baselines and showed that it provides better performance in detecting false negatives. Future work will focus on identifying the best layers in TSD to extract more effective excited regions for the false negative detector.