Semantic segmentation using deep learning has become crucial for many safety-critical systems such as vision-based self-driving cars[treml2016speeding, Feng2021DeepMO] and robot-assisted surgery [allan20192017, Shvets2018AutomaticIS]
. For instance, semantic segmentation is a significant component for any self-driving car for safety, reliability, and scene understanding[Yang2018DenseASPPFS, hao2020brief]. Besides, it plays a substantial role in navigation [maturana2018real, zhang2018road] and obstacle avoidance [hua2019small, Arain2019ImprovingUO] by segmenting critical objects such as pedestrians and other vehicles in real-time from the visual sensory. Due to its importance, there is ongoing research [sseg_lateef2019survey, sseg_Minaee2021ImageSU, GarciaGarcia2018ASO] to improve the overall performance of semantic segmentation to meet the safety-critical demand in robotic vision.
State-of-the-art research in semantic segmentation commonly assumes that the images encountered during training and later during deployment follow a similar distribution. However, this cannot be guaranteed for applications on autonomous vehicles that operate in the open, unconstrained world. The segmentation model will inevitably encounter situations (objects, object configurations, textures), environmental conditions (weather), or imaging conditions (motion blur, illumination and exposure effects) that were never seen during training. As a result, a severe drop in segmentation performance could occur without prior warning, posing an extreme risk for the vehicle and its surroundings.
The ideal solution to achieve consistent semantic segmentation performance in all conditions is a highly effective, robust, and domain agnostic model trained using all possible scenarios that it will encounter during the deployment phase. But, these requirements are hard to achieve in a practical scenario. Another approach is to identify and remove inputs that can hinder segmentation accuracy. Out-of-distribution [ruff2021unifying] and open-set [geng2020recent] detection are examples of this approach. However, these approaches still can not scale to the complex semantic scenes and structure in which an autonomous vehicle operates [synthcp].
Similarly, uncertainty and confidence estimation can be used to detect incorrect semantic segmentation. However, recent works from[synthcp] and  show that these approaches alone are not effective enough to detect perception failure in semantic segmentation. Instead, [zhang2014predicting, daftry2016introspective, synthcp]
have argued in favour of using a specifically trained model to identify the incorrect perception of a target model without depending on approaches such as out-of-distribution, open-set, novelty detection, uncertainty and confidence estimation.
In the context of semantic segmentation, several research such as failure prediction [corbiere2019addressing, synthcp], introspective perception [zhang2014predicting, 9294308], and quality prediction [Devries2018LeveragingUE, robinson2018real, guruau2018learn]
train a separate model to identify the semantic segmentation failure. Most of these works are not applicable in autonomous vehicle scenarios, considering the complexity and significant variance of visual sensor inputs encountered by the semantic segmentation network during the deployment.
This paper proposes a novel framework, FSNet, consisting of semantic segmentation and corresponding failure detection networks. We train both of these networks simultaneously, so the failure detection network exploits the internal features generated by the segmentation network and identifies segmentation failures. FSNet is end-to-end trainable and does not require any additional failure-dataset as used in [synthcp, zhang2014predicting, corbiere2019addressing] for training purposes. We evaluate the proposed approach against the current SOTA methods using multiple datasets representing in and out-distribution scenarios. Although segmentation and failure detection networks are connected and trained jointly, our framework does not impede the segmentation accuracy. Our experimental results show that the accuracy of our jointly trained segmentation network is similar to a separately trained network. At the same time, the jointly trained failure detection network outperforms all existing approaches. Figure 1 shows an example of semantic segmentation, the mismatch between the predicted segmentation and ground-truth, and how FSNet can identify that incorrect segmentation.
Ii Related Works
Failure detection or introspection is an essential requirement in robotics to ensure safety and reliability [rahman2021run]. Morris [Morris-2007-9880] first proposed a robotic introspection framework to monitor operational state of robot for decision making purpose. Later [triebel2016driven, grimmett2016introspective] extended this idea for semantic mapping and obstacle avoidance in robotics. Zhang et al. [zhang2014predicting] proposed alert – a framework that predicts the failure of another model. A similar approach has been used by [daftry2016introspective] for failure prediction in MAV, hardness predictor [wang2018towards]
for image classifier, and performance monitoring[guruau2018learn] for robot perception system. This work focuses on detecting the failure of a semantic segmentation model in the autonomous vehicle context.
The study of failure detection or identifying the erroneous prediction of a model is closely related to uncertainty and confidence estimation. Hendrycks et al. [hendrycks2016baseline]
proposed to use Maximum Softmax Probability (MSP) derived from the softmax layer for detecting a failure in classification tasks. This work is considered as the standard baseline in related literature. However, MSP suffers drawbacks such as failure to distinguish between in and out-distribution samples and improper calibration. To reduce the risk of making incorrect classification, Geifmanet al. [geifman2017selective] introduced selective classifier. This approach controls and guarantees the risk level of a classifier by using thresholds on pre-defined confidence functions, e.g., MSP. Heinrich et al. [Jiang2018ToTO] proposed trust score that compares the prediction between a classifier and a modified nearest neighbour classifier to measure classifier reliability. Most recently, Corbiere et al. [corbiere2019addressing] has proposed true class probability to improve the unreliable ranking of confidence score. Besides, MC Dropout based techniques have become popular for failure detection in classification. However, Xia et al. [synthcp] has argued that these approaches are not applicable in semantic segmentation because of the lack of information on semantic structure and contexts.
Failure detection in the context of image segmentation is being studied extensively in recent years. Kohlberger et al. [Kohlberger2012EvaluatingSE] used a novel space of segmentation features to predict overlap error and the Dice coefficient of an organ segmentation model. Later, Valindria et al. [Valindria2017ReverseCA] have introduced reverse classification accuracy to predict segmentation quality of medical image segmentation. Huang et al. [Huang2016QualityNetSQ] showed that segmentation quality could be predicted using their proposed QualityNet. [Jungo2018UncertaintydrivenSC, Devries2018LeveragingUE] showed the application of Bayesian CNN for predicting semantic segmentation failure. [chabrier2006unsupervised, gao2017novel]
used unsupervised learning to quantify the quality of image segmentation tasks. However, because providing image-level segmentation quality rather than pixel-level failure detection, these works do not apply to identifying the areas where semantic segmentation is incorrect.
Xia et al. [synthcp] have proposed SOTA SynthCP to predict pixel-level failure prediction. They also demonstrated the usage of [hendrycks2016baseline, corbiere2019addressing, gal2016dropout, zhang2014predicting] for the similar task. Here, [synthcp, corbiere2019addressing, zhang2014predicting] are explicitly trained using a failure dataset to detect the failure. Here, failure dataset is much smaller than the original training dataset as these approaches generate this from a holdout segmentation dataset. Hence, these models can not take advantage of the entire segmentation dataset. [hendrycks2016baseline, gal2016dropout] use per-pixel prediction confidence and entropy generated from the segmentation inference for failure detection, which is suboptimal [synthcp, 9294308].
We address these issues by jointly training a semantic segmentation and the corresponding failure detection network. In this case, the failure detection network is trained simultaneously with the semantic segmentation network using the entire segmentation training dataset without requiring an explicit failure dataset. Besides, unlike other approaches, the failure detection network exploits internal features from the segmentation model because of joint architecture and shows better performance than the existing approaches.
Iii Approach Overview
This section introduces our failure detection framework FSNet for semantic segmentation. FSNet consists of two connected components – one semantic segmentation network and one failure detection network. We will describe both of these networks and how they work jointly to detect the failure of semantic segmentation.
Iii-a Module Architecture
|Conv1x1(l)||Applies 2D convolution on with kernel size .|
|Max(Softmax(l))||Extracts maximum softmax value across each channel of .|
|Sigmoid(Entropy(l))||Uses sigmoid normalization after getting channelwise entropy of .|
|ArgMax(l)||Returns indices of the maximum value across each channel of . Same as segmentation label.|
Convolutional layers used to extract features from the output logits of semantic segmentation network.
FSNet uses a joint architecture to connect the semantic segmentation and corresponding failure detection network and trains these networks end-to-end using the semantic segmentation dataset. FSNet also allows the failure detection network to exploit internal features of the segmentation network.
Let be a basic semantic segmentation network combining a convolutional encoder and decoder .
classifies each pixel of a given image of shape into a particular label from a set . During the inference, uses the convolutional features from the last layer of to generate logits of size . Here and . Based on architectural choice, may exploit features from different layers of . Later, a softmax function is applied on to generate the predicted label map .
We are proposing a failure detection network to predict , a failure map of size indicating the pixel where is incorrect. consists of two encoders and , and a decoder . and extracts convolutional features from image and logits using four different convolutional layer. All these layers generate single channel feature from channel input without changing the width and height. Table I enlists these layers names and their functions.
Our failure detection network works in multiple stages. At first, uses as its input and produces the encoded feature , where represents channel-wise concatenation operation. Next, produces encoded feature from . Later, using Equation 1, and are concatenated with encoded feature from the segmentation network, to form feature for failure detection decoder .
takes as input and upsample it to generate failure map of size . represents the confidence of for detecting the misclassification of for . Figure 2 shows an overview of our proposed architecture and the inter-connection among its different components.
Iii-B Training Procedure
We use a single dataset and two different loss functions – segmentation loss and failure detection loss – to train semantic segmentation networkand failure detection network of FSNet. Let, for each input , predicts the label as . Cross-entropy loss function in Equation 2 is used to calculate segmentation loss from ground-truth and label prediction .
To train the failure detection network to predict the failure of , we need a ground-truth showing the mismatch between and . This mismatch indicates the failure of for predicting the semantic label, and will be optimized to predict this failure. Equation 3 is used to generate the failure detection ground-truth . Assuming as the output of failure detection network, we use balanced binary cross-entropy loss of Equation 4 to calculate failure detection loss .
There are two steps in the training procedure. At first, we backpropagate only the lossinto FSNet until the segmentation network is converged. This step only trains to perform semantic segmentation. After the convergence of , is used as the new loss. As and are connected, this step jointly optimizes both of these network for semantic segmentation and failure detection. Our experiment shows without converging first, both networks of FSNet can not jointly be optimized.
Iv Experimental Setup
This section will describe the settings used for the proposed framework’s evaluation. First, we will discuss how we used the in- and out-distribution settings to evaluate the generalizability of this work. Next, existing approaches, related evaluation metrics and implementation details will be summarized.
In- and Out-Distribution Dataset. In all experiments, we used a training dataset of 2974 images from Cityscapes [cordts2016cityscapes] to train FSNet and all other approaches. To evaluate the proposed system, we considered two settings. The first one is in-distribution, where the training and testing data come from the same distribution. We used a testing dataset consisting of 500 images from Cityscapes to evaluate all approaches. The next setting is out-distribution, where the testing data comes from a different dataset. We used 1000 images from BDD100k [yu2018bdd100k] and randomly selected 1000 images from Mapillary [neuhold2017mapillary] semantic segmentation dataset in this setting to evaluate the proposed framework. All the segmentation models are trained to segment 19 classes available in the Cityscapes, and we used the same 19 classes from all three datasets.
Methods to Compare We compare FSNet failure detection network to multiple methods – MSP [hendrycks2016baseline], MCDropout [gal2016dropout], TCP [corbiere2019addressing], Direct-prediction [zhang2014predicting] and SynthCP [synthcp]. MSP and MCDropout provide pixel-level confidence maps as part of their inference, and these are used as standard baselines for pixel-level failure prediction. Direct-prediction, TCP, and SynthCP use a separate failure-dataset to train the failure detection model. Using this new dataset Direct-prediction uses a separate model to train their failure detector. TCP trains a model to predict the true class probability that works as failure indicator. Most recently, SynthCP proposed to use conditional GAN and a comparison module to train a model that identifies the failure of semantic segmentation. SynthCP is the SOTA approach among these works.
Unlike existing approaches, FSNet jointly trains the semantic segmentation and corresponding failure detection network. Although jointly trained, FSNet segmentation network should perform similarly to the individually trained segmentation model. To ensure this, we will compare FSNet segmentation network accuracy with the individually trained SynthCP segmentation network.
Evaluation Metrics. Following [corbiere2019addressing, synthcp], we use AUPR-Error, AUPR-Success, AUROC, and FPR95 for evaluation purposes. AUPR-Error considers incorrect prediction as positive class and computes the area under the Precision-Recall (AUPR) curve. AUPR-Success computes AUPR too but considers correct prediction as the positive class. AUROC calculates the area under the Receiver Operating Characteristics, and FPR95 computes False-Positive Rate at 95% True-Positive Rate. Our proposed jointly trained segmentation model is compared with the SynthCP segmentation model using mean Intersection over Union (mIOU) and per-class accuracy (Cls-Acc). mIOU first calculates the IOU for each class and then calculates the average over classes. Cls-Acc measures the percentage of correctly labeled pixels for each semantic class and then averages over the classes.
Implementation. To compare with existing methods, we use FCN8 [fcn_long2015fully] and DeepLabV2 [chen2017deeplab]
as the semantic segmentation networks in our framework. FCN8 and DeeplabV2 are based on VGG16 and ResNet101 backbone networks, respectively, and pretrained on the MS-COCO semantic segmentation dataset. Both encoders in the proposed failure detector use the ResNet18 network pretrained on the ImageNet dataset. Our training process consists of two steps. At first, theFSNet segmentation network is trained only using the segmentation loss for iterations for convergence. For this step, we follow the hyper-parameters and image augmentations proposed by SynthCP. Then, for the next iterations, FSNet is trained using both segmentation and failure detection loss. The failure detection network uses adam optimizer with learning rate .
This section evaluates the semantic segmentation and failure detection accuracy of the proposed framework with the existing approaches. It also shows comparative performance for in-distribution and out-distribution settings.
V-a Semantic Segmentation Evaluation
Table II shows the comparative accuracy of the segmentation network from SynthCP and FSNet. In the in-distribution setting, for both FCN8 and DeepLabV2, FSNet segmentation accuracy improves by in the Cls-Acc metric. FSNet also shows better performance in the mIOU metric. In out-distribution settings, both SynthCP and FSNet show lower accuracy than the in-distribution setting as BDD100k and Mapillary datasets were unknown to the segmentation network. However, FSNet segmentation accuracy is better than SynthCP for out-distribution setting too. This result shows that segmentation and failure detection network can be trained jointly without degrading the segmentation accuracy.
V-B Failure Detection Evaluation
Table III shows the failure detection accuracy of FSNet and all existing approaches using AP-Err, AP-Suc, AUC, and FPR95 metrics. These metrics are averaged over 19 classes available in the Cityscapes dataset. For the in-distribution setting, FSNet failure detection network achieves in AP-Err for identifying failure of FCN8, which is better than the SOTA method, SynthCP. In this case, FSNet also outperforms SynthCP in AUC and FPR95. However, FSNet is slightly inferior to SynthCP in the AP-Suc metric. As we used the balanced binary cross-entropy loss function to train the failure detection network, it has improved FSNet AP-Err by a large margin for a negligible performance reduction in the AP-Suc. In the same settings, FSNet for DeepLabV2 demonstrates a similar trend by outperforming SynthCP in the AP-Err metric by .
Table III also shows FSNet failure detection accuracy for FCN8 and DeepLabV2 in out-distribution settings. Here, we trained FSNet using the Cityscapes dataset and evaluated using BDD100k and Mapillary datasets. These experiments illustrate the generalization capability of our proposed framework. In four metrics and two datasets, FSNet outperforms all the existing methods, including SOTA SynthCP.
We see higher AP-Err in out-distribution than in the in-distribution setting. The reason is the lower semantic segmentation accuracy for out-distribution (see Table II). As the segmentation networks make more pixel misclassification in out-distribution because of lower accuracy, failure detection network can identify these errors and hence shows better performance in out-distribution than in-distribution.
V-C Risk-Coverage Evaluation
We use the Risk-Coverage [geifman2017selective] metric to evaluate the impact of FSNet after detecting failure in semantic segmentation. Here, Coverage is the percentage of predicted pixel labels that are not flagged as a failure by FSNet, and Risk is the percentage of misclassification error in those predictions. Based on this metric, FSNet can reject the prediction of segmentation network to achieve desired risk level.
Figure 2(a) shows Risk-Coverage curves for all methods while detecting the failure of FCN8 on Cityscapes dataset for in-distribution setting. We plot these curves using 10 different Coverage levels. Here FSNet demonstrates lower risk than all existing methods. As an example, for coverage risk means FSNet has rejected segmentation prediction assuming that the prediction is incorrect keeping the coverage . In this coverage, pixels has been incorrectly classified. All other methods show risk level from to . For DeepLabV2 and in-distribution setting, FSNet risk of failure is for coverage while other existing approach risk varies from to .
Figure 2(b) and Figure 2(c) show risk-coverage curves for FCN8 in out-distributions setting. In both cases, FSNet show lower risk level than all exisiting methods for all coverage levels. Figure 2(e) and Figure 2(f) show risk-coverage curve for DeepLabV2 in out-distribution settings with the similar trend where FSNet outperforms all existing methods.
Figure 4 shows qualitative results and the comparison between FSNet and SynthCP for detecting the failure of image segmentation.
Based on experimental results, FSNet outperforms the SOTA approach SynthCP and other existing methods. In the ablation study, we experimented with multiple configurations to find critical components of FSNet. These configurations include single and multi-branch architecture, full and partial-dataset. In a single-branch setting, we used only a single encoder to extract features from the input and logits of the segmentation network. In a multi-branch (see Figure 2), FSNet used two different encoder to extract features from the input and logits output. We also experimented with how the dataset size impacts FSNet using full-dataset and partial-dataset comprising and randomly selected Cityscapes training images to train FSNet.
As described in the literature, SynthCP, Direct-prediction, and TCP train a segmentation network and apply that network on an unseen dataset to create a new failure dataset for failure detection training. Hence, the dataset for failure detection training is significantly smaller than the segmentation dataset. Therefore, these approaches can not take advantage of the entire available segmentation dataset. On the contrary, FSNet introduces a joint architecture and uses the full semantic segmentation dataset to train both the segmentation and failure detection networks simultaneously. Table IV denotes the comparative accuracy gained by using full-dataset and partial-dataset, respectively, for FSNet and show that the entire set of training data benefits our proposed framework.
Table IV shows that the multi-branch network performs better than a single-branch network. This accuracy gain is possible by using separate encoders to extract more informative features from the image and logits output of the segmentation network.
In all existing approaches, either the segmentation model or a separate network is used for failure detection. However, FSNet exploits internal features from the segmentation model and use them simultaneously to detect segmentation failure. Figure 2 shows how these features connect segmentation and failure detection network. We have removed one feature at a time from the FSNet multi-branch failure detection network to study the impact of different features. In Table IV, we have listed the feature name which is removed and the accuracy of FSNet in all metrics after removing them. It shows that the FSNet accuracy drops in different margins whenever any feature is removed from the failure detection network. However, based on the accuracy drops, the most significant features are , and which are extracted from the segmentation network logits output and the encoder. Without these features, FSNet accuracy will drop below the baselines. Table IV shows that our proposed joint architecture significantly improves the accuracy of FSNet for detecting the failure of semantic segmentation network.
As deep learning based semantic segmentation model becomes an essential component for autonomous vehicles, identifying this model’s failure has gained paramount importance for ensuring safety and robustness. This paper proposes a novel joint learning framework to simultaneously train a semantic segmentation and corresponding failure detection network. This failure detection network can identify the image area at pixel-level where the segmentation network has made an incorrect prediction. Therefore, our proposed framework can be used to inform downstream components in autonomous vehicle systems about expected semantic segmentation reliability. We show the effectiveness of our proposed framework using multiple datasets, segmentation models, and evaluation metrics.