Synthesize then Compare: Detecting Failures and Anomalies for Semantic Segmentation

03/18/2020 ∙ by Yingda Xia, et al. ∙ 6

The ability to detect failures and anomalies are fundamental requirements for building reliable systems for computer vision applications, especially safety-critical applications of semantic segmentation, such as autonomous driving and medical image analysis. In this paper, we systematically study failure and anomaly detection for semantic segmentation and propose a unified framework, consisting of two modules, to address these two related problems. The first module is an image synthesis module, which generates a synthesized image from a segmentation layout map, and the second is a comparison module, which computes the difference between the synthesized image and the input image. We validate our framework on three challenging datasets and improve the state-of-the-arts by large margins, i.e., 6 correlation on pancreatic tumor segmentation in MSD and 20 StreetHazards anomaly segmentation.



There are no comments yet.


page 2

page 5

page 10

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks 

[32, 45, 47, 15, 20] have achieved great success in various computer vision tasks. However, when they come to real world applications, such as autonomous driving [23], medical diagnoses [5] and nuclear power plant monitoring [37], the safety issue [1]

raises tremendous concerns particularly in conditions where failure cases have severe consequences. As a result, it is of enormous value that a machine learning system is capable of detecting the failures,

i.e., wrong predictions, as well as identifying the anomalies, i.e., out-of-distribution (OOD) cases, that may cause these failures.

Previous works on failure detection [18, 25, 7] and anomaly (OOD) detection [36, 34, 9, 35, 19]

mainly focus on classifying small images. Although failure detection and anomaly detection for semantic segmentation have received little attention in the literature so far, they are more closely related to safety-critical applications,

e.g., autonomous driving and medical image analysis. The objective of failure detection for semantic segmentation is not only to determine whether there are failures in a segmentation result, but also to locate where the failures are. Anomaly detection for semantic segmentation, a.k.a anomaly segmentation, is related to failure detection, and its objective is to segment anomalous objects or regions in a given image.

In this paper, our goal is to build a reliable alarm system to address failure detection for semantic segmentation (Fig. 1(i)) and anomaly segmentation (Fig. 1(ii)). Thus, this requires that the system should be able to provide more detailed analysis than those for image classification, i.e., pixel-level error/confidence maps, since, unlike image classification outputs only a single image label, semantic segmentation outputs structured semantic layouts. Some previous work [27, 7, 17]

directly applied the failure detection/anomaly detection strategies for image classification pixel by pixel to estimate a pixel-level error map, but they lack the consideration of the structured semantic layout of a segmentation result.

Figure 1: We aim at addressing two tasks: (i) failure detection, i.e., image-level per-class IoU prediction (top left) and pixel-level error map prediction (bottom left) (ii) anomaly segmentation i.e. segmenting anomalous objects (right middle).

In this paper, we propose a unified framework to address failure detection and anomaly detection for semantic segmentation. This framework consists of two components: an image synthesis module, which synthesizes an image from a segmentation result to reconstruct its input image, i.e., a reverse procedure of semantic segmentation, and a comparison module which computes the difference between the reconstructed image and the input image. Our framework is motivated by the fact that the quality of semantic image synthesis [22, 48, 43] can be evaluated by the performance of segmentation network. Presumably the converse is also true, the better is the segmentation result, the closer a synthesized image generated from the segmentation result is to the input image. If a failure occurs during segmentation, for example, if a person is mis-segmented as a pole, the synthesized image generated from the segmentation result does not look like a person and an obvious difference between the synthesized image and the input image should occur. Similarly, when an anomalous (OOD) object occurs in a test image, it would be classified as any possible in-distribution objects in a segmentation result, and then appear as in-distribution objects in the synthesized image generated from the segmentation result. Consequently, the anomalous object can be identified by finding the differences between the test image and the synthesized image.

We model this synthesis procedure by a semantic-to-image conditional GAN (cGAN) [43]

, which is capable of modeling the mapping from the segmentation layout space to the image space. This cGAN is trained on label-image pairs. Given the segmentation result of an input image obtained by an semantic segmentation model, we apply the trained cGAN to the segmentation result to generate a reconstructed image. Then, the reconstructed image and the input image are fed into the comparison module to identify the failures/anomalies. The comparison module is designed task-specifically: For failure detection, the comparison module is modeled by a Siamese network, outputting both image-level confidences and pixel-level confidences; For anomaly segmentation, the comparison module is realized by computing the distance defined on the intermediate features extracted by the semantic segmentation model.

We validate our framework on the Cityscapes street scene dataset, a pancreatic tumor segmentation dataset in the Medical Segmentation Decathlon (MSD) challenge and the StreetHazards dataset, and show its superiority to other failure detection and anomaly segmentation methods. Specifically, we achieved improvements over the state-of-the-arts by approximately 6% AUPR-Error on Cityscapes pixel-level error prediction, 10% correlation coefficient on pancreatic tumor DSC prediction and 20% AUPR on StreetHazards anomaly segmentation.

We summarize our contribution as follows:

  • To the best of our knowledge, we are the first to systematically study failure detection and anomaly detection for semantic segmentation

  • We propose a unified framework, which enjoys the benefits of a semantic-to-image conditional GAN, to address both of the two tasks.

  • Our framework achieves state-of-the-art failure detection and anomaly segmentation results on three challenging datasets.

2 Related Work

In this section, we first review the topics closely related to failure detection and anomaly segmentation, such as uncertainty/confidence estimation, quality assessment and out-of-distribution (OOD) detection. Then, we review generative adversarial networks (GANs), which serves a key module in our framework.

Uncertainty estimation or confidence estimation has been a hot topic in the field of machine learning for years, and can be directly applied to the task of failure detection. Standard baselines was established in [18]

for detecting failures in classification where maximum softmax probability (MSP) provides reasonable results. However, the main drawback of using MSP for confidence estimation is that deep networks tend to produce high confidence predictions 

[7]. Geifman et al. [12] controled the user specified risk-level by setting up thresholds on a pre-defined confidence function (e.g. MSP). Jiang et al. [25] measured the agreement between the classifier and a modified nearest-neighbor classifier on the test examples as a confidence score. A recent approach [7] proposed to direct regress “true class probability” which improved over MSP for failure detection. Additionally, Bayesian approaches have drawn attention in this field of study. Dropout based approaches  [10, 27]

used Monte Carlo Dropout (MCDropout) for Bayesian approximation. Computing statistics such as entropy or variance is capable of indicating uncertainty. However, all these approaches mainly focus on small image classification tasks. When applied to semantic segmentation, they lack the information of semantic structures and contexts.

Segmentation quality assessment aims at estimating the overall quality of segmentation, without using ground-truth label, which is suitable to make alarms when model fails. Some approaches [33, 26] utilize BCNNs to predict the segmentation quality of medical images. [44, 30]

regressed the segmentation quality from deep features computed from a pair of an image and its segmentation result.

[24, 21] plugged an extra IoU regression head into object detection or instance segmentation. [4, 11]

used unsupervised learning methods to estimate the segmentation quality using geometrical features. Recently, Liu

et al. [38] proposed to use VAE [29] to capture a shape prior for segmentation quality assessment on 3D medical image. However, it is hardly applicable to natural images considering the complexity and large shape variance in 2D scenes and objects. Segmentation quality assessment will be referred to as image-level failure detection in the rest of the paper.

OOD detection aims at detecting out-of-distribution examples in testing data. Since the baseline MSP method [18] was brought up, many approaches have improved OOD detection from various aspects [36, 34, 9, 35, 19]. While these approaches mainly focus on image level OOD detection, i.e., to determine whether an image is an OOD example, e.g., [31, 3] targeted at detecting hazardous scenes in the Wilddash dataset [49]. On the contrary, our approach focuses on anomaly segmentation, i.e., a pixel-level OOD detection task that aims at segmenting anomalous regions from an image. Pixel-wise reconstruction loss [2, 14] with auto-encoders(AE) are the main stream approaches for anomaly segmentation. However, they can hardly model the complex street scenes in natural images and AEs can not guarantee to generate an in-distribution image from OOD regions. Recently, it was found that MSP surprisingly outperform these anomaly segmentation approaches on a newly built larger scale street scene dataset StreetHazards [17] - with 250 types of anomalous objects and more than 6k high resolution images.

Generative adversarial networks [13] generate realistic images by playing a “min-max” game between a generator and a discriminator. GANs effectively minimize a Jensen-Shannon divergence, thus generating in-distribution images. Our work utilizes conditional GANs [42] (cGANs) for image translation [22], a.k.a pixel-to-pixel translation. Approaches designed for semantic image synthesis [48, 43, 40] improves pixel-to-pixel translation in synthesizing real images from semantic masks, which is the reverse procedure of semantic segmentation. Since semantic image synthesis is commonly evaluated by the performance of a segmentation model, reversely, we are motivated to use a semantic-to-image generator for failure detection for semantic segmentation.

Figure 2: We first train the synthesis module on label-image pairs and then use this module to synthesize the image conditioning on the predicted segmentation mask . By comparing and with a comparison module , we can detect failures as well as segment anomalous objects. is instantiated in Sec 3.2.1 and Sec 3.3.2.

3 Methodology

In this section, we introduce our framework for failure detection and anomaly detection for semantic segmentation. Our framework consists of two modules, an image synthesis module and a comparison module. We first introduce the general framework (shown in Fig. 2), then describe the details of the modules for failure detection and anomaly detection in Sec 3.2 and Sec 3.3, respectively. Unless otherwise specified, the notations in this paper follow this criterion: We use a lowercase letter, e.g.,

, to represent a tensor variable, such as a 1D array or a 2D map, and denote its

-th element as ; We use a capital letter, e.g., , to represent a function.

3.1 General Framework

Let be an image with size of and be a set of integers representing the semantic labels. By feeding image to a segmentation model , we obtain its segmentation result, i.e., a pixel-wise semantic label map . Our goal is to identify and locate the failures in or detect anomalies in based on .

3.1.1 Image Synthesis Module

We model this image synthesize module by a pixel-to-pixel translation conditional GAN (cGAN) [42], which is known for its excellent ability for semantic-to-image mapping. It consists of a generator and a discriminator .

Training. We train this translation conditional GAN on label-image pairs: , where is a grouth-truth pixel-wise semantic label map and is its corresponding image. The objective of the generator is to translate semantic label maps to realistic-looking images, while the discriminator aims to distinguish real images from the synthesized ones. This cGAN minimizes the conditional distribution of real images via the following min-max game:


where the objective function is defined as:


Testing. After training, we fix the generator . Given an image and a segmentation model , we feed the predicted segmentation mask into , and obtain a synthesized (i.e., reconstructed) image :


and are then served as the input for the comparison module.

3.1.2 Comparison Module

We detect failures and anomalies in by comparing with . Our assumption is that, if is more similar to , then is more similar to . However, since the optimization of does not guarantee that the synthesized image has the same style as the original image , simple similarity measurements such as distance between and is not accurate. In order to address this issue, we model the comparison module by a task-specific function which estimates a trustworthy task-specific confidence measure between and :


For the task of failure detection, the confidence measure includes an image-level per-class intersection over union (IoU) array and a pixel-level error map ; For the task of anomaly segmentation, the confidence measure is a pixel-level confidence map for anomalous objects.

3.2 Failure Detection

The goal of failure detection is two-fold: 1) to identify whether there are failures in , which requires a per-class IoU prediction for the segmentation result , and 2) to locate the failures in , which needs to compute a pixel-level error map .

3.2.1 Instantiation of Comparison Module

Figure 3: We instantiate as a light-weighted siamese network for joint image-level per-class IoU prediction and pixel-level error map prediction.

We instantiate the comparison module as a light-weighted deep network. In practice, we use ResNet-18 [15] as the base network and follow a siamese-style design for learning the relationship between and . As illustrated in Fig. 3, and are first concatenated with and then separately encoded by a shared-weight siamese encoder. Then two heads are built upon the siamese encoder and output the image-level per-class IoU array and pixel-level error map , respectively. We rewrite the function for failure detection as below:


where represents the network parameters.

In the training stage, the supervision of network training is obtained by computing the ground-truth confidence measure from and . For the ground-truth image-level per-class IoU array , we compute it by


where is the -th semantic class in label set . The loss function is applied to learning this image-level per-class IoU prediction head. For the ground-truth pixel-level error map, we compute it by


The binary cross-entropy loss is applied to learning this pixel-level error map prediction head. The overall loss function of failure detection is the sum of the above two:


3.3 Anomaly Segmentation

3.3.1 Problem Definition

The goal of anomaly segmentation is segmenting anomalous objects in a test image which are unseen in the training images. Formally, given a test image , an anomaly segmentation method should output a confidence score map for the regions of the anomalous objects in the image, i.e., and indicate the pixel belongs to an anomalous object and an in-distribution object (the object is seen in the training images), respectively.

3.3.2 Instantiation of Comparison Module

As the same as failure detection, we first train a cGAN generator on the training images, which maps the in-distribution object labels to realistic images. Given a semantic segmentation model , we feed its prediction into and obtain . Since only contains in-distribution object labels, also only contain in-distribution objects. Thus, we can compare with to find the anomalies. The pixel-wise semantic difference of and is a strong indicator of anomalous objects. Here, we simply instantiate the comparison function as the cosine distance defined on the intermediate features extracted by the segmentation model :



is the feature vector at the

pixel position outputted by the last layer of segmentation model and is the inner product of the two vectors.

Post-processing with MSP. Due to the artifacts and generalized errors of GANs, our approach may mis-classify an in-distribution object into an anomalous object (false positives). We use a simple post-processing to address this issue. We refine the result by maximum softmax probability (MSP) [18], which is known as an effective uncertainty estimation strategy: , where is the maximum soft-max probability at the -th pixel outputted by the segmentation model , is a threshold and is the indicator function.

Figure 4: An analysis of our approach. Left: correctly maps to , resulting in small distance between and the synthesized . However, when there are failures in (middle) or there are OOD examples in (right), the distance between and is larger, given a reliable reverse mapping .

3.4 Analysis

We give some analysis of our proposed method in Fig. 4, where and correspond to image space and label space. is the segmentation model and is a semantic-to-image generator. The left image shows when correctly maps an image to its corresponding segmentation mask, the synthesized image generated from is close to the original image. However, when makes a failure (middle) or encounters an out-of-distribution case (right), the synthesized image should be far away from the original image. As a result, the synthesized image servers as a strong indicator for either failure detection or OOD detection.

4 Experiments

4.1 Failure Detection

4.1.1 Evaluation Metrics

Following [38], we evaluate the performance of image-level failure detection, i.e., per-class IoU prediction, by three metrics: MAE, P.C and S.C. MAE (mean absolute error) measures the average error between predicted IoUs and ground-truth IoUs. P.C (Pearson correlation) and S.C.(Spearman correlation) measures their correlation coefficients. For pixel-level failure detection, i.e., pixel-level error map prediction, we evaluate the performance by using the standard metrics in literature [18, 7]: AUPR-Error, AUPR-Success, FPR at 95% TPR and AUROC. Following [7], AUPR-Error is our main metric, which computes the area under the Precision-Recall curve using errors as the positive class.

FCN-8 Deeplab-v2
image-level MAE STD P.C. S.C. MAE STD P.C. S.C.
MCDropout [10] 17.28 13.33 3.62 5.97 19.31 12.86 4.55 1.37
VAE alarm [38] 16.28 11.88 21.82 18.26 16.78 12.21 17.92 19.63
Direct Prediction 13.25 11.96 58.34 59.74 14.45 12.20 60.94 62.01
Ours separate 11.58 11.50 64.63 65.63 13.60 12.32 62.51 63.41
Ours joint 12.69 11.29 62.52 61.23 13.68 11.60 64.05 65.42
FCN-8 Deeplab-v2
pixel-level AP-Err AP-Suc AUC FPR95 AP-Err AP-Suc AUC FPR95
MSP [18] 50.31 99.02 91.54 25.34 48.46 99.24 92.26 24.41
MCDropout [10] 49.23 99.02 91.47 25.16 47.85 99.23 92.19 24.68
TCP [7] 48.54 98.82 90.29 32.21 45.57 98.84 89.14 36.98
Direct Prediction 52.17 99.15 92.55 22.34 48.76 99.34 92.94 21.56
Ours separate 54.14 99.15 92.70 22.47 48.79 99.31 92.74 22.15
Ours joint 55.53 99.18 92.92 22.47 49.99 99.34 92.98 21.69
Table 1: Experiments on the Cityscapes dataset. We detect failures in the segmentation results of FCN-8 and Deeplab-v2. “Ours separate” and “Ours joint” mean training the image-level and pixel-level failure detection heads in our network separately and jointly, respectively.

4.1.2 The Cityscapes Dataset

We validate our approach on the Cityscapes dataset [8], which contains 2975 high-resolution training images and 500 validation images. As far as we know, the Cityscapes dataset is the largest one for failure detection for semantic segmentation.

Baselines. We compare our approach to MCDropout [10], VAE alarm [38], MSP [18], TCP [7] and “Direct Prediction”. MCDropout, MSP and TCP output pixel-level confidence maps, serving as standard baselines for pixel-level failure prediction. VAE alarm [38] is the state-of-the-art in image-level failure prediction method. Following [38], we also use MCDropout to predict image-level failures. Direct Prediction is a method that directly uses a network to predict both image-level and pixel-level failures, by taking an image and its segmentation result as input. Note that, Direct Prediction shares the same experimental settings (backbone and training strategies) with our approach, which can be seen as an ablation study on the effectiveness of the Synthesized image .

Implementation details. We use the state-of-the-art semantic-to-image cGAN - SPADE [43] in our framework. We use the default setting in [43] except that we do not input the instance maps. The backbone of our comparison module is ResNet-18 [15] pretrained from Image-Net. We train the network for 20k iterations using Adam optimizer [28] with initial learning 0.01 and . We do not use an extra validation set to train the comparison module. Instead, we conduct 4-fold cross-validation on the training set with a given segmentation algorithm, in order to generate imperfect training samples for failure detection.

Figure 5: Visualization on the Cityscapes dataset for pixel-level error map prediction (top) and image-level per-class IoU prediction (bottom). For each example from left to right (top), we show the original image, ground-truth label map, segmentation prediction, synthesized image conditioned on the segmentation prediction, (ground-truth) errors in the segmentation prediction and our pixel-level error prediction. The plots (bottom) show significant correlations between the ground-truth IoU and our predicted IoU on most of the classes.

Results. Experimental results are shown in Table 1 and visualizations are shown in Fig. 5. We use the well-known FCN8 [41] and Deeplab-v2 [6] as the segmentation models. For image-level failure detection, our method consistently outperforms other methods on all metrics. Results are averaged over 19 classes for all four metrics (detailed results in supplementary). We find that VAE alarm does not converge when trained on 2D images, which shows the difficulty to learn the 2D shape prior. Without the synthesized images from the segmentation results, Direct Prediction performs worse than ours despite achieving better performance than the others.

For pixel-level failure detection, our approach achieves the state-of-the-art performance as well, especially for AP-Error metric where our approach outperforms other methods by a considerable margin. The comparison to Direct Prediction demonstrates that the improvements come from the image synthesis module in our framework. TCP does not obtain good results. In the training of TCP, we observe that the optimization does not converge well showing that it is hard to fit the true class probability given a large dataset. We find that our method produces slightly more false positives than “Direct Prediction” baseline (FPR95 is lower). We think the reason might be some correctly segmented regions are not synthesized well by the generative model.

Moreover, we notice that joint training both image-level and pixel-level failure detection tasks (“ours joint”) leads to better performance than training these two tasks separately (“ours separate”). This shows these two tasks can provide complementary information to each other, which is known as the benefit of multi-task learning.

tumor DSC prediction
Method MAE STD P.C. S.C.
Direct Prediction 23.20 29.81 45.50 45.36
Jungo et al. [26] 26.57 29.78 -23.87 -20.23
Kwon et al. [33] 26.14 29.24 14.61 14.70
VAE alarm [38] 20.21 23.60 60.24 63.30
VAE (our imple.) 18.60 13.73 63.42 58.47
Ours 18.13 13.77 61.11 62.66
Ours + VAE 15.19 13.37 67.97 71.35
Table 2: Failure detection results on the pancreatic tumor segmentation dataset in MSD [46].

4.1.3 The Pancreatic Tumor Segmentation Dataset

We also validate our approach on medical images. Following VAE alarm [38], we applied our approach to the challenging pancreatic tumor segmentation task of Medical Segmentation Decathlon [46], where we randomly split the 281 cases into 200 training and 81 testing. The VAE alarm system [38] is the main competitor on this dataset. Since their approach explored shape prior for accurate quality assessment and tumor shapes have large variance, we expect our approach can outperform shape-based models or be complementary to the VAE-based alarm model. We only compare image-level failure detection in this dataset, because the VAE alarm system [38] is targeted to this task and sets up standard baselines.

We use the state-of-the-art network 3D AH-Net [39]

as the segmentation model. Instead of IoU, the segmentation performance is measured by Dice coefficient (DSC), a standard evaluation metric used for medical image segmentation. Moving into 3D is a challenging for our approach, since training 3D GANs is extremely hard, considering the limited GPU memory and high computational costs. In practice, we modify SPADE into 3D. Results and visualizations are shown in Table 

2 and Fig. 6 respectively. In terms of baselines, we re-implement the VAE alarm system for a fair comparison in our settings, while the results of other methods are quoted from [38]. Our approach achieved comparable performances as VAE alarm system. When combined with VAE alarm (a simple ensemble of the predicted DSC), all of the four metric improves significantly (P.C. and S.C correlation coefficient both improves by approximately 10%), illustrating our approach which captures label-to-image information is complementary to the shape-based VAE approach.

Figure 6: Left two: an example of pancreatic tumor segmentation (in red). Right three: plots for tumor segmentation DSC score prediction by the VAE alarm [38], our approach and the ensemble of our approach and VAE approach.

4.2 Anomaly Segmentation

4.2.1 Evaluation metrics.

We use the standard metrics in literature for OOD detection and anomaly segmentation: area under the ROC curve (AUROC), false positive rate at 95% recall (FPR95), and area under the precision recall curve (AUPR).

4.2.2 The StreetHazards Dataset

AE [2] 91.7 66.1 2.2
Dropout [10] 79.4 69.9 7.5
MSP [18] 33.7 87.7 6.6
MSP + CRF [17] 29.9 88.1 6.5
Ours 28.4 88.5 9.3
Table 3: Anomaly segmentation results on StreetHazards dataset [17].

We validate our approach on the StreetHazards dataset of CAOS Benchmark [17]. This dataset contains 5125 training images, 1000 validation images and 1500 test images. 250 types of anomaly objects appears only in the testing images.

Baselines. Baseline approaches include MSP [18], MSP+CRF [17], Dropout [10] and an auto-encoder (AE) based approach [2]. Except for AE, all the other three approaches require a segmentation model to provide either softmax probability or uncertainty estimation. AE is the only approach that requires extra training of an auto-encoder for the images and computes pixel-wise loss for anomaly segmentation.

Implementation details. Following [17], we use two network backbones as the segmentation models: ResNet-101 [15] and PSPNet [50]. The cGAN is also SPADE [43] trained with the same training strategy as in Sec 3.2. The post-processing threshold is chosen for better AUPR and is discussed in detail in the following paragraph.

Results Experimental results are shown in Table 3. Our approach improves the previous state-of-the-art approach MSP+CRF from 6.5% to 9.3% in terms of AUPR. Fig. 7 shows some anomaly segmentation examples.

To study how much MSP post-processing contributes to our approach, we conduct experiments on different thresholds of for post-processing. As shown in Table 4, without post-processing (), our approach achieves higher AUPR, but also produces more false positives, resulting in degrading FPR95 and AUROC. After pruning out false positives at high MSP positions (), we achieved the state-of-the-art performances under all three metrics.

t 0.8 0.9 0.99 0.999 1.0
FPR95 28.6 28.5 28.2 28.4 46.0
AUROC 88.3 88.4 88.6 88.5 81.9
AUPR 7.4 7.7 8.8 9.3 8.1
Table 4: Performance change by varying post-processing threshold .
Figure 7: Visualizations on the StreetHazards dataset. For each example, from left to right, we show the original image, ground-truth label map, segmentation prediction, synthesized image conditioned on segmentation prediction, MSP anomaly segmentation prediction and our anomaly segmentation prediction.

5 Discussions

Why does our approach work better? Current approaches, such as MSP, TCP and MCDropout, mainly focus on improving failure detection with self-estimated statistics. However, deep networks tend to yield high confidence prediction [16, 18], thus self-estimated statistics are not trustable. The approaches that leverage extra data [19] or alternating training strategies [16] can alleviate this problem. We propose to solve this problem from another prospective - analyzing the performance of deep discriminative models by generative models, the reverse procedure that models the conditional data distribution prior . Our method models with a cGAN, which is proved to be beneficial to both failure and OOD detection of segmentation models.

Extra computational cost. There are two steps that requires extra computation besides the original segmentation network (, latency ) in our approach - GAN reconstruction () and the comparison function computation for failure detection/anomaly segmentation. Since and are mutually inverse procedures, the inference time should be in the same magnitude. Compared to or , the inference time of the failure detection network and distance computation for anomalies are insignificant. So the overall extra computational cost for our framework is the . Compared to other approaches, MSP based approaches [18, 19] are the most efficient. VAE alarm [38] and the AE-based approach [2] both need a separate network, basically have the same latency as ours. Dropout based approaches [10, 33] require multiple sampling of a segmentation network, which typically consumes more than time of .

Failure detection on predictions of unseen model . We evaluate the generalizability of our failure detection system. We directly test our failure detection model, which is trained on Deeplab-v2 masks, on the segmentation masks produced by FCN8. We achieve an AUPR-Error of 53.12 on pixel-level error detection and MAE of 12.91 on image-level failure prediction. Full results are available in supplementary material. The results are comparable with our model trained on FCN8 segmentation model in table 1, which illustrates the generalizability of our failure detection system.

GAN types. We assume a stronger GAN would yield better synthesis, and thus choose the state-of-the-art SPADE model [43] for all the main experiments. We also tried a weaker generator - pix2pixHD [48]. It turns out that the synthesis quality is far from satisfactory when the generator takes the prediction as input (shown in supplementary materials). Under the same settings (FCN8 and pixel-level failure detection), AUPR of pix2pixHD model is only 51.31, which is close to the baseline “direct prediction” (AUPR 52.17). We thus conclude that a stronger generator benefits our failure detection scheme.

Adding image style encoder. Since the generator does not guarantee the same style between and which increases the difficulty of the comparison module, we try to mitigate the effect by using an image encoder version of SPADE [43]. We hope the encoder can encode the style and generate images condition on segmentation map with the same style. However, the performance is not satisfactory (AUPR-Error experiences a subtle drop from 55.53 to 55.22). We hypothesize that the style encoder may also encode content (semantic) information and “cheat” to synthesize image without the segmentation mask, thus make the generator “less conditional”.

6 Conclusions

We present a unified framework to detect failures and anomalies for semantic segmentation, which consists of an image synthesize module and a comparison module. We model the image synthesize module with a semantic-to-image conditional GAN (cGAN) and train it on label-image pairs. We then use it to reconstruct the image based on the predicted segmentation mask. The synthesized image and the original image are fed forward to the comparison module and output either failure detection (both image-level and pixel-level) or the mask of anomalous objects, depending on the specific task. Our framework achieved the state-of-the-art performances on three challenging datasets.


  • [1] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané (2016) Concrete problems in ai safety. arXiv preprint arXiv:1606.06565. Cited by: §1.
  • [2] C. Baur, B. Wiestler, S. Albarqouni, and N. Navab (2018)

    Deep autoencoding models for unsupervised anomaly segmentation in brain mr images

    In MICCAI Brainlesion Workshop, Cited by: §2, §4.2.2, Table 3, §5.
  • [3] P. Bevandić, I. Krešo, M. Oršić, and S. Šegvić (2018) Discriminative out-of-distribution detection for semantic segmentation. arXiv preprint arXiv:1808.07703. Cited by: §2.
  • [4] S. Chabrier, B. Emile, C. Rosenberger, and H. Laurent (2006) Unsupervised performance evaluation of image segmentation. EURASIP Journal on Applied Signal Processing. Cited by: §2.
  • [5] R. Challen, J. Denny, M. Pitt, L. Gompels, T. Edwards, and K. Tsaneva-Atanasova (2019) Artificial intelligence, bias and clinical safety. BMJ Quality and Safety. Cited by: §1.
  • [6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, TPAMI. Cited by: §4.1.2.
  • [7] C. Corbière, N. Thome, A. Bar-Hen, M. Cord, and P. Pérez (2019) Addressing failure prediction by learning model confidence. In Advances in Neural Information Processing Systems, Cited by: §1, §1, §2, §4.1.1, §4.1.2, Table 1.
  • [8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR

    Cited by: §4.1.2.
  • [9] T. DeVries and G. W. Taylor (2018) Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865. Cited by: §1, §2.
  • [10] Y. Gal and Z. Ghahramani (2016)

    Dropout as a bayesian approximation: representing model uncertainty in deep learning

    In International Conference on Machine Learning, ICML, Cited by: §2, §4.1.2, §4.2.2, Table 1, Table 3, §5.
  • [11] H. Gao, Y. Tang, L. Jing, H. Li, and H. Ding (2017) A novel unsupervised segmentation quality evaluation method for remote sensing images. Sensors. Cited by: §2.
  • [12] Y. Geifman and R. El-Yaniv (2017) Selective classification for deep neural networks. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [14] M. Haselmann, D. P. Gruber, and P. Tabatabai (2018) Anomaly detection using deep learning based image completion. In International Conference on Machine Learning and Applications, ICMLA, Cited by: §2.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §1, §3.2.1, §4.1.2, §4.2.2.
  • [16] M. Hein, M. Andriushchenko, and J. Bitterwolf (2019)

    Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §5.
  • [17] D. Hendrycks, S. Basart, M. Mazeika, M. Mostajabi, J. Steinhardt, and D. Song (2019) A benchmark for anomaly segmentation. arXiv preprint arXiv:1911.11132. Cited by: §1, §2, §4.2.2, §4.2.2, §4.2.2, Table 3.
  • [18] D. Hendrycks and K. Gimpel (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. International Conference on Learning Representations, ICLR. Cited by: §1, §2, §2, §3.3.2, §4.1.1, §4.1.2, §4.2.2, Table 1, Table 3, §5, §5.
  • [19] D. Hendrycks, M. Mazeika, and T. Dietterich (2019)

    Deep anomaly detection with outlier exposure

    International Conference on Learning Representations, ICLR. Cited by: §1, §2, §5, §5.
  • [20] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §1.
  • [21] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang (2019) Mask scoring r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §2.
  • [22] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §1, §2.
  • [23] J. Janai, F. Güney, A. Behl, and A. Geiger (2017) Computer vision for autonomous vehicles: problems, datasets and state-of-the-art. arXiv preprint arXiv:1704.05519. Cited by: §1.
  • [24] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang (2018) Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision, ECCV, Cited by: §2.
  • [25] H. Jiang, B. Kim, M. Guan, and M. Gupta (2018) To trust or not to trust a classifier. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
  • [26] A. Jungo, R. Meier, E. Ermis, E. Herrmann, and M. Reyes (2018) Uncertainty-driven sanity check: application to postoperative brain tumor cavity segmentation. arXiv preprint arXiv:1806.03106. Cited by: §2, Table 2.
  • [27] A. Kendall and Y. Gal (2017) What uncertainties do we need in bayesian deep learning for computer vision?. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
  • [28] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. International Conference on Learning Representations, ICLR. Cited by: §4.1.2.
  • [29] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. International Conference on Learning Representations, ICLR. Cited by: §2.
  • [30] T. Kohlberger, V. Singh, C. Alvino, C. Bahlmann, and L. Grady (2012) Evaluating segmentation error without ground truth. In International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI, Cited by: §2.
  • [31] I. Krešo, M. Oršić, P. Bevandić, and S. Šegvić (2018) Robust semantic segmentation with ladder-densenet models. arXiv preprint arXiv:1806.03465. Cited by: §2.
  • [32] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Cited by: §1.
  • [33] Y. Kwon, J. Won, B. J. Kim, and M. C. Paik (2020) Uncertainty quantification using bayesian neural networks in classification: application to biomedical image segmentation. Computational Statistics and Data Analysis. Cited by: §2, Table 2, §5.
  • [34] K. Lee, H. Lee, K. Lee, and J. Shin (2018) Training confidence-calibrated classifiers for detecting out-of-distribution samples. International Conference on Learning Representations, ICLR. Cited by: §1, §2.
  • [35] K. Lee, K. Lee, H. Lee, and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, Cited by: §1, §2.
  • [36] S. Liang, Y. Li, and R. Srikant (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. International Conference on Learning Representations, ICLR. Cited by: §1, §2.
  • [37] O. Linda, T. Vollmer, and M. Manic (2009) Neural network based intrusion detection system for critical infrastructures. In International Joint Conference on Neural Networks, IJCNN, Cited by: §1.
  • [38] F. Liu, Y. Xia, D. Yang, A. L. Yuille, and D. Xu (2019) An alarm system for segmentation algorithm based on shape model. In Proceedings of the IEEE International Conference on Computer Vision, ICCV, Cited by: §2, Figure 6, §4.1.1, §4.1.2, §4.1.3, §4.1.3, Table 1, Table 2, §5.
  • [39] S. Liu, D. Xu, S. K. Zhou, O. Pauly, S. Grbic, T. Mertelmeier, J. Wicklein, A. Jerebko, W. Cai, and D. Comaniciu (2018) 3d anisotropic hybrid network: transferring convolutional features from 2d images to 3d anisotropic volumes. In International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI, Cited by: §4.1.3.
  • [40] X. Liu, G. Yin, J. Shao, X. Wang, et al. (2019) Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [41] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §4.1.2.
  • [42] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2, §3.1.1.
  • [43] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §1, §1, §2, §4.1.2, §4.2.2, §5, §5.
  • [44] R. Robinson, O. Oktay, W. Bai, V. V. Valindria, M. M. Sanghvi, N. Aung, J. M. Paiva, F. Zemrak, K. Fung, E. Lukaschuk, et al. (2018) Real-time prediction of segmentation quality. In International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI, Cited by: §2.
  • [45] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, ICLR. Cited by: §1.
  • [46] A. L. Simpson, M. Antonelli, S. Bakas, M. Bilello, K. Farahani, B. Van Ginneken, A. Kopp-Schneider, B. A. Landman, G. Litjens, B. Menze, et al. (2019) A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063. Cited by: §4.1.3, Table 2.
  • [47] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §1.
  • [48] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §1, §2, §5.
  • [49] O. Zendel, K. Honauer, M. Murschitz, D. Steininger, and G. Fernandez Dominguez (2018) Wilddash-creating hazard-aware benchmarks. In Proceedings of the European Conference on Computer Vision, ECCV, Cited by: §2.
  • [50] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §4.2.2.