Log In Sign Up

On the Structures of Representation for the Robustness of Semantic Segmentation to Input Corruption

Semantic segmentation is a scene understanding task at the heart of safety-critical applications where robustness to corrupted inputs is essential. Implicit Background Estimation (IBE) has demonstrated to be a promising technique to improve the robustness to out-of-distribution inputs for semantic segmentation models for little to no cost. In this paper, we provide analysis comparing the structures learned as a result of optimization objectives that use Softmax, IBE, and Sigmoid in order to improve understanding their relationship to robustness. As a result of this analysis, we propose combining Sigmoid with IBE (SCrIBE) to improve robustness. Finally, we demonstrate that SCrIBE exhibits superior segmentation performance aggregated across all corruptions and severity levels with a mIOU of 42.1 compared to both IBE 40.3 and the Softmax Baseline 37.5.


page 3

page 4

page 5


Benchmarking the Robustness of LiDAR Semantic Segmentation Models

When using LiDAR semantic segmentation models for safety-critical applic...

A Benchmark for Out of Distribution Detection in Point Cloud 3D Semantic Segmentation

Safety-critical applications like autonomous driving use Deep Neural Net...

The Fishyscapes Benchmark: Measuring Blind Spots in Semantic Segmentation

Deep learning has enabled impressive progress in the accuracy of semanti...

This is not what I imagined: Error Detection for Semantic Segmentation through Visual Dissimilarity

There has been a remarkable progress in the accuracy of semantic segment...

Exploring Semantic Segmentation on the DCT Representation

Typical convolutional networks are trained and conducted on RGB images. ...

Joint Facade Registration and Segmentation for Urban Localization

This paper presents an efficient approach for solving jointly facade reg...

1 Introduction

In the past two years alone, there has been explosive growth in automated applications built upon advances made in deep learning

[Zion2019]. For vision systems alone, deep learning has paved the way to a host of new products and services in safety-critical applications from autonomous vehicles to medical diagnosis to surveillance [hatcher2018survey]

. Largely attributed to the increase in accessibility of open-source software and computing power

[pytorch2019, tensorflow2015-whitepaper], the barriers between lab-born innovations and market-ready products are lower than ever before. Though this results in the ability to bring deep vision systems to market quickly, it may be premature for safety-critical applications [ntsbHWY18MH010, KamannBenchmarkingRobustnessSemantic2019]. For applications where the failure of the vision system can result in severe consequences—such as, damage or harm—it is necessary that they are robust. In essence, robustness ensures that a system can prevent or minimize the impact of failures. Despite being of limited use in safety-critical applications, much of the work toward robustness of deep vision is centered around image classification [hendrycks17baseline, liang2018enhancing, lee2018simple, hendrycks2018oe, guo2017calibration, lee2018training, hendrycks2018benchmarking, Carlinievaluatingrobustnessneural2017, papernot2016limitations, GoodfellowExplainingharnessingadversarial2014, Temel2018_CUREOR, temel2019traffic, GeirhosGeneralisationhumansdeepa, Temel2017_NIPSW, Temel2018_SPM, Temel2019_ICIP].

To be more relevant to real-world need, we study semantic segmentation, which is at the heart of safety-critical decision systems across a broad spectrum of applications due to simultaneously performing localization and classification. Despite the dizzying pace of advancements for semantic segmentation, contributions have largely only been toward improving task performance on increasingly challenging datasets [EveringhamPascalVisualObject2015, LinMicrosoftCOCOCommon2015, Cordts2016Cityscapes, ZhouSceneparsingade20k2017] or reducing the resource-footprint to capability ratio [SiamRTSegRealTimeSemantic2018, MehtaESPNetEfficientSpatial2018, HowardMobileNetsEfficientConvolutional2017, PaszkeENetDeepNeural2016]. Though still limited, there have been recent contributions to semantic segmentation for assessing and improving robustness to adversarial [ZhouAutomatedEvaluationSemantic2019], out-of-distribution [lehman2019implicit], and corrupted inputs [KamannBenchmarkingRobustnessSemantic2019]. Though we expand upon the techniques from [lehman2019implicit] the most similar work to the presented work in this paper is conducted in [KamannBenchmarkingRobustnessSemantic2019]. Kamann et al. [KamannBenchmarkingRobustnessSemantic2019]

provide a benchmark comparing several popular semantic segmentation models trained on CityScapes Dataset


and PASCAL VOC 2012

[EveringhamPascalVisualObject2015], then tested on corrupted versions of the same. They proposed and verified that training models on noise can improve robustness to noise, but only reported ablation benchmarks across other types of corruption from [hendrycks2018benchmarking].





















Figure 1: Depictions of the dependency structures between components in for each model. The red node represents the background component, the red circle is the maximum foreground component, and red edges represents dependencies constructed with IBE. For IBE and SCrIBE, the background component is only dependent on the negative maximum foreground component, which results in a removal of dependencies. In SCrIBE, the result is binary detectors that share the background representation.

In an effort to continue bridging the gap between innovations made in robustness and semantic segmentation, we investigate the effects of corrupted images on a popular semantic segmentation model, DeeplabV3+ [ChenDeepLabSemanticImage2018] with a Resnet50 [HeDeepresiduallearning2016a] backbone. Further, we provide evidence that improved robustness is a consequence of constructing pixel-wise representations with Implicit Background Estimation (IBE) from [lehman2019implicit]

and, even more so, with our proposed method to combine with the Sigmoid Cross Entropy objective (SCrIBE). We validate our proposed method against the Baseline (Softmax Cross Entropy) and IBE (Softmax Cross Entropy with IBE) using the Imagenet-C Corruption Toolkit from

[hendrycks2018benchmarking] to corrupt the PASCAL VOC 2012 validation set [EveringhamPascalVisualObject2015].

2 Background

Figure 2: The IBE module takes as an input the Class Activation Map (CAM) of the foreground, produces a background map and concatenates to the foreground CAMs to produce a complete prediction.

To better illustrate why IBE and SCrIBE benefit robustness, we must first review the structural properties of representation from the context of semantic segmentation. We will first analyze Softmax and provide insight on some properties that may result in susceptibility to corrupted inputs. Then, we will discuss IBE and outline why it results in the improved robustness observed in [lehman2019implicit]. Finally, we will discuss why Sigmoid alone is a poor choice for optimizing a segmentation objective, but when combined with IBE it becomes superior in robustness to corruption.

Semantic Segmentation: Let a semantic segmentation model be, and at each pixel location the output of is , where is the number of classes. learns a representation for , which we define as . Also defined at each pixel is a categorical label,

, as a one-hot vector. Let softmax be,

and sigmoid be, .

2.1 Softmax

Consider a model trained with Softmax Cross Entropy that achieves satisfactory task performance. Beginning with the definition for the gradient of the Softmax Cross Entropy Loss for Semantic Segmentation in (1). Let be the scalar value of a vector indexed at .


During training, the updates follow the pattern of reinforcing True Positives (TP) by making that component larger and positive, reinforcing True Negatives (TN) by making those components larger and negative, and punishing False Positives (FP) and False Negatives (FN) in the opposite way. Though this behavior is desirable, Softmax heavily favors reinforcing TP and punishing FP because the updates for FN and TN depend on the relative magnitude of the components in where . One insight is that those components associated with negative detection are neglected in optimization—more so as increases. This is evident in Fig. 3, where the Baseline exhibits a chaotic structure in the autocorrelation of because throughout training the FN and TN components were neglected leaving them close to the responses at initialization. Another insight is that the structure of optimization depends only on the relative response in . This structure is fully connected because detection depends on every component, , as shown for the Baseline in Fig. 1.

Figure 3: Example auto-correlation matrices formed for each pixel, , where the prediction is or of the validation set for each model trained with VOC2012.
Figure 4:

Accumulated Explained Variance (EV) for all components in

by model. Notice that both Baseline and IBE have lower dimensionality compared to SCrIBE due to the faster accumulation of EV. SCrIBE retains dimensionality while decorrelating predicted and non-predicted components.

2.2 Implicit Background Estimation

When background is a class that must be learned—as with VOC2012—a problem arises with neglecting negative detection and requiring a fully connected dependency. Background is the complement of the foreground classes. In the case of VOC2012, the model must learn to represent potentially representations for background—one for each foreground class. As was demonstrated in [lehman2019implicit] and shown in Fig. 2, by restricting detection of the background class to when all of the foreground components, are all in the negative orthant of , the surjections of Softmax are eliminated. Also, by inspecting the gradient update for IBE formulated as


We can observe that the TN and FN components for foreground classes are reinforced when the background is updated. This is again evident in Fig. 3, where for both IBE and SCrIBE, there is orthogonality evident between component and all other components. The outcome of this reinforcement will decorrelate TP detections allowing for the observed improvements in calibration and out-of-distribution detection in [lehman2019implicit]. However, as Softmax is still in use, the structure retains fully connected dependence for the foreground, and a binary dependence between the foreground classes and the shared background class as illustrated in Fig. 1.

2.3 Sigmoid

To give some context on why Sigmoid is a poor choice for categorical classification consider the gradient update at , which can be formulated as


Notice that the update penalizes only for FN. Additionally, it allows for undesirable cases where is a collection of large positive values—driving the term to 0—resulting in no gradient updates regardless the value of . However, the key difference from Softmax, that we will utilize, is that Sigmoid ”pins” component responses about 0, eliminating the relative dependence inherent with Softmax resulting in a structure of independent binary detectors.

3 Sigmoid Cross Entropy with Implicit Background Estimation

When combining IBE with a Sigmoid Cross Entropy objective the gradient update becomes


resulting in binary dependencies with only a single manifesting at a time, which is depicted in Fig. 1. This arises from the background update enforcing the shared representation between each class and the Sigmoid operation acting on each component, , independently. Unlike the case of Sigmoid alone, training a collection of binary detectors with SCrIBE enforces an update to all foreground components when a background label is present. The consequence of these updates is that the model must learn a rich representation for background to support foreground detection. The resulting retention of dimensionality—notionally the ”wiggle room”—should improves the robustness for corruptions that affect affine transformations in .

4 Experimental Results

To validate the analysis in Section 2 and test the hypothesis in Section 2, we evaluate all three versions of a single model. We trained DeepLabV3+ with ResNet50 backbone on the unmodified VOC2012 training set augmented with the Semantic Boundaries Dataset (SBD) [BharathICCV2011] to about 10k examples. Each model was trained with images randomly scaled and cropped down to

pixels, a batch size of 30, ”poly” scheduled learning rates starting at 0.01 for the backbone and 0.1 for the classifier, and weight decay of 5e-5. Note that the hardware used were

TITAN RTX with 24GB GPU memory each, but similar results are possible with a smaller batch size and lower learning rate.

Inspection of Representation Structure: To verify the observations made in analysis from 2, we compare the auto-correlation matrices and the accumulated explained variance for the data. Let be the matrix formed by every pixel response for a number of inputs or . The responses shown in Fig. 3, are normalized auto-correlation computed with . Though only shown for a single class, these are consistent across all 21 classes. It is clear that IBE is structuring the response away from the chaotic response shown for Baseline. By decomposing the covariance matrix and inspecting the effective dimensionality of the learned representation, shown in Fig. 4, we can see the effect of using Sigmoid in the place of Softmax. SCrIBE has an inherently higher-dimensional representation compared to Baseline and IBE, as is evident from the slower accumulation of Explained Variance. In order to help support the notion that more ”wiggle-room” results in resistance to the affine-in- corruptions, we evaluate with a variety of corruptions.

Robustness to Corrupted Input:

To test the robustness of each model, mean Intersection-over-Union (mIOU) metric was measured for the VOC 2012 validation set (1449 examples) that was corrupted using the first 15 corruptions in the ImageNet-C Corruption Toolkit at all 5 severity levels to create about 109k examples. All models were tested on the same corrupted input simultaneously to remove effects caused by the variability in generating corruptions at runtime. Additionally, the models were tested using the Multi-Scale Classification (MSC) method from

[ChenDeepLabSemanticImage2018] for comparison. As summarized by Fig. 5 and Table 2 and


Noise Blur Weather Lighting Spatial


Sv. Model Gaus. Sh. Imp. Dfc. Gls. Mtn. Zm. Sno. Frs. Fog Bri. Cnt. Ela. Pix. JPEG


Baseline 55.5 56.0 50.0 52.5 49.5 56.0 44.5 51.0 60.0 63.0 69.5 66.5 48.5 62.0 62.5
IBE 59.0 59.0 50.5 57.5 53.0 59.5 49.0 52.0 61.0 65.5 71.0 68.5 49.5 60.5 63.0
SCrIBE 60.5 61.5 54.5 61.5 48.5 61.0 49.0 53.5 61.5 64.5 70.0 68.5 50.5 63.5 63.5



Baseline 43.0 41.5 37.0 41.0 33.5 41.5 35.5 32.5 44.0 60.0 69.0 63.0 25.5 59.0 59.0
IBE 48.5 47.0 40.5 49.5 39.0 47.5 40.5 33.5 45.5 62.0 70.0 66.5 26.5 58.0 59.0
SCrIBE 51.0 50.0 44.0 54.5 31.5 50.5 40.0 34.5 45.5 62.0 69.0 66.0 28.5 62.0 61.0



Baseline 26.0 26.0 26.0 22.5 12.0 24.5 29.5 36.0 33.5 54.0 67.5 55.0 51.0 38.0 56.5
IBE 30.5 29.5 29.0 33.5 14.5 30.5 36.0 39.0 35.0 57.0 68.0 60.0 53.0 38.5 56.5
SCrIBE 34.5 34.5 33.5 39.0 11.0 34.0 35.5 39.0 34.5 57.0 68.0 61.5 50.5 46.5 58.0



Baseline 10.5 9.0 8.5 12.5 8.0 12.5 24.0 27.0 31.5 49.5 64.5 34.5 35.0 19.5 45.0
IBE 12.5 12.0 10.5 21.0 10.5 16.0 29.0 30.5 33.5 52.0 65.5 41.5 36.5 21.0 46.5
SCrIBE 17.5 14.5 15.0 24.5 8.0 18.5 29.0 30.0 32.5 52.5 66.0 45.0 34.5 28.5 50.0



Baseline 3.5 4.5 3.5 7.5 5.0 8.5 19.0 24.3 25.5 34.0 60.5 14.0 13.5 13.5 29.0
IBE 5.0 6.5 4.5 12.5 7.5 11.0 24.5 26.0 28.0 39.0 62.5 20.0 15.5 14.5 32.0
SCrIBE 5.5 7.5 5.5 14.5 6.0 13.0 23.0 24.3 26.5 40.0 62.5 23.0 15.5 20.5 36.7



Baseline 37.4 37.0 35.4 36.7 21.9 29.0 30.6 32.6 39.7 52.7 66.9 46.5 35.2 39.8 51.4
IBE 40.2 40.0 36.7 43.4 24.8 33.3 35.8 34.7 41.0 55.6 68.2 51.3 36.4 40.5 52.3
SCrIBE 43.5 43.3 41.2 47.3 21.3 35.5 35.5 34.6 41.4 55.3 67.9 52.9 36.6 46.3 55.8


Table 1: mIOU scores for corrupted VOC 2012 validation set. SCrIBE is clearly superior to the Baseline on almost every corruption type and level. Compared to IBE, SCrIBE offers improvements to Noise, and Spatial robustness, however, improvement is mixed otherwise.
Figure 5: This plot aggregates mIOU across each corruption to show the effects of corruption severity. Severity 0 indicates uncorrupted inputs. With Multi-Scale Classification (MSC) both models have similar performance at Severity 0, but only SCrIBE continues to benefit from MSC. The Baseline actually performs worse when MSC is used for corrupted inputs.
Figure 6: The (top) row, labeled Original, shows a comparison between Baseline and SCrIBE for the input. The proceeding rows, labeled Corrupted, shows a comparison across the first 3 severity levels for the 3 corruptions. Though severities 4 and 5 are not shown, it is clear by the results in Table 1, the predictions are generally entirely incorrect.

detailed in Table 1, SCrIBE is clearly superior to the Baseline across almost all corruptions and severity levels, while all models have very similar performance with no corruption. However, SCrIBE does not improve across all corruptions compared to IBE. As suggested earlier, the additional ”wiggle-room” gained by the Sigmoid objective will only help with affine-in- corruptions. MSC improves the performance for all models, as shown in Fig. 5.

Qualitative Results: We provide visualizations comparing the effects of 3 corruptions between Bseline and SCrIBE for qualitiative evaluation in Fig. 6. In general for semantic segmentation, robustness manifest as retention of predictions under corrupted conditions. For all models, it is more often the case that a misclassification is to background and not some other foreground class. However, as there are some cases where a foreground class is the resulting misclassification, we have observed that these exchanges follow the relative label frequency of the training set.

5 Conclusion

In this paper, we provided analytical and empirical evidence about the underlying structures of representation for semantic segmentation. We determined that the decorrelated components produced by IBE result in improved robustness. From the analysis, we hypothesised about corruptions that act as affine transformations in the representation space . We then showed that the effects of affine-in- corruptions can be further reduced by retaining dimensionality of the representation through applying SCrIBE. Though these results are promising for improving the robustness of semantic segmentation models, the evidence suggests that the properties of representation structure are largely unknown and likely untapped. Namely, though the addition of IBE and SCrIBE did improve robustness to corruptions and evidence was presented to associate observable properties with the improvement, the question of direct causality is still unanswered.

Model val val+MSC cor cor+MSC
Baseline 69.1 74.1 35.5 37.5
IBE 70.6 75.3 38.6 40.3
SCrIBE 69.9 74.6 39.5 42.1
Table 2: Results in terms of mIOU on PASCAL VOC 2012 validation set using DeepLabv3+ with ResNet-50 backbone for Baseline and our SCrIBE variant. Multi-Scale Classification (MSC) is also used to improve performance. The results are aggregated across 15 corruptions at 5 severity levels are provided for Baseline, IBE, SCrIBE both with and without MSC.