In the past two years alone, there has been explosive growth in automated applications built upon advances made in deep learning[Zion2019]. For vision systems alone, deep learning has paved the way to a host of new products and services in safety-critical applications from autonomous vehicles to medical diagnosis to surveillance [hatcher2018survey]
. Largely attributed to the increase in accessibility of open-source software and computing power[pytorch2019, tensorflow2015-whitepaper], the barriers between lab-born innovations and market-ready products are lower than ever before. Though this results in the ability to bring deep vision systems to market quickly, it may be premature for safety-critical applications [ntsbHWY18MH010, KamannBenchmarkingRobustnessSemantic2019]. For applications where the failure of the vision system can result in severe consequences—such as, damage or harm—it is necessary that they are robust. In essence, robustness ensures that a system can prevent or minimize the impact of failures. Despite being of limited use in safety-critical applications, much of the work toward robustness of deep vision is centered around image classification [hendrycks17baseline, liang2018enhancing, lee2018simple, hendrycks2018oe, guo2017calibration, lee2018training, hendrycks2018benchmarking, Carlinievaluatingrobustnessneural2017, papernot2016limitations, GoodfellowExplainingharnessingadversarial2014, Temel2018_CUREOR, temel2019traffic, GeirhosGeneralisationhumansdeepa, Temel2017_NIPSW, Temel2018_SPM, Temel2019_ICIP].
To be more relevant to real-world need, we study semantic segmentation, which is at the heart of safety-critical decision systems across a broad spectrum of applications due to simultaneously performing localization and classification. Despite the dizzying pace of advancements for semantic segmentation, contributions have largely only been toward improving task performance on increasingly challenging datasets [EveringhamPascalVisualObject2015, LinMicrosoftCOCOCommon2015, Cordts2016Cityscapes, ZhouSceneparsingade20k2017] or reducing the resource-footprint to capability ratio [SiamRTSegRealTimeSemantic2018, MehtaESPNetEfficientSpatial2018, HowardMobileNetsEfficientConvolutional2017, PaszkeENetDeepNeural2016]. Though still limited, there have been recent contributions to semantic segmentation for assessing and improving robustness to adversarial [ZhouAutomatedEvaluationSemantic2019], out-of-distribution [lehman2019implicit], and corrupted inputs [KamannBenchmarkingRobustnessSemantic2019]. Though we expand upon the techniques from [lehman2019implicit] the most similar work to the presented work in this paper is conducted in [KamannBenchmarkingRobustnessSemantic2019]. Kamann et al. [KamannBenchmarkingRobustnessSemantic2019]
provide a benchmark comparing several popular semantic segmentation models trained on CityScapes Dataset[Cordts2016Cityscapes]
and PASCAL VOC 2012[EveringhamPascalVisualObject2015], then tested on corrupted versions of the same. They proposed and verified that training models on noise can improve robustness to noise, but only reported ablation benchmarks across other types of corruption from [hendrycks2018benchmarking].
In an effort to continue bridging the gap between innovations made in robustness and semantic segmentation, we investigate the effects of corrupted images on a popular semantic segmentation model, DeeplabV3+ [ChenDeepLabSemanticImage2018] with a Resnet50 [HeDeepresiduallearning2016a] backbone. Further, we provide evidence that improved robustness is a consequence of constructing pixel-wise representations with Implicit Background Estimation (IBE) from [lehman2019implicit]
and, even more so, with our proposed method to combine with the Sigmoid Cross Entropy objective (SCrIBE). We validate our proposed method against the Baseline (Softmax Cross Entropy) and IBE (Softmax Cross Entropy with IBE) using the Imagenet-C Corruption Toolkit from[hendrycks2018benchmarking] to corrupt the PASCAL VOC 2012 validation set [EveringhamPascalVisualObject2015].
To better illustrate why IBE and SCrIBE benefit robustness, we must first review the structural properties of representation from the context of semantic segmentation. We will first analyze Softmax and provide insight on some properties that may result in susceptibility to corrupted inputs. Then, we will discuss IBE and outline why it results in the improved robustness observed in [lehman2019implicit]. Finally, we will discuss why Sigmoid alone is a poor choice for optimizing a segmentation objective, but when combined with IBE it becomes superior in robustness to corruption.
Semantic Segmentation: Let a semantic segmentation model be, and at each pixel location the output of is , where is the number of classes. learns a representation for , which we define as . Also defined at each pixel is a categorical label,
, as a one-hot vector. Let softmax be,and sigmoid be, .
Consider a model trained with Softmax Cross Entropy that achieves satisfactory task performance. Beginning with the definition for the gradient of the Softmax Cross Entropy Loss for Semantic Segmentation in (1). Let be the scalar value of a vector indexed at .
During training, the updates follow the pattern of reinforcing True Positives (TP) by making that component larger and positive, reinforcing True Negatives (TN) by making those components larger and negative, and punishing False Positives (FP) and False Negatives (FN) in the opposite way. Though this behavior is desirable, Softmax heavily favors reinforcing TP and punishing FP because the updates for FN and TN depend on the relative magnitude of the components in where . One insight is that those components associated with negative detection are neglected in optimization—more so as increases. This is evident in Fig. 3, where the Baseline exhibits a chaotic structure in the autocorrelation of because throughout training the FN and TN components were neglected leaving them close to the responses at initialization. Another insight is that the structure of optimization depends only on the relative response in . This structure is fully connected because detection depends on every component, , as shown for the Baseline in Fig. 1.
2.2 Implicit Background Estimation
When background is a class that must be learned—as with VOC2012—a problem arises with neglecting negative detection and requiring a fully connected dependency. Background is the complement of the foreground classes. In the case of VOC2012, the model must learn to represent potentially representations for background—one for each foreground class. As was demonstrated in [lehman2019implicit] and shown in Fig. 2, by restricting detection of the background class to when all of the foreground components, are all in the negative orthant of , the surjections of Softmax are eliminated. Also, by inspecting the gradient update for IBE formulated as
We can observe that the TN and FN components for foreground classes are reinforced when the background is updated. This is again evident in Fig. 3, where for both IBE and SCrIBE, there is orthogonality evident between component and all other components. The outcome of this reinforcement will decorrelate TP detections allowing for the observed improvements in calibration and out-of-distribution detection in [lehman2019implicit]. However, as Softmax is still in use, the structure retains fully connected dependence for the foreground, and a binary dependence between the foreground classes and the shared background class as illustrated in Fig. 1.
To give some context on why Sigmoid is a poor choice for categorical classification consider the gradient update at , which can be formulated as
Notice that the update penalizes only for FN. Additionally, it allows for undesirable cases where is a collection of large positive values—driving the term to 0—resulting in no gradient updates regardless the value of . However, the key difference from Softmax, that we will utilize, is that Sigmoid ”pins” component responses about 0, eliminating the relative dependence inherent with Softmax resulting in a structure of independent binary detectors.
3 Sigmoid Cross Entropy with Implicit Background Estimation
When combining IBE with a Sigmoid Cross Entropy objective the gradient update becomes
resulting in binary dependencies with only a single manifesting at a time, which is depicted in Fig. 1. This arises from the background update enforcing the shared representation between each class and the Sigmoid operation acting on each component, , independently. Unlike the case of Sigmoid alone, training a collection of binary detectors with SCrIBE enforces an update to all foreground components when a background label is present. The consequence of these updates is that the model must learn a rich representation for background to support foreground detection. The resulting retention of dimensionality—notionally the ”wiggle room”—should improves the robustness for corruptions that affect affine transformations in .
4 Experimental Results
To validate the analysis in Section 2 and test the hypothesis in Section 2, we evaluate all three versions of a single model. We trained DeepLabV3+ with ResNet50 backbone on the unmodified VOC2012 training set augmented with the Semantic Boundaries Dataset (SBD) [BharathICCV2011] to about 10k examples. Each model was trained with images randomly scaled and cropped down to
pixels, a batch size of 30, ”poly” scheduled learning rates starting at 0.01 for the backbone and 0.1 for the classifier, and weight decay of 5e-5. Note that the hardware used wereTITAN RTX with 24GB GPU memory each, but similar results are possible with a smaller batch size and lower learning rate.
Inspection of Representation Structure: To verify the observations made in analysis from 2, we compare the auto-correlation matrices and the accumulated explained variance for the data. Let be the matrix formed by every pixel response for a number of inputs or . The responses shown in Fig. 3, are normalized auto-correlation computed with . Though only shown for a single class, these are consistent across all 21 classes. It is clear that IBE is structuring the response away from the chaotic response shown for Baseline. By decomposing the covariance matrix and inspecting the effective dimensionality of the learned representation, shown in Fig. 4, we can see the effect of using Sigmoid in the place of Softmax. SCrIBE has an inherently higher-dimensional representation compared to Baseline and IBE, as is evident from the slower accumulation of Explained Variance. In order to help support the notion that more ”wiggle-room” results in resistance to the affine-in- corruptions, we evaluate with a variety of corruptions.
Robustness to Corrupted Input:
To test the robustness of each model, mean Intersection-over-Union (mIOU) metric was measured for the VOC 2012 validation set (1449 examples) that was corrupted using the first 15 corruptions in the ImageNet-C Corruption Toolkit at all 5 severity levels to create about 109k examples. All models were tested on the same corrupted input simultaneously to remove effects caused by the variability in generating corruptions at runtime. Additionally, the models were tested using the Multi-Scale Classification (MSC) method from[ChenDeepLabSemanticImage2018] for comparison. As summarized by Fig. 5 and Table 2 and
detailed in Table 1, SCrIBE is clearly superior to the Baseline across almost all corruptions and severity levels, while all models have very similar performance with no corruption. However, SCrIBE does not improve across all corruptions compared to IBE. As suggested earlier, the additional ”wiggle-room” gained by the Sigmoid objective will only help with affine-in- corruptions. MSC improves the performance for all models, as shown in Fig. 5.
Qualitative Results: We provide visualizations comparing the effects of 3 corruptions between Bseline and SCrIBE for qualitiative evaluation in Fig. 6. In general for semantic segmentation, robustness manifest as retention of predictions under corrupted conditions. For all models, it is more often the case that a misclassification is to background and not some other foreground class. However, as there are some cases where a foreground class is the resulting misclassification, we have observed that these exchanges follow the relative label frequency of the training set.
In this paper, we provided analytical and empirical evidence about the underlying structures of representation for semantic segmentation. We determined that the decorrelated components produced by IBE result in improved robustness. From the analysis, we hypothesised about corruptions that act as affine transformations in the representation space . We then showed that the effects of affine-in- corruptions can be further reduced by retaining dimensionality of the representation through applying SCrIBE. Though these results are promising for improving the robustness of semantic segmentation models, the evidence suggests that the properties of representation structure are largely unknown and likely untapped. Namely, though the addition of IBE and SCrIBE did improve robustness to corruptions and evidence was presented to associate observable properties with the improvement, the question of direct causality is still unanswered.