Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming

07/17/2019 ∙ by Claudio Michaelis, et al. ∙ 4

The ability to detect objects regardless of image distortions or weather conditions is crucial for real-world applications of deep learning like autonomous driving. We here provide an easy-to-use benchmark to assess how object detection models perform when image quality degrades. The three resulting benchmark datasets, termed Pascal-C, Coco-C and Cityscapes-C, contain a large variety of image corruptions. We show that a range of standard object detection models suffer a severe performance loss on corrupted images (down to 30-60 - stylizing the training images - leads to a substantial increase in robustness across corruption type, severity and dataset. We envision our comprehensive benchmark to track future progress towards building robust object detection models. Benchmark, code and data are available at: http://github.com/bethgelab/robust-detection-benchmark

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

30pt30pt A day in the near future: Autonomous vehicles are swarming the streets all over the world, tirelessly collecting data. But on this cold November afternoon traffic comes to an abrupt halt as it suddenly begins to snow: winter is coming. Huge snow flakes are falling from the sky, and the cameras of autonomous vehicles are no longer able to make sense of their surroundings, triggering immediate emergency brakes. A day later, an investigation of this traffic disaster reveals that the unexpectedly large size of the snow flakes was the cause of the chaos: While state-of-the-art vision systems had been trained on a variety of common weather types, their training data contained hardly any snow flakes of this size…

Figure 2: Expect the unexpected: To ensure safety, an autonomous vehicle must be able to be recognize objects even in challenging outdoor conditions such as fog, frost, snow, and, of course, the occasional dragonfire.222Outdoor hazards have been directly linked to increased mortality rates (Lystad and Brown, 2018).

This fictional example highlights the problems that arise when Convolutional Neural Networks (CNNs) encounter settings that were not explicitly part of their training regime. For example, state-of-the-art object detection algorithms such as Faster R-CNN

(Ren et al., 2015) fail to recognize objects when snow is added to an image (as shown in Figure 1), even though the objects are still clearly visible to a human eye. At the same time, augmenting the training data with several types of distortions is not a sufficient solution to achieve general robustness against previously unknown corruptions: It has recently been demonstrated that CNNs generalize poorly to novel distortion types, despite being trained on a variety of other distortions (Geirhos et al., 2018). Even an innocuous distribution shift such as a transition from small snow flakes at training time to large snow flakes at test time can have a strong impact on current vision systems.

On a more general level, CNNs often fail to generalize outside of the training domain or training data distribution. Examples include the failure to generalize to images with uncommon poses of objects (Alcorn et al., 2019) or to cope with small distributional changes (e.g. Zech et al., 2018; Touvron et al., 2019). One of the most extreme cases are adversarial examples (Szegedy et al., 2013): images with a domain shift so small that is imperceptible for humans yet sufficient to fool a CNN. We here focus on the less extreme but far more common problem of perceptible image distortions like blurry images, noise or natural distortions like snow.

As an example, autonomous vehicles need to be able to cope with wildly varying outdoor conditions such as fog, frost, snow, sand storms, or falling leaves, just to name a few (as visualized in Figure 2). One of the major reasons why autonomous cars have not yet gone mainstream is the inability of their recognition models to function well in adverse weather conditions (Dai and Van Gool, 2018). Many common environmental conditions can (and have been) modelled, including fog (Sakaridis et al., 2018a), rain (Hospach et al., 2016), snow (von Bernuth et al., 2019) and daytime to nighttime transitions (Dai and Van Gool, 2018). However it is impossible to foresee all potential conditions that might occur “in the wild”.

If we could build models that are robust to every possible image corruption, weather changes would not be an issue. However, in order to assess the robustness of models one first needs to define a measure. While testing models on the set of all possible corruption types is impossible, we argue that a useful approximation is to evaluate models on a diverse range of corruption types that were not part of the training data: if a model copes well with a dozen corruptions that it has never seen before, we expect it to cope well with yet another type of corruption.

In this work, we propose three easy-to-use benchmark datasets termed Pascal-C, Coco-C and Cityscapes-C to assess distortion robustness in object detection. Each dataset contains versions of the original object detection datasets which are corrupted with 15 distortions, each spanning five levels of severity. This approach is directly inspired by Hendrycks and Dietterich (2019), who introduced corrupted versions of commonly used classification

datasets (ImageNet-C, CIFAR10-C) as standardized benchmarks. After evaluating standard object detection algorithms on these benchmark datasets, we show how a simple data augmentation technique—stylizing the training images—can strongly improve robustness across corruption type, severity and dataset.

1.1 Contributions

Our contributions can be summarized as follows:

  1. We demonstrate that a broad range of object detection and instance segmentation models suffer severe performance impairments on corrupted images.

  2. To quantify this behaviour and to enable tracking future progress, we propose the Robust Detection Benchmark, consisting of three benchmark datasets termed Pascal-C, Coco-C & Cityscapes-C.

  3. We show that a simple data augmentation technique—stylizing the training data—leads to large robustness improvements for all evaluated corruptions without any additional labelling costs or architectural changes.

  4. We make our benchmark, corruption and stylization code openly available in an easy-to-use fashion:

1.2 Related Work

Benchmarking corruption robustness

In recent years, there have been several publications studying the vulnerability of DNNs to common corruptions. Dodge and Karam (2016) measure the performance of four state-of-the-art image recognition models on out-of-distribution data and show that DNNs are in particular vulnerable to blur and Gaussian noise. Geirhos et al. (2018) show that DNN performance drops much faster than human performance for the task of recognizing corrupted images when the perturbation level increases across a broad range of corruption types. Azulay and Weiss (2018) investigate the lack of invariance of several state-of-the-art DNNs to small translations. A benchmark to evaluate the robustness of recognition models against common corruptions was recently introduced by Hendrycks and Dietterich (2019).

Improving corruption robustness

One way to restore the performance drop on corrupted data is to preprocess the data in order to remove the corruption. Mukherjee et al. (2018) propose a CNN-based approach to restore image quality of rainy and foggy images. Bahnsen and Moeslund (2018) and Bahnsen et al. (2019)

propose algorithms to remove rain from images as a preprocessing step and report a subsequent increase in recognition rate. A challenge for these approaches is that noise removal is currently specific to a certain distortion type and thus does not generalize to other types of distortions. Another line of work seeks to enhance the classifier performance by the means of data augmentation, i.e. by directly including corrupted data into the training.

Vasiljevic et al. (2016) study the vulnerability of a classifier to blurred images and enhance the performance on blurred images by fine-tuning on them. Geirhos et al. (2018) study on generalization between different corruption types and find that fine-tuning on one corruption type does not enhance performance on other corruption types. Geirhos et al. (2019) train a recognition model on a stylized version of the ImageNet dataset (Russakovsky et al., 2015), reporting increased general robustness against different corruptions as a result of a stronger bias towards ignoring textures and focusing on object shape. Hendrycks and Dietterich (2019)

report several methods leading to enhanced performance on their corruption benchmark: Histogram Equalization, Multiscale Networks, Adversarial Logit Pairing, Feature Aggregating and Larger Networks.

Evaluating robustness to environmental changes in autonomous driving

In recent years, weather conditions turned out to be a central limitation for state-of-the art autonomous driving systems (Sakaridis et al., 2018a; Volk et al., 2019; Dai and Van Gool, 2018; Chen et al., 2018; Lee et al., 2018). While many specific approaches like modelling weather conditions (Sakaridis et al., 2018a, b; Volk et al., 2019; von Bernuth et al., 2019; Hospach et al., 2016; von Bernuth et al., 2018) or collecting real (Wen et al., 2015; Yu et al., 2018; Che et al., 2019; Caesar et al., 2019) and artificial (Gaidon et al., 2016; Ros et al., 2016; Richter et al., 2017; Johnson-Roberson et al., 2017) datasets with varying weather conditions, no general solution towards the problem has emerged. Robustness analysis and optimization of CNNs in the context of autonomous driving is considered by Volk et al. (2019). Building upon Hospach et al. (2016), Volk et al. (2019) study the fragility of an object detection model against rainy images, identify corner cases where the model fails and include images with synthetic rain variations into the training set. They report enhanced performance on real rain images. von Bernuth et al. (2018) report a drop in the AP of a Recurrent Rolling Convolution network trained on the KITTI dataset when the camera images are modified by simulated rain drops on the windshield. von Bernuth et al. (2019) model photo-realistic snow and fog conditions to augment real and virtual video streams. They report a significant performance drop of an object detection model when evaluated on corrupted data. Pei et al. (2017) introduce VeriVis, a framework to evaluate the security and robustness of different object recognition models using real-world image corruptions such as brightness, contrast, rotations, smoothing, blurring, and others.

2 Methods

2.1 Robust Detection Benchmark

We introduce the Robust Detection Benchmark inspired by the ImageNet-C benchmark for object classification (Hendrycks and Dietterich, 2019) to assess object detection robustness on corrupted images.

Corruption types

Following Hendrycks and Dietterich (2019), we provide 15 corruptions on five severity levels each (visualized in Figure 3) to assess the effect of a broad range of different corruption types on object detection models.444These corruption types were introduced by Hendrycks and Dietterich (2019) and generalized by us to work with arbitrary image dimensions / ratios. Our generalized corruptions can be found at https://github.com/bethgelab/imagecorruptions and installed via pip3 install imagecorruptions. The corruptions are sorted into four groups: noise, blur, digital and weather groups (as defined by Hendrycks and Dietterich (2019)). It is important to note that the corruption types are not meant to be used as a training data augmentation toolbox, but rather to measure a model’s robustness against previously unseen corruptions. For model validation purposes, the four additional held-out corruption types from ImageNet-C (Speckle Noise, Gaussian Blur, Spatter, Saturate) are provided as well, even though they are not being used to assess performance on the Robust Detection Benchmark.

Figure 3: 15 corruption types from Hendrycks and Dietterich (2019), adapted to corrupt arbitrary images (example shown here: randomly selected Pascal VOC image, center crop). Best viewed on screen.
Benchmark datasets

The Robust Detection Benchmark consists of three benchmark datasets: Pascal-C, Coco-C and Cityscapes-C. Among the vast number of available object detection datasets (Everingham et al., 2010; Geiger et al., 2012; Lin et al., 2014; Cordts et al., 2016; Zhou et al., 2017; Neuhold et al., 2017; Krasin et al., 2017), we chose to use Pascal VOC (Everingham et al., 2010), MS Coco (Lin et al., 2014) and Cityscapes (Cordts et al., 2016) as they are the most commonly used datasets for general object detection (Pascal & Coco) and street scenes (Cityscapes). We follow common convention to select the tests splits (VOC2007 test set for Pascal-C, the Coco 2017 validation set for Coco-C and the Cityscapes validation set for Cityscapes-C).

Performance measures

Since performance measures differ between the original datasets, the dataset-specific performance (P) measures are adopted as defined below:

where AP stands for the ‘Average Precision’ metric. On the corrupted data, the benchmark performance is measured in terms of mean performance under corruption (mPC):

(1)

Here, is the dataset-specific performance measure evaluated on test data corrupted with corruption under severity level while and indicate the number of corruptions and severity levels, respectively. In order to measure relative performance degradation under corruption, the relative performance under corruption (rPC) is introduced as defined below:

(2)

rPC measures the relative degradation of performance on the corrupted data compared to performance on clean data.

Submissions

Submissions to the benchmark should be handed in as a simple pull request to the Robust Detection Benchmark555https://github.com/bethgelab/robust-detection-benchmark and need to include all three performance measures: clean performance (P), mean performance under corruption (mPC) and relative performance under corruption (rPC). While mPC is the metric used to rank models on the Robust Detection Benchmark, the other measures provide additional insights into the causes of a performance gain, as they disentangle gains from higher clean performance (as measured by P) and gains from better generalization performance to corrupted data (as measured by rPC).

Baseline models

We provide baseline results for a set of common object detection models including Faster R-CNN (Ren et al., 2015), Mask R-CNN (He et al., 2017), Cascade R-CNN (Cai and Vasconcelos, 2018), Cascade Mask R-CNN (Chen et al., 2019a), RetinaNet (Lin et al., 2017a) and Hybrid Task Cascade (Chen et al., 2019a). We use a ResNet50 (He et al., 2016) with Feature Pyramid Networks (Lin et al., 2017b) as backbone for all models except for Faster R-CNN where we additionally test ResNet101 (He et al., 2016), ResNeXt101-32x4d (Xie et al., 2017) and ResNeXt-64x4d (Xie et al., 2017) backbones. We additionally provide results for Faster R-CNN and Mask R-CNN models with deformable convolutions (Dai et al., 2017; Zhu et al., 2018) in Appendix Section C. We integrate the robustness benchmark into the mmdetection toolbox (Chen et al., 2019b)

and train and test all models with standard hyperparameters. The details can be found in Appendix Section 

A and on the Robust Detection Benchmark page.

2.2 Style transfer as data augmentation

Figure 4:

Training data visualization for Coco and Stylized-Coco. The three different training settings are: standard data (top row), stylized data (bottom row) and the concatenation of both (termed ‘combined’ in plots).

For image classification, style transfer (the method of combining the content of an image with the style of another image) has been shown to strongly improve corruption robustness (Geirhos et al., 2019). We here transfer this method to object detection datasets testing two settings: 1. Replacing each training image with a stylized version. 2. Adding a stylized version of each image to the existing dataset. We apply AdaIN (Huang and Belongie, 2017) with hyperparameter to the training data, replacing the original texture with the randomly chosen texture information of Kaggle’s Painter by Numbers666https://www.kaggle.com/c/painter-by-numbers/ dataset. Examples for the stylization of Coco images are given in Figure 4. We provide ready-to-use code for the stylization of arbitrary datasets at https://github.com/bethgelab/stylize-datasets.

3 Results

3.1 Image corruptions reduce model performance

In order to assess the effect of image corruptions, we evaluated a set of common object detection models on the three benchmark datasets defined in Section 2. Performance is heavily degraded on corrupted images (compare Table 1). While Faster R-CNN can retain roughly 60% relative performance (rPC) on the rather simple images in Pascal VOC, the same model suffers a dramatic reduction to 33% rPC on the Cityscapes dataset, which contains many small objects. With some variations, this effect is present in all tested models, and also holds for instance segmentation tasks (for instance segmentation results please see Appendix 3).

Pascal VOC
clean corrupted relative
model backbone P [] mPC [] rPC [%]
Faster r50 80.5 48.6 60.4

MS Coco
clean corrupted relative
model backbone P [AP] mPC [AP] rPC [%]
Faster r50 36.3 18.2 50.2
Faster r101 38.5 20.9 54.2
Faster x101-32x4d 40.1 22.3 55.5
Faster x101-64x4d 41.3 23.4 56.6
Mask r50 37.3 18.7 50.1
Cascade r50 40.4 20.1 49.7
Cascade Mask r50 41.2 20.7 50.2
RetinaNet r50 35.6 17.8 50.1
HTC x101-64x4d 50.6 32.7 64.7

Cityscapes
clean corrupted relative
model backbone P [AP] mPC [AP] rPC [%]
Faster r50 36.4 12.2 33.4
Mask r50 37.5 11.7 31.1
Table 1: Object detection performance of various models. Backbones indicated with are ResNet and ResNeXt. All model names exceptfor RetinaNet and HTC indicate the corresponding model from the R-CNN family. All Coco models were downloaded from the mmdetection modelzoo. For all reported quantities: higher is better; square brackets denote metric.

3.2 Robustness increases with backbone capacity

We found corruption resistance to improve with backbone capacity. It seems that almost all corruptions (expect for the blur types) induce a fixed penalty to the encoder, which is not dependent on baseline performance. For two models with different backbones (compare Table 1 and Appendix Figure 13). Therefore, more powerful backbones lead to a relative performance improvement under corruption. This finding is supported when investigating models with deformable convolutions (See Appendix C) and the current state-of-the-art model Hybrid Task Cascade (Chen et al., 2019a), which does not only outperform the strongest baseline model by 9% AP on clean data but distances itself on corrupted data by a similar margin, achieving a leading relative performance under corruption (rPC) of 64.7%. However, not all changes lead to similar improvements. Cascade R-CNN which draws its performance increase from a sophisticated head architecture does not show a relative performance increase to faster R-CNN. This indicates that the improved robustness comes primarily from the image encoding and better head architectures cannot extract more information if the primary encoding is sufficiently impaired.

(a)
(b)
(c)
Figure 8: Object detection robustness of Faster R-CNN on corrupted versions of Pascal VOC, MS Coco and Cityscapes. Corruption severity 0 denotes clean data.

3.3 Training on stylized data improves robustness

In order to reduce the strong effect of corruptions on model performance observed above, we tested whether a simple approach (stylizing the training data) leads to a robustness improvement. We evaluate the exact same model (Faster R-CNN) with three different training data schemes (visualized in Figure 4):

  • [leftmargin=2.6cm]

  • the unmodified training data of the respective dataset

  • the training data is stylized completely

  • concatenation of standard and stylized training data

The results across our three datasets Pascal-C, Coco-C and Cityscapes-C are visualized in Figure 8. We observe a similar pattern as reported by Geirhos et al. (2019) for object classification on ImageNet—a model trained on stylized data suffers less from corruptions than the model trained only on the original “clean” data. However, its performance in on clean data is much lower. Combining stylized and clean data seems to achieve the best of both worlds: high performance on clean data as well as strongly improved performance under corruption. From the results in Table 2, it can be seen that both stylized and combined training improve the relative performance under corruption (rPC). Combined training yields the highest absolute performance under corruption (mPC) for all three datasets. This pattern is consistent for each corruption type and severity level with only a hand full exceptions. Detailed results across corruption types are reported in the Appendix (Figure 10, Figure 11 and Figure 12).

Pascal VOC [] MS Coco [AP] Cityscapes [AP]
clean corr. rel. clean corr. rel. clean corr. rel.
train data P mPC rPC [%] P mPC rPC [%] P mPC rPC [%]
standard 80.5 48.6 60.4 36.3 18.2 50.2 36.4 12.2 33.4
stylized 68.0 50.0 73.5 21.5 14.1 65.6 28.5 14.7 51.5
combined 80.4 56.2 69.9 34.6 20.4 58.9 36.3 17.2 47.4
Table 2: Object detection performance of Faster R-CNN trained on standard images, stylized images and the combination of both evaluated on standard test sets (test 2007 for Pascal VOC; val 2017 for MS Coco, val for Cityscapes); higher is better.

3.4 Performance degradation does not simply scale with perturbation size

We investigated whether there is a direct relationship between the impact of a corruption on the pixel values of an image and the impact of a corruption on model performance. Figure 9 shows the relative performance of Faster R-CNN on the corruptions in Pascal-C dependent on the perturbation size measured in Root Mean Square Error (RMSE). It can be seen that there is no such simple relation. For instance, impulse noise alters only few pixels but has drastic impact on the performance of the model, while brightness or fog alter all pixel values but have small impact onto the model performance. However, there seems to be a relation between the impact of a corruption on model performance and corruption group (noise, blur, digital or weather). For instance, the digital corruptions seem to have much lower impact on the performance compared to blur corruptions.

Figure 9: Relative performance under corruption (rPC) as a function of corruption RMSE evaluated on Pascal VOC. The dots indicate the rPC of Faster R-CNN trained on standard data, the arrows show the performance gained via training on ‘combined’ data. Corruptions are grouped into four corruption types: noise, blur, weather and digital.

4 Discussion

We here showed that object detection and instance segmentation models suffer severe performance impairments on corrupted images, a pattern that has previously been observed in image recognition models (e.g. Geirhos et al., 2018; Hendrycks and Dietterich, 2019). In order to track future progress on this important issue, we proposed the robust detection benchmark containing three easy-to-use benchmark datasets Pascal-C, Coco-C and Cityscapes-C. Apart from providing baselines against which future models and techniques can be compared, we then demonstrated how a simple data augmentation technique (adding a stylized copy of the training data in order to reduce a model’s focus on textural information) leads to strong robustness improvements. On corrupted images, we consistently observe a performance increase (roughly ) with small losses on clean data (

). This approach has the benefit that it can be applied to any image dataset, requires no additional labelling or model tuning, and thus comes basically for free. At the same time, our benchmark data show that there is still space for improvement, and it is yet to be determined whether the most promising robustness enhancement techniques will require architectural modifications, data augmentation schemes, modifications to the loss function, or a combination of these.

We encourage readers to expand the benchmark with novel corruption types. In order to achieve robust models, testing against a wide variety of different image corruptions is necessary, there is no ‘too much’. Since our benchmark is open source, we welcome new corruption types and look forward to your pull requests to https://github.com/bethgelab/imagecorruptions!

We envision our comprehensive benchmark to track future progress towards building robust object detection models that can be reliably deployed “in the wild”, eventually enabling them to cope with unexpected weather changes, corruptions of all kinds, and, if necessary, even the occasional dragonfire.

Author contributions

The initial project idea for improving detection robustness was developed by E.R., R.G. and C.M. The initial idea of benchmarking detection robustness was developed by C.M., B.M., R.G., E.R. & W.B. The overall research focus on robustness was collaboratively developed in the Bethge, Bringmann and Wichmann labs. The Robust Detection Benchmark was jointly designed by C.M., B.M., R.G. & E.R.; including selecting datasets, corruptions, metrics and models. B.M. and E.R. jointly developed the pip-installable package to corrupt arbitrary images. B.M. developed code to stylize arbitrary datasets with input from R.G. and C.M.; C.M. and B.M. developed code to evaluate the robustness of arbitrary object detection models. B.M. prototyped the core experiments; C.M. ran the reported experiments. The results were jointly analysed and visualized by C.M., R.G. and B.M. with input from E.R., M.B. and W.B.; C.M., B.M., R.G. & E.R. worked towards making our work reproducible, i.e. making data, code and benchmark openly accessible and (hopefully) user-friendly. Senior support, funding acquisition and infrastructure were provided by O.B., A.S.E., M.B. and W.B. The illustratory figures were designed by E.R., C.M. and R.G. with input from B.M. and W.B. The paper was jointly written by R.G., C.M., E.R. and B.M. with input from all other authors.

Acknowledgement

We would like to thank Alexander von Bernuth for help with Figure 1; Marissa Weis for help with the Cityscapes dataset; Andreas Geiger for helpful discussions on the topic of autonomous driving in bad weather; Mackenzie Mathis for helpful contributions to the stylization code as well as Eshed Ohn-Bar and Jan Lause for pointing us to important references. R.G. would like to acknowledge Felix Wichmann for senior support, funding acquisition and providing infrastructure. C.M., R.G. and E.R. graciously acknowledge support by the International Max Planck Research School for Intelligent Systems (IMPRS-IS) and the Iron Bank of Braavos. C.M. was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) via grant EC 479/1-1 to A.S.E.. A.S.E., M.B. and W.B. acknowledge support from the BMBF competence center for machine learning (FKZ 01IS18039A) and the Collaborative Research Center Robust Vision (DFG Projektnummer 276693517 – SFB 1233: Robust Vision). M.B. acknowledges support by the Centre for Integrative Neuroscience Tübingen (EXC 307). O.B. and E.R. have been partially supported by the Deutsche Forschungsgemeinschaft (DFG) in the priority program 1835 “Cooperatively Interacting Automobiles” under grant BR2321/5-1 and BR2321/5-2. A.S.E., M.B. and W.B. were supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior / Interior Business Center (DoI/IBC) contract number D16PC00003.

References

Appendix

A Implementation details

Training

We train all our models with 2 images per GPU which corresponds to a batch size of 16 on 8 GPUs. On Coco, we resize images so their short edge is 800 pixels and train for 12 epochs with a starting learning rate of 0.02 which is decreased by a factor of 10 after 8 and 11 epochs. On pascal voc, images are resized so that their short edge is 600 pixels. Training is done for 12 epochs with a starting learning rate of 0.01 with a decay step of factor 10 after 9 epochs. For cityscapes, we stayed as close as possible the procedure described in

[He et al., 2017] rescaling images to a shorter edge size between 800 and 1024 pixels and train for 64 epochs (to match 24k steps at a batch size of 8) with an initial learning rate of 0.02 and a decay step of factor 10 after 48 epochs. For evaluation, only one scale - 1024 pixels - is used. In all our experiments, we employ the linear scaling rule Goyal et al. [2017] to reduce the learning rate when less than 8 GPUs are used for training. Specifically, we used 4 GPUs to train the Coco models and 1 GPU for all other models resulting in an effective learning rate reduction by a factor of 2 for Coco and 8 for pascal and cityscapes. Training with stylized data is done by simply exchanging the dataset folder or adding it to the list of dataset folders to consider. For all further details please refer to the config files in our implementation.

B Corrupt arbitrary images

In the original corruption benchmark of ImageNet-C [Hendrycks and Dietterich, 2019], two technical aspects are hard-coded: The image-dimensions and the number of channels. To allow for different data sets with different image dimensions, several corruption functions are defined independently of each other, such as make_cifar_c, make_tinyimagenet_c, make_imagenet_c, and make_imagenet_c_inception. Additionally, many corruptions expect quadratic images. We have modified the code to resolve these constraints, and now all corruptions can be applied to non-quadratic images with varying sizes, which is a necessary prerequisite for adapting the corruption benchmark to the Pascal VOC and Coco datasets. For the corruption type ‘frost’, crops from provided images of frost are added to the input images. Since images in Pascal VOC and Coco have arbitrarily large dimensions, we resize the frost images to fit the largest input image dimension if necessary.

The original corruption benchmark also expects RGB images. Our code now allows for grayscale images777There are approximately 2–3% grayscale images in Pascal VOC/MS Coco..

Both motion_blur and snow relied on the motion-blur functionality of Imagemagick, resulting in an external dependency that could not be resolved by standard python package managers. For convenience, we reimplemented the motion-blur functionality in python and removed the dependency on non-python software.

C Additional Results

c.1 Instance Segmentation Results

We evaluated Mask R-CNN and Cascade Mask R-CNN on instance segmentation. The results are very similar to those on the object detection task with a slightly lower relative performance ( 1%, see Table 3). We also trained Mask R-CNN on the stylized datasets finding again very similar trends for the instance segmentation task as for the object detection task (Table 4). On the one hand, this is not very surprising as Mask R-CNN and Faster R-CNN are very similar. On the other hand, the contours of objects can change due to the stylization process, which would expectedly lead to poor segmentation performance when training only on stylized images. We do not see such an effect but rather find the instance segmentation performance of Mask R-CNN to mirror the object detection performance of Faster R-CNN when trained on stylized images.

MS Coco
clean corr. rel.
model backbone P [AP] mPC [AP] rPC [%]
Mask r50 34.2 16.8 49.1
Cascade Mask r50 35.7 17.6 49.3
HTC x101-64x4d 43.8 28.1 64.0


Cityscapes
clean corr. rel.
model backbone P [AP] mPC [AP] rPC [%]
Mask r50 32.7 10.0 30.5
Table 3: Instance segmentation performance of various models. Backbones indicated with : ResNet. All model names indicate the corresponding model from the R-CNN family. All models were downloaded from the mmdetection modelzoo.
MS Coco Cityscapes
clean corr. rel. clean corr. rel.
train data [P] [mPC] [rPC] [P] [mPC] [rPC]
standard 34.2 16.9 49.4 32.7 10.0 30.5
stylized 20.5 13.2 64.1 23.0 11.3 49.2
combined 32.9 19.0 57.7 32.1 14.9 46.3
Table 4: Instance segmentation performance of Mask R-CNN trained on standard images, stylized images and the combination of both evaluated on standard test sets (test 2007 for Pascal VOC; val 2017 for MS Coco, val for Cityscapes).

c.2 Deformable Convolutional Networks

We tested the effect of deformable convolutions [Dai et al., 2017, Zhu et al., 2018] on corruption robustness. Deformable convolutions are a modification of the backbone architecture exchanging some standard convolutions with convolutions that have adaptive filters in the last stages of the encoder. It has been shown that deformable convolutions can help on a range of tasks like object detection and instance segmentation. This is the case here too as networks with deformable convolutions do not only perform better on clean but also on corrupted images improving relative performance by 6-7% compared to the baselines with standard backbones (See Tables 5 and 6). The effect appears to be the same as for other backbone modifications such as using deeper architectures (See Section 3 in the main paper).

MS Coco
clean corr. rel.
model backbone P [AP] mPC [AP] rPC [%]
Faster r50-dcn 40.0 22.4 56.1
Faster x101-64x4d-dcn 43.4 26.7 61.6
Mask r50-dcn 41.1 23.3 56.7
Table 5: Object detection performance of models with deformable convolutions Dai et al. [2017]. Backbones indicated with r are ResNet, the addition dcn signifies deformable convolutions in stages c3-c5. All model names indicate the corresponding model from the R-CNN family. All models were downloaded from the mmdetection modelzoo.
MS Coco
clean corr. rel.
model backbone P [AP] mPC [AP] rPC [%]
Mask r50-dcn 37.2 20.7 55.7
Table 6: Instance segmentation performance of Mask R-CNN with deformable convolutions [Dai et al., 2017]. The backbone indicated with is a ResNet 50, the addition dcn signifies deformable convolutions in stages c3-c5. The model was downloaded from the mmdetection modelzoo.

Image rights & attribution

Figure 1: Home Box Office, Inc. (HBO).

Figure 10: Results for each corruption type on Pascal VOC.
Figure 11: Results for each corruption type on MS Coco.
Figure 12: Results for each corruption type on Cityscapes.
Figure 13: Results for each corruption type using different backbones. Faster R-CNN trained on MS Coco with ResNet-50, ResNet-101 and ResNext-101_64x4d backbones.