Neural networks have been shown to be sensitive to distribution shifts such as common corruptions (inet_c), adversarial examples (ref_adv) or background changes (terra_inco). When deployed in real-world environments, neural networks often encounter samples that come from a different distribution than the one used to train them. Because of this, they obtain lower performances in practical applications compared to the performances observed on their test sets. In most cases, during model conceptions, we do not have access to the distributions that these models will encounter when deployed in real world applications. Consequently, it is necessary to make neural networks more robust to distribution shifts.
Some methods have been proposed to make neural networks more robust to distribution shifts (ant; adv_prop; noisy_student). To estimate if these methods are useful in practice, we need to establish benchmarks that measure robustness to distribution shifts. Traditionally used approaches consist in measuring performances of models on out-of-distribution samples, i.e. samples that come from a different distribution than the one used to get the training samples. The underlying idea is that the more a model has good performances on an unseen distribution, the more we can expect it to be robust to other unseen distribution shifts.
However, there is no guarantee that the robustness measured using one particular distribution transfers to other distributions: a model robust to colorimetry variations is not necessarily robust to background changes. To address this issue, we generally use several distributions during the testing phase, to have more diverse out-of-distribution samples. We assume that the more a model is robust to a large diversity of distribution shifts, the more this model is likely to generalize to other unseen distributions. For this reason, finding new distributions to draw more diverse test samples, can be useful to improve robustness estimations.
Various distribution shifts can be obtained by using synthetic corruptions such as Gaussian noise, rotations, contrast loss… The robustness of a model can be estimated by testing its performances on a test set that has been corrupted using various image transformations. In this paper, we make a distinction between synthetic and natural corruptions. Synthetic corruptions correspond to modeled images transformations that are used to corrupt images such as translations or quantizations. On the other hand, natural corruptions are distribution shifts arising naturally in real world applications (taori_nat). In this study, we do not consider transformations especially crafted to fool neural networks such as adversarial attacks (ref_adv).
Constituting a benchmark of naturally corrupted samples is costly. It requires to draw samples from a distribution that is not covered by existing datasets, and to label the gathered samples. Samples corrupted with synthetic corruptions are cheaper to gather. They can be obtained by corrupting already labeled images. However, we do not really know in which circumstances robustness to synthetic corruptions is predictive of robustness to natural corruptions (taori_nat; many_face_rob). Besides, it has been shown that synthetic corruption benchmarks can be biased, i.e. they give too much importance to the robustness to some kinds of corruptions and omit some others (overlap_score).
To address these limitations, we propose a new methodology to build synthetic corruption benchmarks. More precisely, given an initial set of synthetic corruptions, we split this set into specific categories. These categories are built such as the corruptions belonging to the same category overlap (they are correlated in terms of robustness), while the corruptions belonging to different categories do not. Based on these corruption categories, we identify three parameters to take into account while building a synthetic corruption benchmark: (1) the number of represented corruption categories (2) the balance among categories (3) the size of benchmarks. We show that considering these parameters helps to build synthetic corruption benchmarks that make robustness estimations more correlated with robustness to natural corruptions. We apply the proposed methodology to build a new benchmark called ImageNet-Syn2Nat that is used to measure the robustness of ImageNet classifiers.
Robustness Estimations. Several benchmarks have been proposed to estimate robustness of image classifiers to naturally occurring corruptions. For instance, SVSF is a store front classification dataset that reveals the natural corruptions that arise when varying three parameters: camera, year and country (many_face_rob). The SI-SCORE dataset focuses on the robustness to other parameters such as object size, location and orientation (si_score). Robustness to background changes is also a widely studied topic (terra_inco). A lot of robustness benchmarks have been proposed to measure robustness of ImageNet classifiers. ImageNet-A is a challenging benchmark, constructed by selecting images that are misclassified by various ResNet 50 architecture based models (inet_a). ImageNet-V2 (inet_v2) has been built by replicating the ImageNet construction process. Because of some statistical biases in the image selection (bias_inet_v2), a distribution shift is observed between ImageNet and ImageNet-V2. ObjectNet (onet) is a set of images that contains objects that have been randomly rotated or taken with various backgrounds and viewpoints. ImageNet-R contains artistic renditions of ImageNet object classes (many_face_rob). ImageNet-Vid (inet_vid) and ImageNet-P (inet_c) study stability of model predictions on sequences of similar images. ImageNet-D has been recently proposed to provide additional challenging distribution shifts (quickdraw, infograph…) (inet_d).
ImageNet-C is a synthetic corruption benchmark widely used to estimate robustness of ImageNet classifiers (inet_c)
. It contains fifteen corruptions which can be classified into noises, blurs, weather and digital corruptions. Other synthetic corruption benchmarks have been proposed to evaluate robustness of neural networks in various computer vision tasks such as face recognition(face_rec_noise), object detection (cc_object_detection), image segmentation (segmentation_rob), saliency region detection (cc_gaze), traffic sign recognition (cure_tsr)scene_classif).
Corruption Overlappings. Two synthetic corruptions overlap when they are correlated in terms of robustness. For instance, it has been demonstrated that corruptions that damage high frequencies in images (noises, blurs…) overlap (fourier; laugros19). It has been shown that a benchmark should not contain a couple of corruptions such as overlaps much more with the other corruptions of the benchmark than (overlap_score). Otherwise, the considered benchmark gives too much importance to the robustness towards some kinds of corruptions compared to others. The overlapping score metric (overlap_score) has been recently proposed to measure to what extent two corruptions and overlap:
, and are models with the same architecture. and have been respectively trained with data augmentation on and ; is only trained on clean samples. is the ratio between the accuracy of on samples corrupted with and the accuracy of on not-corrupted samples. The idea behind the overlapping score is that the more a data augmentation with makes a model robust to and conversely, and the more we can suppose that and are correlated in terms of robustness. The overlapping range value is . The higher this score is, the more the considered corruptions overlap.
Experimental Set-up. All overlapping scores computed in this paper, are obtained using the ImageNet-100 dataset (a subset of ImageNet that contains every tenth ImageNet class by WordNetID order (wordnet_id)
), the ResNet-18 architecture, and exactly the same training hyperparameters as the ones used in the paper introducing the overlapping score(overlap_score).
In all experiments, we measure the robustness of an image classifier to a distribution by computing the residual robustness: . and are the accuracies of respectively computed with samples (independent and identically distributed samples with regard to the training set of ) and samples drawn from . Other robustness metrics could have been used (taori_nat; inet_c), but we choose the residual robustness because it is how robustness is generally considered in industrial applications: it is the accuracy drop caused by a distribution shift. We note that comparing the residual robustness of two models to a distribution shift, requires to check that the accuracies on samples of the two models are comparable. This condition is verified in all our experiments. In this paper, the robustness of a model to a synthetic corruption benchmark , refers to the mean of the residual robustnesses of computed with the corruptions of .
Some experiments in this paper require to select models that have been shown to be robust to some out-of-distribution shifts. The used selection is displayed in Table 1.
|FastAutoAugment (fastautoaugment); Worst-of-10 spatial data augmentation using the following transformation space: pixels degrees (spatial_rob)|
|ResNet-50||ANT3x3 (ant); SIN Augmentation (stylized_imagenet); Augmix (augmix); DeepAugment (many_face_rob); MoPro (mopro); RSC (rsc); Adversarial Training: (pgd_aug)|
|EfficientNet-0||Noisy Student Training (noisy_student); AdvProp (adv_prop)|
|ResNeXt-101-32x16d||Weakly Supervised Pretraining (wsl); Semi-Supervised Pretraining (ssl)|
3 Corruption Categories
There are a lot of possible synthetic corruptions that can be included in a benchmark. Constructing a corruption benchmark requires to pick some corruptions among all possible candidates. Here, we consider a list of 40 candidate corruptions whose names can be seen in the abscissa of Figure 2 and that are illustrated in Figure 1. An other list of corruptions could have been selected, but most of the corruptions that are usually included in existing benchmarks (cure_tsr; face_rec_noise; inet_c) can be found in these candidates: blurs, noises, contrast loss… The 40 corruptions are implemented thanks to the albumentations library (albumentations), the function parameters used to model these corruptions can be found at [Link available upon acceptance].
The 40 considered corruptions form a heterogeneous set. It is difficult to determine the number and the kinds of corruptions to be included in a robustness benchmark a priori. In this paper, we propose a method to select groups of corruptions that make robustness estimations more correlated with robustness to natural corruptions.
The first step of our method is to compute the overlapping scores between candidate corruptions: here we use the corruptions displayed in Figure 1. Each corruption
is then associated with a vector that contains all the overlapping scores computed using
and any other corruptions. The second step is to split the candidate corruptions into categories, such as the overlapping score vectors of the corruptions belonging to the same category are correlated; while the overlapping score vectors of the corruptions belonging to different categories are not. To achieve it, we cluster our candidates using their associated 40-dimensional vector of overlapping scores. We use the K-means algorithm increasing progressively the number of centroids. We note that increasing , raises on average the correlations between the overlapping vectors of the Same Category Corruptions (SCC), which is consistent with our goal; but it also raises the correlations between the vectors of Different Category Corruptions (DCC), which is not desired. We choose to stop increasing at , when the mean of all the Pearson correlation coefficients computed using SCC overlapping vectors becomes higher than 0.5. The obtained categories can be seen in Figure 2.
We observe that all these categories do not contain the same number of corruptions. We also notice that SCC can be associated with a human visual perception interpretation for most of the categories. Indeed, to in Figure 2, could be respectively called spatial transformations, blurs, lightning condition variations, fine-grain artifacts and color distortions. Note that the SCC of overlap way less than the ones of other categories because it contains more heterogeneous corruptions. Corruptions of would likely have been distributed between more refined categories by using additional corruptions in our initial set of candidates.
Empirical Evaluation. We conduct an additional experiment to verify the relevance of the built corruption categories. For each corruption among the candidate corruptions displayed in Figure 1, we compute the residual robustness of the fifteen robust models displayed in Table 1 with the ImageNet validation set corrupted with . Each candidate corruption is now associated with a vector that contains the fifteen robustness scores computed using . For each possible couple of candidate corruptions, we compute the Pearson correlation coefficient using the two robustness scores vectors associated with the corruptions of the considered couple. The mean correlation obtained using SCC is 0.68, while the one obtained using DCC is 0.10. This experiment confirms the relevance of the built corruption categories: SCC are in practice correlated in terms of robustness while DCC are not.
4 Synthetic Corruption Selection Criteria
We introduce the definition of some terms used in this paper. The size of a benchmark is the number of corruptions this benchmark contains. Each time a benchmark , contains a corruption that belongs to the corruption category , we say that is represented in ; and is called a representative of in . In this section, we identify three parameters of synthetic corruption benchmarks that influence the way robustness to these benchmarks is correlated with robustness to natural corruptions. These parameters are: (1) the number of corruption categories represented (2) the balance among categories (3) the size of benchmarks. We make an ablation study in each of the three following sections to demonstrate the importance of each parameter.
4.1 Number of Corruption Categories Represented in Benchmarks
Each corruption category displayed in Figure 2 contains image transformations that modify different attributes in images. For instance, contains essentially corruptions that modify colorimetry; while contains corruptions that modify contrast and brightness. As a consequence, the features modified in one category, are mostly different from the ones modified in the other categories. Then, we make the assumption that the more corruption categories are represented in a benchmark, the more this benchmark takes into account a large diversity of attribute modifications. Distribution shifts due to natural corruptions generally change a lot of attributes at the same time: background, resolution, viewpoint… Then, intuitively, the largest the number of represented categories in a benchmark is, the more this benchmark is likely to make robustness estimations predictive of robustness to natural corruptions. To verify this intuition, we propose to use Algorithm 1 to build several benchmarks that have various numbers of represented categories.
Using the corruption categories displayed in Figure 2, we run Algorithm 1 for several couples: (2,3),(3,2),(6,1),(4,3),(6,2). We repeat this process until we obtain a group of 1000 different benchmarks for each of the considered couples. We note that a benchmark generated using contains 3 representatives of 4 out of 6 categories. We want to measure if increasing the number of represented categories make robustness estimations of benchmarks more correlated with robustness to natural corruptions. To verify this, we propose the Algorithm 2, that measures to what extent the robustness estimations made by a group of synthetic corruption benchmarks, are on average correlated with the robustness to one natural corruption benchmark. For each of the benchmark groups generated using Algorithm 1, we run Algorithm 2 for each of the following natural corruption benchmarks: ImageNet-A (inet_a), ImageNet-R (many_face_rob), ImageNet-V2 (inet_v2) and ObjectNet (onet). The group of neural networks used to run Algorithm 2 is the set of robust models presented in Table 1. The obtained results are displayed in Table 2.
The higher an obtained score is, the more the synthetic corruption benchmarks of a considered group, make on average robustness estimations correlated with the robustness to the natural corruption benchmark used to compute the score. To only study the effect of the number of represented categories in benchmarks, we only compare the scores of Table 2 obtained using benchmarks that have the same size. So, we compare the benchmarks of 6 corruptions generated using the couples and . We see that the obtained scores increase with for all the tested natural corruption benchmarks. Similarly, for the benchmarks of 12 corruptions generated using the couples and , the obtained scores are higher for than . This experiment confirms the idea that increasing the number of categories represented in synthetic corruption benchmarks, makes robustness to these benchmarks more predictive of robustness to natural corruptions.
4.2 Balance Among Categories
We consider that the balance among categories represented in a benchmark is preserved, when all represented categories of this benchmark have the same number of representatives. For instance, the balance among categories of a benchmark that contains the Gaussian noise, iso-noise, multiplicative noise and color-jitter corruptions is not preserved: contains three representatives of and one representative of (see Figure 2). Obviously is biased towards texture damaging robustness rather than colorimetry variation robustness. Intuitively, the robustness to a benchmark biased towards a few kinds of feature modifications, is not likely to be predictive of robustness to natural corruptions that change a large diversity of features in images. Consequently, preserving the balance among categories, should help to build benchmarks that make robustness estimations more correlated with robustness to natural corruptions.
We conduct an experiment to verify this intuition. We note
, the standard deviation computed using all the numbers of representatives of the categories represented in a benchmark. Theof benchmarks generated using Algorithm 1 are null: their balance among category is preserved. We propose to get new benchmarks that have higher by using the substitution operation. A substitution randomly removes a corruption from a benchmark and adds to it a corruption randomly selected in the set of candidates. But, and are selected such as three conditions are respected: (1) the represented categories of do not change (2) of strictly increases (3) is not already in .
We consider : a set of 1000 benchmarks that have been generated using Algorithm 1 with . We get 5000 new benchmarks, by substituting from 1 to 5 corruptions of each benchmark in . The obtained benchmarks have various : (0.6, 0.8, 1.0, 1.2, 1.4, 1.5, 1.8, 2.2). We group together all the benchmarks with the same . For each of these groups, we run Algorithm 2 for each of the following natural corruption benchmarks: ImageNet-A, ImageNet-R, ImageNet-V2 and ObjectNet. The group of neural networks used to run Algorithm 2 is the set of robust models presented in Table 1. The obtained results are displayed in Table 3. We observe that the computed scores decrease as the of corruption benchmarks increase for all the considered natural corruption benchmarks. We repeat the experiment carried out in this section, using different benchmarks generated with Algorithm 2 with and . For both couples, the measured mean correlations also diminish as of benchmarks increases. These experiments confirm that benchmarks for which balance among represented category is preserved, make robustness estimations more correlated with robustness to natural corruptions.
4.3 Corruption Benchmark Size
In Table 2, we observe that the scores obtained with the benchmarks generated using the couples , and ; increase with . We notice that rising for a fixed when using Algorithm 1, is equivalent to increase the size of generated benchmarks while conserving the balance among categories and the number of represented categories. Then, it appears that increasing the size of synthetic corruption benchmarks also helps to make robustness estimations more correlated with natural corruptions. To explain this, we see in Figure 2 that SCC are not completely correlated in terms of robustness. In other words, the representatives of the same category do not make exactly the same feature modifications. So, having more representatives in each category makes benchmarks measure the robustness to a larger range of feature modifications. Since natural corruptions modify a wide diversity of features in images, having more representatives per category (increasing for a fixed ) should make robustness to corruption benchmarks more predictive of robustness to natural corruptions, which can explain the obtained results.
Starting from our 40 candidate corruptions, we use the three parameters identified in this section to constitute a benchmark that makes robustness estimations the most correlated as possible with robustness to natural corruptions. More precisely, the idea is to build the largest benchmark, that represents all the categories displayed in Figure 2 and has its balance among categories preserved. To respect these conditions, we consider all the categories, and we select the same number of representatives per category, with the largest possible . Three representatives are selected per category because the smallest category displayed in Figure 2 contains only three corruptions. For each category, we select the three corruptions that overlap the less with each other because we do not want to include corruptions that are almost equivalent. We group the selected corruptions to form a benchmark called ImageNet-Syn2Nat. The names of its 18 corruptions are underlined in Figure 2.
Now we want to estimate to what extent the robustness to ImageNet-Syn2Nat is predictive of robustness to natural corruptions. To achieve it, we compute the robustness of the fifteen models presented in Table 1 to ImageNet-Syn2Nat. Then, we compute the mean correlations between the robustness scores measured with ImageNet-Syn2Nat and the ones computed with other natural corruption benchmarks. The obtained scores are displayed in Table 4 and compared with the results obtained using an other corruption benchmark (ImageNet-C (inet_c)). We observe that the robustness to ImageNet-Syn2Nat is much more correlated with the robustness to ImageNet-A and ImageNet-R than the robustness to ImageNet-C. For ImageNet-V2 and ObjectNet, the scores obtained with the two synthetic corruption benchmarks are relatively close. On average, we see that the robustness to ImageNet-Syn2Nat is more predictive of the robustness to the natural corruption benchmarks than the robustness to ImageNet-C.
We proposed a method to split a set of synthetic corruptions into some categories using the overlapping score. We showed that such categories are useful to better understand and address robustness of neural networks. Indeed, using corruption categories, we identified three parameters that are important to consider while building a corruption benchmark: the number of categories represented, the balance among categories and the size. We demonstrated that taking into account these parameters, helps to build corruption benchmarks that make robustness estimations more correlated with robustness to natural corruptions. We used these three parameters to build a new benchmark called ImageNet-Syn2Nat, that can be useful to complete existing ImageNet robustness benchmarks.
In further works, by using a larger set of candidate corruptions and by trying other clustering strategies, we would like to build more refined corruption categories than the ones presented in Figure 2. We also want to measure to what extend the robustness to ImageNet-Syn2Cat is predictive of the robustness to other natural corruption benchmarks such as ImageNet-D (inet_d). We plan to adapt the methodology presented in this paper to other computer vision tasks. We hope that these works will help to make robustness estimations that are more predictive of the observed robustness of neural networks in real world applications.