Robustness properties of Facebook's ResNeXt WSL models
We investigate the robustness properties of ResNeXt image recognition models trained with billion scale weakly-supervised data (ResNeXt WSL models). These models, recently made public by Facebook AI, were trained on 1B images from Instagram and fine-tuned on ImageNet. We show that these models display an unprecedented degree of robustness against common image corruptions and perturbations, as measured by the ImageNet-C and ImageNet-P benchmarks. The largest of the released models, in particular, achieves state-of-the-art results on both ImageNet-C and ImageNet-P by a large margin. The gains on ImageNet-C and ImageNet-P far outpace the gains on ImageNet validation accuracy, suggesting the former as more useful benchmarks to measure further progress in image recognition. Remarkably, the ResNeXt WSL models even achieve a limited degree of adversarial robustness against state-of-the-art white-box attacks (10-step PGD attacks). However, in contrast to adversarially trained models, the robustness of the ResNeXt WSL models rapidly declines with the number of PGD steps, suggesting that these models do not achieve genuine adversarial robustness. Visualization of the learned features also confirms this conclusion. Finally, we show that although the ResNeXt WSL models are more shape-biased in their predictions than comparable ImageNet-trained models, they still remain much more texture-biased than humans.READ FULL TEXT VIEW PDF
Convolutional Neural Networks (CNNs) used on image classification tasks ...
We investigate the robustness properties of image recognition models equ...
Shape and texture are two prominent and complementary cues for recognizi...
We demonstrate that the Conditional Entropy Bottleneck (CEB) can improve...
Adversarial examples are commonly viewed as a threat to ConvNets. Here w...
Randomized smoothing has achieved state-of-the-art certified robustness
The human visual system is remarkably robust against a wide range of
Robustness properties of Facebook's ResNeXt WSL models
Facebook AI recently released ResNeXt-class image recognition models trained with 1B images from Instagram using weak supervision and fine-tuned on ImageNet. To our knowledge, these models are the only publicly available models trained with such large scale data. These models can help us address some important and interesting questions about the relationship between training data size and the out-of-sample generalization behavior of image recognition models trained in standard classification tasks. For example: does more training data make the learned representations more robust against common image corruptions and perturbations? Does it make them more robust against adversarial attacks? Does it reduce or even eliminate some of the quirky behavior of ImageNet-trained models, such as their sensitivity to background cues, their heavy reliance on local textural information, and their surprising inability to integrate information more globally across an image (Geirhos et al., 2019)? In this paper, we address these questions.111Code and all simulation results are available at: https://github.com/eminorhan/resnext-wsl
Intuitively, we expect that training with more data should in general increase the robustness of a model, because more data constrain the behavior of the model more strongly. But the scaling of different robustness measures with training data size is an open empirical question. We find that the models trained with billion scale data are substantially more robust than ImageNet-trained models on common image corruptions and perturbations, achieving state-of-the-art results on both ImageNet-C and ImageNet-P benchmarks (Hendrycks and Dietterich, 2019) by a large margin. These models even achieve a limited degree of robustness against white-box adversarial attacks. However, it remains relatively easy to generate adversarial examples for them, hence they do not achieve true adversarial robustness. They also retain the strong texture bias of ImageNet-trained models and their accuracy on the recently introduced “natural adversarial examples” (Hendrycks et al., 2019) remains low, suggesting that these issues are unlikely to be solved by simply increasing the training data size.
We consider five different models. The models all belong the ResNeXt family (Xie et al., 2017). The first one was trained on ImageNet; the remaining four models (WSL models) were trained on 1B images from Instagram using weak supervision and then fine-tuned on ImageNet (please see Mahajan et al. (2018) for further details about training).222The WSL models can be accessed from: https://pytorch.org/hub/facebookresearch_WSL-Images_resnext/
resnext101_32x8d. This is an ImageNet-trained ResNeXt-101 model with cardinality 32 and a bottleneck width of 8 (please see Xie et al. (2017) for a description of the different architectural dimensions). We use the implementation of this model in torchvision.models (0.3).
resnext101_32x8d_wsl. This model has the same architecture as the previous one, but was trained on Instagram images. The comparison between this model and the previous one is important, because any difference between these models is due to the difference in training data.
resnext101_32x16d_wsl. Instagram-trained ResNeXt-101 model with cardinality 32 and a bottleneck width of 16.
resnext101_32x32d_wsl. Instagram-trained ResNeXt-101 model with cardinality 32 and a bottleneck width of 32.
resnext101_32x48d_wsl. Instagram-trained ResNeXt-101 model with cardinality 32 and a bottleneck width of 48. With 829M parameters, this is the largest WSL model released by Facebook AI.
We measured the robustness of the models against common natural corruptions and perturbations using the recently introduced ImageNet-C and ImageNet-P benchmarks (Hendrycks and Dietterich, 2019). We give a brief description of these benchmarks below and refer the reader to Hendrycks and Dietterich (2019) for further details.
ImageNet-C was designed to measure the robustness of classifiers against common image corruptions and contains 15 different corruption types333Gaussian, shot, and impulse noise; defocus, glass, motion, and zoom blur; snow, frost, fog, and brightness corruptions; contrast, elasticity, pixelation, and JPEG compression. applied to each ImageNet validation image at 5 different severity levels.
The robustness performance on ImageNet-C is measured by the mean corruption error (mCE), which is defined as follows. For each corruption type , the classification error of the model is averaged over different severity levels and then divided by the average classification error of a reference classifier (AlexNet): i.e. . The mean corruption error is then obtained by averaging over the corruption types: . We also calculate a relative mCE (rel. mCE) score by subtracting the clean classification error of the classifiers from the corruption errors: i.e. and then averaging over different corruption types as before.
ImageNet-P was designed to measure the stability of a model’s predictions as the input image undergoes a continuous sequence of transformations. ImageNet-P contains 10 common perturbation types444Gaussian noise, shot noise, motion blur, zoom blur, brightness, snow, translation, rotation, tilt, scale perturbations. applied to each ImageNet validation image in a temporal sequence. Each sequence contains more than 30 images.
The robustness performance on ImageNet-P is measured by the mean flip rate (mFR) and the mean top-5 distance (mT5D
) metrics. The mean flip rate is calculated by first computing the flip probability of the model’s predictions for consecutive frames for each perturbation, , normalizing by the AlexNet flip probability for the same perturbation, and then averaging over different perturbations: . The flip probability is computed somewhat differently for the noise perturbations, where the consecutive frames are not temporally related. We refer the reader to Hendrycks and Dietterich (2019) for more details. The mean top-5 distance is defined similarly, but instead of the stability of the model’s top-1 prediction for consecutive frames, it measures the stability of the top-5 predictions. We again refer the reader to Hendrycks and Dietterich (2019) for further details.
We considered both black-box and white-box attacks to measure the robustness of the models against adversarial perturbations. Attacks were carried out with the state-of-the-art projected gradient descent (PGD) algorithm using the Foolbox implementation (Rauber et al., 2017).
Black-box attacks. In the black-box setting, we ran attacks against the resnext50_32x4d model in torchvision.models. Note that this model is different from the five models considered in this paper. We set the number of PGD steps to and the step size to . We varied the total perturbation size of the attack , defined as the -norm of the perturbation divided by the -norm of the clean image: , from to . These attacks against the resnext50_32x4d model were highly successful, yielding below top-1 accuracy even for the lowest perturbation size . We then tested the generated adversarial images with the five ResNeXt-101 models considered in this paper.
White-box attacks. In the white-box setting, attacks were run directly against the ResNeXt-101 models. Attack parameters were identical to those described in the previous paragraph. However, since using only a small number of PGD steps can lead to a significant overestimation of the robustness of a model against white-box adversarial attacks (Engstrom et al., 2018), we also ran stronger white-box attacks with up to PGD steps (fixing the total perturbation size to for these attacks).
Engstrom et al. (2019) recently showed that models trained with robust optimization learn fundamentally different features from models trained in the standard way (through minimization of training loss). In particular, they show that the learned features in robust models are much more meaningful and well-aligned with human perception than the learned features in non-robust models. Here, we use this idea to test whether the learned features in ResNeXt WSL models show this signature of robustness. Following Engstrom et al. (2019), we do this by starting from a seed image and finding an image that maximizes the activation of a particular unit in the penultimate layer of the network. Engstrom et al. (2019)
show that the resulting “maximizing images” are much more meaningful and much less sensitive to the initial seed image in robust models than in non-robust models. To find these “maximizing images”, we removed the final softmax layer from the network and used the PGD algorithm to maximize different units in the penultimate layer of the network. We used the Foolbox implementationProjectedGradientDescentAttack with the TargetClassProbability criterion set to a large value () for the corresponding unit. Note that this is slightly different from the way maximizing images were computed in Engstrom et al. (2019).
To test whether the Instagram-trained ResNeXt WSL models share the characteristic texture bias displayed by ImageNet-trained deep neural networks, we used the shape-texture cue conflict stimuli created byGeirhos et al. (2019). These are 1201 images created with a neural style transfer algorithm to look locally like an image from a given category (texture content), but globally like an image from a different category (shape content). Therefore, these images are ideal for testing a model’s relative sensitivity to local texture information vs. global shape information. Geirhos et al. (2019) showed that ImageNet-trained deep neural networks rely much more heavily on local texture information than on global shape information in making their predictions. This was found to be in stark contrast to humans who are sensitive to both local and global information, but rely almost exclusively on global shape in making classification judgments. We used the same evaluation procedure as Geirhos et al. (2019) to measure the shape/texture biases of the models. Briefly, the images were generated from 16 distinct super-categories in ImageNet. In evaluating the predictions of the models and hence measuring their shape/texture biases, only ImageNet classes belonging to these 16 super-categories were considered, the remaining classes being zeroed out. We refer the reader to Geirhos et al. (2019) for further details about the stimulus generation and model evaluation methods.
Finally, we measured the performance of the ResNeXt WSL models on the recently introduced ImageNet-A dataset (Hendrycks et al., 2019). This curated dataset consists of 7500 natural, unmodified ImageNet-like images for which a standard ImageNet-trained classifier yields incorrect predictions with low confidence in the correct class (less than 15%). These “natural adversarial examples” were also selected to display a diverse range of confusions between different classes. Hendrycks et al. (2019) argue that misclassifications on the dataset result from a diverse range of underlying causes, such as over-reliance on texture, color, or background cues, sensitivity to image distortions or perturbations, tendency to over-generalize etc.
The images in ImageNet-A belong to a subset of 200 classes among the 1000 ImageNet-1K classes. Accuracies are measured on this 200-class subset only (outputs corresponding to the remaining classes being effectively zeroed out). Hendrycks et al. (2019) also introduce two uncertainty metrics to quantify the confidence miscalibration of models: the RMS calibration error (RMS-CE) and the area under the response rate accuracy curve (AURRA). We refer the reader to Hendrycks et al. (2019) for a detailed description of how these metrics are calculated.
ImageNet-C and ImageNet-P robustness scores are reported in Table 1. The ResNeXt WSL models outperform the ImageNet-trained resnext101_32x8d model on all metrics. The largest WSL model resnext101_32x48d_wsl, in particular, achieves state-of-the-art results on all metrics by a large margin. The robustness gains achieved by the WSL models over the ImageNet-trained resnext101_32x8d model are significantly larger than the the gains achieved on ImageNet validation accuracy (ImageNet validation accuracies are reported in Table 2 under the “Clean” column). This suggests that robustness on ImageNet-C and ImageNet-P may be a more meaningful metric than ImageNet validation accuracy in evaluating future improvements in image recognition models.
|Patch Gaussian (ResNet-200) (Lopes et al., 2019)||60.4||75.7||–||–|
|resnext101_64x4d (Hendrycks and Dietterich, 2019)||62.2||80.1||65.9||43.2|
The robustness of the models against black-box and white-box adversarial attacks is shown in Table 2 and in Figure 1. The ResNeXt WSL models achieve significantly better black-box adversarial accuracy compared to the ImageNet-trained resnext101_32x8d model. Even more impressively, however, the WSL models also achieve a significant amount of robustness against 10-step white-box PGD attacks. Note that a 10-step PGD attack is strong enough to yield close to 0% accuracy on the ImageNet-trained resnext101_32x8d model for a standard perturbation size of . By comparison, the best WSL model (resnext101_32x16d_wsl) yields an accuracy of 40.7% in the same condition. In fact, this level of robustness is better than that achieved by some previous adversarial training methods (e.g. ALP, see Table 2). This is surprising given that the WSL models were not explicitly trained to be adversarially robust and suggests that simply training models with more data can automatically improve adversarial robustness.
Gilmer et al. (2019) recently argued that adversarial vulnerability and sensitivity to more common image corruptions and perturbations are two sides of the same underlying phenomenon, namely sensitivity to perturbations in general. According to their perspective, adversarial non-robustness simply arises as the worst-case manifestation of this general sensitivity to perturbations, whereas sensitivity to more common image corruptions and perturbations is the average-case manifestation of the same. This implies that robustness gains in one should, in general, accompany robustness gains in the other measure. Given the large gains in robustness to common image corruptions and perturbations and the concomitant gains in adversarial robustness observed in the WSL models, our results are consistent with this prediction of Gilmer et al. (2019).
The adversarial robustness of the WSL models, however, declined rapidly when we increased the number of PGD iterations up to 50 steps, fixing the total perturbation size to (Figure 1c). This is in contrast to the robustness achieved by a state-of-the-art adversarially-trained model (feature denoising with a ResNet-152 backbone, shown in green in Figure 1c), which remains much more stable as the number of PGD iterations is increased. This result suggests that the ResNeXt WSL models do not achieve true adversarial robustness.
|()||(10-step, )||(50-step, )|
|ALP (InceptionV3) (Kannan et al., 2018)||72||–||27.9||–|
|Denoising (ResNet-152) (Xie et al., 2018)||65.3||–||55.7||47.9|
Maximizing images for the ImageNet-trained resnext101_32x8d and the Instagram-trained resnext101_32x48d_wsl models are shown in Figures 2 and 3, respectively, together with the seed images used in optimization. Both models produce qualitatively similar maximizing images. The maximizing images essentially look like adversarial examples. Perceptually, the maximizing images for different units are almost identical to each other and to the seed image. Engstrom et al. (2019) recently showed that these properties are signatures of adversarially non-robust models (robust models yield perceptually meaningful and heterogeneous maximizing images for different units and the maximizing images are much less dependent on the seed image). This result supports our conclusion from the previous subsection that the ResNeXt WSL models do not achieve genuine adversarial robustness.
The shape biases of different models are reported in Table 3. Although billion scale training with Instagram images increases the shape biases of the WSL models compared to ImageNet-trained ResNet and ResNeXt models, the resulting models are still far more texture-biased than humans. Some example shape-texture cue-conflict stimuli are shown in Figure 4, together with the top 5 predictions of the resnext101_32x48d_wsl model.
This result is expected if the statistical regularities enabling high classification performance in Instagram are similar to those observed in ImageNet and standard image recognition models have an inductive bias for exploiting local textural regularities over more global shape-based regularities (Brendel and Bethge, 2019).
|ResNet-50 (Geirhos et al., 2019)||22.1|
|Shape-ResNet-50 (Geirhos et al., 2019)||81|
|Humans (Geirhos et al., 2019)||95.9|
Table 4 shows the top-1 accuracies and confidence miscalibration scores of the models on ImageNet-A. The ImageNet-trained resnext101_32x8d model achieves a top-1 accuracy of 2.3%, demonstrating the difficulty of this benchmark for standard ImageNet-trained models. Note that this is in sharp contrast to the performance of ImageNet-trained models on the ImageNetV2 dataset (Recht et al., 2019), where despite a significant 11–14% absolute drop in accuracy, the models remain relatively high-performing. The difference between ImageNetV2 and ImageNet-A is that the images in ImageNet-A were explicitly chosen to be hard for an ImageNet-trained classifier (thus the name “natural adversarial examples”), whereas the images in ImageNetV2 were selected in a way that matched as closely as possible the way the original ImageNet validation set was selected.
The Instagram-trained WSL models achieve better calibration scores and accuracies on ImageNet-A: in particular, the largest WSL model achieves a top-1 accuracy of 16.6%; however, the accuracies overall remain very low, suggesting that billion scale training with Instagram images is not able to address the underlying issues causing low classification accuracy on ImageNet-A. Again, just as in the persistent texture bias of the Instagram-trained WSL models, this result is also not too surprising if the statistical regularities enabling high classification performance in Instagram are similar to those in ImageNet, and suggests that these issues will not be feasibly solved by simply training the same models with even more data of the same kind.
Our results paint a mixed picture regarding the robustness properties of the ResNeXt WSL models trained with billion scale weakly-supervised data. On the one hand, these models achieve a remarkable degree of robustness against common image corruptions and perturbations and even a limited degree of adversarial robustness despite not having been explicitly trained for adversarial robustness, demonstrating yet another example of the “unreasonable effectiveness of data” (Halevy et al., 2009). On the other hand, they do not achieve genuine adversarial robustness and they retain some of the quirky behavior of ImageNet-trained models, such as their over-reliance on local texture and background cues, and their seeming inefficiency in integrating information more globally across an image.
We find it unlikely that simply scaling up the standard object classification tasks and models to even more data will be sufficient to feasibly achieve genuinely human-like, general-purpose visual representations: adversarially robust, more shape-based and, in general, better able to handle out-of-sample generalization. It remains a big open question what kinds of tasks and model biases can enable the learning of such robust, general-purpose visual representations. As we continue to deploy machine learning models in more and more challenging, open-ended domains, the need for such robust, general-purpose visual representations will likely increase as well. In the meantime, we can be duly impressed by the performance of current generation large scale vision models trained with large amounts of data on more restricted domains.
Evaluating and understanding the robustness of adversarial logit pairing. arXiv preprint arXiv:1807.10272. Cited by: §2.3.
Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196. Cited by: §2.1, Acknowledgments.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500. Cited by: §2.1, §2.1.