Convolutional neural network (CNN) models have surpassed human performance on image classification benchmarks like Imagenet (He et al. (2015); Deng et al. (2009)). However, their predictions are often sensitive to small image transformations (Szegedy et al. (2013); Azulay & Weiss (2018)) that are visually imperceptible to humans. In contrast, human visual systems are robust to a wide variety of distortions, including color and noise distortions (Geirhos et al. (2018)
). Our goal is to systematically study this phenomenon of model brittleness to small image transformations to better understand and improve the robustness of computer vision models.
Previously, model brittleness has commonly been studied through the framework of adversarial examples, where example images with minute but carefully constructed modifications can fool an otherwise accurate model (Szegedy et al. (2013); Kurakin et al. (2016); Carlini & Wagner (2017)).
Recent work has found that models are also brittle to examples that are not adversarially constructed. For instance, models are found to be sensitive to changes in camera settings such as exposure and contrast (Temel & AlRegib (2018)), small translations and rotations (Azulay & Weiss (2018); Engstrom et al. (2017)), and imperceptible variations across consecutive video frames (Azulay & Weiss (2018)). Convolutional models are much more brittle than humans to synthetic distortions when the same distortions are not used during training. However, models trained on specific distortions can outperform humans on those specific distortions (Geirhos et al. (2018)). To better evaluate the extent of model brittleness in the non-adversarial setting, the community has proposed datasets to benchmark robustness (Hendrycks & Dietterich (2018) and Temel & AlRegib (2018)). These proposed datasets mimic real world scenarios using increasingly more realistic synthetic distortions, such as perturbing clean images with artificial fog and snow.
We present the first study of CNN robustness to natural transformations found across nearby video frames. We term this ”natural robustness”. This is an important type of robustness, because it models a diverse range of transformations that are closer to changes that occur in the natural environment than the synthetic distortions previously studied. We provide a framework, with dataset and evaluation metrics, to systematically evaluate natural robustness. Our results show that
More accurate image model architectures are more robust to natural evaluations.
Small translations and synthetic color distortions are good proxies for evaluating natural robustness.
No single regularization technique systematically improves natural robustness across model architectures.
2 Robustness Metrics
Given a correct classification on one frame, we define natural robustness as the conditional accuracy on the neighboring frame, similar to the Jaggedness measure (Azulay & Weiss (2018)). More formally, let be a model mapping from input space to prediction space , and defines an image transformation function. The robustness of a model is defined as , where is the ground truth. We consider two kinds of transformations: 1) Synthetic distortions, such as adjusting the color saturation of an image, adding noise, etc., and 2) natural transformations that exist across consecutive frames of videos.
Unlike some prior definitions of model robustness, our robustness definition is agnostic to the kind of transformation function. In Hendrycks & Dietterich (2018), robustness according to the relative mCE is defined to be proportional to
. This formulation implicitly assumes that transformed images come from different distribution than the clean images, and are harder to classify. This robustness metric is trivial when considering natural robustness, because neighboring frames come from the same distribution and are equally difficult to classify.
3 Experimental Study Setup
We leverage a large scale dataset of videos, YouTube-BoundingBoxes(YT-BB) (Real et al. (2017)), to evaluate natural robustness. YT-BB contains 380K video segments of 15-20s across 210K unique videos and 24 classes. Each segment is selected to feature objects in natural settings without editing or post-processing. See Figure A.1 for examples of both natural transformations and synthetic distortions.
We create a classification task from YT-BB to train our image classification model. This ensures domain matching between images used during training and the video frames used during robustness evaluation. Prior work used handpicked YouTube videos to evaluate image models trained on ImageNet (Azulay & Weiss (2018)). However, domain mismatch between training and evaluation can result in a significant drop in test accuracy. In our evaluation, we found that evaluating off-the-shelf ImageNet models on the YT-BB classification tasks, without finetuning, results in an average of 27% drop in accuracy across 12 model architectures.
To create this classification task, we split 210K videos into training, validation, and test. We divide the videos into contiguous shots of a single object, and randomly sample a subset of 34,714 frames from unique shots, which we term ”anchor frames”. These anchor frames make up the supervised classification task across 23 imbalanced classes. We then extract 5 consecutive frames at 15Hz before and after the anchor frames to evaluate natural robustness. The top row of figure A.1 illustrates an example anchor frame and its neighbors, which we term ”natural transformations”. Neighboring frames are only used to evaluate natural robustness. Only anchor frames are used during training.
We examine the robustness of 12 model architectures: VGG16, VGG19 (Simonyan & Zisserman (2014)), ResNet-V1-50, ResNet-V1-101, ResNet-V1-152 (He et al. (2016)), MobileNet-V1 (Howard et al. (2017)), NASNet-Mobile (Zoph et al. (2018)), Inception-V1 (Szegedy et al. (2015)), Inception-V2, Inception-V3 (Szegedy et al. (2016)), Inception-V4 (Szegedy et al. (2017b)), and Inception-ResNet-V2 (Szegedy et al. (2017a)
). For each model architecture, we train models on the YT-BB task from scratch and using transfer learning from pre-trained ImageNet checkpoints. We tune hyper-parameters using the same search space for every model independently. This gives us a total of 24 baseline models with a wide range of accuracy on this task.
The preprocessing stage during training is particularly critical in determining the robustness of models against synthetic distortions. Geirhos et al. (2018) demonstrated that training on synthetic distortions will disproportionately improve robustness to the distortion used during training. To ensure that our evaluation across models are fair and comparable, we use the same preprocessing step during all model training, using only random crop and resizing, skipping the color distortions that is typically used for training Inception models.
3.4 Natural Transformations and Synthetic Distortions
For each anchor frame in our classification task, we sample 10 neighboring frames, 5 frames before and after the anchor frames, at 15Hz. We call these neighboring frames the natural transformations of the anchor image. Similar to synthetic distortions, we can adjust the strength of the transformation defined by the temporal difference between the anchor and neighboring frame.
We also evaluated model robustness to synthetic distortions for comparison against natural robustness. We implemented 10 varieties of synthetic distortions at 5 different levels of severity. We selected the following synthetic distortions from prior work: gaussian noise, gaussian blur, pixelate, shot noise, JPEG quality, hue, contrast, saturation, brightness, and small translations (Hendrycks & Dietterich (2018); Azulay & Weiss (2018); Geirhos et al. (2018)).
Azulay & Weiss (2018) showed that the implementation of the translation distortion can significantly affect the results. In our implementation, we shift the evaluation crop of the image to create a set of translations, without introducing empty space or the need for in-painting.
We examine the result of evaluating each of the 24 fully trained models, of 12 model architectures, on 11 transformations, each with 5 varying strengths. For the 12 models trained from scratch, the Top-1 accuracy on the 23-class YTBB classification task ranges from 72.5% to 80.4%. The same architectures fine-tuned from ImageNet obtained accuracy ranging from 83.90% and 88.7%.
4.1 Frames temporally closer in time have more similar model outputs
We expect our models to exhibit the property that more visually similar images have more similar outputs. During our evaluation, we found that this is indeed true. As seen in Figure 2, frame pairs temporally closer together are more likely to be both classified correctly.
We also evaluated the robustness of YT-BB models on synthetic distortions. The result can be found in A.2. Our result reproduces that of Geirhos et al. (2018), showing that models are not robust to synthetic distortions and that robustness drops significantly as the severity of distortions increase.
4.2 More accurate models are more robust
We summarize each model’s robustness to a given transformation as the average across varying strengths. In Figure 3, we see a strong correlation between model accuracy and robustness to both translation and natural transformations found in videos. The result on translation robustness contradicts the conclusion of Azulay & Weiss (2018), which states modern networks are less translation invariant. Azulay & Weiss (2018) measure translation invariance by training their models on full-sized images, then evaluating their models by embedding the image within a larger image. This introduces a large change between training and evaluation, and Figure 8 in Azulay & Weiss (2018) indicates the closer embedded size is to the original image, the more robust models are. Our work is focused on evaluating robustness to small transformations that are close to or even in the same distribution as those seen in training.
Figure 4 shows that more accurate models are also more robust to a number of synthetic distortions such as saturation, hue, and brightness. This conclusion is different from that of Hendrycks & Dietterich (2018), which claimed that model robustness is largely uncorrelated with model accuracy, according to their relative mCE metric. However, according to their unadjusted mCE robustness metric, more accurate classifiers are more accurate on the corrupted test set. As described in Sec. 2, the relative mCE metric is proportional to the drop in total accuracy from clean images to corrupted images. Our metric performs a more fine-grained measure of robustness, by computing accuracy on the corrupted image conditional on the clean image being classified correctly, rather than simply using the total accuracies.
4.3 Synthetic distortions as proxies of natural transformations
In many settings, it’s difficult to acquire domain-matching videos to evaluate natural robustness. We evaluate model robustness across 10 different synthetic distortions, and compute the correlation between different types of robustness. Shown in Figure 5, we find that robustness to image translation and color distortions like saturation and hue are highly correlated with natural robustness, which indicates that these are good proxies for natural robustness in image models.
4.4 Distances between neighboring frames are much larger than distances between adversarial examples
We next investigated the relationship between brittleness to natural transformations and adversarial examples. Adversarial examples are commonly defined as images within an ball of the clean image by norm that results in a mis-classification.
In Figure 6
, we analyze the distribution of 10,000 video frame pairs, all 66ms apart. The distribution of distances between consecutive frames have a mean of 213 and standard deviation of 49.1, which is much larger than what is typically considered for adversarial examples (less than anin space (Kannan et al. (2018)). We then study frame pairs that exhibit brittleness, where the anchor frame is classified correctly but its neighbor is mistaken. Less than 0.01% of frame pairs that exhibit brittleness fall within the definition of adversarial examples ( ball in space).
4.5 No single training technique systematically improves robustness
Finally, we explore regularization and adversarial training techniques to improve the natural robustness of image classifiers. We explored adversarial training through adversarial logit pairing (Kannan et al. (2018)); regularization techniques like weight decay (Hanson & Pratt (1989)), label smoothing (Szegedy et al. (2017b)), clean logit squeezing, clean logit pairing (Kannan et al. (2018)
); and multi-class prediction with the sigmoid activation function on the logits (Goodfellow et al. (2016)
). We explore 25 hyperparameter settings for each training technique. All models trained are within 1.2% of their original accuracy.
We find that no single technique we explored improves natural robustness for all model architectures (Figure 7). However, certain model architectures become more robust with specific training techniques. Figure 7 shows that adversarial logit pairing improves MobileNet-V1 robustness by 0.8%, weight decay improves Resnet-V1-152 robustness by 1.2%, and clean logit pairing and clean logit squeezing improves VGG19 robustness by 1% and 1.3% respectively.
In this work, we present the first study on natural robustness in CNNs by leveraging the large YT-BB video dataset (Real et al. (2017)).
Our results show that more accurate models are also more robust to natural transformations. This implies that researchers should continue designing new architectures and training techniques that improve model accuracy. Our analysis also highlights that the correlation between accuracy and robustness depends highly on the type of distortions applied. This refines conclusions from prior works that show accuracy is uncorrelated with robustness on average (Hendrycks & Dietterich (2018)).
When video frames are not available to evaluate natural robustness directly, we identify synthetic color distortions as good proxies for natural transformations. This result also helps explain and support prior findings that color-based transformations are good data augmentation policies for natural images (Cubuk et al. (2018)).
In exploring the relationship between brittleness found in videos and adversarial examples, we find that brittle examples in videos rarely fall within the typical definition of adversarial examples. This suggests that adversarial robustness does not directly measure robustness to natural transformations. Despite the misalignment in evaluation, we do find early signs that training techniques to improve adversarial robustness can improve robustness for some model architectures. However, no single training technique systematically improves natural robustness across model architectures, providing an interesting direction for future work.
- Azulay & Weiss (2018) Aharon Azulay and Yair Weiss. Why do deep convolutional networks generalize so poorly to small image transformations? arXiv preprint arXiv:1805.12177, 2018.
Carlini & Wagner (2017)
Nicholas Carlini and David Wagner.
Adversarial examples are not easily detected: Bypassing ten detection
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 3–14. ACM, 2017.
- Cubuk et al. (2018) Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
- Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009.
- Engstrom et al. (2017) Logan Engstrom, Brandon Tran, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. A rotation and a translation suffice: Fooling cnns with simple transformations. arXiv preprint arXiv:1712.02779, 2017.
- Geirhos et al. (2018) Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Schütt, Matthias Bethge, and Felix A Wichmann. Generalisation in humans and deep neural networks. In Advances in Neural Information Processing Systems, pp. 7549–7561, 2018.
- Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT Press, 2016.
- Hanson & Pratt (1989) Stephen José Hanson and Lorien Y Pratt. Comparing biases for minimal network construction with back-propagation. In Advances in neural information processing systems, pp. 177–185, 1989.
- He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Hendrycks & Dietterich (2018) Dan Hendrycks and Thomas G Dietterich. Benchmarking neural network robustness to common corruptions and surface variations. arXiv preprint arXiv:1807.01697, 2018.
- Howard et al. (2017) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Kannan et al. (2018) Harini Kannan, Alexey Kurakin, and Ian Goodfellow. Adversarial logit pairing. arXiv preprint arXiv:1803.06373, 2018.
- Kurakin et al. (2016) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
- Real et al. (2017) Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 7464–7473. IEEE, 2017.
- Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.
- Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016.
Szegedy et al. (2017a)
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In Thirty-First AAAI Conference on Artificial Intelligence, 2017a.
- Szegedy et al. (2017b) Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017b.
- Temel & AlRegib (2018) Dogancan Temel and Ghassan AlRegib. Traffic signs in the wild: Highlights from the ieee video and image processing cup 2017 student competition [sp competitions]. IEEE Signal Processing Magazine, 35(2):154–161, 2018.
- Zoph et al. (2018) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710, 2018.
Appendix A Appendix
In Figure A.2, we illustrate the robustness across degrees of synthetic distortions, replicating previous results from Geirhos et al. (2018). Robustness to small translations stands out from other types of robustness in that models are equally robust to translations of a single pixel or 16 pixels. We hypothesize this is due to the use of random crops during training. Additionally, we notice that models are more robust to a translation of exactly 4 and 8 pixels as opposed to 1 or 16 pixels. This phenomenon is likely due to the convolution architecture, as alluded to by Azulay & Weiss (2018).