Over the past five years, machine learning has become a decidedly experimental field. Driven by a surge of research in deep learning, the majority of published papers has embraced a paradigm where the main justification for a new learning technique is its improved performance on a few key benchmarks. At the same time, there are few explanations as to why
a proposed technique is a reliable improvement over prior work. Instead, our sense of progress largely rests on a small number of standard benchmarks such as CIFAR-10, ImageNet, or MuJoCo. This raises a crucial question:
How reliable are our current measures of progress in machine learning?
Properly evaluating progress in machine learning is subtle. After all, the goal of a learning algorithm is to produce a model that generalizes well to unseen data. Since we usually do not have access to the ground truth data distribution, we instead evaluate a model’s performance on a separate test set. This is indeed a principled evaluation protocol, as long as we do not use the test set to select our models.
Unfortunately, we typically have limited access to new data from the same distribution. It is now commonly accepted to re-use the same test set multiple times throughout the algorithm and model design process. Examples of this practice are abundant and include both tuning hyperparameters (number of layers, etc.) within a single publication, and building on other researchers’ work across publications. While there is a natural desire to compare new models to previous results, it is evident that the current research methodology undermines the key assumption that the classifiers are independent of the test set. This mismatch presents a clear danger because the research community could easily be designing models that only work well on the specific test set but actually fail to generalize to new data.
1.1 A Reproducibility Study on CIFAR-10
To understand how reliable current progress in machine learning is, we design and conduct a new type of reproducibility study. Its main goal is to measure how well contemporary classifiers generalize to new, truly unseen data from the same distribution. We focus on the standard CIFAR-10 dataset since its transparent creation process makes it particularly well suited to this task. Moreover, CIFAR-10 has been the focus of intense research for almost 10 years now. Due to the competitive nature of this process, it is an excellent test case for investigating whether adaptivity has led to overfitting.
Our study proceeds in three steps:
First, we curate a new test set where we carefully match the sub-class distribution of our new test set to the original CIFAR-10 dataset.
After collecting about 2000 new images, we evaluate the performance of 30 image classification models on our new test set. The results show two overarching phenomena. On the one hand, there is a significant drop in accuracy from the original test set to our new test set. For instance, VGG and ResNet architectures [18, 7] drop from their well-established 93% accuracy to about 85% on our new test set. On the other hand, we find the performance on the existing test set to be highly predictive of the performance on our new test set. Even small incremental improvements on CIFAR-10 often transfer to truly held-out data.
Motivated by this discrepancy between the original and new accuracies, the third step investigates multiple hypotheses for explaining this gap. A natural conjecture is that re-tuning standard hyperparameters recovers some of the observed gap, but we find only a small effect of about 0.6% improvement. While this and further experiments can explain some of the accuracy loss, a significant gap remains.
Overall, our results paint an unexpected picture of progress in contemporary machine learning. In spite of adapting to the CIFAR-10 test set for several years, there has been no stagnation. The top model is still a recent Shake-Shake network with Cutout regularization [4, 3]. Moreover, its advantage over a standard ResNet increased from 4% to 8% on our new test set. This shows that the current research methodology of “attacking” a test set for an extended period of time is surprisingly resilient to overfitting.
But our results also cast doubt on the robustness of current classifiers. While our new dataset presents only a minute distributional shift, the classification accuracy of widely used models drops significantly. For instance, the aforementioned accuracy loss of VGG and ResNet architectures corresponds to multiple years of progress on CIFAR-10 . Note that the distributional shift induced by our experiment is neither adversarial nor the result of a different data source. So even in benign settings, distribution shift poses a serious challenge and questions to what extent current models truly generalize.
2 Formal Setup
Before we describe our specific experiment on CIFAR-10, we start with a formal description of our problem of interest. We adopt the standard classification setup and posit the existence of a “true” underlying data distribution over labeled examples . The goal is to find a model that minimizes the population loss
Since we usually do not know the distribution , we instead measure the performance of a trained classifier via a test set drawn from the distribution :
For a sufficiently large test set , standard concentration results show that is a good approximation of as long as the classifier does not depend on . This is arguably the core assumption underlying machine learning since it allows us to argue that our classifier truly generalizes (as opposed to say only memorizing the data). So if we collect a new test set from the same distribution
, we would expect that the accuracies match up to confidence intervals given by the inherent sampling error:
However, it is often hard to argue when a new test set is drawn from exactly the same distribution since we usually lack a precise definition of this distribution. So to obtain truly i.i.d. test sets, we ideally would have collected a larger initial dataset that we then randomly split into , , and . Unfortunately, we usually do not have such an exact setup to reproduce accuracy numbers on a new test set. In this paper, we instead mimic the data generating distribution as closely as possible by repeating the dataset creation process that originally derived and from a larger dataset. While this method does not necessarily generate a test set that is an i.i.d. draw from the original data generating distribution, it is a close approximation.
3 Dataset Creation Methodology
To investigate how well current image classifiers generalize to truly unseen data, we collect a new test set for the CIFAR-10 image classification dataset . There are multiple reasons for this choice:
The dataset creation process for CIFAR-10 is transparent and well documented . Importantly, CIFAR-10 draws from the larger Tiny Images repository that has significantly more fine-grained labels . This makes it possible to conduct an experiment where we minimize various forms of distribution shift in our new test set.
CIFAR-10 poses a difficult enough problem so that the dataset is still the subject of active research (e.g., see [3, 4, 21, 23, 17]). Moreover, there is a wide range of classification models that achieve significantly different accuracy scores. Since code for these models is published in a variety of open source repositories, they can be treated as truly independent of our new test set.
Before we describe how we created our new test set, we briefly review relevant background on CIFAR-10 and Tiny Images.
The dataset contains 80 million RGB color images with resolution 32 32 pixels. The images are organized by roughly 75,000 keywords that correspond to the non-abstract nouns from the WordNet database. Each keyword was entered into multiple Internet search engines to collect roughly 1,000 to 2,500 images per keyword. It is important to note that Tiny Images is a fairly noisy dataset. Many of the images filed under a certain keyword do not clearly (or not at all) correspond to the respective keyword.
The goal for the CIFAR-10 dataset was to create a cleanly labeled subset of Tiny Images. To this end, the researchers assembled a dataset consisting of ten classes with 6,000 images per class. These classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The standard train / test split is class-balanced and contains 50,000 training images and 10,000 test images.
The CIFAR-10 creation process is well-documented . First, the researchers assembled a set of relevant keywords for each class by using the hyponym relations in WordNet . Since directly using the corresponding images from Tiny Images would not give a high quality dataset, the researchers paid student annotators to label the images from Tiny Images. The labeler instructions can be found in Appendix C of  and include a set of specific guidelines (e.g., an image should not contain two object of the corresponding class). The researchers then verified the labels of the images selected by the annotators and removed near-duplicates from the dataset via an nearest neighbor search.
3.2 Building the New Test Set
Our overall goal was to create a new test set that is as close as possible to being drawn from the same distribution as the original CIFAR-10 dataset. One crucial aspect here is that the CIFAR-10 dataset did not exhaust any of the Tiny Image keywords it is drawn from. So by collecting new images from the same keywords as CIFAR-10, our new test set can match the sub-class distribution of the original dataset.
Understanding the Sub-Class Distribution.
As the first step, we determined the Tiny Image keyword for every image in the CIFAR-10 dataset. A simple nearest-neighbor search sufficed since every image in CIFAR-10 had an exact duplicate (-distance ) in Tiny Images. Based on this information, we then assembled a list of the 25 most common keywords for each class. We decided on 25 keywords per class since the 250 total keywords make up more than 95% of CIFAR-10. Moreover, we wanted to avoid accidentally creating a harder dataset with infrequent keywords that the classifiers had little incentive to learn based on the original CIFAR-10 dataset.
The keyword distribution can be found in Appendix E. Inspecting this list reveals the importance of matching the sub-class distribution. For instance, the most common keyword in the airplane class is stealth_bomber and not an arguably more ordinary civilian type of airplane. In addition, the third most common keyword for the airplane class is stealth_fighter. Both types of planes are highly distinctive. There are more examples where certain sub-classes are considerably different, e.g., images from the fire_truck keyword have image statistics that are rather different from say pictures for dump_truck.
Collecting New Images.
After determining the keywords, we collected corresponding images. To simulate the student / researcher split in the original CIFAR-10 collection procedure, we introduced a similar split among two authors of this paper. Author A took the role of the original student annotators and selected new suitable images for the 250 keywords. In order to ensure a close match between the original and new images for each keyword, we built a user interface that allowed Author A to first look through existing CIFAR-10 images for a given keyword and then select new candidates from the remaining pictures in Tiny Images. Author A followed the labeling guidelines in the original instruction sheet . The number of images Author A selected per keyword was so that our final dataset would contain between 2,000 and 4,000 images. We decided on 2,000 images as a target number for two reasons:
While the original CIFAR-10 test set contains 10,000 images, a test set of size 2,000 is already sufficient for a fairly small confidence interval. In particular, a conservative confidence interval (Clopper-Pearson at confidence level 95%) for accuracy 90% has size about with 2,000 (to be precise, ). Since we considered a potential discrepancy between original and new test accuracy only interesting if it was significantly larger than 1%, we decided that a new test set of size 2,000 was large enough for our study.
As with very infrequent keywords, our goal was to avoid accidentally creating a harder test set. Since some of the Tiny Image keywords have only a limited supply of remaining adequate images, we decided that a smaller target size for the new dataset would reduce bias to include images of more questionable difficulty.
After Author A had selected a set of about 9,000 candidate images, Author B adopted the role of the researchers in the original CIFAR-10 dataset creation process. In particular, Author B reviewed all candidate images and removed images that were unclear to Author B or did not conform to the labeling instructions in their opinion (some of the criteria are subjective). In the process, a small number of keywords did not have enough images remaining to reach the 2,000 threshold. Author B then notified Author A about the respective keywords and Author A selected a further set of images for these keywords. In this process, there was only one keyword where Author A had to carefully go through all available images in Tiny Images. This keyword was alley_cat and comprises less than 0.3% of the overall CIFAR-10 dataset.
After collecting a sufficient number of high-quality images for each keyword, we sampled a random subset from our pruned candidate set. The sampling procedure was such that the keyword-level distribution of our new dataset matches the keyword-level distribution of CIFAR-10 (see Appendix E). In the final stage, we again proceeded similar to the original CIFAR-10 dataset creation process and used -nearest neighbors to filter out near duplicates. In particular, we removed near-duplicates within our new dataset and also images that had a near duplicate in the original CIFAR-10 dataset (train or test). The latter aspect is particularly important since our reproducibility study is only interesting if we evaluate on truly unseen data. Hence we manually reviewed the top-10 nearest neighbors for each image in our new test set. After removing near-duplicates in our dataset, we re-sampled the respective keywords until this process converged to our final dataset.
We remark that we did not run any classifiers on our new dataset during the data collection phase of our study. In order to ensure that the new data does not depend on the existing classifiers, it is important to strictly separate the data collection phase from the following evaluation phase.
4 Model Performance Results
After we completed the new test set, we evaluated a broad range of image classification models. The main question was how the accuracy on the original CIFAR-10 test set compares to the accuracy on our new test set. To this end, we experimented with a broad range of classifiers spanning multiple years of machine learning research. The models include widely used convolutional networks (VGG and ResNet [18, 7]), more recent architectures (ResNeXt, PyramidNet, DenseNet [20, 6, 10]), the published state-of-the-art (Shake-Drop ), and a model derived from RL-based hyperparameter search (NASNet) . In addition, we also evaluated “shallow” approaches based on random features [16, 2]. Overall, the accuracies on the original CIFAR-10 test set range from about 80% to 97%.
For all deep architectures, we used code previously published online (see Appendix A for a list). To avoid bias due to specific model repositories or frameworks, we also evaluated two widely used architectures (VGG and ResNets) from two different sources implemented in different deep learning libraries. We wrote our own implementation for the models based on random features.
4.1 A Significant Drop in Accuracy
All models see a large drop in accuracy from the original to the new test set. The absolute gap is larger for models that perform worse on the original test set and smaller for models with better published CIFAR-10 accuracy. For instance, VGG and ResNet architectures see a gap of about 8% between their original accuracy (around 93%) and their new accuracy (around 85%). The best original accuracy is achieved by shake_shake_64d_cutout
, which sees a roughly 4% drop from 97% to 93%. While there is some variation in the accuracy drop, no model is a clear outlier.
In terms of relative error, the models with higher original accuracy tend to have a larger increase. Some of the models such as DARC, shake_shake_32d, and resnext_29_4x64d see a increase in their error rate. For simpler models such as VGG, AlexNet, or ResNet, the relative error increase is in the range to . We refer the reader to Appendix C for a table with all relative error numbers.
4.2 Few Changes in the Relative Order
When sorting the models in order of their original and new accuracy, there are few changes in the overall ranking. Models with comparable original accuracy tend to see a similar decrease in performance. In fact, Figure 2 shows that the relationship between original and new accuracy can be explained well with a linear function derived from a least squares fit. The new accuracy of a model is roughly given by the following formula:
On the other hand, it is worth noting that some techniques give a consistently larger increase on the new test set. For instance, adding the Cutout data augmentation  to a shake_shake_64d network adds only 0.12% accuracy on the original test set but gives an accuracy increase of about 1.5% on the new test set. Similarly, adding Cutout to a wide_resnet_28_10 classifiers improves the accuracy by about 1% on the original test set and 2.2% on the new test set. As another example, note that increasing the width of a ResNet as opposed to its depth provides larger benefits on the new test set.
|Original Accuracy||New Accuracy||Gap||Rank|
|shake_shake_64d_cutout [4, 3]||97.1 [96.8, 97.4]||93.0 [91.8, 94.0]||4.1||0|
|shake_shake_96d ||97.1 [96.7, 97.4]||91.9 [90.7, 93.1]||5.1||-2|
|shake_shake_64d ||97.0 [96.6, 97.3]||91.4 [90.1, 92.6]||5.6||-2|
|wide_resnet_28_10_cutout [22, 3]||97.0 [96.6, 97.3]||92.0 [90.7, 93.1]||5||+1|
|shake_drop ||96.9 [96.5, 97.2]||92.3 [91.0, 93.4]||4.6||+3|
|shake_shake_32d ||96.6 [96.2, 96.9]||89.8 [88.4, 91.1]||6.8||-2|
|darc ||96.6 [96.2, 96.9]||89.5 [88.1, 90.8]||7.1||-4|
|resnext_29_4x64d ||96.4 [96.0, 96.7]||89.6 [88.2, 90.9]||6.8||-2|
|pyramidnet_basic_110_270 ||96.3 [96.0, 96.7]||90.5 [89.1, 91.7]||5.9||+3|
|resnext_29_8x64d ||96.2 [95.8, 96.6]||90.0 [88.6, 91.2]||6.3||+3|
|wide_resnet_28_10 ||95.9 [95.5, 96.3]||89.7 [88.3, 91.0]||6.2||+2|
|pyramidnet_basic_110_84 ||95.7 [95.3, 96.1]||89.3 [87.8, 90.6]||6.5||0|
|densenet_BC_100_12 ||95.5 [95.1, 95.9]||87.6 [86.1, 89.0]||8||-2|
|neural_architecture_search ||95.4 [95.0, 95.8]||88.8 [87.4, 90.2]||6.6||+1|
|wide_resnet_tf ||95.0 [94.6, 95.4]||88.5 [87.0, 89.9]||6.5||+1|
|resnet_v2_bottleneck_164 ||94.2 [93.7, 94.6]||85.9 [84.3, 87.4]||8.3||-1|
|vgg16_keras [18, 14]||93.6 [93.1, 94.1]||85.3 [83.6, 86.8]||8.3||-1|
|resnet_basic_110 ||93.5 [93.0, 93.9]||85.2 [83.5, 86.7]||8.3||-1|
|resnet_v2_basic_110 ||93.4 [92.9, 93.9]||86.5 [84.9, 88.0]||6.9||+3|
|resnet_basic_56 ||93.3 [92.8, 93.8]||85.0 [83.3, 86.5]||8.3||0|
|resnet_basic_44 ||93.0 [92.5, 93.5]||84.2 [82.6, 85.8]||8.8||-3|
|vgg_15_BN_64 [18, 14]||93.0 [92.5, 93.5]||84.9 [83.2, 86.4]||8.1||+1|
|resnet_preact_tf ||92.7 [92.2, 93.2]||84.4 [82.7, 85.9]||8.3||0|
|resnet_basic_32 ||92.5 [92.0, 93.0]||84.9 [83.2, 86.4]||7.7||+3|
|cudaconvnet ||88.5 [87.9, 89.2]||77.5 [75.7, 79.3]||11||0|
|random_features_256k_aug ||85.6 [84.9, 86.3]||73.1 [71.1, 75.1]||12||0|
|random_features_32k_aug ||85.0 [84.3, 85.7]||71.9 [69.9, 73.9]||13||0|
|random_features_256k ||84.2 [83.5, 84.9]||69.9 [67.8, 71.9]||14||0|
|random_features_32k ||83.3 [82.6, 84.0]||67.9 [65.9, 70.0]||15||-1|
|alexnet_tf||82.0 [81.2, 82.7]||68.9 [66.8, 70.9]||13||+1|
4.3 A Model for the Linear Fit
Though the linear fit observed in Figure 2 rules out that the new test set is identically distributed as the original test set, the linear relationship between the old and new test errors is striking. There are a variety of plausible explanations for this effect. For instance, posit that the original test set is composed of two sub-populations. On the “easy” sub-population, a classifier achieves an accuracy of . The “hard” sub-population is times more difficult in the sense that the classification error on these examples is times larger. Hence the accuracy on this sub-population is . If the relative frequencies of these two sub-populations are and , we get the following overall accuracy:
which we can rewrite as a simple linear function of :
For the new test set, we also assume a mixture distribution consisting of a different proportion of the same two components, with relative frequencies now and . We can then write the accuracy on the new test set as
where we collected terms into a simple linear function as before.
It is now easy to see that the new accuracy is indeed a linear function of the original accuracy:
We remark that we do not see this mixture model as a ground truth explanation, but rather as an illustrative example for how a linear dependency between the original and new test accuracies naturally arises with small distribution shifts between data sets. In reality, the two test sets have a more complex composition with different accuracies on various sub-populations. Nevertheless, this model reveals surprising sensitivities can exist from distribution shift even while relative ordering of classifiers remain constant. We hope that such sensitivities to distribution shift can be experimentally validated in future work.
5 Explaining the Gap
Since the gap between original and new accuracy is concerningly large, we investigated multiple hypotheses for explaining this gap.
5.1 Statistical error
A first natural guess is that the gap is simply due to statistical fluctuations. But as noted before, the sample size of our new test set is large enough so that a 95% confidence interval has size about . Since a 95% confidence interval for the original CIFAR-10 test accuracy is even smaller (roughly for 90% classification accuracy and for 97% classification accuracy), we can rule out statistical error as the sole explanation.
5.2 Differences in near-duplicate removal
As mentioned in Section 3.2, the final step of both the original CIFAR-10 and our dataset creation procedure is a near-duplicate removal. While removing near-duplicates between our new test set and the original CIFAR-10 dataset, we noticed that the latter contained images that we would have ruled out as near-duplicates. A large number of near-duplicates between CIFAR-10 train and test, combined with our more stringent near-duplicate removal, could explain some of the accuracy drop. Indeed, we found about 800 images in the CIFAR-10 test set that we would classify as near-duplicates. Moreover, most classifiers have accuracy between 99% and 100% on these near-duplicates (recall that most models achieve 100% training error). But since the 800 images comprise only 8% of the original test set, the near-duplicates can explain at most 1% of the observed difference.
For completeness, we describe our process for finding near duplicates in detail. For every test image, we visually inspected the top-10 nearest neighbors in both -distance and the SSIM (structural similarity) metric. We compared the original test set to the CIFAR-10 training set, and our new test set to both the original training and test sets. We consider an image pair as near-duplicates if both images have the same object in the same pose. We include images that have different zoom, color scale, stretch in the horizontal or vertical direction, or small shifts in vertical or horizontal position. If the object was rotated or in a different pose, we did not include it as a near-duplicate.
5.3 Hyperparameter tuning
Another conjecture is that we can recover some of the missing accuracy by re-tuning hyperparameters of a model. To this end, we performed a grid search over multiple parameters of a VGG model. We selected three standard hyperparameters known to strongly influence test set performance: initial learning rate, dropout, and weight decay. The vgg16_keras architecture uses different amounts of dropout across different layers of the network, so we chose to tune a multiplicative scaling factor for the amount of dropout, keeping the ratio of dropout across different layers constant.
We initialized a hyperparameter configuration from values tuned to the original test set (learning rate = , dropout ratio = , weight decay = ), and performed a grid search across the following values:
learning rate in
dropout ratio in
weight decay in
We ensured that the best performance was never at an extreme point of any of the ranges we tested for an individual hyperparameter. However, we did not find a setting with a significantly better accuracy on the new test set (the biggest improvement was from 85.25% to 85.84%).
5.4 Inspecting hard images
It is also possible that we accidentally created a more difficult test set by including a set of "harder" images. To explore this, we visually inspected the set of images that the majority of models incorrectly classified. We find that all the new images are natural images that are recognizable to humans. Figure 3 in Appendix B shows examples of the hard images in our new test set that no model correctly classified.
5.5 Training on part of our new test set
If our new test set came from a significantly different data distribution than the original CIFAR-10 dataset, then retraining on half of our new test set plus the original training set should improve the accuracy scores on the held-out fraction of the new test set.
We conducted this experiment by randomly drawing a class-balanced split containing 1010 images from the new test set. We then added these images to the full CIFAR-10 training set and retrained the vgg16_keras model. After training, we tested the model on the 1011 held-out images from the new test set. We repeated this experiment twice with different randomly selected splits from our test set, obtaining accuracies of 85.06% and 85.36% (compared to 84.9% without the extra training data). This provides further evidence that there are no large distribution shifts between our new test set and the original CIFAR-10 dataset.
Since cross-validation is a more principled way of measuring a model’s generalization ability, we tested if cross-validation on the original CIFAR-10 dataset could predict a model’s error on our new test set. We created cross-validation data by randomly dividing the training set into 5 class-balanced splits. We then randomly shuffled together 4 out of the 5 training splits with toe original test set. The leftover held-out split from the training set then became the new test set.
We retrained the models vgg_15_BN_64, wide_resnet_28_10, and shake_shake_64d_cutout on each of the 5 new datasets we created. The accuracies are reported in Table 2. The accuracies on each of the cross validation splits did not vary much from the accuracies on the original test set.
Do our experiments reveal overfitting? This is arguably the main question when interpreting our results. To be precise, we first define two notions of overfitting:
Training set overfitting.
One way to quantify overfitting is as the difference between the training accuracy and the test accuracy. Note that the deep neural networks in our experiments usually achieve 100% training accuracy. So this notion of overfitting already occurs on the existing dataset.
Test set overfitting. Another notion of overfitting is the gap between the test accuracy and the accuracy on the underlying data distribution. By adapting model design choices to the test set, the concern is that we implicitly fit the model to the test set. The test accuracy then loses its validity as an accurate measure of performance on truly unseen data.
Since the overall goal in machine learning is to generalize to unseen data, we argue that the second notion of overfitting through test set adaptivity is more important. Surprisingly, our results show no signs of such overfitting on CIFAR-10. Despite multiple years of competitive adaptivity on this dataset, there has been no stagnation on truly held out data. In fact, the best performing models on our new test set see an increased advantage over more established baselines. Though this trend is opposite to what overfitting through adaptivity would suggest. While a conclusive picture will require further replication experiments, we view our results as support of the competition-based approach to increasing accuracy scores.
We note that one can read the analysis of Blum and Hardt’s Ladder algorithm as supporting this claim . Indeed, they show that adding a minor modification of standard machine learning competitions avoids the sort of overfitting that can be achieved with aggressive adaptivity. Our results show that even without these modifications, model tuning based on the test error does not lead to overfitting on a standard data set.
Although our results do not support the hypothesis of adaptivity-based overfitting, there is still a significant gap between original and new accuracy scores that needs to be explained. We view this gap as the result of a small distribution shift between the original CIFAR-10 dataset and our new test set. The fact that this gap is large, affects all models, and occurs despite our efforts to replicate the CIFAR-10 creation process is concerning. Normally, distribution shift is studied for specific changes in the data generation process (e.g., changes in lighting conditions) or for worst-case attacks in an adversarial setting. Our experiment is more benign and poses neither of these challenges. Nevertheless, the accuracy of all models drops by 4 - 15% and the relative increase in error rates is up to. This indicates that current CIFAR-10 classifiers have difficulty generalizing to natural variations in image data.
Concrete future experiments should explore whether the competition approach is similarly resilient to overfitting on other datasets (e.g., ImageNet) and other tasks (such as language modeling). An important aspect here is to ensure that the data distribution of a new test set stays as close to the original dataset as possible. Furthermore, we should understand what types of naturally occurring distribution shifts are challenging for image classifiers. For instance, are there certain sub-populations that the models fail to learn on CIFAR-10 but appear trivial to a human? In Section 4.3, we described a simple mixture model based on sub-population shifts that could serve as a starting point for such an investigation.
More broadly, we view our results as motivation for a more thorough evaluation of machine learning research. Currently, the dominant paradigm is to propose a new algorithm and evaluate its performance on existing data. Unfortunately, there is often little understanding to what extent the improvements are broadly applicable. To truly understand generalization questions, more studies should collect insightful new data and evaluate existing algorithms on such data. Since we now have a large body of essentially pre-registered classifiers in open-source repositories, such studies would conform to well established standards of statistically valid research. It is important to note the distinction to current reproducibility efforts in machine learning that usually focus on computational reproducibility, i.e., running published code on the same test data. In contrast, generalization experiments such as ours focus on statistical reproducibility by evaluating classifiers on truly new data (similar to recruiting new participants for a reproducibility experiment in medicine or psychology).
Benjamin Recht and Vaishaal Shankar are supported by ONR award N00014-17-1-2502. Benjamin Recht is additionally generously supported in part by NSF award CCF-1359814, ONR awards N00014-17-1-219 and N00014-17-1-2401, the DARPA Fundamental Limits of Learning (Fun LoL) and Lagrange Programs, and an Amazon AWS AI Research Award. Rebecca Roelofs is generously supported by DOE award AC02-05CH11231. Ludwig Schmidt is generously supported by a Google PhD fellowship. Ludwig was a visitor at UC Berkeley while conducting the research for this paper.
We would like to thank Alexei Efros, David Fouhey, and Moritz Hardt for helpful discussions while working on this paper.
- Blum and Hardt  Avrim Blum and Moritz Hardt. The ladder: A reliable leaderboard for machine learning competitions. In International Conference on Machine Learning (ICML), 2015. http://arxiv.org/abs/1502.04585.
Coates et al. 
Adam Coates, Andrew Ng, and Honglak Lee.
An analysis of single-layer networks in unsupervised feature
Conference on Artificial Intelligence and Statistics (AISTATS), 2011. http://proceedings.mlr.press/v15/coates11a.html.
- DeVries and Taylor  Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. 2017. https://arxiv.org/abs/1708.04552.
- Gastaldi  Xavier Gastaldi. Shake-shake regularization. 2017. https://arxiv.org/abs/1705.07485.
-  Ben Hamner. Popular datasets over time. https://www.kaggle.com/benhamner/popular-datasets-over-time/code.
- Han et al.  Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. https://arxiv.org/abs/1610.02915.
- He et al. [2016a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016a. https://arxiv.org/abs/1512.03385.
- He et al. [2016b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision (ECCV), 2016b. https://arxiv.org/abs/1603.05027.
- Hinton et al.  Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. 2012. http://arxiv.org/abs/1207.0580.
- Huang et al.  Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. https://arxiv.org/abs/1608.06993.
- Kawaguchi et al.  Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. 2017. https://arxiv.org/abs/1710.05468.
- Krizhevsky  Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
Krizhevsky et al. 
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural networks.In Advances in Neural Information Processing Systems (NIPS), 2012. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.
Liu and Deng 
Shuying Liu and Weihong Deng.
Very deep convolutional neural network based image classification
using small training sample size.
Asian Conference on Pattern Recognition (ACPR), 2015. https://ieeexplore.ieee.org/document/7486599/.
- Miller  George A. Miller. Wordnet: A lexical database for english. Communications of the ACM, 1995. URL http://doi.acm.org/10.1145/219717.219748.
- Rahimi and Recht  Ali Rahimi and Benjamin Recht. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems (NIPS), 2009. https://papers.nips.cc/paper/3495-weighted-sums-of-random-kitchen-sinks-replacing-minimization-with-randomization-in-learning.
- Real et al.  Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classifier architecture search. 2018. http://arxiv.org/abs/1802.01548.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. 2014. https://arxiv.org/abs/1409.1556.
Torralba et al. 
Antonio Torralba, Rob Fergus, and William. T. Freeman.
80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008. https://ieeexplore.ieee.org/document/4531741/.
- Xie et al.  Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. https://arxiv.org/abs/1611.05431.
- Yamada et al.  Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. Shakedrop regularization. 2018. https://arxiv.org/abs/1802.02375.
- Zagoruyko and Komodakis  Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. 2016. https://arxiv.org/abs/1605.07146.
- Zoph et al.  Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. 2017. https://arxiv.org/abs/1707.07012.
Appendix A Model Descriptions
Table 3 contains the code repositories for the deep model. The specific parameters we list below are the default configurations for each model in the code repositories. Unless otherwise noted, all models were trained on a single GPU.
alexnet_tf  : Alexnet model without data augmentation
cudaconvnet : Alexnet model with data augmentation
densenet_BC_100_12: DenseNet  with batch size 32 , initial learning rate 0.05, depth 100, block type “bottleneck", growth rate 12, compression rate 0.5
nas: Neural Architecture Search  model for CIFAR-10, batch size 32, learning rate 0.025, cosine (single period) learning rate decay, auxiliary head loss weighting 0.4, clip global norm of all gradients by 5
pyramidnet_basic_110_270: PyramidNet  with depth 110, block type “basic",
pyramidnet_basic_110_84: PyramidNet  with depth 110, block type “basic",
random_features_256k : Random 1 layer convolutional network with 256k filters sampled from image patches, patch size = , pool size , pool stride .
random_features_32k_aug : Random 1 layer convolutional network with 32k filters sampled from image patches, patch size = , pool size , pool stride , and horizontal flip data augmentation.
random_features_32k : Random 1 layer convolutional network with 32k filters sampled from image patches, patch size = , pool size , pool stride
resnet_basic_32: ResNet  with depth 32, block type “basic"
resnet_basic_44: ResNet  with depth 44, block type “basic"
resnet_basic_56: ResNet  with depth 56, block type “basic"
resnet_basic_110: ResNet  with depth 110, block type “basic"
resnet_preact_basic_110: ResNet-preact  with depth 110, block type “basic"
resnet_preact_bottleneck_164: ResNet-preact  with depth 164, block type “bottleneck"
resnext_29_4x64d: ResNeXt  with depth 29, cardinality 4, base channels 64, batch size 32 and initial learning rate 0.025
resnext_29_8x64d: ResNeXt  with depth 29, cardinality 8, base channels 64, batch size 64 and initial learning rate 0.05
shake_shake_32d : Shake-shake  with depth 26, base channels 32, S-S-I model
shake_shake_64d: Shake-shake  with depth 26, base channels 64, S-S-I model, batch size 64, base lr 0.1
shake_shake_96d: Shake-shake  with depth 26, base channels 96, S-S-I model
wide_resnet_tf : Wide residual network  with widening factor 10 (implemented in TensorFlow)
wide_resnet_28_10 : Wide residual network  with depth 28, widening factor 10
Appendix B Details for “Explaining the Gap" Experiments
Inspecting Hard Images
Figure 3 shows examples of the hard images in our new test set that no model correctly classified.
Appendix C Error ratios
|Original Error||New Error||Error Ratio|
|shake_shake_64d_cutout [4, 3]||2.9 [3.2, 2.6]||7.0 [8.2, 6.0]||2.4|
|shake_shake_96d ||2.9 [3.3, 2.6]||8.1 [9.3, 6.9]||2.8|
|shake_shake_64d ||3.0 [3.4, 2.7]||8.6 [9.9, 7.4]||2.8|
|wide_resnet_28_10_cutout [22, 3]||3.0 [3.4, 2.7]||8.0 [9.3, 6.9]||2.6|
|shake_drop ||3.1 [3.5, 2.8]||7.7 [9.0, 6.6]||2.5|
|shake_shake_32d ||3.4 [3.8, 3.1]||10.2 [11.6, 8.9]||3|
|darc ||3.4 [3.8, 3.1]||10.5 [11.9, 9.2]||3.1|
|resnext_29_4x64d ||3.6 [4.0, 3.3]||10.4 [11.8, 9.1]||2.9|
|pyramidnet_basic_110_270 ||3.7 [4.0, 3.3]||9.5 [10.9, 8.3]||2.6|
|resnext_29_8x64d ||3.8 [4.2, 3.4]||10.0 [11.4, 8.8]||2.7|
|wide_resnet_28_10 ||4.1 [4.5, 3.7]||10.3 [11.7, 9.0]||2.5|
|pyramidnet_basic_110_84 ||4.3 [4.7, 3.9]||10.7 [12.2, 9.4]||2.5|
|densenet_BC_100_12 ||4.5 [4.9, 4.1]||12.4 [13.9, 11.0]||2.8|
|neural_architecture_search ||4.6 [5.0, 4.2]||11.2 [12.6, 9.8]||2.4|
|wide_resnet_tf ||5.0 [5.4, 4.6]||11.5 [13.0, 10.1]||2.3|
|resnet_v2_bottleneck_164 ||5.8 [6.3, 5.4]||14.1 [15.7, 12.6]||2.4|
|vgg16_keras [18, 14]||6.4 [6.9, 5.9]||14.7 [16.4, 13.2]||2.3|
|resnet_basic_110 ||6.5 [7.0, 6.1]||14.8 [16.5, 13.3]||2.3|
|resnet_v2_basic_110 ||6.6 [7.1, 6.1]||13.5 [15.1, 12.0]||2|
|resnet_basic_56 ||6.7 [7.2, 6.2]||15.0 [16.7, 13.5]||2.2|
|resnet_basic_44 ||7.0 [7.5, 6.5]||15.8 [17.4, 14.2]||2.3|
|vgg_15_BN_64 [18, 14]||7.0 [7.5, 6.5]||15.1 [16.8, 13.6]||2.2|
|resnet_preact_tf ||7.3 [7.8, 6.8]||15.6 [17.3, 14.1]||2.1|
|resnet_basic_32 ||7.5 [8.0, 7.0]||15.1 [16.8, 13.6]||2|
|cudaconvnet ||11.5 [12.1, 10.8]||22.5 [24.3, 20.7]||2|
|random_features_256k_aug ||14.4 [15.1, 13.7]||26.9 [28.9, 24.9]||1.9|
|random_features_32k_aug ||15.0 [15.7, 14.3]||28.1 [30.1, 26.1]||1.9|
|random_features_256k ||15.8 [16.5, 15.1]||30.1 [32.2, 28.1]||1.9|
|random_features_32k ||16.7 [17.4, 16.0]||32.1 [34.1, 30.0]||1.9|
|alexnet_tf||18.0 [18.8, 17.3]||31.1 [33.2, 29.1]||1.7|
Appendix D Class-balanced Test Set
The top 25 keywords in the CIFAR-10 dataset capture approximately 95% of the new dataset. However, the 5% of the CIFAR-10 dataset that is not captured with the top 25 keywords is heavily skewed to the ship class. As a result, our new dataset was not precisely class-balanced, and contained only 8% ships.
We created a class-balanced version of the new test set with exactly 2000 images. In this version, we selected the top 50 keywords in each class, and then computed a fractional number of images for that keyword based on the fractional ratio of that keyword in the corresponding CIFAR-10 class multiplied by the desired number of images for each class. We then sorted the fractional keyword counts by their rounding error, and removed one image from the keywords with the largest rounding gap until we had 2000 images. The resulting dataset achieves slightly higher accuracy on the models, with the difference in accuracy from the new test set described in the main paper given in Table 5.
|Original Accuracy||New Accuracy||Gap||Acc.|
|shake_shake_64d_cutout [4, 3]||97.1 [96.8, 97.4]||93.1 [90.9, 93.3]||4||0.13|
|shake_shake_96d ||97.1 [96.7, 97.4]||92.0 [89.7, 92.2]||5.1||0.015|
|shake_shake_64d ||97.0 [96.6, 97.3]||91.9 [89.6, 92.2]||5.1||0.46|
|wide_resnet_28_10_cutout [22, 3]||97.0 [96.6, 97.3]||92.1 [89.8, 92.3]||4.9||0.12|
|shake_drop ||96.9 [96.5, 97.2]||92.3 [90.0, 92.5]||4.6||0.019|
|shake_shake_32d ||96.6 [96.2, 96.9]||90.0 [87.6, 90.4]||6.6||0.19|
|darc ||96.6 [96.2, 96.9]||89.9 [87.5, 90.3]||6.7||0.39|
|resnext_29_4x64d ||96.4 [96.0, 96.7]||90.1 [87.8, 90.5]||6.2||0.54|
|pyramidnet_basic_110_270 ||96.3 [96.0, 96.7]||90.5 [88.1, 90.9]||5.8||0.05|
|resnext_29_8x64d ||96.2 [95.8, 96.6]||90.1 [87.7, 90.5]||6.1||0.14|
|wide_resnet_28_10 ||95.9 [95.5, 96.3]||90.1 [87.8, 90.5]||5.8||0.44|
|pyramidnet_basic_110_84 ||95.7 [95.3, 96.1]||89.6 [87.2, 90.0]||6.1||0.34|
|densenet_BC_100_12 ||95.5 [95.1, 95.9]||87.9 [85.4, 88.4]||7.6||0.32|
|neural_architecture_search ||95.4 [95.0, 95.8]||89.2 [86.8, 89.6]||6.2||0.38|
|wide_resnet_tf ||95.0 [94.6, 95.4]||88.8 [86.4, 89.3]||6.2||0.33|
|resnet_v2_bottleneck_164 ||94.2 [93.7, 94.6]||86.1 [83.6, 86.7]||8.1||0.2|
|vgg16_keras [18, 14]||93.6 [93.1, 94.1]||85.6 [83.1, 86.3]||8||0.35|
|resnet_basic_110 ||93.5 [93.0, 93.9]||85.4 [82.9, 86.1]||8.1||0.24|
|resnet_v2_basic_110 ||93.4 [92.9, 93.9]||86.9 [84.4, 87.5]||6.5||0.41|
|resnet_basic_56 ||93.3 [92.8, 93.8]||84.9 [82.3, 85.5]||8.5||-0.11|
|resnet_basic_44 ||93.0 [92.5, 93.5]||84.8 [82.2, 85.5]||8.2||0.58|
|vgg_15_BN_64 [18, 14]||93.0 [92.5, 93.5]||85.0 [82.5, 85.7]||7.9||0.19|
|resnet_preact_tf ||92.7 [92.2, 93.2]||85.1 [82.6, 85.8]||7.6||0.74|
|resnet_basic_32 ||92.5 [92.0, 93.0]||85.2 [82.7, 85.9]||7.3||0.34|
|cudaconvnet ||88.5 [87.9, 89.2]||78.2 [75.5, 79.2]||10||0.66|
|random_features_256k_aug ||85.6 [84.9, 86.3]||73.6 [70.8, 74.8]||12||0.47|
|random_features_32k_aug ||85.0 [84.3, 85.7]||72.2 [69.4, 73.4]||13||0.26|
|random_features_256k ||84.2 [83.5, 84.9]||70.5 [67.7, 71.7]||14||0.58|
|random_features_32k ||83.3 [82.6, 84.0]||68.7 [65.9, 70.0]||15||0.76|
|alexnet_tf||82.0 [81.2, 82.7]||69.2 [66.4, 70.5]||13||0.32|
Appendix E Keywords