Large-scale human-labeled image datasets such as ImageNet (deng2009imagenet) have greatly contributed to the rapid progress of research in image classification. In recent years, considerable effort has been put into designing novel network architectures (he2016deep; hu2018squeeze) and advanced optimization algorithms (kingma2014adam)
to improve the training of image classifiers based on deep neural networks (DNNs), while little attention has been paid to comprehensive and fair evaluation/comparison of their model performance. Conventional model evaluation methodology for image classification generally follows a three-step approach(burnham2003model). First, pre-select a number of images from the space of all possible natural images (i.e., natural image manifold) to form the test set. Second, collect the human label for each image in the test set to identify its ground-truth category. Third, rank the competing classifiers according to their goodness of fit (e.g., accuracy) on the test set; the one with the best result is declared the winner.
A significant problem with this methodology is the apparent contradiction between the enormous size and high dimensionality of natural image manifold and the limited scale of affordable testing (i.e., human labeling, or verifying predicted labels, which is expensive and time consuming). As a result, a typical “large-scale” test set for image classification allows for tens of thousands of natural images to be examined, which are deemed to be extremely sparsely distributed in natural image manifold. Model comparison based on a limited number of samples assumes that they are sufficiently representative of the whole population, an assumption that has been proven to be doubtful in image classification. Specifically, recht2019imagenet found that a minute natural distribution shift leads to a large drop in accuracy for a broad range of image classifiers on both CIFAR-10 (krizhevsky2009learning) and ImageNet (deng2009imagenet)
, suggesting that the current test sets may be far from sufficient to represent hard natural images encountered in the real world. Another problem with the conventional model comparison methodology is that the test sets are pre-selected and therefore fixed. This leaves the door open for adapting classifiers to the test images, deliberately or unintentionally, via extensive hyperparameter tuning, raising the risk of overfitting. As a result, it is never guaranteed that image classifiers with highly competitive performance on such small and fixed test sets can generalize to real-world natural images with much richer content variations.
In order to reliably measure the progress in image classification and to fairly test the generalizability of existing classifiers in a natural setting, we believe that it is necessary to compare the classifiers on a much larger image collection in the order of millions or even billions. Apparently, the main challenge here is how to exploit such a large-scale test set under the constraint of very limited budgets for human labeling, knowing that collecting ground-truth labels for all images is extremely difficult, if not impossible.
In this work, we propose an efficient and practical methodology, namely the MAximum Discrepancy (MAD) competition, to meet this challenge. Inspired by wang2008maximum and ma2019group, instead of trying to prove an image classifier to be correct using a small and fixed test set, MAD starts with a large-scale unlabeled image set, and attempts to falsify a classifier by finding a set of images, whose predictions are in strong disagreement with the rest competing classifiers (see Figure 1). A classifier that is harder to be falsified in MAD is considered better. The initial image set for MAD to explore can be made arbitrarily large provided that the cost of computational prediction for all competing classifiers is cheap. To quantify the discrepancy between two classifiers on one image, we propose a weighted distance over WordNet hierarchy (miller1998wordnet), which is more semantically aligned with human cognition compared with traditional binary judgment (agree vs. disagree). The set of model-dependent images selected by MAD are the most informative in discriminating the competing classifiers. Subjective experiments on the MAD test set reveal the relative strengths and weaknesses among the classifiers, and identify the training techniques and architecture choices that improve the generalizability to natural image manifold. This suggests potential ways to improve a classifier or to combine aspects of multiple classifiers.
We apply the MAD competition to compare eleven ImageNet classifiers, and find that MAD verifies the relative improvements achieved by recent DNN-based methods, with a minimal subjective testing budget. MAD is readily extensible, allowing future classifiers to be added into the competition with little additional cost.
2 The MAD Competition Methodology
The general problem of model comparison in image classification may be formulated as follows. We work with the natural image manifold , upon which we define a class label for every , where and is the number of categories. We assume a subjective assessment environment, in which a human subject can identify the category membership for any natural image among all possible categories. A group of image classifiers are also assumed, each of which takes a natural image as input and makes a prediction of , collectively denoted by . The goal is to compare the relative performance of classifiers under very limited resource for subjective testing.
The conventional model comparison method for image classification first samples a natural image set . For each image , we ask human annotators to provide the ground-truth label . Since human labeling is expensive and time consuming, and DNN-based classifiers are hungry for labeled data in the training stage (krizhevsky2012imagenet), is typically small in the order of tens of thousands (Russakovsky2015). The predictions of the classifiers are compared against the human labels by computing the empirical classification accuracy
The classifier with a higher classification accuracy is said to outperform with a lower accuracy.
As an alternative, the proposed MAD competition methodology aims to falsify a classifier in the most efficient way with the help of other competing classifiers. A classifier that is more likely to be falsified is considered worse.
2.1 The MAD Competition Procedure
The MAD competition methodology starts by sampling an image set from the natural image manifold . Since the number of images selected by MAD for subjective testing is independent of the size of , we may choose to be arbitrarily large such that provides dense coverage of (i.e., sufficiently represents) . MAD relies on a distance measure to quantify the degree of discrepancy between the predictions of any two classifiers. The most straightforward measure is the 0-1 loss:
Unfortunately, it ignores the semantic relations between class labels, which may be crucial in distinguishing two classifiers, especially when they share similar design philosophies (e.g., using DNNs as backbones) and are trained on the same image set (e.g., ImageNet). For example, misclassifying a “chihuahua” as a dog of other species is clearly more acceptable compared with misclassifying it as a “watermelon”. We propose to leverage the semantic hierarchy in WordNet (miller1998wordnet) to measure the distance between two (predicted) class labels. Specifically, we model WordNet as a weighted undirected graph 111Although WordNet is tree-structured, a child node in WordNet may have multiple parent nodes.. Each edge connects a parent node of a more general level (e.g., canine) to its child node of a more specific level (e.g., dog), for . A nonnegative weight is assigned to each edge to encode the semantic distance between and . A larger indicates that and are semantically more dissimilar. We measure the distance between two labels as the sum of the weights assigned to the edges along the shortest path connecting them
Eq. (3) reduces to the standard graph hop distance between two vertices by setting . We design to be inversely proportional to the tree depth level of the parent node (e.g., ). In other words, we prefer the shortest paths to traverse the root node (or nodes with smaller ) as a way of encouraging and to differ in a more general level (e.g., vehicle rather than watercraft). Figure 2 shows the semantic advantages of our choice of weighting compared with the equal weighting. With the distance measure at hand, the optimal image in terms of discriminating and can be obtained by maximizing the discrepancy between the two classifiers on
The queried image label leads to three possible outcomes (see Figure 1):
Case I. Both classifiers make correct predictions. Although theoretically impossible based on the general problem formulation, it is not uncommon in practice that a natural image may contain multiple distinct objects (e.g., guacamole and mortar). In this case, and successfully recognize different objects in , indicating that both classifiers tend to perform at a high level. By restricting to only contain natural images with a single salient object, we may reduce the possibility of this outcome.
Case II. (or ) makes correct prediction, while (or ) makes incorrect prediction. In this case, MAD automatically identifies a strong failure case to falsify one classifier, not the other; a clear winner is obtained. The selected image provides the strongest evidence in differentiating the two classifiers as well as ranking their relative performance.
Case III. Both classifiers make incorrect predictions in a multiclass image classification problem (i.e., ). Although both classifiers make mistakes, they differ substantially during inference, which in turn provides a strong indication of their respective weaknesses222This is in stark contrast to natural adversarial examples in ImageNet-A (hendrycks2019nae), where different image classifiers tend to make consistent mistakes. For example, VGG16BN and ResNet34 make the same incorrect predictions on out of images in ImageNet-A. . Depending on the subjective experimental setting, may be used to rank the classifiers based on Eq. (3) if the full label has been collected. As will be clear later, due to the difficulty in subjective testing, we collect a partial label: “ does not contain nor ”. In this case, only Eq. (1) can be applied, and contributes less to performance comparison between the two classifiers.
In practice, to obtain reliable performance comparison between and , we choose top- images in with largest distances computed by Eq. (3) to form the test subset . MAD runs this game among all distinct pairs of classifiers, resulting in the final MAD test set . The number of natural images in is at most , which is independent of the size of . In other words, applying MAD to a larger image set has no impact on the cost of human labeling. In scenarios where the cost of computational prediction can be ignored, MAD encourages to expand to cover as many “free” natural images as possible.
We now describe our subjective assessment environment for collecting human labels. Given an image , which is associated with two classifiers and , we pick two binary questions for human annotators: “Does contain an ?” and “Does contain an ?”. When both answers are no (corresponding to Case III), we stop querying the ground-truth label of because it is difficult for humans to select one among classes, especially when is large and the ontology of classes is complex.
After subjective testing, we first compare the classifiers in pairs and aggregate the pairwise statistics into a global ranking. Specifically, we compute the empirical classification accuracies of and on using Eq. (1), denoted by and , respectively. When
is small, Laplace smoothing is employed to smooth the estimation. Note thatmay be greater than one because of Case I. The pairwise accuracy statistics of all classifiers form a matrix , from which we compute another matrix with , indicating the pairwise dominance of over . We aggregate the pairwise comparison results into a global ranking using Perron rank (saaty1984inconsistency):
where is an -dimensional vector of all ones. The limit of Eq. (5
) is the normalized principal eigenvector of
corresponding to the largest eigenvalue, wherefor and . The larger is, the better performs in the MAD competition. Other ranking aggregation methods such as HodgeRank (jiang2011statistical) may also be applied. We summarize the workflow of the MAD competition in Algorithm 1.
Finally, it is straightforward and cost-effective to add the -th classifier into the current MAD competition. No change is needed for the sampled and the associated subjective testing. The additional work is to select a total of new images from for human labeling. We then enlarge by one along its row and column, and insert the pairwise comparison statistics between and the previous classifiers. An updated global ranking vector can be computed using Eq. (5). We summarize the procedure of adding a new classifier in Algorithm 2.
3 Application to ImageNet Classifiers
In this section, we apply the proposed MAD competition methodology to comparing ImageNet classifiers. We focus on ImageNet (deng2009imagenet) for two reasons. First, it is one of the first large-scale and widely used datasets in image classification. Second, the improvements on ImageNet seem to plateau, which provides an ideal platform for MAD to distinguish the newly proposed image classifiers finer.
3.1 Experimental Setups
Inspired by hendrycks2019nae, we focus on the same out of classes to avoid rare and abstract ones, and classes that have changed much since 2012. For each class, we crawl a large number of images from Flickr, resulting in a total of natural images. Although MAD allows us to arbitrarily increase with essentially no cost, we choose the size of to be approximately three times larger than the ImageNet validation set to provide a relatively easy environment for probing the generalizability of the classifiers. As will be clear in Section 3.2, the current setting of is sufficient to discriminate the competing classifiers. To guarantee the content independence between ImageNet and , we collect images that have been uploaded after 2013. It is worth noting that no data cleaning (e.g., inappropriate content and near-duplicate removal) is necessary at this stage since we only need to ensure the selected subset for human labeling is eligible.
We select eleven representative ImageNet classifiers for benchmarking: VGG16BN (simonyan2014very)
with batch normalization(ioffe2015batch), ResNet34, ResNet101 (he2016deep), WRN101-2 (zagoruyko2016wide), ResNeXt101-324 (xie2017aggregated), SE-ResNet-101, SENet154 (hu2018squeeze), NASNet-A-Large (zoph2018learning), PNASNet-5-Large (liu2018progressive), EfficientNet-B7 (tan2019efficientnet), and WSL-ResNeXt101-3248 (mahajan2018exploring). Since VGG16BN and ResNet34 have nearly identical accuracies on the ImageNet validation set, it is particularly interesting to see which method generalizes better to natural image manifold. We compare ResNet34 with ResNet101 to see the influence of DNN depth. WRN101-2, ResNeXt101-324, SE-ResNet-101 are different improved versions over ResNet-101. We also include two state-of-the-art classifiers: WSL-ResNeXt101-3248 and EfficientNet-B7. The former leverages the power of weakly supervised pre-training on Instagram data, while the latter makes use of compound scaling. We use publicly available code repositories for all DNN-based models, whose top- accuracies on the ImageNet validation set are listed in Table 1 for reference.
When constructing using the maximum discrepancy principle, we add another constraint based on prediction confidence. Specifically, a candidate image associated with and is filtered out if it does not satisfy , where is the confidence score (i.eand is a predefined threshold set to . We include the confidence constraint for two main reasons. First, if misclassifies with low confidence, it is highly likely that is near the decision boundary and thus contains less information on improving the decision rules of . Second, some images in do not necessarily fall into the classes in ImageNet, which are bound to be misclassified (a problem closely related to out-of-distribution detection). If they are misclassified by with high confidence, we consider them as hard counterexamples of . To encourage class diversity in , we retain a maximum of three images with the same predicted label by . In addition, we exclude images that are non-natural. Figure 4 visually compares representative “manhole cover” images in and the ImageNet validation set (see more in Figure 6).
Collecting Human Labels
As described in Section 2.1, given an image , human annotators need to answer two binary questions. In our subjective experiments, we choose
and invite five volunteer graduate students, who are experts in computer vision, to label a total ofimages. If more than three of them find difficulty in labeling (associated with and ), it is discarded and replaced by with the -th largest distance . Majority vote is adopted to decide the final label when disagreement occurs. After subjective testing, we find that of annotated images belong to Case II, which form the cornerstone of the subsequent data analysis. Besides, and images pertain to Case I and Case III, respectively.
3.2 Experimental Results
Pairwise Ranking Results
Figure 3 shows the pairwise accuracy matrix in the current MAD competition, where a larger value of an entry (a brighter color) indicates a higher accuracy in the subset selected together by the corresponding row and column models. An interesting phenomenon we observe is that when two classifiers and perform at a similar level on (i.e., is small), is also small. That is, more images on which they both make incorrect but different predictions (Case III) have been selected compared with images falling into Case I. Taking a closer look at images in , we may reveal the respective model biases of and . For example, we find that WSL-ResNeXt101-3248 tends to focus on foreground objects, while EfficientNet-B7 attends more to background objects (see Figure 7). We also find several common failure modes of the competing classifiers through pairwise comparison, e.g., excessive reliance on relation inference (see Figure 8), bias towards low-level visual features (see Figure 9), and difficulty in recognizing rare instantiations of objects (see Figures 4 and 6).
Global Ranking Results
We present the global ranking results by MAD in Table 1, where we find that MAD tracks the steady progress in image classification, as verified by a reasonable Spearman rank-order correlation coefficient (SRCC) of between the accuracy rank on the ImageNet validation set and the MAD rank on our test set
. Moreover, by looking at the differences between the two rankings, we obtain a number of interesting findings. First, VGG16BN outperforms not only ResNet34 but also ResNet101, suggesting that under similar computation budgets, VGG-like networks may exhibit better generalizability to hard samples than networks with residual connections. Second, both networks equipped with the squeeze-and-extraction mechanism,i.e., SE-ResNet-101 and SENet154, move up by two places in the MAD ranking. This indicates that explicitly modeling dependencies between channel-wise feature maps seems quite beneficial to image classification. Third, for the two models that exploit neural architecture search, NASNet-A-Large is still ranked high by MAD; interestingly, the rank of PNASNet-5-Large drops a lot. This implies MAD may prefer the global search strategy used in NASNet-A-Large to the progressive cell-wise search strategy adopted in PNASNet-5-Large, although the former is slightly inferior in ImageNet top- accuracy. Last but not least, the top- performers, WSL-ResNeXt101-3248 and EfficientNet-B7, are still the best in MAD competition (irrespective of their relative rankings), verifying the effectiveness of large-scale hashtag data pretraining and compound scaling in the context of image classification.
|Models||ImageNet top- Acc||Acc Rank||MAD Rank||Rank|
We analyze the key hyperparameter in MAD, i.e., the number of images in selected for subjective testing. We calculate the SRCC values between the top- ranking (as reference) and other top- rankings with . As shown in Figure 5, the ranking results are fairly stable () when . This supports our choice of since the final global ranking already seems to enter a stable plateau.
We have presented a new methodology, MAD competition, for comparing image classification models. MAD effectively mitigates the conflict between the prohibitively large natural image manifold that we have to evaluate against and the expensive human labeling effort that we aim to minimize. Much of our endeavor has been dedicated to selecting natural images that are optimal in terms of distinguishing or falsifying classifiers. MAD requires explicit specification of image classifiers to be compared, and provides an effective means of exposing the respective flaws of competing classifiers. It also directly contributes to model interpretability and helps us analyze the models’ focus and bias when making predictions. We have demonstrated the effectiveness of MAD competition in comparing ImageNet classifiers, and concluded a number of interesting observations, which were not apparently drawn from the (often quite close) accuracy numbers on the ImageNet validation set.
The application scope of MAD is far beyond image classification. It can be applied to computational models that produce discrete-valued outputs, and is particularly useful when the sample space is large and the ground-truth label being predicted is expensive to measure. Examples include medical and hyperspectral image classification (filipovych2011semi; wang2014semisupervised), where signification domain expertise is crucial to obtain correct labels. MAD can also be used to spot rare but fatal failures in high-cost and failure-sensitive applications, e.g., comparing perception systems of autonomous cars (chen2015deepdriving) in unconstrained real-world weathers, lighting conditions, and road scenes. In addition, by restricting the test set to some domain of interest, MAD allows comparison of classifiers in more specific applications, e.g., fine-grained image recognition.
We feel it important to note the limitations of the current MAD. First, MAD aims at relatively comparing models, and cannot give an absolute performance measure. Second, as an “error spotting” mechanism, MAD implicitly assumes that models in the competition are reasonably good (e.g., ImageNet classifiers); otherwise, the selected counterexamples may be less meaningful. Third, although the distance in Eq. (3) is sufficient to distinguish multiple classifiers in the current experimental setting, it does not yet fully reflect human cognition of image label semantics. Fourth
, the confidence computation used to select images is not perfectly grounded. How to marry the MAD competition with Bayesian probability theory to model uncertainties during image selection is an interesting direction for future research. Due to the above issues, MAD should be viewed as complementary to, rather than a replacement for, the conventional accuracy comparison for image classification.
Our method arises as a natural combination of concepts drawn from two separate lines of research. The first explores the idea of model falsification as model comparison. wang2008maximum introduced the maximum differentiation competition for comparing computational models of continuous perceptual quantities, which was further extended by ma2019group. NIPS2017_6944 developed a computational method for comparing hierarchical image representations in terms of their ability to explain perceptual sensitivity in humans. MAD, on the other hand, is tailored to applications with discrete model responses and relies on a semantic distance measure to compute model discrepancy. The second
endeavour arises from machine learning literature on generating adversarial examples(szegedy2013intriguing; goodfellow2014explaining; madry2017towards) and evaluating image classifiers on new test sets (geirhos2018imagenet; recht2019imagenet; hendrycks2019benchmarking; hendrycks2019nae). The images selected by MAD can be seen as a form of natural adversarial examples as each of them is able to fool at least one classifier (when Case I is eliminated). Unlike adversarial images with inherent transferability to mislead most classifiers, MAD-selected images emphasize on their discriminability of the competing models. Different from recently created test sets, the MAD-selected set is adapted to the competing classifiers with the goal of minimizing human labeling effort. In addition, MAD can also be linked to the popular technique of differential testing (mckeeman1998differential) in software engineering.
Appendix A More Visualization Results
Due to the page limit, we put some additional figures here.