heldoutinfluenceestimation
None
view repo
Human learners appreciate that some facts demand memorization whereas other facts support generalization. For example, English verbs have irregular cases that must be memorized (e.g., go>went) and regular cases that generalize well (e.g., kiss>kissed, miss>missed). Likewise, deep neural networks have the capacity to memorize rare or irregular forms but nonetheless generalize across instances that share common patterns or structures. We analyze how individual instances are treated by a model on the memorizationgeneralization continuum via a consistency score. The score is the expected accuracy of a particular architecture for a heldout instance on a training set of a fixed size sampled from the data distribution. We obtain empirical estimates of this score for individual instances in multiple datasets, and we show that the score identifies outofdistribution and mislabeled examples at one end of the continuum and regular examples at the other end. We explore three proxies to the consistency score: kernel density estimation on input and hidden representations; and the time course of training, i.e., learning speed. In addition to helping to understand the memorization versus generalization dynamics during training, the Cscore proxies have potential application for outofdistribution detection, curriculum learning, and active data collection.
READ FULL TEXT VIEW PDFHuman learning requires both memorization of specific facts and generalization to novel situations and contexts. Both memorization and generalization can be needed in a domain, and the boundary between when to memorize and when to generalize can be fuzzy. For example, in learning the past tense form of English verbs, there are some verbs whose past tenses must simply be memorized (gowent, eatate, hithit) and there are many regular verbs that obey the rule of appending “ed” (kisskissed, kickkicked, brewbrewed, etc.). Generalization to a novel word typically follows the “ed” rule, for example, binkbinked. Intermediate between the exception verbs and regular verbs are subregularities—a set of exception verbs that have consistent structure (e.g., the mapping of singsang, ringrang, springsprang); these subregularities might suggest a generalization rule to novel verbs that would obtain, for example, dringdrang. Note that rulegoverned and exception cases can have very similar forms, which increases the difficulty of learning each. Consider onesyllable verbs containing ‘ee’, which include the regular cases needneeded and beepbeeped as well as the exception cases meetmet, feelfelt, and seeksought. Generalization from the rulegoverned cases can hamper the learning of the exception cases and viceversa. Indeed, children learning English initially master high frequency exception verbs such as gowent, but after accumulating experience with regular verbs, they then begin to overregularize by mapping gogoed, eventually learning the distinction between the regular and exception verbs; neural nets show the same interesting pattern over the course of training (Rumelhart & McClelland, 1986).
Memorization is tantamount to a lookup table with the individual facts accessible for retrieval. Generalization requires the inference of statistical regularities in the training environment, and the application of procedures or rules for exploiting the regularities. In deep learning, memorization is often considered a failure of a network because memorization implies no generalization. However, mastering a domain involves knowing when to generalize and when not to generalize. Consider the twoclass problem with training examples positioned in an input space as in Figure 1a, or positioned in a latent space as in Figure 1b. Instance 3 (the iron throne) is an exception case and there may not exist similar cases in the data enironment. Instance 1 (a generic chair) lies in a region with a consistent labeling and thus seems to follow a strong regularity. Instance 2 (a rocking chair) has a few supporting neighbors, but it lies in a distinct neighborhood from the majority of samelabel instances; its neighborhood might be considered a weak regularity.
In this article, we formalize the notion of strong regularities, weak regularities, and exceptions in the context of a deep net. We propose a consistency score or Cscore for an instance with label , defined as the expected accuracy of predicted label
from a classifier of architecture
trained on i.i.d. examples drawn from a data distribution :(1) 
Practically, we require that the instance
is excluded from the training set, but under a continuous data distribution, the probability of selecting the same instance for both training and testing is zero. The Cscore reflects the consistency the instance is with respect to the training set: in Figure
1a, instance 1 should have a higher Cscore than instance 2 which in turn should have a higher Cscore than instance 3. The Cscore reflects the relationship of each instance to the training population. A low Cscore indicates that the instance is not aligned with the training population and therefore learning requires memorization. A high Cscore indicates that the instance is supported by the training population and generalization thus follows naturally. The formulation of Cscore is closely related to the memorization score from Feldman (2019), which is defined relative to a dataset that includes and measures the change in the prediction accuracy on when is removed from the dataset. They use the score to quantify the importance of memorization to achieve optimal generalization on a data distribution containing long tail of rare examples.For a nearestneighbor classifier that operates on the input space (Figure 1
a), the Cscore is related to the literature on outlier detection
(Breunig et al., 2000; Ramaswamy et al., 2000; Campos et al., 2016). However, for a deep network, which operates over a latent space (Figure 1b), the Cscore depends not just on the training data distribution but on the model architecture, loss function, optimizer, and hyperparameters. Our work is thus related to adversarial methods to identify outliers in latent space
(Lee et al., 2018; Pidhorskyi et al., 2018; Beggel et al., 2019).The Cscore has many potential uses. First, it can assist in understanding a dataset’s structure by teasing apart distinct regularities and subregularities. Second, it can be used for detecting outofdistribution inputs as well as mislabeled instances: these instances will have low Cscores because they have little support from the training distribution, like instance 3 in Figure 1a. Third, it can be used to guide active data collection to improve performance of rare cases that the model treats as exceptions. Fourth, it can be used to prioritize training instances, along the lines of curriculum learning (Bengio et al., 2009; Saxena et al., 2019).
There are many reasons why the Cscore as defined in Equation 1 cannot be computed. The underlying data distribution is not known. The expectation must be approximated by sampling. Each sample requires model training. Thus, we seek computationally efficient proxies for the Cscore. Ideally the score could be obtained from an untrained network or a single network early in the time course of training.
In our work, we estimate a groundtruth Cscore for a dataset via holdout performance on trained networks. Figure 1c shows examples of various ImageNet classes with low and high estimated Cscores. Given these estimates, we investigate various proxies to the Cscore which include measures based on: density estimation (in input, latent, and gradient spaces), and the time course of learning within a single training run. Our key contributions are as follows.
[nosep,align=left,leftmargin=*]
We obtain empirical estimates of the Cscore for individual instances in MNIST, CIFAR10, CIFAR100, and ImageNet. Estimation requires training up to 20,000 network replications per data set, permitting us to sort instances into those satisfying strong regularities, those satisfying weaker regularities, and exception (outlier) cases.
Because empirical estimation of the Cscore is computationally costly, we define and evaluate a set of candidate Cscore proxies. We identified a lightweight proxy score, the cumulative binary training loss, that correlates strongly () with the Cscore and can be computed for free for all instances in the training set. We note that this result is nontrivial because the Cscore is defined for heldout instances, whereas the cumulative binary training loss is defined over a training set.
We explored the relationship between the Cscore and learning dynamics, finding that the lower the Cscore, the more slowly an instance is learned and the lower the learning rate required for the instance to be learned.
Mangalam & Prabhu (2019)
compared the training of Random Forests and SVMs with deep networks and found that deep learning prioritize examples that are learnable by shallow models.
Arpit et al. (2017) looked at memorization in deep learning by studying gradient based learning algorithms on noise vs. real data. They found that with carefully tuned explicit regularization, a network’s capability of memorizing the noisy data can be effectively controlled without compromising the generalization performance on real data.Carlini et al. (2018) proposed multiple measures for finding prototypical examples that are intrinsic to the dataset, and could lead to good performance when training only on those examples. In contrast, our Cscore captures the statistical regularity combining biases in both the data and the learning algorithm. Moreover, training only on a small subset of examples with high Cscores do not necessarily lead to good performance as statistical regularity realizes only when enough supporting examples are present. Examples with low Cscores are also not necessarily unimportant for learning. Some metrics used in their studies are similar to ours. The closest pair is their model confidence and the learning speed studied by us. Note the former ignores the labels, which we use to quantify the learning speed. The holdout retraining and the ensemble agreement metrics used in Carlini et al. (2018) is conceptually similar to the our holdout procedures. But their retraining is a twostage training procedure which involves pretraining and finetuning; their ensemble agreement mixes architectures with heterogeneous capacities and ignores the label information.
Feldman (2019) constructed a theoretical model to show that when the data distribution has a long tail of rare examples, memorization is necessary for optimal learning. In their proof, a score was proposed to quantify the memorization of an example. Our Cscore closely resemble their definition. The main difference is that memorization in Feldman (2019) is defined relative to a given dataset, whereas Cscore evaluates the expected accuracy when trained on i.i.d. sampled subset of varying size . We also aim to understand how Cscore depends on whereas Feldman (2019); Feldman & Zhang (2019) focus on the effect that memorized examples have on the test set accuracy. Another line of recent theoretical work studies interpolation (e.g. Belkin et al., 2018a, b; Liang & Rakhlin, 2018; Belkin et al., 2019)
, which means the model perfectly fits the training data. It is shown that in some cases interpolation is harmless for optimal generalization. Note interpolation does not necessarily imply memorization (consider fitting a linear classifier on two classes with well separated clusters).
Computing the Cscore by our definition (Equation 1) is not feasible in practice because the underlying data distribution is typically unknown, and even if it were, the expectation cannot be computed analytically. In practice, we usually have a fixed data set consisting of i.i.d. samples from the underlying distribution; for example, with the CIFAR10 image classification task, we have 50,000 training examples. An estimate of the Cscore can be computed by replacing the expectation in (1) with empirical averaging and by sampling i.i.d. subsets of a given size from the fixed data set. We thus define the empirical Cscore for an instance , based on the estimator of memorization score from Feldman (2019) proposed in Feldman & Zhang (2019):
(2) 
where is a subset of size uniformly sampled from excluding , and denotes empirical averaging with i.i.d. samples of such subsets. Because of the cost of computing for individual is prohibitive, we instead use a fold validation procedure. Specifically, we evaluate each fold on the instances not considered for training, and determine the empirical Cscore for a given instance using only the folds in which the instance is in the heldout set. We refer to this procedure as holdout validation, summarized in Algorithm 1.
Because each data set is a different size and we require , we find it convenient to refer not to the absolute number of examples, , but to the percentage of used for training, which we refer to as the subset ratio, with . We use a 3layer fully connected network for MNIST, Inception for CIFAR10 / CIFAR100 and ResNet50 for ImageNet. Please refer to Appendix A for the full details on architectures and hyperparameters.
Figure 2 shows the distribution of empirical Cscores
for CIFAR10 for . For each level of , train/evaluation folds are run. Beyond giving a sense of what fraction of the data set must be used for training to obtain good generalization, the Figure suggests that floor and ceiling effects may concentrate instances, making it difficult to distinguish them based on their Cscores if is too small or too large (will justify shortly). Rather than trying to determine the ‘just right’ value of , we compute a Cscore marginalized over
under a uniform distribution. The left panel of Figure
3 shows a histogram of these estimated integral Cscores. Although the bulk of the scores are on the high end, they are more widely distributed than in the histogram for any particular (Figure 2).We stratify the instances by their integral Cscore into 30 bins, as indicated by the coloring of the bars of the histogram in Figure 3. In the right panel of the Figure, we separately plot the mean Cscore for the instances in a bin as a function of the subset ratio . Note that the monotonic ordering of Cscores does not vary with , but instances bunch up at low Cscores for small and at high Cscores for larger , indicated by the opacity of the open circles in the Figure. (The semitransparent circles become opaque when superimposed on one another.) Bunching makes the instances less discriminable. At the low end of the integrated Cscores (cyan lines), note that the curves drop below chance (0.1 for CIFAR10) with increasing . We conjecture that these instances are ambiguous (e.g., visually similar to instances from a different class), and as the data set grows, regularities in other classes systematically pull these ambiguous instances in the wrong direction. This behavior is analogous to the phenomenon we mentioned earlier that children increase their production of verb overregularization errors (gogoed) as they acquire more exposure to a language.
For MNIST, CIFAR10, and CIFAR100, Figure 4 presents instances that have varying estimated integral Cscores. Each block of examples is one category; the left, middle, and right columns have high, intermediate, and low Cscores, respectively. The homogeneity of examples in the left column suggests a large cluster of very similar images that form a functional prototype. In contrast, many of the examples in the right column are ambiguous or even mislabeled.
The integral estimation computed in the previous section requires invoking the holdout validation procedure for a range of , with each invocation involving training on the order of 2000 networks. For largescale data sets like ImageNet, the computational cost of this approximate integration procedure is too high. Consequently, we investigate the feasibility of approximating the integral Cscore with a point estimate, i.e., selection of the that best represents the integral score. By ‘best represents,’ we mean that the ranking of instances by the integral score matches the ranking by the score for a particular . Figure 5 shows rank correlation between integral score and score for a given , as a function of . The left and right graphs plot two different rank correlation measures, Spearman’s and Kendall’s , respectively. Each curve in a graph corresponds to a a particular data set. Examining the green CIFAR10 curve, there is a peak at for both measures, indicating that yields the best pointestimate approximation for the integral Cscore. That the peak is at an intermediate is consistent with the observation from Figure 3 that the Cscore bunches together instances for low and high .
For MNIST, a less challenging data set than CIFAR10, the peak is lower, at ; for CIFAR100, a more challenging data set than CIFAR10, the peak is higher, at or . Thus, the peak appears to shift to larger for more challenging data sets. This finding is not surprising: more challenging data sets require a greater diversity of training instances in order to observe generalization.
In addition to MNIST, CIFAR10, and CIFAR100, we conducted experiments with ImageNet. Due to the large data set size (1.2M examples), we picked a single for our Cscore estimate. Based on the fact that the optimal increases with data set complexity, we picked for ImageNet. In particular, we train 2,000 ResNet50 models each with a random 70% subset of the ImageNet training set, and compute the Cscores for all the training examples.
The examples shown in Figure 1c are ranked according to this Cscore estimate. Because ImageNet has 1,000 classes, we cannot offer a simple overview over the entire dataset as in MNIST and CIFAR. Thus, we focus on analyzing the behaviors of individual classes. Specifically, we compute the mean and standard deviation (SD) of the Cscores of all the examples in a particular class. The mean Cscores indicates the relative difficulty of classes, and the SD indicates the diversity of examples within each class. The twodimensional histogram in Figure 6
depicts the joint distribution of mean and SD across all classes. A strong correlation is observed: classes with high mean Cscores tend to have low variances. We selected several classes with various combinations of mean and SD, indicated by the
’s in Figure 6. We then selected sample images from the top 1%, 35% and 99% percentile ranked by the Cscore within each class, and show them in Figure 7.The class projectile has Cscores spread out the value range. In contrast, the class weasel has large masses on both low and high Cscores, leading to larger variance than projectile. The class green snake from the high density region of the 2D histogram in Figure 6 represent common cases in the 1,000 ImageNet classes: while highly regular examples dominate, there are also usually a nontrivial amount of outliers or ambiguous examples that need to be memorized in training. The class oscilloscope is similar to green snake except with higher mean and lower SD. On the other extreme of the spectrum is the class yellow lady’s slipper, which mostly contain highly regular examples. From the image samples, we can see even the 99% percentile ranked examples enjoy a consistent color scheme with the rest of the images.
Given meaningful estimates of the Cscore, we now investigate various proxies to the Cscore. To unwind the logic of our investigation, the Cscore relates to the consistency of a given instance with the rest of the data set. We’ve shown that it is useful for understanding the data set structure and for identifying outliers and mislabeled instances. However, it is expensive to estimate. Our goal in this section is to identify proxy measures strongly correlated with the Cscore that can be estimated before or while a model is training on the training instances alone. We emphasize this latter point because if we are successful in estimating Cscores for training examples, we should also be able to estimate performance of asyetunseen data. We explore two Cscore proxy measures based on density estimation—in input and in latent space—as well as a measure based on accuracy over the time course of training. In addition, we discuss a gradientbased measure related to the neural tangent kernel (Jacot et al., 2018) in the supplementary materials (Appendix C). All of these measures have the property that they require training only a single instance of the model and they can be used to estimate performance on a training example without explicit holdout.
In this section, we study Cscore proxies based on kernel density estimation. Intuitively, an example is consistent with the data distribution if it lies near other examples having the same label. However, if the example lies far from instances in the same class or lies near instances of different classes, one might not expect it to generalize. Based on this intuition, we define a relative localdensity score:
(3) 
where is an RBF kernel with the bandwidth , and is the indicator function. We introduce two additional scores as a means of determining what information in the density is critical to predicting the Cscore. First, we define a classconditional density:
(4) 
(Because we are mainly interested in the relative ranking of examples, we do not normalize the score to form a proper probability density function.) If
is a better proxy than , then the contrast between classes is critical. Second, we define a classindependent density:(5) 
If is a better proxy than , then the class labels are critical.
MNIST  

CF10  
CF100 
Table 1 shows the agreement between our three proposed proxy scores and the estimated Cscore. Agreement is quantified by two rank correlation measures on three data sets. As anticipated, the inputdensity score that ignores labels, , and the classconditional density, , have poor agreement. However, so does the classrelative score, . We therefore move on to examining the relationship among instances in hidden space.
Using the penultimate layer of the network as a representation of an image, we evaluate three proxy scores: , , and , with the subscript indicating that the score operates in hidden space. For each score and data set, we compute Spearman’s rank correlation between the proxy score and the Cscore. We drop Kendall’s as it closely tracked Spearman’s
in our previous experiments. Because the embedding changes as the network is trained, we plot the correlation as a function of training epoch in Figure
8. For all three data sets, the proxy score that correlates best with the Cscore is (grey line), followed by (pink line), then (blue line). Clearly, appropriate use of labels helps with the ranking. However, our proxy uses the labels in an ad hoc manner. In Appendix C, we discuss a more principled measure based on gradient vectors and relate it to the neural tangent kernel (Jacot et al., 2018).The results reveal interesting properties of the hidden representation. One might be concerned that as training progresses, the representations will optimize toward the classification loss and may discard interclass relationships that could be potentially useful for other downstream tasks (Scott et al., 2018). However, our results suggest that does not diminish as a predictor of the Cscore, even long after training converges. Thus, at least some information concerning the relation between different examples is retained in the representation, even though intra and interclass similarity is not very relevant for a classification model. To the extent that the hidden representation—crafted through a discriminative loss—preserves class structure, one might expect that the Cscore could be predicted without label reweighting; however, the poor performance of suggests otherwise.
Even at asymptote, achieves a peak correlation of only about 0.7 for MNIST and CIFAR10 and 0.4 for CIFAR100. Nonetheless, the curves in Figure 8 offer an intriguing hint that information in the time course of training may be valuable for predicting the Cscore. We thus investigate the time course of training itself in the next section, specifically, we examine the accuracy of an example in the training set as the network weights evolve.
Intuitively, a training example that is consistent with many others should be learned quickly because the gradient steps for all consistent examples should be well aligned. One might therefore conjecture that strong regularities in a data set are not only better learned at asymptote—leading to better generalization performance—but are also learned sooner in the time course of training. This learning speed hypothesis is nontrivial, because the Cscore is defined for a heldout instance following training, whereas learning speed is defined for a training instance during training.
To test the learningspeed hypothesis, we partitioned examples in the CIFAR10 data set into bins by integrated Cscore, each bin having a width of 0.05. We then train a model on all examples in the data set and plot average proportion correct for each bin as a function of training epoch, as shown in Figure 9a. The two jumps in the graph correspond to points at which the learning rate is reduced. Asymptotically, all examples are learned, as one would expect from an overparameterized model. However, interestingly, the (blue) examples having the lowest Cscores are learned most slowly and the (red) examples having the highest Cscores are learned most quickly. Indeed, learning speed is monotonically related to Cscore bin.
In Figure 9b, we compute the Spearman’s rank correlation between the Cscore of an instance and its softmax confidence value as a function of training epoch. We consider two definitions of confidence: , the softmax probability of the target class, and , the largest probability across all classes. Both correlate well with the Cscore early in training, although is superior.
We also computed a correlation between an instance’s Cscore and an explicit measure of learning speed. One might define learning speed as the first epoch at which an example is classified correctly, but it is known that some instances flip flop between “learned” and “forgotten” states during training (Toneva et al., 2019). Instead, we simply count the total number of training epochs in which the instance is classified correctly. To the extent it is learned early and reliably, the count will be large. At the end of training, Spearman’s rank correlation between this cumulative binary training loss (CBTL) and the Cscore is . Of the various proxies we have presented, the CBTL is best by far: the confidencebased scores (Figure 9b) attained and were sensitive to the epoch at which the score was assessed; the best KDE score (Figure 8) attained .
An interesting observation from the learning speed plot in Figure 9a is that the stagewise learning rate decay has a greater impact for examples with lower Cscores. To explore this phenomenon further, we trained three model instances with constant learning rates of 0.1, 0.02, and 0.0008 (the same learning rates used in the stagewise schedule). As Figure 10
shows, larger learning rates appear to limit asymptotic (training) performance of the lower Cscore examples. At the end of training, test accuracy for the stagewise learningrate model is 95.1% (averaged over the last 10 epochs), whereas the constant learningrate models attain test accuracy of only 84.8%, 91.2%, and 90.8%. Our observations suggest a plausible explanation for why we, like other computer vision researchers, have observed better generalization with stagewise learning rates than with a constant learning rate. Starting with a large learning rate effectively enforces a sort of curriculum in which the model first learns the strongest regularities. At a later stage when the learning rate is lowered and exceptions or outliers are able to be learned, the model has already built a representation based on domain regularities. In contrast, if a constant small learning rate is used (Figure
10, lr=0.0008) the outliers are learned in parallel with the regularities, which may corrupt internal representations.We explored the memorizationgeneralization continuum in deep learning via a consistency score that measures the statistical regularity of an instance in the context of a data distribution. We empirically estimated the Cscore for individual instances in four data sets and we explored various proxies to the Cscore based on density estimation and the time course of training. Our main contributions and takehome messages are as follows.
[align=left,leftmargin=*]
We assigned a consistency score (Cscore) to every example in MNIST, CIFAR10, CIFAR100, and ImageNet. These scores can assist in understanding a data set’s structure by teasing apart regularities and subregularities and exception cases. We are currently investigating whether the scores can be used to improve generalization via curriculum learning or instance reweighting, in particular, with the aim of encouraging networks to discover regularities before exception cases are memorized. The Cscore can also be used to identify ambiguous and mislabeled examples for data cleaning and to identify difficult corner cases in safety critical applications (e.g., the perception component of a selfdriving car) for active data collection.
We explored the distribution of Cscores across all four data sets. For every class in MNIST, CIFAR10, and CIFAR100, high Cscore examples are found that are visually uniform in color, shape, alignment (see Appendix E for more examples). The instances with lowest Cscores are often mislabeled or the salient object in the image belongs to a different class. In ImageNet, some classes do appear to have strong regularities, such as yellow lady’s slipper. However, other classes, such as projectile, are more notable for their extreme diversity. Diversity seems to be reflected in the intraclass Cscore variance.
We identified the cumulative binary training loss (CBTL) as a good proxy to the Cscore. The CBTL costs almost nothing to compute and requires just one training run, not thousands like the Cscore. Remarkably, the CBTL is based on the training performance of an instance, yet it predicts () the Cscore, which is the generalization performance of that same instance if it were held out of the training set.
Tracking the learning speed of examples grouped by Cscore, we formulated a hypothesis to explain why a stagewise decreasing learningrate schedule often generalizes better than a constant or adaptive schedule (more evidence in Appendix G). Our analysis suggests that the stagewise schedule provides scaffolding to build internal representations based on the strongest domain regularities first.
Neural net researchers in the 1980s touted the fact that their models could learn rulegoverned behavior without explicit rules (Rumelhart & McClelland, 1986). In that era, most AI researchers were focused on constructing expert systems by extracting explicit rules from human domain experts. Expert systems ultimately failed because the diversity and nuance of statistical regularities in a domain was too great for any human to explicate. In the modern deep learning era, researchers have made much progress in automatically extracting regularities from data. Nonetheless, there is still much work to be done to understand these regularities, and how the consistency relationships among instances determine the outcome of learning. By defining and investigating a consistency score, we hope to have made some progress in this direction.
Robust anomaly detection in images using adversarial autoencoders, 2019.
Generative probabilistic novelty detection with adversarial autoencoders.
In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 6822–6833. Curran Associates, Inc., 2018.Adapted deep embeddings: A synthesis of methods for kshot inductive transfer learning.
In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 76–85. Curran Associates, Inc., 2018.The details on model architectures, dataset information and hyperparameters used in the experiments for empirical estimation of the Cscore can be found in Table 2.
MNIST  CIFAR10  CIFAR100  ImageNet  
Architecture  MLP(512,256,10)  Inception  Inception  ResNet50 (V2) 
Optimizer  SGD  SGD  SGD  SGD 
Momentum  0.9  0.9  0.9  0.9 
Base Learning Rate  0.1  0.4  0.4  0.17 
Learning Rate Scheduler  LinearRampupPiecewiseConstant  
Batch Size  256  512  512  1287 
Epochs  20  80  160  100 
Data Augmentation 
Random Padded Cropping + Random LeftRight Flipping 

Image Size  2828  3232  3232  224224 
Training Set Size  60,000  50,000  50,000  1,281,167 
Number of Classes  10  10  100  1000 
A simplified Inception model suitable for small image sizes, defined as follows:  
Inception :: Conv(33, 96) Stage1 Stage2 Stage3 GlobalMaxPool Linear. 

Stage1 :: Block(32, 32) Block(32, 48) Conv(33, 160, Stride=2). 

Stage2 :: Block(112, 48) Block(96, 64) Block(80, 80) Block (48, 96) Conv(33, 240, Stride=2). 

Stage3 :: Block(176, 160) Block(176, 160). 

Block(, ) :: Concat(Conv(11, ), Conv(33,)). 

Conv :: Convolution BatchNormalization ReLU. 

learning rate scheduler linearly increase the learning rate from 0 to the base learning rate in the first training steps, and then from there linear decrease to 0 in the remaining training steps.  
LinearRampupPiecewiseConstant learning rate scheduler linearly increase the learning rate from 0 to the base learning rate in the first training steps. Then the learning rate remains piecewise constant with a decay at , and of the training steps, respectively.  
Random Padded Cropping pad 4 pixels of zeros to all the four sides of MNIST, CIFAR10, CIFAR100 images and (randomly) crop back to the original image size. For ImageNet, a padding of 32 pixels is used for all four sides of the images. 
To compute the scores based on kernel density estimation on learned representations, we train neural network models with the same specification as in Table 2 on the full training set. We use an RBF kernel , where the bandwidth parameter is adaptively chosen as of the mean pairwise Euclidean distance across the dataset.
The experiments on learning speed are conducted with ResNet18 on CIFAR10, trained for 200 epochs while batch size is 32. For optimizer, we use the SGD with the initial learning rate 0.1, momentum 0.9 (with Nesterov momentum) and weight decay is 5e4. The stagewise constant learning rate scheduler decrease the learning rate at the 60th, 90th, and 120th epoch with a decay factor of 0.2.
The time complexity of the holdout procedure for empirical estimation of the Cscore is . Here is the number of subset ratios, is number of holdout for each subset ratio, and is the average training time for a neural network. is the time for computing the score given the fold holdout training results, which involves elementwise computation on a matrix of size , and is negligible comparing to the time for training neural networks. The space complexity is the space for training a single neural network times the number of parallel training jobs. The space complexity for computing the scores is .
For kernel density estimation based scores, the most expensive part is forming the pairwise distance matrix (and the kernel matrix), which requires space and time, where is the dimension of the input or hidden representation spaces.
Most modern neural networks are trained with first order gradient descent based algorithms and variants. In each iteration, the gradient of loss on a minibatch of training examples evaluated at the current network weights is computed and used to update the current parameter. Let be the function that maps an inputlabel training pair (the case of minibatch size one) to the corresponding gradient evaluated at the network weights of the th iteration. Then this defines a gradient based representation on which we can compute density based ranking scores. The intuition is that in a gradient based learning algorithm, an example is consistent with others if they all compute similar gradients.
Comparing to the hidden representations defined the outputs of a neural network layer, the gradient based representations induce a more natural way of incorporating the label information. In the previous section, we reweight the neighbor examples belonging to a different class by 0 or 1. For gradient based representations, no ad hoc reweighting is needed as the gradient is computed on the loss that has already takes the label into account. Similar inputs with different labels automatically lead to dissimilar gradients. Moreover, this could seamlessly handle labels and losses with rich structures (e.g. image segmentation, machine translation) where an effective reweighting scheme is hard to find. The gradient based representation is closely related to recent developments on Neural Tagent Kernels (NTK) (Jacot et al., 2018). It is shown that when the network width goes to infinity, the neural network training dynamics can be effectively approximately via Taylor expansion at the initial network weights. In other words, the algorithm is effectively learning a linear model on the nonlinear representations defined by . This feature map induces the NTK, and connects deep learning to the literature of kernel machines.
Although NTK enjoys nice theoretical properties, it is challenging to perform density estimation on it. Even for the more practical case of finite width neural networks, the gradient representations are of extremely high dimensions as modern neural networks general have parameters ranging from millions to billions (e.g. Tan & Le, 2019; Radford et al., 2019). As a result, both computation and memory requirements are prohibitive if a naive density estimation is to be computed on the gradient representations. We leave as future work to explore efficient algorithms to practically compute this score.
The histogram of individual point estimated Cscores with fixed subset ratios on CIFAR10 is shown in Figure 2. The same plot for MNIST and CIFAR100 are shown in Figure 11.
Similarly, the histogram for the estimated integral Cscores for MNIST and CIFAR100 are shown in Figure 12, which can be compared with the results for CIFAR10 in Figure 3 in the main text.
Examples with high, middle and low Cscores from a few representative classes of MNIST, CIFAR10 and CIFAR100 are show in Figure 4. In this appendix, we depict the results for all the 10 classes of MNIST and CIFAR10 in Figure 13 and Figure 14, respectively. The results from the first 60 out of the 100 classes on CIFAR100 is depicted in Figure 15.
The notion of regularity is primarily coming from the statistical consistency of the example with the rest of the population, but less from the intrinsic structure of the example’s contents. To illustrate this, we refer back to the experiments in Section 4.3 on measuring the learning speed of groups of examples generated via equal partition on the Cscore value range . As shown in Figure 3a, the distribution is uneven between high and low Cscore values. As a result, the high Cscore groups will have more examples than the low Cscore groups. This agrees with the intuition that regularity arises from high probability masses.
To test whether an example with topranking Cscore is still highly regular after the density of its neighborhood is reduced, we redo the experiment, but subsample each group to contain an equal number () of examples. Then we run training on this new dataset and observe the learning speed in each (subsampled) group. The result is shown in Figure 16, which is to be compared with the results without groupsizeequalizing in Figure 9a in the main text. The following observations can be made:
The learning curves for many of the groups start to overlap with each other.
The lower ranked groups now learns faster. For example, the lowest ranked group goes above 30% accuracy near epoch 50. In the original experiment (Figure 9a), this groups is still below 20% accuracy at epoch 50. The model is now learning with a much smaller dataset. Since the lower ranked examples are not highly consistent with the rest of the population, this means there are fewer “other examples” to compete with (i.e. those “other examples” will move the weights towards a direction that is less preferable for the lower ranked examples). As a result, the lower ranked groups can now learn faster.
On the other hand, the higher ranked groups now learn slower, which is clear from a direct comparison between Figure 9a and Figure 16. This is because for highly regular examples, reducing the dataset size means removing consistent examples — that is, there are now less “supporters” as oppose to less “competitors” in the case of lower ranked groups. As a result, the learn speed is now slower.
Even though the learning curves are now overlapping, the highest ranked group and the lowest ranked group are still clearly separated. The potential reason is that while the lower ranked examples can be outliers in many different ways, the highest ranked examples are probably regular in a single (or very few) visual clusters (see the top ranked examples in Figure 4). As a result, the within group diversities of the highest ranked groups are still much smaller than the lowest ranked groups.
In summary, the regularity of an example arises from its consistency relation with the rest of the population. A regular example in isolation is no different to an outlier. Moreover, it is also not merely an intrinsic property of the data distribution, but is closely related to the model, loss function and learning algorithms. For example, while a picture with a red lake and a purple forest is likely be considered an outlier in the usual sense, for a model that only uses grayscale information it could be highly regular.
In Section 4.3 we used the Cscore grouping to compare the learning dynamics of a stagewise constant learning rate scheduler and a constant learning rate scheduler. The observations lead to an interesting hypothesis for explaining why stagewsie constant learning rate usually perform better and is preferred in many computer vision tasks. We provide more details here, and also compare to Adam, an optimizer with adaptive learning rate scheduling.
Figure 17 shows the learning speed of groups of examples on CIFAR10 ranked by Cscore, with SGD using stagewise constant learning rate scheduling. This is the same as Figure 9a, replicated here for easy comparison. In Figure 18 we show the learning speeds of groups trained with SGD using constant learning rate scheduling. The 4 panels show the results for the each of the values used in the 4 stages of the stagewise scheduler. In Figure 19 we also present the training results with the Adam optimizer, using the default base learning rate of 0.001. Adaptive algorithms like Adam scale the learning rate automatically and usually converge faster than vanilla SGD. However, it is observed that faster convergence from adaptive algorithms usually leads to worse generalization performances (Wilson et al., 2017; Keskar & Socher, 2017; Luo et al., 2019). In fact, similar behaviors are observed, as summarized in Table 3.
Optimizer  Learning Rate  Test Accuracy (%) 

SGD  Stagewise  95.14 
SGD  0.1  84.84 
SGD  0.02  91.19 
SGD  0.004  92.05 
SGD  0.0008  90.82 
Adam  Adaptive  92.97 
To restate the hypothesis: the reason that stagewise learning rate scheduler generalize better than others is that it delayed the memorization of outliers (low Cscore examples) to later stages. In the first stage, when only the regular examples are learned, the patterns and structures discovered in those regular examples can be used to build a generalizable representation. In later stages, the memorization of outliers will not seriously disrupt the learned representation as the learning rate is much smaller than the earlier stages. In contrast, both Adam and SGD with (small) constant learning rate learn the examples across all Cscore ranges fairly quickly. As a result, the model do not have a chance to build a generalizable representation from a clean subset of highly regular examples.
Our experiments are by no means extensive enough to fully verify this hypothesis. However, we think this is an very interesting side observation from our experiments that is worth mention. It also provide a concrete example of how our Cscore indexing could be useful for research topics on analyzing and understanding. We leave it as future work to systematically investigate the aforementioned hypothesis.