1 Introduction
Deep neural networks achieve impressive performance on many important machine learning benchmarks, such as image classification
(Krizhevsky, 2009; Krizhevsky et al., 2012; Szegedy et al., 2015; Simonyan and Zisserman, 2015; He et al., 2016), automated translation (Bahdanau et al., 2014; Wu et al., 2016) or speech recognition (Deng et al., 2013; Graves et al., 2013). However, the benchmark datasets are used a multitude of times by researchers worldwide. Since stateoftheart methods are selected and published based on their performance on the corresponding test set, it is typical to see results that continuously improve over time; see, e.g., the discussion of Recht et al. (2018) and Figure 1 for the performance improvement of classifiers published for the popular CIFAR10 image classification benchmark (Krizhevsky, 2009). This process may naturally lead to models overfitted to the test set, rendering test error rate (the average error measured on the test set) an unreliable indicator of the actual performance.Detecting whether the model is overfitted to the test set is challenging, because independent test sets drawn from the same data distribution are generally not available, while alternative test sets often introduce a distribution shift.^{1}^{1}1
Throughout the paper we work with the standard statistical learning theory framework which assumes that data, both training and test, is sampled independently from a fixed underlying data distribution.
The shift may happen even if the new test set is collected very carefully. For example, the recent CIFAR10.1 dataset was collected by diligently following the original data curation procedure of CIFAR10, and yet the data distribution underlying CIFAR10.1 appears to be different from that of the test set of CIFAR10 (for details, see Recht et al., 2018).To estimate the performance of a model on unseen data, one may use generalization bounds to get upper bounds on the expected error rate. The generalization bounds are also applicable when the model and the data are dependent (e.g., for error estimates based on the training data or the reused test data), but they usually lead to loose error bounds. Similarly, constructing confidence intervals around the training and test error rates from generalization bounds and rejecting independence of the test set if they do not overlap does not work in practice for the same reason. On the other hand, if the test data and the model are independent, much tighter bounds are available; hence, dependence testing can be applied to validate the resulting bounds. Recently, several methods that allow the reuse of the test set while keeping the validity of test error rates have been proposed
(Dwork et al., 2015). However, these are intrusive: they require the user to follow a strict protocol of interacting with the test set and are thus not applicable in the more common situation when enforcing such a protocol is impossible.In this paper we take a new approach to the challenge of detecting overfitting of a model to the test set. Our method is nonintrusive and uses only the original test data. The core novel idea is to harness the power of adversarial examples (Goodfellow et al., 2014) for our purposes. These are data points^{2}^{2}2Throughout the paper, we use the words “example” and “point” interchangeably. which are not sampled from the data distribution, but instead are cleverly crafted based on existing data points so that the model errs on them. Several authors showed that the best models learned for the abovementioned benchmark problems are highly sensitive to adversarial attacks (Goodfellow et al., 2014; Papernot et al., 2016; Uesato et al., 2018; Carlini and Wagner, 2017a; b; Papernot et al., 2017). For instance, one can often create adversarial versions of images properly classified by a stateoftheart model that the model will make a mistake on. Often, the adversarial versions are (almost) indistinguishable from the original for a human observer, see for example Figure 2, where the adversarial image is obtained from the original one by a carefully selected translation.
scale, weighing machine  toaster 
Most of the work on adversarial examples either concentrates on finding perturbation methods to create adversarial points that can fool classifiers (i.e., finding successful attack methods) or tries to train models that simultaneously have small test error rate and small error on adversarial modifications of the test set (which we will refer to as the adversarial error rate) (Gu and Rigazio, 2014; Cisse et al., 2017; Liao et al., 2017; Kolter and Wong, 2017; Madry et al., 2017; Kurakin and Goodfellow, 2017; Tramèr et al., 2018). In this paper we are not directly interested in these issues, and use adversarial examples to create a new error estimator that is less sensitive to overfitting to the data. The estimator uses adversarial examples together with importance weighting to take into account the change in the data distribution (covariate shift) due to the adversarial transformation. The obtained estimator, which we call the adversarial (error) estimator,^{3}^{3}3Note that the adversarial error estimator’s goal is to estimate the error rate, not the adversarial error rate.
is unbiased and has a smaller variance than the standard test error rate if the test set and the model are independent.
More importantly, since it is based on adversarially generated data points, the adversarial estimator is expected to differ significantly from the test error rate if the model is overfitted to the test set, providing a way to detect test set overfitting. Thus, the test error rate and the adversarial error estimate (calculated based on the same test set) must be close if the test set and the model are independent, and are expected to be different in the opposite case. In particular, if the gap between the two error estimates is large, the independence hypothesis (i.e., that the model and the test set are independent) is dubious and will be rejected. Combining results from multiple training runs, we develop another method to test overfitting of a model architecture.^{4}^{4}4Note that by model architecture we mean not just the layout of the architecture, e.g., that of a neural network, but also the corresponding training procedure and its (hyper)parameters, except for the random seed.
The most challenging aspect of our method is to construct adversarial perturbations for which we can calculate importance weights, while keeping enough degrees of freedom in the way the adversarial perturbations are generated to maximize power, the ability of the test to detect dependence when it is present.
We apply our independence tests to stateoftheart classification methods for two popular image classification benchmarks, ImageNet (Deng et al., 2009) and CIFAR10 (Krizhevsky, 2009). Considering VGGlike imageclassification models (Simonyan and Zisserman, 2015; Rosca et al., 2017), we found that the independence hypothesis cannot be rejected for ImageNet at almost any confidence level, which is a strong evidence that the VGGlike model is not overfitted to the test set (as a control experiment, we verify that the independence of the trained classifier and the training set is rejected with high confidence). For CIFAR10, our individual test rejects the independence of a trained model and the test set at a confidence level of about 40%.^{5}^{5}5To be precise, we consider a rescaled version of the CIFAR10 dataset; see Section 5.3 for details. While this result is inconclusive, the independence of the CIFAR10 test set and the model architecture can be rejected at a convincing confidence level of 97%. Similar results are obtained for the Wide Residual Network (WideResnet) architecture of Zagoruyko and Komodakis (2016). In conclusion, our method provides strong evidence for overfitting of the tested architectures to the test set, but the effect is much harder to detect for an individual training run, most likely due to the stochastic nature of the training process. As a sanity check, our individual independence test verifies (i.e., does not reject) the independence of the individual VGG models and the truly independent CIFAR10.1 test set (on average, the independence can be rejected at a negligible confidence level of about 7%, while for about 80% of the individual models, independence cannot be rejected at any confidence level), while the independence of the model architecture can only be rejected at a confidence level of about 30%. Unfortunately, these latter results are less meaningful in the sense that similar confidence levels can be obtained if the test is applied to randomly selected subsets of the CIFAR10 test set with the same size (2000 images) .
The rest of the paper is organized as follows: In Section 2, we introduce a formal model for error estimation using adversarial examples, including the definition of adversarial example generators. The new overfittingdetection tests are derived in Section 3. To gain better understanding of their behavior, the tests are first analyzed on a synthetic example using Monte Carlo simulation in Section 4, while their application to imageclassification benchmarks is presented in Section 5. Finally, conclusions are drawn and further research questions are discussed in Section 6.
2 Adversarial Risk Estimation
We consider a classification problem with deterministic (noisefree) labels, which is a reasonable assumption for many practical problems, such as image recognition (we leave the extension of our method to noisy labels for future work). In particular, let denote the input space and the set of labels. For simplicity, we assume that . Data is sampled from the distribution over ,^{6}^{6}6Throughout the paper we assume that the standard measurability assumptions are satisfied. and the class label is determined by a function , which we refer to as the ground truth
. We denote a random vector drawn from
by , and its corresponding class label by . We consider deterministic classifiers of the form , which assign a deterministic label to every value of . The performance is measured by the zeroone loss, that is, the loss of classifying is ,^{7}^{7}7For a Booleanvalued expression , denotes its indicator function, that is, if is true and otherwise. and the expected error (also known as the risk or expected risk in the learning theory literature) of the classifier is defined asConsider a test dataset where the are drawn from independently of each other and .^{8}^{8}8Here, following the rest of the machine learning literature, we abuse notation. Technically, the sample should be seen as a sequence (not a set) drawn from the fold product of with itself, . In the learning setting, the classifier usually also depends on some randomly drawn training data and, as such, is random itself. If is independent from , then are independent and identically distributed, and thus the empirical error rate
(1) 
is an unbiased estimate of
for all , that is, . If and are not independent, the performance guarantees on the empirical estimates available in the independent case are significantly weakened because may no longer be independent. For example, in case of overfitting to , the empirical error rate is likely to be much smaller than the expected error. To detect the lack of independence and the potential bias of the empirical error rate, we construct alternate error (risk) estimators with their corresponding confidence intervals that are valid under the independence assumption. If the confidence intervals do not overlap sufficiently, we can reject the hypothesis that the sample and the classifier are independent.As the first estimate, we use the test error rate . To produce the other estimate, we use the wellknown importance sampling or change of measure idea (Kahn and Harris, 1951): instead of sampling from the distribution , we sample from another distribution and correct the estimate by appropriate reweighting. Assuming is absolutely continuous with respect to on the set ,
(2) 
where is the density (formally: RadonNikodym derivative) of with respect to on ( can be defined to have arbitrary finite values on ). It is well known that the variance of the estimator
(3) 
obtained from a sample drawn independently from is minimized if is the socalled zerovariance importance sampling distribution defined by (see, e.g., Section 4.2 of Bucklew, 2004). That is, is concentrated on and for all . In particular, in binary classification, for all .
Clearly, the zerovariance importance sampling distribution, and hence the above estimation strategy, is impossible to use, because it depends on the knowledge of both and , the latter of which we want to estimate. Nevertheless, it suggests that to minimize the variance of an estimator , needs to concentrate on points where makes mistakes. In the next subsection we use the notion of adversarial examples to generate such distributions.
2.1 Generating adversarial examples
In this section we introduce a formal framework for generating adversarial examples. Given a classification problem instance with data distribution and ground truth , an adversarial example generator (AEG) for a classifier is a (measurable) mapping such that

preserves the class labels of the samples, that is, for almost all ; and

does not change points that are incorrectly classified by , that is, if for almost all .
A simple example that illustrates AEGs is given in Figure 3, while another example is presented in Section 4. The justification of the definition of AEGs, in relation to the existing literature, is as follows: An adversarial example is usually generated by staying in a small vicinity of the input, i.e., by keeping small for some distance(like) function, such as where is, e.g., the  or the maxnorm. We view this as a practical way of ensuring condition (G1), the idea being that for all points in the support of , by moving the point by a small amount, say, , the label associated with the point is not going to change. If the label did change, the adversarial example generators proposed in the literature would not make sense. Thus, we can view condition (G1) as a relaxation of the foundational (implicit) assumption made in the literature underlying adversarial example generators. When developing practical methods for image recognition in Section 5, we also return to this stronger assumption instead of condition (G1). Note that this implicit assumption implies that the boundary has zero measure — a geometric margin assumption. If the set does not have a zero measure, the error metrics that use AEGs can be biased in proportion to . It follows that as long as the probability of seeing examples on the boundary between the positive and negative examples is negligible compared to whatever error estimates the methods produce, the conclusions remain valid. (G2) formalizes the fact that there is no need to change samples which are already misclassified. Indeed, existing AEGs comply with this condition.
The performance of an AEG is usually measured by how successfully it generates examples that are misclassified, which can be captured by
Accordingly, we call a point a successful adversarial example if is correctly classified by and (i.e., and ).
In the development of our AEGs for image recognition tasks, we will make use of another condition. For simplicity, we formulate this condition for distributions that have a density with respect to the uniform measure on , which is assumed to exist (notable cases are when is finite, or or when ; in the latter two cases the uniform measure is the Lebesgue measure). The assumption states that the AEG needs to be densitypreserving in the sense that

for almost all .
Note that a densitypreserving map may not be measurepreserving (the latter means that for all measurable , ).
We expect (G3) to hold when perturbs its input by a small amount and if is sufficiently smooth. We believe that the assumption is reasonable for, e.g., image recognition problems (at least in its relaxed form) where we expect that very close images will have a similar likelihood as measured by . An AEG employing image translations, which satisfies (G3), will be introduced in Section 5. Both (G1) and (G3) can be relaxed (to a soft margin condition or allowing a slight change in , resp.) at the price of an extra error term in the analysis that follows.
For a fixed AEG , let denote the distribution of where ( is known as the pushforward measure of under ). Further, let on and arbitrary otherwise. It is easy to see that is welldefined (on ) and, in addition, : First note that
where the second to last equality holds because for any under condition (G2). Thus, for any measurable , which implies that is welldefined on and for all .
2.2 Risk estimation via adversarial examples
Combining the ideas of this section so far, we now introduce unbiased risk estimates based on adversarial examples. Our goal is to estimate the errorrate of through an adversarially generated sample obtained through an AEG , where with drawn independently from and . Since satisfies (G1) by definition, the original example and the corresponding adversarial example have the same label . Recalling that on , we immediately see that the importance weighted adversarial estimate
(4) 
obtained from (3) for the adversarial sample has smaller variance than that of the empirical average : both estimates are unbiased with expectation , and so
Intuitively, the more successful the AEG is (i.e., the more classification error it induces), the smaller the variance of the estimate becomes. As the adversarial error probability becomes larger, becomes smaller in general, so the inequality is likely to be strengthened.
3 Detecting overfitting
In this section we show how the risk estimates introduced in the previous section can be used to test the independence hypothesis that

the sample and the model are independent.
If (H) holds, , and so the difference is expected to be small. On the other hand, if is overfitted to the dataset (in which case ), we expect and to behave differently (the latter being less sensitive to overfitting) since (i) depends also on examples previously unseen by the training procedure; (ii) the adversarial transformation aims to increase the loss, countering the effect of overfitting; (iii) especially in high dimensional settings, in case of overfitting one may expect that there are misclassified points very close to the decision boundary of which can be found by a carefully designed AEG. Therefore, intuitively, (H) can be rejected if exceeds some appropriate threshold. The simplest way to determine the threshold is based on constructing confidence intervals for these estimator based on concentration inequalities, as discussed in Appendix A.
A smaller threshold for , and hence a more effective independence test, can be devised if instead of independently estimating the behavior of and , one utilizes their apparent correlation. Indeed, is the average of terms
that is, , and the two terms in are typically highly correlated by the construction of . Thus, we can apply the empirical Bernstein bound (Mnih et al., 2008) to the pairwise differences to set a tighter threshold in the test: if the independence hypothesis (H) holds (i.e., and are independent), then for any , with probability at least ,
(6) 
with
(7) 
where is the empirical variance of the terms, and denotes the range of the (that is, for some constant ), and we also used that the expectation of each , and hence that of , is zero. Since if (as discussed in Section 2.2), it follows that , but further assumptions (such as being density preserving) can result in tighter bounds.
This leads to our pairwise dependence detection method:
if , reject (H) at a confidence level (value ).
Note that in order for the test to work well, we not only need the test statistic
to have a small variance in case of independence (this could be achieved if were the identity), but we also need the estimators and behave sufficiently differently if the independence assumption is violated. The latter behavior is encouraged by stronger AEGs, as we will show empirically in Section 5.2 (see Figure 8 in particular).Note that in order for the test to work well, we not only need the test statistic to have a small variance in case of independence (this could be achieved if were the identity), but we also need the estimators and behave sufficiently differently if the independence assumption is violated. The latter behavior is encouraged by stronger AEGs, as we will show empirically in Section 5.2 (see Figure 8 in particular).
For a given statistic , the largest confidence level (smallest value) at which (H) can be rejected can be calculated by setting the value of the statistic to zero and solving for . This leads to the following formula for the value (if the solution is larger than , which happens when the bound (6) is loose, is capped at ):
(8) 
The main challenge in applying the method in practice is to find suitable AEG functions for which we can compute the density .
3.1 Dependence detector for randomized training
The dependence between the model and the test set can arise from (i) selecting the “best” random seed in order to improve the test set performance and/or (ii) tweaking the model architecture (e.g., neural network structure) and hyperparameters (e.g., learningrate schedule). If one has access to a single instance of a trained model, these two sources cannot be disentangled. However, if the model architecture and training procedure is fully specified and computational resources are adequate, it is possible to isolate (i) and (ii) by retraining the model multiple times and calculating the
value for every training run separately. Assuming models, let denote the th trained model and the value calculated using the pairwise independence test (6) (i.e., from Eq. 8). We can investigate the degree to which (i) occurs by comparing the values with the corresponding test set error rates . To investigate whether (ii) occurs, we can average over the randomness of the training runs.For every example , consider the average test statistic
where is the statistic (3) calculated for example and model with AEG selected for model (note that AEGs are modeldependent by construction). If, for each and
, the random variables
are independent, then so are (for all ). Hence, we can apply the pairwise dependence detector (6) with instead of , using the average with empirical variance , giving a single value . If the training runs vary enough in their outcomes, different models will err on different data points in , leading to and therefore strengthening the power of the dependence detector (indeed, our experiments in Section 5.3 on a CIFAR10 image classification model show this behavior). This observation can also be interpreted in the following way: randomizing the training seed weakens the overfitting of the model architecture by introducing an important parameter which is not controlled in the experiments. Averaging the outcomes of training runs removes this source of randomness, strengthening the dependence detector. On the other hand, if the random seed is “optimized” in order to improve the test set performance, then the above procedure is obviously not able to detect that.Note that we can calculate also for a sequence of subsets of models of (e.g., size each), obtain a sequence of corresponding values and study their distribution. By changing the averaged model set size from 1 (testing every model independently) to we adjust the tradeoff between more information about the value distribution and detection power. To avoid reusing the same training run multiple times, given training runs we will calculate statistics over training run each, obtaining values.
For brevity, we call this overfitting detection method (or, correspondingly, this test of independence) an model detector (or an model independence test).
4 Synthetic experiments
In this section we present a simple synthetic classification problem to illustrate the power of the method of Section 3. The benefit of this example is that we are able to compute the density in an analytic form (see Figure 4 for an illustration).
Let and consider an input distribution with a density
that is an equally weighted mixture of two 500dimensional isotropic truncated Gaussian distributions
with coordinatewise standard deviation
(denotes the identity matrix of size
), means and densities truncated in the first dimension such that if and if . The label of an input point is , which is the sign of its first coordinate.We consider linear classifiers of the form trained with the crossentropy loss where . We employ a onestep gradient method (which is an version of the fast gradientsign method of Goodfellow et al., 2014; Papernot et al., 2016) to define our AEG , which tries to modify a correctly classified point with label in the direction of the gradient of the cost function : for some . For our specific choice of , the above simplifies to . To comply with the requirements for an AEG, we define as follows: if and (corresponding to (G2) and (G1), respectively), while otherwise. Therefore, if is misclassified by , and are the only points mapped to by . Thus, the density at after the transformation is and
(note that ).
We present two experiments showing the behavior of our independence test: one where the training and test sets are independent, and another where they are not.
In the first experiment a linear classifier was trained on a training set
of size 500 for 50,000 steps using the RMSProp optimizer
(Tieleman and Hinton, 2012) with batch size 100 and learning rate 0.01, obtaining zero (up to numerical precision) final training loss and, consequently, 100% prediction accuracy on the training data. Then the trained classifier was tested on a large test set of size 10,000.^{9}^{9}9The large number of test examples ensures that the random error in the empirical error estimate is negligible. Both sets were drawn independently from defined above. We used a range of values matched to the scale of the data distribution: from , which is the order of magnitude of the margin between two classes (0.05), to , which is the order of magnitude of the width of the Gaussian distribution used for each classes ().In the second experiment we consider the situation where the training and test sets are not independent. To enhance the effects of this dependence, the setup was modified to make the training process more amenable to overfitting by simulating a situation when the model has a wrong bias (this may happen in practice if a wrong architecture or data preprocessing method is chosen, which, despite the modeler’s best intentions, worsens the performance). Specifically, during training we added a penalty term to the training loss , decreased the size of the test set to 1000 and used 50% of the test data for training (the final penalized training loss was 0.25 with 100% prediction accuracy on the training set). Note that the small training set and the large penalty on yield classifiers that are essentially independent of the only interesting feature (recall that the true label of a point is ) and overfit to the noise in the data, resulting in a true model risk .
The results of the two experiments are shown in Figure 5, plotted against different perturbation strengths: the left column corresponds to the first experiment while the right column to the second. The first row presents the values for rejecting the independence hypothesis, calculated by repeating the experiment (sampling data and training the classifier) times and applying the singlemodel (Section 3, labelled as in the plots) and model (Section 3.1, labelled as in the plots) independence test, and taking the average over models (or model sets of size ) for each . We also plot empirical 95% twosided confidence intervals () or, due to limited number of values available after dividing 100 runs into disjoint bins of size , ranges between minimum and maximum value (). For all methods of detecting dependence, it can be seen that for the independent case the test is correctly not able to reject the independence hypothesis (the average value is very close to , although in some runs it can drop to as low as ). On the other hand, for , the nonindependent model failed the independence test at confidence level , hence, in this range of our independence test reliably detects overfitting.
In fact, it is easy to argue that our test should only work for a limited range of , that is, it should not reject independence for too small or too large values of . First we consider the case of small values. Notice that except for points close (in norm) to the true decision boundary or the decision boundary of , is invertible: if is correctly classified and is away from the true decision boundary, there is exactly one point, , which is translated to , while if is incorrectly classified and away from the decision boundary of , no translation leads to and ; any other points are close to the decision boundary of either or . Thus, since is bounded, is invertible on a set of at least probability (according to ). When , , and so for all points with (since is continuous in all such ), implying on these points. It also follows that can only happen to a set of points with an probability. This means that on a set of probability, and for these points . Thus, with probability . Unless the test set is concentrated in large part on the set of remaining points with probability, the test statistic with high probability and our method will not reject the independence hypothesis for .
When is large (), notice that for any point with nonvanishing probability (i.e., with for some ), if than . Therefore, for such an , if and , , and so (if , we trivially have ). If , we have . If is invertible at then and . If is not invertible, then there is another such that ; however, if then (since is large), and so , giving . Therefore, for large , with high probability (i.e., for points with ), so the independence hypothesis will not be rejected with high probability.
To better understand the behavior of the test, the second row of Figure 5 shows the empirical test error rate , the (unadjusted) adversarial error rate , and the adversarial risk estimate , together with their confidence intervals. For the nonindependent model, we also show the expected error (estimated over a large independent test set), while it is omitted for the independent model where it approximately coincides with both and . While the reweighted adversarial error estimate remains the same for all perturbations in case of an independent test set (left column), the adversarial error rate varies a lot for both the dependent and independent test sets. For example, in the case when the test samples and the model are not independent, it undershoots the true error for and overshoots it for larger perturbations. For very large perturbations ( close to ), the behavior of depends on the model : in the independent case decreases back to because such large perturbations increasingly often change the true label of the original example, so less and less adversarial points are generated. In the case when the data and the model are not independent (right column), the adversarial perturbations are almost always successful (i.e., lead to a valid adversarial example for most originally correctly classified points), yielding an adversarial error rate close to one for large enough perturbations. This is because the decision boundary of is almost orthogonal to the true decision boundary, and so the adversarial perturbations are parallel with the true boundary, almost never changing the true label of a point.
The plots of the densities (RadonNikodym derivatives), given in the third row of Figure 5, show how the change in their values compensate the increase of the adversarial error rate : in the independent case, the effect is completely eliminated yielding an unbiased adversarial error estimate , which is essentially constant over the whole range of (as shown in the first row), while in the nonindependent case the similar densities do not bring back the adversarial error rate to the test error rate , allowing the test to detect overfitting. Note that the densities exhibit similar trends (and values) in both cases, driven by the dependence of typical values of the ratio on the perturbation strength for originally misclassifed points () and for successful adversarial examples (i.e., and ).
To compare the behavior of our improved, pairwise test and the basic version, the fourth row of Figure 5 depicts a single realization of the experiments where the confidence intervals (as computed from Bernstein’s inequality) are shown for the estimates. For the independent case, the confidence intervals of and overlap for all , and thus the basic test is not able to detect overfitting. In the nonindependent case, the confidence intervals overlap for and , thus the basic test is not able to detect overfitting with at a confidence level, while the improved test (second row) is able to reject the independence hypothesis for these values at the same confidence level.
Finally, in the fifth row of Figure 5 we plotted the histograms of the empirical distribution of values for both models, over 100 independent runs (between the runs, all the data was regenerated and the models were retrained). For , they concentrate heavily on either or , and have very thin tails extending far towards the opposite end of the interval. This explains the surprisingly wide confidence intervals for values plotted in the first row. In particular, the fact that some values for the independent model are as low as does not mean the independence test is not reliable, because almost all calculated values are close or equal to
, and the few outliers are a combined consequence of the finite sample size and the effectiveness of the AEG. The additional
histogram for the nonindependent model illustrates a regime which is in between the singlemodel pairwise test (Section 3) completely failing to reject the independence hypothesis and clearly rejecting it.independent model,  nonindependent model, 
To verify experimentally whether the model independence test can be a more powerful detector of overfitting than the singlemodel version, in Figure 6 (right panel) we plotted value histograms for for the intermediate AEG strength applied to the nonindependent model over 100 training runs. Indeed, as increases, the concentration of values around in the low () range increases. For we did not have enough values to plot a histogram: for we obtained and , while for the value is . The increase of the test power becomes apparent when we compare the last value with the mean of values obtained by testing every training run separately, equal , and the median .
For comparison, we also plotted in Figure 6 (left panel) the corresponding histograms for the independent model and a slightly higher attack strength, , at which the independence tests fails for the overfitted model even without averaging (see Figure 5, first row, right panel). The histograms are all clustered in the region close to 1, indicating that the model test is not overly pessimistic.
5 Testing overfitting on image classification benchmarks
In the previous section we have shown how the proposed adversarialexamplebased error estimates work for a synthetic problem where the densities (RadonNikodym derivatives) can be computed exactly. In this section we apply our estimates to two popular image datasets; here the main issue is to find sufficiently strong AEGs that allow computing the corresponding densities.
To facilitate the computation of the density , we limit ourselves to densitypreserving AEGs as defined by (G3) (recall that (G3) is different from requiring ). Since in (4) and (3), is multiplied by , we only need to determine the density for data points that are misclassified by .
5.1 AEGs based on translations
To satisfy (G3), we use translations of images as adversarial perturbations. Translation attacks have recently been proposed as means of generating adversarial examples (Azulay and Weiss, 2018). Although such attacks are relatively weak, they fit our needs well: unless the images are procedurally centered, it is reasonable to assume that translating them by a few pixels does not change their likelihood. When the images are procedurally centered, one can still run the test on the “neighboring distribution” where the centering is removed in a data augmentation procedure that applies random translations to the training data. Clearly, this (or for this matter translation attacks) will not work if centering is built into the prediction procedure, as it may happen if one’s goal is to build a model that is invariant to (small) translations. Since centering alone is not expected to provide a satisfactory solution to this invariance, we do not think that incorporating centering can be expected to be part of what a model does. As before, we only apply perturbations to correctly classified images. We also make the natural assumption that the small translations we use do not change the true class of an image. These assumptions imply that translations by a few pixels satisfy conditions (G1) and (G3), and so a function is a valid AEG satisfying also (G3) if it leaves all misclassified images in place (to comply with (G2)), and either leaves a correctly classified image unchanged or applies a small translation.
Formally, images are modeled as 3D tensors in
space, where for RGB data, and and are the width and height of the images, respectively. Let denote the translation of an image by pixels in the (X, Y) plane (here denotes the set of integers). To control the amount of change, we limit the magnitude of translations and allow only, for some fixed positive . Thus, we considers AEGs in the form if and otherwise (if is correctly classified, we attempt to translate it to find an adversarial example in which is misclassified by , but is left unchanged if no such point exisits). Denoting the density of the pushforward measure by , for any misclassified point ,where the second equality follows from (G3). Therefore, the corresponding density is
(9) 
where is the number of neighboring images which are mapped to by . Note that given and , can be easily calculated by checking all possible translations of by for . It is easy to extend the above to nondeterministic perturbations, defined as distributions over AEGs, by replacing the indicator with its expectation with respect to the randomness of , yielding
(10) 
If is deterministic, we have for any successful adversarial example . Hence, for such , the range of the random variables defined in (3) has a tighter upper bound of 3/2 instead 2 (as ), leading to a tighter bound in (6) and a stronger pairwise independence test. In the experiments, we use this stronger test.
For the image classification benchmarks we consider two translation variants that are used in constructing a translational AEG. For every correctly classified image , we consider translations from (for some ), choosing from the set . If all translations result in correctly classified examples, we set . Otherwise, we use one of two possible ways to select (and we call the resulting points successful adversarial examples):

Strongest perturbation: Assuming the number of classes is , let denote the vector of the
class logits calculated by the model
for image , and letWe define
with ties broken deterministically by choosing the first translation from the candidate set, going top to bottom and left to right in rowmajor order. Thus, here we seek a nonidentical “neighbor” that causes the classifier to err the most, reachable from by translations within a maximum range .

Nearest misclassified neighbor: Here we aim to find the nearest image in that is misclassified. That is, letting
Comments
There are no comments yet.