# Detecting Overfitting via Adversarial Examples

The repeated reuse of test sets in popular benchmark problems raises doubts about the credibility of reported test error rates. Verifying whether a learned model is overfitted to a test set is challenging as independent test sets drawn from the same data distribution are usually unavailable, while other test sets may introduce a distribution shift. We propose a new hypothesis test that uses only the original test data to detect overfitting. It utilizes a new unbiased error estimate that is based on adversarial examples generated from the test data and importance weighting. Overfitting is detected if this error estimate is sufficiently different from the original test error rate. The power of the method is illustrated using Monte Carlo simulations on a synthetic problem. We develop a specialized variant of our dependence detector for multiclass image classification, and apply it to testing overfitting of recent models to two popular real-world image classification benchmarks. In the case of ImageNet, our method was not able to detect overfitting to the test set for a state-of-the-art classifier, while on CIFAR-10 we found strong evidence of overfitting for the two recent model architectures we considered, and weak evidence of overfitting on the level of individual training runs.

## Authors

• 1 publication
• 24 publications
• 71 publications
• ### Do CIFAR-10 Classifiers Generalize to CIFAR-10?

Machine learning is currently dominated by largely experimental work foc...
06/01/2018 ∙ by Benjamin Recht, et al. ∙ 0

• ### Model Similarity Mitigates Test Set Overuse

Excessive reuse of test data has become commonplace in today's machine l...
05/29/2019 ∙ by Horia Mania, et al. ∙ 8

We introduce natural adversarial examples -- real-world, unmodified, and...
07/16/2019 ∙ by Dan Hendrycks, et al. ∙ 6

• ### The advantages of multiple classes for reducing overfitting from test set reuse

Excessive reuse of holdout data can lead to overfitting. However, there ...
05/24/2019 ∙ by Vitaly Feldman, et al. ∙ 2

• ### Identifying Model Weakness with Adversarial Examiner

Machine learning models are usually evaluated according to the average c...
11/25/2019 ∙ by Michelle Shu, et al. ∙ 0

• ### Optimal multiclass overfitting by sequence reconstruction from Hamming queries

A primary concern of excessive reuse of test datasets in machine learnin...
08/08/2019 ∙ by Jayadev Acharya, et al. ∙ 0

• ### A new measure for overfitting and its implications for backdooring of deep learning

Overfitting describes the phenomenon that a machine learning model fits ...
06/11/2020 ∙ by Kathrin Grosse, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep neural networks achieve impressive performance on many important machine learning benchmarks, such as image classification

(Krizhevsky, 2009; Krizhevsky et al., 2012; Szegedy et al., 2015; Simonyan and Zisserman, 2015; He et al., 2016), automated translation (Bahdanau et al., 2014; Wu et al., 2016) or speech recognition (Deng et al., 2013; Graves et al., 2013). However, the benchmark datasets are used a multitude of times by researchers worldwide. Since state-of-the-art methods are selected and published based on their performance on the corresponding test set, it is typical to see results that continuously improve over time; see, e.g., the discussion of Recht et al. (2018) and Figure 1 for the performance improvement of classifiers published for the popular CIFAR-10 image classification benchmark (Krizhevsky, 2009). This process may naturally lead to models overfitted to the test set, rendering test error rate (the average error measured on the test set) an unreliable indicator of the actual performance.

Detecting whether the model is overfitted to the test set is challenging, because independent test sets drawn from the same data distribution are generally not available, while alternative test sets often introduce a distribution shift.111

Throughout the paper we work with the standard statistical learning theory framework which assumes that data, both training and test, is sampled independently from a fixed underlying data distribution.

The shift may happen even if the new test set is collected very carefully. For example, the recent CIFAR-10.1 dataset was collected by diligently following the original data curation procedure of CIFAR-10, and yet the data distribution underlying CIFAR-10.1 appears to be different from that of the test set of CIFAR-10 (for details, see Recht et al., 2018).

To estimate the performance of a model on unseen data, one may use generalization bounds to get upper bounds on the expected error rate. The generalization bounds are also applicable when the model and the data are dependent (e.g., for error estimates based on the training data or the reused test data), but they usually lead to loose error bounds. Similarly, constructing confidence intervals around the training and test error rates from generalization bounds and rejecting independence of the test set if they do not overlap does not work in practice for the same reason. On the other hand, if the test data and the model are independent, much tighter bounds are available; hence, dependence testing can be applied to validate the resulting bounds. Recently, several methods that allow the reuse of the test set while keeping the validity of test error rates have been proposed

(Dwork et al., 2015). However, these are intrusive: they require the user to follow a strict protocol of interacting with the test set and are thus not applicable in the more common situation when enforcing such a protocol is impossible.

In this paper we take a new approach to the challenge of detecting overfitting of a model to the test set. Our method is non-intrusive and uses only the original test data. The core novel idea is to harness the power of adversarial examples (Goodfellow et al., 2014) for our purposes. These are data points222Throughout the paper, we use the words “example” and “point” interchangeably. which are not sampled from the data distribution, but instead are cleverly crafted based on existing data points so that the model errs on them. Several authors showed that the best models learned for the above-mentioned benchmark problems are highly sensitive to adversarial attacks (Goodfellow et al., 2014; Papernot et al., 2016; Uesato et al., 2018; Carlini and Wagner, 2017a; b; Papernot et al., 2017). For instance, one can often create adversarial versions of images properly classified by a state-of-the-art model that the model will make a mistake on. Often, the adversarial versions are (almost) indistinguishable from the original for a human observer, see for example Figure 2, where the adversarial image is obtained from the original one by a carefully selected translation.

is unbiased and has a smaller variance than the standard test error rate if the test set and the model are independent.

More importantly, since it is based on adversarially generated data points, the adversarial estimator is expected to differ significantly from the test error rate if the model is overfitted to the test set, providing a way to detect test set overfitting. Thus, the test error rate and the adversarial error estimate (calculated based on the same test set) must be close if the test set and the model are independent, and are expected to be different in the opposite case. In particular, if the gap between the two error estimates is large, the independence hypothesis (i.e., that the model and the test set are independent) is dubious and will be rejected. Combining results from multiple training runs, we develop another method to test overfitting of a model architecture.444Note that by model architecture we mean not just the layout of the architecture, e.g., that of a neural network, but also the corresponding training procedure and its (hyper-)parameters, except for the random seed.

The most challenging aspect of our method is to construct adversarial perturbations for which we can calculate importance weights, while keeping enough degrees of freedom in the way the adversarial perturbations are generated to maximize power, the ability of the test to detect dependence when it is present.

We apply our independence tests to state-of-the-art classification methods for two popular image classification benchmarks, ImageNet (Deng et al., 2009) and CIFAR-10 (Krizhevsky, 2009). Considering VGG-like image-classification models (Simonyan and Zisserman, 2015; Rosca et al., 2017), we found that the independence hypothesis cannot be rejected for ImageNet at almost any confidence level, which is a strong evidence that the VGG-like model is not overfitted to the test set (as a control experiment, we verify that the independence of the trained classifier and the training set is rejected with high confidence). For CIFAR-10, our individual test rejects the independence of a trained model and the test set at a confidence level of about 40%.555To be precise, we consider a rescaled version of the CIFAR-10 dataset; see Section 5.3 for details. While this result is inconclusive, the independence of the CIFAR-10 test set and the model architecture can be rejected at a convincing confidence level of 97%. Similar results are obtained for the Wide Residual Network (WideResnet) architecture of Zagoruyko and Komodakis (2016). In conclusion, our method provides strong evidence for overfitting of the tested architectures to the test set, but the effect is much harder to detect for an individual training run, most likely due to the stochastic nature of the training process. As a sanity check, our individual independence test verifies (i.e., does not reject) the independence of the individual VGG models and the truly independent CIFAR-10.1 test set (on average, the independence can be rejected at a negligible confidence level of about 7%, while for about 80% of the individual models, independence cannot be rejected at any confidence level), while the independence of the model architecture can only be rejected at a confidence level of about 30%. Unfortunately, these latter results are less meaningful in the sense that similar confidence levels can be obtained if the test is applied to randomly selected subsets of the CIFAR-10 test set with the same size (2000 images) .

The rest of the paper is organized as follows: In Section 2, we introduce a formal model for error estimation using adversarial examples, including the definition of adversarial example generators. The new overfitting-detection tests are derived in Section 3. To gain better understanding of their behavior, the tests are first analyzed on a synthetic example using Monte Carlo simulation in Section 4, while their application to image-classification benchmarks is presented in Section 5. Finally, conclusions are drawn and further research questions are discussed in Section 6.

We consider a classification problem with deterministic (noise-free) labels, which is a reasonable assumption for many practical problems, such as image recognition (we leave the extension of our method to noisy labels for future work). In particular, let denote the input space and the set of labels. For simplicity, we assume that . Data is sampled from the distribution over ,666Throughout the paper we assume that the standard measurability assumptions are satisfied. and the class label is determined by a function , which we refer to as the ground truth

. We denote a random vector drawn from

by , and its corresponding class label by . We consider deterministic classifiers of the form , which assign a deterministic label to every value of . The performance is measured by the zero-one loss, that is, the loss of classifying is ,777For a Boolean-valued expression , denotes its indicator function, that is, if is true and otherwise. and the expected error (also known as the risk or expected risk in the learning theory literature) of the classifier is defined as

 R(f)=P(f(X)≠Y)=E[I(f(X)≠Y)]=E[L(f,X)]=∫XL(f,x)dP(x) .

Consider a test dataset where the are drawn from independently of each other and .888Here, following the rest of the machine learning literature, we abuse notation. Technically, the sample should be seen as a sequence (not a set) drawn from the -fold product of with itself, . In the learning setting, the classifier usually also depends on some randomly drawn training data and, as such, is random itself. If is independent from , then are independent and identically distributed, and thus the empirical error rate

 ˆRS(f)=1mm∑i=1L(f,Xi)=1mm∑i=1I(f(Xi)≠Yi) (1)

is an unbiased estimate of

for all , that is, . If and are not independent, the performance guarantees on the empirical estimates available in the independent case are significantly weakened because may no longer be independent. For example, in case of overfitting to , the empirical error rate is likely to be much smaller than the expected error. To detect the lack of independence and the potential bias of the empirical error rate, we construct alternate error (risk) estimators with their corresponding confidence intervals that are valid under the independence assumption. If the confidence intervals do not overlap sufficiently, we can reject the hypothesis that the sample and the classifier are independent.

As the first estimate, we use the test error rate . To produce the other estimate, we use the well-known importance sampling or change of measure idea (Kahn and Harris, 1951): instead of sampling from the distribution , we sample from another distribution and correct the estimate by appropriate reweighting. Assuming is absolutely continuous with respect to on the set ,

 R(f)=∫XL(f,x)dP(x)=∫EL(f,x)h(x)dP′(x) , (2)

where is the density (formally: Radon-Nikodym derivative) of with respect to on ( can be defined to have arbitrary finite values on ). It is well known that the variance of the estimator

 ˆR′S′(f)=1mm∑i=1L(f,X′i)h(X′i)=1mm∑i=1I(f(X′i)≠Y′i)h(X′i) (3)

obtained from a sample drawn independently from is minimized if is the so-called zero-variance importance sampling distribution defined by (see, e.g., Section 4.2 of Bucklew, 2004). That is, is concentrated on and for all . In particular, in binary classification, for all .

Clearly, the zero-variance importance sampling distribution, and hence the above estimation strategy, is impossible to use, because it depends on the knowledge of both and , the latter of which we want to estimate. Nevertheless, it suggests that to minimize the variance of an estimator , needs to concentrate on points where makes mistakes. In the next subsection we use the notion of adversarial examples to generate such distributions.

In this section we introduce a formal framework for generating adversarial examples. Given a classification problem instance with data distribution and ground truth , an adversarial example generator (AEG) for a classifier is a (measurable) mapping such that

1. preserves the class labels of the samples, that is, for -almost all ; and

2. does not change points that are incorrectly classified by , that is, if for -almost all .

A simple example that illustrates AEGs is given in Figure 3, while another example is presented in Section 4. The justification of the definition of AEGs, in relation to the existing literature, is as follows: An adversarial example is usually generated by staying in a small vicinity of the input, i.e., by keeping small for some distance(-like) function, such as where is, e.g., the - or the max-norm. We view this as a practical way of ensuring condition (G1), the idea being that for all points in the support of , by moving the point by a small amount, say, , the label associated with the point is not going to change. If the label did change, the adversarial example generators proposed in the literature would not make sense. Thus, we can view condition (G1) as a relaxation of the foundational (implicit) assumption made in the literature underlying adversarial example generators. When developing practical methods for image recognition in Section 5, we also return to this stronger assumption instead of condition (G1). Note that this implicit assumption implies that the -boundary has zero -measure — a geometric margin assumption. If the set does not have a zero measure, the error metrics that use AEGs can be biased in proportion to . It follows that as long as the probability of seeing examples on the -boundary between the positive and negative examples is negligible compared to whatever error estimates the methods produce, the conclusions remain valid. (G2) formalizes the fact that there is no need to change samples which are already misclassified. Indeed, existing AEGs comply with this condition.

The performance of an AEG is usually measured by how successfully it generates examples that are misclassified, which can be captured by

 pg=P(f(g(X))≠f∗(g(X))|f(X)=f∗(X)).

Accordingly, we call a point a successful adversarial example if is correctly classified by and (i.e., and ).

In the development of our AEGs for image recognition tasks, we will make use of another condition. For simplicity, we formulate this condition for distributions that have a density with respect to the uniform measure on , which is assumed to exist (notable cases are when is finite, or or when ; in the latter two cases the uniform measure is the Lebesgue measure). The assumption states that the AEG needs to be density-preserving in the sense that

1. for -almost all .

Note that a density-preserving map may not be measure-preserving (the latter means that for all measurable , ).

We expect (G3) to hold when perturbs its input by a small amount and if is sufficiently smooth. We believe that the assumption is reasonable for, e.g., image recognition problems (at least in its relaxed form) where we expect that very close images will have a similar likelihood as measured by . An AEG employing image translations, which satisfies (G3), will be introduced in Section 5. Both (G1) and (G3) can be relaxed (to a soft margin condition or allowing a slight change in , resp.) at the price of an extra error term in the analysis that follows.

For a fixed AEG , let denote the distribution of where ( is known as the pushforward measure of under ). Further, let on and arbitrary otherwise. It is easy to see that is well-defined (on ) and, in addition, : First note that

 Pg(A)=P(g(X)∈A)≥P(g(X)∈A,X∈E)=P(X∈A)=P(A)

where the second to last equality holds because for any under condition (G2). Thus, for any measurable , which implies that is well-defined on and for all .

One may think that (G3) implies that for all . However, this does not hold. For example, if is a uniform distribution, any satisfies (G3), where denotes the support of the distribution . This is also illustrated in Figure 3.

### 2.2 Risk estimation via adversarial examples

Combining the ideas of this section so far, we now introduce unbiased risk estimates based on adversarial examples. Our goal is to estimate the error-rate of through an adversarially generated sample obtained through an AEG , where with drawn independently from and . Since satisfies (G1) by definition, the original example and the corresponding adversarial example have the same label . Recalling that on , we immediately see that the importance weighted adversarial estimate

 ˆRg(f)=1mm∑i=1I(f(X′i)≠Yi)hg(X′i) (4)

obtained from (3) for the adversarial sample has smaller variance than that of the empirical average : both estimates are unbiased with expectation , and so

 V[ˆRg(f)] =1m(E[L(f,g(X))2hg(g(X))2]−R(f)2) ≤1m(E[L(f,g(X))hg(g(X))]−R2(f))=1m(R(f)−R2(f))=V[ˆRS(f)] .

Intuitively, the more successful the AEG is (i.e., the more classification error it induces), the smaller the variance of the estimate becomes. As the adversarial error probability becomes larger, becomes smaller in general, so the inequality is likely to be strengthened.

## 3 Detecting overfitting

In this section we show how the risk estimates introduced in the previous section can be used to test the independence hypothesis that

1. the sample and the model are independent.

If (H) holds, , and so the difference is expected to be small. On the other hand, if is overfitted to the dataset (in which case ), we expect and to behave differently (the latter being less sensitive to overfitting) since (i) depends also on examples previously unseen by the training procedure; (ii) the adversarial transformation aims to increase the loss, countering the effect of overfitting; (iii) especially in high dimensional settings, in case of overfitting one may expect that there are misclassified points very close to the decision boundary of which can be found by a carefully designed AEG. Therefore, intuitively, (H) can be rejected if exceeds some appropriate threshold. The simplest way to determine the threshold is based on constructing confidence intervals for these estimator based on concentration inequalities, as discussed in Appendix A.

A smaller threshold for , and hence a more effective independence test, can be devised if instead of independently estimating the behavior of and , one utilizes their apparent correlation. Indeed, is the average of terms

 Ti,g(f)=L(f,g(Xi))hg(g(Xi))−L(f,Xi) =(L(f,g(Xi))hg(g(Xi))−R(f))−(L(f,Xi)−R(f)),

that is, , and the two terms in are typically highly correlated by the construction of . Thus, we can apply the empirical Bernstein bound (Mnih et al., 2008) to the pairwise differences to set a tighter threshold in the test: if the independence hypothesis (H) holds (i.e., and are independent), then for any , with probability at least ,

 |TS,g(f)|≤B(m,¯σ2T,δ,U) (6)

with

 B(m,σ2,δ,U)=√2σ2ln(3/δ)m+3Uln(3/δ)m, (7)

where is the empirical variance of the terms, and denotes the range of the (that is, for some constant ), and we also used that the expectation of each , and hence that of , is zero. Since if (as discussed in Section 2.2), it follows that , but further assumptions (such as being density preserving) can result in tighter bounds.

This leads to our pairwise dependence detection method:

if , reject (H) at a confidence level (-value ).

Note that in order for the test to work well, we not only need the test statistic

to have a small variance in case of independence (this could be achieved if were the identity), but we also need the estimators and behave sufficiently differently if the independence assumption is violated. The latter behavior is encouraged by stronger AEGs, as we will show empirically in Section 5.2 (see Figure 8 in particular).

Note that in order for the test to work well, we not only need the test statistic to have a small variance in case of independence (this could be achieved if were the identity), but we also need the estimators and behave sufficiently differently if the independence assumption is violated. The latter behavior is encouraged by stronger AEGs, as we will show empirically in Section 5.2 (see Figure 8 in particular).

For a given statistic , the largest confidence level (smallest -value) at which (H) can be rejected can be calculated by setting the value of the statistic to zero and solving for . This leads to the following formula for the -value (if the solution is larger than , which happens when the bound (6) is loose, is capped at ):

 δ=min{1,3e−m9U2(¯σ2T+3U|TS,g(f)|−¯σT√¯σ2T+6U|TS,g(f)|)}. (8)

The main challenge in applying the method in practice is to find suitable AEG functions for which we can compute the density .

### 3.1 Dependence detector for randomized training

The dependence between the model and the test set can arise from (i) selecting the “best” random seed in order to improve the test set performance and/or (ii) tweaking the model architecture (e.g., neural network structure) and hyperparameters (e.g., learning-rate schedule). If one has access to a single instance of a trained model, these two sources cannot be disentangled. However, if the model architecture and training procedure is fully specified and computational resources are adequate, it is possible to isolate (i) and (ii) by retraining the model multiple times and calculating the

-value for every training run separately. Assuming models, let denote the -th trained model and the -value calculated using the pairwise independence test (6) (i.e., from Eq. 8). We can investigate the degree to which (i) occurs by comparing the values with the corresponding test set error rates . To investigate whether (ii) occurs, we can average over the randomness of the training runs.

For every example , consider the average test statistic

 ¯Ti=1NN∑j=1Ti(fj)=1NN∑j=1(L(fj,gj(Xi))hgj(gj(Xi))−L(fj,Xi))

where is the statistic (3) calculated for example and model with AEG selected for model (note that AEGs are model-dependent by construction). If, for each and

, the random variables

are independent, then so are (for all ). Hence, we can apply the pairwise dependence detector (6) with instead of , using the average with empirical variance , giving a single -value . If the training runs vary enough in their outcomes, different models will err on different data points in , leading to and therefore strengthening the power of the dependence detector (indeed, our experiments in Section 5.3 on a CIFAR-10 image classification model show this behavior). This observation can also be interpreted in the following way: randomizing the training seed weakens the overfitting of the model architecture by introducing an important parameter which is not controlled in the experiments. Averaging the outcomes of training runs removes this source of randomness, strengthening the dependence detector. On the other hand, if the random seed is “optimized” in order to improve the test set performance, then the above procedure is obviously not able to detect that.

Note that we can calculate also for a sequence of subsets of models of (e.g., size each), obtain a sequence of corresponding -values and study their distribution. By changing the averaged model set size from 1 (testing every model independently) to we adjust the trade-off between more information about the -value distribution and detection power. To avoid reusing the same training run multiple times, given training runs we will calculate statistics over training run each, obtaining -values.

For brevity, we call this overfitting detection method (or, correspondingly, this test of independence) an -model detector (or an -model independence test).

## 4 Synthetic experiments

In this section we present a simple synthetic classification problem to illustrate the power of the method of Section 3. The benefit of this example is that we are able to compute the density in an analytic form (see Figure 4 for an illustration).

Let and consider an input distribution with a density

that is an equally weighted mixture of two 500-dimensional isotropic truncated Gaussian distributions

with coordinate-wise standard deviation

(

denotes the identity matrix of size

), means and densities truncated in the first dimension such that if and if . The label of an input point is , which is the sign of its first coordinate.

We consider linear classifiers of the form trained with the cross-entropy loss where . We employ a one-step gradient method (which is an version of the fast gradient-sign method of Goodfellow et al., 2014; Papernot et al., 2016) to define our AEG , which tries to modify a correctly classified point with label in the direction of the gradient of the cost function : for some . For our specific choice of , the above simplifies to . To comply with the requirements for an AEG, we define as follows: if and (corresponding to (G2) and (G1), respectively), while otherwise. Therefore, if is misclassified by , and are the only points mapped to by . Thus, the density at after the transformation is and

 hg(x′)=ρ(x′)ρ′(x′)=ρ(x′)ρ(x′)+ρ(x)(1−L(f,x))I(f∗(x)=f∗(x′))

(note that ).

We present two experiments showing the behavior of our independence test: one where the training and test sets are independent, and another where they are not.

In the first experiment a linear classifier was trained on a training set

of size 500 for 50,000 steps using the RMSProp optimizer

(Tieleman and Hinton, 2012) with batch size 100 and learning rate 0.01, obtaining zero (up to numerical precision) final training loss and, consequently, 100% prediction accuracy on the training data. Then the trained classifier was tested on a large test set of size 10,000.999The large number of test examples ensures that the random error in the empirical error estimate is negligible. Both sets were drawn independently from defined above. We used a range of values matched to the scale of the data distribution: from , which is the order of magnitude of the margin between two classes (0.05), to , which is the order of magnitude of the width of the Gaussian distribution used for each classes ().

In the second experiment we consider the situation where the training and test sets are not independent. To enhance the effects of this dependence, the setup was modified to make the training process more amenable to overfitting by simulating a situation when the model has a wrong bias (this may happen in practice if a wrong architecture or data preprocessing method is chosen, which, despite the modeler’s best intentions, worsens the performance). Specifically, during training we added a penalty term to the training loss , decreased the size of the test set to 1000 and used 50% of the test data for training (the final penalized training loss was 0.25 with 100% prediction accuracy on the training set). Note that the small training set and the large penalty on yield classifiers that are essentially independent of the only interesting feature (recall that the true label of a point is ) and overfit to the noise in the data, resulting in a true model risk .

The results of the two experiments are shown in Figure 5, plotted against different perturbation strengths: the left column corresponds to the first experiment while the right column to the second. The first row presents the -values for rejecting the independence hypothesis, calculated by repeating the experiment (sampling data and training the classifier) times and applying the single-model (Section 3, labelled as in the plots) and -model (Section 3.1, labelled as in the plots) independence test, and taking the average over models (or model sets of size ) for each . We also plot empirical 95% two-sided confidence intervals () or, due to limited number of -values available after dividing 100 runs into disjoint bins of size , ranges between minimum and maximum value (). For all methods of detecting dependence, it can be seen that for the independent case the test is correctly not able to reject the independence hypothesis (the average -value is very close to , although in some runs it can drop to as low as ). On the other hand, for , the non-independent model failed the independence test at confidence level , hence, in this range of our independence test reliably detects overfitting.

In fact, it is easy to argue that our test should only work for a limited range of , that is, it should not reject independence for too small or too large values of . First we consider the case of small values. Notice that except for points -close (in -norm) to the true decision boundary or the decision boundary of , is invertible: if is correctly classified and is -away from the true decision boundary, there is exactly one point, , which is translated to , while if is incorrectly classified and -away from the decision boundary of , no translation leads to and ; any other points are -close to the decision boundary of either or . Thus, since is bounded, is invertible on a set of at least probability (according to ). When , , and so for all points with (since is continuous in all such ), implying on these points. It also follows that can only happen to a set of points with an -probability. This means that on a set of -probability, and for these points . Thus, with -probability . Unless the test set is concentrated in large part on the set of remaining points with -probability, the test statistic with high probability and our method will not reject the independence hypothesis for .

When is large (), notice that for any point with non-vanishing probability (i.e., with for some ), if than . Therefore, for such an , if and , , and so (if , we trivially have ). If , we have . If is invertible at then and . If is not invertible, then there is another such that ; however, if then (since is large), and so , giving . Therefore, for large , with high probability (i.e., for points with ), so the independence hypothesis will not be rejected with high probability.

The plots of the densities (Radon-Nikodym derivatives), given in the third row of Figure 5, show how the change in their values compensate the increase of the adversarial error rate : in the independent case, the effect is completely eliminated yielding an unbiased adversarial error estimate , which is essentially constant over the whole range of (as shown in the first row), while in the non-independent case the similar densities do not bring back the adversarial error rate to the test error rate , allowing the test to detect overfitting. Note that the densities exhibit similar trends (and values) in both cases, driven by the dependence of typical values of the ratio on the perturbation strength for originally misclassifed points () and for successful adversarial examples (i.e., and ).

To compare the behavior of our improved, pairwise test and the basic version, the fourth row of Figure 5 depicts a single realization of the experiments where the confidence intervals (as computed from Bernstein’s inequality) are shown for the estimates. For the independent case, the confidence intervals of and overlap for all , and thus the basic test is not able to detect overfitting. In the non-independent case, the confidence intervals overlap for and , thus the basic test is not able to detect overfitting with at a confidence level, while the improved test (second row) is able to reject the independence hypothesis for these values at the same confidence level.

Finally, in the fifth row of Figure 5 we plotted the histograms of the empirical distribution of -values for both models, over 100 independent runs (between the runs, all the data was regenerated and the models were retrained). For , they concentrate heavily on either or , and have very thin tails extending far towards the opposite end of the interval. This explains the surprisingly wide confidence intervals for -values plotted in the first row. In particular, the fact that some -values for the independent model are as low as does not mean the independence test is not reliable, because almost all calculated values are close or equal to

, and the few outliers are a combined consequence of the finite sample size and the effectiveness of the AEG. The additional

histogram for the non-independent model illustrates a regime which is in between the single-model pairwise test (Section 3) completely failing to reject the independence hypothesis and clearly rejecting it.

To verify experimentally whether the -model independence test can be a more powerful detector of overfitting than the single-model version, in Figure 6 (right panel) we plotted -value histograms for for the intermediate AEG strength applied to the non-independent model over 100 training runs. Indeed, as increases, the concentration of -values around in the low () range increases. For we did not have enough values to plot a histogram: for we obtained and , while for the -value is . The increase of the test power becomes apparent when we compare the last value with the mean of -values obtained by testing every training run separately, equal , and the median .

For comparison, we also plotted in Figure 6 (left panel) the corresponding histograms for the independent model and a slightly higher attack strength, , at which the independence tests fails for the overfitted model even without averaging (see Figure 5, first row, right panel). The histograms are all clustered in the region close to 1, indicating that the -model test is not overly pessimistic.

## 5 Testing overfitting on image classification benchmarks

In the previous section we have shown how the proposed adversarial-example-based error estimates work for a synthetic problem where the densities (Radon-Nikodym derivatives) can be computed exactly. In this section we apply our estimates to two popular image datasets; here the main issue is to find sufficiently strong AEGs that allow computing the corresponding densities.

To facilitate the computation of the density , we limit ourselves to density-preserving AEGs as defined by (G3) (recall that (G3) is different from requiring ). Since in (4) and (3), is multiplied by , we only need to determine the density for data points that are misclassified by .

### 5.1 AEGs based on translations

To satisfy (G3), we use translations of images as adversarial perturbations. Translation attacks have recently been proposed as means of generating adversarial examples (Azulay and Weiss, 2018). Although such attacks are relatively weak, they fit our needs well: unless the images are procedurally centered, it is reasonable to assume that translating them by a few pixels does not change their likelihood. When the images are procedurally centered, one can still run the test on the “neighboring distribution” where the centering is removed in a data augmentation procedure that applies random translations to the training data. Clearly, this (or for this matter translation attacks) will not work if centering is built into the prediction procedure, as it may happen if one’s goal is to build a model that is invariant to (small) translations. Since centering alone is not expected to provide a satisfactory solution to this invariance, we do not think that incorporating centering can be expected to be part of what a model does. As before, we only apply perturbations to correctly classified images. We also make the natural assumption that the small translations we use do not change the true class of an image. These assumptions imply that translations by a few pixels satisfy conditions (G1) and (G3), and so a function is a valid AEG satisfying also (G3) if it leaves all misclassified images in place (to comply with (G2)), and either leaves a correctly classified image unchanged or applies a small translation.

Formally, images are modeled as 3D tensors in

space, where for RGB data, and and are the width and height of the images, respectively. Let denote the translation of an image by pixels in the (X, Y) plane (here denotes the set of integers). To control the amount of change, we limit the magnitude of translations and allow only, for some fixed positive . Thus, we considers AEGs in the form if and otherwise (if is correctly classified, we attempt to translate it to find an adversarial example in which is misclassified by , but is left unchanged if no such point exisits). Denoting the density of the pushforward measure by , for any misclassified point ,

 ρg(x)=ρ(x)+∑v∈Vρ(τ−v(x))I(g(τ−v(x))=x)=ρ(x)(1+∑v∈VI(g(τ−v(x))=x))

where the second equality follows from (G3). Therefore, the corresponding density is

 hg(x)=11+n(x) , (9)

where is the number of neighboring images which are mapped to by . Note that given and , can be easily calculated by checking all possible translations of by for . It is easy to extend the above to non-deterministic perturbations, defined as distributions over AEGs, by replacing the indicator with its expectation with respect to the randomness of , yielding

 hg(x)=11+∑v∈VP(g(τ−v(x))=x|x,v) . (10)

If is deterministic, we have for any successful adversarial example . Hence, for such , the range of the random variables defined in (3) has a tighter upper bound of 3/2 instead 2 (as ), leading to a tighter bound in (6) and a stronger pairwise independence test. In the experiments, we use this stronger test.

For the image classification benchmarks we consider two translation variants that are used in constructing a translational AEG. For every correctly classified image , we consider translations from (for some ), choosing from the set . If all translations result in correctly classified examples, we set . Otherwise, we use one of two possible ways to select (and we call the resulting points successful adversarial examples):

• Strongest perturbation: Assuming the number of classes is , let denote the vector of the

class logits calculated by the model

for image , and let

 lexc(f,x)=max0≤i

We define

 gstrongest(x)=argmaxx′∈G(x)lexc% (f,x′),

with ties broken deterministically by choosing the first translation from the candidate set, going top to bottom and left to right in row-major order. Thus, here we seek a non-identical “neighbor” that causes the classifier to err the most, reachable from by translations within a maximum range .

• Nearest misclassified neighbor: Here we aim to find the nearest image in that is misclassified. That is, letting