1 Introduction
top: , bottom: 
depicts an overrepresented region (top) and underrepresented region (bottom). This is the kind of overfitting evaluated by methods like FID score and Precision and Recall. Illustration
(b) depicts a datacopied region (top) and underfit region (bottom). This is the type of overfitting focused on in this work. Figure (c) shows VAEgenerated and training samples from a datacopied (top) and underfit (bottom) region of the MNIST instance space. In each 10image strip, the bottom row provides random generated samples from the region and the top row shows their training nearest neighbors. Samples in the bottom region are on average further to their training nearest neighbor than heldout test samples in the region, and samples in the top region are closer, and thus ‘copying’ (computed in embedded space, see Experiments section).Overfitting is a basic stumbling block of any learning process. While it has been studied in great detail in the context of supervised learning, it has received much less attention in the unsupervised setting, despite being just as much of a problem.
To start with a simple example, consider a classical kernel density estimator (KDE), which given data
, constructs a distribution over by placing a Gaussian of width at each of these points, yielding the density(1) 
The only parameter is the scalar . Setting it too small makes too concentrated around the given points: a clear case of overfitting (see Appendix Figure 6). This cannot be avoided by choosing the that maximizes the log likelihood on the training data, since in the limit , this likelihood goes to .
The classical solution is to find a parameter that has a low generalization gap – that is, a low gap between the training loglikelihood and the loglikelihood on a heldout validation set. This method however often does not apply to the more complex generative models that have emerged over the past decade or so, such as Variational Auto Encoders (VAEs) (kingma) and Generative Adversarial Networks (GANs) (goodfellow). These models easily involve millions of parameters, and hence overfitting is a serious concern. Yet, a major challenge in evaluating overfitting is that these models do not offer exact, tractable likelihoods. VAEs can tractably provide a loglikelihood lower bound, while GANs have no accompanying density estimate at all. Thus any method that can assess these generative models must be based only on the samples produced.
A body of prior work has provided tests for evaluating generative models based on samples drawn from them (salimans; mehdi; Ruslan_et_al; heusel); however, the vast majority of these tests focus on ‘mode dropping’ and ‘mode collapse’: the tendency for a generative model to either merge or delete highdensity modes of the true distribution. A generative model that simply reproduces the training set or minor variations thereof will pass most of these tests.
In contrast, this work formalizes and investigates a type of overfitting that we call ‘datacopying’: the propensity of a generative model to recreate minute variations of a subset of training examples it has seen, rather than represent the true diversity of the data distribution. An example is shown in Figure 0(b); in the top region of the instance space, the generative model datacopies, or creates samples that are very close to the training samples; meanwhile, in the bottom region, it underfits. To detect this, we introduce a test that relies on three independent samples: the original training sample used to produce the generative model; a separate (heldout) test sample from the underlying distribution; and a synthetic sample drawn from the generator.
Our key insight is that an overfit generative model would produce samples that are too close to the training samples – closer on average than an independently drawn test sample from the same distribution. Thus, if a suitable distance function is available, then we can test for datacopying by testing whether the distances to the closest point in the training sample are on average smaller for the generated sample than for the test sample.
A further complication is that modern generative models tend to behave differently in different regions of space; a configuration as in Figure 0(b) for example could cause a global test to fail. To address this, we use ideas from the design of nonparametric methods. We divide the instance space into cells, conduct our test separately in each cell, and then combine the results to get a sense of the average degree of datacopying.
Finally, we explore our test experimentally on a variety of illustrative data sets and generative models. Our results demonstrate that given enough samples, our test can successfully detect datacopying in a broad range of settings.
1.1 Related work
There has been a large body of prior work on the evaluation of generative models (salimans; lopez; richardson; mehdi; Kilian; Ruslan_et_al) . Most are geared to detect some form of modecollapse or modedropping: the tendency to either merge or delete highdensity regions of the training data. Consequently, they fail to detect even the simplest case of extreme datacopying – where a generative model memorizes and exactly reproduces a bootstrap sample from the training set. We discuss below a few such canonical tests.
Todate there is a wealth of techniques for evaluating whether a model modedrops or collapses. Tests like the popular Inception Score (IS), Frechét Inception Distance (FID) (heusel), Precision and Recall test (mehdi), and extensions thereof (Kynk_improved; che_2016) all work by embedding samples using the features of a discriminative network such as ‘InceptionV3’ and checking whether the training and generated samples are similar in aggregate. The hypothesistesting binning method proposed by richardson also compares aggregate training and generated samples, but without the embedding step. The parametric Kernel MMD method proposed by gretton uses a carefully selected kernel to estimate the distribution of both the generated and training samples and reports the maximum mean discrepancy between the two. All these tests, however, reward a generative model that only produces slight variations of the training set, and do not successfully detect even the most egregious forms of datacopying.
A test that can detect some forms of datacopying is the TwoSample Nearest Neighbor, a nonparametric test proposed by lopez. Their method groups a training and generated sample of equal cardinality together, with training points labeled ‘1’ and generated points labeled ‘0’, and then reports the LeaveOneOut (LOO) NearestNeighbor (NN) accuracy of predicting ‘1’s and ‘0’s. Two values are then reported as discussed by Kilian – the leaveoneout accuracy of the training points, and the leaveoneout accuracy of the generated points. An ideal generative model should produce an accuracy of for each. More often, a modecollapsing generative model will leave the training accuracy low and generated accuracy high, while a generative model that exactly reproduces the entire training set should produce zero accuracy for both. Unlike this method, our test not only detects exact datacopying, which is unlikely, but estimates whether a given model generates samples closer to the training set than it should, as determined by a heldout test set.
The concept of datacopying has also been explored by Kilian (where it is called ‘memorization’) for a variety of generative models and several of the above twosample evaluation tests. Their results indicate that out of a variety of popular tests, only the twosample nearest neighbor test is able to capture instances of extreme datacopying.
gretton_2 explores threesample testing, but for comparing the performance of different models, not for detecting overfitting. reviewer_paper uses the threesample test proposed by gretton_2 for detecting datacopying; unlike ours, their test is global in nature.
Finally, other works concurrent with ours have explored parametric approaches to rooting out datacopying. A recent work by GAN_benchmarks
suggests that, given a large enough sample from the model, Neural Network Divergences are sensitive to datacopying. In a slightly different vein, a recent work by
latent_recovery investigates whether latentparameter models memorize training data by learning the reverse mapping from image to latent code. The present work departs from those by offering a probabilistically motivated nonparametric test that is entirely model agnostic.2 Preliminaries
We begin by introducing some notation and formalizing the definitions of overfitting. Let denote an instance space in which data points lie, and an unknown underlying distribution on this space. A training set is drawn from and is used to build a generative model . We then wish to assess whether is the result of overfitting: that is, whether produces samples that are too close to the training data. To help ascertain this, we are able to draw two additional samples:

A fresh sample of points from ; call this .

A sample of points from ; call this .
As illustrated in Figures 0(a), 0(b), a generative model can overfit locally in a region . To characterize this, for any distribution on , we use denote its restriction to the region , that is,
2.1 Definitions of Overfitting
We now formalize the notion of datacopying, and illustrate its distinction from other types of overfitting.
Intuitively, datacopying refers to situations where is “too close” to the training set ; that is, closer to than the target distribution happens to be. We make this quantitative by choosing a distance function from points in to the training set, for instance, , if is a subset of Euclidean space.
Ideally, we desire that ’s expected distance to the training set is the same as that of ’s, namely . We may rewrite this as follows: given any distribution over , define to be the onedimensional distribution of for . We consider datacopying to have occurred if random draws from are systematically larger than from . The above equalized expected distance condition can be rewritten as
(2) 
However, we are less interested in how large the difference is, and more in how often is larger than . Let
where represents how ‘far’ is from training sample as compared to true distribution . A more interpretable yet equally meaningful condition is
which guarantees (2) if densities and have the same shape, but could plausibly be meanshifted.
If , is datacopying training set , since samples from are systematically closer to than are samples from . However, even if , may still be datacopying. As exhibited in Figures 0(b) and 0(c), a model may datacopy in one region and underfit in others. In this case, may be further from than is globally, but much closer to locally. As such, we consider to be datacopying if it is overfit in a subset :
Definition 2.1 (DataCopying).
A generative model is datacopying training set if, in some region , it is systematically closer to by distance metric than are samples from . Specifically, if
Observe that datacopying is orthogonal to the type of overfitting addressed by many previous works (heusel; mehdi), which we call ‘overrepresentation’. There, overemphasizes some region of the instance space , often a region of high density in the training set . For the sake of completeness, we provide a formal definition below.
Definition 2.2 (OverRepresentation).
A generative model is overrepresenting in some region
, if the probability of drawing
is much greater than it is of drawing . Specifically, ifObserve that it is possible to overrepresent without datacopying and vice versa. For example, if is an equally weighted mixture of two Gaussians, and perfectly models one of them, then is overrepresenting without datacopying. On the other hand, if outputs a bootstrap sample of the training set , then it is datacopying without overrepresenting. The focus of the rest of this work is on datacopying.
3 A Test For DataCopying
Having provided a formal definition, we next propose a hypothesis test to detect datacopying.
3.1 A Global Test
We introduce our datacopying test in the global setting, when
. Our null hypothesis
suggests that may equal :(3) 
There are wellestablished nonparametric tests for this hypothesis, such as the MannWhitney test (mannwhitney). Let be samples of given by and their distances to training set . The statistic estimates the probability in Equation 3 by measuring the number of all pairwise comparisons in which . An efficient and simple method to gather and interpret this test is as follows:

Sort the values such that each instance has rank , starting from rank 1, and ending with rank . have no tied ranks with probability 1 assuming their distributions are continuous.

Calculate the ranksum for denoted , and its score denoted :
Consequently, .

Under ,
is approximately normally distributed with
samples in both and , allowing for the following scored statistic
provides us a datacopying statistic with normalized expectation and variance under
. implies datacopying, implies underfitting. implies that if holds, is as likely as sampling a value from a standard normal.Observe that this test is completely model agnostic and uses no estimate of likelihood. It only requires a meaningful distance metric, which is becoming common practice in the evaluation of modecollapse and dropping (heusel; mehdi) as well.
3.2 Handling Heterogeneity
As described in Section 2.1, the above global test can be fooled by generators which are very close to the training data in some regions of the instance space (overfitting) but very far from the training data in others (poor modeling).
We handle this by introducing a local version of our test. Let denote any partition of the instance space , which can be constructed in any manner. In our experiments, for instance, we run the means algorithm on , so that . As the number of training and test samples grows, we may increase and thus the instancespace resolution of our test. Letting be the distribution of distancestotrainingset within cell , we probe each cell of the partition individually.
Data Copying.
To offer a summary statistic for data copying, we collect the scored MannWhitney statistic, , described in Section 3.1 in each cell . Let denote the fraction of points lying in cell , and similarly for . The test for cell and training set will then be denoted as , where and similarly for . See Figure 0(c) for examples of these incell scores. For stability, we only measure datacopying for those cells significantly represented by , as determined by a threshold . Let be the set of all cells in the partition for which . Then, our summary statistic for data copying averages across all cells represented by :
OverRepresentation.
The above test will not catch a model that heavily over or underrepresents cells. For completeness, we next provide a simple representation test that is essentially used by richardson, now with an independent test set instead of the training set.
With in cell , we may treat
as Gaussian random variables. We then check the null hypothesis
. Assuming this null hypothesis, a simple test is:where . We then report two values for a significance level : the number of significantly different cells (‘bins’) with (NDB overrepresenting), and the number with (NDB underrepresenting).
Together, these summary statistics — , NDBover, NDBunder — detect the ways in which broadly represents without directly copying the training set .
3.3 Performance Guarantees
We next provide some simple guarantees on the performance of the global test statistic
. Guarantees for the average test is more complicated, and is left as a direction for future work.We begin by showing that when the null hypothesis does not hold, has some desirable properties – is a consistent estimator of the quantity of interest, :
Theorem 1.
For true distribution , model distribution , and distance metric , the estimator according to the concentration inequality
Furthermore, when the model distribution actually matches the true distribution , under modest assumptions we can expect to be near :
Theorem 2.
If , and the corresponding distance distribution is nonatomic, then
Additionally, we show that for a Gaussian Kernel Density Estimator, the parameter that satisfies the condition in Equation 2 is the corresponding to a maximum likelihood Gaussian KDE model. Recall that a KDE model is described by
(4) 
where the posterior probability that a random draw
comes from the Gaussian component centered at training point isLemma 3.
For the kernel density estimator (4), the maximumlikehood choice of , namely the maximizer of , satisfies
See Appendix 7.3 for proof. Unless is large, we know that for any given , . So, enforcing that , and more loosely that provides an excellent nonparametric approach to selecting a Gaussian KDE, and ought to be enforced for any attempting to emulate ; after all, Theorem 2 points out that effectively any model with also yields this condition.
4 Experiments
Having clarified what we mean by datacopying in theory, we turn our attention to data copying by generative models in practice^{1}^{1}1https://github.com/caseymeehan/datacopying. We leave representation test results for the appendix, since this behavior has been well studied in previous works. Specifically, we aim to answer the two following questions:

Are the existing tests that measure generative model overfitting able to capture datacopying?

As popular generative models range from over to underfitting, does our test indicate datacopying, and if so, to what degree?
Training, Generated and Test Sets.
In all of the following experiments, we select a training dataset with test split , and a generative model producing a sample . We perform means on to determine partition , with the objective of having a reasonable population of both and in each . We set threshold , such that we are guaranteed to have at least 20 samples in each cell in order to validate the gaussian assumption of .
4.1 Detecting datacopying
First, we investigate which of the existing generative model tests can detect explicit datacopying.
Dataset and Baselines.
For this experiment, we use the simple twodimensional ‘moons’ dataset, as it affords us limitless training and test samples and requires no feature embedding (see Appendix 7.4.1 for an example).
As baselines, we probe four of the methods described in our Related Work section to see how they react to datacopying: twosample NN (lopez), FID (heusel), BinningBased Evaluation (richardson), and Precision & Recall (mehdi). A detailed description of the methods is provided in Appendix 7.4.2. Note that, without an embedding, FID is simply the Frechét distance between two maximum likelihood normal distributions fit to and . We use the same size generated and training sample for all methods. Note that the twosample NN test requires the generated sample size to be equal to the training sample size . When (especially for large datasets and computationally burdensome samplers) we use an size training subsample .
Experimental Methodology.
We choose as our generative model a Gaussian KDE as it allows us to force explicit datacopying by setting very low. As , becomes a bootstrap sampler of the original training set. If a given test method can detect the level of datacopying by on , it will provide a different response to a heavily overfit KDE (), a wellfit KDE (), and an underfit KDE ().
Results.
Figure 2 depicts how each baseline method responds to KDE models of varying degrees of datacopying, as ranges from datacopying () up to heavily underfit (). The Frechét and Binning methods report effectively the same value for all , indicating inability to detect datacopying. Similarly, the PrecisionRecall curves for different values are nearly identical for all , and only change for large .
The twosample NN test does show a mild change in response as decreases below . This makes sense; as points in become closer to points in , the twosample NN accuracy should steadily decline. The primary reason it does not drop to zero is due to the subsampled training points, , needed to perform this test. As such, each training point being copied by generated point is unlikely to be present in during the test. This phenomenon is especially pronounced in some of the following settings. However, even when , this test will not reduce to zero as due to the wellknown result that a bootstrap sample of will only include of the samples in . Consequently, several training samples will not have a generated sample as nearest neighbor. The test avoids this by specifically finding the training nearest neighbor of each generated sample.
The reason most of these tests fail to detect datacopying is because most existing methods focus on another type of overfitting: modecollapse and dropping, wherein entire modes of are either forgotten or averaged together. However, if a model begins to datacopy, it is definitively overfitting without modecollapsing.
Note that the above four baselines are all two sample tests that do not use as does. For completeness, we present experiments with an additional, three sample baseline in Appendix 7.5. Here, we repeat the ‘moons’ dataset experiment with the threesample kernel MMD test originally proposed by gretton_2 for generative model selection and later adapted by reviewer_paper for testing model overfitting. We observe in Figure 10(b) that the threesample kMMD test does not detect datacopying, treating the MLE model similarly to overfit models with . See Appendix 7.5 for experimental details.
statistic identifying datacopying in an MNIST VAE and ImageNet GAN as they range from heavily overfit to underfit.
(c) and (e) demonstrate the relative insensitivity of the NN baseline to this overfitting, as does figure (a) of the generalization (ELBO) gap method for VAEs. (Note, the markers for (d) apply to the traces of (e))4.2 Measuring degree of datacopying
We now aim to answer the second question raised at the beginning of this section: does our test statistic detect and quantify datacopying?
We focus on three generative models: Gaussian KDEs, Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). For these experiments, we consider two baselines in addition to our method — the twosample NN test, and the likelihood generalization gap where it can be computed or approximated.
4.2.1 KDEbased tests
First we consider Gaussian KDEs. While KDEs do not provide a reliable likelihood in high dimension (theis), they do have several advantages as a preliminary benchmark – they allow us to directly force datacopying, and can help investigate the practical implications of the theoretical connection between the maximum likelihood KDE and as described in Lemma 3. We explore two datasets with Gaussian KDE: the ’moons’ dataset, and MNIST.
KDEs: ‘moons’ dataset.
Here, we repeat the experiment performed in Section 4.1, now including the statistic for comparison. Appendix 7.4.1 provides more experimental details, and examples of the dataset.
Results. Figure 2(a) depicts how the generalization gap dwindles as KDE increases. While this test is capable of capturing datacopying, it is insensitive to underfitting and relies on a tractable likelihood. Figures 2(b) and 2(c) give a sidebyside depiction of and the twosample NN test accuracies across a range of KDE values. Think of values as
score standard deviations. We see that the
statistic in Figure 2(b) precisely identifies the MLE model when , and responds sharply to values above and below . The baseline in Figure 2(c) similarly identifies the MLE model when training accuracy , but is higher variance and less sensitive to changes in , especially for overfit . We will see in the next experiment, that this test breaks down for more complex datasets when .KDEs: MNIST Handwritten Digits.
We now extend the KDE test performed on the moons dataset to the significantly more complex MNIST handwritten digit dataset (lecun).
While it would be convenient to directly apply the KDE sweeping tests discussed in the previous section, there are two primary barriers. The first is that KDE model relies on
norms being perceptually meaningful, which is well understood not to be true in pixel space. The second problem is that of dimensionality: the 784dimensional space of digits is far too high for a KDE to be even remotely efficient at interpolating the space.
To handle these issues, we first embed each image, , to a perceptually meaningful 64dimensional latent code, . We achieve this by training a convolutional autoencoder with a VGGnet perceptual loss produced by zhang (see Appendix 7.4.3
for more detail). Surely, even in the lower 64dimensional space, the KDE will suffer some from the curse of dimensionality. We are not promoting this method as a powerful generative model, but rather as an instructive tool for probing a test’s response to datacopying in the image domain. All tests are run in the compressed latent space; Appendix
7.4.3 provides more experimental details.As discussesd briefly in Section 4.1, a limitation of the twosample NN test is that it requires . For a large training set like MNIST, it is computationally challenging to generate samples, even with a 64dimensional KDE. We therefore use a subsampled training set of size when running the twosample NN test. The proposed test has no such restriction on the size of and .
Results. The likelihood generalization gap is depicted in Figure 2(d) repeating the trend seen with the ‘moons’ dataset.
Figure 2(e) shows how reacts decisively to over and underfitting. It falsely determines the MLE value as slightly overfit. However, the region of where transitions from over to underfit (say ) is relatively tight and includes the MLE .
Meanwhile, Figure 2(f) shows how — with the generated sample smaller than the training sample, — the twosample NN baseline provides no meaningful estimate of datacopying. In fact, the most datacopying models with low achieve the best scores closest to . Again, we are forced to use the subsampled , and most instances of data copying are completely missed.
These results are promising, and demonstrate the reliability of this hypothesis testing approach to probing for datacopying across different data domains. In the next section, we explore how these tests perform on more sophisticated, nonKDE models.
4.2.2 Variational Autoencoders
Gaussian KDE’s may have nice theoretical properties, but are relatively ineffective in highdimensional settings, precluding domains like images. As such, we also demonstrate our experiments on more practical neural models trained on higher dimensional image datasets (MNIST and ImageNet), with the goal of observing whether the statistic indicates datacopying as these models range from over to underfit. The first neural model we consider is a Variational Autoencoder (VAE) trained on the MNIST handwritten images dataset.
Experimetal Methodology.
Unlike KDEs, VAEs do not have a single parameter that controls the degree of overfitting. Instead, similar to VAEs_overfit
, we vary model complexity by increasing the width (neurons per layer) in a threelayer VAE (see Appendix
7.4.3 for details) – where higher width means a model of higher complexity. As an embedding, we pass all samples through the the convolutional autoencoder of Section 4.2.1, and collect statistics in this 64dimensional space. Observe that likelihood is not available for VAEs; instead we compute each model’s ELBO on a sample held out validation set, and use the ELBO approximation to the generalization gap instead.We again note here, that for the NN accuracy baseline, we use a subsampled training set as with the KDEbased MNIST tests where .
Results.
Figures 3(b) and 3(c) compare the statistic to the NN accuracy baseline . behaves as it did in the previous sections: more complex models overfit, forcing , and less complex models underfit forcing it . We note that the range of values is far less dramatic, which is to be expected since the KDEs were forced to explicitly datacopy. We observe that the ELBO spikes for models with near 0. Figure 3(a) shows the ELBO approximation of the generalization gap as the latent dimension (and number of units in each layer) is decreased. This method is entirely insensitive to over and underfit models. This may be because the ELBO is only a lower bound, not the actual likelihood.
The NN baseline in Figure 3(c) is less interpretable, and fails to capture the overfitting trend as does. While all three test accuracies still follow the upwardsloping trend of Figure 2(c), they do not indicate where the highest validation set ELBO is. Furthermore, the NN accuracy statistics are shifted upward when compared to the results of the previous section: all NN accuracies are above 0.5 for all latent dimensions. This is problematic. A test statistic’s absolute score ought to bear significance between very different data and model domains like KDEs and VAEs.
4.2.3 ImageNet GAN
Finally, we scale our experiments up to a larger image domain.
Experimental Methodology.
We gather our test statistics on a state of the art conditional GAN, ‘BigGan’ (BigGan), trained on the Imagenet 12 dataset (imagenet12). Conditioning on an input code, this GAN will generate one of 1000 different Imagenet classes. We run our experiments separately on three classes: ‘coffee’, ‘soap bubble’, and ‘schooner’. All generated, test, and training images are embedded to a 64dimensional space by first gathering the 2048dimensional features of an InceptionV3 network ‘Pool3’ layer, and then projecting them onto the 64 principal components of the training embeddings. Appendix 7.4.4 has more details.
Being limited to one pretrained model, we increase model variance (‘truncation threshold’) instead of decreasing model complexity. As proposed by BigGan’s authors, all standard normal input samples outside of this truncation threshold are resampled. The authors suggest that lower truncation thresholds, by only producing samples at the mode of the input, output higher quality samples at the cost of variety, as determined by Inception Score (IS). Similarly, the FID score finds suitable variety until truncation approaches zero.
Results.
The results for the score is depicted in Figure 3(d); the statistic remains well below zero until the truncation threshold is nearly maximized, indicating that produces samples closer to the training set than real samples tend to be. While FID finds that in aggregate the distributions are roughly similar, a closer look suggests that allocates too much probability mass near the training samples.
Meanwhile, the twosample NN baseline in Figure 3(e) hardly reacts to changes in truncation, even though the generated and training sets are the same size, . Across all truncation values, the training sample NN accuracy remains around 0.5, not quite implying over or underfitting.
A useful feature of the statistic is that one can examine the scores it is composed of to see which of the cells are or are not copying. Figure 5 shows the samples of over and underfit clusters for two of the three classes. For both ‘coffee’ and ‘bubble’ classes, the underfit cells are more diverse than the datacopied cells. While it might seem reasonable that these generated samples are further from nearest neighbors in more diverse clusters, keep in mind that the statistic indicates that they are further from training neighbors than test set samples are. For instance, the people depicted in underfit ‘bubbles’ cell are highly distorted.
4.3 Discussion
We now reflect on the two questions recited at the beginning of Section 4. Firstly, it appears that many existing generative model tests do not detect datacopying. The findings of Section 4.1 demonstrate that many popular generative model tests like FID, Precision and Recall, and BinningBased Evaluation are wholly insensitive to explicit datacopying even in lowdimensional settings. We suggest that this is because these tests are geared to detect over and underrepresentation more than datacopying.
Secondly, the experiments of Section 4.2 indicate that the proposed test statistic not only detects explicitly forced datacopying (as in the KDE experiments), but also detects datacopying in complex, overfit generative models like VAEs and GANs. In these settings, we observe that as models overfit more, drops below 0 and significantly below 1.
A limitation of the proposed test is the number of test samples . Without sufficient test samples, not only is the statistic higher variance, but the instance space partition cannot be very finegrain. Consequently, we are limited in our ability to manage the heterogeneity of across the instance space, and some cells may be mischaracterized. For example, in the BigGan experiment of Section 4.2.3, we are provided only 50 test samples per image class (e.g. ‘soap bubble’), limiting us to an instance space partition of only three cells. Developing more dataefficient methods to handle heterogeneity may be a promising area of future work.
5 Conclusion
In this work, we have formalized datacopying: an underexplored failure mode of generative model overfitting. We have provided preliminary tests for measuring datacopying and experiments indicating its presence in a broad class of generative models. In future work, we plan to establish more theoretical properties of datacopying, convergence guarantees of these tests, and experiments with different model parameters.
6 Acknowledgements
We thank Rich Zemel for pointing us to Kilian, which was the starting point of this work. Thanks to Arthur Gretton and Ruslan Salakhutdinov for pointers to prior work, and Philip Isola and Christian Szegedy for helpful advice. Finally, KC and CM would like to thank ONR under N000141612614, UC Lab Fees under LFR 18548554 and NSF IIS 1617157 for research support.
References
References
7 Appendix
7.1 Proof of Theorem 1
A restatement of the theorem:
For true distribution , model distribution , and distance metric , the estimator according to the concentration inequality
Proof.
We establish consistency using the following nifty lemma
Lemma 4.
(Bounded Differences Inequality) Suppose are independent, and . Let satisfy
for . Then we have for any
(5) 
This directly equips us to prove the Theorem.
It is relatively straightforward to apply Lemma 4 to the normalized . First, think of it as a function of independent samples of and independent samples of ,
Let bound the change in after substituting any with , and bound the change in after substituting any with . Specifically
We then know that for all , with equality when for all . In this case, substituting with flips of the indicator comparisons in from 1 to 0, and is then normalized by . By a similar argument, for all .
Equipped with and , we may simply substitute into Equation 5 of the Bounded Differences Inequality, giving us
7.2 Proof of Theorem 2
A restatement of the theorem:
When , and the corresponding distance distribution is nonatomic,
Proof.
For random variables and , we can partition the event space of into three disjoint events:
Since , the first two events have equal probability, , so
And since the distributions of and are nonatomic (i.e. ) we have that , and thus
7.3 Proof of Lemma 3
Lemma 3 For the kernel density estimator (1), the maximumlikehood choice of , namely the maximizer of , satisfies
Proof.
We have
Setting the derivative of this to zero and simplifying, we find that the maximumlikelihood satisfies
(6) 
Now, interpreting as a mixture of Gaussians, and using the notation to mean that is chosen uniformly at random from , we have
Combining this with (6) yields the lemma.
7.4 Procedural Details of Experiments
7.4.1 Moons Dataset, and Gaussian KDE
moons dataset
‘Moons’ is a synthetic dataset consisting of two curved interlocking manifolds with added configurable noise. We chose to use this dataset as a proof of concept because it is low dimensional, and thus KDE friendly and easy to visualize, and we may have unlimited train, test, and validation samples.
Gaussian KDE
We use a Gaussian KDE as our preliminary generative model because its likelihood is theoretically related to our nonparametric test. Perhaps more importantly, it is trivial to control the degree of datacopying with the bandwidth parameter . Figures 5(b), 5(c), 5(d) provide contour plots of of a Gaussian KDE trained on the moons dataset with progressively larger . With , will effectively resample the training set. is nearly the MLE model. With , the KDE struggles to capture the unique definition of .
7.4.2 Moons Experiments
Our experiments that examined whether several baseline tests could detect datacopying (Section 4.1), and our first test of our own metric (Section 4.2.1) use the moons dataset. In both of these, we fix a training sample, of 2000 points, a test sample of 1000 points, and a generated sample of 1000 points. We regenerate 10 times, and report the average statistic across these trials along with a single standard deviation. If the standard deviation buffer along the line is not visible, it is because the standard deviation is relatively small. We artificially set the constraint that , as is true for big natural datasets, and more elaborate models that are computationally burdensome to sample from.
Section 4.1 Methods
Here are the routines we used for the four baseline tests:

Frechét Inception Distance (FID) (heusel): Normally, this test is run on two samples of images ( and
) that are first embedded into a perceptually meaningful latent space using a discriminative neural net, like the Inception Network. By ‘meaningful’ we mean points that are closer together are more perceptually alike to the human eye. Unlike images in pixel space, the samples of the moons dataset require no embedding, so we run the Frechét test directly on the samples.
First, we fit two MLE Gaussians: to , and to , by collecting their respective MLE mean and covariance parameters. The statistic reported is the Frechét distance between these two Gaussians, denoted , which for Gaussians has a closed form:
Fr Naturally, if is datacopying , its MLE mean and covariance will be nearly identical, rendering this test ineffective for capturing this kind of overfitting.

Binning Based Evaluation (richardson): This test, takes a hypothesis testing approach for evaluating mode collapse and deletion. The test bears much similarity to the test described in Section 3.2. The basic idea is as follows. Split the training set into partition using means; the number of samples falling into each bin is approximately normally distributed if it has >20 samples. Check the null hypothesis that the normal distribution of the fraction of the training set in bin , , equals the normal distribution of the fraction of the generated set in bin , . Specifically:
where . We then perform a onesided hypothesis test, and compute the number of positive values that are greater than the significance level of 0.05. We call this the number of statistically different bins or NDB. The NDB/ ought to equal the significance level if .

TwoSample NearestNeighbor (lopez): In this test — our primary baseline — we report the three LOO NN values discussed in Kilian. The generated sample and training sample (subsampled to have equal size, ), , are joined together create sample of size
, with training samples labeled ‘1’ and test samples labeled ‘0’. One then fits a 1NearestNeighbor classifier to
, and reports the accuracy in predicting the training samples (‘1’s), the accuracy in predicting the generated samples (‘0’s), and the average.One can expect that — when collapses to a few mode centers of — the training accuracy is low, and the generated accuracy is high, thus indicating overrepresentation. Additionally, one could imagine that when the training and generated accuracies are near 0, we have extreme datacopying. However, as explained in Experiments section, when we are forced to subsample , it is unlikely that a given copied training point is used in the test, thus making the test result unclear.

Precision and Recall (mehdi): This method offers a clever technique for scaling classical precision and recall statistics to high dimensional, complex spaces. First, all samples are embedded to Inception Network Pool3 features. Then, the author’s use the following insight: for distribution’s and , the precision and recall curve is approximately given by the set of points:
where
and where is the ‘resolution’ of the curve, the set is a partition of the instance space and are the fraction of samples falling in cell . is determined by running means on the combination of the training and generated sets. In our tests here, we set , and report the average PRD curve measured over 10 means clusterings (and then rerun 10 times for 10 separate trials of ).
7.4.3 MNIST Experiments
The experiments of Sections 4.2.1 and 4.2.2 use the MNIST digit dataset (lecun). We use a training sample, , of size , a test sample of size , a validation sample of , and create generated samples of size .
Here, for a meaningful distance metric, we create a custom embedding using a convolutional autoencoder trained using a VGG perceptual loss proposed by zhang
. The encoder and decoder each have four convolutional layers using batch normalization, two linear layers using dropout, and two max pool layers. The autoencoder is trained for 100 epochs with a batch size of 128 and Adam optimizer with learning rate 0.001. For each training sample
, the encoder compresses to , and decoder expands back up to . Our loss is thenwhere is the VGG perceptual loss, and provides a linear hinge loss outside of a unit ball. The hinge loss encourages the encoder to learn a latent representation within a bounded domain, hopefully augmenting its ability to interpolate between samples. It is worth noting that the perceptual loss is not trained on MNIST, and hopefully uses agnostic features that help keep us from overfitting. We opt to use a standard autoencoder instead of a stochastic autoencoder like a VAE, because we want to be able to exactly datacopy the training set . Thus, we want the encoder to create a nearexact encoding and decoding of the training samples specifically. Figure 7 provides an example of linearly spaced steps between two training samples. While not perfect, we observe that halfway between the ‘2’ and the ‘0’ is a sample that appears perceptually to be almost almost a ‘2’ and almost a ‘0’. As such, we consider the distance metric on this space used in our experiments to be meaningful.
KDE tests:
In the MNIST KDE experiments, we fit each KDE on the 64d latent representations of the training set for several values of ; we gather all statistical tests in this space, and effectively only decode to visaully inspect samples. We gather the average and standard deviation of each data point across 5 trials of generating . For the TwoSample NearestNeighbor test, it is computationally intense to compute the nearesnt neighbor in a 64dimensional dataset of 20,000 points 20,000 times. To limit this, we average each of the training and generated NN accuracy over 500 training and generated samples. We find this acceptable, since the test results depicted in Figure 2(f) are relatively low variance.
VAE experiments:
In the MNIST VAE experiments, we only use the 64d autoencoder latent representation in computing the and 1NN test scores, and not at all in training. Here, we experiment with twenty standard, fully connected, VAEs using binary cross entropy reconstruction loss. The twenty models have three hidden layers and latent dimensions ranging from to in steps of 5. The number of neurons in intermediate layers is approximately twice the number of the layer beneath it, so for a latent space of 50d, the encoder architecture is , and the decoder architecture is the opposite.
To sample from a trained VAE, we sample from a standard normal with dimensionality equivalent to the VAEs latent dimension, and pass them through the VAE decoder to the 784d image space. We then encode these generated images to the agnostic 64d latent space of the perceptual autoencoder described at the beginning of the section, where distance is meaningful. We also encode the training sample and test sample to this space, and then run the and twosample NN tests. We again compute the nearest neighbor accuracies for 10,000 of the training and generated samples (the 1NN classifier is fit on the 20,000 sample set ), which appears to be acceptable due to low test variance.
7.4.4 ImageNet Experiments
Here, we have chosen three of the one thousand ImageNet12 classes that ‘BigGan’ produces. To reiterate, a conditional GAN can output samples from a specific class by conditioning on a class code input. We acknowledge that conditional GANs combine features from many classes in ways not yet well understood, but treat the GAN of each class as a uniquely different generative model trained on the training samples from that class. So, for the ‘coffee’ class, we treat the GAN as a coffee generator , trained on the 1300 ‘coffee’ class samples. For each class, we have 1300 training samples , 2000 generated samples , and 50 test samples . Being atypically training sample starved (), we subsample (not !), to produce equal size samples for the twosample NN test. As such, all training samples used are in the combined set . We also note that the 50 test samples provided in each class is highly limiting, only allowing us to split the instance space into about three cells and keep a reasonable number of test samples in each cell. As the number of test samples grows, so can the number of cells and the resolution of the partition. Figure 10 provides an example of where this clustering might be limited; the generated samples of the underfit cell seem hardly any different from those of the overfit cell. A finergrain partition is likely needed here. However, the datacopied cell to the left does appear to be very close to the training set, potentially too close according to .
In performing these experiments, we gather the statistic for a given class of images. In an attempt to embed the images into a lower dimensional latent space with significance, we pass each image through an InceptionV3 network and gather the 2048dimension feature embeddings after the final average pooling layer (Pool3). We then project all inceptionspace images () onto the 64 principal components of the training set embeddings. Finally, we use means to partition the points of each sample into one of
cells. The number of cells is limited by the 50 test images available per class. Any more cells would strain the Central Limit Theorem assumption in computing
. Finally, we gather the and twosample NN baseline statistics on this 64d space.7.5 Comparison with threesample KernelMMD:
Another threesample test not shown in the main body of this work is the threesample kernel MMD test introduced by gretton_2 intended more for model comparison than for checking model overfitting. For samples and , we can estimate the squared kernel MMD between and under kernel by empirically estimating
More recent works such as reviewer_paper have repurposed this test for measuring generative model overfitting. Intuitively, if the model is overfitting its training set, the empirical kMMD between training and generated data may be smaller than that between training and test sets. This may be triggered by the datacopying variety of overfitting.
This test provides an interesting benchmark to consider in addition to those in the main body. Figure 11 demonstrates some preliminary experimental results repeating both the ‘moons’ KDE experiment of Figure 2 and the MNIST VAE experiment Figure 3(b). To implement the kMMD test, we used code posted by gretton_2 https://github.com/eugenium/MMD, specifically the three sample RBFkMMD test.
In Figures 10(a), 10(b), and 10(c) we compare and the kMMD gap respectively for 50 values of KDE . We observe that the kMMD between the test and training set (blue) and between the generated and training set (orange) remain near zero for all values less than the MLE , indicated by the red circle. This suggests that the threesample kMMD is not a particularly strong test for datacopying, since low values are effectively bootstrapping the original training set. The kMMD gap does diverge for values much larger than the MLE , indicating that it can detect underfitting by , however.
This is corroborated by Figure 10(c) which displays the value of the kMMD hypothesis test used by reviewer_paper. This checks the null hypothesis that the kMMD between and is greater than that between and . A high value confirms this null hypothesis (as seen for all ). A value near 0.5 suggests that the kMMD’s are approximately equal. A value near zero rejects the null hypothesis, suggesting that the kMMD between and is much smaller than that between and . We see that the value remains well above 0.5 for all values and treats the MLE just as it does the overfit values.
Figures 10(d) and 10(e) compare the and kMMD tests for twenty MNIST VAEs with decreasing complexity (latent dimension). Figures 10(e) again depicts the kMMD distance to training set for both the generated (orange) and test samples (blue). We observe that this test does not appear sensitive to overparametrized VAEs () in the same way our proposed test (Figure 10(d)) is. As in the ‘moons’ case above, it appears sensitive to underfitting (). Here, the corresponding kMMD values are effectively 1 for all latent dimension values, and thus are omitted.
We suspect that this insensitivity to datacopying is due to the fact that – for a large number of samples – the kMMDs between and , and between and are both likely to be near zero when datacopies. Consider the case of extreme datacopying, when is simply a bootstrap sample from the training set . The kMMD estimate will be
Informally speaking, the second and third summations of this expression almost behave identically on average to the first since is a bootstrap sample. They only behave differently for summation terms that are collisions: in the second summation, and
Comments
There are no comments yet.