top: , bottom:
depicts an over-represented region (top) and under-represented region (bottom). This is the kind of overfitting evaluated by methods like FID score and Precision and Recall. Illustration(b) depicts a data-copied region (top) and underfit region (bottom). This is the type of overfitting focused on in this work. Figure (c) shows VAE-generated and training samples from a data-copied (top) and underfit (bottom) region of the MNIST instance space. In each 10-image strip, the bottom row provides random generated samples from the region and the top row shows their training nearest neighbors. Samples in the bottom region are on average further to their training nearest neighbor than held-out test samples in the region, and samples in the top region are closer, and thus ‘copying’ (computed in embedded space, see Experiments section).
Overfitting is a basic stumbling block of any learning process. While it has been studied in great detail in the context of supervised learning, it has received much less attention in the unsupervised setting, despite being just as much of a problem.
To start with a simple example, consider a classical kernel density estimator (KDE), which given data, constructs a distribution over by placing a Gaussian of width at each of these points, yielding the density
The only parameter is the scalar . Setting it too small makes too concentrated around the given points: a clear case of overfitting (see Appendix Figure 6). This cannot be avoided by choosing the that maximizes the log likelihood on the training data, since in the limit , this likelihood goes to .
The classical solution is to find a parameter that has a low generalization gap – that is, a low gap between the training log-likelihood and the log-likelihood on a held-out validation set. This method however often does not apply to the more complex generative models that have emerged over the past decade or so, such as Variational Auto Encoders (VAEs) (kingma) and Generative Adversarial Networks (GANs) (goodfellow). These models easily involve millions of parameters, and hence overfitting is a serious concern. Yet, a major challenge in evaluating overfitting is that these models do not offer exact, tractable likelihoods. VAEs can tractably provide a log-likelihood lower bound, while GANs have no accompanying density estimate at all. Thus any method that can assess these generative models must be based only on the samples produced.
A body of prior work has provided tests for evaluating generative models based on samples drawn from them (salimans; mehdi; Ruslan_et_al; heusel); however, the vast majority of these tests focus on ‘mode dropping’ and ‘mode collapse’: the tendency for a generative model to either merge or delete high-density modes of the true distribution. A generative model that simply reproduces the training set or minor variations thereof will pass most of these tests.
In contrast, this work formalizes and investigates a type of overfitting that we call ‘data-copying’: the propensity of a generative model to recreate minute variations of a subset of training examples it has seen, rather than represent the true diversity of the data distribution. An example is shown in Figure 0(b); in the top region of the instance space, the generative model data-copies, or creates samples that are very close to the training samples; meanwhile, in the bottom region, it underfits. To detect this, we introduce a test that relies on three independent samples: the original training sample used to produce the generative model; a separate (held-out) test sample from the underlying distribution; and a synthetic sample drawn from the generator.
Our key insight is that an overfit generative model would produce samples that are too close to the training samples – closer on average than an independently drawn test sample from the same distribution. Thus, if a suitable distance function is available, then we can test for data-copying by testing whether the distances to the closest point in the training sample are on average smaller for the generated sample than for the test sample.
A further complication is that modern generative models tend to behave differently in different regions of space; a configuration as in Figure 0(b) for example could cause a global test to fail. To address this, we use ideas from the design of non-parametric methods. We divide the instance space into cells, conduct our test separately in each cell, and then combine the results to get a sense of the average degree of data-copying.
Finally, we explore our test experimentally on a variety of illustrative data sets and generative models. Our results demonstrate that given enough samples, our test can successfully detect data-copying in a broad range of settings.
1.1 Related work
There has been a large body of prior work on the evaluation of generative models (salimans; lopez; richardson; mehdi; Kilian; Ruslan_et_al) . Most are geared to detect some form of mode-collapse or mode-dropping: the tendency to either merge or delete high-density regions of the training data. Consequently, they fail to detect even the simplest case of extreme data-copying – where a generative model memorizes and exactly reproduces a bootstrap sample from the training set. We discuss below a few such canonical tests.
To-date there is a wealth of techniques for evaluating whether a model mode-drops or -collapses. Tests like the popular Inception Score (IS), Frechét Inception Distance (FID) (heusel), Precision and Recall test (mehdi), and extensions thereof (Kynk_improved; che_2016) all work by embedding samples using the features of a discriminative network such as ‘InceptionV3’ and checking whether the training and generated samples are similar in aggregate. The hypothesis-testing binning method proposed by richardson also compares aggregate training and generated samples, but without the embedding step. The parametric Kernel MMD method proposed by gretton uses a carefully selected kernel to estimate the distribution of both the generated and training samples and reports the maximum mean discrepancy between the two. All these tests, however, reward a generative model that only produces slight variations of the training set, and do not successfully detect even the most egregious forms of data-copying.
A test that can detect some forms of data-copying is the Two-Sample Nearest Neighbor, a non-parametric test proposed by lopez. Their method groups a training and generated sample of equal cardinality together, with training points labeled ‘1’ and generated points labeled ‘0’, and then reports the Leave-One-Out (LOO) Nearest-Neighbor (NN) accuracy of predicting ‘1’s and ‘0’s. Two values are then reported as discussed by Kilian – the leave-one-out accuracy of the training points, and the leave-one-out accuracy of the generated points. An ideal generative model should produce an accuracy of for each. More often, a mode-collapsing generative model will leave the training accuracy low and generated accuracy high, while a generative model that exactly reproduces the entire training set should produce zero accuracy for both. Unlike this method, our test not only detects exact data-copying, which is unlikely, but estimates whether a given model generates samples closer to the training set than it should, as determined by a held-out test set.
The concept of data-copying has also been explored by Kilian (where it is called ‘memorization’) for a variety of generative models and several of the above two-sample evaluation tests. Their results indicate that out of a variety of popular tests, only the two-sample nearest neighbor test is able to capture instances of extreme data-copying.
gretton_2 explores three-sample testing, but for comparing the performance of different models, not for detecting overfitting. reviewer_paper uses the three-sample test proposed by gretton_2 for detecting data-copying; unlike ours, their test is global in nature.
Finally, other works concurrent with ours have explored parametric approaches to rooting out data-copying. A recent work by GAN_benchmarks
suggests that, given a large enough sample from the model, Neural Network Divergences are sensitive to data-copying. In a slightly different vein, a recent work bylatent_recovery investigates whether latent-parameter models memorize training data by learning the reverse mapping from image to latent code. The present work departs from those by offering a probabilistically motivated non-parametric test that is entirely model agnostic.
We begin by introducing some notation and formalizing the definitions of overfitting. Let denote an instance space in which data points lie, and an unknown underlying distribution on this space. A training set is drawn from and is used to build a generative model . We then wish to assess whether is the result of overfitting: that is, whether produces samples that are too close to the training data. To help ascertain this, we are able to draw two additional samples:
A fresh sample of points from ; call this .
A sample of points from ; call this .
2.1 Definitions of Overfitting
We now formalize the notion of data-copying, and illustrate its distinction from other types of overfitting.
Intuitively, data-copying refers to situations where is “too close” to the training set ; that is, closer to than the target distribution happens to be. We make this quantitative by choosing a distance function from points in to the training set, for instance, , if is a subset of Euclidean space.
Ideally, we desire that ’s expected distance to the training set is the same as that of ’s, namely . We may rewrite this as follows: given any distribution over , define to be the one-dimensional distribution of for . We consider data-copying to have occurred if random draws from are systematically larger than from . The above equalized expected distance condition can be rewritten as
However, we are less interested in how large the difference is, and more in how often is larger than . Let
where represents how ‘far’ is from training sample as compared to true distribution . A more interpretable yet equally meaningful condition is
which guarantees (2) if densities and have the same shape, but could plausibly be mean-shifted.
If , is data-copying training set , since samples from are systematically closer to than are samples from . However, even if , may still be data-copying. As exhibited in Figures 0(b) and 0(c), a model may data-copy in one region and underfit in others. In this case, may be further from than is globally, but much closer to locally. As such, we consider to be data-copying if it is overfit in a subset :
Definition 2.1 (Data-Copying).
A generative model is data-copying training set if, in some region , it is systematically closer to by distance metric than are samples from . Specifically, if
Observe that data-copying is orthogonal to the type of overfitting addressed by many previous works (heusel; mehdi), which we call ‘over-representation’. There, overemphasizes some region of the instance space , often a region of high density in the training set . For the sake of completeness, we provide a formal definition below.
Definition 2.2 (Over-Representation).
A generative model is over-representing in some region
, if the probability of drawingis much greater than it is of drawing . Specifically, if
Observe that it is possible to over-represent without data-copying and vice versa. For example, if is an equally weighted mixture of two Gaussians, and perfectly models one of them, then is over-representing without data-copying. On the other hand, if outputs a bootstrap sample of the training set , then it is data-copying without over-representing. The focus of the rest of this work is on data-copying.
3 A Test For Data-Copying
Having provided a formal definition, we next propose a hypothesis test to detect data-copying.
3.1 A Global Test
We introduce our data-copying test in the global setting, when
. Our null hypothesissuggests that may equal :
There are well-established non-parametric tests for this hypothesis, such as the Mann-Whitney test (mannwhitney). Let be samples of given by and their distances to training set . The statistic estimates the probability in Equation 3 by measuring the number of all pairwise comparisons in which . An efficient and simple method to gather and interpret this test is as follows:
Sort the values such that each instance has rank , starting from rank 1, and ending with rank . have no tied ranks with probability 1 assuming their distributions are continuous.
Calculate the rank-sum for denoted , and its score denoted :
is approximately normally distributed withsamples in both and , allowing for the following -scored statistic
provides us a data-copying statistic with normalized expectation and variance under. implies data-copying, implies underfitting. implies that if holds, is as likely as sampling a value from a standard normal.
Observe that this test is completely model agnostic and uses no estimate of likelihood. It only requires a meaningful distance metric, which is becoming common practice in the evaluation of mode-collapse and -dropping (heusel; mehdi) as well.
3.2 Handling Heterogeneity
As described in Section 2.1, the above global test can be fooled by generators which are very close to the training data in some regions of the instance space (overfitting) but very far from the training data in others (poor modeling).
We handle this by introducing a local version of our test. Let denote any partition of the instance space , which can be constructed in any manner. In our experiments, for instance, we run the -means algorithm on , so that . As the number of training and test samples grows, we may increase and thus the instance-space resolution of our test. Letting be the distribution of distances-to-training-set within cell , we probe each cell of the partition individually.
To offer a summary statistic for data copying, we collect the -scored Mann-Whitney statistic, , described in Section 3.1 in each cell . Let denote the fraction of points lying in cell , and similarly for . The test for cell and training set will then be denoted as , where and similarly for . See Figure 0(c) for examples of these in-cell scores. For stability, we only measure data-copying for those cells significantly represented by , as determined by a threshold . Let be the set of all cells in the partition for which . Then, our summary statistic for data copying averages across all cells represented by :
The above test will not catch a model that heavily over- or under-represents cells. For completeness, we next provide a simple representation test that is essentially used by richardson, now with an independent test set instead of the training set.
With in cell , we may treat
as Gaussian random variables. We then check the null hypothesis. Assuming this null hypothesis, a simple -test is:
where . We then report two values for a significance level : the number of significantly different cells (‘bins’) with (NDB over-representing), and the number with (NDB under-representing).
Together, these summary statistics — , NDB-over, NDB-under — detect the ways in which broadly represents without directly copying the training set .
3.3 Performance Guarantees
We next provide some simple guarantees on the performance of the global test statistic. Guarantees for the average test is more complicated, and is left as a direction for future work.
We begin by showing that when the null hypothesis does not hold, has some desirable properties – is a consistent estimator of the quantity of interest, :
For true distribution , model distribution , and distance metric , the estimator according to the concentration inequality
Furthermore, when the model distribution actually matches the true distribution , under modest assumptions we can expect to be near :
If , and the corresponding distance distribution is non-atomic, then
Additionally, we show that for a Gaussian Kernel Density Estimator, the parameter that satisfies the condition in Equation 2 is the corresponding to a maximum likelihood Gaussian KDE model. Recall that a KDE model is described by
where the posterior probability that a random drawcomes from the Gaussian component centered at training point is
For the kernel density estimator (4), the maximum-likehood choice of , namely the maximizer of , satisfies
See Appendix 7.3 for proof. Unless is large, we know that for any given , . So, enforcing that , and more loosely that provides an excellent non-parametric approach to selecting a Gaussian KDE, and ought to be enforced for any attempting to emulate ; after all, Theorem 2 points out that effectively any model with also yields this condition.
Having clarified what we mean by data-copying in theory, we turn our attention to data copying by generative models in practice111https://github.com/casey-meehan/data-copying. We leave representation test results for the appendix, since this behavior has been well studied in previous works. Specifically, we aim to answer the two following questions:
Are the existing tests that measure generative model overfitting able to capture data-copying?
As popular generative models range from over- to underfitting, does our test indicate data-copying, and if so, to what degree?
Training, Generated and Test Sets.
In all of the following experiments, we select a training dataset with test split , and a generative model producing a sample . We perform -means on to determine partition , with the objective of having a reasonable population of both and in each . We set threshold , such that we are guaranteed to have at least 20 samples in each cell in order to validate the gaussian assumption of .
4.1 Detecting data-copying
First, we investigate which of the existing generative model tests can detect explicit data-copying.
Dataset and Baselines.
For this experiment, we use the simple two-dimensional ‘moons’ dataset, as it affords us limitless training and test samples and requires no feature embedding (see Appendix 7.4.1 for an example).
As baselines, we probe four of the methods described in our Related Work section to see how they react to data-copying: two-sample NN (lopez), FID (heusel), Binning-Based Evaluation (richardson), and Precision & Recall (mehdi). A detailed description of the methods is provided in Appendix 7.4.2. Note that, without an embedding, FID is simply the Frechét distance between two maximum likelihood normal distributions fit to and . We use the same size generated and training sample for all methods. Note that the two-sample NN test requires the generated sample size to be equal to the training sample size . When (especially for large datasets and computationally burdensome samplers) we use an -size training subsample .
We choose as our generative model a Gaussian KDE as it allows us to force explicit data-copying by setting very low. As , becomes a bootstrap sampler of the original training set. If a given test method can detect the level of data-copying by on , it will provide a different response to a heavily over-fit KDE (), a well-fit KDE (), and an underfit KDE ().
Figure 2 depicts how each baseline method responds to KDE models of varying degrees of data-copying, as ranges from data-copying () up to heavily underfit (). The Frechét and Binning methods report effectively the same value for all , indicating inability to detect data-copying. Similarly, the Precision-Recall curves for different values are nearly identical for all , and only change for large .
The two-sample NN test does show a mild change in response as decreases below . This makes sense; as points in become closer to points in , the two-sample NN accuracy should steadily decline. The primary reason it does not drop to zero is due to the subsampled training points, , needed to perform this test. As such, each training point being copied by generated point is unlikely to be present in during the test. This phenomenon is especially pronounced in some of the following settings. However, even when , this test will not reduce to zero as due to the well-known result that a bootstrap sample of will only include of the samples in . Consequently, several training samples will not have a generated sample as nearest neighbor. The test avoids this by specifically finding the training nearest neighbor of each generated sample.
The reason most of these tests fail to detect data-copying is because most existing methods focus on another type of overfitting: mode-collapse and -dropping, wherein entire modes of are either forgotten or averaged together. However, if a model begins to data-copy, it is definitively overfitting without mode-collapsing.
Note that the above four baselines are all two sample tests that do not use as does. For completeness, we present experiments with an additional, three sample baseline in Appendix 7.5. Here, we repeat the ‘moons’ dataset experiment with the three-sample kernel MMD test originally proposed by gretton_2 for generative model selection and later adapted by reviewer_paper for testing model over-fitting. We observe in Figure 10(b) that the three-sample kMMD test does not detect data-copying, treating the MLE model similarly to overfit models with . See Appendix 7.5 for experimental details.
statistic identifying data-copying in an MNIST VAE and ImageNet GAN as they range from heavily over-fit to underfit.(c) and (e) demonstrate the relative insensitivity of the NN baseline to this overfitting, as does figure (a) of the generalization (ELBO) gap method for VAEs. (Note, the markers for (d) apply to the traces of (e))
4.2 Measuring degree of data-copying
We now aim to answer the second question raised at the beginning of this section: does our test statistic detect and quantify data-copying?
We focus on three generative models: Gaussian KDEs, Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). For these experiments, we consider two baselines in addition to our method — the two-sample NN test, and the likelihood generalization gap where it can be computed or approximated.
4.2.1 KDE-based tests
First we consider Gaussian KDEs. While KDEs do not provide a reliable likelihood in high dimension (theis), they do have several advantages as a preliminary benchmark – they allow us to directly force data-copying, and can help investigate the practical implications of the theoretical connection between the maximum likelihood KDE and as described in Lemma 3. We explore two datasets with Gaussian KDE: the ’moons’ dataset, and MNIST.
KDEs: ‘moons’ dataset.
Results. Figure 2(a) depicts how the generalization gap dwindles as KDE increases. While this test is capable of capturing data-copying, it is insensitive to underfitting and relies on a tractable likelihood. Figures 2(b) and 2(c) give a side-by-side depiction of and the two-sample NN test accuracies across a range of KDE values. Think of values as
-score standard deviations. We see that thestatistic in Figure 2(b) precisely identifies the MLE model when , and responds sharply to values above and below . The baseline in Figure 2(c) similarly identifies the MLE model when training accuracy , but is higher variance and less sensitive to changes in , especially for over-fit . We will see in the next experiment, that this test breaks down for more complex datasets when .
KDEs: MNIST Handwritten Digits.
We now extend the KDE test performed on the moons dataset to the significantly more complex MNIST handwritten digit dataset (lecun).
While it would be convenient to directly apply the KDE -sweeping tests discussed in the previous section, there are two primary barriers. The first is that KDE model relies on
norms being perceptually meaningful, which is well understood not to be true in pixel space. The second problem is that of dimensionality: the 784-dimensional space of digits is far too high for a KDE to be even remotely efficient at interpolating the space.
To handle these issues, we first embed each image, , to a perceptually meaningful 64-dimensional latent code, . We achieve this by training a convolutional autoencoder with a VGGnet perceptual loss produced by zhang (see Appendix 7.4.3
for more detail). Surely, even in the lower 64-dimensional space, the KDE will suffer some from the curse of dimensionality. We are not promoting this method as a powerful generative model, but rather as an instructive tool for probing a test’s response to data-copying in the image domain. All tests are run in the compressed latent space; Appendix7.4.3 provides more experimental details.
As discussesd briefly in Section 4.1, a limitation of the two-sample NN test is that it requires . For a large training set like MNIST, it is computationally challenging to generate samples, even with a 64-dimensional KDE. We therefore use a subsampled training set of size when running the two-sample NN test. The proposed test has no such restriction on the size of and .
Results. The likelihood generalization gap is depicted in Figure 2(d) repeating the trend seen with the ‘moons’ dataset.
Figure 2(e) shows how reacts decisively to over- and underfitting. It falsely determines the MLE value as slightly over-fit. However, the region of where transitions from over- to underfit (say ) is relatively tight and includes the MLE .
Meanwhile, Figure 2(f) shows how — with the generated sample smaller than the training sample, — the two-sample NN baseline provides no meaningful estimate of data-copying. In fact, the most data-copying models with low achieve the best scores closest to . Again, we are forced to use the -subsampled , and most instances of data copying are completely missed.
These results are promising, and demonstrate the reliability of this hypothesis testing approach to probing for data-copying across different data domains. In the next section, we explore how these tests perform on more sophisticated, non-KDE models.
4.2.2 Variational Autoencoders
Gaussian KDE’s may have nice theoretical properties, but are relatively ineffective in high-dimensional settings, precluding domains like images. As such, we also demonstrate our experiments on more practical neural models trained on higher dimensional image datasets (MNIST and ImageNet), with the goal of observing whether the statistic indicates data-copying as these models range from over- to underfit. The first neural model we consider is a Variational Autoencoder (VAE) trained on the MNIST handwritten images dataset.
Unlike KDEs, VAEs do not have a single parameter that controls the degree of overfitting. Instead, similar to VAEs_overfit
, we vary model complexity by increasing the width (neurons per layer) in a three-layer VAE (see Appendix7.4.3 for details) – where higher width means a model of higher complexity. As an embedding, we pass all samples through the the convolutional autoencoder of Section 4.2.1, and collect statistics in this 64-dimensional space. Observe that likelihood is not available for VAEs; instead we compute each model’s ELBO on a sample held out validation set, and use the ELBO approximation to the generalization gap instead.
We again note here, that for the NN accuracy baseline, we use a subsampled training set as with the KDE-based MNIST tests where .
Figures 3(b) and 3(c) compare the statistic to the NN accuracy baseline . behaves as it did in the previous sections: more complex models over-fit, forcing , and less complex models underfit forcing it . We note that the range of values is far less dramatic, which is to be expected since the KDEs were forced to explicitly data-copy. We observe that the ELBO spikes for models with near 0. Figure 3(a) shows the ELBO approximation of the generalization gap as the latent dimension (and number of units in each layer) is decreased. This method is entirely insensitive to over- and underfit models. This may be because the ELBO is only a lower bound, not the actual likelihood.
The NN baseline in Figure 3(c) is less interpretable, and fails to capture the overfitting trend as does. While all three test accuracies still follow the upward-sloping trend of Figure 2(c), they do not indicate where the highest validation set ELBO is. Furthermore, the NN accuracy statistics are shifted upward when compared to the results of the previous section: all NN accuracies are above 0.5 for all latent dimensions. This is problematic. A test statistic’s absolute score ought to bear significance between very different data and model domains like KDEs and VAEs.
4.2.3 ImageNet GAN
Finally, we scale our experiments up to a larger image domain.
We gather our test statistics on a state of the art conditional GAN, ‘BigGan’ (BigGan), trained on the Imagenet 12 dataset (imagenet12). Conditioning on an input code, this GAN will generate one of 1000 different Imagenet classes. We run our experiments separately on three classes: ‘coffee’, ‘soap bubble’, and ‘schooner’. All generated, test, and training images are embedded to a 64-dimensional space by first gathering the 2048-dimensional features of an InceptionV3 network ‘Pool3’ layer, and then projecting them onto the 64 principal components of the training embeddings. Appendix 7.4.4 has more details.
Being limited to one pre-trained model, we increase model variance (‘truncation threshold’) instead of decreasing model complexity. As proposed by BigGan’s authors, all standard normal input samples outside of this truncation threshold are resampled. The authors suggest that lower truncation thresholds, by only producing samples at the mode of the input, output higher quality samples at the cost of variety, as determined by Inception Score (IS). Similarly, the FID score finds suitable variety until truncation approaches zero.
The results for the score is depicted in Figure 3(d); the statistic remains well below zero until the truncation threshold is nearly maximized, indicating that produces samples closer to the training set than real samples tend to be. While FID finds that in aggregate the distributions are roughly similar, a closer look suggests that allocates too much probability mass near the training samples.
Meanwhile, the two-sample NN baseline in Figure 3(e) hardly reacts to changes in truncation, even though the generated and training sets are the same size, . Across all truncation values, the training sample NN accuracy remains around 0.5, not quite implying over- or underfitting.
A useful feature of the statistic is that one can examine the scores it is composed of to see which of the cells are or are not copying. Figure 5 shows the samples of over- and underfit clusters for two of the three classes. For both ‘coffee’ and ‘bubble’ classes, the underfit cells are more diverse than the data-copied cells. While it might seem reasonable that these generated samples are further from nearest neighbors in more diverse clusters, keep in mind that the statistic indicates that they are further from training neighbors than test set samples are. For instance, the people depicted in underfit ‘bubbles’ cell are highly distorted.
We now reflect on the two questions recited at the beginning of Section 4. Firstly, it appears that many existing generative model tests do not detect data-copying. The findings of Section 4.1 demonstrate that many popular generative model tests like FID, Precision and Recall, and Binning-Based Evaluation are wholly insensitive to explicit data-copying even in low-dimensional settings. We suggest that this is because these tests are geared to detect over- and underrepresentation more than data-copying.
Secondly, the experiments of Section 4.2 indicate that the proposed test statistic not only detects explicitly forced data-copying (as in the KDE experiments), but also detects data-copying in complex, overfit generative models like VAEs and GANs. In these settings, we observe that as models overfit more, drops below 0 and significantly below -1.
A limitation of the proposed test is the number of test samples . Without sufficient test samples, not only is the statistic higher variance, but the instance space partition cannot be very fine-grain. Consequently, we are limited in our ability to manage the heterogeneity of across the instance space, and some cells may be mischaracterized. For example, in the BigGan experiment of Section 4.2.3, we are provided only 50 test samples per image class (e.g. ‘soap bubble’), limiting us to an instance space partition of only three cells. Developing more data-efficient methods to handle heterogeneity may be a promising area of future work.
In this work, we have formalized data-copying: an under-explored failure mode of generative model overfitting. We have provided preliminary tests for measuring data-copying and experiments indicating its presence in a broad class of generative models. In future work, we plan to establish more theoretical properties of data-copying, convergence guarantees of these tests, and experiments with different model parameters.
We thank Rich Zemel for pointing us to Kilian, which was the starting point of this work. Thanks to Arthur Gretton and Ruslan Salakhutdinov for pointers to prior work, and Philip Isola and Christian Szegedy for helpful advice. Finally, KC and CM would like to thank ONR under N00014-16-1-2614, UC Lab Fees under LFR 18-548554 and NSF IIS 1617157 for research support.
7.1 Proof of Theorem 1
A restatement of the theorem:
For true distribution , model distribution , and distance metric , the estimator according to the concentration inequality
We establish consistency using the following nifty lemma
(Bounded Differences Inequality) Suppose are independent, and . Let satisfy
for . Then we have for any
This directly equips us to prove the Theorem.
It is relatively straightforward to apply Lemma 4 to the normalized . First, think of it as a function of independent samples of and independent samples of ,
Let bound the change in after substituting any with , and bound the change in after substituting any with . Specifically
We then know that for all , with equality when for all . In this case, substituting with flips of the indicator comparisons in from 1 to 0, and is then normalized by . By a similar argument, for all .
Equipped with and , we may simply substitute into Equation 5 of the Bounded Differences Inequality, giving us
7.2 Proof of Theorem 2
A restatement of the theorem:
When , and the corresponding distance distribution is non-atomic,
For random variables and , we can partition the event space of into three disjoint events:
Since , the first two events have equal probability, , so
And since the distributions of and are non-atomic (i.e. ) we have that , and thus
7.3 Proof of Lemma 3
Setting the derivative of this to zero and simplifying, we find that the maximum-likelihood satisfies
Now, interpreting as a mixture of Gaussians, and using the notation to mean that is chosen uniformly at random from , we have
Combining this with (6) yields the lemma.
7.4 Procedural Details of Experiments
7.4.1 Moons Dataset, and Gaussian KDE
‘Moons’ is a synthetic dataset consisting of two curved interlocking manifolds with added configurable noise. We chose to use this dataset as a proof of concept because it is low dimensional, and thus KDE friendly and easy to visualize, and we may have unlimited train, test, and validation samples.
We use a Gaussian KDE as our preliminary generative model because its likelihood is theoretically related to our non-parametric test. Perhaps more importantly, it is trivial to control the degree of data-copying with the bandwidth parameter . Figures 5(b), 5(c), 5(d) provide contour plots of of a Gaussian KDE trained on the moons dataset with progressively larger . With , will effectively resample the training set. is nearly the MLE model. With , the KDE struggles to capture the unique definition of .
7.4.2 Moons Experiments
Our experiments that examined whether several baseline tests could detect data-copying (Section 4.1), and our first test of our own metric (Section 4.2.1) use the moons dataset. In both of these, we fix a training sample, of 2000 points, a test sample of 1000 points, and a generated sample of 1000 points. We regenerate 10 times, and report the average statistic across these trials along with a single standard deviation. If the standard deviation buffer along the line is not visible, it is because the standard deviation is relatively small. We artificially set the constraint that , as is true for big natural datasets, and more elaborate models that are computationally burdensome to sample from.
Section 4.1 Methods
Here are the routines we used for the four baseline tests:
Frechét Inception Distance (FID) (heusel): Normally, this test is run on two samples of images ( and
) that are first embedded into a perceptually meaningful latent space using a discriminative neural net, like the Inception Network. By ‘meaningful’ we mean points that are closer together are more perceptually alike to the human eye. Unlike images in pixel space, the samples of the moons dataset require no embedding, so we run the Frechét test directly on the samples.
First, we fit two MLE Gaussians: to , and to , by collecting their respective MLE mean and covariance parameters. The statistic reported is the Frechét distance between these two Gaussians, denoted , which for Gaussians has a closed form:
Naturally, if is data-copying , its MLE mean and covariance will be nearly identical, rendering this test ineffective for capturing this kind of overfitting.
Binning Based Evaluation (richardson): This test, takes a hypothesis testing approach for evaluating mode collapse and deletion. The test bears much similarity to the test described in Section 3.2. The basic idea is as follows. Split the training set into partition using -means; the number of samples falling into each bin is approximately normally distributed if it has >20 samples. Check the null hypothesis that the normal distribution of the fraction of the training set in bin , , equals the normal distribution of the fraction of the generated set in bin , . Specifically:
where . We then perform a one-sided hypothesis test, and compute the number of positive values that are greater than the significance level of 0.05. We call this the number of statistically different bins or NDB. The NDB/ ought to equal the significance level if .
Two-Sample Nearest-Neighbor (lopez): In this test — our primary baseline — we report the three LOO NN values discussed in Kilian. The generated sample and training sample (subsampled to have equal size, ), , are joined together create sample of size
, with training samples labeled ‘1’ and test samples labeled ‘0’. One then fits a 1-Nearest-Neighbor classifier to, and reports the accuracy in predicting the training samples (‘1’s), the accuracy in predicting the generated samples (‘0’s), and the average.
One can expect that — when collapses to a few mode centers of — the training accuracy is low, and the generated accuracy is high, thus indicating over-representation. Additionally, one could imagine that when the training and generated accuracies are near 0, we have extreme data-copying. However, as explained in Experiments section, when we are forced to subsample , it is unlikely that a given copied training point is used in the test, thus making the test result unclear.
Precision and Recall (mehdi): This method offers a clever technique for scaling classical precision and recall statistics to high dimensional, complex spaces. First, all samples are embedded to Inception Network Pool3 features. Then, the author’s use the following insight: for distribution’s and , the precision and recall curve is approximately given by the set of points:
and where is the ‘resolution’ of the curve, the set is a partition of the instance space and are the fraction of samples falling in cell . is determined by running -means on the combination of the training and generated sets. In our tests here, we set , and report the average PRD curve measured over 10 -means clusterings (and then re-run 10 times for 10 separate trials of ).
7.4.3 MNIST Experiments
The experiments of Sections 4.2.1 and 4.2.2 use the MNIST digit dataset (lecun). We use a training sample, , of size , a test sample of size , a validation sample of , and create generated samples of size .
Here, for a meaningful distance metric, we create a custom embedding using a convolutional autoencoder trained using a VGG perceptual loss proposed by zhang
. The encoder and decoder each have four convolutional layers using batch normalization, two linear layers using dropout, and two max pool layers. The autoencoder is trained for 100 epochs with a batch size of 128 and Adam optimizer with learning rate 0.001. For each training sample, the encoder compresses to , and decoder expands back up to . Our loss is then
where is the VGG perceptual loss, and provides a linear hinge loss outside of a unit ball. The hinge loss encourages the encoder to learn a latent representation within a bounded domain, hopefully augmenting its ability to interpolate between samples. It is worth noting that the perceptual loss is not trained on MNIST, and hopefully uses agnostic features that help keep us from overfitting. We opt to use a standard autoencoder instead of a stochastic autoencoder like a VAE, because we want to be able to exactly data-copy the training set . Thus, we want the encoder to create a near-exact encoding and decoding of the training samples specifically. Figure 7 provides an example of linearly spaced steps between two training samples. While not perfect, we observe that half-way between the ‘2’ and the ‘0’ is a sample that appears perceptually to be almost almost a ‘2’ and almost a ‘0’. As such, we consider the distance metric on this space used in our experiments to be meaningful.
In the MNIST KDE experiments, we fit each KDE on the 64-d latent representations of the training set for several values of ; we gather all statistical tests in this space, and effectively only decode to visaully inspect samples. We gather the average and standard deviation of each data point across 5 trials of generating . For the Two-Sample Nearest-Neighbor test, it is computationally intense to compute the nearesnt neighbor in a 64-dimensional dataset of 20,000 points 20,000 times. To limit this, we average each of the training and generated NN accuracy over 500 training and generated samples. We find this acceptable, since the test results depicted in Figure 2(f) are relatively low variance.
In the MNIST VAE experiments, we only use the 64-d autoencoder latent representation in computing the and 1-NN test scores, and not at all in training. Here, we experiment with twenty standard, fully connected, VAEs using binary cross entropy reconstruction loss. The twenty models have three hidden layers and latent dimensions ranging from to in steps of 5. The number of neurons in intermediate layers is approximately twice the number of the layer beneath it, so for a latent space of 50-d, the encoder architecture is , and the decoder architecture is the opposite.
To sample from a trained VAE, we sample from a standard normal with dimensionality equivalent to the VAEs latent dimension, and pass them through the VAE decoder to the 784-d image space. We then encode these generated images to the agnostic 64-d latent space of the perceptual autoencoder described at the beginning of the section, where distance is meaningful. We also encode the training sample and test sample to this space, and then run the and two-sample NN tests. We again compute the nearest neighbor accuracies for 10,000 of the training and generated samples (the 1-NN classifier is fit on the 20,000 sample set ), which appears to be acceptable due to low test variance.
7.4.4 ImageNet Experiments
Here, we have chosen three of the one thousand ImageNet12 classes that ‘BigGan’ produces. To reiterate, a conditional GAN can output samples from a specific class by conditioning on a class code input. We acknowledge that conditional GANs combine features from many classes in ways not yet well understood, but treat the GAN of each class as a uniquely different generative model trained on the training samples from that class. So, for the ‘coffee’ class, we treat the GAN as a coffee generator , trained on the 1300 ‘coffee’ class samples. For each class, we have 1300 training samples , 2000 generated samples , and 50 test samples . Being atypically training sample starved (), we subsample (not !), to produce equal size samples for the two-sample NN test. As such, all training samples used are in the combined set . We also note that the 50 test samples provided in each class is highly limiting, only allowing us to split the instance space into about three cells and keep a reasonable number of test samples in each cell. As the number of test samples grows, so can the number of cells and the resolution of the partition. Figure 10 provides an example of where this clustering might be limited; the generated samples of the underfit cell seem hardly any different from those of the over-fit cell. A finer-grain partition is likely needed here. However, the data-copied cell to the left does appear to be very close to the training set, potentially too close according to .
In performing these experiments, we gather the statistic for a given class of images. In an attempt to embed the images into a lower dimensional latent space with significance, we pass each image through an InceptionV3 network and gather the 2048-dimension feature embeddings after the final average pooling layer (Pool3). We then project all inception-space images () onto the 64 principal components of the training set embeddings. Finally, we use -means to partition the points of each sample into one of
cells. The number of cells is limited by the 50 test images available per class. Any more cells would strain the Central Limit Theorem assumption in computing. Finally, we gather the and two-sample NN baseline statistics on this 64-d space.
7.5 Comparison with three-sample Kernel-MMD:
Another three-sample test not shown in the main body of this work is the three-sample kernel MMD test introduced by gretton_2 intended more for model comparison than for checking model overfitting. For samples and , we can estimate the squared kernel MMD between and under kernel by empirically estimating
More recent works such as reviewer_paper have repurposed this test for measuring generative model overfitting. Intuitively, if the model is overfitting its training set, the empirical kMMD between training and generated data may be smaller than that between training and test sets. This may be triggered by the data-copying variety of overfitting.
This test provides an interesting benchmark to consider in addition to those in the main body. Figure 11 demonstrates some preliminary experimental results repeating both the ‘moons’ KDE experiment of Figure 2 and the MNIST VAE experiment Figure 3(b). To implement the kMMD test, we used code posted by gretton_2 https://github.com/eugenium/MMD, specifically the three sample RBF-kMMD test.
In Figures 10(a), 10(b), and 10(c) we compare and the kMMD gap respectively for 50 values of KDE . We observe that the kMMD between the test and training set (blue) and between the generated and training set (orange) remain near zero for all values less than the MLE , indicated by the red circle. This suggests that the three-sample kMMD is not a particularly strong test for data-copying, since low values are effectively bootstrapping the original training set. The kMMD gap does diverge for values much larger than the MLE , indicating that it can detect underfitting by , however.
This is corroborated by Figure 10(c) which displays the -value of the kMMD hypothesis test used by reviewer_paper. This checks the null hypothesis that the kMMD between and is greater than that between and . A high -value confirms this null hypothesis (as seen for all ). A -value near 0.5 suggests that the kMMD’s are approximately equal. A -value near zero rejects the null hypothesis, suggesting that the kMMD between and is much smaller than that between and . We see that the -value remains well above 0.5 for all values and treats the MLE just as it does the overfit values.
Figures 10(d) and 10(e) compare the and kMMD tests for twenty MNIST VAEs with decreasing complexity (latent dimension). Figures 10(e) again depicts the kMMD distance to training set for both the generated (orange) and test samples (blue). We observe that this test does not appear sensitive to over-parametrized VAEs () in the same way our proposed test (Figure 10(d)) is. As in the ‘moons’ case above, it appears sensitive to underfitting (). Here, the corresponding kMMD -values are effectively 1 for all latent dimension values, and thus are omitted.
We suspect that this insensitivity to data-copying is due to the fact that – for a large number of samples – the kMMDs between and , and between and are both likely to be near zero when data-copies. Consider the case of extreme data-copying, when is simply a bootstrap sample from the training set . The kMMD estimate will be
Informally speaking, the second and third summations of this expression almost behave identically on average to the first since is a bootstrap sample. They only behave differently for summation terms that are collisions: in the second summation, and