Spurious samples in deep generative models: bug or feature?

10/03/2018 ∙ by Balázs Kégl, et al. ∙ 12

Traditional wisdom in generative modeling literature is that spurious samples that a model can generate are errors and they should be avoided. Recent research, however, has shown interest in studying or even exploiting such samples instead of eliminating them. In this paper, we ask the question whether such samples can be eliminated all together without sacrificing coverage of the generating distribution. For the class of models we consider, we experimentally demonstrate that this is not possible without losing the ability to model some of the test samples. While our results need to be confirmed on a broader set of model families, these initial findings provide partial evidence that spurious samples share structural properties with the learned dataset, which, in turn, suggests they are not simply errors but a feature of deep generative nets.



There are no comments yet.


page 2

page 3

page 4

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of unsupervised modelling is to learn a characterization of the data generating distribution from a set of training instances. Generative modelling also aims at a constructive procedure to generate samples from the learned distribution. Evaluating the quality of these models is not trivial: (Theis et al., 2015)

shows that most commonly used criteria such as average log-likelihood, Parzen window estimates, and visual quality of samples are largely independent of each other when the data is high-dimensional. More importantly, they conclude that extrapolation from one criterion to another is not warranted and generative models need to be evaluated directly with respect to the application(s) they were intended for  

(Theis et al., 2015).

The difficulties in the evaluation of generative models become exacerbated when we consider the notion of spurious samples (Bengio et al., 2013). These kinds of examples (Figure 1) are ubiquitous in the mainstream generative model literature. Traditionally, researchers strive to eliminate these samples (Bengio et al., 2013; Goodfellow, 2016; Salimans et al., 2016) since they are considered failures. There is also a chance that in published work such samples are underreported. By contrast, recent work (Nguyen et al., 2015; Lake et al., 2015; Kazakçı et al., 2016; Cherti et al., 2017) has shown growing interest in exploiting or studying these kind of samples. Kazakçı et al. (2016) found that it was quite easy to generate examples that had zero likelihood under any possible notion of likelihood; more precisely, they generated symbols by models trained on digits which were not digits under any notion of what a digit is (Figure 0(c)). Lake et al. (2015) called these examples “unconstrained”. Such studies highlight that spurious samples should not be discarded (e.g., as “noise”): they share deep structural properties with examples of the training set, yet they are obviously not coming from the distribution that generated these training sets.


“Synthetic” objects from imagenet categories from Figure 7 of

(Nguyen et al., 2015)
(b) “Unconstrained” symbols from Figure 7 of (Lake et al., 2015)
(c) New types of symbols from Figure 6 of (Kazakçı et al., 2016)
(d) Non-recognizable animal generated by (Goodfellow, 2016).
Figure 1: Examples of spurious objects.

Given these shared structural properties between relevant and spurious samples, one trivial and fundamental question needs to be answered: is it possible to get rid of all spurious samples without sacrificing the coverage of a model? This question translates into understanding the relationship between the following kinds of failure modes:

  1. Spurious modes: whether the model generates objects that clearly do not belong to the domain (Figure 1(a)).

  2. Missing modes or the lack of coverage: whether the model can generate all objects (e.g., all bedrooms or handwritten digits) of the domain (Figure 1(b)).

Figure 2:

Generated hand-written symbols illustrating (a) spurious modes and (b) missing modes. Detecting missing modes visually is hard; even in case of the simple domain of MNIST we need a reference sample (c) from the tail of the distribution to notice that the model that generated (b) is probably missing some of the modes.

In other words, in this paper we ask the question whether it is possible at all to learn a training set (e.g., MNIST digits) and only the training set, or when we learn a representation of all the training set, we will inescapably learn a larger set of structurally similar objects, not present in the training set. This would mean that for achieving full coverage, we need to live with spurious objects. On the specific set of models we studied, the answer seems affirmative: we either discard a subset of digits, learning only the bulk of the distribution, losing the tail, or if we pick up the full set of digits, the models also naturally represent a much larger set of structurally similar objects.

Contrary to latest research on generative modeling, we chose MNIST as the main data set of study, complemented by the HWRT dataset (Thoma, 2015) of handwritten mathematical symbols for out-of-class objects, mainly since the experimental setup required to train a large number of generative models. Unlike positive results of learnability (a new algorithm which can learn MNIST), our argument is not harmed by this limitation. Essentially, we provide evidence that even a simple distribution or training set is difficult to learn properly. This is a negative result; eliminating spurious modes while covering the full distribution should be more difficult on larger, more heterogeneous and higher dimensional data sets.

The paper has the following contributions:

  • We provide an experimental framework to study and to quantify both spurious and missing modes in a specific type of generative models.

  • We define a new metric that can be used to tune the spurious/missing mode trade-off and for selecting models that achieve the best compromise.

  • We show that, for the type of models we studied, it is impossible to, at the same time, eliminate spurious modes while learning all genuine modes.

2 Spurious samples and the evaluation of generative models

Theis et al. (2015) argues against using simple Parzen-based (Breuleux et al., 2009)) likelihood estimators, even going as far as saying that if the goal is visual appeal, then likelihood itself is not necessarily a good measure. Detecting spurious modes (Figure 1(a); Figure 1) visually is relatively easy. So, when the goal is not to generate these spurious examples (Bengio et al., 2013) or failures (Salimans et al., 2016), non-likelihood-based metrics concentrate on eliminating them. Inception (or, more generally, objectness) score (Salimans et al., 2016) and Frechet (inception) distance (Heusel et al., 2017)

require a sub-class predictor to label sub-class modes (e.g., imagenet or MNIST classes), somewhat defeating the very goal of unsupervised learning. Moreover, by design, they are susceptible to missing modes that were not labeled, within the classes used in the prediction task. For example, Figure 

1(b) displays a generated sample which is missing some labels (sub-classes, from a point of view of modeling digits) but it also shows less variety within the classes. Missing classes are detected by objectness but the lack of within-class variety is not. One may argue that objectness even penalizes

tail examples since the predictor is better at classifying typical examples than tail examples. Also note that detecting missing modes visually is hard, even in case of the simple domain of MNIST we need a reference sample (Figure 

1(c)) from the tail of the test sample to notice that the model that generated the sample in Figure 1(b) is probably missing some of the modes.

None of these metrics seem suitable to analyze the relationship between spurious and missing modes. In the next section, we propose a new metric that we derive from the in-class and out-of-class reconstruction rates. As we will see, this metric has the advantage that it does not require a sub-class classifier (like the inception imagenet classifier). On the other hand, it requires a control sample on which we can measure out-of-class reconstruction rates. In practical situations, say, in a data challenge, the control set can be kept hidden from the modelers, making it less likely that they overfit . It is even possible to use several proxy control sets and to combine the resulting scores using various statistics (e.g., mean or min).

Besides the control set, another requirement for applying l is that each trained model have to be able to answer to the binary question whether a given object can or cannot be reconstructed by

. Autoencoders 

(Vincent et al., 2010; Bengio et al., 2013)

and autoregressive models

(Oord et al., 2016; van den Oord et al., 2016) have this property but, for example, GANs(Goodfellow et al., 2014) do not.

3 The formal setup

Let be the set of all images of dimension with gray-scale pixel values in . Each model is an autoencoder that represents a manifold }. We shall also say that can reconstruct elements of or itself. The threshold was set to experimentally in order to maximize the dynamic range of our scores (smaller or larger thresholds would have resulted in models that reconstructed few digits and symbols or most of them, respectively, see Figure 4). The manifold is loosely related to the distribution of the images (Alain & Bengio, 2014), more precisely, it is the approximate support of the distribution when the autoencoder is used in an iterative generative mode targeting its fixed points (Figure 3). In our setup we are not interested in the actual likelihood assigned to the fixed point (related to the measure of the set of random seeds that generate the fixed point), rather to a yes/no answer to the question whether the model can represent/generate a given image . Also note that the actual setup of turning the generative model into an oracle that can answer to the question “can you reconstruct ?” may vary, depending on the model. The particular setup is somewhat independent of the metrics we propose.

Figure 3: Iterative generation with autoencoders. We start from random images obtained by randomly picking pixel intensities uniformly from 0 to 1, and we repeatedly apply the autoencoder on the images for several iterations. Each row corresponds to an independent sample, while columns correspond to iterations. The images of the last iteration correspond to approximate fixed points of the autoencoder. In other words, those points can be reconstructed very well by the autoencoder, that is .

The models are all trained on the 60000 training images of MNIST. We use the test set of MNIST for evaluation. For detecting spurious modes, we also use a control set of handwritten mathematical symbols from (Thoma, 2015)

. This dataset is originally vectorized and consists in a sequence of coordinates

for each example. We rasterized it by joining the coordinates by segments with a thickness similar to MNIST. We also padded the images with zeros as it was done in MNIST. The full dataset has 369 classes of mathematical symbols. We remove all the classes with less than 100 examples, obtaining a total of 343 classes and 151853 examples. We randomly split the full set into a training set of 60000 examples and a test set

of 91853 examples. The training sets were used to train a digit vs. symbol classifier which was used in the analysis in Section 4.1

. We denote the posterior probabilities output by this classifier by

and .

Let and be the set of digits and symbols, respectively, which a trained model can reconstruct. The in-class reconstruction rate (IRR) and out-of-class reconstruction rate (ORR) of a model are defined as




For each model , we will use and to quantify the “measure” of the missing modes and spurious modes, respectively. Assuming that the and are generated i.i.d, and

are unbiased estimates of probabilities that a digit

and a symbol , respectively. Thus, is indeed a measure of the missing modes under the sampling distribution of , but is only a proxy of the measure of the spurious modes since it only covers those modes that are sampled by the symbol set . In the experiments we will use the difference


as a proxy metrics for identifying “good” models, that is, good compromises of low rates of both spurious and missing modes.

Figure 4: The effect of the threshold on images that are considered as“recognized” and the ones that are not. For each value of , the left panels correspond to images considered as recognized by the model, while the right panels correspond to images considered that are not. Within each panel, the top row contains the original images and the bottom row contains their corresponding reconstructions. The reconstructions are obtained from a convolutional autoencoder trained on MNIST training set and the reconstructions are obtained from the test set of handwritten mathematical symbols.

4 Experiments

In the experiments, we use a family of convolutional autoencoders. All the models consists in a set of convolutional layers on the encoder, followed by a set of convolutional layers with padding (Dumoulin & Visin, 2016) (to increase the size of the feature maps) on the decoder, thus a total of layers. Each convolutional layer has a filter of size and use the ReLUactivation function. We apply an activation function in the bottleneck (”code”). The output layer used a sigmoid activation function.

To explore the space of the architectures, we vary several hyperparameters, while we fix others. We vary the number of layers

from 1 to 6. We use 128 feature maps in all the layers, except in the bottleneck layer where the number of feature maps varies and can take values from . We use a filter size of

in all the layers and a stride of

. For the activation function of the bottleneck , we use the spatial Winner-Take-All (WTA) activation used in  (Makhzani & Frey, 2015). In each feature map, spatialwta

zeroes out all the activations except the activation with the maximum value, thus backpropagating only through through the activation with the maximum value in each feature map. After applying

spatialwta, we apply an additional sparsity activation, which we call channelwta. channelwta is parametrized by a sparsity rate , it keeps only the feature maps with the highest activation111Only a single activation is greater than zero per feature map after having applied spatialwta, zeroing out of them to achieve a channel-wise sparsity rate of . We use the values , where means we do not use channelwta. We also use denoising (Vincent et al., 2010) with salt and pepper noise by varying probabilities of corruption where . All the models are trained on the MNIST training dataset with the reconstruction error objective, using mean squared error (MSE). A total of 187 models have been trained for the current experiments.

4.1 The spurious/missing mode trade-off

Figure 5 shows the out-of-class recognition rate ORR (2) versus the in-class recognition rate IRR (1) of each trained model . For selected models, we show some digits that they cannot reconstruct, illustrating the missing modes, and some symbols that they can, illustrating the spurious modes. The most important observation here is that none of the models are perfect: they either reconstruct all digits but also a large portion of the symbols, or they have a low rate of spurious modes but missing also a large portion of the digits. While the actual numbers are somewhat dependent on the reconstruction threshold , with the threshold we selected , the first model that can reconstruct of the digits can also reconstruct about of the symbols, and the first model that discards of the symbols can only reconstruct about of the digits. The model with the best (3) makes a compromise of reconstructing of the digits and about of the symbols.

The panels attached to selected models show both the original digits and symbols (first and third rows) and the reconstructed digits and symbols (second and fourth rows). As we move from towards , missing modes become more ant more “esoteric” until they disappear completely, while spurious modes become richer and richer as models pick up more and more symbols.

One criticism of the methodology could be that the set of symbols and digits overlap, and the reconstructed symbols all look like digits (coming from ). It is clear from the examples that this is not the case: most reconstructed symbols do not look like digits to a human evaluator. To make this counterargument more formal, we trained a digit vs. symbol classifier . The low test error of showed that indeed most symbols can be recognized as symbols by an “objective” classifier. There still remained a doubt on whether the models “pulled” all the reconstructed symbols into the digit set, so we also looked at the symbol classification rate in , that is, the rate of reconstructed symbols that looked like digits to the discriminator. While the rates were higher than , they remained in the low 10s, confirming that indeed, most reconstructed symbols are spurious, even under this more stringent criterion.

Figure 5: The out-of-class recognition rate ORR (2) versus the in-class recognition rate IRR (1). The panels show a random set of digits that cannot be reconstructed by the selected model in the first row, and their (attempted) reconstructions in the second row. These images represent missing modes. The third row of each panel is the symbols that the selected model can reconstruct, with the reconstruction in the fourth row. These images represent spurious modes. An interactive version of the plot, where the reader can click on any dot to see the corresponding panel is available at https://goo.gl/ehbrb3.

Figure 6 shows the same ORR vs. IRR plot, but panels of selected models show images generated from random seeds using the procedure in Figure 3. Models towards generate overwhelmingly digits, but the variability of these digits is visibly lower than in MNIST. Models towards generate overwhelmingly spurious symbols. These models are typical candidates for research in novelty generation (Nguyen et al., 2015; Lake et al., 2015; Kazakçı et al., 2016; Cherti et al., 2017). Finally, models towards are those that could be considered a good compromise, generating mostly digit-looking symbols with a high variability.

Figure 6: The out-of-class recognition rate ORR (2) versus the in-class recognition rate IRR (1). The panels show a random set of images generated from random seeds by the procedure described in Figure 3 for selected models. An interactive version of the plot, where the reader can click on any dot to see the corresponding panel is available at https://goo.gl/Ltzp3V.

Using can be considered as a metric for selecting these models, and the full IRR-ORR plane can be used to tune the trade-off between accepting either spurious or missing modes. Note also that in practical situations, say, in a data challenge, the control set can be kept hidden from the modelers, making it less likely that they overfit the particular ORR metrics and thus . It is even possible to use several proxy control sets and to combine the resulting scores using various statistics (e.g., mean or min).

4.2 Comparing to objectness

Objectness (or inception score) (Salimans et al., 2016) is one of the popular non-likelihood-based quality metrics. It requires a sub-class classifier so we trained for it a standard convnet for classifying MNIST digits. Figures 7 shows the the scatterplot of objectness vs. . The two metrics agree on what bad models are, but not on what good models are. Furthermore, it is hard to say if there is any correlation between these measures and human judgement. Objectness tends to be insensitive to spurious modes, possibly because of the “blind spots” of the classifier (it confidently classifies spurious symbols into one of the digit classes).

Figure 7: Objectness (Salimans et al., 2016) vs. , visualizing generated images of selected models. An interactive version of the plot, where the reader can click on any dot to see the corresponding panel is available at https://goo.gl/KWuTvk.

5 Discussion

The paper presents an investigation of the spurious samples in deep generative models and their relationship with a model’s ability to effectively learn the domain being modelled. Through a set of experiments and for a specific model family, we have shown that there is a trade-off between a model’s potential to generate spurious samples and its effectiveness for covering all the available training instances. This implies that, at least for the models we considered, one cannot eliminate spurious samples without sacrificing the model’s ability to generate some data we actually want to model. The metrics we used in this study, in-class and out-of-class reconstruction rates and their difference, can be used as an alternative non-likelihood-based metrics to tune the spurious/missing mode trade-off and for selecting models that achieve the best compromise.