Variational Auto-Encoder (VAE) and Generative Adversarial Network (GAN) models have enjoyed considerable success recently in generating natural-looking images. However, in many cases it can be difficult to tell when a trained generator has actually learned to generate new examples; it is entirely possible for a VAE to simply memorize a training set. GAN training provides only indirect access to a training set, so direct memorization is less of an issue. However, it is still possible for any generative model to concentrate its probability mass on a small set of outputs, and the intrinsic dimension of the output is unclear. To identify memorization, experimenters often provide visual evidence: e.g. some generated examples may be shown alongside their nearest neighbors in a training set (e.g. , ). If the neighbors differ from the generated example, memorization is declared to be unlikely. Alternatively, outputs along a path in latent space may be plotted; if the output changes smoothly this suggests generalization, whereas sudden changes suggests memorization.
Instead, we propose studying the induced probability density on output space. To do this, we must change variables and transform a density on the latent space to a density on the output space. This computation requires some care, as the data lie on low-dimensional submanifolds of the input and output space, and the standard formulas will become degenerate and fail. In what follows we first establish the local geometry of the situation, which allows us to obtain a formula for the output density. We then introduce ways to measure the degree to which a generator has memorized, and show experimental results.
This enables us to characterize "memorizing" as learning a probability density on output space which concentrates its mass on a finite number of points (in the limit, the learned measure tends to a collection of delta functions). In contrast, generalization implies a density which smoothly interpolates points, assigning mass to large regions of output space.
2 Mapping Latent Space to Output Space
Consider a trained generative model, where a learned (but now fixed) generator function
maps a space of random variablesto an output space . We assume our generator mapping is differentiable. (This may be false, particularly when
is a neural network with non-differentiable nonlinearities, butwill still be piecewise smooth.) We further assume that is so difficult to compute that it is effectively unavailable.
A main difficulty is that the latent space may be large relative to the intrinsic dimension of the learned representation . is typically chosen to be "large enough" for the problem at hand, and may be larger than necessary. That is, the learned latent representation may have dimension . Assuming the typical case where , we observe that can only map the latent space onto a submanifold of the output space with dimension at most . Thus we see that is a latent (sub)manifold of dimension and is our output (sub)manifold of dimension .
In particular, as a map , we see that is degenerate as its range lies on a low-dimensional submanifold.
2.1 Tangent Spaces, Singular Vectors, and the Volume Element
While global understanding of is not possible in general, the local behavior can be understood by computing the Jacobian matrix and considering the linearized map
for small. In particular, the rank of
tells us about the intrisic dimension of the manifold near a point (see ). Further, the singular value decomposition (SVD) allows us to write
are orthogonal matrices whose columns (the "singular vectors") spanand respectively, and is a diagonal matrix (the "singular values"). It follows that the right and left singular vectors corresponding to non-zero singular values form a basis for the tangent spaces to and (, ). The singular vectors of with degenerate (i.e. ) singular values correspond to subspaces which get collapsed (via projection) onto the tangent space before the linearized mapping .
In other words, if has non-zero singular values, locally maps onto an -manifold in . Moving in directions corresponding to large singular values will cause greater change (in the distance) in the output than directions with smaller . (See section 5.2 for experimental analysis of the intrinsic dimension.)
Assuming we have non-zero singular values at some point , then (as mentioned above) is degenerate and it does not make sense to talk about a volume element. However, we can consider the restriction of as a map between -manifolds. This restriction of will still be a diffeomorphism from onto its image, and we can talk about a volume element here. In particular, the change-of-variable formula will give
That is, volumes on and differ by the product of the nonzero singular values at corresponding points .
3 The Density on Output Space
The random vectors are typically drawn from distributions that are easy to sample, e.g. each may be an independent normal or uniform random variable. Whatever the distribution of is, in conjunction with it induces a density on outputs ; in the case the induced density would have (by change-of-variable) the well-known form
where is the Jacobian matrix of (implicitly at ), and
denotes the determinant (recall the determinant describes the volume element and is the product of eigenvalues of). However, we have and cannot use this formula directly. If had rank we could replace the denominator with , but as discussed above, we find ourselves in a still more degenerate case.
However, using equation 1 we can compute the induced density as
where are the singular values of at . Note that we can also restrict to even lower-dimensional problems by discarding more singular values and vectors; in practice one typically sets a threshold below which any singular values are considered to be zero.
4 Measuring Memorization
We introduce two measurements: the first is based on the density on lines (in latent space) joining two outputs, and considers how much the density drops in-between sample points. The second measure is based on the local rate of decay of the density about individual points, and provides a local measure of the density’s concentration.
It may help to consider figure 1: if has memorized a few examples, it implies must map large regions of to small regions of . If we plot as a function of distance (in ), we can examine both the drop in density between output samples and the rate at which the density decays around a given example or examples. Particularly for considering the decay rate, it is important that we consider as opposed to – that is, we should use distances on , not distances in . In cases of memorization, large regions in correspond to nearly constant output. This implies the density will appear to be spread over "large" neighborhoods when viewed as a function on , while simultaneously appearing obviously concentrated in (see again figure 1).
4.1 The Density Along Lines
When latent representations for outputs are known, one can construct a path joining the sample points in latent space. For example, the linear path
corresponds immediately to a path in output space
However, we wish to detect point masses in output space, which means we should measure distances there as well. This unfortunately means we need to choose a metric on
. For a VAE, the standard reconstruction loss obtained from assuming a Gaussian distribution on output space leads to a square loss on output space. This means themetric is in some sense natural in this case, and it is what we will use.111The metric is typically a poor metric for comparing, say, images. However in our case a metric which made related objects seem close, independent of their visual details, might actually be a disadvantage. Using a metric in which all objects of a certain type were at distance zero from each other, we could not tell if the generator had memorized or not. We can then reparametrize to obtain by constructing as
and then inverting the mapping (numerically) to obtain . Finally, we can plot . In practice we compute the integral by summing distances between sample points along a discrete curve as described in procedure 1.
Consider a path joining two latent points and sample it at times . Let .
Using , construct a discrete path in .
Using equation 2 this yields the discrete function
where the are singular values of at .
When looking along paths joining two endpoints, it is possible for the path to come near the latent representation of a third point. In this case, the density may noticeably rise in the middle, or it may simply cause a falsely high density in between the endpoints.
To avoid the false readings above, we suggest either
Using the local measure described next, or
Computing only along paths joining nearest neighbors. Indeed, perhaps the most convincing evidence of memorization comes from computing paths joining two nearby instances of a single class and seeing essentially no probability mass in the middle.
4.2 Local Measures
Rather than consider paths between sample points, one might wish for a local measure of concentration. One alternative to interpolating between endpoints would be to compute the probability mass contained in balls of radius (measured in , not ) about a given set of points. If the mass increases rapidly as a function of , this indicates memorization. However, this is essentially noncomputable, as we’d need to integrate the density over a very high-dimensional ball, and evaluating the density requires , which is unavailable.
It might be possible to use some random sampling or other methods to overcome or mitigate the objections above. However, looking at the rate of decay of along a collection of paths passing through a point seems to provide a good measure of concentration. We consider a set of lines in passing through a given point and apply similar methodology to procedure 1. The question is which lines are informative?
Note that choosing random directions or, say, every coordinate direction, will generally not work well. Section 2 explains why: one should consider decay only in nondegenerate singular directions. Using degenerate singular directions amounts to measuring the density along paths on which the output is constant, or nearly so. This does nothing but add noise and numerical instability to the calculations, and in the worst case (which can be easily observed by choosing degenerate directions, see figure 6) renders memorization and generalization indistinguishable.
Consider a path for a nondegenerate singular vector and point .
Sample at times . Let (so here agrees with above).
Using , construct in .
Set and for (similar for negative , but distances are negative).
Using equation 2 this yields the discrete function
where the are singular values of at .
There are several ways to combine the decay measures from each singular direction into a single measure. In section 5.3 we propose means of second differences of (a measure of peakiness or concentration), but other variants (e.g. maximum) are possible.
5 Computation and Results
Here we discuss some computational details and illustrate the methods of section 4 on a pair of VAEs trained on the MNIST dataset. Identical architectures, one VAE was well-trained while the other was trained to overfit and memorize the data set.
5.1 Computing the Jacobian
Our experiments were performed with the Keras frontend to Tensorflow. While Tensorflow supports automatic differentiation for scalar-valued functions, there is no support for automatic differentiation of vector-valued functions (this is awkward to implement using reverse-mode automatic differentiation). Hence, we are unable to use autodiff to compute Jacobian matrices. We instead use a simple central-difference approximation for each entry
for some small . This requires function evaluations for a latent space of size and output space of size . However, each evaluation is a neural network forward pass and easily parallelizable.
5.2 The Intrinsic Dimension of the Manifolds
The SVD of the Jacobian tells us about the intrinsic dimension of the generator function near a point. In figure 2 we plot the 20 largest singular values in decreasing order, averaged over 1000 randomly-chosen points in the training set. For both the well-trained and overfit networks, there are no significant singular values beyond the 14th, and particularly for the overfit network the decay begins earlier. The well-trained network has larger singular values overall, reflecting the fact that its generator covers greater volumes of the output space. It seems reasonable to declare the intrinsic dimension of either map to be no greater than 14 dimensions. (Applying SVD to the point cloud of latent representations, as opposed to tangent spaces, also suggests the dimension of the latent representation is no greater than 14.) In the second plot, we show decay of singular values about 4 training examples from different classes in the well-trained network. These decay at various rates, suggesting that the effective dimension of the map varies across classes, so locally the map may be lower-dimensional in some areas.
We use the methods of section 4
to explore the differences between two VAEs trained on MNIST. Each used an identical architecture consisting of several convolutional layers with ReLU and max-pooling, followed by fully-connected layers to compute means and log-variances for the latent distributions, followed by fully-connected layers and transposed convolutions with strides (each with ReLU) for the generator. Data was normalized to lie in the rangeand a final output layer was applied.
Both models used a 100-dimensional latent representation with a latent prior of independent and . However, the first was trained on only 100 examples, while the second trained on the entire 60,000 example training set. The difference was stark. The overfit model had huge dips in between training samples, and the decay measure also showed huge concentration of mass on training examples.
Figures 4, 5 show the two-point interpolation method of section 4.1. The endpoints of each path correspond to two training examples in the same class; the well-trained network does a much better job of interpolating the endpoints, and this is reflected in the log-density plots. The well-trained network places significant mass in the regions between the endpoints, whereas the overtrained network places essentially all its mass at the endpoints.
Figure 3 shows the local decay method of section 4.2. These plots are obtained using a line in the largest singular direction, although similar plots are obtained using other non-degenerate singular directions. The conclusion is similar: the overtrained network concentrates the density much more closely on each example.
Finally, in figure 6 we show what happens when we look in degenerate directions: the plots become meaningless, as we’re looking in directions in which is nearly constant. Memorization and generalization become extremely difficult to distinguish based on the plots alone.
We obtain a single score from both methods as follows:
We consider along lines joining each training sample to its nearest neighbor. We compute second differences
is the density along the -th path, correspond to the endpoints of the path and is the midpoint. We then average these second differences over the training set:
For the local decay measure, we also consider second differences, but now at several radii about the central point along the largest singular direction, normalized by radius:
For a given set of radii we compute the mean decay of the peak:
The motivation for the multiscale method is to obtain robustness to various shapes of decay curves by averaging over several scales. In experiments we use radii of and .
Dip and peak results are summarized in the following table (note that positive dip and negative decay indicate memorization, since the curvatures are opposite):
|Model||Mean Dip||Mean Decay|
The results differ by an order of magnitude in each case. The "Memorized" network exhibits a huge drop in log-probability in between samples and a very peaky density. In contrast, the dip and peak scores for the well-trained network show that, if anything, the density increases slightly away from training samples.
"Memorization" in generative models means learning an output distribution which is concentrated on a finite number of output examples. We have introduced methods for studying the output distribution and its concentration in the case where the latent density is easily to evaluate and the generator is a fixed function which is difficult to invert. The main difficulty (the apparent degeneracy of the generator function ) is overcome by noting that it is in fact a smooth map between submanifolds and its image and we introduce machinery for computing the induced density on .
 Radford, A & Metz, L & Chintala, S (2015) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks arXiv preprint abs/1511.06434
 Gregor, K & Danihelka, I & Graves, A & Wierstra, D (2015) DRAW: A Recurrent Neural Network For Image Generation
 Gregor, K & Danihelka, I & Graves, A & Wierstra, D (2015) DRAW: A Recurrent Neural Network For Image GenerationarXiv preprint abs/1502.04623
 Shifrin, T. (2005) Multivariable mathematics : linear algebra, multivariable, calculus, and manifolds Hoboken, NJ: Wiley
 do Carmo, M (2013) Riemannian Geometry Boston, MA: Birkhauser Boston