Latent Geometry and Memorization in Generative Models

05/25/2017 ∙ by Matt Feiszli, et al. ∙ Evolv Technologies 0

It can be difficult to tell whether a trained generative model has learned to generate novel examples or has simply memorized a specific set of outputs. In published work, it is common to attempt to address this visually, for example by displaying a generated example and its nearest neighbor(s) in the training set (in, for example, the L2 metric). As any generative model induces a probability density on its output domain, we propose studying this density directly. We first study the geometry of the latent representation and generator, relate this to the output density, and then develop techniques to compute and inspect the output density. As an application, we demonstrate that "memorization" tends to a density made of delta functions concentrated on the memorized examples. We note that without first understanding the geometry, the measurement would be essentially impossible to make.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Variational Auto-Encoder (VAE) and Generative Adversarial Network (GAN) models have enjoyed considerable success recently in generating natural-looking images. However, in many cases it can be difficult to tell when a trained generator has actually learned to generate new examples; it is entirely possible for a VAE to simply memorize a training set. GAN training provides only indirect access to a training set, so direct memorization is less of an issue. However, it is still possible for any generative model to concentrate its probability mass on a small set of outputs, and the intrinsic dimension of the output is unclear. To identify memorization, experimenters often provide visual evidence: e.g. some generated examples may be shown alongside their nearest neighbors in a training set (e.g. [1], [2]). If the neighbors differ from the generated example, memorization is declared to be unlikely. Alternatively, outputs along a path in latent space may be plotted; if the output changes smoothly this suggests generalization, whereas sudden changes suggests memorization.

Instead, we propose studying the induced probability density on output space. To do this, we must change variables and transform a density on the latent space to a density on the output space. This computation requires some care, as the data lie on low-dimensional submanifolds of the input and output space, and the standard formulas will become degenerate and fail. In what follows we first establish the local geometry of the situation, which allows us to obtain a formula for the output density. We then introduce ways to measure the degree to which a generator has memorized, and show experimental results.

This enables us to characterize "memorizing" as learning a probability density on output space which concentrates its mass on a finite number of points (in the limit, the learned measure tends to a collection of delta functions). In contrast, generalization implies a density which smoothly interpolates points, assigning mass to large regions of output space.

2 Mapping Latent Space to Output Space

Consider a trained generative model, where a learned (but now fixed) generator function

maps a space of random variables

to an output space . We assume our generator mapping is differentiable. (This may be false, particularly when

is a neural network with non-differentiable nonlinearities, but

will still be piecewise smooth.) We further assume that is so difficult to compute that it is effectively unavailable.

A main difficulty is that the latent space may be large relative to the intrinsic dimension of the learned representation . is typically chosen to be "large enough" for the problem at hand, and may be larger than necessary. That is, the learned latent representation may have dimension . Assuming the typical case where , we observe that can only map the latent space onto a submanifold of the output space with dimension at most . Thus we see that is a latent (sub)manifold of dimension and is our output (sub)manifold of dimension .

In particular, as a map , we see that is degenerate as its range lies on a low-dimensional submanifold.

2.1 Tangent Spaces, Singular Vectors, and the Volume Element

While global understanding of is not possible in general, the local behavior can be understood by computing the Jacobian matrix and considering the linearized map

for small. In particular, the rank of

tells us about the intrisic dimension of the manifold near a point (see [3]). Further, the singular value decomposition (SVD) allows us to write

where

are orthogonal matrices whose columns (the "singular vectors") span

and respectively, and is a diagonal matrix (the "singular values"). It follows that the right and left singular vectors corresponding to non-zero singular values form a basis for the tangent spaces to and ([3], [4]). The singular vectors of with degenerate (i.e. ) singular values correspond to subspaces which get collapsed (via projection) onto the tangent space before the linearized mapping .

In other words, if has non-zero singular values, locally maps onto an -manifold in . Moving in directions corresponding to large singular values will cause greater change (in the distance) in the output than directions with smaller . (See section 5.2 for experimental analysis of the intrinsic dimension.)

Assuming we have non-zero singular values at some point , then (as mentioned above) is degenerate and it does not make sense to talk about a volume element. However, we can consider the restriction of as a map between -manifolds. This restriction of will still be a diffeomorphism from onto its image, and we can talk about a volume element here. In particular, the change-of-variable formula will give

(1)

That is, volumes on and differ by the product of the nonzero singular values at corresponding points .

3 The Density on Output Space

The random vectors are typically drawn from distributions that are easy to sample, e.g. each may be an independent normal or uniform random variable. Whatever the distribution of is, in conjunction with it induces a density on outputs ; in the case the induced density would have (by change-of-variable) the well-known form

where is the Jacobian matrix of (implicitly at ), and

denotes the determinant (recall the determinant describes the volume element and is the product of eigenvalues of

). However, we have and cannot use this formula directly. If had rank we could replace the denominator with , but as discussed above, we find ourselves in a still more degenerate case.

However, using equation 1 we can compute the induced density as

(2)

where are the singular values of at . Note that we can also restrict to even lower-dimensional problems by discarding more singular values and vectors; in practice one typically sets a threshold below which any singular values are considered to be zero.

4 Measuring Memorization

We introduce two measurements: the first is based on the density on lines (in latent space) joining two outputs, and considers how much the density drops in-between sample points. The second measure is based on the local rate of decay of the density about individual points, and provides a local measure of the density’s concentration.

It may help to consider figure 1: if has memorized a few examples, it implies must map large regions of to small regions of . If we plot as a function of distance (in ), we can examine both the drop in density between output samples and the rate at which the density decays around a given example or examples. Particularly for considering the decay rate, it is important that we consider as opposed to – that is, we should use distances on , not distances in . In cases of memorization, large regions in correspond to nearly constant output. This implies the density will appear to be spread over "large" neighborhoods when viewed as a function on , while simultaneously appearing obviously concentrated in (see again figure 1).

Figure 1: A mapping taking a uniform density in to a density concentrated on two points in . Dot spacing corresponds to density, i.e. is less dense in-between the points of concentration in .

4.1 The Density Along Lines

When latent representations for outputs are known, one can construct a path joining the sample points in latent space. For example, the linear path

corresponds immediately to a path in output space

However, we wish to detect point masses in output space, which means we should measure distances there as well. This unfortunately means we need to choose a metric on

. For a VAE, the standard reconstruction loss obtained from assuming a Gaussian distribution on output space leads to a square loss on output space. This means the

metric is in some sense natural in this case, and it is what we will use.111The metric is typically a poor metric for comparing, say, images. However in our case a metric which made related objects seem close, independent of their visual details, might actually be a disadvantage. Using a metric in which all objects of a certain type were at distance zero from each other, we could not tell if the generator had memorized or not. We can then reparametrize to obtain by constructing as

and then inverting the mapping (numerically) to obtain . Finally, we can plot . In practice we compute the integral by summing distances between sample points along a discrete curve as described in procedure 1.

In summary:

  1. Consider a path joining two latent points and sample it at times . Let .

  2. Using , construct a discrete path in .

  3. Set and

  4. Using equation 2 this yields the discrete function

    where the are singular values of at .

Procedure 1 Computing density along paths in

Remarks:

  1. When looking along paths joining two endpoints, it is possible for the path to come near the latent representation of a third point. In this case, the density may noticeably rise in the middle, or it may simply cause a falsely high density in between the endpoints.

  2. To avoid the false readings above, we suggest either

    1. Using the local measure described next, or

    2. Computing only along paths joining nearest neighbors. Indeed, perhaps the most convincing evidence of memorization comes from computing paths joining two nearby instances of a single class and seeing essentially no probability mass in the middle.

4.2 Local Measures

Rather than consider paths between sample points, one might wish for a local measure of concentration. One alternative to interpolating between endpoints would be to compute the probability mass contained in balls of radius (measured in , not ) about a given set of points. If the mass increases rapidly as a function of , this indicates memorization. However, this is essentially noncomputable, as we’d need to integrate the density over a very high-dimensional ball, and evaluating the density requires , which is unavailable.

It might be possible to use some random sampling or other methods to overcome or mitigate the objections above. However, looking at the rate of decay of along a collection of paths passing through a point seems to provide a good measure of concentration. We consider a set of lines in passing through a given point and apply similar methodology to procedure 1. The question is which lines are informative?

Note that choosing random directions or, say, every coordinate direction, will generally not work well. Section 2 explains why: one should consider decay only in nondegenerate singular directions. Using degenerate singular directions amounts to measuring the density along paths on which the output is constant, or nearly so. This does nothing but add noise and numerical instability to the calculations, and in the worst case (which can be easily observed by choosing degenerate directions, see figure 6) renders memorization and generalization indistinguishable.

Similar to procedure 1, we have procedure 2 for computing decay:

  1. Consider a path for a nondegenerate singular vector and point .

  2. Sample at times . Let (so here agrees with above).

  3. Using , construct in .

  4. Set and for (similar for negative , but distances are negative).

  5. Using equation 2 this yields the discrete function

    where the are singular values of at .

Procedure 2 Computing decay of about points in

Remarks:

  • There are several ways to combine the decay measures from each singular direction into a single measure. In section 5.3 we propose means of second differences of (a measure of peakiness or concentration), but other variants (e.g. maximum) are possible.

5 Computation and Results

Here we discuss some computational details and illustrate the methods of section 4 on a pair of VAEs trained on the MNIST dataset. Identical architectures, one VAE was well-trained while the other was trained to overfit and memorize the data set.

5.1 Computing the Jacobian

Our experiments were performed with the Keras frontend to Tensorflow. While Tensorflow supports automatic differentiation for scalar-valued functions, there is no support for automatic differentiation of vector-valued functions (this is awkward to implement using reverse-mode automatic differentiation). Hence, we are unable to use autodiff to compute Jacobian matrices. We instead use a simple central-difference approximation for each entry

for some small . This requires function evaluations for a latent space of size and output space of size . However, each evaluation is a neural network forward pass and easily parallelizable.

5.2 The Intrinsic Dimension of the Manifolds

The SVD of the Jacobian tells us about the intrinsic dimension of the generator function near a point. In figure 2 we plot the 20 largest singular values in decreasing order, averaged over 1000 randomly-chosen points in the training set. For both the well-trained and overfit networks, there are no significant singular values beyond the 14th, and particularly for the overfit network the decay begins earlier. The well-trained network has larger singular values overall, reflecting the fact that its generator covers greater volumes of the output space. It seems reasonable to declare the intrinsic dimension of either map to be no greater than 14 dimensions. (Applying SVD to the point cloud of latent representations, as opposed to tangent spaces, also suggests the dimension of the latent representation is no greater than 14.) In the second plot, we show decay of singular values about 4 training examples from different classes in the well-trained network. These decay at various rates, suggesting that the effective dimension of the map varies across classes, so locally the map may be lower-dimensional in some areas.

Figure 2: Left: Mean decay of first 20 singular values of the Jacobian at 1000 randomly-chosen training samples suggest an intrisic dimension . Right: Decay of singular values near four points from different classes. The effective dimension appears lower near certain examples than others.

5.3 Results

We use the methods of section 4

to explore the differences between two VAEs trained on MNIST. Each used an identical architecture consisting of several convolutional layers with ReLU and max-pooling, followed by fully-connected layers to compute means and log-variances for the latent distributions, followed by fully-connected layers and transposed convolutions with strides (each with ReLU) for the generator. Data was normalized to lie in the range

and a final output layer was applied.

Both models used a 100-dimensional latent representation with a latent prior of independent and . However, the first was trained on only 100 examples, while the second trained on the entire 60,000 example training set. The difference was stark. The overfit model had huge dips in between training samples, and the decay measure also showed huge concentration of mass on training examples.

Figures 4, 5 show the two-point interpolation method of section 4.1. The endpoints of each path correspond to two training examples in the same class; the well-trained network does a much better job of interpolating the endpoints, and this is reflected in the log-density plots. The well-trained network places significant mass in the regions between the endpoints, whereas the overtrained network places essentially all its mass at the endpoints.

Figure 3 shows the local decay method of section 4.2. These plots are obtained using a line in the largest singular direction, although similar plots are obtained using other non-degenerate singular directions. The conclusion is similar: the overtrained network concentrates the density much more closely on each example.

Finally, in figure 6 we show what happens when we look in degenerate directions: the plots become meaningless, as we’re looking in directions in which is nearly constant. Memorization and generalization become extremely difficult to distinguish based on the plots alone.

Figure 3: Left: Decay of along the largest singular direction through a training sample of a "4". Right: similar, through a "1". Solid is overfit, dashed is well-trained.
Figure 4: Above left: a path in the well-trained network smoothly interpolates between two digits. Above right: the same path in the overfitted network remains constant and then crossfades in the middle. Below: , i.e. log-probability as a function of arclength in . Dashed is well-trained, solid is overfit.
Figure 5: Top left: well-trained. Top right: overfit. Bottom: . While the well-trained network interpolates better than the overfit network, the interpolation are visibly worse than figure 4 and this is reflected in the dip in . Dashed is well-trained, solid is overfit.
Figure 6: Decay of in a direction with vanishingly small singular value, through a training sample of a "1". Solid is overfit, dashed is well-trained. Note the horizontal axis’ scale: decay is almost immediate and both curves would essentially be delta functions if superimposed on figure 3. (The greater absolute value of the overfit curve reflects the fact the the mass is concentrated on a handful of examples.) In any event, degenerate directions carry essentially no information and will only destabilize calculations.

We obtain a single score from both methods as follows:

  1. We consider along lines joining each training sample to its nearest neighbor. We compute second differences

    is the density along the -th path, correspond to the endpoints of the path and is the midpoint. We then average these second differences over the training set:

  2. For the local decay measure, we also consider second differences, but now at several radii about the central point along the largest singular direction, normalized by radius:

    For a given set of radii we compute the mean decay of the peak:

    The motivation for the multiscale method is to obtain robustness to various shapes of decay curves by averaging over several scales. In experiments we use radii of and .

Dip and peak results are summarized in the following table (note that positive dip and negative decay indicate memorization, since the curvatures are opposite):

Model Mean Dip Mean Decay
Memorized 1.07 -3.40
Well-trained -0.242 .0369

The results differ by an order of magnitude in each case. The "Memorized" network exhibits a huge drop in log-probability in between samples and a very peaky density. In contrast, the dip and peak scores for the well-trained network show that, if anything, the density increases slightly away from training samples.

6 Conclusion

"Memorization" in generative models means learning an output distribution which is concentrated on a finite number of output examples. We have introduced methods for studying the output distribution and its concentration in the case where the latent density is easily to evaluate and the generator is a fixed function which is difficult to invert. The main difficulty (the apparent degeneracy of the generator function ) is overcome by noting that it is in fact a smooth map between submanifolds and its image and we introduce machinery for computing the induced density on .

References

[1] Radford, A  & Metz, L  & Chintala, S  (2015) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks arXiv preprint abs/1511.06434

[2] Gregor, K  & Danihelka, I  & Graves, A  & Wierstra, D  (2015) DRAW: A Recurrent Neural Network For Image Generation

arXiv preprint abs/1502.04623

[3] Shifrin, T.  (2005) Multivariable mathematics : linear algebra, multivariable, calculus, and manifolds Hoboken, NJ: Wiley

[4] do Carmo, M  (2013) Riemannian Geometry Boston, MA: Birkhauser Boston