Latent Space Cartography: Generalised Metric-Inspired Measures and Measure-Based Transformations for Generative Models

02/06/2019 ∙ by Max F. Frenzel, et al. ∙ 14

Deep generative models are universal tools for learning data distributions on high dimensional data spaces via a mapping to lower dimensional latent spaces. We provide a study of latent space geometries and extend and build upon previous results on Riemannian metrics. We show how a class of heuristic measures gives more flexibility in finding meaningful, problem-specific distances, and how it can be applied to diverse generator types such as autoregressive generators commonly used in e.g. language and other sequence modeling. We further demonstrate how a diffusion-inspired transformation previously studied in cartography can be used to smooth out latent spaces, stretching them according to a chosen measure. In addition to providing more meaningful distances directly in latent space, this also provides a unique tool for novel kinds of data visualizations. We believe that the proposed methods can be a valuable tool for studying the structure of latent spaces and learned data distributions of generative models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep generative models such as Variational Autoencoders (VAEs)

[1, 2] and Generative Adversarial Networks (GANs) [3]

have played a prominent role in the advancement of unsupervised learning. While the details of the architectures vary widely, the general objective is the same across generative models. Given a set of

observations in observation space with dimension , we want to learn a function that can model the data via the latent variables in the latent space , which is usually of much lower dimension . The (potentially stochastic) generator function allows us to map from an arbitrary latent variable to its corresponding observation in data space .

Many prominent applications of generative models such as clustering, comparisons of semantic similarity, and data interpolations rely heavily on the notion of distance in latent space. However, there is generally no guarantee that distances in latent space represent a meaningful measure. The notion of meaningful distances itself can be hard to define and highly problem-specific. The manifold hypothesis asserts that the observations in data space actually lie on a low dimensional manifold. This is the key property exploited by generative models. However, while observations in

might be extremely sparse in certain regions, the training objectives of most generative models promote a densely packed latent space. This can lead to rather dissimilar observations being embedded in close proximity.
This problem as well as the importance of meaningful distances have led to a growing interest in the geometry of latent spaces. Several groups have independently introduced the idea of applying Riemannian geometry to define a metric on the latent space [4, 5, 6, 7, 8, 9], which allows for concepts such as geodesics that give distances and shortest paths which more closely reflect the data. This requires a meaningful and differentiable metric in . However, this limits the potential metrics that can be used and excludes entire classes of models such as language models whose generators rely on repeated sampling.
In this work we extend and build on previous results in several aspects. In section 2, after briefly reviewing the general idea of a metric on latent space as well as previous work on Riemannian metrics and their limitations, we introduce an easy to implement method for approximating a wide range of heuristic metrics and metric-inspired measures. These quantities, while not as rigorous as Riemannian metrics and in some cases lacking certain desirable properties, can be applied to any type of generator function and can be precisely engineered based on the specific problem to be solved, offering considerably more flexibility. Section 3 introduces a diffusion based transformation and investigates how this transformation can be used to smooth out a latent space according to a particular measure, integrating meaningful distance information directly in the latent space itself. In section 4 we present a qualitative analysis of the measures induced by different architectures as well as different data types, and discuss how these results can be used for improved visualizations and how desired properties can be incorporated into the visualization by the choice of measure. We further show that given a good measure, one can find a transformed latent space which has more meaningful distances, and allows for improved interpolations and semantic clustering. Finally we conclude in section 5.

2 Latent Space Metrics

A metric defines a notion of distance between any pair of points in a space. This distance is non-negative, symmetric, satisfies the triangle-inequality, and vanishes iff the two points coincide. The most frequently used metric in many applications is the Euclidean or metric which is given by the length of the straight line connecting two points. The use of this metric comes with the implicit assumption that the underlying space is Euclidean and has no distortions or curvature. However, despite the widespread use of this metric in applications that rely on distances in latent space, there is in general no guarantee that latent spaces are actually flat Euclidean spaces.
On the contrary, the training objectives of most generative models naturally encourage the space to be stretched in some areas and compressed in others. The evidence lower bound in VAEs for example contains a term that encourages the learned approximate posterior distribution to match a (usually Gaussian) prior via minimization of the KL-divergence , whose asymmetric nature encourages the posterior to completely fill the prior, not having any low density regions where

has high density. Qualitatively, we encourage our models to learn a data distribution that is free of holes. However, unless we also have data that is uniformly distributed, this naturally leads to distortions in latent space. A small volume occupied by e.g. a category boundary in latent space may actually correspond to a vast empty volume in data space. Trying to match the data to a Gaussian prior also induces a higher density near the origin. We can expect that a segment of distance

that is close to the origin covers a larger variety of data than a segment of the same length that lies towards the edge of the distribution (c.f. Fig 1). Thus the Euclidean metric is in general inadequate to represent meaningful distances between latent variables.

2.1 Riemannian Metrics and their Limitations

This problem has inspired a number of recent investigations aiming to use ideas from Riemannian geometry to define a more suitable metric. In particular, the authors in [4, 5, 6] advocate the use of a Riemannian metric instead of , treating the latent space as a Riemannian manifold. This allows one to replace distances in latent space, where a readily available notion of distance is generally lacking, with distances in observation space for which it is assumed that we do have a meaningful measure of distance.
For a detailed discussion of Riemannian manifolds and metrics we refer the reader to the original literature [4, 5, 6], but the general idea is to consider the Jacobian with respect to some map

. From the Jacobian we can get the metric tensor

, a symmetric positive definite matrix that encodes local curvature of the space. Related to this is the associated Riemannian measure111The Riemannian measure is also sometimes referred to as the magnification factor or volume element. In the following we will be using the terms measure, magnification factor, and volume element interchangeably. , which quantifies how much volume in an infinitesimal volume around occupies. It essentially defines a density distribution over . The non-uniformity of this distribution gives us a notion of how distorted is. Assuming isotropy, we can consider as a multiplicative factor applied to an infinitesimal line segment passing through . The more volume in an infinitesimal unit cell around corresponds to, and thus the larger , the more distance we should assign to the segment . We will also use this idea as the basis for our heuristic measures introduced below which are not necessarily derived from a Jacobian. While the assumption of isotropy is very strong and usually does not hold, and the stretching of the line segment should thus also depend on its direction, we find that this assumption still leads to useful results as we shall show below.
The original papers introducing the idea defined the Jacobian with respect to the generator function mapping to observation space, i.e. and such that . However, as pointed out in [8] this only provides a meaningful metric if Euclidean distances in are meaningful. For example for images it is highly questionable whether the metric provides a meaningful measure of semantic distance. The authors suggest an alternative metric which is not defined on the final output space, but on some intermediate activation layer in the generator. Since hidden units in intermediate layers tend to represent certain features, defining the Jacobian with respect to their activations should provide distances that capture semantic ideas rather than linear interpolations of the data.
In any applications of the above ideas, it is crucial to have a meaningful and tractable metric on informed by the data which we want to pull back into the latent space. In some cases such as the simulation of a pendulum in [4] we do indeed have a very meaningful metric on the data, namely the angle of the pendulum. But in other cases, such as for images, we might need to find a less obvious metric. In general, what metric or measure we find “meaningful” might strongly depend on the specific problem we are trying to solve. Hence we would like to be as flexible as possible in the choice of metrics we can use. However, calculating the Riemannian metric requires the generator function (or more generally the map ) to be differentiable and smooth. This excludes generators which use sampling procedures in the generation process, such as the decoders of most seq2seq models [10, 11].

2.2 Universal Heuristic Measures

To circumvent these limitations we propose an approximate sampling based method that allows to both easily approximate the Riemannian metric to arbitrary precision, as well as provides large freedom in the design of other heuristic metrics and magnification factors for particular problems. This method can be applied to arbitrary generator functions, as well as more complex functions defined on the output of the generator or the latent space itself. Specifically we can consider an arbitrary function , which maps a point in to in what we shall call the “meaning space” , where a distance we deem meaningful is defined. We are free to choose an appropriate and the corresponding based on the particular task we are trying to achieve.
To find the metric, or directly a heuristic measure, we begin by placing a square grid on the latent space that covers the entire embedded data , with cells in the th dimension. We then proceed by assigning each grid cell in a characteristic in . Depending on the nature of and the grid resolution this can either simply be done by mapping each cell center to the corresponding point , or, for stochastic generators, by sampling multiple in and defining as the average of the respective mappings. The resulting -dimensional tensor now forms the basis for calculating the metric or heuristic volume elements.

2.2.1 Approximate Riemannian Metric

Given the tensor it is straightforward to use a simple finite-difference approximation for the Jacobian with respect to , for example for , where and are the cell widths in the two dimensions respectively. From this we can directly calculate an approximation to the measure .
If we choose to be the full generator function such that , this provides an approximation to the metric considered in [4, 5, 6], whereas using a mapping to one of the intermediate layers in the generator recovers the feature based metric suggested in [8].

2.2.2 General Heuristic Measures and the Jensen-Shannon Measure for Autoregressive Generators

In addition to calculating via a Jacobian, we can consider an arbitrary dissimilarity function between points in and directly define its corresponding heuristic measure on the grid as the average local dissimilarity under to its nearest neighbours. This approach, while sacrificing some directionality information, is extremely flexible and the potential applications are abundant. To illustrate the general idea and show one explicit application in the context of a common and highly relevant use case, let us consider the following specific example.
Autoregressive models are commonly used for sequence modeling tasks such as text or audio [12, 13, 14, 15]. Their generators rely on repeated sampling from a distribution which gets conditioned on the past sampling history. In many such cases we do not even have a clear reasonably smooth metric on the observation space. For example for text, while there are measures such as edit distance [16], they are neither smooth nor do they capture semantic distance which is usually what we would like to capture.
We instead introduce an alternative measure which is not defined on the generated data itself, but the intermediate conditional distributions involved in generating it. Specifically, let us consider a generator representing a conditional language model with finite vocabulary [13, 14]. At step in the generation process the generator produces a distribution over words , conditioned on both the latent variable as well as the previous words. Assuming a total sentence length , we now define our meaning function as the average over the intermediate word distributions

(1)

This vector essentially captures the average word distribution associated with a point in latent space. Note that this is certainly not a perfect solution since word frequency alone is not enough to capture the full meaning of text, and it can assign the same

to different generator outputs. Despite these concerns, we did find this to be a useful quantity in practice. The question of better meaning capturing functions is an interesting direction for further research. One possible (albeit less easily interpretable) alternative could for example be the average hidden state of an LSTM decoder.
To arrive at a useful measure, we also need to define a suitable dissimilarity function . A natural choice in the cases where

represents probability distributions is the Jensen-Shannon distance

where [17]. With this choice we arrive at the Jensen-Shannon measure

(2)

where denotes nearest neighbours and is the number of cells bordering on . This measure quantifies how much the word distribution changes as we move through the unit cell around .
As previously noted, this measure, just like the Riemannian measure derived from a metric tensor, has the drawback of lacking directional information, which might be crucial in certain applications, and an extension to the current approach that does not only provide a measure but a genuine metric could be a fruitful direction for further research. Despite this limitation, we still found these directionless quantities to be useful in practice.

2.2.3 Classifier Measures

Another interesting type of metric can be found if categorical labels are available for the data

. In this case we can train a classifier over the latent variables to predict the class

given a latent variable , and use the resulting probability as the feature vector . Using again the Jensen-Shannon measure (2), now captures how fast the class probabilities change in the vicinity of and thus encodes information such as class boundaries. This can be interesting to get insights into the learned data distribution, as well as in conjunction with the transformation we introduce in section 3 to produce visualizations with clearly distinct clusters for each class.

3 Latent Space Transformations

Figure 1:

Learned MNIST data distributions and latent spaces with their respective measures, as well as data interpolations and the transformed latent spaces, for different activation functions. Top row: Original latent space

with embedded validation data. The (for contrast) square rooted measure is shown in red. The inset shows around the origin without embeddings. A straight path (black) and pseudo-geodesic (red) between two embeddings is shown, as well as equidistance lines around the starting point of the path. Middle row: Interpolations corresponding to the two paths (top: straight line; bottom: pseudo-geodesic). Bottom row: Embeddings, interpolation paths, and equidistance lines in the transformed space .

The Riemannian metric and associated measure as well as the heuristic measures all quantify how much volume in an abstract meaning space maps to each unit cell in latent space. As noted above, it essentially defines a density distribution over the latent space. This raises the question of whether we can find a transformation on the latent space that accounts for this unequal density and maps each point to a corresponding point , stretching the space in such a way as to equilibrate the density.
This problem has previously been studied in a seemingly unrelated domain: cartography. Specifically, cartograms [18], also known as density-equalizing maps, are maps in which the size of geographic regions is proportional to certain properties of that region, such as population or GDP. The most successful techniques for calculating the transformations underlying cartograms are inspired by physical diffusion processes [19, 20], where we assume that the property of interest represents a particle density, and then allow the system to relax to its equilibrium state.
We can directly apply these methods to our present problem, essentially producing cartograms of latent spaces where the quantity of interest is the measure. The result is a bijective map from the original space to a stretched space in which unit volumes map to equal volumes in meaning space .
We refer the reader to the original cartogram literature [19, 20] for details on how to calculate the transformation given a density distribution222For all reported experiments we used the method proposed in [19] and their open implementation which can be found at http://www-personal.umich.edu/~mejn/cart/. We also implemented [20] but found the resulting maps to be less satisfying for metrics with very fine structure and high local gradients.. In our scenario, treating the Riemannian or heuristic measure as a density distribution over the latent space, the resulting map is a discrete vector field over the grid defined on , which maps each cell center to a corresponding point . Using bilinear interpolation between cell centers we can find a continuous map for arbitrary points . It is also straightforward to (approximately) determine the inverse transformation such that .
Euclidean distances in give a much more faithful representation of semantic distance due to the direct incorporation of the meaure. However, note that due to the fact that we lose directionality information when considering the volume elements, straight lines in are in most cases not true geodesics, and we shall refer to them as pseudo-geodesics. While not representing absolute shortest paths, they are still useful for distance comparisons, especially locally, as well as for data interpolations (c.f. Fig 1), and are trivial to compute. One could determine true geodesics for example by considering path integrals between points, but this comes at a significantly higher computational cost similar to approaches in [4] and [6]

which rely on neural networks and solving a system of differential equations respectively to determine geodesics. We leave this to future investigations.


Similarly, finding a more advanced transformation that is not based on the measure, but the metric itself (assuming that a metric is available), and takes not just the local density but also directionality into account, is an interesting open question.

4 Experiments

Having introduced the methods for universal heuristic-based measures and the measure-smoothing transform, we now turn towards some explicit applications and give a qualitative study of how different model and data aspects affect latent spaces.

4.1 Mnist

We first consider the canonical example of MNIST images, which has also been studied in relation to latent space metrics by [4, 5, 6]. For the VAE, we use a simple architecture consisting of two fully connected 512-unit layers each for encoder and decoder. We trained three separate models for the activation functions , , and respectively and approximated their Riemannian measure as described in 2.2.1. Here and in all the following experiments we have used an -grid for the approximation. For comparison with previous work [4, 5, 6] we also defined the Jacobian with respect to the data space . Based on this metric we computed the transforms and applied it to the embeddings. We also performed data interpolations along straight paths in the original space, as well as the pseudo-geodesics. The results are shown in Fig. 1.
Interestingly we find that the activation function used leaves a very strong imprint on the latent space. We confirmed via repeated experiments (not shown) that each activation function indeed leads to a very characteristic metric on the latent space. This shows that the proposed method can be a useful tool in the study of model architectures and activation functions. We also find that as expected, the interpolation along the pseudo-geodesic leads to smoother transitions. We note however that, as pointed out previously, smooth here only means linear interpolation between images due to the limited usefulness of the -metric on the data, and not necessarily smooth in terms of semantic meaning.
Looking at the transformed space , we note that despite the questionable adequacy of the -metric on the data, the transformation achieves a visually nice separation of the data into more distinct clusters, as well as a very noticeable overall smoothing of the non-uniform Gaussian density induced by the VAE’s prior. We believe that this is useful both in terms of more meaningful distances, as well as for improved visualizations. The increase in uniformity was further confirmed by calculating the entropy of the embeddings before and after the transformation, which showed a significant increase for all models considered.

4.1.1 Improved Clustering

Figure 2: k-means clustering applied to the MNIST data distribution learned by a simple (non-variational) autoencoder in the original space (top) as well as the transformed space (bottom). The cluster-based classification F1-score improves from before to after the transformation.

To further study the cluster- and uniformity-improving properties of the transformation we trained a standard autoencoder (AE) with the same architecture as the above VAEs and with activations. Due to the lack of regularization we expected the AE to have a highly distorted latent space. Fig. 2 shows that as expected, the transformation dramatically helps smooth out the distorted data distribution. We also performed k-means clustering based classification to get a more quantitative measure of the clustering, and found that the transformation lead to an improvement in F1-score from to . We note however that repeating this same analysis for the VAEs only led to minor improvements. Given the questionable usefulness of the -metric we did not necessarily expect it to lead to improved clustering at all. However, we believe that our method can be highly useful for clustering if one is able to find a measure which is more appropriate for capturing differences between the desired kinds of clusters (and again, our measures are flexible enough to be tweaked to accommodate certain desired properties and clusterings).

4.1.2 Latent Space Distortions due to bad Training Data

Figure 3: VAE trained on MNIST with two particular data points repeated times in the training data. The square rooted measure in red shows clearly how corrupted data (in this case repeated data) can lead to strong distortions and high curvature in latent space.

Another interesting application is the study of the effects that corrupted training data have on the learned latent spaces. As a simple experiment we retrained the VAE with activation on a dataset in which we repeated two particular data points times. Fig. 3 very clearly shows the resulting distortion of the latent space, the VAE essentially reserving large and (particularly in the case of the “0”) disconnected areas in latent space for the memorization of these samples. While being a simple toy example, this effectively demonstrates the importance of clean training data.

4.1.3 Classifier Visualisations

Figure 4: Top: Classifier based measure in red. Bottom: The same space after the transformation based on this measure .The strong class-separating effect that can be achieved through this measure and transformation is clearly visible.

To conclude our analysis of MNIST, we trained a simple MLP classifier [21] over the original embeddings of the -VAE and calculated the classifier based measure as described in 2.2.3. The measure and resulting transformation are shown in Fig. 4. We can clearly see the class transitions learned by the classifier, as well as the resulting strong separation into clusters of unique classes. Again, while this is only a toy example, it shows how this type of measure can be used for producing visualizations that highlight distinct clusters of classes. One could imagine a more complicated dataset with hundreds of classes. If we want to highlight one or two particular classes, we could define a measure based only on those classes’ respective classifier probabilities and use the resulting transformation to visually clearly separate out these classes from the remaining data. The strength of the desired effect can easily be controlled by applying a Gaussian blur to the metric.

4.2 Natural Language

To show our methods applicability to a class of generators that was completely inaccessible to previous methods and to test our proposed Jensen-Shannon measure for autoregressive generators, we consider a language modeling seq2seq VAE. In terms of architectures, we use exactly the same model and training procedure as proposed in [14], with an LSTM encoder and dilated convolution based decoder.
We also introduce a new dataset that we found particularly interesting for language modeling (as well as classification) for practical applications. This dataset consists of consumer complaints about financial products collected by the US Government, along with various categorical labels such as product category333Data available at https://www.consumerfinance.gov/data-research/consumer-complaints/. This dataset is constantly updated, the version we used for experiments was retrieved on 23/05/2017.. This dataset is interesting for its semantic richness, as well as labelled categories with fairly well defined semantic content.

Figure 5: Average word distribution based Jensen-Shannon measure in black and embeddings (left), as well as transformed embeddings (right) for the Consumer Complaints dataset. Points are colored by product category. Top: seq2seq VAE with -dimensional latent space. Bottom: seq2seq VAE with followed by compression to two dimensions (shown in the figure) via a second VAE.

We first trained a model with and calculated the Jensen-Shannon measure and resulting transformation. Results are shown in Fig. 5. Despite the small latent space dimension, the learned data distribution has surprisingly distinct regions for the different product categories. What is also remarkable is the complicated spider-web-like structure of the measure . Regions of very smooth variation are enclosed by sharp boundaries that mark rapidly changing word distributions. This is reflected in the transformed space by strong clusters emerging in regions that were previously uniform. We can also see that the generator has learned a complicated function that extends well beyond the range of the actual data. To not give too much weight to the volume elements in these regions, we smoothly relaxed the measure to its average away from any data point before calculating the transformation.
To show that despite the methods’ current computational cost which restricts them to low dimension they can still be useful for analysis of high dimensional latent spaces, we first trained a seq2seq VAE with and then trained a second two-dimensional VAE on the learned embeddings, a simple compression VAE with two fully-connected layers for both encoder and decoder. We then calculated the JSD-measure on the two dimensional latent space by feeding the output of the compression VAE’s generator as latent variable to the generator of the seq2seq VAE. Results are shown in the bottom half of Fig. 5. We find that the higher dimensional VAE generally learns a much more distinct separation between the categories. Further, we see that the measure has the same web-like characteristics, although it could be argued that it looks qualitatively more complex. Remarkably, after applying the transformation, we find that the continuous distribution is split into semantically meaningful clusters, both between category boundaries, as well as within individual categories.
These results show both that the metric does indeed capture semantic meaning, as well as that this method could potentially be very useful for clustering beyond human-defined labels. A more detailed study of how the enclosed areas in the metric arise and how they can be used for defining clusters is another interesting area for future study.

5 Conclusion and Outlook

We have introduced an easy to implement approximate method for the calculation of Riemannian metrics, as well as more general heuristic metrics and measures. Based on these measures, we introduced a density-equalizing transformation that allows to smooth out the latent space and rescale local distances according to certain desired properties. We used the proposed methods to analyze the effects of different types of architecture and data on the learned latent space.
The proposed methods in their present form certainly have drawbacks, particularly the exponential scaling with the latent space dimension which limits the applicability to low dimensions, as well as the lack of directionality information in the measure and resulting transformation. The “un-distortion” effect of the transformation is only perfect if the metric is isotropic, which is unlikely in most realistic scenarios. However, we believe that even in their current form, the proposed methods provide useful tools for studying the fundamental structure of latent spaces, including those of higher dimension as we have hinted at in section 4.2, as well as pointing in the directions of promising future research. While lacking some of the desirable quantitative properties of precise Riemannian metrics, our treatment opens up new possibility for qualitative analysis and transformation of latent spaces. We also believe that they can be effectively used for advanced data visualizations. Further, researchers in fields such as computer graphics and cosmology have worked on similar problems for a long time and developed highly efficient methods for computations in higher dimensions, particularly if this high dimensional space is very sparsely populated as is the case for most high dimensional latent spaces. Applying methods from these fields could drastically improve the applicability of the heuristic metric calculations, as well as the transformation.

References