The Riemannian Geometry of Deep Generative Models

11/21/2017 ∙ by Hang Shao, et al. ∙ THE UNIVERSITY OF UTAH ibm 0

Deep generative models learn a mapping from a low dimensional latent space to a high-dimensional data space. Under certain regularity conditions, these models parameterize nonlinear manifolds in the data space. In this paper, we investigate the Riemannian geometry of these generated manifolds. First, we develop efficient algorithms for computing geodesic curves, which provide an intrinsic notion of distance between points on the manifold. Second, we develop an algorithm for parallel translation of a tangent vector along a path on the manifold. We show how parallel translation can be used to generate analogies, i.e., to transport a change in one data point into a semantically similar change of another data point. Our experiments on real image data show that the manifolds learned by deep generative models, while nonlinear, are surprisingly close to zero curvature. The practical implication is that linear paths in the latent space closely approximate geodesics on the generated manifold. However, further investigation into this phenomenon is warranted, to identify if there are other architectures or datasets where curvature plays a more prominent role. We believe that exploring the Riemannian geometry of deep generative models, using the tools developed in this paper, will be an important step in understanding the high-dimensional, nonlinear spaces these models learn.



There are no comments yet.


page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning from unlabeled raw sensory observations, which are often high-dimensional, is a problem of significant importance in machine learning. An influential notion in this line of research is the

manifold hypothesis, which states that these high-dimensional observations are concentrated around a manifold of much lower dimensionality [7, 19, 20, 21, 22]

. Indeed, the manifold hypothesis has been the basis of much of the prior work on the problems of unsupervised and semi-supervised learning

[21, 19, 22, 20, 23, 1, 4, 2, 24, 18, 12].

These problem areas have witnessed a surge in activity following the recent success of deep generative models in modeling the observed data with higher fidelity than was earlier possible. This is particularly true for visual observations, where deep generative models such as variational autoencoders (VAEs)

[11, 17], generative adversarial networks (GANs) [8], PixelCNN [15], and their variants [16, 25, 3, 9] have been shown to generate good quality images. All of these models involve learning a mapping (termed as generator or decoder) from a lower-dimensional latent space to the high-dimensional space of observed data. This allows for generating novel data samples by ancestral sampling, which is seeded by samples in the latent space.

As the learned generator in these models is able to generate high-fidelity data samples, the generator mapping can be argued to approximate the data manifold reasonably well. This has been explored in the context of semi-supervised learning to obtain smooth invariances for classification via estimating the tangent directions to the data manifold as learned by the generator

[12, 18]. However, the metric properties of these generated manifolds still remain unexplored.

In this work, we investigate the Riemannian geometry of the manifolds learned by these deep generative models. Our contributions are summarized as follows:

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt,leftmargin=*]

  • We propose an algorithm for computing geodesic paths

    between points on the generated manifold. This can be used to interpolate between two generated data points on the manifold using the least amount of change necessary, while enforcing that the points along the path remain on the manifold. The arclength of a minimal geodesic path is a distance metric between points on the manifold, and is a natural way to measure the similarity between two data points. While the continuous geodesic equation requires expensive second derivatives and matrix inversions, we formulate an efficient numerical strategy for computing discretized geodesic curves that avoids these computations. In addition to point-to-point geodesics paths, we show how to “shoot” a geodesic from an initial starting position and initial velocity (tangent vector).

  • Next, we develop an algorithm for parallel translation of tangent vectors along a path on the generated manifold. Parallel translation moves a tangent vector continuously along a path using the minimal amount of change needed to keep it tangent to the manifold. This operation provides a means for computing analogies, i.e., taking the change between points to on the manifold (represented as a geodesic segment) and applying that change to a third point .

  • In our experiments, we show how the above tools can be used to explore the Riemannian geometry of the manifolds learned by deep generative models, and in particular, to investigate the curvature of these manifolds. We demonstrate, at least for the VAE architecture used in our experiments, that the generated manifolds learned from real images used in our experiments (CelebA [13], SVHN [14]) have surprisingly little curvature. As a result, straight lines in the latent space map via the generator to curves on the manifold that are quite similar to geodesics. This may help explain why the latent coordinates, and interpolations between them, tend to give plausible changes in the generated images. Geodesic curves are in a sense the smoothest possible transitions, as they move at constant speed and minimize the amount of distance needed to travel from one point to another. Our conclusion is that latent coordinates that approximate geodesics is a desirable property to have, and this should be checked by interrogating the Riemannian geometry of a trained deep generative model.

2 Deep Generative Models as Manifolds

Figure 1: Depiction of a generative model as a mapping from low-dimensional latent coordinates onto an immersed manifold in data space.

In this section, we illustrate the connection between deep generative models and manifolds. A deep generative model represents a mapping, , from some low-dimensional latent space to a high-dimensional data space (typically, ). Under certain conditions (described precisely below), the image of is a smooth manifold, . As depicted in Figure 1, maps linear coordinates in onto curvilinear coordinates on .

We will construct as the composition of multiple layers, i.e., , where we use superscripts to denote the layer index. Each layer

is an affine mapping, followed by a nonlinear activation function:

Here we have used subscripts, , to denote the th component of the output, and , to denote the th row of the weight matrix, .

The image of is a smooth (i.e., ), -dimensional, immersed manifold if for every point , the Jacobian of at , , has rank

. As a straightforward application of the chain rule, this will be true when the following conditions are met:

  1. [noitemsep]

  2. The activation function, , is a smooth, monotonic function.

  3. Each weight matrix has maximal rank.

Note that condition 1 can be enforced during the modeling phase, by selecting an appropriate activation function. Condition 2 must be checked after training. Also, note that condition 2 is sufficient but not necessary: we could potentially have less-than-maximal rank weight matrices in the middle layers, as long as the final rank of the Jacobian is . However, checking this more general condition would require checking the Jacobian is rank- at every possible input , which is not feasible. Finally, we emphasize that is only guaranteed to be an immersed manifold. This means that it is locally diffeomorphic to -dimensional Euclidean space, but globally it may have self intersections.

The Jacobian matrix of provides a way to map tangent vectors in the latent space to tangent vectors on the manifold. At any point , the Jacobian matrix is a linear mapping from , the tangent space of at , to , the tangent space of at . In practice, is computed as the partial derivative matrix of

via backpropagation. A

Riemannian metric provides an inner product structure between tangent vectors in each tangent space . We will use the induced metric from the ambient data space . In other words, thinking of two vectors as living in a linear subspace of , we can use the Euclidean dot product of to compute the Riemannian metric .

Intuitively speaking, the curvature of a Riemannian manifold measures the extent to which the metric deviates from being Euclidean. For a precise mathematical explanation of curvature, refer to standard texts in Riemannian geometry, e.g., [6]. We emphasize an important distinction: just because a manifold is flat, i.e., has zero curvature, does not mean that it isn’t nonlinear. For example, take a sheet of paper and draw a straight line on it. Now bend the sheet of paper into any shape without creasing it. This surface is metrically equivalent to 2D Euclidean space: the straight line you drew is now a geodesic curve with the same arc length. In other words, the surface has zero curvature (this is the Gaussian curvature in the case of a 2D surface). For example, rolling the paper into the famous “swiss roll” results in a surface that is highly nonlinear, but nonetheless has zero curvature.

3 Riemannian Geometry Computations

In this section we develop three algorithms for Riemannian computations on a manifold represented by a deep generative network . These are geodesic interpolation between two points on the manifold, parallel translation of a tangent vector along a path on the manifold, and geodesic shooting from an initial point and velocity on the manifold. We begin with a general discussion of the geodesic equation on a Riemannian manifold.

We will consider all objects (tangent vectors, curves, the Riemannian metric) to be defined in the coordinate space . However, we point out that all of these objects each have a corresponding unique counterpart on the manifold, , through the mapping (or it’s derivative mapping). We represent the Riemannian metric as a symmetric, positive definite matrix field, , defined at each point of the latent coordinate space, . It is given by the formula:

Given two tangent vectors in coordinates, their inner product is .

Now, consider a smooth curve . Again, this corresponds to a curve on the manifold, . The arc length of is defined as


A geodesic curve locally minimizes the arc length, although this is done through minimizing a slightly different energy functional:


Minimizing this energy leads to geodesic curves, which also locally minimize the arc length, but in addition have constant speed parameterizations.

Taking a variation of the geodesic energy functional (2) results in the Euler-Lagrange equation for a geodesic:


where are the Christoffel symbols of the metric . These are defined as

where is inverse of

. Geodesic paths can then be computed using a numerical integration of the ordinary differential equation (

3). However, notice that computation of the Christoffel symbols requires taking derivatives of (which involves second derivatives of the generator, ) and also a matrix inverse of . As we show in the next subsection, these expensive calculations can be avoided if we start from a discrete counterpart to the geodesic energy (2).

3.1 Efficient Discrete Geodesic Computation

We begin with a discretized curve as a sequence of coordinates . We think of this as approximating a continuous curve, . Thus, with time steps, we have a discrete time interval of . This also corresponds to a discrete curve on the manifold as . Using forward finite differences, we get the approximate velocity of the curve at as . Now the discrete analog (2) gives us the energy of this curve:


Fixing the endpoints, and , as our target start and end points of the geodesic path, we will minimize this discrete geodesic energy by taking a gradient descent in the remaining points on the curve, . The gradient with respect to is


Notice that the gradient is a finite-difference second derivative in the space, followed by a Jacobian of coming from the chain rule. The second finite difference in space may have a component normal to the tangent space . However, the will project out this normal component and map the gradient in to a gradient in . Finally, geodesic path finding proceeds by optimizing the curve coordinates , using gradient descent with the gradient in (5).

Input: Two points,
is gradient descent step size
Output: Discrete geodesic path ,
Initialize as linear interpolation between and while  do
       for  do
             Compute the modified gradient using (6)
Algorithm 1 Geodesic Path

While this gradient descent algorithm for computing discretized geodesics avoids the expensive Christoffel symbol calculations, it does still require computation of the Jacobian of the generator, . For deep generative models, this Jacobian can be expensive. However, we can make an additional speed up for models with a corresponding encoder function, i.e., a mapping , such that . For such models, e.g., VAEs, the encoder Jacobian is significantly faster. Now imagine moving our discrete curve points, , in the negative gradient direction along . Mapping this direction down into via the Jacobian of , , will produce an equivalent direction in coordinates. This results in the following modified gradient, which replaces with the faster-to-compute :


Although this modified gradient is no longer the gradient of the discrete curve energy, it does move the in the same initial direction. Also, descent in this modified gradient direction has the same fixed point as gradient descent. The final geodesic path algorithm is given in Algorithm 1.

3.2 Parallel Translation

Given a geodesic path from a point to a point , we can transfer the change from into a change of a third point . This type of “analogy” is performed in three steps: (1) compute the initial velocity to the geodesic from to , (2) parallel translate this velocity along the geodesic from to , and (3) use this velocity at to shoot a geodesic segment. In Euclidean space, these operations would be (1) take the difference , (2) consider as a vector based at , and (3) shoot the geodesic (straight line) by adding . Parallel translation for non-flat manifolds moves a tangent vector along the manifold with as little change as possible, while still enforcing the vector stay tangent. This operation preserves the inner product between tangent vectors, and as such, preserves the length of a translated tangent vector. As a concrete example, imagine the 2D sphere with a tangent vector at the north pole. Now rotate the sphere and tangent vector with it. This is parallel translation along the path swept out by the rotation.

Now, assume that we already have a discrete path in coordinates and a tangent vector in at the initial point on the manifold. A small step of parallel translation is approximately equivalent to Euclidean translation of the vector from to

. However, the vector at this new position will be slightly out of the tangent space. This can be corrected by applying the minimal rotation to bring this vector into the tangent space. Note that we can do this using the singular value decomposition (SVD) of the Jacobian

. The left singular vectors give an orthonormal basis for the tangent space. Rotation onto this basis is equivalent to a projection (multiplication by ) followed by a rescaling of the vector back to it’s original length. Repeating this for process for each time step along the curve gives our parallel translation routine, summarized in Algorithm 2.

Input: Discrete path: , and tangent vector:
Output: Tangent vector
for  do
       Compute SVD:
Algorithm 2 Parallel Translation

3.3 Geodesic Shooting

Given a starting point and a starting velocity , there is a unique geodesic , with these initial conditions and . (Technically, such a geodesic is only guaranteed to exist for some finite time.) In Euclidean space, this intuitively says that given a starting point and velocity, there is only one straight line with those initial conditions.

To compute geodesic shooting, that is, a geodesic path from initial conditions, we will use the connection between the geodesic equation and parallel translation from the previous subsection. The geodesic equation says that the velocity of a geodesic moves by parallel translation along the geodesic. Therefore, we can compute a discrete geodesic step by taking a small step in the current velocity direction, followed by updating the velocity to this new point by parallel translation. This process is detailed in Algorithm 3.

Input: Initial point , and initial velocity
Output: Final point on geodesic segment:
Discrete timestep:
for  do
       Compute SVD:
Algorithm 3 Geodesic Shooting

4 Experiments

Figure 2: Comparing linear interpolation with geodesic interpolation (Algorithm 1) for a pair of points on the manifold induced by the generator of the VAE, which is trained on the data from a hyperbolic paraboloid. Left: True surface of the hyperbolic paraboloid, Middle: Surface of (range of the VAE’s learned generator mapping, ) overlaid with the curves for linear and geodesic interpolation, Right: Linear and geodesic interpolation curves in .

In this section, we conduct an extensive empirical study of the proposed algorithms for various Riemannian geometry computations in the context of deep generative models. We work with variational autoencoder (VAE) [11, 17] as our generative model of choice, however, the proposed algorithms are equally applicable to other popular generative models, such as generative adversarial network [8] and PixelCNN [15].

VAE Encoder architecture

(stride 2), Batch norm, ELU

Conv (stride 2), Batch norm, ELU
Conv (stride 2), Batch norm, ELU
Conv (stride 2), Batch norm, ELU
FC 256, Batch norm, ELU
FC 32 (Mean) FC 32, Sigmoid (Std. dev.)
Table 1: Architectural details of the VAE model used for CelebA and SVHN datasets. The architecture of the generator is reverse of the encoder with Conv layers replaced with transposed convolutions (Deconv) and an additional final Deconv layer of size .

4.1 Synthetic Manifold

Since it is difficult to visualize high dimensional real data as manifolds, we illustrate the geodesic traversal using a simple analytically defined manifold. In particular, we use a hyperbolic paraboloid which is a 2-D surface in three dimensions, defined as the set . We sample data from this manifold using ancestral sampling, with and . We sample points on this manifold and train a VAE on this data with latent dimension of . The encoder

is a two layer neural network with the fully-connected hidden layer of size

(FC-100) having ELU activations. The encoder outputs the mean (FC-2) and variance (FC-2, followed by Sigmoid) of the approximate posterior. The decoder

has reverse architecture of the encoder (FC-100, ELU, FC-3) and maps the two dimensional latents to three dimensional points on the manifold. We use exponential linear units (ELU) [5] so that the resulting generator mapping is differentiable (). Although the use of ELUs does not result in a mapping, it does ensure that we generate a

manifold. Also, all of our proposed algorithms are valid because they require at most first derivatives of the generator. We train this using minibatch stochastic gradient descent with batch size of

and learning rate of for minibatch iterations.

We pick two points reasonably far away on the analytically defined hyperbolic paraboloid, and , and map these to the latent space of the trained VAE using the encoder as and , where in this context represents the mean of the approximate posterior. The corresponding points on the manifold are obtained as and . We use Algorithm 1 to estimate the geodesic connecting the points and , and compare it with the curve traced on the generator’s manifold by linear interpolation between and .

Fig. 2 visualizes the true shape of our analytically defined hyperbolic paraboloid (left-most plot) along with the shape of the manifold as learned by the VAE’s generator (middle plot). We also visualize the geodesic and linear interpolation curves between the points and on the learned manifold (middle plot), and the same set of curves between and in the two dimensional latent space (right-most plot). This clearly brings out the differences between linear and geodesic interpolation paths, with a shorter geodesic curve on the manifold (about 35% smaller arclength than the linear curve) being traced by a longer curve in the latent space.

4.2 Real Manifolds

In this section, we investigate the Riemannian geometry of the generated manifolds learned on real images by carrying out computations such as geodesic interpolation and geodesic mean, and comparing these with the corresponding linear counterparts in space. We use two real image datasets in our experiments:
CelebA[13].  It consists of RGB face images of celebrities. We use center-cropped images of shape as used in several earlier works, using of these for training the VAE.
SVHN [14].  It consists of house numbers obtained from Google Street View images. We use about cropped digits of shape for training that are provided as part of the dataset.

Implementation details.  Architecture of the encoder () for both CelebA and SVHN is shown in Table 1. The architecture for the generator () is reverse of the encoder architecture with an additional transposed convolution layer that outputs the RGB image. The latent dimension is kept at for both datasets. The model is trained for minibatch iterations (batch size of ) using ADAM [10] with the learning rate of .

65.84 height=1cm,valign=mfigs/c1lp.png
65.17 height=1cm,valign=mfigs/c1gp.png
56.34 height=1cm,valign=mfigs/c3lp.png
53.76 height=1cm,valign=mfigs/c3gp.png
82.71 height=1cm,valign=mfigs/c4lp.png
77.01 height=1cm,valign=mfigs/c4gp.png
Figure 3: Linear and geodesic interpolation results for CelebA dataset. Rows 1, 3, 5: linear; Rows 2, 4, 6: geodesic; Column 1: arc length.

4.2.1 Geodesic Interpolation

We use Algorithm 1 to estimate the geodesic curve connecting a given pair of images on the generated manifold, discretizing it at 10 points (). To get an image on the generated manifold, we pick a real image from the dataset and use to get the corresponding point on the generated manifold. Fig 3 and 4 show a few images (equally spaced in Z space) on the linear and geodesic interpolation curves along with their arclengths, for CelebA and SVHN, respectively. Although, the geodesic curve on the manifold gives a shorter arclength than linear interpolation in Z space, the difference is not as pronounced as observed in our earlier experiment with synthetic manifold. This suggests that the generated manifolds learned by our VAE architecture for CelebA and SVHN, although nonlinear, have very little curvature.

18.10 height=1cm,valign=mfigs/s1lp.png
17.88 height=1cm,valign=mfigs/s1gp.png
24.03 height=1cm,valign=mfigs/s2lp.png
23.88 height=1cm,valign=mfigs/s2gp.png
15.94 height=1cm,valign=mfigs/s3lp.png
15.66 height=1cm,valign=mfigs/s3gp.png
Figure 4: Linear and geodesic interpolation results for SVHN dataset. Rows 1, 3, 5: linear; Rows 2, 4, 6: geodesic; Column 1: arc length.

4.2.2 Fréchet Means

We take a step further and look at the Fréchet mean of a chosen set of points on the generated manifold, comparing it with the linear mean in Z space. The Fréchet mean of a set is a point on the manifold which minimizes the total sum-of-squared geodesic distance to all the points in the set. In our setting, if are input data points, the Fréchet mean is defined as the solution to the optimization problem:

where is the geodesic distance, i.e., the arc length of path computed using Algorithm 1. We optimize this least squares problem using gradient descent in the latent coordinates for .

A set of real images from CelebA is constructed by randomly selecting images from the dataset that all have the same value for a chosen pair of attributes. We construct four such sets, each consisting of images, corresponding to attributes (black hair, mouth open), (black hair, mouth closed), (blond hair, mouth open) and (blond hair, mouth closed), respectively. We find the corresponding points on the VAE’s generated manifold by applying function on each of these images. Fig. 5 visualizes the Fréchet means and linear means for these four groups of images. Here the Fréchet means are similar in appearance to the linear means in the latent space. Again, this indicates that there may be limited curvature in the manifold. However, there are certainly subtle differences (particularly in the color) that indicates curvature is playing at least some role.

Figure 5: Linear mean (top) and Geodesic mean (bottom) in space for the four groups of images from CelebA. From left to right: (black hair, mouth close), (black hair, mouth open), (blond hair, mouth close), (blond hair, mouth open).

4.2.3 Geodesic Distance and Attribute Groupings

In this section, we analyze how well are the geodesic distances aligned with the groupings of the images based on the ground truth attributes. We reuse the four groups of images constructed in the earlier section for CelebA for this experiment. In addition, we also construct ten groups of images for SVHN, with each group consisting of randomly sampled images of a digit. We apply on each of these points to get corresponding points on the generated manifold, and compute linear and geodesic distances for each pair of these points. This gives us linear and geodesic distance matrices of size for CelebA and for SVHN. We calculate scores for each distance matrix, , where is the attribute label for and is just total number of data points. The score essentially measures the ratio of the intra-group squared distances and total squared distances, with a higher value indicating better agreement between the attribute based grouping and the distances. As score is already normalized by the sum of all squared distances, it is directly comparable across linear and geodesic distance matrices. As shown in Table 2, we obtain slightly higher scores with geodesic distances compared to the linear distances, indicating that geodesic distances group similar images slightly closer together than linear distances.

We also use multidimensional scaling (MDS) to embed the points into two dimensions based on these distance matrices, which are visualized in Fig. 6

for CelebA. The embedding based on geodesic distances visually seem to give a slightly tighter concentration around the groups, compared to the embedding based on linear distances. We also calculate the eigenvalues for the MDS matrices and plot them in Fig.

7. The eigenvalues of MDS explain whether the data can be isometrically embedded in Euclidean space (i.e., while preserving the distance metric between pairs of points). If all eigenvalues are non-negative, then this Euclidean embedding is possible, and the dimension of the Euclidean space is the number of nonzero eigenvalues. The presence of negative eigenvalues demonstrate that the space has nonzero curvature, and exact Euclidean embedding is impossible. The magnitude of the negative eigenvalues is a measure of how far the manifold distances are deviating from Euclidean, i.e., it is a measure of how much curvature the manifold has. As expected, the linear distance matrix resulted in exactly positive eigenvalues, with exactly zero eigenvalues after . The geodesic distance matrix has negative eigenvalues, but they have very small magnitude compared with the positive eigenvalues. This strongly indicates that the generated manifold has some curvature, but it is close to being zero.

Geodesic Linear
CelebA 0.7782278 0.7638913
SVHN 0.9024925 0.9021349
Table 2: Scores with geodesic and linear distance matrices (the higher the better)
Figure 6: MDS embedding for linear (left) and geodesic (right) distance matrices, for four groups of images from CelebA.
Figure 7: Eigenvalues of the MDS matrices for the four groups of images from CelebA dataset ( total images). Left: all 400 eigenvalues. Right: Zooming in on lowest eigenvalues. Note that the vertical axis scale is much smaller in the right plot.

4.2.4 Geodesic Analogy

The analogy problem is defined as . In our context, and are images and we want to find an image that is related to in the same way as is related to . We reuse the four CelebA groups constructed in the earlier experiments. We take to be the geodesic mean of the group (blond hair, mouth closed) and to be the geodesic mean of the group (blond hair, mouth open). We take to be a randomly selected test image with attributes (blond hair, mouth closed). For geodesic analogy, we first compute the geodesic between and . The initial velocity vector at is then parallel translated to along the geodesic connecting and using Algorithm 2. We then use Algorithm 3 to shoot a geodesic of same arc length as the - geodesic along this parallel translated vector. The end point of this geodesic is expected to have a similar semantic relation to , as is related to (i.e., change in the binary mouth attribute from close to open).

We also try a linear analogy operation in space. We compute the difference (where , ), and add the resulting vector to corresponding to the test image (i.e., ). The answer to the linear analogy problem is then taken to be the image . Fig. 8 shows the results for geodesic and linear analogies for two different attribute combinations. The linear analogy is visually quite close to the geodesic analogy (with subtle differences), which again suggests that the generated manifold has very low curvature.

Figure 8: Linear mean vector analogy (rows 1,3) vs. geodesic parallel translated vector analogy (rows 2,4). First two rows change black hair to blond, and the last two rows change closed mouth to open.

5 Conclusion

In this paper we have introduced methods for exploring the Riemannian geometry of manifolds learned by deep generative models. Our experiments show that these models represent real image data with manifolds that have surprisingly little curvature. Consequently, straight lines in the latent space are relatively close to geodesic curves on the manifold. This fact may explain why traversal in the latent space results in visually plausible changes to the generated data: curvilinear distances in the original data metric are roughly preserved. However, our experiments were limited to a single type of deep network (VAE) and two real image data sets (CelebA and SVHN). Further investigation into this phenomenon is warranted, to identify if there are other architectures or datasets where curvature plays a more prominent role. Also, even for the results presented here, the role of curvature should not be completely discounted: there are still differences between latent distances and geodesic distances that may have more nuanced effects in certain applications. We believe that exploring the Riemannian geometry of deep generative models, using the tools developed in this paper, will be an important step in understanding the high-dimensional, nonlinear spaces these models learn.