Official Code for the paper "A Geometric Analysis of Deep Generative Image Models and Its Applications"
Generative adversarial networks (GANs) have emerged as a powerful unsupervised method to model the statistical patterns of real-world data sets, such as natural images. These networks are trained to map random inputs in their latent space to new samples representative of the learned data. However, the structure of the latent space is hard to intuit due to its high dimensionality and the non-linearity of the generator, which limits the usefulness of the models. Understanding the latent space requires a way to identify input codes for existing real-world images (inversion), and a way to identify directions with known image transformations (interpretability). Here, we use a geometric framework to address both issues simultaneously. We develop an architecture-agnostic method to compute the Riemannian metric of the image manifold created by GANs. The eigen-decomposition of the metric isolates axes that account for different levels of image variability. An empirical analysis of several pretrained GANs shows that image variation around each position is concentrated along surprisingly few major axes (the space is highly anisotropic) and the directions that create this large variation are similar at different positions in the space (the space is homogeneous). We show that many of the top eigenvectors correspond to interpretable transforms in the image space, with a substantial part of eigenspace corresponding to minor transforms which could be compressed out. This geometric understanding unifies key previous results related to GAN interpretability. We show that the use of this metric allows for more efficient optimization in the latent space (e.g. GAN inversion) and facilitates unsupervised discovery of interpretable axes. Our results illustrate that defining the geometry of the GAN image manifold can serve as a general framework for understanding GANs.READ FULL TEXT VIEW PDF
Official Code for the paper "A Geometric Analysis of Deep Generative Image Models and Its Applications"
A repository for my B.Tech. Project thesis
Generative adversarial networks (GANs) learn patterns that characterize complex datasets, and subsequently generate new samples representative of that set. In recent years, there has been tremendous success in training GANs to generate high-resolution and photorealistic images (karras2017progressGAN; brock2018BigGAN; donahue2019BigBiGAN; karras2020StyleGAN2)
. Well-trained GANs show smooth transitions between image outputs when interpolating in their latent input space, which makes them useful in applications such as high-level image editing (changing attributes of faces), object segmentation, and image generation for art and neuroscience(zhu2016natmanif; shen2020latentFaceEdit; pividori2019exploitBigGAN; ponce2019NeuroEvol). However, there is no systematic approach for understanding the latent space of any given GAN or its relationship to the manifold of natural images.
Because a generator provides a smooth map onto image space, one relevant conceptual model for GAN latent space is a Riemannian manifold. To define the structure of this manifold, we have to ask questions such as: are images homogeneously distributed on a sphere? (white2016samplingGAN)
What is the structure of its tangent space — do all directions induce the same amount of variance in image transformation? Here we develop a method to compute the metric of this manifold and investigate its geometry directly, and then use this knowledge to navigate the space and improve several applications.
To define a Riemannian geometry, we need to have a smooth map and a notion of distance on it, defined by the metric tensor. For image applications, the relevant notion of distance is in image space rather than code space. Thus, we can pull back the distance function from the image space onto the latent space. Differentiating this distance function on latent space, we will get a differential geometric structure (Riemannian metric) on the image manifold. Further, by computing the Riemannian metric at different points (i.e. around different latent codes), we can estimate the anisotropy and homogeneity of this manifold.
The paper is organized as follows: first, we review the previous work using tools from Riemannian geometry to analyze generative models in section 2. Using this geometric framework, we introduce an efficient way to compute the metric tensor on the image manifold in section 3, and empirically investigate the properties of in various GANs in section 4. We explain the properties of this metric in terms of network architecture and training in section 5. We show that this understanding provides a unifying principle behind previous methods for interpretable axes discovery in the latent space. Finally, we demonstrate other applications that this geometric information could facilitate, e.g. gradient-free searching in the GAN image manifold in section 6.
Concepts in Riemannian geometry have been recently applied to illuminate the structure of latent space of generative models (i.e. GANs and variational autoencoders, VAEs).shao2018riemannianGeomDGM
designed algorithms to compute the geodesic path, parallel transport of vectors and geodesic shooting in the latent space; they used finite difference together with a pretrained encoder to circumvent the Jacobian computation of the generator. While promising, this method did not provide information of the metric directly and could not be applied to GANs without encoders.arvanitidis2017latentCurvature focused on the geometry of VAEs, deriving a formula for the metric tensor in order to solve the geodesic in the latent space; this worked well with shallow convolutional VAEs and low-resolution images (28 x 28 pixels). chen2018metricDGM
computed the geodesic through minimization, applying their method on shallow VAEs trained on MNIST images and a low-dimensional robotics dataset. In the above, the featured methods could only be applied to neural networks without ReLU activation. Here, our geometric analysis is architecture-agnostic and it’s applied to modern large-scale GANs (e.g. BigGAN, StyleGAN2). Further, we extend the pixel L2 distance assumed in previous works to any differentiable distance metric.
A generative network, denoted by , is a mapping from latent code to image , . Borrowing the language of Riemannian geometry, parameterizes a submanifold in the image space with . Note for applications in image domain, we care about distance in the image space. Thus, given a distance function in image space , we can define the distance between two codes as the distance between the images they generate, i.e. pullback the distance function to latent space through . .
The Hessian matrix (second order partial derivative) of the squared distance function can be seen as the metric tensor of the image manifold (palais1957metricReconstr). The intuition behind this is as follows: consider the squared distance to a fixed reference vector as a function of , . Obviously, is a local minimum of , thus can be locally approximated by a positive semi-definite quadratic form as in Eq.1. This matrix induces an inner product and defines a vector norm, . This squared vector norm approximates the squared image distance, . Thus, this matrix encodes the local distance information on the image manifold up to second order approximation. This is the intuition behind Riemannian metric. In this article, the terms ”metric tensor” and ”Hessian matrix” are used interchangeably. We will call the approximate speed of image change along as measured by metric .
As defined above, the metric tensor can be computed by doubly differentiating the squared distance function
. Here we use a convolutional neural network (CNN)-based distance metric, LPIPS(zhang2018PercMetric), as it has been demonstrated to approximate human perceptual similarity judgements. The direct method to compute Hessian is by building a computational graph towards the gradient and then computing the gradient towards each element in . This method computes column by column, therefore its time complexity is proportional to the latent-space dimension
and the backpropagation time through this graph.
For situations when direct backpropagation is too slow (e.g. FC6GAN, StyleGAN2), we developed an approximation method to compute the major eigen-dimensions of the Hessian more efficiently. These top eigen-pairs are useful in applications like optimization and exploration; moreover, they form the best low-rank approximation to the Hessian. As we will later discover, the spectra of these Hessians have a fast decay, thus far less than eigenvectors are required to approximate them, cf. Sec 4. As a matrix, the Hessian is a linear operator, which could be defined as long as one can compute the Hessian vector product (HVP). Since the gradient to commutes with inner product with , HVP can be rewritten as the gradient to , or the directional derivative to the gradient (Eq.2). The first form is easy to compute in reverse-mode auto-differentiation, and the directional derivative is easy to compute in forward-mode auto-differentiation (or finite differencing). Then, Lanczos iteration is applied to the HVP operator defined in these two ways to solve the largest eigen pairs, which can reconstruct an approximate Hessian matrix. The iterative algorithm using the two HVP definitions are termed Backward Iteration and Forward Iteration respectively. For details and efficiency comparison, see Appendix A.2.
Note a similar computational method has been employed to understand the optimization landscape of deep neural networks recently (ghorbani2019HessSpecNN), although it has not been applied towards the geometry of latent space of GANs before.
This formulation and computation of the Riemannian metric is generic to any mapping into a metric space. Consider a mapping , which could be the feature map of a layer in the GAN, or a CNN processing the generated image. We can pull back the squared L2 distance and metric from , , and define a manifold. The metric tensor of this manifold can be derived as Hessian of . Note, there is a simple relationship between the Hessian of , and the Jacobian of , (Eq. 3
). Through this we know the eigenvalues and eigenvectors of the Hessian matrix
correspond to the squared singular values and right singular vectors of the Jacobian. This allows us to examine the geometry of any representation throughout the GAN, and analyze how the geometry in the image space builds up.
In this work, we use LPIPS, which defines image distance based on the squared L2 distance of the first few layers of a pretrained CNN. If we concatenate the activations and denote this representational map by , then the metric tensor of the image manifold can be derived from the Jacobian of the composite of the generator and the representation map , . This connection is crucial for understanding how geometry depends on the network architecture.
Using the above method, we analyzed the geometry of the latent space of the following GANs: DCGAN (radford2015DCGAN), DeePSiM/FC6GAN (dosovitskiy2016DeepSiMGAN), BigGAN (brock2018BigGAN), BigBiGAN (donahue2019BigBiGAN), Progressive Growing of GANs (PGGAN) (karras2017progressGAN), StyleGAN 1 and 2 (karras2019styleGAN; karras2020StyleGAN2) - model specifications reviewed in Sec. A.3. These GANs are progressively deeper and more complex, and some employ a style-based architecture instead of the conventional DCGAN architecture (e.g. StyleGAN1,2). This diverse set of models allowed us to test the broad applicability of this new approach. In the following sections, ”top” and ”bottom” eigenvectors refer to the eigenvectors with large and small eigenvalues.
In differential geometry, a metric tensor captures an infinitesimal notion of distance. To determine whether this quantity represents evident image changes, we randomly picked a latent code , then computed the metric tensor and its eigendecomposition . Then we explored linearly in the latent space111For some spaces, we used spherical linear exploration (i.e. SLERP), where we restrict the vector to a sphere of certain norm. We project onto tangent space of and travel on the big circle from along .along the eigenvectors . We found that images changed much faster when moving along top than along bottom eigenvectors, both per visual inspection and LPIPS (Fig.1). More intriguingly, eigenvectors at different ranks encoded qualitatively different types of changes; for example, in BigGAN noise space, the top eigenvectors encoded head direction, proximity and size; while lower eigenvectors encoded background changes, shading or much more subtle pixel-wise changes. Moreover, PGGAN and StyleGANs trained on the face dataset (celebA,FFHQ) have top eigenvectors that represent similar interpretable transforms of faces, such as head direction, sex or age (Fig.10). These observations raised the possibility that top eigenvectors also captured perceptually relevant changes: we tested this directly with positive results in Sec. 6.
To explore how eigenvalues were distributed, for each GAN, we randomly sampled 100-1000 in the latent space, used backpropagation to compute and then performed the eigendecomposition. In Fig. 2
, we plotted the mean and 90% confidence interval of the spectra and found that they spanned 5-10 orders of magnitude, with fast decays; each spectrum was dominated by a few eigenvectors with large eigenvalues. In other words, only a small fraction of dimensions were responsible for major image changes (Table2), while most dimensions introduced nuanced changes (e.g. shading, background) — thus GAN latent spaces were highly anisotropic.
We found this anisotropy in every GAN we tested, which raises the question of why it has not been discussed more frequently. One possibility is that the statistical properties of high dimensionality create an illusion of isotropy. When traveling along a random direction in latent space, the approximate rate of image change is a weighted average of all eigenvalues as in Eq. 9. In Sec A.6, we show analytically that the variance of across random directions will be times smaller than the variance among eigenvalues. For example, in BigGAN latent space (256 dimensions), the eigenvalues span over six orders of magnitude, while the
for random directions has a standard deviation less than one order of magnitude (Figs.2, 6). Further, the center of this distribution was closer to the top of the spectrum, and thus provided a reasonable rate of change, while masking the existence of eigendimensions of extremely large and small eigenvalues.
Because the metric describes local geometry, the next question is how it varies at different positions in the latent space. We computed the metric at randomly selected and examined their similarity using a statistic adopted from kornblith2019ReprSimNN. In this statistic, we applied the eigenvectors from a metric tensor at position to the metric tensor at , as . These values formed a vector , representing the effects of metric on eigenvectors of . Then we computed the Pearson correlation coefficient between and the target eigenvalues, , as . This correlation measured the similarity of the action of metric tensors on eigenframes around different positions. As the spectrum usually spanned several orders of magnitude, we computed the correlation on the log scale , where the eigenvalues distribute more uniformly.
Using this correlation statistic, we computed the consistency of the metric tensor across hundreds of positions within each GAN latent space. As shown in Fig. 3C, the average correlation between eigenvalues and vHv values of two points was 0.934 in BigGAN. For DCGAN-type architecture, mean correlations on the log scale ranged from 0.92-0.99; for StyleGAN-1,2, 0.64-0.73 in the Z space, and 0.87-0.89 in the W space (Fig. 3D, Tab.4). Overall, this shows that the local directions that induce image changes of different orders of magnitude are highly consistent at different points in the latent space. Because of this, the notion of a ”global” Hessian makes sense, and we estimated it for each latent space by averaging the Hessian matrices at different locations.
As the spectra have a large portion of small eigenvalues and the metric tensors are correlated in space, the bottom eigenvectors should create a global subspace, in which latent traversal will result in small or even imperceptible changes in the image. This is supported by our perceptual study, as over half of the subjects cannot see any change in image when latent vector move in bottom eigenspace. (Sec. 6). This perceptually ”null” space has implications about exploration in the GAN space and interpretable axes discovery. As , if one axis encodes an interpretable transform , then shifting this vector by a vector in the null space will still result in an interpretable axis . Thus, each interpretable axis have a family of ”equivalent” axes which encode similar transforms, differing from each other by a vector in ”null” space. However, adding component in the null space will decrease the rate of image change along that axis. In this sense, the vectors using a smallest step size to achieve that transform should be the ”purest” axes of the family. Further, the cosine angle between two interpretable axes may not represent the similarity of the transforms they encode. A large angle can be found between two axes of the same family but at different image traversal speed. We compared the axes from previous works in A.9 and observed that projecting out a large part of their axes did not affect the semantics it encoded (Fig. 8).
Above, we showed an intriguingly consistent geometric structure across multiple GANs. Next, we sought to understand how this structure emerged through network architecture and training.
To link the metric tensor to the generator architecture, it is helpful to highlight the relationship between the metric tensor and Jacobian matrix (Eq. 3). As the latent space gets warped and mapped onto image space, directions in latent spaces are scaled differently by the Jacobian of the map; specifically, directions that undergo the most amplification will become the top eigenvectors (Fig. 4A). As the Jacobian of the generator is a composition of Jacobians of each layer , the scaling effect on the image manifold is a product of the scaling effects of each intermediate layer. We can analyze the scaling effect of different layers by applying a set of vectors onto the metric tensors of these layers . In BigGAN, when we apply the eigenvectors of the first few layers onto the metric of other layers, the top eigenvectors are still strongly amplified by subsequent layers, thus forming the top eigendimensions of the manifold. Of note, this is not true for a weight-shuffled control BigGAN: in that case, the top eigendimension of the first few layers was not particularly amplified on the image manifold, and vice versa (Fig. 4 B). This shows that the amplification effect of layers becomes more aligned through training, with the top eigenspace shared across layers. Further, as the amplification effects are not lined up across layers of weight shuffled networks, these networks should exhibit a more isotropic geometry on their image manifold. Indeed, we find their spectra to be flatter and the largest eigenvalue smaller (Fig. 7).
By defining the geometry of the latent space via the metric tensor, we gain an understanding of which directions in this space are more informative than others. This understanding leads to improvements in three applications: 1) finding human-interpretable axes in the latent space, 2) improving gradient-based optimizers, 3) accelerating gradient-free search.
When users wish to manipulate generated images via their latent code, it is useful to reduce the number of variables needed to effectuate that manipulation. Our method provides a systematic way to compute the most informative axes (top eigenspace) in the latent space to use as variables, and the resulting eigenvalues can serve to compute appropriate step sizes along each corresponding axes. We visualized the image changes corresponding to the top eigenvectors in BigGAN, BigBiGAN, PGGAN, StyleGAN1,2 (Fig.1). We found many of these eigenvectors appeared to capture interpretable transformations like zooming, head direction and object position, consistent across reference image.
To test if this was apparent to people other than the authors, we conducted a study using Amazon’s Mechanical Turk. We tested the perceptual properties of the axes identified by the metric tensor, including the top 10 eigenvectors, random vectors orthogonal to the top 15d eigenspace, and bottom 10 eigenvectors. Images were generated using four different GANs (PGGAN, BigGAN noise space, StyleGAN2-Cat and -Face), and were presented to 185 participants. In each trial, five randomly sampled reference images were perturbed along a given axis, and participants were asked if they could a) perceive a change, b) indicate an estimate of its magnitude [0%-100%] c) describe a common change in their own words and how many of the five images shared this change, c) indicate how similar were the 5 image changes (consistency, score of 1-9, 9 most similar) and finally, d) state how difficult it was to describe this change (difficulty score, scale of 1-9, 9 most difficult).
Only 48.5% of the subjects reported to see any change happen for bottom eigenvectors, while the fraction was 93.5% and 89.8% for top and orthogonal directions respectively. Further, when subjects observed some change, they reported that the image transformations induced by top eigenvectors were larger () than those of orthogonal directions (
, 2 sample T-test) and than those of bottom eigenvectors (). This was true even though we picked a step size in the top eigenspace that was 5-10 times smaller than in the orthogonal and bottom eigenspaces. Further, subjects reported the top 10 eigenvectors had a higher mean perceptual consistency score ( responses) than the orthogonal ( responses) and bottom eigenvectors ( responses). Participants reported that the top eigenvectors were easier to interpret () than the bottom eigenvectors (, albeit comparably to the orthogonal eigenvectors ). Thus, overall we conclude that the Hessian eigenvectors not only capture informative axes of image transformations, but that these were also perceptually relevant, corresponding to similar semantic changes when applied to different reference vectors (Fig. 11) — axes interpretable not just in local sense, but in a global sense.
For applications like GAN-assisted drawing and photo editing (zhu2016natmanif; shen2020latentFaceEdit), one crucial step is to find a latent code corresponding to a given natural image (termed GAN inversion). For this problem, one basic approach is to minimize the distance between a generated image and the target image
. Although second-order information (Hessian) is valuable in optimization, they are seldom used as they are expensive to compute and update. However, since we find that the local Hessians are highly correlated across the latent space, we can pre-compute it once for each latent space and use the global average Hessian to boost first-order optimization. As an example, ADAM is a first-order optimization algorithm that adapts the learning rate of each parameter separately according to the moments of gradients on that parameter(kingma2014adamOptim). It can be seen as a quasi-second order optimizer that approximates a diagonal Hessian matrix based on first-order information. However, if the true Hessian is far from diagonal, i.e. the space is anisotropic and the valley is not aligned with the coordinates, then this approximation could work poorly.
To test whether the metric can help overcome this problem, we used the eigenvectors of the global average Hessian to rotate the latent space; this orthogonal change of variables should make the Hessian more diagonal and thus accelerate ADAM. This method can be seen as a preconditioning step which could be inserted into any pipeline involving ADAM. We tested this modification on the state-of-the-art algorithm for inverting BigGAN, i.e. BasinCMA (huh2020BigGANprojection), which interleaves ADAM and CMAES steps. We used our Hessian eigenbasis in the ADAM steps, and found that we could consistently lower the fitted distance to the target when inverting ImageNet and BigGAN-generated images (Fig. 5). Similarly, eigenbasis preconditioning consistently improved inversion of PGGAN and StyleGAN2-Face for real image sampled from both FFHQ and CelebA using ADAM method. In short, the understanding of homogeneity and anisotropy of the latent space can improve gradient-based optimization.
In some domains, it is important to optimize objectives in the absence of a gradient, for example, in black-box attacks against image recognition systems via adversarial images, when searching for activity-maximizing stimuli for neurons in primate visual cortex, or when optimizing perceptual evaluation in the user(ponce2019NeuroEvol; xiao2020xdream; Chiu2020HumanInLoop)
. These applications usually involve a low-dimensional parameter space (such as GANs) and an efficient gradient-free search algorithm, such as covariance matrix adaptation evolution strategy (CMAES). CMAES explores the latent space using a Gaussian distribution and adapts the shape of the Gaussian (covariance matrix) according to the search history and local landscape. However, online learning of a covariance matrix in high-dimensional space is computationally costly, and inaccurate knowledge of it can be detrimental to optimization. Here we applied the prior geometric knowledge of the space to build the covariance matrix instead of learning it from scratch. For example, as illustrated by natural gradient descent(amari1998naturalgradGeom)
, one simple heuristic for optimizing on the image manifold is to move in smaller steps along dimensions that change the image faster to avoid overshoot. We built in this heuristic to improve CMAES, termed CMAES-Hessian. With our method, the search can be limited to the most informative directions, which should increase sampling efficiency; further, our method tunes the exploration step size in a way that is inversely proportional to the rate of image change. To test this approach, we applied our CMAES-Hessian algorithm to the problem of searching for activation maximizing stimuli for units in AlexNet(nguyen2016synthesizing) in the latent space of FC6GAN and BigGAN. We found that the dimension of the search space could be reduced from 4096 to 500 for FC6GAN without impairing maximal activation values. Further, we found that CMAES-Hessian consistently led to higher activation values compared to the classic CMAES algorithm in BigGAN space (Fig. 5F).
In this work, we developed an efficient and architecture-agnostic way to compute the geometry of the manifold learnt by generative networks. This method discovers axes accounting for the largest variation in image transformation, which frequently represent semantically interpretable changes. Subsequently, this geometric method can facilitate image manipulation, increase explainability, and accelerate optimization on the manifold (with or without gradients).
There have been multiple efforts directed at identifying interpretable axes in latent space using unsupervised methods, including (ramesh2018spectral; harkonen2020ganspace; shen2020GANSemFact; voynov2020unsupInterpDir; peebles2020HessianPenalty). Our description of the connection between the metric tensor of the image manifold and the Jacobian matrices of intermediate layers unifies these previous results. As we have showed, the top right singular vectors of the weights (i.e. Jacobian) of the first few layers (as used in shen2020GANSemFact), correspond to the top eigenvectors of the metric tensor of the image manifold, and these usually relate to interpretable transforms. Similarly, the top principal components (PCs) of intermediate layer activations harkonen2020ganspace roughly correspond to the top left singular vectors of the Jacobian, thus also to the interpretable top eigenvectors of the metric on the image manifold. Likewise, ramesh2018spectral also observed that the top right singular vectors of the Jacobian of are locally disentangled. Regarding voynov2020unsupInterpDir and peebles2020HessianPenalty, we empirically compared their interpretable axes and our eigenvectors, and found that in some of the GANs, the discovered axes have a significantly larger alignment with our top eigenspace and they are highly concentrated on individual top axes than expected from random mixing. We refer readers to Sec.A.9,A.10 and Fig.8,9 for further comparison.
Although we have answered how the anisotropy comes into being mechanistically, there remains the question of why it should exist at all. Anisotropy may result from gradient training: theoretical findings on deep-linear networks for classification show that gradient descent aligns the weights of layers, resulting in a highly anisotropic Jacobian (ji2018gradAlignLayer). Whether that analysis transfers to the setting of generative networks remains to be investigated.
Alternatively, assuming that a well-trained GAN faithfully represents the data distribution, this anisotropy may reveal the intrinsic dimensionality of the data manifold. Statistical dependencies of variation in real-world images imply that the images reside in a statistical manifold of much lower dimension. Further, among transformations that happen on this manifold, there will be some dimensions that transform images a lot and some that do not. In that sense, our method may be equivalent to performing a type of nonlinear PCA of the image space through the generator map. In fact, we have found that GANs trained on similar datasets (e.g. PGGAN, StyleGAN1,2 trained on the human face dataset CelebA,FFHQ) have top eigenvectors that represent the same set of transforms (e.g. head direction, gender, age; Fig. 10). This supports the ”PCA” hypothesis, as these transformations may account for much of the pixelwise variability in face space; the GANs are able to learn to represent these transformations as linear directions, which our method can then identify.
This further raises the intriguing possibility that if the dataset is actually distributed on a lower dimensional space, one could learn generators with smaller latent spaces; or alternatively, it may be easier to learn generators with large latent spaces and reduce them after intensive training. These are questions worth exploring.
We appreciate the conceptual and technical inspirations from Dr. Timothy Holy (WUSTL). We are grateful for the constructive discussion with Zhengdao Chen (NYU), whose pointers to the relevant literature helped launch this work. We thank Hao Sun (CUHK) in providing experience for the submission and rebuttal process. We thank friends and colleagues Yunyi Shen (UW–Madison), Lingwei Kong (ASU) and Yuxiu Shao (ENS) who read and commented on our early manuscript. This work was supported by Washington University in St. Louis, the David and Lucille Packard Foundation, the McDonnell Center for Systems Neuroscience (pre-doctoral fellowship to B.W.), and a seed grant from the Center for Brains, Minds and Machines (CBMM; funded by NSF STC award CCF-1231216).
B.W. conceptualized the project, designed the algorithm, developed the code base, performed the numerical experiment and analyzed the data. B.W. and C.R.P. interpreted the results. B.W. and C.R.P. designed the human MTurk task and analyzed the data. B.W. and C.R.P. prepared and revised the manuscript.
It is useful to compare our work to the ”information geometry”(amari2016InfoGeom) on the space of distributions. In that formulation, KL divergence is a pseudo-metric function on the space of distributions, and its Hessian matrix towards parameters of distribution is the Fisher information matrix. In information geometry, this Fisher information matrix could be considered as the metric on the manifold of distributions; this metric information can be further used to assist optimization on the manifold of distributions, termed natural gradient descent (amari1998naturalgradGeom). In our formulation, the squared image difference function is analogous to this KL-divergence; the image as parameterized by latent code is analogous to the distribution parameterized by . The metric tensor we computed is comparable to the Fisher information matrix in their setting. Thus our way of using metric information to assist optimization on manifold is analogous to natural gradient descent.
One direct way to compute the Hessian of a given scalar function (e.g. squared distance in our case), is to compute , create a computational graph from code to the gradient , and back propagate from gradient vector element by element. In this way the computational time is linear to time of a single backward pass times the latent space dimension .
Given a large latent space or a deep network (e.g. 4096 dimensions in FC6GAN, or 512 in StyleGAN2), this method can be very slow. An efficient way is to use the Hessian vector product (HVP) operator and iterative eigenvalue algorithms like power iteration or Lancsoz iterations to solve the eigenvectors corresponding to eigenvalues of largest amplitudes. These largest eigen pairs create the best low rank approximation to the real Hessian matrix. Note that, to find the smallest amplitude eigen pairs, inverse Hessian vector product operator is required, which is much more expensive to compute. However, as the eigenspace with the smallest eigenvalues represent directions that do not change images much, the exact eigenvector does not matter. We can just define an arbitrary basis in the ”null” space orthogonal to the eigenspace with large amplitude eigenvalues.
There are two ways to construct a HVP operator: one way uses the 2nd order computational graph from to the gradient to compute HVP by back-prop, i.e. ; the other way uses finite difference on the first-order gradient to compute HVP i.e. . As it does not require backpropagation, a single operation of is faster than but it is less accurate and takes more iterations to converge. We use the ARPACK (lehoucq1998arpack) implementation of the Lanczos algorithm as our iterative eigenvalue solver.
We termed the direct method Full BackProp (BP), the iterative method using and Backward Iteration and Forward Iteration respectively. We computed the Hessian at the same using these three methods in different GANs and compared their temporal cost empirically in Table 1.
Note, our method can be employed to compute the singular values and right singular vectors of the Jacobian from latent space towards any intermediate layer representation. To obtain the left singular vector, i.e. the change in representation or image space caused by the direction, we need to push forward the right singular vectors through the Jacobian, which is feasible through forward-mode autodiff or finite difference.
|Dimension||Full BackProp||Backward Iter||Forward Iter|
The pretrained GANs used in the paper are from the following sources:
model was obtained from torch hubhttps://pytorch.org/hub/facebookresearch_pytorch-gan-zoo_dcgan/. It’s trained on 64 by 64 pixel fashion dataset. It has a 120d latent space, using Gaussian as latent space distribution.
Progressive Growing GAN (PGGAN) was obtained from torch hub https://pytorch.org/hub/facebookresearch_pytorch-gan-zoo_pgan/ and we use the 256 pixel version. It’s trained on celebrity faces dataset (CelebA). It has a 512d latent space, using Gaussian as latent space distribution.
model was re-written and translated into Pytorch, with weights obtained from official pagehttps://lmb.informatik.uni-freiburg.de/people/dosovits/code.html of dosovitskiy2016DeepSiMGAN. The architecture is designed to mirror that of AlexNet, and the FC6GAN model is trained to invert AlexNet’s mapping from image to FC6 layer. Thus it has 4096d latent space. This model is highly expressive in fitting arbitrary pattern, but not particularly photorealistic.
model was obtained through Hugging Face’s translation of DeepMind’s Tensorflow implementationhttps://github.com/huggingface/pytorch-pretrained-BigGAN, we use biggan-deep-256 version. It’s trained on ImageNet dataset in a class conditional way. It has a 128d latent space called noise space, and a 128d embedding space for the 1000 classes called class space. The 2 vectors are concatenated and sent into the network. The distribution used to sample in noise space is truncated normal. Here we analyze the metric tensor computed in the concatenated 256d space (BigGAN) or in the 128d noise space or class space separately (BigGAN-noise, class).
BigBiGAN model was obtained via a translation of DeepMind’s Tensorflow implementation https://tfhub.dev/deepmind/bigbigan-resnet50/1, we use bigbigan-resnet50 version. It’s trained on ImageNet dataset in unconditioned fashion. It has a 120d latent space, using Gaussian as latent distribution. Note, the latent vector is split into six 20d trunks and sent into different parts of the model, which explains why the spectrum of BigBiGAN has the staircase form (in Fig. 2).
StyleGAN model was obtained via a translation of NVIDIA’s Tensorflow implementation https://github.com/rosinality/style-based-gan-pytorch. We used the 256 pixel output. It has a 512d latent space called Z space, where the latent distribution is Gaussian distribution. This
distribution gets warped into another 512d latent space called W space, by a multi-layer perceptron. The latent vectoris sent into a style-based generative network, in which the latent vector just modulates the feature maps in the conv layers, instead of serving as a spatial input as in DCGAN, FC6GAN, PGGAN.
StyleGAN2 models are obtained via a translation of NVIDIA’s Tensorflow implementation https://github.com/rosinality/stylegan2-pytorch. This is an improved version of StyleGAN: it also has a network mapping the 512d to the space, and the style-based generative network. The various pre-trained models are fetched from https://pythonawesome.com/a-collection-of-pre-trained-stylegan-2-models-to-download. More specifically StyleGAN2-Face256 and 512 are both trained on FFHQ dataset, while Face256 generate lower resolution images and use narrower conv layers. StyleGAN2-Cat is trained on LSUN cat dataset (yu2015lsun) at 512 resolution.
WaveGAN model is obtained from the repository https://github.com/mostafaelaraby/wavegan-pytorch/. Its architecture resembles that of DCGAN, but applied to the one dimensional wave form generation problem. We customly trained it on the wave forms of piano performance clips. It has a 100d latent space, using Gaussian as latent space distribution.
We quantified the anisotropy of the space, i.e. the low rankness of the metric tensor in Table 2. To do this, we computed the number of eigenvalues needed to account for the 0.99, 0.999, 0.9999, 0.99999 fraction of the sum of all eigenvalues. This can be thought of as the minimal number of dimensions needed to achieve a low rank approximation of the Jacobian with 0.01, 0.001, 0.0001, 0.00001 residue in terms of the Frobenius norm.
There are a few interesting patterns we noticed in this table. For BigGAN, we noted that the class subspace is more low-ranked than the noise subspace, i.e. fewer directions could account for most of the changes across object classes than within classes. For StyleGAN 1 and 2, we analyzed the geometry of space and space separately, and found that in all the models the metric in space is significantly more isotropic i.e. less low rank than space. Thus, in this regard, the mapping warped the spherical distribution in space to an elongated one in space, but the mapping from space to image is still more isotropic.
Our work used the LPIPS distance metric to compute the Riemannian metric tensor. To determine how much of the results depended on this choice of metric, we computed the metric tensor at the same hidden vector using different image distance functions, specifically a) structural similarity index measure (SSIM) and b) Mean Squared Error (MSE) in pixel space, which do not depend on CNN. We computed the Hessian at 100 random sampled vectors in BigGAN, Progressive Growing GAN (Face), StyleGAN2 (Face 256), using MSE, SSIM and LPIPS, and then compared their Hessian spectra and eigenvectors. We found that the entry-wise correlation across the Hessian matrices ( elements) ranged from [0.94-0.99]. The correlation of eigenvalue spectra ranged from [0.987-0.995]. Measuring Hessian similarity using the statistics we developed and resulted in correlations concentrated at 0.99. Thus, we found that the Hessian matrices and their spectra were highly correlated across image distance metrics, and that the Hessian matrices had a similar effect on the eigenvectors of each other.
One major difference across Hessians from different image distance metric was evident in the scale of the eigenvalues. We regressed the log Hessian spectra induced by SSIM or MSE onto the log Hessian spectrum induced by LPIPS, and found the intercepts of the regression were usually not zeros (Tab. 5). This result showed different image distance metrics exhibit different ”unit” or intrinsic scale, although they all factored out the same structure in the GAN.
This result is contextualized by Section 5. As equation 3 showed, the Riemannian metric or Hessian of the generator manifold is the inner product matrix of the Jacobian of the representational map. The effect of image distance metric on the Riemannian metric is to add a few more terms on top of the chain of Jacobians. The Jacobian terms from the layers of generator seem to have a larger effect than the final terms coming from the image distance metric.
Note that this does not mean that the choice of sample space distance function is irrelevant. Going beyond image generation, when applying our method to an audio generating GAN, the WaveGAN, we found that the choice of distance function in the space of sound waves will substantially affect the Hessian obtained. We used the MSE of wave forms and MSE of spectrograms (denoted by STFT) to compute metric tensor of that sound wave manifold. We found the element-wise Hessian correlation between these is around 0.53, while the other Hessian similarity metric are also lower than the counterparts for BigGAN, PGGAN and StyleGAN2 (Tab. 5). We think the MSE of spectrograms is a more perceptually relevant distance metric of sound waves than MSE of wave forms, and this difference is reflected in the geometry they induced i.e. anisotropy and homogeneity (Tab. 2, 4). Thus, when and how the sample space distance metric will affect the geometry of generative model still requires more development to be answered.
|H corr||eigval corr||slope||intercept|
. The linear regression between the log spectra of LPIPS the and that of the alternative (SSIM or MSE) yields the slope (slope) and intercept (intercept). The mean and standard deviation (in parenthesis) of the the 100 statistics are shown. In the last row, WaveGANis an audio generating GAN. We measured the similarity of the Hessian using MSE of wave forms (MSE) and MSE of spectrogram (STFT) as dissimilarity metric. Hessian computed using these two measures are less similar to each other.
Here we give a simple derivation of why a highly ill-conditioned Hessian matrix may appear normal, under the probe of random vectors. Given a symmetric matrix , and its eigen decomposition , we computed its effect on an isotropic random vector , , and
. This random variable represents the effect of the symmetric matrix on random directions.
Note that a change of variable using the orthogonal matrixwill not change the distribution . Through this the random variable could be rewritten as
As each element in is distributed as i.i.d. unit normal,
is distributed as i.i.d. chi-square distribution of parameter 1.. Thus the normalized weights conform to a Dirichlet distribution . Through moment formula of Dirichlet distribution, it is straightforward to compute the mean and variance of
As we can see, the variance of the effect on random directions scales relative to the variance of original eigenvalue distribution. This is why the distribution is much tighter than the whole eigenvalue distribution.
This phenomenon may explain why the perceptual path length regularization (PPL) used in karras2020StyleGAN2 doesn’t really result in a flat spectrum. In our notation, the PPL regularization minimizes , which is to minimize the variance of the distribution of with
sampled from normal distribution andsampled from latent distribution. The global minimizer for this regularization is indeed a mapping with flat spectrum, i.e. an isometry up to some scaling. However, we can see through our derivation and Fig.6 that even for highly anisotropic spectrum, this variance will not be very large. Thus we should expect a limited effect of this regularization.
We developed our own statistic to quantify the similarity of metric tensor between different points. Here we discuss the benefits and caveats of it.
Angles between eigenvectors per se are not used, since eigenvectors with close-by eigenvalues are likely to rotate into each other when computing eigendecomposition (van1987RitzClose-by-eigen). However, the statistics should be invariant to this eigenvector mixing, and also take the eigenvalues into account. In our statistics, we applied the eigenvectors of one matrix to the other , i.e. examined the length of these vectors as measured by the other matrix as the metric tensor, . Recall that and . If the eigenvectors fall in the eigenspace of the same eigenvalue in , then will equal the eigenvalue, and thus our statistic is invariant to rotation within the eigenspace. If the eigenvectors are totally uncorrelated, the resulting will distribute like that of random vectors as in Fig.6. As we compute the correlation between the eigenvalue and the , we summarize the similarity of action of on eigenvectors and .
However, this method assumes an anisotropy of spectra in both metric tensors. For example, if both tensors are identity matrices , then this correlation will yield NaN, as there is no variation in the spectra to be correlated. Similarly, if the metric tensor has a more isotropic spectrum, then it will generally have a smaller correlation with others. In that sense, spectral anisotropy also plays a role in our statistics for metric similarity or homogeneity of the manifold. In all the GAN spaces we examined, there is a strong anisotropy in the metric spectra, thus this correlation works fine. But there is caveat for comparing this correlation between two GANs when there is also difference in the anisotropy in their spectra, as a smaller anisotropy can also results in a smaller metric similarity.
Finally, we are aware that there are different ways to average symmetric positive definite matrices (SPSD), induced by different measures of distance in the space of SPSD(yuan2020averageSPDMat). Here we picked the simplest one to estimate the global Hessian in the latent space: averaging the metric tensors element by element.
|Log scale||Linear Scale|
Here we show the geometric analysis for the shuffled controls for all our GANs. Specifically, we shuffled the elements of the weight tensor from each layer to keep the overall weight distribution unchanged. To show how learning affected the geometry of the image manifold, we computed the spectra and the associated metric consistency statistic for weight-shuffled GANs222We were unable to obtain a sensible spectra for either shuffled or randomly initialized BigBiGAN possibly due to its architecture; but we show the comparison for all other models. .
In Fig. 7, we showed that the shuffled controls exhibited flatter spectra and smaller top eigenvalues. There, the correlation of metric tensors in shuffled GANs shows an unclear result. In some GANs, there remains strong correlations in the metrics across locations, while in some, the correlation is close to zero. We think the reason is that our statistic for homogeneity (i.e. a correlation of action of metric tensors ) somewhat entangles homogeneity with the anisotropy of the space. That is, when the space has a totally flat spectrum (the map is isometric), then the correlation coefficient of action will be zero or nan, although the metric tensor will be the same everywhere. Thus the change of anisotropy and the change of homogeneity may interfere with each other, thus shuffling can result in a mixed result. We are working to develop new statistics that will measure the similarity of the Hessian, invariant of anisotropy.
Here we compare the axes discovered by our method with those from a previous approach. Specifically, we applied our method to the same pre-trained GANs used in voynov2020unsupInterpDir, comparing the axes they discovered versus our Hessian structure. Although this method follows a quite different approach compared to ours and those of (harkonen2020ganspace; shen2020GANSemFact), we thought it would be interesting to determine if the interpretable axes discovered in their approach had a relationship with the Hessian structure defined above. If so, this could serve as independent confirmation of the effectiveness of both types of approaches.
In their work, for each generative network , two additional models were simultaneously trained to discover interpretable axes: a ”deformator” and a ”latent shift predictor” . The ”deformator” learned to propose vectors to alter the image, which were used to create images pairs using random reference vectors ; the ”latent shift predictor”
took in the images pairs and learned to classify the direction in which the latent code shifted. The axes learned by the deformator were subsequently annotated and a subset was selected by humans.
Using their code, we compared these annotated axes with the Hessian structure we computed on their GANs (PGGAN512, BigGAN noise and StyleGAN Face). In PGGAN512, we found that their discovered axes had a significantly larger (i.e. approximate rate of image change) than random vectors in that space; in other words, their axes were significantly more aligned with the top eigen space ( for all axes). Further, we wanted to investigate whether their axes aligned with individual eigenvectors identified by our Hessian or whether their axes randomly mixed with our top eigen space. To achieve this goal, we search for the power coefficients that are significantly higher than expected from projection of unit random vector. In fact, for each and every of the discovered axes, we found 1-3 eigenvectors that they are significantly () aligned to. Moreover, these strongly aligned eigenvectors are all in our top 60 eigen dimensions, in fact, 3 of their axes aligned with eigenvector 11 and 2 of their axes aligned with our eigenvector 6. (Fig.8 A) Moreover, we ”purify” their axes by a) retaining projection coefficients only in top 60 eigenvectors, or b) retaining the coefficients only in the 1-3 strongly aligned eigen vectors and set all the other 500+ coefficients to zero, and compared their effect on a same set of reference vectors, using the same step size. We found that by project out coefficients in the lower space, the image transformation is perceptually very similar (Fig.8 B,C). If we only retains the eigenvectors that it highly aligns to, the image transform will be more different, but the annotated semantics in the transform seems to keep (Fig.8 D,E). Thus, their method also discovered that the top eigenspace of PGGAN contained informative transformations, and further confirmed that optimizing interpretability may improve alignment with individual eigen vectors rather than mixing all the eigen dimensions.
Note, as we project out coefficients, the resulting vector has a smaller than unit norm, thus we are moving a smaller distance in latent space using the same step size (Fig.8 B-E title). If we renormalize the vector to unit norm we will need to take a smaller step size to achieve the same transform. This is confirming our predictions in Sec. 4: Each interpretable axis has a family of equivalent axes , which add a direction in the lower eigenspace or null space of the GAN. These axes encode the same transforms but the speeds of image change on them are different. In this sense, the top eigenspace could be used to provide a ”purer” version of the interpretable axes discovered elsewhere.
Although both types of approaches are promising, by removing the need to train additional networks, our method can be viewed as a more efficient way to identify informative axes. Further work comparing axes discovered by different methods will elucidate the connection between interpretable axes and the Hessian structure more.
In this section, we compare our work with that of (peebles2020HessianPenalty), which also focused on the use of a Hessian matrix to “disentangle” directions in generator latent space. This Hessian penalty approach devised a stochastic finite difference estimator of the non-diagonal elements of the Hessian, using only a forward pass, resulting in an efficient regularization of Hessian diagonality during GAN training. This clever approach led to several benefits, including increases in smoothness in latent interpolation, and improvements in interpretability. However, there are clear contrasts between the Hessian penalty approach and ours.
At the highest conceptual level, both works relied on the analysis of the Hessian matrix. Our framework is motivated by a geometric picture of the image manifold and latent representations, while their work is motivated by the idea of increasing the independence of latent factors. The Hessian matrices involved are also different: we computed the Hessian of the squared distance metric in image space, while they they computed the Hessian of every pixel in the image.
At the implementation level, the methods are also different. We computed the exact Hessian matrices or their low-rank approximations based on backpropagation and the Lanczos iteration. They devised a stochastic finite-difference estimator of the total non-diagonal elements of the Hessian, using only a forward pass, which they termed the Hessian penalty. This remarkable achievement enabled them to efficiently regularize Hessian diagonality during GAN training. As another point of contrast, the Hessian penalty does not provide an explicit geometric picture as the exact Hessian matrices do in our work. Specifically, their penalty did not reveal the spectral structure of the Hessian matrices, which is encoded in the diagonal elements of the matrix. Because of this, the anisotropy can be explicitly demonstrated in our work.
It is clear that the Hessian penalty provides a very important and complementary approach to ours. This work showed that regularizing Hessian diagonality during training could promote disentangled latent representation using PGGAN (this provides a more detailed evaluations of disentanglement than we achieved).
Interestingly, many of the phenomena they observed in Hessian-penalized generative networks were reminiscent to phenomena we observed in normally trained generative networks with a Hessian eigen-frame rotation of the latent space. As a first example, they found that after applying the Hessian penalty, many factors stopped affecting the image (termed ”latent space shrinkage” in their work). We also found this phenomenon in normal pre-trained GANs, i.e. they showed a large bottom eigenspace with close to 0 eigenvalues, in which the eigenvectors generated small to no changes in the image (termed ”null space” in our work; Sec. 4). Thus if we performed a Hessian eigen-frame rotation, non-Hessian-penalized generators will exhibit similar behavior as theirs do.
As a second example, they also found that enforcing the Hessian penalty on middle layers is helpful in regularizing the Hessian diagonality in the image space, which is reminiscent of our finding that the Hessian eigen-frames are usually well aligned across the layers of generator (Sec. 5), though the spectra get shaped throughout the layers.
The Hessian penalty raises an interesting question: what makes a diagonal Hessian matrix special? Because the Hessian is a real symmetric matrix, at each point , it can be diagonalized by a rotation of the latent space. However, to achieve a diagonal matrix across the latent space, peebles2020HessianPenalty had to (implicitly) enforce each point to share the same rotation (i.e. Hessian eigen-frame). Given the training involved, this encouraged the homogeneity or flatness of latent space as identified in our framework. However, as shown in our work, most GANs exhibit homogeneity or flatness in their latent space even without the Hessian penalty. So it would be interesting to compare generators trained with Hessian penalty against those trained without the penalty but with a post hoc rotation of the latent space using the global eigen-frame. We expect that their training will exhibit flatter geometry than the post hoc rotated latent space. However, even if this is not true, it would still be interesting if this flat geometry can emerge from modern GAN training i.e. fitting the natural image distribution.
Finally, aside from regularizing GAN training, peebles2020HessianPenalty also explored the use of the Hessian penalty in finding interpretable directions in pre-trained BigGAN (BigGAN-128). Similar to voynov2020unsupInterpDir, the axes they discovered showed a striking correspondence to those identified in our approach (Fig. 9). We showed that when we computed the Hessian at a few points in the noise space of the generator, the interpretable axes they found aligned well with our top eigen-vectors, with a one-to-one or one-to-two matching. For example, their reported interpretable axes 5, 6 and 8 (for the golden retriever class) had 0.998, 0.964, 0.990 of their power concentrated in our top 10-dimensional eigenspace. Moreover, they aligned with single eigenvectors 0, 5 and 2 with power 0.988, 0.499, 0.852 respectively. Due to the presence of close by eigenvalue, eigen vectors can mix into each other, resulting in the phenomenon that the axes identified in peebles2020HessianPenalty can correspond to a few adjacent eigenvectors (e.g. their axes 5 correspond to our eigenvectors 4 and 5, with corresponding eigenvalues 8.3 and 7.4; as reference, eigenvalue 3 = 17.6, eigenvalue 6 = 5.6, a much larger gap.). We found that power concentration within the top eigenspace is a great indicator of the interpretability of their reported axes: all but one of the reported axes showed over 0.95 power concentrated in top 10d eigenspace, while the axes they found not to be interpretable showed power (mean+-std) in the top 10d eigenspace. One advantage of our approach is computational efficiency. As the alternative method required learning (i.e. iterative optimization of a mixing matrix), it took us 40 minutes
50 epochs to finish the training on a 6GB GTX 1060 GPU. In contrast, the present method directly computes the Hessian matrices at one- or a few points, taking 12 seconds to compute a full Hessian matrix (5-50 points usually suffice). The reason for this difference is that their method optimizes for a basis that diagonalizes the Hessian matrix based on a noisy estimate of diagonality – the Hessian penalty; in contrast, our approach directly computes and diagonalizes the matrix at given points. Finally, our method orders the axes by the eigenvalues, facilitating focus on the top eigenspace, and thus alleviating the need to go through all the axes to find interpretable ones.
In summary, we believe that as a stochastic regularizer of the Hessian matrix, the Hessian penalty is a valuable and complementary approach to ours. Our methods provide additional value by accurately estimating the top eigenvectors and eigenspectrum, suitable for analyzing geometry post hoc. However, unlike the Hessian penalty, direct use of our method to regularize training may be inefficient. It would be interesting to explore whether there is a middle ground that incorporates the advantages of both their estimator and our more precise calculation.