Authors official PyTorch implementation of the "WarpedGANSpace: Finding non-linear RBF paths in GAN latent space" [ICCV 2021].
This work addresses the problem of discovering, in an unsupervised manner, interpretable paths in the latent space of pretrained GANs, so as to provide an intuitive and easy way of controlling the underlying generative factors. In doing so, it addresses some of the limitations of the state-of-the-art works, namely, a) that they discover directions that are independent of the latent code, i.e., paths that are linear, and b) that their evaluation relies either on visual inspection or on laborious human labeling. More specifically, we propose to learn non-linear warpings on the latent space, each one parametrized by a set of RBF-based latent space warping functions, and where each warping gives rise to a family of non-linear paths via the gradient of the function. Building on the work of Voynov and Babenko, that discovers linear paths, we optimize the trainable parameters of the set of RBFs, so as that images that are generated by codes along different paths, are easily distinguishable by a discriminator network. This leads to easily distinguishable image transformations, such as pose and facial expressions in facial images. We show that linear paths can be derived as a special case of our method, and show experimentally that non-linear paths in the latent space lead to steeper, more disentangled and interpretable changes in the image space than in state-of-the art methods, both qualitatively and quantitatively. We make the code and the pretrained models publicly available at: https://github.com/chi0tzp/WarpedGANSpace.READ FULL TEXT VIEW PDF
Authors official PyTorch implementation of the "WarpedGANSpace: Finding non-linear RBF paths in GAN latent space" [ICCV 2021].
(a) Warpings of vector spacedue to two RBF functions, and , lead to different non-linear paths in for any given (dashed bold lines) via their gradients, and . Solid black lines represent isohypses of the warpings and the colored vectors represent the vector fields induced by their gradients. (b) Illustration of a non-linear path due to warping , starting from a latent code and moving along the gradient by steps of magnitude .
Generative Adversarial Networks (GANs)  have emerged as the leading generative learning paradigm, exhibiting clear superiority in terms of the quality of generated realistic and aesthetically pleasing images [25, 3, 17, 18, 19]. However, despite their generative efficiency, GANs do not provide an inherent way of comprehending or controlling the underlying generative factors. To address this, the research community has directed its efforts towards studying the structure of GAN’s latent space [28, 30, 18, 33, 7, 37, 1, 34, 35, 9, 14, 32, 26, 11]. These works study the structure of GAN’s latent space and attempt to find interpretable directions on it; that is, directions sampling across which are expected to generate images where only a few (ideally one) factors of variations are “activated”. Meaningful human-interpretable directions can refer to either domain-specific factors (e.g., facial expressions ) or domain-agnostic factors (e.g., zoom scale [14, 26, 32]).
Several methods adopt a supervised learning framework, and discover directions in the latent space that align well to factors controlled by supervision. In this line of research,[31, 9, 18]
supervision is in the form of labels assigned to the generated images, either by explicit human annotation, or by the use of pretrained semantic classifiers. Recent works, such as[32, 14, 26], steer the directions in the latent space so as to align well with controllable manipulations in the image space (e.g., zoom). Those works are limited by the fact that the factors are assumed to be known and by practical issues in generating the supervisory signals.
Another line of research imposes unsupervised constraints in the directions in the latent space. GANSpace 
performs PCA on deep features at the early layers of the generator and finds directions in the latent space that best map to those deep PCA vectors, arriving at a set of non-orthogonal directions in the latent space. Similarly to other methods, this is a very demanding training process that requires drawing large numbers of random latent codes and regressing the latent directions. Similarly, Voynov and Babenko
proposed an unsupervised method to discover linear interpretable latent space directions. While the unsupervised learning framework has interest, current works make the hard assumption that the discovered directions are isotropic in the latent space, leading to linear paths. Furthermore, despite the fact that these works lead to more complex directions, compared to methods that do not use any optimization at all (e.g.,[14, 32]), the evaluation of the obtained results are either left to subjective visual inspection (e.g., ) or relies on laborious human labeling .
In this work, we propose to learn non-linear warping functions on the latent space, each one parametrized by a set of RBF-based latent space warping operations, and where each warping function gives rise to a family of non-linear paths via its gradient. More precisely, at each latent code , the gradient of the warping function gives the direction along the -th family of paths – clearly, the gradient of is not isotropic in , giving rise to non-linear paths. An example is shown in Fig. 1, where two RBF-warping functions, and , are depicted together with two distinct non-linear paths. Building on the work of , that discovers linear paths, we optimize the trainable parameters of the RBFs, so as that images that are generated by codes along paths of different families, , are easily distinguishable by a discriminator network (Fig. 2) – this leads to easily distinguishable image transformations, such as pose and facial expressions in facial images (Fig. 0(b)). We show that , which learns linear paths, can be derived as a special case of our method and we perform extensive comparisons with state-of-the art methods both qualitatively and quantitatively.
For a quantitative evaluation, we propose to utilize trained classifiers that assign attributes to the generated images and propose a framework that monitors the correlation between paths in the latent space, and the corresponding changes/paths in the attribute space so as to determine how correlated are paths along certain warping functions to certain attributes. We experimentally show that the proposed non-linear paths in the latent space lead to more disentangled and more interpretable changes in the image space than in state-of-the art methods. In addition, we show that for paths of the same length in the latent space, our method is able to produce much larger changes in the attribute space in comparison to the linear one, i.e., the generated attribute paths are much more steep, and that we are able to generate larger attribute changes before the quality of the generated images deteriorates.
The main contributions of this paper can be summarized as follows:
We propose an unsupervised and model-agnostic method for discovering non-linear interpretable paths on the latent space of pretrained GANs by using RBF-based warping functions. We derive the case of linear paths as a special case and learn a set of such warping functions so that the corresponding image transformations are distinguishable to each other.
We propose a quantitative evaluation protocol for measuring the interpretability/disentanglement of paths in the latent space, by analysing the corresponding changes to attributes in the generated images, as those are measured by pretrained semantic classifiers (e.g., pretrained face attribute networks).
We apply our method to four pretrained GANs (i.e., SN-GAN , BigGAN , ProgGAN , and StyleGAN2 ) and compare our non-linear paths to linear ones [34, 11], both qualitatively and quantitatively. We show that in comparison to state-of-the-art, our method produces steeper, more disentangled, and longer paths in the attribute space.
A disentangled representation in the context of generative learning can be defined as one where single latent units are sensitive to changes in single generative factors, while being relatively invariant to changes in other factors . Imposing disentanglement in the latent space of a generative learning method has drawn significant attention by the research community in recent years. These works typically refer to the notion of a disentangled latent space [4, 13, 23, 22, 29, 36], in the context of either VAE (e.g., [36, 13]) or GAN (e.g., [4, 23]) and they typically try to improve the architectures and the training protocols of standard generative methods in order to obtain latent spaces where generative factors are disentangled. While these works provide comprehensive theoretical insights, they are typically applied to toy or low-resolution datasets and exhibit inferior results in terms of generation quality and diversity compared to state-of-the-art GANs, such as ProgGAN  or StyleGAN2 .
Since the early days of GANs, it has been shown that the GAN latent space often exhibits semantically meaningful vector space arithmetic. Radford et al.  showed that there exist latent directions corresponding to adding smiles or glasses on faces. This paved the way for the development of methods that would facilitate image editing and has since received significant research attention. Some works [9, 30, 18] require explicit human-provided supervision to identify interpretable directions in the latent space. More specifically, [30, 18] use classifiers, pretrained on the CelebA dataset 
, in order to predict certain face attributes. These classifiers are then used to produce pseudo-labels for a large number of generated images and their latent codes. Based on these pseudo-labels, a separating hyperplane is learned in the latent space giving rise to a direction that captures the corresponding attribute. Plumerault et al. also solve an optimization problem in the latent space for maximizing the score of the pretrained model to predict image memorability and then find the directions that increases memorability. By contract to the above works, our method is trained in an unsupervised manner.
Some recent works [14, 26, 32] seek those vectors in the latent space that correspond to controlled image augmentations such as zoom or translation. While these approaches have interest, they can find only the directions capturing the transformations that they have been trained on. By contrast, our method can discover non-linear paths that correspond to more complex generative factors (e.g., skin color, age, etc.).
Finally, our method is closely related to those of [34, 11], since we are also learning a set of interpretable paths in an unsupervised and model-agnostic manner. More specifically, Voynov and Babenko  optimize a set of linear interpretable directions, modeled by a set of vectors in the latent space, and they evaluate the performance of their method using the judgements of eleven human assessors. GANSpace  is trained in an unsupervised manner in order to discover meaningful directions by using PCA on deep features of the generator. This method seeks linear directions in the latent space that best map to those deep PCA vectors, and results in a set of non-orthogonal directions. Similarly to other methods discussed above, it also requires a very demanding training procedure (drawing random latent codes and regressing the latent directions), while they provide only qualitative evaluation results.
In contrast to these works, our method discovers non-linear paths in the latent space of a pretrained GAN generator in an unsupervised manner. Moreover, in order to lift the obvious limitations introduced by manual labeling of the discovered paths, we propose a quantitative and automatic evaluation protocol that obtains the most interpretable paths in terms of correlation with a certain number of attributes.
In this section, we present our method for discovering non-linear interpretable paths on the latent space of a pretrained GAN generator, by learning warping functions, , the gradients of which define the directions of the paths at each latent code . More specifically, we transform by that is parameterized as a weighted sum of RBFs, and for any given we move along the path belonging to the -th family of paths by following the direction of . In order to obtain interpretable paths, we adopt the framework of  and learn warping functions that give families of paths that lead to image transformations that are distinguishable to each other by a discriminator/reconstructor. The parameters of the warping function and of the reconstructor/discriminator network are optimized jointly. By contrast to  and other methods in the literature, the warping functions may lead to non-linear paths, and the linear ones can be obtained for specific values of the parameters. An overview of the proposed method is given in Fig. 2.
Given a vector space , we define as a weighted sum of parametric Gaussian RBFs given by
where , , and , denote the weight, the scale, and the center of the -th RBF, respectively. Geometrically, transforms each point of the given vector space into a -dimensional point that lies on a -dimensional manifold. We define this transformation as a warping of the vector space . Also, hereby, we will be referring to the centers of the RBFs as the support vectors, driven by the geometric intuition that they “support” the induced warping of the space, and we will be using the term support set to refer to the set of support vectors, . The corresponding weights and parameters will be hereby referred to as the sets and , respectively. Then, different support sets will in general lead to different warpings of a given vector space.
The above warping operation is differentiable and its gradient is given analytically as follows
Thus, given an arbitrary , defines a (local) direction, which we use in order to define a curve in . More specifically, for any and sufficiently small shift magnitude , we define a continuous curve in induced by the warping operation using (2) by shifting by
In Fig. 0(a), we illustrate this for a given vector space and two warpings, and , which lead to two different non-linear paths in for any given (dashed bold lines). In this figure, thin solid lines represent level sets of the warpings, while the vector fields represent their gradients.
Following the discussion above, given a pretrained GAN’s latent space, which is typically modeled as a -dimensional vector space , we may model a set of different warpings by a set of support sets , along with the corresponding weights and parameters , . We embed the support sets into the support tensor
support tensor, and the weights and parameters into the matrices and , respectively. Then, each support set, along with the corresponding weights and parameters, leads to a specific warping of the latent space via the function defined by (1), whose gradient is given analytically by (2). Thus, for each , , we define a vector field on the latent space, which we use to traverse it using (3).
Here, we define each warping to be given by a set of pairs of “bipolar” support vectors, i.e., pairs, that have opposite weights and equal scale . In this formulation, controls the degree of non-linearity of the path, where very small lead to linear paths, similar to . This is illustrated in Fig. 3, where the vector fields for two bipolar support vectors with different values of are depicted.
Finally, let us note that in contrast to the global linear directions discovered by [34, 11], in our case the directions along each warping are different for different latent codes. That is, as shown in (3), the gradient and the shift vector depend on the latent code itself. This anisotropic behaviour of the proposed method reflects our intuition that interpretable paths do not necessarily have the same direction at every region of the latent space.
In this section we will show that the method of  can be derived as a special case of our method. We first note that the framework of  that discovers linear directions encoded in the columns of a matrix can be derived in the special case that the warping functions are linear in , that is, . In that case, the direction along the -th direction is given by , where is the -th column of .
It is straightforward to show that this solution can be obtained in our formulation, when each of the RBF-warping is given by pairs of bipolar RBFs, i.e, pairs of support vectors with opposite and the same , when the value of is sufficiently small. In what follows we give the proof for the simple case of a single bipolar pair, in the special case that . In that case, (2) can be written as , which, for sufficiently small , leads to . Then, the shift in the latent space, given by (3), is written as
In this case, the derivative of the -th warping function at is independent of and equal to a constant vector.
It is straightforward to show that linear directions can be obtained also in the more general case that each of the warping functions is given by several bipolar support vectors, each with a small . It is also the case that such parameters could be found by the optimization process, if they lead to discernible image transformations.
An overview of the learning process is presented in Fig. 2. We use a pretrained generator and learn a) the parameters of a warping network that generates paths in the latent space of , and b) the parameters of a reconstructor network that recognises the index of the warping that generated the changes between a pair of images. The trainable modules of our method are the following:
The warping network is parametrized by a set of triplets, , of the support set , and the corresponding weights and parameters , . Each such triplet gives rise to a warping of the latent space , and thus, to a non-linear path for any given latent code . is implemented by standard layers and is differentiable.
A reconstructor is a model that we use in order to distinguish the image transformations that are induced by the different support sets (i.e., the different latent space warpings). As shown in Fig. 2, the input to the reconstructor is a pair of images, and . The reconstructor’s goals are i) to predict which support set gave rise to the transformation at hand, i.e., recognise the index and ii) to reproduce the magnitude of the shift in the latent space; that is, predict . In the experiments, we use the LeNet 
backbone for SN-GAN (MNIST and AnimeFaces datasets) and ResNet-18
for BigGAN (ImageNet), ProgGAN (CelebA-HQ), and StyleGAN2 (FFHQ). We modify the input channels of the reconstructor so as it receives pairs of images (i.e., we concatenate the input image pair along channels dimension). Finally, we define two output “heads”, one for predicting the index (classification), and the other for predicting the shift magnitude (regression).
The optimization problem that we solve is as follows
where denotes the classification loss term where we use the cross-entropy function, denotes the regression loss terms where we use the mean absolute error, and is a weighting coefficient. We note that the objective function is differentiable with respect to the support vectors, weights and parameters, allowing us to learn not only the positions of the support vectors, but also their weights, and/or parameters. To ensure the positivity of we learn its logarithm. As discussed above, for each warping we learn a set of bipolar pairs of support vectors.
During training, we generate pairs of images and , where , is a warping function index uniformly sampled in , and is a scalar sampled uniformly in . The pair of images is fed to the reconstructor where the loss is calculated and the gradients are back-propagated to the warping network and the reconstructor.
In this section we will present the experimental evaluation of the proposed method and provide qualitative and quantitative comparisons with state-of-the-art methods. We will first show that in comparison to  our method finds paths in the latent space that produce changes in the image space that are easier to be distinguished by a discriminating network – this is achieved consistently across several GANs that are pretrained on different datasets (Table 1). We will then show that in comparison to the state-of-the-art, our method finds paths in the latent space, that produce more distinguishable, more disentangled and larger changes in the generated images. We will first show that qualitatively by presenting images generated along paths of equal length in the latent space for different methods (Fig. 4,8,5) and observe the generated variations they produce in the image space. We will subsequently show this quantitatively (Table 2
), by estimating semantic attributes (e.g., rotations, smile, etc.) in the generated images, and report the correlations and ranges as we follow different paths in the latent space. Finally, we will show that our method finds paths on the latent space that correspond to steeper changes/paths in the attribute space, and therefore allows for better, controllable generation without arriving at latent space regions of low density and, thus, at quality degradation or distortions (Fig.7,8).
We first show that a reconstructor that discriminates images according to the warping in the latent space that generated them, i.e., estimates the index of the warping function, has better classification performance than in the corresponding linear case . This is an indication that the paths that are generated by our method can be discriminated more effectively and therefore are more likely to be more interpretable. The results are summarised in Table 1 and are consistent across several pretrained GANs.
We then show qualitatively that the proposed method finds interpretable paths in the latent space that are similar to the ones reported in , but exhibit larger variations in the captured generative factors. More specifically, for a given method that discovers a set of paths, that is, linear in the cases of [34, 11] or non-linear in our case, in the latent space of a pretrained GAN, we generate an image sequence for each path, starting from a random latent code and “walking” towards the positive and the negative ways of the path for a certain amount of steps. This gives rise to an image sequence that shows how the learned path at hand affects the generation. For fair comparison, the step size and therefore the path length, is the same for all methods.
In Fig. 4 we show the generated images along manually selected directions found by our method and the method of  on SN-GAN (AnimeFaces). In the same figure, we show three interpretable paths discovered by our method, namely zoom, background removal, and rotation, in comparison with the corresponding ones reported in  – we note that these are the directions chosen in  and that we generate the paths using the publicly available models provided by the authors. We can clearly see that in both cases, the paths found by our method produce larger changes in the image space and larger variations in the content.
In Fig. 8 we show paths discovered on the latent space of ProgGAN , that is trained on CelebA-HQ . For this method we report the directions that are most correlated with three attributes, namely yaw, smile, and race, with the correlations estimated with a method we will describe below. We compare with the corresponding linear directions obtained by [34, 11] and we note that our method both leads to greater variation in the respective generative factors (e.g., larger rotation angles) for the same traversal lengths in the latent space, but also that we are able to produce more disentangled generations. This is apparent in Fig. 8, where, for instance, changing smile attribute using our method preserves other generative factors better than [34, 11].
As noted, the length of the paths in the latent space is the same for all sequences and methods. To obtain a measure of the non-linearity of the generated paths, we calculate the ratio between the length of a path and the distance between its endpoints, and report the averages for all the traversals on a given warping. Clearly, for linear paths, . The results are summarized in Fig. 6, where we plot (sorted) the values of for the discovered non-linear warpings for ProgGAN. An illustration is given in Fig. 3.
As discussed before, for a given method that discovers a set of interpretable paths; that is, linear in the cases of [34, 11] or non-linear in the case of the proposed method, in the latent space of a pretrained GAN generator, we generate an image sequence for each path, starting from a random latent code and “walking” towards the positive and the negative ways of the path for a certain amount of steps. For each image of such sequence, we apply a set of pretrained networks that predict the following: a) the location of the face (bounding box), using , b) an identity score for each image of the sequence that expresses the similarity between the original image (central image of the sequence) and each of the rest, using ArcFace , c) an age, race, and gender score using FairFace , d) a set of CelebA attributes classifiers (e.g., smile, wavy hair, etc.), and e) an estimation of the face pose (yaw, pitch, roll), using Hopenet . In this way, for each warping we have a set of paths in the latent space and the corresponding paths in the attribute space.
In order to obtain a measure on how well the paths generated by a warping function are correlated with a certain attribute, we estimate the average Pearson’s correlation between the index of the step along the path and the corresponding values in the attribute vector. By doing so, for each warping, we obtain a vector, which we normalize. This allows for sorting the discovered paths with respect to the correlation with each attribute and select the paths that give the maximum absolute correlation for each attribute.
The results are summarised in Table 2, where we report quantitative results for our method (Tab. 2(a)), in comparison to  (Tab. 2(b)) and  (Tab. 2(c)), in terms of -normalized correlation averaged across 100 latent codes. We note that our method achieves better correlations for the respective attributes, while at the same time the correlations with the rest of the attributes are lower than those achieved by [34, 11], as is evident by the lower values in the off-diagonal elements of the matrix. This shows in a quantitative manner, what was evident in a qualitatively manner in Fig. 8, that is, that the discovered paths in the latent space lead to more disentangled changes in the attribute space.
Finally, in Fig. 5 we show the results of generation across some non-linear interpretable paths obtained automatically by our method for StyleGAN2, for the following attributes: age, race (skin color), gender (“femaleness”), and yaw (rotation). In this figure, we report the paths with the highest correlation with the respective attribute.
In this paper, we presented our method for discovering non-linear interpretable paths in the latent space of pretrained GANs in an unsupervised and model-agnostic manner. We do so by modeling non-linear latent paths using the gradient of RBF-based warping functions, which we optimized in order to be distinguishable to each other. This leads to paths that correspond to interpretable generation where only a small number of generative factors are affected for each path. Finally, we proposed a quantitative evaluation protocol for the case of face-generating GANs, which can be used to automatically associate the discovered paths with interpretable attributes such as smiling and rotation.
Acknowledgments: This work was supported by the EU H2020 AI4Media No. 951911 project.
Arcface: Additive angular margin loss for deep face recognition.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019.
Hope-net: A graph-based model for hand-object pose estimation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6608–6617, 2020.
Proceedings of the Machine Learning for Creativity and Design Workshop at NeurIPS, 2017.
The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 4836–4843, 2020.