1 Introduction
With the success of recent generative models to produce highresolution photorealistic images (StyleGan; largescalegan; vqvae), an increasing number of applications are emerging, such as image inpainting, datasetsynthesis, and deepfakes. However, the use of generative models is often limited by the lack of control over the generated images. More control could be used to improve existing approaches which aim at generating new training examples (gandataaugmentation) by allowing the user to choose more specific properties of the generated images.
First attempts in this direction showed that one can modify an attribute of a generated image by adding a learned vector on its latent code
(vector_arithmetics) or by combining the latent code of two images (StyleGan). Moreover, the study of the latent space of generative models provides insights about its structure which is of particular interest as generative models are also powerful tools to learn unsupervised data representations. For example, vector_arithmetics observed on autoencoders trained on datasets with labels for some factors of variations, that their latent spaces exhibit a vector space structure where some directions encode the said factors of variations.We suppose that images result from underlying factors of variation such as the presence of objects, their relative positions or the lighting of the scene. We distinguish two categories of factors of variations. Modal factors of variation are discrete values that correspond to isolated clusters in the data distribution, such as the category of the generated object. On the other hand, the size of an object or its position are described by Continuous factors of variations, expressed in a range of possible values. As humans, we naturally describe images by using factors of variations suggesting that they are an efficient representation of natural images. For example, to describe a scene, one likely enumerates the objects seen, their relative positions and relations and their characteristics (berg12understanding). This way of characterizing images is also described in visualgenome. Thus, explaining the latent space of generative models through the lens of factors of variation is promising. However, the control over the image generation is often limited to discrete factors and requires both labels and an encoder model. Moreover, for continuous factors of variations described by a real parameter , previous works do not provide a way to get precise control over .
In this paper, we propose a method to find meaningful directions in the latent space of generative models that can be used to control precisely specific continuous factors of variations while the literature has mainly tackled semantic labeled attributes like gender, emotion or object category (vector_arithmetics; conditionalgan). We test our method on image generative models for three factors of variation of an object in an image: vertical position, horizontal position and scale. Our method has the advantage of not requiring a labeled dataset nor a model with an encoder. It could be adapted to other factors of variations such as rotations, change of brightness, contrast, color or more sophisticated transformations like local deformations. However, we focused on the position and scale as these are quantities that can be evaluated, allowing us to measure quantitatively the effectiveness of our method. We demonstrate both qualitatively and quantitatively that such directions can be used to control precisely the generative process and show that our method can reveal interesting insights about the structure of the latent space. Our main contributions are:

We propose a method to find interpretable directions in the latent space of generative models, corresponding to parametrizable continuous factors of variations of the generated image.

We show that properties of generated images can be controlled precisely by sampling latent representations along linear directions.

We propose a novel reconstruction loss for inverting generative models with gradient descent.

We give insights of why inverting generative models with optimization can be difficult by reasoning about the geometry of the natural image manifold.

We study the impacts of disentanglement on the ability to control the generative models.
2 Latent Space Directions of a Factor of Variation
We argue that it is easier to modify a property of an image than to obtain a label describing that property. For example, it is easier to translate an image than to determine the position of an object within said image. Hence, if we can determine the latent code of a transformed image, we can compute its difference with the latent code of the original image to find the direction in the latent space which corresponds to this specific transformation as in vector_arithmetics.
Let us consider a generative model , with its latent space of dimension and the space of images, and a transformations characterized by a continuous parameter . For example if is a rotation, then could be the angle, and if is a translation, then could be a component of the vector of the translation in an arbitrary frame of reference. Let be a vector of and a generated image. Given a transformation , we aim at finding such that to then use the difference between and
in order to estimate the direction encoding the factor of variation described by
.2.1 Latent Space Trajectories of an Image transformation
Given an image , we want to determine its latent code. When no encoder is available we can search an approximate latent code that minimizes a reconstruction error between and ( can be seen as the projection of on ) i.e.
(1) 
Solving this problem by optimization leads to solutions located in regions of low likelihood of the distribution used during training. It causes the reconstructed image to look unrealistic^{1}^{1}1We could have used a penalty on the norm of to encode a centered Gaussian prior on the distribution of . However the penalty requires an additional hyperparameter that can be difficult to choose.. Since
follows a normal distribution
in a dimensional space, we have . Thus, and . Hence, when is large, the norm of is approximately equal to . This can be used to regularize the optimization by constraining to verify :(2) 
2.1.1 Choice of the reconstruction error
One of the important choice regarding this optimization problem is that of . In the literature, the most commonly used are the pixelwise Mean Squared Error (MSE) and the pixelwise crossentropy as in LatentRecovery and InvertGan
. However in practice, pixelwise losses are known to produce blurry images. To address this issue, other works have proposed alternative reconstruction errors. However, they are based on an alternative neural network
(VAEGAN; PerceptualLoss) making them computationally expensive.The explanation usually given for the poor performance of pixelwise mean square error is that it favors the solution which is the expected value of all the possibilities (l2loss)^{2}^{2}2
Indeed, if we model the value of pixel by a random variable
x then . In fact, this problem can easily generalized at every pixelwise loss if we assume that nearby pixels follows approximately the same distribution as will have the same value for nearby pixels.. We propose to go deeper into this explanation by studying the effect of the MSE on images in the frequency domain. In particular, our hypothesis is that due to its limited capacity and the low dimension of its latent space, the generator can not produce arbitrary texture patterns as the manifold of textures is very high dimensional. This uncertainty over texture configurations explains why textures are reconstructed as uniform regions when using pixelwise errors. In Appendix
A, by expressing the MSE in the Fourier domain and assuming that the phase of high frequencies cannot be encoded in the latent space, we show that the contribution of high frequencies in such a loss is proportional to their square magnitude pushing the optimization to solutions with less high frequencies, that is to say more blurry. In order to get sharper results we therefore propose to reduce the weight of high frequencies into the penalization of errors with the following loss:(3) 
where
is the Fourier transform,
is the convolution operator and is a Gaussian kernel. With a reduced importance given to the high frequencies to determine when one uses this loss in equation 2, it allows to benefit from a larger range of possibilities for , including images with more details (i.e with more high frequencies) and appropriate texture to get more realistic generated images. A qualitative comparison to some reconstruction errors and choices of can be found in Appendix C. We also report a quantitative comparison to other losses, based on the Learned Perceptual Image Patch Similarity (LPIPS), proposed by lpips.2.1.2 Recursive Estimation of the Trajectory
Using equation 2, our problem of finding such that , given transformation , can be solve through the following optimization problem:
(4) 
In practice, this problem is difficult and an “unlucky” initialization can lead to a very slow convergence. PhotoEditingM proposed to use an auxiliary network to estimate and use it as initialization. Training a specific network to initialize this problem is nevertheless costly. One can easily observe that a linear combination of natural images is usually not a natural image itself, this fact highlights the highly curved nature of the manifold of natural images in pixel space. In practice, the trajectories corresponding to most transforms in pixel space may imply small gradients of the loss that slowdown the convergence of problem of Eq. ( 2) (see Appendix D).
To address this, we guide the optimization on the manifold by decomposing the transformation into smaller transformations such that and and solve sequentially:
(5) 
each time initializing with the result of the previous optimization. In comparison to PhotoEditingM, our approach does not require extra training and can thus be used directly without training a new model. We compare qualitatively our method to a naive optimization in Appendix C.
A transformation on an image usually leads to undefined regions in the new image (for instance, for a translation to the right, the left hand side is undefined). Hence, we ignore the value of the undefined regions of the image to compute
. Another difficulty is that often the generative model cannot produce arbitrary images. For example a generative model trained on a given dataset is not expected to be able to produce images where the object shape position is outside of the distribution of object shape positions in the dataset. This is an issue when applying our method because as we generate images from a random start point, we have no guarantee that the transformed images is still on the data manifold. To reduce the impact of such outliers, we discard latent codes that give a reconstruction error above a threshold in the generated trajectories. In practice, we remove one tenth of the latent codes which leads to the worst reconstruction errors. It finally results into Algorithm
LABEL:alg:our_method to generate trajectories in the latent space. algocf[tb]2.2 Encoding Model of the Factor of Variation in the Latent Space.
After generating trajectories with Algorithm 1, we need to define a model which describes how factors of variations are encoded in the latent space. We make the core hypothesis that the parameter of a specific factor of variations can be predicted from the coordinate of the latent code along an axis , thus we pose a model of the form , with and the euclidean scalar product in .
When is a monotonic differentiable function, we can without loss of generality, suppose that and that is an increasing function. Under these conditions, the distribution of when is given by :
(6) 
For example, consider the dSprite dataset (dsprites) and the factor corresponding to the horizontal position of an object in an image, we have
that follows a uniform distribution
in the dataset while the projection of onto an axis follows a normal distribution . Thus, it is natural to adopt and for :(7)  
However, in general, the distribution of the parameter
is not known. One can adopt a more general parametrized model
of the form:(8) 
with and (, ) trainable parameters of the model. We typically used piecewise linear functions for .
However, this model cannot be trained directly as we do not have access to (in the case of horizontal translation the coordinate for example) but only to the difference between an image and its transformation ( or in the case of translation). We solve this issue by modeling instead of :
(9) 
Hence, and are estimated by training to minimize the MSE between and with gradient descent on a dataset produced by Algorithm LABEL:alg:our_method for a given transformation.
An interesting application of this method is the estimation of the distribution of the images generated by by using Equation 6. With the knowledge of we can also choose how to sample images. For instance, let say that we want to have , with an arbitrary distribution, we can simply transform as follows:
(10) 
with and such that:
(11) 
These results are interesting to bring control not only on a single output of a generative model but also on the distribution of its outputs. Moreover, since generative models reflect the datasets on which they have been trained, the knowledge of these distributions could be applied to the training dataset to reveal potential bias.
3 Experiments
Datasets: We performed experiments on two datasets. The first one is dSprites (dsprites), composed of 737280 binary images containing a white shape on a dark background. Shapes can vary in position, scale and orientations making it ideal to study disentanglement. The second dataset is ILSVRC (ILSVRC15), containing natural images from one thousand different categories.
Implementation details:
All our experiments have been implemented with TensorFlow 2.0
(tensorflow2015whitepaper) and the corresponding code is available on github here. We used a BigGAN model (largescalegan) whose weights are taken from TensorFlowHub allowing easy reproduction of our results. The BigGAN model takes two vectors as inputs: a latent vector and a onehot vector to condition the model to generate images from one category. The latent vector is then split into six parts which are the inputs at different scale levels in the generator. The first part is injected at the bottom layer while next parts are used to modify the styleof the generated image thanks to Conditional Batch Normalization layers
(CondBatchNorm). We also trained several VAEs (beta_VAE) to study the importance of disentanglement in the process of controlling generation. The exact VAE architecture used is given in Appendix B. The models were trained on dSprites (dsprites) with an Adam optimizer during steps with a batch size of 128 images and a learning rate of .3.1 Quantitative evaluation method
Evaluating quantitatively the effectiveness of our method on complex datasets is intrinsically difficult as it is not always trivial to measure a factor of variation directly. We focused our analysis on two factors of variations: position and scale. On simple datasets such as dSprites, the position of the object can be estimated effectively by computing the barycenter of white pixels. However, for natural images sampled with the BigGAN model, we have to use first saliency detection on the generated image to produce a binary image from which we can extract the barycenter. For saliency detection, we used the model provided by Saliency
which is implemented in the PyTorch framework
(pytorch). The scale is evaluated by the proportion of salient pixels. The evaluation procedure is:
Get the direction which should describe the chosen factor of variation with our method.

Sample latent codes from a standard normal distribution.

Generate images with latent code with .

Estimate the real value of the factor of variation for all the generated images.
steerability proposed an alternative method for quantitative evaluation that relies on an object detector. Similarly to us, it allows an evaluation for and shift as well as scale but is restricted to image categories that can be recognized by a detector trained on some categories of ILSVRC. The proposed approach is thus more generic.
3.2 Results on BigGAN
We performed quantitative analysis on ten chosen categories of objects of ILSVRC, avoiding non actual objects such as ‘‘beach’’ or ‘cliff’’. Results are presented in Figure 2 (top). We observe that for the chosen categories of ILSVRC, we can control the position and scale of the object relatively precisely by moving along directions of the latent space found by our method. However, one can still wonder whether the directions found are independent of the category of interest. To answer this question, we merged all the datasets of trajectories into one and learned a common direction on the resulting datasets. Results for the ten test categories are shown in Figure 2 (bottom). This figure shows that the directions which correspond to some factors of variations are indeed shared between all the categories. Qualitative results are also presented in Figure 3 for illustrative purposes.
We also checked which parts of the latent code are used to encode position and scale. Indeed, BigGAN uses hierarchical latent code which means that the latent code is split into six parts which are injected at different level of the generator. We wanted to see by which part of the latent code these directions are encoded. The squared norm of each part of the latent code is reported in Figure 4 for horizontal position, vertical position and scale. This figure shows that the directions corresponding to spatial factors of variations are mainly encoded in the first part of the latent code. However, for the position, the contribution of level 5 is higher than for the position and the scale. We suspect that it is due to correlations between the vertical position of the object in the image and its background that we introduced by transforming the objects because the background is not invariant by vertical translation because of the horizon.
3.3 The importance of disentangled representations
To test the effect of disentanglement on the performance of our method, we trained several VAE (beta_VAE) on dSprites (dsprites), with different values. Indeed, VAE are known for having more disentangled latent spaces as the regularization parameter increases. Results can be seen in Figure 5. The figure shows that it is possible to control the position of the object on the image by moving in the latent space along the direction found with our method. As expected, the effectiveness of the method depends on the degree of disentanglement of the latent space since the results are better with a larger . Indeed we can see on Figure 5 that as increases, the standard deviation decreases (red curve), allowing a more precise control of the position of the generated images. This observation motivates further the interest of disentangled representations for control on the generative process.
4 Related works
Our work aims at finding interpretable directions in the latent space of generative models to control their generative process. We distinguish two families of generative models: GANlike models which do not provide an explicit way to get the latent representation of an image and autoencoders which provide an encoder to get the latent representation of images. From an architectural point of view, conditional GANs (conditionalgan) allows the user to choose the category of a generated object or some chosen properties of the generated image but this approach requires a labeled dataset and use a model which is explicitly designed to allow this control. Similarly regarding VAE, engel2018latent identified that they suffer from a tradeoff between reconstruction accuracy and sample plausibility and proposed to identify regions of the latent space that correspond to plausible samples to improve reconstruction accuracy. They also use conditional reconstruction to control the generative process. In comparison to these approaches, our method does not directly requires labels. With InfoGan, infogan shows that adding a code to the the input of the GAN generator and optimizing with an appropriate regularization term leads to disentangle the latent space and make possible to find a posteriori meaningfully directions. In contrast, we show that it is possible to find such directions in several generative models, without changing the learning process (our approach could even be applied to InfoGAN) and with an a priori knowledge of the factor of variation sought. More recently, GanDissection
analyze the activations of the network’s neurons to determine those that result in the presence of an object in the generated image, and thus allows to control such a presence. In contrast, our work focuses on the latent space and not on the intermediate activations inside the generator.
One of our contribution and a part of our global method is a procedure to find the latent representation of an image when an encoder is not available. Several previous works have studied how to invert the generator of a GAN to find the latent code of an image. InvertGan showed on simple datasets (MNIST (Lecun98gradientbasedlearning) and Omniglot (Omniglot)) that this inversion process can be achieved by optimizing the latent code to minimize the reconstruction error between the generated image and the target image. LatentRecovery introduced tricks to improve the results on a more challenging dataset (CelebA (CelebA)). However we observed that these methods fail when applied on a more complex datasets (ILSVRC (ILSVRC15)). The reconstruction loss introduced in Section 2.1.1 is adapted to this particular problem and improves the quality of reconstructions significantly. We also theoretically justify the difficulties to invert a generative model, compared to other optimization problems. In the context of vector space arithmetic in a latent space, white16sampling_generative
argues that replacing a linear interpolation by a spherical one allows to reduce the blurriness as well. This work also propose an algorithmic data augmentation, named ‘‘synthetic attribute’’, to generate image with less noticeable blur with a VAE. In contrast, we act directly on the loss.
The closest works were released on ArXiv very recently (Ganalyze; steerability) indicating that finding interpretable directions in the latent space of generative models to control their output is of high interest for the community. In these papers, the authors describe a method to find interpretable directions in the latent space of the BigGAN model (largescalegan). If their method exhibits similarities with ours (use of transformation, linear trajectories in the latent space), it also differs on several points. From a technical point of view our training procedure differs in the sense that we first generate a dataset of interesting trajectories to then train our model while they train their model directly. Our evaluation procedure is also more general as we use a saliency model instead of a MobileNetSSD v1 ssd trained on specific categories of the ILSVRC dataset allowing us to measure performance on more categories. We provide additional insight on how autoencoders can also be controlled with the method, the impact of disentangled representations on the control and on the structure of the latent space of BigGAN. Moreover we also propose an alternative reconstruction error to invert generators. However, the main difference we identify between the two works is the model of the latent space used. Our model allows a more precise control over the generative process and can be being adapted to more cases.
5 Conclusions
Generative models are increasingly more powerful but suffer from little control over the generative process and the lack of interpretability in their latent representations. In this context, we propose a method to extract meaningful directions in the latent space of such models and use them to control precisely some properties of the generated images. We show that a linear subspace of the latent space of BigGAN can be interpreted in term of intuitive factors of variation (namely translation and scale). It is an important step toward the understanding of the representations learned by generative models.
References
Appendix A Penalty on the amplitude of frequencies due to MSE
In Section 2.1, we consider a target image and a generated image to be determined according to a reconstruction loss (Equation 1). Let us note the Fourier transform. If is the usual MSE, from the Plancherel theorem, we have . Let us consider a particular frequency in the Fourier space and compute its contribution to the loss. The Fourier transform of (resp. ) having a magnitude (resp. ) and a phase (resp. ) at , we have:
(12)  
If we model the disability of the generator to model every high frequency patterns as an uncertainty on the phase of high frequency of the generated image, i.e by posing , the expected value of the high frequency contributions to the loss is equal to:
(13)  
The term is a constant w.r.t the optimization of and can thus be ignored. The contribution to the total loss thus directly depends on . While minimizing , the optimization process tends to favor images with smaller magnitudes in the high frequencies, that is to say smoother images, with less high frequencies.
Appendix B VAE architecture
The VAE framework was introduced by beta_VAE to discover interpretable factorized latent representations for images without supervision. For our experiments, we designed a simple convolutional VAE architecture to generate images of size 64x64, the decoder network is the opposite of the encoder with transposed convolutions.


Appendix C Qualitative and quantitative experiments with our reconstruction error


On Fig. 6 we show qualitative reconstruction results with our method (Eq. 3) for several values of . On this representative example, we observe quite good results with and . Higher values penalizes too low frequencies that lead to a less accurate reconstruction.
We also illustrate on Fig. 7 a comparison of our approach to two others, namely classical Mean Square Error (MSE) and Structural dissimilarity (DSSIM) proposed by SSIM. Results are also presented with an unconstrained latent code during optimization (Eq. 1) and the approach proposed (Eq. 2). This example show the accuracy of the reconstruction obtained with our approach, as well as the fact that the restriction of to a ball of radius avoids the presence of artifacts.
We also performed a quantitative evaluation of the performance of our approach. We randomly selected one image for each of the 1000 categories of the ILSVRC dataset and reconstructed it with our method with a budget of 3000 iterations. We then computed the Learned Perceptual Image Patch Similarity (LPIPS), proposed by lpips, between the final reconstruction and the target image. We used the official implementation of the LPIPS paper with default parameters. Results are reported in Table 2. It suggests that images reconstructed using our reconstruction error are perceptually closer to the target image than those obtained with MSE or DSSIM. The higher standard deviation for the MSE reconstructed image LPIPS suggests that some images are downgraded in terms of perception. It can be the case for the textured ones in particular, for the reasons explained in the Section A.
reconstruction error  mean LPIPS  std LPIPS 

MSE  0.57  0.14 
DSSIM  0.58  0.12 
Our ()  0.52  0.12 
Appendix D On the difficulty of optimization on the natural image manifold.
The curvature of the natural image manifold makes the optimization problem of Equation 2 difficult to solve. This is especially true for factors of variation which correspond to curved walks in pixelspace (for example translation or rotation by opposition to brightness or contrast changes which are linear).
To illustrate this fact, we show that the trajectory described by an image undergoing common transformations is curved in pixel space. We consider three types of transformations, namely translation, rotation and scaling, and get images from the dSprites (dsprites) dataset which correspond to the progressive transformation (interpolation) of an image. To visualize, we compute the PCA of the resulting trajectories and plot the trajectories on the two main axes of the PCA. The result of this experiment can be seen in Figure 8.
In this figure, we can see that for large translations, the direction of the shortest path between two images in pixelspace is near orthogonal to the manifold. The same problem occurs for rotation and, at a smaller extent, for scale. However this problem does not exist for brightness for example, as its change is a linear transformation in pixelspace. This is problematic during optimization of the latent code because the gradient of the reconstruction loss with respect to the generated image is tangent to this direction. Thus, when we are in the case of near orthogonality, the gradient of the error with respect to the latent code is small.
Indeed, let us consider an ideal case where is a bijection between and the manifold of natural images. Let be , a basis of vectors tangent to the manifold at point is given by .
If is near orthogonal to the manifold then:
(14) 
Thus,
(15) 
It shows that when the direction of descent in pixel space is near orthogonal to the manifold described by the generative model, optimization gets slowed down and can stop if the gradient of the loss with respect to the generated image is orthogonal to the manifold.
For example, let assume we have an ideal GAN which generates a small white circle on a black background, with a latent space of dimension 2 that encodes the position of the circle. Let consider a generated image with the circle on the left of the image and we want to move it to the right. Obviously, we thus have if the intersection of the two circles is empty (see Figure 8) since a small translation of the object does not change the reconstruction error.
Appendix E Additional qualitative examples
We show qualitative examples for images generated with the BigGAN model for position, scale and brightness. The images latent codes are sampled in the following way: with and the learned direction. We have chosen the categories to produce interesting results: for position and scale categories are objects, for brightness categories are likely to be seen in a bright or dark environment. Notice that for some of the chosen categories, we failed to control the brightness of the image. It is likely due to the absence of dark images for these categories in the training data. for position and scale, the direction is learned on the ten categories presented here while for brightness only the five top categories are used.
Appendix F Qualitative comparison between our optimization method and the naive method.
0  10  20  30  40  50  60  70  80  90  100 
0  500  850  1000  1500  2000  2500  3000  3500  4000  4500 
Comments
There are no comments yet.