github_RameenAbdal_StyleFlow
None
view repo
High-quality, diverse, and photorealistic images can now be generated by unconditional GANs (e.g., StyleGAN). However, limited options exist to control the generation process using (semantic) attributes, while still preserving the quality of the output. Further, due to the entangled nature of the GAN latent space, performing edits along one attribute can easily result in unwanted changes along other attributes. In this paper, in the context of conditional exploration of entangled latent spaces, we investigate the two sub-problems of attribute-conditioned sampling and attribute-controlled editing. We present StyleFlow as a simple, effective, and robust solution to both the sub-problems by formulating conditional exploration as an instance of conditional continuous normalizing flows in the GAN latent space conditioned by attribute features. We evaluate our method using the face and the car latent space of StyleGAN, and demonstrate fine-grained disentangled edits along various attributes on both real photographs and StyleGAN generated images. For example, for faces, we vary camera pose, illumination variation, expression, facial hair, gender, and age. Finally, via extensive qualitative and quantitative comparisons, we demonstrate the superiority of StyleFlow to other concurrent works.
READ FULL TEXT VIEW PDFA longstanding goal of Computer Graphics has been to generate high-quality realistic images that can be controlled using user-specified attributes. One broad philosophy has been to create detailed 3D models, decorate them with custom materials and texture properties, and render them using realistic camera and illumination models. Such an approach, once realized, provides users with significant control over a range of attributes including object properties, camera position, and illumination. However, it is still hard to achieve photorealism over a range of attribute specifications.
Alternately, in a recent development, generative adversarial networks (GANs) have opened up an entirely different image generation paradigm. Tremendous interest in this area has resulted in rapid improvements both in speed and accuracy of the generated results. For example, StyleGAN (Karras et al., 2019b), one of the most celebrated GAN frameworks, can produce high resolution images with unmatched photorealism. However, there exist limited options for the user to control the generation process with adjustable attribute specifications. For example, starting from a real face photograph, how can we edit the image to change the (camera) pose, illumination, or the person’s expression?
One successful line of research that supports some of the above-mentioned edit controls is conditional generation using GANs. In such a workflow, attributes have to be specified directly at (GAN) training time. While the resultant conditional GANs (Choi et al., 2019; Xiao et al., 2018; Park et al., 2019) provide a level of (semantic) attribute control, the resultant image qualities can be blurry and fail to reach the quality produced by uncontrolled GANs like StyleGAN. Further, other attributes, which were not specified at training time, can change across the generated images, and hence result in loss of (object) identity due to edits.
We take a different approach. We treat the attribute-based editing problem as a conditional exploration problem in an unsupervised GAN, rather than conditional generation requiring attribute-based retraining. We explore the problem using StyleGAN, as one of the leading uncontrolled GAN setups, and treat attribute-based user manipulations as finding corresponding non-linear paths in the StyleGAN’s latent space. Specifically, we explore two problems: (i) attribute-conditioned sampling, where the goal is to sample a diverse set of images all meeting user-specified attribute(s); and (ii) attribute-controlled editing, where the user wants to edit, possibly sequentially, a particular image with target attribute specifications. The paths inferred by StyleFlow is conditioned on the input image, and hence can be adapted to the uniqueness of individual faces.
Technically, we solve the problem by proposing a novel normalizing flow based technique to conditionally sample from the GAN latent space. First, assuming access to an attribute evaluation function (e.g., an attribute classification network), we generate sample pairs linking StyleGAN latent variables with attributes of the corresponding images. In our implementation, we consider a range of attributes including camera, illumination, expression, gender, and age for human faces; and camera, type, and color for cars. We then enable adaptive latent space vector manipulation by casting the conditional sampling problem in terms of conditional normalizing flows using the attributes for conditioning. Note that, unlike in conditional GANs, this does not require attribute information during GAN training. This results in a simple yet robust attribute-based image editing framework.
We evaluate our method mainly in the context of human faces (both generated as well as real images projected into the relevant latent space) and present a range of high-quality identity-preserving edits at an unmatched quality. As a further proof of robustness, we demonstrate sequential edits, see Figure 1 and Figure 9, to the images without forcing the latent vectors out of the distribution, as guaranteed by our formulation of the problem with normalizing flows. We compare our results with concurrently proposed manipulation techniques, and demonstrate superior identity preservation, both quantitatively and qualitatively. In summary, we enable conditional exploration of latent spaces of unconditional GANs using conditional normalizing flows based on (semantic) attributes.
Project code and user interface (see supplementary material) will be released for research use.
Generative Adversarial Networks (GANs) have been introduced by Goodfellow et al. (2014)
and sparked a huge amount of follow up work. One direction of research has been the improvement of the GAN architecture, loss functions, and training dynamics. As current state of the art we consider a sequence of architecture versions by Karras and his co-authors, ProgressiveGAN
(Karras et al., 2017), StyleGAN (Karras et al., 2019b), and StyleGAN2 (Karras et al., 2019a). These architectures are especially strong in human face synthesis. Another strong system is BigGAN (Brock et al., 2018)that produces excellent results on ImageNet
(Deng et al., 2009). We build our work on StyleGAN2 as it is easier to work with. A detailed review of the history of GAN architectures, or a discussion on loss functions, etc. is beyond the scope of the paper. We also note that in addition to GANs, there are other generative models, such as Variational Autoencoders (VAEs)
(Kingma and Welling, 2013) or pixelCNN (van den Oord et al., 2016). While these methods have some advantages, GANs currently produce the highest quality output for the applications we consider by a large margin.A significant invention for image manipulations are conditional GANs (CGANs). To add conditional information as input, CGANs (Mirza and Osindero, 2014) learn a mapping from an observed input and a randomly sampled vector to an output image
. One important class of CGANs use images as conditioning information, such as pix2pix
(Isola et al., 2017), BicycleGAN (Zhu et al., 2017b), pix2pixHD (Wang et al., 2018), SPADE (Park et al., 2019), MaskGAN (Lee et al., 2019) and SEAN (Zhu et al., 2019). Other notable works (Nie et al., 2020; Siarohin et al., 2020; Zakharov et al., 2019) produce high quality image animations and manipulations. CGANs can be trained even with unpaired training data using cycle-consistency loss (Zhu et al., 2017a; Yi et al., 2017; Kim et al., 2017). They can be used as a building block for image editing, for example, by using a generator G to translate a line drawing or semantic label map to a realistic-looking output image. CGANs have given rise to many application specific modifications and refinements.CGANs are an excellent tool for semantic image manipulations. In the context of faces, StarGAN (Choi et al., 2018, 2019) proposes a GAN architecture that considers face attributes such as hair color, gender, and age. Another great idea is to use GANs conditioned on sketches and color information to fill in regions in a face image. This strategy was used by FaceShop (Portenier et al., 2018) and SC-FEGAN (Jo and Park, 2019). One interesting aspect of these papers is that they use masks to restrict the generation of content to a predefined region. Therefore, some of the components of these systems borrow from the inpainting literature. We just mention DeepFillv2 (Yu et al., 2018) as a representative state-of-the-art technique. An early paper in computer graphics showed how conditioning on sketches can produce good results for terrain modeling (Guérin et al., 2017). Specialized image manipulation techniques, such as makeup transfer PSGAN (Jiang et al., 2019) or hair editing (Tan et al., 2020) are also very useful, as faces are an important class of images. A very challenging style transfer technique is the transformation of input photographs to obtain caricatures (Cao et al., 2019). This problem is quite difficult, because the input and output are geometrically deformed. Overall, these methods are not directly comparable to our work because the input and problem statements are slightly different.
A competing approach to conditional GANs is the idea to manipulate latent codes of a pretrained GAN. Radford et al. (2016) observed that interesting semantic editing operations can be achieved by first computing a difference vectors between two latent codes (e.g. a latent code for a person with beard and a latent code for a person without beard) and then adding this difference vector to latent codes of other people (e.g. to obtain an editing operation that adds a beard). Our technique falls into this category of methods that do not design a separate architecture, but manipulate latent codes of a pretrained GAN. This approach became very popular recently and all noteworthy competing papers are only published on arXiv at the point of submission. We would like to note that these papers were developed independently of our work. Nevertheless, we believe it will be useful for the reader to judge our work in competition with these recent papers (Nitzan et al., 2020; Tewari et al., 2020a; Härkönen et al., 2020), because they provide better results than other work. A great idea is proposed by the StyleRig (Tewari et al., 2020a) paper where they want to transfer face rigging information from an existing model as a method to control face manipulations in the StyleGAN latent space. While the detailed control of the face ultimately did not work, they have very nice results for the transfer of overall pose (rotation) and illumination. Based on earlier work, some authors worked on the hypothesis that the StyleGAN latent space is actually linear and they propose linear manipulations in that space. Two noteworthy efforts are InterFaceGAN (Shen et al., 2019) and GANSpace (Härkönen et al., 2020). The former tries to find the latent space vectors that correspond to meaningful edits. The latter takes a data driven approach and uses PCA to learn the most important directions. Upon analyzing these directions the authors discover that the directions often correspond to meaningful semantic edits. Our results confirm that the assumption of a linear latent space is a useful simplification that produces good results. However, we are still able to produce significantly better disentangled results with a non-linear model of the latent space conditioned on the input image.
We would also like to mention techniques that try to embed images into the latent space of a GAN. Generally, there are three main techniques. The first technique is to build an encoder network that maps an image into the latent space. The second technique is to use an optimization algorithm to iteratively improve a latent code so that it produces the output image (Abdal et al., 2019; Karras et al., 2019a; Abdal et al., 2020). Finally, there is the idea to combine the two techniques and first use the encoder network to obtain an approximate embedding and then refine it with an optimization algorithm (Zhu et al., 2016, 2020). We will use the optimization based technique of Karras et al. (Karras et al., 2019a). In addition, embedding can itself be used for GAN-supported image modifications. We will compare to one recent approach in our work (Abdal et al., 2019).
Neural rendering refers to the idea to generate images from a scene description using a neural network. We refer the reader to a recent state of the art report
(Tewari et al., 2020b) that summarizes recent techniques. Current methods tackle specific sub-problems like novel view synthesis (Hedman et al., 2018; Thies et al., 2020; Sitzmann et al., 2019), relighting under novel lighting conditions (Xu et al., 2018; Guo et al., 2019), animating faces (Kim et al., 2018; Thies et al., 2019; Fried et al., 2019), and animating bodies (Aberman et al., 2019; Shysheya et al., 2019; Martin-Brualla et al., 2018) in novel poses. While these techniques share some similar goals in terms of user interaction, the overall problem setting is sufficiently different from our work so that it is hard to compare to these works directly.We support two tasks: first, attribute-conditioned sampling, wherein we want to sample high-quality realistic images with target attributes; and, second, attribute-controlled editing, wherein we want to edit given images such that the edited images have the target attributes, while best preserving the identity of the source images.
For generating realistic images, we use StyleGAN. In our implementation, we support sampling from both StyleGAN (Karras et al., 2019b) and StyleGAN2 (Karras et al., 2019a). We recall that StyleGAN maps latent space samples to intermediate vectors in space, by learning a non-linear mapping , such that the -s can then be decoded to images . In the uncontrolled setup,
is sampled from a multi-dimensional normal distribution. The
vector is used to control the normalization at 18 different locations of the StyleGAN2 generator network. The idea of the extended latent space is to not use the same vector eighteen times, but use different vectors. Hence, a vector has dimensions . We will use both of these latent spaces in our paper. For training the StyleFlow network, we use . For restricting edits and editing real images, we use .In order to measure attributes of any image, we assume access to a class-specific attribute function
, typically realized as a classifier network, that returns a vector of attributes
for any given image belonging to the class under consideration. The attributes are represented as an -dimensional vector (e.g., for human faces in our tests).Solving the first task amounts to sampling from a multi-dimensional normal distribution and using a learned mapping function of the form , where denotes the target attributes, to produce suitable intermediate weights. These weights, when decoded via StyleGAN, produce attribute-conditioned image samples of the form matching the target attribute.
Specifically, using a zero-mean multi-dimensional normal distribution with identity as variance we can conditionally sample as,
(1) |
and in the process satisfy . In Section 5, we describe how to train and use a neural network to model such a function using forward inference on a conditional continuous normalizing flow (CNF). In other words, the normalizing flow maps the samples from an n-dimensional prior distribution, in this case a normal distribution, to a latent distribution conditioned on the target attributes.
Solving the second task is more complex. Given an image , we first project it to the StyleGAN space to obtain such that using (Abdal et al., 2019; Karras et al., 2019a). Recall that our goal is to edit the current image attributes to user specified attributes , whereby the user has indicated changes to one or multiple of the original attributes, while best preserving the original image identity. We then recover latent variables that lead to intermediate weights using an inverse lookup . We realize the inverse map using a reverse inference of the CNF network described earlier, i.e., . Finally, we perform a forward inference, using the same CNF, to get the edited image that preserves the identity of the source image as,
(2) |
We first summarize normalizing flows in Section 4, and then, in Section 5, we describe how the invertible CNF can be used to compute the exact likelihood of the samples from the latent distribution of a generative model.
A normalizing flow, often realized as a sequence of invertible transformations, allows to map an unknown distribution to a known distribution (e.g., normal or uniform distribution). This inverse mapping, from an initial density to a final density and vice versa, can be simply seen as a chain of recursive change of variables.
Let be a bijective map such that there exists an invertible map with
. Let the transformation of the random variable be from
to such that. By the change of variable rule, the output probability density of variable
can be obtained as,(3) |
where or . The same rule applies for a successive transformation of the variable . Specifically, the transformation be represented by , i.e., , and since exists the inverse mapping is expressed as . Therefore, applying the change of variable rule, we obtain the modified output log probability density,
(4) |
where and .
In the special case of planar flows, the function can be modeled by a neural network (Rezende and Mohamed, 2015) where the flow takes the form,
(5) |
where are the learnable parameters, is a smooth element-wise non-linear activation with derivative . The probability density obtained by sampling and applying a sequence of planar transforms to produce variable takes the form,
(6) |
where .
The normalizing flows can be generalized into a continuous formulation (Grathwohl et al., 2018; Chen et al., 2018) using neural ODE (Chen et al., 2018) which adopts adjoint sensitivity method (Pontryagin, 2018) to compute the gradients with respect to the parameters in an ODE black box solver. In continuous Normalizing flows, differential equations are expressed in the form: where is the variable of a given distribution, is the time variable, and are the parameters of an arbitrary neural network. Specifically, the differential equation takes the form,
Finally, the change in the log density can be expressed as,
(7) |
We decided against using DNF networks as is that it is difficult to ensure a reversible mapping. Another related problem is the expressiveness and versatility of such networks due to fixed number of invertible functions to choose from. Finally, the Jacobian determinant computation is costly, so a workaround is to constrain the network architecture which is also undesirable (Grathwohl et al., 2018).
In our StyleFlow framework, we use CNFs for our formulation. In contrast, the main benefit of the continuous formulation is that it is invertible by definition and the determinant is replaced by a matrix trace, which is considerably easier to compute. Hence, it allows to choose from a wide variety of architectures. Additionally, FFJORD (Grathwohl et al., 2018) also claims that CNFs can potentially learn a less entangled internal representation compared to DNFs (see e.g. Figure 2 in FFJORD). We would also like to point out that in this work, we interpret time as a ”virtual” concept related to how CNFs are evaluated. Instead of evaluating the results by sequentially passing through the network layers, as in DNF, in CNFs the ODEs are used to evaluate the function through time. Hence, for both our conditional sampling and editing tasks, we desire to condition based on the target attributes and use CNFs to continuously evolve the image samples. The CNFs are expected to retarget the probability densities as described next.
We consider the latent vectors sampled from the space of the StyleGAN1/2. The prior distribution is represented by , where . Our aim is to model a conditional mapping between the two domains. Moreover, it is imperative to be able to learn a semantic mapping between the domains so that editing applications are realizable. We explain our method in the following subsections.
A general work flow for the preparation of the dataset is as follows: We first sample 10k samples from the Gaussian space of the StyleGAN1/2 (Karras et al., 2019a; Karras et al., 2019b). Then we infer the corresponding codes in the disentangled space of the models. We use the vectors of space truncated by a factor of
as suggested by StyleGAN. Otherwise, there will be outlier faces of low quality. We generate corresponding images
via StyleGAN1/2 generator and hence create a mapping between the space and the image space . To have conditional control over the image features, we use a face classifier network to map the images to the attribute domain. We use this dataset for the final training of the flow network using triplets , and . Note that the attributes do not depend of the variable . To prepare the domain of the training dataset, we use a state-of-the-art Microsoft Face API (Microsoft, 2020), which we found to be robust for the face attribute classification. The API provides a diverse set of attributes given a face image. The main attributes that we focus in our work are gender, pitch, yaw, eyeglasses, age, facial hair, expression, and baldness. For the lighting attribute, we use the predictions from the DPR model (Zhou et al., 2019) that outputs a 9-dimensional vector per image measuring the first 9 spherical harmonics of the lighting. Thus, for faces, we have .We use a series of the gate-bias modulation (Grathwohl et al., 2018; Yang et al., 2019) (called ”ConcatSquash”) networks to model the function of the conditional continuous normalizing flows. We make this design choice as the gate part can learn the per-dimension scaling factor given an input latent code and the bias part of the network can learn to translate the code towards a particular edit direction which is suitable for our formulation of adaptive identity aware edits. This model builds on top of FFJORD (Grathwohl et al., 2018)
as a general framework where the attributes can be 2D or 3D tensors like an image for an Image2Image translation task. Figure
5 shows the function block used in the CNF block. To include the condition information into the network, we transform the time variable with a broadcast operation to match the spatial dimensions of the attribute space. Then, we apply channel-wise concatenation to the resultant variable with the attribute variable , and finally the new variableis fed to the network as a conditional attribute variable. Note that at inference time, we use linear interpolation in the attribute domain to smoothly translate between the two edits to get the final image. For stable training, we used 4 stacked CNF functions (i.e., gate-bias functions) and 2 Moving Batch norm functions
(Yang et al., 2019)(one before and one after the CNF functions) where each function outputs a vector of the same dimension (i.e. 512). We observed that using only 2-3 CNF function block models overfit on the data. The matrix trace is computed by 10 evaluations of Hutchinson’s trace estimator. Depending on the properties of the extended
tensor, we can use the convolutional or linear neural network to transform the tensors to make them the same shape as the input. We make use of linear layers in this work, but we expect extensions to this work where the flows can be conditioned on images, segmentation maps etc. Then, we perform gate-bias modulation on the input tensor. Note here we use sigmoid non-linearity before the gate tensor operation (Grathwohl et al., 2018). The final output tensor is passed through a non-linearity before passing to the next stage of the normalizing flow.An important insight of our work is that flow networks trained on one attribute at a time learn entangled vector fields, and hence resultant edits can produce unwanted changes along other attributes. Instead, we propose to use joint attribute learning for training the flow network. We concatenate all the attributes to a single tensor before feeding it to the network. In Section 7, we show that the joint attribute training produces an improvement in the editing quality with a more disentangled representation. We hypothesize that the training on single condition tends to over-fit the model on the training data. Further, in absence of measures along other attribute axis, the conditional CNF remains oblivious of variations along those other attribute directions. Therefore, the flow changes multiple features at a time. Joint attribute training tends to learn stable conditional vector fields for each attribute.
The goal during the training is to maximize the likelihood of the data given a set of attributes . So, the objective can be written as . Here, we assume the standard Gaussian prior with as the variable. Also, let
represent the Gaussian probability density function. Algorithm
1 shows the training algorithm (Grathwohl et al., 2018; Chen et al., 2018)of the proposed joint conditional continuous normalizing flows. Other details are Epochs: 10; Batch size: 5; Training speed: 1.07 - 2.5 iter/sec depending on the number of CNF functions (see Table
1); GPU: Nvidia Titan XP; Parameters: 1128449; Final Log-Likelihood: -4327872; Inference time: 0.61 sec. For faster implementation and practical purposes, we also train a model with 6 stacked CNF functions which improves the inference time to 0.21 sec at the cost of slight decrease in the quality of disentanglement. We use the adjoint method to compute the gradients and solve the ODE using ‘dopri5’ ODE solver (Grathwohl et al., 2018). The tolerances are set to default . We use the Adam optimizer with an initial learning rate of . Other parameters of the Adam optimizer are set to default values.A natural benefit of the training formulation of the framework is the sampling. In particular, the mapping learnt between the two domains is used to produce a vector given a vector and vice versa. Moreover, we can manipulate the vectors in the respective domains and the changes translate to the other domain semantically from the editing point of view. Please refer to the supplementary video for interaction sessions.
Once the network is trained, we are able to conditionally sample the with the Gaussian prior modelled by the continuous normalizing flows. Formally, we set the attribute variable to a desired set of values, and then sample multiple . These vectors are then passed through the (trained) conditional CNF network. The learned vector field translates the vectors to produce the latent vectors , which are then finally fed to the StyleGAN1/2 generator. In Section 7, we show the results of sampling given a set of attributes. We notice that the quality of the samples is very high as well as unedited attributes remain largely fixed. The conditional sampling is an important result of our work and validates the fact that the network is able to learn the underlying semantic representations, which is further used to perform semantic edits to the images.
Here we show the procedure to semantically edit the images using the proposed framework. Note here the vector manipulations are adaptive and are obtained by solving a vector field by an ODE solver. Unlike the previous methods (Abdal et al., 2019; Härkönen et al., 2020; Shen et al., 2019) the semantic edits performed on the latent vectors forces the resultant vector to remain in the distribution of space (). This enables us to do stable sequential edits which, to the best of our knowledge, are difficult to obtain with the previous methods. We show the results of the edits in Section 7. In the following, we will discuss the procedure and components of the editing framework.
The first step of the semantic editing operation in the StyleFlow framework is the Joint Reverse Encoding. Here, we jointly encode the variables and . Specifically, given a , we infer the source image . Note that we can also start with a real image and use the projection methods (Abdal et al., 2019; Karras et al., 2019a; Zhu et al., 2020), to infer the corresponding . Such procedures may render the vectors outside the original distribution and hence makes the editing a challenging task. But in later sections, we show that StyleFlow is a general framework which is able to produce high quality edits on real images. We pass the image through the face classifier API (Microsoft, 2020) and the lighting prediction DPR network (Zhou et al., 2019) to infer the attributes. Then, we use reverse inference given a set and to infer the corresponding .
The second step is the Conditional Forward Editing, where we fix the (this vector encodes a perfect projection of the given image) and to translate the semantic manipulation to the image , we change the set of desired conditions, e.g., we change the age attribute from 20 yo to 60 yo. Then, with the given vector and the new set of (target) attributes , we do a forward inference using the flow model. Finally, we process the resulting vector to produce an editing vector.
This is the third step of the StyleFlow editing framework. Studying the structure of the StyleGAN1/2, we choose to apply the given vector at the different indices of the (Karras et al., 2019b; Abdal et al., 2019, 2020) () space depending on the nature of the edit, e.g., we would expect the lighting change to be in the later layers of the StyleGAN where mostly the color/style information is present. Empirically, we found the following indices of the edits to work best: Light (), Expression (), Yaw (), Pitch (), Age (), Gender (), Remove Glasses (), Add Glasses (), Baldness () and Facial hair ( and ). The final step is the inference of the image from the modified latents. In Section 7, we show the importance of this module in improving the editing quality.
We have two ways to edit: in the faster and approximate version, we do not reproject every time an edit is performed, and in the slower and accurate implementation all the vectors (18 w’s) are reprojected in one pass after an edit. As the w code is manipulated every time by the w’ based on the edit-specific subset selection, some of these subsets overlap with others in a sequential edit and may make the edit unstable. In the fast option, we sometimes observe a sudden change in the output image. Encoding these vectors back to space ensures that the flow network is aware of the changes made to the W/W+ space (identity-aware) of the StyleGAN1/2 and hence make the edits stable. Note that since the previous and concurrent methods are not identity aware, this problem also adds to the reason for failed sequential edits. So, in our work the vectors of resultant W+ space are re-mapped to a new set of s using JRE followed by CFE and edit specific subset selection to perform a subsequent edit.
In this section, we discuss the results produced by our StyleFlow framework and compare them both quantitatively and qualitatively with other methods.
We evaluate our results on two datasets - FFHQ (Karras et al., 2019b) and LSUN-Car (Yu et al., 2015). Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of human faces (at resolution) with 70,000 images. The dataset is diverse in terms of ethnicity, age, and accessories. LSUN-Car is a resolution car dataset. The dataset is diverse in terms of car pose, color, and types. We use StyleGAN pretrained on these datasets to evaluate our results.
We consider the following three methods as concurrent work: InterfaceGAN, GANSpace, and StyleRig. In addition, we compare to a simple version of vector arithmetic Image2StyleGAN as previous work. InterfaceGAN and Image2StyleGAN are re-implemented using StyleGAN2. GANSpace and StyleFlow are naturally implemented in StyleGAN2. StyleRig uses StyleGAN1.
StyleFlow: our method.
InterfaceGAN: We retrained InterfaceGAN on the same data as StyleFlow. The images were segregated based on the attributes to create the binary data. For the magnitude of edits, it’s difficult for competing methods to produce the same magnitude of changes when a given vector is applied for different faces. For example, in Figure 10 top row, the rotation is a complete flip, for InterfaceGAN, the learned vector is translated till it matches the extreme pose. Hence instead of evaluating the results by considering the attributes as binary (like InterfaceGAN, which may not represent the quality of edits), we overcome this problem by proposing three metrics later in Section 7.10 which we use to evaluate the quality, consistency and disentanglement of the edits.
Image2StyleGAN: For Image2StyleGAN, we embedded paired images of expression (IMPA-FACES3D (IMPA-FACE3D, 2012)), lighting pairs from DPR and the rotation pairs using StyleFlow outputs. Lighting part of both the methods (Image2StyleGAN and InterfaceGAN) is trained for the right to the left lighting change.
GANSpace: We used the code provided by the authors. This is true for the layer subsets as well.
StyleRig: At present, code is not available. We will add a comparison in a future version.
Here we describe different aspects of the different editing methods and how the design choices effect the results.
The main difference between our method and linear methods such as Image2StyleGAN, InterfaceGAN, and GANSpace, is that the editing direction depends on the starting latent. These other three competing methods compute a single editing direction for an attribute (e.g. expression or pose) and apply this same editing direction to all starting latents or by scaling and adding the vector . By contrast, our methods as well as StyleRig compute an edit direction that depends on the starting latent. Our results show that this design choice contributes to better disentanglement.
GANSpace and StyleFlow both use edit specific subset selection. This is an advantage over InterfaceGAN (and possibly StyleRig) and contributes to disentanglement in our results.
Our method as well as StyleRig, InterfaceGAN, and Image2StyleGAN are supervised. They need to have a training corpus of latent vectors / images that are labeled with attributes. By contrast, GANSpace discovers edits in an unsupervised manner. GANSpace finds a large number of edits that could be interesting and then labels them semantically using visual inspection. Our results show that the unsupervised discovery of edit directions does not lead to editing directions that are sufficiently disentangled. It sometimes leads to edits that are not identity preserving, as multiple changes are entangled together. However, GANSpace has the ability to discover cool edits for which labels might not be available. One example is the expression where the mouth forms an O-shape.
Our method can create edits by specifying a continuous parameter for a given attribute. InterfaceGAN and GANSpace just give an edit direction and the user has to manually tune the scaling parameter to control the strength of an edit. That makes the edits not directly comparable and we have to manually tune this scaling parameter of these other methods to make it approximately match the strength of our edit.
In addition to making an edit direction depend on the starting latent, the edit paths in latent space of our method and StyleRig are non-linear. This is in contrast to InterfaceGAN and GANSpace which use linear trajectories. Our evaluation will show that this difference is not the main factor in explaining the edit quality of our method. It is also possible to setup our method approximating the nonlinear edit path by a linear one to achieve reasonable results.
In our results we show that the quality of edits improves if we train StyleFlow using all attributes at the same time. This helps disentanglement and provides higher quality edits than training a network for each attribute separately. This is in contrast to StyleRig. In the StyleRig paper the results are shown by training for each edit separately. In additional materials, the authors demonstrate that it is also possible to train for all edits jointly with some loss in quality. We believe that our behavior is more intuitive.
Previous work often focuses on applying a single edit to a starting image. In our experience, the quality of an editing framework becomes much more obvious when applying sequential edits. Small errors in disentanglement will accumulate and the initial face much easier loses its identity when subsequent edits are applied. While we show results for individual edits, we focus a lot on sequential edits in the paper to provide a more challenging evaluation of edit quality.
The quality of edits depends on the availability of good attribute labels and a good training dataset for StyleGAN. In general, the face dataset has the highest quality attributes and the highest quality latent space. Also StyleRig can only work on faces. We therefore focus mainly on faces in our evaluation. However, there is nothing specific to faces in StyleFlow and our work can be applied to any dataset. For faces, InterfaceGAN and StyleFlow can produce edits for the same set of attributes. However, StyleRig does not include several common attributes such as hair length, gender, eyeglasses, age. We are therefore also restricted in our evaluation to a subset of the attributes when comparing to StyleRig.
In principle, there is no difference between editing real images and editing synthetic images. When editing real images, it becomes important to have a good projection method into latent space. We use a reimplementation of Image2StyleGAN for StyleGAN2. This method seems to be sufficient to get good edits, but multiple researchers are working on improved embedding algorithms. Analyzing the compound effect of embedding and editing is beyond the scope of this paper. We just would like to note that it is essential to embed into latent space and not into latent space as for example proposed by the StyleGAN2 paper. All methods can work with real as well as synthetic images. However, StyleRig seems to have the most problems working with real images and all results shown in their paper are on synthetic images only. The additional results show two results on real images, but the quality is inconclusive.
All methods can work with StyleGAN1 as well as StyleGAN2. That creates a compatibility issue. A synthetic face in the StyleGAN1 latent space is generally not transferable into the StyleGAN2 latent space. Therefore, StyleRig and StyleFlow are not directly comparable on synthetic faces. To transfer from one latent space into another, this would require projection which will lead to a decrease in quality and would be unfair to one of the methods. We can therefore only compare edits on real faces. We hope to coordinate such a comparison in the future.
First, we evaluate the two design choices of joint attribute encoding and edit specific subset selection. As explained, we perform joint attribute encoding to ensure the face identity is preserved during the Conditional Forward Editing (CFE). Figure 6 shows the variation of the proposed approach when attributes are trained jointly versus separately. The results show that in case of the joint encoding of the attributes, the identity of the face such that the unedited attributes like hair style, age, background is preserved.
Also, to show the effectiveness of the edit specific subset selection block, we show the results in Figure 7. Here, the framework without the Edit Specific Subset Selection Block is referred to as V1 and the final framework is referred to as V2. We notice that the edits done with the V2 framework performs high quality edits producing images with comparable the skin tone, background, and clothes with respect to the source image. To evaluate the variation quantitatively, we also compare them in the next subsection.
We also provide an ablation study of the architecture choice in terms of number of CNF function blocks in Table 1. More evaluations would be provided in a future version.
We show the results of the conditional sampling of StyleGAN2 in Figure 8. As shown in the figure, the generated samples are of very high quality. In the first row, for instance, we sample females of different age groups with glasses and fixed pose. Note that during the sampling operation we resample to infer vectors in and keep a set of attributes fixed. Apart from the quality of the samples we show that the diversity of the samples is also very high.
We show sequential edits on real images projected to StyleGAN2 by re-implementing the Image2StyleGAN W+ projection algorithm in Figure 1 and Figure 15. In addition to the sequential edits, we show non-sequential edits in Figure 16. Note that we demonstrate high quality results on images with diverse pose, lighting, expressions and age attributes. For example, see Row 4 in Figure 16, even though the expression is asymmetrical, the method is able to handle the edits well. Hence, as opposed to other concurrent methods, like StyleRig, we show that our method performs better disentangled edits on the real images with minimum artifacts. Figure 9 shows the results of the sequential edits on generated images using our framework. Here, we consider the sequential edits of Pose Lighting Expression Gender Age. In order to show that different permutations of the edits can be performed without affecting the performance, Figure 9 and Figure 15 show the results of a random sequence of edits performed to a source image. Here we consider multiple edits of gender, facial hair, pose, lighting, age, expression, eyeglasses, and baldness. Note the quality of the edits is also very high quality. Also, note that the order of the edits does not affect the quality of the images. Notice that we can handle extreme pose changes and can smoothly transfer the edits as the attributes change. Global features like background, cloths, and skin tone are largely preserved. Most importantly, the edits of pose change, expression, and lighting preserve the identity of the person.
We also show the editing results of our framework with StyleGAN2 trained on the LSUN-Car dataset, Figure 11 shows qualitative results of our framework. We show sequential edits including SUV/Hatchback conversion, rotation, and color change. We use a fine-tuned ResNet-152 (He et al., 2016) model trained on Stanford Cars (Krause et al., 2013) to create the attributes. For car manipulation, we used a car recognition model (spectrico (Spectrico, 2020)) with 95% classification accuracy, we report the accuracy for the Hatchback/SUV conversion as 80% and the colour as 100%. For the rotation, there is no precise model in the literature to evaluate the scores quantitatively, hence we show multiple visual examples in the supplementary video to judge the quality.
Previous relevant works that directly manipulate the latent space (Radford et al., 2016; Abdal et al., 2019; Shen et al., 2019), e.g., adding offset vectors, are not able to achieve the same high quality because vector manipulations often move the final latent into a region outside the distribution (Karras et al., 2019a). This leads to visible artifacts in the generated images and the face identity is affected. Owing to the design of our framework, our method has two major changes from these models. Firstly, attribute-guided edits amount to non-linear curves in the StyleGAN latent space. Secondly, the above curves (or even their linear approximations (see Figure 17)) should be conditioned on the current (face/Car) identity. Note that this is in contrast to edit vectors being independent of current identity as in GANSpace and also in InterfaceGAN. Figure 10 evaluates our claim of high quality edits. Here, we subject the methods of Image2StyleGAN(Abdal et al., 2019), InterfaceGAN (Shen et al., 2019), and our StyleFlow to extreme attribute conditions and perform sequential edits on the images. We consider three primary edits of pose, expression, and lighting. The figure shows that while Image2StyleGAN suffers and drives the image out of the distribution due to its usage of space for the edit computation, InterfaceGAN’s conditional manipulation produces relatively better image samples. However, preserving face identity still remains a major issue. In contrast, our framework handles the sequential edits producing high quality output and preserves facial features.
Figure 12 shows a visual comparison of our method with the GANSpace method. Here we compare the transition results produced by GANSpace. Notice, in the top sequence of the figure, the transition fails and drastically changes the gender from female to male, while our results are gender preserving. Moreover, we notice that the edits computed by GANSpace do not work in all scenarios. In the lower sequence of Figure 12, we show a failure case of lighting edit. We attribute these failure cases to the fact that the GANSpace edits, although very interesting, are still linear in nature and do not depend on the current identity of the face. Here the results are shown till 2. Note that GANSpace, being unsupervised, cannot control which attributes or combination of attributes are discovered as PCA axes. In contrast, in StyleFlow we directly learn nonlinear mapping between GAN latent space and targeted attribute variations. Nevertheless, GANSpace does not need image annotations which is an advantage of the method when working with new datasets.
Similarly, we observe several shortcomings of the StyleRig method. StyleRig has not been demonstrated to work with real images while our method demonstrates high quality edits on real images (see Figure 1, Figure 15 and Figure 16). StyleRig can only do a very limited number of edits. Additionally, while the sequential edits are very important in practice and is one of the important contributions of this work, StyleRig does not show evidences to demonstrate this. Since the code is not publicly available at the time of submission, in Figure 13 we show a counterpart of the Figure 9 in the StyleRig paper. We show that the quality of the edits by our method on attribute transfer are of similar or higher visual quality. Note that we transfer the attributes from the source image to target image. We evaluate on a diverse set of attributes which the StyleRig framework does not currently support. For example, StyleRig framework cannot process eyeglasses, facial hair, age, and gender, which are not modeled in morphable face models.
To demonstrate the compatibility of our work with the older StyleGAN1, we show the results of selected edits in Figure 14. Despite the more entangled latent space, our method is able to perform well.
In this subsection we compare to two previous methods. Both of these methods are based on the idea that edits can be encoded by vector addition and subtraction in latent space (Radford et al., 2016). We compare to edits stemming from edit vectors computed by Image2StyleGAN (Abdal et al., 2019) and InterfaceGAN (Härkönen et al., 2020) and refer to these methods as linear models.
First, we would like to analyze how much the edits depend on the initial latent vector. For an edit we can compute the difference vector between the final latent vector and the initial latent vector . The linear models Image2StyleGAN and InterfaceGAN make the assumption that the difference vectors are independent of the starting latent vector if the same edit is applied. Given many edits of the same type (e.g. changing a neutral expression to smiling by translating the attribute from 0 to 1), we compute their difference vectors . Then, given a set of pairs of these vectors, we compare the magnitude and the angles between the vectors. By sampling multiple such edits, the mean of the magnitudes (norm) of these difference vectors is computed to be indicating the adaptive edits. The angles between the vectors are observed to vary up to . This shows that the edits depend on the initial latent allowing the resultant vector to follow the original posterior distribution. Unlike previous linear models, which apply the same vector to the latent code, our method adaptively adjusts the manipulation of latents to produce high quality edits.
To assess the non-linearity of the edit path, we compare the interpolation in the attribute domain () to the interpolation in the latent domain (), i.e., we linearly change the variable of the attribute that is fed to the flow model versus linear interpolation of the vector to . We sample 20 points along the interpolation paths of both scenarios and then compare the latents produced by both the methods. We compute mean of the norm of these difference vectors along the path. Sampling multiple edits produced by StyleFlow, we conclude that, on average, the linear interpolation in the domain differs from the attribute domain by a factor of , validating the non-linearity of the path taken. In Figure 17, we compare the results of the non-linear path edits with the linear interpolation visually. Note here the final w’ is obtained by subjecting StyleFlow framework to extreme edits. We notice that the non-linear path is able to retain hair style, clothes and head coverings for a larger extent along the path validating the improvement in the disentanglement.
We quantitatively evaluate the results of our framework with Image2StyleGAN and InterfaceGAN using three metrics namely FID, face identity, and the edit consistency scores.
To measure the diversity and quality of the output samples, we use the FID score between the test images and the edited images generated from these images. We evaluate the results with 1k generated samples from the StyleGAN2 framework. These samples are then used to perform sequential edits as shown in Figure 10 i.e. ’Pose’, ’Light’ and ’Expression’. In Table 2, we show the FID score for our method is relatively low (lower the better) than other methods.
To evaluate the quality of the edit and quantify the identity preserving property of the edits, we evaluate the edited images on the face identity score produced in the Table 3
. Here we consider three metrics to determine if the face identity is preserved. We take a state-of-the-art face classifier model for face recognition
(Geitgey, 2020) to evaluate the metrics. The classifier outputs the embeddings of the images which can be compared. Given two images (before and after the edits) i.e., and , we calculate the Euclidean distance and the cosine similarity between the embeddings. Note that we use a classifier different from the attribute estimator used in training our StyleFlow. We choose three major edits for this purpose: light, pose, and expression. The metrics show that our method outperforms other method across all the metrics and the edits. Also we evaluate the scores when all the edits are applied sequentially. Here, our method also shows superiority in quantitative evaluation. Moreover, we also compute the accuracy based on the final decision of the classifier of the two embeddings being the same face.We introduce this score to measure the consistency of the applied edit across the images. For example, in a sequential editing setup, if the pose edit is applied, it should not be affected by where in the sequence it is applied. In principle, in the above case, different permutations of edits must lead to the same attributes when classified with an attribute classifier. In Table 4, we show the cyclic edit consistency evaluation of our method and compare it with other methods. Here , for instance, refers to the application of the expression and then the pose in the sequence, and comparing it with the pose and lighting edit – we expect the pose attribute to be same when evaluated on the final image. We measure this using the score where denotes conditional edit along attribute specification and denotes pose attribute vector regressed by the attribute classifier. As shown in the Table 4, our editing method remains consistent under different permutations. We used mean (absolute) error across the respective attributes.
To make a more interactive experience for a user, we developed an image editing UI (see Figure 18). The UI allows the user to select a given real or generated image and perform various edits with the help of interactive sliders. The checkpoint images are saved in a panel so that a user can revisit the changes made during the interactive session.
We presented StyleFlow, a simple yet robust solution to the conditional exploration of the StyleGAN latent space. We investigated two important subproblems of attribute-conditioned sampling and attribute-controlled editing on StyleGAN using conditional continuous normalizing flows. As a result, we are able to sample high quality images from the latent space given a set of attributes. Also, we demonstrate fine-grained disentangled edits along various attributes, e.g., camera pose, illumination variation, expression, skin tone, gender, and age for faces. The real face editing of our framework is demonstrated to have unmatched quality than the concurrent works. The qualitative and quantitative results show the superiority of the StyleFlow framework over other competing methods.
We identified three major limitations of our work. First, our work relies on the availability of attributes. These attributes might be difficult to obtain for new datasets and could require a manual labeling effort. Second, great results are only achievable with StyleGAN trained on high quality datasets, mainly FFHQ. It would be good to have different types of datasets in similar quality, e.g., buildings or indoor scenes, to better evaluate our method. The lack of availability of very high quality data is still a major limitation for evaluating GAN research. Third, the real image editing sometimes produce some artifacts compared to the synthesized images. While the quality of these edits in our framework are still better than competing work, a better understanding of this problem and better projection algorithms require further research. We suggest this line of investigation as the most rewarding avenue of future work. In addition, it would be interesting to develop extensions to other attributes. In this context, it would be interesting to analyze what attributes are even captured by a GAN model. Maybe a combination of GANSpace to discover attributes and our method to encode conditional edits could be developed in the future.
Proceedings of the IEEE International Conference on Computer Vision
. 4432–4441.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
.Neural ordinary differential equations. In
Advances in neural information processing systems. 6571–6583.Proceedings of the 34th International Conference on Machine Learning-Volume 70
. JMLR. org, 1857–1865.