SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing

12/04/2021
by   Yichun Shi, et al.
ByteDance Inc.
0

Recent studies have shown that StyleGANs provide promising prior models for downstream tasks on image synthesis and editing. However, since the latent codes of StyleGANs are designed to control global styles, it is hard to achieve a fine-grained control over synthesized images. We present SemanticStyleGAN, where a generator is trained to model local semantic parts separately and synthesizes images in a compositional way. The structure and texture of different local parts are controlled by corresponding latent codes. Experimental results demonstrate that our model provides a strong disentanglement between different spatial areas. When combined with editing methods designed for StyleGANs, it can achieve a more fine-grained control to edit synthesized or real images. The model can also be extended to other domains via transfer learning. Thus, as a generic prior model with built-in disentanglement, it could facilitate the development of GAN-based applications and enable more potential downstream tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 7

page 13

page 14

page 16

page 17

page 18

page 19

07/15/2021

StyleFusion: A Generative Model for Disentangling Spatial Segments

We present StyleFusion, a new mapping architecture for StyleGAN, which t...
11/30/2021

SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Editing

Recently, large pretrained models (e.g., BERT, StyleGAN, CLIP) have show...
06/25/2021

Diversifying Semantic Image Synthesis and Editing via Class- and Layer-wise VAEs

Semantic image synthesis is a process for generating photorealistic imag...
09/09/2021

Talk-to-Edit: Fine-Grained Facial Editing via Dialog

Facial editing is an important task in vision and graphics with numerous...
11/05/2020

Disentangling Latent Space for Unsupervised Semantic Face Editing

Editing facial images created by StyleGAN is a popular research topic wi...
11/16/2021

Delta-GAN-Encoder: Encoding Semantic Changes for Explicit Image Editing, using Few Synthetic Samples

Understating and controlling generative models' latent space is a comple...
04/04/2022

Flexible Portrait Image Editing with Fine-Grained Control

We develop a new method for portrait image editing, which supports fine-...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent studies on Generative Adversarial Networks (GANs) have made impressive progress on image synthesis, where photo-realistic images can be generated from random codes in a latent space 

[BigGAN, stylegan, stylegan2, karras2021alias]. These models provide powerful generative priors for downstream tasks by serving as neural renderers. However, their synthesis procedure is usually stochastic and no user control is naturally promised. Thus, it is still a challenging problem to achieve controllable image synthesis and editing utilizing generative priors.

One of the most famous work among such generative priors is the StyleGAN series [stylegan, stylegan2, karras2021alias]

, where each generated image is conditioned on a set of coarse-to-fine latent codes. However, the meanings of these latent codes are still relatively ambiguous. Thus, a plethora of studies have attempted to further investigate into the latent space of StyleGAN to improve controllability. It is shown that by learning a linear boundary or a neural network in the latent space of StyleGAN, one could control the global attributes 

[shen2020interpreting, shen2020closed, harkonen2020ganspace, abdal2021styleflow] or 3D structure [tewari2020stylerig] of the generated images. Furthermore, by using an optimization/encoder-based method, real images can also be embedded into the latent space to create a unified synthesis/editing model [abdal2019image2stylegan, abdal2020image2stylegan++, zhu2020indomain, tov2021e4e, richardson2021psp, alaluf2021restyle, wang2021high]. However, as pure learning-based methods, these solutions inevitability suffer from the biases in the StyleGAN latent space. For example, since different attributes could be correlated in StyleGAN, it often happens that unexpected attributes or local parts are changed while one wants to edit a certain attribute or area.

To obtain a more precise control, another solution is to train a new GAN model from scratch by introducing additional supervision or inductive biases. For example, by using 3D rendered faces, CONFIG [kowalski2020config] and DiscoFaceGAN [deng2020discofacegan] aim to build a GAN where pose, 3D information are factorized in the latent space. GAN-Control [shoshan2021gancontrol] disentangles the latent space by incorporating pre-trained attribute models for contrastive learning. Given the recent progress on neural rendering, it has also been shown that 3D-controllable GANs can be trained from images by injecting volumetric rendering into the synthesis procedure [nguyen2019hologan, schwarz2020graf, chan2021pigan, gu2021stylenerf, zhou2021cips3d]. However, a major limitation of above-mentioned models is that they are designed for holistic attributes and there is no fine-grained local editability.

In this work, we propose SemanticStyleGAN, which introduces a new type of generative prior for controllable image synthesis. Unlike prior work, the latent space of SemanticStyleGAN is factorized based on semantic parts defined by semantic segmentation masks. Each semantic part is modulated individually with corresponding local latent codes and an image is synthesized by composing local feature maps. Different from layout-to-image translation methods [zhu2020SEAN, wang2021image, chen2021sofgan], our local latent codes are able to control both the structure and texture of semantic parts. Compared to attribute-conditional GANs [kowalski2020config, deng2020discofacegan, shoshan2021gancontrol], our model is not designed for any specific task and can serve as a generic prior like StyleGAN. Thus, it can be combined with latent manipulation methods designed for StyleGAN to edit output images while providing more precise local controls. The contributions of this work can be summarized as follows:

  • A GAN training framework that learns the joint modeling of image and semantic segmentation masks.

  • A compositional generator architecture that disentangles the latent space into different semantic areas to control the structure and texture of local parts via implicit latent code manipulation.

  • Experiments showing that our generator can be combined with existing latent manipulation methods to edit images in a more controllable fashion.

  • Experiments showing that our generator can be adapted to other domains with only limited images while preserving spatial disentanglement.

2 Related Work

2.1 GAN Latent Space for Image Editing

Given the success of GANs on synthesizing high quality images [stylegan, stylegan2, BigGAN], many studies have attempted to utilize GANs as a image prior to achieve controllable image synthesis and editing. These studies can be categorized into two types. The first type aims to learn a model to manipulate the latent space of a pre-trained GAN network to achieve editability. For example, InterFaceGAN [shen2020interpreting], GANSpace [harkonen2020ganspace] and StyleFlow [abdal2021styleflow] trains a attribute model in the StyleGAN latent space to control binary attributes. StyleRig [tewari2020stylerig] learns a set of latent space networks to change the pose and lighting. Similarly, StyleFusion [kafri2021stylefusion] learns to fuse semantic parts from different images in the latent space. The second type aims to learn a GAN with more disentangled latent space using additional supervision. For example, CONFIG [kowalski2020config] and DiscoFaceGAN [deng2020discofacegan] uses 3D-rendered data to disentangle pose, identity, expression from other information. GAN-Control [shoshan2021gancontrol] separates attributes like identity and age in the latent space by utilizing pre-trained attribute models. Besides these, StyleMapGAN [kim2021stylemapgan] propose to use style maps to modulate a synthesis network, but the meaning of each style pixel is unclear. Different from prior works, we propose a new type of factorization in the GAN latent space according to semantic labels. Our disentangled latent codes could independently control the shape and texture of each semantic part in the output image.

2.2 Compositional Image Synthesis

A plethora of studies have investigated how to build generative models to mimic the compositional nature of the world. To achieve compositionality, some studies propose to take images as input and compose a complicated scene with elements from real images [arandjelovic2019object, azadi2020compositional, sbai2021surprising]. On the other side, the majority studies aim to build a generative model that unsupervisedly discovered different objects in the training images and then synthesize them from independent latent codes. Most of these methods assume that objects are positioned independently in the scene and a compositional generative model is designed to discover such objects [gregor2015draw, eslami2016attend, yang2017lrgan, greff2019IODINE, burgess2019monet, anciukevicius2020object, yang2020learning, ehrhardt2020relate, van2020investigating, hudson2021compositional]. Some other methods approach the compositional synthesis from a 3D perspective and disentangles objects and background by leraning multi-view datasets [nanbo2020learning, henderson2020unsupervised, nguyen2020blockgan, niemeyer2021giraffe]. Similar to these work, we inject composition as an inductive-bias to encourage disentanglement. However, we focus on semantic parts that are defined by humans. This allows us to decompose highly correlated local parts below object level (e.g. hair and face) and enables more fine-grained control during synthesis.

Figure 2: Overview of our training framework. A MLP first maps randomly sampled codes into space. The code is used to modulate the weights of local generators. Each local generator takes Fourier features as its input and outputs a feature map and a pseudo-depth map , which are fused into a coarse segmentation mask and a global feature map for image synthesis. The render network , which is only conditioned on the feature map, refines upsampled into a high-resolution segmentation mask by learning a residual

and generates the fake image. A dual-branch discriminator models the joint distribution of RGB images and semantic segmentation masks.

2.3 Layout-based Generators for Local Editing

In the layout-to-image translation problem, a layout image is provided as the condition for controllable image synthesis. The layout image can be a semantic segmentation mask [chen2017CRN, qi2018SIMS, wang2018Pix2PixHD, park2019SPADE, liu2019learning, zhu2020SEAN, zhu2020semantically, wang2021image, chen2021sofgan], a sketch image [wang2018Pix2PixHD, chen2020deepfacedrawing, richardson2021psp], etc. Among these, some studies have attempted to represent different semantic parts with latent codes [zhu2020SEAN, zhu2020semantically, chen2021sofgan]. But since the layout is controlled by the input segmentation mask, they are only able to control the local texture. Our method also shares similarity with prior research that utilizes semantic masks as intermediate representations for generation [hong2018inferring, johnson2018image, ashual2019specifying], but they are engineered to serve conditional generation tasks and not able to generate images from scratch. Recently, some researchers have also analyzed the correlation between StyleGAN style space and semantic masks [collins2020editing, wu2021stylespace, kafri2021stylefusion] or supervise the latent manipulation with segmentation masks  [futschik2021real, ling2021editgan, pernuvs2021high] to achieve local editing. In contrast to these methods, we build a semantic-aware generator that directly associates different local areas with latent codes, these codes can then be used to edit both local structure and texture.

3 Methodology

A typical GAN framework learns a generator that maps a vector

to an image, where

is usually a standard normal distribution. In StyleGANs 

[stylegan, stylegan2], to handle the non-linearity of data distribution, is first mapped into a latent code with an MLP. This space is then extended into a space that controls the output styles at different resolutions [stylegan]. However, these latent codes do not have a strictly defined meaning and can hardly be used individually.

We propose to build a generator whose space is disentangled for different semantic areas. Formally, given a labeled dataset , where is the semantic segmentation mask of image and is the number of semantic classes, our generator gives a factorized such that:

(1)

Here each local latent code controls the shape and texture of semantic area defined in segmentation labels while is a shared code that controls the coarse structure, such as pose. Each is further decomposed into a shape code and a texture code . The generator maps the latent codes to an RGB image and a semantic segmentation mask. To this end, we identify two major challenges:

  1. [leftmargin=15pt]

  2. How to decouple different local areas?

  3. How to ensure the semantic meanings of these areas?

For the first problem, inspired by compositional generative models [greff2019IODINE, burgess2019monet, niemeyer2021giraffe], we introduce local generators and a compositional synthesis procedure as the inductive bias. For the second problem, we use a dual-branch discriminator that models the joint distribution to supervise the shapes of local parts after composition.

Figure 3: The architecture of local generator. Blue blocks are modulated 1

1 convolution layers whose weights are conditioned on input latent codes. Purple blocks are linear transformation layers.

3.1 Generator

The overall structure of our generator is shown in Figure 2. Similar to StyleGAN2 [stylegan, stylegan2], an 8-layer MLP first maps to the intermediate code . Then, local generators are introduced to model different semantic parts using . A render net takes in the fused results from local generators and outputs an RGB image and a corresponding semantic segmentation mask.

Local Generator

Following recent work on continuous image rendering  [anokhin2021CIPS, skorokhodov2021INRGAN, zhou2021cips3d], we use modulated MLPs for local generators (Fig. 3), which allows explicit spatial control over synthesized output. Given Fourier features [tancik2020fourier] (position encoding) and latent codes as inputs, a local generator outputs a feature map and a pseudo-depth map :

(2)

Here, to reduce the computation cost, the input Fourier feature map as well as the outputs are of a reduced size , smaller than the final output image. In practice, we choose it to be to balance the efficiency and quality. During training, style mixing [stylegan] is conducted independently within each local generator between , and such that different local parts and different shapes and textures could work collaboratively for synthesis.

Fusion

In the fusion step, we first generate a coarse segmentation mask from pseudo-depth maps. Following prior work on compositional generation [greff2019IODINE, burgess2019monet]

, the pseudo-depth maps are used as logits for softmax function:

(3)

where denotes the pixel in the class of mask and similarly for . The feature maps are then aggregated by:

(4)

Here denotes element-wise multiplication. The aggregated feature map contains all the information about the output image and is sent into for rendering. We note that directly using for feature aggregation could be problematic when some classes are transparent. Thus, we use a modified version for feature aggregation in case of transparent classes, e.g. glasses (See appendix for details).

Render Net

The render net is similar to the original StyleGAN2 generator with a few modifications. First, it does not use modulated convolution layers and the output is purely conditioned on the input feature map . Second, we input the feature map at both and resolutions, where feature concatenation is conducted at . The additional input of low-resolution feature map allows a better blending between different parts. Last, we find that directly training with is difficult due to the intrinsic gap between softmax outputs and real segmentation masks. Thus, besides the ToRGB branch after each convolution layer, we have an additional ToSeg branch as in SemanticGAN [li2021SemanticGAN] to output residuals to refine the coarse segmentation mask into the final mask that has the same size as output image. Here a regularization loss is needed such that the final mask would not deviate too much from the coarse mask:

(5)

3.2 Discriminator and Learning Framework

In order to model the joint distribution , the discriminator needs to take both RGB images and segmentation masks as input. We found that a simple conconcatenation does not work due to the large gradient magnitude on segmentation masks. Thus, we propose to use a dual-branch discriminator that has two convolution branches for and , respectively. The outputs are then summed up for fully connected layers. Such a design allows us to separately regularize the gradient norm of the segmentation branch with an additional R1 regularization loss

. The resulting training framework is similar to StyleGAN2 with the loss function:

(6)

where denotes the loss functions used in the original StyleGAN2.

4 Implementation Details

We implement our methods using PyTorch 1.15 library. We use the same optimizer and batch settings as in StyleGAN2.

, , are set to , and

, respectively. Style mixing probability and path regularization are reduced to

and , respectively. The 256256 and 512512 models are trained on 4 and 8 32GB Tesla V100 GPUs, respectively. For some experiments, we fine-tune our models on image-only datasets. In such cases, we drop the segmentation branch in discriminator and use the original StyleGAN2 loss functions to fine-tune the model. Due to the space limit, more details about network architectures are given in appendix.

px Image Psuedo-depth Segmentation

Figure 4: Illustration of compositional synthesis. Starting from background, we gradually add more components into the feature map. The second row shows the pseudo-depth map of each corresponding component used for fusion. Note that the “hair” generator outputs a complete shape even though it is covered by the face. During synthesis, all pseudo-depth maps are fused without an order.
Method Data Compositional FID IS
StyleGAN2 img 4.45 3.40
SemanticGAN img&seg 18.54 2.77
  + proposed training img&seg 7.50 3.51
SemanticStyleGAN (ours) img&seg 6.42 3.21
Table 1: Quantitative evaluation on synthesis quality. All the models are trained on CelebAMask-HQ at 256256. “img” and “seg” refer to RGB image and segmentation mask, respectively.
Figure 5: Example generation results of our model trained on CelebAMask-HQ. The images are generated at a resolution of 512512 with a truncation of .
real image reconstruction synthesized translation zoom out
Figure 6: Image composition and transformation via Fourier feature manipulation. Real images are used as background for synthesis and inverted into feature maps. Then foreground can be synthesized on this real image in the feature space. The location and size of foreground can be controlled via Fourier features.

5 Experiments

5.1 Semantic-aware and Disentangled Generation

We first evaluate our model on the synthesis quality and its local disentanglement. For synthesis quality, we compare our model with StyleGAN2 [stylegan2] and SemanticGAN [li2021SemanticGAN]. The original StyleGAN2, which neither models segmentation masks nor provides local controllability, is compared against as an upper bound of synthesis quality. SemanticGAN modifies StyleGAN2 into a joint training framework to output both image and segmentation masks. Since its goal is to conduct segmentation, it does not allows local control either. All the models are trained on the the first 28,000 images of CelebAMask-HQ resized to 256256. Fréchet Inception Distance (FID) [heusel2017FID] and Inception Score (IS) [salimans2016improved] are used to measure the synthesis quality.

Our project is initially built on SemanticGAN framework for learning a semantic-aware model. The original SemanticGAN is semi-supervised and we change it to use all the training labels. As shown in Tab. 1, SemanticGAN achieves much lower quality compared to original StyleGAN, indicating that learning a joint model of images and segmentation masks is a challenging task. Hypothesizing that the main bottleneck of SemanticGAN is the additional patch discriminator used for learning segmentation masks, we replace it with the proposed dual-branch discriminator. The new training framework achieves much better synthesis score. We further replace the SemanticGAN generator with our SemanticStyleGAN generator. Compared to SemanticGAN generator, our model shows a similar synthesis quality while providing additional controllability on each semantic area. We then extend our model to 512512 resolution and achieve a FID and IS of and , respectively. For reference, the StyleGAN2 generator achieves a FID and IS of and , respectively. Fig. 5 shows the synthesis results of the 512512 model.

To interpret the compositional synthesis of our model, Fig. 4

shows the results of synthesis with limited components. We first disable all the foreground generators and gradually add them into the forward process. It can be seen that these local generators can work independently to generate a semantic part. The pseudo-depth maps, in spite of the lack of 3D supervision, learn meaningful shapes that could be used to collaboratively compose different faces. We note that unlike traditional GANs that generates a complete image, such a compositional process also allows our model to generate the foreground only and control it by manipulating the Fourier features (See 

Fig. 6).

px All Background Face Hair

Figure 7:

Results of latent interpolation on the whole latent space and specified subspaces. Here, “Face” refers to all the components relevant to face, including eyes, mouth, etc.

Fig. 7 shows the results of latent interpolation of our generator model. The first row shows that our model could interpolate smoothly between two randomly sampled images. Besides, we can interpolate on a specific semantic area by changing the corresponding latent codes, e.g. face or hair, while fixing irrelevant parts. The results indicate that our model has learned a smooth and disentangled latent space for semantic editing. Fig. 8 further shows the results of individually changing shape and texture codes. Overall, even though there is no explicit constraint during training, we observe that our model could disentangle most local shapes and textures. We also refer the readers to the appendix for more results on semantic local style mixing.

px Face Hair

Figure 8: Random sampling in local latent spaces. The middle row shows a randomly sampled face. The first two and last two columns show the results of changing shape and texture codes, respectively. Here, “Face” refers to all the components relevant to face, including eyes, mouth, etc.

5.2 Controlled Synthesis and Image Editing

With the semantic decomposition in the latent space, our model provides a more disentangled generative prior for image editing. Here, we evaluate our model on downstream editing tasks and compares it to StyleGAN2. We use the pytorch conversion of official StyleGAN2 (config-F on FFHQ 1024x1024) as our baseline, which is widely used in relevant studies on image editing. The 512512 model is used for our method.

Method MSE ID LPIPS
StyleGAN2 (FFHQ) 0.031 0.015 0.654 0.097 0.3090.046
StyleGAN2 0.029 0.016 0.575 0.119 0.3300.052
SemanticStyleGAN 0.031 0.017 0.602 0.122 0.3350.051
Table 2: Quantitative evaluation of reconstruction performance using Restyle (psp) encoder. The bottom two rows (StyleGAN2 and Ours) are trained on the same split of CelebAMask-HQ.
Expression
Bald
Bangs
Beard
Input Inversion (StyleGAN2) StyleFlow InterFaceGAN Inversion (Ours) StyleFlow+Ours InterFaceGAN+Ours
Figure 9: Results of GAN inversion and editing. For each attribute and method, we show the inversion result of Restyle encoder, the edited image and the difference map between them.
Figure 10: Quantitative comparison of local attribute editing using StyleGAN2 and our model when combined with StyleFlow and InterFaceGAN.

5.2.1 Encoding and Editing Real Images

To evaluate the editing results on real images, we first need to embed such images into the GAN latent space. Here, we adopt a state-of-the-art GAN encoder, i.e. Restyle-psp [alaluf2021restyle], for both StyleGAN2 and our model. We use the official model from Restyle authors for StyleGAN2 while a new encoder is trained for our model with default hyper-parameters. For reference, we also train a encoder for our StyleGAN2 that is trained on CelebAMask-HQ. Tab. 2 shows the quantitative results of image reconstruction using restyle encoders. Overall, our model achieves a comparable performance in terms of reconstruction.

The next question is whether our model can be applied to local editing on these reconstructed images. Here, we adopt two popular editing methods that were proposed for StyleGAN2: InterFaceGAN [shen2020interpreting] and StyleFlow [abdal2021styleflow]. Both methods need to generate a set of fake images and label their attributes to train a latent manipulation model. In particular, InterFaceGAN learns a linear SVM while StyleFlow uses a conditional continuous normalizing flow [grathwohl2018ffjord] to model the latent attribute manipulation. For both generators, we randomly synthesize 50,000 images for labeling. Following InterFaceGAN, a ResNet-50 [he2016deep] is trained on CelebA dataset [liu2015celeba] to label these images. During the experiments, we found that our model trained on CelebAMask-HQ exhibits a much lower diversity compared to FFHQ-based StyleGAN2. Thus, we fine-tune our model on FFHQ for steps (See Sec. 4), for which we observe a sufficient improvement of diversity without loss of controllability.

We choose 4 local attributes covering different parts of the face image for editing experiments, namely smile, baldness, beard and bangs, and test on the last images of CelebAMask-HQ, which were not used for training. For StyleGAN2, we keep the original selection of latent dimensions in these methods for content preservation. For ours, we manually choose relevant areas for editing, e.g. hair for baldness and face for beard, which can be regarded as a trivial step during deployment. Fig. 9

shows the qualitative results of applying InterFaceGAN to StyleGAN2 and our model. Although InterFaceGAN successfully edits the attributes on StyleGAN2, irrelevant parts are inevitably altered due to the entanglement in the latent space. In comparison, our model focuses only on specified semantic areas. We also conduct a quantitative evaluation of the editing task. For each image, we control the degree of manipulation to generate 10 images. Then a “preservation-score” curve is plotted using the attribute classifier. Here,

refers to the average gain in classification score of the target attribute. refers to minus the loss between the two images. The loss is an approximation of loss, which computes the number of pixels that has been altered. In our experiments, we found this simple metric best correlates with the spatial difference between images. From Fig. 10, it can be seen that our model achieves a better overall performance. Note that for baldness, our model stops when it removes all hairs, but InterFaceGAN+StyleGAN2 keeps increasing the score by adapting into correlated attributes (such as aging). For bangs, our model tends to increase the overall length of hairs, which could be an inherited bias from original training data. Besides, we found that StyleFlow is more sensitive to label imbalance. Thus, given the small number of bald examples, it fails to learn the baldness attribute for both generators.

initial face “a person with brown skin” “a person with purple long hair” “a person with blue eyes”
StyleCLIP
StyleCLIP + Ours
Figure 11: Results of text-guided image synthesis under sequential editing. Starting from an average fake face, the first row (from left to right) shows the results of sequentially applying optimization-based StyleCLIP [patashnik2021styleclip] with StyleGAN2 while the second row shows the results of our model with the same input texts.

5.2.2 Text-guided Synthesis

Recent work have shown that one could use a text-image embedding, such as CLIP [radford2021CLIP], to guide the synthesis of StyleGAN2 for controlled synthesis [patashnik2021styleclip]. Similar to attribute editing, StyleGAN2 suffers from the local disentanglement problem on. Fig. 11 shows a few examples of using StyleCLIP [patashnik2021styleclip] to manipulate a synthesized image with a sequence of text prompts. Here, we use the optimization-based version of StyleCLIP as it is flexible for any input text. It can be seen that the original StyleCLIP often modifies the whole image while the text is trying to change only a specified area. Our model, by additionally let the user to choose relevant areas, can faithfully constrain the editing to local parts. The results indicate that our model could be a more suitable tool for text-guided portrait synthesis where detailed descriptions are provided.

px Photos Toonify MetFaces Bitmoji

Figure 12: Example results of changing hair styles on adapted new domains. The first four columns and last three columns show the results of sampling different latent codes for hair shapes and textures, respectively.

px Hair Top Bottom

Figure 13: Controlled generation results on the DeepFashion dataset. Our model can generate various style for different semantic parts.

5.3 Results on Other Domains

Training our model from scratch requires access to images and segmentation masks at the same time, which might not be feasible in some cases. Thus, we would like to ask whether the model can be fine-tuned on image-only datasets while preserving the local disentanglement (See Sec. 4 for fine-tuning). Fig. 12 shows the results after fine-tuning our model on the Toonify [pinkney2020toonify], MetFaces dataset [karras2020stylegan_ada] and BitMoji [BitMoji]. All of these datasets have a much smaller number of images compared to CelebAMask-HQ and no segmentation masks. We train our model for hundreds of steps until perceptually good results are generated. It can be seen that, for datasets with a limited domain gap, our model is able to maintain local controllability even after fine-tuning.

In spite of the experiments on face datasets so far, our method indeed does not include any module that is designed for face only and hence can be applied to other objects as well. Fig. 13 shows the results of training our model on the DeepFashion dataset [liu2016deepfashion], for which we obtain the labels from [zhu2020semantically]. With the default hyper-parameters, we find that our model can be successfully trained on fashion datasets and we can similarly control the structure and texture of different semantic parts in the latent space.

6 Limitations and Discussion

Applicable Datasets

Although we have shown that our method can be applied to other domains beyond face photos, we still see a limitation caused by the design and supervision. Since we need to build a local generator for each class, the method would not scale to datasets that have too many semantic classes, such as scenes [zhou2017ADE20k]. Besides, for the purpose of synthesis quality, we change the semi-supervised framework of SemanticGAN [li2021SemanticGAN] into fully-supervised, which limits our model from training on image-only datasets from scratch. It would be beneficial to develop a semi-supervised version of our method in the future.

Disentanglement

As the disentanglement between pose, shape and texture is only enforced by the design of layer separation in local generators, we see that the boundary between them is still sometimes ambiguous. For example, the shared coarse structure code could encode some information about expression and the shape code could affect the beard. However, in this work, we mainly focus on the spatial disentanglement between different semantic parts and we believe additional regularization losses or architecture tuning could be incorporated in the future to better decouple those information.

Societal Impact

Our work focuses on the technical problem of improving controllability of GANs and is not specifically designed for any malicious uses. This being said, we do see that the method could be potentially extended into controversial applications such as generating fake profiles. Therefore, we believe that the images synthesized using our approach should present itself as synthetic.

7 Conclusion

In this paper, we present a new type of GAN method that synthesizes images in a controllable way. Through the design of local generators, masked feature aggregation and joint modeling of images and segmentation masks, we are able to model the structure and texture of different semantic areas separately. Experiments show that our method is able to synthesize high-quality images while disentangling different local parts. By combining our model with other editing methods, we can edit synthesized images with a more fine-grained control. Experiments also show that our model can be adapted to image-only datasets while preserving disentanglement capability. We believe the proposed method presents a new and interesting direction of GAN priors for controllable image synthesis, which could shed light on many potential downstream tasks.

References

Appendix A Implementation Details

a.1 Fusion with Transparent Classes

Following Sec 3.1 in the main paper, a coarse semantic mask is fused from pseudo-depth maps, which is further used to aggregate local feature maps. In general cases, the aggregation can be simply achieved by computing , where the frontal class in the semantic mask will be chosen for the output feature. However, in the case of transparent classes, this formulation could be problematic. For example, although the whole eye area could be labeled as glasses in the semantic masks, we are still able to see the skin behind it. Thus, we treat such transparent classes separately during feature aggregation. In particular, we use a modified mask:

(7)

where is an indicator function that equals 1 if is a transparent class and otherwise. is the opposite indicator function for non-transparent classes. The first part of Eq. 7 here means that we first aggregate the features without the transparent classes. Then in the second part of Eq. 7, we add the transparent features using their their original weights in mask . In this way, the feature map will not be affected if there are no transparent classes. If there are, they would be added onto the feature map as additional residuals. Note that this formulation assumes that transparent classes do not overlap with each other. In our experiments, we set glasses and earings as transparent glasses. In fact, the model can also be trained stably by simply using the original mask for fusion, but the texture behind transparent classes could be distorted.

Figure 14: Details of the render net. Here, we take 256256 model as the example. A “ConvBlock” is a StyleGAN2 convolution block that have 2 convolution layers. We remove the style modulation and add a linear segmentation output branch in each convolution layer. indicates upsampling and summation.
Figure 15:

Details of the dual-branch discriminator. The “Residual Blocks” are the convolution layers. The image branch and segmentation are symmetric except the input channels. “concat std.” is the step to of calculating standard deviation. The discriminator would be equivalent to StyleGAN2 discriminator if we remove the segmentation branch.

a.2 Architecture Details

As shown in Fig. 3 in the main paper, each local generator is a modulated MLP (implemented by 11 convolution) that has 10 layers. The input and output feature maps are both of size . All the hidden layers has 64 channels. The Fourier feature at the input is first transformed into the hidden feature map with a linear fully connected (FC) layer. The “toDepth” layer is a FC layer that outputs a 1-channel pseudo-depth map. The “toFeat” is a FC layer that outputs a 512-channel feature map. To encourage the disentanglement between shape and texture, we stop the gradient between shape and texture layers except for the background generator. We also fix the pseudo-depth map of background generator to be all s.

The detailed architecture of render network is shown in Fig. 14. Note that there is an upsampling and residual operation every layer for the segmentation mask, so is not explicitly computed. Instead, we calculate by the difference between downsampled output segmentation and coarse mask .

The detailed architecture of discriminator is shown in Fig. 15. It is similar to StyleGAN2 discriminator except that we add an additional segmentation branch that is symmetric to image branch. During fine-tuning, we remove this branch and the discriminator reduces to an image discriminator.

a.3 Efficiency

For the 512512 model, our model takes about two and a half day to train 150,000 steps with a batch size of 32 on 8 32GB Nvidia Tesla V100 GPUs, where the best model is then selected. For inference, it takes s for our model to generate an image on a single GPU without parallelizing local generators.

Appendix B Style Mixing and Additional Results

In the main paper, we showed that the proposed model can interpolate smoothly in a local latent space. Here, we show results on more fine-grained style mixing using our model. Different from StyleGANs [stylegan, stylegan2, karras2021alias], we can conduct style mixing between local generators to transfer a certain semantic component from one image to another. This conducted by transferring both the shape code and texture code . Fig. 16 shows the results of semantic style mixing using our model trained on CelebAMask-HQ [CelebAMaskHQ]. Besides the local latent codes, we also show the transferring results of the coarse structure code . It can be seen that our model is able to transfer most local component styles between images, including small components such as eyes and mouth. However, it is also observable that the coarse structure code is currently encoding some information about these local components, such as expression and hair. Although a user or developer is able to change the number of coarse structure codes dynamically during testing (and even manipulate all the layers in a local generator), we believe it would be beneficial to further regularize the information in the coarse code in the future. Fig. 17 and Fig. 18 show the semantic style mixing results of a model after transfer learning (on BitMoji dataset [BitMoji]) and the model trained on DeepFashion dataset [liu2016deepfashion]. A similar effect can be seen on the DeepFashion that the coarse structure would affect certain components. Also, we see that sometimes the hair color is affected by the background on this dataset. Since the head in this dataset is rather small, we believe such entanglement is caused by the low-resolution (1616) feature map that is fed into render network for blending, which was originally selected for face datasets. Further tuning the hyper-parameters of the render net might alleviate such issues.

Fig. 19 and Fig. 20 show more results on randomly sampled images and pseudo-depth maps, respectively Figs. 24, 23, 22 and 21 show more results on real face editing using our model and original StyleGAN2. As mentioned in the main paper, we see that StyleFlow is more sensitive to data imbalance and less robust. Taking bangs for instance, it tries to reduce the hair on the side but not in the front for our model. For beard, it tries to make face skin look darker for our model while completely fails on the original StyleGAN2. Note that we re-train both StyleFlow and InterFaceGAN using newly sampled images and our own attribute prediction model. Overall, we can observe that our model achieves much more localized control when editing output images.

px Coarse Structure Background Face (skin) Eyes Eyebrows Mouth Hair

Figure 16: Local style mixing of the model trained on CelebAMask-HQ. The first column shows randomly sampled images for editing. The remaining columns show the results of mixing local styles using the reference images in the first row.

px Coarse Structure Face (skin) Eyes Eyebrows Mouth Hair

Figure 17: Local style mixing of the model fine-tuned on the BitMoji dataset. The first column shows randomly sampled images for editing. The remaining columns show the results of mixing local styles using the reference images in the first row.

px Coarse Structure Background Hair Up Bottom

Figure 18: Local style mixing of the model trained on the DeepFashion dataset. The first column shows different randomly sampled images for editing. The remaining columns show the results of mixing local styles using the reference images in the first row.

px

Figure 19: Example generated images and using our 512512 trained on CelebAMask-HQ. On the right of each generated photo is the refined segmentation mask output by the model.

px Image Psuedo-depth Segmentation Image Psuedo-depth Segmentation Image Psuedo-depth Segmentation Image Psuedo-depth Segmentation

Figure 20: Illustration of compositional synthesis. Starting from background, we gradually add more components into the feature map. The second row of each sample shows the pseudo-depth map of each corresponding component used for fusion. During synthesis, all pseudo-depth maps are fused without an order.
Input Inversion (StyleGAN2) StyleFlow InterFaceGAN Inversion (Ours) StyleFlow+Ours InterFaceGAN+Ours
Figure 21: Results of GAN inversion and editing for the smile attribute. For each method, we show the inversion result of Restyle encoder, the edited image and the difference map between them.
Input Inversion (StyleGAN2) StyleFlow InterFaceGAN Inversion (Ours) StyleFlow+Ours InterFaceGAN+Ours
Figure 22: Results of GAN inversion and editing for the bald attribute. For each method, we show the inversion result of Restyle encoder, the edited image and the difference map between them.
Input Inversion (StyleGAN2) StyleFlow InterFaceGAN Inversion (Ours) StyleFlow+Ours InterFaceGAN+Ours
Figure 23: Results of GAN inversion and editing for the bangs attribute. For each method, we show the inversion result of Restyle encoder, the edited image and the difference map between them.
Input Inversion (StyleGAN2) StyleFlow InterFaceGAN Inversion (Ours) StyleFlow+Ours InterFaceGAN+Ours
Figure 24: Results of GAN inversion and editing for the beard attribute. For each method, we show the inversion result of Restyle encoder, the edited image and the difference map between them.