Labels4Free: Unsupervised Segmentation using StyleGAN

03/27/2021 ∙ by Rameen Abdal, et al. ∙ 34

We propose an unsupervised segmentation framework for StyleGAN generated objects. We build on two main observations. First, the features generated by StyleGAN hold valuable information that can be utilized towards training segmentation networks. Second, the foreground and background can often be treated to be largely independent and be composited in different ways. For our solution, we propose to augment the StyleGAN2 generator architecture with a segmentation branch and to split the generator into a foreground and background network. This enables us to generate soft segmentation masks for the foreground object in an unsupervised fashion. On multiple object classes, we report comparable results against state-of-the-art supervised segmentation networks, while against the best unsupervised segmentation approach we demonstrate a clear improvement, both in qualitative and quantitative metrics.



There are no comments yet.


page 1

page 5

page 6

page 7

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given the high quality and photo-realistic results of current generative adversarial networks (GANs), we are witnessing their widespread adaptation for many applications. Examples include various image and video editing tasks, image inpainting 

[11, 44, 45, 40], local image editing [41, 2], low bit-rate video conferencing [39]

, image super resolution 

[27, 10]

, image colorization 

[4, 26], and extracting 3D models [29].

Figure 1: We propose an unsupervised segmentation framework that enables foreground/background separation for raw input images. At the core of our framework is an unsupervised network, which segments class-specific StyleGAN images, and is used to generate segmentation masks for training supervised segmentation networks.

While originally conjectured that GANs are merely great at memorizing the training data, recent work in GAN-based image editing [16, 3, 35, 41] demonstrates that GANs learn non-trivial semantic information about a class of objects, e.g., faces or cars. For example, GANs are able to learn the concept of pose and they can show the same, or at least a very similar looking, object with different orientations. Even though the background changes in subtle ways during this editing operations, in this paper we explore to what extent the underlying generator network actually learns the distinction between foreground and background, and then to encourage it to disentangle foreground and background, without explicit mask-level supervision. As an important byproduct, we can extract information from an unsupervised GAN that is useful for general object segmentation. For example, we are able to create a large synthetic dataset for training a state-of-the-art segmentation network and then segment arbitrary face (or horse, car, cat) images into foreground and background without requiring any manually assigned labels (see Fig. 1).

Our implementation is based on StyleGAN [22, 21], generally considered the state-of-the-art for GANs trained on individual object classes. Our solution is built on two ideas. First, based on our analysis of GAN-based image editing, the features generated by StyleGAN hold a lot of information useful for segmentation, and can be used towards corresponding mask synthesis. Second, the foreground and background should be largely independent and be composited in different ways. The exact coupling between foreground and background is highly non-trivial however and there are multiple ways of decoupling foreground and background that we analyze in our work.

For our solution, we propose to augment the StyleGAN generator architecture with a segmentation branch and to split the generator into a foreground and background network. This enables us to generate soft segmentation masks for the foreground object. In order to facilitate easier training, we propose a training strategy that starts from a fully trained network that only has a single generator and utilizes it towards unsupervised segmentation mask generation.

To summarize, our main contributions are:

  • A novel architecture modification, a loss function, and a training strategy to split StyleGAN into a foreground and background network.

  • Generating synthetic datasets for segmentation. Our framework can be used to create a complete dataset of labeled GAN generated images of high quality in an unsupervised manner. This dataset can then be used to train other state-of-the-art segmentation networks to yield compelling results.

2 Related Work

High Quality GANs.

From the seminal works of Goodfellow et al. [15] and Radford et al. [30], subsequent GAN research contributed to big improvements in the visual quality of the generated results. The state-of-the-art networks like ProGAN [18], BigGAN [9], StyleGAN [21], StyleGAN2 [22] and StyleGAN2-ada [19] demonstrate superior performance in terms of diversity and quality of the generated samples. While StyleGAN series by Karras et al. [19] has demonstrated high quality and photo-realistic results on human faces using the high quality FFHQ [21]

dataset, BigGAN can produce high quality samples using complex datasets like ImageNet. In our work we build on StyleGAN2 which is the current state of the art for many smaller data-sets, including faces.

GAN Interpretability and Image Editing.

GAN interpretability has been an important aspect of the GAN research since the beginning. Some recent works in this domain [5, 6, 14, 16, 48, 3, 35, 2, 36] study the structure of the activation and latent space. For instance GANspace simplifies the latent space of StyleGAN ( Space) to be linear and extracts meaningful directions using PCA. StyleRig [36] mapped StyleGAN latent space to a riggable face model. StyleFlow [3] studied the non-linear nature of the StyleGAN latent space using normalizing flows and is able produce high-quality sequential edits on the generated and real images. On the other hand, the layer activation based methods [14, 2, 5, 41] try to understand the properties of GANs using an analysis of the activation space. A recent work StyleSpace [41] studies the style parameters of the channels to determine various properties and editing abilities of StyleGAN.

Another interesting approach for interpretability of GANs is image embedding. Abdal et. al. [1] demonstrated high quality embeddings into the extended space called space for real image editing. Subsequent works [47, 31, 37] try to improve upon the embedding quality, e.g. by suggesting new regularizers for the optimization. The embedding methods combined with image editing capabilities of StyleGAN has lead to many applications and even commercial software such as Adobe Photoshop’s Neural Filters [13]. A known problem in the domain of image editing is the disentanglement of the features during an editing operation. While state-of-the-art editing frameworks [3, 36, 16] achieve high quality fine grained edits using supervised and unsupervised approaches, background/ foreground aware embeddings and edits still pose a challenge. Often, the background of the scene changes a lot during the embedding itself or during sequential edits performed on the images. Later in Sec. 4, we show how our unsupervised segmentation framework leads to improved image editing.

Unsupervised Object Segmentation.

Among the few attempts in this domain are works by Xu et al. [17] and Yassine et al. [28] which learn a clustering function in an unsupervised setting. A more recent work, Van et al. [38] adopts a predetermined prior in a contrastive optimization objective to learn pixel embeddings. In the GAN domain, PSeg [7] is currently the only approach to segmenting foreground and background by reformulating and retraining the GAN using relative jittering between the composite and background generators at the cost of quality of samples.

3 Method

Our goal is to segment StyleGAN generated images (i.e., a generative model that allows sampling from a normal distribution

to produce class-specific images ) into foreground and background layers. We achieve this without access to ground truth masks. Our self-supervised algorithm relies on two observations: first, for many object classes, foreground and background layers can be swapped and recomposited to still produce good quality images; and second, the interior layers of a trained StyleGAN already contain valuable information that can be used to extract such foreground/background segmentation.


Figure 2: Our unsupervised segmentation network that makes use of pretrained generators and to simultaneously train a segmentation network (see Figure 3) and a ‘weak’ discriminator , without requiring supervision for ground truth masks (see Sec 3).
Figure 3: Architecture of the Alpha Network that directly operates on pretrained features of StyleGAN to directly produce a binary segmentation mask .

Our framework consists of two generators that share a single discriminator. The first generator is a pretrained StyleGAN generator whose channels are used by the AlphaGenerator Network to eventually extract the foreground image as (see Sec. 3.1). The second generator , the background generator is responsible to generate background image samples from (see Sec. 3.2). Note that both are derived from pretrained generators. The final image, obtained by mixing background/foreground from different branches, is composited using standard -blending as , which a discriminator cannot tell apart from the real images. Note that only the network is trained in this setup using the regular GAN loss.

Fig. 2 shows an overview of this architecture.

We expand on the training dynamics of this network in Sec. 3.3.

3.1 Foreground Generator with Alpha Network

Features of the pretrained generator already contain sufficient information to segment image pixels into foreground and background. We extend the foreground generator with an Alpha Network () to learn the alpha mask for the foreground (see Fig. 3). In this context, we overcome multiple challenges.

First, the feature maps from lower layers of the generator need to be upsampled. To this effect, we introduce upsampling blocks in the alpha network.

Second, the number of features per pixel is quite high (e.g., for StyleGAN2). We therefore use

convolutions for compression and feature selection. This encourages dropping the several features that do not contain segmentation information.

Third, we optionally discard channels and complete layers using a semi-automatic analysis using the tRGB layers of the StyleGAN2. These layers are RGB channels conditioned on the output tensors with the W codes at each resolution contributing to the final image. We initialize the tRGB layers in different resolutions with noise

at each pixel position and analyze the SSIM loss with respect to the original sample. We average the results at each resolution and discard the ones with high SSIM scores. For example, for faces, the tRGB corresponding to the 4th layer of StyleGAN ( resolution) is retained as the average SSIM for the 3rd layer is about 25% higher than those at the 4th layer. Hence we select all the latter tensors including the 4th layer tensors for the construction of . Similarly, we identify the dominant layers to be 2nd for LSUN-Horse and LSUN-Cat and 3rd for the LSUN-Car (see Appendix B).

Fourth, the features per pixel have to be processed to output an alpha value. We therefore use several non-linear activation functions in the upsampling blocks and add a sigmoid function at the end to make the output in the range [0,1]. This yields a lightweight but simplified network to extract segmentation information. In particular, neighboring pixels do not interact with each other.

3.2 Background Generator

The challenge for the background generator is the training dynamics. Specifically, when initializing with a pre-trained generator, the method fails because the background already includes the foreground object. When pre-training the background generator on a different dataset, our attempts failed because the discriminator can easily detect that the backgrounds are out of distribution (see Supplemental (Appendix A)). We therefore adopted an approach based on the conjecture that the foreground and background image are already composited by StyleGAN.

We start out by seeking channels which are responsible for generating the background pixels in a StyleGAN image. Let be the StyleGAN generator and, and be the latent and noise variables, respectively. In order to bootstrap the network to identify the channels, we first identify and collect generic StyleGAN backgrounds from multiple sampled images by cropping. We notice that background having, for example, a white, black and blue tinge are common in-domain backgrounds. Then we find the gradient of the objective function: with respect to the tensors in the StyleGAN2 layers at all resolutions, where is the upsampled background (crops). The above process allows identifying the layers which are most responsible for deleting an object (e.g., faces, cats, horses) from a composited StyleGAN2 image.

In order to quantify the calculated gradient maps, we the calculate the sum of gradients norm over the channels of the respective layers. We found that the first layer (the constant layer and the first layer in which the W latent is injected excluding the tRGB layer [22]) has the maximum value of the above measure. Hence, we hypothesise that switching off (i.e., zeroing out) the identified channels would exclude the object information from the tensor representation and produce an approximate background. Fig. 4 and 5 show the sampled backgrounds following the steps described above. Note that we could produce higher quality curated backgrounds by empirically setting some channels to be active based on a threshold of the above measure. However, we noticed that such backgrounds occasionally contained traces of foreground objects. Hence, in the training phase, we adopt a safe strategy and zero out all the channels of the selected layer. We refer to this trimmed background generator as .

3.3 Training Dynamics

Our unsupervised setup consists of the Alpha Network , a pre-trained StyleGAN generator for the foreground, a modified pre-trained StyleGAN generator for the background, and a weak discriminator .

We freeze the generators and during training and only train the Alpha Network and the discriminator using adversarial training. Note that we do not apply the path regularization to the discriminator and style mixing during the training. Unlike others [7], our framework does not alter the generation quality of samples. The final composited image given to the Discriminator is:


where is the mask predicted by the Alpha Network (i.e., ). We use space for the training, and show in Sec. 4 that the method can generalize to space to handle real images projected into StyleGAN2. In order to robustly train the network and avoid degenerate solutions (e.g., to produce all 1s), we make several changes as described next.

(a) Discriminator : Unlike using a frozen pre-trained generator (), we train the discriminator from scratch. Our method fails to converge by starting the training from a pre-trained discriminator. We hypothesize that that the pre-trained discriminator is already very strong in detecting the correlations between the foreground and the background [34]. For example, environmental illumination, shadows and reflections in case of cars, cats, and horses. Hence to enable our network to train on independently sampled backgrounds, we use a ‘Weak’ discriminator, in the sense that it does not have a strong prior about the correlation between foregrounds and backgrounds.

(b) Truncation trick: We observed that unlike FFHQ trained StyleGAN2, the samples from the LSUN-Object trained StyleGAN2 have variable quality produced without truncation. To avoid the rejection sampling and ensure high quality samples during the training, we use truncation [0,1] for the generator in case of LSUN-Object training [20].

(c) Regularizer: As the alpha segmentation network may choose to converge into a sparse map, we use a binary enforcing regularizer, to guide the training. Note that we control this regularizer to still get soft segmentation maps. The truncation trick above may result in backgrounds from more aligned to the original distribution than the composite image from . Hence, for the LSUN-Object, we use another regularizer which ensures that the optimization does not converge to a trivial solution of only using the background network. Similarly, the regularizer as checks that the solution does not degenerate to only using the network.

Truncation = 0.7 Truncation = 1.0
IOU fg/bg mIOU F1 Prec Rec Acc IOU fg/bg mIOU F1 Prec Rec Acc

0.52/0.82 0.67 0.80 0.78 0.81 0.85 0.50/0.81 0.66 0.78 0.77 0.80 0.84
0.87/0.94 0.90 0.95 0.95 0.94 0.95 0.75/0.89 0.82 0.90 0.92 0.89 0.92

Table 1: Evaluation of unsupervised  [8, 7] against our unsupervised approach using results from supervised network BiSeNet [42, 49], trained on CelebA-Mask dataset [24], as ground truth. Note that for , since the samples can be of low-quality, we use the Detectron2 model for person detection before evaluating the masks. See Figure 4 for assessing the visual quality of the BiSeNet generated masks.
Figure 4: Qualitative results of our unsupervised framework on StyleGAN2 trained on FFHQ compared with BiSeNet trained on CelebA-HQ Masks. Note that the green and red areas are the ‘False Positives’ and ‘False Negatives’ with respect to the foreground in ground truth. We report our results on two truncation levels.
Figure 5:

Qualitative results of our unsupervised framework on StyleGAN2 trained on LSUN-Horse, LSUN-Cat and LSUN-Car (LSUN-Object) compared with Detectron 2 trained on MS-COCO. Note that the green and red areas are the ‘False Positives’ and ‘False Negatives’ with respect to the foreground in ground truth.

Figure 6: Our method achieves better background preservation compared to original semantic edits in StyleFlow [3]. For the real image, we first obtain a background layer, segmented using Label4Free and then completed using ContentAwareFill, and then for each edit using StyleFlow, we again segment them using our method and then composite back with the completed background layer (obtained above). Please compare the first row versus third rows.

4 Results

4.1 Evaluation Measures

For the task of segmentation, IOU (Intersection Over Union) and mIOU (mean Intersection Over Union) give the measure of overlap between the ground truth and predicted segments. Since there are two classes we calculate IOU for both the classes in the experiments. We also report the final mIOU. Additional metrics that we report to evaluate the segmentation are Precision (Prec), Recall (Rec), F1 Score, and accuracy (Acc). As surrogate for visual quality, we use FID in our experiments. This can only be seen as a rough approximation to visual quality and is mainly useful in conjunction with visual inspection of the results.

4.2 Datasets

We use StyleGAN2 pretrained on FFHQ [20], LSUN-Cars, LSUN-Cats, and LSUN-Horse [43] datasets. FFHQ is a face image dataset with resolution . The facial features are aligned canonically. The LSUN-Object dataset contains various objects. It is relatively diverse in terms of poses, position, and number of objects in a scene. We use the subcategories for cars, cats, and horses.

4.3 Competing Methods

We compare our results with two approaches, a supervised approach and an unsupervised approach.

In the supervised setting, we use BiSeNet [42, 49] trained on the CelebA-Mask dataset [24] for the evaluation of the faces and Facebook’s Detectron 2 Mask R-CNN Model ( R101-FPN [12] ) with the ResNet101 architecture pre-trained on the MS-COCO dataset [25] for the evaluation on LSUN-Object datasets. As our method is unsupervised, these methods are mainly suitable to judge how close we can come to supervised training. They are not direct competitors to our method. We also create a custom evaluation dataset of 10 images per class to directly compare the two approaches (See Supplemental (Appendix C)).

In the unsupervised setting, we compare our method with PSeg [8, 7]

using the parameters in the open source GitHub repo. This method is the only published competitor to ours.

4.4 Comparison on Faces

We compare and evaluate segmentation results on both sampled and real images quantitatively and qualitatively. First we show qualitative results of the learned segmentation masks using the images and backgrounds sampled from the StyleGAN2 in Fig. 4. To put these results into context, we compare the segmentation masks of the learned alpha network with the supervised BiSeNet. The figure shows results at different truncation levels. The truncation setting leads to higher quality images. The setting leads to lower quality images that are more difficult to segment from the composited representation. In such cases our method is able to outperform the supervised segmentation network. In Table 1, we compare the results of the unsupervised segmentation of StyleGAN2 generated images using our method with unsupervised Pseg. We train PSeg [8, 7]

using the parameters in the Github repo on the FFHQ dataset. In the absence of a large corpus of testing data, we estimate the ground truth using a supervised segmentation network (BiSeNet). We sample

images and backgrounds at different truncation levels and compute the evaluation metrics. The quantitative results clearly show that our method is able to produce results of much higher quality than our only direct (unsupervised) competitor. The PSeg method tends to extract some random attributes from the learned images and is very sensitive to hyperparameters (See Fig. 

7). In contrast, our method ensures that the original generative quality of StyleGAN is maintained. We also compare the FID of the sampled images in Table 5. The results show that PSeg drastically affects the sample quality.

Figure 7: Samples from method trained on FFHQ dataset. For samples on LSUN-Object refer to Figure 3 and Figure 4 in  [7] paper. Note that the method struggles with the quality of samples and the authors communicated that they are working on a better version.

Background preserving face editing

In order to show the application for our method in real image editing, we use state-of-the-art image editing framework StyleFlow [3]. One of the limitations of sequential editing using StyleGAN2 is that it has a tendency to change the background. Fig. 6 shows some edits on the real and sample images using StyleFlow and corresponding results where our method is able to preserve the background using the same edit. Notice that our method is robust to pose changes.

Performance on real face images.

One method to obtain segmentation results on real images is to project them into the StyleGAN2 latent space. We do this for images from the CelebA-Mask dataset using the PSP encoder [31]. We show quantitative results in Table 4 comparing the output segmentation masks with the ground truth in the dataset. Note that the Alpha Network is not trained on the space but is able to generalize well on the real images. A downside of this approach is that the projection method introduces its own projection errors affecting segmentation.

Labels4Free to obtain synthetic training data.

A better method to extend to arbitrary images is to generate synthetic training data (see Table 4). We train a UNet [32] and BiSeNet on sampled images and backgrounds produced by our unsupervised framework and report the scores. Note that the scores are comparable to the supervised setup. This demonstrates that our method can be applied to synthetically generate high quality training data for class specific foreground/background segmentation.

Method IOU fg/bg mIOU F1 Prec Rec Acc
(A) 0.65/0.68 0.66 0.80 0.81 0.80 0.79
Ours(A) 0.84/0.77 0.81 0.89 0.89 0.90 0.90
(B) 0.50/0.40 0.45 0.71 0.69 0.73 0.63
Ours(B) 0.83/0.67 0.75 0.85 0.84 0.91 0.87
(C) 0.81/0.73 0.77 0.83 0.83 0.84 0.85
Ours(C) 0.93/0.84 0.89 0.94 0.93 0.95 0.95
Table 2: Evaluation of unsupervised  [8, 7] against our unsupervised approach on LSUN-Object categories (LSUN-Horse, LSUN-Cat and LSUN-Car) using results from supervised network Detectron 2 R101-FPN [12] trained on MS-COCO dataset. We report the results without truncation and the threshold for the masks is 0.9. A: LSUN-Cat ; B : LSUN-Horse ; C: LSUN-Car.
Method IOU fg/bg mIOU F1 Prec Rec Acc
BiSeNet(A) 0.94/0.97 0.95 0.98 0.98 0.97 0.98
Ours(A) 0.95/0.98 0.96 0.98 0.99 0.98 0.98
Dt2(B) 0.93/0.95 0.94 0.97 0.97 0.97 0.97
Ours(B) 0.95/0.96 0.96 0.98 0.98 0.98 0.98
Dt2(C) 0.96/0.94 0.95 0.97 0.97 0.97 0.98
Ours(C) 0.97/0.96 0.97 0.98 0.98 0.98 0.98
Dt2(D) 0.99/0.96 0.97 0.99 0.99 0.98 0.99
Ours(D) 0.98/0.95 0.97 0.98 0.98 0.99 0.99
Table 3: Evaluation on custom dataset. Dt2 : Detectron 2 A: FFHQ ; B: LSUN-Cat ; C: LSUN-Horse ; D: LSUN-Car.
Method IOU fg/bg mIOU F1 Prec Rec Acc
BiSeNet 0.84/0.92 0.88 0.93 0.93 0.94 0.94
Ours(PSP) 0.83/0.92 0.88 0.93 0.94 0.93 0.94
Ours(UNet) 0.75/0.88 0.82 0.90 0.90 0.90 0.91
Ours(BiSeNet) 0.81/0.91 0.86 0.92 0.92 0.92 0.93
Table 4: Evaluation of the unsupervised object segmentation of real images using a projection method (PSP) and segmentation networks (Unet and BiSeNet) trained on our generated synthetic data.
Method A B C D
24.33 57.32 53.60 30.87
StyleGAN2 2.82 6.93 3.43 2.32
Table 5: FID comparison of the generation quality of samples from method vs StyleGAN2. A : FFHQ ; B : LSUN-Cat ; C : LSUN-Horse ; D : LSUN-Car.

4.5 Comparison on other datasets

We also train our framework on LSUN-Cat, LSUN-Horse and LSUN-Car [43]. These are more challenging datasets than FFHQ. We have identified two related problems with StyleGAN2 trained on these datasets. Firstly, the quality of the samples at lower truncation or no truncation levels is not as high as the FFHQ trained StyleGAN2. Second, these datasets can have multiple instances of the objects in a scene. StyleGAN2 does not handle such samples well. Both these factors affect the training as well as the evaluation of the unsupervised framework.

In Fig. 5 we show the qualitative results of our unsupervised segmentation approach on LSUN-Object datasets. Notice that the quality of segmentation masks produced by our framework is comparable to the supervised network. In Table 2, we calculated the quantitative results of sampled images from StyleGAN2. Note that here we resort to rejection sampling based on the technique to reject the sample not identified by the detectron 2 model. For simplicity we also reject the multiple instance object samples. The results show that the metrics scores are comparable to a state-of-the-art supervised method. In order to directly compare the supervised and unsupervised approaches we compare the metrics on a custom dataset mentioned in Section 4.3. Table 3 shows that our supervised approach is either better or has similar segmentation capabilities as the supervised approach.

4.6 Training Details

Our method is faster to train than the competing methods. We train our framework on 4 RTX 2080 (24 GB) cards with a batch size of 8 using the StyleGAN pytorch implementation 

[33]. Let and be the weights of the regualarizers and , respectively. For FFHQ dataset, we run iterations and set . For LSUN-Object, we set . For LSUN-Cat, we run 900 iterations and set and . For LSUN-Horse, we run 500 iterations and set and . For LSUN-Car, we run 250 iterations and set and . Also, we set the weight of the non-saturating discriminator loss to . For FFHQ, LSUN-Horse and LSUN-Cat, we set the learning rate to . For LSUN-Car, we set the learning rate to . Each model takes less than 30 minutes to converge.

5 Conclusion

We proposed a framework for unsupervised segmentation of StyleGAN generated images into a foreground and a background layer. The most important property of this segmentation is that it works entirely without supervision. To that effect, we leverage information that is already present in the layers to extract segmentation information. In the future, we would like to explore the unsupervised extraction of other information, e.g., illumination information, the segmentation into additional classes, and depth information.

6 Acknowledgements

This work was supported by Adobe and the KAUST Office of Sponsored Research (OSR) under Award No. OSR-CRG2017-3426.


  • [1] R. Abdal, Y. Qin, and P. Wonka (2019) Image2stylegan: how to embed images into the stylegan latent space?. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    pp. 4432–4441. Cited by: §2.
  • [2] R. Abdal, Y. Qin, and P. Wonka (2020) Image2stylegan++: how to edit the embedded images?. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 8296–8305. Cited by: §1, §2, §8.1.
  • [3] R. Abdal, P. Zhu, N. Mitra, and P. Wonka (2020) Styleflow: attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. arXiv e-prints, pp. arXiv–2008. Cited by: §1, §2, §2, Figure 6, §4.4.
  • [4] M. Afifi, M. A. Brubaker, and M. S. Brown (2020) HistoGAN: controlling colors of gan-generated and real images via color histograms. arXiv preprint arXiv:2011.11731. Cited by: §1.
  • [5] D. Bau, J. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T. Freeman, and A. Torralba (2018) GAN dissection: visualizing and understanding generative adversarial networks. External Links: 1811.10597 Cited by: §2.
  • [6] D. Bau, J. Zhu, J. Wulff, W. Peebles, H. Strobelt, B. Zhou, and A. Torralba (2019) Seeing what a gan cannot generate. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4502–4511. Cited by: §2.
  • [7] A. Bielski and P. Favaro (2019) Emergence of object segmentation in perturbed generative models. arXiv preprint arXiv:1905.12663. Cited by: §2, §3.3, Table 1, Figure 7, §4.3, §4.4, Table 2.
  • [8] A. Bielski Perturbed-seg. Note: Cited by: Table 1, §4.3, §4.4, Table 2.
  • [9] A. Brock, J. Donahue, and K. Simonyan (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • [10] G. Daras, J. Dean, A. Jalal, and A. G. Dimakis (2021) Intermediate layer optimization for inverse problems using deep generative models. arXiv preprint arXiv:2102.07364. Cited by: §1.
  • [11] U. Demir and G. Unal (2018) Patch-based image inpainting with generative adversarial networks. arXiv preprint arXiv:1803.07422. Cited by: §1.
  • [12] FaceBook Detectron2. Note: Cited by: §4.3, Table 2.
  • [13] N. Filters Adobe photoshop. Note: Cited by: §2.
  • [14] A. Frühstück, I. Alhashim, and P. Wonka (2019) TileGAN: synthesis of large-scale non-homogeneous textures. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–11. Cited by: §2.
  • [15] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §2.
  • [16] E. Härkönen, A. Hertzmann, J. Lehtinen, and S. Paris (2020) Ganspace: discovering interpretable gan controls. arXiv preprint arXiv:2004.02546. Cited by: §1, §2, §2.
  • [17] X. Ji, J. F. Henriques, and A. Vedaldi (2019) Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9865–9874. Cited by: §2.
  • [18] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • [19] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila (2020) Training generative adversarial networks with limited data. In Proc. NeurIPS, Cited by: §2.
  • [20] T. Karras, S. Laine, and T. Aila (2018) A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948. Cited by: §3.3, §4.2.
  • [21] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §1, §2.
  • [22] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of StyleGAN. In Proc. CVPR, Cited by: §1, §2, §3.2.
  • [23] LabelBox LabelBox tool. Note: Cited by: §9.1.
  • [24] C. Lee, Z. Liu, L. Wu, and P. Luo (2020) MaskGAN: towards diverse and interactive facial image manipulation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1, §4.3.
  • [25] T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár (2014) Microsoft coco: common objects in context. External Links: 1405.0312 Cited by: §4.3.
  • [26] X. Luo, X. Zhang, P. Yoo, R. Martin-Brualla, J. Lawrence, and S. M. Seitz (2020) Time-travel rephotography. arXiv preprint arXiv:2012.12261. Cited by: §1.
  • [27] S. Menon, A. Damian, S. Hu, N. Ravi, and C. Rudin (2020) PULSE: self-supervised photo upsampling via latent space exploration of generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2437–2445. Cited by: §1.
  • [28] Y. Ouali, C. Hudelot, and M. Tami (2020) Autoregressive unsupervised image segmentation. In European Conference on Computer Vision, pp. 142–158. Cited by: §2.
  • [29] X. Pan, B. Dai, Z. Liu, C. C. Loy, and P. Luo (2020) Do 2d gans know 3d shape? unsupervised 3d shape reconstruction from 2d image gans. arXiv preprint arXiv:2011.00844. Cited by: §1.
  • [30] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §2.
  • [31] E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or (2020)

    Encoding in style: a stylegan encoder for image-to-image translation

    arXiv preprint arXiv:2008.00951. Cited by: §2, §4.4.
  • [32] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §4.4.
  • [33] rosinality Stylegan2-pythorch. Note: Cited by: §4.6.
  • [34] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. arXiv preprint arXiv:1606.03498. Cited by: §3.3.
  • [35] Y. Shen, C. Yang, X. Tang, and B. Zhou (2020) Interfacegan: interpreting the disentangled face representation learned by gans. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §2.
  • [36] A. Tewari, M. Elgharib, G. Bharaj, F. Bernard, H. Seidel, P. Pérez, M. Zollhofer, and C. Theobalt (2020) Stylerig: rigging stylegan for 3d control over portrait images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6142–6151. Cited by: §2, §2.
  • [37] A. Tewari, M. Elgharib, M. BR, F. Bernard, H. Seidel, P. Pérez, M. Zöllhofer, and C. Theobalt (2020-12) PIE: portrait image embedding for semantic control. Vol. 39. External Links: Document Cited by: §2.
  • [38] W. Van Gansbeke, S. Vandenhende, S. Georgoulis, and L. Van Gool (2021) Unsupervised semantic segmentation by contrasting object mask proposals. arXiv preprint arXiv:2102.06191. Cited by: §2.
  • [39] T. Wang, A. Mallya, and M. Liu (2020) One-shot free-view neural talking-head synthesis for video conferencing. arXiv preprint arXiv:2011.15126. Cited by: §1.
  • [40] R. Webster, J. Rabin, L. Simon, and F. Jurie (2019) Detecting overfitting of deep generative networks via latent recovery. External Links: 1901.03396 Cited by: §1.
  • [41] Z. Wu, D. Lischinski, and E. Shechtman (2020) StyleSpace analysis: disentangled controls for stylegan image generation. arXiv preprint arXiv:2011.12799. Cited by: §1, §1, §2, §8.1.
  • [42] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018-09) BiSeNet: bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: Table 1, §4.3.
  • [43] F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao (2015)

    LSUN: construction of a large-scale image dataset using deep learning with humans in the loop

    arXiv preprint arXiv:1506.03365. Cited by: §4.2, §4.5.
  • [44] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018) Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5505–5514. Cited by: §1.
  • [45] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. Huang (2018) Free-form image inpainting with gated convolution. External Links: 1806.03589 Cited by: §1.
  • [46] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva (2014)

    Learning deep features for scene recognition using places database

    In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. . External Links: Link Cited by: §7.1.
  • [47] J. Zhu, Y. Shen, D. Zhao, and B. Zhou (2020) In-domain gan inversion for real image editing. In European Conference on Computer Vision, pp. 592–608. Cited by: §2.
  • [48] P. Zhu, R. Abdal, Y. Qin, and P. Wonka (2020) Improved stylegan embedding: where are the good latents?. arXiv preprint arXiv:2012.09036. Cited by: §2.
  • [49] zllrunning Face-parsing.pytorch. Note: Cited by: Table 1, §4.3.

7 Appendix A

7.1 Ablation Study

In order to validate the importance of the in-domain backgrounds for the training of the unsupervised network, we train our framework with the backgrounds taken from the MIT places [46] dataset. In order to do so, we replace the background generator with a random selection of an image from MIT places. During training, we not only train the alpha network but also the discriminator (as in the main method described in the paper). As discussed in the main paper, the discriminator is very good in identifying the out of the distribution images. In Table 6, we show the scores compared with BiSeNet and Detectron 2 (see Table 1 and Table 2 in the main paper) when using the MIT places for the backgrounds. Note that the scores decrease drastically.

Dataset mIOU F1 Prec Rec Acc
FFHQ 0.34 0.40 0.34 0.50 0.68
LSUN-Horse 0.14 0.22 0.14 0.5 0.28
LSUN-Cat 0.20 0.28 0.69 0.50 0.39
LSUN-Car 0.34 0.51 0.63 0.63 0.51
Table 6: Quantitative results of using MIT places dataset for the backgrounds.

As a second ablation study, we try to learn the Alpha mask only from features of the last layer before the output. This straightforward extension does not work well as seen in Table 7. In summary, using features from multiple layers in the generator is important to achieve higher fidelity.

Dataset mIOU F1 Prec Rec Acc
FFHQ 0.34 0.41 0.57 0.50 0.69
LSUN-Horse 0.33 0.42 0.41 0.44 0.63
LSUN-Cat 0.31 0.41 0.48 0.50 0.58
LSUN-Car 0.42 0.57 0.57 0.58 0.63
Table 7: Quantitative results of using only the last layer of the StyleGAN2 for the construction of the Alpha Network.

8 Appendix B

8.1 Visualization of tRGB layers

In Fig. 9,  10 and  11, we visualize the tRGB layers at different resolutions as discussed in the experiment in Section 3.1 of the main paper. We select all the resolution tensors including and after the highlighted resolution. Here, we first normalize the tensors by , where represents the tRGB tensor at a given resolution. Notice that for the face visualization in Fig. 9, the face structure is clearly noticeable at the resolution corresponding to the 4th layer of StyleGAN2. Other efforts in the StyleGAN-based local editing [2, 41] also selects early layers for the semantic manipulation of the images. These tests support that even features from earlier layers are beneficial for segmentation.

Figure 8: Multiple object segmentation.
Figure 9: Visualization of the tRGB layers in the StyleGAN2 trained on FFHQ dataset. Note that the maps produce a prominent face structure at resolution corresponding to layer 4 of StyleGAN2.
Figure 10: Visualization of the tRGB layers in the StyleGAN2 trained on LSUN-Horse and LSUN-Cat datasets.
Figure 11: Visualization of the tRGB layers in the StyleGAN2 trained on LSUN-Car dataset.

8.2 Multiple object segmentation

Apart from the main object, there can be multiple instances of the same object or multiple objects in the scene in correlation with the foreground object. Fig. 8 shows that our method is able to handle such cases.

9 Appendix C

9.1 Custom dataset

We curated a custom dataset to evaluate the performance of our method with the supervised frameworks (BiSeNet and Detectron 2). We collected 10 images per class. These images were sampled by using the pretrained StyleGAN2 at different trucation levels and on different datasets i.e., FFHQ, LSUN-Horse, LSUN-Cat and LSUN-Car. Note that we collected a diverse set of images , , diverse poses, lighting, and background instances (see Fig. 12). The images were annotatd using the LabelBox tool [23].

Figure 12: Our custom dataset with the corresponding ground truth labels.