Learning High-Resolution Domain-Specific Representations with a GAN Generator

06/18/2020
by   Danil Galeev, et al.
SAMSUNG
0

In recent years generative models of visual data have made a great progress, and now they are able to produce images of high quality and diversity. In this work we study representations learnt by a GAN generator. First, we show that these representations can be easily projected onto semantic segmentation map using a lightweight decoder. We find that such semantic projection can be learnt from just a few annotated images. Based on this finding, we propose LayerMatch scheme for approximating the representation of a GAN generator that can be used for unsupervised domain-specific pretraining. We consider the semi-supervised learning scenario when a small amount of labeled data is available along with a large unlabeled dataset from the same domain. We find that the use of LayerMatch-pretrained backbone leads to superior accuracy compared to standard supervised pretraining on ImageNet. Moreover, this simple approach also outperforms recent semi-supervised semantic segmentation methods that use both labeled and unlabeled data during training. Source code for reproducing our experiments will be available at the time of publication.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 8

04/27/2021

Semi-Supervised Semantic Segmentation with Pixel-Level Contrastive Learning from a Class-wise Memory Bank

This work presents a novel approach for semi-supervised semantic segment...
08/12/2020

Self-Path: Self-supervision for Classification of Pathology Images with Limited Annotations

While high-resolution pathology images lend themselves well to `data hun...
07/11/2020

How Does GAN-based Semi-supervised Learning Work?

Generative adversarial networks (GANs) have been widely used and have ac...
04/13/2021

DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort

We introduce DatasetGAN: an automatic procedure to generate massive data...
03/08/2021

Semi-supervised Domain Adaptation based on Dual-level Domain Mixing for Semantic Segmentation

Data-driven based approaches, in spite of great success in many tasks, h...
10/22/2021

Semi-Supervised Semantic Segmentation of Vessel Images using Leaking Perturbations

Semantic segmentation based on deep learning methods can attain appealin...
11/26/2018

Universal Semi-Supervised Semantic Segmentation

In recent years, the need for semantic segmentation has arisen across se...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative models of visual data, and generative adversarial nets (GANs) in particular, have made remarkable progress in recent years Goodfellow et al. (2014); Arjovsky et al. (2017); Mescheder et al. (2018); Miyato et al. (2018); Karras et al. (2017); Brock et al. (2018); Kingma and Dhariwal (2018); Karras et al. (2019a, b), and now they are able to produce images of high quality and diversity. Generative models have long been considered as a means of representation learning, with common assumption that the ability to generate data from some domain implies understanding of the semantics of that domain. Thus, various ideas about using GANs for representation learning have been studied in the literature Radford et al. (2015); Chen et al. (2016). Most of these works are focused on producing universal feature representations by training a generative model on a large and diverse dataset Donahue et al. (2016); Donahue and Simonyan (2019). However, the use of GANs as universal feature extractors has several limitations.

In this work we consider the task of unsupervised domain-specific pretraining. Rather than trying to learn a universal representation on a diverse dataset, we focus on producing a specialized representation for a particular domain. Our intuition is that GAN generators are most efficient for learning high-resolution representations, as generating a realistically-looking image implies learning appearance and location of different semantic parts. Thus, we experiment with semantic segmentation as a target downstream task. To illustrate our idea, we perform experiments with semantic projection of GAN generator and show that it can be easily converted into a semantic segmentation model. Based on this finding, we introduce a novel LayerMatch scheme that trains a model to predict the activations of the internal layers of GAN generator. Since the proposed scheme is trained on synthetic data and requires only a trained generator model, it can be used for unsupervised domain-specific pretraining.

As a practical use-case we consider the scenario when a limited amount of labeled data is available along with a large unlabeled dataset. This scenario is usually addressed by semi-supervised learning methods that use both labeled and unlabeled data during training. We evaluate LayerMatch pretraining as follows. First, a GAN model is trained on the unlabeled data, and a backbone model is pretrained using LayerMatch with the available GAN generator. Then, a semantic segmentation model with the LayerMatch-pretrained backbone is fine-tuned on the labeled part of the data. We perform experiments on two datasets with high quality GAN models available (CelebA-HQ Lee et al. (2019) and LSUN Yu et al. (2015a)). Surprisingly, we find that LayerMatch pretraining outperforms both the standard supervised pretraining on ImageNet and the recent semi-supervised semantic segmentation methods based on pseudo-labeling Lee (2013) and adversarial training Hung et al. (2018).

The rest of the paper is organized as follows. In Section 2.1 we explore semantic projections of GAN generator and describe our experiments. In Section 3.1 we introduce the LayerMatch scheme for unsupervised domain-specific pretraining that is based on inverting GAN generator. Section 4 describes our experiments with the models pretrained using LayerMatch. In Section 5 we discuss related works.

2 Semantic projection of a GAN generator

Let us introduce the following notation. A typical GAN model consists of a jointly trained generator and a discriminator . Generator

transforms a random latent vector

into an image and discriminator classifies whether an image is real or fake. Let us denote the activations of internal layers of for latent vector by .

Semantic projection of a generator is a mapping of the features onto the dense label map where is the number of classes. It can be implemented as a decoder that takes the features from different layers of a generator and outputs the semantic segmentation result. An example of a decoder architecture built on top of a style-based generator is shown in  Figure 1 (a).

(a) (b)
Figure 1: (a) Semantic projection implemented by a decoder built on top of a style-based generator as described in Section 2.1; (b) LayerMatch scheme for pretraining a backbone to approximate the activations of a GAN model as described in Section 3.1.

2.1 Converting semantic projection model into a segmentation model

Training procedure of a semantic projection model is shown in Algorithm 1. First, we sample a few images using GAN generator and store corresponding activations

of internal layers. The latent vectors are sampled from normal distribution. Then, we manually annotate a few generated images. The decoder is trained in a supervised manner using the segmentation masks from the previous step with corresponding intermediate generator features. We use cross-entropy between the predicted mask and ground truth as a loss function.

Input: GAN model
Output: Semantic projection model
1 Generate images from the random latent vectors , and store them along with their features ,
2 Annotate the images and create semantic maps ,
3 Train a decoder on pairs ,
Algorithm 1 Training semantic projection model

Once we have trained a semantic projection model , we can obtain pixelwise annotation for generated images. For this purpose, we can apply to the features produced by generator . However, the features of a generator are not available for real images. Since semantic projection alone does not allow obtaining semantic segmentation maps for real images, we propose Algorithm 2 for converting the semantic projection model into a semantic segmentation model applicable to real images. The intuition is that training on a large number of GAN-generated images along with accurate annotations provided by semantic projection should result in an accurate segmentation model.

Input: GAN model , semantic projection model
Output: Semantic segmentation model
1 Generate images from the random latent vectors and store them along with their features ,
2 Compute results of semantic projection ,
3 Train semantic segmentation model on pairs ,
Algorithm 2 Converting semantic projection into semantic segmentation model

2.2 Experiments with semantic projections

In this section we address the following questions: 1) Will a lightweight decoder be sufficient to implement an accurate semantic projection model? 2) How many images are required to train semantic projection to a reasonable accuracy? 3) Will the use of Algorithm 2 lead to improved performance on real images?

Experimental protocol. We perform experiments with style-based generator Karras et al. (2018) on two datasets (FFHQ and LSUN-cars). In both experiments, we manually annotate 20 randomly generated images for training the semantic projection models. For FFHQ experiment we use two classes: hair and background. Hair is a challenging category for segmentation as it usually has difficult shape with narrow elongated parts. For LSUN-cars we use car and background categories. We also train DeepLabV3+ Chen et al. (2018) model using Algorithm 2 with semantic projection models trained on 20 images. In all experiments we use ResNet-50 as a backbone. For LSUN-cars we experiment with both ImageNet-pretrained and randomly initialized backbones. For comparison we train a similar DeepLabV3+ model on 20 labeled real images. 80 annotated real images are used for testing the semantic segmentation models. Pixel accuracy and intersection-over-union (IoU) are measured for methods comparison.

Architecture of the semantic projection model. The lightweight decoder architecture for semantic projection is shown in  Figure 1

. It has an order of magnitude fewer parameters compared to the standard decoder architectures and 16 times fewer than DeepLabV3+ decoder. Each CBlock of the decoder takes the features from corresponding SBlock of StyleGAN as an input. CBlock consists of a 50% dropout, a convolutional and a batch normalization layers. Each RBlock of a decoder has one residual block with two convolutional layers. The number of feature maps in each convolutional layer of the decoder is set to 32, as wider feature maps resulted in just minor improvement in our experiments.

(a) (b)
Figure 2: (a) - evaluation results of the semantic projection model on two classes (background and hair) with respect to the number of images in training; (b) - outputs of semantic projection model for test images generated by StyleGAN. Note that while the model was trained just on 20 images, it provides quite accurate segmentation.
Figure 3: Outputs of two semantic segmentation models trained with equal amount of supervision. First row - FFHQ, second row - LSUN-cars. From left to right: input image, output of the model trained with Algorithm 2 on synthetic images, output of the model trained on the same number of real images.
Categories Method ImageNet-pretrained backbone accuracy IoU
Hair/Background Training on 20 labeled images + 0.9515 0.8194
Algorithm 2 with 20 labeled images 0.9675 0.8759
Car/Background Training on 20 labeled images - 0.8588 0.6983
Algorithm 2 with 20 labeled images 0.9787 0.9408
Training on 20 labeled images + 0.9641 0.9049
Algorithm 2 with 20 labeled images 0.9862 0.9609
Table 1: Comparison of the segmentation models trained with equal amount of supervision. See text for more details.

Results and discussion. Figure 3 (b) shows outputs of a semantic projection model trained on 20 synthetic images using Algorithm 1. The results of varying the size of a training set from 1 to 15 synthetic images is shown in Figure 3 (a). The test set in this experiment contains 30 manually annotated GAN-generated images. We observe that even with a single image in training set, the model achieves reasonable segmentation quality, and the quality grows quite slowly after 10 images.

Next, we compare two semantic segmentation models trained with equal amount of supervision. The first one uses Algorithm 2 with semantic projection model trained on 20 synthetic images. The second one uses ImageNet-pretrained backbone and is trained on 20 real images. Table 1 shows quantitative comparison of the two models. One can notice that in case when the backbone for DeepLabV3+ is randomly initialized, the model trained with Algorithm 2 is significantly more accurate compared to the baseline approach. When using ImageNet-pretrained backbones, Algorithm 2 leads to 6% improvement in terms of IoU for both datasets. Figure 3 shows examples of hair and car segmentation for real images from the test set.

Our experiments of two datasets demontrate that a lightweight decoder is sufficient to implement an accurate semantic projection model. We observe that just a few annotated images are enough to train semantic projection to a reasonable accuracy. The Algorithm 2 leads to improved accuracy on real images compared to simply training a similar model with the same number of annotated images.

3 Transfer learning using generator representation

Training a semantic projection model introduced in Section 2.1 requires manual annotation of GAN-generated images. Thus, we cannot use standard real-image datasets for comparison with other works. Real images could potentially be embedded into the GAN latent space, but in practice this approach has its own limitations Bau et al. (2019). Besides, some of the images produced by GAN generators can be hard to label.

A semantic segmentation network transforms an image into a segmentation map. At the same time a GAN generator transforms a random vector into an image . Obviously, the input dimensions of these two types of models do not match. Therefore, the models trained for image generation cannot be directly applied to image segmentation. To overcome this issue one can think of inverting a generator. Inverted GAN generators have been widely used for the task of image manipulation Abdal et al. (2019); Bau et al. (2019). For this purpose, an encoder model is usually trained to predict the latent vector from an image. Following Abdal et al. (2019); Bau et al. (2019) we train an encoder network, but predict the activations of a fixed GAN generator instead of the latent vector. The backbone of the trained encoder can then be used to initialize a semantic segmentation model.

3.1 Unsupervised pretraining with LayerMatch

The scheme of the LayerMatch algorithm is shown in Figure 1 (b). We can view generator as a function of the latent vector and all the intermediate activations: , where intermediate features themselves depend on the latent vector and all the previous features: . The generated image is fed to the encoder , which tries to predict the

specified activation tensors:

, where .

The loss function for LayerMatch training consists of two terms:

where matching loss is the sum of the L2-losses between generated and predicted features that penalizes difference between the outputs of the encoder and the activations of the generator:

Reconstructed image is obtained by replacing random feature with , where , and recalculating features :

Reconstruction loss is the L2-loss between the generated image and the reconstructed image. This loss controls that the generator produces an image which is close to the original one when generator activations are replaced with the outputs of the backbone:

(a) (b) (c)
Figure 4: (a) - example face image with color-coded semantic annotation, (b) - t-SNE visualization of the features of an ImageNet-pretrained backbone, (c) - similar visualization of the features learnt with LayerMatch. In (b) and (c) each point is color-coded according to the ground truth annotation. See text for more details.

Figure 4 shows t-SNE visualizations of the features from the internal layers of two similar models. Each point on the plots (b) and (c) represents a feature vector corresponding to a particular position in an image. We used the activations for 5 images in this visualization. Plot (a) shows the activations of an ImageNet-pretrained model, and the plot (b) shows activations of the model pretrained with LayerMatch. One can observe that the distribution of the features learned by LayerMatch contains semantically meaningful clusters, while the universal features of an ImageNet-pretrained backbone look more scattered.

4 Experiments with LayerMatch

Evaluation protocol.

The standard protocol for evaluation of unsupervised learning techniques proposed in

Zhang et al. (2016) involves training a model on unlabeled ImageNet, freezing its learned representation, and then training a linear classifier on its outputs using all of the training set labels. This protocol is based on the assumption that the resulting representation is universal and applicable to different domains. We rather focus on domain-specific pretraining, or "specializing" the backbone to a particular domain. We aim at high-resolution tasks, e.g. semantic segmentation. Therefore, we apply a different evaluation protocol.

We assume that we have a high-quality GAN model trained on a large unlabeled dataset from the domain of interest along with a limited number of annotated images from the same domain. The unlabeled data is used for training a GAN model, which in turn is used for pretraining the backbone model using LayerMatch (see Algorithm 3). The pixelwise-annotated data is later used for training a fixed semantic segmentation network with the pretrained backbone using a standard cross-entropy loss. Then, we evaluate the resulting model on a test set across the standard semantic segmentation metrics such as mIOU and pixel accuracy. We perform experiments with varying fraction of labeled data. In all our experiments we initialize the networks with ImageNet-pretrained backbones.

Input: GAN model trained on a large unlabeled dataset, a small labeled dataset
Output: Semantic segmentation model
1 Generate images from the random latent vectors , and store them along with their features ,
2 Train the backbone using LayerMatch using the pairs ,
3 Train a semantic segmentation model on the labeled part of the data ,
Algorithm 3 Training semantic projection model

Comparison with prior work. For all compared methods we use the same network architectures differing only in training procedure and loss functions used. The first baseline uses a standard ImageNet-pretrained backbone without domain-specific pretraining. The semantic segmentation model is trained using available annotated data and does not use the unlabeled data.

The other two baselines are recent semi-supervised segmentation methods using both labeled and unlabeled data during training. In the experiments with these methods we used exactly the same amount of both labeled and unlabeled data as for LayerMatch. Namely, for the experiments with Celeba-HQ we used both the unlabeled part of CelebA and the FFHQ dataset, that was used for GAN training. For the experiments with LSUN-church all the unlabeled data in LSUN-church dataset was used during training.

The first semi-supervised segmentation method that we use for comparison is based on pseudo-labeling Lee (2013). Unlabeled data is augmented by generating pseudo-labels using the network predictions. Only the pixels with high-confidence pseudo-labels are used as ground truth for training. The second one is an adversarial semi-supervised segmentation approach Hung et al. (2018). In this method the segmentation network is supervised by both the standard cross-entropy loss with the ground truth label map and the adversarial loss with the discriminator network. In our experiments we used the official implementation provided by the authors, and changed only the backbone.

(a) (b)
Figure 5: Comparison of the models trained with Algorithm 3 to semi-supervised segmentation methods for a varying number of annotated samples. (a) - FFHQ+CelebA-HQ dataset. (b) - LSUN-church dataset

Datasets. Celeba-HQ Lee et al. (2019) contains 30,000 high-resolution face images selected from the CelebA dataset Liu et al. (2015), each image having a segmentation mask with the resolution of 512x512 and 19 classes including all facial components and accessories such as skin, nose, eyes, eyebrows, ears, mouth, lips, hair, hat, eyeglass, earring, necklace, neck, cloth, and background. We use a StyleGAN2 model trained on FFHQ dataset provided in Karras et al. (2019b) that has a FID measure 3.31 and PPL 125. In the experiments with Celeba-HQ we vary the fraction of labeled data from 1/1024 to the full dataset.

LSUN-church Yu et al. (2015b) contains 126,000 images of churches of 256x256 resolution. We have selected top 10 semantic categories that occupy more than 1% of image area, namely road, vegetation, building, sidewalk, car, sky, terrain, pole, fence, wall. We use a StyleGAN2 model provided in Karras et al. (2019b) that has a FID measure 3.86 and PPL 342. As LSUN dataset does not contain pixelwise annotation, we take the outputs of the Unified Scene Parsing Network Xiao et al. (2018) as ground truth in this experiment similarly to Bau et al. (2019). In the experiments with Celeba-HQ we vary the fraction of labeled data from 1/4096 to the full dataset.

Implementation details. HRNet Sun et al. (2019) is used as an encoder architecture. We add auxiliary heads for each of activations that we want to predict (see Figure 1

(b)). After training, auxiliary heads are discarded and only the pretrained backbone is used for transfer learning, similar to ImageNet pretraining. For pretraining the encoder we use Adam optimizer with the learning rate

and the cosine learning rate decay. We use source code from HRNet repository for training semantic segmentation networks.

Results and discussion. Figure 5 shows the comparison of the proposed LayerMatch pretraining scheme to 3 baseline methods across 2 datasets with varying fraction of annotated data. Pseudo-labeling is applicable in case when some part of the dataset is unlabelled.

One can see that LayerMatch pretraining shows significantly higher IoU compared to the baseline methods on Celeba-HQ (see Figure 5 (a)) for any fraction of the labeled data. For LSUN-church it shows higher accuracy compared to other methods in cases when up to 1/512 of the data is annotated. Figure 6 shows qualitative comparison of the model pretrained with LayerMatch to the standard ImageNet pretraining on Celeba-HQ when trained with 1/512 of annotated data. The difference between two models is quite noticeable for both CelebA-HQ and for LSUN-church. Table 2 shows category-wise results for all four compared models trained with 1/512 of labeled data. LayerMatch pretraining leads to significant accuracy improvement for the eyeglasses category.

Overall, LayerMatch pretraining leads to improved results in semi-supervised learning scenario compared to both simple ImageNet pretraining and to semi-supervised segmentation methods. Lower accuracy for larger fraction of annotated datasets on LSUN-church can be attributed to lower quality of LSUN-church GAN generator compared to Celeba-HQ GAN generator. Another possible reason for this effect may be the imperfect annotation of both training and test data, which may lead to inaccuracies in evaluation.

Method

pixAcc

mIoU

background

skin

nose

eye glasses

left eye

right eye

left brow

right brow

left ear

right ear

mouth

upper lip

lower lip

hair

hat

earrings

necklace

neck

cloth

ImageNet only .92 .62 .89 .89 .87 .01 .75 .77 .67 .68 .71 .72 .76 .74 .79 .87 0. .34 0. .77 .62
AdvSemiSeg Hung et al. (2018) .91 .62 .89 .89 .85 .14 .75 .77 .66 .66 .71 .70 .77 .73 .78 .86 0. .29 0. .75 .59
Pseudo-labeling Lee (2013) .92 .63 .89 .90 .86 .01 .77 .78 .71 .70 .73 .72 .79 .75 .79 .86 0. .33 0. .77 .58
LayerMatch .93 .67 .90 .91 .87 .68 .78 .78 .70 .69 .75 .74 .79 .76 .80 .87 0. .34 0. .80 .64
Table 2: Comparison of segmentation models trained on CelebA-HQ dataset with equal amount of supervision. Notice that LayerMatch provides better results for almost all categories and improves the IoU for eye glasses category by several times.
Figure 6: Test results. First row: CelebA-HQ, 1/512, second row: LSUN-church, 1/512. From left to right: input image, LayerMatch pretraining, ImageNet pretraining only.

5 Related work

Several works consider generative models for unsupervised pretraining Makhzani et al. (2015); Larsen et al. (2015); Donahue et al. (2016); Donahue and Simonyan (2019). One of the approaches Radford et al. (2015) uses representation learnt by a discriminator. Another line of research extends GAN to bidirectional framework (BiGAN) by introducing an auxiliary encoder branch that predicts the latent vector from a natural image Donahue et al. (2016); Donahue and Simonyan (2019). The encoder learnt via BiGAN framework can be used as feature extractor for downstream tasks Donahue and Simonyan (2019). The use of GANs as universal feature extractors has severe limitations. First, GANs are not always capable of learning a multimodal distribution as they tend to suffer from mode collapse Liu et al. (2019)

. The trade-off between GAN precision and recall is still difficult to control

Kynkäänniemi et al. (2019). Besides, training a GAN on a large dataset of high-resolution images requires an extremely large computational budget, which makes ImageNet-scale experiments prohibitively expensive. Our approach differs from this line of work, as we use a GAN to specialize a model to a particular domain rather than trying to obtain universal feature representations. We explore the representation of a GAN generator that, to the best of our knowledge, has not been previously considered for transfer learning.

Bau et al. Bau et al. (2018) show that activations of a generator are highly correlated with semantic segmentation masks for the generated image. One of the means for analysis of latent space and internal representation of a generator is latent embedding, i.e. finding a latent vector that corresponds to a particular image. Several methods for embedding images into GAN latent space have been proposed Karras et al. (2019b); Bau et al. (2019); Abdal et al. (2019), bringing interesting insights about generator representations. For instance, it allowed to demonstrate that some of semantic categories are missing systematically in GAN-generated images Bau et al. (2019). Similarly to these works we invert a GAN generator using both feature approximation and image reconstruction losses, although we do not aim at reconstructing the latent code, and only approximate the activations of the layers of a generator.

While image-level classification has been extensively studied in a semi-supervised setting, dense pixel-level classification with limited data has only drawn attention recently. Most of the works on semi-supervised semantic segmentation borrow the ideas from semi-supervised image classification and generalize them on high-resolution tasks. Hung et al. (2018)

adopt an adversarial learning scheme and propose a fully convolutional discriminator that learns to differentiate ground truth label maps from probability maps of segmentation predictions.

Mittal et al. (2019) use two network branches that link semi-supervised classification with semi-supervised segmentation including self-training.

6 Conclusion

We study the use of GAN generators for the task of learning domain-specific representations. We show that the representation of a GAN generator can be easily projected onto semantic segmentation map using a lightweight decoder. Then, we propose LayerMatch scheme for unsupervised domain-specific pretraining that is based on approximating the generator representation. We present experiments in semi-supervised learning scenario and compare to recent semi-supervised semantic segmentation methods.

References

  • R. Abdal, Y. Qin, and P. Wonka (2019) Image2StyleGAN: how to embed images into the stylegan latent space?. In

    Proceedings of the IEEE International Conference on Computer Vision

    ,
    pp. 4432–4441. Cited by: §3, §5.
  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §1.
  • D. Bau, J. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T. Freeman, and A. Torralba (2018) Gan dissection: visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1811.10597. Cited by: §5.
  • D. Bau, J. Zhu, J. Wulff, W. Peebles, H. Strobelt, B. Zhou, and A. Torralba (2019) Seeing what a gan cannot generate. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4502–4511. Cited by: §3, §3, §4, §5.
  • A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §1.
  • L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, Cited by: §2.2.
  • X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §1.
  • J. Donahue, P. Krähenbühl, and T. Darrell (2016) Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: §1, §5.
  • J. Donahue and K. Simonyan (2019) Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, pp. 10541–10551. Cited by: §1, §5.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • W. Hung, Y. Tsai, Y. Liou, Y. Lin, and M. Yang (2018) Adversarial learning for semi-supervised semantic segmentation. arXiv preprint arXiv:1802.07934. Cited by: §1, Table 2, §4, §5.
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §1.
  • T. Karras, S. Laine, and T. Aila (2018) A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948. Cited by: §2.2.
  • T. Karras, S. Laine, and T. Aila (2019a) A style-based generator architecture for generative adversarial networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4401–4410. Cited by: §1.
  • T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2019b) Analyzing and improving the image quality of stylegan. arXiv preprint arXiv:1912.04958. Cited by: §1, §4, §4, §5.
  • D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10215–10224. Cited by: §1.
  • T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila (2019) Improved precision and recall metric for assessing generative models. In Advances in Neural Information Processing Systems, pp. 3929–3938. Cited by: §5.
  • A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther (2015) Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300. Cited by: §5.
  • C. Lee, Z. Liu, L. Wu, and P. Luo (2019) MaskGAN: towards diverse and interactive facial image manipulation. arXiv preprint arXiv:1907.11922. Cited by: §1, §4.
  • D. Lee (2013)

    Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks

    .
    In Workshop on challenges in representation learning, ICML, Vol. 3, pp. 2. Cited by: §1, Table 2, §4.
  • K. Liu, W. Tang, F. Zhou, and G. Qiu (2019) Spectral regularization for combating mode collapse in gans. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6382–6390. Cited by: §5.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §4.
  • A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey (2015) Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: §5.
  • L. Mescheder, A. Geiger, and S. Nowozin (2018) Which training methods for gans do actually converge?. arXiv preprint arXiv:1801.04406. Cited by: §1.
  • S. Mittal, M. Tatarchenko, and T. Brox (2019) Semi-supervised semantic segmentation with high-and low-level consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §5.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §1.
  • A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §1, §5.
  • K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang (2019) High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514. Cited by: §4.
  • T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun (2018)

    Unified perceptual parsing for scene understanding

    .
    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434. Cited by: §4.
  • F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao (2015a) Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §1.
  • F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao (2015b) LSUN: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §4.
  • R. Zhang, P. Isola, and A. A. Efros (2016)

    Colorful image colorization

    .
    In European conference on computer vision, pp. 649–666. Cited by: §4.