StyleGAN2 Distillation for Feed-forward Image Manipulation

03/07/2020 ∙ by Yuri Viazovetskyi, et al. ∙ Yandex 55

StyleGAN2 is a state-of-the-art network in generating realistic images. Besides, it was explicitly trained to have disentangled directions in latent space, which allows efficient image manipulation by varying latent factors. Editing existing images requires embedding a given image into the latent space of StyleGAN2. Latent code optimization via backpropagation is commonly used for qualitative embedding of real world images, although it is prohibitively slow for many applications. We propose a way to distill a particular image manipulation of StyleGAN2 into image-to-image network trained in paired way. The resulting pipeline is an alternative to existing GANs, trained on unpaired data. We provide results of human faces' transformation: gender swap, aging/rejuvenation, style transfer and image morphing. We show that the quality of generation using our method is comparable to StyleGAN2 backpropagation and current state-of-the-art methods in these particular tasks.



There are no comments yet.


page 1

page 9

page 10

page 12

page 13

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative adversarial networks (GANs) [17] have created wide opportunities in image manipulation. General public is familiar with them from the many applications which offer to change one’s face in some way: make it older/younger, add glasses, beard, etc.

There are two types of network architecture which can perform such translations feed-forward: neural networks trained on either paired or unpaired datasets. In practice, only unpaired datasets are used. The methods used there are based on cycle consistency

[60]. The follow-up studies [23, 10, 11] have maximum resolution of 256x256.

At the same time, existing paired methods (e.g. pix2pixHD [54] or SPADE [41]) support resolution up to 2048x1024. But it is very difficult or even impossible to collect a paired dataset for such tasks as age manipulation. For each person, such dataset would have to contain photos made at different age, with the same head position and facial expression. Close examples of such datasets exist, e.g. CACD [8], AgeDB [39], although with different expressions and face orientation. To the best of our knowledge, they have never been used to train neural networks in a paired mode.

These obstacles can be overcome by making a synthetic paired dataset, if we solve two known issues concerning dataset generation: appearance gap [21] and content gap [27]. Here, unconditional generation methods, like StyleGAN [29], can be of use. StyleGAN generates images of quality close to real world and with distribution close to real one according to low FID results. Thus output of this generative model can be a good substitute for real world images. The properties of its latent space allow to create sets of images differing in particular parameters. Addition of path length regularization (introduced as measure of quality in [29]) in the second version of StyleGAN [30] makes latent space even more suitable for manipulations.

Basic operations in the latent space correspond to particular image manipulation operations. Adding a vector, linear interpolation, and crossover in latent space lead to expression transfer, morphing, and style transfer, respectively. The distinctive feature of both versions of StyleGAN architecture is that the latent code is applied several times at different layers of the network. Changing the vector for some layers will lead to changes at different scales of generated image. Authors group spatial resolutions in process of generation into coarse, middle, and fine ones. It is possible to combine two people by using one person’s code at one scale and the other person’s at another.

Operations mentioned above are easily performed for images with known embeddings. For many entertainment purposes this is vital to manipulate some existing real world image on the fly, e.g. to edit a photo which has just been taken. Unfortunately, in all the cases of successful search in latent space described in literature the backpropagation method was used [2, 1, 15, 30, 46]. Feed-forward is only reported to be working as an initial state for latent code optimization [5]. Slow inference makes application of image manipulation with StyleGAN2 in production very limited: it costs a lot in data center and is almost impossible to run on a device. However, there are examples of backpropagation run in production, e.g. [47].

In this paper we consider opportunities to distill [20, 4] a particular image manipulation of StyleGAN2 generator, trained on the FFHQ dataset. The distillation allows to extract the information about faces’ appearance and the ways they can change (e.g. aging, gender swap) from StyleGAN into image-to-image network. We propose a way to generate a paired dataset and then train a “student” network on the gathered data. This method is very flexible and is not limited to the particular image-to-image model.

Despite the resulting image-to-image network is trained only on generated samples, we show that it performs on real world images on par with StyleGAN backpropagation and current state-of-the-art algorithms trained on unpaired data.

Our contributions are summarized as follows:

  • We create synthetic datasets of paired images to solve several tasks of image manipulation on human faces: gender swap, aging/rejuvenation, style transfer and face morphing;

  • We show that it is possible to train image-to-image network on synthetic data and then apply it to real world images;

  • We study the qualitative and quantitative performance of image-to-image networks trained on the synthetic datasets;

  • We show that our approach outperforms existing approaches in gender swap task.

We publish all collected paired datasets for reproducibility and future research:

2 Related work

2.0.1 Unconditional image generation

Following the success of ProgressiveGAN [28] and BigGAN [6], StyleGAN [29] became state-of-the-art image generation model. This was achieved due to rethinking generator architecture and borrowing approaches from style transfer networks: mapping network and AdaIN [22], constant input, noise addition, and mixing regularization. The next version of StyleGAN – StyleGAN2 [30], gets rid of artifacts of the first version by revising AdaIN and improves disentanglement by using perceptual path length as regularizer.

Mapping network is a key component of StyleGAN, which allows to transform latent space into less entangled intermediate latent space . Instead of actual latent

sampled from normal distribution,

resulting from mapping network is fed to AdaIN. Also it is possible to sample vectors from extended space , which consists of multiple independent samples of , one for each layer of generator. Varying at different layers will change details of generated picture at different scales.

2.0.2 Latent codes manipulation

It was recently shown [16, 26] that linear operations in latent space of generator allow successful image manipulations in a variety of domains and with various GAN architectures. In GANalyze [16], the attention is directed to search interpretable directions in latent space of BigGAN [6] using MemNet [31] as “assessor” network. Jahanian et al. [26] show that walk in latent space lead to interpretable changes in different model architectures: BigGAN, StyleGAN, and DCGAN [42].

To manipulate real images in latent space of StyleGAN, one needs to find their embeddings in it. The method of searching the embedding in intermediate latent space via backprop optimization is described in [2, 1, 15, 46]

. The authors use non-trivial loss functions to find both close and perceptually good image and show that embedding fits better in extended space

. Gabbay et al. [15] show that StyleGAN generator can be used as general purpose image prior. Shen et al. [46] show the opportunity to manipulate appearance of generated person, including age, gender, eyeglasses, and pose, for both PGGAN [28] and StyleGAN. The authors of StyleGAN2 [30] propose to search embeddings in instead of to check if the picture was generated by StyleGAN2.

2.0.3 Paired Image-to-image translation

Pix2pix [25] is one of the first conditional generative models applied for image-to-image translation. It learns mapping from input to output images. Chen and Koltun [9] propose the first model which can synthesize 2048x1024 images. It is followed by pix2pixHD [54] and SPADE [41]. In SPADE generator, each normalization layer uses the segmentation mask to modulate the layer activations. So its usage is limited to the translation from segmentation maps. There are numerous follow-up works based on pix2pixHD architecture, including those working with video [7, 52, 53].

2.0.4 Unpaired Image-to-image translation

The idea of applying cycle consistency to train on unpaired data is first introduced in CycleGAN [60]. The methods of unpaired image-to-image translation can be either single mode GANs [60, 58, 35, 10] or multimodal GANs [61, 23, 32, 33, 36, 11]. FUNIT [36] supports multi-domain image translation using a few reference images from a target domain. StarGAN v2 [11] provide both latent-guided and reference-guided synthesis. All of the above-mentioned methods operate at resolution of at most 256x256 when applied to human faces.

Gender swap is one of well-known tasks of unsupervised image-to-image translation [10, 11, 37].

Face aging/rejuvenation is a special task which gets a lot of attention [59, 49, 18]. Formulation of the problem can vary. The simplest version of this task is making faces look older or younger [10]. More difficult task is to produce faces matching particular age intervals [34, 55, 57, 37]. GAN [18] proposes continuous changing of age using weight interpolation between transforms which correspond to two closest age groups.

2.0.5 Training on synthetic data

Synthetic datasets are widely used to extend datasets for some analysis tasks (e.g. classification). In many cases, simple graphical engine can be used to generate synthetic data. To perform well on real world images, this data need to overcome both appearance gap [21, 14, 50, 51, 48] and content gap [27, 45].

Ravuri et al. [43] study the quality of a classificator trained on synthetic data generated by BigGAN and show [44]

that BigGAN does not capture the ImageNet

[13] data distributions and is only partly successful for data augmentation. Shrivastava et al. [48] reduce the quality drop of this approach by revising train setup.

Synthetic data is what underlies knowledge distillation, a technique that allows to train “student” network using data generated by “teacher” network [20, 4]. Usage of this additional source of data can be used to improve measures [56] or to reduce size of target model [38]. Aguinaldo et al. [3] show that knowledge distillation is successfully applicable for generative models.

3 Method overview

3.1 Data collection

Figure 2: Method of finding correspondence between latent codes and facial attributes.

All of the images used in our datasets are generated using the official implementation of StyleGAN2111 In addition to that we only use the config-f version checkpoint pretrained by the authors of StyleGAN2 on FFHQ dataset. All the manipulations are performed with the disentangled image codes .

We use the most straightforward way of generating datasets for style mixing and face morphing. Style mixing is described in [29] as a regularization technique and requires using two intermediate latent codes and at different scales. Face morphing corresponds to linear interpolation of intermediate latent codes . We generate 50 000 samples for each task. Each sample consists of two source images and a target image. Each source image is obtained by randomly sampling from normal distribution, mapping it to intermediate latent code , and generating image with StyleGAN2. We produce target image by performing corresponding operation on the latent codes and feeding the result to StyleGAN2.

Face attributes, such as gender or age, are not explicitly encoded in StyleGAN2 latent space or intermediate space. To overcome this limitation we use a separate pretrained face classification network. Its outputs include confidence of face detection, age bin and gender. The network is proprietary, therefore we release the final version of our gender and age datasets in order to maintain full reproducibility of this work


We create gender and age datasets in four major steps. First, we generate an intermediate dataset, mapping latent vectors to target attributes as illustrated in Fig. 2. Second, we find the direction in latent space associated with the attribute. Third, we generate raw dataset, using above-mentioned vector as briefly described in Fig. 3. Finally, we filter the images to get the final dataset. The method is described below in more detail.

  1. Generate random latent vectors , map them to intermediate latent codes , and generate corresponding image samples with StyleGAN2.

  2. Get attribute predictions from pretrained neural network , .

  3. Filter out images where faces were detected with low confidence333This helps to reduce generation artifacts in the dataset, while maintaining high variability as opposed to lowering truncation-psi parameter.. Then select only images with high classification certainty.

  4. Find the center of every class and the transition vectors from one class to another

  5. Generate random samples and pass them through mapping network. For gender swap task, create a set of five images For aging/rejuvenation first predict faces’ attributes , then use corresponding vectors to generate faces that should be two bins older/younger.

  6. Get predictions for every image in the raw dataset. Filter out by confidence.

  7. From every set of images, select a pair based on classification results. Each image must belong to the corresponding class with high certainty.

As soon as we have aligned data, a paired image-to-image translation network can be trained.

Figure 3: Dataset generation. We first sample random vectors from normal distribution. Then for each we generate a set of images along the vector corresponding to a facial attribute. Then for each set of images we select the best pair based on classification results.

3.2 Training process

In this work, we focus on illustrating the general approach rather than solving every task as best as possible. As a result, we choose to train pix2pixHD444 [54] as a unified framework for image-to-image translation instead of selecting a custom model for every type of task.

It is known that pix2pixHD has blob artifacts555 and also tends to repeat patterns [41]. The problem with repeated patterns is solved in [29, 41]. Light blobs is a problem which is solved in StyleGAN2.

We suppose that similar treatment also in use for pix2pixHD. Fortunately, even vanilla pix2pixHD trained on our datasets produces sufficiently good results with little or no artifacts. Thus, we leave improving or replacing pix2pixHD for future work. We make most part of our experiments and comparison in 512x512 resolution, but also try 1024x1024 for gender swap.

Style mixing and face averaging tasks require two input images to be fed to the network at the same time. It is done by setting number of input channels to 6 and concatenating the inputs along channel axis.

4 Experiments

Although StyleGAN2 can be trained on data of different nature, we concentrate our efforts only on face data. We show application of our method to several tasks: gender swap, aging/rejuvenation and style mixing and face morphing. In all our experiments we collect data from StyleGAN2, trained on FFHQ dataset [29].

4.1 Evaluation protocol

Only the task of gender transform (two directions) is used for evaluation. We use Frechét inception distance (FID) [19] for quantitative comparison of methods, as well as human evaluation.

For each feed-forward baseline we calculate FID between 50 000 real images from FFHQ datasets and 20 000 generated images, using 20 000 images from FFHQ as source images. For each source image we apply transformation to the other gender, assuming source gender is determined by our classification model. Before calculating FID measure all images are resized to 256x256 size for fair comparison.

Also human evaluation is used for more accurate comparison with optimization based methods. Our study consists of two surveys. In the first one crowdsource workers were asked to select an image showing better translation from one gender to the other given a source image. They were also instructed to consider preserving of person’s identity, clothes and accessories. We refer to this measure as ”quality”. The second survey asked to select the most realistic image, and no source image was provided. All images were resized to 512x512 size in this comparison.

The first task should show which method is the best at performing transformation, the second – which has the best quality. We use side-by-side experiments for both tasks where one side is our method and the other side is one of optimization based baselines. Answer choices are shuffled. For each comparison of our method with a baseline, we generate 1000 questions and each question is answered by 10 different people. For answers aggregation we use Dawid-Skene method [12] and filter out the examples with confidence level less than 95% (it is approximately 4% of all questions).

4.2 Distillation of image-to-image translation

4.2.1 Gender swap

We generate a paired dataset for male and female faces according to the method described above and than train a separate pix2pixHD model for each gender translation.

(a) Female to male
(b) Male to female
Figure 4: Gender transformation: comparison with image-to-image translation approaches. MUNIT and StarGAN v2* are multimodal so we show one random realization there.

We compete with both unpaired image-to-image methods and different StyleGAN embedders with latent code optimization. We choose StarGAN666 [10], MUNIT777 [24] and StarGAN v2*888 (unofficial implementation, so its results may differ from the official one) [11]

for a competition with unpaired methods. We train all these methods on FFHQ classified into males and females.

Fig. 4 shows qualitative comparison between our approach and unpaired image-to-image ones. It demonstrates that distilled transformation have significantly better visual quality and more stable results. Quantitative comparison in Table 4(a) confirms our observations.

StyleGAN2 provides an official projection method. This method operates in , which only allows to find faces generated by this model, but not real world images. So, we also build a similar method for for comparison. It optimizes separate for each layer of the generator, which helps to better reconstruct a given image. After finding we can add transformation vector described above and generate a transformed image.

Also we add previous approaches [40, 5] for finding latent code to comparison, even though they are based on the first version of StyleGAN. StyleGAN encoder[5] add to[40] more advanced loss functions and forward pass approximation of optimization starting point, that improve reconstruction results.

Method FID
StarGAN [10] 29.7
MUNIT [23] 40.2
StarGANv2* 25.6
Ours 14.7
Real images 3.3
(a) Comparison with unpaired image-to-image methods
Method Quality Realism
StyleGAN Encoder [40] 18% 14%
StyleGAN Encoder [5] 30% 68%
StyleGAN2 projection () 22% 22%
StyleGAN2 projection () 11% 14%
Real images - 85%
(b) User study of StyleGAN-based approaches. Winrate ”method vs ours”
Table 1: Quantitative comparison. We use FID as a main measure for comparison with another image-to-image transformation baselines. We measure user study for all StyleGAN-based approaches because we consider human evaluation more reliable measure for perception.
Figure 5: Gender transformation: comparison with StyleGAN2 latent code optimization methods. Input samples are real images from FFHQ. Notice that unusual objects are lost with optimization but kept with image-to-image translation.

Since unpaired methods show significantly worse quality, we put more effort into comparisons between different methods of searching embedding through optimization. We avoid using methods that utilize FID because all of them are based on the same StyleGAN model. Also, FID cannot measure ”quality of transformation” because it does not check keeping of personality. So we decide to make user study our main measure for all StyleGAN-based methods. Fig. 5 shows qualitative comparison of all the methods. It is visible that our method performs better in terms of transformation quality. And only StyleGAN Encoder [5] outperforms our method in realism. However this method generates background unconditionally.

We find that pix2pixHD keeps more details on transformed images than all the encoders. We suppose that this is achieved due to the ability of pix2pixHD to pass part of the unchanged content through the network. Pix2pixHD solves an easier task compared to encoders which are forced to encode all the information about the image in one vector.

Fig. 4 and 5 also show drawbacks of our approach. Vector of “gender” is not perfectly disentangled due to some bias in attribute distribution of FFHQ and, consequently, latent space correlation of StyleGAN[46]. For example, it can be seen that translation into female faces can also add smile.

We also encounter problems of pix2pixHD architecture: repeated patterns, light blobs and difficulties with finetuning 1024x1024 resolution. We show an uncurated list of generated images in supplementary materials.

4.2.2 Aging/rejuvenation

To show that our approach can be applied for another image-to-image transform task, we also carry out similar experiment with face age manipulation. First, we estimate age for all generated images, then group them into several bins. After that, for each bin we find vectors of “+2 bins” and “-2 bins”. Using these vectors, we generate united paired dataset. Each pair contains younger and older versions of the same face. Finally, we train two pix2pixHD networks, one for each of two directions.

Figure 6: Aging/rejuvenation. Source images are sampled from FFHQ.

Examples of the application of this approach are presented in Fig. 6.

4.3 Distillation of style mixing

4.3.1 Style mixing and face morphing

There are 18 AdaIN inputs in StyleGAN2 architecture. These AdaINs work with different spatial resolutions, and changing different input will change details of different scale. The authors divide them into three groups: coarse styles (for spatial resolutions), middle styles () and fine styles (). The opportunity to change coarse, middle or fine details is a unique feature of StyleGAN architectures.

Figure 7: Style mixing with pix2pixHD. (a), (b), (c) show results of distillated crossover of two latent codes in , (d) shows result of average latent code transformaiton. Source images are sampled from FFHQ

We collect datasets of triplets (two source images and their mixture) and train our models for each transformation. We concatenate two images into 6 channels to feed our pix2pixHD model. Fig. 7(a,b,c) shows the results of style mixing.

Another simple linear operation is to average two latent codes. It corresponds to morphing operation on images. We collect another dataset with triplet latent codes: two random codes and an average one. The examples of face morphing are shown in Fig. 7(d).

5 Conclusions

In this paper, we unite unconditional image generation and paired image-to-image GANs to distill a particular image manipulation in latent code of StyleGAN2 into single image-to-image translation. The resulting technique shows both fast inference and impressive quality. It outperforms existing unpaired image-to-image models in FID score and StyleGAN Encoder approaches both in user study and time of inference on gender swap task. We show that the approach is also applicable for other image manipulations, such as aging/rejuvenation and style transfer.

Our framework has several limitations. StyleGAN2 latent space is not perfectly disentangled, so the transformations made by our network are not perfectly pure. Despite the latent space is not disentangled enough to make pure transformations, impurities are not so severe.

We use only pix2pixHD network although different architectures fit better for different tasks. Besides, we distil every transformation to a separate model, although some universal model could be trained. This opportunity should be investigated in future studies.


  • [1] R. Abdal, Y. Qin, and P. Wonka (2019) Image2StyleGAN: how to embed images into the stylegan latent space?. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4432–4441. Cited by: §1, §2.0.2.
  • [2] R. Abdal, Y. Qin, and P. Wonka (2019) Image2StyleGAN++: how to edit the embedded images?. arXiv preprint arXiv:1911.11544. Cited by: §1, §2.0.2.
  • [3] A. Aguinaldo, P. Chiang, A. Gain, A. Patil, K. Pearson, and S. Feizi (2019) Compressing gans using knowledge distillation. arXiv preprint arXiv:1902.00159. Cited by: §2.0.5.
  • [4] J. Ba and R. Caruana (2014) Do deep nets really need to be deep?. In Advances in neural information processing systems, pp. 2654–2662. Cited by: §1, §2.0.5.
  • [5] P. Baylies (2019) StyleGAN encoder - converts real images to latent space. GitHub. Note: Cited by: §1, 4(b), §4.2.1, §4.2.1.
  • [6] A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §2.0.1, §2.0.2.
  • [7] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros (2019) Everybody dance now. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5933–5942. Cited by: §2.0.3.
  • [8] B. Chen, C. Chen, and W. H. Hsu (2014) Cross-age reference coding for age-invariant face recognition and retrieval. In European conference on computer vision, pp. 768–783. Cited by: §1.
  • [9] Q. Chen and V. Koltun (2017) Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE international conference on computer vision, pp. 1511–1520. Cited by: §2.0.3.
  • [10] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018) StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §1, §2.0.4, §2.0.4, §2.0.4, 4(a), §4.2.1.
  • [11] Y. Choi, Y. Uh, J. Yoo, and J. Ha (2019) StarGAN v2: diverse image synthesis for multiple domains. arXiv preprint arXiv:1912.01865. Cited by: §1, §2.0.4, §2.0.4, §4.2.1.
  • [12] A. P. Dawid and A. M. Skene (1979) Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) 28 (1), pp. 20–28. Cited by: §4.1.
  • [13] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §2.0.5.
  • [14] G. French, M. Mackiewicz, and M. Fisher (2017) Self-ensembling for visual domain adaptation. arXiv preprint arXiv:1706.05208. Cited by: §2.0.5.
  • [15] A. Gabbay and Y. Hoshen (2019) Style generator inversion for image enhancement and animation. arXiv preprint arXiv:1906.11880. Cited by: §1, §2.0.2.
  • [16] L. Goetschalckx, A. Andonian, A. Oliva, and P. Isola (2019-10) GANalyze: toward visual definitions of cognitive image properties. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.0.2.
  • [17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [18] Z. He, M. Kan, S. Shan, and X. Chen (2019) S2GAN: share aging factors across ages and share aging trends among individuals. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9440–9449. Cited by: §2.0.4.
  • [19] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §4.1.
  • [20] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2.0.5.
  • [21] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell (2017) Cycada: cycle-consistent adversarial domain adaptation. arXiv preprint arXiv:1711.03213. Cited by: §1, §2.0.5.
  • [22] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510. Cited by: §2.0.1.
  • [23] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §1, §2.0.4, 4(a).
  • [24] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In ECCV, Cited by: §4.2.1.
  • [25] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.0.3.
  • [26] A. Jahanian, L. Chai, and P. Isola (2019) On the ”steerability” of generative adversarial networks. arXiv preprint arXiv:1907.07171. Cited by: §2.0.2.
  • [27] A. Kar, A. Prakash, M. Liu, E. Cameracci, J. Yuan, M. Rusiniak, D. Acuna, A. Torralba, and S. Fidler (2019) Meta-sim: learning to generate synthetic datasets. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4551–4560. Cited by: §1, §2.0.5.
  • [28] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §2.0.1, §2.0.2.
  • [29] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §1, §2.0.1, §3.1, §3.2, §4.
  • [30] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2019) Analyzing and improving the image quality of stylegan. arXiv preprint arXiv:1912.04958. Cited by: §1, §1, §2.0.1, §2.0.2.
  • [31] A. Khosla, A. S. Raju, A. Torralba, and A. Oliva (2015) Understanding and predicting image memorability at a large scale. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2390–2398. Cited by: §2.0.2.
  • [32] H. Lee, H. Tseng, J. Huang, M. K. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In European Conference on Computer Vision, Cited by: §2.0.4.
  • [33] H. Lee, H. Tseng, Q. Mao, J. Huang, Y. Lu, M. K. Singh, and M. Yang (2019) DRIT++: diverse image-to-image translation viadisentangled representations. arXiv preprint arXiv:1905.01270. Cited by: §2.0.4.
  • [34] P. Li, Y. Hu, Q. Li, R. He, and Z. Sun (2018) Global and local consistent age generative adversarial networks. In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 1073–1078. Cited by: §2.0.4.
  • [35] M. Liu, T. Breuel, and J. Kautz (2017) Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pp. 700–708. Cited by: §2.0.4.
  • [36] M. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz (2019) Few-shot unsupervised image-to-image translation. arXiv preprint arXiv:1905.01723. Cited by: §2.0.4.
  • [37] Y. Liu, Q. Li, and Z. Sun (2019) Attribute-aware face aging with wavelet-based generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11877–11886. Cited by: §2.0.4, §2.0.4.
  • [38] S. Mirzadeh, M. Farajtabar, A. Li, and H. Ghasemzadeh (2019) Improved knowledge distillation via teacher assistant: bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393. Cited by: §2.0.5.
  • [39] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and S. Zafeiriou (2017) Agedb: the first manually collected, in-the-wild age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 51–59. Cited by: §1.
  • [40] D. Nikitko (2019)

    StyleGAN – encoder for official tensorflow implementation

    GitHub. Note: Cited by: 4(b), §4.2.1.
  • [41] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346. Cited by: §1, §2.0.3, §3.2.
  • [42] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §2.0.2.
  • [43] S. Ravuri and O. Vinyals (2019) Classification accuracy score for conditional generative models. In Advances in Neural Information Processing Systems, pp. 12247–12258. Cited by: §2.0.5.
  • [44] S. Ravuri and O. Vinyals (2019) Seeing is not necessarily believing: limitations of biggans for data augmentation. Cited by: §2.0.5.
  • [45] N. Ruiz, S. Schulter, and M. Chandraker (2018) Learning to simulate. arXiv preprint arXiv:1810.02513. Cited by: §2.0.5.
  • [46] Y. Shen, J. Gu, X. Tang, and B. Zhou (2019) Interpreting the latent space of gans for semantic face editing. arXiv preprint arXiv:1907.10786. Cited by: §1, §2.0.2, §4.2.1.
  • [47] T. Shi, Y. Yuan, C. Fan, Z. Zou, Z. Shi, and Y. Liu (2019) Face-to-parameter translation for game character auto-creation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 161–170. Cited by: §1.
  • [48] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb (2017) Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2107–2116. Cited by: §2.0.5, §2.0.5.
  • [49] J. Song, J. Zhang, L. Gao, X. Liu, and H. T. Shen (2018) Dual conditional gans for face aging and rejuvenation. In IJCAI, pp. 899–905. Cited by: §2.0.4.
  • [50] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 23–30. Cited by: §2.0.5.
  • [51] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7472–7481. Cited by: §2.0.5.
  • [52] T. Wang, M. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro (2019) Few-shot video-to-video synthesis. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §2.0.3.
  • [53] T. Wang, M. Liu, J. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro (2018) Video-to-video synthesis. arXiv preprint arXiv:1808.06601. Cited by: §2.0.3.
  • [54] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807. Cited by: §1, §2.0.3, §3.2.
  • [55] Z. Wang, X. Tang, W. Luo, and S. Gao (2018) Face aging with identity-preserved conditional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7939–7947. Cited by: §2.0.4.
  • [56] Q. Xie, E. Hovy, M. Luong, and Q. V. Le (2019) Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252. Cited by: §2.0.5.
  • [57] H. Yang, D. Huang, Y. Wang, and A. K. Jain (2019) Learning continuous face age progression: a pyramid of gans. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.0.4.
  • [58] Z. Yi, H. Zhang, P. Tan, and M. Gong (2017-10) DualGAN: unsupervised dual learning for image-to-image translation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.0.4.
  • [59] Z. Zhang, Y. Song, and H. Qi (2017)

    Age progression/regression by conditional adversarial autoencoder

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5810–5818. Cited by: §2.0.4.
  • [60] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1, §2.0.4.
  • [61] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017) Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, Cited by: §2.0.4.