StyleGAN2 is a state-of-the-art network in generating realistic images. Besides, it was explicitly trained to have disentangled directions in latent space, which allows efficient image manipulation by varying latent factors. Editing existing images requires embedding a given image into the latent space of StyleGAN2. Latent code optimization via backpropagation is commonly used for qualitative embedding of real world images, although it is prohibitively slow for many applications. We propose a way to distill a particular image manipulation of StyleGAN2 into image-to-image network trained in paired way. The resulting pipeline is an alternative to existing GANs, trained on unpaired data. We provide results of human faces' transformation: gender swap, aging/rejuvenation, style transfer and image morphing. We show that the quality of generation using our method is comparable to StyleGAN2 backpropagation and current state-of-the-art methods in these particular tasks.READ FULL TEXT VIEW PDF
We propose an efficient algorithm to embed a given image into the latent...
Most successful machine intelligence systems rely on gradient-based lear...
We propose Image2StyleGAN++, a flexible image editing framework with man...
The latent code of the recent popular model StyleGAN has learned disenta...
Deep generative models have become increasingly effective at producing
We address the problem of soft color segmentation, defined as decomposin...
We propose Neural Crossbreed, a feed-forward neural network that can lea...
Generative adversarial networks (GANs)  have created wide opportunities in image manipulation. General public is familiar with them from the many applications which offer to change one’s face in some way: make it older/younger, add glasses, beard, etc.
There are two types of network architecture which can perform such translations feed-forward: neural networks trained on either paired or unpaired datasets. In practice, only unpaired datasets are used. The methods used there are based on cycle consistency. The follow-up studies [23, 10, 11] have maximum resolution of 256x256.
At the same time, existing paired methods (e.g. pix2pixHD  or SPADE ) support resolution up to 2048x1024. But it is very difficult or even impossible to collect a paired dataset for such tasks as age manipulation. For each person, such dataset would have to contain photos made at different age, with the same head position and facial expression. Close examples of such datasets exist, e.g. CACD , AgeDB , although with different expressions and face orientation. To the best of our knowledge, they have never been used to train neural networks in a paired mode.
These obstacles can be overcome by making a synthetic paired dataset, if we solve two known issues concerning dataset generation: appearance gap  and content gap . Here, unconditional generation methods, like StyleGAN , can be of use. StyleGAN generates images of quality close to real world and with distribution close to real one according to low FID results. Thus output of this generative model can be a good substitute for real world images. The properties of its latent space allow to create sets of images differing in particular parameters. Addition of path length regularization (introduced as measure of quality in ) in the second version of StyleGAN  makes latent space even more suitable for manipulations.
Basic operations in the latent space correspond to particular image manipulation operations. Adding a vector, linear interpolation, and crossover in latent space lead to expression transfer, morphing, and style transfer, respectively. The distinctive feature of both versions of StyleGAN architecture is that the latent code is applied several times at different layers of the network. Changing the vector for some layers will lead to changes at different scales of generated image. Authors group spatial resolutions in process of generation into coarse, middle, and fine ones. It is possible to combine two people by using one person’s code at one scale and the other person’s at another.
Operations mentioned above are easily performed for images with known embeddings. For many entertainment purposes this is vital to manipulate some existing real world image on the fly, e.g. to edit a photo which has just been taken. Unfortunately, in all the cases of successful search in latent space described in literature the backpropagation method was used [2, 1, 15, 30, 46]. Feed-forward is only reported to be working as an initial state for latent code optimization . Slow inference makes application of image manipulation with StyleGAN2 in production very limited: it costs a lot in data center and is almost impossible to run on a device. However, there are examples of backpropagation run in production, e.g. .
In this paper we consider opportunities to distill [20, 4] a particular image manipulation of StyleGAN2 generator, trained on the FFHQ dataset. The distillation allows to extract the information about faces’ appearance and the ways they can change (e.g. aging, gender swap) from StyleGAN into image-to-image network. We propose a way to generate a paired dataset and then train a “student” network on the gathered data. This method is very flexible and is not limited to the particular image-to-image model.
Despite the resulting image-to-image network is trained only on generated samples, we show that it performs on real world images on par with StyleGAN backpropagation and current state-of-the-art algorithms trained on unpaired data.
Our contributions are summarized as follows:
We create synthetic datasets of paired images to solve several tasks of image manipulation on human faces: gender swap, aging/rejuvenation, style transfer and face morphing;
We show that it is possible to train image-to-image network on synthetic data and then apply it to real world images;
We study the qualitative and quantitative performance of image-to-image networks trained on the synthetic datasets;
We show that our approach outperforms existing approaches in gender swap task.
We publish all collected paired datasets for reproducibility and future research: https://github.com/EvgenyKashin/stylegan2-distillation.
Following the success of ProgressiveGAN  and BigGAN , StyleGAN  became state-of-the-art image generation model. This was achieved due to rethinking generator architecture and borrowing approaches from style transfer networks: mapping network and AdaIN , constant input, noise addition, and mixing regularization. The next version of StyleGAN – StyleGAN2 , gets rid of artifacts of the first version by revising AdaIN and improves disentanglement by using perceptual path length as regularizer.
Mapping network is a key component of StyleGAN, which allows to transform latent space into less entangled intermediate latent space . Instead of actual latent
sampled from normal distribution,resulting from mapping network is fed to AdaIN. Also it is possible to sample vectors from extended space , which consists of multiple independent samples of , one for each layer of generator. Varying at different layers will change details of generated picture at different scales.
It was recently shown [16, 26] that linear operations in latent space of generator allow successful image manipulations in a variety of domains and with various GAN architectures. In GANalyze , the attention is directed to search interpretable directions in latent space of BigGAN  using MemNet  as “assessor” network. Jahanian et al.  show that walk in latent space lead to interpretable changes in different model architectures: BigGAN, StyleGAN, and DCGAN .
To manipulate real images in latent space of StyleGAN, one needs to find their embeddings in it. The method of searching the embedding in intermediate latent space via backprop optimization is described in [2, 1, 15, 46]
. The authors use non-trivial loss functions to find both close and perceptually good image and show that embedding fits better in extended space. Gabbay et al.  show that StyleGAN generator can be used as general purpose image prior. Shen et al.  show the opportunity to manipulate appearance of generated person, including age, gender, eyeglasses, and pose, for both PGGAN  and StyleGAN. The authors of StyleGAN2  propose to search embeddings in instead of to check if the picture was generated by StyleGAN2.
Pix2pix  is one of the first conditional generative models applied for image-to-image translation. It learns mapping from input to output images. Chen and Koltun  propose the first model which can synthesize 2048x1024 images. It is followed by pix2pixHD  and SPADE . In SPADE generator, each normalization layer uses the segmentation mask to modulate the layer activations. So its usage is limited to the translation from segmentation maps. There are numerous follow-up works based on pix2pixHD architecture, including those working with video [7, 52, 53].
The idea of applying cycle consistency to train on unpaired data is first introduced in CycleGAN . The methods of unpaired image-to-image translation can be either single mode GANs [60, 58, 35, 10] or multimodal GANs [61, 23, 32, 33, 36, 11]. FUNIT  supports multi-domain image translation using a few reference images from a target domain. StarGAN v2  provide both latent-guided and reference-guided synthesis. All of the above-mentioned methods operate at resolution of at most 256x256 when applied to human faces.
Face aging/rejuvenation is a special task which gets a lot of attention [59, 49, 18]. Formulation of the problem can vary. The simplest version of this task is making faces look older or younger . More difficult task is to produce faces matching particular age intervals [34, 55, 57, 37]. GAN  proposes continuous changing of age using weight interpolation between transforms which correspond to two closest age groups.
Synthetic datasets are widely used to extend datasets for some analysis tasks (e.g. classification). In many cases, simple graphical engine can be used to generate synthetic data. To perform well on real world images, this data need to overcome both appearance gap [21, 14, 50, 51, 48] and content gap [27, 45].
that BigGAN does not capture the ImageNet data distributions and is only partly successful for data augmentation. Shrivastava et al.  reduce the quality drop of this approach by revising train setup.
Synthetic data is what underlies knowledge distillation, a technique that allows to train “student” network using data generated by “teacher” network [20, 4]. Usage of this additional source of data can be used to improve measures  or to reduce size of target model . Aguinaldo et al.  show that knowledge distillation is successfully applicable for generative models.
All of the images used in our datasets are generated using the official implementation of StyleGAN2111https://github.com/NVlabs/stylegan2. In addition to that we only use the config-f version checkpoint pretrained by the authors of StyleGAN2 on FFHQ dataset. All the manipulations are performed with the disentangled image codes .
We use the most straightforward way of generating datasets for style mixing and face morphing. Style mixing is described in  as a regularization technique and requires using two intermediate latent codes and at different scales. Face morphing corresponds to linear interpolation of intermediate latent codes . We generate 50 000 samples for each task. Each sample consists of two source images and a target image. Each source image is obtained by randomly sampling from normal distribution, mapping it to intermediate latent code , and generating image with StyleGAN2. We produce target image by performing corresponding operation on the latent codes and feeding the result to StyleGAN2.
Face attributes, such as gender or age, are not explicitly encoded in StyleGAN2 latent space or intermediate space. To overcome this limitation we use a separate pretrained face classification network. Its outputs include confidence of face detection, age bin and gender. The network is proprietary, therefore we release the final version of our gender and age datasets in order to maintain full reproducibility of this work222https://github.com/EvgenyKashin/stylegan2-distillation.
We create gender and age datasets in four major steps. First, we generate an intermediate dataset, mapping latent vectors to target attributes as illustrated in Fig. 2. Second, we find the direction in latent space associated with the attribute. Third, we generate raw dataset, using above-mentioned vector as briefly described in Fig. 3. Finally, we filter the images to get the final dataset. The method is described below in more detail.
Generate random latent vectors , map them to intermediate latent codes , and generate corresponding image samples with StyleGAN2.
Get attribute predictions from pretrained neural network , .
Filter out images where faces were detected with low confidence333This helps to reduce generation artifacts in the dataset, while maintaining high variability as opposed to lowering truncation-psi parameter.. Then select only images with high classification certainty.
Find the center of every class and the transition vectors from one class to another
Generate random samples and pass them through mapping network. For gender swap task, create a set of five images For aging/rejuvenation first predict faces’ attributes , then use corresponding vectors to generate faces that should be two bins older/younger.
Get predictions for every image in the raw dataset. Filter out by confidence.
From every set of images, select a pair based on classification results. Each image must belong to the corresponding class with high certainty.
As soon as we have aligned data, a paired image-to-image translation network can be trained.
In this work, we focus on illustrating the general approach rather than solving every task as best as possible. As a result, we choose to train pix2pixHD444https://github.com/NVIDIA/pix2pixHD  as a unified framework for image-to-image translation instead of selecting a custom model for every type of task.
It is known that pix2pixHD has blob artifacts555https://github.com/NVIDIA/pix2pixHD/issues/46 and also tends to repeat patterns . The problem with repeated patterns is solved in [29, 41]. Light blobs is a problem which is solved in StyleGAN2.
We suppose that similar treatment also in use for pix2pixHD. Fortunately, even vanilla pix2pixHD trained on our datasets produces sufficiently good results with little or no artifacts. Thus, we leave improving or replacing pix2pixHD for future work. We make most part of our experiments and comparison in 512x512 resolution, but also try 1024x1024 for gender swap.
Style mixing and face averaging tasks require two input images to be fed to the network at the same time. It is done by setting number of input channels to 6 and concatenating the inputs along channel axis.
Although StyleGAN2 can be trained on data of different nature, we concentrate our efforts only on face data. We show application of our method to several tasks: gender swap, aging/rejuvenation and style mixing and face morphing. In all our experiments we collect data from StyleGAN2, trained on FFHQ dataset .
Only the task of gender transform (two directions) is used for evaluation. We use Frechét inception distance (FID)  for quantitative comparison of methods, as well as human evaluation.
For each feed-forward baseline we calculate FID between 50 000 real images from FFHQ datasets and 20 000 generated images, using 20 000 images from FFHQ as source images. For each source image we apply transformation to the other gender, assuming source gender is determined by our classification model. Before calculating FID measure all images are resized to 256x256 size for fair comparison.
Also human evaluation is used for more accurate comparison with optimization based methods. Our study consists of two surveys. In the first one crowdsource workers were asked to select an image showing better translation from one gender to the other given a source image. They were also instructed to consider preserving of person’s identity, clothes and accessories. We refer to this measure as ”quality”. The second survey asked to select the most realistic image, and no source image was provided. All images were resized to 512x512 size in this comparison.
The first task should show which method is the best at performing transformation, the second – which has the best quality. We use side-by-side experiments for both tasks where one side is our method and the other side is one of optimization based baselines. Answer choices are shuffled. For each comparison of our method with a baseline, we generate 1000 questions and each question is answered by 10 different people. For answers aggregation we use Dawid-Skene method  and filter out the examples with confidence level less than 95% (it is approximately 4% of all questions).
We generate a paired dataset for male and female faces according to the method described above and than train a separate pix2pixHD model for each gender translation.
We compete with both unpaired image-to-image methods and different StyleGAN embedders with latent code optimization. We choose StarGAN666https://github.com/yunjey/stargan , MUNIT777https://github.com/NVlabs/MUNIT  and StarGAN v2*888https://github.com/taki0112/StarGAN_v2-Tensorflow (unofficial implementation, so its results may differ from the official one) 
for a competition with unpaired methods. We train all these methods on FFHQ classified into males and females.
Fig. 4 shows qualitative comparison between our approach and unpaired image-to-image ones. It demonstrates that distilled transformation have significantly better visual quality and more stable results. Quantitative comparison in Table 4(a) confirms our observations.
StyleGAN2 provides an official projection method. This method operates in , which only allows to find faces generated by this model, but not real world images. So, we also build a similar method for for comparison. It optimizes separate for each layer of the generator, which helps to better reconstruct a given image. After finding we can add transformation vector described above and generate a transformed image.
Also we add previous approaches [40, 5] for finding latent code to comparison, even though they are based on the first version of StyleGAN. StyleGAN encoder add to more advanced loss functions and forward pass approximation of optimization starting point, that improve reconstruction results.
Since unpaired methods show significantly worse quality, we put more effort into comparisons between different methods of searching embedding through optimization. We avoid using methods that utilize FID because all of them are based on the same StyleGAN model. Also, FID cannot measure ”quality of transformation” because it does not check keeping of personality. So we decide to make user study our main measure for all StyleGAN-based methods. Fig. 5 shows qualitative comparison of all the methods. It is visible that our method performs better in terms of transformation quality. And only StyleGAN Encoder  outperforms our method in realism. However this method generates background unconditionally.
We find that pix2pixHD keeps more details on transformed images than all the encoders. We suppose that this is achieved due to the ability of pix2pixHD to pass part of the unchanged content through the network. Pix2pixHD solves an easier task compared to encoders which are forced to encode all the information about the image in one vector.
Fig. 4 and 5 also show drawbacks of our approach. Vector of “gender” is not perfectly disentangled due to some bias in attribute distribution of FFHQ and, consequently, latent space correlation of StyleGAN. For example, it can be seen that translation into female faces can also add smile.
We also encounter problems of pix2pixHD architecture: repeated patterns, light blobs and difficulties with finetuning 1024x1024 resolution. We show an uncurated list of generated images in supplementary materials.
To show that our approach can be applied for another image-to-image transform task, we also carry out similar experiment with face age manipulation. First, we estimate age for all generated images, then group them into several bins. After that, for each bin we find vectors of “+2 bins” and “-2 bins”. Using these vectors, we generate united paired dataset. Each pair contains younger and older versions of the same face. Finally, we train two pix2pixHD networks, one for each of two directions.
Examples of the application of this approach are presented in Fig. 6.
There are 18 AdaIN inputs in StyleGAN2 architecture. These AdaINs work with different spatial resolutions, and changing different input will change details of different scale. The authors divide them into three groups: coarse styles (for – spatial resolutions), middle styles ( – ) and fine styles ( – ). The opportunity to change coarse, middle or fine details is a unique feature of StyleGAN architectures.
We collect datasets of triplets (two source images and their mixture) and train our models for each transformation. We concatenate two images into 6 channels to feed our pix2pixHD model. Fig. 7(a,b,c) shows the results of style mixing.
Another simple linear operation is to average two latent codes. It corresponds to morphing operation on images. We collect another dataset with triplet latent codes: two random codes and an average one. The examples of face morphing are shown in Fig. 7(d).
In this paper, we unite unconditional image generation and paired image-to-image GANs to distill a particular image manipulation in latent code of StyleGAN2 into single image-to-image translation. The resulting technique shows both fast inference and impressive quality. It outperforms existing unpaired image-to-image models in FID score and StyleGAN Encoder approaches both in user study and time of inference on gender swap task. We show that the approach is also applicable for other image manipulations, such as aging/rejuvenation and style transfer.
Our framework has several limitations. StyleGAN2 latent space is not perfectly disentangled, so the transformations made by our network are not perfectly pure. Despite the latent space is not disentangled enough to make pure transformations, impurities are not so severe.
We use only pix2pixHD network although different architectures fit better for different tasks. Besides, we distil every transformation to a separate model, although some universal model could be trained. This opportunity should be investigated in future studies.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.0.4, §2.0.4, §2.0.4, 4(a), §4.2.1.
Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.0.3.
StyleGAN – encoder for official tensorflow implementation. GitHub. Note: https://github.com/Puzer/stylegan-encoder Cited by: 4(b), §4.2.1.
Age progression/regression by conditional adversarial autoencoder. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5810–5818. Cited by: §2.0.4.