1 Introduction
Semantic modifications over a human face have multiple applications including forensic art [fu2010age] and crossface verification [park2010age]
. Recently, many generative models have shown their efficiency in generating photorealistic human faces with required semantic modification. For example, Generative adversarial networks (GANs) like progressiveGAN
[karras2017progressive] and styleGAN [karras2019style] can generate high quality images with controls to create semantic modifications over attributes such as age, gender and smile [shen2020interpreting]. Despite the capability to generate semantic changes over an image sampled from the latent space, these models do not provide any encoding scheme to map a real world image onto the latent space and generate similar modifications. Recently, [zhu2020domain]proposed a twoway encoding scheme to use the pretrained styleGAN to generate similar semantic changes over a real world human face. Their method involves first learning an initial latent vector generated by an encoder network and then using the learned latent vector for domain regularized search. However, their method involves training an additional encoder network that can be a timeconsuming and resource intensive process. An alternative to such GANbased architectures are latent spacebased autoencoders like styleALAE
[pidhorskyi2020adversarial]. Unlike other GANbased models, styleALAE trains both an encoder network (to map real world input images to latent space) and a decoder network (to map latent vectors to image space) within the same training paradigm. This ensures that there is no additional overhead of training a separate network to work with outoftrainingdataset images for semantic manipulation. Unlike traditional autoencoders, which use MSE based reconstruction loss for image generation [kingma2013auto], styleALAE uses adversarial loss with a compositional network architecture to ensure that the generated output is of high quality. Even the decoder network in styleALAE uses the styleGAN based architecture to ensure photorealism of the generated output. It further uses latent spacebased MSE loss to ensure reciprocity between the input image and reconstructed output image [ulyanov2018takes]. [zhu2020domain] showed that for efficient GANbased inversion and to generate necessary semantic modifications, the encoder network needs to be trained with the gradients generated by the GAN model. Therefore, the process of training both encoder and decoder networks simultaneously in styleALAE ensures an efficient training of encoder networks using the gradients generated by the styleGAN based decoder network. At the same time, such a training process ensures that the encoder also plays its role in modeling the shared latent space. However, despite all its advantages, the reconstructed output in styleALAE i.e., D(E(I)), where D denotes decoder and E denotes encoder, does not preserve the identity of the input image (I). This limits the application of styleALAE for downstream tasks, which often consider the identity of the input image as an essential feature for semantic face editing.The difference in the identity of the reconstructed image and the input image is because of the disparity in the distribution learned by the model and the input data. Unlike other autoencoderbased architectures, styleALAE uses latent spacebased MSE loss to ensure the reciprocity between the encoded image and the reconstructed output [ulyanov2018takes, srivastava2017veegan]. Even though, this ensures that the decoder network captures multiple semantics from the input image, it fails to ensure the identitybased similarity between the reconstructed output and the input image. Recently, [yang2020one] proposed a framework to generate images from the same distribution as that of a given oneshot example in styleGAN based model. The complete process consists of two steps i.e. the iterative latent optimization [abdal2019image2stylegan] to project input image onto the latent space of pretrained styleGAN and followed by finetuning the weights of the styleGAN while fixing the projected latent vector to match the perceptual and pixelwise similarity with the given input image. This oneshot training process ensures that the manifold structures of styleGAN shift toward the target distribution and can further be reused to generate multiple images with similar identity as the input image. A similar trick to preserve the identity has been proposed in [zakharov2019few] and [kowalski2020config], where oneshot finetuning of the generator reduces the identity difference in deepfake and computer graphicsbased semantic face editing system.
In our work, we propose a oneshot domain adaptation method to solve the mismatch between the identity of the input image and the reconstructed output of styleALAE. We ensure this by first mapping the input image onto the latent distribution of pretrained styleALAE, and then finetuning the decoder to produce the required manifold shifting toward the given input image. We further show that the latent space of the pretrained styleALAE model can be reused to generate age, smile, and genderbased semantic modifications under the given oneshot domain adaptation technique. Unlike the deepfake and graphicsbased face editing methods, which use conditional GAN to generate semantic modification, our approach uses a latent spacebased traversal scheme [shen2020interpreting] to generate the required semantic modification on the input image. We further experiment with different inversion techniques to project the input image onto the latent space of styleALAE and show its effect on the overall semantic modification and identity preservation. Our contribution can be summarized as follow.

We propose a oneshot domain adaption technique to solve the disparity in the identity of the input image and reconstructed output in styleALAE.

We experiment with different inversion techniques and show their impact on the semantic modification and identity preservation in styleALAE.

We show that within oneshot domain adaptation technique, the latent space of pretrained styleALAE can be reused to generate semantic modification using a linear traversalbased scheme.
2 Related Work
2.1 Generative models
Recently, a lot of progress has been made in the area of semantic face editing. Models like in [kingma2018glow] have shown an efficient method for semantic face editing of real world images by learning a disentangled latent structure. However, the high dimension of the latent space of such a network often makes it cumbersome to train. Researchers have also explored conditional GANbased model [he2019attgan, lu2018attribute, gu2019mask] for semantic face editing, but training such a model often requires real world images with necessary semantic variations, which is often hard to obtain. To particularly tackle the problem of lack of dataset [kowalski2020config] used graphic rendering engines to generate synthetic face images. They then trained an autoencoder with a shared latent space between synthetic dataset and real world images. They further used this dataset to generate semantic modification over the input image. However, in such an approach the semantic modification over in any input image is often tied up with the variation captured in the synthetic dataset.
2.2 GAN Inversion
With the recent success of models like progressiveGAN [karras2017progressive], styleGAN [karras2019style] and StyleGAN2 [karras2020analyzing] in generating photorealistic images, researchers have focused on enhancing its application in downstream face editing tasks. [shen2020interpreting] showed that the latent structure of these models have a defined semantic structure, and a linear traversal along a specific direction can be utilized for semantic face editing. Similarly, [abdal2020styleflow, shubham2020learning] have explored a nonlinear traversal scheme for face editing. To extend their application on real world images with known identities, [abdal2019image2stylegan] and [zhu2020domain] proposed a latent spacebased optimization method to map any input image onto the latent space of the existing pretrained model. [zhu2020domain] further showed that the latent space of styleGAN can be reused to generate required semantic modification over the mapped image. However, their approach involves training an additional encoder network that provides a domain regularization for latent optimization. This puts an additional overhead of training a separate network, particularly for real world images.
2.3 Fewshot domain adaptation
Training a model to generate photorealistic and semantically rich facial images often requires a large dataset. Recently, there has been a growing interest in generalizing the performance of deep learning models with few sets of images
[zakharov2019few, motiian2017few, liu2019few, finn2017model]. Multiple techniques have been proposed to achieve the given task. [finn2017model] proposed metalearning approaches to adapt the model for any new task. Similarly, [snell2017prototypical, vinyals2016matching] focused on learning the embedding space better suited for fewshot learning. Recently [yang2020one] proposed a oneshot domain adaptation framework for generating images with different styles using pretrained styleGAN. The authors first mapped the given input image onto the latent space of styleGAN and then finetuned the generator to produce the necessary manifold shifting for required face editing. Oneshot domain adaptation techniques have also been used for identity preservation. [zakharov2019few] and [kowalski2020config] showed that the disparity in the identity of the reconstructed output for a given input image can be mitigated by finetuning the generator using oneshot training.3 Method
Unlike other works in GAN inversion, ours utilizes the encoder network of the pretrained styleALAE model to map real world images onto the latent space. We further explore the properties of the encoder network to generate semantic modifications over images using a oneshot domain adaptation technique. Such approaches have previously been studied only for style transfer and identity preservation. Similarly, unlike other generative models, which require synthetic datasets for semantic modifications, ours utilized the existing latent space of styleALAE to generate similar changes on the reconstructed images of realworld facial images.
Our algorithm therefore consists of the following three crucial steps.

Image inversion: For any given input image, a corresponding latent vector is generated by projecting the image onto the latent space of pretrained styleALAE model. In our work, we have experimented with different projection schemes for image inversion and have shown its impact on the overall semantic modification (see Sec 3.1 and Sec 4 ).

Manifold shifting: Once the latent vector is generated for the given input image, we fix the corresponding latent vector and update the weights of the decoder of styleALAE model to minimize the difference between the input image and the reconstructed output image. This optimization generates the manifold shift [yang2020one] in the output distribution towards the given input image.

Semantic modification: Subsequently, we use the linear traversal scheme on the latent space of styleALAE to generate the necessary semantic modification [shen2020interpreting].
3.1 Image Inversion
The first step of the oneshot domain adaptation method [yang2020one] involves projecting the input image onto the latent space of the styleALAE. While other GANbased approaches do not provide any encoder network to map the input image onto the latent space, styleALAE on the other hand provides an encoder to project them onto the learned latent space.
Recent works in GAN inversion [zhu2020domain, abdal2019image2stylegan] have shown that the styleGAN has a latent vector associated with images of people from the real world with known identities. This latent vector, when passed through the generator of the original styleGAN, can generate a perceptually similar image with the same identity as the input image. [zhu2020domain] have further used this learned vector for semantic modification, and [yang2020one] have shown its suitability for domain adaptation purposes. However, we noticed that such approaches, when applied to the latent space styleALAE, do not generate a high quality image and sometimes even fail to generalize for different identities (Section 4).
In our experiments, we found that the latent vector generated using the pretrained encoder network in styleALAE is more suited for projecting images for oneshot domain adaptation, particularly for subsequent semantic modification (Section 3.3). The encoder of the styleALAE network is trained with the gradients of the generator and is efficient in capturing the high level semantics features such as color, race, gender and age. This helps the encoder network in projecting any input image closer to the learned manifold structure which can later be used to generate semantic modifications (Section 3.3).
3.2 Manifold shifting
After generating the latent vector associated with the given input image, the next step is to generate a manifold shift in styleALAE towards the given input image [yang2020one]. For this, we finetune the decoder of styleALAE using pixel and perceptual similarity while fixing the projected latent vector during the complete fine tuning process. The finetuning of the decoder reduces the identity gap between the input image, and the reconstructed image associated with the latent vector generated by the encoder network. The change in the identity of the reconstructed image is not only visible for the projected latent vector, but in fact, using the finetuned decoder, even the neighborhood of the projected vector in the latent space generates images that resemble the identity of the given input image.
In our work, we finetune the decoder of styleALAE using a weighted combination of VGG16 based perception loss [abdal2019image2stylegan] and pixel wise MSE loss.
(1) 
where and , are the decoder and encoder of styleALAE.
For perception loss in Equation 1 we have used smooth L1 loss [rashid2017interspecies] over VGG16 features:
(2) 
are images used for comparison and are VGG16 features from conv1_1, conv1_2, conv3_2 and conv4_2 respectively [abdal2019image2stylegan]
. Perceptual loss in the overall loss function, encourages the model to generate similar feature representation between input image and the reconstructed output, and act as a regularizer to ensure smooth optimization
[abdal2019image2stylegan]3.3 Semantic Modification
Once the required manifold shift has been generated in styleALAE toward given input image, we reuse the latent space of the pretrained model to generate semantic modifications. For this, we modify the projected latent vector using linear interpolation along the hyperplane trained to segregate latent vectors with semantically opposite features
[shen2020interpreting].(3) 
where, is the new edited latent vector, is the original projected vector, is the hyperplane for the required semantic modification and
is a hyperparameter.
4 Experiments
For our experiments, we use a styleALAE model pretrained on 1024 1024 3 images from the FFHQ dataset. We have compared our formulation (oneshot adaptation + encoder based projection) with four different algorithms namely (i) vanilla styleALAE (ii) latent optimization based inversion techniques [abdal2019image2stylegan] where the projection vector generated in styleALAE by the encoder network is optimized using the pixel and perceptual based loss (see Equation 1); this involves only latent optimization. (iii) oneshot domain adaptation in styleALAE with randomly sampled vector as projection vector; this involves oneshot adaptation and random projection. In this method the decoder of styleALAE model is finetuned by fixing a randomly sampled latent vector as projection vector during decoder based fine tuning and then the same vector is later used for semantic modification. (iv) oneshot domain adaptation in styleALAE with latent optimizationbased projection [yang2020one] (involving oneshot adaptation as well as latent optimization). In this method the encoded latent vector is first optimized using latent optimization techniques and later fixed during the complete fine tuning of the decoder. The generated latent vector is then reused for semantic modifications. To generate semantic modifications, we have used the default latent directions provided by the authors of styleALAE for all the above mentioned algorithms. The experiments were performed on a Intel(R) Xeon Silver 4114 Processor @ 2.20GHz with 40 cores and 128GB RAM, we have used one Nvidia GeForce RTX 2080Ti with 11016 MB VRAM.
4.1 Quantitative evaluation
To compare the performance of our proposed method on real world identities, we have used 1000 single face images of celebrities from IMDBwiki face dataset
[RotheIJCV2018, RotheICCVW2015]. In our quantitative evaluation, we have done two kinds of analyses. First, we compare the input image with the reconstructed output generated after the manifold shift (Section 3.2) and second, we compare the images generated during linear traversal for semantic modification (Section 3.3) with the given input image.As a metric, we have used SSIM, PSNR [hore2010image] and SWD [rabin2011wasserstein] scores, to compare different algorithms.
Algorithm  SSIM  PSNR  SWD 

Vanilla styleALAE  0.596 / 0.072  15.823 / 2.029  1839.956 / 156.494 
Only latent optimization  0.746 / 0.063  22.759 / 1.170  1489.477 / 164.251 
oneshot adaptation + random projection  0.853 / 0.040  27.461 / 1.903  1194.786 / 155.150 
oneshot adaptation + latent optimization  0.887 / 0.037  29.407 / 2.152  1079.925 / 163.816 
oneshot adaptation + encoder based projection (Ours)  0.881 / 0.039  29.061 / 2.197  1098.163 / 168.272 
Table 1 shows the comparison score for the reconstructed image and the input image. Among different methods ((i), (ii), (iii), (iv)) the oneshot domain adaptation method with latent optimization based projection performs best, this is closely followed by our method. This high generation quality of reconstructed output in latent optimization based projection and encoder based projection when compared to other methods can be attributed to better initialization provided by these methods for finetuning of decoder.
Type  Algorithm  SSIM  PSNR  SWD 
Vanilla styleALAE  0.567 / 0.071  15.052 / 1.817  1974.962 / 140.477  
AGE  Only latent optimization  0.591 / 0.069  16.020 / 1.803  1912.731 / 141.372 
oneshot adaptation + random projection  0.692 / 0.046  18.948 / 1.259  1786.257 / 114.478  
oneshot adaptation + latent optimization  0.684 / 0.061  18.135 / 1.685  1711.060 / 132.413  
oneshot adaptation + encoder based projection (Ours)  0.787 / 0.049  22.715 / 1.670  1490.155 / 147.353  
Vanilla styleALAE  0.581 / 0.071  15.530 / 1.950  1874.735 / 139.257  
Gender  Only latent optimization  0.6073 / 0.069  16.625 / 1.956  1805.526 / 139.186 
oneshot adaptation + random projection  0.726 / 0.048  20.521 / 1.596  1602.929 / 122.529  
oneshot adaptation + latent optimization  0.708 / 0.060  19.009 / 1.822  1606.178 / 130.274  
oneshot adaptation + encoder based projection (Ours)  0.820 / 0.045  24.870 / 1.882  1330.361 / 141.232  
Vanilla styleALAE  0.566 / 0.071  15.415 / 1.897  1886.499 / 141.816  
Smile  Only latent optimization  0.592 / 0.068  16.500 / 1.892  1816.944 / 142.755 
oneshot adaptation + random projection  0.723 / 0.049  20.987 / 1.366  1589.207 / 131.564  
oneshot adaptation + latent optimization  0.643 / 0.061  17.927 / 1.613  1695.687 / 127.006  
oneshot adaptation + encoder based projection (Ours)  0.786 / 0.046  23.831 / 1.559  1398.035 / 149.570  
Table 2 shows the comparison between semantically modified images with the input image. As per the result it is evident that our method generates high quality images with greater perceptual similarity. This can be attributed to the ability of the encoder network to capture multiple semantic features of input image and generate a latent vector which lies closer to the true latent manifold structure of styleALAE.
4.2 Qualitative evaluation
Figure  1 shows the qualitative comparison between reconstructed output and the input image for all the algorithms. As can be seen, when compared to the vanilla styleALAE or only latent optimization based methods, oneshot domain adaptation does a better job in preserving the identity of the reconstructed image irrespective of the choice of the projection vector.
Although the impact of different projection vectors for oneshot domain adaptation is much more evident in semantic modifications. As can be seen in Figure 2, a random projection vector generates unwanted artifacts on the input image, while latent optimization based projection vector and oneshot adaptation with latent optimization generates images with different identity. Compared to other algorithms our method fairs well in generating the necessary semantic modification while preserving the identity.
4.3 User study
Considering the subjective nature of perceiving semantic manipulations and identity preservation, we have further evaluated the algorithms using human evaluators. For this, 450 different trajectories of semantic modifications were generated and verified by two annotators on a 5 point likert scale. The evaluators rated the algorithms on following dimensions a: Rate the algorithms on the basis of their ability to preserve the identity of the input image while generating the required semantic manipulation. b: Rate the algorithms on the basis of their ability to prevent image distortion while generating the required semantic manipulation. c: Rate the algorithms on the basis of their ability to prevent blurring while generating the required semantic manipulation.
Figure 3 shows the results of the user study for the mentioned algorithms for age, gender and smile based manipulation. From the results it is evident that our approach outperforms other methods in terms of identity preservation and prevention of image distortion.
As per the results (see Figure 2 and Figure 3 ), it is evident that the latent vector generated by the encoder network of styleALAE does a better job in preventing image distortion and preserving the identity of the input image in a linear traversal based semantic modification scheme. As shown in Figure 1, the finetuning of the decoder in styleALAE ensures that the reconstructed output preserves the identity of the input image irrespective of the choice of the image inversion technique. But, in a linear traversal based semantic modification scheme, it is essential to be closer to the underlying manifold structure of styleALAE to generate semantic modifications. A random vector or a latent optimizationbased projection vector fails to generate a vector closer to the existing manifold structure of styleALAE. It hence performs poorly in preserving identity or avoiding any distortion in the generated output. On the other hand, the encoder network ensures that the projected latent vector is generated closer to the underlying manifold structure and hence generates better output.
5 Conclusion
In our work, we have addressed the problem of difference in identity of the reconstructed output compared to given input image in styleALAE, and its impact on generating the semantic modification over real world images. Our results shows that with oneshot domain adaptation, the identity of input image can be preserved and under the given setting, the latent space of styleALAE can be reused to generate semantic modification over the given input image with known identity. We have further shown the importance of the projected vector in oneshot domain adaptation for latent space based semantic modification scheme and in generating high quality images. In the future, we will extend this formulation to other attributes. We also intend to further investigate the impact of fine tuning of the generator on flowbased and autoregressive models.
Comments
There are no comments yet.