Generative Adversarial Networks (GANs) Goodfellow et al. (2014) have made great progress in the field of image and video synthesis. Among all GANs models, recent StyleGAN Karras et al. (2019) and StyleGAN2 Karras et al. (2020) have further pushed forward the quality of generated images. The most distinguishing characteristic of StyleGAN is the design of intermediate latent space that enables disentanglement of different attributes at different semantic levels. This has attracted attention in trying to demystify the latent space and achieve simple image manipulations Shen et al. (2020b, a); Collins et al. (2020); Abdal et al. (2019); Wu et al. (2020).
With its disentanglement property, StyleGAN has unleashed numerous image editing and manipulation tasks. We can improve the controllability of the generation process via exploiting the latent space by augmenting and regularizing the latent space Chen et al. (2020); Alharbi and Wonka (2020); Shoshan et al. (2021), and by inverting images back to the latent space Abdal et al. (2019); Bau et al. (2020, 2019). Furthermore, various conventional conditional image generation tasks can be achieved with the help of the inversion techniques. For example, image-to-image translation can be done by injecting encoded features to StyleGANs Richardson et al. (2020); Kwong et al. (2021)
, and image inpainting and outpainting can realized by locating the appropriate codes in the latent spaceAbdal et al. (2020); Cheng et al. (2021); Lin et al. (2021). However, most methods either are designed in a task-specific manner or require additional architectures.
In this work, we demonstrate that a vanilla StyleGAN is sufficient to host a variety of different tasks, as shown in Figure 1
. By exploiting the spatial properties of the intermediate layers along with some simple operations, we can, without any additional training, perform feature interpolation, panorama generation, and generation from a single image. With fine-tuning, we can achieve image-to-image translation which leads to various applications including continuous translation, local image translation, and attributes transfer. Qualitative and quantitative comparisons show that the proposed method performs comparably to current state-of-the-art methods without any additional architecture. All codes and models can be found athttps://github.com/mchong6/SOAT.
2 Related Work
Image editing with StyleGAN
The style-based generators of StyleGAN Karras et al. (2019, 2020) provide an intermediate latent space that has been shown to be semantically disentangled. This property facilitates various image editing applications via the manipulation of the
space. In the presence of labels in the form of binary attributes or segmentations, vector directions in thespace can be discovered for semantic edits Shen et al. (2020b); Wu et al. (2020). In an unsupervised setting, EIS Collins et al. (2020) analyzes the style space of a large number of images to build a catalog that isolates and bonds specific parts of the style code to specific facial parts. However, as the latent space of StyleGAN is one-dimensional, these methods usually have limited control over spatial editing of images. On the other hand, optimization-based methods provide better control over spatial editing. Bau et al. Bau et al. (2020) allow users to interactively rewrite the rules of a generative model by manipulating the layers of a GAN as a linear associate memory. However, it requires optimization of the model weights and cannot work on the feature layers. To enable intuitive spatial editing, Suzuki et al. Suzuki et al. (2018) perform collaging (cut and paste) in the intermediate spatial feature space of GANs, yielding realistic blending of images. However, due to the nature of collaging, the method are highly dependent on the pose and structure of the images and do not generate realistic results in many scenarios.
Panorama generation aims to generate a sequence of continuous images in an unconditional setting Lin et al. (2021, 2019); Skorokhodov et al. (2021) or conditioning on given images Cheng et al. (2021). These methods perform generation conditioning on a coordinate system. Arbitrary-lengthed panorama generation is then done by continually sampling along the coordinate grid.
Generation from a single image
SinGAN Rott Shaham et al. (2019) recently proposes to learn the distribution of patches within a single image. The learned distribution enable the generation of diverse samples that follows the patch distribution of the original image. However, scalability is a major downside for SinGAN as every new image requires an individual SinGAN model, which is both time and computationally intensive. On the contrary, the proposed method can achieve similar effect by manipulating the feature space of a pretrained StyleGAN.
Image to image translation (I2I)
I2I aims to learn the mapping among different domains. Most I2I methods Isola et al. (2017); Zhu et al. (2017); Huang et al. (2018); Lee et al. (2018) formulate the the mapping via learning a conditional distribution. However, this formulation is sensitive to and heavily dependent on the input distribution, which often leads to unstable training and unsatisfactory inference results. To leverage the unconditional distribution of both source and target domains, recently, Toonify Pinkney and Adler (2020)
proposes to finetune a pretrained StyleGAN and perform weight swapping between the pretrained and finetuned model to allow high-quality I2I translation. Finetuning from a pretrained model allows the semantics learned from the original dataset to be well preserved. Less data is also needed for training due to transfer learning. However, Toonify has limited controls and fails to achieve editing such as local translation and continouous translation.
3 Image Manipulation with StyleGAN
We introduce some common operations and their applications using StyleGAN. For the rest of the paper, let represents intermediate features of the the -th layer in the StyleGAN.
All images generated are of resolution. For faces, we use the pretrained FFHQ model by rosinality 28; for churches, the pretrained model on the LSUN-Churches dataset Yu et al. (2015) by Karras et al. Karras et al. (2020); for landscapes and towers, we trained a StyleGAN2 model on LHQ Skorokhodov et al. (2021) and LSUN-Towers Yu et al. (2015)
using standard hyperparameters. For face2disney and face2anime tasks, we fine-tune the FFHQ model on the DisneyPinkney and Adler (2020) and Danbooru2018 Anonymous et al. (2019) dataset respectively.
For quantitative evaluations, we perform user study and FID computations. All FID computations are implemented using the FID by Chong et al. Chong and Forsyth (2020) which debiases the computation of FID. For our user study, given a pair of images, users are asked to choose the one that is more realistic and more relevant to the task. We ask each user to compare pairs of images from different methods and collect results from a total of subjects.
3.2 Simple spatial operations
Since StyleGAN is fully convolutional, we can adjust the spatial dimensions of to cause a corresponding spatial change in the output image. We experiment with simple spatial operations such as padding and resizing and show that we are able to achieve pleasing and intuitive results.
We apply all spatial operations on
. First, we perform padding operation that expands an input tensor by appending additional values to the borders of the tensor. In Fig.2, we explore several variants of paddings and investigate the results they have on the generated image. Replicate padding pads the tensor to its desired size by its boundary value. Fig. 2 shows that the background is extended d by replicating the bushes and trees. Reflection padding reflects from the border, and Circular padding wraps the tensor around, creating copies of the same tensor, as shown in Fig. 2. Then we introduce the resizing operation that performs resizing in the feature space. Compared to naive resizing that causes artifacts such as blurred textures, resizing in the feature space maintains realistic texture.
3.3 Feature interpolation
Suzuki et al. Suzuki et al. (2018) show that collaging (copy and pasting) features in the intermediate layers of StyleGAN allows the images to be blended seamlessly. However, this collaging does not work well when the images to be blended are too different. Instead of collaging, we show that interpolating the features leads to smooth transitions between two images even if they are largely different.
At each StyleGAN layer, we generate and separately using different latent noise. We then blend them smoothly with , where is a mask that blends the two features decided by different ways of blending, e.g. if for horizontal blending, the mask will get larger from left to right. is then passed on to the next convolution layer where the same blending will occur again. Note that we do not have to perform this blending at every single layer. We later show that strategic choices of where to blend can impact the results we get.
In most experiments, we set linearly scaled using linspace which allows a smooth interpolation between the two features. The scale depends on the tasks. For landscapes, the two images are normally structurally different, and thus, benefit from a longer and slower scale that allows a smooth transition. This is evident in Fig. 3, where we compare feature interpolation with feature collaging in Suzuki et al. which fails to perform smooth transition. We also perform a user study to let users select which interpoloated images look more realistic. As shown in Table 3.4, of users prefer our method against Suzuki et al. .
3.4 Generation from a single image
In addition to feature interpolation between different images, we can apply interpolation within a single image. In some feature layers, we select relevant patches and replicate it spatially by blending it with other regions. Specifically, with a shift operator that translates the mask in a given direction:
In combination with simple spatial operations, we can generate diverse images from a single image that has consistent patch distributions and structure. This is a similar task to SinGAN Rott Shaham et al. (2019) with the exception that SinGAN involves sampling while we require manual choosing of patches for feature interpolation. Different from SinGAN that each image requires an individual model, our method uses the same StyleGAN with different latent codes.
We qualitatively and quantitatively compare the capability to generate from a single image of SinGAN and the proposed method. In Fig. 4, we perform comparisons on the LSUN-Churches and LHQ datasets. Our method generates realistic structures borrowed from different parts of the image and blends them into a coherent image. While SinGAN has more flexibility and is able to generate more arbitrary structures, in practice, the results are less realistic, especially in the case of image extension. Notice in landscape extension, SinGAN is not able to correctly capture the structure of clouds, leading to unrealistic samples. Comparatively, the extension of our method based on reflection padding generates realistic textures that are structurally sound. For user study, we compare with SinGAN for image extension, with our method using spatial reflect padding at . From Table 3.4, over 80% of the users prefer our method.
|vs. Suzuki et al. Suzuki et al. (2018)||vs. EIS Collins et al. (2020)|
|vs. Suzuki et al. Suzuki et al. (2018)|
|Single Image Generation|
|vs. SinGAN Rott Shaham et al. (2019)|
3.5 Improved GAN inversion
GAN inversion aims to locate a style code in the space that can synthesize an image similar to the given target image. In practice, despite being able to reconstruct the target image, the resulting style codes often fall into unstable out-of-domain regions of the space, making it difficult to perform any semantic control over the resulting images. Wulff et al Wu et al. (2020)
discover that under a simple non-linear transformation, the
space can be modeled with a Gaussian distribution. Applying a Gaussian prior improves the stability of GAN inversion. However, in our attributes transfer setting, we need to invert both a source and reference image, this formulation struggles to provide satisfactory results.
In a StyleGAN, the latent space is mapped to the style coefficients space by an affine transformation in the AdaIN module. Recent work has shown better performance in face manipulations Xu et al. (2020); Collins et al. (2020) utilizing compared to . We discover that the space without any transformations can also be modeled as a Gaussian distribution. We are then able to impose the same Gaussian prior in this space instead during GAN inversion.
3.6 Controllable I2I translation
Building upon Toonify, Kwong et al. Kwong et al. (2021) propose to freeze the fully-connected layers during finetuning phase to better preserve semantics after I2I translation. This preserves StyleGAN’s space, which exhibits disentanglement properties Karras et al. (2020); Shen et al. (2020a); Abdal et al. (2019). Following the discussion in Section 3.5 that space exhibits better disentanglement compared to space, we propose to also freeze the affine transformation layer that produces . In Fig. 6(d), we show that this simple change allows us to better preserve the semantics for image translation (note the expressions and shapes of the mouths).
Following Toonify, we first finetune an FFHQ-pretrained StyleGAN on the target dataset. Both Toonify and Kwong et al. then proceed to perform weight swapping for I2I. While they produce visually pleasing results, they have limited control over the degree of image translation. One interesting observation we make is that feature interpolation also works across the pretrained and finetuned StyleGAN. This allows us to blend real and Disney faces together in numerous ways, achieving different results: 1) We can perform continuous translation by using a constant across all spatial dimensions. The value of determines the degree of translation. 2) We can perform localized image translation by choosing which area to perform feature interpolation. 3) We can use GAN inversion to perform both face editing and translation on real faces. Using our improved GAN inversion allows more realistic and accurate results.
Fig. 6 shows a comprehensive overview of our capabilities in I2I translations. We show that we can perform multimodal translations across different datasets. Reference images provide the overall style of the translated image, while source images provide semantics such as pose, hair style, etc. Sampling different reference images also results in significantly varied styles (drawing style, colors, etc). By controlling blending parameter, we also show visually pleasing continuous translation results. For example, in the first row of Fig. 9(b), we can maintain the texture of a real face while enlarging the eyes. We further show that we can selectively choose which area to translate through feature interpolation. This gives us a large degree of controllability, allowing us to create a face with Disney eyes or even an anime head with a human face.
3.7 Panorama Generation
Using feature interpolation, we can blend two side-by-side images by creating a realistic transition that connects them. We can extend this into infinite panorama generation by continuously blending two images and knitting them together. Under certain blending constraints illustrated in Fig. 7, we can knit them perfectly. To enforce the constraint that specified areas remain the same, we can choose which areas to blend by a careful choice of weights. Note that we are not limited to blending only two images at once. The limitation is induced by the GPU memory. Depending on the dataset, our panorama method is not limited to horizontal generation and can be extended in any direction.
Even though feature interpolation allows us to blend images that are different, the results are not ideal when the input images are too semantically dissimilar (e.g. side-by-side blending of sea and trees). To overcome this issue, we perform latent smoothing – applying a Gaussian filter across latent codes to smooth neighboring latent codes. It results in more similar neighboring images and as such, have a more natural interpolation between them, leading to more natural results.
In the experiment, for blending images to form a panorama, we perform feature interpolation at every single layer. We choose a blending mask by linearly scaling it from left to right in the areas constraint by our construction in Fig. 7. We quantitatively compare our method with ALIS Skorokhodov et al. (2021) using the -FID introduced in it. Just by hijacking a pretrained StyleGAN, our method is able to obtain comparable -FID with ALIS, which is trained specifically for this task. We also show that performing latent smoothing leads to significant improvement in the score.
3.8 Attributes Transfer
While Suzuki et al. Suzuki et al. (2018) show that feature collaging can perform localized feature transfer between two images, the results are highly dependent on pose and orientation. Transferring features from a left-looking face to a right-looking face will cause awkward misalignments. Similarly, naively applying our feature interpolation leads to similar results. EIS allows realistic facial feature transfer that performs well even when faces have different poses. However, EIS does not ensure that irrelevant regions are not affected, e.g., transferring eye features can affect the nose features too. Moreover, EIS only allows transfers for predefined features and not arbitrary user-defined features. Lastly, EIS only allows generating in-distribution images, limiting its ability to generate less common examples such as having one eye with makeup and one without.
In order to allow feature interpolation to work well for arbitrary poses, we perform a pose alignment between source and reference images. There are numerous ways to pose align for StyleGAN images Shen et al. (2020b); Härkönen et al. (2020). Based on the observation in Karras et al. (2020) that early layers of StyleGAN primarily control pose and structure, we can simply align the first dimensions of the style code between the source and reference images. Once pose aligned, we can then apply feature interpolation to transfer chosen features from reference to source. This procedure is shown in Fig. 8.
We can further allow arbitrary localized edits by choosing which area to perform feature interpolation. The final pipeline involves a user drawing a bounding box on the source face they wish to change (say eyes + nose). Attributes will then be automatically transferred from a chosen reference face even if their poses are not aligned. We can even generate interesting out-of-distribution examples such as a vertical blending between a male and female face Fig. 9(b).
To perform natural attributes transfer with minimal blending artifacts, we perform feature interpolation on layers . In Fig. 9 we qualitatively compare our face attributes transfer method with several other methods. We use the proposed improved GAN inversion method to perform the comparisons on real images. Our results are generally more realistic and better capture the attributes we are interested in. Suzuki et al. produce unnatural images due to the difference in poses between source and reference images, while EIS is less accurate in transferring attributes. We further validated our results through a user study where users choose based on both realism and transfer accuracy, Table 3.4. Our method is preferred by the users over both other methods.
4 Conclusions and Broader Impacts
In this work, we show that with only pretrained StyleGAN models along with the proposed spatial operations on the latent space, we can achieve comparable results in various image manipulation tasks that usually require task-specific architectures or training paradigms. The proposed method is lightweight, efficient, and applicable to any pretrained StyleGAN model.
Our method provides a simple and computationally efficient procedure for general public to perform a variety of image manipulation tasks. However, as a trade-off, this method can also just as easily be applied for disinformation. For example, attributes transfer can be used to make DeepFakes which can be used maliciously. Also, as our method relies on a pretrained StyleGAN, it is also limited by the capacity of it. There may be issues of diversity where minorities are not well represented in the dataset. As such, our method might not be able to perform manipulations well on faces of minorities. A well balanced dataset that properly represents the minorities is pertinent to a fair model. More research and insight into mode dropping in GANs are also necessary.
-  (2019) Image2stylegan: how to embed images into the stylegan latent space?. In Int. Conf. Comput. Vis., Cited by: §1, §1, §3.6.
-  (2020) Image2stylegan++: how to edit the embedded images?. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1.
-  (2020) Disentangled image generation through structured noise injection. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1.
-  (2019-01) Danbooru2018: a large-scale crowdsourced and tagged anime illustration dataset. dataset. Note: https://www.gwern.net/Danbooru2018Accessed: DATE External Links: Cited by: §3.1.
-  (2020) Rewriting a deep generative model. In Eur. Conf. Comput. Vis., Cited by: §1, §2.
-  (2019) Semantic photo manipulation with a generative image prior. In SIGGRAPH, Cited by: §1.
-  (2020) A free viewpoint portrait generator with dynamic styling. arXiv preprint arXiv:2007.03780. Cited by: §1.
-  (2021) In&Out: diverse image outpainting via gan inversion. arXiv preprint arXiv:2104.00675. Cited by: §1, §2.
-  (2020) Effectively unbiased fid and inception score and where to find them. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §3.1.
-  (2020) Editing in style: uncovering the local semantics of gans. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1, §2, Figure 9, §3.4, §3.5.
-  (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §1.
-  (2020) GANSpace: discovering interpretable GAN controls. arXiv preprint arXiv:2004.02546. Cited by: §3.8.
-  (2018) Multimodal unsupervised image-to-image translation. In Eur. Conf. Comput. Vis., Cited by: §2.
-  (2017) Image-to-image translation with conditional adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §2.
-  (2019) A style-based generator architecture for generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1, §2.
-  (2020) Analyzing and improving the image quality of stylegan. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1, §2, §3.1, §3.4, §3.6, §3.8.
-  (2021) Unsupervised image-to-image translation via pre-trained stylegan2 network. IEEE Transactions on Multimedia. Cited by: §1, Figure 6, §3.6.
-  (2018) Diverse image-to-image translation via disentangled representations. In Eur. Conf. Comput. Vis., Cited by: §2.
-  (2019) Coco-gan: generation by parts via conditional coordinating. In Int. Conf. Comput. Vis., Cited by: §2.
-  (2021) InfinityGAN: towards infinite-resolution image synthesis. arXiv preprint arXiv:2104.03963. Cited by: §1, §2.
-  (2020) Resolution dependant gan interpolation for controllable image synthesis between domains. arXiv preprint arXiv:2010.05334. Cited by: §2, §3.1.
-  (2020) Encoding in style: a stylegan encoder for image-to-image translation. arXiv preprint arXiv:2008.00951. Cited by: §1.
-  (2019) SinGAN: learning a generative model from a single natural image. In Int. Conf. Comput. Vis., Cited by: §2, §3.4, §3.4.
-  (2020) Interpreting the latent space of gans for semantic face editing. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1, §3.6.
-  (2020) Interfacegan: interpreting the disentangled face representation learned by gans. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2, §3.8.
-  (2021) GAN-control: explicitly controllable gans. arXiv preprint arXiv:2101.02477. Cited by: §1.
-  (2021) Aligning latent and image spaces to connect the unconnectable. arXiv preprint arXiv:2104.06954. Cited by: §2, §3.1, §3.4, §3.7.
stylegan2-pytorch. GitHub. Note: https://github.com/rosinality/stylegan2-pytorch Cited by: §3.1.
-  (2018) Spatially controllable image synthesis with internal representation collaging. arXiv preprint arXiv:1811.10153. Cited by: §2, Figure 3, Figure 9, §3.3, §3.4, §3.8.
-  (2020) StyleSpace analysis: disentangled controls for stylegan image generation. arXiv preprint arXiv:2011.12799. Cited by: §1, §2, §3.5.
-  (2020) Improving inversion and generation diversity in stylegan using a gaussianized latent space. arXiv preprint arXiv:2009.06529. Cited by: Figure 5.
-  (2020) Generative hierarchical features from synthesizing images. arXiv preprint arXiv:2007.10379. Cited by: §3.5.
Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §3.1.
The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, Cited by: §3.5.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Int. Conf. Comput. Vis., Cited by: §2.
Appendix A Appendix
We blend images with an mask. We can control different speed of scaling from 0 to 1 to obtain different masks for feature blending. In Figure 10, we illustrate the concept of alpha blending. In Figure 11, we apply different alpha masks to different tasks. For landscape images where contents are usually structurally different, slower allows smoother transition. On the other hand, for face editing, faster is usually beneficial as we want to accurately reproduce the fine-grained features from the reference without it being affected by the transitions.
a.2 Latent smoothing
In addition to feature interpolation, we adopt latent smoothing to handle cases that input images are too semantically dissimilar. We apply a Gaussian filter across latent codes. As shown in Figure 12, latent smoothing can greatly alleviate the artifacts.