StyleGAN of All Trades: Image Manipulation with Only Pretrained StyleGAN

Recently, StyleGAN has enabled various image manipulation and editing tasks thanks to the high-quality generation and the disentangled latent space. However, additional architectures or task-specific training paradigms are usually required for different tasks. In this work, we take a deeper look at the spatial properties of StyleGAN. We show that with a pretrained StyleGAN along with some operations, without any additional architecture, we can perform comparably to the state-of-the-art methods on various tasks, including image blending, panorama generation, generation from a single image, controllable and local multimodal image to image translation, and attributes transfer. The proposed method is simple, effective, efficient, and applicable to any existing pretrained StyleGAN model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 8

page 9

page 13

page 14

page 15

page 16

page 17

03/07/2020

StyleGAN2 Distillation for Feed-forward Image Manipulation

StyleGAN2 is a state-of-the-art network in generating realistic images. ...
02/06/2022

FEAT: Face Editing with Attention

Employing the latent space of pretrained generators has recently been sh...
11/30/2021

SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Editing

Recently, large pretrained models (e.g., BERT, StyleGAN, CLIP) have show...
11/25/2020

StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation

We explore and analyze the latent style space of StyleGAN2, a state-of-t...
10/17/2021

AE-StyleGAN: Improved Training of Style-Based Auto-Encoders

StyleGANs have shown impressive results on data generation and manipulat...
11/26/2021

Disentangled Unsupervised Image Translation via Restricted Information Flow

Unsupervised image-to-image translation methods aim to map images from o...
05/27/2020

Network Fusion for Content Creation with Conditional INNs

Artificial Intelligence for Content Creation has the potential to reduce...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GANs) Goodfellow et al. (2014) have made great progress in the field of image and video synthesis. Among all GANs models, recent StyleGAN Karras et al. (2019) and StyleGAN2 Karras et al. (2020) have further pushed forward the quality of generated images. The most distinguishing characteristic of StyleGAN is the design of intermediate latent space that enables disentanglement of different attributes at different semantic levels. This has attracted attention in trying to demystify the latent space and achieve simple image manipulations Shen et al. (2020b, a); Collins et al. (2020); Abdal et al. (2019); Wu et al. (2020).

With its disentanglement property, StyleGAN has unleashed numerous image editing and manipulation tasks. We can improve the controllability of the generation process via exploiting the latent space by augmenting and regularizing the latent space Chen et al. (2020); Alharbi and Wonka (2020); Shoshan et al. (2021), and by inverting images back to the latent space Abdal et al. (2019); Bau et al. (2020, 2019). Furthermore, various conventional conditional image generation tasks can be achieved with the help of the inversion techniques. For example, image-to-image translation can be done by injecting encoded features to StyleGANs Richardson et al. (2020); Kwong et al. (2021)

, and image inpainting and outpainting can realized by locating the appropriate codes in the latent space 

Abdal et al. (2020); Cheng et al. (2021); Lin et al. (2021). However, most methods either are designed in a task-specific manner or require additional architectures.

In this work, we demonstrate that a vanilla StyleGAN is sufficient to host a variety of different tasks, as shown in Figure 1

. By exploiting the spatial properties of the intermediate layers along with some simple operations, we can, without any additional training, perform feature interpolation, panorama generation, and generation from a single image. With fine-tuning, we can achieve image-to-image translation which leads to various applications including continuous translation, local image translation, and attributes transfer. Qualitative and quantitative comparisons show that the proposed method performs comparably to current state-of-the-art methods without any additional architecture. All codes and models can be found at

https://github.com/mchong6/SOAT.

2 Related Work

Image editing with StyleGAN

The style-based generators of StyleGAN Karras et al. (2019, 2020) provide an intermediate latent space that has been shown to be semantically disentangled. This property facilitates various image editing applications via the manipulation of the

space. In the presence of labels in the form of binary attributes or segmentations, vector directions in the

space can be discovered for semantic edits Shen et al. (2020b); Wu et al. (2020). In an unsupervised setting, EIS Collins et al. (2020) analyzes the style space of a large number of images to build a catalog that isolates and bonds specific parts of the style code to specific facial parts. However, as the latent space of StyleGAN is one-dimensional, these methods usually have limited control over spatial editing of images. On the other hand, optimization-based methods provide better control over spatial editing. Bau et al. Bau et al. (2020) allow users to interactively rewrite the rules of a generative model by manipulating the layers of a GAN as a linear associate memory. However, it requires optimization of the model weights and cannot work on the feature layers. To enable intuitive spatial editing, Suzuki et al. Suzuki et al. (2018) perform collaging (cut and paste) in the intermediate spatial feature space of GANs, yielding realistic blending of images. However, due to the nature of collaging, the method are highly dependent on the pose and structure of the images and do not generate realistic results in many scenarios.

Panorama generation

Panorama generation aims to generate a sequence of continuous images in an unconditional setting Lin et al. (2021, 2019); Skorokhodov et al. (2021) or conditioning on given images Cheng et al. (2021). These methods perform generation conditioning on a coordinate system. Arbitrary-lengthed panorama generation is then done by continually sampling along the coordinate grid.

Generation from a single image

SinGAN Rott Shaham et al. (2019) recently proposes to learn the distribution of patches within a single image. The learned distribution enable the generation of diverse samples that follows the patch distribution of the original image. However, scalability is a major downside for SinGAN as every new image requires an individual SinGAN model, which is both time and computationally intensive. On the contrary, the proposed method can achieve similar effect by manipulating the feature space of a pretrained StyleGAN.

Image to image translation (I2I)

I2I aims to learn the mapping among different domains. Most I2I methods Isola et al. (2017); Zhu et al. (2017); Huang et al. (2018); Lee et al. (2018) formulate the the mapping via learning a conditional distribution. However, this formulation is sensitive to and heavily dependent on the input distribution, which often leads to unstable training and unsatisfactory inference results. To leverage the unconditional distribution of both source and target domains, recently, Toonify Pinkney and Adler (2020)

proposes to finetune a pretrained StyleGAN and perform weight swapping between the pretrained and finetuned model to allow high-quality I2I translation. Finetuning from a pretrained model allows the semantics learned from the original dataset to be well preserved. Less data is also needed for training due to transfer learning. However, Toonify has limited controls and fails to achieve editing such as local translation and continouous translation.

3 Image Manipulation with StyleGAN

We introduce some common operations and their applications using StyleGAN. For the rest of the paper, let represents intermediate features of the the -th layer in the StyleGAN.

3.1 Setup

All images generated are of resolution. For faces, we use the pretrained FFHQ model by rosinality 28; for churches, the pretrained model on the LSUN-Churches dataset Yu et al. (2015) by Karras et al. Karras et al. (2020); for landscapes and towers, we trained a StyleGAN2 model on LHQ Skorokhodov et al. (2021) and LSUN-Towers Yu et al. (2015)

using standard hyperparameters. For face2disney and face2anime tasks, we fine-tune the FFHQ model on the Disney 

Pinkney and Adler (2020) and Danbooru2018 Anonymous et al. (2019) dataset respectively.

For quantitative evaluations, we perform user study and FID computations. All FID computations are implemented using the FID by Chong et al. Chong and Forsyth (2020) which debiases the computation of FID. For our user study, given a pair of images, users are asked to choose the one that is more realistic and more relevant to the task. We ask each user to compare pairs of images from different methods and collect results from a total of subjects.

Figure 2: Spatial Operation.

Performing simple spatial operations such as resizing and padding on StyleGAN’s intermediate feature layers results in intuitive and realistic manipulations.

3.2 Simple spatial operations

Since StyleGAN is fully convolutional, we can adjust the spatial dimensions of to cause a corresponding spatial change in the output image. We experiment with simple spatial operations such as padding and resizing and show that we are able to achieve pleasing and intuitive results.

We apply all spatial operations on

. First, we perform padding operation that expands an input tensor by appending additional values to the borders of the tensor. In Fig. 

2, we explore several variants of paddings and investigate the results they have on the generated image. Replicate padding pads the tensor to its desired size by its boundary value. Fig. 2 shows that the background is extended d by replicating the bushes and trees. Reflection padding reflects from the border, and Circular padding wraps the tensor around, creating copies of the same tensor, as shown in Fig. 2. Then we introduce the resizing operation that performs resizing in the feature space. Compared to naive resizing that causes artifacts such as blurred textures, resizing in the feature space maintains realistic texture.

Figure 3: Feature Interpolation. We compare our feature interpolation with the feature collaging in Suzuki et al. Suzuki et al. (2018). H and V represent horizontal and vertical blending respectively. Our feature interpolation is able to blend and transition the two images seamlessly while there are obvious blending artifacts in Suzuki et al.

3.3 Feature interpolation

Suzuki et al. Suzuki et al. (2018) show that collaging (copy and pasting) features in the intermediate layers of StyleGAN allows the images to be blended seamlessly. However, this collaging does not work well when the images to be blended are too different. Instead of collaging, we show that interpolating the features leads to smooth transitions between two images even if they are largely different.

At each StyleGAN layer, we generate and separately using different latent noise. We then blend them smoothly with , where is a mask that blends the two features decided by different ways of blending, e.g. if for horizontal blending, the mask will get larger from left to right. is then passed on to the next convolution layer where the same blending will occur again. Note that we do not have to perform this blending at every single layer. We later show that strategic choices of where to blend can impact the results we get.

In most experiments, we set linearly scaled using linspace which allows a smooth interpolation between the two features. The scale depends on the tasks. For landscapes, the two images are normally structurally different, and thus, benefit from a longer and slower scale that allows a smooth transition. This is evident in Fig. 3, where we compare feature interpolation with feature collaging in Suzuki et al. which fails to perform smooth transition. We also perform a user study to let users select which interpoloated images look more realistic. As shown in Table 3.4, of users prefer our method against Suzuki et al. .

Figure 4: Generation from a single image. We compare single image generation with SinGAN. For our method, we perform feature interpolation to collage image structures or spatial paddings to widen the image. Our images are significantly more diverse and realistic. SinGAN fails to vary the church structures in meaningful way and generates unrealistic clouds and landscapes.

3.4 Generation from a single image

In addition to feature interpolation between different images, we can apply interpolation within a single image. In some feature layers, we select relevant patches and replicate it spatially by blending it with other regions. Specifically, with a shift operator that translates the mask in a given direction:

(1)

In combination with simple spatial operations, we can generate diverse images from a single image that has consistent patch distributions and structure. This is a similar task to SinGAN Rott Shaham et al. (2019) with the exception that SinGAN involves sampling while we require manual choosing of patches for feature interpolation. Different from SinGAN that each image requires an individual model, our method uses the same StyleGAN with different latent codes.

We qualitatively and quantitatively compare the capability to generate from a single image of SinGAN and the proposed method. In Fig. 4, we perform comparisons on the LSUN-Churches and LHQ datasets. Our method generates realistic structures borrowed from different parts of the image and blends them into a coherent image. While SinGAN has more flexibility and is able to generate more arbitrary structures, in practice, the results are less realistic, especially in the case of image extension. Notice in landscape extension, SinGAN is not able to correctly capture the structure of clouds, leading to unrealistic samples. Comparatively, the extension of our method based on reflection padding generates realistic textures that are structurally sound. For user study, we compare with SinGAN for image extension, with our method using spatial reflect padding at . From Table 3.4, over 80% of the users prefer our method.

Figure 5: GAN Inversion. We compare our GAN inversion vs SOTA Wulff et al. Wulff and Torralba (2020). Our method more faithfully reconstruct the original image while maintaining better editability. Our deepfakes are more natural and captures the facial attributes better.
Table 1: User Preference. We conduct user study on different tasks against other methods.
Attributes Transfer
vs. Suzuki et al. Suzuki et al. (2018) vs. EIS Collins et al. (2020)
70.4% 64.0%
Feature Interpolation
vs. Suzuki et al. Suzuki et al. (2018)
87.6%
Single Image Generation
vs. SinGAN Rott Shaham et al. (2019)
83.3%
Table 2: Quantitative Results on Panorama Generation. We measure -FID to evaluate the visual quality of the generated panorama. Method FID StyleGAN2 Karras et al. (2020) 4.5 Method -FID ALIS Skorokhodov et al. (2021) 10.5 Ours 15.7 Ours + latent smoothing 12.9

3.5 Improved GAN inversion

GAN inversion aims to locate a style code in the space that can synthesize an image similar to the given target image. In practice, despite being able to reconstruct the target image, the resulting style codes often fall into unstable out-of-domain regions of the space, making it difficult to perform any semantic control over the resulting images. Wulff et al Wu et al. (2020)

discover that under a simple non-linear transformation, the

space can be modeled with a Gaussian distribution. Applying a Gaussian prior improves the stability of GAN inversion. However, in our attributes transfer setting, we need to invert both a source and reference image, this formulation struggles to provide satisfactory results.

In a StyleGAN, the latent space is mapped to the style coefficients space by an affine transformation in the AdaIN module. Recent work has shown better performance in face manipulations Xu et al. (2020); Collins et al. (2020) utilizing compared to . We discover that the space without any transformations can also be modeled as a Gaussian distribution. We are then able to impose the same Gaussian prior in this space instead during GAN inversion.

In Fig. 5, we compare our GAN inversion with Wulff et al. and show significant improvements in the reconstruction and editability of the image. For both GAN inversions, we perform descent steps with LPIPS Zhang et al. (2018) and MSE loss.

Figure 6: Image-to-Image Translation. (a) Our method easily transfers to multimodal image translation on multiple datasets. (b) We can control the degree translation. (c) We are also able to perform local translation on a user-prescribed region. (d) Our method preserves semantics better compared to Kwong et al. Kwong et al. (2021) (note the facial expressions).

3.6 Controllable I2I translation

Building upon Toonify, Kwong et al. Kwong et al. (2021) propose to freeze the fully-connected layers during finetuning phase to better preserve semantics after I2I translation. This preserves StyleGAN’s space, which exhibits disentanglement properties Karras et al. (2020); Shen et al. (2020a); Abdal et al. (2019). Following the discussion in Section 3.5 that space exhibits better disentanglement compared to space, we propose to also freeze the affine transformation layer that produces . In Fig. 6(d), we show that this simple change allows us to better preserve the semantics for image translation (note the expressions and shapes of the mouths).

Following Toonify, we first finetune an FFHQ-pretrained StyleGAN on the target dataset. Both Toonify and Kwong et al. then proceed to perform weight swapping for I2I. While they produce visually pleasing results, they have limited control over the degree of image translation. One interesting observation we make is that feature interpolation also works across the pretrained and finetuned StyleGAN. This allows us to blend real and Disney faces together in numerous ways, achieving different results: 1) We can perform continuous translation by using a constant across all spatial dimensions. The value of determines the degree of translation. 2) We can perform localized image translation by choosing which area to perform feature interpolation. 3) We can use GAN inversion to perform both face editing and translation on real faces. Using our improved GAN inversion allows more realistic and accurate results.

Fig. 6 shows a comprehensive overview of our capabilities in I2I translations. We show that we can perform multimodal translations across different datasets. Reference images provide the overall style of the translated image, while source images provide semantics such as pose, hair style, etc. Sampling different reference images also results in significantly varied styles (drawing style, colors, etc). By controlling blending parameter, we also show visually pleasing continuous translation results. For example, in the first row of Fig. 9(b), we can maintain the texture of a real face while enlarging the eyes. We further show that we can selectively choose which area to translate through feature interpolation. This gives us a large degree of controllability, allowing us to create a face with Disney eyes or even an anime head with a human face.

Figure 7: Panorama Generation. We generate panoramas by knitting spans (blends of two images). We enforce certain constraints to enable flawless knitting. Colored bars represent that the areas are exactly the same, the numbers represent the areas we take to obtain the final panorama. By ensuring the yellow area of is the same as in Span, black area of same as in Span, we can knit Span and Span perfectly. We can repeat this process to form an arbitrary-lengthed panorama. We show random samples from LHQ, LSUN-Churches, and LSUN-Towers.

3.7 Panorama Generation

Using feature interpolation, we can blend two side-by-side images by creating a realistic transition that connects them. We can extend this into infinite panorama generation by continuously blending two images and knitting them together. Under certain blending constraints illustrated in Fig. 7, we can knit them perfectly. To enforce the constraint that specified areas remain the same, we can choose which areas to blend by a careful choice of weights. Note that we are not limited to blending only two images at once. The limitation is induced by the GPU memory. Depending on the dataset, our panorama method is not limited to horizontal generation and can be extended in any direction.

Even though feature interpolation allows us to blend images that are different, the results are not ideal when the input images are too semantically dissimilar (e.g. side-by-side blending of sea and trees). To overcome this issue, we perform latent smoothing – applying a Gaussian filter across latent codes to smooth neighboring latent codes. It results in more similar neighboring images and as such, have a more natural interpolation between them, leading to more natural results.

In the experiment, for blending images to form a panorama, we perform feature interpolation at every single layer. We choose a blending mask by linearly scaling it from left to right in the areas constraint by our construction in Fig. 7. We quantitatively compare our method with ALIS Skorokhodov et al. (2021) using the -FID introduced in it. Just by hijacking a pretrained StyleGAN, our method is able to obtain comparable -FID with ALIS, which is trained specifically for this task. We also show that performing latent smoothing leads to significant improvement in the score.

Figure 8: Attributes Transfer with Pose Align. Naive feature interpolation does not work well when images have very different poses. Our method addresses the problem with a simple pose alignment that allowing us to perform attributes transfers regardless of original poses.
Figure 9: Attribute Transfer Comparisons. We compare attribute transfer with other state-of-the-art methods. Collins et al. Collins et al. (2020) does not accurately transfers fine-grained attributes, and Suzuki et al. Suzuki et al. (2018) produces unrealistic outputs when the poses are mismatched. Our method is both accurate and realistic. Furthermore, our method is also able to perform transfer in arbitrary regions. We can seamlessly blend two halves of a face, have two distinctly different eyess on each side, etc.

3.8 Attributes Transfer

While Suzuki et al. Suzuki et al. (2018) show that feature collaging can perform localized feature transfer between two images, the results are highly dependent on pose and orientation. Transferring features from a left-looking face to a right-looking face will cause awkward misalignments. Similarly, naively applying our feature interpolation leads to similar results. EIS allows realistic facial feature transfer that performs well even when faces have different poses. However, EIS does not ensure that irrelevant regions are not affected, e.g., transferring eye features can affect the nose features too. Moreover, EIS only allows transfers for predefined features and not arbitrary user-defined features. Lastly, EIS only allows generating in-distribution images, limiting its ability to generate less common examples such as having one eye with makeup and one without.

In order to allow feature interpolation to work well for arbitrary poses, we perform a pose alignment between source and reference images. There are numerous ways to pose align for StyleGAN images Shen et al. (2020b); Härkönen et al. (2020). Based on the observation in Karras et al. (2020) that early layers of StyleGAN primarily control pose and structure, we can simply align the first dimensions of the style code between the source and reference images. Once pose aligned, we can then apply feature interpolation to transfer chosen features from reference to source. This procedure is shown in Fig. 8.

We can further allow arbitrary localized edits by choosing which area to perform feature interpolation. The final pipeline involves a user drawing a bounding box on the source face they wish to change (say eyes + nose). Attributes will then be automatically transferred from a chosen reference face even if their poses are not aligned. We can even generate interesting out-of-distribution examples such as a vertical blending between a male and female face Fig. 9(b).

To perform natural attributes transfer with minimal blending artifacts, we perform feature interpolation on layers . In Fig. 9 we qualitatively compare our face attributes transfer method with several other methods. We use the proposed improved GAN inversion method to perform the comparisons on real images. Our results are generally more realistic and better capture the attributes we are interested in. Suzuki et al. produce unnatural images due to the difference in poses between source and reference images, while EIS is less accurate in transferring attributes. We further validated our results through a user study where users choose based on both realism and transfer accuracy, Table 3.4. Our method is preferred by the users over both other methods.

4 Conclusions and Broader Impacts

In this work, we show that with only pretrained StyleGAN models along with the proposed spatial operations on the latent space, we can achieve comparable results in various image manipulation tasks that usually require task-specific architectures or training paradigms. The proposed method is lightweight, efficient, and applicable to any pretrained StyleGAN model.

Our method provides a simple and computationally efficient procedure for general public to perform a variety of image manipulation tasks. However, as a trade-off, this method can also just as easily be applied for disinformation. For example, attributes transfer can be used to make DeepFakes which can be used maliciously. Also, as our method relies on a pretrained StyleGAN, it is also limited by the capacity of it. There may be issues of diversity where minorities are not well represented in the dataset. As such, our method might not be able to perform manipulations well on faces of minorities. A well balanced dataset that properly represents the minorities is pertinent to a fair model. More research and insight into mode dropping in GANs are also necessary.

References

  • [1] R. Abdal, Y. Qin, and P. Wonka (2019) Image2stylegan: how to embed images into the stylegan latent space?. In Int. Conf. Comput. Vis., Cited by: §1, §1, §3.6.
  • [2] R. Abdal, Y. Qin, and P. Wonka (2020) Image2stylegan++: how to edit the embedded images?. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1.
  • [3] Y. Alharbi and P. Wonka (2020) Disentangled image generation through structured noise injection. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1.
  • [4] Anonymous, the Danbooru community, G. Branwen, and A. Gokaslan (2019-01) Danbooru2018: a large-scale crowdsourced and tagged anime illustration dataset. dataset. Note: https://www.gwern.net/Danbooru2018Accessed: DATE External Links: Link Cited by: §3.1.
  • [5] D. Bau, S. Liu, T. Wang, J. Zhu, and A. Torralba (2020) Rewriting a deep generative model. In Eur. Conf. Comput. Vis., Cited by: §1, §2.
  • [6] D. Bau, H. Strobelt, W. Peebles, J. Wulff, B. Zhou, J. Zhu, and A. Torralba (2019) Semantic photo manipulation with a generative image prior. In SIGGRAPH, Cited by: §1.
  • [7] A. Chen, R. Liu, L. Xie, and J. Yu (2020) A free viewpoint portrait generator with dynamic styling. arXiv preprint arXiv:2007.03780. Cited by: §1.
  • [8] Y. Cheng, C. H. Lin, H. Lee, J. Ren, S. Tulyakov, and M. Yang (2021) In&Out: diverse image outpainting via gan inversion. arXiv preprint arXiv:2104.00675. Cited by: §1, §2.
  • [9] M. J. Chong and D. Forsyth (2020) Effectively unbiased fid and inception score and where to find them. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §3.1.
  • [10] E. Collins, R. Bala, B. Price, and S. Susstrunk (2020) Editing in style: uncovering the local semantics of gans. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1, §2, Figure 9, §3.4, §3.5.
  • [11] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §1.
  • [12] E. Härkönen, A. Hertzmann, J. Lehtinen, and S. Paris (2020) GANSpace: discovering interpretable GAN controls. arXiv preprint arXiv:2004.02546. Cited by: §3.8.
  • [13] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Eur. Conf. Comput. Vis., Cited by: §2.
  • [14] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §2.
  • [15] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1, §2.
  • [16] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of stylegan. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1, §2, §3.1, §3.4, §3.6, §3.8.
  • [17] S. Kwong, J. Huang, and J. Liao (2021) Unsupervised image-to-image translation via pre-trained stylegan2 network. IEEE Transactions on Multimedia. Cited by: §1, Figure 6, §3.6.
  • [18] H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In Eur. Conf. Comput. Vis., Cited by: §2.
  • [19] C. H. Lin, C. Chang, Y. Chen, D. Juan, W. Wei, and H. Chen (2019) Coco-gan: generation by parts via conditional coordinating. In Int. Conf. Comput. Vis., Cited by: §2.
  • [20] C. H. Lin, Y. Cheng, H. Lee, S. Tulyakov, and M. Yang (2021) InfinityGAN: towards infinite-resolution image synthesis. arXiv preprint arXiv:2104.03963. Cited by: §1, §2.
  • [21] J. N. Pinkney and D. Adler (2020) Resolution dependant gan interpolation for controllable image synthesis between domains. arXiv preprint arXiv:2010.05334. Cited by: §2, §3.1.
  • [22] E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or (2020) Encoding in style: a stylegan encoder for image-to-image translation. arXiv preprint arXiv:2008.00951. Cited by: §1.
  • [23] T. Rott Shaham, T. Dekel, and T. Michaeli (2019) SinGAN: learning a generative model from a single natural image. In Int. Conf. Comput. Vis., Cited by: §2, §3.4, §3.4.
  • [24] Y. Shen, J. Gu, X. Tang, and B. Zhou (2020) Interpreting the latent space of gans for semantic face editing. In IEEE Conf. Comput. Vis. Pattern Recog., Cited by: §1, §3.6.
  • [25] Y. Shen, C. Yang, X. Tang, and B. Zhou (2020) Interfacegan: interpreting the disentangled face representation learned by gans. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2, §3.8.
  • [26] A. Shoshan, N. Bhonker, I. Kviatkovsky, and G. Medioni (2021) GAN-control: explicitly controllable gans. arXiv preprint arXiv:2101.02477. Cited by: §1.
  • [27] I. Skorokhodov, G. Sotnikov, and M. Elhoseiny (2021) Aligning latent and image spaces to connect the unconnectable. arXiv preprint arXiv:2104.06954. Cited by: §2, §3.1, §3.4, §3.7.
  • [28] (2019)

    stylegan2-pytorch

    .
    GitHub. Note: https://github.com/rosinality/stylegan2-pytorch Cited by: §3.1.
  • [29] R. Suzuki, M. Koyama, T. Miyato, T. Yonetsuji, and H. Zhu (2018) Spatially controllable image synthesis with internal representation collaging. arXiv preprint arXiv:1811.10153. Cited by: §2, Figure 3, Figure 9, §3.3, §3.4, §3.8.
  • [30] Z. Wu, D. Lischinski, and E. Shechtman (2020) StyleSpace analysis: disentangled controls for stylegan image generation. arXiv preprint arXiv:2011.12799. Cited by: §1, §2, §3.5.
  • [31] J. Wulff and A. Torralba (2020) Improving inversion and generation diversity in stylegan using a gaussianized latent space. arXiv preprint arXiv:2009.06529. Cited by: Figure 5.
  • [32] Y. Xu, Y. Shen, J. Zhu, C. Yang, and B. Zhou (2020) Generative hierarchical features from synthesizing images. arXiv preprint arXiv:2007.10379. Cited by: §3.5.
  • [33] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao (2015)

    Lsun: construction of a large-scale image dataset using deep learning with humans in the loop

    .
    arXiv preprint arXiv:1506.03365. Cited by: §3.1.
  • [34] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    .
    In CVPR, Cited by: §3.5.
  • [35] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Int. Conf. Comput. Vis., Cited by: §2.

Appendix A Appendix

a.1 blending

We blend images with an mask. We can control different speed of scaling from 0 to 1 to obtain different masks for feature blending. In Figure 10, we illustrate the concept of alpha blending. In Figure 11, we apply different alpha masks to different tasks. For landscape images where contents are usually structurally different, slower allows smoother transition. On the other hand, for face editing, faster is usually beneficial as we want to accurately reproduce the fine-grained features from the reference without it being affected by the transitions.

Figure 10: blending. We inject 2 different styles to get 2 intermediate features and , which we blend using a spatially varying mask. The final output is then passed on to the next convolution layers where the same process is repeated.
Figure 11: scaling. Different scaling gives different blend results. Fast scaling preserves features better. This is harmful if the two images are very different as the transition will be abrupt as seen in the landscape example. It is however useful for accurate facial attributes transfer. Slow scaling gives a slow smooth blend which is helpful for landscapes, but fails to accurately preserves facial features.

a.2 Latent smoothing

In addition to feature interpolation, we adopt latent smoothing to handle cases that input images are too semantically dissimilar. We apply a Gaussian filter across latent codes. As shown in Figure 12, latent smoothing can greatly alleviate the artifacts.

Figure 12: Latent smoothing. We compare our feature interpolation with and without latent smoothing. When blending 2 very different images, the resulting blend can be unrealistic (left). Latent smoothing causes neighboring latent codes (and consequently neighboring images) to be closer, giving a more realistic image blending.

a.3 More Samples

We present more samples on parorama generation, generation from a single image, and image-to-image translation in Figure 13, Figure 14, and Figure 15, respectively.

Figure 13: More Samples of Panorama generation.
Figure 14: More Samples of Generation from a single image.
Figure 15: More Samples of Image-to-Image translation.