Many Computer Vision problems can be viewed as an Image-to-Image Translation problem, such as style transfer[gatys2016image, huang2017arbitrary]dong2015image, kim2016accurate]pathak2016context, iizuka2017globally]zhang2016colorful, zhang2017real]
and so on. Image-to-Image Translation is a method to map an input image from one domain to a comparable image in a different domain. Image-to-Image Translation is based on the Generative Adversarial Networks(GANs)[goodfellow2014generative]. pix2pix[isola2017image]
is the representative Supervised Image-to-Image Translation model. The pix2pix framework uses a conditional generative adversarial network to learn a mapping from input to output images, The discriminator, D, learns to classify between fake (synthesized by the generator) and real edge and photo tuples. The generator, G, learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discriminator observe the input edge map. CycleGAN[zhu2017unpaired] is the first Unsupervised Image-to-Image Translation algorithm, which proposes the cycle consistency. The translation networks for both directions are trained together, and they provide supervision signals for each other.
UNIT[liu2017unsupervised] is one of the Unimodal Image-to-Image Translation algorithm. Unlike CycleGAN, UNIT proposes a different assumption called shared latent space. This assumes that the latent space can be shared by both domains such as the two analogous images are mapped to the same latent code. MUNIT[huang2018multimodal] is a representative Multimodal Image-to-Image Translation model. MUNIT assumes that the image representation can be decomposed into a content code that is domain-invariant and a style code that captures domain-specific properties. For example, MUNIT can translate edge2shoes dataset containing example images(like edges of shoes) and shoes images. MUNIT translates real shoes images refers to example edge images.
Recently, there are several ways to solve Exemplar-based Image-to-Image Translation in the state-of-the-art [huang2018multimodal, ma2018exemplar, wang2019example, zhang2020cross]. For example, SPADE[park2019semantic] shows impressive results using Spatially-Adaptive Normalization. This improves the input semantic layout while performing affine transformations in the normalization layer. This model allows the user to select an external style image to control the ”global appearance” of the output image. Moreover, this model demonstrates the user control demos over both semantic and style when compositing images.
CoCosNet[zhang2020cross] shows state-of-the-art results in exemplar-based image-to-image translation. Despite the success of CoCosNet in exemplar-based image-to-image translation, CoCosNet has several critical problems. First, weighty training is required so this makes expensive computation cost. Second, such weighty training is mandatory per every single kind of task(ex. mask-to-face, edge-to-face, pose-to-edge) and each task needs large-scaled training sets. Also, SPADE block inside CoCosNet network requires labeled masks for the training set. This leads to restrictions in practical use of the model. Lastly, CoCosNet fails to match instances within a pair of image and causes the generated image to be overfitted to mask image.
Recently, there have been many attempts to reverse the generation process by re-mapping the image space into a latent space, widely known as GAN inversion. For example, On the ”steerability” of generative adversarial networks showed impressive results using explore the latent space. This paper shows how to achieve transformations in the output space by moving in latent space. Image2StyleGAN shows the StyleGAN inversion model(using pre-trained StyleGAN model) using efficient embedding techniques to map input image to the extended latent space W+. mGANprior shows the GAN Inversion technique using multiple latent codes and adaptive channel importance. However, mGANprior fails to generate best results with using the same layer and parameter.
In this paper, we explore an alternative solution, called Multiple GAN Inversion for Exemplar-based Image-to-Image Translation, to overcome aforementioned limitations for exemplar-based image-to-image translation. Multiple GAN Inversion(MGI) refine the translation hypothesis, we also present a novel GANs inversion in the MGI in a manner that self-decides to find out the optimal translation result using warped image in exemplar based I2I. We applied Super Resolution with GAN Inversion to generate high-quality translation results.
2 Related Works
Image-to-Image translation aims to learn the mapping between two domains. Exemplar-based image-to-image translation is a kind of image-to-image translation method. It makes use of a structure image(input) and style image(exemplar) to generate a single image with each aspect. The point is that the output should contain both structure from input and style from exemplar ‘balanced’, and should be ‘natural’.
Moderately balanced output has been an issue in exemplar image-to-image translation. It needs well-extracted features for both structure and texture. Recent approaches require paired image dataset, or labeled image[zhang2020cross, park2019semantic, zhu2020sean] to catch semantic attributes more accurately. Some other approaches generate pretext in order to create a ground-truth-like image.
which replaces normalization into a mixture of batch normalization and structure-based conditional form into a new form for taking more spatial information, fails to represent style. SEAN[zhu2020sean] then points out two shortcomings of SPADE, that it has only one style code to control the entire style of an image and Inserts style information only in the beginning. In order to maintain more style information, SEAN introduces one style code per region, and injects style information into multiple locations. However, both SPADE and SEAN require labeled mask semantic information and this means they need a large scale of labeled dataset. On the other hand, our approach does not need any requirements for labeled data, or even dataset. What we need is only a couple of images as input and exemplar.
Swapping Autoencoder[park2020swapping] has tried to resolve the balance problem by training a swapping autoencoder with two independent components;structure code and texture code. This model is innovative since it does not need any semantic information in prior. However, Swapping Autoencoder fails to extract local style features and we cannot say it has moderately balanced output. In our work, we apply correlation matrix and GAN Inversion to extract local style features unsupervised.
CoCosNet[zhang2020cross] shows state-of-the-art results in exemplar-based image-to-image translation. Yet, CoCosNet has several critical problems. First, weighty training is required so this makes expensive computation cost. Second, such weighty training is mandatory per every single kind of task(ex. mask-to-face, edge-to-face, pose-to-edge) and each task needs large-scaled training sets. Also, SPADE block inside CoCosNet network requires labeled masks for the training set. This leads to restrictions in practical use of the model.
To address this problem, OEFT [kang2020online] OEFT(Online Exemplar Fine-Tuning) utilizes pre-trained off-the-shelf networks, and just fine-tunes in online for a single pair of input images, which does not require the off-line training phase. Correspondence fine-tuning module establishes accurate correspondence fields by only considering the internal matching statistics. However, OEFT still has several cirtical problems regarding computation time and unseen input image translation generalizability. In contrast, our model has shown the high generalization ability to any unseen input images and less computation time compared to other approaches.
The goal of GAN inversion is to return a given image to a latent code using pre-trained GAN(PGGAN, StyleGAN, etc) model. Image2StyleGAN[abdal2019image2stylegan] is an efficient embedding technique that allows to map input image to the extended latent space W+ of a pre-trained StyleGAN model. This assumes multiple questions about insight into the structure of the StyleGAN latent space. Moreover, this algorithm provides Style Transfer, Face embedding and so on.
PULSE[menon2020pulse] is a Novel Super-Resolution GAN Inversion technique, which creates realistic high-resolution images that downscale to the correct Low Resolution input. This algorithm proposes a new framework for single image super-resolution, using downscaling loss and latent space exploration.
Collaborative Learning for Faster StyleGAN Embedding[guan2020collaborative] using the same design with Image2StyleGAN, but this model has few differences, such as initialization method, and LPIPS loss method. Image2StyleGAN runtime is 420.00, but this paper runtime is 0.71, 0.016% runtime compared to Image2StyleGAN.
provides multiple latent codes and adaptive channel importance in GAN Inversion. This assumes an effective GAN Inversion method and mGANprior can be used in many applications, such as super resolution, colorization, inpainting, semantic manipulation, and so on. However, mGANprior fails to prevent optimized hyperparameters in different images. In our work, we propose Multiple GAN Inversion using mGANprior allows us to find the best hyperparameter in the GAN Inversion model. This avoids human intervention using a self-deciding algorithm in choosing the number of layers using FID.
We want to learn the translation from the source domain() to the target domain(). The generated output() is desired to conform to the content() while resembling the style from semantically similar parts in (). To address this problem, we first establish a correspondence between and in different domains, and then warp the exemplar image accordingly to match the meaning with . After that, we refine the warped image, which is generated by Cross-domain correspondence networks, to generate a more natural and plausible image through the proposed Multiple GAN Inversion, using a self-deciding algorithm in choosing the hyper-parameters for GANs inversion. Figure 1 shows our pipeline of exemplar-based image-to-image translation.
3.1 Cross-domain Correspondence Networks
We first encode source image() and target image() using feature extractors, using pre-trained VGG-19 feature extractor, then we can get feature(, ) below:
denotes a feed-forward process of input through a deep network of parameters . , and denotes spatial size of the feature and channels. and are network parameters for encoding source and target features, respectively.
Local Feature Matching Ideally, the source and warped image should structurally match. To address this problem, We propose correlation matrix to warp the features of and . Specifically, we need to calculate a global correlation matrix M of which each factor is a pairwise feature correlation.
and are the channel-wise centralized feature of and at the position of u and v. can be summarized as follows:
After that, we need to get warped image into correlation matrix. we need to warp according to and obtain the warped exemplar More specifically, we need to choose the most relevant pixel from and calculate the weighted average. This can be summarized as follows:
3.2 Multiple GANs Inversion
Even though the previously mentioned technique produces a translation image, the output will inherently have a limited resolution due to memory limitations and some artifacts. To overcome these, CoCosNet [zhang2020cross] attempts to train additional translation networks using SPADE block, However, it requires an extra burden. We propose a method to utilize the GAN inversion [gu2020image] in that we only optimize the latent code that is likely to generate the plausible image guided by the warped image with the pre-trained, fixed generation network of parameters . Even though any GAN inversion technique can be used, in this paper, we use recent GAN Inversion technique, named mGANprior [gu2020image].
To further improve the performance of this, we present a self-deciding algorithm when choosing the hyperparameters of GAN inversion based on Fréchet Inception Distance (FID) [heusel2017gans], which is a no-reference quality measure for image generation, so we do not need any supervision.
Specifically, we reformulate the GAN inversion module to generate multiple hypotheses using different number of layers. Among the multiple hypotheses with the number of hypotheses , we decide the most plausible one based on FID scores, which enables to make the inversion result be in natural data distributions. We define the latent code for -th attempt as , which can be found by minimizing the distance function between and as
where is the downsampling operator, and . is an inversion network parameter, so is the parameters of -th attempt.
is the distance function, which can be L1, L2, or perceptual loss function[johnson2016perceptual]. By measuring the FID scores of reconstructed images such that , and finding the minimum, we final get the final translation image such that . Figure 2 visualizes our aforementioned procedure.
4.1 Implementation Details
In this section, we report all the details including implementation and training details to reproduce our experiments.
We first summarize implementation details of our approach, especially in cross-domain correspondence networks and multiple GANs inversion. For correspondence module, we used the CoCosNet default setting for experiments. we used Adam solver with , . Following the TTUR, we set imbalanced learning rates, and respectively.
For Multiple GAN Inversion module, we basically try to follow the default setting of mGANprior, with some modifications. We hypothesize composing layers ranging to 4 to 8, Number of the latent codes ranging to 10 to 40. Up-sampling factor was 4 for processing 256 256 to 1024 1024 image. We used PGGAN-Multi-Z and StyleGAN loss for mGANprior. For the distance function, we used L2 and perceptual Loss, as in mGANprior. We used a batch size of 1.
4.2 Experimental Setup
Datasets. We used two kinds of datasets to evaluate our method, namely CelebA-HQ[liu2015deep] and Flickr Faces HQ(FFHQ)[karras2019style]. During the optimization, where input images were resized into 256 256.
Baselines. We compared our method with recent state-of-the-art exemplar-based image-to-image translation methods such as CoCosNet [zhang2020cross], Swapping Autoencoder [park2020swapping], Image2StyleGAN[abdal2019image2stylegan], and Style Transfer[gatys2015neural]. In addition, we also evaluate our GAN inversion module in comparison with mGANprior [gu2020image], PULSE [menon2020pulse], and Image2StyleGAN [abdal2019image2stylegan], which have been state-of-the-art in GAN inversion.
4.3 Experimental Results
Since our framework consists of two sub-networks, Cross-domain Correspondence Networks and Multiple GAN Inversion, in this section, we evaluate the two modules separately, specifically exemplar-warp denotes the results by only Cross-domain Correspondence Networks module [zhang2020cross], and ours-fin denotes the inversion results by using Multiple GAN Inversion module. We evaluate our final results as ours-fin.
Qualitative evaluation. We evaluate our MGI module in Figure 3, where we show that our generated results are more realistic and plausible. For instance, our method better reflects eyes or wrinkle as well as overall structure, clearly outperforming the state-of-the-art methods. In addition, our approach has shown the high generalization ability to any unseen input images, which the other previous methods often fail.
We evaluate our MGI module’s computation time with state-of-the-art model, OEFT and CoCosNet. CoCosNet reported in their paper using 8 32GB Tesla V100 GPUs, and training takes roughly 4 days to train 100 epochs on the ADE20k dataset. However, OEFT and MGI uses online optimization and pre-trained gan model respectively. Compared to OEFT with using CelebA-HQ dataset, OEFT takes 632 seconds and our model takes 467.09 seconds. Figure7 shows our evaluation result.
User study. We also conducted a user study on 80 participants. Figure 8 shows our user-study result. In OEFT comparison, there are 63.27% users that prefer image quality for our method. In FFHQ, 77.78% users prefer our image quality and most respondent prefers structure of source and style of target.
In this paper, we proposed a novel framework, Multiple GAN Inversion(MGI) for Exemplar-based Image-to-Image Translation. Our model generate a more natural and plausible image through the proposed Multiple GAN Inversion, using a self-deciding algorithm in choosing the hyper-parameters for GANs inversion. We formulate overall networks as two sub-networks, including Cross-domain Correspondence Networks and Multiple GAN Inversion. Our apporach achieves high generalization ability to unseen image domains, which is one of the major bottleneck of the state-of-the-art methods. Experimental results showed superiority of the proposed method compared to existing state-of-the-art exemplar-based image-to-image translation methods.