From Continuity to Editability: Inverting GANs with Consecutive Images

Existing GAN inversion methods are stuck in a paradox that the inverted codes can either achieve high-fidelity reconstruction, or retain the editing capability. Having only one of them clearly cannot realize real image editing. In this paper, we resolve this paradox by introducing consecutive images (, video frames or the same person with different poses) into the inversion process. The rationale behind our solution is that the continuity of consecutive images leads to inherent editable directions. This inborn property is used for two unique purposes: 1) regularizing the joint inversion process, such that each of the inverted code is semantically accessible from one of the other and fastened in a editable domain; 2) enforcing inter-image coherence, such that the fidelity of each inverted code can be maximized with the complement of other images. Extensive experiments demonstrate that our alternative significantly outperforms state-of-the-art methods in terms of reconstruction fidelity and editability on both the real image dataset and synthesis dataset. Furthermore, our method provides the first support of video-based GAN inversion, and an interesting application of unsupervised semantic transfer from consecutive images. Source code can be found at: <https://github.com/cnnlstm/InvertingGANs_with_ConsecutiveImgs>.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

page 8

page 11

page 12

page 13

page 14

08/20/2021

GAN Inversion for Out-of-Range Images with Geometric Transformations

For successful semantic editing of real images, it is critical for a GAN...
11/30/2021

FENeRF: Face Editing in Neural Radiance Fields

Previous portrait image generation methods roughly fall into two categor...
08/26/2019

Mocycle-GAN: Unpaired Video-to-Video Translation

Unsupervised image-to-image translation is the task of translating an im...
12/01/2021

HyperInverter: Improving StyleGAN Inversion via Hypernetwork

Real-world image manipulation has achieved fantastic progress in recent ...
03/15/2021

Understanding invariance via feedforward inversion of discriminatively trained classifiers

A discriminatively trained neural net classifier achieves optimal perfor...
05/29/2021

Transforming the Latent Space of StyleGAN for Real Face Editing

Despite recent advances in semantic manipulation using StyleGAN, semanti...
04/06/2021

ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement

Recently, the power of unconditional image synthesis has significantly a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GANs) [NIPS2014_5ca3e9b1, karras2019style, karras2019analyzing] has demonstrated versatile image editing capability, especially by discovering the spontaneously learned interpretable directions that can manipulate corresponding image attributes [radford2015unsupervised, jahanian2019steerability, shen2019interpreting, goetschalckx2019ganalyze, voynov2020unsupervised]. Concretely, given a random latent code

, image editing can be achieved by pushing the latent vector along a specific semantic direction (

e.g., age, gender):

(1)

where is the edited image, is the generator, is a scaling factor, and is the interpretable direction.

As a consequence, recent attempts [abdal2019image2stylegan, abdal2019image2stylegan2, richardson2020encoding, zhu2020domain] aim to migrate this power to real image editing by inverting an image to the latent code . There are two prominent demands for this task, whether the inverted code can faithfully reconstruct the original input, and, whether the pre-learned semantic directions can be successfully applied. However, existing methods seem to stuck in a paradox, as achieving one end will inevitably sacrifice the other. As shown in Fig. From Continuity to Editability: Inverting GANs with Consecutive Images, I2S, I2S++, and pSp [abdal2019image2stylegan, abdal2019image2stylegan2, richardson2020encoding] concentrate only on obtaining faithful reconstruction, but the inverted codes show limited editability. In contrast, latent codes obtained from In-domain inversion [zhu2020domain] (Fig. From Continuity to Editability: Inverting GANs with Consecutive Images) are regularized in the semantically meaningful domain at the expense of fidelity. We argue that balancing these two factors solely based on a single image is extremely challenging, as there is no indicator to shed light on the editable domain in the latent space, preventing the optimization from obtaining a perfect balance between two factors.

In this paper, we resolve the above problem by introducing consecutive images, which can be either a video segment or the same person with different poses, to form a joint optimization process. The rational behind our alternative solution is that the continuity brought by consecutive images can be used as an indicator to constrain the editability. In particular, to ensure each of the inverted latent codes is editable, we jointly optimize multiple latent codes by enforcing each of them is semantically accessible from one of the other code with a simple linear combination. In addition, we further explore this inborn continuity for fidelity, by injecting multi-source supervision that common regions of the reconstructed images should be consistent in all the consecutive images. We establish dense correspondences between input images, and then apply the obtained correspondences to warp each of the reconstructions to the neighbors for a consistent and coherence measurement.

To evaluate the proposed method, we construct a real video-based dataset RAVDESS-12, and another consecutive images dataset synthesized by manipulating attributes from the generated images of StyleGAN [karras2019analyzing]. Extensive experiments demonstrate the superior performance of our solution over existing methods in terms of editability and reconstruction fidelity. Furthermore, our method is capable to perform various applications, e.g., video-based GAN inversion, unsupervised semantic transfer, and image morphing.

In summary, our contributions are three-fold:

  • We resolve the editability-fidelity paradox of GAN inversion from a novel perspective. We propose an alternative GAN inversion method for consecutive images, and delve deep into the inborn continuity property of consecutive images for GAN inversion.

  • We tailor two novel constraints, one is the mutually accessible constraint that formulates consecutive images inversion as a linear combination process in the latent space to ensure editability, and the inversion consistency constraint that works in the RGB space to guarantee reconstruction fidelity by measuring reconstruction consistency across inputs.

  • We demonstrate optimal performances in terms of editability and reconstruction fidelity, and we support various new applications like video-based GAN inversion and unsupervised semantic transfer.

2 Related Work

Image Editing via Latent Space Exploration. Generative models show great potential in synthesizing versatile images by taking random latent codes as inputs. Recent works show that the latent space of a pre-trained GAN encodes rich semantic directions. Varying the latent codes with a specific direction can edit the image with the target attribute. In particular, Radford et al[radford2015unsupervised] observe that there are directions in the latent space corresponding to adding smiles or glasses on the faces. Ganalyze et al[goetschalckx2019ganalyze] explore the memorability direction in the latent space by a fixed assessor. Jahanian et al[jahanian2019steerability] study the steerability of GANs to fit some image transformations. Shen et al[shen2019interpreting] explore the semantic boundary in the latent space of the binary attributes. Voynov et al[voynov2020unsupervised] discover the semantic directions hidden in the latent space by an unsupervised model-agnostic procedure. Varying the latent codes with such directions can manipulate the corresponding attributes of the output images. It is natural to transferring these techniques on real image editing. Before that, it is required to invert a real image back to the latent code.

GAN Inversion. To realize real image editing, GAN inversion methods are proposed to inference a latent code of an input image based on the pre-trained GAN [perarnau2016invertible, zhu2016generative, abdal2019image2stylegan, creswell2018inverting, pan2020exploiting]. These methods can be categorized into two classes, optimization-based and encoder-based. The former individually optimizes the latent code for a specific image, with a concentration on pixel-wise reconstruction [abdal2019image2stylegan, abdal2019image2stylegan2, creswell2018inverting, NEURIPS2018_e0ae4561]. However, ensuring reconstruction fidelity cannot guarantee the output latent code is editable by the learned directions. On the other hand, encoder-based methods train a general encoder that maps real images to latent codes [zhu2020domain, richardson2020encoding]. Especially, In-Domain GAN inversion [zhu2020domain] combines the learned encoder with an optimization process to align the encoder with the semantic knowledge of the generator. However, existing methods do not resolve the problem of editable domain in the latent space, thus they cannot achieve a perfect balance between editability and fidelity. We aim to solve this problem from a novel view of considering multiple consecutive images.

3 Method

Figure 1: The pipeline of the proposed consecutive images based GAN inversion. is a pretrained generator of StyleGAN and is a pretrained FlowNet for calculating the bidirectional optical flow. The upper part shows the Mutually Accessible Constraint. Given consecutive images as input, we enforce each of them is semantically accessible from one of the other codes with a simple linear combination. Both  and  are the optimization targets. The bottom shows the Inversion Consistency Constraint in the RGB space. Note here we only show the forward flow among and when calculating the for simplicity, meanwhile, and are also omitted.

3.1 Overview

The editability of the latent code and the fidelity of the reconstruction are the two vital factors that affect the performance of GAN inversion. To satisfy both sides, we exploit the continuity brought by consecutive images which depict a same subject with different variations. The pipeline of our approach is illustrated in Fig. 1. Given a sequence of consecutive images as input, the proposed method aims to seek their optimal latent codes in the latent space, and then feed them into a pretrained and fixed generator for reconstruction. In particular, i) we define a linear combination mechanism among consecutive images, which would facilitate the editability of the latent codes via a joint optimization with semantic directions, and ii) we establish a consistency constraint in the RGB space between the warped results of the reconstructed images and their corresponding originals, promoting the fidelity of the reconstruction. Note that we choose the generator of StyleGAN [karras2019style] as the pretrained one in our model.

3.2 Consecutive Images Based GAN Inversion

Mutually Accessible Constraint. For each image in an input set, it may gradually change into the others, just as shown in Fig. 1, the mouth opens progressively. Or it can vary to the others in a drastic way, such as the same person at very different poses. In either case, given the first image , it can be intuitively assumed that the latent codes of the other images are linear combinations of that of the starting image along with a specific semantic direction (e.g., expression, pose). Then, the latent code of the image can be formulated as follows:

(2)

where denotes the total number of images. However, the assumption is too strong since consecutive images are likely to vary from one to the others in different attributes. On the other hand, the specific semantic direction can be predefined [shen2019interpreting, voynov2020unsupervised, shen2021closedform] , but the semantically equivalent input images are required. To cope with images with arbitrary semantic variations, we regard the direction as one of our optimization targets. Note that the scaling factors -s are also learnable, and therefore, we absorb them into the direction and reformulate Eq.(2) as the proposed mutually accessible constraint, which is as follows:

(3)

In this way, we can figure out the latent codes of all the images via a joint optimization of and -s. Such a simple linear combination mechanism can promote the editability of the latent codes. The main reasons are that, i) each of the inverted latent codes can be regarded as an edited code with respect to one of the others, and ii) if the images vary in a specific semantic direction, the scales of variations could be learned adaptively, and more importantly, iii) it is able to deal with the variations of different attributes among consecutive images. Moreover, the learned -s has the potential to be transferred to other latent codes as predefined semantic directions.

Inversion Consistency Constraint. Once we have the latent codes, we feed them into the generator to reconstruct consecutive images. For a certain base image , its reconstruction with regard to the latent code is calculated by

(4)

In order to ensure a fidelity of the reconstructions, we particularly consider an inversion consistency between common regions of the reconstructed images and the input consecutive images. Specifically, we tailor an inversion consistency constraint loss based on bidirectional flows in the RGB space. As shown in the bottom of Fig. 1, first, the forward flows between base image and the other images can be calculated by a pretrained FlowNet2 [ilg2017flownet] , which is formulated as follows:

(5)

where . Then we warp with this flow to form the warped images , which can be presented as follows:

(6)

Also, we have the warped results of the recovered images , which is presented as follows:

(7)

Since consecutive images describe a same subject, an inherent relationship should be existed among the generated warpings for each base image . And this relationship should be transferred to the other warpings . In the same way, we can calculate the backward flow between and , and the corresponding warpings , . By iterating over all the values of and , we inject multi-source supervision from the warpings of input images to confine the reconstructions, and the proposed inversion consistency constraint loss is given by

(8)

where denotes a pixel-wise distance.

Moreover, we consider to maintain a pixel-wise consistency between the input images and its corresponding recoveries. A pixel-wise consistency loss is therefore introduced in our objective, that is

(9)

To guarantee a fine visual perception of the recontructions, we also utilize a perceptual loss which is formulated as follows:

(10)

where denotes the th layer of a pretrained VGG-16 network, and we follow Abdal et al[abdal2019image2stylegan] to select the features produced by the conv1_1, conv1_2, conv3_2 and conv4_2 layers of VGG-16 for modeling the loss.

Finally, the whole objective function is defined as follows:

(11)

where s denote the balance factors. Then the latent code and the directions can be optimized by

(12)

Note that we follow [abdal2019image2stylegan] that initialize with the mean latent vector of space. And the direction -s are initially set to zero and updated during optimization.

4 Experiments

MethodsMetrics RAVDESS-12 Dataset Synthesized Dataset
NIQE FID LPIPS MSE(e-3) NIQE FID LPIPS MSE(e-3)
I2S [abdal2019image2stylegan] 3.770 16.284 0.162 8.791 3.374 48.909 0.271 35.011
pSp [richardson2020encoding] 3.668 29.701 0.202 22.337 3.910 84.355 0.391 46.244
InD [zhu2020domain] 3.765 18.135 0.193 9.963 3.152 42.773 0.352 44.645
Ours 3.596 13.136 0.148 5.972 2.807 37.225 0.250 24.395
I2S++ [abdal2019image2stylegan2] 3.358 0.320 0.003 0.174 2.644 2.967 0.014 1.458

Ours++
3.352 0.311 0.003 0.165 2.597 2.897 0.014 1.432

Table 1: Comparisons with existing GAN inversion methods on image reconstruction with four metrics on two datasets. denotes the lower the better and the best results are marked in bold.

4.1 Implementation Details

We implement the proposed method in Pytorch 

[NEURIPS2019_9015] on a PC with an Nvidia GeForce RTX 3090. We utilize the generator of StyleGAN [karras2019style] pre-trained on the FFHQ dataset [karras2019style] with the resolution of . The latent codes and semantic directions are optimized using Adam optimizer [kingma2014adam]. We follow [abdal2019image2stylegan] that use 5000 gradient descent steps with a learning rate of 0.01, , , and . We empirically set the balancing weights in Eq. (12) as , and . And we set in Eq. (4), which indicates 5 consecutive images are contained in each input sequence.

4.2 Experimental Settings

Datasets. We first conduct our experiments on the RAVDESS dataset [livingstone2018ryerson] with real videos. The original RAVDESS dataset contains 2,452 videos with 24 subjects speaking and singing with various semantic expressions. We select 12 videos of them for evaluation, resulting in 1,454 frames, we name this dataset as RAVDESS-12 Dataset. Since there are no ground truth latent codes for real images, we cannot subjectively evaluate the inverted code and its editability in the latent space. On the other hand, it is demonstrated that the learned semantic directions work very well in generated images. As a result, we construct a synthesized dataset, containing 1000 samples that were randomly generated by StyleGAN. For each sample, we vary its latent code with 5 random combinations of the value (ranging from -3 to 3) and semantic direction (acquired from InterfaceGAN [shen2019interpreting]), producing 5000 images. We record the latent codes of the original samples, the corresponding editing specifications, and the edited latent codes for evaluating the editability. GAN inversion methods will invert the original samples and edit them to target attributes for comparisons.

Competitors. We mainly compare with four GAN inversion methods: Image2StyleGAN (I2S) [abdal2019image2stylegan], Image2StyleGAN++ (I2S++) [abdal2019image2stylegan2], In-domain Inversion (InD) [zhu2020domain], and pSp network [richardson2020encoding]. All the methods are inverted to the same latent space of StyleGAN, applying the same directions for editing. It is worth noting that I2S++ introduces the additional noise space for small details recovery. A main problem is that, the inverted two latent codes in the and spaces are highly coupled, but the learned semantic directions are optimized in only. Applying them changes the space latent code but leaves the noise vector unchanged, these unpaired vectors yield “ghosting” artifacts after editing (see Fig. 17). As a result, we mainly use it for reconstruction comparisons, and we also extend our method to include the noise space, named as Ours++.

(a) Original
(b) I2S
(c) pSp
(d) InD
(e) Ours
(f) I2S++
(g) Ours++
Figure 9: Qualitative comparison on image reconstruction. Compared with the works that optimized in the  space (left part), our method can reconstruct the most faithful appearances. Involving the  space largely improves reconstruction (right part), but Ours++ show better color preservation than I2S++ (second row).

Evaluation Metrics. For the quantitative comparisons, we use four metrics, Naturalness Image Quality Evaluator (NIQE) [mittal2012making], Fréchet Inception Distance (FID) [heusel2017gans], Learned Perceptual Image Patch Similarity (LPIPS) [zhang2018unreasonable], and pixel-wise Mean Square Error (MSE), for evaluating the reconstruction fidelity. Especially, FID computes the Wasserstein-2 distance between the distribution of input and output images. NIQE evaluates the quality perceived by a human, which is a completely blind assessment with no request for the GT image. Since there is no GT for the semantic editing task on the real RAVDESS-12 Dataset, we use FID and NIQE to evaluate real image editing results.

4.3 Evaluation on Image Reconstruction

Quantitative Evaluation. We first evaluate the reconstruction fidelity of the inverted codes. Quantitative comparisons are shown in Tab. 1. We can see that our method significantly outperforms three editable GAN inversion methods (upper part) on both the real dataset and the synthesized one. Especially for the pixel-wise difference metric MSE, we largely improve the state-of-the-art by 31%. This indicates that the proposed joint optimization successfully incorporates complementary information from neighboring images. Besides, by involving the noise space , I2S++ and Ours++ achieve the most faithful reconstruction compared with other methods. Thanks to the inter-image coherence, we further push the reconstruction record a bit.

Qualitative Evaluation. Qualitative comparisons are shown in Fig. 50. We can see that the Image2StyleGAN cannot recover the image color correctly. Meanwhile, the pSp and InD cannot recover the finest facial details of the original images (see teeth in the first row). Compared with the above three methods, our method can reconstruct faithful appearance details. Unsurprisingly, I2S++ recovers the finest details among all competitors. That mainly because of their optimization performed in the space encodes high-frequency information. We also depict our results optimized in space, and we preserve the original color better than I2S++ (see the second row).

(a) Input
(b) I2S
(c) Ind
(d) pSp
(e) Ours
(f) I2S++
(g) Ours++
Figure 17: Qualitative comparison on semantic editing with Pose and Age attributes on the real RAVDESS-12 Dataset. Images marked by red boxes are the reconstructed targets, and images in the middle row of each example are the inversion results. We can tell that our method can support more favorable semantic editing.
Metrics I2S [abdal2019image2stylegan] pSp [richardson2020encoding] InD [zhu2020domain] Ours
NIQE 3.776 5.242 3.693 3.254
FID 21.609 30.128 19.271 15.482
Table 2: Quantitative evaluation on real image manipulation with two blind metrics on the RAVDESS-12 Dataset. denotes the lower the better and the best results are marked in bold.
(a) GT
(b) I2S
(c) Ind
(d) pSp
(e) Ours
Figure 23: Qualitative comparison on semantic editing with Smile and Age attributes on the Synthesized Dataset. It’s noticed that we force the input sequence contains different semantic changes from its corresponding editing direction. the first sequence contains the semantic changes with “gender” for training and “smile” for testing, and the second is “pose” for training while “age” for the edit testing. And our edited results are more similar with the ground truths.

4.4 Evaluation on Image Editing

In this section, we evaluate our GAN inversion method on real image editing as well as synthesized images. We conduct two editing tasks based on the inverted latent codes, the first one is semantic manipulation and the second is image morphing.

4.4.1 Semantic Manipulation

Semantic manipulation aims to edit a real image by varying its inverted codes along with a specific semantic direction. We use five semantic directions (i.e., gender, pose, smile, eyeglasses, and age) acquired by [shen2019interpreting] in the experiment.

Metrics I2S [abdal2019image2stylegan] pSp [richardson2020encoding] InD [zhu2020domain] Ours
NIQE 3.390 3.917 3.193 3.163
FID 35.894 58.342 48.867 33.872
LPIPS 0.399 0.452 0.424 0.347
MSE(e-3) 89.671 126.642 101.563 70.224
Table 3: Quantitative evaluation on image manipulation with four metrics on the Synthesized Dataset. denotes the lower the better and the best results are marked in bold.

Qualitative Evaluation. The qualitative comparisons on real data are shown in Fig. 17. We can see that our manipulated faces have visually more plausible results than those of the competitors. In particular, the manipulated results gained by Image2StyleGAN [abdal2019image2stylegan] present noisy artifacts with pose changes and glasses are entangled with the age attributes, revealing that the edited latent codes are escaped from the editable domain. Similar situations can be found in I2S++ [abdal2019image2stylegan2]. The manipulated faces based on pSp [richardson2020encoding] are almost unchanged. This is because pSp focuses on learning a direct mapping from the input to latent code, ignoring the editability. This problem is addressed by In-domain inversion [zhu2020domain], but it also sacrifices the reconstruction quality. In contrast, thanks to the jointly considered inherent editability constraint between consecutive images, our inverted latent codes are more semantically editable, leading to more disentangled manipulations. On the other hand, the noise space optimization methods (right part of Fig. 17) show obvious noise artifacts than the others with pose changes, this is because the pre-optimized noise vector is not suitable for the edited latent vector . However, Ours++ can disentangle gender from glasses better than the inverted vector from I2S++.

To evaluate whether the obtained inversion can be edited by arbitrary directions, we force the editing direction different from the semantic changes contained the input sequence on the synthesized dataset. The qualitative comparisons on the synthesized data are shown in Fig. 56. Similar to the evaluation on real data, I2S produces obvious artifacts, pSp fails to edit the results, and InD cannot preserve the original identity. Our manipulated results are more similar with the GTs, which indicates that our inverted codes are much closer with the GT latent codes and also inherit their editability.

Quantitative Evaluation. We present the quantitative comparisons in Tab. 2 and Tab. 3. Our method achieves the best results on both the RAVDESS-12 Dataset and the Synthesized Dataset. In particular, for the blind metric NIQE, our edited results achieve 13.8% improvement over the state-of-the-art method, which indicates that our editing is more visually plausible. Quantitative results on the Synthesized Dataset can evaluate whether the inverted codes are close enough with the GT code such that we can reuse their semantic information. From two non-blind metrics LPIPS and MSE, we can see that our edited results are very similar to the GTs. Thanks to our semantically accessible regularization in the latent space, our inverted latent codes show a strong editability compared with the competitors.

Ours           InD           pSp           I2S

Inverted A         Morphing         Inverted B

Figure 24: Qualitative comparison on image morphing task. We can see that our result present a continuous process and the morphing faces are realistic.

4.4.2 Image Morphing

Image morphing aims to fuse two images semantically by interpolating their latent codes. It is another way to evaluate whether the inverted codes indeed lie in the latent space and reuse the semantic knowledge. For the high-quality inverted codes, their interpolated results should also stay in the editable domain and the semantic varies continuously. Qualitative comparisons are shown in Fig. 

24. We can see that the morphing results produced by Image2StyleGAN [abdal2019image2stylegan] have noticeable artifacts. Meanwhile, the results produced by pSp [richardson2020encoding] are unrealistic with the unnatural hairs. In contrast, our method presents high-quality results with a continuous morphing process. We also present the quantitative evaluation on the morphing task in Tab. 4 and Tab. 5, we can see that our inversion results outperform the other inversion methods on both the real dataset and the synthesized one.

Metrics I2S [abdal2019image2stylegan] pSp [richardson2020encoding] InD [zhu2020domain] Ours

NIQE
4.255 5.350 4.051 3.688
FID 40.627 38.474 38.925 37.695


Table 4: Quantitative evaluation on image morphing with two blind metrics on the RAVDESS-12 Dataset. denotes the lower the better and the best results are marked in bold.
Metrics I2S [abdal2019image2stylegan] pSp [richardson2020encoding] InD [zhu2020domain] Ours

NIQE
3.389 3.800 3.212 3.115
FID 31.776 30.192 21.901 18.621
LPIPS 0.472 0.467 0.469 0.402
MSE(e-3) 141.432 121.834 125.674 98.354

Table 5: Quantitative evaluation on image morphing with four metrics on the Synthesized Dataset. denotes the lower the better and the best results are marked in bold.
Figure 25: Our unsupervisedly acquired direction  from consecutive images can be used for transferring the semantics. The first row is the input set that is regarded as reference, and the images in red boxes are the target faces. We can transfer the semantic changes of the reference to the target faces, even with more than one attribute changed.

4.5 Semantic Transfer

As discussed in Sec. 3.2, both the latent code  and the semantic direction  can be unsupervisedly obtained after inversion. Besides the latent code, our acquired direction  represents the semantic changes of the input images. Given the input images as reference, we can transfer its semantic changes to the target faces.

The transfer results are shown in Fig. 25. We can see that the semantic attributes of target faces are modified following the reference image set. Note that there are more than one attribute has been changed in the reference. For example, in the right example, the mouth and pose varies simultaneously but we can still capture those changes. This shows that our acquired direction is disentangled with the referenced and can be applied on other faces. Other than existing supervised [shen2019interpreting, shen2020interfacegan] or unsupervised [shen2021closedform, voynov2020unsupervised] learning of interpretable directions, this sheds light on a new exemplar-based learning of semantic direction.

4.6 Ablation Study

In this section, we analyze the efficacy of our two components: mutually accessible constraint (MAC) and inversion consistency constraint (ICC). Note that without these two components, our method equals to the Image2StyleGAN inversion and we set it as our baseline. By unplugging one of these two constraints, we yield two variants of “ MAC” and “ ICC”. In this case, is removed and all the latent codes are optimized simultaneously.

Variants NIQE FID LPIPS MSE(e-3)

Baseline
3.770 16.284 0.162 8.791
MAC 3.685 13.375 0.151 8.065
ICC 3.765 14.791 0.160 8.508
Ours 3.596 13.136 0.148 5.972
Table 6: Ablation study on image reconstruction with four metrics. denotes the lower the better and the best results are marked in bold.
Metrics Baseline MAC ICC Ours
NIQE 3.776 3.659 3.398 3.254
FID 21.609 16.121 17.274 15.482
Table 7: Ablation study on semantic manipulation with two blind metrics. denotes the lower the better and the best results are marked in bold.

We perform ablation study experiment on image reconstruction and semantic manipulation tasks on the RAVDESS-12 Dataset. Quantitative comparisons of GAN inversion are shown in Tab. 6. We can see that every variant outperforms the baseline on all metrics. This indicates both two components contribute to the GAN inversion performance. Meanwhile, variant ( MAC) performs better than the variant ( ICC), this indicates that the inversion consistency brought by consecutive images contributes more for the GAN inversion task. In Tab. 7 of semantic editing, we observe a different situation. We can see the variant ( ICC) performs better than the variant ( MAC), this reveals that mutually accessible constraint confines the inverted latent codes to stay in the editable domain. The above two evaluations show that our two constraints work very well following our design principles.

(a) Baseline
(b) MAS
(c) ICS
(d) Ours
Figure 30: Ablation study on semantic editing with two variants and baseline by editing the “age” attribute.

We show the results of different variants in Fig. 30 by changing the “age” attribute. We can see that the baseline and variant ( MAC) entangles with glasses, showing that concentrating only on reconstruction fidelity lacks editability of the inverted codes. In contrast, variant ( ICC) and our final result can successfully modify the “age” attribute, revealing the strong regularization power of our designed mutually accessible constraint.

5 Conclusion

In this paper, we propose an alternative GAN inversion method for consecutive images, we formulate consecutive images inversion as a linear combination process in the latent space that ensures editability, and transfer the reconstruction consistency across inputs in the RGB space to guarantee reconstruction fidelity. The experiment results demonstrate the effectiveness in terms of editability and reconstruction fidelity. Besides, we also support various new applications like video-based GAN inversion and unsupervised semantic transfer.

References

6 Supplemental Results

MetricsMethods Image Reconstruction Nonlinear Semantic Edit
I2S pSp InD Ours I2S pSp InD Ours
NIQE 3.632 3.439 3.254 2.997 3.940 3.974 3.703 3.476

FID
40.098 62.932 77.692 33.692 48.032 64.224 51.607 37.039

LPIPS
0.252 0.323 0.414 0.238 0.489 0.522 0.471 0.403
MSE(e-3) 30.767 79.878 83.294 25.192 87.361 118.285 99.623 69.449
Table 8: Quantitative evaluations on image reconstruction and semantic editing tasks on the nonlinear dataset.

To further demonstrate our method is not restricted with the linear-based editing, we synthesize a new dataset under the nonlinear constraints by StyleFlow [abdal2021styleflow]. It consists of 1,000 sequences resulting in 5,000 images, with different semantic changes, such as pose, illumination, expression, eyeglasses, gender, and age. We conduct an inversion experiment on it and the results are shown in the Tab. 8, Fig. 36. Our method can obtain an accurate reconstruction on this nonlinear dataset. Besides, we also conduct nonlinear semantic editing task using the nonlinear StyleFlow [abdal2021styleflow], as shown in the right parts of Tab. 8 and Fig. 42. Our results are more similar with the GT (see the hair color of “age+” in the right-bottom corner of Fig. 42). These results prove that our method is not constrained by linear-editing assumption. That because our method is an optimization-based GAN inversion method that does not rely on any attribute constraint for the input images, and the optimization is image-specific without training a general network.

(a) Original
(b) I2S
(c) pSp
(d) InD
(e) Ours
Figure 36: Qualitative comparison on image reconstruction on the nonlinear dataset.
(a) GT
(b) I2S
(c) pSp
(d) InD
(e) Ours
Figure 42: Qualitative comparison on semantic edit on the nonlinear dataset.

We also give more qualitative comparison on image reconstitution on RAVDESS-12 Dataset and the linear-based Synthesized Dataset in Fig. 50. We can see that our method can reconstruct the most faithful appearances by optimization latent code in the  space. Involving the  space largely improves reconstruction quality and Ours++ can reconstruct the correct colors.

The qualitative comparison on image editing task on Synthesized dataset can be seen in Fig. 56. Our edited results are more similar with the ground truths and show the pleasure appearances, which indicates that our inverted latent codes are close enough with the GT codes.

(a) Original
(b) I2S
(c) pSp
(d) InD
(e) Ours
(f) I2S++
(g) Ours++
Figure 50: Qualitative comparison on image reconstruction.
(a) GT
(b) I2S
(c) Ind
(d) pSp
(e) Ours
Figure 56: Qualitative comparison on semantic editing.