1 Introduction
Generative Adversarial Networks (GANs) [NIPS2014_5ca3e9b1, karras2019style, karras2019analyzing] has demonstrated versatile image editing capability, especially by discovering the spontaneously learned interpretable directions that can manipulate corresponding image attributes [radford2015unsupervised, jahanian2019steerability, shen2019interpreting, goetschalckx2019ganalyze, voynov2020unsupervised]. Concretely, given a random latent code
, image editing can be achieved by pushing the latent vector along a specific semantic direction (
e.g., age, gender):(1) 
where is the edited image, is the generator, is a scaling factor, and is the interpretable direction.
As a consequence, recent attempts [abdal2019image2stylegan, abdal2019image2stylegan2, richardson2020encoding, zhu2020domain] aim to migrate this power to real image editing by inverting an image to the latent code . There are two prominent demands for this task, whether the inverted code can faithfully reconstruct the original input, and, whether the prelearned semantic directions can be successfully applied. However, existing methods seem to stuck in a paradox, as achieving one end will inevitably sacrifice the other. As shown in Fig. From Continuity to Editability: Inverting GANs with Consecutive Images, I2S, I2S++, and pSp [abdal2019image2stylegan, abdal2019image2stylegan2, richardson2020encoding] concentrate only on obtaining faithful reconstruction, but the inverted codes show limited editability. In contrast, latent codes obtained from Indomain inversion [zhu2020domain] (Fig. From Continuity to Editability: Inverting GANs with Consecutive Images) are regularized in the semantically meaningful domain at the expense of fidelity. We argue that balancing these two factors solely based on a single image is extremely challenging, as there is no indicator to shed light on the editable domain in the latent space, preventing the optimization from obtaining a perfect balance between two factors.
In this paper, we resolve the above problem by introducing consecutive images, which can be either a video segment or the same person with different poses, to form a joint optimization process. The rational behind our alternative solution is that the continuity brought by consecutive images can be used as an indicator to constrain the editability. In particular, to ensure each of the inverted latent codes is editable, we jointly optimize multiple latent codes by enforcing each of them is semantically accessible from one of the other code with a simple linear combination. In addition, we further explore this inborn continuity for fidelity, by injecting multisource supervision that common regions of the reconstructed images should be consistent in all the consecutive images. We establish dense correspondences between input images, and then apply the obtained correspondences to warp each of the reconstructions to the neighbors for a consistent and coherence measurement.
To evaluate the proposed method, we construct a real videobased dataset RAVDESS12, and another consecutive images dataset synthesized by manipulating attributes from the generated images of StyleGAN [karras2019analyzing]. Extensive experiments demonstrate the superior performance of our solution over existing methods in terms of editability and reconstruction fidelity. Furthermore, our method is capable to perform various applications, e.g., videobased GAN inversion, unsupervised semantic transfer, and image morphing.
In summary, our contributions are threefold:

We resolve the editabilityfidelity paradox of GAN inversion from a novel perspective. We propose an alternative GAN inversion method for consecutive images, and delve deep into the inborn continuity property of consecutive images for GAN inversion.

We tailor two novel constraints, one is the mutually accessible constraint that formulates consecutive images inversion as a linear combination process in the latent space to ensure editability, and the inversion consistency constraint that works in the RGB space to guarantee reconstruction fidelity by measuring reconstruction consistency across inputs.

We demonstrate optimal performances in terms of editability and reconstruction fidelity, and we support various new applications like videobased GAN inversion and unsupervised semantic transfer.
2 Related Work
Image Editing via Latent Space Exploration. Generative models show great potential in synthesizing versatile images by taking random latent codes as inputs. Recent works show that the latent space of a pretrained GAN encodes rich semantic directions. Varying the latent codes with a specific direction can edit the image with the target attribute. In particular, Radford et al. [radford2015unsupervised] observe that there are directions in the latent space corresponding to adding smiles or glasses on the faces. Ganalyze et al. [goetschalckx2019ganalyze] explore the memorability direction in the latent space by a fixed assessor. Jahanian et al. [jahanian2019steerability] study the steerability of GANs to fit some image transformations. Shen et al. [shen2019interpreting] explore the semantic boundary in the latent space of the binary attributes. Voynov et al. [voynov2020unsupervised] discover the semantic directions hidden in the latent space by an unsupervised modelagnostic procedure. Varying the latent codes with such directions can manipulate the corresponding attributes of the output images. It is natural to transferring these techniques on real image editing. Before that, it is required to invert a real image back to the latent code.
GAN Inversion. To realize real image editing, GAN inversion methods are proposed to inference a latent code of an input image based on the pretrained GAN [perarnau2016invertible, zhu2016generative, abdal2019image2stylegan, creswell2018inverting, pan2020exploiting]. These methods can be categorized into two classes, optimizationbased and encoderbased. The former individually optimizes the latent code for a specific image, with a concentration on pixelwise reconstruction [abdal2019image2stylegan, abdal2019image2stylegan2, creswell2018inverting, NEURIPS2018_e0ae4561]. However, ensuring reconstruction fidelity cannot guarantee the output latent code is editable by the learned directions. On the other hand, encoderbased methods train a general encoder that maps real images to latent codes [zhu2020domain, richardson2020encoding]. Especially, InDomain GAN inversion [zhu2020domain] combines the learned encoder with an optimization process to align the encoder with the semantic knowledge of the generator. However, existing methods do not resolve the problem of editable domain in the latent space, thus they cannot achieve a perfect balance between editability and fidelity. We aim to solve this problem from a novel view of considering multiple consecutive images.
3 Method
3.1 Overview
The editability of the latent code and the fidelity of the reconstruction are the two vital factors that affect the performance of GAN inversion. To satisfy both sides, we exploit the continuity brought by consecutive images which depict a same subject with different variations. The pipeline of our approach is illustrated in Fig. 1. Given a sequence of consecutive images as input, the proposed method aims to seek their optimal latent codes in the latent space, and then feed them into a pretrained and fixed generator for reconstruction. In particular, i) we define a linear combination mechanism among consecutive images, which would facilitate the editability of the latent codes via a joint optimization with semantic directions, and ii) we establish a consistency constraint in the RGB space between the warped results of the reconstructed images and their corresponding originals, promoting the fidelity of the reconstruction. Note that we choose the generator of StyleGAN [karras2019style] as the pretrained one in our model.
3.2 Consecutive Images Based GAN Inversion
Mutually Accessible Constraint. For each image in an input set, it may gradually change into the others, just as shown in Fig. 1, the mouth opens progressively. Or it can vary to the others in a drastic way, such as the same person at very different poses. In either case, given the first image , it can be intuitively assumed that the latent codes of the other images are linear combinations of that of the starting image along with a specific semantic direction (e.g., expression, pose). Then, the latent code of the image can be formulated as follows:
(2) 
where denotes the total number of images. However, the assumption is too strong since consecutive images are likely to vary from one to the others in different attributes. On the other hand, the specific semantic direction can be predefined [shen2019interpreting, voynov2020unsupervised, shen2021closedform] , but the semantically equivalent input images are required. To cope with images with arbitrary semantic variations, we regard the direction as one of our optimization targets. Note that the scaling factors s are also learnable, and therefore, we absorb them into the direction and reformulate Eq.(2) as the proposed mutually accessible constraint, which is as follows:
(3) 
In this way, we can figure out the latent codes of all the images via a joint optimization of and s. Such a simple linear combination mechanism can promote the editability of the latent codes. The main reasons are that, i) each of the inverted latent codes can be regarded as an edited code with respect to one of the others, and ii) if the images vary in a specific semantic direction, the scales of variations could be learned adaptively, and more importantly, iii) it is able to deal with the variations of different attributes among consecutive images. Moreover, the learned s has the potential to be transferred to other latent codes as predefined semantic directions.
Inversion Consistency Constraint. Once we have the latent codes, we feed them into the generator to reconstruct consecutive images. For a certain base image , its reconstruction with regard to the latent code is calculated by
(4) 
In order to ensure a fidelity of the reconstructions, we particularly consider an inversion consistency between common regions of the reconstructed images and the input consecutive images. Specifically, we tailor an inversion consistency constraint loss based on bidirectional flows in the RGB space. As shown in the bottom of Fig. 1, first, the forward flows between base image and the other images can be calculated by a pretrained FlowNet2 [ilg2017flownet] , which is formulated as follows:
(5) 
where . Then we warp with this flow to form the warped images , which can be presented as follows:
(6) 
Also, we have the warped results of the recovered images , which is presented as follows:
(7) 
Since consecutive images describe a same subject, an inherent relationship should be existed among the generated warpings for each base image . And this relationship should be transferred to the other warpings . In the same way, we can calculate the backward flow between and , and the corresponding warpings , . By iterating over all the values of and , we inject multisource supervision from the warpings of input images to confine the reconstructions, and the proposed inversion consistency constraint loss is given by
(8) 
where denotes a pixelwise distance.
Moreover, we consider to maintain a pixelwise consistency between the input images and its corresponding recoveries. A pixelwise consistency loss is therefore introduced in our objective, that is
(9) 
To guarantee a fine visual perception of the recontructions, we also utilize a perceptual loss which is formulated as follows:
(10) 
where denotes the th layer of a pretrained VGG16 network, and we follow Abdal et al. [abdal2019image2stylegan] to select the features produced by the conv1_1, conv1_2, conv3_2 and conv4_2 layers of VGG16 for modeling the loss.
Finally, the whole objective function is defined as follows:
(11) 
where s denote the balance factors. Then the latent code and the directions can be optimized by
(12) 
Note that we follow [abdal2019image2stylegan] that initialize with the mean latent vector of space. And the direction s are initially set to zero and updated during optimization.
4 Experiments
MethodsMetrics  RAVDESS12 Dataset  Synthesized Dataset  

NIQE  FID  LPIPS  MSE(e3)  NIQE  FID  LPIPS  MSE(e3)  
I2S [abdal2019image2stylegan]  3.770  16.284  0.162  8.791  3.374  48.909  0.271  35.011 
pSp [richardson2020encoding]  3.668  29.701  0.202  22.337  3.910  84.355  0.391  46.244 
InD [zhu2020domain]  3.765  18.135  0.193  9.963  3.152  42.773  0.352  44.645 
Ours  3.596  13.136  0.148  5.972  2.807  37.225  0.250  24.395 
I2S++ [abdal2019image2stylegan2]  3.358  0.320  0.003  0.174  2.644  2.967  0.014  1.458 
Ours++ 
3.352  0.311  0.003  0.165  2.597  2.897  0.014  1.432 

4.1 Implementation Details
We implement the proposed method in Pytorch
[NEURIPS2019_9015] on a PC with an Nvidia GeForce RTX 3090. We utilize the generator of StyleGAN [karras2019style] pretrained on the FFHQ dataset [karras2019style] with the resolution of . The latent codes and semantic directions are optimized using Adam optimizer [kingma2014adam]. We follow [abdal2019image2stylegan] that use 5000 gradient descent steps with a learning rate of 0.01, , , and . We empirically set the balancing weights in Eq. (12) as , and . And we set in Eq. (4), which indicates 5 consecutive images are contained in each input sequence.4.2 Experimental Settings
Datasets. We first conduct our experiments on the RAVDESS dataset [livingstone2018ryerson] with real videos. The original RAVDESS dataset contains 2,452 videos with 24 subjects speaking and singing with various semantic expressions. We select 12 videos of them for evaluation, resulting in 1,454 frames, we name this dataset as RAVDESS12 Dataset. Since there are no ground truth latent codes for real images, we cannot subjectively evaluate the inverted code and its editability in the latent space. On the other hand, it is demonstrated that the learned semantic directions work very well in generated images. As a result, we construct a synthesized dataset, containing 1000 samples that were randomly generated by StyleGAN. For each sample, we vary its latent code with 5 random combinations of the value (ranging from 3 to 3) and semantic direction (acquired from InterfaceGAN [shen2019interpreting]), producing 5000 images. We record the latent codes of the original samples, the corresponding editing specifications, and the edited latent codes for evaluating the editability. GAN inversion methods will invert the original samples and edit them to target attributes for comparisons.
Competitors. We mainly compare with four GAN inversion methods: Image2StyleGAN (I2S) [abdal2019image2stylegan], Image2StyleGAN++ (I2S++) [abdal2019image2stylegan2], Indomain Inversion (InD) [zhu2020domain], and pSp network [richardson2020encoding]. All the methods are inverted to the same latent space of StyleGAN, applying the same directions for editing. It is worth noting that I2S++ introduces the additional noise space for small details recovery. A main problem is that, the inverted two latent codes in the and spaces are highly coupled, but the learned semantic directions are optimized in only. Applying them changes the space latent code but leaves the noise vector unchanged, these unpaired vectors yield “ghosting” artifacts after editing (see Fig. 17). As a result, we mainly use it for reconstruction comparisons, and we also extend our method to include the noise space, named as Ours++.
Evaluation Metrics. For the quantitative comparisons, we use four metrics, Naturalness Image Quality Evaluator (NIQE) [mittal2012making], Fréchet Inception Distance (FID) [heusel2017gans], Learned Perceptual Image Patch Similarity (LPIPS) [zhang2018unreasonable], and pixelwise Mean Square Error (MSE), for evaluating the reconstruction fidelity. Especially, FID computes the Wasserstein2 distance between the distribution of input and output images. NIQE evaluates the quality perceived by a human, which is a completely blind assessment with no request for the GT image. Since there is no GT for the semantic editing task on the real RAVDESS12 Dataset, we use FID and NIQE to evaluate real image editing results.
4.3 Evaluation on Image Reconstruction
Quantitative Evaluation. We first evaluate the reconstruction fidelity of the inverted codes. Quantitative comparisons are shown in Tab. 1. We can see that our method significantly outperforms three editable GAN inversion methods (upper part) on both the real dataset and the synthesized one. Especially for the pixelwise difference metric MSE, we largely improve the stateoftheart by 31%. This indicates that the proposed joint optimization successfully incorporates complementary information from neighboring images. Besides, by involving the noise space , I2S++ and Ours++ achieve the most faithful reconstruction compared with other methods. Thanks to the interimage coherence, we further push the reconstruction record a bit.
Qualitative Evaluation. Qualitative comparisons are shown in Fig. 50. We can see that the Image2StyleGAN cannot recover the image color correctly. Meanwhile, the pSp and InD cannot recover the finest facial details of the original images (see teeth in the first row). Compared with the above three methods, our method can reconstruct faithful appearance details. Unsurprisingly, I2S++ recovers the finest details among all competitors. That mainly because of their optimization performed in the space encodes highfrequency information. We also depict our results optimized in space, and we preserve the original color better than I2S++ (see the second row).
Metrics  I2S [abdal2019image2stylegan]  pSp [richardson2020encoding]  InD [zhu2020domain]  Ours 

NIQE  3.776  5.242  3.693  3.254 
FID  21.609  30.128  19.271  15.482 
4.4 Evaluation on Image Editing
In this section, we evaluate our GAN inversion method on real image editing as well as synthesized images. We conduct two editing tasks based on the inverted latent codes, the first one is semantic manipulation and the second is image morphing.
4.4.1 Semantic Manipulation
Semantic manipulation aims to edit a real image by varying its inverted codes along with a specific semantic direction. We use five semantic directions (i.e., gender, pose, smile, eyeglasses, and age) acquired by [shen2019interpreting] in the experiment.
Metrics  I2S [abdal2019image2stylegan]  pSp [richardson2020encoding]  InD [zhu2020domain]  Ours 

NIQE  3.390  3.917  3.193  3.163 
FID  35.894  58.342  48.867  33.872 
LPIPS  0.399  0.452  0.424  0.347 
MSE(e3)  89.671  126.642  101.563  70.224 
Qualitative Evaluation. The qualitative comparisons on real data are shown in Fig. 17. We can see that our manipulated faces have visually more plausible results than those of the competitors. In particular, the manipulated results gained by Image2StyleGAN [abdal2019image2stylegan] present noisy artifacts with pose changes and glasses are entangled with the age attributes, revealing that the edited latent codes are escaped from the editable domain. Similar situations can be found in I2S++ [abdal2019image2stylegan2]. The manipulated faces based on pSp [richardson2020encoding] are almost unchanged. This is because pSp focuses on learning a direct mapping from the input to latent code, ignoring the editability. This problem is addressed by Indomain inversion [zhu2020domain], but it also sacrifices the reconstruction quality. In contrast, thanks to the jointly considered inherent editability constraint between consecutive images, our inverted latent codes are more semantically editable, leading to more disentangled manipulations. On the other hand, the noise space optimization methods (right part of Fig. 17) show obvious noise artifacts than the others with pose changes, this is because the preoptimized noise vector is not suitable for the edited latent vector . However, Ours++ can disentangle gender from glasses better than the inverted vector from I2S++.
To evaluate whether the obtained inversion can be edited by arbitrary directions, we force the editing direction different from the semantic changes contained the input sequence on the synthesized dataset. The qualitative comparisons on the synthesized data are shown in Fig. 56. Similar to the evaluation on real data, I2S produces obvious artifacts, pSp fails to edit the results, and InD cannot preserve the original identity. Our manipulated results are more similar with the GTs, which indicates that our inverted codes are much closer with the GT latent codes and also inherit their editability.
Quantitative Evaluation. We present the quantitative comparisons in Tab. 2 and Tab. 3. Our method achieves the best results on both the RAVDESS12 Dataset and the Synthesized Dataset. In particular, for the blind metric NIQE, our edited results achieve 13.8% improvement over the stateoftheart method, which indicates that our editing is more visually plausible. Quantitative results on the Synthesized Dataset can evaluate whether the inverted codes are close enough with the GT code such that we can reuse their semantic information. From two nonblind metrics LPIPS and MSE, we can see that our edited results are very similar to the GTs. Thanks to our semantically accessible regularization in the latent space, our inverted latent codes show a strong editability compared with the competitors.
4.4.2 Image Morphing
Image morphing aims to fuse two images semantically by interpolating their latent codes. It is another way to evaluate whether the inverted codes indeed lie in the latent space and reuse the semantic knowledge. For the highquality inverted codes, their interpolated results should also stay in the editable domain and the semantic varies continuously. Qualitative comparisons are shown in Fig.
24. We can see that the morphing results produced by Image2StyleGAN [abdal2019image2stylegan] have noticeable artifacts. Meanwhile, the results produced by pSp [richardson2020encoding] are unrealistic with the unnatural hairs. In contrast, our method presents highquality results with a continuous morphing process. We also present the quantitative evaluation on the morphing task in Tab. 4 and Tab. 5, we can see that our inversion results outperform the other inversion methods on both the real dataset and the synthesized one.Metrics  I2S [abdal2019image2stylegan]  pSp [richardson2020encoding]  InD [zhu2020domain]  Ours 

NIQE 
4.255  5.350  4.051  3.688 
FID  40.627  38.474  38.925  37.695 

Metrics  I2S [abdal2019image2stylegan]  pSp [richardson2020encoding]  InD [zhu2020domain]  Ours 

NIQE 
3.389  3.800  3.212  3.115 
FID  31.776  30.192  21.901  18.621 
LPIPS  0.472  0.467  0.469  0.402 
MSE(e3)  141.432  121.834  125.674  98.354 

4.5 Semantic Transfer
As discussed in Sec. 3.2, both the latent code and the semantic direction can be unsupervisedly obtained after inversion. Besides the latent code, our acquired direction represents the semantic changes of the input images. Given the input images as reference, we can transfer its semantic changes to the target faces.
The transfer results are shown in Fig. 25. We can see that the semantic attributes of target faces are modified following the reference image set. Note that there are more than one attribute has been changed in the reference. For example, in the right example, the mouth and pose varies simultaneously but we can still capture those changes. This shows that our acquired direction is disentangled with the referenced and can be applied on other faces. Other than existing supervised [shen2019interpreting, shen2020interfacegan] or unsupervised [shen2021closedform, voynov2020unsupervised] learning of interpretable directions, this sheds light on a new exemplarbased learning of semantic direction.
4.6 Ablation Study
In this section, we analyze the efficacy of our two components: mutually accessible constraint (MAC) and inversion consistency constraint (ICC). Note that without these two components, our method equals to the Image2StyleGAN inversion and we set it as our baseline. By unplugging one of these two constraints, we yield two variants of “ MAC” and “ ICC”. In this case, is removed and all the latent codes are optimized simultaneously.
Variants  NIQE  FID  LPIPS  MSE(e3) 

Baseline 
3.770  16.284  0.162  8.791 
MAC  3.685  13.375  0.151  8.065 
ICC  3.765  14.791  0.160  8.508 
Ours  3.596  13.136  0.148  5.972 
Metrics  Baseline  MAC  ICC  Ours 

NIQE  3.776  3.659  3.398  3.254 
FID  21.609  16.121  17.274  15.482 
We perform ablation study experiment on image reconstruction and semantic manipulation tasks on the RAVDESS12 Dataset. Quantitative comparisons of GAN inversion are shown in Tab. 6. We can see that every variant outperforms the baseline on all metrics. This indicates both two components contribute to the GAN inversion performance. Meanwhile, variant ( MAC) performs better than the variant ( ICC), this indicates that the inversion consistency brought by consecutive images contributes more for the GAN inversion task. In Tab. 7 of semantic editing, we observe a different situation. We can see the variant ( ICC) performs better than the variant ( MAC), this reveals that mutually accessible constraint confines the inverted latent codes to stay in the editable domain. The above two evaluations show that our two constraints work very well following our design principles.
We show the results of different variants in Fig. 30 by changing the “age” attribute. We can see that the baseline and variant ( MAC) entangles with glasses, showing that concentrating only on reconstruction fidelity lacks editability of the inverted codes. In contrast, variant ( ICC) and our final result can successfully modify the “age” attribute, revealing the strong regularization power of our designed mutually accessible constraint.
5 Conclusion
In this paper, we propose an alternative GAN inversion method for consecutive images, we formulate consecutive images inversion as a linear combination process in the latent space that ensures editability, and transfer the reconstruction consistency across inputs in the RGB space to guarantee reconstruction fidelity. The experiment results demonstrate the effectiveness in terms of editability and reconstruction fidelity. Besides, we also support various new applications like videobased GAN inversion and unsupervised semantic transfer.
References
6 Supplemental Results
MetricsMethods  Image Reconstruction  Nonlinear Semantic Edit  

I2S  pSp  InD  Ours  I2S  pSp  InD  Ours  
NIQE  3.632  3.439  3.254  2.997  3.940  3.974  3.703  3.476 
FID 
40.098  62.932  77.692  33.692  48.032  64.224  51.607  37.039 
LPIPS 
0.252  0.323  0.414  0.238  0.489  0.522  0.471  0.403 
MSE(e3)  30.767  79.878  83.294  25.192  87.361  118.285  99.623  69.449 
To further demonstrate our method is not restricted with the linearbased editing, we synthesize a new dataset under the nonlinear constraints by StyleFlow [abdal2021styleflow]. It consists of 1,000 sequences resulting in 5,000 images, with different semantic changes, such as pose, illumination, expression, eyeglasses, gender, and age. We conduct an inversion experiment on it and the results are shown in the Tab. 8, Fig. 36. Our method can obtain an accurate reconstruction on this nonlinear dataset. Besides, we also conduct nonlinear semantic editing task using the nonlinear StyleFlow [abdal2021styleflow], as shown in the right parts of Tab. 8 and Fig. 42. Our results are more similar with the GT (see the hair color of “age+” in the rightbottom corner of Fig. 42). These results prove that our method is not constrained by linearediting assumption. That because our method is an optimizationbased GAN inversion method that does not rely on any attribute constraint for the input images, and the optimization is imagespecific without training a general network.
We also give more qualitative comparison on image reconstitution on RAVDESS12 Dataset and the linearbased Synthesized Dataset in Fig. 50. We can see that our method can reconstruct the most faithful appearances by optimization latent code in the space. Involving the space largely improves reconstruction quality and Ours++ can reconstruct the correct colors.
The qualitative comparison on image editing task on Synthesized dataset can be seen in Fig. 56. Our edited results are more similar with the ground truths and show the pleasure appearances, which indicates that our inverted latent codes are close enough with the GT codes.