Log In Sign Up

High-Fidelity Image Inpainting with GAN Inversion

Image inpainting seeks a semantically consistent way to recover the corrupted image in the light of its unmasked content. Previous approaches usually reuse the well-trained GAN as effective prior to generate realistic patches for missing holes with GAN inversion. Nevertheless, the ignorance of a hard constraint in these algorithms may yield the gap between GAN inversion and image inpainting. Addressing this problem, in this paper, we devise a novel GAN inversion model for image inpainting, dubbed InvertFill, mainly consisting of an encoder with a pre-modulation module and a GAN generator with F W+ latent space. Within the encoder, the pre-modulation network leverages multi-scale structures to encode more discriminative semantics into style vectors. In order to bridge the gap between GAN inversion and image inpainting, F W+ latent space is proposed to eliminate glaring color discrepancy and semantic inconsistency. To reconstruct faithful and photorealistic images, a simple yet effective Soft-update Mean Latent module is designed to capture more diverse in-domain patterns that synthesize high-fidelity textures for large corruptions. Comprehensive experiments on four challenging datasets, including Places2, CelebA-HQ, MetFaces, and Scenery, demonstrate that our InvertFill outperforms the advanced approaches qualitatively and quantitatively and supports the completion of out-of-domain images well.


page 2

page 8

page 9

page 11


GAN Inversion for Out-of-Range Images with Geometric Transformations

For successful semantic editing of real images, it is critical for a GAN...

CM-GAN: Image Inpainting with Cascaded Modulation GAN and Object-Aware Training

Recent image inpainting methods have made great progress but often strug...

Image Processing Using Multi-Code GAN Prior

Despite the success of Generative Adversarial Networks (GANs) in image s...

High-fidelity GAN Inversion with Padding Space

Inverting a Generative Adversarial Network (GAN) facilitates a wide rang...

Boosted GAN with Semantically Interpretable Information for Image Inpainting

Image inpainting aims at restoring missing region of corrupted images, w...

IMAGINE: Image Synthesis by Image-Guided Model Inversion

We introduce an inversion based method, denoted as IMAge-Guided model IN...

Barbershop: GAN-based Image Compositing using Segmentation Masks

Seamlessly blending features from multiple images is extremely challengi...

1 Introduction

Image inpainting is an ill-posed problem that requires to recover the missing or corrupted content based on incomplete images with masks. It has been widely adopted for manipulating photographs, such as corrupted image repairing, unwanted object removal, or object position modification [3, 30, 31].

The mainstream approaches [27, 20] often employ an encoder-decoder architecture in UNet style [29] for image inpainting, and have demonstrated promising results in dealing with narrow holes or removing small objects. To apply to more complicated cases, later works have been focused on improving the performance with various discriminators [43, 45, 16], contextual attention mechanisms [16, 45, 47], and auxiliary information [21, 46, 26, 8]. Nevertheless, limited by their model capacity, it remains challenging for these UNet-like methods to fill large corruptions with visually realistic patches.

Recently, generative adversarial network (GAN) models 

[11, 13, 14] have been verified to successfully produce high-resolution photorealistic images. In these models, GAN inversion [51, 38]

plays an important role. Specifically, when simply fed with stochastic vectors of latent space, GAN is not applicable to any image-to-image translation. To handle this problem, GAN inversion method uses a pre-trained GAN as prior, and encodes the given images into stochastic vectors that represent the target images, resulting to high-fidelity translation results. Inspired by this, several approaches 

[6, 28, 2] have made great efforts to introduced GAN inversion for image inpainting. Despite excellent performance, existing methods may suffer from following issues:

Figure 1: Visual results of our contributions. Image (a) shows the high-fidelity inpainting results for large corruptions, image (b) exhibits the improvement of our method for the “gapping” issue over previous inversion-based inpainting method pSp [28], and image (c) demonstrates the semantically consistent results by our model for the out-of-domain masked image. Best viewed in color for all figures throughout the paper.
  • Distortion for extreme image inpainting. Due to large corruptions, current methods (e.g.[16, 47, 8]) may become degenerated because these models are not able to effectively extract correlation from inadequate knowledge in extremely degraded images. Such correlation information is crucial in eliminating the ambiguity of large continuous holes, especially where far from the boundary.

  • Inconsistency caused by hard constraint. Unlike in regular conditional translation (e.g.

    , super-resolution 

    [6], face editing [50] and label-to-image [28]), image inpainting has a hard constraint that the unmasked regions in the input and the output should be the same. Current inversion-based algorithms [6, 28, 2], however, ignore this constraint, which results in color discrepancy and semantic inconsistency as displayed in Fig. 1(b) and may require additional post-processing such as image blending [2]. We call this problem “gapping” in the following sections.

  • Robustness for out-of-domain inputs. In order to reconstruct faithful images, the key is to find an in-domain latent code that can align with the domain of a well-trained GAN model [50]. Unfortunately, the encoder fails to invert out-of-domain inputs to produce accurate results. For example, the pSp [28] is hard to tackle the corrupted images with contents or masks from unseen domains, which is harmful to the applicability of GAN inversion.

To solve the above issues, we introduce a novel InvertFill network for image inpainting. It follows the encoder-based inversion fashion architecture [28] that consists of an encoder and a GAN generator. We first develop a new latent space (as explained later) that encodes the original images into style vector to enable the accessibility of the generator backbone to inputs, decreasing color discrepancy and semantic inconsistency. Besides, to make full use of the encoder, we present pre-modulation networks to amplify the reconstruction signals of the style vector based on the predicted multi-scale structures, further enhancing the discriminative semantic. Then, we propose a simple yet effective soft-update mean latent technique to sample a dynamic in-domain code for the generator. Compared to using a fixed code, our method is able to facilitate diverse downstream goals while reconstructing faithfully and photo-realistically, even in the task of unseen domain. To verify the superiority of our method, we conduct extensive experiments on four datasets, including CelebA-HQ [11], Places2 [49], MetFaces [12], and Scenery [41]. The results demonstrate that our method achieves favorable performance, especially for images with large corruptions. Furthermore, our approach can handle images and masks from unseen domains by optimizing a lightweight encoder without retraining the GAN generator on a large-scale dataset. Fig. 1 shows several visual results of our approach.

The contributions of our work are summarized as three-fold: (1) We introduce a novel latent space to resolve the problems of color discrepancy and semantic inconsistency and thus bridge the gap between image inpainting and GAN inversion. (2) We propose (a) pre-modulation networks to encode more discriminative semantic from compact multi-scale structures and (b) soft-update mean latent to synthesize more semantically reasonable and visually realistic patches by leveraging diverse patterns. (3) Extensive experiments on CelebA-HQ [11], Places2 [49], MetFaces [12], and Scenery [41] show that the proposed approach outperforms current state-of-the-arts, evidencing its effectiveness.

2 Related Work

Image Inpainting. Image inpainting could be treated as a conditional translation task with hard constraint. The seminal learning-based work by Pathak et al. [27] integrates UNet [29] and GAN discriminator [5] for image inpainting, and subsequently derives many variants that effectively deal with narrow holes or remove small objects. More recently, several works have attempted to extending the idea in [27] to more complicated cases. Roughly speaking, these methods can be categorized into three types. The first one is to explicitly dispose of invalid signals at masked regions [20, 43, 23, 44]. Among them, Liu et al. [20]

attach heuristic mask update step to standard convolution and Yu 

et al. [43] formally replace the mask update process with a learnable convolution layer. The second type is called valuable signals shifting that is inspired by the traditional exemplar-based approach [3], which presently tends to model contextual attention to achieve [40, 42, 22, 16, 46]. In particular, RFR [16] applies multiple iterations at the bottleneck while sharing the attention scores to guide a patch-swap process. ProFill [47] iteratively performs inpainting based on the confidence map calculated by spatial attention. CRFill [46] yields a contextual reconstruction objective function that learns query-reference feature similarity. The third branch is to adopt auxiliary labels, which generate intermediate structures to assist with more accurate semantic [26, 21, 17, 8]. In specific, EC [26] introduces canny edge to deliver finer inpainting structures. MEDFE [21] jointly learns to represent structures and textures and utilizes spatial and channel equalization to ensure consistency. CTSDG [8] couples texture and structure through parallel pathways and then fuses them by bidirectional gated layers. In addition to the above methods, there also exist other approaches. One notable example is Score-SDE [32]

which proposes a scoring model that saves the gradient computation of energy-based models for efficient sampling.

Inpainting with GAN Inversion. StyleGAN [13] implicitly learns hierarchical latent styles instead of the initial stochastic vector , which provides control over the style of outputs at coarse-to-fine levels of detail by style-modulation modules [10]. StyleGAN2 [14] further proposes weight demodulation, path length regularization, and generator redesign for improved image quality. They are adept in the generation without any given images, but requires specialized networks [24] or regularization [7, 25] and paired training data. GAN inversion [51] is a common practice that takes advantage of the intrinsic statistics of well-trained large-scale GAN as prior for generic applications [50, 1]. Existing GAN inversion approaches could be roughly divided as optimized-based [2, 6, 37, 34] and encoder-based [28, 50, 39]. Among these methods, mGANprior [6] utilizes multiple latent codes and adaptive channel importance for faithful reconstruction and shows applications in different tasks including inpainting. pSp [28] synthesizes images with the mapping network to extract style vectors of latent space  [1] separately for corresponding style-modulation layers of the StyleGAN. Nevertheless, these approaches ignore the “gapping” issue, resulting in color inconsistency and semantic misalignment.

Difference with Previous Studies. In this paper we focus on encoder-based GAN inversion to improve generation fidelity for image inpainting. The proposed InvertFill is related to but significantly different from previous studies. In specific, InvertFill is relevant to the methods in [28, 50, 39] where encoder-based architecture is adopted. However, differing from them, we introduce a new latent space to explicitly handle the “gapping” issue which is ignored in previous algorithms. Our method also shares similar spirit with the works of [47, 46] that adopt GAN for image inpainting. The difference is that these approaches may suffer from ambiguity when filling the large corruptions, while the proposed InvertFill exploits the priors of a large-scale generator and can achieve image inpainting with high-fidelity semantic.

Figure 2: The main components of InvertFill, including a feature pyramid-based encoder (image (a)), mapping network with pre-modulation network (image (c)) and a StyleGAN2 generator with the proposed latent space (image (b)).

3 The Proposed Method

Given an original image and its corrupted image , where is a binary mask and denotes element-wise product. The value of pixels in masked region equal to 1 indicates invisible. We aim to produce a visually realistic reconstructed image with the input of corrupted images .

3.1 Latent Space

Our architecture mainly consists of three components: (i) A feature pyramid-based [19] encoder that extracts input images and provides hierarchical reconstructed RGB images, (ii) the mapping networks with pre-modulation module, and (iii) a StyleGAN2 generator that takes in the style vectors as well as the input image to generate a image. The details of InvertFill are shown in Fig. 2.

Specifically, we attach three RGB heads to the encoder for generating reconstructed RGB images in correspondence to three different scale. We follow the map2style [28] for the mapping network and cut down the network number from to , each of which corresponds to the disentanglement level of image representation (i.e., coarse, middle and fine [13, 28]). Three map2style networks encode the output feature map of the encoder into the intermediate code . Similarly, we replicate map2style as map2structure to project reconstructed RGB images gradually into structure vector .

Before executing the style modulation in the generator, we perform pre-modulation networks to project the semantic structure into the style vector in latent space , i.e., . denotes number of style-modulation layers of StyleGAN2 generator, and is adjusted by the image resolution on the generator side. As Fig. 2(c) demonstrates, we adopt Instance Normalization (IN) [33] to regularize the latent code, then carry out denormalization according to multi-scale structure vector ,


where denotes the index of style vectors, indicates three vectors correspond to level of coarse to fine, is a pair of the affine transformation parameters learned by networks shown in Fig. 2(c). Different than previous methods in only using intermediate latent code from a network, the proposed pre-modulation module is a lightweight network and novel in applying more discriminative multi-scale features to help latent code perceive uncorrupted prior and better guide image generation.

The GAN is initially fed with a stochastic vector , and previous works [1, 6, 28, 50] invert the source images into the intermediate latent space or , which is a less entangled representation than latent space . The style vectors or are sent to the style-modulation layers of pre-trained StyleGAN2 to synthesize target images. These approaches can be formulated mathematically as follows,


where and represent the encoder that maps source images into latent space and the pre-trained GAN generator, respectively.

Nevertheless, the above formulation in Equ. (2) may encounter the “gapping” issue in image translation tasks with hard constraint, e.g., image inpainting. The hard constraint requires that parts of the source and recovered image remain the same. We formally defined the hard constraint in image inpainting as . Intuitively, we argue that the “gapping” issue is caused by that the GAN model cannot directly access pixels of the input image but the intermediate latent code. To avoid the semantic inconsistency and color discrepancy caused by this problem, we utilize the corrupted image as one of the inputs to assist with the GAN generator inspired by skip connection of U-Net [29]. In detail, is fed into the RGB branch as shown in Fig. 2(b), the feature map between RGB branch and the generator are connected by element-wise addition. Hence, the previous formulation in Equ. (2) is updated as:


3.2 Soft-update Mean Latent

Pixels closer to the mask boundary are more accessible to inpainting, but conversely the model is hard to predict specific content missing. We find that the encoder learns a trick to averaging textures to reconstruct the region away from unmasked region. It causes blurring or mosaic in some areas of the output image, mainly located away from the mask borders, as shown in Fig 7. Drawing inspiration from L2 regularization and motivated by the intuition that fitting diverse domains works better than fitting a preset static domain, a feasible solution is to make style code be bounded by the mean latent code of pre-trained GAN.

The mean latent code is obtained from abundant random samples that restrict the encoder outputs to the average style hence lossy the diversity of output distribution of encoder. In addition, it introduces additional hyperparameters and a static mean latent code that requires loading when training the model.

We adopt dynamic mean latent code instead of static one by stochastically fluctuating the mean latent code while training. Further, we smooth the effect of fluctuating variance for convergence inspired by a reinforcement learning 

[18]. For initialization, target mean latent code and online mean latent code are sampled. is used in image generation instead of , which is fixed until and then resampled. Between two successive sampled mean latent codes, is updated by per iteration during training, where denotes updating factor and for soft updating target mean latent code. The soft-update mean latent degraded to static mean latent [28] when the parameter of soft-update mean latent approaching zero.

3.3 Optimization

[width=]Fig/QualiPlaces2.pdf MaskedGCRFRCTSDGMEDFECRFillProFillOurs

Figure 3: Qualitative results on Places2 dataset.

[width=]Fig/QualiCHQ.pdf (a)Masked(b)mGANprior(c)pSp(d)Ours(e)GT

Figure 4: Qualitative results on CelebA-HQ dataset. Two columns of (b-d) show the original model output and composition output, from left to right, respectively. The output of the GAN-inversion-based method (pSp [28] and mGANprior [6]) is inconsistency at the edge of the mask. Zoom-in to see the details.

Following prior work in inpainting [20, 16], our architecture is supervised by regular inpainting loss , which consists of the pixel-wise Euclidean norm of valid and hole regions, the perceptual loss perc, the style loss style, and the total variation loss tv:


where all above distance are calculated between and . and are norm on the known and masked region respectively. The perceptual loss and the style loss are based on a pre-trained VGG-16 network. More details can be found in [16].

To directly optimize our encoder, the multi-scale reconstruction loss is utilized to penalize the deviation of at each scale:


where is represented as mean-squared loss between and . The multi-scale reconstruction loss contains three different losses including perceptual ([4], style ([20] and mean-square () losses. The role of is to supervise the generated image from decoder and make final generation close to the original image.

The soft-update mean latent is utilized to prevent the encoder from falling into the trick way. We adopt the following fidelity loss for improving the quality and diversity of output images:


The fidelity loss is designed as a mean squared loss of style vectors and online mean latent code . Its role is to improve the quality and diversity of the output images.

Overall, the loss of our networks is defined as the weighted sum of the inpainting loss, the multi-scale reconstruction loss, and the fidelity loss.


where and are the balancing factors for the multi-scale reconstruction loss and the fidelity loss, respectively.

4 Experiments


Figure 5: The visual effect of our method for processing input images from unseen domain. The 1st row shows inpainting results of Metfaces, and the 2nd row shows outpainting results of Scenery. Each instance of results is laid out as the masked image, the model output, and the original image.

We perform extensive validating experiments aiming to answer the following research questions:

  • RQ1: How does our approach perform, compared with existing methods, especially the fidelity when the input is large-scale masked images.

  • RQ2: Can our approach resolve the “gapping” issue?

  • RQ3: Can our approach handle input from unseen domain by reusing the well-trained generator while only retraining a lightweight encoder?

  • RQ4: How do different components (e.g., soft-update mean latent, pre-modulation) affect our approach?

4.1 Experimental Settings


Experiments for RQ1, RQ2 and RQ4 are conducted on two datasets, Places2 [49] and CelebA-HQ [11]. CelebA-HQ contains 30,000 high-resolution celebrity faces, and we follow [42, 43] to split this dataset for training and testing. Places2 contains real-world photos, including more significant objects, such as streets, cars, houses, which is better suited for verifying models on large-scale masks than CelebA-HQ. Based on the official train/val/test split, we train the model on train plus test about 200,000 images, evaluate the model on first 5,000 images of val. With regard to RQ3, we utilize two datasets Scenery [41] and MetFaces [12]. Scenery dataset is a common benchmark for recent image outpainting tasks and contains 6,040 landscape photographs. We follow [41] to use about 5,000 images as training set and the remaining 1,000 images as test set. MetFaces consists of 1,336 human faces extracted from works of art, and we randomly select 1,000 images as training set and other images as test set. Our model and all baselines adopt the same training and test strategies to ensure experimental fairness.

Places2 hard SSIM 0.624 0.645 0.598 0.664 0.651 0.629 0.641
FID 22.05 27.77 44.38 21.49 35.77 22.46 12.44
LPIPS 0.246 0.235 0.294 0.240 0.272 0.250 0.232
extreme SSIM 0.363 0.382 0.323 0.409 0.393 0.360 0.366
FID 51.35 71.19 111.85 46.44 95.50 51.26 21.08
LPIPS 0.407 0.395 0.495 0.402 0.438 0.413 0.386
all SSIM 0.734 0.750 0.714 0.764 0.755 0.738 0.761
FID 14.19 16.26 26.15 13.81 21.36 14.44 9.29
LPIPS 0.178 0.170 0.217 0.173 0.199 0.182 0.155
CelebA-HQ hard SSIM 0.790 0.825 0.781 - 0.818 0.810 0.812
FID 17.38 9.98 21.97 - 15.13 13.78 9.89
LPIPS 0.170 0.128 0.192 - 0.151 0.139 0.121
extreme SSIM 0.589 0.641 0.552 - 0.616 0.639 0.652
FID 41.70 22.07 55.52 - 33.89 30.19 13.21
LPIPS 0.297 0.241 0.359 - 0.281 0.275 0.214
all SSIM 0.852 0.878 0.846 - 0.875 0.859 0.867
FID 11.78 7.96 15.52 - 10.32 11.94 7.71
LPIPS 0.128 0.092 0.142 - 0.110 0.114 0.089
Table 1: Quantitative comparison with the mainstream inpainting approaches on Places2 and CelebA-HQ datasets. Hard, Extreme, All masks denote the mask with coverage ratio of 50% 60%, 70% 90%, and 10% 90% , respectively. Higher is better, and lower is better. Best and second best results are highlighted.

We use three metrics following prior works to measure the quality and fidelity of inpainting results. SSIM [35] modeling image distortion by structure, luminance, and contrast, is a pixel-level objective metric similar to PSNR, and their drawbacks cause inconsistent evaluation results with the human eye. Despite that, they are classical metrics for image evaluation, one of which SSIM we selected for quantitative comparison. FID [9] is a deep metric and closer to human perception. It measures the distribution distance with a pre-trained inception model, which better captures distortions. LPIPS [48] is another learned perceptual metric and commonly used to score the intra-conditioning diversity of models output. Following previous works [16, 26, 43], We calculate these quantitative metrics on original images and composition images .


We carefully select baseline methods mainly from two perspectives: UNet style methods and Inversion style methods to demonstrate our approach’s characteristics and superiority. First, for the sake of validating the ability of InvertFill in filling images under large-scale masks, we compare it with the previous approaches including EC [26], GC [42], RFR [16], MEDFE [21], ProFill [47], CTSDG [8] and CRFill [46]. Second, we compare with the latest GAN inversion-based inpainting methods mGANprior [6] and pSp [28].

4.2 Implementation Details

We utilize eight A100 GPUs for pre-training the GAN generator, and one TITAN RTX GPU for optimizing the encoder and other experiments. Following [16], we scale the image size of all datasets to

as the input. In the light of the mask coverage, we classify the test masks into three difficulty levels:

Hard/Extreme/All, indicates the mask with coverage ratio of 50% 60%, 70% 90%, 10% 90%, respectively. During testing, for a fair comparison, we use the same image-mask pair for all approaches. More details of implementation are shown in the supplementary.

[width=]Fig/AblationPSP.pdf MaskedpSppSp+BlendOursMaskedpSppSp+BlendOurs

Figure 6: Comparison with pSp [28] and pSp+Blend [36] that post-processing by image blending. The 1st row shows the color discrepancy that image blending is sufficient to resolve satisfactorily. The 2nd row shows that the semantic inconsistency is still reserved, except for our method.

4.3 Result Analysis


We reproduce all the above baselines by utilizing their official implementations. Concerning Places2 dataset, we utilize the pre-trained weights officially released by the baselines. On CelebA-HQ dataset, EC [26], GC [43], mGANprior [6] offer pre-trained weights, we thereby carefully retrain other baselines through the official source codes. Because ProFill only offers Web API on Places2, we use placeholder ‘-’ in Table 1 for ProFill on CelebA-HQ.

From Table 1, our method achieves the best or comparable performance among advanced inpainting approaches. In terms of the FID metric, our method at most produces a notable margin of 54.60% and 40.14% on Places2 and CelebA-HQ datasets, respectively. And our method also outperforms the second-best approach 11.2% and 10.4% improvements on another perceptual metric LPIPS, which validates the superiority of our design.

Score-SDE [32] 24.76 0.337 0.428
mGANprior [6] 29.57 0.273 0.608
pSp [28] 25.61 0.248 0.594
pSp + Blend [36] 21.96 0.240 0.602
Ours 13.21 0.214 0.652
Table 3: Comparison with previous outpainting approaches and inpainting baselines on Scenery dataset.
RFR [16] 138.31 0.455 0.376
pSp [28] 49.62 0.379 0.392
Boundless [15] 45.05 0.368 0.413
NS-outpaint [41] 38.95 0.342 0.410
Ours 20.90 0.294 0.439
Table 2: Comparison with previous GAN inversion-based and diffusion-based approaches on CelebA-HQ dataset.

Fig. 3 and 4 provide several visual inpainting results on Places2 and CelebA-HQ datasets. Fig. 3 reveals that the prior works still struggle to generate refined texture if the input image with large corruptions, while our approach has been able to create semantically rich objects such as windows, towers, and woods. In Fig. 4, mGANprior [6] progressively erases the color discrepancy rely on optimized-based inversion but is unable to bypass semantic inconsistency. The encoder-based inversion method pSp [28] could synthesize realistic pixels for corrupted regions based on the well-trained model, though it still has not resolved the “gapping” issue. The results indicate that our method produces consistent output while generating high-fidelity texture compared to existing methods.


The “gapping” causes color discrepancy and semantic inconsistency, and we are counting on image post-processing to tackle this issue at the beginning of this study. Specifically, we adopt image blending [36], which is effective in eliminating the color discrepancy but helpful in remedying semantic inconsistency.

To further demonstrate the superiority of our method, we construct the pSp+Blend variant that introduces an image blending [36] method after generating output images. In Fig. 6, the first row shows the distinct gap at the stitching boundary in pSp output, and pSp+Blend fixes this color discrepancy problem. Even so, the second row shows pSp+Blend unable to assist with semantic inconsistency problem given the glasses are still incomplete. Compared with the vanilla pSp and pSp+Blend, output images of our approach no longer suffer from color discrepancy or semantic inconsistency.

We conduct a comparison experiment on CelebA-HQ dataset with the Extreme level masks. As demonstrated in Table 3, our method performs better than a recent diffusion-based approach Score-SDE [32] w.r.t to FID, LPIPS and SSIM metrics. The results in Table 3 also show that our method performs best among the existing inversion-based inpainting approaches after resolving the “gapping” issue. Notably, our method does not require any image post-processing.

Method Easy Extreme
RFR 0.93 18.89 0.069 0.52 58.24 0.315
CRFill 0.95 13.67 0.042 0.54 50.93 0.278
pSp 0.95 14.91 0.040 0.49 65.04 0.341
Ours 0.97 8.64 0.033 0.60 38.85 0.227
Table 4: Comparison with previous inpainting methods on Metfaces. In this experimental setting, the model/generator is only trained on CelebA-HQ.

[width=]Fig/AblationSML.pdf Maskedw/o SMLw/ SMLGTMaskedw/o SMLw/ SMLGT

Figure 7: The importance of soft-update mean latent.


Concerning validating that our approach can reuse the pre-trained GAN generator as priors to tackle image from unseen domain, we conduct two extended tests that introduced images or masks from unseen domains and only required optimizing the lightweight encoder. The first is archaic photograph inpainting, and we use MetFaces [12] for optimizing the encoder, and remain the pre-trained weights of GAN generator of CelebA-HQ dataset. For the second one, we perform our approach with outpainting masks [15] on Scenery dataset. Similarly, the generator did not retrain on the Scenery dataset rather than remaining the weights for Places2.

The 1st row of Fig. 5 shows the inpainting results of archaic photograph inpainting. It demonstrates that our method enables the generator to synthesize semantically consistent style and patches, even in an unseen domain. From the 2nd row of Fig. 5, the outpainting results on the Scenery dataset show our approach still can synthesize realistic texture and significant objects, e.g.trees, mountains. To ensure the masks are unseen for the GAN generator, we only use the outpainting masks to train the encoder, not the GAN generator.

Furthermore, we quantitatively compare mainstream outpainting approaches as well as adopt RFR [16] and pSp [28] as additional baselines. As shown in Table 3, our model considerably outperforms the best outpainting baselines [15, 41] with respect to FID, LPIPS, and SSIM. Similarly, we conduct experiments compared with inpainting baselines on Metfaces, as show in Table 4. In summary, the results indicate that our proposed method is robust and extends to other tasks with out-of-domain inputs.

Due to limited space, please kindly refer to the supplementary material for more results.

4.4 Ablation Study (RQ4)

The ablation experiments are carried out on the Places2 dataset under the Extreme mask setting. In Table 5, we construct three variants to verify the contribution of proposed modules, in which PM and SML denote pre-modulation and soft-update mean latent. By learning from these modules, our method considerably outperforms the most naive variant w.r.t FID, LPIPS, SSIM, and PSNR.

The soft-update mean latent is motivated by the intuition that fitting diverse domains works better than fitting a preset static domain, especially when the training dataset contains various scenarios such as street and landscape. As shown in Fig. 7, when we use SML code that dynamically fluctuates during training, the masked region far away from the mask border tends to be reconstructed by explicitly learned semantics instead of repetitive patterns. Notably, ‘w/o SML’ represents using regular static mean latent code.

4.5 Failure Cases and Discussion

Fig. 8 shows two failure cases. Even if the model can recognize the corrupted objects (our method tends to recover the human face in the left case of Fig. 8), it mistakenly locates them and produces severe artifacts. When lacking sufficient prior knowledge, our method fails to reconstruct details. This demonstrates that these situations are challenging for image inpainting and need further study.

35.37 0.395 0.357 13.85
24.73 0.389 0.358 13.99
21.08 0.386 0.366 14.62
42.85 0.392 0.361 14.25
Table 5: Ablation study comparison on Places2 dataset under Extreme mask setting.

Masked Original Ours Masked Original Ours
Figure 8: Illustration of two failure cases of the proposed method.

5 Conclusion

In this paper, we propose an encoder-based GAN inversion method InvertFill for image inpainting. The encoder projects corrupted images into a latent space with pre-modulation for learning more discriminative representation. The novel latent space resolves the “gapping” issue when applied to GAN inversion in image inpainting. In addition, the soft-update mean latent dynamically samples diverse in-domain patterns, leading to more realistic textures. Extensive quantitative and qualitative comparisons demonstrate the superiority of our model over previous approaches and can cheaply support the semantically consistent completion of images or masks from unseen domains.


This work was supported by the Key Research Program of Frontier Sciences, CAS, Grant No. ZDBS-LY-JSC038. Libo Zhang was supported by the CAAI-Huawei MindSpore Open Fund and Youth Innovation Promotion Association, CAS (2020111). Heng Fan and his employer received no financial support for the research, authorship, and/or publication of this article.


  • [1] R. Abdal, Y. Qin, and P. Wonka (2019) Image2StyleGAN: how to embed images into the stylegan latent space?. In ICCV, pp. 4431–4440. Cited by: §2, §3.1.
  • [2] Y. Cheng, C. H. Lin, H. Lee, J. Ren, S. Tulyakov, and M. Yang (2021) In&out : diverse image outpainting via GAN inversion. CoRR abs/2104.00675. Cited by: 2nd item, §1, §2.
  • [3] A. Criminisi, P. Pérez, and K. Toyama (2003) Object removal by exemplar-based inpainting. In CVPR, pp. 721–728. Cited by: §1, §2.
  • [4] L. A. Gatys, A. S. Ecker, and M. Bethge (2015) A neural algorithm of artistic style. CoRR abs/1508.06576. Cited by: §3.3.
  • [5] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, pp. 2672–2680. Cited by: §2.
  • [6] J. Gu, Y. Shen, and B. Zhou (2020) Image processing using multi-code GAN prior. In CVPR, pp. 3009–3018. Cited by: 2nd item, §1, §2, Figure 4, §3.1, §4.1, §4.3, §4.3, Table 3.
  • [7] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In NeurIPS, pp. 5767–5777. Cited by: §2.
  • [8] X. Guo, H. Yang, and D. Huang (2021) Image inpainting via conditional texture and structure dual generation. In ICCV, pp. 14134–14143. Cited by: 1st item, §1, §2, §4.1.
  • [9] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, pp. 6626–6637. Cited by: §4.1.
  • [10] X. Huang and S. J. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In ICCV, pp. 1510–1519. Cited by: §2.
  • [11] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of gans for improved quality, stability, and variation. In ICLR, Cited by: §1, §1, §1, §4.1.
  • [12] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila (2020) Training generative adversarial networks with limited data. In NeurIPS, Cited by: §1, §1, §4.1, §4.3.
  • [13] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In CVPR, pp. 4401–4410. Cited by: §1, §2, §3.1.
  • [14] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of stylegan. In CVPR, pp. 8107–8116. Cited by: §1, §2.
  • [15] D. Krishnan, P. Teterwak, A. Sarna, A. Maschinot, C. Liu, D. Belanger, and W. T. Freeman (2019) Boundless: generative adversarial networks for image extension. In ICCV, pp. 10520–10529. Cited by: §4.3, §4.3, Table 3.
  • [16] J. Li, N. Wang, L. Zhang, B. Du, and D. Tao (2020) Recurrent feature reasoning for image inpainting. In CVPR, pp. 7757–7765. Cited by: 1st item, §1, §2, §3.3, §4.1, §4.1, §4.2, §4.3, Table 3.
  • [17] L. Liao, J. Xiao, Z. Wang, C. Lin, and S. Satoh (2021) Image inpainting guided by coherence priors of semantics and textures. In CVPR, pp. 6539–6548. Cited by: §2.
  • [18] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. In ICLR, Cited by: §3.2.
  • [19] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017) Feature pyramid networks for object detection. In CVPR, pp. 936–944. Cited by: §3.1.
  • [20] G. Liu, F. A. Reda, K. J. Shih, T. Wang, A. Tao, and B. Catanzaro (2018) Image inpainting for irregular holes using partial convolutions. In ECCV, pp. 89–105. Cited by: §1, §2, §3.3, §3.3.
  • [21] H. Liu, B. Jiang, Y. Song, W. Huang, and C. Yang (2020) Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. In ECCV, pp. 725–741. Cited by: §1, §2, §4.1.
  • [22] H. Liu, B. Jiang, Y. Xiao, and C. Yang (2019) Coherent semantic attention for image inpainting. In ICCV, pp. 4169–4178. Cited by: §2.
  • [23] Y. Ma, X. Liu, S. Bai, L. Wang, D. He, and A. Liu (2019) Coarse-to-fine image inpainting via region-wise convolutions and non-local correlation. In IJCAI, pp. 3123–3129. Cited by: §2.
  • [24] X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley (2017) Least squares generative adversarial networks. In ICCV, pp. 2813–2821. Cited by: §2.
  • [25] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In ICLR, Cited by: §2.
  • [26] K. Nazeri, E. Ng, T. Joseph, F. Z. Qureshi, and M. Ebrahimi (2019) EdgeConnect: structure guided image inpainting using edge prediction. In ICCVW, pp. 3265–3274. Cited by: §1, §2, §4.1, §4.1, §4.3.
  • [27] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In CVPR, pp. 2536–2544. Cited by: §1, §2.
  • [28] E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or (2021) Encoding in style: A stylegan encoder for image-to-image translation. In CVPR, pp. 2287–2296. Cited by: Figure 1, 2nd item, 3rd item, §1, §1, §2, §2, Figure 4, §3.1, §3.1, §3.2, Figure 6, §4.1, §4.3, §4.3, Table 3.
  • [29] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §1, §2, §3.1.
  • [30] R. Shetty, M. Fritz, and B. Schiele (2018) Adversarial scene editing: automatic object removal from weak supervision. In NeurIPS, pp. 7717–7727. Cited by: §1.
  • [31] L. Song, J. Cao, L. Song, Y. Hu, and R. He (2019) Geometry-aware face completion and editing. In AAAI, pp. 2506–2513. Cited by: §1.
  • [32] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole (2021) Score-based generative modeling through stochastic differential equations. In ICLR, Cited by: §2, §4.3, Table 3.
  • [33] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. CoRR abs/1607.08022. Cited by: §3.1.
  • [34] H. Wang, N. Yu, and M. Fritz (2021-06) Hijack-gan: unintended-use of pretrained, black-box gans. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 7872–7881. Cited by: §2.
  • [35] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. TIP, pp. 600–612. Cited by: §4.1.
  • [36] H. Wu, S. Zheng, J. Zhang, and K. Huang (2019) GP-GAN: towards realistic high-resolution image blending. In MM, pp. 2487–2495. Cited by: Figure 6, §4.3, §4.3, Table 3.
  • [37] Z. Wu, D. Lischinski, and E. Shechtman (2021-06) StyleSpace analysis: disentangled controls for stylegan image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12863–12872. Cited by: §2.
  • [38] W. Xia, Y. Zhang, Y. Yang, J. Xue, B. Zhou, and M. Yang (2021) GAN inversion: A survey. CoRR abs/2101.05278. Cited by: §1.
  • [39] Y. Xu, Y. Shen, J. Zhu, C. Yang, and B. Zhou (2021-06) Generative hierarchical features from synthesizing images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4432–4442. Cited by: §2, §2.
  • [40] Z. Yan, X. Li, M. Li, W. Zuo, and S. Shan (2018)

    Shift-net: image inpainting via deep feature rearrangement

    In ECCV, pp. 3–19. Cited by: §2.
  • [41] Z. Yang, J. Dong, P. Liu, Y. Yang, and S. Yan (2019) Very long natural scenery image prediction by outpainting. In ICCV, pp. 10560–10569. Cited by: §1, §1, §4.1, §4.3, Table 3.
  • [42] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018) Generative image inpainting with contextual attention. In CVPR, pp. 5505–5514. Cited by: §2, §4.1, §4.1.
  • [43] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2019) Free-form image inpainting with gated convolution. In ICCV, pp. 4470–4479. Cited by: §1, §2, §4.1, §4.1, §4.3.
  • [44] T. Yu, Z. Guo, X. Jin, S. Wu, Z. Chen, W. Li, Z. Zhang, and S. Liu (2020) Region normalization for image inpainting. In AAAI, pp. 12733–12740. Cited by: §2.
  • [45] Y. Zeng, J. Fu, H. Chao, and B. Guo (2021) Aggregated contextual transformations for high-resolution image inpainting. CoRR abs/2104.01431. Cited by: §1.
  • [46] Y. Zeng, Z. Lin, H. Lu, and V. M. Patel (2021-10) CR-fill: generative image inpainting with auxiliary contextual reconstruction. In ICCV, pp. 14164–14173. Cited by: §1, §2, §2, §4.1.
  • [47] Y. Zeng, Z. Lin, J. Yang, J. Zhang, E. Shechtman, and H. Lu (2020) High-resolution image inpainting with iterative confidence feedback and guided upsampling. In ECCV, pp. 1–17. Cited by: 1st item, §1, §2, §2, §4.1.
  • [48] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pp. 586–595. Cited by: §4.1.
  • [49] B. Zhou, À. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2018)

    Places: A 10 million image database for scene recognition

    TPAMI, pp. 1452–1464. Cited by: §1, §1, §4.1.
  • [50] J. Zhu, Y. Shen, D. Zhao, and B. Zhou (2020) In-domain GAN inversion for real image editing. In ECCV, pp. 592–608. Cited by: 2nd item, 3rd item, §2, §2, §3.1.
  • [51] J. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros (2016) Generative visual manipulation on the natural image manifold. In ECCV, pp. 597–613. Cited by: §1, §2.