Log In Sign Up

High-Fidelity Pluralistic Image Completion with Transformers

by   Ziyu Wan, et al.

Image completion has made tremendous progress with convolutional neural networks (CNNs), because of their powerful texture modeling capacity. However, due to some inherent properties (e.g., local inductive prior, spatial-invariant kernels), CNNs do not perform well in understanding global structures or naturally support pluralistic completion. Recently, transformers demonstrate their power in modeling the long-term relationship and generating diverse results, but their computation complexity is quadratic to input length, thus hampering the application in processing high-resolution images. This paper brings the best of both worlds to pluralistic image completion: appearance prior reconstruction with transformer and texture replenishment with CNN. The former transformer recovers pluralistic coherent structures together with some coarse textures, while the latter CNN enhances the local texture details of coarse priors guided by the high-resolution masked images. The proposed method vastly outperforms state-of-the-art methods in terms of three aspects: 1) large performance boost on image fidelity even compared to deterministic completion methods; 2) better diversity and higher fidelity for pluralistic completion; 3) exceptional generalization ability on large masks and generic dataset, like ImageNet.


page 1

page 3

page 5

page 6

page 8


Diverse Image Inpainting with Bidirectional and Autoregressive Transformers

Image inpainting is an underdetermined inverse problem, it naturally all...

Diverse Plausible 360-Degree Image Outpainting for Efficient 3DCG Background Creation

We address the problem of generating a 360-degree image from a single im...

Taming Transformers for High-Resolution Image Synthesis

Designed to learn long-range interactions on sequential data, transforme...

3D Human Texture Estimation from a Single Image with Transformers

We propose a Transformer-based framework for 3D human texture estimation...

Spectrally Consistent UNet for High Fidelity Image Transformations

Convolutional Neural Networks (CNNs) are the current de-facto approach u...

Pluralistic Image Completion with Probabilistic Mixture-of-Experts

Pluralistic image completion focuses on generating both visually realist...

Learning Oracle Attention for High-fidelity Face Completion

High-fidelity face completion is a challenging task due to the rich and ...

Code Repositories


High-Fidelity Pluralistic Image Completion with Transformers

view repo

1 Introduction

Image completion (a.k.a. image inpainting), which aims to fill the missing parts of images with visually realistic and semantically appropriate contents, has been a longstanding and critical problem in computer vision areas. It is widely used in a broad range of applications, such as object removal 

[2], photo restoration [32, 33], image manipulation [15], and image re-targeting [6]. To solve this challenging task, traditional methods like PatchMatch [2] usually search for similar patches within the image and paste them into the missing regions, but they require appropriate information to be contained in the input image, , similar structures or patches, which is often difficult to satisfy.

In recent years, CNN-based solutions [14, 18, 25, 35, 21] started to dominate this field. By training on large-scale datasets in a self-supervised way, CNNs have shown their strength in learning rich texture patterns, and fills the missing regions with such learned patterns. Besides, CNN models are computationally efficient considering the sparse connectivity of convolutions. Nonetheless, they share some inherent limitations: 1) The local inductive priors of convolution operation make modeling the global structures of an image difficult; 2) CNN filters are spatial-invariant, the same convolution kernel operates on the features across all positions, by that the duplicated patterns or blurry artifacts frequently appear in the masked regions. On the other hand, CNN models are inherently deterministic. To achieve diverse completion outputs, recent frameworks [39, 37] rely on optimizing the variational lower bound of instance likelihood. Nonetheless, extra distribution assumption would inevitably hurt the quality of generated contents [38].

Transformer, as well-explored architectures in language tasks, is on-the-rise in many computer vision tasks. Compared to CNN models, it abandons the baked-in local inductive prior and is designed to support the long-term interaction via the dense attention module [31]. Some preliminary works [5] also demonstrate its capacity in modeling the structural relationships for natural image synthesis. Another advantage of using a transformer for synthesis is that it naturally supports pluralistic outputs by directly optimizing the underlying data distribution. However, the transformer also has its own deficiency. Due to quadratically increased computational complexity with input length, it struggles in high-resolution image synthesis or processing. Besides, most existing transformer-based generative models  [24, 5] works in an auto-regressive manner, , synthesize pixels in a fixed order, like the raster-scan order, thus hampering its application in the image completion task where the missing regions are often with arbitrary shapes and sizes.

In this paper, we propose a new high-fidelity pluralistic image completion method by bringing the best of both worlds: the global structural understanding ability and pluralism support of transformer, and the local texture refinement ability and efficiency of CNN models. To achieve this, we decouple image completion into two steps: pluralistic appearance priors reconstruction with a transformer to recover the coherent image structures, and low-resolution upsampling with CNN to replenish fine textures. Specifically, given an input image with missing regions, we first leverage the transformer to sample low-resolution completion results, appearance priors. Then, guided by the appearance priors and the available pixels of the input image, another upsampling CNN model is utilized to render high-fidelity textures for missing regions meanwhile ensuring coherence with neighboring pixels. In particular, unlike previous auto-regressive methods  [5, 30], in order to make the transformer model capable of completing the missing regions by considering all the available contexts, we optimize the log-likelihood objective of missing pixels based on the bi-directional conditions, which is inspired by the masked language model like BERT [9].

To demonstrate the superiority, we compare our method with state-of-the-art deterministic [23, 20, 34] and pluralistic [39] image completion approaches on multiple datasets. Our method makes significant progress from three aspects: 1) Compared with previous deterministic completion methods, our method establishes a new state of the art and outperforms theirs on a variety of metrics by a large margin; 2) Compared with previous pluralistic completion methods, our method further enhances the results diversity, meanwhile achieving higher completion fidelity; 3) Thanks to the strong structure modeling capacity of transformers, our method generalizes much better in completing extremely large missing region and large-scale generic datasets (, ImageNet) as shown in Figure. 1. Remarkably, the FID score on ImageNet is improved by 41.2 at most compared with the state-of-the-art method PIC [39].

2 Related Works

Visual Transformers    Vaswani et al. [31]

firstly propose transformers for machine translation, whose success subsequently has been proved in various down-stream natural language processing (NLP) tasks. The overall architecture of transformers is composed of stacked self-attention and point-wise feed-forward layers for encoder and decoder. Since the attention mechanism could model the dense relationship among elements of input sequence well, transformers are now gradually popular in computer vision areas. For example, DETR 

[4] employ transformers as the backbone to solve the object detection problem. Dosovitskiy et al. [10] propose ViT, which firstly utilize transformers in the image recognition area and achieved excellent results compared with CNN-based methods. Besides, Parmar et al. [24] and Chen et al. [5] leverage the transformer to model the image. Nonetheless, to generate an image, these methods rely on a fixed permutation order, which is not suitable to complete the missing areas with varying shapes.

Figure 2: Pipeline Overview. Our method consists of two networks. The above one is bi-directional transformer

, which is responsible for producing the probability distribution of missing regions, then the appearance priors could be reconstructed by sampling from this distribution with diversities. Subsequently, we employ another

CNN to upsample the appearance prior to original resolution under the guidance of input masked images. Our method combines both advantages of transformer and CNN, leading to high-fidelity pluralistic image completion performance. E: Encoder, D: Decoder, R: Residual block.

Deterministic Image Completion    Traditional image completion methods, like diffusion-based [3, 11] and patch-based [2, 12, 7], rely on strong low-level assumptions, which may be violated while facing large-area masks. To generate semantic-coherent contents, recently many CNN-based methods [25, 19, 21, 35, 23] have been proposed. Most of the methods share a similar encoder-decoder architecture. Specifically, Pathak et al. [25] bring the adversarial training into inpainting and achieve semantic hole-filling. Iizuka et al. [14] improve the performance of CE [25] by involving a local-global discriminator. Yu et al. [35] propose a new contextual attention module to capture the long-range correlations. Liu et al. [19] design a new operator named partial-conv to alleviate the negative influence of the masked regions produced by convolutions. These methods could generate reasonable contents for masked regions but lack the ability to generate diversified results.

Pluralistic Image Completion    To obtain a diverse set of results for each masked input, Zheng et al. [39]

propose a dual pipeline framework, which couples the estimated distribution from the reconstructive path and conditional prior of the generative path via jointly maximizing the lower bound. Similar with

[39], UCTGAN [37] project both the masked input and reference image into a common space via optimizing the KL-divergence between encoded features and distribution to achieve diversified sampling. Although they have achieved some diversities to a certain extent, their completion qualities are limited due to variational training. Unlike these methods, we directly optimize the log-likelihood in the discrete space via transformers without auxiliary assumptions.

3 Method

Image completion aims to transform the input image with missing pixels into a complete image . This task is inherently stochastic, which means given the masked image , there exists a conditional distribution . We decompose the completion procedure into two steps, appearance priors reconstruction and texture details replenishment. Since obtaining coarse prior given and is deterministic, then could be re-written as,


Instead of directly sampling from , we first use a transformer to model the underlying distribution of appearance priors given , denoted as (described in Sec. 3.1). These reconstructed appearance priors contain ample cues of global structure and coarse textures, thanks to the transformer’s strong representation ability. Subsequently, we employ another CNN to replenish texture details under the guidance of appearance priors and unmasked pixels, denoted as (described in Sec. 3.2). The overall pipeline could be found in Figure. 2.

3.1 Appearance Priors Reconstruction

Discretization    Considering the quadratically increasing computational cost of multi-head attention [31] in the transformer architecture, we represent the appearance priors of a natural image with its corresponding low-resolution version ( or in our implementation), which contains structural information and coarse textures only. Nonetheless, the dimensionality of RGB pixel representation () is still too large. To further reduce the dimension and faithfully re-represent the low-resolution image, an extra visual vocabulary with spatial size

is generated using K-Means cluster centers of the whole ImageNet 

[8] RGB pixel spaces. Then for each pixel of appearance priors, we search the index of the nearest element from the visual vocabulary to obtain its discrete representation. In addition, the elements of the representation sequence corresponding to hole regions will be replaced with a special token [MASK], which is also the learning target of the transformer. To this end, we convert the appearance prior into a discretized sequence.

Transformer    For each token of the discretized sequence , where is the length of , we project it into a

dimensional feature vector through prepending a learnable embedding. To encode the spatial information, extra learnable position embeddings will be added into the token features for every location

to form the final input of transformer model.

Following GPT-2 

[26], we use decoder-only transformer as our network architecture, which is mainly composed with self-attention based transformer layers. At each transformer layer , the calculation could be formulated as


where LN, MSA, MLP denote layer normalization [1], multi-head self-attention and FC layer respectively. More specifically, given input , the MSA could be computed as:


where is the number of head, , and are three learnable linear projection layers, . is also a learnable FC layer, whose target is to fuse the concatenation of the outputs from different heads. By adjusting the parameters of transformer layer , embedding dimension and head number , we could easily scale the size of the transformer. It should also be noted that unlike auto-regressive transformers  [5, 24], which generate elements via single-directional attention, only constrained by the context before scanning line, we make each token attend to all positions to achieve bi-directional attention, as shown in Figure. 3. This ensures the generated distribution could capture all available contexts, either before and after a mask position in the raster-scan order, thus leading to the consistency between generated contents and unmasked regions.

Figure 3: Differences between single-directional (left) and bi-directional (right) attention.

The output of the final transformer layer is further projected to a per-element distribution over 512 elements of visual vocabulary with the fully connected layers and softmax function. We adopt the masked language model (MLM) objective similar as the one used in BERT [9] to optimize the transformer model. Specifically, let denote the indexes of [MASK] tokens in the discretized input, where is the number of masked tokens. Let denote the set of [MASK] tokens in , and denote the set of unmasked tokens. The objective of MLM minimizes the negative log-likelihood of conditioned on all observed regions:



is the parameters of transformer. MLM objective incorporating with bi-directional attention ensures that the transformer model could capture the whole contextual information to predict the probability distribution of missing regions.

Figure 4: Qualitative comparison with state-of-the-art methods on FFHQ, Places2 dataset. The completion results of our method are with better quality and diversity.

Sampling Strategy    We introduce how to obtain reasonable and diversified appearance priors using the trained transformer in this section. Given the generated distribution of the transformer, directly sampling the entire set of masked positions does not produce good results due to the independence property. Instead, we employ Gibbs sampling to iteratively sample tokens at different locations. Specifically, in each iteration, a grid position is sampled from with the top- predicted elements, where denotes the previous generated tokens. Then the corresponding [MASK] token is replaced with the sampled one, and the process is repeated until all positions are updated. Similar with PixelCNN [30], the positions are sequentially chosen in a raster-scan manner by default. After sampling, we could obtain a bunch of complete token sequences. For each complete discrete sequences sampled from transformer, we reconstruct its appearance priors with querying the visual vocabulary.

3.2 Guided Upsampling

After reconstructing the low-dimensional appearance priors, we reshape into for subsequent processing. Since has contained the diversity, now the problem is how to learn a deterministic mapping to re-scale the into original resolution , meanwhile preserving the boundary consistency between hole regions and unmasked regions. To achieve this point, since CNNs have advantages of modeling texture patterns, here we introduce another guided upsampling network, which could render high-fidelity details of reconstructed appearance priors with the guidance of masked input . The processing of guided upsampling could be written as



is the result of bilinear interpolation of

and denotes the concatenation operation along the channel dimension. is the backbone of upsampling network parameterized by , which is mainly composed of encoder, decoder and several residual blocks. More details about the architecture could be found in the supplementary material.

We optimize this guided upsampling network by minimizing loss between and corresponding ground-truth :


To generate more realistic details, extra adversarial loss is also involved in the training process, specifically,


where is the discriminator parameterized by . We jointly train upsampling network and discriminator through solving the following optimization,


The loss weights are set to and in all experiments. We also observe that involving instance normalization (IN) [29] will cause color inconsistency and severe artifacts during optimization. Therefore we remove all IN in the upsampling network.

Figure 5: Qualitative comparison with state-of-the-art methods on ImageNet dataset. More qualitative examples are shown in supplementary materials.

4 Experiments

We present the details of implementation in Sec. 4.1, subsequently evaluate (Sec. 4.2) and delve into (Sec. 4.3) the proposed transformer-based image completion method. The pluralistic image completion experiments are conducted at resolution on three datasets including FFHQ [16], Places2 [40] and ImageNet [27]. We preserve 1K images from the whole FFHQ for testing, and use the original common training and test splits in rest datasets. The diversified irregular mask dataset provided by PConv [19] is employed for both training and evaluation.

4.1 Implementation Details

We control the scale of transformer architecture by balancing the representation ability and size of dataset concurrently. The discrete sequence length is set to for FFHQ. Limited by the computational resources, we decrease feasible to on large-scale datasets Places2 and ImageNet. The detailed configurations of different transformer models are attached in supplementary material.

We use Tesla V100 GPUs for FFHQ with batch size 64, Tesla V100 GPUs for Places2 and ImageNet with batch size 256 to train the transformers until convergence. We optimize the network parameters using AdamW [22] with and . The learning rate is warmed up from 0 to

linearly in the first epoch, then decays to 0 via cosine scheduler in rest iterations. No extra weight decay and dropout strategy are employed in the model. To train the guided upsampling network we use Adam 

[17] optimizer with fixed learning rate , and . During optimization, the weights of different loss terms are set to fixed value described in Sec. 3.2 empirically.

4.2 Results

Our method is compared against the following state-of-the-art inpainting algorithms: [23], (DFv2) [34], [20] and [39] using the official pre-trained models. We also fully train these models on FFHQ dataset for fair comparison.

Qualitative Comparisons    We qualitatively compare the results with other baselines in this section. We adopt the sampling strategy introduced in Sec. 3.1 with =50 to generate 20 solutions in parallel, then select the top 6 results ranked by the discriminator score of the upsampling network following PIC [39]. All reported results of our method are direct outputs from the trained models without extra post-processing steps.

We show the results on the FFHQ and Places2 datasets in Figure. 4. Specifically, EC [23] and MED [20] could generally produce the basic components of missing regions, but the absence of texture details makes their results nonphotorealistic. DeepFillv2 [34], which is based on a multi-stage restoration framework, could generate sharp details. However, severe artifacts appear when mask regions are relatively large. Besides, their method could only produce a single solution for each input. As the state-of-the-art diversified image inpainting method, PIC [39] tends to generate over-smooth contents and strange patterns, meanwhile the semantic-reasonable variation is limited to a small range. Compared to these methods, ours is superior in both photo-realism and diversity. We further show the comparison with [35] and PIC [39] on ImageNet dataset in Figure. 5. In this challenging setting, full CNN-based methods could not understand the global context well, resulting in unreasonable completion. When meeting large masks, they even could not maintain the accurate structure as shown in the second row of Figure. 5. In comparison, our method gives superior results, which demonstrates the exceptional generalization ability of our method on large-scale datasets.

Dataset Ffhq [16] Places2 [40]
DFv2 [34] 25.868 0.922 0.0231 16.278 26.533 0.881 0.0215 24.763
EC [23] 26.901 0.938 0.0209 14.276 26.520 0.880 0.0220 25.642
PIC [39] 26.781 0.933 0.0215 14.513 26.099 0.865 0.0236 26.393
MED [20] 26.325 0.922 0.0230 14.791 26.469 0.877 0.0224 26.977
Ours 27.922 0.948 0.0208 10.995 26.503 0.880 0.0244 21.598
20%-40% 28.242 0.952 0.0155 10.515 26.712 0.884 0.0198 20.431
DFv2 [34] 21.108 0.802 0.0510 28.711 22.192 0.729 0.0440 39.017
EC [23] 21.368 0.780 0.0510 30.499 22.225 0.731 0.0438 39.271
PIC [39] 21.723 0.811 0.0488 25.031 21.498 0.680 0.0507 49.093
MED [20] 20.765 0.763 0.0592 34.148 22.271 0.717 0.0457 45.455
Ours 22.613 0.845 0.0445 20.024 22.215 0.724 0.0431 33.853
40%-60% 23.076 0.864 0.0371 20.843 22.635 0.739 0.0401 34.206
DFv2 [34] 24.962 0.882 0.0310 19.506 25.692 0.834 0.0280 29.981
EC [23] 25.908 0.882 0.0301 17.039 25.510 0.831 0.0293 30.130
PIC [39] 25.580 0.889 0.0303 17.364 25.035 0.806 0.0315 33.472
MED [20] 25.118 0.867 0.0349 19.644 25.632 0.827 0.0291 31.395
Ours 26.681 0.910 0.0292 14.529 25.788 0.832 0.0267 25.420
Random 27.157 0.922 0.0223 14.039 25.982 0.839 0.0254 25.985
Table 1: Quantitative results on FFHQ and Places2 datasets with different mask ratios. Ours: Default top-50 sampling. : Top-1 sampling.
Method Mask Ratio PSNR SSIM MAE FID
PIC [39] 24.010 0.867 0.0319 47.750
Ours 20%-40% 24.757 0.888 0.0263 28.818
PIC [39] 18.843 0.642 0.0756 101.278
Ours 40%-60% 20.135 0.721 0.0585 59.486
PIC [39] 22.711 0.791 0.0462 59.428
Ours Random 23.775 0.835 0.0358 35.842
Table 2: Quantitative comparison with PIC on ImageNet dataset.

Quantitative Comparisons    We numerically compare our method with other baselines in Table. 1 and Table. 2

. The peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and relative

(MAE) are used to compare the low-level differences between the recovered output and the ground truth, which is more suitable to measure the mask setting of small ratio. To evaluate the larger area missing, we adopt Fréchet Inception Distance (FID) [13], which calculates the feature distribution distance between completion results and natural images. Since our method could produce multiple solutions, we need to find one examplar to calculate the mentioned metrics. Unlike PIC [39], which selects the results with high ranking discriminator scores for each sample, here we directly provide stochastic sampling results while =50 to demonstrate its generalization capability. Besides, we also provide deterministic sampling results given =1 in Table. 1. It can be seen that our method with top-1 sampling achieves superior results compared with other competitors in almost all metrics. And in the case of relatively large mask regions, top-50 sampling leads to slightly better FID scores. On the ImageNet dataset, as shown in Table. 2, our method outperforms PIC by a considerable margin, especially in FID metrics (more than 41.2 for large masks).

Figure 6: Results of user study.

User Study    To better evaluate the subjective quality, we further conduct a user study to compare our method with other baselines. Specifically, we randomly select 30 masked images from the test set. For a test image, we use each method to generate one completion result and ask the participant to rank five results from the highest photorealism to the lowest photorealism. We have collected answers from 28 participants and calculate the ratios of each method being selected as top 1,2,3, with the statistics shown in Figure. 6. Our method is 73.70% more likely to be picked as the first rank, demonstrating its clear advantage.

Figure 9: Left: Diversity curve. Right: FID curve. P and F denote Places2 and FFHQ dataset respectively.
Figure 10: Image completion results of large-scale masks. It could be noted that all the compared baselines struggle to generate plausible contents.

4.3 Analysis

Diversity.    We calculate the average LPIPS distance [36] between pairs of randomly-sampled outputs from the same input to measure the completion diversity following Zhu et al. [41]

. Specifically, we leverage 1K input images and sample 5 output pairs per input in different mask ratios. And the LPIPS is computed based on the deep features of VGG 

[28] model pre-trained on ImageNet. The diversity scores are shown in Figure. 9. Since the completion with diverse but meaningless contents will also lead to high LPIPS scores, we simultaneously provide the FID score of each level computed between the whole sampled results (10K) and natural images in the right part of Figure. 9. Our method achieves better diversity in all cases. Besides, in the max mask ratio of Places2, although PIC [39] approximates the diversity of our method, the perceptual quality of our completion outperforms PIC [39] by a large margin.

Figure 11: Visualization of probability map generated by transformer. Higher confidence denotes lower uncertainty.

Robustness for the completion of extremely large holes.    To further understand the ability of the transformer, we conduct extra experiments on the setting of extremely large holes, which means only very limited pixels are visible. Although both the transformer and upsampling network are only trained using a dataset of PConv [19] (max mask ratio 60%), our method generalizes fairly well to this difficult setting. In Figure. 10, almost all the baselines fail with large missing regions, while our method could produce high-quality and diversified completion results.

If the transformer could better understand global structure than CNN?    To answer this question, we conduct the completion experiments on some geometric primitives. Specifically, we ask our transformer-based method and other full CNN methods, DeepFillv1 [35] and PIC [39], trained on ImageNet to recover the missing parts of pentagram shape in Figure. 12. As expected, all full CNN methods fail to reconstruct the missing shape, which may be caused by the locality of the convolution kernel. In contrast, the transformer could easily reconstruct the right geometry in low-dimensional discrete space. Based on such accurate appearance prior, the upsampling network could more effectively render the original resolution results finally.

Figure 12: Completion of basic geometric shape. All compared models are trained on ImageNet. : Appearance prior reconstructed from transformer.

Visualization of probability map.    Intuitively, since the contour of missing regions is contiguous to existing pixels, the completion confidence should gradually decrease from the mask boundary to the interior region. The lower confidence corresponds to more diversified results. To verify this point, we plot the probability map in Figure. 11 , where each pixel denotes the maximum probability of visual vocabulary generated by the transformer. And we have some interesting observations: 1) In the right part of Figure. 11, the uncertainty is indeed increasing from outside to inside. 2) For the portrait completion example, the uncertainty of face regions is overall lower than hair parts. The underlying reason is the seen parts of the face constrain the diversity of other regions to some degree. 3) The probability of the right cheek of the portrait example is highest among the rest mask regions, which indicates that the transformer captures the symmetric property.

5 Concluding Remarks

There is a long-existing dilemma in the image completion area to achieve both enough diversity and photorealistic quality. Existing attempts mostly optimize the variational lower-bound through a full CNN architecture, which not only limits the generation quality but also is difficult to render natural variations. In this paper, we first propose to bring the best of both worlds: structural understanding capability and pluralism support of transformers, and local texture enhancement and efficiency of CNNs, to achieve high-fidelity free-form pluralistic image completion. Extensive experiments are conducted to demonstrate that the superiority of our method compared with state-of-the-art fully convolutional approaches, including large performance gain on regular evaluations setting, more diversified and vivid results, and exceptional generalization ability on large-scale masks and datasets.


  • [1] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.1.
  • [2] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (TOG) 28 (3), pp. 1–11. Cited by: §1, §2.
  • [3] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester (2000) Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 417–424. Cited by: §2.
  • [4] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. arXiv preprint arXiv:2005.12872. Cited by: §2.
  • [5] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever (2020) Generative pretraining from pixels. In

    International Conference on Machine Learning

    pp. 1691–1703. Cited by: §1, §1, §2, §3.1.
  • [6] D. Cho, J. Park, T. Oh, Y. Tai, and I. So Kweon (2017)

    Weakly-and self-supervised learning for content-aware deep image retargeting

    In Proceedings of the IEEE International Conference on Computer Vision, pp. 4558–4567. Cited by: §1.
  • [7] S. Darabi, E. Shechtman, C. Barnes, D. B. Goldman, and P. Sen (2012) Image melding: combining inconsistent images using patch-based synthesis. ACM Transactions on graphics (TOG) 31 (4), pp. 1–10. Cited by: §2.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §3.1.
  • [9] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §3.1.
  • [10] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §2.
  • [11] A. A. Efros and W. T. Freeman (2001) Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 341–346. Cited by: §2.
  • [12] J. Hays and A. A. Efros (2007) Scene completion using millions of photographs. ACM Transactions on Graphics (TOG) 26 (3), pp. 4–es. Cited by: §2.
  • [13] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §4.2.
  • [14] S. Iizuka, E. Simo-Serra, and H. Ishikawa (2017) Globally and locally consistent image completion. In sig, Cited by: §1, §2.
  • [15] Y. Jo and J. Park (2019)

    SC-fegan: face editing generative adversarial network with user’s sketch and color

    In Proceedings of the IEEE International Conference on Computer Vision, pp. 1745–1753. Cited by: §1.
  • [16] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: Table 1, §4.
  • [17] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [18] Y. Li, S. Liu, J. Yang, and M. Yang (2017) Generative face completion. In cvpr, Cited by: §1.
  • [19] G. Liu, F. A. Reda, K. J. Shih, T. Wang, A. Tao, and B. Catanzaro (2018) Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 85–100. Cited by: §2, §4.3, §4.
  • [20] H. Liu, B. Jiang, Y. Song, W. Huang, and C. Yang (2020) Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. arXiv preprint arXiv:2007.06929. Cited by: §1, §4.2, §4.2, Table 1.
  • [21] H. Liu, B. Jiang, Y. Xiao, and C. Yang (2019) Coherent semantic attention for image inpainting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4170–4179. Cited by: §1, §2.
  • [22] I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §4.1.
  • [23] K. Nazeri, E. Ng, T. Joseph, F. Z. Qureshi, and M. Ebrahimi (2019) Edgeconnect: generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212. Cited by: §1, §2, §4.2, §4.2, Table 1.
  • [24] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran (2018) Image transformer. In International Conference on Machine Learning, pp. 4055–4064. Cited by: §1, §2, §3.1.
  • [25] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. Efros (2016) Context encoders: feature learning by inpainting. In cvpr, Cited by: §1, §2.
  • [26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §3.1.
  • [27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.
  • [28] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.3.
  • [29] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §3.2.
  • [30] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. (2016) Conditional image generation with pixelcnn decoders. Advances in neural information processing systems 29, pp. 4790–4798. Cited by: §1, §3.1.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2, §3.1.
  • [32] Z. Wan, B. Zhang, D. Chen, P. Zhang, D. Chen, J. Liao, and F. Wen (2020) Bringing old photos back to life. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2747–2757. Cited by: §1.
  • [33] Z. Wan, B. Zhang, D. Chen, P. Zhang, D. Chen, J. Liao, and F. Wen (2020) Old photo restoration via deep latent space translation. arXiv preprint arXiv:2009.07047. Cited by: §1.
  • [34] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2019) Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4471–4480. Cited by: §1, §4.2, §4.2, Table 1.
  • [35] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and H. S (2018) Generative image inpainting with contextual attention. In cvpr, Cited by: §1, §2, §4.2, §4.3.
  • [36] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595. Cited by: §4.3.
  • [37] L. Zhao, Q. Mo, S. Lin, Z. Wang, Z. Zuo, H. Chen, W. Xing, and D. Lu (2020) UCTGAN: diverse image inpainting based on unsupervised cross-space translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5741–5750. Cited by: §1, §2.
  • [38] S. Zhao, J. Song, and S. Ermon (2017)

    Towards deeper understanding of variational autoencoding models

    arXiv preprint arXiv:1702.08658. Cited by: §1.
  • [39] C. Zheng, T. Cham, and J. Cai (2019) Pluralistic image completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1438–1447. Cited by: §1, §1, §2, §4.2, §4.2, §4.2, §4.2, §4.3, §4.3, Table 1, Table 2.
  • [40] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017)

    Places: a 10 million image database for scene recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Table 1, §4.
  • [41] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017)

    Toward multimodal image-to-image translation

    arXiv preprint arXiv:1711.11586. Cited by: §4.3.