High-Fidelity Pluralistic Image Completion with Transformers
Image completion has made tremendous progress with convolutional neural networks (CNNs), because of their powerful texture modeling capacity. However, due to some inherent properties (e.g., local inductive prior, spatial-invariant kernels), CNNs do not perform well in understanding global structures or naturally support pluralistic completion. Recently, transformers demonstrate their power in modeling the long-term relationship and generating diverse results, but their computation complexity is quadratic to input length, thus hampering the application in processing high-resolution images. This paper brings the best of both worlds to pluralistic image completion: appearance prior reconstruction with transformer and texture replenishment with CNN. The former transformer recovers pluralistic coherent structures together with some coarse textures, while the latter CNN enhances the local texture details of coarse priors guided by the high-resolution masked images. The proposed method vastly outperforms state-of-the-art methods in terms of three aspects: 1) large performance boost on image fidelity even compared to deterministic completion methods; 2) better diversity and higher fidelity for pluralistic completion; 3) exceptional generalization ability on large masks and generic dataset, like ImageNet.READ FULL TEXT VIEW PDF
Image inpainting is an underdetermined inverse problem, it naturally all...
Designed to learn long-range interactions on sequential data, transforme...
Convolutional Neural Networks (CNNs) are the current de-facto approach u...
We present the Colorization Transformer, a novel approach for diverse hi...
High-fidelity face completion is a challenging task due to the rich and
Most image completion methods produce only one result for each masked in...
Bridging distant context interactions is important for high quality imag...
High-Fidelity Pluralistic Image Completion with Transformers
Image completion (a.k.a. image inpainting), which aims to fill the missing parts of images with visually realistic and semantically appropriate contents, has been a longstanding and critical problem in computer vision areas. It is widely used in a broad range of applications, such as object removal, photo restoration [32, 33], image manipulation , and image re-targeting . To solve this challenging task, traditional methods like PatchMatch  usually search for similar patches within the image and paste them into the missing regions, but they require appropriate information to be contained in the input image, , similar structures or patches, which is often difficult to satisfy.
In recent years, CNN-based solutions [14, 18, 25, 35, 21] started to dominate this field. By training on large-scale datasets in a self-supervised way, CNNs have shown their strength in learning rich texture patterns, and fills the missing regions with such learned patterns. Besides, CNN models are computationally efficient considering the sparse connectivity of convolutions. Nonetheless, they share some inherent limitations: 1) The local inductive priors of convolution operation make modeling the global structures of an image difficult; 2) CNN filters are spatial-invariant, the same convolution kernel operates on the features across all positions, by that the duplicated patterns or blurry artifacts frequently appear in the masked regions. On the other hand, CNN models are inherently deterministic. To achieve diverse completion outputs, recent frameworks [39, 37] rely on optimizing the variational lower bound of instance likelihood. Nonetheless, extra distribution assumption would inevitably hurt the quality of generated contents .
Transformer, as well-explored architectures in language tasks, is on-the-rise in many computer vision tasks. Compared to CNN models, it abandons the baked-in local inductive prior and is designed to support the long-term interaction via the dense attention module . Some preliminary works  also demonstrate its capacity in modeling the structural relationships for natural image synthesis. Another advantage of using a transformer for synthesis is that it naturally supports pluralistic outputs by directly optimizing the underlying data distribution. However, the transformer also has its own deficiency. Due to quadratically increased computational complexity with input length, it struggles in high-resolution image synthesis or processing. Besides, most existing transformer-based generative models [24, 5] works in an auto-regressive manner, , synthesize pixels in a fixed order, like the raster-scan order, thus hampering its application in the image completion task where the missing regions are often with arbitrary shapes and sizes.
In this paper, we propose a new high-fidelity pluralistic image completion method by bringing the best of both worlds: the global structural understanding ability and pluralism support of transformer, and the local texture refinement ability and efficiency of CNN models. To achieve this, we decouple image completion into two steps: pluralistic appearance priors reconstruction with a transformer to recover the coherent image structures, and low-resolution upsampling with CNN to replenish fine textures. Specifically, given an input image with missing regions, we first leverage the transformer to sample low-resolution completion results, appearance priors. Then, guided by the appearance priors and the available pixels of the input image, another upsampling CNN model is utilized to render high-fidelity textures for missing regions meanwhile ensuring coherence with neighboring pixels. In particular, unlike previous auto-regressive methods [5, 30], in order to make the transformer model capable of completing the missing regions by considering all the available contexts, we optimize the log-likelihood objective of missing pixels based on the bi-directional conditions, which is inspired by the masked language model like BERT .
To demonstrate the superiority, we compare our method with state-of-the-art deterministic [23, 20, 34] and pluralistic  image completion approaches on multiple datasets. Our method makes significant progress from three aspects: 1) Compared with previous deterministic completion methods, our method establishes a new state of the art and outperforms theirs on a variety of metrics by a large margin; 2) Compared with previous pluralistic completion methods, our method further enhances the results diversity, meanwhile achieving higher completion fidelity; 3) Thanks to the strong structure modeling capacity of transformers, our method generalizes much better in completing extremely large missing region and large-scale generic datasets (, ImageNet) as shown in Figure. 1. Remarkably, the FID score on ImageNet is improved by 41.2 at most compared with the state-of-the-art method PIC .
Visual Transformers Vaswani et al. 
firstly propose transformers for machine translation, whose success subsequently has been proved in various down-stream natural language processing (NLP) tasks. The overall architecture of transformers is composed of stacked self-attention and point-wise feed-forward layers for encoder and decoder. Since the attention mechanism could model the dense relationship among elements of input sequence well, transformers are now gradually popular in computer vision areas. For example, DETR employ transformers as the backbone to solve the object detection problem. Dosovitskiy et al.  propose ViT, which firstly utilize transformers in the image recognition area and achieved excellent results compared with CNN-based methods. Besides, Parmar et al.  and Chen et al.  leverage the transformer to model the image. Nonetheless, to generate an image, these methods rely on a fixed permutation order, which is not suitable to complete the missing areas with varying shapes.
Deterministic Image Completion Traditional image completion methods, like diffusion-based [3, 11] and patch-based [2, 12, 7], rely on strong low-level assumptions, which may be violated while facing large-area masks. To generate semantic-coherent contents, recently many CNN-based methods [25, 19, 21, 35, 23] have been proposed. Most of the methods share a similar encoder-decoder architecture. Specifically, Pathak et al.  bring the adversarial training into inpainting and achieve semantic hole-filling. Iizuka et al.  improve the performance of CE  by involving a local-global discriminator. Yu et al.  propose a new contextual attention module to capture the long-range correlations. Liu et al.  design a new operator named partial-conv to alleviate the negative influence of the masked regions produced by convolutions. These methods could generate reasonable contents for masked regions but lack the ability to generate diversified results.
Pluralistic Image Completion To obtain a diverse set of results for each masked input, Zheng et al. 
propose a dual pipeline framework, which couples the estimated distribution from the reconstructive path and conditional prior of the generative path via jointly maximizing the lower bound. Similar with, UCTGAN  project both the masked input and reference image into a common space via optimizing the KL-divergence between encoded features and distribution to achieve diversified sampling. Although they have achieved some diversities to a certain extent, their completion qualities are limited due to variational training. Unlike these methods, we directly optimize the log-likelihood in the discrete space via transformers without auxiliary assumptions.
Image completion aims to transform the input image with missing pixels into a complete image . This task is inherently stochastic, which means given the masked image , there exists a conditional distribution . We decompose the completion procedure into two steps, appearance priors reconstruction and texture details replenishment. Since obtaining coarse prior given and is deterministic, then could be re-written as,
Instead of directly sampling from , we first use a transformer to model the underlying distribution of appearance priors given , denoted as (described in Sec. 3.1). These reconstructed appearance priors contain ample cues of global structure and coarse textures, thanks to the transformer’s strong representation ability. Subsequently, we employ another CNN to replenish texture details under the guidance of appearance priors and unmasked pixels, denoted as (described in Sec. 3.2). The overall pipeline could be found in Figure. 2.
Discretization Considering the quadratically increasing computational cost of multi-head attention  in the transformer architecture, we represent the appearance priors of a natural image with its corresponding low-resolution version ( or in our implementation), which contains structural information and coarse textures only. Nonetheless, the dimensionality of RGB pixel representation () is still too large. To further reduce the dimension and faithfully re-represent the low-resolution image, an extra visual vocabulary with spatial size
is generated using K-Means cluster centers of the whole ImageNet RGB pixel spaces. Then for each pixel of appearance priors, we search the index of the nearest element from the visual vocabulary to obtain its discrete representation. In addition, the elements of the representation sequence corresponding to hole regions will be replaced with a special token [MASK], which is also the learning target of the transformer. To this end, we convert the appearance prior into a discretized sequence.
Transformer For each token of the discretized sequence , where is the length of , we project it into a
dimensional feature vector through prepending a learnable embedding. To encode the spatial information, extra learnable position embeddings will be added into the token features for every locationto form the final input of transformer model.
Following GPT-2, we use decoder-only transformer as our network architecture, which is mainly composed with self-attention based transformer layers. At each transformer layer , the calculation could be formulated as
where LN, MSA, MLP denote layer normalization , multi-head self-attention and FC layer respectively. More specifically, given input , the MSA could be computed as:
where is the number of head, , and are three learnable linear projection layers, . is also a learnable FC layer, whose target is to fuse the concatenation of the outputs from different heads. By adjusting the parameters of transformer layer , embedding dimension and head number , we could easily scale the size of the transformer. It should also be noted that unlike auto-regressive transformers [5, 24], which generate elements via single-directional attention, only constrained by the context before scanning line, we make each token attend to all positions to achieve bi-directional attention, as shown in Figure. 3. This ensures the generated distribution could capture all available contexts, either before and after a mask position in the raster-scan order, thus leading to the consistency between generated contents and unmasked regions.
The output of the final transformer layer is further projected to a per-element distribution over 512 elements of visual vocabulary with the fully connected layers and softmax function. We adopt the masked language model (MLM) objective similar as the one used in BERT  to optimize the transformer model. Specifically, let denote the indexes of [MASK] tokens in the discretized input, where is the number of masked tokens. Let denote the set of [MASK] tokens in , and denote the set of unmasked tokens. The objective of MLM minimizes the negative log-likelihood of conditioned on all observed regions:
is the parameters of transformer. MLM objective incorporating with bi-directional attention ensures that the transformer model could capture the whole contextual information to predict the probability distribution of missing regions.
Sampling Strategy We introduce how to obtain reasonable and diversified appearance priors using the trained transformer in this section. Given the generated distribution of the transformer, directly sampling the entire set of masked positions does not produce good results due to the independence property. Instead, we employ Gibbs sampling to iteratively sample tokens at different locations. Specifically, in each iteration, a grid position is sampled from with the top- predicted elements, where denotes the previous generated tokens. Then the corresponding [MASK] token is replaced with the sampled one, and the process is repeated until all positions are updated. Similar with PixelCNN , the positions are sequentially chosen in a raster-scan manner by default. After sampling, we could obtain a bunch of complete token sequences. For each complete discrete sequences sampled from transformer, we reconstruct its appearance priors with querying the visual vocabulary.
After reconstructing the low-dimensional appearance priors, we reshape into for subsequent processing. Since has contained the diversity, now the problem is how to learn a deterministic mapping to re-scale the into original resolution , meanwhile preserving the boundary consistency between hole regions and unmasked regions. To achieve this point, since CNNs have advantages of modeling texture patterns, here we introduce another guided upsampling network, which could render high-fidelity details of reconstructed appearance priors with the guidance of masked input . The processing of guided upsampling could be written as
is the result of bilinear interpolation ofand denotes the concatenation operation along the channel dimension. is the backbone of upsampling network parameterized by , which is mainly composed of encoder, decoder and several residual blocks. More details about the architecture could be found in the supplementary material.
We optimize this guided upsampling network by minimizing loss between and corresponding ground-truth :
To generate more realistic details, extra adversarial loss is also involved in the training process, specifically,
where is the discriminator parameterized by . We jointly train upsampling network and discriminator through solving the following optimization,
The loss weights are set to and in all experiments. We also observe that involving instance normalization (IN)  will cause color inconsistency and severe artifacts during optimization. Therefore we remove all IN in the upsampling network.
We present the details of implementation in Sec. 4.1, subsequently evaluate (Sec. 4.2) and delve into (Sec. 4.3) the proposed transformer-based image completion method. The pluralistic image completion experiments are conducted at resolution on three datasets including FFHQ , Places2  and ImageNet . We preserve 1K images from the whole FFHQ for testing, and use the original common training and test splits in rest datasets. The diversified irregular mask dataset provided by PConv  is employed for both training and evaluation.
We control the scale of transformer architecture by balancing the representation ability and size of dataset concurrently. The discrete sequence length is set to for FFHQ. Limited by the computational resources, we decrease feasible to on large-scale datasets Places2 and ImageNet. The detailed configurations of different transformer models are attached in supplementary material.
We use Tesla V100 GPUs for FFHQ with batch size 64, Tesla V100 GPUs for Places2 and ImageNet with batch size 256 to train the transformers until convergence. We optimize the network parameters using AdamW  with and . The learning rate is warmed up from 0 to
linearly in the first epoch, then decays to 0 via cosine scheduler in rest iterations. No extra weight decay and dropout strategy are employed in the model. To train the guided upsampling network we use Adam optimizer with fixed learning rate , and . During optimization, the weights of different loss terms are set to fixed value described in Sec. 3.2 empirically.
Our method is compared against the following state-of-the-art inpainting algorithms: , (DFv2) ,  and  using the official pre-trained models. We also fully train these models on FFHQ dataset for fair comparison.
Qualitative Comparisons We qualitatively compare the results with other baselines in this section. We adopt the sampling strategy introduced in Sec. 3.1 with =50 to generate 20 solutions in parallel, then select the top 6 results ranked by the discriminator score of the upsampling network following PIC . All reported results of our method are direct outputs from the trained models without extra post-processing steps.
We show the results on the FFHQ and Places2 datasets in Figure. 4. Specifically, EC  and MED  could generally produce the basic components of missing regions, but the absence of texture details makes their results nonphotorealistic. DeepFillv2 , which is based on a multi-stage restoration framework, could generate sharp details. However, severe artifacts appear when mask regions are relatively large. Besides, their method could only produce a single solution for each input. As the state-of-the-art diversified image inpainting method, PIC  tends to generate over-smooth contents and strange patterns, meanwhile the semantic-reasonable variation is limited to a small range. Compared to these methods, ours is superior in both photo-realism and diversity. We further show the comparison with  and PIC  on ImageNet dataset in Figure. 5. In this challenging setting, full CNN-based methods could not understand the global context well, resulting in unreasonable completion. When meeting large masks, they even could not maintain the accurate structure as shown in the second row of Figure. 5. In comparison, our method gives superior results, which demonstrates the exceptional generalization ability of our method on large-scale datasets.
|Dataset||Ffhq ||Places2 |
. The peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and relative(MAE) are used to compare the low-level differences between the recovered output and the ground truth, which is more suitable to measure the mask setting of small ratio. To evaluate the larger area missing, we adopt Fréchet Inception Distance (FID) , which calculates the feature distribution distance between completion results and natural images. Since our method could produce multiple solutions, we need to find one examplar to calculate the mentioned metrics. Unlike PIC , which selects the results with high ranking discriminator scores for each sample, here we directly provide stochastic sampling results while =50 to demonstrate its generalization capability. Besides, we also provide deterministic sampling results given =1 in Table. 1. It can be seen that our method with top-1 sampling achieves superior results compared with other competitors in almost all metrics. And in the case of relatively large mask regions, top-50 sampling leads to slightly better FID scores. On the ImageNet dataset, as shown in Table. 2, our method outperforms PIC by a considerable margin, especially in FID metrics (more than 41.2 for large masks).
User Study To better evaluate the subjective quality, we further conduct a user study to compare our method with other baselines. Specifically, we randomly select 30 masked images from the test set. For a test image, we use each method to generate one completion result and ask the participant to rank five results from the highest photorealism to the lowest photorealism. We have collected answers from 28 participants and calculate the ratios of each method being selected as top 1,2,3, with the statistics shown in Figure. 6. Our method is 73.70% more likely to be picked as the first rank, demonstrating its clear advantage.
. Specifically, we leverage 1K input images and sample 5 output pairs per input in different mask ratios. And the LPIPS is computed based on the deep features of VGG model pre-trained on ImageNet. The diversity scores are shown in Figure. 9. Since the completion with diverse but meaningless contents will also lead to high LPIPS scores, we simultaneously provide the FID score of each level computed between the whole sampled results (10K) and natural images in the right part of Figure. 9. Our method achieves better diversity in all cases. Besides, in the max mask ratio of Places2, although PIC  approximates the diversity of our method, the perceptual quality of our completion outperforms PIC  by a large margin.
Robustness for the completion of extremely large holes. To further understand the ability of the transformer, we conduct extra experiments on the setting of extremely large holes, which means only very limited pixels are visible. Although both the transformer and upsampling network are only trained using a dataset of PConv  (max mask ratio 60%), our method generalizes fairly well to this difficult setting. In Figure. 10, almost all the baselines fail with large missing regions, while our method could produce high-quality and diversified completion results.
If the transformer could better understand global structure than CNN? To answer this question, we conduct the completion experiments on some geometric primitives. Specifically, we ask our transformer-based method and other full CNN methods, DeepFillv1  and PIC , trained on ImageNet to recover the missing parts of pentagram shape in Figure. 12. As expected, all full CNN methods fail to reconstruct the missing shape, which may be caused by the locality of the convolution kernel. In contrast, the transformer could easily reconstruct the right geometry in low-dimensional discrete space. Based on such accurate appearance prior, the upsampling network could more effectively render the original resolution results finally.
Visualization of probability map. Intuitively, since the contour of missing regions is contiguous to existing pixels, the completion confidence should gradually decrease from the mask boundary to the interior region. The lower confidence corresponds to more diversified results. To verify this point, we plot the probability map in Figure. 11 , where each pixel denotes the maximum probability of visual vocabulary generated by the transformer. And we have some interesting observations: 1) In the right part of Figure. 11, the uncertainty is indeed increasing from outside to inside. 2) For the portrait completion example, the uncertainty of face regions is overall lower than hair parts. The underlying reason is the seen parts of the face constrain the diversity of other regions to some degree. 3) The probability of the right cheek of the portrait example is highest among the rest mask regions, which indicates that the transformer captures the symmetric property.
There is a long-existing dilemma in the image completion area to achieve both enough diversity and photorealistic quality. Existing attempts mostly optimize the variational lower-bound through a full CNN architecture, which not only limits the generation quality but also is difficult to render natural variations. In this paper, we first propose to bring the best of both worlds: structural understanding capability and pluralism support of transformers, and local texture enhancement and efficiency of CNNs, to achieve high-fidelity free-form pluralistic image completion. Extensive experiments are conducted to demonstrate that the superiority of our method compared with state-of-the-art fully convolutional approaches, including large performance gain on regular evaluations setting, more diversified and vivid results, and exceptional generalization ability on large-scale masks and datasets.
International Conference on Machine Learning, pp. 1691–1703. Cited by: §1, §1, §2, §3.1.
Weakly-and self-supervised learning for content-aware deep image retargeting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4558–4567. Cited by: §1.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §3.1.
SC-fegan: face editing generative adversarial network with user’s sketch and color. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1745–1753. Cited by: §1.
Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658. Cited by: §1.
Places: a 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Table 1, §4.
Toward multimodal image-to-image translation. arXiv preprint arXiv:1711.11586. Cited by: §4.3.