As an ill-posed problem, image inpainting is not to recover the original images for corrupted regions but to synthesize alternative contents that are visually plausible and semantically reasonable. It has been widely investigated in various image editing tasks such as object removal, old photo restoration, movie restoration, and so on. Realistic and high-fidelity image inpainting remains a challenging task especially when the corrupted regions are large and have complex texture and structural patterns.
State-of-the-art image inpainting methods leverage generative adversarial networks (GANs)  heavily for generating realistic high-frequency details . But they often face a dilemma of perceptual quality and reconstruction that share a perception-distortion trade-off . Specifically, the adversarial loss in GANs tends to recover high-frequency texture details and improve the perceptual quality [31, 8], while the L1/L2 loss in reconstruction focuses more on recovering low-frequency global structures . Concurrently optimizing the two objectives in spatial domain tends to introduce inter-frequency conflicts as illustrated in Fig. 1. GMCNN  balances the two objectives by weighted sum, but it still works in spatial domain with mixed frequency and struggles to generate more realistic high-frequency details due to the inter-frequency conflicts. Gate Convolution (GC)  mitigates this issue by adopting a Coarse-to-Fine strategy [39, 32, 29, 20, 40]
that first predicts global low-frequency structures and then refines high-frequency texture details. The coarse estimation network is generally trained with L1 loss, but the inter-frequency conflicts still exist in the refinement network. Moreover, the two-stage network often suffers from inconsistency in generated structure and texture details due to the lack of effective alignment and fusion of multi-stage features.
To address the aforementioned issues, we design WaveFill, an innovative image inpainting framework that employs wavelet transform to complete corrupted image regions at multiple frequency bands separately. Specifically, we convert images into wavelet domain with 2D discrete wavelet transform (DWT)  where the images can be disentangled into multiple frequency bands accurately without losing spatial information. The disentanglement allows us to apply adversarial (or L1) loss to the high-frequency (or low-frequency) branches explicitly and separately, which greatly mitigates the content conflicts as introduced by concurrently optimizing the two different objectives over entangled features in spatial space. In addition, we design a novel frequency region attentive normalization (FRAN) scheme that aggregates attention from low frequency to high frequency to align and fuse the multi-frequency features. FRAN ensures the consistency across multiple frequency bands and helps suppress artifacts and preserve texture details effectively. The separately completed features in different frequency bands are then transformed back to the spatial domain via inverse discrete wavelet transform (IDWT) to produce the final completion.
The contributions of this work can be summarized in three aspects. First, we propose WaveFill, an innovative image inpainting technique that synthesizes corrupted image regions at different frequency bands explicitly and separately, which effectively mitigates the inter-frequency conflicts while minimizing adversarial and reconstruction losses. Second, we design a novel normalization scheme that enables attentive alignment and fusion of the multi-frequency features with effective artifact suppression and detail preservation. Third, extensive experiments over multiple datasets show that the proposed WaveFill achieves superior inpainting as compared with the state-of-the-art.
2 Related Works
2.1 Image Inpainting
Image inpainting has been studied for years and earlier works employ diffusion and image patches heavily. Specifically, diffusion methods [3, 1] propagate neighboring information towards the corrupted regions but often fail to recover meaningful structures with little global information. Patch-based methods [2, 7] complete images by searching and transferring similar patches from the background. They work well for stationary texture but struggle while generating meaningful semantics for non-stationary data.
Though the aforementioned methods address image completion in different manners, most of them work in the spatial domain where information of different frequencies is mixed and often introduces inter-frequency conflicts in learning and optimization. Our method instead decomposes images into the frequency space and applies different objectives to different frequency bands explicitly and separately, which mitigates inter-frequency conflicts and improves image inpainting quality effectively.
2.2 Wavelet-based Methods
Wavelet transforms decompose a signal into different frequency components and has shown great effectiveness in various image processing tasks . Wavelet-based inpainting has been investigated far before the prevalence of deep learning. For example, Chan et al.  designs variational models with total variation (TV) minimization for image inpainting, and it’s improved in  with non-local TV regularization. In addition, Dobrosotskaya et al.  combines diffusion with the non-locality of wavelets for better sharpness in inpainting. Zhang and Dai  decomposes images in the wavelet domain to generate structures and texture with diffusion and exemplar-based methods, respectively. The aforementioned methods leverage hand-crafted features which cannot generate meaningful content for large corrupted regions. We borrow the idea of wavelet-based decomposition and incorporate CNN representations and adversarial learning which mitigates this issue effectively.
3 Proposed Method
The overview of our proposed inpainting network is illustrated in Fig. 2. An input image is first decomposed and assembled into 3 frequency bands LowFreq, Lv2HighFreq and Lv1HighFreq, which are then fed to three network branches for respective completion. We apply L1 reconstruction loss to LowFreq and adversarial loss to Lv2HighFreq and Lv1HighFreq to mitigate the inter-frequency conflicts. In addition, we design a novel normalization scheme FRAN that aligns and fuses features from the three branches to enforce the completion consistency across the three frequency bands. The generation results in the three branches are finally transformed back to the spatial domain to complete the inpainting, more details to be described in the ensuing subsections.
3.2 Wavelet Decomposition
The key innovation of our work is to disentangle images into multiple frequency bands and complete the images in different bands separately in the wavelet domain. We adopt 2D discrete wavelet transform (DWT) to first decompose images into multiple wavelet sub-bands with different frequency contents. For each iteration of the decomposition, the DWT applies low-pass and high-pass wavelet filters alternatively along image columns and rows (followed by downsampling), which produces 4 sub-bands including , and . The decomposition continues iteratively on to produce , and until the target level of decomposition is reached. Hence, a total number of wavelet sub-bands will be finally produced including , and . Here captures low-frequency information at the -th level, , and capture the horizontal, vertical and diagonal high-frequency information at the -th level, respectively. Note that the sizes of sub-bands at the -th level are down-sampled with a factor of .
In this work, we adopt the Haar wavelet filter as the basis for the wavelet transform, where the high-pass filter is and the low-pass filter is . The level of wavelet decomposition is empirically set to 2, we treat as low-frequency, concatenate , and in the channel dimension as -th level high-frequency. Given a input image of size , we will thus obtain three inputs in the wavelet domain, namely, LowFreq with size of , Lv2-HighFreq with size of and Lv1-HighFreq with size of .
3.3 Frequency Region Attentive Normalization
It is a vital step to align and fuse the low-frequency and high-frequency features for generating consistent and realistic contents across different frequency bands.An effective fusion of low-frequency and high-frequency features has two major challenges. First, the statistics of low-frequency and high-frequency bands have clear differences, direct summing or concatenating them could greatly suppress high-frequency information due to its high sparsity. Second, the different branches are trained with their explicit loss terms, and the learning capacity (No. of CNN layers and kernel sizes) also varies among the branches. Thus, when inpainting different branches independently without inter-branch alignment, a network branch may generate contents that are reasonable in its own frequency bands but inconsistent across frequency bands of other branches (in object shapes or sizes). Both issues could lead to various blurs and artifacts in the completion results. We design a novel Frequency Region Attentive Normalization (FRAN) technique that aligns and fuses low-frequency and high-frequency features for more realistic inpainting.
For the issue with the statistical difference, we propose to align the low-frequency features with the target high-frequency features so as to fuse them effectively and alleviate the difficulty of generating target high-frequency bands. Inspired by the spatially-adaptive normalization (SPADE) , we achieve the feature alignment by injecting the learnable modulation parameters and of high-frequency features to the low-frequency features , where is the number of spatial positions, i.e. .
To align the contents in the missing regions, we aggregate the self-attention score of low-frequency features to high-frequency features. Since the attention map depicts the correlation between low-frequency feature patches, the misaligned high-frequency features of corrupted regions can be reconstructed by collectively aggregating features from uncorrupted regions. Another advantage of applying attention aggregation is to leverage complementary features of distant regions by establishing long-range dependencies. As shown in Fig. 3, the attention scores are computed from low-frequency features ( is the channel number) which are firstly transformed to two features space for key and query respectively, i.e. , and are the
convolutions. For efficiency, we employ max-pooling to obtain a spatial dimension offor attention calculation and aggregation.
The high-frequency features is then mapped to the feature space with the same hidden dimension by where is the transformation function by convolution. The aggregation of at position is defined by:
Since the high-frequency features are significantly sparse, the magnitude of resultant aggregation is relatively small. We adopt a parameter-free positional normalization  to normalize it and meanwhile preserve structure information. The same normalization is also applied to low-frequency features before the modulation. Finally, the aggregation output is convolved to produce the modulation parameters and to modulate the normalized low-frequency features:
where is the modulated features, and
is the mean and standard deviation ofalong the channel dimension.
3.4 Network Architecture
Our network consists of one generator and 2 discriminator as illustrated in Fig. 2.
Generation Network. The generation network consists of 3 branches LowFreq, Lv2-HighFreq and Lv1-HighFreq that recover corrupted regions separately. The LowFreq branch consists of a completion module GC ResBlk that adopts gated convolution 11]. Specifically, GC ResBlk consists of several consecutive residual blocks with growing dilation rates up to 16 to increase the receptive field. Meanwhile, it replaces all convolutions by gated convolution to dynamically handle missing regions. The generated low-frequency features will be propagated to a decoder that has two gated convolutions to predict the completion of low-frequency sub-bands. Besides, they will also be transferred to two high-frequency branches for guiding and aligning with their generation.
The high-frequency branch Lv2-HighFreq consists of a new residual block FRAN ResBlk that is introduced with FRAN as illustrated in Fig. 2 (right). As the learned modulation parameters have encoded high-frequency information, we directly feed the high-frequency bands to the FRAN without additional encoding. After injecting the high-frequency information to low-frequency features, we propagate the acquired high-frequency features to a separate decoder which also consists of two gated convolutions. Another high-frequency branch Lv1-HighFreq shares similar structures with Lv2-HighFreq, except that it concatenates the well-aligned and normalized features from the previous two branches and up-sampling them to the current spatial dimension. The generation network thus predicts the inpainting of all 3 frequency bands, and finally converts them back to the spatial domain via inverse Discrete Wavelet Transform (IDWT). As DWT and IDWT are both differentiable, the network can be trained end-to-end.
Discrimination Network. To synthesize high-frequency information, we adopt two discriminators of the same structure to predice Lv2-HighFreq and Lv1-HighFreq, respectively. Motivated by PatchGAN  and global and local GANs , we adopt global and local sub-networks on top of PatchGAN to ensure the generation consistency. Additionally, we append a self-attention layer  after the last convolutional layer to assess the global structure and enforce the geometric consistency.
3.5 Loss Functions
We denote the finally completed image by , the predictions in the wavelet domain by ( is number of levels in wavelet decomposition), the ground-truth image by and its corresponding wavelet coefficients by . is the discriminator for the -th level high-frequency wavelet coefficients in the wavelet domain.
Low-Frequency L1 Loss. We explicitly employ the L1 loss on the low-frequency subbands in the wavelet domain, which can be defined by:
Adversarial Loss. For the 2 discriminators of high-frequency branches, we apply the same adversarial losses to them using hinge loss . The adversarial loss for discriminator is defined as:
For the generator, we sum up the adversarial loss of each discriminator to obtain the final loss as below:
Feature Matching Loss. As the training could be unstable due to the sparsity of high-frequency bands, we adopt the feature matching loss following pix2pixHD  on both the two discriminators to stabilize the training process.
where is the last layer of the discriminator, and are the activation map and its number of elements in the -th layer of the discriminator, respectively.
Perceptual Loss. To penalize the perceptual and semantic discrepancy, we employ the perceptual loss  using a pertrained VGG-19 network:
where are the balancing weights. is the activation of -th layer of the VGG-19 model which corresponds to the activation maps from layers relu1_2, relu2_2, relu3_2, relu4_2 and relu5_2. represents the activation maps of relu4_2 layer, and we select this specific layer to emphasize the high-level semantics.
Full Objective. With the linear combination of the aforementioned losses, the network is optimized by the following objective:
where we empirically set , , and in our experiments for balancing the objectives.
4.1 Experimental Settings
Datasets. We conduct experiments on three public datasets that have different characteristics:
Compared Methods. We compare our method with a number of state-of-the-art methods as listed:
GMCNN : It is a generative model with different receptive fields in different branches.
GC : It is also known as DeepFill v2, a two-stage method that leverages gated convolution.
EC : It is a two-stage method that first predicts salient edges to guide the generation.
MEDFE : It is a mutual encoder-decoder that treats features from deep and shallow layers as structures and textures of an input image.
We perform evaluations by using four widely adopted evaluation metrics: 1) Fréchet Inception Score (FID) that evaluates the perceptual quality by measuring the distribution distance between the synthesized images and real images; 2) mean
error; 3) peak signal-to-noise ratio (PSNR); and 4) structural similarity index (SSIM) with a window size of 51.
The proposed method is implemented in PyTorch. The network is trained usingimages with random rectangle masks or irregular masks . We use Adam optimizer  with and , and set the learning rate at 1e-4 and 4e-4 for the generator and discriminators, respectively. The experiments are conducted on 4 NVIDIA(R) Tesla(R) V100 GPU. The inference is performed in a single GPU, and our full model runs at 0.138 seconds per image.
4.2 Quantitative Evaluation
We perform extensive quantitative evaluations over data with central square masks and irregular masks . For inpainting with central square masks, we use the mask size of , and benchmark with GMCNN , EC  and GC  over the validation images of CelebA-HQ . For inpainting with irregular masks, we conducted experiments over Places2  and Paris StreetView  and benchmarked with GC , EC  and MEDFE . The irregular masks in the experiments are categorized based the ratios of the masked regions over the image size. Performance of the compared methods was acquired by running publicly available pre-trained. The only exception is EC  which was trained with the official implementation on CelebA-HQ  with random rectangle masks.
Table 1 shows experimental results for dataset CelebA-HQ with central square masks. It can be observed that WaveFill outperforms all existing methods under different evaluation metrics consistently. In addition, experiments with irregular masks show that WaveFill achieves superior inpainting under different mask ratios as shown in Table 2. The effectiveness of WaveFill largely attributes to the wavelet-based frequency decomposition and the proposed normalization scheme. Specifically, disentangling frequency information in the wavelet domain helps mitigate the conflicts in generating low-frequency and high-frequency contents effectively, and it improves the inpainting quality in PSNR and SSIM as well. With the proposed normalization scheme, the low and high frequency information can be aligned for consistent generations in different frequency bands. Moreover, it allows the model to establish long-range dependencies which help generate more semantically plausible contents with better perceptual quality in FID. Quantitative results for Paris StreetView  are provided in the supplementary materials due to space limit.
4.3 Qualitative Evaluations
Figs. 4 and 5 show qualitative experimental results over the validation set of CelebA-HQ  and Places2 , respectively. As demonstrated in Fig. 4, the inpainting by GMCNN  and EC  suffers from unreasonable semantics and inconsistency near edge regions clearly, while the inpainting by GC  contains obvious artifacts and blurry textures. As a comparison, the inpainting by WaveFill are more semantically reasonable and has less artifacts but more texture details. For dataset Places2 , the inpainting by GC  and MEDFE  contains undesired artifacts and distorted structures as shown in Figs. 5b and 5c. Though EC  produces more visually appealing contents with less artifacts, its generated semantics are still short of plausibility. Thanks to the frequency disentanglement and FRAN, WaveFill achieves superior inpainting for both central square masks and irregular masks.
4.4 User Study
We performed user studies over datasets Paris StreetView, Places2 and CelebA-HQ. Specifically, we randomly sampled 25 test images from each dataset with no idea of inpainting results, which leads to 75 multiple choice questions in the survey. We recruited 20 volunteers with image processing backgrounds and each subject is asked to vote for the most realistic inpainting in each question. As Fig. 6 shows, the proposed WaveFill outperforms state-of-the-art methods by large margins.
4.5 Ablation Study
We study the individual contributions of our technical designs by several ablation studies over Paris StreetView  as shown in Table 3. In the ablation studies, we trained four network models including: 1) Spatial + Concat (Baseline) that adopts the typical encoder-decoder network with gated convolution . Different from WaveFill, L1 and adversarial losses are applied together, multi-level features are directly concatenated; 2) DCT + Concat that adopts discrete cosine transform (DCT) to compare with wavelet transformation. Similar to WaveFill, we split the frequency bands into three groups and feed them to the three generation branches; 3) Wavelet + Concat that replaces FRAN by concatenation of multi-frequency features; 4) Wavelet + SPADE that replace FRAN by SPADE .
As shown in Table 3, using DCT degrades the inpainting greatly due to the lack of spatial information. Wavelet transformation preserves spatial information which improves inpainting by large margins. In addition, using wavelet outperforms the baseline especially in FID, largely because wavelet-based model disentangles multi-frequency information and recovers corrupted regions in different frequency bands separately. Visual evaluation is well aligned with quantitative experiments in Fig. 7. We can see that DCT-based model fails to synthesize meaningful structures as shown in (c). Spatial-based model instead introduces unreasonable semantics and clear artifacts as shown in (b). Our wavelet-based model fills the missing regions with much less artifacts as shown in (d). Further, concatenation and SPADE do not align the features of different frequencies for better content consistency. FRAN addresses this issue effectively as shown in Table 3 and Fig. 7. More ablation studies are included in the supplementary materials.
This paper presents WaveFill, a novel image inpainting framework that disentangles low and high frequency information in the wavelet domain and fills the corrupted regions explicitly and separately. To ensure the inpainting consistency across multiple frequency bands, we propose a novel frequency region attentive normalization (FRAN) that effectively aligns and fuses the multi-frequency features especially those within the missing regions. Extensive experiments show that WaveFill achieves superior image inpainting for both rectangle and free-form masks. Moving forward, we will study how to adapt the idea of wavelet decomposition and separate processing in different frequency bands to other image recovery and generation tasks.
-  (2001) . IEEE Transactions on Image Processing 10 (8), pp. 1200–1211. Cited by: §2.1.
-  (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (TOG) 28 (3), pp. 24. Cited by: §2.1.
-  (2000) Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 417–424. Cited by: §2.1.
The perception-distortion tradeoff.
IEEE Conference on Computer Vision and Pattern Recognition, pp. 6228–6237. Cited by: §1.
-  (2006) Total variation wavelet inpainting. Journal of Mathematical imaging and Vision 25 (1), pp. 107–125. Cited by: §2.2.
Uses of complex wavelets in deep convolutional neural networks. Ph.D. Thesis, University of Cambridge. Cited by: §1.
-  (2012) Image melding: combining inconsistent images using patch-based synthesis. ACM Transactions on Graphics (TOG) 31 (4), pp. 1–10. Cited by: §2.1.
-  (2019) Wavelet domain style transfer for an effective perception-distortion tradeoff in single image super-resolution. In International Conference on Computer Vision, pp. 3076–3085. Cited by: §1, §2.2.
-  (2008) A wavelet-laplace variational technique for image deconvolution and inpainting. IEEE Transactions on Image Processing 17 (5), pp. 657–663. Cited by: §2.2.
-  (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §1, §2.1.
-  (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §3.4.
-  (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §4.1.
-  (2017) Wavelet-srnet: a wavelet-based cnn for multi-scale face super resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1689–1697. Cited by: §2.2.
-  (2017) Globally and locally consistent image completion. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–14. Cited by: §3.4.
-  (2017) Image-to-image translation with conditional adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134. Cited by: §3.4, §3.5.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pp. 694–711. Cited by: §3.5.
-  (2018) Progressive growing of gans for improved quality, stability, and variation. International Conference on Learning Representations. Cited by: Figure 4, item –, §4.2, §4.3, §4.4, Table 1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (2019) Positional normalization. In Advances in Neural Information Processing Systems, pp. 1622–1634. Cited by: §3.3.
-  (2018) Image inpainting for irregular holes using partial convolutions. In European Conference on Computer Vision, pp. 85–100. Cited by: §1, §2.1, §4.1, §4.2, Table 2, Table 3.
-  (2020) Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. In European Conference on Computer Vision, Cited by: §1, §2.1, item –, §4.2, §4.3, Table 2.
-  (2020) Wavelet-based dual-branch network for image demoiréing. In European Conference on Computer Vision, Cited by: §2.2.
-  (2019) Attribute-aware face aging with wavelet-based generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 11877–11886. Cited by: §2.2.
-  (2015) Deep learning face attributes in the wild. In International Conference on Computer Vision, pp. 3730–3738. Cited by: item –.
-  (1999) A wavelet tour of signal processing. Elsevier. Cited by: §2.2.
-  (2019) Edgeconnect: generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212. Cited by: §2.1, item –, §4.2, §4.3, Table 1, Table 2.
-  (2019) Semantic image synthesis with spatially-adaptive normalization. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346. Cited by: §3.3, §4.2, §4.2, §4.5, Table 3.
-  (2016) Context encoders: feature learning by inpainting. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544. Cited by: §1, §2.1, Figure 7, item –, §4.4.
-  (2019) Structureflow: image inpainting via structure-aware appearance flow. In International Conference on Computer Vision, pp. 181–190. Cited by: §1.
The earth mover’s distance as a metric for image retrieval. International journal of computer vision 40 (2), pp. 99–121. Cited by: Figure 1.
-  (2017) Enhancenet: single image super-resolution through automated texture synthesis. In International Conference on Computer Vision, pp. 4491–4500. Cited by: §1.
-  (2018) Contextual-based image inpainting: infer, match, and translate. In European Conference on Computer Vision, pp. 3–19. Cited by: §1.
-  (2020) Multi-level wavelet-based generative adversarial network for perceptual quality enhancement of compressed video. In European Conference on Computer Vision, Cited by: §2.2.
-  (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807. Cited by: §3.5.
-  (2018) Image inpainting via generative multi-column convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 331–340. Cited by: Figure 1, §1, §2.1, item –, §4.2, §4.3, Table 1.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §4.1.
-  (2020) LEED: label-free expression editing via disentanglement. In European Conference on Computer Vision, pp. 781–798. Cited by: §2.1.
-  (2020) Cascade ef-gan: progressive facial expression editing with local focuses. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5021–5030. Cited by: §2.1.
-  (2018) Generative image inpainting with contextual attention. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5505–5514. Cited by: §1, §2.1.
-  (2019) Free-form image inpainting with gated convolution. In International Conference on Computer Vision, pp. 4471–4480. Cited by: Figure 1, §1, §2.1, §3.4, item –, item –, §4.2, §4.3, §4.5, Table 1, Table 2.
-  (2021) Diverse image inpainting with bidirectional and autoregressive transformers. arXiv preprint arXiv:2104.12335. Cited by: §2.1.
-  (2021) Unbalanced feature transport for exemplar-based image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
-  (2021) Bi-level feature alignment for versatile image translation and manipulation. arXiv preprint arXiv:2107.03021. Cited by: §2.1.
-  (2019) Spatial fusion gan for image synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3653–3662. Cited by: §2.1.
Self-attention generative adversarial networks.
International Conference on Machine Learning, pp. 7354–7363. Cited by: §3.4.
-  (2012) Image inpainting based on wavelet decomposition. Procedia Engineering 29, pp. 3674–3678. Cited by: §2.2.
-  (2010) Wavelet inpainting by nonlocal total variation. Inverse Problems & Imaging 4 (1), pp. 191. Cited by: §2.2.
Places: a 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (6), pp. 1452–1464. Cited by: Figure 5, item –, §4.2, §4.3, §4.4, Table 2.