Image inpainting (a.k.a. image completion or image hole-filling) is a task of synthesizing alternative contents in missing regions such that the modification is visually realistic and semantically correct. It allows to remove distracting objects or retouch undesired regions in photos. It can also be extended to tasks including image/video un-cropping, rotation, stitching, re-targeting, re-composition, compression, super-resolution, harmonization and many others.
In computer graphics, two broad approaches to image inpainting exist: patch-based ones with low-level features and deep generative models with convolutional neural networks. The former approach[Efros and Leung, 1999; Barnes et al., 2009; Efros and Freeman, 2001] can synthesize plausible stationary textures, but usually make critical failures in non-stationary cases like complicated scenes, faces and objects. The latter approach [Iizuka et al., 2017; Yu et al., 2018] can exploit semantics learned from large scale datasets to fill contents in non-stationary images in an end-to-end fashion.
However, deep generative models based on vanilla convolutional networks are naturally ill-fitted for image hole-filling because convolutional filters treat all input pixels as same valid ones. For hole-filling, the input images/features are composed of both regions with valid pixels outside holes and invalid or synthesized pixels in masked regions. Vanilla convolutions apply same filters on all valid, invalid and mixed (on hole boundaries) pixels, leading to visual artifacts such as color discrepancy, blurriness and obvious edge responses surrounding holes when tested on free-form masks.
To address this limitation, partial convolution [Liu et al., 2018] is recently proposed where the convolution is masked and re-normalized to be conditioned only on valid pixels. It is then followed by a mask-update step to re-compute new mask layer by layer. Partial convolution is essentially a hard-gating single-channel un-learnable layer multiplied to input feature maps.
It heuristically categorizes all pixel locations to be either valid or invalid, and multiplies hard-gating values (e.g. ones or zeros) to input images/features. However this assumption has several problems. First, if we want to extend it to user-guided image inpainting with conditional channels where users provide sparse sketches inside the mask, should these pixel locations be considered as valid or invalid? How to properly update the mask for next layer? Secondly, for partial convolution the invalid pixels will progressively disappear in deep layers, leaving all gating values to be ones (Figure 3). However, we will show that if we allow the network to learn the optimal gating values by itself, the network assigns different gating values to different locations in different channels based on input masks and sketches, even in deep layers, as shown in visualization results in Figure 3.
We propose gated convolution that learns a dynamic feature selection mechanism for each channel and each spatial location (e.g. inside or outside masks, RGB or user-input channels) for the task of free-form image inpainting. Specifically we consider the formulation where the input feature is firstly used to compute gating values (
is sigmoid function,is learnable parameter). The final output is a multiplication of learned feature and gating values in which
is any activation function. Gated convolution is easy to implement and performs significantly better when (1) the masks have arbitrary shapes and (2) the inputs are no longer simply RGB channels with a mask but also has conditional inputs like sketches. For network architectures, we stack gated convolution to form a simple encoder-decoder network[Yu et al., 2018]. Skip connections with a U-Net [Ronneberger et al., 2015], as adopted in some image inpainting networks [Liu et al., 2018], are not effective for non-narrow masks, mainly because inputs of these skip connections are almost zeros thus cannot propagate detailed color or texture information to decoder. This can be explained by our visualization of learned feature representation of encoder. Our inpainting network also integrates contextual attention module [Yu et al., 2018] within same refinement network to better capture long-range dependencies.
Without degradation of performance, we also significantly simplify training objectives into two terms: a pixel-wise reconstruction loss and an adversarial loss. The modification is mainly designed for free-form image inpainting. As the holes may appear anywhere in images with any shapes, global and local GANs [Iizuka et al., 2017] designed for a single rectangular mask are not suitable. Instead, we propose a variant of generative adversarial networks, named SN-PatchGAN, motivated by global and local GANs [Iizuka et al., 2017], MarkovianGANs [Li and Wand, 2016], perceptual loss [Johnson et al., 2016] and recent work on spectral-normalized GANs [Miyato et al., 2018]. The discriminator of SN-PatchGAN directly computes hinge loss on each point of the output map with format , formulating number of GANs focusing on different locations and different semantics (represented in different channels) of input image. SN-PatchGAN is simple in formulation, fast and stable for training and produces high-quality inpainting results.
For practical image inpainting tools, enabling user interactivity is crucial because there could exist many plausible solutions for filling a hole in an image. To this end, we present an extension to allow user-guided inputs (i.e. sketches). Comparison to other methods is summarized in Table 1. In summary, our contributions are as follows:
We introduce gated convolution to learn a dynamic feature selection mechanism for each channel at each spatial location across all layers, significantly improving the color consistency and inpainting quality of free-form masks and inputs.
We propose a novel GAN discriminator SN-PatchGAN designed for free-form image inpainting. It is simple, fast and produces high-quality inpainting results.
We extend our proposed inpainting model to an interactive one which can take user sketches as guidance to obtain more user-desired inpainting results.
For the first time we provide visualization and interpretation of learned CNN feature representation for the image inpainting task. The visualization demonstrates the efficacy of gated convolution in shallow and deep layers.
Our proposed generative image inpainting system achieves higher-quality free-form inpainting than previous state of the arts on benchmark datasets including Places2 natural scenes and CelebA-HQ faces. We show that the proposed system helps users quickly remove distracting objects, modify image layouts, clear watermarks, edit faces and interactively create novel objects in images.
2. Related Work
2.1. Automatic Image Inpainting
A variety of approaches have been proposed for image inpainting. Traditionally, diffusion-based [Bertalmio et al., 2000; Ballester et al., 2001] methods propagate local image appearance next to the target holes based on the isophote direction field. They mainly work on small and narrow holes and usually fail on large ones. Patch-based [Efros and Leung, 1999; Efros and Freeman, 2001] algorithms progressively extend pixels close to the hole boundaries based on low-level features, for example, features of mean square difference on RGB space, to search and paste the most similar image patch. These algorithms work well on stationary textural regions but often fail on non-stationary images. Further, Simakov et al. propose bidirectional similarity synthesis approach [Simakov et al., 2008] to better capture and summarize non-stationary visual data. To reduce the high cost of memory and computation during search, tree-based acceleration structures of memory [Mount and Arya, 1998] and randomized algorithms [Barnes et al., 2009] are proposed. Moreover, inpainting results are improved by matching local features like image gradients [Ballester et al., 2001; Darabi et al., 2012] and statistics of similar patch offsets [He and Sun, 2014]
. In these works, a Markov random field model is usually assumed, and the conditional distribution of a pixel given all its neighbors synthesized so far is estimated by querying the sample image and finding all similar neighborhoods.
Recently, image inpainting systems based on deep learning are proposed to directly predict pixel values inside masks in an end-to-end manner. A significant advantage of these models is their ability to learn adaptive image features of different semantics and thus they can synthesize pixels that are more visually plausible especially on structured images like faces [Li et al., 2017], objects [Pathak et al., 2016] and natural scenes [Iizuka et al., 2017; Yu et al., 2018]. Among all these methods, Iizuka et al.  proposed a fully convolutional image inpainting network with both global and local consistency to handle high-resolution images on a variety of datasets [Russakovsky et al., 2015; Karras et al., 2017; Zhou et al., 2017]. This approach, however, still heavily relies on Poisson image blending with traditional patch-based inpainting results [He and Sun, 2014] as a post-processing step. Yu et al.  proposed an end-to-end image inpainting model by adopting stacked generative networks to further ensure the color and texture consistence of generated regions with surroundings. Moreover, for capturing long-range spatial dependencies, contextual attention module [Yu et al., 2018] is proposed and integrated into networks to explicitly borrow information from distant spatial locations. However, this approach is mainly trained on large rectangular masks and does not generalize well on free-form masks. To better handle irregular masks, partial convolution [Liu et al., 2018] is proposed where the convolution is masked and re-normalized to utilize valid pixels only. It is then followed by a mask-update step to re-compute new masks layer by layer.
2.2. Guided Image Inpainting
To improve image inpainting, different user guidance is explored including dots or lines [Ashikhmin, 2001; Drori et al., 2003; Sun et al., 2005; Barnes et al., 2009], structures [Huang et al., 2014], transformation or distortion information [Huang et al., 2013; Pavić et al., 2006] and image exemplars [Criminisi et al., 2004; Kwatra et al., 2005; Hays and Efros, 2007; Whyte et al., 2009; Zhao et al., 2018]. Notably, Hays and Efros [Hays and Efros, 2007] first utilize millions of photographs as a database to search for an example image which is most similar to the input, and then complete the image by cutting and pasting the corresponding regions from the matched image.
2.3. User-Guided Image Synthesis with Deep Learning
Recent advances in conditional generative networks empower user-guided image processing, synthesis and manipulation learned from large-scale datasets. Here we selectively review several related works. Zhang et al. 
proposed colorization networks which can take user guidance as additional inputs. The system recommends plausible colors based on the input image and current user inputs to obtain better colorization. Wanget al.  synthesized high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks. The Scribbler network [Sangkloy et al., 2017] explored a deep adversarial image synthesis architecture conditioning on sketched boundaries and sparse color strokes to generate realistic cars, bedrooms, or faces.
In this section, we describe the details of the Gated Convolution, SN-PatchGAN, the design of inpainting network, free-form mask generation algorithm, and our extension to allow additional user guidance.
3.1. Gated Convolution
We first explain why vanilla convolutions used in [Iizuka et al., 2017; Yu et al., 2018] are ill-fitted for the task of free-form image inpainting. We consider a convolutional layer in which a bank of filters are applied to the input feature map to produce an output feature map. Assume input is -channel, each pixel located at in -channel output map is computed as
where represents x-axis, y-axis of output map, and is the kernel size (e.g. ), , , represents convolutional filters, and are inputs and outputs. For simplicity, the bias term of convolution is ignored in equation.
It can be observed that for all spatial locations , the same filters are applied to compute the output in vanilla convolution layers. This makes sense for tasks such as image classification and object detection, where all pixels of input image are valid, to extract local features in a sliding-window fashion. However, for image inpainting, the input features are composed of both regions with valid pixels outside holes and invalid pixels (shallow layers) or synthesized pixels (deep layers) in masked regions. This causes ambiguity during training and leads to visual artifacts such as color discrepancy, blurriness and obvious edge responses during testing, as reported in [Liu et al., 2018].
Recently proposed partial convolution-based architectures [Liu et al., 2018] use a masking and re-normalization step to make the convolution dependent only on valid pixels. Mathematically, partial convolution is computed as:
in which is the corresponding binary mask, represents pixel in the location is valid, represents the pixel is invalid, denotes element-wise multiplication. After each partial convolution operation, the mask-update step is required to propagate new with the following rule:
Partial convolution [Liu et al., 2018]
improves the quality of inpainting on irregular masks, but it still has remaining issues. (1) It heuristically classifies all spatial locations to be either valid or invalid. The mask in next layer will be set to ones no matter how many pixels are covered by the filter range in previous layer (e.g. 1 valid pixel and 9 valid pixels are treated as same to update current mask). (2) It is incompatible with additional user inputs. We aim at a user-guided image inpainting system where users can optionally provide sparse sketches inside the mask as conditional channels. In this situation, should these pixel locations be considered as valid or invalid? How to properly update the mask for next layer? (3) For partial convolution the invalid pixels will progressively disappear in deep layers, gradually converting all mask values to ones, as shown in Figure 3. However, our study shows that if we allow the network to learn optimal masks by itself, the network assigns soft mask values to every spatial locations (Figure 3). (4) All channels in each layer share the same masks, which limits the flexibility. In fact, partial convolution can be considered as a hard-gating single-channel un-learnable layer multiplied to each input feature map.
We propose gated convolution for image inpainting network. Instead of hard masks updated with rules, gated convolutions learn soft masks automatically from data. It can be expressed as:
where is sigmoid function thus the output gating values are between zeros and ones. can be any activation functions (e.g
. ReLU or LeakyReLU).and are two different convolutional filters.
The proposed gated convolution enables network to learn a dynamic feature selection mechanism for each channel and each spatial location. Interestingly, visualization (Figure 3) of intermediate gating values show that it learns to select the feature maps not only according to backgrounds, masks, sketches, but also considering semantic segmentation in some channels. Even in deep layers, gated convolution learns to highlight the masked regions and sketch information in separate channels to better generate inpainting results.
3.2. Spectral-Normalized Markovian Discriminator (SN-PatchGAN)
Previous image inpainting networks try to complete images with a single rectangular hole. An additional local GAN using a patch surrounding that hole is used to improve results [Iizuka et al., 2017; Yu et al., 2018]. However, we consider the task of free-form image inpainting where there may be multiple holes with any shapes and at any locations. Motivated by global and local GANs [Iizuka et al., 2017], MarkovianGANs [Li and Wand, 2016; Isola et al., 2016], perceptual loss [Johnson et al., 2016] and recent work on spectral-normalized GANs [Miyato et al., 2018], we developed a simple yet highly effective GAN loss, SN-PatchGAN, for training free-form image inpainting networks. It is described in detail below. SN-PatchGAN is fast and stable during GAN training and produces high-quality inpainting results.
A convolutional neural network is used as the discriminator where the input consists of image, mask and guidance channels, and the output is a 3-D feature of shape , where , , representing the height, width and number of channels respectively. As shown in Figure 4
, six strided convolutions with kernel sizeand stride is stacked to captures the feature statistics of Markovian patches [Li and Wand, 2016]. We then directly apply GANs for each feature element in this feature map, formulating number of GANs focusing on different locations and different semantics (represented in different channels) of input image. We should notice that the receptive fields of each point in output map can still cover the entire input image in our training setting, thus a global discriminator is not necessary.
We adopt the recently proposed weight normalization technique called spectral normalization [Miyato et al., 2018] to further stabilize the training of GANs. We use the default fast approximation algorithm of spectral normalization described in SN-GANs [Miyato et al., 2018]. To discriminate if the input is real or fake, we also use the hinge loss as objective function:
where represents spectral-normalized discriminator, is image inpainting network that takes incomplete image .
With SN-PatchGAN, our inpainting network trains faster per batch samples than baseline model [Yu et al., 2018]. We don’t use perceptual loss since similar patch-level information is already encoded in SN-PatchGAN. Unlike PartialConv [Liu et al., 2018] in which different loss terms and balancing hyper-parameters are used, our final objective function for inpainting network is only composed of pixel-wise reconstruction loss and SN-PatchGAN loss with default loss balancing hyper-parameter as .
3.3. Inpainting Network Architecture
We use a state of the art generative inpainting network and customize it with the proposed gated convolutions and SN-PatchGAN loss. Specifically, we adopt the full model architecture in [Yu et al., 2018] with both coarse and refinement networks. The coarse network is shown in Figure 3 (for simplicity, the refinement network is ignored in the figure and its detail can be found in [Yu et al., 2018]). The refinement network with the contextual attention module improves sharpness of texture details especially.
For coarse and refinement networks, we use a simple encoder-decoder network [Yu et al., 2018] instead of U-Net used in PartialConv [Liu et al., 2018]. We found that skip connections in a U-Net [Ronneberger et al., 2015] have no significant effect for non-narrow masks. This is mainly because for center of a masked region, the inputs to these skip connections are almost zeros thus cannot propagate detailed color or texture information to the decoder of that region. This can be explained by visualization of the learned feature representation of encoder. For hole boundaries, our encoder-decoder architecture equipped with gated convolution is sufficient to generate seamless results.
We replace all vanilla convolutional layers with gated convolutions [Yu et al., 2018]. One potential problem is that gated convolutions introduce additional parameters. To maintain the same efficiency with our baseline model [Yu et al., 2018], we slim the base model width by and have not found apparent performance drop both quantitatively and qualitatively. Our inpainting network is trained in an end-to-end manner and can be tested on free-form holes at arbitrary locations. Since our network is fully convolutional, it also supports varied input resolutions.
3.4. Free-Form Masks Generation
|center rectangle masks||free-form masks|
|Method||mean loss||mean loss||mean TV loss||mean loss||mean loss||mean TV loss|
|PatchMatch [Barnes et al., 2009]||16.1%||3.9%||25.0%||11.3%||2.4%||26.1%|
|Global&Local [Iizuka et al., 2017]||9.3%||2.2%||26.7%||21.6%||7.1%||31.4%|
|ContextAttention [Yu et al., 2018]||8.6%||2.1%||25.3%||17.2%||4.7%||27.8%|
|PartialConv* [Liu et al., 2018]||9.8%||2.3%||26.9%||10.4%||1.9%||27.0%|
The algorithm to automatically generate free-form masks is important and non-trivial. The sampled masks, in essence, should be (1) similar in shape to holes drawn in real use-cases, (2) diverse to avoid over-fitting, (3) efficient in computation and storage, (4) controllable and flexible. Previous method [Liu et al., 2018] collects a fixed set of irregular masks from an occlusion estimation method between two consecutive frames of videos. Although random dilation, rotation and cropping are added to increase its diversity, the method does not meet other requirements listed above.
We introduce a simple algorithm to automatically generate random free-form masks on-the-fly during training. For the task of hole filling, users behave like using an eraser to brush back and forth to mask out undesired regions. This behavior can be simply simulated with a randomized algorithm by drawing lines and rotating angles repeatedly. To ensure smoothness of two lines, we also draw a circle in joints between the two lines.
We use maxVertex, maxLength, maxWidth and maxAngle
as four hyper-parameters to control the mask generation process. In this way, the generated masks can have large varieties. Moreover, our algorithm generates masks on-the-fly with little computational overhead and no storage is required. In practice, the computation of free-form masks on CPU can be easily hid behind training networks on GPU in modern deep learning frameworks such as TensorFlow[Abadi et al., 2015].
The overall mask generation algorithm is illustrated in Algorithm 1. Additionally we can sample multiple strokes in single image to mask multiple regions. We can also add regular masks (e.g. rectangular) on top of sampled free-form masks. Example masks compared with previous method [Liu et al., 2018] is shown in Figure 5.
3.5. Extension to User-Guided Image Inpainting
We use sketches as an example user guidance to extend our image inpainting network to a user guided system. Sketches (or edges) are simple and intuitive for users to draw. Moreover, it is also relatively easy for obtaining training data. We show two cases with faces and natural scenes. For faces, we extract landmarks and connect related landmarks as sketches shown in Figure 6. The motivation is that (1) for users, the regions of interest are most likely around face landmarks and (2) algorithms for detecting face landmarks are much more robust than edge detectors. For natural scene images, we directly extract edge maps using the HED edge detector [Xie and Tu, 2015] and set all values above a certain threshold (i.e. 0.6) to ones, as shown in Figure 6.
For training the user-guided image inpainting system, intuitively we will need additional constraint loss to enforce the network generates results conditioned on user guidance. However we find with the same combination of pixel-wise reconstruction loss and GAN loss (with conditional channels as inputs to the discriminator), we are able to learn conditional generative network in which the generated results respect user inputs faithfully. We also tried to use additional pixel-wise loss on HED [Xie and Tu, 2015] output features with the raw image and the generated result as input to enforce constraints. But it did not boost performance further.
We first evaluate our proposed inpainting model on Places2 [Zhou et al., 2017] and CelebA-HQ faces [Karras et al., 2017]. Our model has a total of 4.1M parameters, and is trained with TensorFlow v1.8, CUDNN v7.0, CUDA v9.0. For testing, it runs at 0.21 seconds per image on single NVIDIA(R) Tesla(R) V100 GPU and 1.9 seconds on Intel(R) Xeon(R) CPU @ 2.00GHz for images of resolution on average, regardless of hole sizes. For efficiency, we can also restrict the search range of contextual attention module from the whole image to a local neighborhood, which can make the run-time significantly faster while maintaining overall quality of the results.
4.1. Quantitative Results
As mentioned in [Yu et al., 2018]
, image inpainting lacks good quantitative evaluation metrics. Nevertheless, we report our evaluation results in terms of meanerror, mean error and total variation (TV) loss on validation images of Places2 with both center rectangle masks and free-form masks for reference in Table 2. As shown in the table, learning-based methods perform better in terms of , errors, while PatchMatch [Barnes et al., 2009] has slightly lower TV loss because it directly borrows raw image patches. Moreover, partial convolution implemented within the same framework obtains worse performance in terms of the reconstruction loss. This might be due to the fact that partial convolution uses a un-learnable binary mask for all channels in a layer, which has lower representation capacity for learning from complex distribution of input data.
4.2. Qualitative Comparisons
Next, we compare our model with previous state-of-the-art methods [Iizuka et al., 2017; Yu et al., 2018; Liu et al., 2018]. Free-form masks and sketches are used as inputs for PartialConv and GatedConv trained with exactly the same settings. For the existing methods without user-guided option (Global&Local and ContextAttention), we simply evaluate them with the same free-form masks without the sketch channel. Note that for all learning-based methods, no post-processing steps are performed to ensure fairness of the comparisons.
Figure 7 shows comparisons of different methods on two example images. As reported in [Iizuka et al., 2017], simple uniform regions similar to sky areas in the first example are hard cases for learning-based image inpainting networks. Previous methods with vanilla convolution have obvious visual artifacts in masked regions and edge responses surrounding holes. PartialConv produced better results but still with observable color discrepancy. Our proposed method based on gated convolution obtains a visually pleasing result without noticeable color inconsistency. We also show effects of additional sketch inputs for both examples in Figure 7. Given sparse sketches, our method is able to produce realistic results with seamless boundary transitions. Our generated results nicely follow the user sketches, which is useful for creatively editing image layouts and faces.
4.3. Case Study
Moreover, we specifically study two most important real image inpainting use cases, object removal and creative editing, and compare our methods with commercial product Photoshop(R) (based on PatchMatch [Barnes et al., 2009]) and the previous state-of-the-art generative inpainting network [Yu et al., 2018].
4.3.1. Object Removal
In the first example, we try to remove three people in Figure 8. We can see both Content-Aware Fill and our method successfully removed the person on the right. But for two people on the left marked with red box, Context-Aware Fill incorrectly copied half of person from left. This example reflects that traditional methods without learning from data can ignore the semantics of images and make critical failures for non-stationary scenes. For learning-based methods with vanilla convolution in Yu et al. [Yu et al., 2018], we can observe artifacts near hole boundaries for both left and right persons.
4.3.2. Creative Editing
Next we study the user case where users want to interact with inpainting algorithm to produce more desired results. The input is shown on left in Figure 9. We use the sketch input as a guidance image for Content-Aware Fill. We can find it directly copied a left tree into masked regions without any modification, shown with black boxes. The guided image serves as initial optimization values for PatchMatch [Barnes et al., 2009], thus we can observe that the output does not exactly follow guidance (tree sketch on the right). Meanwhile, our system can create novel objects that does not exist anywhere in current image.
4.4. User Study
We perform a user study on the validation set of Places2 [Zhou et al., 2017]. We show either the full completed image or a random image from the dataset to 50 users to evaluate the naturalness of the completion. Figure 10 shows results of user study evaluating the naturalness of inpainting results sampled from Places2 validation set. The numbers are the percentage of the images that are believed to be real by 50 users for the Ground Truth (GT) and our approach. In the study, 88.7% of the our results are believed to be real, which is unprecedented among previous inpainting methods. For comparison, the real images are correctly categorized 94.3%.
4.5. More Examples and Results
We presented a novel free-form image inpainting system based on an end-to-end generative network with gated convolution and a novel GAN loss. We showed that gated convolutions significantly improve inpainting results with free-form masks and user guidance input. We demonstrated sketches as an exemplar user guidance to help users quickly remove distracting objects, modify image layouts, clear watermarks, edit faces and interactively create novel objects in photos. We also visualized the learned feature representations to interpret and understand proposed gated convolution in the trained inpainting network. Quantitative results, qualitative comparisons and user studies demonstrated the superiority of our proposed free-form image inpainting system.
Abadi et al. 
Martín Abadi, Paul
Barham, Jianmin Chen, Zhifeng Chen,
Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat,
Geoffrey Irving, Michael Isard,
et al. 2015.
TensorFlow: A System for Large-Scale Machine Learning.
- Ashikhmin  Michael Ashikhmin. 2001. Synthesizing natural textures. In Proceedings of the 2001 symposium on Interactive 3D graphics. ACM, 217–226.
- Ballester et al.  Coloma Ballester, Marcelo Bertalmio, Vicent Caselles, Guillermo Sapiro, and Joan Verdera. 2001. IEEE transactions on image processing 10, 8 (2001), 1200–1211.
- Barnes et al.  Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. 2009. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (TOG) (Proceedings of SIGGRAPH 2009) (2009).
- Bertalmio et al.  Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. 2000. Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 417–424.
- Criminisi et al.  Antonio Criminisi, Patrick Pérez, and Kentaro Toyama. 2004. Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on image processing 13, 9 (2004), 1200–1212.
- Darabi et al.  Soheil Darabi, Eli Shechtman, Connelly Barnes, Dan B Goldman, and Pradeep Sen. 2012. Image melding: Combining inconsistent images using patch-based synthesis. ACM Transactions on Graphics (TOG) (Proceedings of SIGGRAPH 2012) (2012).
- Drori et al.  Iddo Drori, Daniel Cohen-Or, and Hezy Yeshurun. 2003. Fragment-based image completion. In ACM Transactions on graphics (TOG). ACM.
- Efros and Freeman  Alexei A Efros and William T Freeman. 2001. Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques. ACM, 341–346.
- Efros and Leung  Alexei A Efros and Thomas K Leung. 1999. Texture synthesis by non-parametric sampling. In Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, Vol. 2. IEEE, 1033–1038.
- Hays and Efros  James Hays and Alexei A Efros. 2007. Scene completion using millions of photographs. In ACM Transactions on Graphics (TOG). ACM.
- He and Sun  Kaiming He and Jian Sun. 2014. Image completion approaches using the statistics of similar patches. IEEE transactions on pattern analysis and machine intelligence 36, 12 (2014), 2423–2435.
- Huang et al.  Jia-Bin Huang, Sing Bing Kang, Narendra Ahuja, and Johannes Kopf. 2014. Image completion using planar structure guidance. ACM Transactions on Graphics (TOG) 33, 4 (2014), 129.
- Huang et al.  Jia-Bin Huang, Johannes Kopf, Narendra Ahuja, and Sing Bing Kang. 2013. Transformation guided image completion. In Computational Photography (ICCP), 2013 IEEE International Conference on. IEEE, 1–9.
- Iizuka et al.  Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. 2017. Globally and locally consistent image completion. ACM Transactions on Graphics (TOG) 36, 4 (2017), 107.
- Isola et al.  Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2016. Image-to-Image Translation with Conditional Adversarial Networks. arxiv (2016).
- Johnson et al.  Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision. Springer, 694–711.
- Karras et al.  Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive Growing of GANs for Improved Quality, Stability, and Variation. arXiv preprint arXiv:1710.10196 (2017).
- Kwatra et al.  Vivek Kwatra, Irfan Essa, Aaron Bobick, and Nipun Kwatra. 2005. Texture optimization for example-based synthesis. ACM Transactions on Graphics (ToG) 24, 3 (2005), 795–802.
- Li and Wand  Chuan Li and Michael Wand. 2016. Precomputed real-time texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision. Springer, 702–716.
- Li et al.  Yijun Li, Sifei Liu, Jimei Yang, and Ming-Hsuan Yang. 2017. Generative Face Completion. arXiv preprint arXiv:1704.05838 (2017).
- Liu et al.  Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. 2018. Image Inpainting for Irregular Holes Using Partial Convolutions. arXiv preprint arXiv:1804.07723 (2018).
- Miyato et al.  Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018).
- Mount and Arya  David M Mount and Sunil Arya. 1998. ANN: library for approximate nearest neighbour searching. (1998).
Pathak et al. 
Deepak Pathak, Philipp
Krahenbuhl, Jeff Donahue, Trevor
Darrell, and Alexei A Efros.
Context encoders: Feature learning by inpainting.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2536–2544.
- Pavić et al.  Darko Pavić, Volker Schönefeld, and Leif Kobbelt. 2006. Interactive image completion with perspective correction. The Visual Computer 22, 9 (2006), 671–681.
- Ronneberger et al.  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234–241.
- Russakovsky et al.  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
- Sangkloy et al.  Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. 2017. Scribbler: Controlling Deep Image Synthesis With Sketch and Color. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5400–5409.
- Simakov et al.  Denis Simakov, Yaron Caspi, Eli Shechtman, and Michal Irani. 2008. Summarizing visual data using bidirectional similarity. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 1–8.
- Sun et al.  Jian Sun, Lu Yuan, Jiaya Jia, and Heung-Yeung Shum. 2005. Image completion with structure propagation. ACM Transactions on Graphics (ToG) 24, 3 (2005), 861–868.
- Wang et al.  Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2017. High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. arXiv preprint arXiv:1711.11585 (2017).
- Whyte et al.  Oliver Whyte, Josef Sivic, and Andrew Zisserman. 2009. Get Out of my Picture! Internet-based Inpainting.. In Proceedings of the 20th British Machine Vision Conference, London.
- Xie and Tu  Saining Xie and Zhuowen Tu. 2015. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision. 1395–1403.
- Yu et al.  Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. 2018. Generative Image Inpainting with Contextual Attention. arXiv preprint arXiv:1801.07892 (2018).
- Zhang et al.  Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng, Angela S Lin, Tianhe Yu, and Alexei A Efros. 2017. Real-time user-guided image colorization with learned deep priors. arXiv preprint arXiv:1705.02999 (2017).
- Zhao et al.  Yinan Zhao, Brian Price, Scott Cohen, and Danna Gurari. 2018. Guided Image Inpainting: Replacing an Image Region by Pulling Content from Another Image. arXiv preprint arXiv:1803.08435 (2018).
Zhou et al. 
Bolei Zhou, Agata
Lapedriza, Aditya Khosla, Aude Oliva,
and Antonio Torralba. 2017.
Places: A 10 million Image Database for Scene Recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).