Free-Form Image Inpainting with Gated Convolution

06/10/2018 ∙ by Jiahui Yu, et al. ∙ 0

We present a novel deep learning based image inpainting system to complete images with free-form masks and inputs. The system is based on gated convolutions learned from millions of images without additional labelling efforts. The proposed gated convolution solves the issue of vanilla convolution that treats all input pixels as valid ones, generalizes partial convolution by providing a learnable dynamic feature selection mechanism for each channel at each spatial location across all layers. Moreover, as free-form masks may appear anywhere in images with any shapes, global and local GANs designed for a single rectangular mask are not suitable. To this end, we also present a novel GAN loss, named SN-PatchGAN, by applying spectral-normalized discriminators on dense image patches. It is simple in formulation, fast and stable in training. Results on automatic image inpainting and user-guided extension demonstrate that our system generates higher-quality and more flexible results than previous methods. We show that our system helps users quickly remove distracting objects, modify image layouts, clear watermarks, edit faces and interactively create novel objects in images. Furthermore, visualization of learned feature representations reveals the effectiveness of gated convolution and provides an interpretation of how the proposed neural network fills in missing regions. More high-resolution results and video materials are available at http://jiahuiyu.com/deepfill2

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

page 7

page 8

page 9

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Image inpainting (a.k.a. image completion or image hole-filling) is a task of synthesizing alternative contents in missing regions such that the modification is visually realistic and semantically correct. It allows to remove distracting objects or retouch undesired regions in photos. It can also be extended to tasks including image/video un-cropping, rotation, stitching, re-targeting, re-composition, compression, super-resolution, harmonization and many others.

In computer graphics, two broad approaches to image inpainting exist: patch-based ones with low-level features and deep generative models with convolutional neural networks. The former approach 

[Efros and Leung, 1999; Barnes et al., 2009; Efros and Freeman, 2001] can synthesize plausible stationary textures, but usually make critical failures in non-stationary cases like complicated scenes, faces and objects. The latter approach [Iizuka et al., 2017; Yu et al., 2018] can exploit semantics learned from large scale datasets to fill contents in non-stationary images in an end-to-end fashion.

However, deep generative models based on vanilla convolutional networks are naturally ill-fitted for image hole-filling because convolutional filters treat all input pixels as same valid ones. For hole-filling, the input images/features are composed of both regions with valid pixels outside holes and invalid or synthesized pixels in masked regions. Vanilla convolutions apply same filters on all valid, invalid and mixed (on hole boundaries) pixels, leading to visual artifacts such as color discrepancy, blurriness and obvious edge responses surrounding holes when tested on free-form masks.

To address this limitation, partial convolution [Liu et al., 2018] is recently proposed where the convolution is masked and re-normalized to be conditioned only on valid pixels. It is then followed by a mask-update step to re-compute new mask layer by layer. Partial convolution is essentially a hard-gating single-channel un-learnable layer multiplied to input feature maps.

It heuristically categorizes all pixel locations to be either valid or invalid, and multiplies hard-gating values (

e.g. ones or zeros) to input images/features. However this assumption has several problems. First, if we want to extend it to user-guided image inpainting with conditional channels where users provide sparse sketches inside the mask, should these pixel locations be considered as valid or invalid? How to properly update the mask for next layer? Secondly, for partial convolution the invalid pixels will progressively disappear in deep layers, leaving all gating values to be ones (Figure 3). However, we will show that if we allow the network to learn the optimal gating values by itself, the network assigns different gating values to different locations in different channels based on input masks and sketches, even in deep layers, as shown in visualization results in Figure 3.

We propose gated convolution that learns a dynamic feature selection mechanism for each channel and each spatial location (e.g. inside or outside masks, RGB or user-input channels) for the task of free-form image inpainting. Specifically we consider the formulation where the input feature is firstly used to compute gating values (

is sigmoid function,

is learnable parameter). The final output is a multiplication of learned feature and gating values in which

is any activation function. Gated convolution is easy to implement and performs significantly better when (1) the masks have arbitrary shapes and (2) the inputs are no longer simply RGB channels with a mask but also has conditional inputs like sketches. For network architectures, we stack gated convolution to form a simple encoder-decoder network 

[Yu et al., 2018]. Skip connections with a U-Net [Ronneberger et al., 2015], as adopted in some image inpainting networks [Liu et al., 2018], are not effective for non-narrow masks, mainly because inputs of these skip connections are almost zeros thus cannot propagate detailed color or texture information to decoder. This can be explained by our visualization of learned feature representation of encoder. Our inpainting network also integrates contextual attention module [Yu et al., 2018] within same refinement network to better capture long-range dependencies.

Without degradation of performance, we also significantly simplify training objectives into two terms: a pixel-wise reconstruction loss and an adversarial loss. The modification is mainly designed for free-form image inpainting. As the holes may appear anywhere in images with any shapes, global and local GANs [Iizuka et al., 2017] designed for a single rectangular mask are not suitable. Instead, we propose a variant of generative adversarial networks, named SN-PatchGAN, motivated by global and local GANs [Iizuka et al., 2017], MarkovianGANs [Li and Wand, 2016], perceptual loss [Johnson et al., 2016] and recent work on spectral-normalized GANs [Miyato et al., 2018]. The discriminator of SN-PatchGAN directly computes hinge loss on each point of the output map with format , formulating number of GANs focusing on different locations and different semantics (represented in different channels) of input image. SN-PatchGAN is simple in formulation, fast and stable for training and produces high-quality inpainting results.

For practical image inpainting tools, enabling user interactivity is crucial because there could exist many plausible solutions for filling a hole in an image. To this end, we present an extension to allow user-guided inputs (i.e. sketches). Comparison to other methods is summarized in Table 1. In summary, our contributions are as follows:

  • We introduce gated convolution to learn a dynamic feature selection mechanism for each channel at each spatial location across all layers, significantly improving the color consistency and inpainting quality of free-form masks and inputs.

  • We propose a novel GAN discriminator SN-PatchGAN designed for free-form image inpainting. It is simple, fast and produces high-quality inpainting results.

  • We extend our proposed inpainting model to an interactive one which can take user sketches as guidance to obtain more user-desired inpainting results.

  • For the first time we provide visualization and interpretation of learned CNN feature representation for the image inpainting task. The visualization demonstrates the efficacy of gated convolution in shallow and deep layers.

  • Our proposed generative image inpainting system achieves higher-quality free-form inpainting than previous state of the arts on benchmark datasets including Places2 natural scenes and CelebA-HQ faces. We show that the proposed system helps users quickly remove distracting objects, modify image layouts, clear watermarks, edit faces and interactively create novel objects in images.

2. Related Work

2.1. Automatic Image Inpainting

A variety of approaches have been proposed for image inpainting. Traditionally, diffusion-based [Bertalmio et al., 2000; Ballester et al., 2001] methods propagate local image appearance next to the target holes based on the isophote direction field. They mainly work on small and narrow holes and usually fail on large ones. Patch-based [Efros and Leung, 1999; Efros and Freeman, 2001] algorithms progressively extend pixels close to the hole boundaries based on low-level features, for example, features of mean square difference on RGB space, to search and paste the most similar image patch. These algorithms work well on stationary textural regions but often fail on non-stationary images. Further, Simakov et al. propose bidirectional similarity synthesis approach [Simakov et al., 2008] to better capture and summarize non-stationary visual data. To reduce the high cost of memory and computation during search, tree-based acceleration structures of memory [Mount and Arya, 1998] and randomized algorithms [Barnes et al., 2009] are proposed. Moreover, inpainting results are improved by matching local features like image gradients [Ballester et al., 2001; Darabi et al., 2012] and statistics of similar patch offsets [He and Sun, 2014]

. In these works, a Markov random field model is usually assumed, and the conditional distribution of a pixel given all its neighbors synthesized so far is estimated by querying the sample image and finding all similar neighborhoods.

Recently, image inpainting systems based on deep learning are proposed to directly predict pixel values inside masks in an end-to-end manner. A significant advantage of these models is their ability to learn adaptive image features of different semantics and thus they can synthesize pixels that are more visually plausible especially on structured images like faces [Li et al., 2017], objects [Pathak et al., 2016] and natural scenes [Iizuka et al., 2017; Yu et al., 2018]. Among all these methods, Iizuka et al[2017] proposed a fully convolutional image inpainting network with both global and local consistency to handle high-resolution images on a variety of datasets [Russakovsky et al., 2015; Karras et al., 2017; Zhou et al., 2017]. This approach, however, still heavily relies on Poisson image blending with traditional patch-based inpainting results [He and Sun, 2014] as a post-processing step. Yu et al[2018] proposed an end-to-end image inpainting model by adopting stacked generative networks to further ensure the color and texture consistence of generated regions with surroundings. Moreover, for capturing long-range spatial dependencies, contextual attention module [Yu et al., 2018] is proposed and integrated into networks to explicitly borrow information from distant spatial locations. However, this approach is mainly trained on large rectangular masks and does not generalize well on free-form masks. To better handle irregular masks, partial convolution [Liu et al., 2018] is proposed where the convolution is masked and re-normalized to utilize valid pixels only. It is then followed by a mask-update step to re-compute new masks layer by layer.

2.2. Guided Image Inpainting

To improve image inpainting, different user guidance is explored including dots or lines [Ashikhmin, 2001; Drori et al., 2003; Sun et al., 2005; Barnes et al., 2009], structures [Huang et al., 2014], transformation or distortion information [Huang et al., 2013; Pavić et al., 2006] and image exemplars [Criminisi et al., 2004; Kwatra et al., 2005; Hays and Efros, 2007; Whyte et al., 2009; Zhao et al., 2018]. Notably, Hays and Efros [Hays and Efros, 2007] first utilize millions of photographs as a database to search for an example image which is most similar to the input, and then complete the image by cutting and pasting the corresponding regions from the matched image.

2.3. User-Guided Image Synthesis with Deep Learning

Recent advances in conditional generative networks empower user-guided image processing, synthesis and manipulation learned from large-scale datasets. Here we selectively review several related works. Zhang et al[2017]

proposed colorization networks which can take user guidance as additional inputs. The system recommends plausible colors based on the input image and current user inputs to obtain better colorization. Wang

et al[2017] synthesized high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks. The Scribbler network [Sangkloy et al., 2017] explored a deep adversarial image synthesis architecture conditioning on sketched boundaries and sparse color strokes to generate realistic cars, bedrooms, or faces.

3. Approach

In this section, we describe the details of the Gated Convolution, SN-PatchGAN, the design of inpainting network, free-form mask generation algorithm, and our extension to allow additional user guidance.

3.1. Gated Convolution

We first explain why vanilla convolutions used in [Iizuka et al., 2017; Yu et al., 2018] are ill-fitted for the task of free-form image inpainting. We consider a convolutional layer in which a bank of filters are applied to the input feature map to produce an output feature map. Assume input is -channel, each pixel located at in -channel output map is computed as

where represents x-axis, y-axis of output map, and is the kernel size (e.g), , , represents convolutional filters, and are inputs and outputs. For simplicity, the bias term of convolution is ignored in equation.

It can be observed that for all spatial locations , the same filters are applied to compute the output in vanilla convolution layers. This makes sense for tasks such as image classification and object detection, where all pixels of input image are valid, to extract local features in a sliding-window fashion. However, for image inpainting, the input features are composed of both regions with valid pixels outside holes and invalid pixels (shallow layers) or synthesized pixels (deep layers) in masked regions. This causes ambiguity during training and leads to visual artifacts such as color discrepancy, blurriness and obvious edge responses during testing, as reported in [Liu et al., 2018].

Recently proposed partial convolution-based architectures [Liu et al., 2018] use a masking and re-normalization step to make the convolution dependent only on valid pixels. Mathematically, partial convolution is computed as:

in which is the corresponding binary mask, represents pixel in the location is valid, represents the pixel is invalid, denotes element-wise multiplication. After each partial convolution operation, the mask-update step is required to propagate new with the following rule:

Partial convolution [Liu et al., 2018]

improves the quality of inpainting on irregular masks, but it still has remaining issues. (1) It heuristically classifies all spatial locations to be either valid or invalid. The mask in next layer will be set to ones no matter how many pixels are covered by the filter range in previous layer (

e.g. 1 valid pixel and 9 valid pixels are treated as same to update current mask). (2) It is incompatible with additional user inputs. We aim at a user-guided image inpainting system where users can optionally provide sparse sketches inside the mask as conditional channels. In this situation, should these pixel locations be considered as valid or invalid? How to properly update the mask for next layer? (3) For partial convolution the invalid pixels will progressively disappear in deep layers, gradually converting all mask values to ones, as shown in Figure 3. However, our study shows that if we allow the network to learn optimal masks by itself, the network assigns soft mask values to every spatial locations (Figure 3). (4) All channels in each layer share the same masks, which limits the flexibility. In fact, partial convolution can be considered as a hard-gating single-channel un-learnable layer multiplied to each input feature map.

Figure 2. Illustration of partial convolution (left) and gated convolution (right).

We propose gated convolution for image inpainting network. Instead of hard masks updated with rules, gated convolutions learn soft masks automatically from data. It can be expressed as:

where is sigmoid function thus the output gating values are between zeros and ones. can be any activation functions (e.g

ReLU or LeakyReLU).

and are two different convolutional filters.

The proposed gated convolution enables network to learn a dynamic feature selection mechanism for each channel and each spatial location. Interestingly, visualization (Figure 3) of intermediate gating values show that it learns to select the feature maps not only according to backgrounds, masks, sketches, but also considering semantic segmentation in some channels. Even in deep layers, gated convolution learns to highlight the masked regions and sketch information in separate channels to better generate inpainting results.

Figure 3. Comparisons of gated convolution to partial convolution with visualization and interpretation of learned gating values. We first show our inpainting network architecture based on [Yu et al., 2018] by replacing all convolutions with gated convolutions in the 1st row. Note that for simplicity, the following refinement network in [Yu et al., 2018] is ignored in figure. With same settings, we train two models based on gated convolution and partial convolution separately. We then directly visualize intermediate un-normalized values of gating output in the 2nd row. The values differ mainly based on three parts: background, mask and sketch. In the 3rd row, we provide an interpretation based on which part(s) have higher gating values. Interestingly we also find that for some channels (e.g. channel-31 of the layer after dilated convolution), the learned gating values are based on foreground/background semantic segmentation. For comparison, we also visualize the un-learnable fixed binary mask of partial convolution in the 4th row. The inpainting results of gated convolution and partial convolution can be found in Section 4.

3.2. Spectral-Normalized Markovian Discriminator (SN-PatchGAN)

Figure 4. Overview of SN-PatchGAN and our architecture for learning free-form image inpainting network. Details of generative inpainting network is shown in Figure 3. We aim at free-form image inpainting for which the masks may appear anywhere in images with any shapes. Previous global and local GANs [Iizuka et al., 2017] designed for a single rectangular mask are not suitable. Thus we introduce SN-PatchGAN that directly applies GAN loss for each point in output feature map of convolutional discriminator. It is simple in formulation, fast and stable for training and produces high-quality inpainting results. Note that we use convolutional kernel size and the receptive fields of each point in output map can still cover entire input image in our training setting thus a global GAN is not used.

Previous image inpainting networks try to complete images with a single rectangular hole. An additional local GAN using a patch surrounding that hole is used to improve results [Iizuka et al., 2017; Yu et al., 2018]. However, we consider the task of free-form image inpainting where there may be multiple holes with any shapes and at any locations. Motivated by global and local GANs [Iizuka et al., 2017], MarkovianGANs [Li and Wand, 2016; Isola et al., 2016], perceptual loss [Johnson et al., 2016] and recent work on spectral-normalized GANs [Miyato et al., 2018], we developed a simple yet highly effective GAN loss, SN-PatchGAN, for training free-form image inpainting networks. It is described in detail below. SN-PatchGAN is fast and stable during GAN training and produces high-quality inpainting results.

A convolutional neural network is used as the discriminator where the input consists of image, mask and guidance channels, and the output is a 3-D feature of shape , where , , representing the height, width and number of channels respectively. As shown in Figure 4

, six strided convolutions with kernel size

and stride is stacked to captures the feature statistics of Markovian patches [Li and Wand, 2016]. We then directly apply GANs for each feature element in this feature map, formulating number of GANs focusing on different locations and different semantics (represented in different channels) of input image. We should notice that the receptive fields of each point in output map can still cover the entire input image in our training setting, thus a global discriminator is not necessary.

We adopt the recently proposed weight normalization technique called spectral normalization [Miyato et al., 2018] to further stabilize the training of GANs. We use the default fast approximation algorithm of spectral normalization described in SN-GANs [Miyato et al., 2018]. To discriminate if the input is real or fake, we also use the hinge loss as objective function:

where represents spectral-normalized discriminator, is image inpainting network that takes incomplete image .

With SN-PatchGAN, our inpainting network trains faster per batch samples than baseline model [Yu et al., 2018]. We don’t use perceptual loss since similar patch-level information is already encoded in SN-PatchGAN. Unlike PartialConv [Liu et al., 2018] in which different loss terms and balancing hyper-parameters are used, our final objective function for inpainting network is only composed of pixel-wise reconstruction loss and SN-PatchGAN loss with default loss balancing hyper-parameter as .

3.3. Inpainting Network Architecture

We use a state of the art generative inpainting network and customize it with the proposed gated convolutions and SN-PatchGAN loss. Specifically, we adopt the full model architecture in [Yu et al., 2018] with both coarse and refinement networks. The coarse network is shown in Figure 3 (for simplicity, the refinement network is ignored in the figure and its detail can be found in [Yu et al., 2018]). The refinement network with the contextual attention module improves sharpness of texture details especially.

For coarse and refinement networks, we use a simple encoder-decoder network [Yu et al., 2018] instead of U-Net used in PartialConv [Liu et al., 2018]. We found that skip connections in a U-Net [Ronneberger et al., 2015] have no significant effect for non-narrow masks. This is mainly because for center of a masked region, the inputs to these skip connections are almost zeros thus cannot propagate detailed color or texture information to the decoder of that region. This can be explained by visualization of the learned feature representation of encoder. For hole boundaries, our encoder-decoder architecture equipped with gated convolution is sufficient to generate seamless results.

We replace all vanilla convolutional layers with gated convolutions [Yu et al., 2018]. One potential problem is that gated convolutions introduce additional parameters. To maintain the same efficiency with our baseline model [Yu et al., 2018], we slim the base model width by and have not found apparent performance drop both quantitatively and qualitatively. Our inpainting network is trained in an end-to-end manner and can be tested on free-form holes at arbitrary locations. Since our network is fully convolutional, it also supports varied input resolutions.

3.4. Free-Form Masks Generation

Figure 5. Comparisons of free-form masks. The left two masks are from work [Liu et al., 2018]. The right two masks are sampled from our automatic algorithm. Details can be found in Algorithm 1.
center rectangle masks free-form masks
Method mean loss mean loss mean TV loss mean loss mean loss mean TV loss
PatchMatch [Barnes et al., 2009] 16.1% 3.9% 25.0% 11.3% 2.4% 26.1%
Global&Local [Iizuka et al., 2017] 9.3% 2.2% 26.7% 21.6% 7.1% 31.4%
ContextAttention [Yu et al., 2018] 8.6% 2.1% 25.3% 17.2% 4.7% 27.8%
PartialConv* [Liu et al., 2018] 9.8% 2.3% 26.9% 10.4% 1.9% 27.0%
Ours 8.6% 2.0% 26.6% 9.1% 1.6% 26.8%
Table 2. For reference, results of mean error, mean error and TV loss on validation images of Places2 with both center rectangle masks and free-form masks are reported. Both PartialConv* and ours are trained on same random combination of rectangle and free-form masks. No edge guidance is utilized in training/inference to ensure fair comparisons. * denotes our implementation within the same framework because of unavailability of its official implementation and models.

The algorithm to automatically generate free-form masks is important and non-trivial. The sampled masks, in essence, should be (1) similar in shape to holes drawn in real use-cases, (2) diverse to avoid over-fitting, (3) efficient in computation and storage, (4) controllable and flexible. Previous method [Liu et al., 2018] collects a fixed set of irregular masks from an occlusion estimation method between two consecutive frames of videos. Although random dilation, rotation and cropping are added to increase its diversity, the method does not meet other requirements listed above.

We introduce a simple algorithm to automatically generate random free-form masks on-the-fly during training. For the task of hole filling, users behave like using an eraser to brush back and forth to mask out undesired regions. This behavior can be simply simulated with a randomized algorithm by drawing lines and rotating angles repeatedly. To ensure smoothness of two lines, we also draw a circle in joints between the two lines.

We use maxVertex, maxLength, maxWidth and maxAngle

as four hyper-parameters to control the mask generation process. In this way, the generated masks can have large varieties. Moreover, our algorithm generates masks on-the-fly with little computational overhead and no storage is required. In practice, the computation of free-form masks on CPU can be easily hid behind training networks on GPU in modern deep learning frameworks such as TensorFlow 

[Abadi et al., 2015].

The overall mask generation algorithm is illustrated in Algorithm 1. Additionally we can sample multiple strokes in single image to mask multiple regions. We can also add regular masks (e.g. rectangular) on top of sampled free-form masks. Example masks compared with previous method [Liu et al., 2018] is shown in Figure 5.

mask = zeros(imageHeight, imageWidth)
numVertex = random.uniform(maxVertex)
startX = random.uniform(imageWidth)
startY = random.uniform(imageHeight)
for  to numVertex do
     angle = random.uniform(maxAngle)
     if  (i % 2 == 0)  then
         angle = 2 * pi - angle // comment: reverse mode
     end if
     length = random.uniform(maxLength)
     brushWidth = random.uniform(maxBrushWidth)
     Draw line from point (startX, startY) with angle, length and brushWidth as line width.
     startX = startX + length * sin(angle)
     startY = stateY + length * cos(angle)
     Draw a circle at point (startX, startY) with radius as half of brushWidth. // comment: ensure smoothness of strokes.
end for
mask = random.flipLeftRight(mask)
mask = random.flipTopBottom(mask)
Algorithm 1 Algorithm for sampling free-form training masks. maxVertex, maxLength, maxBrushWidth, maxAngle are four hyper-parameters to control the mask generation.

3.5. Extension to User-Guided Image Inpainting

Figure 6. For face dataset (on the left), we directly detect landmarks of faces and connect related nearby landmarks as training sketches, which is extremely robust and useful for editing faces. We use HED [Xie and Tu, 2015] model with threshold to extract binary sketch for natural scenes (on the right).

We use sketches as an example user guidance to extend our image inpainting network to a user guided system. Sketches (or edges) are simple and intuitive for users to draw. Moreover, it is also relatively easy for obtaining training data. We show two cases with faces and natural scenes. For faces, we extract landmarks and connect related landmarks as sketches shown in Figure 6. The motivation is that (1) for users, the regions of interest are most likely around face landmarks and (2) algorithms for detecting face landmarks are much more robust than edge detectors. For natural scene images, we directly extract edge maps using the HED edge detector [Xie and Tu, 2015] and set all values above a certain threshold (i.e. 0.6) to ones, as shown in Figure 6.

For training the user-guided image inpainting system, intuitively we will need additional constraint loss to enforce the network generates results conditioned on user guidance. However we find with the same combination of pixel-wise reconstruction loss and GAN loss (with conditional channels as inputs to the discriminator), we are able to learn conditional generative network in which the generated results respect user inputs faithfully. We also tried to use additional pixel-wise loss on HED [Xie and Tu, 2015] output features with the raw image and the generated result as input to enforce constraints. But it did not boost performance further.

4. Results

Figure 7. Qualitative Comparisons on the Places2 and CelebA-HQ validation sets.

We first evaluate our proposed inpainting model on Places2 [Zhou et al., 2017] and CelebA-HQ faces [Karras et al., 2017]. Our model has a total of 4.1M parameters, and is trained with TensorFlow v1.8, CUDNN v7.0, CUDA v9.0. For testing, it runs at 0.21 seconds per image on single NVIDIA(R) Tesla(R) V100 GPU and 1.9 seconds on Intel(R) Xeon(R) CPU @ 2.00GHz for images of resolution on average, regardless of hole sizes. For efficiency, we can also restrict the search range of contextual attention module from the whole image to a local neighborhood, which can make the run-time significantly faster while maintaining overall quality of the results.

4.1. Quantitative Results

As mentioned in [Yu et al., 2018]

, image inpainting lacks good quantitative evaluation metrics. Nevertheless, we report our evaluation results in terms of mean

error, mean error and total variation (TV) loss on validation images of Places2 with both center rectangle masks and free-form masks for reference in Table 2. As shown in the table, learning-based methods perform better in terms of , errors, while PatchMatch [Barnes et al., 2009] has slightly lower TV loss because it directly borrows raw image patches. Moreover, partial convolution implemented within the same framework obtains worse performance in terms of the reconstruction loss. This might be due to the fact that partial convolution uses a un-learnable binary mask for all channels in a layer, which has lower representation capacity for learning from complex distribution of input data.

Figure 8. Comparison results of object removal case study.
Figure 9. Comparison results of creative editing case study.

4.2. Qualitative Comparisons

Next, we compare our model with previous state-of-the-art methods [Iizuka et al., 2017; Yu et al., 2018; Liu et al., 2018]. Free-form masks and sketches are used as inputs for PartialConv and GatedConv trained with exactly the same settings. For the existing methods without user-guided option (Global&Local and ContextAttention), we simply evaluate them with the same free-form masks without the sketch channel. Note that for all learning-based methods, no post-processing steps are performed to ensure fairness of the comparisons.

Figure 7 shows comparisons of different methods on two example images. As reported in [Iizuka et al., 2017], simple uniform regions similar to sky areas in the first example are hard cases for learning-based image inpainting networks. Previous methods with vanilla convolution have obvious visual artifacts in masked regions and edge responses surrounding holes. PartialConv produced better results but still with observable color discrepancy. Our proposed method based on gated convolution obtains a visually pleasing result without noticeable color inconsistency. We also show effects of additional sketch inputs for both examples in Figure 7. Given sparse sketches, our method is able to produce realistic results with seamless boundary transitions. Our generated results nicely follow the user sketches, which is useful for creatively editing image layouts and faces.

4.3. Case Study

Moreover, we specifically study two most important real image inpainting use cases, object removal and creative editing, and compare our methods with commercial product Photoshop(R) (based on PatchMatch [Barnes et al., 2009]) and the previous state-of-the-art generative inpainting network [Yu et al., 2018].

4.3.1. Object Removal

In the first example, we try to remove three people in Figure 8. We can see both Content-Aware Fill and our method successfully removed the person on the right. But for two people on the left marked with red box, Context-Aware Fill incorrectly copied half of person from left. This example reflects that traditional methods without learning from data can ignore the semantics of images and make critical failures for non-stationary scenes. For learning-based methods with vanilla convolution in Yu et al. [Yu et al., 2018], we can observe artifacts near hole boundaries for both left and right persons.

4.3.2. Creative Editing

Next we study the user case where users want to interact with inpainting algorithm to produce more desired results. The input is shown on left in Figure 9. We use the sketch input as a guidance image for Content-Aware Fill. We can find it directly copied a left tree into masked regions without any modification, shown with black boxes. The guided image serves as initial optimization values for PatchMatch [Barnes et al., 2009], thus we can observe that the output does not exactly follow guidance (tree sketch on the right). Meanwhile, our system can create novel objects that does not exist anywhere in current image.

4.4. User Study

Figure 10. User study for evaluating the naturalness.
Figure 11. More results from our free-form inpainting system on faces.
Figure 12. More results from our free-form inpainting system on natural images (1).
Figure 13. More results from our free-form inpainting system on natural images (2).

We perform a user study on the validation set of Places2 [Zhou et al., 2017]. We show either the full completed image or a random image from the dataset to 50 users to evaluate the naturalness of the completion. Figure 10 shows results of user study evaluating the naturalness of inpainting results sampled from Places2 validation set. The numbers are the percentage of the images that are believed to be real by 50 users for the Ground Truth (GT) and our approach. In the study, 88.7% of the our results are believed to be real, which is unprecedented among previous inpainting methods. For comparison, the real images are correctly categorized 94.3%.

4.5. More Examples and Results

We show more results on CelebA-HQ validation set in Figure 11 and results from our model on natural images in Figure 12 and Figure 13.

5. Conclusions

We presented a novel free-form image inpainting system based on an end-to-end generative network with gated convolution and a novel GAN loss. We showed that gated convolutions significantly improve inpainting results with free-form masks and user guidance input. We demonstrated sketches as an exemplar user guidance to help users quickly remove distracting objects, modify image layouts, clear watermarks, edit faces and interactively create novel objects in photos. We also visualized the learned feature representations to interpret and understand proposed gated convolution in the trained inpainting network. Quantitative results, qualitative comparisons and user studies demonstrated the superiority of our proposed free-form image inpainting system.

References