Image Inpainting using Partial Convolution

08/19/2021 ∙ by Harsh Patel, et al. ∙ 5

Image Inpainting is one of the very popular tasks in the field of image processing with broad applications in computer vision. In various practical applications, images are often deteriorated by noise due to the presence of corrupted, lost, or undesirable information. There have been various restoration techniques used in the past with both classical and deep learning approaches for handling such issues. Some traditional methods include image restoration by filling gap pixels using the nearby known pixels or using the moving average over the same. The aim of this paper is to perform image inpainting using robust deep learning methods that use partial convolution layers.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Problem Statement

Image Inpainting

is an image reconstruction technique, where missing sections of an image(holes) are filled or ”predicted”. This technique finds several applications, some of them include removing unwanted objects from an image, restore damaged portions of old images, or removing unwanted text. Classical vision approaches in Image Inpainting deal with the propagation of unmasked image parameters onto the holes. These methods have shown varied success, however, they all lack the ability to semantically reconstruct the holes. Recent work on Image Inpainting has predominantly been based on learning-based approaches which can effectively learn the semantics of the image. Generative Adversarial Networks (GANs) have also been used in this domain to generate satisfactory results.


This project is inspired from the original paper ”Image Inpainting for Irregular Holes Using Partial Convolutions” [5]

. This approach of image inpainting is capable of restoring image with any arbitrary shaped holes. The central idea of this work involves in its Partial Convolution layers, and specific high-level feature loss networks.

The upcoming section of the report contain sections describing the Approach, Quantitative and Qualitative results, and Key Observations.

2 Approach

In this project, we make use of Partial Convolution Layers with mask update method to perform the inpainting operation. The following section describes the approach in detail.

Figure 1: Pipeline of the approach

2.1 Dataset

We use the CelebA-HQ/256 [6], a large-scale dataset that has 30,000 human face images each of size 256x256. We divided the dataset into train, validation and tests datasets in the proportion of 70%, 15%, and 15%. For applications described in the later sections, we make use of samples of image size 256x256 from Places2 dataset [10].

2.2 Partial Convolution Layers

A partial convolution, as the name suggests, is very similar to a usual Convolution layer except that it only performs the convolution operations on the pixels where the input pixel are not currently masked or ”available”. More formally, if W is the convolution filter, b is its corresponding filter, M is the binary mask corresponding to the convolution filter, and X are the image features, then the partial convolution is defined as-


Note that 1

is a vector with same shape as M, containing all the elements as one.

2.3 Mask Update Step

Every Partial Convolution layer is followed by a mask update step. As seen in Section 2.2, only the available pixels are used for calculating the convolution output. The mask update step is about updating a mask value from 0 to 1, if there is at least one valid input corresponding to the pixel under consideration. Mathematically, the updated mask can be defined as -


It is important to observe that after successive iterations of the masked image through the network, all values of the mask will eventually be filled with ones, if the input contained at least one valid pixel.

2.4 UNet Architecture

Figure 2: Skeleton of typical UNET

The UNet architecture, as can be seen from figure 2, has a ’U’ shape. In image 2 there are only two layers, however a typical UNet contains four.

The two limbs of figure 2 represent the contraction(left portion, also called encoders) and expansion(right portion, also called decoders) paths respectively.

The contraction path is a stack of convolutional neural networks and max pooling layers applied successively. Every layer contains two CNNs, each followed by an activation function. Typically, the activation function for all layers is ReLu. The output of these two CNNs is passed through a max pooling layer with a kernel size that is typically equal to 2. The convolution operations increase the number of channels(say from 3 for an RGB image) to a greater number of channels in a latent space(say 256 to 512). Since we apply max pooling(kernel size

2), the height and width reduce.

Functionally, the contraction path is used to capture information from images that helps us identify people, places or other entities relevant to us. In other words, it determines the ”context” of the image. This can be understood by the fact that as the image goes through the contraction path, its size reduces, but the number of channels increase. This means that information regarding spatial locality is being lost, however information regarding the features in the image is being amplified.

The expansion path is also a stack of CNNs, but here instead of the max pooling layers, we have upsampling layers. Every decoding layer consists of two CNNs, each followed by an activation function(the last layer usually has no activation function).

An important property of UNet is the presence of skip connections. These refer to the connections that run across the pipeline (the ”copy and crop” connections in image 2). These separate it from regular encoder-decoder nets, that just use the output from the previous layer. The output from the corresponding layer(same depth) of the contraction path is concatenated with the output of the previous decoding layer. Therefore the size of the image is gradually restored while the number of channels are reduced through the CNN layers.

The function of the expansion path is recovering the spatial locality information that was lost in the contraction path. Not only does this happen by upsampling, but also by the skip connections. This is because, the feature map from the encoder at the same depth level contains locality information.

All in all, the UNet architecture is one arranged in a U-shape, with a contraction path and an expansion path containing encoders and decoders respectively, and where shallow layers(encoders) are connected to deeper layers(decoders) through skip links.

2.5 Modified UNet (Our Contribution)

For our task of image inpainting, we have used a total of seven layers in our model. In other words, we have seven encoders and seven decoders. We have the same Module List inside each encoder and decoder as described in the previous section. However there are a few changes.

We have used the following activation functions for different layers-

  • ReLu for all the encoding layers.

  • Leaky ReLu for all decoding layers except the first layer(depth = 1).

  • No activation for the first decoding layer.

Note that the first decoding layer is the topmost layer in the architecture and the final layer that is used in the sequence followed by the pipeline.

For the upsampling step we have used the bilinear mode of Pytorch since that is recommended for 4D data.

We have not required to crop the image from the encoder in any skip connection, since we have maintained the invariant that the image size at the encoder and decoder is

. We ensure this by having padding of size

for a kernel size of

, with no stride or dilation for every convolution operation. This ensures that the output image size of any CNN is the same as its input image size. The Max Pooling layer with kernel dimension of 2x2 is responsible for contraction by 2 while the Upsampling layer with scale factor = 2 is responsible for expansion.

2.6 Loss Functions

Choice of Loss functions plays a significant role in helping us to achieve both good predictions and faster convergence in any deep learning framework. In this image restoration problem, the following loss functions take charge in improving both features and their compositions in the restored image. For the subsequent sections please consider the following notation.

  • : Ground Truth Image

  • : Output Restored image generated by the network

  • : Output Restored image with non-hole regions replaced with the ground truth.

  • : Activation map of the ith layer of the network

  • = Gram matrix of I

  • : Gram matrix normalising factor, dependent on the size of the input (h*w)

2.6.1 Total Variation Loss

The total variation loss is quite analogous to the nature of regularization losses. It is responsible for maintaining spatial continuity along with smoothness in the restored image. It is widely used in similar computer vision applications like Style Transfer and also, in digital signal processing for removal of noise.


2.6.2 Perceptual Loss

Perceptual Loss also known as the feature reconstruction loss, encourages the pixels in the restored image to have similar feature representations rather than exactly matching the ground truth pixels


These feature representation are obtained by passing the image through a pre-trained Convolutional Neural Network (VGG-16


in our experiments) which is a series of few convolution layers followed by some fully connected layers. The layers from the input to the last max pooling layer are used for the feature extraction.

This loss function tries to minimize the semantic differences between the features that are activated for the input image and for the reconstructed image, at the different available layers in the network.


2.6.3 Style Loss

Style loss is used to incorporate texture like feature details of the ground truth image rather than its global arrangement into the restored image. It helps to capture general appearance of the ground truth image in terms of colors and localised compositions [3].

These style represented as gram matrices which are obtained by computing the correlations between the feature maps (same as in Section 2.6.2) obtained from the activations of the convolutional neural network at different layers of the network.

This loss function tries to minimize the L1 distance between the entries of the Gram matrix from the style of the input image and both the Gram matrix of the reconstructed image and generated after every iteration.


2.6.4 Pixel losses

The pixel losses for both hole and non-hole regions target to improve the per-pixel reconstruction accuracy.


The aggregate loss function is shown in the equation -9

. For our initial experiments we use the values of the hyperparameters (

s) according to the following equation:


2.7 Training Details

The above described architecture, along with the loss functions was trained for 34500 iterations, with a batch size of 3. The totally approximate training time was 20 hours, depending on the GPU system available.

3 Applications

We demonstrate the applications of this project in following two different ways.

3.1 Automated Segmented Object Removal

In this section, we make use of automatic Image Segmentation Technique with state of the art FCN-ResNet101 Model [8]

, to generate masks of the segmented regions. The Segmentation model can identify large number of classes (20 in our case), and thus can generate masks corresponding to different objects in the image. We use the pre-trained Pytorch version of this model trained on the Pascal VOC dataset (20 classes) to create this pipeline. For the demonstration of this application, we use the pre-trained Image Inpainting model trained on the Places-2 Dataset. The user just needs to give the input image and the automated pipeline firstly detects the humans in the image (any of the 20 classes from the Pascal VOC dataset) and then performs the image inpainting using those detected objects regions as masks. Here we given an example for the same: In this demonstration, we have chosen the object to be a human figure in the image. The segmentation model creates a Segmented mask as shown in Figure


Figure 3: Automated Segmented Object Removal, The rightmost image is the ground truth that is provided as input to the automated pipeline. The Image segmentation part of the pipeline detects all the humans and generates a mask for the same as can be seen in the second image from the left. The leftmost image goes as input into the Image Inpainting pipeline which eventually generates reconstruced outputs as seen in the middle and the second last image.

This mask, along with the original image, is fed as an input to our inpainting model. The results are compared with other alternate inpainting algorithms.

3.2 Manual Mask Generation and Inpainting

In this demonstration, we facilitate the user with a custom board where one can create a mask manually. The created mask is then inpainted using our trained model. As with the previous application, this method is also compared with other Classical approaches to compare the results. The results of this experiment are shown in Figure 4.

Figure 4: Manual Mask Generation and Inpainting. Part A: The user can create a manual mask in the input image. Part B: The resulting output generated consists of the masked input, the mask, output of our work, output from Navier stokes algorithm, and the ground truth from left to right respectively.

4 Observations

The state of the art models for image feature extraction break the images into primarily two types of information. The first kind of features are the features that are essential for the shape (edges, blobs, corners etc.) of the image, and the other kind of features, are essential for the style (color patterns) of the image. In this project, we tried to understand the contribution of each of these types of features, and study how removing one of them can affect the performance of the architecture in generating the resulting images. Our qualitative and quantitative observations are described in the following subsections.

4.1 Ablation Study

An Ablation Study typically involves removing some component of the model, to understand the contribution of the component to the overall system. As described in the previous sections, this work focuses significantly on the complex loss functions. Therefore, we performed the following two experiments. In the first experiment, we removed perceptual loss (described in Section 2.5) component from the loss function of the model, and trained the model for 20000 epochs. In the second experiment, we removed the Style loss component(described in Section 2.5) and similarly trained the model. These experiments are a clear proof of concept for the losses described in Section 2.

4.1.1 Perceptual Loss Ablation

Figure 5 shows the results of this experiment. The results of the Perceptual loss ablation clearly reflect the lack of sharp image features, in terms of the edges, corners etc. Although the generated images capture the overall structure of the input image via several other loss functions involved, the results are clearly seem lacking.

4.1.2 Style Loss Ablation

Figure 5 shows the results of this experiment. As described in Section 2.5.3, Style loss is responsible for capturing colors and localised compositions. As we remove the effect of Style loss, we can clearly see that output images contain structural feature details, but are unable to match the ground truth in terms of color compositions.

Figure 5: Results of Ablation Study UNET

4.2 Quantitative Study

In this section, we make use of the quantitative metric such as L1 Norm, Mean Squared Error(MSE), Peak Signal to Noise Ratio(PSNR), and Structural Similarity Index(SSIM) to measure the performance of our model. The reader must carefully note that there could be multiple possible inpainting results for the same image and mask provided as input. Therefore, although the quantitative analysis gives a measure for comparison different methods, this analysis is not the most robust way for analysing the performance. Previous works in the same field agree on this idea, however, have been reporting these performance metric for comparisons. Therefore, we follow the convention and report our results in terms of the above mentioned metrics.

We compare our outputs with a classical approach named Navier Stokes Image Inpainting Technique. This approach uses ideas from classical fluid dynamics to propagate isophote lines continuously from the exterior into the region to be inpainted. We repeat our study for various mask size to image size ratios. The results are displayed in the Table 1 below.

Trained Model r=0.05 r=0.1 r =0.2
l1 mse psnr ssim l1 mse psnr ssim l1 mse psnr ssim
PConv, no Style Loss
PConv, no Perceptual Loss
Classical Method
Table 1: Quantitative Measure

4.3 Key Observations

Some of our key observations are-

  1. An image is made of structural features and features that correspond to colours.

  2. Classical techniques are not as effective as the learning based approaches for the problem of image inpainting.

  3. Capturing the semantics of the image rather than per pixel information, is a much more efficient method for image reconstruction.

5 Conclusion and Future Work

This project was an attempt to reproduce and innovate the Deep Learning based Image Inpainting Methodology by making use of the Partial Convolution Layers and complex Loss Networks. In this project we performed an in-detail analysis of the UNet inspired network architecture, performed ablation studies, and implemented two methodologies to show the application of this technique.

This current project can be extended as a part of other larger Deep learing frameworks such Image Deocclusion, or can be further implemented for automatic modifications to Videos.


  • [1] C. A. Z. Barcelos and M. A. Batista (2007) Image restoration using digital inpainting and noise removal. Image and Vision Computing 25 (1), pp. 61–69 (English). External Links: Document Cited by: Image Inpainting using Partial Convolution.
  • [2] M. Bertalmio, A.L. Bertozzi, and G. Sapiro (2001) Navier-stokes, fluid dynamics, and image and video inpainting. In

    Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001

    Vol. 1, pp. I–I. External Links: Document Cited by: Image Inpainting using Partial Convolution.
  • [3] L. A. Gatys, A. S. Ecker, and M. Bethge (2015) A neural algorithm of artistic style. External Links: 1508.06576 Cited by: §2.6.3.
  • [4] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    External Links: 1603.08155 Cited by: §2.6.2.
  • [5] G. Liu, F. A. Reda, K. J. Shih, T. Wang, A. Tao, and B. Catanzaro (2018) Image inpainting for irregular holes using partial convolutions. CoRR abs/1804.07723. External Links: Link, 1804.07723 Cited by: §1.
  • [6] Z. Liu, P. Luo, X. Wang, and X. Tang (2015-12) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §2.1.
  • [7] R. V. Marinescu, D. Moyer, and P. Golland (2021) Bayesian image reconstruction using deep generative models. External Links: 2012.04567 Cited by: Image Inpainting using Partial Convolution, §1.
  • [8] E. Shelhamer, J. Long, and T. Darrell (2016) Fully convolutional networks for semantic segmentation. External Links: 1605.06211 Cited by: §3.1.
  • [9] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. External Links: 1409.1556 Cited by: §2.6.2.
  • [10] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017)

    Places: a 10 million image database for scene recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.1.