1 Problem Statement
is an image reconstruction technique, where missing sections of an image(holes) are filled or ”predicted”. This technique finds several applications, some of them include removing unwanted objects from an image, restore damaged portions of old images, or removing unwanted text. Classical vision approaches in Image Inpainting deal with the propagation of unmasked image parameters onto the holes. These methods have shown varied success, however, they all lack the ability to semantically reconstruct the holes. Recent work on Image Inpainting has predominantly been based on learning-based approaches which can effectively learn the semantics of the image. Generative Adversarial Networks (GANs) have also been used in this domain to generate satisfactory results.
This project is inspired from the original paper ”Image Inpainting for Irregular Holes Using Partial Convolutions” 
. This approach of image inpainting is capable of restoring image with any arbitrary shaped holes. The central idea of this work involves in its Partial Convolution layers, and specific high-level feature loss networks.
The upcoming section of the report contain sections describing the Approach, Quantitative and Qualitative results, and Key Observations.
In this project, we make use of Partial Convolution Layers with mask update method to perform the inpainting operation. The following section describes the approach in detail.
We use the CelebA-HQ/256 , a large-scale dataset that has 30,000 human face images each of size 256x256. We divided the dataset into train, validation and tests datasets in the proportion of 70%, 15%, and 15%. For applications described in the later sections, we make use of samples of image size 256x256 from Places2 dataset .
2.2 Partial Convolution Layers
A partial convolution, as the name suggests, is very similar to a usual Convolution layer except that it only performs the convolution operations on the pixels where the input pixel are not currently masked or ”available”. More formally, if W is the convolution filter, b is its corresponding filter, M is the binary mask corresponding to the convolution filter, and X are the image features, then the partial convolution is defined as-
Note that 1
is a vector with same shape as M, containing all the elements as one.
2.3 Mask Update Step
Every Partial Convolution layer is followed by a mask update step. As seen in Section 2.2, only the available pixels are used for calculating the convolution output. The mask update step is about updating a mask value from 0 to 1, if there is at least one valid input corresponding to the pixel under consideration. Mathematically, the updated mask can be defined as -
It is important to observe that after successive iterations of the masked image through the network, all values of the mask will eventually be filled with ones, if the input contained at least one valid pixel.
2.4 UNet Architecture
The two limbs of figure 2 represent the contraction(left portion, also called encoders) and expansion(right portion, also called decoders) paths respectively.
The contraction path is a stack of convolutional neural networks and max pooling layers applied successively. Every layer contains two CNNs, each followed by an activation function. Typically, the activation function for all layers is ReLu. The output of these two CNNs is passed through a max pooling layer with a kernel size that is typically equal to 2. The convolution operations increase the number of channels(say from 3 for an RGB image) to a greater number of channels in a latent space(say 256 to 512). Since we apply max pooling(kernel size2), the height and width reduce.
Functionally, the contraction path is used to capture information from images that helps us identify people, places or other entities relevant to us. In other words, it determines the ”context” of the image. This can be understood by the fact that as the image goes through the contraction path, its size reduces, but the number of channels increase. This means that information regarding spatial locality is being lost, however information regarding the features in the image is being amplified.
The expansion path is also a stack of CNNs, but here instead of the max pooling layers, we have upsampling layers. Every decoding layer consists of two CNNs, each followed by an activation function(the last layer usually has no activation function).
An important property of UNet is the presence of skip connections. These refer to the connections that run across the pipeline (the ”copy and crop” connections in image 2). These separate it from regular encoder-decoder nets, that just use the output from the previous layer. The output from the corresponding layer(same depth) of the contraction path is concatenated with the output of the previous decoding layer. Therefore the size of the image is gradually restored while the number of channels are reduced through the CNN layers.
The function of the expansion path is recovering the spatial locality information that was lost in the contraction path. Not only does this happen by upsampling, but also by the skip connections. This is because, the feature map from the encoder at the same depth level contains locality information.
All in all, the UNet architecture is one arranged in a U-shape, with a contraction path and an expansion path containing encoders and decoders respectively, and where shallow layers(encoders) are connected to deeper layers(decoders) through skip links.
2.5 Modified UNet (Our Contribution)
For our task of image inpainting, we have used a total of seven layers in our model. In other words, we have seven encoders and seven decoders. We have the same Module List inside each encoder and decoder as described in the previous section. However there are a few changes.
We have used the following activation functions for different layers-
ReLu for all the encoding layers.
Leaky ReLu for all decoding layers except the first layer(depth = 1).
No activation for the first decoding layer.
Note that the first decoding layer is the topmost layer in the architecture and the final layer that is used in the sequence followed by the pipeline.
For the upsampling step we have used the bilinear mode of Pytorch since that is recommended for 4D data.
We have not required to crop the image from the encoder in any skip connection, since we have maintained the invariant that the image size at the encoder and decoder is
. We ensure this by having padding of sizefor a kernel size of
, with no stride or dilation for every convolution operation. This ensures that the output image size of any CNN is the same as its input image size. The Max Pooling layer with kernel dimension of 2x2 is responsible for contraction by 2 while the Upsampling layer with scale factor = 2 is responsible for expansion.
2.6 Loss Functions
Choice of Loss functions plays a significant role in helping us to achieve both good predictions and faster convergence in any deep learning framework. In this image restoration problem, the following loss functions take charge in improving both features and their compositions in the restored image. For the subsequent sections please consider the following notation.
: Ground Truth Image
: Output Restored image generated by the network
: Output Restored image with non-hole regions replaced with the ground truth.
: Activation map of the ith layer of the network
= Gram matrix of I
: Gram matrix normalising factor, dependent on the size of the input (h*w)
2.6.1 Total Variation Loss
The total variation loss is quite analogous to the nature of regularization losses. It is responsible for maintaining spatial continuity along with smoothness in the restored image. It is widely used in similar computer vision applications like Style Transfer and also, in digital signal processing for removal of noise.
2.6.2 Perceptual Loss
Perceptual Loss also known as the feature reconstruction loss, encourages the pixels in the restored image to have similar feature representations rather than exactly matching the ground truth pixels.
These feature representation are obtained by passing the image through a pre-trained Convolutional Neural Network (VGG-16
in our experiments) which is a series of few convolution layers followed by some fully connected layers. The layers from the input to the last max pooling layer are used for the feature extraction.
This loss function tries to minimize the semantic differences between the features that are activated for the input image and for the reconstructed image, at the different available layers in the network.
2.6.3 Style Loss
Style loss is used to incorporate texture like feature details of the ground truth image rather than its global arrangement into the restored image. It helps to capture general appearance of the ground truth image in terms of colors and localised compositions .
These style represented as gram matrices which are obtained by computing the correlations between the feature maps (same as in Section 2.6.2) obtained from the activations of the convolutional neural network at different layers of the network.
This loss function tries to minimize the L1 distance between the entries of the Gram matrix from the style of the input image and both the Gram matrix of the reconstructed image and generated after every iteration.
2.6.4 Pixel losses
The pixel losses for both hole and non-hole regions target to improve the per-pixel reconstruction accuracy.
2.7 Training Details
The above described architecture, along with the loss functions was trained for 34500 iterations, with a batch size of 3. The totally approximate training time was 20 hours, depending on the GPU system available.
We demonstrate the applications of this project in following two different ways.
3.1 Automated Segmented Object Removal
In this section, we make use of automatic Image Segmentation Technique with state of the art FCN-ResNet101 Model 
, to generate masks of the segmented regions. The Segmentation model can identify large number of classes (20 in our case), and thus can generate masks corresponding to different objects in the image. We use the pre-trained Pytorch version of this model trained on the Pascal VOC dataset (20 classes) to create this pipeline. For the demonstration of this application, we use the pre-trained Image Inpainting model trained on the Places-2 Dataset. The user just needs to give the input image and the automated pipeline firstly detects the humans in the image (any of the 20 classes from the Pascal VOC dataset) and then performs the image inpainting using those detected objects regions as masks. Here we given an example for the same: In this demonstration, we have chosen the object to be a human figure in the image. The segmentation model creates a Segmented mask as shown in Figure3.
This mask, along with the original image, is fed as an input to our inpainting model. The results are compared with other alternate inpainting algorithms.
3.2 Manual Mask Generation and Inpainting
In this demonstration, we facilitate the user with a custom board where one can create a mask manually. The created mask is then inpainted using our trained model. As with the previous application, this method is also compared with other Classical approaches to compare the results. The results of this experiment are shown in Figure 4.
The state of the art models for image feature extraction break the images into primarily two types of information. The first kind of features are the features that are essential for the shape (edges, blobs, corners etc.) of the image, and the other kind of features, are essential for the style (color patterns) of the image. In this project, we tried to understand the contribution of each of these types of features, and study how removing one of them can affect the performance of the architecture in generating the resulting images. Our qualitative and quantitative observations are described in the following subsections.
4.1 Ablation Study
An Ablation Study typically involves removing some component of the model, to understand the contribution of the component to the overall system. As described in the previous sections, this work focuses significantly on the complex loss functions. Therefore, we performed the following two experiments. In the first experiment, we removed perceptual loss (described in Section 2.5) component from the loss function of the model, and trained the model for 20000 epochs. In the second experiment, we removed the Style loss component(described in Section 2.5) and similarly trained the model. These experiments are a clear proof of concept for the losses described in Section 2.
4.1.1 Perceptual Loss Ablation
Figure 5 shows the results of this experiment. The results of the Perceptual loss ablation clearly reflect the lack of sharp image features, in terms of the edges, corners etc. Although the generated images capture the overall structure of the input image via several other loss functions involved, the results are clearly seem lacking.
4.1.2 Style Loss Ablation
Figure 5 shows the results of this experiment. As described in Section 2.5.3, Style loss is responsible for capturing colors and localised compositions. As we remove the effect of Style loss, we can clearly see that output images contain structural feature details, but are unable to match the ground truth in terms of color compositions.
4.2 Quantitative Study
In this section, we make use of the quantitative metric such as L1 Norm, Mean Squared Error(MSE), Peak Signal to Noise Ratio(PSNR), and Structural Similarity Index(SSIM) to measure the performance of our model. The reader must carefully note that there could be multiple possible inpainting results for the same image and mask provided as input. Therefore, although the quantitative analysis gives a measure for comparison different methods, this analysis is not the most robust way for analysing the performance. Previous works in the same field agree on this idea, however, have been reporting these performance metric for comparisons. Therefore, we follow the convention and report our results in terms of the above mentioned metrics.
We compare our outputs with a classical approach named Navier Stokes Image Inpainting Technique. This approach uses ideas from classical fluid dynamics to propagate isophote lines continuously from the exterior into the region to be inpainted. We repeat our study for various mask size to image size ratios. The results are displayed in the Table 1 below.
|Trained Model||r=0.05||r=0.1||r =0.2|
|PConv, no Style Loss|
|PConv, no Perceptual Loss|
4.3 Key Observations
Some of our key observations are-
An image is made of structural features and features that correspond to colours.
Classical techniques are not as effective as the learning based approaches for the problem of image inpainting.
Capturing the semantics of the image rather than per pixel information, is a much more efficient method for image reconstruction.
5 Conclusion and Future Work
This project was an attempt to reproduce and innovate the Deep Learning based Image Inpainting Methodology by making use of the Partial Convolution Layers and complex Loss Networks. In this project we performed an in-detail analysis of the UNet inspired network architecture, performed ablation studies, and implemented two methodologies to show the application of this technique.
This current project can be extended as a part of other larger Deep learing frameworks such Image Deocclusion, or can be further implemented for automatic modifications to Videos.
-  (2007) Image restoration using digital inpainting and noise removal. Image and Vision Computing 25 (1), pp. 61–69 (English). External Links: Cited by: Image Inpainting using Partial Convolution.
Navier-stokes, fluid dynamics, and image and video inpainting.
Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Vol. 1, pp. I–I. External Links: Cited by: Image Inpainting using Partial Convolution.
-  (2015) A neural algorithm of artistic style. External Links: Cited by: §2.6.3.
Perceptual losses for real-time style transfer and super-resolution. External Links: Cited by: §2.6.2.
-  (2018) Image inpainting for irregular holes using partial convolutions. CoRR abs/1804.07723. External Links: Cited by: §1.
-  (2015-12) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §2.1.
-  (2021) Bayesian image reconstruction using deep generative models. External Links: Cited by: Image Inpainting using Partial Convolution, §1.
-  (2016) Fully convolutional networks for semantic segmentation. External Links: Cited by: §3.1.
-  (2015) Very deep convolutional networks for large-scale image recognition. External Links: Cited by: §2.6.2.
Places: a 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.1.