Historically, inpainting is an ancient technique that was performed by professional artists to restore damaged paintings in museums. These defects (scratches, cracks, dust and spots) were inpainted by hand to restore and maintain the image quality. The evolution of computers in the last century and its frequent daily use has encouraged inpainting to take a digital format [1, 2, 3, 4, 5, 6, 7] as an image restoration technique. Image inpainting aims to fill in missing pixels caused by a defect based on pixel similarity information .
The state-of-the-art approaches are discussed in two categories: conventional and deep learning-based methods. Traditional methods[1, 3, 8, 9] use image statistics of best-fitting pixels to fill in missing regions (defects). However, these approaches often fail to produce images with plausible visual semantics. With the evolution in research, deep learning-based methods [4, 5, 10, 11, 7, 6] encode the semantic context of an image into feature space and fill in missing pixels on images by hallucinations 
through the use of generative neural network.
Pathak et al.  propose the context encoder with close similarity to the auto-encoder [12, 13] and AlexNet  to predict missing pixels on RGB images. This technique applies the -norm as a pixel reconstruction loss to capture the overall structure of the missing region regarding the image context. Iizuka et al.  introduce a globally and locally consistent training approach that uses two discriminators, i.e. one (global discriminator) to assess the coherency of the whole image while the other (local discriminator) ensures local consistency of predicted pixels of the binary mask region to the image. Yang et al. 
propose a multi-scale neural patch synthesis approach. This method uses a combined optimisation framework, with global and local texture as a constraint on Convolutional Neural Network (CNN) to preserve contextual structures and predict missing regions on an image. Yeh et al. use context and prior losses to train a network that searches the encoding of a corrupted image in latent space to reconstruct the original image.
Liu et al.  use partial convolution to replace typical convolutions  with an automatic mask-updating step. This technique masks and renormalise convolutions to target only valid pixels. Yan et al. 
use deep feature rearrangement by adding a particular shift-connection layer to the U-Net architecture. Wang et al.  introduce a Laplacian approach based on residual learning  to propagate high-frequency details and predict missing information. However, most of these models use expensive post-processing techniques (Poisson blending) on the final output, to produce images that are visually consistent with the original image. Hence, it is still a challenging area in research due to failures in generating images with realism from a compact latent feature.
Although deep learning approaches achieve excellent performance in facial inpainting, there are some limitations of state of the art as illustrated in Figure 1, where Figure 1(a) shows poor performance on holes with arbitrary sizes; Figure 1(b) illustrates the lack of edge-preserving using the existing technique; Figure 1(c) depicts the blurry artefacts; and Figure 1(d) demonstrates the poor performance on high-resolution images and image completion with mask at border region.
To correctly predict missing parts of a face and preserve its realism, we propose Symmetric Skip Connection Wasserstein GAN (S-WGAN) with the following contributions:
We propose a new framework with Wasserstein GAN that uses symmetric skip connection to preserve image details.
We define a new combined loss function based on feature space.
We demonstrate that our loss, combined with our S-WGAN, can achieve better results than the state-of-the-art algorithms.
Ii Proposed Framework
Our proposed model uses skip connections with dilated convolution to perform image inpainting. We discuss the architecture and loss function of S-WGAN in the following sections.
Figure 2 shows the overall framework of our proposed S-WGAN. The network is designed to have a generator () and a discriminator (). We define as an encoder-decoder framework with dilated convolutions and symmetric skip connections. Figure 3 shows the process of dilated convolution. Dilated convolutions, combined with skip connections, are critical to the design of our model as:
It broadens the receptive fields to capture more contextual information without parameter accretion and computational complexity, which are preserved and transferred by skip connections to corresponding deconvolution layers.
It detects fine details and maintains high-resolution feature maps, and achieves end-to-end feature learning with a better local minimum (high restoration performance).
Generator () The effectiveness of feature encoding is improved by having an encoder of n-convolutional layers, with a kernel size of 5 and dilation rate of 2, designed to match the size of the output image. This technique enables our model to learn larger spatial filters and help reduce volume 
. We apply a dropout regularisation with a probability value of 0.25 in the 4th and final layer of the encoder. The dropout layer randomly disconnects nodes and adjust the weights to propagate information to the decoder without overfitting.
Decoder The decoder are blocks of deconvolutional layers, with learnable upsampling layers that recover image details. The corresponding feature maps in the decoder are asymmetrically linked by element-wise skip connections to reach an optimum size. The final layer in the decoder is Tanh activation.
Dilated Convolutions: We express the dilated convolution based on the network input in Equation 1:
where is the output feature map of the dilated convolution from the input and the filter is given by . The dilation rate parameter () reverts to normal when .
It is advantageous to use dilated convolution compared to using typical convolutional layers combined with pooling. The reason for this is that a small kernel size of can enlarge into
based on the dilated stride, thus allowing a flexible receptive field of fine-detail contextual information while maintaining high-quality resolution.
The inpainting solver may result in predictions of the missing region, that may be reasonable or ill-posed. Because of this, instead of being constrained to [0,1] in normal GANs, we include as part of our network adopted from  to provide improved stability and enhanced discrimination for photo-realistic images. With ongoing adversarial training, the discriminator is unable to distinguish real data from fake ones. Equation 2 shows the reconstruction of the image during training from :
where is the reconstructed image; is the ground-truth; is the predictions; is the element-wise multiplication; the binary mask represented in 0 and 1 for the context of the entire image and missing regions.
Ii-B Loss function
Instead of using the typical Mean Square Error (MSE) loss function used in 
, we defined a combined loss function based on feature space. To achieve this, we adopt a pre-trained VGG-16 model trained on ImageNet. We utilize block3-convolution3 of this model as our feature extractor to compute our loss function . Using MSE as a base, we calculate the squared difference of the feature maps instead of pixel-wise representation. Since our goal is to output high-resolution/high-quality visually plausible images that have the detailed characteristics that can be undetected by the human visual system, we use the loss as a constraint and compute our perceptual loss function in Equation 5 in feature space. loss preserves colour and luminance and does not over penalise more substantial errors . The adjust our perceptual loss to minimise any error 1. Also, the loss allows better evaluation of feature predictions to match ground-truth ones instead of the normal pixel-wise difference.
More specifically, we define the loss as:
where is the pixel index belonging to the feature maps () with and as pixel values of predictions and ground-truth. The advantage of using feature space is that a particular filter determines the extraction of feature maps from low level to high-level, sophisticated features. To reconstruct quality images, we compute our loss function with feature maps determined by block3-conv3, resized to the same size as masks and generated images. The reason is that using block4-conv4 or block5-conv5 will result in poor quality, as the network starts to expand the view at these layers due to more filters used.
We adopt the loss function  and define our special perceptual loss as follows:
where is the input image, is the reconstructed image and is dimensions obtained from feature maps with high level representational abstractions extracted from the third block convolution layer. By combining the
loss, the model learns to produce finer details in the predicted features and output without any blurry artefacts. The entire model trains end-to-end with backpropagation and uses the Wasserstein loss () to optimise and to learn reasonable predictions. Our goal is to reconstruct an image from by training the generator to learn and preserve image details. The Wasserstein loss improves convergence in GANs and its the mean difference between two images. The combined loss function Wasserstein-perceptual () is defined in Equation 6.
This section describes the dataset, binary masks and the implementation.
Iii-a Dataset and irregular binary mask
Our experiment focuses on high-resolution face images and irregular binary masks. The benchmark dataset for high-resolution face images is CelebA-HQ dataset , which was curated from the CelebA dataset  and contained 30,000 images. Figure 4 shows a few samples from the CelebA-HQ dataset.
To create irregular holes on images, we use the Quickdraw irregular mask dataset , available for public use and is divided into 50,000 train and 10,000 test masks. The images are of size pixels. Figure 5 show samples from the mask dataset .
We perform normalised floating-point representation on the image to set the intensity values of the pixels in the range [-1,1] and apply the mask on the image to obtain our input, as shown in Figure 6. We initialize pre-trained weights from VGG16 to compute our loss function. We use a learning rate of in and in and optimise the training process using the Adam optimiser 
. We use a Quadro P6000 GPU machine to train these models. According to our hardware conditions, we use a batch-size of 5 in each epoch for input images with shape. It takes 0.193 seconds to predict missing pixels of any size created by binary mask on an image and ten days to train 100 epochs.
We assess the performance of the inpainting methods qualitatively and quantitatively in this section.
Iv-a Qualitative Comparisons
Consider the importance of visual and semantic coherence, we conducted a qualitative comparison of our test dataset. First, we implemented a WGAN approach with perceptual loss and Wasserstein distance. We observed an induced pattern and pitiable colour on the images, as shown in Figure 7(d). We introduced dilated convolution, skip connections combined with our Wasserstein-perceptual loss function () to handle the induced pattern and match the luminance of the original images.
We compare our model with three popular methods:
We test our S-WGAN against state of the art on CelebA-HQ test dataset and show the results in Figure 7. Based on visual inspection, Figure 7(b) illustrates blurry and checkerbox effects generated by the by Pathak et al.’s CE method . On the other hand, PConv  generates clear images but with residues of the mask left on the image as illustrated in Figure 7(c). WGAN induced pattern and pitiable colour on the images as shown in Figure 7(d). Overall, our proposed S-WGAN, as shown in Figure 7(e), produced the best visual results when compared to the ground-truth in Figure 7(f).
Iv-B Quantitative Comparisons
We select some popular image quality metrics including Mean Absolute Error (MAE), MSE, Peak Signal to Noise Ratio (PSNR), Structure Similarity Index Measure (SSIM) to evaluate the performance quantitatively. TableIV-B shows the results from our experiment compared to state of the art [4, 5] for image inpainting with our S-WGAN in bold. Evaluation Metric Inpainting Method MSE MAE PSNR SSIM WGAN 3562.13 87.03 13.50 0.56 Pathak et al.  133.481 129.30 27.71 0.76 Guilin Liu et al.  124.62 105.94 28.82 0.90 S-WGAN 81.03 66.09 29.87 0.94
For MSE and MAE, the lower the value, the better the image quality. MSE measures the average squared intensity difference of pixels while MAE measures the magnitude of error between the ground-truth image and the reconstructed image. Conversely, for PSNR and SSIM, the higher the value, the closer the image quality to the ground-truth. In our case, S-WGAN shows better results than the other state-of-the-art algorithms across all the performance metrics.
Our proposed S-WGAN with dilated convolution and Wasserstein-perceptual loss function outperforms the state-of-the-art results. Our model can learn the end-to-end mapping of input images from a large-scale dataset to predict missing pixels of the binary mask regions on the image. Our S-WGAN automatically learns and identifies missing pixels from the input and encodes them as feature representations, to be reconstructed in the decoder. Skip connections help to transfer image details forwardly and find local minimum by backward propagation.
Our experiments show the benefit of skip connection combined with Wasserstein-perceptual loss for image inpainting. We have visually compared our proposed method with state of the art [4, 5] in Figure 7. To verify the effectiveness of our network, we carried out experiments with regular convolutions and used the perceptual loss function in 
. We noticed that the images produced had checkboard artefacts with pitiable visual similarity compared to the original image, as shown in Figure7(d). We introduced skip connections with dilated convolution and our new loss function and obtained improved results that are were semantically reasonable with preserved realism in all aspects.
Compared to existing methods, our S-WGAN learns specific structures in natural images due to symmetric skip connection. Based on Figure 7, our S-WGAN can handle irregularly shaped binary mask without any blurry artefacts and has shown edge-preserving and mask completion at border regions on the output images. Additionally, using the Wasserstein discriminator enables the overall network to perform better. This boost the experimental performance of our network to achieve the state-of-the-art performance in inpainting high-resolution images.
One limitation is a consistent practice of other inpainting methods in the preprocessing step. Most preprocessing ignores the fact that the image has to be converted into a float before normalisation and inverse-normalisation on the output image, which contributes to the colour variation experienced by most researchers that leads to expensive post-processing. We have been able to solve this using S-WGAN with a new combination of the loss function that preserves colour and image detail.
Vi Conclusion and Future Work
In this paper, we propose a Symmetric Skip Connection Wasserstein Generative Adversarial Network (S-WGAN). Our network can generate images, which are semantically visually plausible with preserved realism of facial features. We achieved this with two key contributions that widen the receptive field in each block to capture more information and forward to the corresponding deconvolutional blocks. Besides, we introduced a new loss function based on feature space combined with loss. Our network was able to attain a state-of-the-art performance and generated high-resolution images from input covered with arbitrary binary mask shape. The proposed network has shown the effectiveness of skip connections with dilated convolutions as a capture and refining mechanism of contextual information combined with Wasserstein GAN. For future work, we aim to extend our model to inpaint coarse and fine wrinkles extracted from wrinkle detectors  with preserved realism.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro P6000 used for this research.
-  Alexei A Efros and Thomas K Leung. Texture synthesis by non-parametric sampling. In iccv, page 1033. IEEE, 1999.
-  Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 417–424. ACM Press/Addison-Wesley Publishing Co., 2000.
-  Antonio Criminisi, Patrick Pérez, and Kentaro Toyama. Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on image processing, 13(9):1200–1212, 2004.
-  Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In , pages 2536–2544, 2016.
-  Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. arXiv preprint arXiv:1804.07723, 2018.
-  Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li. High-resolution image inpainting using multi-scale neural patch synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, page 3, 2017.
-  Zhaoyi Yan, Xiaoming Li, Mu Li, Wangmeng Zuo, and Shiguang Shan. Shift-net: Image inpainting via deep feature rearrangement. In Proceedings of the European Conference on Computer Vision (ECCV), pages 1–17, 2018.
-  Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (ToG), 28(3):24, 2009.
-  Jian Sun, Lu Yuan, Jiaya Jia, and Heung-Yeung Shum. Image completion with structure propagation. In ACM Transactions on Graphics (ToG), volume 24, pages 861–868. ACM, 2005.
-  Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (TOG), 36(4):107, 2017.
-  Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. arXiv preprint, 2018.
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of machine learning research, 11(Dec):3371–3408, 2010.
-  Quoc V Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, and Andrew Y Ng. On optimization methods for deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 265–272. Omnipress, 2011.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Raymond A Yeh, Chen Chen, Teck-Yian Lim, Alexander G Schwing, Mark Hasegawa-Johnson, and Minh N Do. Semantic image inpainting with deep generative models. In CVPR, volume 2, page 4, 2017.
-  Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9446–9454, 2018.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
-  Qiang Wang, Huijie Fan, Gan Sun, Yang Cong, and Yandong Tang. Laplacian pyramid adversarial network for face completion. Pattern Recognition, 88:493–505, 2019.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285, 2016.
-  Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
-  Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
-  Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
-  Adrian Rosebrock. Deep Learning for Computer Vision with Python. PyImageSearch.com, 2.1.0 edition, 2019.
-  Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
Justin Johnson, Alexandre Alahi, and Li Fei-Fei.
Perceptual losses for real-time style transfer and super-resolution.In European conference on computer vision, pages 694–711. Springer, 2016.
-  Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging, 3(1):47–57, 2016.
-  Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
-  Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Large-scale celebfaces attributes (celeba) dataset. Retrieved August, 15:2018, 2018.
-  Karim Iskakov. Semi-parametric image inpainting. arXiv preprint arXiv:1807.02855, 2018.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Moi Hoon Yap, Jhan Alarifi, Choon-Ching Ng, Nazre Batool, and Kevin Walker. Automated facial wrinkles annotator. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018.