Symmetric Skip Connection Wasserstein GAN for High-Resolution Facial Image Inpainting

01/11/2020 ∙ by Jireh Jam, et al. ∙ 6

We propose a Symmetric Skip Connection Wasserstein Generative Adversarial Network (S-WGAN) for high-resolution facial image inpainting. The architecture is an encoder-decoder with convolutional blocks, linked by skip connections. The encoder is a feature extractor that captures data abstractions of an input image to learn an end-to-end mapping from an input (binary masked image) to the ground-truth. The decoder uses the learned abstractions to reconstruct the image. With skip connections, S-WGAN transfers image details to the decoder. Also, we propose a Wasserstein-Perceptual loss function to preserve colour and maintain realism on a reconstructed image. We evaluate our method and the state-of-the-art methods on CelebA-HQ dataset. Our results show S-WGAN produces sharper and more realistic images when visually compared with other methods. The quantitative measures show our proposed S-WGAN achieves the best Structure Similarity Index Measure of 0.94.



There are no comments yet.


page 1

page 2

page 3

page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Historically, inpainting is an ancient technique that was performed by professional artists to restore damaged paintings in museums. These defects (scratches, cracks, dust and spots) were inpainted by hand to restore and maintain the image quality. The evolution of computers in the last century and its frequent daily use has encouraged inpainting to take a digital format [1, 2, 3, 4, 5, 6, 7] as an image restoration technique. Image inpainting aims to fill in missing pixels caused by a defect based on pixel similarity information [2].

The state-of-the-art approaches are discussed in two categories: conventional and deep learning-based methods. Traditional methods

[1, 3, 8, 9] use image statistics of best-fitting pixels to fill in missing regions (defects). However, these approaches often fail to produce images with plausible visual semantics. With the evolution in research, deep learning-based methods [4, 5, 10, 11, 7, 6] encode the semantic context of an image into feature space and fill in missing pixels on images by hallucinations [6]

through the use of generative neural network.

Pathak et al. [4] propose the context encoder with close similarity to the auto-encoder [12, 13] and AlexNet [14] to predict missing pixels on RGB images. This technique applies the -norm as a pixel reconstruction loss to capture the overall structure of the missing region regarding the image context. Iizuka et al. [10] introduce a globally and locally consistent training approach that uses two discriminators, i.e. one (global discriminator) to assess the coherency of the whole image while the other (local discriminator) ensures local consistency of predicted pixels of the binary mask region to the image. Yang et al. [6]

propose a multi-scale neural patch synthesis approach. This method uses a combined optimisation framework, with global and local texture as a constraint on Convolutional Neural Network (CNN) to preserve contextual structures and predict missing regions on an image. Yeh et al.

[15] use context and prior losses to train a network that searches the encoding of a corrupted image in latent space to reconstruct the original image.

Liu et al. [5] use partial convolution to replace typical convolutions [16] with an automatic mask-updating step. This technique masks and renormalise convolutions to target only valid pixels. Yan et al. [7]

use deep feature rearrangement by adding a particular shift-connection layer to the U-Net architecture

[17]. Wang et al. [18] introduce a Laplacian approach based on residual learning [19] to propagate high-frequency details and predict missing information. However, most of these models use expensive post-processing techniques (Poisson blending) on the final output, to produce images that are visually consistent with the original image. Hence, it is still a challenging area in research due to failures in generating images with realism from a compact latent feature.

Fig. 1: Images showing some issues by the state of the art: (a) Poor performance on holes with arbitrary sizes; (b) Lack of edge preserving technique; (c) Blurry artefacts; and (d) Poor performance on high-resolution images and image completion with mask at border region.
Fig. 2: S-WGAN framework. The dilated convolution and deconvolution with the element-wise sum of feature maps (skip connection) combined with a Wasserstein network. The skip connections in the diagram ensure local pixel-level accuracy of the feature details to be retained.

Although deep learning approaches achieve excellent performance in facial inpainting, there are some limitations of state of the art as illustrated in Figure 1, where Figure 1(a) shows poor performance on holes with arbitrary sizes; Figure 1(b) illustrates the lack of edge-preserving using the existing technique; Figure 1(c) depicts the blurry artefacts; and Figure 1(d) demonstrates the poor performance on high-resolution images and image completion with mask at border region.

To correctly predict missing parts of a face and preserve its realism, we propose Symmetric Skip Connection Wasserstein GAN (S-WGAN) with the following contributions:

  • We propose a new framework with Wasserstein GAN that uses symmetric skip connection to preserve image details.

  • We define a new combined loss function based on feature space.

  • We demonstrate that our loss, combined with our S-WGAN, can achieve better results than the state-of-the-art algorithms.

Ii Proposed Framework

Our proposed model uses skip connections with dilated convolution to perform image inpainting. We discuss the architecture and loss function of S-WGAN in the following sections.

Ii-a Architecture

Fig. 3: Illustration of dilated convolution process. Convolving a kernel over a input with a dilation factor of 2 (i.e., , , , and ) [20]. The accretion of receptive field is in linearity with the parameters [21]. A kernel will have the same receptive field view at dilation rate=2 whilst only using 9 parameters over a input.

Figure 2 shows the overall framework of our proposed S-WGAN. The network is designed to have a generator () and a discriminator (). We define as an encoder-decoder framework with dilated convolutions and symmetric skip connections. Figure 3 shows the process of dilated convolution. Dilated convolutions, combined with skip connections, are critical to the design of our model as:

  • It broadens the receptive fields to capture more contextual information without parameter accretion and computational complexity, which are preserved and transferred by skip connections to corresponding deconvolution layers.

  • It detects fine details and maintains high-resolution feature maps, and achieves end-to-end feature learning with a better local minimum (high restoration performance).

  • It has shown considerable improvement of accuracy in segmentation task [21, 22, 23].

Generator () The effectiveness of feature encoding is improved by having an encoder of n-convolutional layers, with a kernel size of 5 and dilation rate of 2, designed to match the size of the output image. This technique enables our model to learn larger spatial filters and help reduce volume [24]

. Each block of convolution in exception of the final layer has leaky ReLU activation and max-pooling operation of pool size

. We apply a dropout regularisation with a probability value of 0.25 in the 4th and final layer of the encoder. The dropout layer randomly disconnects nodes and adjust the weights to propagate information to the decoder without overfitting.

Decoder The decoder are blocks of deconvolutional layers, with learnable upsampling layers that recover image details. The corresponding feature maps in the decoder are asymmetrically linked by element-wise skip connections to reach an optimum size. The final layer in the decoder is Tanh activation.

Dilated Convolutions: We express the dilated convolution based on the network input in Equation 1:


where is the output feature map of the dilated convolution from the input and the filter is given by . The dilation rate parameter () reverts to normal when .

It is advantageous to use dilated convolution compared to using typical convolutional layers combined with pooling. The reason for this is that a small kernel size of can enlarge into

based on the dilated stride

, thus allowing a flexible receptive field of fine-detail contextual information while maintaining high-quality resolution.

The inpainting solver may result in predictions of the missing region, that may be reasonable or ill-posed. Because of this, instead of being constrained to [0,1] in normal GANs, we include as part of our network adopted from [25] to provide improved stability and enhanced discrimination for photo-realistic images. With ongoing adversarial training, the discriminator is unable to distinguish real data from fake ones. Equation 2 shows the reconstruction of the image during training from :


where is the reconstructed image; is the ground-truth; is the predictions; is the element-wise multiplication; the binary mask represented in 0 and 1 for the context of the entire image and missing regions.

Equation 3 adopted from [25] refers to the Wasserstein discriminator.


where D is the discriminator and is real data distribution. G is the generator of our network and is the distribution.

Ii-B Loss function

Perceptual loss

Instead of using the typical Mean Square Error (MSE) loss function used in [4]

, we defined a combined loss function based on feature space. To achieve this, we adopt a pre-trained VGG-16 model trained on ImageNet

[14]. We utilize block3-convolution3 of this model as our feature extractor to compute our loss function [26]. Using MSE as a base, we calculate the squared difference of the feature maps instead of pixel-wise representation. Since our goal is to output high-resolution/high-quality visually plausible images that have the detailed characteristics that can be undetected by the human visual system, we use the loss as a constraint and compute our perceptual loss function in Equation 5 in feature space. loss preserves colour and luminance and does not over penalise more substantial errors [27]. The adjust our perceptual loss to minimise any error 1. Also, the loss allows better evaluation of feature predictions to match ground-truth ones instead of the normal pixel-wise difference.

More specifically, we define the loss as:


where is the pixel index belonging to the feature maps () with and as pixel values of predictions and ground-truth. The advantage of using feature space is that a particular filter determines the extraction of feature maps from low level to high-level, sophisticated features. To reconstruct quality images, we compute our loss function with feature maps determined by block3-conv3, resized to the same size as masks and generated images. The reason is that using block4-conv4 or block5-conv5 will result in poor quality, as the network starts to expand the view at these layers due to more filters used.

We adopt the loss function [26] and define our special perceptual loss as follows:


where is the input image, is the reconstructed image and is dimensions obtained from feature maps with high level representational abstractions extracted from the third block convolution layer. By combining the

loss, the model learns to produce finer details in the predicted features and output without any blurry artefacts. The entire model trains end-to-end with backpropagation and uses the Wasserstein loss (

) to optimise and to learn reasonable predictions. Our goal is to reconstruct an image from by training the generator to learn and preserve image details. The Wasserstein loss improves convergence in GANs and its the mean difference between two images. The combined loss function Wasserstein-perceptual () is defined in Equation 6.


Iii Experiment

This section describes the dataset, binary masks and the implementation.

Iii-a Dataset and irregular binary mask

Fig. 4: Sample images from CelebA-HQ Dataset [28].

Our experiment focuses on high-resolution face images and irregular binary masks. The benchmark dataset for high-resolution face images is CelebA-HQ dataset [28], which was curated from the CelebA dataset [29] and contained 30,000 images. Figure 4 shows a few samples from the CelebA-HQ dataset.

To create irregular holes on images, we use the Quickdraw irregular mask dataset [30], available for public use and is divided into 50,000 train and 10,000 test masks. The images are of size pixels. Figure 5 show samples from the mask dataset [30].

Fig. 5: Sample irregular binary mask images reproduced from Quick draw irregular mask dataset [30].

Iii-B Implementation

We used the Keras library with TensorFlow backend to implement and design our network. With our choice of the dataset, we followed the experiment settings of state of the art

[5] and split our data into 27,000 images for training and 3,000 images for testing.

We perform normalised floating-point representation on the image to set the intensity values of the pixels in the range [-1,1] and apply the mask on the image to obtain our input, as shown in Figure 6. We initialize pre-trained weights from VGG16 to compute our loss function. We use a learning rate of in and in and optimise the training process using the Adam optimiser [31]

. We use a Quadro P6000 GPU machine to train these models. According to our hardware conditions, we use a batch-size of 5 in each epoch for input images with shape

. It takes 0.193 seconds to predict missing pixels of any size created by binary mask on an image and ten days to train 100 epochs.

(a) GT
(b) Mask
(c) Masked-Image
Fig. 6: Process of input generation: a) CelebA-HQ image; b) Binary mask image [30]; and c) Corresponding masked image (input image).

Iv Results

We assess the performance of the inpainting methods qualitatively and quantitatively in this section.

Iv-a Qualitative Comparisons

Consider the importance of visual and semantic coherence, we conducted a qualitative comparison of our test dataset. First, we implemented a WGAN approach with perceptual loss and Wasserstein distance. We observed an induced pattern and pitiable colour on the images, as shown in Figure 7(d). We introduced dilated convolution, skip connections combined with our Wasserstein-perceptual loss function () to handle the induced pattern and match the luminance of the original images.

We compare our model with three popular methods:

  • CE: Context-Encoder method by Pathak et al. [4].

  • PConv: Image Inpainting for irregular holes using partial convolutions by Liu et al. [5].

  • WGAN: Wasserstein GAN approach of with perceptual loss with the same network structure as S-WGAN, with regular convolutions and no skip connection.

We test our S-WGAN against state of the art on CelebA-HQ test dataset and show the results in Figure 7. Based on visual inspection, Figure 7(b) illustrates blurry and checkerbox effects generated by the by Pathak et al.’s CE method [4]. On the other hand, PConv [5] generates clear images but with residues of the mask left on the image as illustrated in Figure 7(c). WGAN induced pattern and pitiable colour on the images as shown in Figure 7(d). Overall, our proposed S-WGAN, as shown in Figure 7(e), produced the best visual results when compared to the ground-truth in Figure 7(f).

(a) Masked-Image
(b) CE
(c) PConv
(d) WGAN
(e) S-WGAN
(f) GT
Fig. 7: Qualitative comparison of our proposed S-WGAN with the state-of-the-art methods on CelebA-HQ: (a) Input image; (b) CE [4]; (c) PConv [5]; (d) WGAN; (e) S-WGAN (proposed method);and (f) Ground-truth image.

Iv-B Quantitative Comparisons

We select some popular image quality metrics including Mean Absolute Error (MAE), MSE, Peak Signal to Noise Ratio (PSNR), Structure Similarity Index Measure (SSIM) to evaluate the performance quantitatively. Table

IV-B shows the results from our experiment compared to state of the art [4, 5] for image inpainting with our S-WGAN in bold. Quantitative comparison of various performance assessment metrics on 3000 test images from the CelebA-HQ dataset. Evaluation Metric Inpainting Method MSE MAE PSNR SSIM WGAN 3562.13 87.03 13.50 0.56 Pathak et al. [4] 133.481 129.30 27.71 0.76 Guilin Liu et al. [5] 124.62 105.94 28.82 0.90 S-WGAN 81.03 66.09 29.87 0.94

For MSE and MAE, the lower the value, the better the image quality. MSE measures the average squared intensity difference of pixels while MAE measures the magnitude of error between the ground-truth image and the reconstructed image. Conversely, for PSNR and SSIM, the higher the value, the closer the image quality to the ground-truth. In our case, S-WGAN shows better results than the other state-of-the-art algorithms across all the performance metrics.

V Discussion

Our proposed S-WGAN with dilated convolution and Wasserstein-perceptual loss function outperforms the state-of-the-art results. Our model can learn the end-to-end mapping of input images from a large-scale dataset to predict missing pixels of the binary mask regions on the image. Our S-WGAN automatically learns and identifies missing pixels from the input and encodes them as feature representations, to be reconstructed in the decoder. Skip connections help to transfer image details forwardly and find local minimum by backward propagation.

Our experiments show the benefit of skip connection combined with Wasserstein-perceptual loss for image inpainting. We have visually compared our proposed method with state of the art [4, 5] in Figure 7. To verify the effectiveness of our network, we carried out experiments with regular convolutions and used the perceptual loss function in [26]

. We noticed that the images produced had checkboard artefacts with pitiable visual similarity compared to the original image, as shown in Figure 

7(d). We introduced skip connections with dilated convolution and our new loss function and obtained improved results that are were semantically reasonable with preserved realism in all aspects.

Compared to existing methods, our S-WGAN learns specific structures in natural images due to symmetric skip connection. Based on Figure 7, our S-WGAN can handle irregularly shaped binary mask without any blurry artefacts and has shown edge-preserving and mask completion at border regions on the output images. Additionally, using the Wasserstein discriminator enables the overall network to perform better. This boost the experimental performance of our network to achieve the state-of-the-art performance in inpainting high-resolution images.

One limitation is a consistent practice of other inpainting methods in the preprocessing step. Most preprocessing ignores the fact that the image has to be converted into a float before normalisation and inverse-normalisation on the output image, which contributes to the colour variation experienced by most researchers that leads to expensive post-processing. We have been able to solve this using S-WGAN with a new combination of the loss function that preserves colour and image detail.

Vi Conclusion and Future Work

In this paper, we propose a Symmetric Skip Connection Wasserstein Generative Adversarial Network (S-WGAN). Our network can generate images, which are semantically visually plausible with preserved realism of facial features. We achieved this with two key contributions that widen the receptive field in each block to capture more information and forward to the corresponding deconvolutional blocks. Besides, we introduced a new loss function based on feature space combined with loss. Our network was able to attain a state-of-the-art performance and generated high-resolution images from input covered with arbitrary binary mask shape. The proposed network has shown the effectiveness of skip connections with dilated convolutions as a capture and refining mechanism of contextual information combined with Wasserstein GAN. For future work, we aim to extend our model to inpaint coarse and fine wrinkles extracted from wrinkle detectors [32] with preserved realism.


We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro P6000 used for this research.