Deep Stacked Networks with Residual Polishing for Image Inpainting

12/31/2017 ∙ by Ugur Demir, et al. ∙ 0

Deep neural networks have shown promising results in image inpainting even if the missing area is relatively large. However, most of the existing inpainting networks introduce undesired artifacts and noise to the repaired regions. To solve this problem, we present a novel framework which consists of two stacked convolutional neural networks that inpaint the image and remove the artifacts, respectively. The first network considers the global structure of the damaged image and coarsely fills the blank area. Then the second network modifies the repaired image to cancel the noise introduced by the first network. The proposed framework splits the problem into two distinct partitions that can be optimized separately, therefore it can be applied to any inpainting algorithm by changing the first network. Second stage in our framework which aims at polishing the inpainted images can be treated as a denoising problem where a wide range of algorithms can be employed. Our results demonstrate that the proposed framework achieves significant improvement on both visual and quantitative evaluations.



There are no comments yet.


page 2

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of inpainting is reconstruction of an image without incurring noticeable changes [Bertalmio et al.(2000)Bertalmio, Sapiro, Caselles, and Ballester]. It is a widely used technique by the photo and video editing applications for repairing damaged images, removing undesired objects or refilling the missing parts of images. Although fixing the small deteriorations are relatively simple, filling the large holes or removing an object from the scene are still challenging due to complexity of the problem.

With the recent advancement of Convolutional Neural Networks (CNN), several generative models that produce visually pleasant outputs have been presented for inpainting [Pathak et al.(2016)Pathak, Krähenbühl, Donahue, Darrell, and Efros, Yang et al.(2016)Yang, Lu, Lin, Shechtman, Wang, and Li, Nguyen et al.(2017)Nguyen, Yosinski, Bengio, Dosovitskiy, and Clune]

. The most popular approach is using an Autoencoder-like (AE) architecture that takes center cropped images (see Figure

5) and tries to synthesize realistic image patches to fill the blank areas. Results demonstrate that CNNs have a great potential to learn structure of the images collected from the real world [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio].

One of the essential questions about realistic texture synthesis is: how can we measure the realism? No magical mathematical formula to determine whether an image is real or artificially constructed exists. In order to solve this challenging problem, a crucial step is to construct synthesis models which are trained based on a comparison of real images with generated outputs. Although primitive objective functions like Euclidean Distance assist in measuring and comparing information on the general structure of the image, they tend to converge to the mean of pixel values that cause blurry outputs.

Goodfellow et alhave taken image synthesis step forward by presenting Generative Adversarial Networks (GAN) [Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio]

. An additional binary classifier, which is called a discriminative network, is included in order to classify whether an image comes from a real distribution or a generator network output. The discriminative network is trained while the generative network tries to convince the former by producing more realistic patches. During the training, the generative network is scored by an adversarial loss that is calculated by the discriminator network. Another remarkable approach for image generation is given by a loss function that compares the features extracted from a pre-trained network instead of through direct pixel-wise measurements. It is known as the content loss in the literature

[Johnson et al.(2016)Johnson, Alahi, and Fei-Fei, Larsen et al.(2016)Larsen, Sønderby, Larochelle, and Winther]. The idea of this approach lies in the recovery of high frequency details rather than blur. While using these loss functions, although plausible image patches are produced, they introduce undesired artifacts and noise due to complexity of the training procedure as can be seen in Figure 5.

The contribution of our work can be summarized upfront as follows: we divide the problem of image inpainting into two parts. First part, which we call the Coarse Painter Net (CPN), synthesizes the image texture by studying the uncorrupted parts. Second part of our method, which we call the Fine Painter Network (FPN) takes the reconstructed region and applies an enhancement to obtain an improved reconstruction. The latter is similar in spirit to the super-resolution problem, however, instead of enlarging the image, we aim to recover details without resizing. Furthermore, the second network aims to reduce noise if present. Overall, our FPN is designed to learn changes instead of the final image patch itself.


Figure 1: General structure of the Residual Polishing framework.

2 Related Work

As described above, we perform an inpainting process followed by an enhancement of the result by another process, overall of which we call the Residual Polishing framework. Thus, our work is related to a set of topics such as inpainting, denoising and super-resolution in the literature.

Early studies on inpainting generally worked on a single damaged image [Bertalmio et al.(2000)Bertalmio, Sapiro, Caselles, and Ballester, Criminisi et al.(2004)Criminisi, Perez, and Toyama, Liu and Caselles(2013)]. Starting from the border of the missing region, similar patches are searched inside the uncorrupted image segments. The closest discovered patches are placed into the missing region, and the process is repeated until the blank area is completely filled. Most of those earlier works paid attention to improving the speed of similar patch finding procedure. Although those methods have shown superior performance on the images which contain similar texture details, they suffered from the lack of global structural information, which led to undesirable outputs.

CNN has shown great success in both classification [Simonyan and Zisserman(2014), Szegedy et al.(2016)Szegedy, Ioffe, and Vanhoucke, He et al.(2016b)He, Zhang, Ren, and Sun] and regression tasks [Bruna et al.(2015)Bruna, Sprechmann, and LeCun, Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio, Radford et al.(2015)Radford, Metz, and Chintala]. Especially AE architecture has been used widely for image generation and reconstruction [Pathak et al.(2016)Pathak, Krähenbühl, Donahue, Darrell, and Efros, Dong et al.(2014)Dong, Loy, He, and Tang, Ledig et al.(2016)Ledig, Theis, Huszar, Caballero, Aitken, Tejani, Totz, Wang, and Shi]. Adversarial training scheme has shown striking success for realistic image generation. Lately, a wide range of GAN type architectures, which offer novel objective functions by changing the structure of discriminator [Zhao et al.(2016)Zhao, Mathieu, and LeCun, Nguyen et al.(2017)Nguyen, Yosinski, Bengio, Dosovitskiy, and Clune, Springenberg(2016)], have been proposed. Pathak et aluse the GAN for an inpainting CNN which is called Context-Encoder [Pathak et al.(2016)Pathak, Krähenbühl, Donahue, Darrell, and Efros]. Their inpainting network takes the center cropped images and regress the missing part. Indeed, our CPN architecture is inspired from the Context-Encoder. On the other hand, most GANs have been focusing solely on the realistic image generation instead of generation of an image patch well-matched to the global image, and that property of GANs is incompatible with the original goal of the inpainting.

Using features extracted from pre-trained networks for comparing images has become popular recently in art transfer [Johnson et al.(2016)Johnson, Alahi, and Fei-Fei, Li and Wand(2016)], AE training [Larsen et al.(2016)Larsen, Sønderby, Larochelle, and Winther] and inpainting [Yang et al.(2016)Yang, Lu, Lin, Shechtman, Wang, and Li]. In the literature, this idea is known as either mainly content loss, perceptual loss, style loss or feature loss, as we call here. Generally, features are extracted from a classification network, which is trained on huge data collections due to their representation strength. During the optimization, features of the generated image and the ground truth are forced to be close. The learned feature loss measures produce more robust results compared to those of the pixel-based distance measures [Larsen et al.(2016)Larsen, Sønderby, Larochelle, and Winther]. Notable inpainting results are obtained recently through using a combination of the feature loss and the adversarial loss [Yeh et al.(2016)Yeh, Chen, Lim, Hasegawa-Johnson, and Do]. Also, in [Yang et al.(2016)Yang, Lu, Lin, Shechtman, Wang, and Li], Yang et alreported state-of-the-art results by applying a local texture constraint that was inspired from [Li and Wand(2016)] along with the content loss and the adversarial loss.

Residual Networks have been developed to improve gradient flow in the deep networks through addition of skip connections to the architecture [He et al.(2016b)He, Zhang, Ren, and Sun, He et al.(2016a)He, Zhang, Ren, and Sun, Zagoruyko and Komodakis(2016), Szegedy et al.(2016)Szegedy, Ioffe, and Vanhoucke, Xie et al.(2017)Xie, Girshick, Dollár, Tu, and He]

. They stack the residual blocks which consist of several convolutional layers and a residual connection. This operation reduces the training time dramatically while classification accuracies on the public datasets are improved significantly. In another area of regression problems, residual connections are used differently. Kim

et alproposed a super-resolution model that learns the difference between the high resolution image and its low resolution counterparts [Kim et al.(2016)Kim, Lee, and Lee]. The final result is obtained by adding the difference to the input image. It was shown that regressing the difference between input and the desired output can produce considerable performance improvement [Zhang et al.(2016)Zhang, Zuo, Chen, Meng, and Zhang, Han et al.(2016)Han, Yoo, and Ye, Lettry et al.(2016)Lettry, Vanhoey, and Gool].

In this paper, we present a novel method that combines a coarse painter and a fine painter network, which is described in Section 3. Our fine painter idea is inspired from the super-resolution problem where a high resolution image is produced based on a given low resolution input. To our knowledge, our proposed residual polishing method is the first application that uses the residual connections to improve the results in image inpainting problem, which we will demonstrate in Experiments (Section 4), followed by Conclusions (Section 5).

3 Residual Polishing

In Residual Polishing framework, the intention is completion of the inpainting process at two stages. In the beginning, we obtain the input image by removing the center part of the image that is taken from the dataset. Our CPN fills the blank regions in to obtain inpainted image part at the end of the first stage. We hypothesize that has noise and artifacts introduced by the CPN, therefore, further improvement should be possible by applying a suitable procedure. To improve quality of the ultimate result, in the second stage, the FPN removes the undesired effects on and generates the final image . Figure 1 shows the general structure of our approach. Apart from noise and artifacts, FPN additionally finds the undiscovered details. The generated image part is placed to the missing area of the input to finalize the algorithm. Here we can formulate Residual Polishing as;


where represents FPN, represents CPN and is the center removal operation to obtain from . Following sections give details about the architectures and training steps.

3.1 Coarse Painter Network

CPN is formed by sequentially stacking an encoder and a decoder module . The Encoder part takes an input image and produces a latent representation called the bottleneck features. The latent output is passed to the decoder network to generate missing part of . This approach is inspired from the Context-Encoder proposed in [Pathak et al.(2016)Pathak, Krähenbühl, Donahue, Darrell, and Efros]. The difference is that we use a fully-connected layer at the end of the encoder instead of the channel-wise fully-connected layer.

Architecture of the CPN is very similar to that of [Radford et al.(2015)Radford, Metz, and Chintala]

. The filter sizes are fixed to 4x4 for each convolutional layer. To expose global information in the image, input is subsampled by strided convolutions. Filter depth is doubled for each subsampling operations. Layers of the encoder other than the last one consist of convolution, batch normalization

[Ioffe and Szegedy(2015)]

and Leaky ReLU (LReLU) activation, respectively. The last layer contains only a traditional fully-connected layer which connects the encoder to the decoder. To reconstruct the output from the bottleneck features, decoder applies transposed convolution, batch normalization and Exponential Linear Unit (ELU)

[Clevert et al.(2015)Clevert, Unterthiner, and Hochreiter] activation. Figure 2 shows the detailed architectural design of the CPN.


Figure 2: Coarse Painter Network architecture.

At the training stage, we use a combination of three loss functions. They are optimized jointly via backpropagation. We describe each loss function briefly as follows.

Euclidean Loss computes the pixel-wise Euclidean Distance between the synthesized image patch and the ground truth. Even though it forces the network to produce a blurry output, it guides the network to roughly predict texture colors. It is defined as:


where is the number of samples, is the ground truth, is the generated output, , , are width, height and channel of the compared images.

Adversarial Loss is computed by the discriminator network that is introduced in the training phase. It tries to distinguish whether the input comes from the real data distribution or generator output distribution . The generative network , which is CPN in that case, is optimized to fool the while the discriminator tries to increase its accuracy. and are trained simultaneously by solving


where and are the parameters of the CPN and the discriminator network. While the adversarial loss helps to generate more realistic image textures, it causes appearance of superfluous details. Relying only on the adversarial loss makes training difficult and causes unstable behaviour.

Feature Loss transfers the compared images from the pixel value space to a space where the features obtained from an external model are used. We utilized VGG16 [Simonyan and Zisserman(2014)]

network, which is trained on ImageNet dataset

[Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei], as the feature extraction network due to its proven success. The feature network is used with pre-trained values and its weights are kept constant during the training. We use the intermediate activation maps (relu_2_1) as features. The feature loss is calculated by


where is the inpainted image, is the feature extraction operation, , , are the width, height and depth of the activation map. Here, we note that in Equation 4, the whole inpainted image and the whole source image are compared instead of just the center part of the original image and the output of the CPN to make use of the global structure similarity in images.

The final CPN architecture and the training strategy are determined through the experiments. The best results are obtained when the combination of the Euclidean Loss, Adversarial Loss and Feature Loss are used as the objective function. Each component of the loss function is governed by a coefficient :


Adversarial loss is calculated by solving the Equation 3. Also a L2 regularization term is added to the loss function to apply weight decay to the CPN parameters.

3.2 Fine Painter Network

Our FPN in Figure 3 takes the image patch obtained from the CPN, and supposing that the output of the CPN has a noisy characteristic, it aims to improve its quality by a "noise-removal" operation through residual connections in the network.

Input and output of the FPN are close to each other because most of the texture detail is determined by the first network. Thus, learning the residual image which is the difference between the input and the output is more accessible than directly regressing the output. To condition the network to produce a residual image , the input of the FPN is connected to the output with a skip connection. Defining the residual image by , the objective function becomes


Equation 6 indicates that residual image and the ground truth

must have the same size. In order to satisfy the size constraint, the convolutional layers are used without stride and activation maps are padded with zero. We build two different FPNs. First version uses only cascaded convolutional layers and ELU activations. Our second design puts batch normalization between convolution and the activation for each layer. The second network was experimented with several activation functions and we obtained the best results with ReLU. Figure

3 shows our FPN architecture.


Figure 3: Fine Painter Network architecture.

To obtain the polished output, the generated residual image is added to the output of the CPN. Similar to the literal meaning of "polishing", our Residual Polishing network grinds the surface of the image to make it appear smoother with reduced artifacts in the inpainted area.

4 Experiments

In this section, we evaluate the performance of our proposed method and compare residual polishing with the recent inpainting methods. One of the major problems of inpainting applications is measuring the output quality. Whereas the extracted center parts of the images are used as ground truths, the generated patches can be different while they are still plausible. Thus, pixel-wise comparison can be misleading. Nevertheless, to compare the algorithms, we use peak signal to noise ratio (PSNR), mean L1 and L2 losses for evaluation. We also provide visual evidence of successfully inpainted images in Figure 5.

4.1 Dataset

With the advancement of augmented and virtual reality applications, street view images and videos receive increased interest. People can travel around the world even without being there. However, a large amount of confidential or private scenes exists in the street view image collections. To avoid personal privacy breach, for instance, GoogleTM adds blur or some filtering to cover undesired image parts, which certainly disrupts the integrity of user experience. As per mentioned motivation, we trained our proposed inpainting network on Google Street View dataset [Zamir and Shah(2014)] to learn realistic street view image generation so that we can hide the unwanted parts in the images.

Google Street View dataset consist of 62058 high quality images. It is divided into 10 parts. We use the first and tenth parts as the testing set, the ninth part for validation, and the rest of the parts are included in the training set. In this way, 46200 images are used for training. Images are scaled to the size 128x128 and its 64x64 sized center part is cropped and kept as the ground truth. We do not apply data augmentation at the training stage.

4.2 Implementation

Residual Polishing framework is implemented using Tensorflow

[Abadi et al.(2015)Abadi, Agarwal, Barham, Brevdo, Chen, Citro, Corrado, Davis, Dean, Devin, Ghemawat, Goodfellow, Harp, Irving, Isard, Jia, Jozefowicz, Kaiser, Kudlur, Levenberg, Mané, Monga, Moore, Murray, Olah, Schuster, Shlens, Steiner, Sutskever, Talwar, Tucker, Vanhoucke, Vasudevan, Viégas, Vinyals, Warden, Wattenberg, Wicke, Yu, and Zheng]. Our networks are trained separately on NVIDIATM Tesla K20 5GB and GeForceTM

GTX 960 4GB graphic cards. For performance measurements, we used Context-Encoder Torch

[Collobert et al.(2002)Collobert, Bengio, and Marithoz] implementation provided by its authors. Authors of [Yang et al.(2016)Yang, Lu, Lin, Shechtman, Wang, and Li] have made available only the pre-trained model of their approach. Therefore, we could not train that network on the Google Street View dataset. Instead, to our CPN output, we applied a local texture constraint that is used by the mentioned method, and obtained plausible results.

(a) Coarse Painter Network (b) Fine Painter Networks on Validation Set
Figure 4: PSNR curve of the networks: (a) CPN performance on training and validation set; (b) Comparison of different FPN architectures and CPN on evaluation set. Note that CPN is not trained during the FPN training, we added CPN curve on (b) for comparison purposes.

4.3 Training

CPN is trained with joint loss function stated in Equation 5 using Adam optimizer [Kingma and Ba(2014)]. We set the parameters for the optimizer as , and . Contributions of different loss functions are determined by the parameters , and . Figure 4

shows the training and validation performance of CPN during 200 epochs. It is clearly seen that after 50 epochs, CPN stops learning. Thus, we take the model at that point as our final CPN.

During the FPN training, CPN weights are not updated. FPN is trained by Adam optimizer which uses the same parameters specified for CPN. In Figure 4, we show the performance of different FPN architectures against CPN. We tried several residual architectures with different activations. Without batch normalization, ELU is the only activation that we can train our network where ReLU and LReLU could not achieve considerable improvements. After batch normalization layers are placed between convolution layers and the activations, we obtain the best results for our setup. With batch normalization, performance of ReLU and LReLU are too close. The final FPN is constructed by convolution, batch normalization and ReLU blocks.

(a) Input (b) Ctx-Enc (c) NPS (d) CPN (e) Our Results

Figure 5: Image inpaintig results obtained from different methods. Please zoom in while comparing the results. (a) Center cropped input images taken from the test set; (b) Context Encoder [Pathak et al.(2016)Pathak, Krähenbühl, Donahue, Darrell, and Efros] output; (c) Output of Neural Patch Synthesis [Yang et al.(2016)Yang, Lu, Lin, Shechtman, Wang, and Li]; (d) CPN output (e) FPN output.

(b) (c) (d) (e) (f) (g) (h)
Figure 6: Fail cases of Context Encoder [Pathak et al.(2016)Pathak, Krähenbühl, Donahue, Darrell, and Efros] (a,e), Neural Patch Synthesis [Yang et al.(2016)Yang, Lu, Lin, Shechtman, Wang, and Li] (b,f), CPN (c,g) and FPN (d,h).

4.4 Evaluations

We evaluate the performance of our algorithm against recent inpainting algorithms [Yang et al.(2016)Yang, Lu, Lin, Shechtman, Wang, and Li, Pathak et al.(2016)Pathak, Krähenbühl, Donahue, Darrell, and Efros]. The output images are given in Figure 5. Visual outputs show that our Residual Polishing approach softens the inpainted image and produces visually plausible outputs. In Table 1, we demonstrate that our method achieves the best PSNR value on Google Street View dataset.

Method Mean L1 Loss Mean L2 Loss PSNR
Context-Encoder [Pathak et al.(2016)Pathak, Krähenbühl, Donahue, Darrell, and Efros] 2.74 0.53 20.60 dB
Neural Patch Synth.[Yang et al.(2016)Yang, Lu, Lin, Shechtman, Wang, and Li] 5.74 1.01 20.72 dB
CPN (Our) 1.97 0.38 21.37 dB
Residual Polish (Our) 1.74 0.32 22.89 dB
Table 1: Performance comparison on Google Street View dataset. For each measures the best results are shown in bold.

Our results show that instead of proposing an end-to-end solution, dividing the inpainting problem into simpler tasks and solving them separately can produce better results. Although our CPN architecture is inspired from Context-Encoder and it has nearly the same number of parameters, CPN achieves better results due to addition of the feature loss. This indication is consistent with the recent studies [Larsen et al.(2016)Larsen, Sønderby, Larochelle, and Winther, Johnson et al.(2016)Johnson, Alahi, and Fei-Fei, Li and Wand(2016), Yeh et al.(2016)Yeh, Chen, Lim, Hasegawa-Johnson, and Do] and shows that using pre-trained network features improves texture synthesis quality.

Residual polishing policy provides us additional performance gain by fixing the local deformations introduced by previous inpainting network. If the first network does not generate a proper texture, our FPN cannot improve the texture details (see Figure 6 for example fail cases). On the other hand, if we increase the receptive field of our residual network, it can be capable of repairing more global deformations as stated in [Kim et al.(2016)Kim, Lee, and Lee].

5 Conclusion

In this paper, we proposed Residual Polishing framework as a novel inpainting algorithm. Our motivation is to simplify the inpainting problem by dividing it into two stages which for each we present different networks. At the first stage, a coarse texture is obtained by considering at the surrounding pixels of the damaged area. Then our second network produces a residual image which is the difference between the desired output and the coarsely inpainted image. We have demonstrated that this residual image contains significant information that improves the performance of our final results. Residual Polishing framework can be benefited by any of the inpainting algorithms to enhance their outputs. Further, we will investigate different architectures and training policies for our residual network to ensure it can fix even more complex artifacts.