Where is the Fake? Patch-Wise Supervised GANs for Texture Inpainting

11/06/2019 ∙ by Ahmed Ben Saad, et al. ∙ 0

We tackle the problem of texture inpainting where the input images are textures with missing values along with masks that indicate the zones that should be generated. Many works have been done in image inpainting with the aim to achieve global and local consistency. But these works still suffer from limitations when dealing with textures. In fact, the local information in the image to be completed needs to be used in order to achieve local continuities and visually realistic texture inpainting. For this, we propose a new segmentor discriminator that performs a patch-wise real/fake classification and is supervised by input masks. During training, it aims to locate the fake and thus backpropagates consistent signal to the generator. We tested our approach on the publicly available DTD dataset and showed that it achieves state-of-the-art performances and better deals with local consistency than existing methods.



There are no comments yet.


page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The inpainting task consists in filling missing parts of an image. A ”good” inpainting has to be visually plausible. In other words, it needs to respect the texture, colors, shapes and patterns continuities. This is even more the case when we tackle Texture Inpainting, which is the scope of this paper.

Generative Adversarial Networks [6] proved to be very efficient in yielding the most realistic results in the inpainting task. For instance, Context Encoders (CE) [11] (Fig. 1 leftmost) obtained impressive results compared to traditional approaches [2, 4, 5]. The idea was to train a generator (encoder-decoder network) with the help of an adversarial loss computed through a discriminator network. Howeve, the main purpose of CE was feature learning and not inpainting, leading to a good global consistency (i.e., a generated image is globally visually plausible) but a poor local one (i.e., zooming on an image reveals many inconsistencies).

Figure 1: Visual comparison of s.o.t.a discriminators and our proposed one. R/F refers to Real/Fake.

Iizuka et al., 2017 [8] tackled this problem of local inconsistencies, by adding a local discriminator (Fig. 1 middle-left) that takes image patches centered on the completed region. This technique succeeded in dealing better with local consistency but it usually generates boundary artifacts and distortions which forced the authors to use Poisson Blending [12] as post-processing step. Isola et al. [9] went further by proposing a PatchGAN discriminator (Fig. 1

middle-right) that divides the images in overlapping patches then classifies all of them. The final output was the average of all classification results. This technique was, for instance, successfully applied in inpainting in the medical imagery context by Armanious

et al., 2018 [1]. However, we believe that averaging all the patches’ contributions limits the power of the discriminators. In fact, PatchGAN can classify images with tiny ”fake” regions as globally real; and risk to learn features from the bad locations of fake and real regions.

Figure 2: In our Inpainting framework, the generator (left) takes as input masked images and outputs inpainted images, that are fed to the discriminator (right) which segments fake regions. The latter is trained with GT masks and used as an adversarial loss for the former, which is also trained with classical reconstruction loss. Multiscale filters are not represented for simplicity.

In this paper, we propose to solve these problems using what we call a Segmentor As A Discriminator (SAAD). The main idea behind SAAD (Fig. 1 rightmost) is to have a finer discriminator that locates fake parts in inpainted images, thus backpropagates better gradients to the generator. To do so, instead of classifying the whole image as real or fake, we propose a discriminator that solves a segmentation task, and thus learn to locate the fake. The segmentation ground-truth is given “for free” thanks to the inpainting masks. Additionally, while state-of-the-art (s.o.t.a) discriminators handle fake regions at one specific scales, we proposed to follow a multi-scale real/fake approach within our segmentor discriminator.

Experiments were conducted on the DTD dataset [3] where we compared our method to the works mentioned above. Results show that our approachs achieve state-of-the-art performance and better inpaint texture images.

2 Method Description

Our inpainting method is composed of two components: (i) a classical generator that performs the completion task (Sec. 2.1); and (ii) our main contribution that is a Segmentor As A Discriminator (SAAD) (Sec. 2.2). Furthermore in Sec. 2.3, we present a multi-scale SAAD version that aims to deal with multi-scale fake regions.

2.1 Generator

The generator takes as input masked images ( with being the mask locations and the ground-truth image) and outputs inpainted images (denoted ). is a classical U-Net like architecture (encoder-decoder + long-skips) [13]

with 2-strided convolutions 

[14] in the encoder-decoder for dimensions reduction and dilated convolutions [15] in the middle convolutional blocks in order to increase receptive fields sizes. Note that, regenerates every pixel to form a new image (denoted ). However, real pixels of the input masked image do not need to be replaced. Hence, we consider only pixels at the mask locations and the final output of becomes: . For the training of , we use the sum of a reconstruction loss as well as an adversarial loss coming from our segmentor discriminator (described in the next section). For , we use MSE between generated image and corresponding ground-truth (GT): .

2.2 Segmentor As A Discriminator (SAAD)

The main idea behind SAAD is to have a finer discriminator that is able, given an inpainted image, to locate its fake parts, thus backpropagating better gradients to the generator. Locating the fake helps in: (i) avoiding to classify images with tiny generated regions as globally real or fake; and (ii) learning features from the correct locations of fake and real regions.

To locate the fake, we propose that the discriminator performs a segmentation task. In fact, in inpainting, the segmentation masks are given “for free”, since they correspond to the inpainting masks. Specifically, the discriminator takes as input and outputs feature maps on top of which we add a convolution filter that outputs a real/fake map that we denote . Simply said, . To learn our segmentor discriminator , we enforce its output (

: sigmoid function) to be close to

, by minimizing a pixel-wise BCE loss. This corresponds to .

Note that, for we can use classical architectures, thus, the output size of the last feature map is usually smaller than the input size. It is thus the same for . Hence, to match the size of the input masks (), we up-sample from to . Note also that has a receptive field of size with . This means that classifies patches of the input images and this is why we characterize as a patch-wise discriminator.

After model convergence, as for any discriminator real and fake patches cannot be distinguished. However, during training, these last are usually well classified. In our case, the discriminator is able to go further by classifying and localizing the fake regions as illustrated in Fig. 2.

2.3 Multiscale approach

In the above section we used only one real/fake segmentation filter that has a specific receptive field of size . That size is defined by the position of in the network. It is thus sub-optimal to handle fake regions that can occur at different scales with only one filter at a specific scale. Thus, we propose to follow a multi-scale real/fake segmentation approach to capture more texture diversity.

To do so, we perform the segmentation task with multiple filters positioned at different levels of the network and thus having different receptive fields sizes. Formally, each filters takes as input the feature maps given by the convolutional layer and outputs real/fake maps that are upsampled and always compared to the same ground-truth mask , as in Sec. 2.2.

3 Experiments and results

3.1 Experimental settings

Texture Inpainting Task
Since the GAN-based Texture Inpainting task is not common in the literature, we proposed to set up a new experimental setting using the publicly available Describable Textures Dataset (DTD) [3]. DTD contains 5640 texture images and we used nearly 200 random images for testing purposes and the rest for training/validation. For each image, we generated multiple rectangle masks (random number, at most 5), at randomly positions before feeding it to the generator. The masks eventually overlaps each other and cover 15% to 30% of training and test images. We used a fixed set of masks for the test images for fair comparisons.

To compare the performance of all methods, we used 3 common metrics: Peak Signal To Noise Ratio (PSNR), Structural Similarity (SSIM) and Mean Perceptual Similarity (MPS) computed by:

, where is the set of masked test images, and PS is the Perceptual Loss as defined in [16]. Moreover, every generator is trained 5 times and the average score is reported to ensure fair comparison.

Comparison Methods
We compared our discriminators (SAAD and its multiscale version) with three existing ones: (i) Context Encoder (CE) that globally classifies the generated image; (ii) GLCIC which consist in concatenating the features of a global and a local discriminator; and (iii) GLPG which is a combination of GLCIC and PatchGAN (consist in classifying real/fake patches with convolutional filters and averaging their outputs to get the global prediction). SAAD and these three methods are illustrated in Fig 1. One should note that, many works proposed to use Perceptual loss [16] calculated over VGG-19 or AlexNet features [1] but this is orthogonal to our contribution, and the goal here is to asses the different supervisions of the discriminators.

Note that, the same generator network was used for all the methods as well as the same discriminator’s backbone. The latter, corresponding to the first 3 blocks of the ResNET-18 [7]

architecture as we are dealing with textures and do not need high-level features. For the local discriminator in GLCIC and GLPG, we used just the two first blocks. We trained all the networks with 200 epochs using Adam optimizer with learning rates of

and respectively for the generator and the discriminator. To avoid model collapse, we used zero-centered gradient penalty as defined in [10]

3.2 Results

The results of the different methods on the texture inpainting task in DTD are presented in Tab. 1

. We can see that our methods perform better than all others, regardless of the evaluation metric. For instance, SAAD-multiscale outperforms the CE baseline by 2 points of MPS. More importantly, compared to the recent GLPG, we improve the MPS by 1.6%. Since the only difference between GLPG and SAAD is the supervision (

i.e., classification vs segmentation), this result shows that the main contribution of this paper is valuable.

However, one must be careful with manipulating the PSNR, SSIM and MPS evaluation metrics when dealing with texture images. Indeed, sometimes visually good results yield worse quantitative scores, as illustrated in Fig. 3. Thus, we decided to do a qualitative comparison of different methods. The results are given in Fig. 4. From these results, we can clearly observe how the generated textures of our method are visually better compared to others.

Context Encoder (Pathak et al.) 95.3 24.385 0.901
GLCIC (Iizuka et al.) 96.2 24.728 0.924
GLPG (Armanious et al.) 95.6 26.409 0.930
SAAD (Ours) 97.2 26.635 0.934
SAAD MultiScale (ours) 97.3 27.536 0.937
Table 1: Inpainting results with MPS in % and PSNR in dB.

4 Conclusion

We presented a new approach for GAN-based texture inpainting that involves changing the discrimination task to a segmentation one to achieve better texture completion. We have shown, through quantitative and qualitative results on DTD, that this new way of supervision allows the generator to better generate textures and preserve mostly local features like colors, contrasts and shapes.

Figure 3: Illustration of the ineffectiveness of evaluation metrics for texture inpainting problem. For instance, in the first row, the left result is perceptually blurry, while it gets much higher scores than the right one, that inpaints the shapes quite clearly.

Figure 4: Qualitative results of different methods in the texture inpainting task. Masks are colored in blue for visibility.


  • [1] K. Armanious, Y. Mecky, S. Gatidis, and B. Yang (2019) Adversarial inpainting of medical image modalities. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1, §3.1.
  • [2] A. Bugeau and M. Bertalmio (2009) Combining texture synthesis and diffusion for image inpainting.. In VISAPP 2009-Proceedings of the Fourth International Conference on Computer Vision Theory and Applications, Cited by: §1.
  • [3] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi (2014) Describing textures in the wild. In

    Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1, §3.1.
  • [4] A. Criminisi, P. Pérez, and K. Toyama (2004) Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on image processing. Cited by: §1.
  • [5] I. Drori, D. Cohen-Or, and H. Yeshurun (2003) Fragment-based image completion. In ACM Transactions on graphics (TOG), Cited by: §1.
  • [6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §1.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1.
  • [8] S. Iizuka, E. Simo-Serra, and H. Ishikawa (2017) Globally and locally consistent image completion. ACM Transactions on Graphics (ToG). Cited by: §1.
  • [9] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §1.
  • [10] L. Mescheder, A. Geiger, and S. Nowozin (2018) Which training methods for gans do actually converge?. arXiv preprint arXiv:1801.04406. Cited by: §3.1.
  • [11] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1.
  • [12] P. Pérez, M. Gangnet, and A. Blake (2003) Poisson image editing. ACM Transactions on graphics (TOG). Cited by: §1.
  • [13] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, Cited by: §2.1.
  • [14] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §2.1.
  • [15] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §2.1.
  • [16] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1, §3.1.