The inpainting task consists in filling missing parts of an image. A ”good” inpainting has to be visually plausible. In other words, it needs to respect the texture, colors, shapes and patterns continuities. This is even more the case when we tackle Texture Inpainting, which is the scope of this paper.
Generative Adversarial Networks  proved to be very efficient in yielding the most realistic results in the inpainting task. For instance, Context Encoders (CE)  (Fig. 1 leftmost) obtained impressive results compared to traditional approaches [2, 4, 5]. The idea was to train a generator (encoder-decoder network) with the help of an adversarial loss computed through a discriminator network. Howeve, the main purpose of CE was feature learning and not inpainting, leading to a good global consistency (i.e., a generated image is globally visually plausible) but a poor local one (i.e., zooming on an image reveals many inconsistencies).
Iizuka et al., 2017  tackled this problem of local inconsistencies, by adding a local discriminator (Fig. 1 middle-left) that takes image patches centered on the completed region. This technique succeeded in dealing better with local consistency but it usually generates boundary artifacts and distortions which forced the authors to use Poisson Blending  as post-processing step. Isola et al.  went further by proposing a PatchGAN discriminator (Fig. 1
middle-right) that divides the images in overlapping patches then classifies all of them. The final output was the average of all classification results. This technique was, for instance, successfully applied in inpainting in the medical imagery context by Armaniouset al., 2018 . However, we believe that averaging all the patches’ contributions limits the power of the discriminators. In fact, PatchGAN can classify images with tiny ”fake” regions as globally real; and risk to learn features from the bad locations of fake and real regions.
In this paper, we propose to solve these problems using what we call a Segmentor As A Discriminator (SAAD). The main idea behind SAAD (Fig. 1 rightmost) is to have a finer discriminator that locates fake parts in inpainted images, thus backpropagates better gradients to the generator. To do so, instead of classifying the whole image as real or fake, we propose a discriminator that solves a segmentation task, and thus learn to locate the fake. The segmentation ground-truth is given “for free” thanks to the inpainting masks. Additionally, while state-of-the-art (s.o.t.a) discriminators handle fake regions at one specific scales, we proposed to follow a multi-scale real/fake approach within our segmentor discriminator.
Experiments were conducted on the DTD dataset  where we compared our method to the works mentioned above. Results show that our approachs achieve state-of-the-art performance and better inpaint texture images.
2 Method Description
Our inpainting method is composed of two components: (i) a classical generator that performs the completion task (Sec. 2.1); and (ii) our main contribution that is a Segmentor As A Discriminator (SAAD) (Sec. 2.2). Furthermore in Sec. 2.3, we present a multi-scale SAAD version that aims to deal with multi-scale fake regions.
The generator takes as input masked images ( with being the mask locations and the ground-truth image) and outputs inpainted images (denoted ). is a classical U-Net like architecture (encoder-decoder + long-skips) 
with 2-strided convolutions in the encoder-decoder for dimensions reduction and dilated convolutions  in the middle convolutional blocks in order to increase receptive fields sizes. Note that, regenerates every pixel to form a new image (denoted ). However, real pixels of the input masked image do not need to be replaced. Hence, we consider only pixels at the mask locations and the final output of becomes: . For the training of , we use the sum of a reconstruction loss as well as an adversarial loss coming from our segmentor discriminator (described in the next section). For , we use MSE between generated image and corresponding ground-truth (GT): .
2.2 Segmentor As A Discriminator (SAAD)
The main idea behind SAAD is to have a finer discriminator that is able, given an inpainted image, to locate its fake parts, thus backpropagating better gradients to the generator. Locating the fake helps in: (i) avoiding to classify images with tiny generated regions as globally real or fake; and (ii) learning features from the correct locations of fake and real regions.
To locate the fake, we propose that the discriminator performs a segmentation task. In fact, in inpainting, the segmentation masks are given “for free”, since they correspond to the inpainting masks. Specifically, the discriminator takes as input and outputs feature maps on top of which we add a convolution filter that outputs a real/fake map that we denote . Simply said, . To learn our segmentor discriminator , we enforce its output (
: sigmoid function) to be close to, by minimizing a pixel-wise BCE loss. This corresponds to .
Note that, for we can use classical architectures, thus, the output size of the last feature map is usually smaller than the input size. It is thus the same for . Hence, to match the size of the input masks (), we up-sample from to . Note also that has a receptive field of size with . This means that classifies patches of the input images and this is why we characterize as a patch-wise discriminator.
After model convergence, as for any discriminator real and fake patches cannot be distinguished. However, during training, these last are usually well classified. In our case, the discriminator is able to go further by classifying and localizing the fake regions as illustrated in Fig. 2.
2.3 Multiscale approach
In the above section we used only one real/fake segmentation filter that has a specific receptive field of size . That size is defined by the position of in the network. It is thus sub-optimal to handle fake regions that can occur at different scales with only one filter at a specific scale. Thus, we propose to follow a multi-scale real/fake segmentation approach to capture more texture diversity.
To do so, we perform the segmentation task with multiple filters positioned at different levels of the network and thus having different receptive fields sizes. Formally, each filters takes as input the feature maps given by the convolutional layer and outputs real/fake maps that are upsampled and always compared to the same ground-truth mask , as in Sec. 2.2.
3 Experiments and results
3.1 Experimental settings
Texture Inpainting Task
Since the GAN-based Texture Inpainting task is not common in the literature, we proposed to set up a new experimental setting using the publicly available Describable Textures Dataset (DTD) . DTD contains 5640 texture images and we used nearly 200 random images for testing purposes and the rest for training/validation. For each image, we generated multiple rectangle masks (random number, at most 5), at randomly positions before feeding it to the generator. The masks eventually overlaps each other and cover 15% to 30% of training and test images. We used a fixed set of masks for the test images for fair comparisons.
To compare the performance of all methods, we used 3 common metrics: Peak Signal To Noise Ratio (PSNR), Structural Similarity (SSIM) and Mean Perceptual Similarity (MPS) computed by:, where is the set of masked test images, and PS is the Perceptual Loss as defined in . Moreover, every generator is trained 5 times and the average score is reported to ensure fair comparison.
We compared our discriminators (SAAD and its multiscale version) with three existing ones: (i) Context Encoder (CE) that globally classifies the generated image; (ii) GLCIC which consist in concatenating the features of a global and a local discriminator; and (iii) GLPG which is a combination of GLCIC and PatchGAN (consist in classifying real/fake patches with convolutional filters and averaging their outputs to get the global prediction). SAAD and these three methods are illustrated in Fig 1. One should note that, many works proposed to use Perceptual loss  calculated over VGG-19 or AlexNet features  but this is orthogonal to our contribution, and the goal here is to asses the different supervisions of the discriminators.
Note that, the same generator network was used for all the methods as well as the same discriminator’s backbone. The latter, corresponding to the first 3 blocks of the ResNET-18 
architecture as we are dealing with textures and do not need high-level features. For the local discriminator in GLCIC and GLPG, we used just the two first blocks. We trained all the networks with 200 epochs using Adam optimizer with learning rates ofand respectively for the generator and the discriminator. To avoid model collapse, we used zero-centered gradient penalty as defined in 
The results of the different methods on the texture inpainting task in DTD are presented in Tab. 1
. We can see that our methods perform better than all others, regardless of the evaluation metric. For instance, SAAD-multiscale outperforms the CE baseline by 2 points of MPS. More importantly, compared to the recent GLPG, we improve the MPS by 1.6%. Since the only difference between GLPG and SAAD is the supervision (i.e., classification vs segmentation), this result shows that the main contribution of this paper is valuable.
However, one must be careful with manipulating the PSNR, SSIM and MPS evaluation metrics when dealing with texture images. Indeed, sometimes visually good results yield worse quantitative scores, as illustrated in Fig. 3. Thus, we decided to do a qualitative comparison of different methods. The results are given in Fig. 4. From these results, we can clearly observe how the generated textures of our method are visually better compared to others.
|Context Encoder (Pathak et al.)||95.3||24.385||0.901|
|GLCIC (Iizuka et al.)||96.2||24.728||0.924|
|GLPG (Armanious et al.)||95.6||26.409||0.930|
|SAAD MultiScale (ours)||97.3||27.536||0.937|
We presented a new approach for GAN-based texture inpainting that involves changing the discrimination task to a segmentation one to achieve better texture completion. We have shown, through quantitative and qualitative results on DTD, that this new way of supervision allows the generator to better generate textures and preserve mostly local features like colors, contrasts and shapes.
-  (2019) Adversarial inpainting of medical image modalities. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1, §3.1.
-  (2009) Combining texture synthesis and diffusion for image inpainting.. In VISAPP 2009-Proceedings of the Fourth International Conference on Computer Vision Theory and Applications, Cited by: §1.
Describing textures in the wild.
Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §3.1.
-  (2004) Region filling and object removal by exemplar-based image inpainting. IEEE Transactions on image processing. Cited by: §1.
-  (2003) Fragment-based image completion. In ACM Transactions on graphics (TOG), Cited by: §1.
-  (2014) Generative adversarial nets. In NIPS, Cited by: §1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1.
-  (2017) Globally and locally consistent image completion. ACM Transactions on Graphics (ToG). Cited by: §1.
-  (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §1.
-  (2018) Which training methods for gans do actually converge?. arXiv preprint arXiv:1801.04406. Cited by: §3.1.
-  (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1.
-  (2003) Poisson image editing. ACM Transactions on graphics (TOG). Cited by: §1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, Cited by: §2.1.
-  (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §2.1.
-  (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §2.1.
The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1, §3.1.