Faster Unsupervised Semantic Inpainting: A GAN Based Approach

08/14/2019 ∙ by Avisek Lahiri, et al. ∙ IIT Kharagpur 0

In this paper, we propose to improve the inference speed and visual quality of contemporary baseline of Generative Adversarial Networks (GAN) based unsupervised semantic inpainting. This is made possible with better initialization of the core iterative optimization involved in the framework. To our best knowledge, this is also the first attempt of GAN based video inpainting with consideration to temporal cues. On single image inpainting, we achieve about 4.5-5× speedup and 80× on videos compared to baseline. Simultaneously, our method has better spatial and temporal reconstruction qualities as found on three image and one video dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic inpainting refers to filling up of missing pixels in a given image by leveraging neighborhood information. Traditional methods [1, 3] were mainly successful when deployed on background scenes and images with repeated textures. However, they fail to learn complex semantic representations and thereby manifest unpleasing reconstructions on complex non-repetitive textured objects. With the advent of Generative Adversarial Networks (GAN), there has been a recent surge of interest [13, 4, 14, 7] to solve inpainting with deep generative models. There are mainly two schools of approach:
Fully unsupervised: This approach, first proposed by Yeh et al. [13] aligns with the concept of the pioneering paper of GAN [2]. In [13], the objective is learn a GAN model to generate realistic images conditioned on noise priors only and inpaiting is done by iteratively matching a masked/damaged image to its ‘best matching’ noise prior. This method does not require any paired training set (masked, unmasked) and hence we term it as ‘unsupervised’.
Hybrid: These methods [14, 7, 10, 4] in general rely on initial training with usual reconstruction loss on a paired (masked, unmasked) dataset. Since loss based reconstructions manifest lack of high frequency components, the next step is to push the solutions nearer to original data manifold with an additional adversarial loss. It is to be noted, that without the initial supervised training phase, these methods fail to work and thus we term this framework as ‘hybrid’ approach.
Motivation: For [13], being fully unsupervised comes at a cost of significant inference time due to an iterative search for a matching noise prior. Hybrid methods perform test time inference in one single forward pass and thus research has been dedicated mainly towards this genre. However, going against the trend, we advocate the former method because true potential of GAN is appreciated when there is no source of supervision. The motivation in this paper is to primarily reduce the inference run time of [13], yet achieve better/similar reconstruction performance compared to [13]. Towards this, the paper presents the following contributions:

1. A better initializing method for the iterative optimization of [13] to speedup inference run time on single image inpainting by 4.5-5.

2. First demonstration of totally unsupervised GAN based inpainting on videos (in context of error concealment) with speedup upto 80 by leveraging temporal redundancy

3. A group consistency loss for a more temporally consistent sequence reconstruction and thereby leading to more pleasing spatio-temporal experience as ascertained by the MOVIE metric [12]

4. Exhaustive experiments on SVHN, Standford Cars, CelebA image dataset and VidTIMID video dataset manifest the benefits of our approach

2 GAN preliminaries

A GAN model consists of two deep neural nets, viz., generator, , and discriminator, . The task of the generator is to create an image,

with a noise prior vector,

, as input. is sampled from a known distribution, ; usually . The discriminator has to distinguish between real samples(sampled from real distribution, ) and generated samples. The game is played on :

(1)

3 Method

We build upon the unsupervised inpainting framework of Yeh et al.[13]. Given a masked image, , corresponding to an original image, , and a pre-trained GAN model, the idea is to iteratively find the ‘closest’ vector (starting randomly from ) which results in a reconstructed image whose semantics are similar to corrupted image. is optimized as,

(2)

where is the binary mask with zeros on masked region else unity, is the pointwise multiplier and is the objective function to be minimized. Interesting to note is that the objective function never assumes knowledge of pixel intensities inside the masked region(and thus the term ‘unsupervised’). Upon convergence, the inpainted image, , is given as, . The objective function, is composed of two components:
Fidelity Loss: This loss ensures that the predicted noise prior preserves fidelity between generated image and the original unmasked regions.

(3)

Perceptual Loss: This loss ensures that the inpainted output lies near the original/real data manifold and is measured by the log likelihood of real class assigned by the pre-trained discriminator;

(4)

The overall objective, , where controls the relative importance of .

3.1 Better initiation for noise prior search

One of the fundamental drawbacks of Yeh et al. is the iterative optimization requirement of Eq.2. A random usually tends to generate images quite disparate from the concerned maksed image and thus the optimization requires multiple updates of . In fact, the authors in [13] suggest around 1000 rounds of iterations per image. Our motivation is to initiate by respecting some global statistics of the concerned masked image.
Nearest neighbor search: After training a GAN, we store (one time offline task) a pool, , of images by passing random noise vectors through the pre-trained generator. For a given damaged image, , we perform a nearest neighbor search over the pool, , to identify ‘the closest’ matching pair. Specifically, we perform the matching between and a candidate pooled image (generated from ), , based on a distance metric, . Please note, even during matching we are not exploiting the masked region of . While formulating we want to make sure that we not only match the overall color statistics of the damaged image but also respect the overall structure. Thus has got two components:
Data loss: This loss penalizes if the pixel intensities of a pooled image, , deviate from the damaged image, ;

(5)

Structure loss: This loss penalizes if the structure(captured in essence with gradients) of pooled image deviates from damaged image. Structure loss, is defined as:

(6)

where and are horizontal and vertical gradient operators. The final matching criterion is, , where controls relative importance of . Effect of is discussed in Fig.1. The initial noise vector, is given by,

(7)

3.2 Video Inpainting: Exploiting temporal redundancy

To our best knowledge, this is the first demonstration of unsupervised GAN based inpainting on videos. By video inpainting we refer to concept of error concealment in video, i.e., to recover damaged/masked portion of a frame. A naive application of [13] would be to apply single image model independently on each frame. This poses two problems, viz, a) such approach does not leverage temporal redundancy among neighboring frames b) independent frame level reconstructions result in temporal inconsistency in a sequence. We propose to address these challenges with two innovations.
Reuse of predicted vector: It is safe to assume that neighboring frames are coherent in appearance and thus the noise priors. Thus, it makes sense to initiate . This drastically speeds up optimization by almost 80 compared to vanilla version of Yeh et al.[13]. We refer to this proposed method as Proposed (Re) in all experiments.
Group consistency loss: Even though we initialize with , the final solution for time step () can diverge away appreciably from . This would mean that there will abrupt changes in scene appearance when the frames are viewed as a sequence. To enforce smooth temporal dynamics we impose a group consistency loss, , by constraining a group of reconstructed frames to be similar. Disparity between two generated images can be expressed with corresponding disparity between the corresponding vectors [15]. Specifically, we impose consistency loss over a window of neighboring frames,

(8)

Please note that is imposed only over a neighboring window of frames and not over entire video. Combination of this loss + reuse of vector is denoted as Proposed (Re + G) in experiments.

4 Experiment Settings

Datasets: For image inpainting we tested on SVHN[9], Standford Cars[5] @ 6464 resolution and CelebA[8] @ 6464 and 128128 resolution. For video inpainting, we experimented at 128128 resolution on VidTIMIT[11] dataset.
Network Architectures: For fair comparison with our baseline of [13], we borrowed their architectures for both generator and discriminator and followed their paradigm of training GAN. Parameter, , was set to 0.01 following [13].
Balacing data loss and structure loss:Hyperparameter, sets the relative importance of structure loss, , over data loss . Setting means initial solution will not explicitly preserve edge information and just retrieve nearest image based on raw intensity. On contrary, a high value of will enforce only edge preservation without respecting the intensity. See Fig.1 for illustration of these two extreme cases. In either of these cases, initial solution can be appreciably different than desired solution and will thus require longer iterative refinements. We reserved a validation dataset on which we test . gave peak speedup across datasets and thus is set to 0.01 for all experiments. See Fig.2 for some examples of initial solutions with our proposed method v/s random initialization of [13].
Pool size () for nearest neighbor search: Nearest neighbor search with an ideal generator () will be able to retrieve the exact vector corresponding to a masked image, , if we allow, . However, for practical viability it is not possible to search over every possible vector. On validation set, we experiment with different and compare the average MS-SSIM between initial masked solution and masked image. is set to 300(on all datasets unless otherwise stated), above which, the MS-SSIM does not increase appreciably.
Selecting group consistency window, : Setting (in Eq.8) to a large value results in over smoothing of a sequence in temporal dimension and thus leads to degraded MOVIE metric[12] due to poor perceptual quality. , on the other hand will have no effect in incorporating temporal coherence which again hurts MOVIE metric. On held out validation set of VidTIMIT we experimented with . was selected which, on average, yielded best MOVIE metric. So, for a sequence, every alternate 5 frame (pivot frame) is inpainted as a single image. Intermediate frames are initiated with inpainted pivot frame and solutions of a group are additionally constrained by .
Comparing Methods: Our primary comparing method is the unsupervised baseline of [13]. However, for the completeness of the paper we also provide comparisons with hybrid frameworks of [10, 4, 14]. Please note, the later approaches are only successful if trained initially with supervised reconstruction loss. If trained with only unsupervised adversarial loss, these method fail drastically.

Figure 1: Role of data loss, (Eq.5) and structural loss, (Eq.6) on retrieving nearest matching initial solution. For each tuple, left column: masked image, middle column: initial solution retrieved with only , right column: initial solution retrieved with only . Using mainly tries to maintain the global color statistics while only focuses on matching the structure irrespective of absolute intensity concern. It can be appreciated that only retrieves initial matches by maintaining facial expression(smile), orientation of cars, keystrokes of digits. Thus we apply weighted combination of and as objective (Eq.7) for nearest neighbor retrieval.
Figure 2: Benefit of proposed noise prior initialization v/s random initialization of Yeh et al. [13]. Top row: Masked image, Middle row: initial solution of [13], Bottom row: initial solution by our proposed nearest neighbor based. It is evident that our initial solutions are much more closer to masked images and thus requires lesser iterative updates of Eq.2. compared to [13].
[10] [4] [14] [13]
Ours
Cars 14.3 15.3 14.5 13.5 14.1
SVHN 21.5 23.6 23.7 20.4 22.0
CelebA(64) 23.0 24.1 24.2 22.6 23.3
CelebA(128) 20.0 20.9 20.6 17.6 18.8
Table 1: Comparing inpainting PSNR (in dB) on different datasets with the unsupervised baseline of [13]. We also compare with hybrid methods of [10, 4, 14].
Image Video
Yeh et al.[13] 9.0 33.5
Proposed (Re) 1.9 0.36
Proposed (Re + G) - 0.41
Table 2: Comparison of absolute run times (in seconds) on a NVIDIA K-40 GPU on image (6464) and video (128128) inpainting. Time is measured till corresponding loss of a model converges to 95% of saturation value. Notice how application of single image model of [13] naively on higher resolution video booms up the inference time; however, our proposed model (Re + G) appreciably speeds up by almost 80%. In fact, on image and video we achieve speedups 5 and in terms of iterations count. This table also considers the time of nearest neighbor search.
[10] [4] [14] [13]
Ours
(Re)
Ours
(Re + G)
Cars 13.8 14.2 14.5 15.6 18.2 20.0
SVHN 21.3 21.8 22.1 22.8 24.1 24.8
CelebA(64) 23.1 23.2 23.6 24.1 25.6 26.3
CelebA(128) 21.8 20.9 21.6 21.9 22.4 23.5
Table 3: Average temporal consistency () in dB on test sets of different dataset. Higher value means a model is more temporally coherent.
[13] [10] [4] [14] Proposed (Re) Proposed (Re + G)
0.66 0.63 0.55 0.46 0.52 0.47
Table 4: Comparison on MOVIE metric[12] on ViDTIMIT video test set. We compare with unsupervised baseline of Yeh et al. [13] and also with hybrid methods of [10, 4, 14]. A lower MOVIE metric is better in terms of spatio-temporal effectiveness of inpainting.

5 Results

Speedup in optimization: With respect to our unsupervised baseline[13], on average, we achieved about 5 speedup for single image inpainting. On videos the speedup is almost 80. See Table 2 for speed comparisons.
Image Inpainting: In Fig. 3 we show some exemplary inpainting comparison with [13]. Even with appreciable speedup we usually achieve better (or similar visual performance) to [13]. We also show some visual comparison with recent hybrid benchmarks in Fig.4. Recently [6, 13, 14]

researchers have shown that PSNR metric is not fully justifiable to assess tasks such as inpainting and super resolution. However, for reference we also report PSNR in Table

LABEL:table_psnr.
Pseudo sequences and temporal consistency Before analyzing effects of our proposed losses on real videos, we study the effects on a simpler case of pseudo sequences. A pseudo sequence of length, is basically a single image replicated times but masked with different masks. An ideal model would inpaint all the frames identically. We can define temporal consistency, as, combinations of . In Table LABEL:tab_consistency we report temporal consistency over different datasets. It can be seen that proposed initialization technique of improves consistency with further improvement brought by group consistency loss. Even the current hybrid benchmarks [14, 10, 4] manifest greater inconsistency because these methods do not leverage any temporal information. Exemplary visualizations are shown in Fig. 5. Note, gives an indication of temporal consistency only (which can be studied only on such pseudo sequences). It does not give essence of spatial correctness. For example, a model might reconstruct all blank images which will give high , but will yield worse MOVIE metric.
Analyzing real video inpainting performance PSNR and MS-SSIM are not well suited for judging a reconstructed video since these metrics are agnostic to temporal dimension. We advocate using the MOVIE metric[12] which considers spatial, temporal and spatio-temporal aspects by comparing original and reconstructed sequence. In Table 4 we report average MOVIE metric on VidTIMIT test set. Our proposed modifications to [13] significantly improves reconstructed video quality.

Figure 3: Visual comparison of inpainting by the unsupervised baseline of [13](3 column) and our proposed method(4 column). We perform equivalently(sometimes better) with 5 less iterations. 1 col: Original; 2: masked image.
Figure 4: Comparing with contemporary hybrid methods of CE[10] and GIP[14].
Figure 5: Visualizing benefit of proposed model (bottom row) over Yeh et al.[13] (middle row) on pseudo sequences. A pseudo sequence is created from a single image of a person but masked with different masks (thereby mimicking a temporal aspect). An ideal sequence inpainting model should result in identical outputs for a given subject. Note that our sequence reconstructions lead to more temporally consistent (notice the lips, eyes) solutions.

6 Conclusion

In this paper, we first discussed the problem of impractical long inference time of the recent completely unsupervised inpainting framework of [13]. We then proposed to speedup the iterative optimization of [13] by better initialization technique on images and also leveraging temporal redundancy in videos. In the process, we achieved a significant speedup. Several comparisons were also done with current hybrid benchmarks and we achieved comparable performance. Future work might consider replacing the iterative optimization by learning to project a damaged image into noise prior space.

References

  • [1] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (ToG) 28 (3), pp. 24. Cited by: §1.
  • [2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §1.
  • [3] J. Hays and A. A. Efros (2007) Scene completion using millions of photographs. In ACM Transactions on Graphics (TOG), Vol. 26, pp. 4. Cited by: §1.
  • [4] S. Iizuka, E. Simo-Serra, and H. Ishikawa (2017) Globally and locally consistent image completion. ACM Transactions on Graphics (TOG) 36 (4), pp. 107. Cited by: §1, Table 1, Table 3, Table 4, §4, §5.
  • [5] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3D object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia. Cited by: §4.
  • [6] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network.. In CVPR, Vol. 2, pp. 4. Cited by: §5.
  • [7] Y. Li, S. Liu, J. Yang, and M. Yang (2017) Generative face completion. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Vol. 1, pp. 3. Cited by: §1.
  • [8] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738. Cited by: §4.
  • [9] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, Vol. 2011, pp. 5. Cited by: §4.
  • [10] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In CVPR, pp. 2536–2544. Cited by: §1, Table 1, Table 3, Table 4, §4, Figure 4, §5.
  • [11] C. Sanderson and B. C. Lovell (2009) Multi-region probabilistic histograms for robust and scalable identity inference. In International Conference on Biometrics, pp. 199–208. Cited by: §4.
  • [12] K. Seshadrinathan and A. C. Bovik (2010) Motion tuned spatio-temporal quality assessment of natural videos. IEEE transactions on image processing 19 (2), pp. 335–350. Cited by: §1, Table 4, §4, §5.
  • [13] R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do (2017) Semantic image inpainting with deep generative models. In CVPR, pp. 5485–5493. Cited by: §1, §1, §3.1, §3.2, §3, Figure 2, Table 1, Table 2, Table 3, Table 4, §4, Figure 3, Figure 5, §5, §6.
  • [14] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018) Generative image inpainting with contextual attention. In CVPR, Cited by: §1, Table 1, Table 3, Table 4, §4, Figure 4, §5.
  • [15] J. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros (2016) Generative visual manipulation on the natural image manifold. In ECCV, pp. 597–613. Cited by: §3.2.