Image inpainting usually refers to filling up of holes or masked regions with plausible pixel values coherent with the neighborhood context. Traditional techniques [2, 9] were mainly successful in inpainting background and scenes with repetitive textures by matching and copying background patches into holes. However, these methods fail on cases where patterns are unique or non repetitive such as on faces and objects. Also, these methods fail to capture higher semantics of the scene. With the recent breakthrough in generative models such as Variational Autoencoeder (VAE) and Generative Adversarial Networks (GAN) 
, inpainting, in general, is seen as an image completion problem. There are mainly two schools of approach, viz. a) completely unsupervised: conditioned on a prior latent/noise vector b) mixture of supervised + unsupervised: conditioned on masked image [23, 11]. The latter methods heavily depend on an initial phase of fully supervised training (reconstruction loss between original and inpainted outputs within the mask), followed by refinement stage with adversarial loss to add high frequency components in reconstructions. Going against the trend, we feel, the true essence of GAN lies in its ability to generate data within a completely unsupervised framework. The former method of  is thus more difficult to train because it has to ‘hallucinate’ an entire object with just a noise/latent vector conditioning and no information of masked/damaged pixels. Thus, though, the latter school of approach has gained major attention among inpainting community, in this paper, we advocate the former genre of unsupervised approach(pixel values under mask never used). Being unsupervised is the merit of , but it also creates a run time bottleneck. The algorithm follows iterative gradient descent optimization for finding the ‘best matching’ noise prior corresponding to damaged image. Such iterative framework prohibits real time applications.
In this paper we primarily aim to massively accelarate inference runtime (we achieve 1500X speedup compared to ) with simultaneousness visual quality improvement by parametrically learning noise priors. Another issue with inpainting(both supervised and unsupervised) is multi modal completion possibility of a masked region. For example, a masked lip region of face may be completed as smiling or neutral. We show that it is possible to regularize the inpainted outputs with some structural priors. As an example, for a face, we can make use of the facial landmarks as priors. Lastly, single image inpainting models cannot be appreciable applied on videos. Though each frame might be visually pleasing, when viewed as a sequence, there are lot of jitter and flicker due to temporal inconsistency of models. We propose to subdue such inconsistencies with a recurrent net based grouped noise prior learning combined with a subsequence consistency constraint. Our contributions can be summarized as follows:
Unsupervised data driven GAN noise prior prediction framework to convert the iterative paradigm of  to a single feed forward pipeline with visually better reconstruction and simultaneous massive speedup of inference time by 1500.
Augmenting structural priors to improve GAN samples which eventually results in better reconstructions. Such priors also regularize GAN training to respect pose and size of objects.
Pioneering effort towards GAN based sequence inpainting with a recurrent neural net based grouped prior learning for better temporal consistency of reconstructed sequences compared to both supervised and unsupervised benchmarks.
A sub-sequence consistency loss to further improve temporal smoothness of reconstructed sequences
We exhaustively validate our models on CelebA, SVHN, Standford Cars, CelebaHQ image datasets and VidTIMIT video dataset.
2 Related works
Traditional image inpainting methods[1, 4, 6, 7] broadly worked with matching patches and diffusion of low level features from unmasked sections to the masked region. These method mainly worked on synthesis of stationary textures of background scenes where it is plausible to find a matching patch from unmasked regions. However, complex objects lack such redundancy of appearance features and thus recent methods leverage hierarchical feature learning capability of deep neural nets to learn higher order semantics of a scene. Initial deep learning based methods [15, 28] were completely supervised and trained with conservative reconstruction loss. With the advent of GANs, a common practice [23, 11] has been to refine the blurry reconstructions by loss with an adversarial loss coming from a discriminator which is also simultaneously trained to distinguish real samples from inpainted samples. Notably, the first work within this paradigm of approach was Context Encoder(CE)  by Pathak et al.
, in which the authors tried to learn scene representation along with inpainting. Iizukaet al. proposed ‘Globally and Locally Consistent Image Completion’ (GLCIC) in which a inpainter/generator network is pitted against two discriminators, one for gauging realism of entire image and the other for measuring fidelity of local reconstruction of masked patch region. Recently, Yu et al.  improved upon GLCIC, by incorporating contextual attention within inpainting network so that the net learns to leverage distant information from uncorrupted pixels. These methods have a common pipeline of fully supervised training stage followed by adversarial loss based refinement. Thus these methods are not fully unsupervised since paired examples(masked and unmasked) are required during training.
In this paper, we are advocating a fully unsupervised approach (information about the masked pixels not used anywhere in training pipeline) to inpainting pioneered by Yeh et al. . In , the idea is to first train a GAN framework conditioned on only noise prior() sampled from some prior known distribution. At test time, since their method is completely unsupervised, the authors used an iterative gradient descent optimization to find the ‘best matching’ vector for the damaged image with the pre-trained generator and discriminator network of the GAN. However, this iterative optimization takes about 2.5 minutes/image and is thus not suitable for practical applications. We consider the framework of  as a baseline and seek to improve upon the inference time and reconstruction quality. In the process, we also achieve comparable performance to the contemporary hybrid trained methods.
3.1 GAN Basics
Proposed by Goodfellow et al., a GAN model consists of two parametrized deep neural nets, viz., generator, , and discriminator, . The task of the generator is to yield an image, with a latent noise prior vector, , as input. is sampled from a known distribution, . A common choice  is, . The discriminator is pitted against the generator to distinguish real samples(sampled from ) from fake/generated samples. Specifically, discriminator and generator play the following game on :
With enough capacity, on convergence, fools at random .
3.2 Baseline GAN based unsupervised inpainting
We first review the unsupervised inpainting baseline of Yeh et al. . Given a damaged image, , corresponding to an original image, , and a pre-trained GAN model, the idea is to iteratively find the ‘closest’ vector (starting randomly from ) which results in a reconstructed image whose semantics are similar to corrupted image. is optimized as,
where is the binary mask with zeros on masked region else unity, is the Hadamard operator and
is any loss function. Interesting to note is that the loss function never makes use of pixels inside the masked region. Upon convergence, the inpainted image,, is given as, .
4 Proposed Method
4.1 Data driven Noise Prior Learning
Though the unsupervised characteristic of  is encouraging for the generative learning community, the iterative optimization is a major bottleneck in the pipeline. Instead of iteratively optimizing the noise prior,
, for each test image during runtime, we propose to learn an unsupervised offline parametric model,, for predicting vector. The parameter set, , is optimized to minimize the following unsupervised losses:
Contextual Loss: This loss ensures that the predicted noise prior preserves fidelity with respect to the original unmasked regions.
Realism Loss: This loss ensures that the inpainted output lies near the original/real data manifold and is measured by the log likelihood of belongingness to real class assigned by the pre-trained discriminator
Gradient Difference Loss: Inspired by [21, 22] we also use the gradient difference loss imposed between the gradient (horizontal and vertical) matrices of original and reconstructed outputs. This compels the network to predict noise priors which yield high frequency retaining samples and also respects the gradients of the original scene.
Please note that the loss is still calculated on the unmasked regions only. In summary, parameter set, , is optimized to minimize the combined loss, ,
where ’s controls the relative importance of each loss factor. After convergence of training of , given a masked image, , mask, , we can get the inpainted output, , in one feed forward step instead of the iterative optimizations of . Inpainted image, , is given by,
Though Eq. 2 and 6 are functionally same, prediction using a learned parametric network tends to perform better than ad hoc iterative optimization. This is because, with evolution of training, the network learns to adapt parameters to map images with closely matching appearances to similar vectors. Parameter update for a given image thus implicitly generalizes to images with similar characteristics.
4.2 Regularization with Structural Priors
Image inpainting intrinsically suffers from a multi modal completion problem. A given masked region has multiple plausible possibilities for completion. For example, consider Fig.2: for an unconstrained optimization setup, the masked region of the face can be inpainted with different facial expressions. From a single image inpainting point of view this might not be an issue. But in case of sequences, it is desirable to maintain a smooth flow of scene dynamics. A laughing face, for example, cannot suddenly be inpainted as a neutral frame. We propose to further regularize our network by augmenting structural priors. Structural priors can be any representation which captures the pose and size of the object to be inpainted and thereby compelling the network to yield outputs by respecting such priors. Such additional priors can be seen as conditional variables, , to the GAN framework. Formulation of Eq. 1
changes subtly to respect the joint distribution of real samples and conditional information. The modified game,:
The noise prior predictor network, has to optimize by respecting the structural prior as an additional constraint.
In this paper, without any loss of generalization, we have considered face inpainting with semantic priors as facial landmarks automatically extracted in real time(5ms @ 256256 resolution) using the robust framework of Kazemi et al.  which achieves benchmark performance on face alignment.
4.3 Grouped Noise Prior Learning for Sequences
To our best knowledge, this is the first demonstration of GAN based completely unsupervised sequence inpainting. A naive approach of applying the formulation of Eq. 6 on sequences is to inpaint individual frames independently. However, such anapproach fails to learn the temporal dynamics of sequence and thereby yielding jittering effects. In this regard, for a sequence of
frames, we propose to use a Recurrent Neural Network (RNN) to jointly predictvectors for a subset of frames at a time. RNN consist of a hidden state to summarize information observed upto that time step. The hidden state is updated after looking at the previous hidden state and the corrupted image(with an additional option to condition on structural priors), leading to more consistent reconstructions in terms of appearance. ames
Since, RNNs suffer from vanishing gradients problem
and are unable to capture long dependencies, we use Long Short Term Memory (LSTM) Networks. Fig. 3 shows our LSTM based framework architecture for jointly inpainting a group of frames. Let, be a group of corrupted successive frames. Initially, each frame is passed through a CNN module (same architecture of , except the last layer outputs instead of by ), to obtain the input sequence for the recurrent network . We obtain the predicted prior, , by feeding the hidden state, , of the recurrent network to a fully-connected layer. is then used for reconstructions, , with the help of the pre-trained generator, . We use the loss function in Eq. 6, averaged over the grouped window of frames to optimize the parameters of LSTM and the CNN descriptor network. Specifically, the grouped prior loss is defined by, ,
Please note, the parameters of pre-trained generator and discriminator are kept frozen. .
4.4 Subsequence consistency loss
We further regularize training of the LSTM framework by an implicit subsequence consistency loss over a group of neighborhood frames. The motivation is that a group of adjacent frames in a video exhibit close coherence of appearance. Thus, we define a subsequence clique as a collection of adjacent frames and penalize if the appearances of the frames differ from each other. Disparity between two inpainted images, and can be approximated by Euclidean distance between their latent vectors (). We define the loss, as,
So, helps in learning the temporal dynamics while explicitly fosters temporal smoothness. If dominates then, the network will be penalized by because over smoothing of a sequence is not a true characterization of a real world sequence. The final loss function for the LSTM-CNN combined framework is given by ,
where sets relative importance of subsequence consistency. Please note, is applied only on a neighborhood of frames and not on entire sequence. Applying on entire sequence is not a true representation of temporal dynamics because we will be then penalizing appearance changes even over distant frames. On contrary, reducing means no explicit temporal consistency loss.
5.1 Single Image Inpainting
We experiment on cropped SVHN, Standford Cars, CelebA and CelebA-HQ. SVHN crops are resized to 6464. On Standford Cars we use bounding box information to extract and resize cars to 6464. CelebA images are center cropped to 6464 and 128128. Celeb-HQ images are resized to 256256. On SVHN and Cars, we use the dataset provider’s test/train split. For CelebA and CelebA-HQ, we keep 2000 images for testing.
5.1.1 Importance of Learned Noise Prior:
with our one shot feed forward solution. Without any mechanism to estimate noise prior from masked image, initial solutions of lie far from real data manifold and thereby mandating an iterative approach. Abiding by the suggestions in , each image requires 1500 test time iterations. Our approach adds just subtle amount of computation for the noise predictor network and a negligible overhead for the structural priors; thereby making our model almost 1500X faster compared to . From Fig. 12, it is encouraging to see that even after the iterative optimization, visual quality of our method is usually superior than .
5.1.2 Importance of Structural Priors:
In this paper, we have considered the special case of face inpainting with semantic priors as facial landmarks detected by the robust framework of . We observed three fold benefits of leveraging such priors.
Improved GAN Samples and Reconstructions: Conditioning on structural priors forces the generator to yield samples closer to natural data manifold. Random samples from such conditioned generator are thus more photo-realistic (see Fig. 5) compared to the unconditioned vanilla version of GAN used by . Towards this end, we visually compare (following the protocol in ) the quality of random samples from our proposed semantic conditioned GAN and  at resolutions of 6464 and 128128. For visual turing test, a human annotator is randomly shown total 200 images(100 real and 100 generated) in groups of 20 and asked to label each sample as real or fake. Decisions from 10 annotators are taken. On average, 64 resolution, the classification accuracy is 5.8% higher for DIP() and 4.2% higher() at 128128 resolution. Thus, human annotators found it more difficult to distinguish samples from our model compared to DIP.
Control of Pose and Expression: With structural priors, the generator learns to disentangle appearance and pose. A given semantic prior should force the generator to create a face with matching head pose and facial expression while two nearby vectors results in similar facial textures. In Figure 6(Top setting) we show such disentanglement learned by our model.
Greater structural fidelity to reference image: In Fig. 8, we show the importance of structural priors on top of learned noise priors. Reconstructions with only our proposed learned noise priors might be stand alone realistic but are not penalized for changing facial expressions. For example, a (masked) smiling face can be inpainted as a neutral face by only conditioning on a learned noise prior. However, if we constrain the model with structural priors, the reconstructions are more coherent in appearance and expression to the reference image. Such structural fidelity is key in achieving temporally more consistent sequence reconstructions as discussed in upcoming sections.
5.1.3 Comparison to Hybrid Benchmarks
Though our method is unsupervised, for completeness of the paper, we also compare with recent hybrid inpainting benchmarks of [23, 11, 30, 19]. To scale up our GAN model to 256256, we follow the progressive training strategy of . See Fig. 13 for visual examples.
Is Supervised Phase Mandatory ?
To seek an answer to this, we trained the models of [23, 11, 30, 19] without any loss but only adversarial loss. We observe that these methods fail to perform in absence of loss. In Fig. 12, we show some visual examples.
5.2 Sequence Inpainting
5.2.1 Temporal consistency and Synthetic Sequences
Recent deep learning based inpainting works have only been restricted for single image inpainting. The genre of video has not received interest. Even if there are some works [18, 27], the reported results are in terms of per frame PSNR which does not take into account the temporal consistency/dynamics of scene reconstructions. For example, it is very annoying for a viewer if the stationary portions of a series of frames are reconstructed with different appearances on each frame and thereby creating jitter effects.
We dedicate this section to analyze the temporal consistencies of different methods on synthetic sequences. A synthetic sequence, , of length is formed by taking a single image, , and masking it with different/same corruptions masks. An ideal inpainting model should be agnostic of the corruption masks and yield identical reconstructions for all the frames. We define temporal consistency, , as the mean pairwise PSNR between all possible pairs() of inpainted frames within a synthetic sequence, , of length, ;
Eq. 12 allows enumerating the consistency of a generative model. Ideally, we want
=0. Please note that this evaluation is not possible on real videos because the transformation from one frame to another is not known and thus it is not possible to align the frames to a single frame of reference without incorporating interpolation noise with motion compensator. In Table 1 we compare the consistencies with contemporary benchmarks. We see progressive improvement of consistency with the addition of LSTM-grouped prior and structural priors. Note, even the hybrid (supervised + adversarial) benchmarks manifests higher inconsistencies with exception of  because it jointly trains the network with inpainting loss and face parsing loss. This bolsters the hypothesis that a prior knowledge of object structure helps in inpainting.
|Temporal Consistency (dB) on Synthetic Sequences||PSNR (dB) on Single Images|
|SVHN @ 64||CelebA @ 128||Cars@ 64||SVHN@ 64||CelebA@ 128||CelebA-HQ @ 256|
|Yeh et al.||22.5||22.9||22.8||21.9||22.2||21.1||13.9||14.0||13.2||20.9||21.2||21.0||23.0||23.1||21.4||15.7||16.0||13.1|
5.2.2 Importance of Subsequence Consistency Loss
In Fig. 10 we show a synthetic sequence, in which a same face is masked differently. Proposed LSTM-grouped prior based reconstruction is successful is maintaining the overall same facial expression but fails to maintain subtle textural consistencies as shown in highlighted insets. Subsequence consistency loss helps in maintaining such subtle texture coherence which results in improved temporal consistency. Again, please note, these difference are much more easier to illustrate(and visualize) in such synthetic sequences than on real videos.
5.2.3 Application on Real Videos:
The experiments with synthetic sequences taught us three lessons, viz., a)LSTM-CNN based grouped noise prior learning is better than independent noise prior learning b) structural prior fosters in higher fidelity and c) subsequence consistency loss helps in preserving subtle texture details. With these knowledge, we proceed to demonstrate first attempt towards GAN based inpainting on real videos. For this, we selected the VidTIMIT dataset which consists of video recordings of 43 subjects each narrating 10 different sentences. Images of CelebA dataset are of superior resolution than those of VidTIMIT. Due to this intrinsic difference of data distribution we finetuned our pretrained(trained on CelebA) models on randomly selected 33 subjects of VidTIMIT. Remaining videos of 10 subjects were kept for testing inpainting performances. In total, there are total 9600 frames for testing. All faces center cropped to 128128.
Evaluating Video Quality: MOVIE metric : Traditional metrics such as PSNR and structure similarity (SSIM) are not a true reflection of human visual perception measure as shown in recent studies [30, 17]. Also, these metrics donot consider any temporal information. For this, we preferred to use the MOVIE metric . MOVIE is a spatio-spectrally localized framework for assessing a video quality by considering spatial, temporal and spatio-temporal aspects. A lower value of MOVIE metric indicates a better video. MOVIE metric was found to appreciably correlate with human perception. In Table 2, we compare the average test set MOVIE metric. All variants of our proposed framework outperforms . With independent noise prior model, we get better performance than  and comparable performance to [11, 30]. Addition of LSTM-grouped prior and structural prior boosts our performance with further improvement coming from subsequence consistency loss. It is interesting to see that even if we compute structural prior on third(and reuse in between), there is subtle degradation of performance. We show some video snippets in Fig. 11.
6 Discussion and Conclusion
In this paper, we showed the importance of priors in GANs for pushing the performance envelope of unsupervised inpainting framework of  with better inpainting quality and almost 1500
speedup. The objective of this paper was to purposefully abstain from the contemporary practice of hybrid(supervised + unsupervised) training and focus on creating a faster unsupervised framework comparable visual performance. Our proposed framework with grouped LSTM-CNN guided noise prior and structural prior manifests better spatio-temporal characteristics than contemporary hybrid baselines. This shows that current single image inpainting methods have further scopes of improvement on videos and the frameworks used by us in this regard can be exploited by those algorithms also. Given the current state of GAN research, it is not expected that a completely unsupervised GAN based inpainter can work on natural images such as ImageNet or Places2 dataset(which hybrid methods are capable of due to supervised). However, as our understanding on GANs improve and we enable GAN models to generate natural scenes, the methods of this paper shall seamlessly fit in those scenarios as well.
-  (2001) Filling-in by joint interpolation of vector fields and gray levels. IEEE transactions on image processing 10 (8), pp. 1200–1211. Cited by: §2.
-  (2009) PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (ToG) 28 (3), pp. 24. Cited by: §1.
-  (1994) Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5 (2), pp. 157–166. Cited by: §4.3.
-  (2000) Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 417–424. Cited by: §2.
Real-time video super-resolution with spatio-temporal networks and motion compensation. CVPR. Cited by: §5.2.1.
-  (2001) Image quilting for texture synthesis and transfer. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 341–346. Cited by: §2.
-  (1999) Texture synthesis by non-parametric sampling. In ICCV, pp. 1033. Cited by: §2.
-  (2014) Generative adversarial nets. In NIPS, pp. 2672–2680. Cited by: §1, §3.1.
-  (2007) Scene completion using millions of photographs. In ACM Transactions on Graphics (TOG), Vol. 26, pp. 4. Cited by: §1.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.3.
-  (2017) Globally and locally consistent image completion. ACM Transactions on Graphics (TOG) 36 (4), pp. 107. Cited by: §1, Figure 2, §2, Figure 12, §5.1.3, §5.2.3, Table 1, Table 2, Figure 13.
-  (2018) Progressive growing of gans for improved quality, stability, and variation. In ICLR, Cited by: §5.1.3, §5.1, Figure 13.
-  (2014) One millisecond face alignment with an ensemble of regression trees. In CVPR, pp. 1867–1874. Cited by: §4.2, §5.1.2.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
Mask-specific inpainting with deep neural networks.
German Conference on Pattern Recognition, pp. 523–534. Cited by: §2.
3d object representations for fine-grained categorization.
Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561. Cited by: §5.1.
-  (2017) Photo-realistic single image super-resolution using a generative adversarial network.. In CVPR, Vol. 2, pp. 4. Cited by: §5.2.3.
-  (2018) Inpainting of continuous frames of old movies based on deep neural network. In 2018 International Conference on Audio, Language and Image Processing (ICALIP), pp. 132–137. Cited by: §5.2.1.
-  (2017) Generative face completion. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 3. Cited by: Figure 12, §5.1.3, §5.2.1, Table 1, Table 2, Figure 13.
-  (2015) Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738. Cited by: §5.1.
-  (2016) Deep multi-scale video prediction beyond mean square error. ICLR. Cited by: §4.1.
-  (2017) Medical image synthesis with context-aware generative adversarial networks. In MICCAI, pp. 417–425. Cited by: §4.1.
-  (2016) Context encoders: feature learning by inpainting. In CVPR, pp. 2536–2544. Cited by: §1, Figure 2, §2, §5.1.3, §5.2.3, Table 1, Table 2, Figure 13.
-  (2009) Multi-region probabilistic histograms for robust and scalable identity inference. In International Conference on Biometrics, pp. 199–208. Cited by: §5.2.3.
-  (2010) Motion tuned spatio-temporal quality assessment of natural videos. IEEE transactions on image processing 19 (2), pp. 335–350. Cited by: §5.2.3, Table 2.
-  (2017) Learning from simulated and unsupervised images through adversarial training. CVPR, pp. 2107–2116. Cited by: §5.1.2.
-  (2018) A temporally-aware interpolation network for video frame inpainting. arXiv preprint arXiv:1803.07218. Cited by: §5.2.1.
Deep convolutional neural network for image deconvolution. In Advances in Neural Information Processing Systems, pp. 1790–1798. Cited by: §2.
-  (2017) Semantic image inpainting with deep generative models. In CVPR, pp. 5485–5493. Cited by: item 1, §1, §1, Figure 2, §2, §3.2, Figure 4, §4.1, Figure 11, Figure 7, Figure 9, §5.1.1, §5.1.2, §5.2.3, Table 1, Table 2, Figure 13, §6.
-  (2018) Generative image inpainting with contextual attention. In CVPR, Cited by: Figure 2, §2, Figure 12, §5.1.3, §5.2.3, Table 1, Table 2, Figure 13.