Semantic inpainting refers to reconstructions of damaged portions of an image using available neighborhood information. In this paper, we are interested in investigating the role of automated dense semantic conditioning to generative adversarial networks (GAN) 
for the specific task of semantic inpainting. We have focused on the special case of semantic inpainting of faces because faces are tough to inpaint due to presence of finer semantic details. Also, due to contemporary proliferation of multimedia services, video calling is deemed to become a frequent mode of communication and in such video streams, human faces occupy major part of a frame. Thus computer vision guided facial sequence inpainting is the call of the hour. Specifically, we wish to study and improve upon two aspects, viz., a) consistency and b) correctness. Consistency is applicable in case of sequence inpainting, in which we measure the coherence among a group of reconstructed frames. If not accounted for, generative models render abrupt structural changes and unpleasing flickering effects over stationary portions of frames. This is an intrinsic nature of generative model because the forward process of mapping a corrupted section to a valid image manifold is multimodal. An illustration is shown in Figure2 (Refer to Figure1
for actual comparison of outputs), wherein a generative model has multiple independent and equiprobable options to semantically fill in the corrupted portion of the image. However, if the model is applied on a stream of video frames, then such independent reconstructions renders the sequence unrealistic, because, for example, a smiling face has very low probability of transitioning into a neutral face in next frame. Our intuition to tackle this problem is to constrain the possible models of generation by an auxiliary conditional information. Such conditioning can be in the form of shape priors as used by Fišeret al.  for synthesis of stylized facial animations or consistency in optical flow field  for video style transfer. IIzuka et al. only concentrated on consistency of reconstruction at a local and global scale within a single frame , but did not address the issue of multimodal image completion in sequence inpainting. We illustrate, both numerically and visually, the inconsistencies in GAN based inpainting methods and offer a simple yet computationally frugal solution to enforce consistency.
Regarding correctness: Correctness refers to a similarity metric quantifying the fidelity of reconstructed output to original version. As we are building upon the recent ”DCGAN” based inpainting method by Yeh et al.(we abbreviate this as ‘DIP’ in rest of the paper), the quality of reconstruction depends on the success of training the generator to approximate the underlying data distribution. Recent work by  shows that conditioning the GAN framework on positional constraints fosters in better sample generation. Our idea of improving upon  is to condition the GAN framework with automatically extracted facial semantics and thereby enabling (can be treated as constraining) the generator to generate specific facial components adhering to this conditioning input. In §5.2, we show that this simple yet elegant solution significantly improves quality of generated samples and also plays a pivotal role in achieving consistent reconstruction.
Specifically, our key contributions in the paper are:
To the best of our knowledge, this is the first time the dual concept of ”correctness” and ”consistency” is being explicitly studied in the context of GAN based inpainting.
We show that the facial semantic conditional information enables our generative model to disentangle between appearance and pose cues (§5.1).
A new framework is presented for assessing consistency of reconstruction by generative models. We show both that our model is able to reconstruct more consistent images compared to DIP (§5.4).
2 Related Works
With the advent of Variational Auto-Encoder (VAE) and GAN , there has been a recent surge in interest towards automated image/video generation and subsequent unsupervised feature learning [9, 34, 23]. GANs are known to generate sharper images compared to VAE because VAE is based on the principle of loss based between generated images and posterior distribution and thereby producing blurry outputs. In , the authors recommend an empirically tested, stable architecture framework for GAN training and it became popular as the ”DCGAN”. While the original GAN formulation allowed the generator network to unrestrainedly sample from generated distribution, recently, researchers have utilized additional conditional information to constrain the output space of GANs for controlled sample generation. Class conditional GAN  was the natural extension, wherein the generator was forced to generate samples of a given class. Denton et al.  extended this idea in a class conditional Laplacian pyramid GAN setting. Such hierarchical conditioning information aided in better sample quality. Apart from discrete class labels, continuous attributes such as ‘smile’, ‘age’, etc., have been used in  to interactively modify a given image. Such continuous conditioning have also been leveraged by [4, 51] for making semantically consistent photo editing on faces. Conditioning on natural text was leveraged by Reed et al.  to directly map an informal description of a flower and bird to pixel space. Later, Zhang et al.  used a stacked GAN architecture to generate sharper images at higher resolution by conditioning on both text and first level of GAN generated image.
Another contemporary practice is to condition a GAN on an auxiliary image, specially for the task of image-to-image translation[16, 7], style transfer [52, 6, 17], video/sequence generation [29, 43], image denoising , real time texture synthesis 
, image super resolution, semantic inpainting , unsupervised visual domain alignment [3, 40] to list a few.
Conditional information is not only restricted to GANs. Recent works on VAE have also explored such auxiliary conditions for predicting future state from a single static image , attribute based face editing  and in learning to represent structured output . Conditional inputs have also been used as discriminative regularizers  for improving VAE sample quality.
Recently, Reed at al  showed that providing sparse localization information to the generator network aids the generator in producing better samples. Our idea is mainly motivated from this observation. However, our approach is computationally more scalable because we use an automated facial fiducial points detection framework based on the real time face alignment with ensemble of regression trees . The authors in  instead had to manually mark the parts of the objects before training the conditional GAN. Also, we provide a dense semantic guide to the generator instead of sparse body joint or bounding box locations.
3.1 Generative Adversarial Networks
Generative adversarial network engages two parametrized models, viz., discriminator and generator in a two-player min-max game. Realized as a feed forward neural net, the generator network takes a latent noise vectordrawn from a prior noise distribution . Following [49, 12], (uniform distribution) and generator maps it onto an image, ; . The other network, discriminator, has the task to discriminate samples coming from the true data distribution and the generated distribution, . Specifically, generator and discriminator play the following game on :
This min-max game has global optimum when and this happens when both discriminator and generator have enough capacity . Empirically, it has been observed that for generator, it is prudent to maximize instead of minimizing .
3.2 Conditional generative adversarial networks
In conditional GANs, an extra input, is also fed to the generator in addition to the vector and thus . Under this conditioning, the modified objective for GAN training becomes,
The GAN framework is flexible in accepting different genres of conditioning inputs such as class labels , natural language description , localization information and even an entire image [50, 16] or a sequence of images . In our case, we condition the GAN framework with a facial semantic map capturing the pose(head orientation, size) and coarse facial expressions.
4.1 Facial Semantic Map Extraction
The first requirement to train our semantic guided GAN framework is to extract facial semantics. In that regard, we make use of the real time face alignment framework of Kazemi et al. . However, detection of facial key points alone does not explicitly give semantic information of face. To mitigate this issue, semantically similar facial components are grouped together and given the same RGB color encoding. As shown in Figure3, this semantic map acts an conditional information during GAN training and inference phases.
4.2 Training semantic conditioned GAN
The basic architecture of our proposed conditioned GAN training is shown in Figure 3. We draw the noise prior and tile to it all spatial locations to match the resolution of the conditional map. Next, the tiled vector and facial semantic maps are concatenated and fed to a conv-deconv 111Deconv layer should ideally be termed as transposed convolution layer
network. The convolutional network has 5 layers of convolution of stride 2, kernel size 5 and number of filters doubles at every stage. Next, the transposed convolutional section consists of 4 layers of fractionally strided convolution. Each layer upscales the previous layer’s output by 2 and halves the number of filters. The discriminator is also conditioned on the semantic map by concatenating the generated/real images with the corresponding maps. This forces the generator not only to generate realistic samples but also to adhere to the face pose and expressions constraints imposed by the semantic map. Discriminator consists of series of stride 2, kernel size 5 convolutions till the spatial resolution is reduced to 44, followed by a linear layer which outputs the probability of joint combination of (face, map) belonging to real/fake distribution. Following the recommendations in 
, we apply Batch Normalization
after all layers of the discriminator followed by ReLu non-linearity. Exception is the last deconvolutional layer which is followed by tanh non linearity without Batch Normalization. In case of discriminator, except the first and last layer, Batch Normalization is applied after all the convolutional layers. We use LeakyReLu non linearity activation after each convolutional layer. The final layer is followed by sigmoid non linearity.
4.3 Semantic inpainting with appearance and pose constraint GAN
It was shown in 
that a linear interpolation in thespace results in smooth transition in semantic space. This indicates that semantically similar looking images can be created from ‘close’ vectors (here, we define “close” as per Equation 3). We build upon the work of , wherein the idea is to find the approximate vector related to the semantically “closest” natural image compared to the corrupted image. However, in our proposed case, the semantic closeness between corrupted and uncorrupted image is constrained by both appearance and pose criteria; such joint constraint helps in visually correct and structurally aligned inpainitng. Specifically, given a damaged image, , the corruption mask, , and the semantic map conditioning, , we aim to find the best fit
vector by iterative optimization of the following loss function,
where strikes a trade off between and . is the contextual loss which penalizes for changing the appearance of the uncorrupted pixels.
where is the Hadamard product operator. for uncorrupted pixels and 0 otherwise. is the perceptual loss coming from the pre-trained discriminator of §4.2 and penalizes if the joint combination of generated image and the semantic map lies away from the natural image manifold.
For a given corrupted image, we start with a random and iteratively update
with stochastic gradient descent to minimize the loss in Equation3. This enables us to approximately find the vector which approximately maps the corrupted image to its closest semantic neighbor. After calculating , the inpainted image, , is formed by overlaying the corrupted image, , with the reconstructed image, .
4.4 Implementation Details
Both experiments of GAN training and inpainitng were performed on 6464 and 128128 resolutions.
We have followed
mini-batch stochastic gradient descent optimization with mini-batch size of 64 using Adam  optimizer.
During GAN training, learning rate was kept constant at 210 and Adam momentum parameters, and = 0.5.
During semantic inpainitng, learning rate was set to 510 and iterative back propagation was carried on for 1500 iterations, after which the loss in Equation 3 saturates. vector was restricted to be within and was set to 0.1. Momentum parameters of Adam, were set to =0.9 and =0.99. Same parameters were used for both 64 and 128128 resolutions and all deformation types. For the framework of DIP , we have used the parameter settings as reported by the authors.
Our model assumes the presence of facial semantic map on a corrupted image. As of today, this is not an over restrictive assumption because current state-of-the-art facial landmark localizers  are able to perform appreciably under significant occlusions. Also, recent work of Luc et al.  has shown significant promise of predicting future semantic maps of a scene. Thus, our assumption of having facial is significantly pragmatic. Moreover, we envision that the concepts of this paper will be extended for video inpainting. In videos, temporal redundancy makes it possible to reuse facial maps from a preceding voxel. It is a frequent practice [10, 22] in traditional video coding literature to reuse spatio-temporal information from nearby voxels for appearance and motion vector based error concealment. However, the main question we ask in this paper is, “if somehow we provide semantic maps to GAN, will that improve overall inpainting quality and consistency?”. Predicting facial semantic map under occlusion is an independent research and deviates readers from central theme of this paper.
Before delving into experimental analysis, we feel, a justification is required for selecting DIP  as our comparing baseline and restricting ourselves to the original GAN formulation of Goodfellow et al.  for our GAN training. First, DIP, as of today, is the benchmark for GAN based image inpainitng. Superiority of GAN paradigm of inpainting over the previous state-of-the-art method of context encoders  has already been shown in . The core objective of this paper is to aware the readers of the intrinsic sequence inpainting inconsistency of DIP and to examine whether semantic conditioning aids in mitigating this drawback. Second, we could have exploited other variants of GAN formulation such as Wasserstein GAN , boundary equilibrium GAN  and unrolled GAN  as these variants have shown to produce better samples compared to original GAN formulation. However, in this paper we are interested in showing that conditional semantic map is effective to improve sample qualities compared to original unconstrained GAN formulation. Amalgamation of better GAN loss function and semantic conditioning might not be a fair comparison to the framework of DIP.
5.1 Independence of appearance and pose
Main hypothesis of our semantic conditioned GAN is that the generator should learn to disentangle appearance and pose cues for generating images. Intuitively, the semantic map should force the generator to create face with matching head pose and facial expression while two nearby vectors should result in similar facial textures. In Figure 4(Top setting) we show groups of images which have been generated using same vectors but different semantic maps. Appearance factors such as gender, skin textures, hair color/styles are preserved; yet the facial expressions and pose closely adhere to the semantic map. In Figure 4(bottom setting), images along a row are generated with different vectors for a given facial map. Changes in appearances can be appreciated but the facial expression/orientation remains constant. Such independence of appearance and shape is key in success of our inpainting method. Given a semantic map, the vector mainly focuses on perfecting the appearance.
5.2 Generated image quality and visual turing test
Success of GAN based inpainting framework depends on the capability of the generator in approximating the real image manifold. So, a generator yielding more realistic samples is expected to perform better inpainting. Towards this end, we visually compare the quality of random samples from our proposed semantic conditioned GAN and  at resolutions of 6464 and 128128. As shown in Figure 5, samples from our proposed model are usually sharper and structurally more coherent. To quantitatively compare the visual appearance of the two models, we perform a visual turing test as followed in . A human annotator is randomly shown total 200 images(100 real and 100 generated) in groups of 20 and asked to label each sample as real or fake. Decisions from 10 annotators are taken. On average, at 64 resolution, the classification accuracy is 5.8% higher for DIP() and 4.2% higher() at 128128 resolution. Thus, human annotators found it more difficult to distinguish samples from our dataset compared to DIP. This finding advocates the use of semantic conditioning for improving GAN samples without any significant overhead of architecture and loss function modification.
5.3 Image inpainting
As we interested in face inpainting, we have used the CelebA dataset which contains 202,599 face images. Following , we separated 2000 images for testing. GAN training (both  and ours) was done on the remaining images.
5.3.1 Correctness of inpainting: Quantitative and visual evaluation
For evaluating correctness, we measure the PSNR between an uncorrupted image and its inpainted version. While reporting PSNR, we have not used Poisson blending post processing because such post processing obscures the true performance of a generative deep model. We have performed extensive experiments on different types of corruption masks as shown in Figure 6 and 7 to compare the generalization capability of each model. The mean PSNR for each setting is reported in Table 1. At 6464 resolution for Central, Checkboard, Left and Freehand masks, our method outperforms DIP by margins of 1.36dB, 1.46dB, 1.29dB and 1.33 respectively. At 128128 resolution, corresponding margins are 0.93dB, 0.76dB, 0.98dB and 0.92dB. Statistical significance of the observations reveals p-value on all cases; this shows that our model significantly outperforms DIP. We show some exemplary inpaintings in Figure6 and Figure7. It can be appreciated that finer facial structural details are preserved in our model. Our reconstructions are also sharper due to the intrinsic superiority of the underlying semantic conditioned GAN model.
However, we acknowledge the fact that PSNR (or even Structural Similarity Metric (SSIM)) might not be the best metric to compare generative models because these models are not trained explicitly to minimize loss. Such observations were pointed out in recent works on image super resolution [25, 17]. To complement out findings in Table 1, we perform a human visual testing experiment. Each subject is shown the original image, the corrupted version and the two inpainted images without revealing the identity of the underlying algorithm. The subject has to vote for the inpainted image with better visual quality. Each subject was shown a random selection of 100 images. In a study with 10 participants, our algorithm selected 69.7% of times which is significantly better () than chance.
5.4 Consistency in inpainting
The strategy behind enumerating consistency of inpainting is to corrupt a given image using different(or same) deformations and pass the corrupted images independently to the inpainting model. Ideally, each of the inpainted images should be coherent with each other. It is to be noted that such study of consistency is not recommended on real videos because there are unknown transformations between two successive frames. We can use an approximate motion compensation  to align two frames, but then the evaluation system will have an intrinsic motion compensation noise which is not separable from the generative modeling noise. Thus, we perform this study on pseudo sequences generated from the 2000 test images of CelebA. Examples of pseudo sequences are shown in Figure 8.
To formalize, given an uncorrupted image, , we create a sequence comprising of different (or same) corrupted images, given by, ; is corruption operator on . Following Equation 3, for each we converge at a and get the inpainted image, , from the generative model. For calculating consistency, we enumerate PSNR between all possible pairwise inpainted frames in the sequence. Consistency, , for the pseudo sequence seeding from , is calculated as,
In Table 2 we report the average values of consistency calculated over the 2000 sequences at 6464 and 128128 resolution with different deformation masks as shown in Figure 8. Random center and Freehand masks depict the condition where each frame in sequence is corrupted by a different deformations (center mask varies from 50%-70%, Freehand mask corrupts random 25% in 3 different freehand shapes). Constant left mask corrupts the left 50% of a given frame for all frames in the pseudo sequence.
From Table 2, we see that our method is more consistent in reconstruction under all the different deformations. At 64 resolution, for random central crop, mean consistency for our method is 4.62 dB higher than that of DIP . The corresponding margins are 3.94 dB and 0.64 dB for random Freehand and left masks respectively. At resolution of 128128, the average margins of success for our method are 1.84 dB, 3.2 dB and 1.05 dB respectively. Statistical significance of the observations reveals p-value for each of the deformation settings and on both 6464 and 128128 resolution. Thus our proposed reconstructions are significantly more consistent than DIP. To appreciate the numerical findings, we also show example cases in Figure 8. It can be seen that DIP often fails to maintain consistency of the subtle yet important semantics such as facial expression (smile/neutral face), extent of eye opening, skin texture. Independently, each of the reconstructed frames by DIP might be acceptable, but when perceived as a sequence , the performance lacks realism due to abrupt change of facial semantics. Our model, however, faithfully retains not only the pose and expressions but also the skin texture of a subject. Such observation strongly bolsters our hypothesis that semantic guide is crucial in incorporating consistency in generative models.
6 Video inpainting on Youtube Faces
As a proof of concept of viability of our model for video inpainting, we conducted preliminary experiments on the challenging Youtube Faces  dataset with the pretrained GAN models on CelebA. It is to be noted that in CelebA, the resolution of faces are bigger than in Youtube Faces dataset. The latter mainly captures celebrity videos in the wild. Thus there is an intrinsic domain difference between the distribution on which we trained GAN models and the distribution in which we are trying to do inpainting. As a result, visual quality of inpainting results are not at par with results on CelebA sequences. But, for this preliminary study, we were interested in examining the raw performances of the GAN models without any domain adaptation or retraining on Youtube videos. We randomly chose video sequences of 2 celebrities, viz. Elizabeth Hurley and Gary Bettman. We extracted the facial region from each frame, resized it to 6464 and corrupted randomly. The PSNR performance is reported in Table 3 and a short snippet of qualitative visualizations are shown in Figure 9. Even on real life video sequences, our model outperforms DIP, both visually and quantitatively. To our knowledge, this is the first time, a GAN based semantic inpainting framework has been applied on real life videos and our model shows a promising pathway in this regard.
7 Discussion and Conclusion
In this work we presented a simple yet effective framework for improving consistency and correctness in sequence inpainting by conditioning original GAN formulation with semantic mapping. Such conditioning also significantly improved the visual quality of generated samples both at 6464 and 128128. We showed that our model learns to disentangle appearance and pose information during sample generation and this helps in preserving both pose and appearance during inpainting. Also, we showed initial success of our model on Youtube Faces videos.
An important lesson here is that generative models are not naively suitable for video/sequence applications due to the multimodal nature of inference pipelines. The results in this paper thus advocates the use of semantic inpainting for improving GAN performance, specially for video applications. In future we wish to study the combination of advanced variants of GAN [1, 2] with semantic conditioning. Another immediate extension would be to incorporate frameworks for predicting semantic mapping on corrupted portions of frames.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, pages 214–223, 2017.
-  D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
-  K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. CVPR, pages 3722–3731, 2017.
-  A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093, 2016.
-  J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi. Real-time video super-resolution with spatio-temporal networks and motion compensation. CVPR, 2016.
-  A. J. Champandard. Semantic style transfer and turning two-bit doodles into fine artworks. arXiv preprint arXiv:1603.01768, 2016.
-  Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. arXiv preprint arXiv:1707.09405, 2017.
-  E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, pages 1486–1494, 2015.
-  J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
-  M. Ebdelli, O. Le Meur, and C. Guillemot. Video inpainting with short-term windows: application to object removal and error concealment. IEEE Transactions on Image Processing, 24(10):3034–3047, 2015.
-  J. Fišer, O. Jamriška, D. Simons, E. Shechtman, J. Lu, P. Asente, M. Lukáč, and D. Sỳkora. Example-based synthesis of stylized facial animations. ACM Transactions on Graphics (TOG), 36(4):155, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
-  H. Huang, H. Wang, W. Luo, L. Ma, W. Jiang, X. Zhu, Z. Li, and W. Liu. Real-time neural style transfer for videos. In CVPR, pages 783–791, 2017.
-  S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion. ACM Transactions on Graphics (TOG), 36(4):107, 2017.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448–456, 2015.
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.
Image-to-image translation with conditional adversarial networks.CVPR, pages 1125–1134, 2016.
-  J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, pages 694–711. Springer, 2016.
-  T. Kaneko, K. Hiramatsu, and K. Kashino. Generative attribute controller with conditional filtered generative adversarial networks. In CVPR, pages 6089–6098, 2017.
-  V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. In CVPR, pages 1867–1874, 2014.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  W.-Y. Kung, C.-S. Kim, and C.-C. Kuo. Spatial and temporal error concealment techniques for video transmission over noisy channels. IEEE transactions on circuits and systems for video technology, 16(7):789–803, 2006.
A. Lahiri, K. Ayush, P. K. Biswas, and P. Mitra.
Generative adversarial learning for reducing manual annotation in
semantic segmentation on large scale miscroscopy images: Automated vessel
segmentation in retinal fundus image as test case.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 42–48, 2017.
-  A. Lamb, V. Dumoulin, and A. Courville. Discriminative regularization for generative models. arXiv preprint arXiv:1602.03220, 2016.
-  C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. CVPR, pages 4681–4690, 2016.
-  C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In ECCV, pages 702–716. Springer, 2016.
-  P. Luc, C. Couprie, S. Chintala, and J. Verbeek. Semantic segmentation using adversarial networks. NIPS Workshop on Adversarial Learning, 2016.
A. L. Maas, A. Y. Hannun, and A. Y. Ng.
Rectifier nonlinearities improve neural network acoustic models.In ICML, volume 30, 2013.
-  M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. ICLR, 2016.
-  L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016.
-  M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
-  N. Neverova, P. Luc, C. Couprie, J. Verbeek, and Y. LeCun. Predicting deeper into the future of semantic segmentation. arXiv preprint arXiv:1703.07684, 2017.
A. Odena, C. Olah, and J. Shlens.
Conditional image synthesis with auxiliary classifier GANs.In ICML, pages 2642–2651, 2017.
-  D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, pages 2536–2544, 2016.
-  P. Pérez, M. Gangnet, and A. Blake. Poisson image editing. In ACM Transactions on graphics (TOG), volume 22, pages 313–318. ACM, 2003.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR, 2016.
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. In ICML, 2016.
-  S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In NIPS, pages 217–225, 2016.
-  S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In NIPS, pages 217–225, 2016.
-  A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. CVPR, pages 2107–2116, 2017.
-  K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In NIPS, pages 3483–3491, 2015.
-  Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In CVPR, pages 3476–3483, 2013.
-  C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In NIPS, pages 613–621, 2016.
J. Walker, C. Doersch, A. Gupta, and M. Hebert.
An uncertain future: Forecasting from static images using variational autoencoders.In ECCV, pages 835–851. Springer, 2016.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
-  L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. CVPR, pages 529–534, 2011.
-  J. M. Wolterink, T. Leiner, M. A. Viergever, and I. Isgum. Generative adversarial networks for noise reduction in low-dose ct. IEEE Transactions on Medical Imaging, 2017.
-  X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In ECCV, pages 776–791. Springer, 2016.
-  R. A. Yeh, C. Chen, T. Y. Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do. Semantic image inpainting with deep generative models. In CVPR, pages 5485–5493, 2017.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. CVPR, pages 5077–5086, 2016.
-  J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In ECCV, pages 597–613. Springer, 2016.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
8 Visualization: Independence of appearance and pose
As mentioned in Section 5.1 of main paper, independence of appearance and pose refers to the fact that our generator learns to disentangle appearance and pose cues for generating images. In Figure 10, we show example cases in which different faces are generated for a given semantic map. In Figure 11, we show example cases in which similar looking faces are generated from the same vector but conditioned on different facial maps.
9 Visualization : Inpainting performance
10 Visualization : Consistency of inpainting
In Figures 16 and 17 we visually compare consistency of inpainting of a pseudo sequences at 6464 resolution. Figures 18 and 19 show the visualizations for 128128 resolution. Given an uncorrupted image, a pseudo sequence is created by deforming the original image with different(or same) masks. Refer to Section 5.4 in main paper for more details.
11 Visualization: Quality of generated samples from GAN
In Figures 20 and 21 we show some samples generated by our semantically conditioned GAN at 6464 and 128128 resolution respectively. In Figures 22 and 23 the corresponding samples from DCGAN architecture used by DIP are shown. Qualitatively, sample qualities from our GAN model is better. Refer to Section 5.2 of main paper for a detailed analysis.