Facial landmarks can be regarded as the most compressed representation of a face due to the fact that very few number of points are required to capture the landmark locations. In spite of the incredibly low number of keypoints, they are known to preserve important information about the face such as pose, gender  and structure [34, 28, 33]. Success of facial analysis tasks using just landmark keypoints is essential from the perspective of memory management and information privacy. Considering that size of landmarks is an order of magnitude smaller as compared to the image size, it will result in significant savings in terms of memory. Essentially, we are now able to store only landmark key points and throw away face image for a particular application. In addition, landmark information can be safely stored, transported, and distributed without potential violation of human privacy and confidentiality. Motivated by these reasons, it would be interesting to understand how landmarks can be exploited for performing high-level facial analysis tasks in the absence of corresponding face images.
Several researchers have demonstrated that facial landmarks can be used in many face analysis tasks such as face recognition[5, 19, 20], facial attribute inference 
, age estimation, gender recognition  and expression analysis . However, these methods operate on a small set of keypoints due to which their performance is severely limited. To overcome this problem, we propose a novel solution that involves synthesis of faces from landmark points using the recently popular generative models [7, 43, 2, 42, 38, 32, 37]. While, several methods [41, 31, 14, 29]
have been proposed in the literature for landmark detection, the inverse problem of synthesizing faces from their corresponding landmarks is a largely unexplored problem. We believe that using synthesized faces will result in better recognition performances as they leverage the capabilities of generative models to accentuate information present in landmarks. Apart from their use in high-level facial analysis tasks, these generative methods can be used to create virtually unlimited stochastic samples by conditioning on both landmarks and a stochastic noise vector enabling us to augment existing datasets for large scale learning.
In this work, generative models are exploited to synthesize faces from landmarks in an attempt to accentuate information (gender in particular) present in the landmarks. Cao et al. specifically address the question if facial metrology can be used to predict gender and they further go on to demonstrate that gender recognition using landmarks achieves reasonable performance. This is remarkable considering the fact that only 68 keypoints are used to predict gender of the face represented by these keypoints. However, generating faces from landmarks will enable us to achieve further improvement in performance as this process will leverage generative models to learn the distribution of landmarks and their mappings to the respective faces. While recognition of other attributes like ethnicity, pose, identity, etc. can all be improved, in this work, we specifically focus on the gender attribute. To this end, we propose Gender Preserving Generative Adversarial Network (GP-GAN) to generate faces from their respective landmarks (as shown in Fig. 1). To further enhance the network’s performance, it is guided by perceptual loss and a gender preserving loss in addition to adversarial loss. To summarize, following are the key contributions of this work:
To the best of our knowledge, this is the first attempt to generate faces from landmark keypoints while preserving gender information.
Detailed experiments are conducted to demonstrate the improvements in gender recognition obtained from synthesized images using the proposed method.
Ii Related Work
In contrast to landmark detection methods [24, 29, 14, 1] , we focus on the inverse problem of synthesizing or generating faces from landmark keypoints which is a relatively unexplored problem. To this end, recently popular generative models are explored in this work. Among these methods, we specifically study Generative Adversarial Network (GAN)[7, 43, 2, 42] and Variational Auto-encoder (VAE) [22, 13].
VAEs are powerful generative models that use deep networks to describe distribution of observed and latent variables. A VAE consists of two networks, with one network encoding a data sample to a latent representation and the other network decoding latent representation back to data space. VAE regularizes the encoder by imposing a prior over the latent distribution. Conditional VAE (CVAE)    is an extension of VAE that models latent variables and data, both conditioned on side information such as a part or label of the image. GANs 
are another class of generative models that are used to synthesize realistic images by effectively learning the distribution of training images. Recently, several variants based on this game theoretic approach have been proposed for image-to-image translation tasks. Isolaet al. proposed Conditional GANs 
for several tasks such as labels to street scenes, labels to facades, image colorization, etc. In an another variant, Zhuet al. proposed CycleGAN that learns image-to-image translation in an unsupervised fashion. Berthelot et al. proposed a new method for training auto-encoder based GANs that is relatively more stable. Their method is paired with a loss inspired by Wasserstein distance. Some of the other applications of GANs include image de-hazing , crowd counting , and image de-raining .
Iii Proposed Method
Given an application where only facial landmarks are available, we explore how to leverage information preserved in these keypoints. To this end, we propose to model the joint distribution of facial landmarks and corresponding face images111Face images are available only during training using generative modeling. Inspired by the success of GANs 
, we explore adversarial networks in this work for synthesizing faces from landmark keypoints. GANs, motivated by game theory, consist of two competing networks: generatorand discriminator . The goal of GAN is to train to produce samples from training distribution such that the synthesized samples are indistinguishable from actual distribution by discriminator . Conditional GAN is another variant where the generator is conditioned on additional variables such as discrete labels , text  and images . The objective function of a conditional GAN is defined as follows
where , the output image, and , the observed image, are sampled from distribution and they are distinguished by the discriminator, . While for the generated fake sampled from distributions would like to fool .
As shown in Fig. 2, the proposed network consists of a generator sub-network (based on U-net  and DenseNet  architecture) conditioned on a facial landmark image and a patch-based discriminator sub-network . takes landmark as input and attempts to generate corresponding face image, while attempts to distinguish between real and synthesized images. The two sub-networks are trained iteratively. In addition to the adversarial loss, we propose to guide the generator using three other loss functions: perceptual loss based on VGG-16 architecture , gender preserving loss and reconstruction error.
Deeper networks are known to better capture high-level concepts, however, the vanishing gradient problem affects convergence rate as well as the quality of convergence. Several works have been developed to overcome this issue among which U-Net and DenseNet  are of particular interest. While U-Net incorporates longer skip connections to preserve low-level features, DenseNet employs short range connections within micro-blocks resulting in maximum information flow between layers in addition to an efficient network. Motivated by these two methods, we propose UDeNet for the generator sub-network in which, the U-Net architecture is seamlessly integrated into the DenseNet network in order to leverage advantages of both the methods. This novel combination enables more efficient learning and improved convergence quality.
A set of 3 dense-blocks (along with transition blocks) are stacked in the front, followed by a set of 5 dense-block layers (transition blocks). The initial set of dense-blocks are composed of 6 bottleneck layers. For efficient training and better convergence, symmetric skip connections are involved into the generator sub-network, similar to . Details regarding the number of channels for each convolutional layer are as follows: C(64)-M(64)-D(256)-T(128)-D(512)-T(256)-D(1024)-T(512)-D(1024)-DT(256)-D(512)-DT(128)-D(256)-DT(64)-D(64)-D(32)-D(32)-DT(16)-C(3), where C(K) is a set of-channel output, T(K) is transition layer with -channel output for downsampling. DT(K) is similar to T(K) except for transposed convolutional layer instead of convolutional layer for upsampling.
Motivated by , patch-based discriminator is used and it is trained iteratively along with . The primary goal of
is to learn to discriminate between real and synthesized samples. This information is backpropagated intoso that it generates samples that are as realistic as possible. Additionally, patch-based discriminator ensures preserving of high-frequency details which are usually lost when only L1 loss is used. All the convolutional layers in have a filter size of . Details regarding the number of channels for each convolutional layer are specified in Fig. 2.
Iii-C Objective function
The network parameters are learned by minimizing the following objective function:
where is the adversarial loss, is the perceptual loss, is the gender preserving loss and is the loss based on -norm between the target and reconstructed image, , and are weights respectively for perceptual loss, gender preserving loss and loss.
Adversarial loss: Adversarial loss is based primarily on the discriminator sub-network . Given a set of synthesized faces, , the entropy loss from that is used to learn the parameters of is defined as:
Perceptual loss: Johnson et al.or
reconstruction error, they learn the network parameters using errors between high-level image feature representations extracted from a pre-trained convolutional neural network. Similar to their work, pre-trained VGG-16 network is used to extract high-level features (conv4_3 layers) and the distance between these features of real and fake images is used to guide the generator . The perceptual loss function is defined as:
where, and indicate real and fake images, respectively and is a particular layer of the VGG-16 network.
Gender preserving loss: Inspired largely by the perceptual loss, we define a gender preserving loss. As indicated by the name, this function measures the error in terms of gender attribute of the synthesized image as compared to that of real image. It is defined as:
where represents a pre-trained gender classification network. In this work, is constructed using the standard VGG-16 network in which, the convolutional layers are retained and the fully connected layers are replaced by a new set of layers as shown in Fig. 2. This network is trained by minimizing the standard binary cross entropy error.
L1 loss: L1 loss measures the reconstruction error between the synthesized face image and the corresponding real image and is defined as
|LM (D)||LM (A)||CycleGAN||CVAE||BEGAN||CGAN||GP-GAN (UNet+GP-Loss)||GP-GAN (UDeNet,No GP Loss)||GP-GAN (UDeNet+GP-Loss)|
|LFW||78.0 1.9||79.8 2.4||81.8 1.1||80.3 2.0||84.4 1.9||86.3 2.5||91.1 1.1||91.7 1.6||93.1 1.2|
|CASIA||61.0 11.8||61.7 13.6||64.8 3.3||62.0 4.1||67.8 5.0||70.4 5.5||73.2 3.9||76.7 4.3||78.4 4.1|
Iv Experiments and Evaluations
In this section, experimental settings and evaluation of the proposed method are discussed in detail. We present the qualitative and quantitative results of the synthesis experiment. The quantitative performance is measured using gender recognition rates. Results are compared with four state-of-the-art generative models: Conditional GAN , Cycle GAN , CVAE   and adopted BEGAN222https://github.com/taey16/pix2pixBEGAN.pytorch in addition to two baseline methods (a) GP-GAN using U-Net generator with GP-Loss, and (b) GP-GAN using UDeNet generator without GP-Loss. The baseline comparisons are performed to demonstrate the improvements achieved by the gender preserving loss and UDeNet components. Also, we demonstrate that the use of synthesis using GP-GAN accentuates gender information present in landmarks by comparing gender recognition rates with methods that directly compute these rates from landmark points . Furthermore, we conduct an experiment to evaluate the data augmentation capabilities of the synthesis method.
Iv-a Preprocessing and training details
Prior to preforming these experiments, all images in both datasets are fed through a pre-processing pipeline. First, MTCNN  is employed for detecting face bounding boxes which are further used to crop the faces followed by landmark key point detection using TCDCN algorithm . Pairs of these detected landmarks and faces are used for training the proposed method. Since we consider this problem as an image-to-image translation, the input landmark is encoded using a heatmap (similar to ) as shown in Fig. 1
which is a created by imposing a 2D Gaussian with standard deviation of 0.2 at every landmark location on a blank image like could counting work. Note that the cropped face images are resized to 6464.
The proposed network is trained on a single TitanX GPU for approximately 10 hours (200 epochs). A learning rate ofis used for and . For perceptual network, the input images are resized to a size of . The learning rate is decayed by a factor of for every epoch after 100 epochs. The weights , and are set equal to , and , respectively.
For learning the parameters of the proposed method and baselines, training set from the LFW official deep funneling aligned dataset  is used. It contains 5749 identities, and 13233 images. The official training, validating and testing View 1 was used for this experiment. After detection and crop procedure, we are left with 3757 images in the training set and 1615 images in the test set. The trained network is evaluated on the LFW test set and a subset of CASIA-Webface dataset . The test subset for CASIA-Webface is constructed by randomly selecting 1000 male and 1000 female face images. Note that, in order to demonstrate the generalization performance, the proposed network is trained using only the LFW training set and evaluated on the LFW test set and the CASIA-Webface dataset.
Fig. 3 and Fig. 4 show sample results of reconstruction using various methods on the LFW and CASIA datasets, respectively. The landmark image is used as the input for all the methods except CVAE  . For CVAE, the inputs are original image and normalized landmark locations as the attributes. It can be clearly observed that Conditional GAN , Cycle GAN  and BEGAN  are unable to reconstruct visually coherent faces. Though CVAE is able to generate visually appropriate faces, they fail to preserve the gender information. Since their network implements an auto-encoder like architecture and uses pixel-wise Euclidean measure, the output is often blurry, due to which gender classification becomes very difficult. GP-GAN using UDeNet generator without GP-Loss is able to generate perceptually better results as compared to GP-GAN using UNet generator with GP-Loss demonstrating the superior performance obtained using the novel combination of UNet and DenseNet architectures. The proposed method GP-GAN (UDeNet and GP-Loss) outperforms all existing and baseline methods. It may be argued that identity information is lost during the reconstruction process, however, note that the goal of the proposed method is not to capture the exact mapping between landmarks and corresponding faces. Instead, the idea is to explore generation of visually coherent faces from landmark keypoints which can further assist in data augmentation and other tasks.
As discussed earlier, the quantitative performance is measured in terms of gender recognition rates and it is shown in Table I. Gender recognition rates for the synthesized are calculated using the LBP features 
and a linear SVM classifier that is trained using the LFW training set, whereas the recognition rates for landmarks, LM(D) and LM(A), are calculated using the distance and angle methods described in. Note that the gender recognition is performed based only on landmark keypoints considering that the corresponding face images are unavailable and hence recent state-of-the-art gender recognition methods cannot be used for comparison as they operate on actual face images rather than only on facial landmarks. Similar to the observations made using visual comparisons, it can be found from the quantitative results that, gender recognition rates improve in general using the generative models as compared to the landmark-based methods.
With respect to the baseline comparisons, it can be observed that GP-GAN using UDeNet generator without GP-Loss outperforms GP-GAN using UNet generator with GP-Loss in spite of the fact that GP-Loss is not used, thus indicating the effectiveness of UDeNet architecture. Furthermore, the proposed method GP-GAN (UDeNet with GP-Loss) outperforms all existing baseline methods by a large margin in terms of gender recognition rates. This indicates that the proposed synthesis method can be used to generate face images from just facial landmarks while retaining gender information present in these landmarks.
In addition, we conducted a face synthesis experiment to verify if the proposed method can be used for data augmentation. In this experiment, we manipulate the landmark of a face (for instance, modify mouth open to mouth close) and use this landmark to synthesize a face using generator . Sample results for this experiment are shown in Fig 5. It can be seen that, the generator is able to synthesize realistic faces from the modified landmarks while reflecting this modification in the synthesized face. Additionally, the gender attribute is also retained. Based on these experiments, we can conclude that the proposed method is able successfully generate face samples which can be used for data augmentation for other facial analysis tasks.
We explored the problem of synthesizing faces from landmarks points using the recently introduced generative models. The aim of this project was to demonstrate that information (especially gender) present in the landmark keypoints can be accentuated using synthesis models while generating realistic images. The proposed network is based on the generative adversarial networks and is guided by perceptual loss and a novel gender preserving loss. Further, we propose a novel generator based on UNet and DenseNet architectures. Evaluations are performed on two popular datasets, LFW and CASIA-Webface, and the results are compared with recent state-of-the-art generative methods. It is clearly demonstrated that the proposed method achieves significant improvements in terms of visual quality and gender recognition. Additionally, we conducted a face synthesis experiment to demonstrate that the proposed generative method can be used as a data augmentation technique.
This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 2014-14071600012. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. We like to thank He Zhang for his insightful discussion on this topic.
-  A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Robust discriminative response map fitting with constrained local models. In , pages 3444–3451, 2013.
-  D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
-  K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. arXiv preprint arXiv:1612.05424, 2016.
-  D. Cao, C. Chen, M. Piccirilli, D. Adjeroh, T. Bourlai, and A. Ross. Can facial metrology predict gender? In Biometrics (IJCB), 2011 International Joint Conference on, pages 1–8. IEEE, 2011.
-  J. C. Chen, V. M. Patel, and R. Chellappa. Unconstrained face verification using deep cnn features. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–9, March 2016.
-  X. Di and V. M. Patel. Face synthesis from visual attributes via sketch using conditional vaes and gans. arXiv preprint arXiv:1801.00077, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
-  G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  G. B. Huang, M. Mattar, H. Lee, and E. Learned-Miller. Learning to align from scratch. In NIPS, 2012.
-  G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.
-  J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  A. Kumar, R. Ranjan, V. Patel, and R. Chellappa. Face alignment by local deep descriptor regression. arXiv preprint arXiv:1601.07950, 2016.
-  L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool. Pose guided person image generation. In NIPS, 2017.
-  X.-J. Mao, C. Shen, and Y.-B. Yang. Image denoising using very deep fully convolutional encoder-decoder networks with symmetric skip connections. arXiv preprint, 2016.
-  M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
-  T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on pattern analysis and machine intelligence, 24(7):971–987, 2002.
-  R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2017.
-  R. Ranjan, S. Sankaranarayanan, A. Bansal, N. Bodla, J. C. Chen, V. M. Patel, C. D. Castillo, and R. Chellappa. Deep learning for understanding faces: Machines may be just as good, or better, than humans. IEEE Signal Processing Magazine, 35(1):66–83, Jan 2018.
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee.
Generative adversarial text-to-image synthesis.
Proceedings of The 33rd International Conference on Machine Learning, 2016.
-  D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
-  J. M. Saragih, S. Lucey, and J. F. Cohn. Deformable model fitting by regularized landmark mean-shift. International Journal of Computer Vision, 91(2):200–215, 2011.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  V. A. Sindagi and V. M. Patel. Generating high-quality crowd density maps using contextual pyramid cnns. In IEEE International Conference on Computer Vision, 2017.
-  K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483–3491, 2015.
-  Q. Sun, L. Ma, S. J. Oh, L. V. Gool, B. Schiele, and M. Fritz. Natural and effective obfuscation by head inpainting. In CVPR, 2018.
-  Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3476–3483, 2013.
-  S. Taheri, P. Turaga, and R. Chellappa. Towards view-invariant expression analysis using analytic shape manifolds. In Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on, pages 306–313. IEEE, 2011.
-  M. Valstar, B. Martinez, X. Binefa, and M. Pantic. Facial point detection using boosted regression and graph models. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2729–2736. IEEE, 2010.
-  L. Wang, V. A. Sindagi, and V. M. Patel. High-quality facial photo-sketch synthesis using multi-adversarial networks. CoRR, abs/1710.10182, 2017.
-  W. Wang, X. Alameda-Pineda, D. Xu, E. Ricci, and N. Sebe. Every smile is unique: Landmark-guided diverse smile generation. In CVPR, 2018.
-  T. Wu, P. Turaga, and R. Chellappa. Age estimation and face verification across aging using landmarks. IEEE Transactions on Information Forensics and Security, 7(6):1780–1788, 2012.
-  X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, pages 776–791. Springer, 2016.
-  D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. arXiv preprint arXiv:1411.7923, 2014.
-  H. Zhang and V. M. Patel. Densely connected pyramid dehazing network. arXiv preprint arXiv:1803.08396, 2018.
-  H. Zhang, V. Sindagi, and V. M. Patel. Image de-raining using a conditional generative adversarial network. arXiv preprint arXiv:1701.05957, 2017.
-  H. Zhang, V. Sindagi, and V. M. Patel. Joint transmission map estimation and dehazing using deep networks. arXiv preprint arXiv:1708.00581, 2017.
-  K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
-  Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning. In ECCV, pages 94–108. Springer, 2014.
-  J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.