GP-GAN: Gender Preserving GAN for Synthesizing Faces from Landmarks

10/03/2017 ∙ by Xing Di, et al. ∙ Rutgers University 0

Facial landmarks constitute the most compressed representation of faces and are known to preserve information such as pose, gender and facial structure present in the faces. Several works exist that attempt to perform high-level face-related analysis tasks based on landmarks. In contrast, in this work, an attempt is made to tackle the inverse problem of synthesizing faces from their respective landmarks. The primary aim of this work is to demonstrate that information preserved by landmarks (gender in particular) can be further accentuated by leveraging generative models to synthesize corresponding faces. Though the problem is particularly challenging due to its ill-posed nature, we believe that successful synthesis will enable several applications such as boosting performance of high-level face related tasks using landmark points and performing dataset augmentation. To this end, a novel face-synthesis method known as Gender Preserving Generative Adversarial Network (GP-GAN) that is guided by adversarial loss, perceptual loss and a gender preserving loss is presented. Further, we propose a novel generator sub-network UDeNet for GP-GAN that leverages advantages of U-Net and DenseNet architectures. Extensive experiments and comparison with recent methods are performed to verify the effectiveness of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Facial landmarks can be regarded as the most compressed representation of a face due to the fact that very few number of points are required to capture the landmark locations. In spite of the incredibly low number of keypoints, they are known to preserve important information about the face such as pose, gender [4] and structure [34, 28, 33]. Success of facial analysis tasks using just landmark keypoints is essential from the perspective of memory management and information privacy. Considering that size of landmarks is an order of magnitude smaller as compared to the image size, it will result in significant savings in terms of memory. Essentially, we are now able to store only landmark key points and throw away face image for a particular application. In addition, landmark information can be safely stored, transported, and distributed without potential violation of human privacy and confidentiality. Motivated by these reasons, it would be interesting to understand how landmarks can be exploited for performing high-level facial analysis tasks in the absence of corresponding face images.

Several researchers have demonstrated that facial landmarks can be used in many face analysis tasks such as face recognition

[5, 19, 20], facial attribute inference [41]

, age estimation

[34], gender recognition [4] and expression analysis [30]. However, these methods operate on a small set of keypoints due to which their performance is severely limited. To overcome this problem, we propose a novel solution that involves synthesis of faces from landmark points using the recently popular generative models [7, 43, 2, 42, 38, 32, 37]. While, several methods [41, 31, 14, 29]

have been proposed in the literature for landmark detection, the inverse problem of synthesizing faces from their corresponding landmarks is a largely unexplored problem. We believe that using synthesized faces will result in better recognition performances as they leverage the capabilities of generative models to accentuate information present in landmarks. Apart from their use in high-level facial analysis tasks, these generative methods can be used to create virtually unlimited stochastic samples by conditioning on both landmarks and a stochastic noise vector enabling us to augment existing datasets for large scale learning

[3].

Fig. 1:

Overview of the proposed GP-GAN method for synthesizing faces from landmarks. In addition to adversarial loss function, the generator sub-network is guided by a perceptual loss and a gender preserving loss.

In this work, generative models are exploited to synthesize faces from landmarks in an attempt to accentuate information (gender in particular) present in the landmarks. Cao et al.[4] specifically address the question if facial metrology can be used to predict gender and they further go on to demonstrate that gender recognition using landmarks achieves reasonable performance. This is remarkable considering the fact that only 68 keypoints are used to predict gender of the face represented by these keypoints. However, generating faces from landmarks will enable us to achieve further improvement in performance as this process will leverage generative models to learn the distribution of landmarks and their mappings to the respective faces. While recognition of other attributes like ethnicity, pose, identity, etc. can all be improved, in this work, we specifically focus on the gender attribute. To this end, we propose Gender Preserving Generative Adversarial Network (GP-GAN) to generate faces from their respective landmarks (as shown in Fig. 1). To further enhance the network’s performance, it is guided by perceptual loss and a gender preserving loss in addition to adversarial loss. To summarize, following are the key contributions of this work:

  • To the best of our knowledge, this is the first attempt to generate faces from landmark keypoints while preserving gender information.

  • A GAN-based framework guided by perceptual and gender preserving loss is proposed. The generator is constructed using a novel combination of UNet [23] and DenseNet [8], which we call it as UDeNet.

  • Detailed experiments are conducted to demonstrate the improvements in gender recognition obtained from synthesized images using the proposed method.

Fig. 2: Architecture of the proposed GP-GAN framework. Left: Generator () synthesizes face image from landmarks and is based on UNet and DenseNet architecture. is a patch-based discriminator that is trained to distinguish between real/fake face images and it is responsible for providing adversarial feedback to . is also guided by a perceptual loss (based on VGG-16 architecture) and a gender-preserving loss. Right: Dense-block used in generator .

Ii Related Work

In contrast to landmark detection methods [24, 29, 14, 1] , we focus on the inverse problem of synthesizing or generating faces from landmark keypoints which is a relatively unexplored problem. To this end, recently popular generative models are explored in this work. Among these methods, we specifically study Generative Adversarial Network (GAN)[7, 43, 2, 42] and Variational Auto-encoder (VAE) [22, 13].

VAEs are powerful generative models that use deep networks to describe distribution of observed and latent variables. A VAE consists of two networks, with one network encoding a data sample to a latent representation and the other network decoding latent representation back to data space. VAE regularizes the encoder by imposing a prior over the latent distribution. Conditional VAE (CVAE) [27] [35] [6] is an extension of VAE that models latent variables and data, both conditioned on side information such as a part or label of the image. GANs [7]

are another class of generative models that are used to synthesize realistic images by effectively learning the distribution of training images. Recently, several variants based on this game theoretic approach have been proposed for image-to-image translation tasks. Isola

et al.[11] proposed Conditional GANs [17]

for several tasks such as labels to street scenes, labels to facades, image colorization, etc. In an another variant, Zhu

et al.[43] proposed CycleGAN that learns image-to-image translation in an unsupervised fashion. Berthelot et al.[2] proposed a new method for training auto-encoder based GANs that is relatively more stable. Their method is paired with a loss inspired by Wasserstein distance. Some of the other applications of GANs include image de-hazing [39], crowd counting [26], and image de-raining [38].

Iii Proposed Method

Given an application where only facial landmarks are available, we explore how to leverage information preserved in these keypoints. To this end, we propose to model the joint distribution of facial landmarks and corresponding face images

111Face images are available only during training using generative modeling. Inspired by the success of GANs [7]

, we explore adversarial networks in this work for synthesizing faces from landmark keypoints. GANs, motivated by game theory, consist of two competing networks: generator

and discriminator . The goal of GAN is to train to produce samples from training distribution such that the synthesized samples are indistinguishable from actual distribution by discriminator . Conditional GAN is another variant where the generator is conditioned on additional variables such as discrete labels [17], text [21] and images [11]. The objective function of a conditional GAN is defined as follows

(1)

where , the output image, and , the observed image, are sampled from distribution and they are distinguished by the discriminator, . While for the generated fake sampled from distributions would like to fool .

As shown in Fig. 2, the proposed network consists of a generator sub-network (based on U-net [23] and DenseNet [8] architecture) conditioned on a facial landmark image and a patch-based discriminator sub-network . takes landmark as input and attempts to generate corresponding face image, while attempts to distinguish between real and synthesized images. The two sub-networks are trained iteratively. In addition to the adversarial loss, we propose to guide the generator using three other loss functions: perceptual loss based on VGG-16 architecture [25], gender preserving loss and reconstruction error.

Iii-a Generator

Deeper networks are known to better capture high-level concepts, however, the vanishing gradient problem affects convergence rate as well as the quality of convergence. Several works have been developed to overcome this issue among which U-Net

[23] and DenseNet [8] are of particular interest. While U-Net incorporates longer skip connections to preserve low-level features, DenseNet employs short range connections within micro-blocks resulting in maximum information flow between layers in addition to an efficient network. Motivated by these two methods, we propose UDeNet for the generator sub-network in which, the U-Net architecture is seamlessly integrated into the DenseNet network in order to leverage advantages of both the methods. This novel combination enables more efficient learning and improved convergence quality.

A set of 3 dense-blocks (along with transition blocks) are stacked in the front, followed by a set of 5 dense-block layers (transition blocks). The initial set of dense-blocks are composed of 6 bottleneck layers. For efficient training and better convergence, symmetric skip connections are involved into the generator sub-network, similar to [16]. Details regarding the number of channels for each convolutional layer are as follows: C(64)-M(64)-D(256)-T(128)-D(512)-T(256)-D(1024)-T(512)-D(1024)-DT(256)-D(512)-DT(128)-D(256)-DT(64)-D(64)-D(32)-D(32)-DT(16)-C(3), where C(K) is a set of

-channel convolutional layers followed by batch normalization and ReLU activation. M is max-pooling layer. D(K) is the dense-block layer with

-channel output, T(K) is transition layer with -channel output for downsampling. DT(K) is similar to T(K) except for transposed convolutional layer instead of convolutional layer for upsampling.

Iii-B Discriminator

Motivated by [11], patch-based discriminator is used and it is trained iteratively along with . The primary goal of

is to learn to discriminate between real and synthesized samples. This information is backpropagated into

so that it generates samples that are as realistic as possible. Additionally, patch-based discriminator ensures preserving of high-frequency details which are usually lost when only L1 loss is used. All the convolutional layers in have a filter size of . Details regarding the number of channels for each convolutional layer are specified in Fig. 2.

Iii-C Objective function

The network parameters are learned by minimizing the following objective function:

(2)

where is the adversarial loss, is the perceptual loss, is the gender preserving loss and is the loss based on -norm between the target and reconstructed image, , and are weights respectively for perceptual loss, gender preserving loss and loss.

Adversarial loss: Adversarial loss is based primarily on the discriminator sub-network . Given a set of synthesized faces, , the entropy loss from that is used to learn the parameters of is defined as:

(3)

Perceptual loss: Johnson et al.[12]

introduced the perceptual loss function for style transfer and super-resolution. Instead of relying only on

or

reconstruction error, they learn the network parameters using errors between high-level image feature representations extracted from a pre-trained convolutional neural network. Similar to their work, pre-trained VGG-16

[25] network is used to extract high-level features (conv4_3 layers) and the distance between these features of real and fake images is used to guide the generator . The perceptual loss function is defined as:

(4)

where, and indicate real and fake images, respectively and is a particular layer of the VGG-16 network.

Fig. 3: Sample qualitative results of synthesis experiments from LFW dataset. The proposed method GP-GAN (UDeNet + GP Loss) achieves more realistic synthesis compared to the other methods (CycleGAN, CVAE, BEGAN, CGAN) and the baseline methods from the ablation study: GP-GAN (UNet+GP Loss), GP-GAN (UDeNet+ No GP Loss).
Fig. 4: Sample qualitative results of synthesis experiments from CASIA WebFace dataset. The proposed method GP-GAN (UDeNet + GP Loss) achieves more realistic synthesis compared to the other methods (CycleGAN, CVAE, BEGAN, CGAN) and the baseline methods from the ablation study: GP-GAN (UNet+GP Loss), GP-GAN (UDeNet+ No GP Loss).

Gender preserving loss: Inspired largely by the perceptual loss, we define a gender preserving loss. As indicated by the name, this function measures the error in terms of gender attribute of the synthesized image as compared to that of real image. It is defined as:

where represents a pre-trained gender classification network. In this work, is constructed using the standard VGG-16 network in which, the convolutional layers are retained and the fully connected layers are replaced by a new set of layers as shown in Fig. 2. This network is trained by minimizing the standard binary cross entropy error.

L1 loss: L1 loss measures the reconstruction error between the synthesized face image and the corresponding real image and is defined as

(5)
LM (D) LM (A) CycleGAN CVAE BEGAN CGAN GP-GAN (UNet+GP-Loss) GP-GAN (UDeNet,No GP Loss) GP-GAN (UDeNet+GP-Loss)
LFW 78.0 1.9 79.8 2.4 81.8 1.1 80.3 2.0 84.4 1.9 86.3 2.5 91.1 1.1 91.7 1.6 93.1 1.2
CASIA 61.0 11.8 61.7 13.6 64.8 3.3 62.0 4.1 67.8 5.0 70.4 5.5 73.2 3.9 76.7 4.3 78.4 4.1
TABLE I: Quantitative comparison of gender recognition accuracy (%) for various methods.

Iv Experiments and Evaluations

In this section, experimental settings and evaluation of the proposed method are discussed in detail. We present the qualitative and quantitative results of the synthesis experiment. The quantitative performance is measured using gender recognition rates. Results are compared with four state-of-the-art generative models: Conditional GAN [11], Cycle GAN [43], CVAE [27] [35] and adopted BEGAN222https://github.com/taey16/pix2pixBEGAN.pytorch in addition to two baseline methods (a) GP-GAN using U-Net generator with GP-Loss, and (b) GP-GAN using UDeNet generator without GP-Loss. The baseline comparisons are performed to demonstrate the improvements achieved by the gender preserving loss and UDeNet components. Also, we demonstrate that the use of synthesis using GP-GAN accentuates gender information present in landmarks by comparing gender recognition rates with methods that directly compute these rates from landmark points [4]. Furthermore, we conduct an experiment to evaluate the data augmentation capabilities of the synthesis method.

Iv-a Preprocessing and training details

Prior to preforming these experiments, all images in both datasets are fed through a pre-processing pipeline. First, MTCNN [40] is employed for detecting face bounding boxes which are further used to crop the faces followed by landmark key point detection using TCDCN algorithm [41]. Pairs of these detected landmarks and faces are used for training the proposed method. Since we consider this problem as an image-to-image translation, the input landmark is encoded using a heatmap (similar to [15]) as shown in Fig. 1

which is a created by imposing a 2D Gaussian with standard deviation of 0.2 at every landmark location on a blank image like could counting work

[26]. Note that the cropped face images are resized to 6464.

The proposed network is trained on a single TitanX GPU for approximately 10 hours (200 epochs). A learning rate of

is used for and . For perceptual network, the input images are resized to a size of . The learning rate is decayed by a factor of for every epoch after 100 epochs. The weights , and are set equal to , and , respectively.

For learning the parameters of the proposed method and baselines, training set from the LFW official deep funneling aligned dataset [10][9] is used. It contains 5749 identities, and 13233 images. The official training, validating and testing View 1 was used for this experiment. After detection and crop procedure, we are left with 3757 images in the training set and 1615 images in the test set. The trained network is evaluated on the LFW test set and a subset of CASIA-Webface dataset [36]. The test subset for CASIA-Webface is constructed by randomly selecting 1000 male and 1000 female face images. Note that, in order to demonstrate the generalization performance, the proposed network is trained using only the LFW training set and evaluated on the LFW test set and the CASIA-Webface dataset.

Iv-B Results

Fig. 3 and Fig. 4 show sample results of reconstruction using various methods on the LFW and CASIA datasets, respectively. The landmark image is used as the input for all the methods except CVAE [27] [35]. For CVAE, the inputs are original image and normalized landmark locations as the attributes. It can be clearly observed that Conditional GAN [11], Cycle GAN [43] and BEGAN [2] are unable to reconstruct visually coherent faces. Though CVAE is able to generate visually appropriate faces, they fail to preserve the gender information. Since their network implements an auto-encoder like architecture and uses pixel-wise Euclidean measure, the output is often blurry, due to which gender classification becomes very difficult. GP-GAN using UDeNet generator without GP-Loss is able to generate perceptually better results as compared to GP-GAN using UNet generator with GP-Loss demonstrating the superior performance obtained using the novel combination of UNet and DenseNet architectures. The proposed method GP-GAN (UDeNet and GP-Loss) outperforms all existing and baseline methods. It may be argued that identity information is lost during the reconstruction process, however, note that the goal of the proposed method is not to capture the exact mapping between landmarks and corresponding faces. Instead, the idea is to explore generation of visually coherent faces from landmark keypoints which can further assist in data augmentation and other tasks.

As discussed earlier, the quantitative performance is measured in terms of gender recognition rates and it is shown in Table I. Gender recognition rates for the synthesized are calculated using the LBP features [18]

and a linear SVM classifier that is trained using the LFW training set, whereas the recognition rates for landmarks, LM(D) and LM(A), are calculated using the distance and angle methods described in

[4]. Note that the gender recognition is performed based only on landmark keypoints considering that the corresponding face images are unavailable and hence recent state-of-the-art gender recognition methods cannot be used for comparison as they operate on actual face images rather than only on facial landmarks. Similar to the observations made using visual comparisons, it can be found from the quantitative results that, gender recognition rates improve in general using the generative models as compared to the landmark-based methods.

With respect to the baseline comparisons, it can be observed that GP-GAN using UDeNet generator without GP-Loss outperforms GP-GAN using UNet generator with GP-Loss in spite of the fact that GP-Loss is not used, thus indicating the effectiveness of UDeNet architecture. Furthermore, the proposed method GP-GAN (UDeNet with GP-Loss) outperforms all existing baseline methods by a large margin in terms of gender recognition rates. This indicates that the proposed synthesis method can be used to generate face images from just facial landmarks while retaining gender information present in these landmarks.

In addition, we conducted a face synthesis experiment to verify if the proposed method can be used for data augmentation. In this experiment, we manipulate the landmark of a face (for instance, modify mouth open to mouth close) and use this landmark to synthesize a face using generator . Sample results for this experiment are shown in Fig 5. It can be seen that, the generator is able to synthesize realistic faces from the modified landmarks while reflecting this modification in the synthesized face. Additionally, the gender attribute is also retained. Based on these experiments, we can conclude that the proposed method is able successfully generate face samples which can be used for data augmentation for other facial analysis tasks.

Fig. 5: Results of experiment for dataset augmentation where landmark corresponding to a face is modified and used for synthesis. We are able to generate new samples while preserving gender information. (i) Original face image. (ii) Landmark corresponding to original face. (iii) Synthesized face from original landmark. (iv) Landmark obtained after manipulating original landmark. (v) Synthesized face image using manipulated landmark.

V Conclusion

We explored the problem of synthesizing faces from landmarks points using the recently introduced generative models. The aim of this project was to demonstrate that information (especially gender) present in the landmark keypoints can be accentuated using synthesis models while generating realistic images. The proposed network is based on the generative adversarial networks and is guided by perceptual loss and a novel gender preserving loss. Further, we propose a novel generator based on UNet and DenseNet architectures. Evaluations are performed on two popular datasets, LFW and CASIA-Webface, and the results are compared with recent state-of-the-art generative methods. It is clearly demonstrated that the proposed method achieves significant improvements in terms of visual quality and gender recognition. Additionally, we conducted a face synthesis experiment to demonstrate that the proposed generative method can be used as a data augmentation technique.

Acknowledgments

This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 2014-14071600012. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. We like to thank He Zhang for his insightful discussion on this topic.

References