Face Synthesis from Visual Attributes via Sketch using Conditional VAEs and GANs

by   Xing Di, et al.
Rutgers University

Automatic synthesis of faces from visual attributes is an important problem in computer vision and has wide applications in law enforcement and entertainment. With the advent of deep generative convolutional neural networks (CNNs), attempts have been made to synthesize face images from attributes and text descriptions. In this paper, we take a different approach, where we formulate the original problem as a stage-wise learning problem. We first synthesize the facial sketch corresponding to the visual attributes and then we reconstruct the face image based on the synthesized sketch. The proposed Attribute2Sketch2Face framework, which is based on a combination of deep Conditional Variational Autoencoder (CVAE) and Generative Adversarial Networks (GANs), consists of three stages: (1) Synthesis of facial sketch from attributes using a CVAE architecture, (2) Enhancement of coarse sketches to produce sharper sketches using a GAN-based framework, and (3) Synthesis of face from sketch using another GAN-based network. Extensive experiments and comparison with recent methods are performed to verify the effectiveness of the proposed attribute-based three stage face synthesis method.



There are no comments yet.


page 8

page 10

page 11

page 12

page 13

page 14


Facial Synthesis from Visual Attributes via Sketch using Multi-Scale Generators

Automatic synthesis of faces from visual attributes is an important prob...

Multimodal Conditional Image Synthesis with Product-of-Experts GANs

Existing conditional image synthesis frameworks generate images based on...

Multimodal Face Synthesis from Visual Attributes

Synthesis of face images from visual attributes is an important problem ...

Random Sampling for Fast Face Sketch Synthesis

Exemplar-based face sketch synthesis plays an important role in both dig...

Multi-Attributed and Structured Text-to-Face Synthesis

Generative Adversarial Networks (GANs) have revolutionized image synthes...

Semantic Text-to-Face GAN -ST^2FG

Faces generated using generative adversarial networks (GANs) have reache...

RankGAN: A Maximum Margin Ranking GAN for Generating Faces

We present a new stage-wise learning paradigm for training generative ad...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Facial attributes are descriptions or labels that can be given to a face to describe its appearance kumar_ttributes . In the biometrics community, attributes are also referred to as soft-biometrics softbio . Various methods have been developed in the literature for predicting facial attributes from images DeepAtt , kumar2008facetracer , zhang2014panda . For instance, Kumar et al.kumar2008facetracer proposed a facial part-based method for attribute predication. Zhang et al.zhang2014panda

proposed a method which combines part-based models and deep learning for learning attributes. Similarly, Liu et al.

DeepAtt proposed a convolutional neural network (CNN) based approach which combines two CNNs for localizing face region and extracting high-level features from the localized region for predicting attributes.

(a)                              (b)

Figure 1: Attribute prediction vs. face synthesis from attributes. (a) Attribute prediction: given a face image, the goal is to predict the corresponding attributes. (b) Face synthesis from attributes: given a list of facial attributes, the goal is to generate a face image that satisfies these attributes.

Figure 2: Three-stage training network. Stage 1 generates a coarse approximation of the sketch image from attributes. Stage 2 further enhances the sketch image from Stage 1. Finally, Stage 3 generates a face image from the sketch generated from Stage 2 conditioned on the attributes. Here, and denote generators, while and denote discriminators. Note that the attributes are divided into two separate groups - one corresponding to texture and the other corresponding to color.

Figure 3:

Testing phase of the proposed Attribute2Sketch2Face method. The network takes attributes and noise vectors as the input and generates high-quality face images.

While several methods have been proposed in the literature for inferring attributes from images, the inverse problem of synthesizing faces from their corresponding attributes is a relatively unexplored problem (see Figure 1). Visual description-based face synthesis has many applications in law enforcement and entertainment. For example, visual attributes are commonly used in law enforcement to assist in identifying suspects involved in a crime when no facial image of the suspect is available at the crime scene. This is commonly done by constructing a composite or forensic sketch of the person based on the visual attributes.

Reconstructing an image from attributes or text descriptions is an extremely challenging problem. Several recent works have attempted to solve this problem by using recently introduced CNN-based generative models such as conditional variational autoencoder (CVAE) sohn2015learning and generative adversarial network (GAN) goodfellow2014generative . For instance, Yan et al.sohn2015learning proposed a CVAE-based method for attribute-conditioned image generation. In a different approach, Reed et al.reed2016generative proposed a GAN-based method for synthesizing images from detailed text descriptions. Similarly, Zhang et al.zhang2016stackgan proposed a stacked GAN method for synthesizing photo-realistic images from text.

In contrast to the above mentioned methods, we propose a different approach to the problem of face image reconstruction from attributes. Rather than directly reconstructing a face from attributes, we first synthesize a sketch image corresponding to the attributes and then reconstruct the face image from the synthesized sketch. Our approach is motivated by the way forensic sketch artists render the composite sketches of an unknown subject using a number of individually described parts and attributes.

In particular, the proposed framework consists of three stages (see Figure 2). In the first stage, we adapt a CVAE-based framework to generate a sketch image from visual attributes. The generated sketch images from the first stage are often of poor quality. Hence, in the second stage, we further enhance the sketch images using a GAN-based framework in which the generator sub-network leverages advantages of UNet ronneberger2015u and DenseNet huang2017densely architectures, which is inspired from jegou2017one

. Finally, in the third stage, we reconstruct a color face image from the enhanced sketch image with the help of attributes using another GAN-based framework. The Stage 3 formulation is motivated by the disentangled representation learning framework proposed in

disentangled . In particular, the attribute information is fused with the latent representation vector to learn a disentangled representation. Once the three-stage network is trained, one can synthesize sketches and face images by inputing visual attributes along with noise as shown in Figure 3.

To summarize, this paper makes the following contributions:

  • We formulate the attribute-to-face reconstruction problem as a stage-wise learning problem (i.e. attribute-to-sketch, sketch-to-sketch, sketch-to-face).

  • A novel attribute-preserving dense UNet-based generator architecture, called AUDeNet, is proposed which incorporates the encoded texture attributes and the coarse sketches from stage 1 to generate sharper sketches.

  • A new sketch-to-face synthesis generator is proposed which reconstructs the face image from the sketch image using attributes. This generator is based on a new UNet structure and is able to preserve the attributes of the reconstructed image and improves the overall image quality.

  • We use the combination of L1 loss, adversarial loss and perceptual loss johnson2016perceptual in different stages for the purpose of image synthesis.

  • Extensive experiments are conducted to demonstrate the effectiveness of the proposed image synthesis method. Furthermore, an ablation study is conducted to demonstrate the improvements obtained by different stages of our framework.

Rest of the paper is organized as follows. In Section 2, we review a few related works. Details of the proposed attribute-to-face image synthesis method are given in Section 3. Experimental results are presented in Section 4, and finally, Section 5 concludes the paper with a brief summary.

2 Background and Related Work

2.1 Conditional VAE (CVAE)

VAEs are powerful generative models that use deep networks to describe distribution of observed and latent variables. A VAE consists of two networks, with one network encoding a data sample to a latent representation and the other network decoding latent representation back to data space. VAE regularizes the encoder by imposing a prior over the latent distribution. Conditional VAE (CVAE) sohn2015learning yan2016attribute2image is an extension of VAE that models latent variables and data, both conditioned on side information such as a part or label of the image. The CVAE is trained by maximizing the variational lower bound


where and are input, output and latent variables, respectively, and and are the parameters. Here,

is assumed to be an isotropic Gaussian distribution and

and are multivariate Gaussian distributions.

Figure 4: Stage 1 (A2S) network architecture.

2.2 Conditional GAN

GANs goodfellow2014generative are another class of generative models that are used to synthesize realistic images by effectively learning the distribution of training images. The goal of GAN is to train a generator, , to produce samples from training distribution such that the synthesized samples are indistinguishable from actual distribution by the discriminator, . Conditional GAN is another variant where the generator is conditioned on additional variables such as discrete labels, text or images. The objective function of a conditional GAN is defined as follows


where , the input noise, , the output image, and , the observed image, are sampled from distribution and they are distinguished by the discriminator, . While for the generated fake sampled from distributions would like to fool .

Recently, several variants based on this game theoretic approach have been proposed for image synthesis and image-to-image translation tasks. Isola et al.isola2016image proposed Conditional GANs mirza2014conditional

for several tasks such as labels to street scenes, labels to facades, image colorization, etc. In an another variant, Zhu et al.

zhu2017unpaired proposed CycleGAN that learns image-to-image translation in an unsupervised fashion. Berthelot et al.berthelot2017began proposed a new method for training auto-encoder based GANs that is relatively more stable. Their method is paired with a loss inspired by Wasserstein distance arjovsky2017wasserstein . Reed et al.reed2016generative proposed a conditional GAN network to generate reasonable images conditioned on the text description. Zhang et al.zhang2016stackgan proposed a two-stage stacked GAN method which achieves the state-of-art image synthesis results. Recently, Bao et al.bao2017cvae proposed a fine-grained image generation method based on a combination of CVAE and GANs. Yan et al.yan2016attribute2image proposed a CVAE method using a disentangled representation in the latent and the original data distribution to achieve impressive attribute-to-image synthesis results.

Note that the approach we take in this paper is different from the above mentioned methods in that we make use of an intermediate representation (i.e. sketch) for the problem of image synthesis from attributes. In contrast, some of the other methods attempt to directly reconstruct the image from attributes. The only method that is closest to our approach is StackGAN zhang2016stackgan , where the original image synthesis problem is broken into more manageable sub-problems through primitive shape and color refinement process. Another important difference is that zhang2016stackgan was specifically designed for text-to-image translation, while our approach is for the problem of attribute-to-face image reconstruction. Furthermore, as will be shown later, our approach produces much better face reconstructions compared to zhang2016stackgan .

3 Proposed Method

In this section, we provide details of the proposed Attribute2Sketch2Face method for image reconstruction from attributes. It consists of three stages: attribute-to-sketch (A2S), sketch-to-sketch (S2S), and sketch-to-face (S2F) (see Figure 2). Note that the training phase of our method requires ground truth attributes and the corresponding sketch and face images. Furthermore, the attributes are divided into two separate groups - one corresponding to texture and the other corresponding to color. Since sketch contains no color information, we use only texture attributes in A2S and S2S stages as indicated in Figure 2.

3.1 Stage 1: Attribute-to-Sketch (A2S)

In the A2S stage, we adapt the CVAE architecture from yan2016attribute2image . Figure 4 gives an overview of the Stage 1 network architecture. Given a texture attribute vector , noise vector , and ground-truth sketch , we aim to learn a model which can model the distribution of and generate . Here, denotes the decoder with parameter and denotes the encoder with parameter . In this approach, the objective is to find the best parameter which maximizes the log-likelihood . In conditional VAE, the objective is to maximize the following variational lower bound,


where is an isotropic multivariate Gaussian distribution and and

are two multivariate Gaussian distributions. The purpose of this function is to approximate the true conditional probability

with error by maximizing the loss.

As shown in Figure 4, two encoders, and are the proposed Encoder 1 and Encoder 2, respectively in Figure 2. The encoder takes sketch and attributes as input, whereas

takes noise and attribute vectors as input. The overall loss function of the A2S stage is as follows


The first two terms in (3.1), and , are the regularization terms in order to enforce the latent variable and

both match the prior normal distribution,


The encoder network has two modules: one encoding the input sketch (in blue) and the other encoding the texture attribute (in yellow). The encoding module for sketch consists of the following components: CONV5(64) - CONV5(128) - CONV3(256) - CONV3(512) - CONV4(1024), where CONVk(N) denotes N-channel convolutional layer with kernel of size

. In particular, CONV5(64) and CONV5(128) consist of the convolutional layers followed by ReLU and 2-stride max pooling layer, respectively. The next two layers CONV3(256) and CONV3(512) consists of the convolutional layers followed by a batch normalization and ReLU layer, respectively. The final CONV4(1024) layer consists of convolutional layers with kernel of size

with 1024-channel output. The other encoding module for attribute is a fully-connected network with 256-dimension output followed by 1D batch normalization and ReLU layers.

The encoder , which takes the noise and attributes as input, also consists of the encoding module for attributes as in (shown in yellow) and the encoding module for noise (shown in purple). The noise encoding module consist of one fully-connected layer with 1024-dimensional output along with 1D batch normalization and ReLU layers. For the decoder (shown in green), we first concatenate the encoded attributes with the encoded image/noise together and implement the reparameterization trick as in kingma2013auto . Then reshape the mixed latent vector into a size feature maps. Then, we implement four UpsampleBlock which consists of 2D nearest upsampling layer followed by a convolutional layer, batch normalization and ReLU layers.

Figure 5:

Sample reconstruction results from Stage 1. Odd columns: reconstructed sketch images. Even columns: real sketch images.

Figure 6: Stage 2 (S2S) network architecture (AUDeNet). (a) Generator () produces sharp sketch images from blurry inputs. Discriminator, () is a patch-based discriminator with 4 downsampling blocks which is responsible to provide the adversarial feedback to . (b) The DenseBlock huang2016densely used in .

3.2 Stage 2: Sketch-to-Sketch (S2S)

As shown in Figure 5, sketch reconstructions from Stage 1 are often of poor quality. Hence, we propose a conditional GAN-based framework to generate sharper sketch images from blurry images. As shown in Figure 6, the proposed network consists of a generator sub-network (based on UNet ronneberger2015u and DenseNet huang2017densely architectures) conditioned on the encoded attribute vector from the A2S stage and a patch-based discriminator sub-network . takes blurry sketch images as input and attempts to generate sharper sketch images, while attempts to distinguish between real and generated images. The two sub-networks are trained iteratively.

3.2.1 Generator (G2)

Deeper networks are known to better capture high-level concepts, however, the vanishing gradient problem affects convergence rate as well as the quality of convergence. Several works have been developed to overcome this issue among which UNet

ronneberger2015u and DenseNet huang2017densely are of particular interest. While UNet incorporates longer skip connections to preserve low-level features, DenseNet employs short range connections within micro-blocks resulting in maximum information flow between layers in addition to an efficient network. Motivated by these two methods, we propose AUDeNet for the generator sub-network in which, the UNet architecture is seamlessly integrated into the DenseNet network in order to leverage advantages of both the methods. This novel combination enables more efficient learning and improved convergence quality. Furthermore, in order to generate attribute preserving reconstructions, we concatenate the latent attribute vector from A2S with the latent vector from the encoder as shown in Figure 6.

A set of 3 dense-blocks (along with transition blocks) are stacked in the front, followed by a set of 5 dense-block layers (transition blocks). The initial set of dense-blocks are composed of 6 bottleneck layers. For efficient training and better convergence, symmetric skip connections are involved into the generator sub-network, similar to mao2016image . Details regarding the number of channels for each convolutional layer are as follows: C(64) - M(64) - D(256) - T(128) - D(512) - T(256) - D(1024) - T(512) - D(1024) - DT(256) - D(512) - DT(128) - D(256) - DT(64) - D(64) - D(32) - D(32) - DT(16) - C(3), where C(K) is a set of -channel convolutional layers followed by batch normalization and ReLU activation. M is max-pooling layer. D(K) is the dense-block layer with -channel output, T(K) is transition layer with -channel output for downsampling. DT(K) is similar to T(K) except for transposed convolutional layer instead of convolutional layer for upsampling.

3.2.2 Discriminator (D2)

Motivated by isola2016image , patch-based discriminator is used and it is trained iteratively along with . The primary goal of

is to learn to discriminate between real and synthesized samples. This information is backpropagated into

so that it generates samples that are as realistic as possible. Additionally, patch-based discriminator ensures preserving of high-frequency details which are usually lost when only L1 loss is used. All the convolutional layers in have a filter size of .

3.2.3 Objective function

The network parameters for the S2S stage are learned by minimizing the following objective function:


where is the adversarial loss, is the loss based on the -norm between the synthesized image and the target, is the perceptual loss, and are weights. Adversarial loss is based primarily on the discriminator sub-network . Given a set of synthesized sketch images, , the entropy loss from that is used to learn the parameters of is defined as

The L1 loss measures the reconstruction error between the synthesized sketch image and the corresponding target sketch and is defined as

Finally, the perceptual loss johnson2016perceptual

is used to measure the distance between high-level features extracted from a pre-trained CNN and is defined as

Here, and indicate target and synthesised images respectively and is a particular layer of the VGG-16 network. In our work, the output from the conv1-2 layer of a pre-trained VGG-16 network simonyan2014very is used as the feature representation. Note that, the coarse sketches from the previous Stage 1, along with the corresponding target sketches, are used to train the network.

3.3 Stage 3: Sketch-to-Face (S2F)

Figure 7: Stage 3 (S2F) network architecture. A novel UNet-based generator, , conditioned on visual attributes is used to synthesize face images from the sketch images. is a patch-based discriminator.

The objective of Stage 3 is to reconstruct a color face image from the sketch image generated from the S2S stage. We propose a GAN-based framework for this problem where we make use of another UNet-based architecture for the generator sub-network. In particular, the visual attribute vector is combined with the latent representation to produce attribute-preserved image reconstructions. Figure 7 gives an overview of the proposed network architecture for S2F.

3.3.1 Generator (G3)

The Stage 3 generator consists of five convolutional layers and five transposed convolutional layers. Details regarding the number of channels for each convolutional and transposed convolutional layers are as follows: C(64) - C(128) - C(256) - C(512) - C(512) - R(512) - DC(512) - DC(256) - DC(128) - DC(64) - DC(1), where C(K) is a set of -channel convolutional layers followed by batch normalization and leaky ReLU activation. DC(K) denotes a set of -channel transposed convolutional layers along with ReLU and batch normalization layers. R(C) is a two-layer ResNet Block as in StackGAN zhang2016stackgan to fuse the attribute vector with the UNet latent vector. Note that unlike Stages 1 and 2, the attribute vector here consists of both texture and color attributes.

3.3.2 Discriminator (D3)

Similar to , a patch-based discriminator , consisting of 4 downsampling blocks, is used and it is trained iteratively along with .

3.3.3 Objective function

The network parameters for the S2F stage are learned by minimizing (5). In particular, a combination of

loss, adversarial loss and perceptual loss is used. As before, the perceptual loss is measured by using the deep feature representations from the conv1-2 layer of a pre-trained VGG-16 network

simonyan2014very . We use the enhanced sketch from the previous stage along with the target face image to train this network.

3.4 Testing

Figure 3 shows the testing phase of the proposed method. Attribute and noise vectors are first passed through the encoder/decoder structure corresponding to the A2S stage. The encoded texture attribute vector along with the generated sketch from the A2S stage are fed into an AUDeNet-based generator (G2) to produce a sharper sketch image. Finally, a UNet-based attribute-conditioned generator (G3) corresponding to the S2F stage is used to reconstruct a high-quality face image from the sketch image generated from the S2S stage. In other words, our method takes noise and attribute vectors as input and generates high-quality face images via sketch images.

4 Experimental Results

In this section, experimental settings and evaluation of the proposed method are discussed in detail. Results are compared with several state-of-the-art generative models: CVAE sohn2015learning adapted from yan2016attribute2image , text2img reed2016generative and stackGAN zhang2016stackgan . In addition, we compare the performance of our method with a baseline, attr2face, in which we attempt to recover the image directly from attributes without going to the intermediate stage of sketch. The entire network in Figure 2

is trained stage-by-stage using Pytorch


4.1 Datasets

We conduct experiments using three publicly available datasets: CelebA liu2015faceattributes , deep funneled LFW Huang2012a and CUHK wang2009face . The CelebA database contains about 202,599 face images, 10,177 different identities and 40 binary attributes for each face image. The deep funneled LFW database contains about 13,233 images, 5,749 different identities and 40 binary attributes for each face image which are from the LFWA dataset liu2015faceattributes . The CUFS dataset wang2009face consists of 88 real sketches and photos for training, and 100 real sketches and photos for testing. For each face image in the CUHK dataset, the corresponding sketch image was drawn by an artist when viewing this photo. Note that the training part of our network requires original face images and the corresponding sketch images as well as the corresponding list of visual attributes. The CelebA and the deep funneled LFW datasets consist of both the original images and the corresponding attributes while the CUHK dataset consists of face-sketch image pairs. To generate the missing sketch images in the CelebA and the deep funneled LFW datasets, we use a pencil-sketch synthesis method 222http://www.askaswiss.com/2016/01/how-to-create-pencil-sketch-opencv-python.html to generate the sketch images from the face images. The missing attributes in the CUHK dataset were manually labeled. Figure 8(a) shows some sample generated sketch images from the CelebA and the deep funneled LFW datasets. Figure 8(b) shows the synthetic sketches, real sketches and real face images examples from CUHK.

Figure 8: Generated sketch images. (a) Sketch images from the LFW and the CelebA datasets are shown in row 1 and row 2, respectively. (b) Left to right: Comparison of the composed sketch, real sketch and real photo from the CUHK dataset. As can be seen from this figure that the composed sketch images are very similar to the ones drawn by artists and they preserve the texture and shading information present in the color images. Hence, they can be used as a good replacement for real sketches.

4.2 Preprocessing

The MTCNN method zhang2016joint was used to detect and crop faces from the original images. The detected faces were rescaled to the size of . Since many attributes from the original list of 40 attributes were not significantly informative, we selected 23 most useful attributes for our problem. Furthermore, the selected attributes were further divided into 17 texture and 6 color attributes as shown in Table 1. During experiments, the texture attributes were used for generating sketches in the A2S and S2S stages while all 23 attributes were used for generating high-quality face images in the final S2F stage.

Arched_Eyebrows, Bags_Under_Eyes, Bald,
Bangs, Big_Lips, Big_Nose,
Bushy_Eyebrows, Chubby,
Male, Narrow_Eyes, No_Beard,
Smiling, Young
Black_Hair, Blond_Hair, Brown_Hair,
Gray_Hair, Pale_Skin, Rosy_Cheeks
Table 1: List of fine-grained texture and color attributes.
Figure 9: Comparison of results from different configurations of the proposed network. For all subfigures (a) (b) (c) and (d): first column: output using the proposed method, second column: output with the specific configuration, and third column: reference images. (a) Results corresponding to the case where attributes are not used in Stage 3. (b) S2S reconstructions without the use of encoded texture attributes. (c) Results when the second stage of S2S is omitted from the pipeline. Results show reconstructions with wrong attributes. (d) Results when the second stage of S2S is omitted from the pipeline. Poor quality reconstructions are obtained when S2S is skipped from the proposed method.

4.3 Ablation Study

In this section, we perform an ablation study to demonstrate the effects of different modules in the proposed method. The following three configurations are evaluated.

  1. Omit attributes while enhancing the sketch images generated from the A2S stage. This will show the significance of using attributes while enhancing the sketch images in the S2S stage.

  2. Remove the second stage of sketch image enhancement from the entire pipeline. In other words, reconstruct the face image directly from the blurry sketch generated from A2S without enhancement. This will clearly show the significance of the S2S stage.

  3. Remove the attribute concatenation from the final S2F stage. This will show the significance of using attributes in the final stage of sketch-to-image generation.

Results corresponding to the above three configurations are shown in Figure 9. Results corresponding to the first experiment are shown in Figure 9(b), where the first, second and third columns indicate, the outputs from the S2S stage of our method, reconstructions without the use of attributes in S2S, the reference sketch, respectively. From this figure we clearly see that attribute-conditioned generator produces sketches that are much better than the ones where sketches are enhanced directly without conditioning on the attributes.

Results corresponding to the third experiment are shown in Figure 9(a), where the first, second and third columns show the reconstruction results from our method, reconstructions without using attributes in S2F, and reference images, respectively. As can be seen from this figure, the absence of attributes in the final stage results in reconstructions with wrong face features such as gender and hair. When attributes are used along with the sketch from S2S, the produced results have attributes that are very close to the ones corresponding to the original images. This can be clearly seen by comparing the first and last columns in Figure 9(a).

In the final experiment, we omit the second stage of S2S from our pipeline and attempt to reconstruct the image from attributes in a two-stage procedure. In other words, sketch images generated from the A2S stage are directly fed into the S2F stage. Results are shown in Figure 9(c) and (d). In both figures, first, middle and last columns show reconstructions from our method, without the second stage and reference images, respectively. As can be seen from these figures, omission of the S2S stage from our pipeline produces images that are of poor quality (see results in Figure 9(d)). The enhancement of sketches in Stage 2 not only produces sharper results but also with correct attributes (see results in Figure 9(c)).

4.4 CelebA Dataset Results

Figure 10: Image reconstruction results on the CelebA dataset. (a) Stage 1 results, (b) Stage 2 results, (c) Stage 3 results (output from our method), (d) CVAE yan2016attribute2image , (e) text2img reed2016generative , (f) StackGAN zhang2016stackgan , (g) attr2face, (h) reference sketch, (i) reference face image.

The CelebA dataset liu2015faceattributes consists of 162,770 training samples, 19,867 validation samples and 19,962 test samples. After preprocessing and combining the training and validation sets, we obtain 182,468 samples which we use for training our three-stage network. After preprocessing, the number of samples in the test set remain the same. During training, we used a batch size of 128. The ADAM algorithm adam_opt

with learning rate of 0.0002 is used. We keep this initial learning rate for the first 10 epochs. For the next 10 epochs, we let it drop by 1/decay_epoch of its previous value after every epoch which is 1/10. The total training time was about 20 hours in a single Titan X GPU.

Figure 11: Image reconstruction results on the LFWA dataset. (a) Stage 1 results, (b) Stage 2 results, (c) Stage 3 results (output from our method), (d) CVAE yan2016attribute2image , (e) text2img reed2016generative , (f) StackGAN zhang2016stackgan , (g) attr2face, (h) reference sketch, (i) reference face image.

Sample image reconstruction results corresponding to different methods from the CelebA test set are shown in Figure 10. As can be seen from this figure, text2img and StackGAN methods are able to provide attribute-preserved reconstructions, but the synthesized face images are distorted and contain many artifacts. The CVAE method is able to reconstruct the images without distortions but they are blurry. Also, some of the attributes are difficult to see in the reconstructions from the CVAE method. For example, hair color is hard to see in the reconstructed images. The attr2face baseline provides reasonable reconstructions but images are distorted. In comparison to these methods, the proposed method, as shown in (c), provides the best attribute-preserved reconstructions. This can be seen by comparing the attributes of images in (i) with (c). To show the improvements obtained from different stages of our method, we also show the results from Stage 1 and Stage 2 in (a) and (b), respectively.

4.5 LFWA Dataset Results

Images in the LFWA dataset come from the LFW dataset Huang2012a , LFWTech , and the corresponding attributes come from liu2015faceattributes . This dataset contains the same 40 binary attributes as in the CelebA dataset. After preprocessing, the training and testing subsets contain 6,263 and 6,880 samples, respectively. The learning strategy for the ADAM method is the same as the one used for the CelebA dataset except that the initial learning rate is kept the same for the first 20 epochs and is dropped by 1/decay_epoch of its previous value after every epoch which is 1/20.

Sample results corresponding to different methods on the the LFWA dataset are shown in Figure 11. As can be seen from the results, the CVAE method produces reconstructions which are blurry and distorted. Attibute-conditioned GAN-based approaches such as text2img and StackGAN produce poor quality results with many distortions. The attr2face baseline and the proposed method show better reconstruction compared to the other methods. By comparing the reconstructions from our method in (c) with the images in (i) we see that the proposed method is able to reconstruct high-quality attribute-preserved face images. Again, outputs from Stage 1 and Stage 2 of our method are shown in (a) and (b), respectively.

4.6 CUHK Dataset Results

Figure 12: Image reconstruction results on the CUHK dataset. (a) Stage 1 results, (b) Stage 2 results, (c) Stage 3 results (output from our method), (d) CVAE yan2016attribute2image , (e) text2img reed2016generative , (f) StackGAN zhang2016stackgan , (g) attr2face, (h) reference sketch, (i) reference face image.

Instead of using the composed sketches as was done for the experiments on the CelebA and LFWA datasets, in this section, we implemented our algorithm using real sketches and photos from the CUFS dataset wang2009face . The CUFS dataset is a relatively small dataset. After preprocessing and data augmentation, such as flipping and rotation, we obtained 264 samples for training, and 300 samples for testing. The batch size of 8 was used while training our network. The other settings are kept the same as the CalebA dataset. Since this dataset does not come with attribute annotations, we manually annotated 23 attributes on this dataset.

Results corresponding to different methods are shown in Figure 12. We obtain similar results as we did in the CelebA and LFWA datasets. The text2img, StackGAN, and attr2face methods generate images with some visual artifacts, while the CVAE method produces blurry results. In contrast, our method produces the best results and generates photo-realistic and attribute-preserved face reconstructions.

4.7 Face Synthesis

Figure 13: Sample image synthesis on CelebA when attributes are changed while the noise vector is kept frozen. (a) Female to male. (b) Neutral to smile. (c) Original skin tone to pale skin tone. (d) Original hair color to black hair color.
Figure 14: Sample image synthesis results on (a) CelebA, and (b) LFWA when attributes are kept frozen while the noise vector is changed according to . Note that the identity changes as we vary the noise vector but he attributes stay the same on the reconstructed images.
Metric Dataset text2img reed2016generative StackGAN zhang2016stackgan CVAE yan2016attribute2image attr2face Attribute2Sketch2Face
Inception Score CelebA
Attribute CelebA
Table 2: Quantitative results corresponding to different methods.The Inception Score and Attribute measure are used to compare the performance of different methods.

In this section, we show the image synthesis capability of our network by manipulating the input attribute and noise vectors. Note that, the testing phase of our network takes attribute vector and noise as inputs and produces face reconstruction as the output. In the first set of experiments with image synthesis, we keep the random noise vector the same, i.e. and change the attribute weights corresponding to a particular attribute as follows: . The corresponding results on the CelebA dataset are shown in Figure 13. From this figure, we can see that when we give higher weights to a certain attribute, the corresponding appearance changes. For example, one can synthesize an image with a different gender by changing the weights corresponding to the gender attribute as shown in Figure 13(a). Each row shows the progression of gender change as the attribute weights are changed from -1 to 1 as described above. Similarly, figures (b), (c) and (d) show the synthesis results when a neutral face image is transformed into a smily face image, skin tones are changed to pale skin tone, and hair colors are changed to black, respectively. It is interesting to see that when the attribute weights other than the gender attribute are changed, the identity of the person does not change. Only the attributes change.

In the second set of experiments, we keep the input attribute vector frozen but now change the noise vector by inputing different realizations of . Sample results corresponding to this experiment are shown in Figure 14(a) and (b) using the CelebA and LFWA datasets, respectively. Each column shows how the output changes as we change the noise vector. Different subjects are shown in different rows. It is interesting to note that, as we change the noise vector, attributes stay the same while the identity changes. This can be clearly seen by comparing the reconstructions in each row.

4.8 Quantitative Results

In addition to the qualitative results presented in Figures 10, 11 and 12, we present quantitative comparisons based on the Inception Score salimans2016improved and Attribute -norm. The inception scores are used to evaluate the realism and diversity of the generated samples and has been used before to evaluate the performance of deep generative methods bao2017cvae , zhang2016stackgan . Attribute -norm is used to compare the quality of attributes corresponding to different images. We extract the attributes from the synthesized images as well as the reference image using the MOON attribute prediction method rudd2016moon . Once the attributes are extracted, we simply take the -norm of the difference between the attributes as follows


where and are the 23 extracted attributes from the reference image and the synthesized image, respectively. Note that higher values of the Inception Score and lower values of the Attribute measure imply the better performance. The quantitive results corresponding to different methods on the CalebA, LFW and CUHK datasets are shown in Table 2

. Results are evaluated on the test splits of the corresponding dataset and the average performance along with the standard deviation are reported in Table 


As can be seen from this table, the proposed Attribute2Sketch2Face method produces the highest inception scores implying that the images generated by our method are more realistic than the ones generated by other methods. Furthermore, our method produces the lowest Attribute scores. This implies that our method is able to generate attribute-preserved images better than the other compared methods. This can be clearly seen by comparing the images synthesized by different methods in Figures 10, 11 and 12.

5 Conclusion

We presented a novel deep generative framework for reconstructing face images from visual attributes. Our method makes use of an intermediate representation to generate photo realistic images. The training part of our method consists of three stages - A2S, S2S and S2F. The A2S stage is based on the CVAE model while the S2S and S2F stages are based on GANs. Novel UNet-based generators are proposed for the S2S and S2F stages. Various experiments on three publicly available datasets show the significance of the proposed three-stage synthesis framework. In addition, an ablation study was conducted to show the importance of different components of our network. Various experiments showed that the proposed method is able to generate high-quality images and achieves significant improvements over the state-of-the-art methods.

This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 2014-14071600012. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.