Attribute-Guided Sketch Generation

01/28/2019 ∙ by Hao Tang, et al. ∙ EPFL University of Michigan University of Oxford Università di Trento Texas State University 4

Facial attributes are important since they provide a detailed description and determine the visual appearance of human faces. In this paper, we aim at converting a face image to a sketch while simultaneously generating facial attributes. To this end, we propose a novel Attribute-Guided Sketch Generative Adversarial Network (ASGAN) which is an end-to-end framework and contains two pairs of generators and discriminators, one of which is used to generate faces with attributes while the other one is employed for image-to-sketch translation. The two generators form a W-shaped network (W-net) and they are trained jointly with a weight-sharing constraint. Additionally, we also propose two novel discriminators, the residual one focusing on attribute generation and the triplex one helping to generate realistic looking sketches. To validate our model, we have created a new large dataset with 8,804 images, named the Attribute Face Photo & Sketch (AFPS) dataset which is the first dataset containing attributes associated to face sketch images. The experimental results demonstrate that the proposed network (i) generates more photo-realistic faces with sharper facial attributes than baselines and (ii) has good generalization capability on different generative tasks.



There are no comments yet.


page 1

page 3

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently, there has been a new trend in computer vision to use machines to express the “creativity” of art. Novel and never-seen-before images can be generated by inverting the convolution process in CNN (“upconvolution” or “deconvolution”), which gives such networks the ability to “dream” [24] and to generate images. To implement these tasks, deep generative models have been usually adopted, and these networks have also been employed for face-to-sketch translation. These deep models can be roughly divided into two categories, namely, the Generative Adversarial Networks (GANs) [5]

and Variational AutoEncoders (VAEs)

[12, 15, 20]. In GANs, there are two subnetworks, a generator and a discriminator, which play the two-player minimax game with a value function [5]. The generator acts as a mapping function to convert an input image to the generated image so that it fools the discriminator, which is trained to distinguish the generated images from the input real images. In this work, we rely on GANs as the basis to implement the face-to-sketch translation model.

Fig. 1: Architecture comparison of three generators. The goal of this work is to generate facial attributes and sketches simultaneously. To achieve this target, we need to train two Encoder-Decoder networks [7] (Top), two U-nets [9] (Middle) or the proposed W-net (Bottom). The W-net consists of two subnetworks which are fused together via a novel joint learning strategy.

Face-to-sketch translation is quite challenging due to the fact that it is a non-linear process conditioned on the appearance of the input face. To address this problem, several methods have been proposed [38, 49, 52, 9, 46, 17]

for image-to-image translation problems which convert a photo to a sketch. However, these works focus only on the face-to-sketch-translation ignoring the possibility of using facial attributes (e.g., facial expressions, age) and of generating sketches conditioned on external attributes (e.g., glasses, scarf).

To overcome this challenging problem, we present a novel Attribute-Guided Sketch Generative Adversarial Network (ASGAN) based on conditional generative adversarial networks. ASGAN contains two generators and two discriminators. The two generators and comprise a novel W-shaped network (W-net) and they are learned jointly as shown in Fig. 1. We set as an Encoder-Decoder network [7] and as a U-net [9]. The proposed W-net can jointly perform facial attribute generation and face-to-sketch translation. Our formulation is similar to Pix2pix [9] but we significantly differ from them by extending it to handle the problem of generating sketches with attributes which can be conditioned not only on the image priors, but also on the attribute labels. learns the translation from images without attributes to images with attributes guided by attribute labels. In this way, the different attributes can be learned simultaneously as in StarGAN [2] and GGAN [42]. Next, converts images to sketches with the learned attributes. In detail, the first branch in the W-net is a conditional encoder-decoder which learns to add attributes to the input images. This is implemented by fusing the embeddings of the conditioning label and the embeddings of the input face image at the bottleneck. The second branch learns to convert the generated face images with attribute to sketches with attribute. We train the two generators and in an alternating way and they share the same weights between the decoder of and the encoder of

, which limits the network behavior and benefits each other. Moreover, two loss functions are designed to train these two generators, i.e., the attribute loss and the sketch loss. Besides, compared with the simple combination of two Pix2pix models, the proposed W-net only needs 75% of parameters by a weight-sharing strategy.

We also propose two novel discriminators and , the residual and the triplex discriminator. To focus on learning of facial attributes, the discriminator is trained to distinguish the residuals between the input images and the generated images from the residuals between the input images and the ground truth image which have the conditioned attributes. Moreover, since the input image, the attributed image, and the attributed sketch have strong correlations, we take the triplet (face, face with attribute, sketch with attribute) as the input to the triplex discriminator which is used to distinguish the real triplet from the fake one as shown in Fig. 2. In this way,

could also take into consideration the correlations between the elements. To evaluate the quality of a generated image, we present a novel evaluation metric namely the Feature-Level Similarity Score (FLSS) inspired by feature matching. FLSS works as a complementary metric to Inception Score (IS) and Self Similarity Matrix (SSIM).

In addition, to validate the proposed network and the overall framework, we have collected a new dataset of 8,804 images collected from five existing datasets by adding visual attributes (glasses, beard, and bow tie), named the Attributed Face Photo & Sketch (AFPS). The AFPS consists of two subsets, i.e., face images with attributes and face sketches with attributes. To our knowledge, this is the first dataset containing attributes associated to face sketch images. Qualitative and quantitative experimental results demonstrate that the proposed ASGAN (i) generates more photo-realistic faces with sharper/more realistic facial attributes than baselines on the AFPS dataset and (ii) has a good generalization ability on other generative tasks, i.e., face colorization and face completion. In summary, the contributions of this paper are as follows:

  • [leftmargin=*]

  • We propose a novel Attribute Guided Sketch Generative Adversarial Network (ASGAN). The two generators in ASGAN form a novel W-net using a weight-sharing paradigm, which can generate faces with attributes and the corresponding sketches jointly.

  • We design two novel discriminators, the residual and the triplex discriminators. The former one focuses on attribute generation by distinguishing the generated attribute from the real attribute. The latter one focuses on sketch translation by distinguishing the real triplet tuple (original face, real face with attribute, real sketch with attribute) from the fake triplet tuple (original face, generated face with attribute, generated sketch with attribute).

  • We introduce a new AFSP dataset, which contains 8,804 face images and sketches with visual attributes. The new dataset will be made available to the research community.

Ii Related Work

Fig. 2: The structure of the proposed ASGAN. The generator is a W-net composed of and , is an Encoder-Decoder network and is a U-net. receives the attribute labels at the bottleneck fully connected layer to guide the attribute generation. Both generators are made up of an encoder (Enc.) and a decoder (Dec.). We train and in an alternating way and they share the same weights between the decoder of and the encoder of . Moreover, the corresponding discriminators and are the residual and triplex discriminators.

Many works have tried to address the face-to-sketch problem. For instance, Song et al. [38]

present the Bidirectional Transformation Network (BTN), which generates a whole face/sketch recursively by using a small number of facial patches. A representation learning framework generating the photo-sketch mapping through structure and texture decomposition is presented in

[49]. Zhang et al. [52]

propose a Cascaded Image Synthesis (CIS) strategy and integrate a Sparse Representation-based Greedy Search (SRGS) and Bayesian Inference (BI) for face sketch synthesis. Recently, Isola et al.

[9] introduce a general framework for image-to-image translation problems, which can convert an image to sketch. However, all of these works focus on the face-to-sketch-translation task, but they could not generate sketches conditioned on external facial attributes. Generating facial attributes (e.g., glasses, scarf, facial expression, age) associated to the sketch is very challenging.

Recently, Generative Adversarial Networks (GANs) [5]

have drawn significant attention in both supervised and unsupervised learning research fields, and a lot of GAN variants have been explored. For example, the Conditional GANs (CGANs) are introduced to solve ill-posed problems, such as text-to-image translation

[31, 21], image-to-image translation [9], facial attribute generation [35]

and image super-resolution

[16]. CGANs incorporate additional information into GANs, e.g., category labels [28, 25, 21, 42], text description [50, 31], object keypoints [18, 32], human skeleton [41, 37], context [3] and conditional images [9, 48]. CGANs take as inputs a random noise

and an additional non-random variable. For instance, Odena et al.


construct an Auxiliary Classifier GAN (ACGAN) using label as the conditioning priors. Reed et al.


combine Recurrent Neural Networks (RNNs) with GANs for text-to-image translation using text description and introduce the Generative Adversarial What-Where Network (GAWWN) 

[32] to synthesizes sharp and realistic images conditioned on both texts and key points. A recent study [9] employs CGANs to generate photo-realistic images on different image generation tasks, such as label to street scene, aerial to map, day to night, edge to photo and gray-scale image to color image. Despite the success, GANs still have many limitations, including the difficulty of convergence, training instability, mode collapse, and low sample quality results [30, 34], which may introduce artifacts into the generated image, making it visually unpleasant and artificial.

Based on CGANs, Isola et al. [9] have developed a generic framework “Pix2pix”, which is suitable for different generative tasks. In Pix2pix, one conditional image is adopted as a reference during the training time. The generator in Pix2pix is a U-net, which tries to synthesize a fake image conditioned on the given conditional image in order to fool the discriminator, while the discriminator tries to identify the fake image by comparing it with the corresponding target image. Under these settings, the discriminator takes the pairs of images as input. The U-net is actually an Encoder-Decoder network with skip connection, in which the encoder consists of multiple convolution layers and the decoder consists of multiple deconvolution layers. Isola et al. [9] added skip connections between each layer and layer which allows feature sharing between the encoder and decoder, where is the total number of layers. All channels at layer are simply concatenated with those at layer by the skip connections. [9] shares a similar goal with us, but it cannot solve the face-to-attributed-sketch translation task since it cannot convert a face image to a sketch conditioned on external facial attributes while our ASGAN is specifically designed to tackle this task.

Iii Model Description

Fig. 3: The proposed generator is a W-shaped network (W-net) (look from the right side) and composed of two generators (the first and second rows) and (the second and third rows). To learn all attributes simultaneously, receives the attribute label as the input to the fully connected layer at the bottleneck. The two generators are trained alternatively in the end-to-end way. Besides, the convolution layers in the encoder of share weights with the decoder in . Each block represents a Convolution-BachchNorm-RuLU structure. All ReLUs in the encoder are leaky, while the ones in the decoder are not leaky. The arrows denote the different operations.

Iii-a Objective Function

GANs are generative models that learn the mapping from a random noise to the output image ,  [5]. The objective function of a traditional GAN [5] can be formulated as follows:


In contrast, CGANs [9] learn the mapping from a conditional image and a random noise to , . The generator is trained to produce outputs that cannot be distinguished from “real” images by an adversarial discriminator , while the discriminator is trained to distinguish the “fake” images from the “real” ones. The objective function of a CGAN [9] can be expressed as follows:


where tries to minimize the objective function while tries to maximize it, i.e.,


Our ASGAN contains two generator/discriminator pairs, with controls facial attribute generation and and controls face-to-sketch translation, respectively. Similar to StarGAN [2] and GGAN [42], all the attributes can be learned simultaneously as

can receive arbitrary facial attribute label. The facial attribute is represented by a one hot vector which is used to distinguish each attribute from others. In the hot vector, only the element which corresponds to the label is set to 1 while the others are set to 0. The one hot vector is passed to a linear layer to get a feature embedding with 64 dimensions and then the embeddings are concatenated with an image embedding vector and passed to the fully connected layer at the bottleneck of

. This training procedure is shown in Fig. 2. The loss functions of the facial attribute generation network and face-to-sketch translation network write as,




where represents the input image without attribute; denotes the ground truth image with attribute, and is the ground truth sketch with attribute. The two noises and are sampled independently. The outputs are not two sketches but one face image with attribute and one sketch with attribute .

Previous approaches of CGANs have found it beneficial to mix the GAN objective function with a more traditional loss, such as the [9] or losses [26]. Under that condition, the generator is asked to not only fool the discriminator but also to be closer to the ground truth output in a or sense. We also explore this option, using distance rather than the since encourages less blurring [9]:


It is worth noting that there are many loss functions we could try, e.g., feature loss [10] and total variation loss [19]. We did not use them in our experiments and we list them here for completeness. Therefore, the final objective is:


where .

In our experiments, instead of using Gaussian noises and , we generate the noise in the dropout layer, which is consistent with [9].

Iii-B Network Architectures

The architectures of generators and discriminators in this paper are developed based on [9]. The face-to-sketch translation problem is essentially the problem that maps a high resolution input face image to a high resolution output sketch image. In addition, although the input image and the output sketch differ in appearance, the sketch is a rendering of the real image and they share the same contour. Therefore, the contour of the input image is roughly aligned with the contour of the output sketch. Many previous solutions [26, 43, 10] to this problem usually employ an Encoder-Decoder network [7]. In such a network, the input is passed through a series of layers and is progressively down-sampled, until a bottleneck layer is reached, at which point the process is reversed. Such a network requires that all information flows through all the layers, including the bottleneck. For many image translation problems, there is a great deal of low-level information shared between the input and output, and it would be desirable to transfer this information directly across the net. For example, in the case of face-to-sketch translation, the input image and the output sketch share the location of prominent structures and edges. Therefore, the design of the generator architecture in [9] takes this into consideration. To help the generator circumvent the bottleneck for information like this, [9] adds skip connections, following the general shape of a U-Net [33].

Generator Architectures. In this paper, we present a novel generator which is a W-shaped network consisting of two generators and . The components of the W-shaped network are shown in the following.

Generator : As shown in Fig. 3, the encoder in

has eight Convolution-BatchNorm-ReLu layers, while the decoder also has eight layers, i.e., three Convolution-BatchNorm-Dropout-ReLu layers and five Convolution-BatchNorm-ReLu layers. The dropout rate is 50%. The numbers of convolution layer feature maps in the encoder are

. The numbers of convolution layer feature maps in the decoder are . In the encoder and decoder, the kernels in all the convolution layers have the same shape which is

and the same stride which is 2. The output of each convolution layer in the encoder is down-sampled by 2 while the output of each convolution layer in the decoder is up-sampled by 2. At the end of the decoder, a convolution is applied to map it to a 3 channels result, followed by a tanh function. BatchNorm is not employed into the first layer in the encoder. All ReLUs in the encoder are leaky, with slope 0.2, while ReLUs in the decoder are not leaky.

Generator : We aim to generate the attributed face and the corresponding sketch jointly. Therefore, we select U-net as the sketch generation model as it can reuse the features of the attributed face via the skip connection. In this way, we formulate the problem as a multi-task learning problem and both attributed face and the corresponding sketches could benefit from each other. As shown in Fig. 3, the generator is a U-net, which is an Encoder-Decoder net with skip connections between mirrored layers in the encoder and decoder stacks [33]. The number of the convolution layer feature maps of the encoder in is the same with the decoder in . Besides, the convolution layers in the encoder of share the weights with the decoder in , which makes the final output not only learn the facial attribute but also the face-to-sketch translation. However, the skip connections change the number of channels in the decoder, thus the convolution layer feature map numbers in the decoder of the generative net are .

Discriminator Architectures. We also adopt the discriminators and similar to [9]. The discriminators are built with the basic Convolution-BatchNorm-ReLU layer. All ReLUs are leaky, with slope 0.2. The convolution layer feature map numbers in this discriminator are

. After the last layer, a convolution is applied to map it to a 1-D value, followed by a sigmoid function. BatchNorm is not adopted into the first layer.

Iii-C Optimization

We follow the standard optimization method from [5] to optimize the proposed ASGAN, i.e., we alternate between one gradient descent step on discriminator and with and fixed, then one step on generator and with and fixed. In addition, following the suggestion made in [5], we train the generator (or ) to maximize (or ) rather than minimizing (or ). Moreover, in order to slow down the rate of the discriminator (or ) relative to the generator (or ) we divide the objective by 2 while optimizing (or ),


where denotes the Binary Cross Entropy loss function. We employ the minibatch SGD algorithm and apply the Adam optimizer [11] as solver, the momentum terms and of Adam are 0.5 and 0.999, respectively. The initial learning rate for Adam is 0.0002.

Iii-D Feature-Level Similarity Score

Currently, there are two kinds of evaluation metrics for the image generation task, i.e., qualitative and quantitative. On one hand, the qualitative results are usually studied by human observers and machines, e.g., the previous works [9, 1] conduct a user survey on generated images and collected the scores given by the users. Others [9, 51] ran “real vs fake” perceptual studies on Amazon Mechanical Turk (AMT). On the other hand, the quantitative results are usually evaluated by algorithms, e.g., [28, 4, 6, 47] take an image distance, such as, MSE (Mean Square Error), SSIM (Structural Similarity) or PSNG (Peak Signal to Noise Ratio) to measure the quality of the results. Others apply the Inception or ResNet models to every generated images [39, 25, 41]. Recent works adopt additional classifiers to predict if the generated images can be correctly detected [43], segmented [9] or classified [8, 34, 30].

It is clear that there is no single metric that can be used to measure the quality of the generated images accurately and holistically [34]. However, measurements such as SSIM and IS (Inception Score) are not good metrics as shown in [36, 10, 18] since sometimes the generated images with more sharper and more photo-realistic aspects have a lower SSIM and IS. We, therefore, present an alternative method to measure the quality of generated images called Feature-Level Similarity Score (FLSS), which is similar to the pixel-level similarity method [48]

. The idea of the FLSS derives from the feature matching methods in image retrieval

[40, 54]. Let us define the generated images and the ground truth images (where

denotes the index of generated or ground truth images). Given a function of feature extraction

, our task is to extract features from and and match them:


where denotes the number of matches between two sets of descriptors and . In other words, we do feature matching between two sets of descriptors, then calculate how many features (matches) are shared between the two sets:


where defines the cardinality of the set , denotes the number of the generated or ground truth images in a dataset and is the final similarity score of the dataset.

Dataset Image Source Type Train. # Test. # Total
CUFS [44] CUHK student Hand-Drawn 94 94 188
CUFSF [53] FERET [29] Hand-Drawn 562 561 1,123
E-PRIP [23] AR [22] Composite 70 16 86
PRIP-VSGC [14] AR [22] Composite 70 16 86
Caricature [13] Internet Caricatural 101 30 131
Total AFPS dataset - 897 717 1,614
TABLE I: Key characteristics of the proposed AFPS dataset.

Iv Experiments

In this section, we first introduce the dataset used and the implementation details. We then show detailed qualitative and quantitative results and analyses.

Iv-a Datasets

To evaluate the proposed method and the overall framework, we have investigated the available face sketch datasets [27]. However, none of them considers attributes on sketches. Therefore, we use a public photo-editing software to produce the attributes (glasses, beard and bow tie) based on five existing datasets, which are then incorporated into the Attributed Face Photo & Sketch (AFPS) dataset. The AFPS contains three types of face sketches, i.e., hand-drawn, composite, and caricatural. The main features of the AFPS are listed in Table I. The AFPS contains 897 training samples and 717 testing samples. Each sample has two images, i.e., a face photo with attribute and a face sketch with attribute. Since several face images in Caricature, R-PRIP and PRIP-VSGC datasets do not have the room to allow augmenting them with a bow tie, and there are already faces with mustache in the Caricature dataset, we remove the bow tie and beard attributes in Caricature dataset, and the bow tie attribute in both R-PRIP and PRIP-VSGC datasets. Therefore the total number of images in AFPS is 8,804.

Iv-B Implementation Details

The proposed network is implemented under the deep learning framework using PyTorch. In our experiments, all the images are re-scaled to

and all models of each dataset were trained for 200 epochs. For all datasets, we do left-right flip for data augmentation. During training time, we need to input

, and . While during testing time we only require , and the model will output and

. The weights are initialized from a Gaussian distribution with mean 0 and standard deviation 0.02. Training and testing stages are conducted on a Nvidia TITAN Xp GPU with 12 GB memory.

Fig. 4: Results of the proposed ASGAN for face-to-attributed-sketch task with different losses on the CUFS (top) and the CUFSF (bottom) datasets. Ground truths are provided only for comparison purpose.

Iv-C Evaluation

Analysis of Loss Function. We first investigate the performance of the proposed ASGAN using the objective in Eq. 8. Ablation experiments are run to isolate the effect of the attribute loss term, the sketch loss term and the loss term. Fig. 4 and 5 show the qualitative results of the variations of face-to-attributed-sketch translation task. The attribute term gives faces attributes. The sketch term controls face-to-sketch translation. Finally, the term with attribute and sketch terms together can generate photo-realistic sketches and sharp facial attributes simultaneously. Note that the ground truth sketches on the E-PRIP and PRIP-VSGC datasets do not have torsos such that we do not show the generated results with bow tie attribute. We also try to generate a beard on the face of a lady as shown in the first row of Fig. 5, which only to validate that the proposed ASGAN is able to generate the correct attributes. At testing stage, we are able to control the proposed ASGAN to generate the desired attributes.

Fig. 5: Results of the proposed ASGAN on the face-to-attributed-sketch translation task on the E-PRIP (top) and PRIP-VSGC (bottom) dataset. Ground truths are provided only for comparison purpose.

We also provide the quantitative results of the generated sketches. SSIM [45] and the proposed FLSS are adopted to measure the quality of the generated images. The SIFT matching function of the VLFeat library is applied for the implementation of the proposed FLSS. We list the average results of all attributes with SSIM and FLSS in Table II. As we can see from Table II, the attribute + sketch + model produces high-quality results and photo-realistic attributes as compared to other loss combinations. Beside, we also report the time for evaluating the quality of the generated images. The proposed FLSS is much faster than SSIM, which highlights the advantage of the proposed evaluation method.

attribute loss + sketch loss
0.5090 0.1571 0.3454 0.1225 0.3761 0.0790 0.3373 0.0781
attribute + sketch +
ASGAN (Ours)
0.6249 0.2534 0.3515 0.1567 0.4864 0.1280 0.4122 0.1258
Encoder-Decoder [7]
0.5025 0.1387 0.3440 0.1386 0.4677 0.1190 0.4030 0.1119
Pix2pix [9]
0.6127 0.2431 0.3488 0.1517 0.4781 0.1264 0.4018 0.1136
Time (s) 239.9 15.3 1416.8 102.6 40.3 2.7 40.5 2.5 85.8 5.9
TABLE II: Quantitative results of the generated sketches using different losses on the proposed APFS dataset, compared with Encoder-Decoder [7] and Pix2pix [9] model. For both the SSIM and FLSS, higher is better. For the table cell of the Caricature dataset, the top value is the result of the colorization (col.) task and the bottom one is the result of the (com.) completion task.

Comparison against Baselines. Our joint learning task tries to generate attributed faces and sketches simultaneously from a single face image, which is quite novel. Therefore we only compare our method with the most related ones, i.e., encoder-decoder [7] and Pix2pix [9], which are the most successful models on image-to-image translation. Note that, we need to train two encoder-decoder nets or two U-nets, which doubles the parameters, but the proposed ASGAN only has 75% amount of parameters compared with the baseline methods since we share the parameters between two generators. Moreover, for the baselines we need to train these models multiple times for every attribute, while for the proposed ASGAN we only need to train it once. Qualitative results are shown in Fig. 6 and quantitative results are listed in Table II. We can observe in Fig. 6 that the proposed method achieves superior results compared with encoder-decoder [7] and Pix2pix [9]. The proposed ASGAN consistently generates clear and convincing visual attributes, and produces more vivid and high-quality face sketches than the baselines. As we can see in Table II, the proposed ASGAN consistently outperforms encoder-decoder network [7] and Pix2pix [9] methods across both metrics on the five datasets in Table II.

Fig. 6: Comparisons of the Encoder-Decoder net [7], Pix2pix [9] and ASGAN (ours) for face-to-attributed-sketch translation task. Ground Truths (GT) are provided only for comparison.

Moreover, we follow the same protocol from Pix2pix [9] to run “real vs fake” perceptual studies on Amazon Mechanical Turk (AMT). The average results of all facial attributes are listed in Table III. We can observe that the proposed ASGAN achieves better performance than Encoder-Decoder net [7] and Pix2pix [9] on both the generated faces and sketches.

% Tukers label real CUFS CUFSF E-PRIP PRIP-VSGC
Model attr. sket. attr. sket. attr. sket. attr. sket.
Encoder-Decoder [7] 1.3% 0.7% 1.1% 15.3% 3.1% 0.3% 2.8% 1.6%
Pix2pix [9] 40.1% 3.3% 29.3% 10.7% 40.7% 3.2% 11.8% 3.2%
ASGAN (Ours) 42.7% 6.9% 46.7% 20.4% 43.7% 4.6% 15.6% 6.3%
TABLE III: AMT “real vs fake” test of the generated attributed faces (attr.) and attributed sketches (sket.) on the APFS dataset, compared with Encoder-Decoder [7] and Pix2pix [9] models.

Cross-Dataset Experiments. In order to further investigate the generalization ability of the proposed network, we also demonstrate the cross-dataset generalization. The model is trained on the E-PRIP dataset, but is tested on the face photo from the PRIP-VSGC dataset, and vice versa. In Fig. 7, we can observe the overall style of the outputs and the ground truths are reversed with the correct visual attributes. Thus, we can conclude that the trained ASGAN has correctly learned the mapping from face-to-attributed-sketch.

(a) Test on PRIP-VSGC; trained on E-PRIP.
(b) Test on E-PRIP; trained on PRIP-VSGC.
Fig. 7: Cross-dataset test results of the proposed method on face-to-attributed-sketch translation task on the E-PRIP [23] and PRIP-VSGC [14] dataset, compared to the ground truth. The attributes are beard and glasses from top to bottom, respectively.

Influence of Hyper-Parameters. We also analyze the influence of the hyper-parameters, i.e., and . We set to 1, 10, 50, 100, 150, 200, 600 and 1,000, respectively. The performance with different is listed in Fig. 8 (a). The axis represents the value of and the axis represents the score of evaluation SSIM and FLSS. For both metrics, larger is better. We can observe that better performance can be achieved when the value of is between 50 to 200. In addition, we have also investigated the influence of the number of training samples . Results are displayed in Fig. 8 (b). Note that in most cases SSIM and FLSS are going up when more training samples are used.

Fig. 8: FLSS and SSIM with different values of (Left) and different training samples (Right).

Other Applications. In addition, to evaluate the effectiveness of the proposed network on different generative tasks, the Caricature dataset [13] is employed for face colorization and face completion tasks. The test results are shown in Fig. 9 and Table II. For face colorization, our approach is quite effective at creating reasonable and realistic color renderings and at the same time generating the right facial attribute. For the completion task, the Encoder-Decoder network [7] cannot generate face images, while Pix2pix [9] is able to generate face images but without convincing attributes. We observe that the proposed ASGAN has better qualitative results than both baselines and the missing part can be restored correctly with glasses. We obtain good results on both tasks, which indicates that the proposed approach can be useful for other generative tasks.

Fig. 9: Test results of the Caricature dataset [13] on the face colorization (Left) and completion (Right) tasks, compared to Encoder-Decoder network [7], Pix2pix [9], the proposed ASGAN and the ground truth. The input and output images with and without glasses, respectively. For the face completion task, we add random noise in the central square patch in the input image.

V Conclusion

In this paper, we present a novel Attributed-Sketch Generative Adversarial Network (ASGAN) which consists two generators and two discriminators. The generators are jointly trained through a weight-sharing strategy, which can generate attributed face and attributed sketches at the same time. We also propose two novel discriminators, the residual and the triplex discriminators. Experimental results demonstrate that the proposed ASGAN generates more photo-realistic faces with sharper facial attributes than baselines on three different generative tasks, i.e., face-to-attributed-sketch translation, face colorization and face completion. Finally, the proposed ASGAN has a huge potential for forensic sketch synthesis, which is our future research direction.


  • [1] Y. Cao, Z. Zhou, W. Zhang, and Y. Yu. Unsupervised diverse colorization via generative adversarial networks. In ECML-PKDD, 2017.
  • [2] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
  • [3] E. Denton, S. Gross, and R. Fergus. Semi-supervised learning with context-conditional generative adversarial networks. arXiv:1611.06430, 2016.
  • [4] A. Deshpande, J. Lu, M.-C. Yeh, and D. Forsyth. Learning diverse image colorization. In CVPR, 2017.
  • [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • [6] Y. Güçlütürk, U. Güçlü, R. van Lier, and M. A. van Gerven. Convolutional sketch inversion. In ECCV, 2016.
  • [7] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
  • [8] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be color!: Joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM TOG, 35(4):110, 2016.
  • [9] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    In CVPR, 2017.
  • [10] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
  • [11] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
  • [12] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2013.
  • [13] B. F. Klare, S. S. Bucak, A. K. Jain, and T. Akgul. Towards automated caricature recognition. In ICB, 2012.
  • [14] S. J. Klum, H. Han, B. F. Klare, and A. K. Jain. The facesketchid system: Matching facial composites to mugshots. IEEE TIFS, 9(12):2248–2263, 2014.
  • [15] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. In ICML, 2015.
  • [16] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 2017.
  • [17] Y. Lu, S. Wu, Y.-W. Tai, and C.-K. Tang. Image generation from sketch constraint using contextual gan. In ECCV, 2018.
  • [18] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool. Pose guided person image generation. In NIPS, 2017.
  • [19] A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In CVPR, 2015.
  • [20] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. In ICLR, 2016.
  • [21] E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov. Generating images from captions with attention. In ICLR, 2016.
  • [22] A. M. Martinez. The ar face database. CVC Technical Report, 24, 1998.
  • [23] P. Mittal, A. Jain, G. Goswami, R. Singh, and M. Vatsa. Recognizing composite sketches with digital face images via ssd dictionary. In IJCB, 2014.
  • [24] A. Mordvintsev, C. Olah, and M. Tyka. Inceptionism: Going deeper into neural networks. Google Research Blog. Retrieved June, 20:14, 2015.
  • [25] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier gans. In ICLR, 2017.
  • [26] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016.
  • [27] C. Peng, N. Wang, X. Gao, and J. Li. Face recognition from multiple stylistic sketches: Scenarios, datasets, and evaluation. In ECCV, 2016.
  • [28] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Álvarez. Invertible conditional gans for image editing. In NIPSW, 2016.
  • [29] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss. The feret evaluation methodology for face-recognition algorithms. IEEE TPAMI, 22(10):1090–1104, 2000.
  • [30] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
  • [31] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text-to-image synthesis. In ICML, 2016.
  • [32] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In NIPS, 2016.
  • [33] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  • [34] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, 2016.
  • [35] W. Shen and R. Liu. Learning residual images for face attribute manipulation. In CVPR, 2017.
  • [36] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang.

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network.

    In CVPR, 2016.
  • [37] A. Siarohin, E. Sangineto, S. Lathuilière, and N. Sebe. Deformable gans for pose-based human image generation. In CVPR, 2018.
  • [38] Y. Song, Z. Zhang, and H. Qi. Recursive cross-domain face/sketch generation from limited facial parts. arXiv:1706.00556, 2017.
  • [39] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  • [40] H. Tang and H. Liu. A novel feature matching strategy for large scale image retrieval. In IJCAI, 2016.
  • [41] H. Tang, W. Wang, D. Xu, Y. Yan, and N. Sebe. Gesturegan for hand gesture-to-gesture translation in the wild. In ACM MM, 2018.
  • [42] H. Tang, D. Xu, W. Wang, Y. Yan, and N. Sebe. Dual generator generative adversarial networks for multi-domain image-to-image translation. In ACCV, 2018.
  • [43] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In ECCV, 2016.
  • [44] X. Wang and X. Tang. Face photo-sketch synthesis and recognition. IEEE TPAMI, 31(11):1955–1967, 2009.
  • [45] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE TIP, 13(4):600–612, 2004.
  • [46] L. Wolf, Y. Taigman, and A. Polyak. Unsupervised creation of parameterized avatars. In ICCV, 2017.
  • [47] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In ECCV, 2016.
  • [48] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel-level domain transfer. In ECCV, 2016.
  • [49] D. Zhang, L. Lin, T. Chen, X. Wu, W. Tan, and E. Izquierdo. Content-adaptive sketch portrait generation by decompositional representation learning. IEEE TIP, 26(1):328–339, 2017.
  • [50] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  • [51] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In ECCV, 2016.
  • [52] S. Zhang, X. Gao, N. Wang, and J. Li. Face sketch synthesis from a single photo–sketch pair. IEEE TCSVT, 27(2):275–287, 2017.
  • [53] W. Zhang, X. Wang, and X. Tang. Coupled information-theoretic encoding for face photo-sketch recognition. In CVPR, 2011.
  • [54] L. Zheng, S. Wang, W. Zhou, and Q. Tian. Bayes merging of multiple vocabularies for scalable image retrieval. In CVPR, 2014.