Composition-aided Sketch-realistic Portrait Generation

12/04/2017 ∙ by Fei Gao, et al. ∙ 0

Sketch portrait generation is of wide applications including digital entertainment and law enforcement. Despite the great progress achieved by existing face sketch generation methods, they mostly yield blurred effects and great deformation over various facial parts. In order to tackle this challenge, we propose a novel composition-aided generative adversarial network (CA-GAN) for sketch portrait generation. First, we utilize paired inputs including a face photo and the corresponding pixel-wise face labels for generating the portrait. Second, we propose an improved pixel loss, termed compositional loss, to focus training on hard-generated components and delicate facial structures. Moreover, we use stacked CA-GANs (stack-CA-GAN) to further rectify defects and add compelling details. Experimental results show that our method is capable of generating identity-preserving, sketch-realistic, and visually comfortable sketch portraits over a wide range of challenging data, and outperforms existing methods. Besides, our methods show considerable generalization ability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Face photo-sketch synthesis refers synthesizing a face sketch (or photo) given one input face photo (or sketch). It has a wide range of applications such as digital entertainment and law enforcement. Ideally, the synthesized photo or sketch portrait should be appearance-preserving and photo/sketch-realistic, so that it will yield both high sketch identification accuracy and excellent perceptual quality. Despite the great success achieved in this area, existing photo-sketch synthesis methods [1]

, even the most advanced deep learning based method

[2], yield serious blurred effects and deformation in systhesised sketches and photos [3] (see Fig. 1).

Fig. 1: Illustration results of existing methods and the proposed methods. (a) Input, (b) MrFSPS [1], (c) cGAN [4], (d) our CA-GAN, (e) our SCA-GAN, (f) Ground truth. Our results show more nautral textures and details.

Recently, generative adversarial networks (GANs) [5] have achieved great success in image transformation, e.g. image style transfer [4]

, image super-resolution

[6]

, and image-to-image translation

[7]. The face photo-sketch synthesis process can be naturally formulated as photo-to-sketch and sketch-to-photo translation problem, which can be naturally handled by a conditional generative adversarial network (cGAN) model [4]. Wang et al. [8] therefore test the vanilla cGAN for facial sketch generation. The results show that cGAN is promising to yield sketch-like textures. However, as the vanilla cGAN only takes the face photo as input, it is difficult for the model to learn the structural relationship among the facial components given no composition information, thus resulting in deformation on some facial parts (see Fig. 1).

Since faces are under strong geometric constrain with complicated structural details, it is promising to use the facial composition information to help the generation of sketch portraits. In this paper, we propose to use pixel-wise face labelling masks to character the facial composition. This is motivated by the following two observations. First, the facial structure can be well represented by pixel-wise face labelling masks. In particular, the pixel-wise labels can be mapped to a face photo/sketch one-by-one, thus preserving the personal information in faces. Second, it is easy to access pixel-wise facial labels due to recent development on face parsing techniques [9], thus avoiding heavy human annotations and feasible for test.

Moreover, we propose an improved pixel loss, termed compositional loss, for learning the photo/sketch generator. In typical image generation methods, the pixel loss (i.e. reconstruction error) is uniformly calculated across the whole image as (part of) the objective [4]. Thus large components that comprise a vast number of pixels dominate the training procedure, obstructing the model to generate delicate facial structures. However, for face photos/sketches, large components are typically unimportant for recognition (e.g. background) or easy to generate (e.g. facial skin). In contrast, small components (e.g. eyes) are critical for recognition and difficult to generate, because they comprise complicated structures. To eliminate this barrier, we introduce a weighting factor for the distinct pixel loss of each component, which down-weights the loss assigned to large components. In other words, our compositional loss focus training on hard components and prevents the large components from overwhelming the generator during training.

In this paper, we propose a Composition-Aided Generative Adversarial Network (CA-GAN) for face photo-sketch synthesis. Our model is based on the cGAN infrastructure. First, we utilize paired inputs including a face photo and the corresponding pixel-wise face labelling masks for generating the portrait. Second, we use the proposed novel compositional loss for training the GAN. Moreover, we use stacked CA-GANs (SCA-GAN) for refinement, which proves to be capable of rectifying defects and adding compelling details [6]. As the proposed framework jointly exploits the image appearance space and structural composition space, it is capable of generating natural face photos and sketches. Experimental results show that our methods outperform existing methods in terms of perceptual quality, and obtain highly comparable quantitative evaluation results. We also verify the excellent generalization ability of our new model across different datasets.

The contributions of this paper are mainly three-fold.

  • First, to the best of our knowledge, this is the first work to employ facial composition information in the loop of learning a face photo-sketch synthesis model.

  • Second, we propose an improved pixel loss, termed compositional loss, to focus training on hard-generated components and delicate facial structures, which is demonstrated to be much effective. This both speeds the training up and greatly stabilizes it.

  • Third, the proposed method yields identity-preserving, realistic, and visually comfortable photos and sketches over a wide range of challenging data. Besides, our methods show considerable generalization ability.

The rest of this paper is organized as follows. Section II introduces related works. Secion III details the proposed sketch portrait generation framework. Experimental results and analysis are presented in section IV. Section V concludes this paper.

Ii Related Work

Ii-a Face Photo-Sketch Synthesis

Tremendous efforts have been made to develop facial photo-sketch synthesis methods, which can be broadly classified into two groups: data-driven methods and model-driven methods

[10]. Data-driven refers to methods that try to synthesize a photo/sketch by using a linear combination of similar training photo/sketch patches [11, 12, 13, 14, 15, 16]. These methods have two main parts: similar photo/sketch patch searching and linear combination weight computation. The similar photo/sketch searching process heavily increases the time consuming for test and make it difficult to use a large scale of training dataset. Model-driven refers to methods that learn a mathematical function offline to map a photo to a sketch or to map a sketch to a photo [1, 17, 18, 19]. Traditionally, researchers pay great efforts to explore hand-crafted features, neighbour searching strategies, and learning techniques. However, these methods typically yield serious blurred effects and great deformation in synthesized face photos and sketches.

Inspired by the great success achieved by deep learning techniques [20, 5] in various image-to-image translation tasks [7], some trials are made to learn deep learning based face sketch synthesis models. To name a few, Zhang et al. [21] propose to use branched fully convolutional network (FCN) for generating structural and textural representations, respectively, and then use face parsing results to fuse them together. However, the resulted sketches have blurred and ring effects. Recently, Wang et al. [8] propose to first use the vanilla cGAN to generate a sketch and then refine it by using a post-processing approach termed back projection. Experimental results show that cGAN can produces sketch-like structures in the synthesized portrait. However, there are also great deformation in various facial parts. More recently, Wang et al. [22] use the CycleGAN [23] as the prototype, and propose to use multi-scale discriminators [24] for generating high resolution sketches/photos. This method shows distinctly improved performance and yield sketch-realistic textures. However, there are still slight blurred defects and degradations in the color components.

Few exiting methods use the composition information to guide the generation of the face sketch [21, 25]. In particular, they try to learn a specific generator for each component and then combine them together to form the entire face. Similar ideas have also been proposed for face image hallucination [26, 27]. In contrast, we propose to employ facial composition information in the loop of learning the generator to boost the performance .

Ii-B Image-to-image Translation

Our work is highly related to image-to-image translation, which has achieved significant progress with the development of generative adversarial networks (GANs) [5, 28] and variational auto-encoders (VAEs) [29]. Among them, conditional generative adversarial networks (cGAN) [4] attracts growing attentions because there are many interesting works based on it, including conditional face generation [30], text to image synthesis [6], and image style transfer [31]. All of them obtained amazing results. Inspired by these observations, we are interested in generating sketch-realistic portraits by using cGAN. However, we found the vanilla cGAN insufficient for this task, thus propose to boost the performance by both developing the network architecture and modifying the objective.

Iii Method

Iii-a Preliminaries

The proposed method is capable of handling both sketch synthesis and photo synthesis, because these two procedures are symmetric. In this section, we take face sketch synthesis as an example to introduce our method.

Our problem is defined as follows. Given a face photo , we would like to generate a sketch portrait that share the same identity with sketch-realistic appearance. Our key idea is using the face composition information to help the generation of sketch portrait. The first step is to obtain the structural composition of a face. As face parsing can well represent the facial composition, we employ the pixel-wise face labelling masks as prior knowledge for the facial composition. The remaining problem is to generate the sketch portrait based on the face photo and composition masks: . Here, we propose a composition-aided GAN (CA-GAN) for this purpose. We further employ stacked CA-GANs (SCA-GAN) to refine the generated sketch portraits. Details are given in the following of this section.

Iii-B Face Decomposition

Assume that the given face photo is , where , , and are the height, width, and number of channels, respectively. We decompose the input photo into components (e.g. hair, nose, mouse, etc.) by employing the face parsing method proposed by Liu et al. [9] due to its excellent performance. For notational convenience, we refer to this model as P-Net. By using P-Net, we get the pixel-wise labels related to 8 components, i.e. two eyes, two eyebrows, nose, upper and lower lips, inner mouth, facial skin, hair, and background [9].

We propose to use soft labels (probabilistic outputs) in this paper. Let denote the pixel-wise face labelling masks. Here,

denotes the probability pixel

belongs to the -th component, predicted by P-Net, with . In the preliminary implementation, we also tested the performance while using hard labels (binary outputs), i.e. each value denotes whether belongs to the -th component. Because it is almost impossible to get absolutely precise pix-wise face labels, using hard labels occasionally yields deformation in the border area between two nearby components.

Iii-C Composition-aided GAN (CA-GAN)

In the proposed framework, we first utilize paired inputs including a face photo and the corresponding pixel-wise face labels for generating the portrait. Second, we propose an improved pixel loss, termed the compositional loss, to focus training on hard-generated components and delicate facial structures. Moreover, we use stacked CA-GANs to further rectify defects and add compelling details. Details will be introduced in the following subsections.

Fig. 2: Generator architecture of the proposed composition-aided generative adversarial network (CA-GAN).

Iii-C1 Generator Architecture

The architecture of the generator in CA-GAN is presented in Fig. 2. In our case, the generator needs to translate two inputs (i.e., the face photo and the face labelling masks ) into a single output . Because and are of different modalities, we propose to use distinct encoders to model them and refer to them as Appearance Encoder and Composition Encoder, correspondingly. The features of these two encoders are concatenated at the bottleneck layer for the decoder [32]. In this way, the information of both the face photo and the facial composition can be well modeled respectively. The architectures of the encoder, decoder, and discriminator are exactly the same as those used in [4] but without dropout, following the shape of a ”U-Net”. Specifically, we concatenate all channels at layer in both encoders with those at layer in the decoder. Details of the network can be found in the appendix of [4].

In addition, we test the network with one single encoder that takes the cascade of and , i.e. , as the input. This network is the most straightforward solution for simultaneously encoding the face photo and the composition masks. Experimental results show that using this structure decreases the face sketch recognition accuracy by about 2 percent and yield slightly blurred effects in the area of hair.

Iii-C2 Compositional Loss

Previous approaches to cGANs have found it beneficial to mix the GAN objective with pixel loss (i.e. reconstruction error) for various tasks, e.g. image translation [4] and super-resolution reconstruction [7]. Besides, using the normalized distance encourage less blurring than the distance. We therefore use the normalized distance between the generated sketch and the target in the computation of pixel loss. We introduce the compositional loss starting from the standard pixel loss for image generation. In previous works about cGANs, the pixel loss is calculated over the whole image. For distinction, we refer to it as global pixel loss in this paper.

Global pixel loss. Suppose both and have shape . Let be a matrix of ones. The global pixel loss is expressed as:

(1)

In the global pixel loss, the loss related to the component, , can be expressed as:

(2)

with . Here, denotes the pixel-wise product operation. As all the pixels are treated equally in the global pixel loss, large components (e.g. background and facial skin) contribute more to learn the generator than small components (e.g. eyes and mouth).

Compositional Loss. To eliminate this barrier, we introduce a weighting factor, , to balance the distinct pixel loss of each component. Specially, inspired by the idea of balanced cross-entropy loss [33], we set by inverse component frequency. When we adopt the soft facial labels, is the sum of the possibilities every pixel belonging to the component. Here, denotes the convolutional operation. If we adopt the hard facial labels, it becomes the number of pixels belonging to the component. The component frequency is thus . So we set and multiply it with , resulting in the balanced loss:

(3)

Obviously, the balanced loss is exactly the normalized loss across the related componential region.

The compositional loss is defined as,

(4)

As is broadly in inverse proportion to the component size, it reduces the loss contribution from large components. From the other aspect, it high-weights the losses assigned to small and hard-generated components.

In practice we use a weighted average of the global pixel loss and compositional loss:

(5)

where is used to balance the global pixel loss and the compositional pixel loss. We adopt this form in our experiments and set , as it yields slightly improved perceptual comfortability over the compositional loss.

Iii-C3 Objective

Following the objective of the vanilla cGAN, we express the adversarial loss of CA-GAN as:

(6)

Similar to the settings in [4], we do not add a Gaussian noise as the input. Besides, we do not use dropout in the generator. Finally, we use a combination of the adversarial loss and the weighted pixel loss to learn the generator. We aim to solve:

(7)

where is a weighting factor.

Iii-D Stacked Refinement Network

Finally, we use stacked CA-GAN (SCA-GAN) to further boost the quality of the generated sketch portrait [6]. The architecture of SCA-GAN is illustrated in Fig. 3. SCA-GAN includes two-stage GANs, each comprises a generator and a discriminator, which are sequentially denoted by . In SCA-GAN, the Stage-I GAN yields an initial portrait, , based on the given face photo and pix-wise label masks . Afterwards, the Stage-II GAN takes as inputs to rectify defects and add compelling details, yielding a refined sketch portrait, . The network architectures of the these two GANs are almost the same, except that the inputs of and have one more channel (i.e. the initial sketch) than those of and , correspondingly. Here, the given photo and the initial sketch are concatenated and input into the appearance encoder. In the implementation, we also test the SCA-GAN network with one single discriminator, shared by these two GANs. However, it cannot yield vivid hairs.

Fig. 3: Pipeline of the proposed stacked composition-aided generative adversarial network (SCA-GAN).

Iii-E Optimization and Implementation

In the proposed method, the input image should be of fixed size, e.g. . In the default setting of cGAN [4], the input image is resized from an arbitrary size to . However, we observed that resizing the input face photo will yield serious blurred effects and great deformation in the generated sketch [8] [22]

. In contrast, by padding the input image to the target size, we can obtain considerable performance improvement. We therefore use padding across all the experiments.

To optimize our networks, following [4], we alternate between one gradient descent step on , then one step on . We use minibatch SGD and apply the Adam solver. For clarity, we illustrate the optimization procedure of SCA-GAN in Algorithm 1

. In our experiments, we use batch size 1 and run for 700 epochs for all the experiments. Besides, we apply instance normalization, which has shown great superiority over batch normalization in the task of image generation

[4]. We trained our models on a single Pascal Titan X GPU. When we used a training set of 500 samples, it took about 3 hours to train the CA-GAN model and 6 hours to train the SCA-GAN model. At test time, all models run in well under one second on this GPU.

a set of training instances, in form of triplet:
{a face photo , pix-wise label masks , a target sketch };
iteration time , max iteration ;
optimal ;
initial ;
for  to  do
1. Randomly select one training instance:
     { a face photo , pix-wise label masks , a target sketch . }

2. Estimate the initial sketch portrait:

    
3. Estimate the refined sketch portrait:
    
4. Update :
    
5. Update :
    
6. Update :
    
7. Update :
    
end for
Algorithm 1 Optimization procedure of SCA-GAN (for sketch synthesis).

Iv Experiments

In this section, we will first introduce the experimental settings and then present a series of empirical results to verify the effectiveness of the proposed method.

Iv-a Settings

Iv-A1 Datasets

We conducted experiments on three public available databases: the CUHK Face Sketch database (CUHK) [34], the CUFSF database [35], and the VIPSL-FS database [19] [36]. The CUHK database consists of 606 face photos from three databases: the CUHK student database [37] (188 persons), the AR database [38] (123 persons), and the XM2VTS database [39] (295 persons). The CUFSF database includes 1194 persons [40]. In the CUFSF database, there are lighting variation in face photos and shape exaggeration in sketches. Thus the CUFSF is very challenging. For each person, there are one face photo and one face sketch drawn by the artist in both the CUHK database and the CUFSF database. The VIPSL-FS database includes 200 persons. For each person, there are 5 sketches drawn by different artists. Because the original sketches in the VIPSL-FS database are , we use it to test the performance of our proposed method for generating high-resolution face photos/sketches.

Following existing methods [3], all these face images (photos and sketches) are geometrically aligned relying on three points: two eye centers and the mouth center. For the CUHK and CUFSF, the aligned images are cropped to the size of . For the VIPSL-FS database, the aligned face photos are first cropped to the size of and then resized to . The corresponding pixel-wise label masks are estimated from the photo and then resized to . The aligned sketches are cropped to .

In the following context, we present a series of experiments:

  • First, we perform face photo-sketch synthesis on the CUHK, CUFSF, and VIPSL-FS databases, respectively, to evaluate the performance of the proposed methods (see Part IV-B and Part IV-C);

  • Second, we conduct cross-dataset experiments to verify whether the proposed method is independent of the training data (see Part IV-D); and

  • Third, we discuss the network configurations for our proposed method on the CUHK database and CUFSF database (see Part IV-E).

We use the proposed architecture for both the sketch synthesis and photo synthesis and release all the synthesized sketches and photos online: https://github.com/fei-hdu/ca-gan.

It is well know that a large size of training dataset is necessary for learning the GAN based model. In the experiment, unless otherwise specified, we randomly split each dataset into a training set (80%) and a testing set (20%). There is no overlap between them. Besides, we ran the training-testing process for 10 times and calculated the average values of the following criteria as the performance measure.

Iv-A2 Criteria

We adopt the Peak Signal to Noise Ratio (PSNR) and Feature Similarity Index Metric (FSIM) [41] between the synthesized image and the ground-truth image to objectively assess the quality of the synthesized image. It is worth mentioning that, although these metrics works well for evaluating the quality of natural images and have become a prevalent metric in the face photo-sketch synthesis community, their performance for the synthesized images is referential but not infallible [42].

In addition, sketch based face recognition is always used to assist law enforcement. It is necessary to verify whether the synthesized images can be used for identity recognition. We therefore statistically evaluate the face recognition accuracy while using the ground-truth image (the photo or the sketch drawn by the artist) as the probe image and synthesized images (photos or sketches) as the images in the gallery. Null-space linear discriminant analysis (NLDA)

[43] is employed to conduct the face recognition experiments. We repeat each face recognition experiment 20 times by randomly partitioning the data and report the average accuracy.

Fig. 4: Examples of synthesized face sketches on the the CUHK database and the CUFSF database. (a) Photo, (b) MrFSPS [1], (c) RSLCR[3], (d) FCN [2], (e) BP-GAN [8], (f) cGAN [4], (g) CA-GAN, (h) SCA-GAN, and (i) Sketch drawn by artist. From top to bottom, the examples are selected from the CUHK student database [37], the AR database [38], the XM2VTS database [39], and the CUFSF database [40], sequentially.

Iv-B Face Sketch Synthesis

Comparison with existing methods

There are great divergence in the experimental settings among existing face sketch synthesis methods. Besides, existing methods are typically tested on the CUHK database and CUFSF database. In this paper, we follow the work presented in [3] and split the dataset in the following ways. For the CUHK student database, 88 pairs of face photo-sketch are taken for training and the rest for testing. For the AR database, we randomly choose 80 pairs for training and the rest 43 pairs for testing. For the XM2VTS database, we randomly choose 100 pairs for training and the rest 195 pairs for testing.

Fig. 4 presents some synthesized face sketches from different methods on the CUHK database and the CUFSF database. Four advanced methods are compared: MrFSPS [1], RSLCR [3], FCN [2], and cGAN [4]. All the synthesized sketches by RSLCR, and FCN are those released by Wang et al. at: http://www.ihitworld.com/RSLCR.html. All the synthesized sketches by MrFSPS are those released by the author Peng at: http://chunleipeng.com/TNNLS2015_MrFSPS.html.

As shown in Fig. 4, cGAN, CA-GAN, and SCA-GAN methods could generate sketch-like textures (e.g. hair region) and shadows. In contrast, BP-GAN yields over-smooth sketch portrait. MrFSPS, RSLCR, and FCN yield serious blurred effects and great deformation in various facial pars. Besides, there are deformations on synthesized sketches by cGAN, specially for the mouth area. In contrast, CA-GAN alleviates such defects, and SCA-GAN almost eliminates them. This illustrates the effectiveness of the proposed methods.

MrFSPS[1] RSLCR[3] FCN[2] BP-GAN[8] cGAN[4] CA-GAN SCA-GAN
PSNR CUFS N/A 30.08 30.04 30.73 30.04 30.13 30.01
CUFSF N/A 28.90 28.41 29.87 29.29 29.22 29.21
FSIM CUHK 73.39 69.64 69.34 69.05 71.09 71.19 71.43
CUFSF N/A 66.47 66.22 68.18 72.81 72.72 72.86
Acc. CUHK 97.70 98.38 96.49 93.14 95.48 95.64 95.90
CUFSF 75.36 75.94 69.80 67.45 80.89 79.84 79.88
TABLE I: Comparison with existing face sketch synthesis methods in term of the average PSNR, FSIM (%), and face recognition accuracy (Acc.) (%), on the CUHK and CUFSF databases. The experimental setttings are following RSLCR [3].

Table I presents the average PSNR, FSIM, and face sketch recognition accuracy (Acc.) of the most advanced face sketch synthesis methods and the proposed ones, on the CUHK database and CUFSF database. The evaluation method is exactly the same as that presented in [3]. Specially, in the face sketch recognition experiment, we randomly split the CUHK database into a training set (150 synthesized sketches and corresponding ground-truths) and a testing set (188 sketches) consists of the gallery. For the CUFSF database, we randomly choose 300 synthesized sketches and corresponding ground-truths for training and 644 synthesized sketches as the gallery. We repeat each face recognition experiment 20 times by randomly partitioning the data.

As shown in Table I, the PSNR values related to all these methods are highly comparable. According to FSIM, cGAN, CA-GAN, and SCA-GAN outperform existing methods, except MrFSPS, on both the CUHK database and CUFSF database. According to the recognition accuracy, cGAN, CA-GAN, and SCA-GAN show 2-3 percent inferiority over MrFSPS and RSLCR on the CUHK database, but show 4-5 superiority over them on the CUFSF database. Since the CUFSF database is much larger than CUHK database. Besides, the lighting variation in face photos and the shape exaggeration in sketches both increase the difficulty of face sketch–photo synthesis and recognition. We conclude that cGAN, CA-GAN, and SCA-GAN outperform existing methods according to the face sketch recognition accuracy. There is no considerable difference between cGAN, CA-GAN, and SCA-GAN, in terms of these three criteria, across both the CUHK database and CUFSF database. In addition, PS-MAN [22] achieves a FSIM value of 73.61 on the CUHK database, which is slightly better than both CA-GAN and SCA-GAN.

From Fig. 4 and Table I, we can safely conclude that both CA-GAN and SCA-GAN generate much better sketches and achieve highly comparable quantitative evaluations, in comparison with existing face sketch synthesis methods.

High-resolution sketch synthesis

We add one convolutional layer to the encoders and one deconvolutional layer to the decoders in cGAN, CA-GAN, and SCA-GAN, for the purpose of generating high-resolution sketches. We use 1000 photo-sketch pairs in the VIPSL-FS database here. We randomly split these pairs into a training set and a testing set by 80%:20%. Fig. 5 shows the sketch portraits generated by using cGAN [4], SCA-GAN. Obviously, cGAN yields check-board-like textures and blurred effects in the area of hair. Besides, cGAN yields deformation in small facial components (see the left eye of the first person in Fig. 5). In contrast, SCA-GAN generate very high-quality and sketch-realistic portraits, alleviating such defects.

Quantitative Evaluation

Since GANs typically need a large size of training data, we further conduct the sketch synthesis experiment on the CUHK, CUFSF, and VIPSL-FS databases by randomly splitting each database into a training set (80%) and a testing set (20%). In the face sketch recognition, we randomly split the CUHK database into a training set (70 synthesized sketches and corresponding ground-truths) and a testing set (188 sketches) consists of the gallery. For the CUFSF database, we randomly choose 120 synthesized sketches and corresponding ground-truths for training and 250 synthesized sketches as the gallery. For the VIPSL-FS database, we randomly choose 20 synthesized sketches and corresponding ground-truths for training and 40 synthesized sketches as the gallery. We repeat each face sketch recognition experiment 20 times by randomly partitioning the data. We run the training-testing process for 10 times, and calculate the average PSNR, FSIM, and face recognition accuracy (Acc.) of the synthesized sketches. The corresponding results are shown in Table II.

As shown in Table II, there is no distinct difference between cGAN, CA-GAN, and SCA-GAN in term of PSNR. According to FSIM, CA-GAN is highly comparable with cGAN, and SCA-GAN show slight superiority over both of them. In addition, both CA-GAN and SCA-GAN achieves higher face sketch recognition accuracy on the CUHK database, and is still comparable with cGAN on both the CUFSF and VIPSL-FS databases. Recall that the sketches generated by SCA-GAN looks most like the input face (as illustrated in Figs. 1, 4, and 5). We can safely draw the conclusion that both CA-GAN and SCA-GAN are capable of generating identity-preserving and sketch-realistic sketch portraits.

Fig. 5: Examples of high-resolution synthesized face sketches on the VIPSL-FS database. (a) Photo, (b) cGAN [4], (c) CA-GAN, (d) SCA-GAN, and (e) Sketch drawn by artist.
cGAN CA-GAN SCA-GAN
CUHK 30.69 30.65 30.73
PSNR CUFSF 29.11 29.05 29.13
VIPSL-FS 31.70 31.76 31.73
CUHK 72.53 72.67 73.03
FSIM CUFSF 73.09 72.96 73.24
VIPSL-FS 70.85 70.50 71.12
CUHK 98.73 98.92 99.61
Acc. CUFSF 88.88 86.77 87.69
VIPSL-FS 66.35 66.15 65.10
TABLE II: Average PSNR, FSIM (%), and face recognition accuracy (Acc.) (%) of the synthesized sketches on the CUHK, CUFSF, and VIPSL-FS databases. Each database is randomly split into a training set (80%) and a testing set (20%).

Iv-C Face Photo Synthesis

We exchange the roles of the sketch and photo in the proposed model, and evaluate the face photo synthesis performance on the aforementioned datasets, separately.

Fig.6 illustrates the synthesized face photos of MrFSPS [1], cGAN, CA-GAN, and SCA-GAN. All the synthesized photos by MrFSPS are those released by the author Peng at: http://chunleipeng.com/TNNLS2015_MrFSPS.html. Obviously, the face photos synthesized by MrFSPS are heavily blurred. Besides, there are serious degradations in the synthesized photos by suing cGAN. In contrast, the photos generated by both CA-GAN or SCA-GAN consistently show considerable improvement in the perceptual quality.

Fig. 6: Examples of synthesized face photos. (a) Sketch drawn by artist, (b) MrFSPS [1], (c) cGAN, (d) CA-GAN, (e) SCA-GAN, and (f) ground-truth photo. From top to bottom, the examples are selected from the CUHK student database [37], the AR database [38], the XM2VTS database [39], and the CUFSF database [40], sequentially.

Table III presents the average PSNR, FSIM, and face recognition accuracy (Acc.) on the CUHK, CUFSF, and VIPSL-FS databases. Obviously, SCA-GAN obtains the best FSIM values across all the three databases. Besides, both CA-GAN and SCA-GAN outperform cGAN in terms of face recognition on the CUHK and VIPSL-FS databases, but infer to cGAN on the CUFSF database.

From Fig. 6 and Table III, we can see that both CA-GAN and SCA-GAN generate better face photos and achieve highly comparable quantitative evaluations as compared with cGAN. We can safely draw the conclusion that both CA-GAN and SCA-GAN are capable of generating identity-preserving and natural face photos.

Comparison with existing methods

Recently, only a few number of methods have been proposed for face photo synthesis. Here we compare the proposed method with two advanced methods: MrFSPS [1] and PS-MAN [22]. MrFSPS achieves an FSIM of and a face recognition accuracy of using the synthesized photos on the CUHK database. Besides, on the CUFSF database, it achieves an a face recognition accuracy of . As reported in [22], PS-MAN achieves a FSIM value of on the CUHK database. In general, the performance of CA-GAN and SCA-GAN are highly comparable with MrFSPS and PS-MAN. Besides, the photos synthesized by using CA-GAN and SCA-GAN outperform the results of MrFSPS and PS-MAN. Specially, there are serious blurred effects in the photos synthesized by using MrFSPS and visible degradations in color components in those by using PS-MAN [22]. In contrast, the results of CA-GAN and SCA-GAN express more natural colors and details.

cGAN CA-GAN SCA-GAN
CUHK 30.98 30.92 30.85
PSNR CUFSF 30.06 29.65 29.97
VIPSL-FS 30.40 30.18 30.52
CUHK 76.18 76.53 77.13
FSIM CUFSF 79.54 79.13 79.67
VIPSL-FS 72.83 72.80 73.15
CUHK 94.80 96.96 95.49
Acc. CUFSF 77.50 73.38 74.85
VIPSL-FS 61.80 63.20 63.55
TABLE III: Average PSNR, FSIM (%), and face recognition accuracy (Acc.) (%) of the synthesized photos on the CUHK, CUFSF, and VIPSL-FS databases. Each database is randomly split into a training set (80%) and a testing set (20%).

High-resolution photo synthesis

In addition, we evaluate the performance of cGAN, CA-GAN, and SCA-GAN, in synthesizing high-resolution photos, on the VIPSL-FS database. The experimental settings are exactly the same as those previously presented in Section IV-B, except that the roles of the sketch and photo are exchanged.

Fig. 7 illustrates the synthesized photos. Obviously, cGAN yields check-board-like textures and blurred effects in the area of hair. Besides, cGAN yields deformation in small facial components (e.g. the left eye of the second person in Fig. 7 (b)). In contrast, both CA-GAN and SCA-GAN generate very high-quality and natural face photos, alleviating such defects. Besides, the photos synthesized by using SCA-GAN are of the best perceptual quality.

Fig. 7: Examples of synthesized high-resolution face photos on the VIPSL-FS database. (a) Sketch drawn by artist, (b) cGAN [4], (c) CA-GAN, (d) SCA-GAN, and (e) ground truth photo.

Iv-D Dataset Independence

To verify the generalization ability of the learned model, we conducted two cross-dataset experiments.

Cross-database experiment

First, we apply the model learned from the CUHK training dataset to the whole VIPSL-FS database. There is great divergence in person identity, background, and sketch style between these two datasets. Fig. 8 illustrates the synthesized sketches on the VIPSL-FS database, and Fig. 9 the synthesized photos. Obviously, both CA-GAN and SCA-GAN generate much better sketches and photos than cGAN. Besides, the results of SCA-GAN express the best appearance.

Table IV literates the average PSNR, FSIM (%), and face recognition accuracy (Acc.) (%) of the synthesized photos/sketches on the VIPSL-FS database. In the face sketch recognition task, we randomly choose 100 synthesized sketches and corresponding ground-truths for training and 200 synthesized sketches as the gallery. We repeat each face sketch recognition experiment 20 times by randomly partitioning the data. CA-GAN and stack-CA-GAN outperform cGAN according to PSNR and FSIM, but inferior to cGAN according to the face recognition accuracy.

cGAN CA-GAN SCA-GAN
PSNR 55.92 58.41 58.76
Sketch Synthesis FSIM 27.77 27.70 27.74
Acc. 24.17 26.67 23.33
PSNR 25.44 25.63 25.52
Photo Synthesis FSIM 61.81 64.42 65.17
Acc. 27.50 22.50 22.50
TABLE IV: Average PSNR, FSIM (%), and face recognition accuracy (Acc.) (%) of the synthesized photos/sketches on the whole VIPSL-FS database while the model is learned from the CUHK training dataset.
Fig. 8: Synthesized sketches on the VIPSL-FS database while the model is trained on the CUHK database. (a) Photo, (b) cGAN, (c) CA-GAN, (d) SCA-GAN, (e) ground-truth sketch drawn by the artist.
Fig. 9: Synthesized photos on the VIPSL-FS database while the model is trained on the CUHK database. (a) Sketch drawn by the artist, (b) cGAN, (c) CA-GAN, (d) SCA-GAN, (e) ground-truth photo.

Face photo-sketch synthesis of Chinese celebrities

In addition, we tested the CA-GAN and SCA-GAN model, trained on the CUHK database, our method on the photos and sketches of Chinese celebrities. These photos and sketches are downloaded from the web, and contain different lighting conditions and backgrounds compared with the images in the training set. Fig. 10 shows the synthesized sketches, and Fig. 11 the synthesized photos. Obviously, our results express more natural textures and details than cGAN.

Fig. 10: Synthesized sketches of Chinese celebrities. (a) Photo, (b) cGAN, (c) CA-GAN, (d) SCA-GAN.
Fig. 11: Synthesized photos of Chinese celebrities. (a) Sketch, (b) cGAN, (c) CA-GAN, (d) SCA-GAN.

Limitations

It is inspiring that both CA-GAN and SCA-GAN show outstanding generalization ability in the sketch synthesis task. However, as shown in Fig. 8, the proposed method could not handle the black margins well, and yields ink marks on the corresponding areas. This might be caused by the fact that there are little black margins in the CUHK dataset, the generator therefore learns little about how to process black margins. In addition, the synthesized photos in the cross-dataset experiment are dissatisfactory. This might be due to the great divergence between the input sketches in terms of textures and styles. It is necessary to further improve the generalization ability of the photo synthesis models.

Iv-E Discussions on the Network Configurations

Iv-E1 Ablation Study

There are mainly three components in CA-GAN, i.e. (i) using face labels in G; (ii) using face labels in D; and (iii) the compositional loss. To illustrate the contribution of each component, we accordingly evaluate the performance related to the following settings: cGAN, cGAN+i, cGAN+ii, cGAN+iii, CA-GAN (i.e. cGAN+i+ii+iii), and SCA-GAN. We separately conduct the photo synthesis and sketch synthesis experiments on the CUHK database and the CUFSF database. We randomly split each database into two parts: 80% for training and the rest for testing. There is no overlap between the training set and the testing set.

We show the corresponding PSNR, FSIM, and recognition accuracy (Acc.) of the synthesized sketches and photos in Table V and Table VI, respectively. There is no distinct difference between different settings. Fig. 12, illustrates the synthesised sketches and photos. Compared to (b), (c)-(e) express less deformations and sharper margins in the area of nose, mouse and eyes. In other words, all the proposed three components improve the quality of the generated sketches.

Fig. 12: Illustration of synthesized face sketch and photo under different configurations. (a) Input, (b) cGAN, (c) cGAN with face labels in G (cGAN+i), (d) cGAN with face labels in D (cGAN+ii), (e) cGAN with the compositional loss (cGAN+iii), (f) CA-GAN, (g) SCA-GAN, (h) ground-truth.
cGAN cGAN+i cGAN+ii cGAN+iii CA-GAN SCA-GAN
PSNR CUHK 30.69 30.68 30.68 30.67 30.65 30.73
CUFSF 29.11 29.13 29.16 29.19 29.05 29.13
FSIM CUHK 72.53 72.57 72.71 72.27 72.67 73.03
CUFSF 73.09 73.01 73.08 72.98 72.96 73.24
Acc. CUHK 98.73 98.14 99.02 98.43 98.92 99.61
CUFSF 88.88 86.38 89.27 88.77 86.77 87.69
TABLE V: Average PSNR, FSIM, and face recognition accuracy (Acc.) of the synthesized sketches on the CUHK and CUFSF databases. (i) Using face labels in G; (ii) using face labels in D; and (iii) the compositional loss.
cGAN cGAN+i cGAN+ii cGAN+iii CA-GAN SCA-GAN
PSNR CUHK 30.98 30.86 30.58 30.81 30.92 30.85
CUFSF 30.06 30.02 29.98 30.07 29.65 29.97
FSIM CUHK 76.18 76.76 76.52 76.27 76.53 77.13
CUFSF 79.54 79.37 79.57 79.36 79.13 79.67
Acc. CUHK 94.80 96.47 95.00 94.71 96.96 95.49
CUFSF 77.50 75.27 77.35 77.62 73.38 74.85
TABLE VI: Average PSNR, FSIM, and face recognition accuracy (Acc.) of the synthesized photos on the CUHK and CUFSF databases. (i) Using face labels in G; (ii) using face labels in D; and (iii) the compositional loss.

Iv-E2 Stability of the Training Procedure

We discover that, our proposed approaches considerably stabilizes the training procedure of the network. Fig. 13 shows the (smoothed) training loss curves related to cGAN [4], CA-GAN, and SCA-GANK database. Specially, (a) and (b) shows the reconstruction error (Global loss) and the adversarial loss in the sketch synthesis task; (c) and (d) show the reconstruction error and the adversarial loss in the photo synthesis task, respectively. For clarity, we smooth the initial loss curves by averaging adjacent 40 loss values.

Obviously, there are large impulses in the adversarial loss of cGAN. In contrast, the corresponding curves of CA-GAN and SCA-GAN are much smoother. The reconstruction error of both CA-GAN and SCA-GAN are smaller than that of cGAN. Besides, SCA-GAN achieves the least reconstruction errors and smoothest loss curves. This observation explains why stacked generators are capable of refining the generation performance [44].

Fig. 13: Training loss curves of cGAN, CA-GAN, and SCA-GAN, on the CUHK database. (a) Reconstruction error in the sketch synthesis task, (b) adversarial loss in the sketch synthesis task, (c) reconstruction error in the photo synthesis task, and (d) adversarial loss in the photo synthesis task.

V Conclusion

In this paper, we propose a novel composition-aided generative adversarial network for face photo-sketch synthesis. Our approach produces high-quality face photos and sketches over a wide range of challenging data. We hope that the presented approach can support the applications of other image generation problems. Besides, it is essential to develop models that can handle photos/sketches with great variations in head poses, lighting conditions, and styles. Finally, exciting work remains to be done to qualitatively evaluate the quality of the synthesized sketches and photos.

References

  • [1] C. Peng, X. Gao, N. Wang, D. Tao, X. Li, and J. Li, “Multiple representations-based face sketch-photo synthesis,”

    IEEE Transactions on Neural Networks and Learning Systems

    , vol. 27, no. 11, pp. 2201–2215, 2016.
  • [2] L. Zhang, L. Lin, X. Wu, S. Ding, and L. Zhang, “End-to-end photo-sketch generation via fully convolutional representation learning,” in Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, 2015, pp. 627–634.
  • [3] N. Wang, X. Gao, and J. Li, “Random sampling for fast face sketch synthesis,” Pattern Recognition (PR), 2017.
  • [4] P. Isola, J. Zhu, T. Zhou, and A. Efros, “Image-to-image translation with conditional adversarial networks,” arXiv preprint arXiv: 1611.07004, Tech. Rep., 2016.
  • [5] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in International Conference on Neural Information Processing Systems, 2014, pp. 2672–2680.
  • [6] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas, “Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks,” Tech. Rep., 2016.
  • [7] J. Johnson, A. Alahi, and F. F. Li, “Perceptual losses for real-time style transfer and super-resolution,” in

    European Conference on Computer Vision

    , 2016, pp. 694–711.
  • [8] N. Wang, W. Zha, J. Li, and X. Gao, “Back projection: an effective postprocessing method for gan-based face sketch synthesis,” Pattern Recognition Letters, pp. 1–7, 2017.
  • [9] S. Liu, J. Yang, C. Huang, and M. H. Yang, “Multi-objective convolutional learning for face labeling,” in Computer Vision and Pattern Recognition, 2015, pp. 3451–3459.
  • [10] N. Wang, M. Zhu, J. Li, B. Song, and Z. Li, “Data-driven vs. model-driven: Fast face sketch synthesis,” Neurocomputing, 2017.
  • [11] Y. Song, J. Zhang, L. Bao, and Q. Yang, “Fast preprocessing for robust face sketch synthesis,” in Proceedings of International Joint Conference on Artifical Intelligence, 2017, pp. 4530–4536.
  • [12] Y. Song, L. Bao, S. He, Q. Yang, and M. H. Yang, “Stylizing face images via multiple exemplars,” Computer Vision and Image Understanding, 2017.
  • [13] X. Gao, N. Wang, D. Tao, and X. Li, “Face sketch–photo synthesis and retrieval using sparse representation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 8, pp. 1213–1226, 2012.
  • [14] Y. Song, L. Bao, Q. Yang, and M. H. Yang, “Real-time exemplar-based face sketch synthesis,” in European Conference on Computer Vision, 2014, pp. 800–813.
  • [15] Q. Pan, Y. Liang, L. Zhang, and S. Wang, “Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis,” in Computer Vision and Pattern Recognition, 2012, pp. 2216–2223.
  • [16] X. Wang and X. Tang, “Face photo-sketch synthesis and recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 11, pp. 1955–67, 2009.
  • [17] S. Zhang, X. Gao, N. Wang, J. Li, and M. Zhang, “Face sketch synthesis via sparse representation-based greedy search.” IEEE Transactions on Image Processing, vol. 24, no. 8, pp. 2466–77, 2015.
  • [18] S. Zhang, X. Gao, N. Wang, and J. Li, “Robust face sketch style synthesis,” IEEE Transactions on Image Processing, vol. 25, no. 1, p. 220, 2016.
  • [19] N. Wang, D. Tao, X. Gao, X. Li, and J. Li, “Transductive face sketch-photo synthesis,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 9, pp. 1364–1376, 2013.
  • [20]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    International Conference on Neural Information Processing Systems, 2012, pp. 1097–1105.
  • [21] D. Zhang, L. Lin, T. Chen, X. Wu, W. Tan, and E. Izquierdo, “Content-adaptive sketch portrait generation by decompositional representation learning,” IEEE Transactions on Image Processing, vol. 26, no. 1, pp. 328–339, 2016.
  • [22] V. M. P. Lidan Wang, Vishwanath A. Sindagi, “High-quality facial photo-sketch synthesis using multi-adversarial networks,” arXiv preprint arXiv:1710.10182, Tech. Rep., 2017.
  • [23] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” pp. 2242–2251, 2017.
  • [24] T. C. Wang, M. Y. Liu, J. Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional gans,” 2017.
  • [25] M. Zhang, L. J., N. Wang, and X. Gao, “Compositional model-based sketch generator in facial entertainment,” IEEE Transactions on Cybernetics, vol. PP, no. 99, pp. 1–12, 2017.
  • [26] Y. Song, J. Zhang, S. He, L. Bao, and Q. Yang, “Learning to hallucinate face images via component generation and enhancement,” in Proceedings of International Joint Conference on Artifical Intelligence, 2017, pp. 4537–4543.
  • [27] R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis,” 2017.
  • [28] M. Y. Liu and O. Tuzel, “Coupled generative adversarial networks,” 2016.
  • [29]

    A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow, “Adversarial autoencoders,”

    Computer Science, 2015.
  • [30] K. Tero, A. Timo, L. Samuli, and L. Jaakko, “Progressive growing of gans for improved quality, stability, and variation,” Tech. Rep., 2017.
  • [31] D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua, “Stylebank: An explicit representation for neural image style transfer,” 2017.
  • [32] Y. Yan, J. Xu, B. Ni, and X. Yang, “Skeleton-aided articulated motion generation,” 2017.
  • [33] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” 2017.
  • [34] X. Wang and X. Tang, “Face photo-sketch synthesis and recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 11, pp. 1955–1967, 2009.
  • [35] W. Zhang, X. Wang, and X. Tang, “Coupled information-theoretic encoding for face photo-sketch recognition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 513–520.
  • [36] N. Wang, D. Tao, X. Gao, X. Li, and J. Li, “A comprehensive survey to face hallucination,” International Journal of Computer Vision, vol. 106, no. 1, pp. 9–30, 2014.
  • [37] X. Tang and X. Wang, “Face photo recognition using sketch,” in Proceedings of IEEE International Conference on Image Processing, 2002, pp. 257–260.
  • [38] A. Martinez and R. Benavente, “The AR face database,” CVC, Barcelona, Spain, Tech. Rep. 24, Jun. 1998.
  • [39] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “XM2VTSDB: the extended M2VTS database,” in Proceedings of the International Conference on Audio- and Video-Based Biometric Person Authentication, Apr. 1999, pp. 72–77.
  • [40] P. Phillips, H. Moon, P. Rauss, and S. Rizvi, “The FERET evaluation methodology for face recognition algorithms,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 10, pp. 1090–1104, 2000.
  • [41] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “Fsim: A feature similarity index for image quality assessment,” IEEE Transactions on Image Processing, vol. 20, no. 8, p. 2378, 2011.
  • [42] N. Wang, X. Gao, J. Li, B. Song, and Z. Li, “Evaluation on synthesized face sketches,” Neurocomputing, vol. 214, no. C, pp. 991–1000, 2016.
  • [43] L. Chen, H. Liao, and M. Ko, “A new lda-based face recognition system which can solve the small sample size problem,” Pattern Recognition, vol. 33, no. 10, pp. 1713–1726, 2000.
  • [44] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie, “Stacked generative adversarial networks,” 2016.