Semi-Supervised Learning for Face Sketch Synthesis in the Wild, ACCV2018
Face sketch synthesis has made great progress in the past few years. Recent methods based on deep neural networks are able to generate high quality sketches from face photos. However, due to the lack of training data (photo-sketch pairs), none of such deep learning based methods can be applied successfully to face photos in the wild. In this paper, we propose a semi-supervised deep learning architecture which extends face sketch synthesis to handle face photos in the wild by exploiting additional face photos in training. Instead of supervising the network with ground truth sketches, we first perform patch matching in feature space between the input photo and photos in a small reference set of photo-sketch pairs. We then compose a pseudo sketch feature representation using the corresponding sketch feature patches to supervise our network. With the proposed approach, we can train our networks using a small reference set of photo-sketch pairs together with a large face photo dataset without ground truth sketches. Experiments show that our method achieve state-of-the-art performance both on public benchmarks and face photos in the wild. Codes are available at https://github.com/chaofengc/Face-Sketch-Wild.READ FULL TEXT VIEW PDF
Sketch-based face recognition is an interesting task in vision and multi...
Exemplar-based face sketch synthesis methods usually meet the challengin...
In this paper, we propose a kinship generator network that can synthesiz...
Doodling is a useful and common intelligent skill that people can learn ...
We contribute the first large-scale dataset of scene sketches, SketchySc...
The problem of domain generalization is to learn from multiple training
Imagine a world in which each photo, printed or digitally displayed, hid...
Semi-Supervised Learning for Face Sketch Synthesis in the Wild, ACCV2018
Face sketch synthesis targets at generating a sketch from an input face photo. It has many useful applications. For instance, police officers often have to rely on face sketches to identify suspects, and face sketch synthesis makes it feasible for matching sketches against photos in a mugshot database automatically. Artists can also employ face sketch synthesis to simplify the animation production process . Many people prefer using sketches as their profile pictures in social media networks , and face sketch synthesis allows them to produce sketches without the help of a professional artist.
Much effort has been devoted to face sketch synthesis. In particular, exemplar based methods dominated in the past two decades. These methods can achieve good performance without explicitly modeling the highly nonlinear mapping between face photos and sketches. They commonly subdivide a test photo into overlapping patches, and match these test patches with the photo patches in a reference set of photo-sketch pairs. They then compose an output sketch using the corresponding sketch patches in the reference set. Although promising results have been reported [1, 3, 4, 5], these methods have several drawbacks. For example, sketches in Fig. 4
(c)(d)(e)(f) are over-smoothed and fail to preserve subtle contents such as strands of hair on the forehead. Moreover, the patch matching and optimization processes are often very time-consuming. Recent methods exploited Convolutional Neural Networks (CNNs) to learn a direct mapping between photos and sketches, which is, however, a non-trivial task. The straight forward CNN based method produces blurry sketches (see Fig.4(g)), and methods based on Generative Adversary Networks (GAN)  introduces undesirable artifacts (see Fig. 4(h),(i)). Besides, all these CNN based methods do not generalize well to face photos in the wild due to the lack of large training datasets of photo-sketch pairs. Although unpaired GAN based methods such as Cycle-GAN  can use unpaired data to transfer images between different domains, they fail to well preserve the facial content because of the weak content constraint (see fig. 8).
In this paper, we propose a semi-supervised learning framework for face sketch synthesis that takes advantages of the exemplar based approach, the perceptual loss and GAN. We design a residual net  with skip connections as our generator network. Suppose we have a small reference set of photo-sketch pairs and a large face photo dataset without ground truth sketches. Similar to the exemplar based approach, we subdivide the VGG-19  feature maps of the input photo into overlapping patches, and match them with the photo patches (in feature space) in the reference set. We then compose a pseudo sketch feature representation using the corresponding sketch patches (in feature space) in the reference set. We can then supervise our generator network using a perceptual loss based on the mean squared error (MSE) between the feature maps of the generated sketch and the corresponding pseudo sketch feature of the input photo. An adversary loss is also utilized to make the generated sketches more realistic.
In summary, our main contributions are three folds: (1) A semi-supervised learning framework for face sketch synthesis. Our framework allows us to train our networks using a small reference set of photo-sketch pairs together with a large face photo dataset without ground truth sketches. This enables our networks to generalize well to face photos in the wild. (2) A perceptual loss based on pseudo sketch feature. We show that the proposed loss is critical in preserving both facial content and texture details in the generated sketches. Extensive experiments are conducted to verify the effectiveness of our model. Both qualitative and quantitative results illustrate the superiority of our method. (3) To the best of our knowledge, our method is the first work that can generate visually pleasant sketches for face photos in the wild.
Tang and Wang  first introduced the exemplar based method based on eigentransformation. They projected an input photo onto the eignspace of the training photos, and then reconstruct a sketch from the eignspace of the training sketches using the same projection. Liu et al.  observed that the linear model holds better locally, and therefore proposed a nonlinear model based on local linear embedding (LLE). They first subdivided an input photo into overlapping patches and reconstructed each photo patch as a linear combination of the training photo patches. They then obtained the sketch patches by applying the same linear combinations to the corresponding training sketch patches. Wang and Tang  employed a multi-scale markov random fields (MRF) model to improve the consistency between neighboring patches. By introducing shape priors and SIFT features, Zhang et al.  proposed an extended version of MRF which can handle face photos under different illuminations and poses. However, these MRF based methods are not capable of synthesizing new sketch patches since they only select the best candidate sketch patch for each photo patch. To tackle this problem, Zhou et al.  presented the markov weight fields (MWF) model which produces a target sketch patch as a linear combination of best candidate sketch patches. Considering that patch matching based on traditional image features (e.g., PCA and SIFT) is not robust, a recent method  used CNN feature to represent the training patches and computed more accurate combination coefficients. To accelerate the synthesis procedure, Song et al.  formulated face sketch synthesis as a spatial sketch denoising (SSD) problem, and Wang et al.  presented an offline random sampling strategy for nearest neighbor selection of patches.
Recent works applied CNN to synthesize sketches and produced promising results. Zhang et al. 
proposed a 7-layer fully convolutional network (FCN) to directly transfer an input photo to a sketch. Although their model can roughly estimate the outline of a face, it fails to capture texture details with the use of intensity based mean square error (MSE) loss. Zhanget al.  utilized a branched fully convolutional network (BFCN) consisting of a content branch and a texture branch. Because the face content and texture are predicted separately with different loss metrics, the final sketch looks disunited. Chen et al.  proposed the pyramid column feature and used it to compose a reference style for a test photo from the training sketches. They utilized a CNN to create a content image from the photo, and then transferred the reference style to introduce shadings and textures in the output sketch. Wang et al.  presented the multi-scale generative adversarial networks (GANs) to generate sketches from photos and vice versa. Multiple discriminators at different hidden layers are used to supervise the synthesis process. Gao et al.  took advantage of the facial parsing map and proposed a composition-aided stack GAN. All these deep learning based methods require ground truth photo-sketch pairs for training, and they do not generalize well to face photos in the wild due to the lack of training data.
Our framework is composed of three main parts, namely a generator network , a pseudo sketch feature generator and a discriminator network (see Fig. 1). The generator network is a deep residual network with skip connections. It is used to generate a synthesized sketch for each input photo . The pseudo sketch feature generator is the key to our semi-supervised learning approach. Instead of training the generator network directly with ground truth sketches, we construct a pseudo sketch feature for each input photo to supervise the synthesis of . In this way, we can train our network on any face photo datasets, and generalize our model to face photos in the wild. We further adopt a discriminator network to minimize the gap between generated sketches and real sketches drawn by artists.
Given a reference set , the pseudo sketch feature generator targets at constructing a pseudo sketch feature for a test photo which is used to supervise the synthesis of the sketch . We follow MRF-CNN  to extract a local patch representation of an image. We first feed into a pretrained VGG-19 network and extract the feature map at the -th layer. Similarly, we obtain and . Let us denote a patch centered at a point of as , and the same definition applies to and . Now for each patch , where and with and being the height and width of , we find its best match in the reference set based on cosine distance, i.e.,
Since the photos and the corresponding sketches in are well aligned, we directly apply to index the corresponding sketch feature patch for , and use it as the pseudo sketch feature patch for . Finally, a pseudo sketch feature representation (at layer ) for is given by . Fig. 2 visualizes an example of the pseudo sketch feature. It can be seen that the pseudo sketch feature provides a good approximation of the real sketch feature (see Fig. 2(a)). We also show a naïve reconstruction in Fig. 2(b) obtained by directly using the matching index to index the pixel values in the training sketches. We can see such a naïve reconstruction does roughly resemble the real sketch, which also justifies the effectiveness of the pseudo sketch feature. Note that we only need alignment between photos and sketches in . Since we perform a dense patch matching between the input photo and the reference photos, we can also generate reasonable pseudo sketch features for input faces under different poses (see Fig. 2(c)).
We define our pseudo sketch feature loss as
where refer to layers relu3_1, relu4_1, and relu5_1 respectively. High level features after relu3_1 are better representations of textures and more robust to appearance changes and geometric transforms . Fig. 3 shows the results of using different layers in . As expected, low level features (e.g., relu1_1 and relu2_1) fail to generate sketch textures. While high level features (e.g., relu5_1) can better preserve textures, they produce artifacts in terms of details (see the eyes of sketches in Fig. 3). To get better performance and reduce the computation cost of patch matching, we set .
For easier convergence, we use the least square loss when training the GAN, known as LSGAN . The objective functions of LSGAN are given by
where denotes the intensity value at of the synthesized sketch .
Based on the above loss terms, we can train our generator network and discriminator network
using the following two loss functions respectively:
where and are minimized alternatively until converge. , and are trade-off weights for each loss term respectively.
We use two public datasets: the CUFS dataset(consist of the CUHK student dataset , the AR dataset , and the XM2VTS dataset ) and the CUFSF dataset , to evaluate our model111Data comes from http://www.ihitworld.com/RSLCR.html. The CUFSF dataset is more challenging than the CUFS dataset because (1) the photos were captured under different lighting conditions and (2) the sketches exhibit strong deformation in shape and cannot be aligned with the photos well. Details of these datasets are summarized in Table 1.
We use the VGG-Face dataset  to evaluate our model on photos in the wild. There are 2,622 persons in this dataset and each person has 1,000 photos. We randomly select 2,000 persons for training and the rest for testing. For each person in the training split, we randomly select photos and named the resulting dataset VGG-Face222The dataset will be made available., where . We also randomly select 2 photos for each person in the testing split to construct a VGG test set of 1,244 photos.
For photos/sketches which have already been aligned and have a size of , we leave them unchanged. For the rest, we first detect 68 face landmarks on the image using dlib333http://dlib.net/, and calculate a similarity transform to warp the image into one with the two eyes located at and respectively. We then crop the resulting image to a size of . We simply drop those photos/sketches from which we fail to detect face landmarks.
As in exemplar based methods, patch matching is a time-consuming process. We accelerate this process in three ways. First, we precompute and store the feature patches for the photos and sketches in the reference set (i.e., and ). Second, instead of searching the whole reference feature set, we first identify best matched reference photos for each input photo based on the cosine distance of their relu5_1 feature maps. Patch matching is then restricted within these reference photos (we set in the whole training process). Third, Equ. 1 is implemented as a convolution operator which can be computed efficiently on GPU.
We update the generator and discriminator alternatively at every iteration. The trade-off weights , are set to and , and is set to when use CUFS as reference style and when use CUFSF. We implemented our model using PyTorch444http://pytorch.org/, and trained it on a Nvidia Titan X GPU. We used Adam  with learning rates from to , decreasing with a factor of . Data augmentation was done online in the color space (brightness, contrast, saturation and sharpness). Each iteration took about 2s with a batch size of 6, and the model converged after about 5 hours of training.
In this section, we evaluate our model using two public benchmarks, namely CUFS and CUFSF, which were captured under laboratory conditions. We use the training photos from CUFSCUFSF to train our networks. When evaluating on CUFS, the reference photo-sketch pairs only comes from CUFS, and the same applies to CUFSF. To demonstrate the effectiveness of our model, we compare our results both qualitatively and quantitatively with seven other methods, namely MWF , SSD , RSLCR , DGFL , FCN 
, Pix2Pix-GAN, and Cycle-GAN . We also compare our results quantitatively with the latest GAN based sketch synthesis methods, i.e., PS-MAN  and stack-CA-GAN . Since the models of their work are not available, we can only compare with the results that are directly taken from their published papers.
As we can observe in Fig. 4, exemplar based methods (see Fig. 4(c),(d),(e) in general perform worse than learning based methods (see Fig. 4(g),(h),(i),(j)), especially in preserving contents of the input photos. Using deep features in exemplar based methods helps to alleviate the problem, but the results are over-smoothed (see Fig. 4(f)). Due to the lack of training data, FCN produces bad results when the photos are taken under very different lighting conditions (see last two rows of Fig. 4(g)). Although the two GANs can produce much better results than FCN, they also introduce many artifacts and noise. Thanks to the pseudo sketch feature loss, our method does not suffer from the above problems. In particular, our semi-supervised strategy allows us to incorporate more training photos without ground truth in training, which helps to improve the generalization ability.
as an image quality assessment metric to measure the similarity between a generated sketch and the ground truth sketch. However, many researchers (e.g., in super resolution and face sketch synthesis [30, 29]) pointed out that SSIM is not always consistent with the perceptual quality. One main reason is that SSIM favors slightly blurry images when the images contain rich textures. To demonstrate this, we show some sketches generated using different methods together with their SSIM scores in Fig. 5. It can be seen that sketch generated by RSLCR is smoother than those by Pix2Pix-GAN and our model, but have higher SSIM scores. We applied a bilateral filter to smooth all the sketches. It can be observed that the SSIM scores of the sketch generated by RSLCR remain roughly the same after smoothing, whereas those of the sketches generated by Pix2Pix-GAN and our model improve by more than .
In Fig. 6(a), we show the averaged SSIM scores of the sketches generated by different methods on CUFS, together with the averaged SSIM scores of their smoothed counterparts. As expected, the averaged SSIM scores of most of the methods improve after smoothing, same for a few exemplar based methods which produce over-smoothed sketches. The averaged SSIM score of our smoothed results is comparable to that of the state-of-the-art method. In Fig. 6(b), we show the corresponding results on CUFSF. Similar conclusions can be drawn.
Due to the drawback of SSIM, we use feature similarity (FSIM)  as our image quality assessment metric. FSIM is better at evaluating detailed textures compared with SSIM. It can be observed from Fig. 5 that the FSIM scores of the sketches decrease after smoothing. The average FSIM scores of the sketches generated by different methods on CUFS and CUFSF are shown in Fig. 6(c) and Fig. 6(d) respectively. It can be seen that our method achieves the state-of-the-art in terms of FSIM score on both CUFS and CUFSF.
Sketch recognition is an important application of face sketch synthesis. We follow the same practice of Wang et al.  and employ the null-space linear discriminant analysis (NLDA)  to perform the recognition experiments. Fig. 7
shows the recognition accuracy of different methods on the two datasets. Our method achieves the best result when the dimension of the reduced eigenspace is less than 100, and achieves a competitive result to the state-of-the-art method when the dimension is above 100.
To further demonstrate the effectiveness of the proposed method, we compare it with two latest GAN methods, namely PS-MAN  and stack-CA-GAN , which are specially designed for sketch synthesis. As shown in Table 5.2.3, our method achieves the best performance on almost all datasets, except for the SSIM score in CUFSF. However, we obtain a better performance on NLDA which indicates that our model can better preserve the identify information. Note that both of these GAN methods use extra information to train their network, i.e., multi-scale supervision (PS-MAN) and parsing map (stack-CA-GAN). Compared with them, our perceptual loss can not only avoid producing artifacts but also help to improve the generalization of the network.
There are two challenges for sketch synthesis in the wild. The first challenge is how to deal with real photos captured under uncontrolled environments with varying pose and lighting, and cluttered backgrounds. The second is the computation time. Our method tackles the first challenge by introducing more training photos through the construction of pseudo sketch features. Regarding computation time, our CNN based model can generate a sketch in a single feed forward pass which takes about 7ms on a GPU for a photo. We compare our method with five other methods , including SSD555http://www.cs.cityu.edu.hk/~yibisong/eccv14/index.html, FCN, Pix2Pix-GAN666https://github.com/phillipi/pix2pix, Cycle-GAN777https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix, and Fast-RSLCR888http://www.ihitworld.com/RSLCR.html. In this experiment, we train our model using CUFS as the reference set and all the training photos from the CUFS, CUFSF and VGG-Face10 as the training set. Since there are no ground truth sketches for the test photos, we carry out a mean opinion score (MOS) test to quantitatively evaluate the results.
As photos in the wild are captured under uncontrolled environments, their appearance may vary largely. Fig. 8 shows some photos sampled from our VGG-Face test dataset and the sketches generated by different methods. It can be observed that these photos may show very different lightings, poses, image resolutions, and hair styles. Besides, some photos may be incomplete and people may also use a cartoon as their photos for entertainment (see the last row of Fig. 8). It is therefore very difficult, if not impossible, for a method which only learns from a small set of photo-sketch pairs to generate sketches for photos in the wild. Among the results of other methods, exemplar based methods (see Fig. 8(b)(c)) fail to deal with pose changes and different hair styles. FCN produces sketches (see Fig. 8(d)) that can roughly preserve the contour of the face but lose important facial components (e.g., nose and eyes). Although GANs can generate some sketch like textures, none of them can well preserve the contents. The face shapes are distorted and the key facial parts are lost. It can be seen from Fig. 8(g) that our model can handle photos in the wild well and generate pleasant results.
Introducing more training photos from VGG-Face dataset is the key to improve the generalization ability of our model. As demonstrated in Fig. 6.2, the model trained without additional photos from VGG-Face has difficulty in handling uncontrolled lightings and different hair colors (see Fig. 6.2(b)). As we add more photos to the training set, the results improve significantly (see the eyes and hair in Fig. 6.2).
Since there are no ground truth sketches for the photos in the wild, we performed a MOS test to assess the perceptual quality of the sketches generated by different methods. Specifically, we randomly selected 30 photos from the VGG test set, and then generated the sketches for these photos using SSD, FCN, Fast-RSLCR, Pix2Pix-GAN and our method respectively. Given the example photo-sketch pairs from public benchmarks as reference, 108 raters were asked to rank 10 groups of randomly selected sketches synthesized by the five different methods. We assign a score of 1-5 to the sketches based on their rankings (5 being the best). The results are presented in Fig. 6.2. It can be observed that the MOS of our results significantly outperforms that of the other methods. This demonstrates the superiority of our method on photos in the wild.
In this paper, we propose a semi-supervised learning framework for face sketch synthesis in the wild. We design a residual network with skip connections to transfer photos to sketches. Instead of supervising our network using ground truth sketches, we construct a novel pseudo sketch feature representation for each input photo based on feature space patch matching with a small reference set of photo-sketch pairs. This allows us to train our model using a large face photo dataset (without ground truth sketches) with the help of a small reference set of photo-sketch pairs. Training with a large face photo dataset enables our model to generalize better to photos in the wild. Experiments show that our method can produce sketches comparable to those produced by other state-of-the-art methods on four public benchmarks (in terms of SSIM and FSIM), and outperforms them on photos in the wild.
We thank Nannan Wang, Hao Zhou and Yibing Song for providing their codes and data. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.
In: European Conference on Computer Vision. (2014) 800–813
In: IEEE Conference on Computer Vision and Pattern Recognition. (2012) 1091–1097
In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence. (2017) 3574–3580
Image-to-image translation with conditional adversarial networks.CVPR (2017)