Learning to Sketch Human Facial Portraits using Personal Styles by Case-Based Reasoning

07/10/2016 ∙ by Bingwen Jin, et al. ∙ New Jersey Institute of Technology Zhejiang University 0

This paper employs case-based reasoning (CBR) to capture the personal styles of individual artists and generate the human facial portraits from photos accordingly. For each human artist to be mimicked, a series of cases are firstly built-up from her/his exemplars of source facial photo and hand-drawn sketch, and then its stylization for facial photo is transformed as a style-transferring process of iterative refinement by looking-for and applying best-fit cases in a sense of style optimization. Two models, fitness evaluation model and parameter estimation model, are learned for case retrieval and adaptation respectively from these cases. The fitness evaluation model is to decide which case is best-fitted to the sketching of current interest, and the parameter estimation model is to automate case adaptation. The resultant sketch is synthesized progressively with an iterative loop of retrieval and adaptation of candidate cases until the desired aesthetic style is achieved. To explore the effectiveness and advantages of the novel approach, we experimentally compare the sketch portraits generated by the proposed method with that of a state-of-the-art example-based facial sketch generation algorithm as well as a couple commercial software packages. The comparisons reveal that our CBR based synthesis method for facial portraits is superior both in capturing and reproducing artists' personal illustration styles to the peer methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image-based artistic rendering (IB-AR) algorithms often rely on manually-encoded heuristics to emulate specific artistic illustration styles 


. These heuristics can be classified into three categories 

Kyprianidis2013 : stroke-based rendering for image approximation lu2010interactive , region-based techniques zhao2010sisley , and image processing (filtering) papari2007artistic  papari2009continuous . However, it is difficult for artist to explicitly depict the rules of artistic illustration because these rules are often subconsciously exercised and are not always expressible verbally or symbolically. Therefore, example-based artistic rendering (EBAR) is proposed to generate stylized sketch, which is more popular and more effective since the styles are implicitly embodied in the examples, as it is natural to specify artistic styles by showing a set of examples Chen2004 .

Existing EBAR methods can be broadly classified as model-based and model-free methods. The model-based methods embed prior knowledge about artistic rendering styles into generative models that carry a set of process control parameters. For example, Lu et al. Lu2012Pencil built-up parametric histogram models on the tone distribution inside the sketch example. The resultant sketch was then generated via tone-adjusting by this model according to the tone map in the source image. Tu et al. Tu2010

learned novel direct combined models from the joint distributions of the feature vector pairs of input facial image vs neutral facial shape, neutral vs exaggerated facial shape, and facial sketch vs combination of input facial image and exaggerated facial shape, and then the system optimally synthesized the corresponding output sketch by applying the MMSE criterion 


within the combined eigenspace. The model-free methods synthesize the resultant style-preserving illustrations directly by the correlation relationship between pixels, patches, or strokes etc., in the exemplars. Wang et al. 

Wang2009 divided facial region into overlapping patches from the source photo, and for each patch, located a similar photo patch from examples and collected its corresponding sketch for the synthesis of the resultant sketch smoothly. Zhao et al. Zhao2011 built a dictionary of stroke templates of oil painting portrait with complete information of artists’ stroke-by-stroke drawing processes. Their method painterly rendered new portrait by reusing brush strokes from a matched template in the dictionary in terms of facial shape and color in the source image. Model-based methods can generate diverse new illustrations with parametric styles through generalization on exemplars. However subtle and unique artistic characteristics related to individual artist are somewhat lost while modelling the stylization. Model-free methods naturally incorporate the visual correlation in examples into the resultant sketch. But the richness of resulting sketches are usually limited by that in the given examples.

In the community of artificial intelligence, case-based reasoning (CBR) solves a new problem instance by first recalling one or multiple similar, previously solved problem instances and then reapplying the known solutions, often with adaptations according to the new context, to address the new problem instance 

aamodt1994case . From the point of the view of problem solving, the exemplars, a photo and its corresponding artistic illustration, in EBAR are very similar to the cases in CBR. This motivates us to employ case-based reasoning to generate the stylized human facial portraits for an individual artist. Fig. 1 shows two facial sketches automatically created by us using the styles learned from two artists respectively. Sketches generated by two commercial software packages are also presented for comparison.

From a technical perspective, the major challenges for CBR-based stylization of portrait lie in case retrieval and adaptation respectively. Case retrieval in stylization is complex, and goes beyond selecting a best example for a given input as that in Wang2009 , because portrait sketching practice of human artists is usually a multi-step and progressive painting process, in which the case retrieval may occur in multiple steps, accounting for not only the given input, but also the current interim sketch and the desired resulting sketch in his/her mind. However, during the phase of looking for the suitable cases, the expected resultant sketch is not available yet for the time being. Case adaptation for stylization is to transfer the style-related correlations in the selected cases into the current sketching. Although existing style-transferring approaches Tu2010  Wang2009  Zhao2011 can be utilized, its key-point is how to automate the case adaptation for stylization, especially when multiple cases are applicable, since automatic case adaptation is necessary to enable CBR systems to function autonomously and to serve naive as well as expert users leake1995learning .

We address these challenges with an approach specific to personalized stylization for sketching faces. First, we designed an iterative pipeline as an overall algorithmic mechanism for CBR-based stylization, in which the current sketch is refined iteratively by the best-fit case until the desired sketch appears. Second, motivated by human artist s practice of forming a desired sketch in mind before drawing, we train-up a predictive model from cases for each artist to hypothetically create them. The model evaluates the fitness of each candidate case in terms of the visual similarity to the presumed ”resultant sketch”. The best-fit one is selected to sketch current face. Furthermore, in order to automate the case adaptation, a parameter estimation model is learned for each artist in advance, which will automatically assign the appropriate values to parameters for case adaptation, while getting a new case for refinement of resulting sketch.

To summarize, our paper makes the following contributions:

  • A novel case-based reasoning algorithmic pipeline to iteratively stylize the human portraits by exemplars, producing rich portraits and preserving personal styles well.

  • A predictive framework integrated with generate-and-evaluate mechanism is newly proposed for best-fit case retrieval with capability of hypothetically creating the desired sketch in the mind of artist, which greatly facilitate the evaluation of best-fit cases.

  • Innovatively learning parameter estimation model for each artist which enables an automatic case adaptation for stylized sketching of facial photo.

2 Related Work

A multitude of image-based artistic rendering (IB-AR) techniques have been proposed. Kyprianidis et al. Kyprianidis2013 gave an in depth survey of IB-AR techniques. Here we merely focus on example-based artistic rendering (EBAR), which can be roughly categorized into two classes: model-based and model-free methods.

Model-based methods acquire prior knowledge of artistic rendering styles from examples and accordingly built-up the stylization models. Besides the model-based approaches in Lu2012Pencil  Tu2010 (see Section 1), Reinhard et al. reinhard2001color

modeled color style of a source image through the means and standard deviations along each of the three axes in l

color space, and then imposed the means and standard deviations onto the target image, transferring the color style to target image. Liang et al. Liang2002 models facial pexaggeration style through analyzing the correlation between the image caricature pairs using partial least-squares (PLS). The model-based hatching in Kalogerakis2012 trained a mapping from the features of input 3D object to hatching properties. A new illustration was generated in target style according to predicted properties. The aforementioned model-based methods can generate diverse new illustrations with parametric styles through generalization on exemplars. However subtle and unique artistic characteristics related to individual artist are somewhat lost while modelling the stylization.

Model-free methods generate new artistic illustrations directly by reusing the correlation relationship provided by all exemplars. Hertzmann et al. Hertzmann2001 proposed image analogies algorithm which reuse a source image A and artistic depiction of that image A’ to synthesize an artistic illustration B’ for a new image B in pixel level. Each new pixel is synthesized by reusing the pixel in exemplar that best matches the pixel being synthesized. Later, an extension of this algorithm incorporated image gradient direction to better preserve object shape of the target image Lee2010DTT . Example-based stippling Kim2009 proposed a texture similarity metric based on gray-level co-occurrence matrix, aiming at generating stipple textures that are perceptually close to input samples. Their reuse of examples merely accounted for pixels related to stipple primitives. Besides pixel-level reusing of examples, patch-level reusing are also proposed, which often divide original image exemplars into patch exemplars Wang2009 . Liu et al. Liu2005 took the similar approach in Wang2009 , and retrieved multiple exemplars for each photo patch of a new face and synthesize a sketch patch by linearly blending sketch patched in the candidate exemplars. Wang et al. wang2013transductive

improved this approach by defining a probabilistic model to optimize both the reconstruction fidelity of the input photo and the synthesis fidelity of the target output sketch. Song et al. 

song2014real further extended it through a Spatial Sketch Denoising method. Moreover, the face image can also be decomposed into patches in terms of anatomical structure of human face as that in Chen2004  Gao2009  Min2007  zhang2014data . However, the style of a rendition is mainly embodied in the stokes of a sketch, instead of pixels or patches in an image. Therefore stroke-based reusing in EBAR are also investigated Zhao2011 . Berger et al. berger2013style presented an approach which reuse real artists’ strokes to the image. To retain specific styles, they composed a stroke library for each artist, and ”cloned” the relevant strokes that matched to the detected edges in the source image.

3 Method Overview

Figure 2: System overview. Feature points on images are represented by yellow dots.
Figure 3: Facial feature points and their corresponding color-coded facial regions.
Figure 2: System overview. Feature points on images are represented by yellow dots.

Fig. 3 gives an overview of our CBR-based sketch synthesis framework that produces visually superior results than existing synthesis methods. Given exemplars of source facial photo and stylized sketch hand-drawn by a human artist, its preprocessing phase for case-based reasoning proceeds in three steps: 1) we generate sketch Synthesis (STS) cases and construct a library for them; 2) Fitness Evaluation (FE) models for the artist are trained up from the STS cases; 3) Parameter Estimation (PE) models are learned for automatic case adaptation. During the runtime phase, we use STS cases, FE models, and PE models to iteratively synthesize a new facial sketch via CBR for each newly given facial photo.

The best-fit case retrieval is carried out by a predictive framework of generate-and-evaluate mechanism. From the point of view of style imitation, best-fit case should be the one that maximally preserves the human artist’s style in resulting sketch. However, the resulting sketch is to-be-generated, and sketch is not at-hand yet, which arises the problem of unknown ”ground truth”. Our solution is to hypothesize there is always a ground truth in the mind of a human artist, while drawing the sketch. In fact a human artist can easily select the best-fit case by the desired sketch in his/her mind. Therefore we propose a predictive framework embedded with generate-and-evaluate mechanism, in which each candidate case is implicitly applied on the sketch of current interest. FE model is trained to rank its fitness in terms of the similarity between the optimally generated sketch and the one mentally imagined in human artist’s mind.

Our case adaptation for sketch synthesis is implemented through a blending operator, whose parameter setting should maximize the similarity between generated sketch and its hypothesized groundtruth. An explicit parameter searching for optimization is time-consuming, therefore we learn a PE model to automatically configure the blending parameter, i.e. to generate a new sketch, we blend the stylized sketch from the best-fit case and the sketch of current interest with the parameter assigned by PE model automatically.

During the phase of iteratively synthesis, we firstly segment the entire facial region into multiple regions according to the anatomical structure of a human face (see Fig. 3 for facial regions painted with different colors). Then for each segmented region, the resultant sketch is synthesized progressively with an iterative loop of retrieval and adaptation of candidate cases until the desired aesthetic style is achieved. This is motivated by the human artist’s multi-step, progressive portrait sketching practice. FE model guides case retrieval, while PE model automates case adaptation. At last, the overall sketch for the entire human face is composed and formed globally from sketches of these local facial regions according to the relative spatial layout based on facial feature points.

Moreover, a significant characteristics of facial sketch is exaggeration. With our CBR’s framework, simulation of the exaggeration is easily embedded as a post-processing step. After our iterative synthesis process, the exaggeration field is generated from STS cases, and then applied on the synthesized sketch to imitate exaggeration style of a human artist.

4 Building up Cases

Figure 4: Pipeline of building up cases. Feature points on images are represented by yellow dots.

To build up STS cases (see Fig. 4), we start from a set of paired source photo and its corresponding stylized sketch, , where is the -th sample composed of a front-view facial photo and its corresponding stylized sketch . All sketches in are illustrated using the same style as denoted by its subscript. The cardinality of , , is the number of sample pairs inside it.

A STS case consists of 6 components: source photo , stylized sketch , neutral sketch , exaggeration field , and two feature point sets, and , of and respectively. Each source photo merely has one human face in front view. Stylized sketch is hand-drawn by human artist. Neutral sketch is the one neutralized by removing the exaggeration from the stylized sketch. Neutral sketch will be identical to stylized sketch if there is no exaggeration in it. Exaggeration field is a 2D vector field representing the displacement of relevant pixels due to exaggeration. Each feature point set has 67 pre-defined points (see Fig. 3). One is from source photo, and the other one is from stylized sketch. Formally, we denote a STS case as:


A well-established practice of representing the geometry of a facial image is based on the spatial layout of facial feature points. By this principle, the geometry of and are respectively represented by and . is a set of 67 pre-defined facial feature points extracted from (see Fig. 3) by the active shape model (ASM) Milborrow2008 . Each feature point is uniquely assigned to one of the nine aforementioned facial regions. Similarly, is a set of feature points extracted from , which are also detected by ASM Milborrow2008 . The index numbers of feature points in the two sets and as the same. That is, points with the same index number presumably depict the same location on a face, assuming a perfect positional alignment between and . However, facial elements sketched by artists sometimes deform geometrically and/or positionally from their counterparts in the corresponding facial photo, partly due to exaggeration. To get the correct matching between them, manual adjustment is often required.

More details about exaggeration field, features, and neutral sketch are given below.

4.1 Exaggeration Field

Exaggeration field (EF) is a matrix of two-dimensional vectors. The dimension of the matrix is the same as the size of the EF’s corresponding image. Each pixel in the image has a counterpart two-dimensional vector that represents the pixel’s horizontal and vertical displacements from its position in the neutral sketch to the corresponding position in the deformed sketch with exaggeration.

The EF representing the exaggeration in is denoted as . To generate , the positions of feature points in and are aligned, and the geometric image transformation from to is computed by the image deformation algorithm based on Moving Least Squares (MLS) Schaefer2006 . The user can also manually modulate the transformation if needed. In MLS algorithm, a denser set of displacement vectors

are interpolated by the translations between feature points

and . In our prototype implementation, 50000 two-dimensional vectors will be derived in an EF for an image of 200 by 250 pixels.

4.2 Features

Given a facial image (sketch or photo) and its feature points, each pixel in the image inside the outline of facial feature points and the facial image is segmented into nine regions by its nearest pre-defined facial feature point by the distance. Let be one of the nine facial regions in ; be ’s corresponding area in ; be a region identification function whose input and output are a facial region and an index representing region type respectively. Without loss of generation, let be the st facial region in , which means . We denote and as the feature point sets of and respectively.

Let be a vector of features for , which includes four types of features, i.e. . Similarly, we extract a set of features for characterizing , i.e. .

The features used to characterize photo region are:

  • The feature vector is a SURF descriptor Bay2006_surf . For a photo patch , we extract a dimensional SURF feature vector to describe the distribution of Haar-wavelet responses in the photo patch.

  • The feature vector is a normalized histogram on the gray value of pixels in . We empirically set the histogram dimension to 32 in our implementation.

  • The feature vector is a normalized histogram on the gradient directions of pixels in where the gradient direction of each pixel is calculated using the Sobel operator MachineVision . We also empirically set the histogram dimension to 32.

  • The feature vector is composed of simplified shape context descriptors Belongie2002 , which is a log-polar histogram of the coordinates of the remaining points measured using a reference point as the origin. Before generating shape context descriptors, we sample the outline of with uniform spacing, resulting in 100 vertices. Then, for each feature point in , a shape context descriptor is computed with 1 and 8 bins for and respectively. Therefore, the dimensionality of is .

These features depict the high-level visual characteristics because exaggerated sketch illustrations usually focus more on the salient visual features of an object or scene Mish1994 rather than the details.

4.3 Neutral Sketch

Photo Artist illustrated sketch Recovered sketch
Figure 5: Two instances of the recovered sketches after removing exaggeration.

To generate the neutral sketch, we also use MLS algorithm to derive the geometric image transformation from to , which is similar with the method introduced in Sec 4.1. By applying this transformation to , we manage to remove the exaggeration from the sample sketch illustration , resulting in an neutral sketch . As a byproduct, the method also produces the new positions of the 67 pre-defined facial feature points in , which are denoted by a new feature point set . Fig. 5 shows two restored sketches from the stylized facial portraits.

5 Fitness Evaluation Model for Case Retrieval

In case retrieval, the fitness is evaluated in terms of the visual similarity between the generated sketch and the groundtruth sketch for the same input facial photo illustrated by the target human artist. The metric for visual similarity assessment is based on the normalized mutual information Modat2010FFD , partly due to mutual information neither depends on any assumption of the data thevenaz2000 nor requires the extraction of additional features, such as edges and corners, a process of which may introduce additional geometrical errors maes1997 .

For each facial region, we train one FE model. Let the FE model trained for the st facial region be . Its output is fitness quality between a case from our case set and a case which could be another case in our case set or a new case. Its input is a composite feature vector computed from and which is defined as . Then we have


5.1 Training Data Generation

To generate training data for , we start from a case set , which is randomly divided into 10 equal-sized sub-sets . Sketches in one of the sub-sets will be used as the ground truth while the remaining 9 sub-sets are used to generate the sketch results, which will then be cross-validated using the groundtruth sub-set. This process will be performed 10 folds. Therefore, each case will be used once as a groundtruth case. Without loss of generation, let current groundtruth case set be , and remaining case set be , where .

Formally, given a STS case set and a photo region from , we identify and adapt one or multiple cases to iteratively synthesize a neutral sketch region to maximally approximate the groundtruth . Let and respectively be the final resultant sketch and the interim sketch after the -th iteration.

In the first iteration, the algorithm selects a case whose source photo region appears visually closest to in terms of the normalized mutual information Modat2010FFD . The sketch in the selected case is then denoted as .

For the -th iteration, the method searches in for a best-fit case . For each candidate case in . The adaptation on is carried out by blending and . The maximum similarity between resultant sketch patch of blending operation and groundtruth is


where is also represented as the fitness value of , is the resultant sketch patch of blending and with blending weight , and is the visual similarity between and in terms of normalized mutual information Modat2010FFD .

The image blending operator to synthesize is originally proposed in Lee1996 . It generates an inbetween image, , of two input images, and , by , where and are two non-linear warping functions built from all pairs of corresponding feature points between and . It is obvious that is identical to our . During the process of blending and , their feature point sets, and , are also blended to compute the feature point set for . Formally, we define the blending operation as follows:


where is the resulting image of blending and with the blending parameter ; is the resulting set of facial feature points.

A non-linear optimization method Powell1998COBYLA is employed as a solver for Equation (3). After computing for all candidate cases, the case with maximum will be selected as . We denote the maximum as . Resultant sketch patch corresponding to , , will be involved in the next iteration. This iterative synthesis procedure terminates when is less than a threshold (it is set to in our implementation).

All generated in aforementioned iterative process are collected as sample set , where is the number of iterations, and all combinations of are collected as training samples. Hence we generate samples for after synthesizing a sketch for . Supposing that the mean value of , , is , and the size of is , we can generate training samples.

5.2 Learning Fitness Evaluation Model

Given the training samples for , we train the fitness evaluation model via regression. To identify optimal regression model for , we employ the ten-fold cross validation (CV) technique during the model selection process schaffer1993 .

The definitions of and indicate that the input to

consists of hundreds of features. The dimensionality of the input features is between 528 and 648 wherein the exact dimensionality depends on the specific type of the facial region. Its feature selection is accomplished using the minimal-redundancy-maximal-relevance (mRMR) criterion 

Peng2005 . However, the mRMR method can only identify the most important features by a user-specified number. To optimally get this number, we employ the best-first search algorithm kohavi1997wrappers to look for it through minimizing the CV error. The best regression model is selected by the minimized CV error derived through performing feature selection for each candidate regression model.

Figure 7: The cross-validation errors of candidate regression models of for . The averaged training time for each model is also shown.
Figure 6: The sizes of the training sample sets for the nine facial regions. The reported feature dimensionality refers to the reduced dimensionality of the selected feature set with the optimally chosen regression models.
Figure 7: The cross-validation errors of candidate regression models of for . The averaged training time for each model is also shown.
Figure 8: The cross validation errors of candidate regression models of for . The averaged training time for each model is also shown.
Figure 6: The sizes of the training sample sets for the nine facial regions. The reported feature dimensionality refers to the reduced dimensionality of the selected feature set with the optimally chosen regression models.
DatasetRegion 1 2 3 4 5 6 7 8 9
Table 2: Selected optimal regression models for the PE models for datasets , , and respectively.
DatasetRegion 1 2 3 4 5 6 7 8 9
Table 1: Selected optimal regression models for the FE models for datasets , , and respectively.

Fig. 8 shows the sample number and feature dimensionality in the model selection process based on dataset (see Section 8 for details about ). Using the Weka toolkit Hall2009_weka , 12 most popular regression models, , are taken into consideration: bagging regression tree (), SVM regression (), M5P (), regression tree (), conjunctive case (), M5Cases (), isotonic regression (), additive regression (

), KNN (

), linear regression (

), pace regression (), and decision stump (). Let be the cross-validation errors of the -th candidate regression model for , which are applicable for the st facial region. Then are used to select an optimal regression model for . Fig. 8 shows the cross-validation errors of the candidate regression models of for , from which we find the optimal one is bagging trees Breiman1996 . Table 2 shows the regression models selected for , , and (see Section 8 for details about and ), in which we observe that the optimal regression models chosen for FE models may vary across facial regions and datasets.

6 Parameter Estimation Model for Automatic Case Adaptation

Case adaptation is carried out by blending the sketch from the retrieved case with the interim sketch from previous iteration. To automate the case adaptation, we train PE models to identify the optimal blending parameter, maximizing the similarity between the resultant sketch of blending and the hypothetic groundtruth sketch.

Let be the PE model trained for the st facial region. In each iteration of sketch synthesis, a new photo region , a sketch region of current interest , and a case are given. PE model is to estimate the optimal parameter for blending and . That is:


The learning of PE model takes the same training samples for FE model. It is worth noticing that an optimal blending weight is also acquired, after solving Equation (3). Therefore, the training data for is also accordingly generated as a by-product while preparing training data for , which is described in Section 5.1.

We use the same learning method introduced in Section 5.2 to train PE model. Fig. 8 also shows the sample number and feature dimensionality in the model selection process for PE models. Fig. 8 shows the cross-validation errors of the candidate regression model for of . Table 2 shows the regression models selected for , , and (see Section 8 for details about , and ), in which we can also observe that the optimal regression models chosen for PE models vary across facial regions and datasets.

7 Synthesizing the Desired Sketch via CBR

Given an input facial photo and a set of STS cases of a human artist, the resultant sketch is synthesized progressively with an iterative loop of retrieval and adaptation of candidate cases until the desired aesthetic style is presented. Fig. 9 and 10 clearly show that the synthesized sketch using multiple cases is much more close to the ground truth image than the sketch produced using single case in a sense of normalized mutual information Modat2010FFD . Moreover, exaggeration imitation is embedded as a post-processing step.

7.1 Iterative sketching by Cases

To synthesize the sketch in a given input facial photo , the key point is how to optimally retrieve relevant cases and adaptively fuse them to produce the resultant sketch. Our CBR based synthesis pipeline is illustrated in Fig. 11, which originates from the general CBR framework aamodt1994case .

Use 1 Case Use Multiple Cases Groundtruth
1: 0.9024 4: 0.9036
1: 0.8163 3: 0.8316
1: 0.8472 3: 0.8700
1: 0.7329 3: 0.7455
1: 0.8865 3: 0.9276
1: 0.9378 5: 0.9531
Figure 9: Comparison between sketch synthesis results using single case (first column) and multiple cases (second column). The third column is the ground truth. The number of the cases used and the similarity with the ground truth image measured by normalized mutual information Modat2010FFD is also given for reference.
Using 1 Case Using Multiple Cases Ground Truth
Sim = 0.5697 Sim = 0.6141
Sim = 0.5175 Sim = 0.5367
Figure 10: Comparison between the synthesized facial sketches using single case (first column) and multiple cases (second column) for each facial region as compared with the groundtruth facial sketches (third column). ”Sim” is its visual similarity with the groundtruth image measured by normalized mutual information Modat2010FFD .
Figure 11: The pipeline of synthesize sketches for individual facial regions iteratively through case-based reasoning. Feature points on images are represented by yellow dots.

Given a new facial region from a new photo and a set of STS cases, the process for synthesizing a new sketch is very similar to the one in Section 5.1. The major difference is that and are already known now. Therefore, and optimal of Equation 3 can be directly calculated by and respectively, ignoring the non-linear optimization for Equation 3.

We are aware of that Liu et al. Liu2005 introduced a similar method for finding optimal example sketch patches and their blending parameters during a facial texture synthesis procedure. Their method is based on the local linearity assumption, which searches the most visually similar examples for an input photo region. The optimal blending parameters are identified through minimizing the reconstruction error in terms of the visual similarity between the input photo and the blended result of the selected example photos. Instead of calculating blending parameters directly from examples, a large amount of training data from a limited number of sketch synthesis cases are generated to train PE models, which allows us to fully utilize available cases for parameter estimation.

The blended image will appear more blurry than the source one, and the post-processing is usually required. We extend the image analogies algorithm Hertzmann2001 , as an example-based image sharpening procedure with a multi-scale autoregression process, which can learn from multiple pairs of example images. More concretely, we use as the “filtered” examples and apply a Gaussian kernel, whose radius and standard deviation are empirically set to and pixels respectively, over each to generate its “unfiltered” version . According to the exemplified mapping relation , the sharpened one of is synthesized by image analogy. Fig. 13 shows that its sharpened sketch is much better than the one processed by the high-pass filter-based method gonzalez2002digital in terms of the visual appearance of sketching.

7.2 Exaggeration

The blended mouth By high pass filter By our Method
Figure 12: Comparison with a high-pass filter based method. Our method can present more details than the high-pass filter based method.
Photo (a) (b) (c)
Figure 13: Deforming a sketch using the exaggeration field and feature point displacement vectors sampled from the exaggeration field. (a) is an initial sketch. (b) is generated by a standard image remapping algorithm opencv_library . (c) is generated by our revised approach. Notice: (b) and (c) are best viewed at 400% zoom.

Exaggeration is popular in facial sketch illustrations, which is often captured as a 2D transformation from to , represented by exaggeration field .

In our CBR framework, a new EF for exaggeration imitation, , is generated by a linear system as , where is a new facial sketch and is the set of feature points on .

The linear system has the properties of superposition and homogeneity of degree 1 Chen1998LST , therefore , where is a set of sample inputs of , and is the salience weight associated with . Given a new input , we can optimally find a linear combination of sample inputs to that best reconstructs . That is:


where , , and is defined as the sum of Squared Euclidean Distances between pairs of corresponding points in two input feature point sets, and is the sample output of corresponding to (see Sec. 4.1 for the method of generating ). According to the index number specified for each feature point (see Fig. 3), we can naturally establish a pairwise point-to-point correspondence between the two sets of feature points.

We cannot theoretically prove the linear assumption of our method. However, experimentally, the resultant exaggeration by the predicted EF appears highly close to the ones illustrated by the target human artist.

Once is generated, a straightforward approach for exaggerating is to warp following the guidance of through a standard image remapping function opencv_library . However, is usually noisy, which would cause unreliable deformations (see Fig. 13(b)). Therefore we firstly get rid of the noise from by interpolating the control point displacements Schaefer2006 . And then the smoothed is used to warp through image remapping opencv_library , which generates the exaggerated sketch accordingly (see Fig. 13(c)).

8 Experimental Results

In our experiments, we prepared 3 sets of facial sketch illustrations, respectively denoted as for three human artists respectively. Each set consists of facial sketch portraits illustrated by one human artist, as differentiated through the subscript of ’s. Every set contains 49 front-view facial photos of 49 different people where each photo is accompanied by its sketch illustrated by the artist. The resolutions of the images are 200 by 250 pixels in , 191 by 235 pixels in , and 194 by 247 pixels in . In particular, for , all its facial photos and their sketch illustrations are from the CUHK student dataset Wang2009 . For and , 39 photos are from and the remaining 10 new human facial photos are newly taken, which are both included in and . For the 49 photos respectively in and , we hired two artists to draw 49 facial sketches for each set respectively. The artist for has more than 20 years of professional experiences in creating human facial portraits. The artist for is a PhD student of digital art and design. He has been a freelance illustrator for 7 years and showed his caricature art pieces in a national culture and art expo event. In this manual sketch illustration process, we used a 21-inch LCD to display the human facial photo one by one and asked artists to draw sketch illustrations for each of the displayed facial photo on their A4 paper canvas. Our artists were asked to spend as long as they want in creating these sketch illustrations. After that, all hand-drawn sketches are digitalized by a scanner. The original resolutions of the images in and are 2480 by 3508 pixels. As the image resolution of the released CUHK student dataset is 200 by 250 pixels, we downsample the images in and to make our experimental conditions comparable among the three datasets.

Photo (a) (b) (c) Artist I
Figure 14: Comparing our approach with two commercial software packages. Column (a) and (b) are generated by the two commercial software packages siyanhui and akvis2012 respectively. Column (c) is generated by our approach with the training examples from . The sketch illustrations created by artist I are the groundtruths, which are not involved in the training process.
Figure 15: The user study comparing the sketch generation quality by our approach and Wang2009 . Data collected from 2624 completed questionnaires by answering 10 questions.
Photo (a) (b) Artist I Photo (a)  (b) Artist II Photo (a)  (b) Artist III
Figure 16: Generating sketch illustrations of facial portraits from input human facial photos in personal illustration styles of three artists. Column (a) are generated by the peer algorithm proposed in Wang2009 while column (b) are produced by our algorithm.

To compare the aesthetic stylization between our approach with that of a facial sketch synthesis approach in Wang2009 and two commercial packages siyanhui ; akvis2012 , we perform the leave-one-out test on . It is noted that since the sketch illustrations in do not present the exaggerative style, the procedure of removing exaggeration from example sketches (Sec. 4.3) is not performed. That is, for , and are the same. Fig. 15 shows the results by our method and the two peer commercial software packages, which clearly demonstrate that the sketch illustrations generated by us appear visually much closer to the target artist’s hand-drawn ones. Fig. 16 gives more sketching instances by our approach and the state-of-the-art peer algorithmic approach Wang2009 , which also intuitively leads to the same qualitative comparison conclusion. In this experiment, as we perform the leave-one-out test on , the number of training samples used for sketch generation is 48, which is significantly less than the number of sample sketches, at least 88, in Wang2009 . In all experiments in this paper, less than 50 training samples are used to capture an artist’s personal facial sketch illustration style.

To make more assessment on the stylization of the sketches produced by our approach and the peer algorithm, we further conducted a user survey on the sketches generated in the above experiment. We selected 10 artist-drawn sketches from and their corresponding sketches generated by our approach and Wang2009 respectively. We then created an online questionnaire consisting of 10 questions. Each question presents three images side by side. The hand-drawn sketch illustration is always shown in the middle. The corresponding sketch illustrations generated by the two algorithms respectively are randomly placed on the left and right. Each question asks a human viewer to select among the two images displayed on the left and right positions of the image triplet. The subjects are asked to answer which one appears visually more similar to the middle image. We released the online questionnaire on a twitter-like social media community dedicated to comic fans sinaweibo , who are generally very familiar with artistic facial drawing styles. 2624 completed questionnaires were collected during 3 days. None of the participants was compensated monetarily, who took part in the online survey due to their curiosity and interest in facial sketch illustrations. Through a dynamic webpage tracking feature, it is shown each questionnaire takes around 2.5 minutes on average to be answered. Among the ten pairs of sketches in our online survey, seven of them get more votes for our sketches by the subjects, and the remaining three get more votes for the sketches from the peer algorithm. Overall,

of all the 26240 answers to the ten questions favor resulting sketches by our method (one-sample t-test,

; two-sample t-test, ). Fig. 15 presents more details about our assessment. The voting clearly shows the superiority of our method in terms of the visual appearance of sketching.

We also perform leave-one-out tests on the data sets and . Fig. 16 gives more synthesized sketches, which demonstrate that our approach is indeed capable of imitating multiple personal facial illustration styles, with or without exaggerations. The first artist sketches human face portraits more or less following the visual characteristics in the original facial photo. The other two artists present significant exaggeration in their facial illustrations.

In Fig. 16, it is obvious that the peer algorithm Wang2009 fails to capture the sketch illustration styles of artists II and III as the generated sketch illustrations have significant differences to the ones created by the target artists. This is partly because their algorithm is designed upon the assumption that each facial region in a photo occupies the same image area as its counterpart region in the corresponding sketch illustration does, which is not always true in reality, particularly for sketch illustrations with significant exaggeration.

Figure 17: Comparing our approach with the method proposed by wang2013transductive . The faces in column (c) and (e) are generated by our algorithm and the peer algorithm proposed in wang2013transductive respectively.
Photo (a) Artist 1 (b) Artist 2
Figure 18: Human facial sketch portraits generated in two illustration styles for the same set of input facial photos. Images in column (a) and (b) are generated by our approach. The training data for generating the images in columns (a) and (b) come from and respectively.

We also compare our approach with the transductive learning algorithm proposed by Wang et al. wang2013transductive . Fig. 18 shows our resultant sketches in the leave-one-out test on and sketches from Wang’s website of results on CUHK face sketch database CUHKFace2013 . The nasolabial folds burgess2005cosmetic is one of the essential elements of facial stylization and often appears in face portraits drawn by human artist. They are well preserved and can be significantly observed in our resultant sketches. Unfortunately, nasolabial folds are little presented in sketches generated by the transductive learning algorithm. Hence, our resultant sketches more closely resembles the groundtruth sketches by the target human artists.

Fig. 18 shows more resulting sketches in the leave-one-out tests on and . It demonstrates that our method can successfully learn personal sketching styles of multiple artists and accordingly generate the individual stylized sketches for the same input photos.

Regarding the time performance, it takes about 1 minute to generate a facial sketch from an input photo of the resolution of 200 by 250 pixels using the unoptimized, single computing core prototype implementation of our algorithm executed on a PC equipped with Intel i5-3450 3.1GHz CPU and 3.2G memory, while the peer algorithm Wang2009 takes 3 minutes on the same hardware platform.

Originally our training takes about 3 days to finish. However, we observe that it is straightforward to implement our training algorithm concurrently since FE and PE models for each facial region can be trained independently. We evaluated the parallel computing implementation on a PC with Intel i7 CPU (8 cores) and 6G memory. It took about 8 hours to complete all training tasks. We are thus optimistic about the computational efficiency of our algorithm in practice with a parallel implementation.

9 Conclusion and Discussions

We propose a new CBR-based sketch synthesis algorithm that can produce visually superior results than existing synthesis methods. To the best of our knowledge, it is the first CBR framework that is explicitly introduced to personally stylize human facial portrait.

For each human artist to be mimicked, a series of cases are firstly built-up from her/his exemplars of source facial photo and hand-drawn sketch, and then its stylization for facial photo is transformed as a style-transferring process of iterative refinement by looking for and applying a series of best-sit cases in a sense of style optimization. The presented experimental results demonstrate that in comparison with a state-of-the-art method and a couple commercially available software packages, the new approach is capable of generating visually more appealing portraits from the point of view of personal style, which also more closely resemble the groundtruth sketches by the target human artists.

Our method has a major limitation. It has not supported the synthesize sketch illustrations for human hair yet. This limitation comes from the difficulty in establishing correspondences between strokes or texture areas that depict hairs and the counterpart hair regions on human facial photos. A related future research direction of this problem is Cross-Modal Face Matching ouyang2014face . Besides, if we focus on the hair of a specific group of people, it is possible to represent the correspondences through manually indicated key points, like that by Chen et al. Chen2004 .


  • (1) Aamodt, A., Plaza, E.: Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI communications 7(1), 39–59 (1994)
  • (2) AKVIS: Akvis sketch v.13.0 (2012). URL http://akvis.com/en/sketch/index.php
  • (3) Bay, H., Tuytelaars, T., Van Gool, L.: Surf: speeded up robust features. ECCV, pp. 404–417. Springer-Verlag (2006)
  • (4) Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. PAMI 24(4), 509 –522 (2002)
  • (5) Berger, I., Shamir, A., Mahler, M., Carter, E., Hodgins, J.: Style and abstraction in portrait sketching. TOG 32(4), 55 (2013)
  • (6) Bradski, G.: Dr. Dobb’s Journal of Software Tools (2000)
  • (7) Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)
  • (8) Burgess, C.M.: Cosmetic dermatology. Springer (2005)
  • (9) Chen, C.T.: Linear System Theory and Design, 3rd edn. Oxford University Press, Inc. (1998)
  • (10) Chen, H., Liu, Z., Rose, C., Xu, Y., Shum, H.Y., Salesin, D.: Example-based composite sketching of human portraits. In: NPAR, pp. 95–153. ACM (2004)
  • (11) Gonzalez, R.C., Woods, R.E.: Digital image processing, 2nd. SL: Prentice Hall (2002)
  • (12) Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
  • (13) Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: SIGGRAPH, pp. 327–340. ACM (2001)
  • (14) Kalogerakis, E., Nowrouzezahrai, D., Breslav, S., Hertzmann, A.: Learning hatching for pen-and-ink illustration of surfaces. ACM Trans. Graph. 31(1), 1:1–1:17 (2012)
  • (15) Kim, S.Y., Maciejewski, R., Isenberg, T., Andrews, W.M., Chen, W., Sousa, M.C., Ebert, D.S.: Stippling by example. In: NPAR, pp. 41–50. ACM (2009)
  • (16) Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial intelligence 97(1), 273–324 (1997)
  • (17) Kyprianidis, J., Collomosse, J., Wang, T., Isenberg, T.: State of the art: A taxonomy of artistic stylization techniques for images and video. TVCG 19(5), 866–885 (2013)
  • (18) Leake, D.B., Kinley, A., Wilson, D.: Learning to improve case adaptation by introspective reasoning and cbr. In: Case-Based Reasoning Research and Development, pp. 229–240. Springer (1995)
  • (19) Lee, H., Seo, S., Ryoo, S., Yoon, K.: Directional texture transfer. In: NPAR, pp. 43–48. ACM (2010)
  • (20) Lee, S.Y., Chwa, K.Y., Hahn, J., Shin, S.Y.: Image morphing using deformation techniques. JVCA 7(1), 3–23 (1996)
  • (21) Liang, L., Chen, H., Xu, Y.Q., Shum, H.Y.: Example-based caricature generation with exaggeration. In: PG, pp. 386–. IEEE Computer Society (2002)
  • (22) Liu, Q., Tang, X., Jin, H., Lu, H., Ma, S.: A nonlinear approach for face sketch synthesis and recognition. In: CVPR, pp. 1005–1010. IEEE Computer Society (2005)
  • (23) Lu, C., Xu, L., Jia, J.: Combining sketch and tone for pencil drawing production. In: NPAR, pp. 65–73. Eurographics Association (2012)
  • (24) Lu, J., Sander, P.V., Finkelstein, A.: Interactive painterly stylization of images, videos and 3d animations. In: Proceedings of the 2010 ACM SIGGRAPH symposium on Interactive 3D Graphics and Games, pp. 127–134. ACM (2010)
  • (25) Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., Suetens, P.: Multimodality image registration by maximization of mutual information. T-MI 16(2), 187–198 (1997)
  • (26) Microsoft: Microsoft office homestyle+ trial edition (2003). URL http://www.microsoft.com/zh-tw/download/details.aspx?id=4851
  • (27) Milborrow, S., Nicolls, F.: Locating facial features with an extended active shape model. ECCV (2008)
  • (28) Min, F., Suo, J.L., Zhu, S.C., Sang, N.: An automatic portrait system based on and-or graph representation. In: EMMCVPR, pp. 184–197. Springer-Verlag (2007)
  • (29) Mish, F.C.: Merriam Webster’s Collegiate Dictionary. Merriam Webster, Inc. (1994)
  • (30) Modat, M., Ridgway, G.R., Taylor, Z.A., Lehmann, M., Barnes, J., Hawkes, D.J., Fox, N.C., Ourselin, S.: Fast free-form deformation using graphics processing units. Comput. Methods Prog. Biomed. 98, 278–284 (2010)
  • (31) Papari, G., Petkov, N.: Continuous glass patterns for painterly rendering. Image Processing, IEEE Transactions on 18(3), 652–664 (2009)
  • (32) Papari, G., Petkov, N., Campisi, P.: Artistic edge and corner enhancing smoothing. Image Processing, IEEE Transactions on 16(10), 2449–2462 (2007)
  • (33) Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. PAMI 27(8), 1226–1238 (2005)
  • (34) Powell, M.J.D.: Direct search algorithms for optimization calculations. Acta Numerica pp. 287–336 (1998)
  • (35) Reinhard, E., Ashikhmin, M., Gooch, B., Shirley, P.: Color transfer between images. IEEE Computer graphics and applications 21(5), 34–41 (2001)
  • (36) Schaefer, S., McPhail, T., Warren, J.: Image deformation using moving least squares. TOG 25, 533–540 (2006)
  • (37) Schaffer, C.: Selecting a classification method by cross-validation. Machine Learning 13(1), 135–143 (1993)
  • (38) Semillon: Semillon’s homepage on sina weibo (2013). URL http://weibo.com/semillon
  • (39) Song, Y., Bao, L., Yang, Q., Yang, M.H.: Real-Time Exemplar-Based Face Sketch Synthesis, pp. 800–813. Springer (2014)
  • (40) Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis and Machine Vision. London: Chapman and Hall (1993)
  • (41) Thévenaz, P., Unser, M.: Optimization of mutual information for multiresolution image registration. Image Processing, IEEE Transactions on 9(12), 2083–2099 (2000)
  • (42) Tu, C.T., Lien, J.J.J.: Automatic location of facial feature points and synthesis of facial sketches using direct combined model. Trans. Sys. Man Cyber. Part B 40, 1158–1169 (2010)
  • (43) Wang, N.: Results on cuhk face sketch database (2013). URL https://nannanwang.github.io/Result_CUFS.htm
  • (44) Wang, N., Tao, D., Gao, X., Li, X., Li, J.: Transductive face sketch-photo synthesis. Neural Networks and Learning Systems, IEEE Transactions on 24(9), 1364–1376 (2013)
  • (45) Wang, X., Tang, X.: Face photo-sketch synthesis and recognition. PAMI 31, 1955–1967 (2009)
  • (46) Wei, G., Rui, M., Lei, W., Yi, Z., Zhenyun, P., Yaohui, Z.: Template-based portrait caricature generation with facial components analysis. In: ICIS, pp. 219 – 223 (2009)
  • (47) Wu, C., Liu, C., Shum, H.Y., Xy, Y.Q., Zhang, Z.: Automatic eyeglasses removal from face images. Pattern Analysis and Machine Intelligence, IEEE Transactions on 26(3), 322–336 (2004)
  • (48) Zhang, Y., Dong, W., Deussen, O., Huang, F., Li, K., Hu, B.G.: Data-driven face cartoon stylization. In: SIGGRAPH Asia 2014 Technical Briefs, pp. 14:1–14:4. ACM (2014)
  • (49) Zhao, M., Zhu, S.C.: Sisley the abstract painter. In: NPAR, pp. 99–107. ACM (2010)
  • (50) Zhao, M., Zhu, S.C.: Portrait painting using active templates. In: NPAR, pp. 117–124. ACM (2011)
  • (51) Zhou, H., Kuang, Z., Wong, K.Y.K.: Markov weight fields for face sketch synthesis. In: CVPR, 2012 IEEE Conference on, pp. 1091–1097 (2012)