Scoot: A Perceptual Metric for Facial Sketches

by   Deng-Ping Fan, et al.
Nankai University

While it is trivial for humans to quickly assess the perceptual similarity between two images, the underlying mechanism is thought to be quite complex. Despite this, the most widely adopted perceptual metrics today, such as SSIM and FSIM, are simple, shallow functions, and fail to consider many factors of human perception. Recently, the facial modelling community has observed that the inclusion of both structure and texture has a significant positive benefit for face sketch synthesis (FSS). But how perceptual are these so-called "perceptual features"? Which elements are critical for their success? In this paper, we design a perceptual metric, called Structure Co-Occurrence Texture (Scoot), which simultaneously considers the blocklevel spatial structure and co-occurrence texture statistics. To test the quality of metrics, we propose three novel metameasures based on various reliable properties. Extensive experiments verify that our Scoot metric exceeds the performance of prior work. Besides, we built the first large scale (152k judgments) human-perception-based sketch database that can evaluate how well a metric is consistent with human perception. Our results suggest that "spatial structure" and "co-occurrence texture" are two generally applicable perceptual features in face sketch synthesis.



There are no comments yet.


page 1

page 2

page 4

page 6

page 8

page 9

page 10

page 11


The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

While it is nearly effortless for humans to quickly assess the perceptua...

Face Sketch Synthesis Style Similarity:A New Structure Co-occurrence Texture Measure

Existing face sketch synthesis (FSS) similarity measures are sensitive t...

A Linked Aggregate Code for Processing Faces (Revised Version)

A model of face representation, inspired by the biology of the visual sy...

Metric Learning for Phoneme Perception

Metric functions for phoneme perception capture the similarity structure...

Face Sketch Synthesis via Semantic-Driven Generative Adversarial Network

Face sketch synthesis has made significant progress with the development...

DPLM: A Deep Perceptual Spatial-Audio Localization Metric

Subjective evaluations are critical for assessing the perceptual realism...

Content-Adaptive Sketch Portrait Generation by Decompositional Representation Learning

Sketch portrait generation benefits a wide range of applications such as...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability to compare data items is known to be a fundamental operation for all of the computing [80, 90]

, especially in the computer vision area 

[91, 8, 5]. For various end-user applications such as face sketch [49], image style transfer [27], image quality assessment [65], saliency detection [13, 12, 11, 89], segmentation [42, 41, 43] and disease classification [73], image denoising [71], the comparison can turn out to be evaluating a “perceptual distance”, which assesses how similar two images are in a way that highly correlates with human perception.

In this paper, we study facial sketch and show that human judgments are often different from current evaluation metrics, and as the first related attempt, we provide a novel perceptual distance for sketch according to human choice principles. As noticed in 

[80], human judgments of similarity depend on high-order image structure. Facial sketches are made up of a lot of textures, and there are many algorithms for synthesizing sketches, which is a good fit for this problem. However, designing a good perceptual metric should take into account human perception in facial sketch comparison, which should:

  • closely match human perception so that good sketches can be directly used in various subjective applications, , law enforcement and entertainment.

  • be insensitive to slight mismatches (, re-size, rotation) since real-world sketches drawn by artists do not precisely match each pixel to the original photos.

  • be capable of capturing holistic content, that is, prefer the complete sketch to one that only contains strokes (, has lost some facial components).

[width=]ExistingMetricFailureCase S0RS1



Figure 1: Which sketch (left or right) is “closer” to the middle sketch in these examples? For the right case, sketch 0 (S0) is more similar than sketch 1 (S1) w.r.t. reference (R) in terms of structure and texture. Sketch 1 almost completely destroys the texture of the hair. The widely-used (SSIM [65], FSIM [76]), classic (IFC [39], VIF [40]) and recently released (GMSD [72]) metrics disagree with humans. Only our Scoot metric agrees well with humans.


No. Model Year’Pub Sj. Rr. Ob. No. Model Year’Pub Sj. Rr. Ob.
1 ST [49] 03’ICCV VRR 2 STM [50] 04’TCSVT VRR
3 LLE [31] 05’CVPR VRR 4 BTI [32] 07’IJCAI RMSE
5 E-HMMI [22] 08’NC VRR UIQI 6 EHMM [21] 08’TCSVT VRR
7 MRF [64] 08’PAMI VRR 8 SL [70] 10’NC VRR UIQI
9 RMRF [86] 10’ECCV VRR 10 SNS-SRE [20] 12’TCSVT VRR
11 MWF [92] 12’CVPR VRR 12 SCDL [60] 12’CVPR PSNR
13 Trans [57] 13’TNNLS VRR 14 SFS-SVR [56] 13’PRL VRR VIF
15 Survey [58] 14’IJCV RMSE, UIQI, SSIM 16 SSD [45] 14’ECCV SV VRR
17 SFS [81] 15’TIP VRR FSIM, SSIM 18 FCN [75] 15’ICMR ES VRR
19 RFSSS [82] 16’TIP VRR FSIM, SSIM 20 KD-Tree [88] 16’ECCV VRR VIF, SSIM
23 RR [59] 17’NC VRR VIF, SSIM 24 Bayesian [55] 17’TIP VRR VIF, SSIM
27 ArFSPS [29] 17’NC VRR FSIM 28 BFCN [74] 17’TIP SV VRR
29 DGFL [93] 17’IJCAI VRR SSIM 30 FreeH [30] 17’IJCV SV
31 Pix2pix [27] 17’CVPR 32 CA-GAN [18] 17’CVPR VRR SSIM
33 ESSFA [14] 17’TOG 34 PSMAN [52] 18’FG VRR FSIM, SSIM
35 NST [38] 17’NPAR 36 CMSG [77] 18’TC SV VRR
37 RSLCR [54] 18’PR VRR SSIM 38 MRNF [78] 18’IJCAI VIF, SSIM
39 GAN [84] 18’IJCAI FSIM 40 FSSN [28] 18’PR PSNR, SSIM


Table 1: Summarization of 42 representative FSS-based algorithms. Sj.: Subjective metric. Rr.: Recognition rates. Ob.: Objective metric. SV = Subjective Voting. ES = Empirical Study. VRR = various recognition rate methods, such as, null-space LDA [4], Random Sampling LDA [62, 63], dual-space LDA [61], LPP [26], Sparse Representation and Classification [68]. Note that UIQI [66] is a special case of SSIM [65].

To the best of our knowledge, no prior metric can satisfy all these properties simultaneously.

For example, in face sketch synthesis (FSS), the target is for the synthetic sketch to be indistinguishable from the reference by a human subject, although their pixel representations might be mismatched. Let us take a look at Fig. 1

in which there are three examples. Which one is closer to the middle reference? While this comparison task seems trivial for humans, to date the widely-used metrics disagree with human judgments. Not only are visual patterns very high-dimensional, but the very notion of visual similarity is often subjective 


Our contributions to the facial sketch community can be summarized in three points. Firstly, as described in Sec. 3, we propose a Structure Co-Occurrence Texture (Scoot) perceptual metric for FSS that provides a unified evaluation considering both structure and texture.

Secondly, as described in Sec. 4.2, we design three meta-measures based on the above three reliable properties. Extensive experiments on these meta-measures verify that our Scoot metric exceeds the performance of prior works. Our experiments indicate that “spatial structure” and “co-occurrence” texture are two generally applicable perceptual features in FSS.

Thirdly, we explore different ways of exploiting texture statistics (, Gabor, Sobel, and Canny, ). We find that simple texture features [17, 16] performs far better than the commonly used metrics in the literature [39, 65, 76, 40, 72]. Based on our findings, we construct the first large-scale human-perception-based sketch database that can evaluate how well a metric is in line with human perception.

Our three contributions presented above offer a complete metric benchmark suite, which provides a novel view and a practical tool (, metric, meta-measures and database) to analyze data similarity from the human perception direction.

[width=]SecondImage (a)(b)

(c)GuidelineOutlineAdd details

(d)SketchPix2pix [27]LR [59]

SketchPix2pix [27]LR [59]



Figure 2: Motivation of the proposed Scoot metric. (a) Pencil grades and their strokes. (b) Using stroke tones to present texture. The stroke textures used, from top to bottom, are: “cross-hatching”, “stippling”. The stroke attributes, from left to right, are: spare to dense. Images are from [67]. (c) The artist draws the sketch from guideline to details. (d) The original sketches. (e) The quantized sketches. (f) Creating various tones of the stroke by applying different pressure (, light to dark) on the pencil tip.

2 Related Work

From Tab. 1, we observe that some works utilize recognition rates (Rr.) to evaluate the quality of synthetic sketches. However, Rr. cannot completely reflect the visual quality of synthetic sketches [53]. In the FSS area, the widely-used perceptual metrics, , SSIM [65], FSIM [76], and VIF [40] were initially designed for image quality assessment (IQA) which aims to evaluate image distortion such as Gaussian blur, jpeg, and jpeg 2000 compression. Directly introducing the IQA metric to FSS may be intractable (see Fig. 1) due to the different nature of their task.

Psychophysics [95] and prior work, , line drawings [23, 15] indicate that human perception of sketch similarity depends on two crucial factors, , image structure [65] and texture [53]. However, how perceptual are these so-called “perceptual features”? Which elements are critical for their success? How well do these “perceptual features” actually correspond to human visual perceptions? As noticed by Wang  [53], there is currently no reliable perceptual metric in FSS. We review the topics most pertinent to facial sketch within the constraints of space:

Heuristic-based Metric. The most widely used metric in FSS is SSIM proposed by Wang  [65]. SSIM computes structure similarity, and luminance and contrast comparison using a sliding window on the local patch. Sheikh and Bovik [40] proposed the VIF metric which evaluates the image quality by quantifying two kinds of information. One is obtained via the human visual system channel, with the input ground truth and the output reference image information. The other is achieved via the distortion channel, called distortion information, and the result is the ratio of these two types of information. Studies of the human vision system (HVS) found that the features perceived by human vision are consistent with the phase of the Fourier series at different frequencies. Therefore, Zhang  [76] chose phase congruency as the primary feature. Then they proposed a low-level feature similarity metric called FSIM.

Recently, Xue  [72]

devised a simple metric named gradient magnitude similarity deviation (GMSD), where the pixel-wise gradient magnitude similarity is utilized to obtain image local quality. The standard deviation of the overall gradient magnitude similarity map is calculated as the final image quality index. Their metric achieves the state-of-the-art (SOTA) performance compared with the other metrics.

Learning based Metric. As well as the heuristic-based metric, there are numerous learning based metrics [7, 19, 48], for comparing images in a perceptual-based manner which have been used to evaluate image compression and many other imaging tasks. We refer readers to a recent survey [80]

for a comprehensive review of various deep features adopted for perceptual metrics. This paper focuses on showing why face sketches require a specific perceptual distance metric that differs from or improves upon previously heuristic-based methods.

2.1 Motivation

We observed the basic principles of the sketch and noted that “graphite pencil grades” and “pencil’s strokes” are the two fundamental elements in the sketch.

2.2 Graphite Pencil Grades.

In the European system, “H” & “B” stand for “hard” & “soft” pencil, respectively. Fig. 2(a) illustrates the grade of graphite pencil. Sketch images are expressed through a limited medium (graphite pencil) which provides no color. Illustrator Sylwia Bomba [47] said that “if you put your hand closer to the end of the pencil, you have darker markings. Gripping further up the pencil will result in lighter markings.” Besides, after a long period of practice, artists will form their fixed pressure (, from guideline to detail in Fig. 2(c)) style. In other words, the marking of the stroke can be varied (, light to dark in Fig. 2(f)) by changing the pressure on the pencil tip. Note that different pressures on the tip will result in various types of marking which is one of the quantifiable factors called gray tone.

Gray Tone. The quantification of gray tone should reduce the effect of slight amounts of noise and over-sensitivity to subtle gray tone gradient changes in sketches. We introduce intensity quantization during the evaluation of gray tone similarity. Inspired by previous works [6], we can quantize the input sketch to different grades to reduce the number of intensities to be considered: . A typical example of such quantization is shown in Fig. 2(d, e). Humans will consistently rank Pix2pix higher than LR before (Fig. 2(d)) and after (Fig. 2(e)) quantizing the input sketches when evaluating the perceptual similarity. Although quantization may introduce artifacts, our experiments (Sec. 6) also show that this process can reduce sensitivity to minor intensity variations and balance the performance and computational complexity.

2.3 Pencil’s Strokes.

Because all of the sketches are generated by moving a tip on the paper, different paths of the tip along the paper will create various stroke shapes. One example is shown in Fig. 2(b), in which different spatial distributions of the stroke have produced various textures (, sparse or dense). Thus, the stroke tone is another quantifiable factor.

Stroke Tone. The stroke tone and grey tone are not independent concepts. The gray tone is based on the different strokes of gray-scale in a sketch image, while the stroke tone can be defined as the spatial distribution of gray tones.

An example is shown in Fig. 2(d). Intuitively, Pix2pix [27] is better than LR [59] since Pix2pix preserves the texture (or stroke tone) of the hair and details in the face. However, LR presents an overly smooth result and has lost much of the sketch style.

3 Proposed Algorithm

This section explains the proposed Scoot metric, which captures the co-occurrence texture statistics in the “block-level” spatial structure.

3.1 Co-Occurrence Texture

With the two quantifiable factors at hand, we start to describe the details. To simultaneously extract statistics about the “stroke tone” and their relationship to the surrounding “gray tone”, we need to characterize their spatial interrelationships. Previous work in texture [25] verified that the co-occurrence matrix can efficiently capture the texture feature, due to the use of various powerful statistics. Since the sketches show a lot of similarities to textures, we use the co-occurrence matrix as our gray tone and stroke tone extractor. Specifically, this matrix is defined as:


where and denote the gray value; is the relative distance to ; and are the spatial positions in the given quantized sketch ; denotes the gray value of at position ; and are the width and height of the sketch , respectively. To extract the perceptual features in a sketch, we test the three most widely used [24] statistics: Homogeneity (), Contrast (), and Energy ().

Homogeneity reflects how much the texture changes in local regions, it will be high if the gray tone of each pixel pair is similar. The homogeneity is defined as:


Contrast represents the difference between a pixel in and its neighbor summed over the whole sketch. This reflects that a low-contrast sketch is not characterized by low gray tones but rather by low spatial frequencies. The contrast is highly correlated with spatial frequencies. The contrast equals 0 for a constant tone sketch.


Energy measures textural uniformity. When only similar gray tones of pixels occur in a sketch () patch, a few elements in will be close to 1, while others will be close to 0. Energy will reach the maximum if there is only one gray tone in a sketch () patch. Thus, high energy corresponds to the sketch’s gray tone distribution having either a periodic or constant form.



Algorithm 1: Structure Co-Occurrence Texture Measure
Input: Synthetic Sketch , Ground Truth Sketch
Step 1: Quantize and into grades
Step 2: Calculate the matrices and according to Eq. 1
Step 3: Divide the whole sketch image into a grid of blocks
Step 4: Extract the features according to Eq. 3 & 4 from each block
and concatenate them together
Step 5: Compute the average feature of four orientations with Eq. 5
Step 6: Evaluate the similarity between and according to Eq. 6
Output: Scoot score;


3.2 Spatial Structure

To holistically represent the spatial structure, we follow the spatial envelope strategy [34, 9] to extract the statistics from the “block-level” spatial structure in the sketch. First, we divide the whole sketch image into a grid of blocks. Our experiments demonstrate that the process can help to derive content information. Second, we compute the co-occurrence matrix for all blocks and normalize each matrix such that the sum of its components is 1. Then, we concatenate statistics (, ) of all the

blocks into a vector

Note that each of the above statistics is based on a single direction (, , that is = (0, 1)), since the direction of the spatial distribution is also very important to capture the style such as “hair direction”, “the direction of shadowing strokes”. To exploit this observation for efficiently extracting the stroke direction style, we compute the average feature of orientation vectors to capture more directional information:


where denotes the th direction and .

[width=]RankingStability (a) Photo(b) Reference(c) Difference(d) Downsized

(e) Reference(f) Re-sized(g) Pix2pix(h) LR

Figure 3: Meta-measure 1: Stability to Slight Re-sizing.

3.3 Scoot Metric

After obtaining the perceptual feature vectors of the reference sketch and synthetic sketch , a function is needed to evaluate their similarity. We have tested various forms of functions such as Euclidean distance or exponential functions, , but have found that the simple Euclidean distance is a simple and effective function and works best in our experiments. Thus, the proposed perceptual similarity Scoot metric can be defined as:


where denotes the -norm. denote the quanitized , respectively. represents identical style.

4 Experiments

4.1 Implementation Details

The size of spatial structure in Sec. 3.2 is set to 4 to achieve the best performance. The quantization parameter in Eq. (6) is set to 6 grades. We have demonstrated that = 2 (, in Eq. 3 combined with in Eq. 4) achieve the best performance (see Sec. 6). Due to the symmetry of the co-occurrence matrix , the statistical features in 4 orientations are actually equivalent to the 8 neighbor directions at 1 distance. Empirically, we set orientations to achieve the robust performance.

4.2 Meta-measures

As described in [33], one of the most challenging tasks in designing a metric is proving its performance. Following [37], we use the meta-measure methodology, which is a measure that assesses a metric. Inspired by [9, 10, 33], we further propose three meta-measures based on the 3 properties described in Sec. 1.

[width=.92]rotate.pdf (a) Reference(b) R-Reference(c) Pix2pix(d) MWF

Figure 4: Meta-measure 2: Rotation Sensitivity.

Meta-measure 1: Stability to Slight Resizing.

The first meta-measure (MM1) specifies that the rankings of synthetic sketches should not change much with slight changes in the reference sketch. Therefore, we perform a minor 5 pixels downsizing of the reference by using nearest-neighbor interpolation. Fig. 

3 gives an example. The hair of the reference in (b) drawn by the artist has a slight size discrepancy compared to the photo (a). We observe that about 5 pixels deviation (Fig. 3(c)) in the boundary is common. Although the two sketches (e) & (f) are almost identical, widely-used metrics, , SSIM [65], VIF [40], and GMSD [72] switched the ranking of the two synthetic sketches (g, h) when using (e) or (f) as the reference. However, the proposed Scoot metric consistently ranked (g) higher than (h).

For this meta-measure, we applied the  [2] measure to test the metric ranking stability before and after the reference downsizing was performed. The value of falls in the range [0, 2].

Tab. 2 shows the results: the lower the result is, the more stable a metric is to slight downsizing. We can see a significant ( 77% and 83%) improvement over the existing SSIM, FSIM, GMSD, and VIF metrics in both the CUFS and CUFSF databases. These improvements are mainly because the proposed metric considers “block-level” statistics rather than “pixel-level”.

[width=.69]ContentCaptureAbility.pdf (a) Reference(b) Synthetic(c) Light

Figure 5: Meta-measure 3: Content Capture Capability.


Metic MM1 MM2 MM3 Jud MM1 MM2 MM3 Jud
resize rotation content judgment resize rotation content judgment


Classical & Widely Used
IFC [39] 0.256 0.189 1.20% 26.9% 0.089 0.112 3.07% 25.4%
SSIM [65] 0.162 0.086 81.4% 37.3% 0.073 0.074 97.4% 36.8%
FSIM [76] 0.268 0.123 14.2% 50.0% 0.151 0.058 32.4% 37.5%
VIF [40] 0.322 0.236 43.5% 44.1% 0.111 0.150 22.2% 52.8%
GMSD [72] 0.417 0.210 21.9% 42.6% 0.259 0.132 63.6% 58.6%
Scoot (Ours) 0.037 0.025 95.9% 76.3% 0.012 0.008 97.5% 78.8%


Texture-based & Edge-based
Canny [3] 0.086 0.078 33.7% 27.8% 0.138 0.146 0.00% 0.10%
Sobel [44] 0.040 0.037 0.00% 32.8% 0.048 0.044 0.00% 52.6%
GLRLM [17] 0.111 0.111 18.6% 73.7% 0.125 0.079 64.6% 68.0%
Gabor [16] 0.062 0.055 0.00% 72.2% 0.089 0.043 19.3% 80.9%
Scoot (Ours) 0.037 0.025 95.9% 76.3% 0.012 0.008 97.5% 78.8%


Feature Combination
0.034 0.024 95.9% 76.3% 0.011 0.008 97.4% 78.7%
0.007 0.005 61.5% 77.5% 0.003 0.003 79.1% 77.8%
0.200 0.104 98.5% 73.1% 0.044 0.026 99.2% 77.4%
0.010 0.007 54.4% 74.6% 0.009 0.006 64.7% 73.4%
0.011 0.007 60.1% 74.6% 0.007 0.005 78.1% 73.7%
0.156 0.088 97.9% 75.7% 0.030 0.017 98.8% 80.3%
(Scoot) 0.037 0.025 95.9% 76.3% 0.012 0.008 97.5% 78.8%


Table 2: Benchmark results of classical and alternative texture/edge based metrics. The best result is highlighted in bold, and these differences are all statistically significant at the level. The indicates that the higher the score is, the better the metric performs, and vice verse ().

Meta-measure 2: Rotation Sensitivity. In real-world situations, sketches drawn by artists may also have slight rotations compared to the original photographs. Thus, the proposed second meta-measure (MM2) verifies the sensitivity of reference rotation for the evaluation metric. We did a slight counter-clockwise rotation () for each reference. Fig. 4 shows an example. When the reference (a) is switched to the slightly rotated reference (b), the ranking results should not change much. In MM2, we got the ranking results for each metric by applying reference sketches and slightly rotated reference sketches (R-Reference) separately. We utilized the same measure () as meta-measure 1 to evaluate the rotation sensitivity.

The sensitivity results are shown in Tab. 2. It is worth noting that MM2 and MM1 are two aspects of the expected property described in Sec. 4.2. Our metric again significantly outperforms the current metrics over the CUFS and CUFSF databases.

Meta-measure 3: Content Capture Capability. The third meta-measure (MM3) describes that a good metric should assign a complete sketch generated by a SOTA algorithm a higher score than any sketches that only preserve incomplete strokes. Fig. 5 presents an example. We expect that a metric should prefer the SOTA synthetic result (b) over the light strokes111 To test the third meta-measure, we use a simple threshold of grayscale (170) to separate the sketch (Fig. 5 Reference) into darker strokes & lighter strokes. The image with lighter strokes loses the main texture features of the face (hair, eye, beard), resulting in an incomplete sketch. result (c). For MM3, we compute the mean score of 10 SOTA [75, 27, 31, 59, 64, 92, 54, 45, 93, 55] face sketch synthesis algorithms. The mean score is robust to situations in which a certain model generates a poor result. We recorded the number of times the mean score of SOTA synthetic algorithms is higher than a light stroke’s score.


Figure 6: Our distortions. These distortions are generated by various real synthesis algorithms [75, 27, 31, 59, 64, 92, 54, 45, 93, 55].

For the case shown in Fig. 5, the current widely-used metrics (SSIM, FSIM, VIF) are all in favor of the light sketch. Only the proposed Scoot metric gives the correct order. In terms of pixel-level matching, it is obvious that the regions where dark strokes are removed are different from the corresponding parts in (a). But at other positions, the pixels are identical to the reference. Previous metrics only consider “pixel-level” matching and will rank the light strokes sketch higher. However, the synthetic sketch (b) is better than the light one (c) in terms of both style and content. From Tab. 2, we observe a great (14%) improvement over the other metrics in CUFS database. A slight improvement is also achieved for the CUFSF database.

5 Proposed Perceptual Similarity Dataset

To evaluate the performance of different perceptual metrics, we built a large-scale highly diverse dataset of perceptual judgments using the 2 alternative forced-choice (2AFC) scheme [80]. These judgments are derived from a wide space of distortions and real algorithm synthesises. Because the true test of a synthetic sketch assessment metric is on real problems and real algorithms, we gather perceptual judgments using such outputs.

5.1 Distortions

Source Images. Data on real algorithms is more limited, as each synthesis model will have its own unique properties. To obtain more distortion data, we collect 338 pairs (CUFS) and 944 pairs (CUFSF) of test set images as source images following the split scheme of [54].

Distortion Types. We simulate diverse possible distortions introduced by traditional and CNN-based synthetic methods, to more closely simulate the space of artifacts that can arise from real algorithms [75, 27, 31, 59, 64, 92, 54, 45, 93, 55]. Our goal of each selected algorithms is not to address the task per se, but rather to explore common artifacts that plague the outputs of traditional/deep-based methods. As shown in Fig. 6, we introduce 10 distortion types, such as lightness shift, foreground noise, shifting, linear warping, structural damage, contrast change, blur, component lost, ghosting, and checkerboard artifact.

5.2 Psychophysical Similarity Measurements

Data selection. 21 viewers, who were pre-trained with 50 pairs of ranking, are asked to rank the synthetic sketch result based on two criteria: texture similarity and content similarity. To minimize the ambiguity of human ranking, we follow the voting strategy [53] to conduct this experiment (152K judgments) through the following stages:

  • We let the first group of viewers (7 subjects) select four out of ten sketches for each photo. The 4 sketches should consist of two good and two bad ones. Thus, we are left with 1352 (4 338), and 3776 (4 944) sketches for CUFS and CUFSF, respectively.

  • For the selected four sketches in each photo, the second group of viewers (seven people) is further asked to choose three sketches for which they can rank them easily. Based on the voting results of viewers, we pick out the 3 most frequently selected sketches.

  • Sketches that are too similar will make it difficult for viewers to judge which sketch is better, potentially causing them to give random decisions. To avoid this random selection, we ask the last group of viewers (seven) to pick out the pair of sketches that are most obvious to rank.

2AFC similarity judgments. For each image, we have a reference sketch drawn by artists and two distortions . We ask the viewer which is closer to the reference , and record the response . On average, viewers spent about 2 seconds per judgment. Let denote our dataset of image triplets. Note that we have 5 volunteers involved in the whole process for cross-checking the ranking. For example, if there viewer preferences for and for , the final ranking will be . All triplets with a clear majority will be preserved and the other triplets are discarded. Finally, we establish two new human-ranked222 The two datasets include 1014 (3338 triplets), and 2832 (3944 triplets) human-ranked images, respectively. Recent works [46, 69] show that the scale of a dataset is important. To our best knowledge, this is the first large-scale publicly available human judgment dataset in FSS. datasets: RCUFS and RCUFSF. Please refer to our website for complete datasets.

5.3 Human Judgments

Here, we evaluate how well our Scoot and other compared metrics. The RCUFS and RCUFSF contain 338 and 944 judged triplets, respectively. To increase an inherently noisy process, we compute the agreement of a metric with each triplet and adopt the average statistics among the dataset as the final performance.

How well do classical metrics and our Scoot perform? Tab. 2 shows the performance of various classical metrics (, IFC, SSIM, FSIM, VIF, and GMSD). Interestingly, these metrics perform at about the same low level (, 59%). Despite its common use in FSS, these metrics were not designed for situations where pixel mismatching is a large factor. However, the proposed Scoot metric shows a significant (26.3%) improvement over the best prior metric in RCUFS. This improvement is due to our consideration of structure and texture similarity which human perception considers as two essential factors when evaluating sketches.

6 Discussion

Which elements are critical for their success? In Sec. 3.1, we considered 3 widely-used statistics: Homogeneity (), Contrast (), and Energy (). To achieve the best performance, we need to explore the best combination of these statistical features. We have applied our three meta-measures as well as human judgments to test the performance of the Scoot metric using each single feature, each feature pair and the combination of all three features.

[width=.8]Sensitive6 (a)(b)(c)(d)

grid scale grid scale

quantization quantization

Figure 7: Sensitivity experiments of the spatial structure (top) and quantization (bottom). For MM1 & MM2, the lower the better. For MM3 & MM4, the higher the better.

The results are shown in Tab. 2. All possibilities (, , , , , , ) perform well in Jud (human judgment). and are insensitive to re-sizing (MM1) and rotation (MM2), while they are not good at content capture (MM3). is the opposite compared to and . Thus, using a single feature is not good. The results of combining two features show that if is combined with , the sensitivity to re-sizing and rotating will still be high, while partially overcoming the weakness of . The performance of shows no improvement compared to the combination of “” features. Previous work in [1] also found the energy and contrast to be the most efficient features for discriminating textural patterns. Thus, we choose “” feature as our final combination to extract the perceptual features.

How well do these “perceptual features” actually correspond to human visual perceptions? As described in Sec. 3.1, sketches are quite close to textures. There are many other texture & edge-based features (GLRLM [17], Gabor [16], Canny [3], Sobel [44]). Here, we select the most wide-used features as candidate alternatives to replace our “” feature. For GLRLM, we select all five statistics mentioned in the original version. Results are shown in Tab. 2. Gabor and GLRLM are texture features, while the other two are edge-based. All the texture features (GLRLM, Gabor) and the proposed Scoot metric provide a good (, 68%) consistency with human ranking (Jud). Among all the texture features, the proposed metric provides a consistently high average performance with human ranking (Jud). GLRLM performs well according to MM1 & 2 & 3. Gabor is reasonable in terms of MM1 & 2, but not good at MM3. For edge-based features, Canny fails according to all meta-measures. Sobel is very stable to slight re-sizing (MM1) or rotating (MM2), but cannot capture content (MM3) and is not consistent with human judgment (Jud). Interestingly, Canny, Sobel, and Gabor assigned the incomplete stroke a higher score than the sketch generated by the SOTA algorithm. In other words, the metric has completely reversed the ranking results for all the tested cases. In terms of overall results, we conclude that our “” feature is more robust than other competitors.

What is the Sensitivity to the Spatial Structure? To analyze the effect of spatial structure, we derive seven variants, each of which divides the sketch with a different sized grid, , is set to 1, 2, 4, , 64. The results of MM3 & 4 in Fig. 7(b) show that = 1 achieves the best performance. However, the weakness of this version is that it only captures the “image-level” statistics, and the structure of the sketch is ignored. That is, a sketch made up of an arbitrary arrangement can also achieve a high score. The experiment of MM1 in Fig. 7(a), clearly shows that = 4 achieves the best performance for the CUFS dataset. Based on the two experiments, gains the most robust performance.

What is the Sensitivity to Quantization? To determine which quantization parameter (baseline: = {2, 4, 6, 8, 16, 32, 64, 128}) produces the best performance we perform a further sensitivity test. From Fig. 7(c)&(d), we observe that quantizing the input sketch to 32 grey levels achieves an excellent result. However, for the experiments of MM3 & MM4, it gains the worst performance. Considering overall experiments, achieves a more robust result.

7 Conclusion

In this work, we explore the human perception problem, , what is the difference between human choice and metrics. A tool used to analyze the above question are facial sketches. We provide a specific metric, called Scoot (Structure Co-Occurrence Texture), that captures human perception, and is analyzed by the proposed three meta-measures. Finally, we built the first human-perception-based sketch database that can evaluate how well a metric is in line with human perception. We systematically evaluate different texture-based/edge-based features on our Scoot architecture and compare them with classic metrics. Our results show that “spatial structure” and “co-occurrence” texture are two generally applicable perceptual features in facial sketches. In the future, we will continue to develop and apply Scoot in order to further push the frontiers of research, , for evaluation of background subtraction [94].


This research was supported by NSFC (61572264, 61620106008, 61802324, 61772443), the national youth talent support program, and Tianjin Natural Science Foundation (17JCJQJC43700, 18ZXZNGX00110).


  • [1] A. Baraldi and F. Parmiggiani (1995) An investigation of the textural characteristics associated with gray level cooccurrence matrix statistical parameters. IEEE T Geosci. Remote. 33 (2), pp. 293–304. Cited by: §6.
  • [2] D. Best and D. Roberts (1975)

    Algorithm AS 89: the upper tail probabilities of Spearman’s rho

    J R STAT SOC C-APPL 24 (3), pp. 377–379. Cited by: §4.2.
  • [3] J. Canny (1986) A computational approach to edge detection. IEEE TPAMI 8, pp. 679–698. Cited by: Table 2, §6.
  • [4] L. Chen, H. M. Liao, M. Ko, J. Lin, and G. Yu (2000)

    A new LDA-based face recognition system which can solve the small sample size problem

    Pattern Recognition 33 (10), pp. 1713–1726. Cited by: Table 1.
  • [5] M. Cheng, Y. Liu, W. Lin, Z. Zhang, P. L. Rosin, and P. H. Torr (2019)

    BING: binarized normed gradients for objectness estimation at 300fps

    Computational Visual Media 5 (1), pp. 3–20. Cited by: §1.
  • [6] D. A. Clausi (2002) An analysis of co-occurrence texture statistics as a function of grey level quantization. Can. J Remote. Sens. 28 (1), pp. 45–62. Cited by: §2.2.
  • [7] A. Dosovitskiy and T. Brox (2016) Generating images with perceptual similarity metrics based on deep networks. In NIPS, pp. 658–666. Cited by: §2.
  • [8] M. Elhoseiny, Y. Zhu, H. Zhang, and A. Elgammal (2017) Link the head to the “beak”: zero shot learning from noisy text description at part precision. In IEEE CVPR, Cited by: §1.
  • [9] D. Fan, M. Cheng, Y. Liu, T. Li, and A. Borji (2017) Structure-measure: A New Way to Evaluate Foreground Maps. In IEEE ICCV, pp. 4548–4557. Cited by: §3.2, §4.2.
  • [10] D. Fan, C. Gong, Y. Cao, B. Ren, M. Cheng, and A. Borji (2018) Enhanced-alignment Measure for Binary Foreground Map Evaluation. In IJCAI, pp. 698–704. Cited by: §4.2.
  • [11] D. Fan, Z. Lin, J. Zhao, Y. Liu, Z. Zhang, Q. Hou, M. Zhu, and M. Cheng (2019) Rethinking rgb-d salient object detection: models, datasets, and large-scale benchmarks. arXiv preprint arXiv:1907.06781. Cited by: §1.
  • [12] D. Fan, J. Liu, S. Gao, Q. Hou, A. Borji, and M. Cheng (2018) Salient objects in clutter: bringing salient object detection to the foreground. In ECCV, pp. 1597–1604. Cited by: §1.
  • [13] D. Fan, W. Wang, M. Cheng, and J. Shen (2019) Shifting more attention to video salient object detection. In IEEE CVPR, pp. 8554–8564. Cited by: §1.
  • [14] J. Fišer, O. Jamriška, D. Simons, E. Shechtman, J. Lu, P. Asente, M. Lukáč, and D. Sỳkora (2017) Example-based synthesis of stylized facial animations. ACM TOG 36 (4), pp. 155. Cited by: Table 1.
  • [15] W. T. Freeman, J. B. Tenenbaum, and E. C. Pasztor (2003) Learning style translation for the lines of a drawing. ACM TOG 22 (1), pp. 33–46. Cited by: §2.
  • [16] D. Gabor (1946) Theory of communication. part 1: the analysis of information. Journal of the Institution of Electrical Engineers-Part III: Radio and Communication Engineering 93 (26), pp. 429–441. Cited by: §1, Table 2, §6.
  • [17] M. M. Galloway (1974) Texture analysis using grey level run lengths. NASA STI/Recon Technical Report N 75. Cited by: §1, Table 2, §6.
  • [18] F. Gao, S. Shi, J. Yu, and Q. Huang (2017) Composition-aided sketch-realistic portrait generation. arXiv preprint arXiv:1712.00899. Cited by: Table 1.
  • [19] F. Gao, Y. Wang, P. Li, M. Tan, J. Yu, and Y. Zhu (2017) DeepSim: deep similarity for image quality assessment. Neurocomputing 257, pp. 104–114. Cited by: §2.
  • [20] X. Gao, N. Wang, D. Tao, and X. Li (2012) Face sketch–photo synthesis and retrieval using sparse representation. IEEE TCSVT 22 (8), pp. 1213–1226. Cited by: Table 1.
  • [21] X. Gao, J. Zhong, J. Li, and C. Tian (2008) Face sketch synthesis algorithm based on E-HMM and selective ensemble. IEEE TCSVT 18 (4), pp. 487–496. Cited by: Table 1.
  • [22] X. Gao, J. Zhong, D. Tao, and X. Li (2008) Local face sketch synthesis learning. Neurocomputing 71 (10-12), pp. 1921–1930. Cited by: Table 1.
  • [23] S. Grabli, E. Turquin, F. Durand, and F. X. Sillion (2004) Programmable style for NPR line drawing. Rendering Techniques (Eurographics Symposium on Rendering). Cited by: §2.
  • [24] R. M. Haralick et al. (1979) Statistical and structural approaches to texture. Proceedings of the IEEE 67 (5), pp. 786–804. Cited by: §3.1.
  • [25] R. M. Haralick, K. Shanmugam, et al. (1973) Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics, pp. 610–621. Cited by: §3.1.
  • [26] X. He and P. Niyogi (2004) Locality preserving projections. In NIPS, pp. 153–160. Cited by: Table 1.
  • [27] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    In IEEE CVPR, pp. 1125–1134. Cited by: Figure 2, Figure 2, Table 1, §1, §2.3, Figure 6, §4.2, §5.1.
  • [28] L. Jiao, S. Zhang, L. Li, F. Liu, and W. Ma (2018)

    A modified convolutional neural network for face sketch synthesis

    PR 76, pp. 125–136. Cited by: Table 1.
  • [29] J. Li, X. Yu, C. Peng, and N. Wang (2017) Adaptive representation-based face sketch-photo synthesis. Neurocomputing 269, pp. 152–159. Cited by: Table 1.
  • [30] Y. Li, Y. Song, T. M. Hospedales, and S. Gong (2017) Free-hand sketch synthesis with deformable stroke models. IJCV 122 (1), pp. 169–190. Cited by: Table 1.
  • [31] Q. Liu, X. Tang, H. Jin, H. Lu, and S. Ma (2005) A nonlinear approach for face sketch synthesis and recognition. In IEEE CVPR, Vol. 1, pp. 1005–1010. Cited by: Table 1, Figure 6, §4.2, §5.1.
  • [32] W. Liu, X. Tang, and J. Liu (2007)

    Bayesian Tensor Inference for Sketch-Based Facial Photo Hallucination

    In IJCAI, pp. 2141–2146. Cited by: Table 1.
  • [33] R. Margolin, L. Zelnik-Manor, and A. Tal (2014) How to evaluate foreground maps?. In IEEE CVPR, pp. 248–255. Cited by: §4.2.
  • [34] A. Oliva and A. Torralba (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 42 (3), pp. 145–175. Cited by: §3.2.
  • [35] C. Peng, X. Gao, N. Wang, and J. Li (2017) Superpixel-based face sketch–photo synthesis. IEEE TCSVT 27 (2), pp. 288–299. Cited by: Table 1.
  • [36] C. Peng, X. Gao, N. Wang, D. Tao, X. Li, and J. Li (2016) Multiple representations-based face sketch–photo synthesis. IEEE TNNLS 27 (11), pp. 2201–2215. Cited by: Table 1.
  • [37] J. Pont-Tuset and F. Marques (2013) Measures and meta-measures for the supervised evaluation of image segmentation. In IEEE CVPR, pp. 2131–2138. Cited by: §4.2.
  • [38] A. Semmo, T. Isenberg, and J. Döllner (2017) Neural style transfer: a paradigm shift for image-based artistic rendering?. In ACM NPAR, pp. 5. Cited by: Table 1.
  • [39] H. R. Sheikh, A. C. Bovik, and G. De Veciana (2005) An information fidelity criterion for image quality assessment using natural scene statistics. IEEE TIP 14 (12), pp. 2117–2128. Cited by: Figure 1, §1, Table 2.
  • [40] H. R. Sheikh and A. C. Bovik (2006) Image information and visual quality. IEEE TIP 15 (2), pp. 430–444. Cited by: Figure 1, §1, §2, §2, §4.2, Table 2.
  • [41] J. Shen, Y. Du, W. Wang, and X. Li (2014) Lazy random walks for superpixel segmentation. IEEE TIP 23 (4), pp. 1451–1462. Cited by: §1.
  • [42] J. Shen, X. Hao, Z. Liang, Y. Liu, W. Wang, and L. Shao (2016) Real-time superpixel segmentation by DBSCAN clustering algorithm. IEEE TIP 25 (12), pp. 5933–5942. Cited by: §1.
  • [43] J. Shen, J. Peng, and L. Shao (2018) Submodular trajectories for better motion segmentation in videos. IEEE TIP 27 (6), pp. 2688–2700. Cited by: §1.
  • [44] I. Sobel (1990) An isotropic 33 image gradient operator. Machine Vision for Three-dimensional Scenes, pp. 376–379. Cited by: Table 2, §6.
  • [45] Y. Song, L. Bao, Q. Yang, and M. Yang (2014) Real-time exemplar-based face sketch synthesis. In ECCV, pp. 800–813. Cited by: Table 1, Figure 6, §4.2, §5.1.
  • [46] X. Sun, J. Yang, M. Sun, and K. Wang (2016) A benchmark for automatic visual classification of clinical skin disease images. In ECCV, pp. 206–222. Cited by: footnote 2.
  • [47] B. Sylwia, C. Rovina, C. Brun, G. Justin, and L. Marisa (2015) Beginner’s guide to sketching. 3dtotal Publishing. Cited by: §2.2.
  • [48] H. Talebi and P. Milanfar (2018) NIMA: neural image assessment. IEEE TIP 27 (8), pp. 3998–4011. Cited by: §2.
  • [49] X. Tang and X. Wang (2003) Face sketch synthesis and recognition. In IEEE CVPR, pp. 687–694. Cited by: Table 1, §1.
  • [50] X. Tang and X. Wang (2004) Face sketch recognition. IEEE TCSVT 14 (1), pp. 50–57. Cited by: Table 1.
  • [51] C. Tu, Y. Chan, and Y. Chen (2016) Facial Sketch Synthesis Using 2D Direct Combined Model-Based Face-Specific Markov Network. IEEE TIP 25 (8), pp. 3546–3561. Cited by: Table 1.
  • [52] L. Wang, V. Sindagi, and V. Patel (2018) High-quality facial photo-sketch synthesis using multi-adversarial networks. In IEEE FG, pp. 83–90. Cited by: Table 1.
  • [53] N. Wang, X. Gao, J. Li, B. Song, and Z. Li (2016) Evaluation on synthesized face sketches. Neurocomputing 214, pp. 991–1000. Cited by: §2, §2, §5.2.
  • [54] N. Wang, X. Gao, and J. Li (2018) Random sampling for fast face sketch synthesis. Pattern Recognition 76, pp. 215–227. Cited by: Table 1, Figure 6, §4.2, §5.1, §5.1.
  • [55] N. Wang, X. Gao, L. Sun, and J. Li (2017) Bayesian face sketch synthesis. IEEE TIP 26 (3), pp. 1264–1274. Cited by: Table 1, Figure 6, §4.2, §5.1.
  • [56] N. Wang, J. Li, D. Tao, X. Li, and X. Gao (2013) Heterogeneous image transformation. PRL 34 (1), pp. 77–84. Cited by: Table 1.
  • [57] N. Wang, D. Tao, X. Gao, X. Li, and J. Li (2013) Transductive face sketch-photo synthesis. IEEE TNNLS 24 (9), pp. 1364–1376. Cited by: Table 1.
  • [58] N. Wang, D. Tao, X. Gao, X. Li, and J. Li (2014) A comprehensive survey to face hallucination. IJCV 106 (1), pp. 9–30. Cited by: Table 1.
  • [59] N. Wang, M. Zhu, J. Li, B. Song, and Z. Li (2017) Data-driven vs. model-driven: fast face sketch synthesis. Neurocomputing. Cited by: Figure 2, Figure 2, Table 1, §2.3, Figure 6, §4.2, §5.1.
  • [60] S. Wang, L. Zhang, Y. Liang, and Q. Pan (2012)

    Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis

    In IEEE CVPR, pp. 2216–2223. Cited by: Table 1.
  • [61] X. Wang and X. Tang (2004) Dual-space linear discriminant analysis for face recognition. In IEEE CVPR, Vol. 2, pp. II–II. Cited by: Table 1.
  • [62] X. Wang and X. Tang (2004) Random sampling lda for face recognition. In IEEE CVPR, pp. 259–265. Cited by: Table 1.
  • [63] X. Wang and X. Tang (2006) Random sampling for subspace face recognition. IJCV 70 (1), pp. 91–104. Cited by: Table 1.
  • [64] X. Wang and X. Tang (2009) Face photo-sketch synthesis and recognition. IEEE TPAMI 31 (11), pp. 1955–1967. Cited by: Table 1, Figure 6, §4.2, Table 2, §5.1.
  • [65] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE TIP 13 (4), pp. 600–612. Cited by: Figure 1, Table 1, §1, §1, §2, §2, §2, §4.2, Table 2.
  • [66] Z. Wang and A. C. Bovik (2002) A universal image quality index. IEEE SPL 9 (3), pp. 81–84. Cited by: Table 1.
  • [67] G. Winkenbach and D. H. Salesin (1994) Computer-generated pen-and-ink illustration. In ACM SIGGRAPH, pp. 91–100. Cited by: Figure 2.
  • [68] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma (2009) Robust face recognition via sparse representation. IEEE TPAMI 31 (2), pp. 210–227. Cited by: Table 1.
  • [69] X. Wu, C. Zhan, Y. Lai, M. Cheng, and J. Yang (2019) IP102: a large-scale benchmark dataset for insect pest recognition. In IEEE CVPR, pp. 8787–8796. Cited by: footnote 2.
  • [70] B. Xiao, X. Gao, D. Tao, Y. Yuan, and J. Li (2010) Photo-sketch synthesis and recognition based on subspace learning. Neurocomputing 73 (4-6), pp. 840–852. Cited by: Table 1.
  • [71] J. Xu, L. Zhang, D. Zhang, and X. Feng (2017) Multi-channel weighted nuclear norm minimization for real color image denoising. In IEEE ICCV, Cited by: §1.
  • [72] W. Xue, L. Zhang, X. Mou, and A. C. Bovik (2014) Gradient magnitude similarity deviation: a highly efficient perceptual image quality index. IEEE TIP 23 (2), pp. 684–695. Cited by: Figure 1, §1, §2, §4.2, Table 2.
  • [73] J. Yang, X. Sun, J. Liang, and P. L. Rosin (2018) Clinical skin lesion diagnosis using representations inspired by dermatologist criteria. In IEEE CVPR, pp. 1258–1266. Cited by: §1.
  • [74] D. Zhang, L. Lin, T. Chen, X. Wu, W. Tan, and E. Izquierdo (2017) Content-adaptive sketch portrait generation by decompositional representation learning. IEEE TIP 26 (1), pp. 328–339. Cited by: Table 1.
  • [75] L. Zhang, L. Lin, X. Wu, S. Ding, and L. Zhang (2015) End-to-end photo-sketch generation via fully convolutional representation learning. In ACM ICMR, pp. 627–634. Cited by: Table 1, Figure 6, §4.2, §5.1.
  • [76] L. Zhang, L. Zhang, X. Mou, and D. Zhang (2011) FSIM: a feature similarity index for image quality assessment. IEEE TIP 20 (8), pp. 2378–2386. Cited by: Figure 1, §1, §2, §2, Table 2.
  • [77] M. Zhang, J. Li, N. Wang, and X. Gao (2018) Compositional model-based sketch generator in facial entertainment. IEEE TOC 48 (3), pp. 904–915. Cited by: Table 1.
  • [78] M. Zhang, N. Wang, X. Gao, and Y. Li (2018) Markov random neural fields for face sketch synthesis.. In IJCAI, pp. 1142–1148. Cited by: Table 1.
  • [79] M. Zhang, N. Wang, Y. Li, R. Wang, and X. Gao (2018) Face sketch synthesis from coarse to fine. In AAAI, Cited by: Table 1.
  • [80] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In IEEE CVPR, pp. 586–595. Cited by: §1, §1, §1, §2, §5.
  • [81] S. Zhang, X. Gao, N. Wang, J. Li, and M. Zhang (2015) Face sketch synthesis via sparse representation-based greedy search. IEEE TIP 24 (8), pp. 2466–2477. Cited by: Table 1.
  • [82] S. Zhang, X. Gao, N. Wang, and J. Li (2016) Robust face sketch style synthesis. IEEE TIP 25 (1), pp. 220–232. Cited by: Table 1.
  • [83] S. Zhang, X. Gao, N. Wang, and J. Li (2017) Face sketch synthesis from a single photo–sketch pair. IEEE TCSVT 27 (2), pp. 275–287. Cited by: Table 1.
  • [84] S. Zhang, R. Ji, J. Hu, Y. Gao, and C. Lin (2018) Robust face sketch synthesis via generative adversarial fusion of priors and parametric sigmoid.. In IJCAI, pp. 1163–1169. Cited by: Table 1.
  • [85] S. Zhang, R. Ji, J. Hu, X. Lu, and X. Li (2018) Face sketch synthesis by multidomain adversarial learning. IEEE TNNLS. Cited by: Table 1.
  • [86] W. Zhang, X. Wang, and X. Tang (2010) Lighting and pose robust face sketch synthesis. In ECCV, pp. 420–433. Cited by: Table 1.
  • [87] W. Zhang, X. Wang, and X. Tang (2011) Coupled information-theoretic encoding for face photo-sketch recognition. In IEEE CVPR, pp. 513–520. Cited by: Table 2.
  • [88] Y. Zhang, N. Wang, S. Zhang, J. Li, and X. Gao (2016) Fast face sketch synthesis via kd-tree search. In ECCV, pp. 64–77. Cited by: Table 1.
  • [89] J. Zhao, Y. Cao, D. Fan, M. Cheng, X. Li, and L. Zhang (2019) Contrast prior and fluid pyramid integration for RGBD salient object detection. In IEEE CVPR, Cited by: §1.
  • [90] J. Zhao, J. Liu, D. Fan, J. Yang, and M. Cheng (2019) Edge-based network for salient object detection. In IEEE ICCV, Cited by: §1.
  • [91] J. Zhao, R. Bo, Q. Hou, M. Cheng, and P. L. Rosin (2018) FLIC: fast linear iterative clustering with active search. Computational Visual Media 4 (4), pp. 333–348. Cited by: §1.
  • [92] H. Zhou, Z. Kuang, and K. K. Wong (2012) Markov weight fields for face sketch synthesis. In IEEE CVPR, pp. 1091–1097. Cited by: Table 1, Figure 6, §4.2, §5.1.
  • [93] M. Zhu, N. Wang, X. Gao, and J. Li (2017) Deep graphical feature learning for face sketch synthesis. In IJCAI, pp. 3574–3580. Cited by: Table 1, Figure 6, §4.2, §5.1.
  • [94] Y. Zhu and A. Elgammal (2017) A multilayer-based framework for online background subtraction with freely moving cameras. In IEEE ICCV, Cited by: §7.
  • [95] S. W. Zucker, A. Dobbins, and L. Iverson (1989) Two stages of curve detection suggest two styles of visual computation. Neural computation 1 (1), pp. 68–81. Cited by: §2.