Cali-Sketch: Stroke Calibration and Completion for High-Quality Face Image Generation from Poorly-Drawn Sketches

11/01/2019 ∙ by Weihao Xia, et al. ∙ 4

Image generation task has received increasing attention because of its wide application in security and entertainment. Sketch-based face generation brings more fun and better quality of image generation due to supervised interaction. However, When a sketch poorly aligned with the true face is given as input, existing supervised image-to-image translation methods often cannot generate acceptable photo-realistic face images. To address this problem, in this paper we propose Cali-Sketch, a poorly-drawn-sketch to photo-realistic-image generation method. Cali-Sketch explicitly models stroke calibration and image generation using two constituent networks: a Stroke Calibration Network (SCN), which calibrates strokes of facial features and enriches facial details while preserving the original intent features; and an Image Synthesis Network (ISN), which translates the calibrated and enriched sketches to photo-realistic face images. In this way, we manage to decouple a difficult cross-domain translation problem into two easier steps. Extensive experiments verify that the face photos generated by Cali-Sketch are both photo-realistic and faithful to the input sketches, compared with state-of-the-art methods



There are no comments yet.


page 1

page 4

page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Drawing a sketch is may be the easiest way for amateurs to describe an object or scene quickly. Compared with photographs or portraits, it does not require technical capture devices or professional painting skills. Generating photo-realistic images from free-hand sketch enables a novice to create images from their imagination, making reality a face or scene otherwise only existing in their dreams. However, the sketches drawn by non-artists are usually simple and imperfect. They are sparse, lack of necessary details, and strokes do not precisely align to the original images or actual objects. It is hence challenging to synthesize natural and realistic images from such poorly-drawn sketches.

Fig. 1: The overall architecture of our method. Stroke Calibration Network first calibrates unreasonable strokes and adds necessary details. The modified sketches are then fed into Image Synthesis Network to produce photo-realistic face images.

Recent progress on image-to-image translation [1, 2, 3, 4, 5, 6, 7] has shown that an end-to-end generative adversarial network (GAN) architecture could produce high quality results. A few of them are capable of synthesizing facial photos from sketches, but it requires sketches with precisely and strictly aligned boundary to produce plausible results. Building such an exquisite large-scale dataset with thousands of image pairs (i.e, face photo and its corresponding sketch drawn by professional portraitists) would be quite time-consuming and expensive. It is much easier to build a dataset of face photos and their corresponding free-hand sketches drawn by amateurs. Technically, given such poorly-drawn sketches and photos, the networks proposed for cross-domain translation [1, 2, 8, 9] would learn both stroke modification and image generation simultaneously. However, the remarkable stroke and appearance differences between sketches and photos diminish the effectiveness of these networks, thus leading to unpleasant results.

There are some interactive face image modification methods under the framework of image inpainting

[10, 11, 12, 13]. Given a partially-and-irregularly masked image, they refill the erased regions with strokes provided by the user as guidance. The refilled regions are consistent with input reference strokes and compatible with the whole image. Recent work [10] can obtain a realistic synthetic face photo even though the user conducts some modifications and the network tolerates minor error or mismatching. However, to generate an appropriately edited and restored result, a plausible sketch of the original image is still needed by them. When poorly-drawn sketches are fed into the model, the results can be unacceptable. Moreover, synthesizing an image from a total sketch is much harder than from a regionally-erased image, since in the latter case the rest edges and colors can significantly help the reconstruction.

Some methods [14, 15] consider the case of casual free-hand frontal face sketches, where the generated images do not have to strictly align with the input sketches and present more freedom in appearance. But their methods produce blurry and artifactual results. What’s more, crucial components and drawing intention of the original sketches such as facial contours and hairstyles are not preserved in the synthesized images.

To address those above issues, we propose a novel two-stage generative adversarial network called “Cali-Sketch“, to realize face photo synthesis from poorly-drawn sketches in a unified framework. It explicitly models stroke calibration and image generation using two constituent GANs: a Stroke Calibration Network (SCN), which calibrates and completes strokes of facial features and enriches facial details while preserving the original intent features of the painter, and an Image Synthesis Network (ISN), which transfers the calibrated and completed sketches to face photos. Two GANs are first separately trained for each stage, and then trained jointly.

We focus on image generation from “poorly-drawn sketches”, which is less discussed but appears more in real applications. Compared with the aforementioned methods in [1, 10, 5], ours doesn’t necessarily desire a sketch well aligned to the original image to generate an appropriately edited and restored result while those methods might produce unacceptable results given such poorly-drawn sketches as input.

To preserve facial features and drawing intention, we propose both global contour loss and local detail loss to accomplish necessary stroke modifications and detail improvements. To eliminate artifacts, we also incorporate a perceptual loss and a reconstruction loss in the overall objective function. In this way, we manage to make the final appearance of generated images photo-realistic, while keeping the determinant attributes and drawing intention of input sketch. Experiments confirms that face images synthesized by our proposed method are natural-looking and visually pleasant without observable artifacts.

To sum up, our key contributions are three-fold:

  • We present the first two-stage poorly-drawn face sketch to photo translation. It achieves stroke calibration and image synthesis with two consecutive GANs: SCN and ISN.

  • We propose SCN for necessary stroke calibration and detail completion. To preserve identity and drawing intention during the reconstruction of fine-grained face sketches, we design novel calibration loss functions. Furthermore, when given a free-hand drawn sketch, this network can act as a pre-processing modification module for other tasks using reference sketches such as interactive face image modification.

  • We propose ISN for face sketch-to-image generation. The synthesized face images are both identity-consistent and appearance-realistic.

The rest of this paper is organized as follows. Section II provides an overview of the previous methods and related techniques. Section presents the proposed Cali-Sketch method. Section IV reports the qualitative and quantitative performance of sketch-based image synthesis experiments using the proposed method, and Section V summarizes and concludes the paper.

Ii Related Work

Ii-a Photo-Realistic Image Synthesis

Photo-realistic image synthesis methods have progressed rapidly during the last few years. The goal of image synthesis is to generate photo-realistic and faithful images from sketches or abstract semantic label maps, refer to as label-based image synthesis and sketch-based image synthesis, respectively.

Label-based image synthesis methods [5, 16, 17] synthesize image semantically from abstract label maps, such as sparse landmarks or pixel-wise segmentation maps. [5] proposes a framework for instance-level image synthesis with conditional GANs. [16] proposes a cascade framework to synthesis high-resolution images from pixel-wise labeling maps.

Facial sketch-based image synthesis approaches have been widely developed during the last few years. Those existing studies can be broadly classified into two categories: image retrieval based approaches

[18, 19, 20]

and deep learning based methods

[1, 2, 14, 15, 21, 22]. The former mainly has three basic steps: retrieve, select and composite. Given a sketch plus overlaid text labels as input, Sketch2Photo [18] and Photo-Sketcher [19] automatically synthesize realistic pictures by seamlessly composing objects and backgrounds based on sketch searching and image compositing. PoseShop [20] constructs a large segmented character database for human synthesis, where people in pictures are segmented and annotated by actions and appearance attributes. Then human images are composed by feeding given sketches with text labels into the query. Those methods often suffer from heavily blurred effects and tedious inference process.

Deep learning based methods learn the mapping between sketches and photos. The pix2pix [1] translates precise edge maps to pleasing shoe pictures using conditional GANs. CycleGAN [2] proposes cycle-consistent loss to handle the paired training data limitation of pix2pix. SketchyGAN [14] synthesizes plausible images of objects from 50 classes. It aims to synthesis results both photo-realistic and faithful to the intention of given sketches. In this case, intention was defined as generated images sharing similar poses with input sketches since it’s hard to learn human intention. PhotoSketchMAN [21] generates face photos iteratively from low resolution to high resolution by multi-adversarial networks. CA-GAN [22] proposes to use pixel-wise labelling facial composition information to help face sketch-photo synthesis. Contextual-GAN [15] formulates the task of sketch-image synthesis as the joint image completion. Sketches provide contextual information for completion.

Fig. 2: Qualitative comparison with baselines. We compare our methods with pix2pix [1], CycleGAN [2], DRIT [9], MUNIT [8]. Our approach generates more photo-realistic images. The corresponding image can be recognised easily from a batch of mixed sketches, which means crucial components and drawing intention of original sketches such as facial contours and hair styles are well-preserved in the synthesized images.

Ii-B Generative Adversarial Networks

In recent years, Generative Adversarial Networks (GANs) [23]

have been successfully applied in many computer vision tasks to improve realism of generated images, such as domain adaption

[24, 25, 26]

, super-resolution

[27, 28, 29, 30, 31, 32]. They are composed of a generator and a discriminator . Discriminators try to distinguish the generated fake images, while generators aim to fool discriminators from identifying real images from fake ones. The ideal solution is the Nash equilibrium where and couldn’t improve their cost unilaterally.

Despite of great success, there are still several challenges in GANs including generalization [33, 34] and training stability [35, 36]. To alleviate those problems, technologies are proposed to improve GANs. For example, Arjovsky et al. [35, 37] propose to minimize the Wasserstein distance between model and data distributions. Berthelot et al. [38] try to optimize a lower bound of the Wasserstein distances between auto-encoder loss distributions on real and fake data distributions. Mao et al. [39] proposes a least squares loss for the discriminator, which implicitly minimizes Pearson divergence, leading to stable training, high image quality and considerable diversity.

Ii-C Image-to-Image Translation with GANs

General image-to-image translation methods aim to learn a mapping from the source domain to the target domain. Isola et al. [1] propose a pix2pix framework trained with image pairs and achieve convincing synthetic images on many translation tasks. To handle the limitation of paired images for training, CycleGAN [2], DualGAN [40], DiscoGAN [41] present cycle consistency loss to constrain the translation between inputs and translated images. CSGAN [42] extends [2] with additive cyclic-synthesized loss between the synthesized image of one domain and the cycled image of another domain. InstaGAN [43] incorporates instance attribute information for multi-instance transfiguration. MUNIT [8] and DRIT [9] are proposed for one-to-many diverse image translation. ComboGAN [44] also proposes a multi-component translation method without being constrained to two domains.

Iii Method

Iii-a Overview

Our goal is to realize face photo synthesis from a poorly-drawn sketch. Consider two data collections from different domains, referring to input sketch domain and referring to output photo domain. represents an image of height , width and channel . Converting a face sketch from source domain S to an image in the target photo domain P can be referred to as . This is a typical cross-domain image translation problem but we could not directly learn the mapping by existing image-to-image translation methods. Instead, we decompose this translation into two stages: 1) Stroke Calibration Network named SCN, and 2) Image Synthesis Network named ISN. Let and be the generator and discriminator of SCN, and be the generator and discriminator of ISN, respectively. As shown in Figure 1, the input sketch S is first put into SCN to get the refined sketch R after stroke calibration and detail completion, which is then fed into ISN to generate a photo-realistic face image P. We first train Stroke Calibration Network and Image Synthesis Network separately until the losses plateau, and then train them jointly in an end-to-end way until convergence. Qualitative comparison with baselines is demonstrated in Figure 2. Illustrations of SCN and ISN are shown in Figure 3 and 4, respectively. Training details and network architecture can be found in Section IV-B.

Iii-B Stroke Calibration Network

Stroke Calibration Network aims to modify inconsequent strokes and enrich necessary details of input sketch. Let be input sketches. Ground truth face photos and their edge counterparts will be denoted as and . The mapping from poorly-drawn sketches to the modified ones can be denoted as :


where are composed of two components: global contours and local details .

To modify inconsequent strokes and enrich necessary details, we introduce a novel calibration loss which consists of global contour loss and local detail loss. Global contour loss aims to modify inconsequent strokes and local detail loss enriches necessary details.

We define both losses based on the feature matching loss [5]. Feature representations of real and synthesized images extracted from multiple layers of discriminator are then used to calculate the feature matching loss as


where represents global contour and local detail respectively, represents intermediate representations from th-layer feature extractor of discriminator. In our experiments, global contour and local detail are implemented by HED [45] and Canny [46] edge map, respectively. This calibration loss can stabilize training by forcing the generator to produce natural statistics at different scales [5].

Fig. 3: Illustration of Stroke Calibration Network (SCN): (a) reference; (b) input sketch; (c) Canny result; (d) modified sketch by our SCN.

For stable training, high image quality and considerable diversity as discussed in Section II, we use the least-squares GAN [39] in our experiment. Thus, can be formulated as


where a, b, c denotes the labels for fake data and real data and the value that G wants D to believe for fake data respectively. In our experiment, are ground truth images and are input sketches sampled from distribution .

The total loss of Stroke Calibration Network combines an improved adversarial loss and calibration loss as


where is regularization parameters controlling the importance of two terms.

Iii-C Image Synthesis Network

After stroke calibration and detail completion, the refined sketch R is then fed into Image Synthesis Network to generate photo-realistic face photo . This translation process from the refined sketch R to the photo-realistic face image P can be defined as . The output image should yield both high sketch identification similarity and favourable perceptual quality, while sharing the same resolution with the input sketch:

Fig. 4: Illustration of Image Synthesis Network (ISN): (a) reference; (b) original input sketch; (c) input sketch modified by our SCN. (d) generated image by our ISN.

We train this image synthesis network with a joint loss, which consists of five terms: an reconstruction loss , adversarial loss , perceptual loss , style loss and total variation loss :


Reconstruction loss minimizes the differences between reference and generated images:


Perceptual loss is proposed by Johnson et al. [47] based on perceptual similarity. It is originally defined as the distance between two activated features of a pre-trained deep neural network. Here we adopt a more effective perceptual loss which uses features before activation layers [31]. These features are more dense and thus provide relatively stronger supervision, leading to better performance:


where donates the feature maps before activation of the VGG-19 network pre-trained for image classification.

Style loss [47] is adopted in the same form as in the original work, which aims to measure differences between covariance of activation features:


where represents the Gram matrix constructed from feature maps .

Total variation loss is based the principle that images with unrestrained and possibly spurious detail have high total variation. According to this, reducing total variation of an image subject to it being a close match to the original image, removes unwanted noises while enforcing spatial smoothness and preserving important details such as edges.It is defined on the basis of the absolute gradient of generated images:


Iv Experiments

Iv-a Training Data

Appropriate and adequate training data is important for network performance. Since it is infeasible to collect large-scale paired images and sketches, most existing free-hand sketch based image synthesis methods generate sketches automatically from images.

To exhibit different styles of free-hand sketches and to improve the network generality, we augment training data by adopting multiple different styles of input sketches. Specifically, we generate four different free-hand sketch styles in total. We use the XDoG edge detector [48], Photocopy effect [49] in Photoshop to generate two styles. To better resemble hand-drawn sketches, we simplified the edge images using [50] as in [51]. We also use photo-sketch [52] to generate the desired face sketches. This recent method generates imperfect alignment contour sketches of input images. The poorly-drawn sketches should be sparse and contain wrong edges. That’s why the Canny algorithm [46] shouldn’t be chosen to get input sketches. Those edges generated by Canny are solid and well-aligned with input images. To show effectiveness and efficiency of our approach, the CUHK Face Sketch Database [53] is used in our experiment for its appropriateness and popularity. We use its resized and cropped version.

Fig. 5: Illustration of well-drawn sketches from [53]. Best viewed in color.
Fig. 6: Illustration of four different free-hand sketch styles: photo-sketch [52], XDoG [50], Photocopy[49] and FDoG[48].

Figure 5 are illustration of well-drawn sketches from [53]. These sketches are drawn by the artist. Compared with poorly-drawn sketches in Figure 6, the well-depicted sketches capture the most distinctive characteristics of human faces and are faithful to the original face images. We often can easily recognize a person from corresponding sketch. The free-hand sketches are often sparse, deformed, lack of necessary strikes or details and lines don’t precisely align to the real face images, sometimes even lack of necessary lines in the area of mouth or jaw as illustrated in Figure 6.

The ground truth sketches for Stroke Calibration Network are generated using Canny algorithm [46] and Holistically-nested Edge Detection (HED) edge detector [45]. Specifically, we extract HED from images after histogram equalization to avoid the interference of light. Thus, we generate a desired new dataset consisted of high-quality face photos and corresponding poorly-drawn face sketches, Canny together with HED edges.

Iv-B Experiment Settings

Network architecture. Inspired by recent image translation studies, our generators follow an encoder-decoder architecture similar to the method proposed by Johnson et al. [47]. Each of the generators consists of two down-sampling encode layers, followed by eight residual blocks[54] and two up-sampling decoders. Skip connections are added to concatenate previous layers with the same spatial resolution. We replace regular convolutions in the residual blocks with dilated convolutions with dilation factor two to obtain large receptive fields. Our discriminators are based on the SN-PatchGAN [55] architecture, which determines whether or not overlapping image patches of certain size are real. Spectral normalization [56] is introduced for rapid and stable training and helps produce high-quality results.

Notice that here we didn’t deliberately design the structure of Image Synthesis Network (SCN). In fact, we adopt a quite simple structure of SCN to show that it‘s easy to generate satisfactory results from the calibrated sketches even the sktech2image synthesis network is not deliberately designed. For more details about the indispensability of Stroke Calibration Network and scalability of Image Synthesis Network, refer to Section IV-E2 and IV-E3 respectively.

Training strategy. The training strategy is demonstrated in Algorithm 1. Forward and backward in Algorithm 1 represent forward propagation and back propagation respectively. The forward process includes steps of passing the input through the network layers and calculating the actual output and losses of the model. The backward process back-propagates errors and updates weights of the network. We refer corresponding operations to as forward and backward for simplicity and emphasize that our method is an end-to-end method with three-stage training. are iteration numbers which are large enough to guarantee convergence.

Firstly, we train our Stroke Calibration Network using the Canny and HED edges as supervision with a learning rate. Meanwhile, we train Image Synthesis Network using Canny HED as input refined sketches and ground truth face images as supervision with the same learning rate. Here, denotes the Hadamard product. We then decrease the learning rate to and jointly train both and in an end-to-end way until convergence. Discriminators are trained with a learning rate one tenth of the generators’ according to different training stages. Both networks are trained with resized images with a batch of .

Stage 1: SCN training Input: , free-hand sketch.
Output: , Refined sketch
1 while   do
2       , , = .forward .backward
Stage 2: ISN training Input: , Canny HED.
Output: , Generated face image.
3 while  do
4       , , = .forward .backward
Stage 3: Joint training Input: , free-hand sketch.
Output: , Generated face image.
5 while  do
6       , , = .forward , , .forward .backward .backward
Algorithm 1 Training strategy

Evaluation metrics.

For our task of face image synthesis from poorly-drawn sketches, we use two kinds of evaluation metrics: similarity metrics and perceptual scores. We apply the widely used full reference image quality assessment metrics such as PSNR, SSIM as similarity metrics. Given two images

, the peak signal-to-noise ratio (PSNR) are defined as


where L is usually 255, is the Frobenius norm and . The structural similarity index (SSIM) is defined as


where and

is the mean and variance of

is the covariance between and and and are constant relaxation terms. A highest score indicates a more structurally similar face for a given sketch.

For perceptual scores, we use Frchet Inception distance (FID) [57]. The FID is defined using the Frchet distance between two multivariate Gaussians:


where and are the 2048-dimensional activations of the Inception-v3 pool3 layer for real and generated samples respectively The lowest FID means it achieves the most perceptual results.

Iv-C Baselines

We perform the evaluation on the following baseline approaches:

Pix2pix [1] is an early work of image-to-image translation. It achieves good photo results on edge-to-photo generation, and the models trained on automatically detected edges can generalize to human drawings.

CycleGAN [2] achieves unsupervised image-to-image translation via cycle-consistent loss.

DRIT [9] is a recent work, which realizes diverse image-to-image translation via disentangled content and attribute representations of different domains. Experiment on the edge-to-shoes dataset shows it can produce both realistic and diverse images.

MUNIT [8] is the state-of-the-art unsupervised multi-domain image-to-image translation framework. It achieves quality and diversity comparable to the state-of-the-art supervised algorithms on the task of edge-to-shoes/handbags.

We implement all baselines using default architecture and details.

Iv-D Comparison Against Baselines

Qualitative evaluation. Qualitative comparison with baselines are demonstrated in Figure 2. The results produced by pix2pix [1] all have obvious artifacts. Facial features are all suffer from shape distortion to a degree, especially the facial and ear contours on the first and fifth rows. CycleGAN [2] produces the most similar face with the reference, but its results are blurry and unpleasing. There are two or more visible spots in the area of hair. The contours of face images generated by DRIT [9] are aligned with the ir lines of the input sketches, which notably deteriorate the image quality. MUNIT [8] could produces relatively visually realistic and qualitatively consistent results. however, they are more like oil paintings rather than photos.

Compared with baseline methods, our approach generates high-quality images. The generated human face images are more photo-realistic. The corresponding image can be recognised easily from a batch of mixed sketches, which means crucial components and drawing intention of original sketches like facial contours, hair styles are well-preserved in the synthesized images.

Quantitative comparison. Quantitative evaluation with baselines are shown in Table I. For PSNR and SSIM, CycleGAN [2] achieves the highest structural similarity, and our method ranks the second. For the task of sketch-to-image generation, similarity is not that important, since for most free-hand drawn sketches, there are no corresponding real face images as reference. What really matters is whether generated images are photo-realistic or not.

Frchet Inception Distance (FID) is calculated by computing the Frchet distance between two Gaussians of feature representations extracted from the pre-trained inception network [58]. It is not only a measure of similarity between two datasets of images, but also shown to correlate well with human visual judgement of image quality. Due to above advantages, FID [57] is most often used to evaluate the quality of images generated by Generative Adversarial Networks. As shown in Table I, our method achieves the lowest FID score, which means that our method produces the best results in both perceptual judgement and high-level similarity.

pix2pix 18.83 0.7554 76.90
CycleGAN 24.21 0.8508 80.17
MUNIT 17.23 0.7515 78.57
DRIT 16.14 0.7047 97.36
Ours 20.25 0.8006 58.43
TABLE I: Perfomance as PSNR, SSIM and FID on the CUHK dataset. The best and second best results are highlighted in each column. For details refer to Section IV-D

Iv-E Component Analysis of Cali-Sketch

Iv-E1 The choice of contour and detail

There are many choices of contour and detail for our methods such as edges, boundaries and contours. These are a few differences between them. Edge maps are precisely aligned to object boundaries, and they usually contain more information about details and backgrounds. Boundaries pay more attention to external lines. Contours contain object boundaries, salient internal and background edges. Contours can be obtained by the boundary contour edge extractors like HED [45], COB [59, 60], RCF [61, 62], or similar to pix2pixHD [5], simplified from the face parsing semantic labels. For a face image, contours are more like facial feature boundaries.

Since sketches are the approximate outline of the objects with spatial transformations and deformed strokes, we need to modify its strokes and add more details before synthesis. Contours and edges are respectively responsible for stroke calibration and detail completion. We will illustrate the reasons in the next part. In our experiment, we choose HED as global contour and Canny as ground truth local detail for simplicity.

Iv-E2 The impact of Stroke Calibration Network

We have tested directly applying the pix2pix to generate face images from poorly-drawn sketches, but found the training unstable and the quality of results unsatisfactory. The original sketches are deformed and sometimes lack of necessary lines in the area of mouth or jaw, as shown in Figure 3 and Figure 7. It inspires us to modify strokes and add essential details before image synthesis. Edges like the Canny detector can act as ground truth for the training of this process. The refined sketches are more visually favourable and consistent with the original identity, as shown in the third column of Figure 7.

Fig. 7: The impact of Stroke Calibration Network and with or without global contour. (a) input sketch. (b) canny edge. (c) result with only local detail loss. (d) result with both local detail loss and global contour loss.

However, it also demonstrates that only being supervised by Canny is not enough. It results in unwanted strokes at the area of eyebrow mouth or jaw, and even changes the shape of eyes. Stroke calibration should be superior to detail enrichment. We want the Stroke Calibration Network to modify strokes without changing holistic properties like facial contours and hair styles. So we add both the contours as the global constraint and the edges as local constraint. As shows in the fourth column of Figure 7, it calibrates unreasonable strokes and preserves original properties.

Table II

shows the accuracy of our Stroke Calibration Network. We measure precision and recall with Canny for ablation study of local detail loss and global contour loss. For each setting, we convert the refined sketch and corresponding Canny to binary with a constant threshold value (

i.e., each pixel is either zero or one). Precision means how many ones in the refined sketch are actual ones in the ground truth Canny, and recall means how many ones in the the ground truth Canny are contained in the refined sketch. The high precision and relative low recall are in line with expectations. The original purpose of Stroke Calibration Network is to modify unreasonable strokes and add essential details. The low accuracy of using local detail loss only is consistent with results in Figure 7. Since HED and Canny are different, it is not surprising that the accuracy of using global contour loss only is low.

Components Precision Recall
detail only 0.1559 0.1475
contour only 0.1713 0.1564
detail and contour 0.9962 0.4772
TABLE II: Quantitative performance of our Stroke Calibration Network trained on only local detail loss(detail only), global contour loss(contour only) and local detail loss together with global contour loss(detail and contour).

Iv-E3 The scalability of Image Synthesis Network

Notice that in Section III-C we didn’t deliberately design the structure of Image Synthesis Network (SCN). In fact, we adopt a quite simple structure of SCN to show that it’s easy to generate satisfactory results from the calibrated sketches even the sktech2image synthesis network is not deliberately designed. Since there are many methods [10, 11] for ”well-drawn” sktech2image problem, we argue that such stroke calibration is indispensable for these methods to be well applied in some real applications, such as cultural relics or digital sketch generation for suspects, to produce realistic images. Therefore, it is a useful application and a new solution to synthesize a high-quality image from poorly-drawn sketches. The results in Section IV-D have shown that our proposed stroke calibration network is a simple yet effective. The calibrated sketches can be directly fed into other existing ”well-drawn” sktech2image methods [10, 11] to produce more diverse and more photo-realistic results. Our two-stage Algorithm 1 provides end-to-end scalability for improving SCN by designing novel architecture or combining with existing ”well-drawn” sktech2image methods.

For example, we can improve SCN by simply doubling the numbers of residual blocks (refer to as Improved-1). Or building our generator based on U-Net and using Masked Residual Unit (MRU) module proposed in [14] (refer to as Improved-2). MRU is shown to be more effective than ResNet, Cascaded Refinement Network (CRN) or DCGAN structure in image synthesis task according to [14]. We compare images generated by different structures of SCN on the CUHK dataset using PSNR, SSIM and FID as metrics. The results are shown in Table III.

Original 20.25 0.8006 58.43
Improved-1 20.34 0.8092 57.09
Improved-2 21.09 0.8137 55.12
TABLE III: The scalability of Image Synthesis Network. We compare results generated by different structures of SCN on the CUHK dataset using PSNR, SSIM and FID as metrics. For details refer to Section IV-D and Section IV-E3.

V Conclusion and Discussion

We propose a poorly-drawn sketch based face image synthesis method named Cali-Sketch. Our method can generate pleasing results even when the input sketches are not plausible. To achieve this, we introduce a two-stage sketch-to-image translation method consisting of two GANs. Stroke Calibration Network first calibrates unreasonable strokes and adds necessary details. The refined sketches are then fed into Image Synthesis Network to produce photo-realistic face images. Given poorly-drawn sketches, Cali-Sketch can generate identity-consistent and appearance-realistic face images. Experimental results shows effectiveness and efficiency of the proposed Cali-Sketch, showing superior performance than the state-of-the-art methods.

V-a Extended Application

Our Stroke Calibration Network can act as a pre-processing module for real-world sketches. For interactive face image manipulation like [10], a plausible input sketch is necessary. When free-hand drawn sketches are directly fed into those models, the results may be unacceptable. In this case, our Stroke Calibration Network can also act as pre-processing modification module. [10] is a recent facial image editing method. Users draw sketch and color as guidance on incomplete image erased by mask . To show effectiveness and efficiency of our approach in this case, we first directly use original poorly-drawn sketch as input sketch for [10] to get an edited image. Then we feed the refined sketch pre-processed by our Stroke Calibration Network to produce another edited image. As demonstrated in Figure 8, when the input sketch is sparse and contains wrong strokes and directly fed into [10], the generated facial features are distorted and deformed. Our Stroke Calibration Network can calibrates unreason strokes and add necessary details. When this refined sketch is fed into [10], the result is improved significantly.

Fig. 8: Stroke Calibration Network as a pre-processing module for real-world sketches. (a) original image. (b) masked image and input sketch. (c) masked image and modified sketch. (d) editing result of original sketch. (e) editing result of modified sketch.

Our method also show its potentialities for sketch-based object search [63, 64] and image retrieval [65, 66, 67]. Various works have been proposed to efficiently support automatic annotation of multimedia contents and help content-based retrieval, and obtaining precise image sample sufficing the user specification may not be always handy in every instance. In such cases, sketch can be an alternative solution to initialize the search, i.e., sketch based image retrieval [63, 68]. Our method can complete necessary object information that would be critical for a reliable search performance.

V-B Limitation and Future Work

Compared with image inpainting or image-to-sketch synthesis, generate photo-realistic image from poorly-drawn sketch is more challenging since there is less information in sketches. Thus, we temporarily experiment on frontal faces without large rotation and translation. The dataset limitations provide strong motivation for future work to improve performance by expanding the datasets into faces with various angles or expressions, and further into all classes, e.g., the Google QuickDraw Dataset. In addition, the category of a sketch is also critical for image generation. Sketches are far from being complete in terms of the object information that would be transformed into a totally different object during generation. For example, as illustrated in Figure 9, if a user is intent on generating a pyramid image by simply drawing a ‘triangle’, it is not sufficiently discriminative to uniquely resemble the pyramids. Thus incorporating category information of poorly-drawn sketches is critical. We will develop our Cali-Sketch into a drawing assistance that creates photographic self-portraits or user‘s favorite cartoon characters.

Fig. 9: Sketch samples of ‘triangle’. It is not sufficiently discriminative to uniquely resemble the pyramids by simply drawing a ‘triangle’. Incorporating category information is critical for image generation from poorly-drawn sketches.


  • [1]

    P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in

    Proc. IEEE Conf. Comput. Vis. Pattern Recognit.

    , 2017, pp. 5967–5976.
  • [2] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2242–2251.
  • [3] Y. Zhou, R. Jiang, X. Wu, J. He, S. Weng, and Q. Peng, “Branchgan: Unsupervised mutual image-to-image transfer with a single encoder and dual decoders,” IEEE Trans. Multimedia, 2019.
  • [4] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2018.
  • [5] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional gans,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
  • [6] L. Chen, L. Wu, Z. Hu, and M. Wang, “Quality-aware unpaired image-to-image translation,” IEEE Trans. Multimedia, vol. 21, no. 10, pp. 2664–2674, 2019.
  • [7]

    W. Xu, S. Keshmiri, and G. Wang, “Adversarially approximated autoencoder for image generation and manipulation,”

    IEEE Trans. Multimedia, vol. 21, no. 9, pp. 2387–2396, 2019.
  • [8] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsupervised image-to-image translation,” in Proc. Eur. Conf. Comput. Vis., 2018.
  • [9] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. K. Singh, and M.-H. Yang, “Diverse image-to-image translation via disentangled representations,” in Proc. Eur. Conf. Comput. Vis., 2018.
  • [10] Y. Jo and J. Park, “Sc-fegan: Face editing generative adversarial network with user’s sketch and color,” ArXiv, vol. abs/1902.06838, 2019.
  • [11] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative image inpainting with contextual attention,” arXiv preprint arXiv:1801.07892, 2018.
  • [12] K. Nazeri, E. Ng, T. Joseph, F. Z. Qureshi, and M. Ebrahimi, “Edgeconnect: Generative image inpainting with adversarial edge learning,” ArXiv, vol. abs/1901.00212, 2019.
  • [13] J. Liu, S. Yang, Y. Fang, and Z. Guo, “Structure-guided image inpainting using homography transformation,” IEEE Trans. Multimedia, vol. 20, no. 12, pp. 3252–3265, Dec 2018.
  • [14] W. Chen and J. Hays, “Sketchygan: Towards diverse and realistic sketch to image synthesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2018.
  • [15] Y. Lu, S. Wu, Y.-W. Tai, C.-K. Tang, and T. Youtu, “Sketch-to-image generation using deep contextual completion,” ArXiv, vol. abs/1711.08972, 2017.
  • [16] Q. Chen and V. Koltun, “Photographic image synthesis with cascaded refinement networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 1520–1529.
  • [17] X. Qi, Q. Chen, J. Jia, and V. Koltun, “Semi-parametric image synthesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 8808–8816.
  • [18] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu, “Sketch2photo: Internet image montage,” ACM Trans. Graph., vol. 28, no. 5, p. 124, 2009.
  • [19] M. Eitz, R. Richter, K. Hildebrand, T. Boubekeur, and M. Alexa, “Photosketcher: interactive sketch-based image synthesis,” Computer Graphics and Applications, vol. 31, no. 6, pp. 56–66, 2011.
  • [20] T. Chen, P. Tan, L.-Q. Ma, M.-M. Cheng, A. Shamir, and S.-M. Hu, “Poseshop: Human image database construction and personalized content synthesis,” IEEE Trans. Visualization and Computer Graphics, vol. 19, no. 5, pp. 824–837, 2013.
  • [21] L. Wang, V. Sindagi, and V. Patel, “High-quality facial photo-sketch synthesis using multi-adversarial networks,” in IEEE Conf. Automatic Face & Gesture Recognition, 2018, pp. 83–90.
  • [22] J. Yu, S. Shi, F. Gao, D. Tao, and Q. Huang, “Composition-aided face photo-sketch synthesis,” ArXiv, vol. abs/1712.00899, 2017.
  • [23] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672–2680.
  • [24] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim, “Image to image translation for domain adaptation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4500–4509.
  • [25] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao, “Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 994–1003.
  • [26] M. Long, Z. Cao, J. Wang, and M. I. Jordan, “Conditional adversarial domain adaptation,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 1647–1657.
  • [27] B. Yan, B. Bare, C. Ma, K. Li, and W. Tan, “Deep objective quality assessment driven single image super-resolution,” IEEE Trans. Multimedia, 2019.
  • [28] J. Jiang, C. Chen, J. Ma, Z. Wang, Z. Wang, and R. Hu, “Srlsp: A face image super-resolution algorithm using smooth regression with local structure prior,” IEEE Trans. Multimedia, vol. 19, no. 1, pp. 27–40, 2017.
  • [29] J. Shi and G. Zhao, “Face hallucination via coarse-to-fine recursive kernel regression structure,” IEEE Trans. Multimedia, vol. 21, no. 9, pp. 2223–2236, 2019.
  • [30] J. Xie, R. S. Feris, S. Yu, and M. Sun, “Joint super resolution and denoising from a single depth image,” IEEE Trans. Multimedia, vol. 17, no. 9, pp. 1525–1537, 2015.
  • [31] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. C. Loy, “Esrgan: Enhanced super-resolution generative adversarial networks,” in Proc. Eur. Conf. Comput. Vis. Workshops, September 2018.
  • [32] W. Yang, X. Zhang, Y. Tian, W. Wang, and J. Xue, “Deep learning for single image super-resolution: A brief review,” IEEE Trans. Multimedia, 2019.
  • [33] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang, “Generalization and equilibrium in generative adversarial nets,” in

    Proc. Conf. Machine Learning

    , 2017, pp. 224–232.
  • [34] L. Mescheder, S. Nowozin, and A. Geiger, “The numerics of gans,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 1825–1835.
  • [35] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proc. Conf. Machine Learning, 2017, pp. 214–223.
  • [36] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 2234–2242.
  • [37] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, 2017, pp. 5767–5777.
  • [38] D. Berthelot, T. Schumm, and L. Metz, “Began: Boundary equilibrium generative adversarial networks,” ArXiv, vol. abs/1703.10717, 2017.
  • [39] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares generative adversarial networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2794–2802.
  • [40] Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised dual learning for image-to-image translation,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2849–2857.
  • [41] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discover cross-domain relations with generative adversarial networks,” in Proc. Conf. Machine Learning, 2017, pp. 1857–1865.
  • [42] K. B. Kancharagunta and S. R. Dubey, “Csgan: Cyclic-synthesized generative adversarial networks for image-to-image transformation,” ArXiv, vol. abs/1901.03554, 2019.
  • [43] S. Mo, M. Cho, and J. Shin, “Instagan: Instance-aware image-to-image translation,” in Proc. Conf. Learn. Represents, 2019.
  • [44] A. Anoosheh, E. Agustsson, R. Timofte, and L. Van Gool, “Combogan: Unrestrained scalability for image domain translation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2018, pp. 783–790.
  • [45] S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2015.
  • [46] J. Canny, “A computational approach to edge detection,” Readings in computer vision, pp. 184–203, 1987.
  • [47] J. Johnson, A. Alahi, and F. F. Li, “Perceptual losses for real-time style transfer and super-resolution,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 694–711.
  • [48] H. Winnemöller, J. E. Kyprianidis, and S. C. Olsen, “Xdog: An extended difference-of-gaussians compendium including advanced image stylization,” Computers & Graphics, vol. 36, no. 6, pp. 740–753, 2012.
  • [49] “Create filter gallery photocopy effect with single step in photoshop.”
  • [50] E. Simo-Serra, S. Iizuka, K. Sasaki, and H. Ishikawa, “Learning to simplify: fully convolutional networks for rough sketch cleanup,” ACM Trans. Graph., vol. 35, no. 4, pp. 121:1–121:11, 2016.
  • [51] Y. Lu, S. Wu, Y. Tai, and C. Tang, “Image generation from sketch constraint using contextual GAN,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 213–228.
  • [52] M. Li, Z. Lin, R. Mˇech, , E. Yumer, and D. Ramanan, “Photo-sketching: Inferring contour drawings from images,” IEEE Winter Conf. on Applications of Comput. Vis., 2019.
  • [53] X. Wang and X. Tang, “Face photo-sketch synthesis and recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 11, pp. 1955–1967, 2009.
  • [54] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
  • [55] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free-form image inpainting with gated convolution,” arXiv preprint arXiv:1806.03589, 2018.
  • [56] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
  • [57] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6626–6637.
  • [58] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826.
  • [59] K. Maninis, J. Pont-Tuset, P. Arbeláez, and L. V. Gool, “Convolutional oriented boundaries,” in Proc. Eur. Conf. Comput. Vis., 2016.
  • [60] K. Maninis, J. Pont-Tuset, P. Arbelaez, and L. V. Gool, “Convolutional oriented boundaries: From image segmentation to high-level tasks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 819–833, 2018.
  • [61] Y. Liu, M.-M. Cheng, X. Hu, K. Wang, and X. Bai, “Richer convolutional features for edge detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3000–3009.
  • [62] Y. Liu, M.-M. Cheng, X. Hu, J.-W. Bian, L. Zhang, X. Bai, and J. Tang, “Richer convolutional features for edge detection,” IEEE Trans. Pattern Anal. Mach. Intell., 2019.
  • [63] S. D. Bhattacharjee, J. Yuan, Y. Huang, J. Meng, and L. Duan, “Query adaptive multiview object instance search and localization using sketches,” IEEE Trans. Multimedia, vol. 20, no. 10, pp. 2761–2773, 2018.
  • [64] S. He, Z. Zhou, F. Farhat, and J. Z. Wang, “Discovering triangles in portraits for supporting photographic creation,” IEEE Trans. Multimedia, vol. 20, no. 2, pp. 496–508, 2018.
  • [65] A. Grigorova, F. G. B. De Natale, C. Dagli, and T. S. Huang, “Content-based image retrieval by feature adaptation and relevance feedback,” IEEE Trans. Multimedia, vol. 9, no. 6, pp. 1183–1192, 2007.
  • [66] G. Ioannakis, A. Koutsoudis, I. Pratikakis, and C. Chamzas, “Retrieval¡ªan online performance evaluation tool for information retrieval methods,” IEEE Trans. on Multimedia, vol. 20, no. 1, pp. 119–127, 2018.
  • [67] J. Choi, H. Cho, J. Song, and S. M. Yoon, “Sketchhelper: Real-time stroke guidance for freehand sketch retrieval,” IEEE Trans. Multimedia, vol. 21, no. 8, pp. 2083–2092, 2019.
  • [68] Y. Zhang, X. Qian, X. Tan, J. Han, and Y. Tang, “Sketch-based image retrieval by salient contour reinforcement,” IEEE Trans. Multimedia, vol. 18, no. 8, pp. 1604–1615, 2016.