How can we visualize a scene or portray an object quickly? One of the easiest ways is to draw a sketch. Compared to photography, drawing a sketch does not require any special devices and is not limited to faithfully sampling reality. However, sketches are often overly simple, so it would be compelling if we could synthesize realistic images from a novice sketch. Sketch-based image synthesis enables non-artists to create realistic images without significant artistic skill or domain expertise in image synthesis. It is generally hard because sketches are sparse, and novice human artists cannot draw sketches that precisely reflect object boundaries. A real-looking image synthesized from a sketch should respect the intent of the artist as much as possible
, but might need to deviate from the coarse strokes in order to stay on the natural image manifold. In the past 30 years, the most popular sketch-based image synthesis techniques are driven by image retrieval methods such as Photosketcher Eitz11 and Sketch2photo chen2009sketch2photo. Such approaches often require carefully designed feature representations which are invariant between sketches and photos. They also involve complicated post-processing procedures like graph cut compositing and gradient domain blending in order to make the synthesized images realistic.
The recent emergence of deep convolutional neural networks lecun2015DeepLearning,Krizhevsky14AlexNet,he2016ResNet has provided enticing methods for image synthesis, among which Generative Adversarial Networks (GANs) Goodfellow14 have shown great potential. A GAN frames its training as a zero-sum game between the generator and the discriminator. The goal of the discriminator is to decide whether a given image is fake or real, while the generator tries to generate realistic images so the discriminator will misclassify them as real. Sketch-based image synthesis can be formulated as an image translation problem conditioned on an input sketch. There exists several methods using GANs to translate images from one domain to another pix2pix2016,CycleGAN2017. However, none of them is specifically designed for image synthesis from sketches.
In this paper, we propose SketchyGAN, a GAN-based, end-to-end trainable sketch to image synthesis approach that can generate realistic objects from 50 classes. The input is a sketch illustrating an object and the output is a realistic image containing that object in a similar pose. This is challenging because: (i) paired photos and sketches are difficult to acquire so there is no massive database to learn from. (ii) There is no established neural network method for sketch to image synthesis for diverse categories. Previous works train models for single or very few categories kim2017discoGAN,sangkloy2016scribbler.
We resolve the first challenge by augmenting the Sketchy database Patsorn16Sketchy, which contains nearly 75,000 actual human sketches paired with photos, with a larger dataset of paired edge maps and photos. This augmentation dataset is obtained by collecting 2,299,144 Flickr images from 50 categories and synthesizing edge maps from them. During training, we adjust the ratio between edge map-image and sketch-image pairs so that the network can transfer its knowledge gradually from edge map-image synthesis to sketch-image synthesis. For the second challenge, we build a GAN-based model, conditioned on an input sketch, with several additional loss terms which improve synthesis quality. We also introduce a new building block called Masked Residual Unit (MRU) which helps generate higher quality images. This block takes an extra image input, and utilizes its internal mask to dynamically decide the information flow of the network. By chaining these blocks we are able to input a pyramid of images with different scales. We show that this structure outperforms naive convolutional approaches and ResNet blocks on our sketch to image synthesis tasks.
Our main contributions are:
We present SketchyGAN, a deep learning approach to sketch to image synthesis. Unlike previous approaches, we do not do image retrieval at test time. Our method is capable of generating real-looking objects from 50 diverse categories. Sketch-based image synthesis is very challenging and our results are not generallyphotorealistic, but we demonstrate a significant increase in quality compared to existing deep generative models.
We demonstrate a data augmentation technique for sketch data that address the lack of sufficient human-annotated training data.
We formulate a GAN model with additional objective functions and a new network building block. We show that all of them are critical to our task, and lacking any of them will reduce the quality of our results.
2 Related Work
Sketch-Based Image Retrieval and Synthesis.
There exist numerous works on sketch-based image retrieval eitz2010sketchFeatureDescriptor,eitz2011sketchBagOfFeatures,hu2010gradientFieldSketch,cao2011edgelSketchSearch,cao2010mindfinder,wang2010mindfinder2,hu2011bagofRegionSketch,hu2013gradientHoGSketch,james2014reenactSketchDesign,lin2013_3DSketchRetrieval,wang2015sketch3DRetrieval,Li2016FineGrainedSketchRetrieval. Most methods use bag of words representations and edge detection to build features that are (ideally) invariant across both domains. Common shortcomings include the inability to perform fine-grained retrieval and the inability to map from badly drawn sketch edges to photo boundaries. To address these problems, Yu et al. Yu16SketchMe and Sangkloy et al. Patsorn16Sketchy train deep convolutional neural networks(CNNs) to relate sketches and photos, treating the sketch-based image retrieval as a search in the learned feature embedding space. They show that using CNNs greatly improves performance and they are able to do fine-grained and instance-level retrieval.
Beyond the task of retrieval, Sketch2Photo chen2009sketch2photo and PhotoSketcher Eitz11 synthesize realistic images by compositing objects and backgrounds retrieved from a given sketch. PoseShop chen2013poseshop composites images of people by letting users input an additional 2D skeleton into the query so that the retrieval will be more precise.
Sketch-Based Datasets. There are only a few datasets of human-drawn sketches and they are generally small due to the huge effort needed to collect drawings. One of the most commonly used sketch dataset is the TU-Berlin dataset eitz2012tu_berlin which contains 20,000 human sketches spanning 250 categories. Yu et al. Yu16SketchMe introduced a new dataset with paired sketches and images, but there are only two categories – shoes and chairs. There is also the CUHK Face Sketches wang2009CUHK containing 606 face sketches drawn by artists. The newly published QuickDraw dataset ha2017SketchRNN has an impressive 50 million sketches. However, the sketches are particularly crude because of a 10 second time limit. The sketches lack detail and tend to be iconic or canonical views. The Sketchy database, in contrast, has more detailed drawings in a greater variety of poses. It spans 125 categories with a total of 75,471 sketches of 12,500 objects. Critically, it is the only substantial dataset of paired sketches and images spanning diverse categories so we choose to use this dataset.
Image-to-Image Translation with GANs.
Generative Adversarial Networks(GANs) have shown great potential in generating natural, realistic images berthelot2017began,gulrajani2017improvedwgan. Instead of directly optimizing a log-likelihood over the whole image which often leads to blurry and conservative results, GANs use a discriminator to distinguish unrealistic images from real ones, forcing the generator to produce sharper images. The “pix2pix” work of Isolaet al. pix2pix2016 demonstrates a straightforward approach to translate one image to another using conditional GANs. Conditional settings are also adapted in other image translation tasks, including sketch coloring sangkloy2016scribbler, style transformation yoo2016pixelLevelTransfer and domain adaptation bousmalis2016UnACGAN tasks. In contrast with using conditional GANs and paired data, Liu et al
. liu2017UNIT introduce an unsupervised image translation framework consists of CoupledGAN liu2016CoGAN and a pair of variational autoencoders Kingma2014VAE. More recently, CycleGAN CycleGAN2017 shows promising results on unsupervised image translation by enforcing cycle-consistency losses.
3 Sketchy Database Augmentation
In this section, we discuss how we augment the Sketchy database with images crawled from Flickr and synthesize edge maps which we hope approximate human sketches. The dataset will be made publicly available. The final augmentation dataset contains around 2,299,144 image-edge map pairs spanning 50 categories. Section 3.2 describes why we choose these specific categories instead of all categories in the Sketchy database. Section 3.3 demonstrates the way we collect images and how we use existing Convolutional Neural Networks to eliminate unrelated images. Section 3.4 describes our edge map synthesis. Section 3.5 describes the way we use the augmented dataset.
3.1 Edges vs Sketches
Figure 1 visualizes the difference between image edges and human-drawn sketches. A sketch is set of human-drawn strokes mimicking the approximate boundary and internal contours of an object, and an edge map is machine-generated array of pixels that precisely correspond to photo intensity boundaries. Generating photos from sketches is considerably harder than from edges. Unlike edge maps, sketches are not precisely aligned to object boundaries, so a generative model needs to learn spatial transformations to correct deformed strokes. Second, edge maps usually contain more information about backgrounds and details, while sketches do not, meaning a model must insert more information itself. Finally, sketches may contain caricatured or iconic features, like the “tiger” stripes on the cat’s face in Figure (c)c, which a model must learn to handle. Despite these considerable differences, we show that edge maps are still a valuable augmentation to the limited Sketchy database. But it is non-trivial to gradually transition a model from edge-based image synthesis to sketch-based image synthesis.
3.2 Category Choice
Since we use off-the-shelf Convolutional Neural Networks trained on ImageNet ImageNet15 and MS COCO lin2014MSCOCO to eliminate faulty images, we need to find out overlapping categories between Sketchy and these two datasets. We find 18 common categories between the 125 Sketchy categories and 80 COCO classes, and 38 common categories between Sketchy and ImageNet. Notice that Sketchy claims to choose categories from ImageNet, but they use categories for bounding box annotations which is different from the 1000 classification categories which most models are trained on Krizhevsky14AlexNet,Simonyan14VGG,szegedy2015googleNet. During training, we end up using 50 categories out of the 56 available categories, because the excluded six categories often contain training images that have a human as a main object alongside with the class object, which makes the training harder. The excluded classes are harp, violin, umbrella, saxophone, racket and trumpet.
3.3 Data Collection
Because we need rich training data for our generative model, a large amount of images per category is necessary. ImageNet only has around 1,000 images per class, and photos in COCO often have crowd of objects. These data are not suitable for training our network because sketches in Sketchy usually contain only one object. Instead, we collect images directly from Flickr through Flickr API by querying category names as keywords, and returned images are sorted by “relevance”. 100,000 images are gathered for each category. Two different models are used for filtering out unrelated images. An Inception-ResNet-v2 network szegedy2017inceptionResNet trained on ImageNet is used to classify whether an image falls into one of the 38 ImageNet categories. Since there is no classification model available for COCO, we use Single Shot MultiBox Detector liu2016ssd to detect whether a given image contains an object in the 18 COCO categories. An extra restriction is added: the bounding box of a detected object must cover more than 5% of the whole image, because an image with a large object is less likely to be crowded. After filtering, we obtain a large set of images with an average of 46,265 images per ImageNet category and 61,365 images per COCO category.
3.4 Edge Map Creation
We use edge detection and several post-processing steps to obtain sketch-like edge maps. The pipeline is illustrated in Figure 2. The first step is to detect edges with Holistically-nested edge detection (HED) xie2015HED like Isola et al
. pix2pix2016. After binarizing the output and thinning all edges zhang1984fastThinning, we clean out isolated pixels and remove small connected components. Next we perform erosion with a threshold on all edges, further decreasing number of edge fragments. Remaining spurs are then removed. Because edges are very sparse, we calculate an unsigned euclidean distance field for each edge map to obtain a dense representation (see Figure(g)g). Similar idea is also employed in recent works regarding 3D shape recovery Nguyen20163DShapeRepair,han20173DShapeCompletion. We also calculate distance fields for sketches in Sketchy.
3.5 Training Adaptation from Edges to Sketches
Because our final goal is a network that generates images from sketches, it is necessary to train the network on both edge maps and sketches. To simplify training process, we use a strategy that gradually shift the inputs from edge maps to sketches: at the beginning of training, the training data are mostly pairs of images and edge maps. During training, we slowly increase the proportion of sketch-image pairs. Let be the maximum number of training iterations, be the number of current iteration, then the proportion of sketches and edge maps at current iteration is given by:
is an adjustable hyperparameter indicating how fast the portion of sketches grows. We usein our experiments. It is easy to see that grows from 0.1 slowly to 0.9. Using this training schedule, we eliminate the need of separate pre-training on edge maps, so the whole training process is one time.
In this section we present a Generative Adversarial Network framework that transform input sketches into realistic images. Our GAN learns a mapping from an input sketch to an output image , so that . There are two parts in the GAN, generator and discriminator . Section 4.1 introduces the new Masked Residual Unit, section 4.2 illustrates the network structure, and section 4.3 discusses the objective functions.
4.1 Masked Residual Unit (MRU)
DCGAN radford15 structure has been popular in GANs because vanilla GAN suffers from unstable training. Since the introduction of WassersteinGAN arjovsky2017wasserstein, more complicated structures like ResNet blocks are used in GANs.
ResNet has gained great success in feed-forward networks by employing shortcuts through multiple non-linear layers to retain more information and help back-propagation. In a generative model, however, we argue that the network can also gain information by receiving the input image repetitively. With internal masks, the network can selectively extract new features it fails to keep in previous layers from the input images.
Figure 3 shows the structure of Masked Residual Unit (MRU). Qualitative comparison to DCGAN and ResNet in our generative task can be found in Section 5.3. An MRU block takes two inputs: input feature maps and an image , and outputs feature maps . For convenience we only discuss the case inputs and outputs all have the same spacial dimension. Let denote concatenation, denote convolution on ,
be an activation function. We want to first merge the information in input imageinto input feature maps . A naive approach will be concatenating them along feature depth dimension and do convolution:
. However it is better if the block can decide how much information it want to preserve upon receiving the new image. So instead we use the following approach:
is a mask over the input feature maps. Multiple convolutional layer can be stacked here to increase performance. We then want to dynamically combine the information from the newly convoluted feature maps and the original input feature maps, so we use another mask
to combine the input feature maps with the new feature maps to get the final output:
The second term in Equation 7
serves as a residual connection. Because there are internal masks to determine information flow, we call this structure masked residual unit. We can stack multiple of these units and input the same image at different scale repetitively so that the network can retrieve information from the input image dynamically on its path, leading to a great improvement on its performance.
One thing to note is that the formulations are similar to that of Gated Recurrent Unit (GRU) cho2014GRU. However, we are driven by very different motivations and there are several crucial differences: 1) We are motivated by repetitively inputting the same image to improve the information flow. GRU is designed for solving gradient vanishing in recurrent neural networks. 2) GRU cells are recurrent so part of the output is fed back into the same cell, while MRU blocks are cascaded so the outputs of a previous block are fed into the next block. 3) GRU shares weights for each step so it can only receive fixed length inputs. No two MRU blocks share weights, so we can shrink or expand the size of output feature maps like normal convolutional layers.
4.2 Network Structure
Our complete network structure is shown in Figure 4. The generator uses an encoder-decoder structure. Both the encoder and the decoder are built with MRU blocks, where the sketches are resized and fed into every MRU block on the path. In our best results in Figure 8
, we also apply skip-connections between encoder and decoder blocks, so the output feature maps from encoder blocks will be concatenated to the outputs of corresponding decoder blocks. The discriminator is also built with MRU blocks but will shrink in spatial dimension. At the end of the discriminator, we output two logits, one for the GAN loss and one for classification loss.
4.3 Objective Function
Let , be either a image or a sketch,
be a noise vector,be a class label, our GAN objective function can be expressed as
and the objective of generator will be to minimize the second term.
It is shown that giving the model side information will improve the quality of generated images odena2016ACGAN, so we use conditional instance normalization Dumoulin17ConditionalNorm in the generator and pass in labels of input sketches. In addition, we let the discriminator predict class labels out of the images it sees. The auxiliary classification loss of discriminator maximize the log-likelihood between predicted and ground-truth labels:
and the generator maximizes the same log-likelihood with discriminator fixed.
Since we have paired image data, we are able to provide direct supervision to the network with L1-distance between generated images and ground truth images:
However, directly minimizing L1 loss between generated image and ground truth image discourages diversity, so we add a perceptual loss to encourage the network to generate diverse images Chen2017CRN. We use four intermediate layers from an Inception-V4 szegedy2017inceptionResNet to calculate the perceptual loss. Let be filter response of a layer in the Inception model,
gives us the perceptual loss on generator.
To further encourage diversity, we concatenate Gaussian noises to feature maps at the bottleneck of the generator. Previous works reach the conclusion that conditional GANs tend to ignore the noise completely pix2pix2016 or produce worse results because of noise pathak2016contextEncoder. We find a simple diversity loss
will improve both quality and diversity of generated images. The interpolation is straightforward: with a pair of different noise vectorsand conditioned on the same image, the generator should output a pair of sightly different images.
Our complete discriminator and generator loss are thus
where the discriminator maximizes Equation 13 and the generator minimizes Equation 14. In practice, we use DRAGAN loss kodali2017dragan in order to stabilize training and use focal loss Lin2017Focal as classification loss.
|pix2pix, Sketchy only||3.94|
5.1 Experiment settings
We use the sketch-image pairs of the 50 categories from training split of Sketchy as basic training data, and augment them with edge map-image pairs.
In the following sections, we call the training data from Sketchy “Sketchy”, and Sketchy augmented with edge maps “Augmented Sketchy”. Since we are interested in sketch to image synthesis, all the models are tested on test split of Sketchy. All images are resized to 6464 regardless of the original aspect ratio. Both the sketches and edge maps are converted into distance fields.
Implementation Details In all our experiments, we use batch size of 8, except of Figure 8 which use a batch size of 32. We augment our data by random horizontal flipping during training. We use Adam optimizer kingma2014adam in our experiments, and set the initial learning rate of generator at 0.0001 and that of discriminator at 0.0002 heusel2017TTUR.
Evaluation Metrics For our task of image synthesis, we mainly use Inception Score salimans2016improvedGAN to measure the quality of synthesized images. The intuition behind Inception Score is that a good synthesized image should have easily recognizable objects by an off-the-shelf recognition system. Besides reporting Inception Scores, because the ultimate goal is to make the generated images plausible to humans, we also visually inspect the results and report our observations and show qualitative examples.
|Model||Num of parameters||Inception Score|
5.2 Comparison to Baselines
We based our experiments on the popular pix2pix and its variations. All models are trained for 300k iterations.
We include three baselines:
pix2pix on Sketchy This is the simplest model. We directly take authors’ pix2pix code and train it on the 50 categories from Sketchy. Since we find the image quality stops improving after 100k iterations, we stop early at 150k iteration and report the results.
pix2pix on Augmented Sketchy
In this model, we train pix2pix on both the edge map-image and sketch-image pairs, as we do in our method. The network structure and loss functions remain the same as the previous model.
Label-Supervised pix2pix on Augmented Sketchy In this model, we modify the pix2pix model to pass class labels into the generator using conditional instance normalization, and also add auxiliary classification loss to its discriminator. This is a much stronger baseline than the previous two, since the label information helps the network decide the object type and in turn make the generated image quality higher gulrajani2017improvedwgan,odena2016ACGAN.
The comparison of Inception Scores can be found in Table 1, and visual results can be found in Figure 5. Our observations are as follows: (i) pix2pix trained on Sketchy fails completely, generating meaningless color patches. This means the model is unable to find the translation function from sketches to images. Since pix2pix has been successful with edge map-image translations, this implies that sketch to image synthesis is not trivial. (ii) pix2pix trained on Augmented Sketchy performs slightly better, as it starts to catch the general shape of the object. This shows that the augmenting edge maps indeed help the training. However, more data alone do not solve the problem. The trained model still fills in wrong colors and generates meaningless backgrounds. (iii) The label-supervised pix2pix on Augmented Sketchy is better than the previous two, as it correctly colors the object more often and also starts to generate some meaningful backgrounds. The results are still blurry sometimes, and a large amount of artifacts can be observed. (iv) Comparing to baselines, our method generates sharper images, gets the object color correct, and puts more detailed textures on the object. The backgrounds are more meaningful and the whole images are more colorful.
5.3 Component Analysis
Here we analyze which part of our model is more important. We decouple our objective function and analyze the influence of lacking part of it. All models are trained on Augmented Sketchy with the same set of parameters. Detailed comparison can be found in Table 3. We first remove the GAN loss and the discriminator. The result is surprisingly poor as the images are extremely vague. This observation is consistent with that of Isola et al. pix2pix2016.
Next we remove the auxiliary loss and substitute conditional instance normalization with batch normalization ioffe2015batchnorm. This leads to a significant decrease in image quality such as wrong colors and misplaced textures. This indicates that class information helps a lot, which makes sense because we are generating 50 categories from a single model.
We then remove the L1 loss and the perceptual loss. We find they also have a large influence on image quality. From sample images we can see the model puts wrong colors and fails to find object boundaries. Finally, we remove the diversity loss, and find that doing so also decreases image quality slightly. This can be related to how we apply this diversity loss, which forces the generator to generate image pairs that are realistic, similar but different. This encourages the generator to generalize because it needs to find a solution that upon given different noise vectors, it only makes changes in unimportant areas, and those changes still need to be real-looking.
Comparison between MRU, ResNet and DCGAN To demonstrate the effectiveness of our MRU blocks, we compare the performance of MRU, ResNet and DCGAN structure in our image synthesis task. We train two additional models: the first one uses improved ResNet blocks he2016improvedResNet, which is a better block comparing to all variants of original ResNet blocks he2016ResNet, in both generator and discriminator. The second one is a weak baseline, using DCGAN-like structure. All models are trained with Augmented Sketchy using all of our objective functions under the same hyperparameters, and we keep the number of parameters in MRU model and in ResNet model roughly the same by reducing feature depth in MRU. Detailed number of parameters and corresponding scores can be found in Table 2. Judging from both visual quality and the Inception Scores, MRU model has generated better images than the ResNet model. We notice that the MRU model tends to produce higher quality foreground objects, which can be observed in Figure 7. This can be due to the internal masks of MRU serving as a kind of attention mechanism, causing the network to selectively focus on the main object. In our task this is helpful, since we are interested in only generating a specific object from sketch.
In this work, we presented a novel approach to the sketch-to-image synthesis problem. The problem is challenging given the nature of sketches, and this work showed deep generative model is promising. We presented a data augmentation technique on paired sketch and image data to encourage the research in this direction. The demonstrated GAN framework can synthesis more realistic images than popular generative models from sketches, and the generated images are diverse. Currently, the main focus on GANs is to find better probability metrics as objective functions, but there has been very few works searching for better network structures in GANs. We proposed a new network structure for our generative task, and we showed that it performs much better than existing structures.
Supplementary Material Outline
1 Category list
Here are the 50 categories we use for training and testing our models: airplane, ant, apple, banana, bear, bee, bell, bench, bicycle, candle, cannon, car, castle, cat, chair, church, couch, cow, cup, dog, elephant, geyser, giraffe, hammer, hedgehog, horse, hotdog, hourglass, jellyfish, knife, lion, motorcycle, mushroom, pig, pineapple, pizza, pretzel, rifle, scissors, scorpion, sheep, snail, spoon, starfish, strawberry, tank, teapot, tiger, volcano, zebra.
2 Evaluation of MRU on CIFAR-10
We introduce the Masked Residual Unit (MRU) to improve generative deep networks by giving repeated access to the conditioning signal (in our case, a sketch). But this network building block is also quite useful for classification tasks. We compare the performance of the MRU and other recent architectures on CIFAR-10 and show that the MRU performance is on par with ResNet. Accuracy numbers for other models are obtained from their corresponding papers. For convenience, we call the improved ResNet ”ResNet-v2” in the table. In ”MRU-108, LeakyReLU gate”, we substitute the sigmoid activations in our MRU units with LeakyReLU Maas13LReLU_supp, and normalize obtained masks to the range of .
|MRU-108, LeakyReLU gate||5.83|
3 Samples from all 50 categories
Here we present samples from all 50 categories from pix2pix variants and our methods for comparison. Each category contains three input samples, among which the third sample is a failure case for our method. The six columns in each figure are: (Input) input sketch, (a) pix2pix on Sketchy, (b) pix2pix on Augmented Sketchy, (c) Label-supervised pix2pix on Augmented Sketchy, (d) our method, (GT) ground truth image.