Teaching GANs to Sketch in Vector Format

04/07/2019 ∙ by Varshaneya V, et al. ∙ Sri Sathya Sai Institute of Higher Learning Indian Institute of Technology Hyderabad 0

Sketching is more fundamental to human cognition than speech. Deep Neural Networks (DNNs) have achieved the state-of-the-art in speech-related tasks but have not made significant development in generating stroke-based sketches a.k.a sketches in vector format. Though there are Variational Auto Encoders (VAEs) for generating sketches in vector format, there is no Generative Adversarial Network (GAN) architecture for the same. In this paper, we propose a standalone GAN architecture SkeGAN and a VAE-GAN architecture VASkeGAN, for sketch generation in vector format. SkeGAN is a stochastic policy in Reinforcement Learning (RL), capable of generating both multidimensional continuous and discrete outputs. VASkeGAN hybridizes a VAE and a GAN, in order to couple the efficient representation of data by VAE with the powerful generating capabilities of a GAN, to produce visually appealing sketches. We also propose a new metric called the Ske-score which quantifies the quality of vector sketches. We have validated that SkeGAN and VASkeGAN generate visually appealing sketches by using Human Turing Test and Ske-score.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Some of the very popular generative models in Deep Learning (DL) are VAEs

[1] and GANs [2]. VAEs tend to maximize the likelihood of the generated data coming from the actual distribution while assuming a Gaussian prior. Different VAE architectures have performed well in generating various types of images ranging from handwritten digits [1, 3] to faces [1] to house numbers [3] and to CIFAR images [3]. On the other hand, in a GAN, the generator and discriminator play a minmax game until they reach an equilibrium. At this point, the distribution of generator is close to that of the original data. GANs are known to have been used for simple image generation in [4] as well as to more sophisticated tasks such as style-transfer [5, 6]

, super-resolution

[7]image-to-image translation [8, 9] and removing image in-painting [10]. All of the aforesaid works are limited to only raster images.

Vector graphics were introduced in computer displays in the 1960s. Since then vector graphics have been studied very intensely. These images do not degrade when transformations are applied and they require minimal amount of space to be stored and transferred. Most importantly, they can be rescaled infinitely. They are represented as curves and strokes.

Sketching was the means of communication much before languages were developed. Hence, sketching becomes a fundamental skill of human cognition. Today, DNNs perform the state-of-art in language-related tasks but there only a handful of works that discuss sketch generation in vector format, let alone vector image generation in the wild. As of today, there are [11, 12, 13, 14] based on VAE, which generate sketches in vector format. This proves that sketch generation in vector format is indeed a very challenging problem to tackle. In addition to this, there are no GAN architectures for sketch generation in vector format. So, we focus on generating sketches in vector format.

We consider a sketch to be a collection of strokes, wherein each stroke consists of 2-D continuous offsets and 3-D discrete pen-states. This representation is also known as the stroke-based format. The discrete outputs pose a difficulty for the gradient updates to be passed from discriminator to generator for the weight update. [15] proposed a novel policy gradient based loss for generating 1-D discrete tokens. Whereas in our case, we have a combination of 2-D continuous variates and 3-D discrete variates which makes adapting of policy gradient loss not a straight-forward task. Currently there are no metrics that quantify the ‘goodness’ of vector sketches. Hence, we have proposed the ‘Ske-score’ which quantifies the goodness of vector sketches. Our contributions are as follows:

  • The first GAN-based architecture SkeGAN, for sketch generation in vector format. To this end we propose a novel coupling mechanism which models the influence of offsets on pen-states while sketching.

  • An alternative GAN architecture VASkeGAN based on VAE-GAN architecture [16] for comparison with SkeGAN.

  • A new metric known as the Ske-score, which quantifies the goodness of generated vector sketches.

2 Related Work

There are a very few approaches of sketch generation which use stochastic techniques such as Hidden Markov Models (HMMs)

[17] and others that use pure image processing techniques such as [18]. There are quite a lot of work done relating to human-drawn sketches in general using DL such as recognition [19, 20, 21], eye-fixation or saliency [22], guessing a sketch being drawn [23] and parsing [24]. Specifically [25] uses GANs to generate sketches of human faces given digital portraits of their faces. There are also works such as [26, 27] which discuss the approaches to convert rasterized sketches into realistic images. One commonality amongst all of them is that all of them work with sketches in the raster format.

The first attempt in generating vector images is by [12] to generate Kanji (Chinese alphabet) characters using a two-layered LSTM, where each Kanji character is represented in the stroke-based format. Following [12], D. Ha et al. propose a VAE model called the “sketch-rnn", for vector sketch generation in [11], which is trained on the “QuickDraw" dataset [28]. Here too, the sketches are represented in the stroke-based format. This paper shows very good performance in unconditional generation, conditional reconstruction, latent space representation and predicting the ending of incomplete sketches for a variety of classes of objects. [11] produced visually appealing sketches when trained with a single category of sketch. The sketches are not visually appealing when a mix of category is used for training. Hence, in order to overcome this difficulty, [13] replaces the encoder of [11]

with a Convolution Neural Network(CNN) and removes the KL-Divergence loss. This model too produces sketches in the stroke-based format. Since the convolution is spatial, the input to the this model is rasterized format of sketches from the QuickDraw dataset. Based on the Turing Test, the authors of

[13] conclude that the models with CNN encoders outperformed those with RNN encoders in generating human-style sketches. K. Zhong in [14] extends the VAE proposed in [11] to create an end-to-end pipeline which takes in fonts in Scalable Vector Graphics (SVGs) to learn and generate novel fonts. The results are demonstrated on Google Fonts Dataset.

All the architectures mentioned for sketch generation in vector format, are VAEs. A well known disadvantage with VAEs, they tend to produce blurred images in case of raster images. Since there is no concept of blurring, the vector images produced by VAEs like sketch-rnn [11] tend to suffer from a mode-collapse-like situation wherein the pen is not lifted to draw at another location, but stays on the paper and continues to scribble. We call this as the scribble effect. Figure 1 shows this effect in the sketches of “yoga poses" and “mosquitos". Since VAEs assume the prior to be Gaussian, they need to be trained for a very large number of iterations so that the weights of the decoder get adjusted accordingly in order to generate close to the distribution of data. In the case of [11], the training is done for 10 million iterations. Also, GANs have performed outstandingly well for a variety of tasks mentioned in Section 1, with raster images. In order to alleviate these disadvantages of VAEs and harness the power of GANs, we propose a standalone GAN called the SkeGAN and another GAN called the VASkeGAN with which we compare SkeGAN.

Figure 1: Scribble effect of [11] with yoga pose sketches (left) and mosquito sketches (right).

3 Our Contributions

3.1 Ske-score: Evaluation Metric

Since sketches suffer from scribble effect, we propose a novel metric known as the ‘Ske-score’ which quantifies the goodness of a sketch. Ske-score , of a sketch is defined as the ratio of number of times the pen is lifted to number of times it touches the paper while the sketch is being drawn. The Ske-score of a model is defined to be the average of individual Ske-scores of sketches generated by it. The Ske-score of the dataset is the average of Ske-scores of all of the sketches in it. Intuitively, this metric quantifies the fraction of time when pen is lifted from the paper. A high value of indicates that pen is lifted more number of times. A model is said to generate ‘good sketches’ without scribble effect iff .

3.2 Problem Setup

Sketches are considered to be a collection of 5-tuple , where are the offsets to be moved along and axes respectively and is the pen-state. indicates that the pen is on the paper, indicates it is lifted and

indicates that the drawing has ended. Pen-state is modeled as categorical random variable. All the drawings are assumed to start from the origin. This is done by prepending the sketch with the start-of-sequence symbol

, which is

. The offsets are modeled as a Gaussian Mixture Model (GMM) in the case of SkeGAN and as IID normal variable in the case of VASkeGAN. In both the models, we incorporate the parameter

as defined in [11], to control the randomness or the variety in the generated samples. Sketch generation is done tuple-by-tuple until is not 1 or until the maximum length is reached. A sketch is generated stroke by stroke, wherein the stroke at time-step depends on all of the strokes at previous time-steps. In order to model this dependency, the generator is an auto-regressive model like LSTM or GRU. The discriminator distinguishes whether a batch of sketches has come from the dataset or from the generator. So it must understand the dependency between strokes of different time-steps in order to distinguish the sketches. Therefore, the discriminator is also an auto-regressive model like LSTM or GRU.

3.3 SkeGAN: A Sequential GAN for Vector Images

In a GAN architecture, the weights of the generator are updated based on the signal/reward from the discriminator. GANs have a limitation when there is a need to generate discrete tokens. The discrete outputs pose a difficulty for the gradient updates to be passed from discriminator to generator for the weight update. In our case, the offsets are continuous random variables whereas the pen-states are discrete random variables. So, given a conventional GAN architecture, during the back-propagation the gradient updates are passed without any difficulty for the offsets but not for pen-states. Also, any discriminator can guide a generator only when a complete sequence is given to it. This means that the discriminator cannot guide the generator while it is in the process of generating a sequence. Also, in our case we must generate both discrete and continuous data. Therefore, we propose a coupled GAN architecture with a combination of policy gradient and standard adversarial losses to generate both multi-dimension discrete and continuous tokens. The generator

in SkeGAN is a stochastic policy in RL which can sample tuples for the Monte Carlo search. By performing a Monte Carlo search, the reward signal from discriminator is passed back to even at its intermediate action value. Further, policy gradients are used for updating the weights of via gradient ascent mentioned in Equation 7.

We assume that the current coordinate at which the pen is situated dictates whether the pen must be on the paper or must be lifted, when it is to be moved to the next coordinate. In other words, we assume that the offsets influence the pen-states. In addition to this, the previous pen-state influences the next pen-state. So, the current pen-state depends both on its previous state and the current offset. To model this relationship, we propose a coupled generator consisting of two generators viz. for generating offsets and for generating pen-states. Each of and is an LSTM with a hidden size of 512. The hidden state of at time-step is denoted as and that of at time-step is denoted as . The coupling is achieved by having two update gates and with learnable parameters. The coupling effect can be mathematically described in the following equations:


where refers to element-wise multiplication and , , and are learnable parameters. The Generator of the proposed architecture is shown in the left portion of Figure 2. At each time-step , generates and generates

. The parameters for the distribution of offsets are estimated from

while those for the distribution of pen-states are estimated from as given in [11].

Figure 2: Generator (Left) and Discriminator (Right) of SkeGAN

The discriminator

is a Bidirectional LSTM with a hidden size of 256. A batch with one half containing generated sketches and another half from the dataset is shuffled and given to it. The forward and the backward hidden states of the LSTM are concatenated and mapped to a vector of dimension 2 followed by softmax activation to predict the probability of each sequence being real or fake. The discriminator is shown in the right portion of Figure


Policy gradient based formulation:

Let and . Since this policy gradient based formulation is meaningful only for discrete tokens, the following discussion pertains to alone. Given a sequence , where each is a 3-tuple consisting of a valid pen-state i.e. and . At a time-step , the state of is the sequence of produced tokens, which is and its action is to select the next token . It can be observed that though the policy model is stochastic, the transition state is deterministic after an action. In other words, if is the current state and the action is , then next state . is the probability of a particular sequence being real or not. Since there is no intermediate reward for an incomplete sequence, the objective is to generate a sequence from the start state which maximizes the expected end reward as given by:


where is the action-value function of the sequence and is the reward of the complete sequence. is the expected accumulative reward starting from state , taking action by following the policy . The next step is to estimate the action-value function. The estimated probability of a sequence being real , is considered to be the reward for . The can provide the reward for a complete sequence only. Also, one must look for maximizing the long-term rewards. Therefore to evaluate every intermediary step , Monte Carlo search with a rollout policy is used to sample the rest of the tokens. Let the output of an -time Monte Carlo search be represented as:


where and is sampled based on rollout policy and the current state. Here is set to itself for simplicity and speed. Thus, the action value function for is defined as: Q_D_ϕ^^G_θ(s = Y_1:t-1,a = y_t) = {Dϕ(Y1:Tn),  Y1:Tn∈MC^Gβ(Y1:t;N)for t TDϕ(Y1:t),for t = T

The advantage of using as the reward function is that, since it is updated by the adversarial loss at every iteration, it improves its capability of distinguishing between real and fake. Due to this, it can provide better feedback to

. The gradient of the loss function with respect to

from [29] is given by:

∇_θJ(θ) = ∑^T_t=1 E_Y_1:t-1 ∼^G_θ [ ∑_y_t ∈Y ∇_θ^G_θ(y_t ∣Y_1:t-1) ⋅Q_D_ϕ^^G_θ (Y_1:t-1,y_t) ] Using likelihood ratios [15], becomes: ∇_θJ(θ) ≃∑^T_t=1 E_y_t ∼^G_θ(y_t ∣Y_1:t-1) [ ∇_θ^G_θ(y_t ∣Y_1:t-1) ⋅Q_D_ϕ^^G_θ (Y_1:t-1,y_t) ] The parameters of are updated by the gradient ascent rule which is also known as policy gradient equation:


where is the learning rate at the iteration.


In order to ensure stability and faster convergence, the training of SkeGAN is done in two stages viz. the pre-training and the adversarial training. is pre-trained so that it can avoid generating meaningless values for both offsets and pen-states. At each time-step , is fed to . The pre-training of is done for 38500 iterations and the loss function used is the reconstruction loss [11] as given in Equation 16. is also pre-trained so that it can effectively differentiate between the real and the fake samples to provide better feedback to . In our case, is pre-trained for 35000 iterations. Each batch for pre-training contains 50% samples from the dataset (labeled as real data) and 50% samples generated by (labeled as fake data). The loss function used is the binary cross entropy loss. The number of iterations to pre-train was decided using empirical studies.

The adversarial training is done as in [15]

. One round of training constitutes one epoch (700 iterations) of training

, followed by two epochs (1400 iterations) of training . At each iteration in the training of , the policy gradients based loss is used to update the parameters of and adversarial loss is used to update the parameters of . Firstly, a sequence is generated from . The action value function is then calculated for by using the Monte Carlo rollout Equation 6, only for the pen-states. The parameters of are updated by policy gradient Equation 7. For finding the adversarial loss, a sequence of length is given to . The generated sequence is then prepended with and a batch containing such sequences is given to for its decision. Based on this decision, the weights of are updated. The training of is similar to its pre-training.

As in [11], the offsets are modeled as a mixture of

bivariate normal distributions with the parameters for each of the distribution

to be . There is an additional vector of length which consists of the mixing weights for the Gaussian Mixture Model (GMM). Therefore, at each time-step , hidden state of is mapped to a vector of size , from where the parameters of the GMM are sampled. The distribution of the offsets are given as:


The vector can be split into the parameters of the GMM as:


The weight for each of the component in the GMM is calculated as:


We then apply and

operations to ensure that the standard deviations are non-negative and correlation is in the range



The pen-states is a vector of size 3 and hence the hidden state of is mapped to a vector of size 3. The probabilities are calculated for pen-states as:


and are concatenated to get . We incorporate the parameter as in [11], to control the randomness or the variety in the generated samples. Mathematically writing, the parameters , and would be replaced by , and respectively.

The number of mixtures in the GMM is 20. The batch size is set to 100. is set as the length of the longest sequence in the dataset. Gradients are clipped between for both and to avoid exploding of gradients, which is a common issue with sequence models. The initial learning rate is set to 0.001, with a decay of 0.9999 after every 700 iterations for and 1400 iterations for . The learning rate is decayed only if it is above 0.00001. Recurrent dropouts with a drop probability of 0.1 is used. The maximum number of steps in the rollout is set to 8 and the update rate for the policy gradient update is set to 0.8. Parameters of are updated using Adam optimizer and those of

are updated using Stochastic Gradient Descent (SGD).

Figure 3: Encoder and Decoder of VASkeGAN architecture.

Figure 4: Discriminator of VASkeGAN architecture.

3.4 VASkeGAN: VAE-GAN for Sketch Generation

For the sake of fair comparison of SkeGAN with another GAN architecture, we propose a VAE-GAN [16] based architecture. Since VAEs are good at representing the data in the latent space and GANs at generating data, the VAE in VAE-GAN produces meaningful representation of the data, which helps the generator in generating data close to its actual distribution. [16] shows the success of VAE-GAN architecture for datasets such as CelebA [30] and Labeled Faces in the Wild (LFW) [31]. It alleviates the blurring in generated images. We hence propose VASkeGAN for sketch generation.

VASkeGAN is a combination of VAE and GAN, wherein the decoder of the VAE doubles up to be the generator of the GAN and there is a discriminator. The encoder is a Bi-directional LSTM with a hidden size of 256, which takes a sketch as an input and produces a latent vector of size . The parameters and are estimated from using linear layer with learnable parameters, where is a concatenation of the forward and backward hidden states. is calculated from as . and are then used to obtain the latent vector based on the “reparametrization trick" given as . The decoder/generator is an LSTM with a hidden size of 512 and it produces sketches conditioned on . The initial hidden state and cell state are derived from via the following equation: , where and are learnable parameters. The input to the decoder at each time-step is the previous point concatenated with . The output of decoder at every time-step , can be split as:


Exponential operation is applied to so that the standard deviations are non-negative and softmax is used to calculate the probabilities of the pen-states. The sampling of and is akin to the reparametrization trick. In other words, and are independently sampled and and are calculated as:


We experimented with two types of discriminators viz. one being a GRU with a hidden size of 512 and another being an LSTM with a hidden size of 512. The discriminator has to classify whether the batch of sketches presented to it is from the actual dataset (real samples with distribution

) or is generated by the generator (fake samples with distribution ). The hidden layer in the last time-step is mapped to a feature vector of size 2 using a linear layer with softmax activation, so as to output class probabilities of it being real and fake. The GAN plays a minmax game with the generator trying to “fool" the discriminator, while the discriminator tries to foil this attempt. The model is fully trained when the discriminator is no longer able to detect whether the batch of samples is from the real data or generated by the generator. At that time, is closer to . The proposed architecture is shown in Figures 3 and 4.


The training of VAE-GAN is a combination of the training procedures of VAE and GAN. We train the encoder and the decoder (generator ), with reconstruction loss and KL Divergence loss from [11] and the adversarial loss for generator . The adversarial loss for is where and is the binary cross-entropy loss. The expression for and are given in Equation 16 and Equation 15 respectively. Therefore, the objective that needs to be minimized by the VAE part is .


where is set to 0.5. On the other hand, the objective that the discriminator minimizes is given as:


The weights of the encoder and the decoder are updated using Adam whereas those of the discriminator are updated using SGD. It has been found out empirically by [11] that annealing yields better results by making Adam optimizer to focus on minimizing the reconstruction loss term, which is tougher than optimizing KL Divergence loss. Therefore the modified objective is given as:


The batch size is set to 100. is set to the length of the longest sequence in the dataset and is the maximum length of any generated sequence. Gradients are clipped between for both generator and discriminator to avoid exploding of gradients, which is a common issue with sequence models. The initial learning rate is set to 0.001, with a decay of 0.9999 for every 100 iterations. The learning rate is decayed only if it is above a minimum threshold of 0.00001. The length of the latent vector is set to 128. In order to encourage the optimizer to put less focus on optimizing , it is modified as . The is assigned a value of 0.2. Recurrent dropouts with drop probability of 0.1 is used.

4 Experiments and Results

4.1 Datasets, Baselines and Performance Metrics

We have used the QuickDraw Dataset created in [11] for training and experimentation. QuickDraw consists of sketches belonging to 345 different categories. Each category consists of 75000 sketches for training, 2500 for validation and 2500 for testing. As of today, QuickDraw is the only dataset with a large number of sketches in vector format for training and testing. We trained VASkeGAN and SkeGAN on the categories of sketches such as cat, firetruck, mosquito and yoga poses, as done in [11]. Since there is huge variety of sketches in QuickDraw, we chose these categories because they represent the sketches of humans, animals, insects and non-living things, and capture the diversity of the dataset. We have trained a separate model for each of the categories for both VASkeGAN and SkeGAN architectures, as done for Sketch-rnn in [11]. VASkeGAN was trained for 200000 iterations on the aforesaid sketch categories. The total number of training rounds for cat, mosquito, yoga and firetruck sketches are 4, 3, 6 and 4 respectively. The results of SkeGAN and VASkeGAN along their implications are discussed subsequently. We have quantitatively assessed the visual appeal of the sketches by performing a Human Turing Test with a group of 45 human subjects, to rate the sketches on a scale of 1 – 5 on the categories such as clarity, drawing skill and naturalness. We have also introduced a metric, Ske-score, to objectively assess sketch generation in vector format.

4.2 Results

Unconditional Generation of SkeGAN: All of the sketches are generated with a single starting tuple . Subsequently tuples are generated until the pen-state equals 1 or the number of tuples generated becomes . Figure 5 shows some of the sketches generated for the aforesaid categories. Note that the sketches to the left of the separating line in Figure 5 are generated just after the pre-training of the . The sketches to the right of the separating line are generated by the trained model. The visual appeal of the generated images on the right favours the combination of policy gradients and adversarial loss for generating sketches. The images on the left of separating line indicates that pre-training is essential but not sufficient to generate good sketches. It very clear from Figure 5 that the ‘scribble effect’ of [11] is alleviated by SkeGAN.

Figure 5: Sketches generated by SkeGAN after pre-training (left) and after the actual training (right).

Sketch Completion by SkeGAN: In order to test the extrapolative abilities of SkeGAN, we feed a partially drawn sketch and observe how it can figure out various endings for the incomplete sketch. The generator which is trained with sketches of a particular category, is conditioned with an incomplete sketch from that category. The hidden state of the generator after this conditioning is , which contains the semantic information of the incomplete sketch. Using this information, the remainder of the tuples for the sketch are sampled from the generator, with as its initial hidden state. Figure 6 shows various completions for the same input sketch at a temperature of 0.25. The completed sketches shown are indeed meaningful and visually appealing, which highlights the creative aspect of SkeGAN.

Figure 6: Partially drawn sketches (Left). Completed sketches by SkeGAN (Right).

Evaluation Loss of SkeGAN: Figure 7 shows the evaluation loss for both the generator and the discriminator for across different categories of sketches. The generator is evaluated based on the reconstruction loss (as used in the pre-training stage), while the discriminator is evaluated using Negative Log Likelihood. The trends in both the evaluation losses point out to the fact that the minmax game played by the generator and discriminator is stable. Hence, the introduction of policy gradients into GAN has not affected the minmax game and the stability in training and has led to generation of visually appealing sketches.

Figure 7: Evaluation loss of Generator and Discriminator for SkeGAN.

Sketches Generated by VASkeGAN: We allow the trained model to generate sketches after being conditioned by a sketch from a particular class. A sample of the generated images are shown in Figure 8. This confirms that the model is indeed generating meaningful sketches and not random strokes. Here too, it is very clear from Figure 8 that, the ‘scribble effect’ of [11] is alleviated by VASkeGAN. Following this, VASkeGAN is trained as a standalone GAN wherein the the encoder and the are removed, retaining only the generator (decoder) and the discriminator with and adversarial loss. The sketches generated in this case for all the categories are just doodles without any discernible entity. This experimentally validates the fact that discrete outputs pose a difficulty for the gradient updates to be passed from discriminator to generator for the weight update. Hence empirically strengthening the argument in favour of the formulation of SkeGAN.

Figure 8: Sketches generated by VASkeGAN with GRU discriminator (Top) and LSTM discriminator (Bottom).

Transfer Learning by VASkeGAN: The weights of the model trained on cat sketches for 200000 iterations, are used as an initialization to train two different models on pig and aeroplane sketches. In this case, the training is done for only 100000 iterations. Figure 9

shows the pig and aeroplane sketches generated by transferring the learnt representations across categories. This shows that VASkeGAN generalizes well and is able to transfer the knowledge across categories. Transfer learning on SkeGAN led to a mode collapse, which is a direction of future investigation.

Figure 9: Transfer learning with GRU discriminator (Top) and LSTM discriminator (Bottom).

Visual Appeal: The scores of the Turing Test on the scale of 5 for the aforesaid criteria of visual appeal across the 4 categories of sketches are shown in Figure 10. The average of scores for a particular criterion across different categories are tabulated in Table 1. It is interesting to note that SkeGAN generated sketches are almost at par with the those in the dataset. This supports our claim that SkeGAN can generate sketches that are clear, as artistic and as natural as those from the dataset.

Figure 10: Scores of Human Turing Test for different criteria of visual appeal.
Model Clarity Drawing Skill Naturalness
Dataset 2.79 1.01 2.71 1.07 2.6 1.12
SkeGAN 2.25 1.1 2.25 1.04 2.24 1.21
VASkeGAN(GRU) 1.8 0.86 1.74 0.91 1.87 1.12
VASkeGAN(LSTM) 1.87 0.93 1.68 0.79 1.83 1.04
Table 1: Turing Test for Visual Appeal

Evaluation of Sketches With Ske-score: The Ske-scores for the proposed models and for the dataset are tabulated in Table 2. It is clear that the Ske-scores of SkeGAN is closest to the dataset as compared to VASkeGAN. Therefore, SkeGAN generates ‘good’ sketches without scribble effect. This shows that our formulation of SkeGAN is ideal for sketch generation in vector format.

Model Cat Fire truck Mosquito Yoga
Dataset 0.18 0.07 0.12 0.05 0.16 0.08 0.15 0.07
SkeGAN 0.19 0.09 0.13 0.08 0.18 0.12 0.18 0.1
VASkeGAN (GRU) 0.15 0.07 0.1 0.05 0.13 0.08 0.12 0.06
VASkeGAN (LSTM) 0.15 0.06 0.09 0.05 0.13 0.06 0.13 0.06
Table 2: Evaluation of generated sketches using Ske-score

5 Discussions

We now present ablation studies related to our models, discuss the best discriminator for VASkeGAN, and compare the training times of SkeGAN with VASkeGAN.

Effect of Temperature :
Conditional Generation of VASkeGAN: We fix a sketch from each category and vary the temperature to see its effect on the reconstruction. The effect of temperature on sketches trained with GRU and LSTM discriminators is shown in Figure 11. The images to the left of the separating line in the top and bottom subfigures, are the human input to the trained model. To the right of the separating line in both the subfigures, are the ones that are conditionally generated by the VASkeGAN model at the temperatures of 0.2, 0.4, 0.6, 0.8 and 1.0 respectively. From the Figure 11, it can be noticed that as the temperature increases the “randomness" increases. Under the influence of , , and are replaced by , and respectively. Therefore, higher the value of

more is the influence of variance which is translated into variations in the sketch generation.

Another observation is that the generated sketches of a particular category has the best visual appeal for a particular temperature. It is also at this temperature that along with reconstruction of sketches, extra visually appealing features (not present in the input image) are generated. For example, in Figure 11, the change in position of whiskers of cats in the top subfigure and the generation of whiskers on the cat’s face in the bottom subfigure, are not present in the human input but are generated by reasoning out as to make the generated sketch more natural.

Figure 11: Effect of temperature on VASkeGAN with GRU discriminator (Top) and LSTM discriminator (Bottom).

Unconditional Generation of SkeGAN: The effect of temperature for unconditional generation is similar to its effect in conditional generation in the case of VASkeGAN. As increases the randomness of the generated sketches also increases. Since we are investigating its effect on unconditional generation, there is no ground-truth to compare the randomness. Instead, a group of sketches generated with a particular must be compared with those generated with a different value of . The influence of on , and is same for those in conditional generation of VSkeGAN. In addition to this, it influences the mixing weights of GMM by acting as its inverse multiplier as in [11]. In Figure 12, there are 5 rows each depicting the sketches generated by SkeGAN at values of 0.2, 0.4, 0.6, 0.8 and 1.0. We find here that value of 0.4 is ideal for sketch generation based on visual appeal.

Figure 12: Effect of on sketches generated by SkeGAN.

Weighting Policy Gradient Loss: In order to understand the effect of policy gradient loss on the training, we multiply the loss with different weights and analyze its effect on the sketches generated and the Ske-score of the sketches. Figure 13 shows the effect of multiplying weights to policy gradient loss on the sketches generated. One can observe that the weightage given to both policy gradient loss and adversarial loss must be equal in order to generate visually appealing sketches. Therefore, the ideal weight for both adversarial loss and policy gradient loss is 1.0 respectively.

The variation in Ske-scores due to variation in the weightage given to policy gradient loss in tabulated in Table 3. A very important observation is that as the weightage to the policy gradient loss is increased, the Ske-score increases. Also, a Ske-score which is sufficiently close to that of the dataset implies an alleviation of the Scribble Effect. Therefore, we conclude that the policy gradient formulation is ideal for pen-states as it reduces the Scribble Effect and helps in generating visually appealing sketches. The Scribble Effect is one of the short-comings of Sketch-rnn [11].

Figure 13: Effect of weighting policy gradient loss for cat sketches (Top) and firetruck sketches (Bottom).
Model Cat Firetruck
Dataset 0.18 0.07 0.12 0.05
SkeGAN@0.25 0.18 0.08 0.13 0.07
SkeGAN@0.5 0.18 0.08 0.13 0.07
SkeGAN@1.0 0.19 0.09 0.13 0.08
SkeGAN@2.0 0.19 0.09 0.13 0.08
Table 3: Effect of weighting policy gradient loss on Ske-scores.

Weighting KL Divergence Loss: To understand the effect of weighting KL Divergence loss, we assign different values to such as 0.25, 0.5 and 1.0 and analyze the quality of sketches generated both visually and using the Ske-score. Unlike [11], where a higher produces images closer to the data manifold, in our case, this behaviour is exactly opposite. Figure 14 shows the plot of for different values of while training the proposed model on cat sketches with GRU and LSTM respectively as the discriminators. The implication of this on the sketch generation is shown in Figure 15. Visually inspecting these sketches, we conclude that a value of 0.5 is ideal for .

Figure 14: Plot of for various for GRU (left) and for LSTM (right) discriminators.

We also analyze the effect of changing on the Ske-scores of the sketches. The values of Ske-scores of cat sketches for different are tabulated in Table 4. It can be observed that there is no significant change in Ske-score with change in implying that there is no correlation between the weight assigned to and the Ske-score.

Figure 15: Effect of on the generated sketches at constant of 0.25.
Model Cat
Dataset 0.18 0.07
VASkeGAN(GRU)@0.25 0.15 0.06
VASkeGAN(GRU)@0.5 0.15 0.07
VASkeGAN(GRU)@1.0 0.15 0.06
VASkeGAN(LSTM)@0.25 0.15 0.07
VASkeGAN(LSTM)@0.5 0.15 0.06
VASkeGAN(LSTM)@1.0 0.15 0.06
Table 4: Effect of on Ske-Score.

Discriminator for VASkeGAN: Based on the visual appeal scores for the sketches in each of the categories as shown in Figure 10, the best discriminator for VASkeGAN is an LSTM discriminator except for ‘firetruck’ class where GRU worked well. This highlights the sensitivity involved in choosing a discriminator for training a GAN.

Training-time Comparison: Both VASkeGAN and SkeGAN are trained on NVIDIA GeForce GTX 1080 Ti. The average time per iteration for training VASkeGAN is 0.6s, and hence a total time of 33.33 hours to train the model (for 200000 iterations) for a given category. The average time per iteration (one iteration of generator + two iterations of discriminator) for SkeGAN is 16.95s. The total time to train SkeGAN for cat, mosquito, yoga and firetruck sketches are 13.18 hours, 9.88 hours, 19.77 hours and 13.18 hours respectively, which are much lesser than the training time of VASkeGAN (33.33 hours). SkeGAN evidently admits faster convergence than VASkeGAN.

6 Conclusion

In this work, we proposed two GAN-based approaches to address the problem of sketch generation in vector format. Until now, only a handful of approaches based on VAE [12, 11, 13, 14], address this problem. We proposed two architectures viz. SkeGAN and VASkeGAN, the former a standalone GAN with policy gradients and adversarial loss, while the latter based on VAE-GAN. SkeGAN generates sketches that are better, both qualitatively and quantitatively, than those generated by VASkeGAN and has a faster convergence than VASkeGAN. SkeGAN generated sketches that are clear, natural and artistic compared to those from the dataset while maintaining a stable training process. Most importantly, both VASkeGAN and SkeGAN overcome the “scribble effect" of [11], thus highlighting their usefulness. Future directions include generalizing this work to larger vector-art datasets, including cartoons.


  • [1] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [2] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [3] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
  • [4] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [5] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networkss. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
  • [6] Donggeun Yoo, Namil Kim, Sunggyun Park, Anthony S Paek, and In So Kweon. Pixel-level domain transfer. In European Conference on Computer Vision, pages 517–532. Springer, 2016.
  • [7] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 4681–4690, 2017.
  • [8] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.

    Image-to-image translation with conditional adversarial networks.

    CVPR, 2017.
  • [9] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [10] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei Efros. Context encoders: Feature learning by inpainting. In Computer Vision and Pattern Recognition (CVPR), 2016.
  • [11] David Ha and Douglas Eck. A neural representation of sketch drawings. In International Conference on Learning Representations, 2018.
  • [12] D Ha.

    Recurrent net dreams up fake chinese characters in vector format with tensorflow, 2015.

  • [13] Yajing Chen, Shikui Tu, Yuqi Yi, and Lei Xu. Sketch-pix2seq: a model to generate sketches of multiple categories. arXiv preprint arXiv:1709.04121, 2017.
  • [14] Kimberli Zhong. Learning to draw vector graphics: applying generative modeling to font glyphs. PhD thesis, Massachusetts Institute of Technology, 2018.
  • [15] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In

    Thirty-First AAAI Conference on Artificial Intelligence

    , 2017.
  • [16] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
  • [17] Saul Simhon and Gregory Dudek. Sketch interpretation and refinement using statistical models. In Rendering Techniques, pages 23–32, 2004.
  • [18] Patrick Tresset and Frederic Fol Leymarie. Portrait drawing by paul the robot. Computers & Graphics, 37(5):348–363, 2013.
  • [19] Ravi Kiran Sarvadevabhatla, Jogendra Kundu, et al.

    Enabling my robot to play pictionary: Recurrent neural networks for sketch recognition.

    In Proceedings of the 24th ACM international conference on Multimedia, pages 247–251. ACM, 2016.
  • [20] Ravi Kiran Sarvadevabhatla et al. Analyzing structural characteristics of object category representations from their semantic-part distributions. In Proceedings of the 24th ACM international conference on Multimedia, pages 97–101. ACM, 2016.
  • [21] Ravi Kiran Sarvadevabhatla et al. Eye of the dragon: Exploring discriminatively minimalist sketch-based abstractions for object categories. In Proceedings of the 23rd ACM international conference on Multimedia, pages 271–280. ACM, 2015.
  • [22] Ravi Kiran Sarvadevabhatla, Sudharshan Suresh, and R Venkatesh Babu. Object category understanding via eye fixations on freehand sketches. IEEE Transactions on Image Processing, 26(5):2508–2518, 2017.
  • [23] Ravi Kiran Sarvadevabhatla, Shiv Surya, Trisha Mittal, and R Venkatesh Babu. Game of sketches: Deep recurrent models of pictionary-style word guessing. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [24] Ravi Kiran Sarvadevabhatla, Isht Dwivedi, Abhijat Biswas, Sahil Manocha, et al. Sketchparse: Towards rich descriptions for poorly drawn sketches using multi-task hierarchical deep networks. In Proceedings of the 25th ACM international conference on Multimedia, pages 10–18. ACM, 2017.
  • [25] Jun Yu, Shengjie Shi, Fei Gao, Dacheng Tao, and Qingming Huang. Composition-aided face photo-sketch synthesis. 2017.
  • [26] Wengling Chen and James Hays. Sketchygan: towards diverse and realistic sketch to image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9416–9425, 2018.
  • [27] Yongyi Lu, Shangzhe Wu, Yu-Wing Tai, and Chi-Keung Tang. Image generation from sketch constraint using contextual gan. In Proceedings of the European Conference on Computer Vision (ECCV), pages 205–220, 2018.
  • [28] Jonas Jongejan, Henry Rowley, Takashi Kawashima, Jongmin Kim, and Nick Fox-Gieg. The quick, draw!-ai experiment, 2016.
  • [29] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
  • [30] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
  • [31] Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller.

    Labeled faces in the wild: A database for studying face recognition in unconstrained environments.

    Technical Report 07-49, University of Massachusetts, Amherst, October 2007.