1 Introduction
Some of the very popular generative models in Deep Learning (DL) are VAEs
[1] and GANs [2]. VAEs tend to maximize the likelihood of the generated data coming from the actual distribution while assuming a Gaussian prior. Different VAE architectures have performed well in generating various types of images ranging from handwritten digits [1, 3] to faces [1] to house numbers [3] and to CIFAR images [3]. On the other hand, in a GAN, the generator and discriminator play a minmax game until they reach an equilibrium. At this point, the distribution of generator is close to that of the original data. GANs are known to have been used for simple image generation in [4] as well as to more sophisticated tasks such as styletransfer [5, 6][7]imagetoimage translation [8, 9] and removing image inpainting [10]. All of the aforesaid works are limited to only raster images.Vector graphics were introduced in computer displays in the 1960s. Since then vector graphics have been studied very intensely. These images do not degrade when transformations are applied and they require minimal amount of space to be stored and transferred. Most importantly, they can be rescaled infinitely. They are represented as curves and strokes.
Sketching was the means of communication much before languages were developed. Hence, sketching becomes a fundamental skill of human cognition. Today, DNNs perform the stateofart in languagerelated tasks but there only a handful of works that discuss sketch generation in vector format, let alone vector image generation in the wild. As of today, there are [11, 12, 13, 14] based on VAE, which generate sketches in vector format. This proves that sketch generation in vector format is indeed a very challenging problem to tackle. In addition to this, there are no GAN architectures for sketch generation in vector format. So, we focus on generating sketches in vector format.
We consider a sketch to be a collection of strokes, wherein each stroke consists of 2D continuous offsets and 3D discrete penstates. This representation is also known as the strokebased format. The discrete outputs pose a difficulty for the gradient updates to be passed from discriminator to generator for the weight update. [15] proposed a novel policy gradient based loss for generating 1D discrete tokens. Whereas in our case, we have a combination of 2D continuous variates and 3D discrete variates which makes adapting of policy gradient loss not a straightforward task. Currently there are no metrics that quantify the ‘goodness’ of vector sketches. Hence, we have proposed the ‘Skescore’ which quantifies the goodness of vector sketches. Our contributions are as follows:

The first GANbased architecture SkeGAN, for sketch generation in vector format. To this end we propose a novel coupling mechanism which models the influence of offsets on penstates while sketching.

An alternative GAN architecture VASkeGAN based on VAEGAN architecture [16] for comparison with SkeGAN.

A new metric known as the Skescore, which quantifies the goodness of generated vector sketches.
2 Related Work
There are a very few approaches of sketch generation which use stochastic techniques such as Hidden Markov Models (HMMs)
[17] and others that use pure image processing techniques such as [18]. There are quite a lot of work done relating to humandrawn sketches in general using DL such as recognition [19, 20, 21], eyefixation or saliency [22], guessing a sketch being drawn [23] and parsing [24]. Specifically [25] uses GANs to generate sketches of human faces given digital portraits of their faces. There are also works such as [26, 27] which discuss the approaches to convert rasterized sketches into realistic images. One commonality amongst all of them is that all of them work with sketches in the raster format.The first attempt in generating vector images is by [12] to generate Kanji (Chinese alphabet) characters using a twolayered LSTM, where each Kanji character is represented in the strokebased format. Following [12], D. Ha et al. propose a VAE model called the “sketchrnn", for vector sketch generation in [11], which is trained on the “QuickDraw" dataset [28]. Here too, the sketches are represented in the strokebased format. This paper shows very good performance in unconditional generation, conditional reconstruction, latent space representation and predicting the ending of incomplete sketches for a variety of classes of objects. [11] produced visually appealing sketches when trained with a single category of sketch. The sketches are not visually appealing when a mix of category is used for training. Hence, in order to overcome this difficulty, [13] replaces the encoder of [11]
with a Convolution Neural Network(CNN) and removes the KLDivergence loss. This model too produces sketches in the strokebased format. Since the convolution is spatial, the input to the this model is rasterized format of sketches from the QuickDraw dataset. Based on the Turing Test, the authors of
[13] conclude that the models with CNN encoders outperformed those with RNN encoders in generating humanstyle sketches. K. Zhong in [14] extends the VAE proposed in [11] to create an endtoend pipeline which takes in fonts in Scalable Vector Graphics (SVGs) to learn and generate novel fonts. The results are demonstrated on Google Fonts Dataset.All the architectures mentioned for sketch generation in vector format, are VAEs. A well known disadvantage with VAEs, they tend to produce blurred images in case of raster images. Since there is no concept of blurring, the vector images produced by VAEs like sketchrnn [11] tend to suffer from a modecollapselike situation wherein the pen is not lifted to draw at another location, but stays on the paper and continues to scribble. We call this as the scribble effect. Figure 1 shows this effect in the sketches of “yoga poses" and “mosquitos". Since VAEs assume the prior to be Gaussian, they need to be trained for a very large number of iterations so that the weights of the decoder get adjusted accordingly in order to generate close to the distribution of data. In the case of [11], the training is done for 10 million iterations. Also, GANs have performed outstandingly well for a variety of tasks mentioned in Section 1, with raster images. In order to alleviate these disadvantages of VAEs and harness the power of GANs, we propose a standalone GAN called the SkeGAN and another GAN called the VASkeGAN with which we compare SkeGAN.
3 Our Contributions
3.1 Skescore: Evaluation Metric
Since sketches suffer from scribble effect, we propose a novel metric known as the ‘Skescore’ which quantifies the goodness of a sketch. Skescore , of a sketch is defined as the ratio of number of times the pen is lifted to number of times it touches the paper while the sketch is being drawn. The Skescore of a model is defined to be the average of individual Skescores of sketches generated by it. The Skescore of the dataset is the average of Skescores of all of the sketches in it. Intuitively, this metric quantifies the fraction of time when pen is lifted from the paper. A high value of indicates that pen is lifted more number of times. A model is said to generate ‘good sketches’ without scribble effect iff .
3.2 Problem Setup
Sketches are considered to be a collection of 5tuple , where are the offsets to be moved along and axes respectively and is the penstate. indicates that the pen is on the paper, indicates it is lifted and
indicates that the drawing has ended. Penstate is modeled as categorical random variable. All the drawings are assumed to start from the origin. This is done by prepending the sketch with the startofsequence symbol
, which is. The offsets are modeled as a Gaussian Mixture Model (GMM) in the case of SkeGAN and as IID normal variable in the case of VASkeGAN. In both the models, we incorporate the parameter
as defined in [11], to control the randomness or the variety in the generated samples. Sketch generation is done tuplebytuple until is not 1 or until the maximum length is reached. A sketch is generated stroke by stroke, wherein the stroke at timestep depends on all of the strokes at previous timesteps. In order to model this dependency, the generator is an autoregressive model like LSTM or GRU. The discriminator distinguishes whether a batch of sketches has come from the dataset or from the generator. So it must understand the dependency between strokes of different timesteps in order to distinguish the sketches. Therefore, the discriminator is also an autoregressive model like LSTM or GRU.3.3 SkeGAN: A Sequential GAN for Vector Images
In a GAN architecture, the weights of the generator are updated based on the signal/reward from the discriminator. GANs have a limitation when there is a need to generate discrete tokens. The discrete outputs pose a difficulty for the gradient updates to be passed from discriminator to generator for the weight update. In our case, the offsets are continuous random variables whereas the penstates are discrete random variables. So, given a conventional GAN architecture, during the backpropagation the gradient updates are passed without any difficulty for the offsets but not for penstates. Also, any discriminator can guide a generator only when a complete sequence is given to it. This means that the discriminator cannot guide the generator while it is in the process of generating a sequence. Also, in our case we must generate both discrete and continuous data. Therefore, we propose a coupled GAN architecture with a combination of policy gradient and standard adversarial losses to generate both multidimension discrete and continuous tokens. The generator
in SkeGAN is a stochastic policy in RL which can sample tuples for the Monte Carlo search. By performing a Monte Carlo search, the reward signal from discriminator is passed back to even at its intermediate action value. Further, policy gradients are used for updating the weights of via gradient ascent mentioned in Equation 7.We assume that the current coordinate at which the pen is situated dictates whether the pen must be on the paper or must be lifted, when it is to be moved to the next coordinate. In other words, we assume that the offsets influence the penstates. In addition to this, the previous penstate influences the next penstate. So, the current penstate depends both on its previous state and the current offset. To model this relationship, we propose a coupled generator consisting of two generators viz. for generating offsets and for generating penstates. Each of and is an LSTM with a hidden size of 512. The hidden state of at timestep is denoted as and that of at timestep is denoted as . The coupling is achieved by having two update gates and with learnable parameters. The coupling effect can be mathematically described in the following equations:
(1)  
(2)  
(3)  
(4) 
where refers to elementwise multiplication and , , and are learnable parameters. The Generator of the proposed architecture is shown in the left portion of Figure 2. At each timestep , generates and generates
. The parameters for the distribution of offsets are estimated from
while those for the distribution of penstates are estimated from as given in [11].The discriminator
is a Bidirectional LSTM with a hidden size of 256. A batch with one half containing generated sketches and another half from the dataset is shuffled and given to it. The forward and the backward hidden states of the LSTM are concatenated and mapped to a vector of dimension 2 followed by softmax activation to predict the probability of each sequence being real or fake. The discriminator is shown in the right portion of Figure
2.Policy gradient based formulation:
Let and . Since this policy gradient based formulation is meaningful only for discrete tokens, the following discussion pertains to alone. Given a sequence , where each is a 3tuple consisting of a valid penstate i.e. and . At a timestep , the state of is the sequence of produced tokens, which is and its action is to select the next token . It can be observed that though the policy model is stochastic, the transition state is deterministic after an action. In other words, if is the current state and the action is , then next state . is the probability of a particular sequence being real or not. Since there is no intermediate reward for an incomplete sequence, the objective is to generate a sequence from the start state which maximizes the expected end reward as given by:
(5) 
where is the actionvalue function of the sequence and is the reward of the complete sequence. is the expected accumulative reward starting from state , taking action by following the policy . The next step is to estimate the actionvalue function. The estimated probability of a sequence being real , is considered to be the reward for . The can provide the reward for a complete sequence only. Also, one must look for maximizing the longterm rewards. Therefore to evaluate every intermediary step , Monte Carlo search with a rollout policy is used to sample the rest of the tokens. Let the output of an time Monte Carlo search be represented as:
(6) 
where and is sampled based on rollout policy and the current state. Here is set to itself for simplicity and speed. Thus, the action value function for is defined as: Q_D_ϕ^^G_θ(s = Y_1:t1,a = y_t) = {Dϕ(Y1:Tn), Y1:Tn∈MC^Gβ(Y1:t;N)for t TDϕ(Y1:t),for t = T
The advantage of using as the reward function is that, since it is updated by the adversarial loss at every iteration, it improves its capability of distinguishing between real and fake. Due to this, it can provide better feedback to
. The gradient of the loss function with respect to
from [29] is given by:∇_θJ(θ) = ∑^T_t=1 E_Y_1:t1 ∼^G_θ [ ∑_y_t ∈Y ∇_θ^G_θ(y_t ∣Y_1:t1) ⋅Q_D_ϕ^^G_θ (Y_1:t1,y_t) ] Using likelihood ratios [15], becomes: ∇_θJ(θ) ≃∑^T_t=1 E_y_t ∼^G_θ(y_t ∣Y_1:t1) [ ∇_θ^G_θ(y_t ∣Y_1:t1) ⋅Q_D_ϕ^^G_θ (Y_1:t1,y_t) ] The parameters of are updated by the gradient ascent rule which is also known as policy gradient equation:
(7) 
where is the learning rate at the iteration.
Training:
In order to ensure stability and faster convergence, the training of SkeGAN is done in two stages viz. the pretraining and the adversarial training. is pretrained so that it can avoid generating meaningless values for both offsets and penstates. At each timestep , is fed to . The pretraining of is done for 38500 iterations and the loss function used is the reconstruction loss [11] as given in Equation 16. is also pretrained so that it can effectively differentiate between the real and the fake samples to provide better feedback to . In our case, is pretrained for 35000 iterations. Each batch for pretraining contains 50% samples from the dataset (labeled as real data) and 50% samples generated by (labeled as fake data). The loss function used is the binary cross entropy loss. The number of iterations to pretrain was decided using empirical studies.
The adversarial training is done as in [15]
. One round of training constitutes one epoch (700 iterations) of training
, followed by two epochs (1400 iterations) of training . At each iteration in the training of , the policy gradients based loss is used to update the parameters of and adversarial loss is used to update the parameters of . Firstly, a sequence is generated from . The action value function is then calculated for by using the Monte Carlo rollout Equation 6, only for the penstates. The parameters of are updated by policy gradient Equation 7. For finding the adversarial loss, a sequence of length is given to . The generated sequence is then prepended with and a batch containing such sequences is given to for its decision. Based on this decision, the weights of are updated. The training of is similar to its pretraining.As in [11], the offsets are modeled as a mixture of
bivariate normal distributions with the parameters for each of the distribution
to be . There is an additional vector of length which consists of the mixing weights for the Gaussian Mixture Model (GMM). Therefore, at each timestep , hidden state of is mapped to a vector of size , from where the parameters of the GMM are sampled. The distribution of the offsets are given as:(8) 
The vector can be split into the parameters of the GMM as:
(9) 
The weight for each of the component in the GMM is calculated as:
(10) 
We then apply and
operations to ensure that the standard deviations are nonnegative and correlation is in the range
.(11) 
The penstates is a vector of size 3 and hence the hidden state of is mapped to a vector of size 3. The probabilities are calculated for penstates as:
(12) 
and are concatenated to get . We incorporate the parameter as in [11], to control the randomness or the variety in the generated samples. Mathematically writing, the parameters , and would be replaced by , and respectively.
The number of mixtures in the GMM is 20. The batch size is set to 100. is set as the length of the longest sequence in the dataset. Gradients are clipped between for both and to avoid exploding of gradients, which is a common issue with sequence models. The initial learning rate is set to 0.001, with a decay of 0.9999 after every 700 iterations for and 1400 iterations for . The learning rate is decayed only if it is above 0.00001. Recurrent dropouts with a drop probability of 0.1 is used. The maximum number of steps in the rollout is set to 8 and the update rate for the policy gradient update is set to 0.8. Parameters of are updated using Adam optimizer and those of
are updated using Stochastic Gradient Descent (SGD).
3.4 VASkeGAN: VAEGAN for Sketch Generation
For the sake of fair comparison of SkeGAN with another GAN architecture, we propose a VAEGAN [16] based architecture. Since VAEs are good at representing the data in the latent space and GANs at generating data, the VAE in VAEGAN produces meaningful representation of the data, which helps the generator in generating data close to its actual distribution. [16] shows the success of VAEGAN architecture for datasets such as CelebA [30] and Labeled Faces in the Wild (LFW) [31]. It alleviates the blurring in generated images. We hence propose VASkeGAN for sketch generation.
VASkeGAN is a combination of VAE and GAN, wherein the decoder of the VAE doubles up to be the generator of the GAN and there is a discriminator. The encoder is a Bidirectional LSTM with a hidden size of 256, which takes a sketch as an input and produces a latent vector of size . The parameters and are estimated from using linear layer with learnable parameters, where is a concatenation of the forward and backward hidden states. is calculated from as . and are then used to obtain the latent vector based on the “reparametrization trick" given as . The decoder/generator is an LSTM with a hidden size of 512 and it produces sketches conditioned on . The initial hidden state and cell state are derived from via the following equation: , where and are learnable parameters. The input to the decoder at each timestep is the previous point concatenated with . The output of decoder at every timestep , can be split as:
(13) 
Exponential operation is applied to so that the standard deviations are nonnegative and softmax is used to calculate the probabilities of the penstates. The sampling of and is akin to the reparametrization trick. In other words, and are independently sampled and and are calculated as:
(14) 
We experimented with two types of discriminators viz. one being a GRU with a hidden size of 512 and another being an LSTM with a hidden size of 512. The discriminator has to classify whether the batch of sketches presented to it is from the actual dataset (real samples with distribution
) or is generated by the generator (fake samples with distribution ). The hidden layer in the last timestep is mapped to a feature vector of size 2 using a linear layer with softmax activation, so as to output class probabilities of it being real and fake. The GAN plays a minmax game with the generator trying to “fool" the discriminator, while the discriminator tries to foil this attempt. The model is fully trained when the discriminator is no longer able to detect whether the batch of samples is from the real data or generated by the generator. At that time, is closer to . The proposed architecture is shown in Figures 3 and 4.Training
The training of VAEGAN is a combination of the training procedures of VAE and GAN. We train the encoder and the decoder (generator ), with reconstruction loss and KL Divergence loss from [11] and the adversarial loss for generator . The adversarial loss for is where and is the binary crossentropy loss. The expression for and are given in Equation 16 and Equation 15 respectively. Therefore, the objective that needs to be minimized by the VAE part is .
(15)  
(16)  
where is set to 0.5. On the other hand, the objective that the discriminator minimizes is given as:
(17) 
The weights of the encoder and the decoder are updated using Adam whereas those of the discriminator are updated using SGD. It has been found out empirically by [11] that annealing yields better results by making Adam optimizer to focus on minimizing the reconstruction loss term, which is tougher than optimizing KL Divergence loss. Therefore the modified objective is given as:
(18) 
The batch size is set to 100. is set to the length of the longest sequence in the dataset and is the maximum length of any generated sequence. Gradients are clipped between for both generator and discriminator to avoid exploding of gradients, which is a common issue with sequence models. The initial learning rate is set to 0.001, with a decay of 0.9999 for every 100 iterations. The learning rate is decayed only if it is above a minimum threshold of 0.00001. The length of the latent vector is set to 128. In order to encourage the optimizer to put less focus on optimizing , it is modified as . The is assigned a value of 0.2. Recurrent dropouts with drop probability of 0.1 is used.
4 Experiments and Results
4.1 Datasets, Baselines and Performance Metrics
We have used the QuickDraw Dataset created in [11] for training and experimentation. QuickDraw consists of sketches belonging to 345 different categories. Each category consists of 75000 sketches for training, 2500 for validation and 2500 for testing. As of today, QuickDraw is the only dataset with a large number of sketches in vector format for training and testing. We trained VASkeGAN and SkeGAN on the categories of sketches such as cat, firetruck, mosquito and yoga poses, as done in [11]. Since there is huge variety of sketches in QuickDraw, we chose these categories because they represent the sketches of humans, animals, insects and nonliving things, and capture the diversity of the dataset. We have trained a separate model for each of the categories for both VASkeGAN and SkeGAN architectures, as done for Sketchrnn in [11]. VASkeGAN was trained for 200000 iterations on the aforesaid sketch categories. The total number of training rounds for cat, mosquito, yoga and firetruck sketches are 4, 3, 6 and 4 respectively. The results of SkeGAN and VASkeGAN along their implications are discussed subsequently. We have quantitatively assessed the visual appeal of the sketches by performing a Human Turing Test with a group of 45 human subjects, to rate the sketches on a scale of 1 – 5 on the categories such as clarity, drawing skill and naturalness. We have also introduced a metric, Skescore, to objectively assess sketch generation in vector format.
4.2 Results
Unconditional Generation of SkeGAN: All of the sketches are generated with a single starting tuple . Subsequently tuples are generated until the penstate equals 1 or the number of tuples generated becomes . Figure 5 shows some of the sketches generated for the aforesaid categories. Note that the sketches to the left of the separating line in Figure 5 are generated just after the pretraining of the . The sketches to the right of the separating line are generated by the trained model. The visual appeal of the generated images on the right favours the combination of policy gradients and adversarial loss for generating sketches. The images on the left of separating line indicates that pretraining is essential but not sufficient to generate good sketches. It very clear from Figure 5 that the ‘scribble effect’ of [11] is alleviated by SkeGAN.
Sketch Completion by SkeGAN: In order to test the extrapolative abilities of SkeGAN, we feed a partially drawn sketch and observe how it can figure out various endings for the incomplete sketch. The generator which is trained with sketches of a particular category, is conditioned with an incomplete sketch from that category. The hidden state of the generator after this conditioning is , which contains the semantic information of the incomplete sketch. Using this information, the remainder of the tuples for the sketch are sampled from the generator, with as its initial hidden state. Figure 6 shows various completions for the same input sketch at a temperature of 0.25. The completed sketches shown are indeed meaningful and visually appealing, which highlights the creative aspect of SkeGAN.
Evaluation Loss of SkeGAN: Figure 7 shows the evaluation loss for both the generator and the discriminator for across different categories of sketches. The generator is evaluated based on the reconstruction loss (as used in the pretraining stage), while the discriminator is evaluated using Negative Log Likelihood. The trends in both the evaluation losses point out to the fact that the minmax game played by the generator and discriminator is stable. Hence, the introduction of policy gradients into GAN has not affected the minmax game and the stability in training and has led to generation of visually appealing sketches.
Sketches Generated by VASkeGAN: We allow the trained model to generate sketches after being conditioned by a sketch from a particular class. A sample of the generated images are shown in Figure 8. This confirms that the model is indeed generating meaningful sketches and not random strokes. Here too, it is very clear from Figure 8 that, the ‘scribble effect’ of [11] is alleviated by VASkeGAN. Following this, VASkeGAN is trained as a standalone GAN wherein the the encoder and the are removed, retaining only the generator (decoder) and the discriminator with and adversarial loss. The sketches generated in this case for all the categories are just doodles without any discernible entity. This experimentally validates the fact that discrete outputs pose a difficulty for the gradient updates to be passed from discriminator to generator for the weight update. Hence empirically strengthening the argument in favour of the formulation of SkeGAN.
Transfer Learning by VASkeGAN: The weights of the model trained on cat sketches for 200000 iterations, are used as an initialization to train two different models on pig and aeroplane sketches. In this case, the training is done for only 100000 iterations. Figure 9
shows the pig and aeroplane sketches generated by transferring the learnt representations across categories. This shows that VASkeGAN generalizes well and is able to transfer the knowledge across categories. Transfer learning on SkeGAN led to a mode collapse, which is a direction of future investigation.
Visual Appeal: The scores of the Turing Test on the scale of 5 for the aforesaid criteria of visual appeal across the 4 categories of sketches are shown in Figure 10. The average of scores for a particular criterion across different categories are tabulated in Table 1. It is interesting to note that SkeGAN generated sketches are almost at par with the those in the dataset. This supports our claim that SkeGAN can generate sketches that are clear, as artistic and as natural as those from the dataset.
Model  Clarity  Drawing Skill  Naturalness 

Dataset  2.79 1.01  2.71 1.07  2.6 1.12 
SkeGAN  2.25 1.1  2.25 1.04  2.24 1.21 
VASkeGAN(GRU)  1.8 0.86  1.74 0.91  1.87 1.12 
VASkeGAN(LSTM)  1.87 0.93  1.68 0.79  1.83 1.04 
Evaluation of Sketches With Skescore: The Skescores for the proposed models and for the dataset are tabulated in Table 2. It is clear that the Skescores of SkeGAN is closest to the dataset as compared to VASkeGAN. Therefore, SkeGAN generates ‘good’ sketches without scribble effect. This shows that our formulation of SkeGAN is ideal for sketch generation in vector format.
Model  Cat  Fire truck  Mosquito  Yoga 

Dataset  0.18 0.07  0.12 0.05  0.16 0.08  0.15 0.07 
SkeGAN  0.19 0.09  0.13 0.08  0.18 0.12  0.18 0.1 
VASkeGAN (GRU)  0.15 0.07  0.1 0.05  0.13 0.08  0.12 0.06 
VASkeGAN (LSTM)  0.15 0.06  0.09 0.05  0.13 0.06  0.13 0.06 
5 Discussions
We now present ablation studies related to our models, discuss the best discriminator for VASkeGAN, and compare the training times of SkeGAN with VASkeGAN.
Effect of Temperature :
Conditional Generation of VASkeGAN:
We fix a sketch from each category and vary the temperature to see its effect on the reconstruction. The effect of temperature on sketches trained with GRU and LSTM discriminators is shown in Figure 11. The images to the left of the separating line in the top and bottom subfigures, are the human input to the trained model. To the right of the separating line in both the subfigures, are the ones that are conditionally generated by the VASkeGAN model at the temperatures of 0.2, 0.4, 0.6, 0.8 and 1.0 respectively. From the Figure 11, it can be noticed that as the temperature increases the “randomness" increases. Under the influence of , , and are replaced by , and respectively. Therefore, higher the value of
more is the influence of variance which is translated into variations in the sketch generation.
Another observation is that the generated sketches of a particular category has the best visual appeal for a particular temperature. It is also at this temperature that along with reconstruction of sketches, extra visually appealing features (not present in the input image) are generated. For example, in Figure 11, the change in position of whiskers of cats in the top subfigure and the generation of whiskers on the cat’s face in the bottom subfigure, are not present in the human input but are generated by reasoning out as to make the generated sketch more natural.
Unconditional Generation of SkeGAN: The effect of temperature for unconditional generation is similar to its effect in conditional generation in the case of VASkeGAN. As increases the randomness of the generated sketches also increases. Since we are investigating its effect on unconditional generation, there is no groundtruth to compare the randomness. Instead, a group of sketches generated with a particular must be compared with those generated with a different value of . The influence of on , and is same for those in conditional generation of VSkeGAN. In addition to this, it influences the mixing weights of GMM by acting as its inverse multiplier as in [11]. In Figure 12, there are 5 rows each depicting the sketches generated by SkeGAN at values of 0.2, 0.4, 0.6, 0.8 and 1.0. We find here that value of 0.4 is ideal for sketch generation based on visual appeal.
Weighting Policy Gradient Loss: In order to understand the effect of policy gradient loss on the training, we multiply the loss with different weights and analyze its effect on the sketches generated and the Skescore of the sketches. Figure 13 shows the effect of multiplying weights to policy gradient loss on the sketches generated. One can observe that the weightage given to both policy gradient loss and adversarial loss must be equal in order to generate visually appealing sketches. Therefore, the ideal weight for both adversarial loss and policy gradient loss is 1.0 respectively.
The variation in Skescores due to variation in the weightage given to policy gradient loss in tabulated in Table 3. A very important observation is that as the weightage to the policy gradient loss is increased, the Skescore increases. Also, a Skescore which is sufficiently close to that of the dataset implies an alleviation of the Scribble Effect. Therefore, we conclude that the policy gradient formulation is ideal for penstates as it reduces the Scribble Effect and helps in generating visually appealing sketches. The Scribble Effect is one of the shortcomings of Sketchrnn [11].
Model  Cat  Firetruck 

Dataset  0.18 0.07  0.12 0.05 
SkeGAN@0.25  0.18 0.08  0.13 0.07 
SkeGAN@0.5  0.18 0.08  0.13 0.07 
SkeGAN@1.0  0.19 0.09  0.13 0.08 
SkeGAN@2.0  0.19 0.09  0.13 0.08 
Weighting KL Divergence Loss: To understand the effect of weighting KL Divergence loss, we assign different values to such as 0.25, 0.5 and 1.0 and analyze the quality of sketches generated both visually and using the Skescore. Unlike [11], where a higher produces images closer to the data manifold, in our case, this behaviour is exactly opposite. Figure 14 shows the plot of for different values of while training the proposed model on cat sketches with GRU and LSTM respectively as the discriminators. The implication of this on the sketch generation is shown in Figure 15. Visually inspecting these sketches, we conclude that a value of 0.5 is ideal for .
We also analyze the effect of changing on the Skescores of the sketches. The values of Skescores of cat sketches for different are tabulated in Table 4. It can be observed that there is no significant change in Skescore with change in implying that there is no correlation between the weight assigned to and the Skescore.
Model  Cat 

Dataset  0.18 0.07 
VASkeGAN(GRU)@0.25  0.15 0.06 
VASkeGAN(GRU)@0.5  0.15 0.07 
VASkeGAN(GRU)@1.0  0.15 0.06 
VASkeGAN(LSTM)@0.25  0.15 0.07 
VASkeGAN(LSTM)@0.5  0.15 0.06 
VASkeGAN(LSTM)@1.0  0.15 0.06 
Discriminator for VASkeGAN:
Based on the visual appeal scores for the sketches in each of the categories as shown in Figure 10, the best discriminator for VASkeGAN is an LSTM discriminator except for ‘firetruck’ class where GRU worked well. This highlights the sensitivity involved in choosing a discriminator for training a GAN.
Trainingtime Comparison: Both VASkeGAN and SkeGAN are trained on NVIDIA GeForce GTX 1080 Ti. The average time per iteration for training VASkeGAN is 0.6s, and hence a total time of 33.33 hours to train the model (for 200000 iterations) for a given category. The average time per iteration (one iteration of generator + two iterations of discriminator) for SkeGAN is 16.95s. The total time to train SkeGAN for cat, mosquito, yoga and firetruck sketches are 13.18 hours, 9.88 hours, 19.77 hours and 13.18 hours respectively, which are much lesser than the training time of VASkeGAN (33.33 hours). SkeGAN evidently admits faster convergence than VASkeGAN.
6 Conclusion
In this work, we proposed two GANbased approaches to address the problem of sketch generation in vector format. Until now, only a handful of approaches based on VAE [12, 11, 13, 14], address this problem. We proposed two architectures viz. SkeGAN and VASkeGAN, the former a standalone GAN with policy gradients and adversarial loss, while the latter based on VAEGAN. SkeGAN generates sketches that are better, both qualitatively and quantitatively, than those generated by VASkeGAN and has a faster convergence than VASkeGAN. SkeGAN generated sketches that are clear, natural and artistic compared to those from the dataset while maintaining a stable training process. Most importantly, both VASkeGAN and SkeGAN overcome the “scribble effect" of [11], thus highlighting their usefulness. Future directions include generalizing this work to larger vectorart datasets, including cartoons.
References
 [1] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [2] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [3] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
 [4] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [5] JunYan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networkss. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
 [6] Donggeun Yoo, Namil Kim, Sunggyun Park, Anthony S Paek, and In So Kweon. Pixellevel domain transfer. In European Conference on Computer Vision, pages 517–532. Springer, 2016.

[7]
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew
Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz,
Zehan Wang, et al.
Photorealistic single image superresolution using a generative
adversarial network.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 4681–4690, 2017. 
[8]
Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A Efros.
Imagetoimage translation with conditional adversarial networks.
CVPR, 2017.  [9] TingChun Wang, MingYu Liu, JunYan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Highresolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
 [10] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei Efros. Context encoders: Feature learning by inpainting. In Computer Vision and Pattern Recognition (CVPR), 2016.
 [11] David Ha and Douglas Eck. A neural representation of sketch drawings. In International Conference on Learning Representations, 2018.

[12]
D Ha.
Recurrent net dreams up fake chinese characters in vector format with tensorflow, 2015.
 [13] Yajing Chen, Shikui Tu, Yuqi Yi, and Lei Xu. Sketchpix2seq: a model to generate sketches of multiple categories. arXiv preprint arXiv:1709.04121, 2017.
 [14] Kimberli Zhong. Learning to draw vector graphics: applying generative modeling to font glyphs. PhD thesis, Massachusetts Institute of Technology, 2018.

[15]
Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.
Seqgan: Sequence generative adversarial nets with policy gradient.
In
ThirtyFirst AAAI Conference on Artificial Intelligence
, 2017.  [16] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
 [17] Saul Simhon and Gregory Dudek. Sketch interpretation and refinement using statistical models. In Rendering Techniques, pages 23–32, 2004.
 [18] Patrick Tresset and Frederic Fol Leymarie. Portrait drawing by paul the robot. Computers & Graphics, 37(5):348–363, 2013.

[19]
Ravi Kiran Sarvadevabhatla, Jogendra Kundu, et al.
Enabling my robot to play pictionary: Recurrent neural networks for sketch recognition.
In Proceedings of the 24th ACM international conference on Multimedia, pages 247–251. ACM, 2016.  [20] Ravi Kiran Sarvadevabhatla et al. Analyzing structural characteristics of object category representations from their semanticpart distributions. In Proceedings of the 24th ACM international conference on Multimedia, pages 97–101. ACM, 2016.
 [21] Ravi Kiran Sarvadevabhatla et al. Eye of the dragon: Exploring discriminatively minimalist sketchbased abstractions for object categories. In Proceedings of the 23rd ACM international conference on Multimedia, pages 271–280. ACM, 2015.
 [22] Ravi Kiran Sarvadevabhatla, Sudharshan Suresh, and R Venkatesh Babu. Object category understanding via eye fixations on freehand sketches. IEEE Transactions on Image Processing, 26(5):2508–2518, 2017.
 [23] Ravi Kiran Sarvadevabhatla, Shiv Surya, Trisha Mittal, and R Venkatesh Babu. Game of sketches: Deep recurrent models of pictionarystyle word guessing. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [24] Ravi Kiran Sarvadevabhatla, Isht Dwivedi, Abhijat Biswas, Sahil Manocha, et al. Sketchparse: Towards rich descriptions for poorly drawn sketches using multitask hierarchical deep networks. In Proceedings of the 25th ACM international conference on Multimedia, pages 10–18. ACM, 2017.
 [25] Jun Yu, Shengjie Shi, Fei Gao, Dacheng Tao, and Qingming Huang. Compositionaided face photosketch synthesis. 2017.
 [26] Wengling Chen and James Hays. Sketchygan: towards diverse and realistic sketch to image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9416–9425, 2018.
 [27] Yongyi Lu, Shangzhe Wu, YuWing Tai, and ChiKeung Tang. Image generation from sketch constraint using contextual gan. In Proceedings of the European Conference on Computer Vision (ECCV), pages 205–220, 2018.
 [28] Jonas Jongejan, Henry Rowley, Takashi Kawashima, Jongmin Kim, and Nick FoxGieg. The quick, draw!ai experiment, 2016.
 [29] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 [30] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.

[31]
Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik LearnedMiller.
Labeled faces in the wild: A database for studying face recognition in unconstrained environments.
Technical Report 0749, University of Massachusetts, Amherst, October 2007.
Comments
There are no comments yet.