StrokeCoder: Path-Based Image Generation from Single Examples using Transformers

03/26/2020 ∙ by Sabine Wieluch, et al. ∙ Universität Ulm 0

This paper demonstrates how a Transformer Neural Network can be used to learn a Generative Model from a single path-based example image. We further show how a data set can be generated from the example image and how the model can be used to generate a large set of deviated images, which still represent the original image's style and concept.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Hand-drawn sketches are often used to quickly illustrate a scene or easily capture a thought. However stroke- or path-based images only take up a very small part in the current research trends of generative machine learning.

Recent advances in image generation have been achieved with Generative Adversarial Networks [18, 7], these types of neural nets are well suited to generate pixel-based images and have been used to great success. But Generative Adversarial Networks are not suitable for path-based images, as a sequence and not a pixel grid needs to be processed.

Sketch-based images have also been of interest to other domains like Computer Vision and Human Computer Interaction

[4, 29, 6, 12]. Where one of the most used applications is symbol recognition [26, 21, 10, 17]

, which for example can be utilized to recognize letters in handwriting. Here, mostly Convolutional or Recurrent Neural Networks have been used.

Also, the neural path representation differs a lot between research work, for example Sequence-to-Sequence Variational Autoencoders

[9] combined with LSTMs or Mixture Density Networks [16] compared to a Graph-based representations with Transformers[27, 28].
Transformers [24]

are a new class of neural nets which have been especially useful to the Natural Language Processing community. They are used for sequence-to-sequence translation tasks


as well as for text generation

[19]. Also other domains have used Transformers for various tasks like music generation [11].
In One-Shot Learning, a model is trained with only one or few examples. This is especially interesting for Generative Models used in creative application, where one single user is not able to produce a large training data set. For example it is not suitable for a computer game level designer to create thousands of levels to train a Generative Model, as the data set generation exceeds the Generative Model’s value. One-SHot Learning has already been applied to various domains like pixel-based image generationSingan[23]

, pose estimation

[1] or speech[14].
Where in our last research [25]

, we examined how Dropout can be used with Generative Adversarial Networks to create different, but coherent images in image-to-image translation tasks. For this research paper we aim to learn a Generative Model from one single path-based image. We focus on generating diverse images that match the input image’s style and concept. The generated images will be especially useful in art or digital fabrication.

Data Structure

In this paper, we aim to learn a Generative Model from one single path-based sketch image. A sketch consists of one or more strokes. Each stroke can be represented by a sequence of curves, where Bézier Curves [20] are the state of the art representation.
In our experiments, we will use cubic Bézier Curves. Here, each curve consists of a start point, an end point and two control points. The control points are often referred to as ”handles”, as they are used to manipulate the curvature.
However, most path-based image editing software does not interpret a path as a sequence of curves, but as a sequence of points. So, each point has a ”in” and ”out” handle as additional attributes to control the path flow. We will use this representation in our work. Both definitions are depicted in figure 1.
Because we use this simple path representation, it is easy to use common path-based image formats as input images or transfer our result in such formats.

One of the most common path-based image formats are Scalable Vector Graphics (SVG)

[5]. This graphic format supports paths, texts, geometric primitives like rectangles, ellipses, etc. Scalable Vector Graphics have already successfully been used to learn a neural representation for letters [16], however the used data structure included most of the available description types in SVGs. This is not necessary in our case, because all supported shapes can be described as paths. So we will only focus on a neural representation for Bézier paths.


Figure 1: Two types of curve definitions: the left images shows the typical cubic Bézier Curve definition with one start point, one end point and two control handles, which define the curvature. The right image shows a point-wise curve definition, where each point has an in- and out-handle to control the flow through this point. Both definitions are interchangeable. The point-wise definition is used in this work.

To learn a Generative Model from a path-based example image, we want the neural net to be able to predict the next point in the path, dependent on the previous points. This problem is very similar to problems in Natural Language Processing (NLP), where a Generative Model should predict the next word dependent on the previous words in the text. For this reason, we will use a neural net with a Transformer architecture [24] which is the basic architecture for most recent NLP Models like BERT [3], Transformer-XL [2] or GPT-2 [19].
To use the Transformer, we need a single sequence of points. Our path-based images consist of a sequence of paths and paths consist of a sequence of points. So the images are easily flattened to a single sequence.
Each point consists of:

  • The point’s relative position to the previous point. If there is no previous point, it is relative to .

  • The in-handle position, relative to the point’s position.

  • The out-handle position, relative to the point’s position.

This combined information will further be refereed to as a ”move”, as the neural net chooses between next moves to draw the path further.
We also add two special moves to our data sequence:

  • Path End: marks the end of a path, so that the following point’s position is relative to .

  • Image End: is the last move of an image.

As we want our Model to be able to generate multiple images in a row, we feed in as many sequence sections as our data set provides. As we have no sequence-to-sequence problem, we do not need padding to fill in empty places in a sequence section. We also do not need to remove too long sequences, as we only work on sequence sections.

Data Set Generation

When learning Generative Models from few natural images [23], the images are altered into a variety of so called patches. These patches are cut out parts of the original images, which are also often slightly deformed, scaled or changed in other manners to produce a larger amount of training data as the initial images would have provided.
To learn a Generative Model on one single path-based image, we propose a similar method: the initial paths are altered in different ways to produce a large and diverse training data set. As path-based images differ a lot from natural, pixel-based images, the altering methods need to be adapted accordingly. All proposed altering methods are visualized in figure 2 and will be described below:


Figure 2: Five types of path manipulations used to create a large data set from one example. Translation, Rotation and Scaling are used on the whole path-based image, whereas Path Reversal and Jitter are used on individual paths in the image. Not depicted is Path Order Shuffle, where the stroke order in the image is changed in a random manner.
  • Translation: The whole path-image is moved to a new position in a way, that it is still contained in the initial image boundaries. This way, the relation between each path point is unchanged.

  • Rotation: The whole path-image is rotated by a random angle. Here, the distance relations between path points also remains unchanged.

  • Path Order Shuffle: Each path-based image contains a list of paths, therefore each image has an implicit path order. This ordering is irrelevant in our setting as we are only interested in the final resulting image and not the drawing progress. So, we can randomly shuffle the path order to generate new patch variations.

  • Path Reversal: As each path consists of a list of points, a path has an implicit direction. In our setting, the path direction is not important, so a path can be reversed to generate new patches. The path direction might be important for other settings like sketch or handwriting classification [27]

    , where the stroke direction and path order are very similar in one letter. As Path Reversal is a binary state (either the path is reversed or not), this manipulation is applied by with a probability of 0.5.

  • Jitter/Noise: Because we only work with one path-based input image, we have very sparse information. On the one hand because we only have one single image, but on the other hand path-based images in themselves contain very little information. To accumulate more information, we add a slight noise on the point position, the point handles stay unchanged to not deform the path too much. So each point is translated by a random vector with where is a chosen threshold.

  • Scaling: The whole path-based image is scaled to a smaller size. It is important that this step is performed after the noise induction step, because the noise would have a much larger impact on the smaller image. The scaling operation also changes the point distances, but the relative point distance relations are still coherent.

In our experiments, we use path-based images with an image boundary of 100x100 units. The initial images are hand-drawn with a digitizer pen, from which only points are recorded. The point sequence is then simplified to remove unnecessary points and instead describe the hand-drawn path by few curves instead of many points. So, a sequence of as few curves as possible is fitted through the hand-drawn path’s points with an allowed maximum error. The fitting algorithm is described by Schneider [22] and the result is visualized in figure 3.


Figure 3: Simplification of a path: the left image depicts a hand-drawn path, where on each mouse event a new point is added to the path. The right side shows a simplified version, where multiple points can be substituted with a curve.

Transformer Architecture

A Transformer [24] consists of an encoder and a decoder, as it is usually used for sequence-to-sequence translation tasks. However, in our setting we aim to learn a Generative Model and therefore we will only use a Transformer Decoder.
But before we can use our Decoder, the input needs to be prepared accordingly. In the last sections, we described how a training data set can be generated from one image and how the move sequence is constructed. The moves now need to be converted in a way, that they can easily be used by a neural net. In Natural Language Processing, using word sequences as input is a very similar problem. Here, word embeddings [15] are used to encode words into vectors that can be used as neural net input.
Early experiments without embedding and separate flags for path-end or image-end showed that the Transformer tends to generate illegal move sequences like an image-end without a path-end before.

Additionally to the embedded move sequence, a positional encoding is needed to provide positional information to the Transformer network. In our research, we use the standard positional encoding defined in


The Transformer Decoder consists of multiple layers, which end in a Linear and Softmax Layer. Decoder Layers can be stacked on top of each other any number of times.

One Decoder Layer consists of three sub-layers. The first and second layers are Multi-Head-Attention Layers, which are a number of parallel attention layers, whose output is concatenated and finalized with a linear layer. A self-attention layer gives the neural net the ability to focus more on certain moves or ignore other moves in the sequence. A mask is applied to the first attention layer to prevent the neural net from seeing ”future” sequence elements. Each of these two sub-layers end with a layer normalization. The second sub-layer would usually receive information from the Transformer Encoder, but as we only use the Decoder, this connection is not used. The third sub-layer is a feed-forward network, also ending with a layer normalization. Figure 4 gives a visual overview of our used architecture.


Figure 4:

Transformer Decoder Net as used in this work: after the input is embedded and recieves positional encoding information, it uses multiple self-attention layers and a feed-forward net to process the sequence. Decoder Layers can be stacked on top at any number. The Decoder Layer output is processed through a linear and a softmax layer to receive the final one-hot encoded vector, which can then be used to read the final move from the embedding.

For our training, we use the Adam optimizer [13] with the described changes in [24], where the learning rate is first linearly increased for the first warm-up steps and thereafter decreased again.

As a loss function, we use Cross Entropy Loss.

Experiments and Results

For our experiments, we recorded 5 sketches, which vary in path length, number of paths per image and look. The samples can be seen in figure 5.
For all of our experiments, we used the following Transformer settings:

  • Batch Size: 500

  • Sequence Length: 16

  • Epochs: 100

  • Hidden Embedding Size: 52

  • Decoder Layers: 6

  • Attention Heads: 4

  • Data Set: 1500 patches


Figure 5: Five sample sketches, which were recorded for our experiments. Each path is drawn in a random color to distinguish between them. The images and their corresponding data sets are called ”waves”, ”flower”, ”boxes”, ”spirals” and ”triangles”.

In our first basic experiment, we used Translation, Rotation, Path Order Shuffle and Path Reversal to generate training data sets of 1500 images. We trained the Transformer for 100 epochs, but did not shuffle the input data order. Therefore the neural net is not able to generalize well and also might be biased dependent on the data seen before.
Figure 6 shows the generation results from the ”triangle” and ”waves” data set: 16 images were generated and arranged in a 4x4 grid. Each image is colored in a random color to better distinguish which path belongs to which image. The images were generated in one process, so the next image is generated in one sequence with the last image. Because of the not-shuffled training data, the model can easily get stuck in a generative loop. The generated images in figure 6 depict this clearly: either the Generative Model only generates one single image over and over or switches between two different images back and forth. Also the images do not resemble the input example very well.
The results differ a lot if the training data set is shuffled for each epoch, which can be seen in figure 7. The resulting images differ from each other and resemble the original input image very well. The generated images also have slight differences to the original images like non-closed tips at the triangle tops. This is a desirable result, because it shows that the learned Model is not always replicating the input image exactly, but the training data set creates enough variety for slight deviations.


Figure 6: Generated path-based images from the ”triangle” and ”waves” data set: because the input training data was not shuffled, the Generative Model can easily get struck in a generative loop.


Figure 7: Generated path-based images from the ”triangle” and ”waves” data set with shuffled input data: the images closely resemble the original image but also show slight derivations like non-closed triangle tops.

Scaling Images in Data Set

In the next experiment step, we added the Scaling operation to our patch creation process. The patches could be scaled down to a smaller size with a minimum of 0.5 of the original size. The results can be seen in figure 8: scaled down variations of the original image appear in the generated image set. Alas the generated images still very closely resemble the original. With the introduction of the scaling operation, we hoped to see a higher derivation between original and generated image. But the model tends to overfit easily, which is very likely due to the generated moves either being too few or rather the move-space being too large. A solution to this problem will be discussed in the conclusion.


Figure 8: Generated path-based images from the ”flower” and ”spirals” data set: with the scaling operation added to the patch generation process, also scaled down versions appear. Though, the generated images still closely resemble the original, which hints to an overfitting problem.

Getting Creative: Generating more Diverse Paths

Another approach to add more variety to the generated images is to add noise to the patch generation. Here, all points on the path are slightly translated by a vector with . For our experiment, we chose to have a visible change but not to distort the original image too much.
The result can be seen in figure 9: the model was trained on the ”boxes” data set with all 6 manipulation options, described in the Data Set Generation section. Even with this high noise of , the generated images still closely resemble the original image and do not form deviations, except the heavy distortion from the added noise.
A better solution to create interesting deviations is to utilize the Transformer net structure: the last two layers (a linear and a softmax layer) produce a vector which can be interpreted as a list of probabilities for each possible move. Usually, the move with the highest probability is chosen and added to the next iteration’s input.
However it is also possible to choose for example the second highest probability and the corresponding move. In the experiment depicted in figure 10 we chose the second best move with a probability of 25%. Now the generated images trained on the ”boxes” data set clearly show deviations from the original image. Strokes no longer only for rectangles but also other angular shapes. Though, most of the deviating images look confusing. Also single, short strokes tend to appear, which look very disconnected to the image.
This approach can be improved by only choosing the second best move if it is likely to be a ”good” move, thus having a similarly high probability as the best move. In our last experiment, we chose as threshold to distinguish between a ”good” or ”bad” second-best move. So, only if the probability difference of the first and second best moves were smaller than , the second best move was chosen with a probability of again 25%. The resulting generated images can be seen in figure 11. The images still resemble the structure of the original image. Some images differ more than others, but all fit very well into the ”boxes” data set look.
Compared to the noise-induces data set, this probability-based method produces the superior results and provides multiple benefits:

  • No additional change is needed in the data set generation phase.

  • The intensity of the deviated image generation can be controlled after the model is trained, so no additional time and resources are needed.

  • The results more closely resemble the concept of the original input image, as the deviation results from choosing a similar likely next move.


Figure 9: Generated path-based images from the ”boxes” data set with noise added in the patch generation phase. The resulting images look distorted and do not form and interesting deviations from the original image except the added noise.


Figure 10: Generated path-based images from the ”boxes” data set, where the second most likely move is chosen with a probability of 25%. The resulting images form deviations from the original image, though they look very unstructured and do not resemble the concept of the original image well.


Figure 11: Generated path-based images from the ”boxes” data set, where the second most likely move is chosen with a probability of 25% only if the second best move has a similar probability as the first best move. The resulting images form very pleasant deviations from the original image and resemble the original image’s concept very well.


In our work, we showed that a Transformer neural net can be used to learn and generate path-based images from one single input image. We generated large training data sets from one image, using different path altering methods. Five hand-drawn image samples were recorded and used to generate our training data sets. We experimented with different path altering method configurations and found that Translation, Rotation, Path Reversal and Scaling were helpful tools to generate large data sets, but the resulting generated images were very close to the original image and showed hardly any deviations. Also shuffling the training data each epoch is important to prevent a bias in the Generative Model. This could be observed very well, as a model trained with non-shuffled data often got stuck in generative loops.
To generate interesting images that deviated from the original, we proposed two methods to evoke said deviation: additional noise on the training data set and choosing the second most likely move instead of the first at a given probability. Both methods resulted in very different outcomes: generated images from the noisy training data set looked very distorted and often did not resemble the original image’s concept any more.
Whereas the move choosing manipulation worked very well and resulted in interesting images, that deviated from the original but still clearly reflect the original concept. This method worked especially well if the second best move was only chosen when the probability distance between first and second best move was very small. So, the second best move was a similar ”good” choice.
The basic generated images (without deviation methods) very closely resembled the original input image. This is most likely due to a too large move-space to be embedded. So, each move is so different from the other moves, that it is very clear that it belongs to a certain training image. To suppress this overfitting, future work should not use full curves as moves, but divide moves into smaller sub-moves, so that more similar moves are created. A similar approach is used in Natural Language Processing, where words are divided into sub-words to create a more common word basis.
In our future research we want to expand these generative methods to create larger path-based pattern images from one input image. For this it might be interesting to use hierarchical approaches, as they have been successfully used in other domains like dialogue generation(serban2017hierarchical). An hierarchical approach might be helpful to give the neural net an overflew of already generated paths and their places beyond the sequence length memory.
Another interesting field is Co-Creative Design [8], where a user is cooperating with an artificial agent to support him in a design task. Here, a path-based image representation can be very useful, especially in the domain of digital fabrication. It will be interesting to explore these applications and domains in future research.


  • [1] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black (2016) Keep it smpl: automatic estimation of 3d human pose and shape from a single image. In European Conference on Computer Vision, pp. 561–578. Cited by: Introduction.
  • [2] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: Data Structure.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Introduction, Data Structure.
  • [4] M. Eitz, J. Hays, and M. Alexa (2012) How do humans sketch objects?. ACM Transactions on graphics (TOG) 31 (4), pp. 1–10. Cited by: Introduction.
  • [5] J. Ferraiolo, F. Jun, and D. Jackson (2000) Scalable vector graphics (svg) 1.0 specification. iuniverse Bloomington. Cited by: Data Structure.
  • [6] D. Gasques, J. G. Johnson, T. Sharkey, and N. Weibel (2019) What you sketch is what you get: quick and easy augmented reality prototyping with pintar. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–6. Cited by: Introduction.
  • [7] I. Goodfellow (2016) NIPS 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160. Cited by: Introduction.
  • [8] M. Guzdial, N. Liao, and M. Riedl (2018) Co-creative level design via machine learning. arXiv preprint arXiv:1809.09420. Cited by: Conclusion.
  • [9] D. Ha and D. Eck (2017) A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477. Cited by: Introduction.
  • [10] J. He, X. Wu, Y. Jiang, B. Zhao, and Q. Peng (2017) Sketch recognition with deep visual-sequential fusion model. In Proceedings of the 25th ACM international conference on Multimedia, pp. 448–456. Cited by: Introduction.
  • [11] C. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck (2018) Music transformer: generating music with long-term structure. Cited by: Introduction.
  • [12] P. Karimi, M. L. Maher, N. Davis, and K. Grace (2019) Deep learning in a computational model for conceptual shifts in a co-creative design system. arXiv preprint arXiv:1906.10188. Cited by: Introduction.
  • [13] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Transformer Architecture.
  • [14] B. Lake, C. Lee, J. Glass, and J. Tenenbaum (2014) One-shot learning of generative speech concepts. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 36. Cited by: Introduction.
  • [15] O. Levy and Y. Goldberg (2014) Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185. Cited by: Transformer Architecture.
  • [16] R. G. Lopes, D. Ha, D. Eck, and J. Shlens (2019) A learned representation for scalable vector graphics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7930–7939. Cited by: Introduction, Data Structure.
  • [17] A. Prabhu, V. Batchu, S. A. Munagala, R. Gajawada, and A. Namboodiri (2018)

    Distribution-aware binarization of neural networks for sketch recognition

    In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 830–838. Cited by: Introduction.
  • [18] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: Introduction.
  • [19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: Introduction, Data Structure.
  • [20] D. Salomon (2007) Curves and surfaces for computer graphics. Springer Science & Business Media. Cited by: Data Structure.
  • [21] R. K. Sarvadevabhatla and J. Kundu (2016) Enabling my robot to play pictionary: recurrent neural networks for sketch recognition. In Proceedings of the 24th ACM international conference on Multimedia, pp. 247–251. Cited by: Introduction.
  • [22] P. J. Schneider (1990) An algorithm for automatically fitting digitized curves. In Graphics gems, pp. 612–626. Cited by: Data Set Generation.
  • [23] T. R. Shaham, T. Dekel, and T. Michaeli (2019) Singan: learning a generative model from a single natural image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4570–4580. Cited by: Introduction, Data Set Generation.
  • [24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Introduction, Data Structure, Transformer Architecture, Transformer Architecture.
  • [25] S. Wieluch and F. Schwenker (2019) Dropout induced noise for co-creative gan systems. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: Introduction.
  • [26] P. Xu, Y. Huang, T. Yuan, K. Pang, Y. Song, T. Xiang, T. M. Hospedales, Z. Ma, and J. Guo (2018) Sketchmate: deep hashing for million-scale human sketch retrieval. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 8090–8098. Cited by: Introduction.
  • [27] P. Xu, C. K. Joshi, and X. Bresson (2019) Multi-graph transformer for free-hand sketch recognition. arXiv preprint arXiv:1912.11258. Cited by: Introduction, 4th item.
  • [28] P. Xu, Z. Song, Q. Yin, Y. Song, and L. Wang (2020) Deep self-supervised representation learning for free-hand sketch. arXiv preprint arXiv:2002.00867. Cited by: Introduction.
  • [29] Q. Yu, F. Liu, Y. Song, T. Xiang, T. M. Hospedales, and C. Loy (2016) Sketch me that shoe. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 799–807. Cited by: Introduction.