The characterization and the extraction of human style profile, given some human activity (like speech, handwriting, human interactions,…etc), is an open research problem. Usually, there is no clear definition of styles, making style extraction an ill-posed problem. In case of generative models, taking styles into account allows us to have more personalized generation.
In this paper, we look at the problem from the angle of generating for handwriting. Ideally, given a letter from a writer, we would like to have information about the letter symbol (the character) and the factors that give the characterize the shape (which, be default, characterize the writer). By doing so, we can: i) better study what constitutes the human profile, and ii) produce more human-acceptable samples.
In this paper, we discuss 4 methods to capture the style, to be used for biasing handwriting generation. The handwriting generation and two of the proposed style methods are based on state of the art in deep learning. We then propose our performance metrics, and the reasoning behind them. The cardinal power of style methods are known beforehand. This will allow us to validate our choice of the evaluation metrics.
Ii Related work
Some of the remarkable recent advances in deep learning  happened in the area of generative models. For static data, such as generating images, the work done using Variational Autoencoders
Variational Autoencoders and Generative Adversarial Networks  has shown remarkable results.
In contrast, handling temporal data, such generating tracings from letter/writer embedding, is more challenging: the data is sequential, and it is difficult to keep the coherence for long sequences. Advances in neural networks cell structure, likeLSTM  and GRU [6, 7], showed impressive results on handling long term dependencies in temporal sequences.
These advances later allowed the development of state-of-the-art neural networks architectures for generating biased temporal sequences, specifically for generating text and image captioning, are showing impressive results [8, 9, 10, 11]. More applications since then have been explored, like music generation  and speech synthesis generation .
The generation of continuous data has always been tricky. Graves  combined Long-Short Term Memory (LSTM) networks with Mixture Density Networks, MDN , to generate continuous handwritten characters, using IAM Handwriting Database . While the results are impressive, the MDN approach are quite difficult to train. Another possible approach for generating continuous tracings is Gaussian Scale Mixtures, GSM .
In order to simplify the procedure, and focus on our investigation of styles, we discretized the tracings using Freeman codes for direction, and speed - see Section III-B more details -, and apply softmax to the output of the last layer, instead of MDN. This was inspired by the results reported in [18, 13], where they show impressive results on the discrete domain, given a good discretization policy. Having a categorical distribution is more flexible and generic that a continuous distribution, and requires no assumption about the data distribution shape.
Recently, interesting work has been done concerning style extraction in the area of speech synthesis [19, 20]. In their work, they extract a number of style tokens. They evaluated the performance of their system via classical subjective rating of voice, and show these token relate to some aspects of the speech prosody and the speaker’s voice.
On the evaluation side, there has been a lot of advancement in developing performance metrics for image captioning and machine translation . Metrics like BLEU, METEOR  and CIREr  are considered the SOTA in image captioning and machine translation evaluation. Traditionally, the evaluation of these kind of applications is subjective. But with the advance of machine and statistical learning, there was a need to develop metrics that are cheap to evaluate, yet have a good correlation with the human evaluation.
Iii Dataset and Pre-Processing
The dataset we choose is IRON-OFF Cursive Handwriting Dataset . While there are more famous handwriting datasets, like IAM Handwriting Database , already available, our dataset provides us with separated and labeled letters, instead of entire sentence, thus allowing us to focus more on the problem of styles. A quick summary of this dataset is given below:
700 writers total. We use 412 writers, who have written isolated letters.
10,685 isolated lower case letters.
10,679 isolated upper case letters, e.g. see Fig 1.
410 euro signs.
4,086 isolated digits.
Gender, handiness, age and nationality are available for all writers.
For each letter, we have letter image - with size around 167x214 pixels, and a resolution of 300 dpi -, pen movement timed sequence comprising continuous X, Y and pen pressure, and also discrete pen state. This data is sampled at 100 points per seconds on a Wacom UltraPad A4.
One particular challenge in this dataset is that each writer wrote the letters only once. Since we are focusing on the styles, this makes it particularly challenging for us. We do not use the pressure or the pen state, in order to simplify the model.
All the images of letters have been denoised and cropped in order to focus on the letters. Then, the images had been down-scaled to 28x28 pixels.
We cleaned the selected motion captured isolated letters by removing frames related to false starts or corrections, extra strokes as well as removing entire tracings whose lengths exceed 1 second, in particular due to lengthy pen-up durations. All tracings exceeding 99 time steps has been discarded from the dataset as well.
All the letter tracings are represented as two modalities: Freeman code - see part III-B1
Iii-B1 Freeman coding for direction and quantizing speed
Freeman codes  belongs to a family of compression algorithms called Chain codes. These algorithms are useful to encode an image when it has connected components inside it. They are considered compression algorithms as they can transform a sparse matrix, to just a small fraction of the size of the image, in the form of a sequence of codes. Original Freeman codes have 2 versions, 4-directional codes, and 8-directional codes. Both are fairly simple as they encode each direction with a unique number (from 0 to n-1, where n are the directions). A direction is defined in the image as the directed vector connecting two neighbouring pixels on the contour of a connected component.
In our work, we compute the direction angle between each two consequent points. Then, we convert each direction to its corresponding freeman code symbol, as shown in Fig 2. Then, we perform one-hot encoding on the direction, and feed it to our network. In order to have a faithful reconstruction of the letters, we also quantize the speed of each displacement.
Iv-a Model selection
The quality of generation of our model has been quite challenging – due to the issues mentioned in the section III-A. We ran random hyper-parameter selection for several days to get the best results. The resulting generator is based GRU cell, with 3 hidden layers, each of size , and a dropout of . Adam optimizer  is selected, with a learning rate of . An MLP is applied to the output of the GRU at each time step, with an output size of 34. Two softmax operation are then applied, one for output , representing freeman codes, and the other on output , representing the speed.
For the models used to extract styles/bias our generator, we followed a more conservative approach, starting from already tested architectures, and modifying their hyper-parameters gradually, until we got satisfying results. The architectures are reported in the following sections.
(EOS) symbol is added at the end of each sequence. Padding is done to make all sequence lengths equal.
The first time step represents the bias we use for the model. It is projected to the same dimension as the rest of the letter sequence. For example, if we use the letter as embedding (as one-hot encoding, it has 26 dimensions), and dimensions of our sequence is 34 (16 + 1 for direction + EOS, 16 + 1 for speed + EOS), then we use a Multi-Layer perceptron
Multi-Layer perceptron(MLP) to project the 26 dimensions into 34 dimensions
In the training phase, Fig 2(b), first, a token that encodes the letter and the writer or his/her style is first set with the same feature dimension as the encoded sequence and considered as frame 0. This frame is added to rest of the encoded sequence (frames 1 to N) in order to bias the hidden states of the network. The objective of the model is to predict the next frame in the sequence given the preceding ones. The input to the model during the training is always the ground truth.
To formalize this, is the input letter trace, where is the trace length, and is the letter with/without style - the model bias, and is the MLP used to project to the same dimension as , then our system works as the following:
The loss used to optimized the GRU parameters is the negative log likelihood of the correct trace point at each time step, calculated as follows:
During inference, Fig 2(c), the first time step has the embedding information, used to bias the model. The network then generates the first frame. This frame is then feedback to the network’s input for generating the second frame. This continues until an EOS symbol is generated.
Over the course of generation, the model accumulate errors, leading to degradation of performance when generating long sequences. Some techniques, like Scheduled Sampling , can be applied during the training phase in order to enhance the quality of the model training, but they are not used in this work.
In order to infer/generate the tracing of the letter, we use the Softmax Sampling strategy: at each time step, we generate a two multinomial distributions: one for the directions, the other for the speed). At time step , we sample both distributions according to a temperature level, and use these samples to feed the model’s input for the next time step . This method is the one we use in this work.
V Biasing the network with a style input
We assess the multiple methods to bias our letter generator in their ability to capture of writing styles. These methods are chosen since we know beforehand their cardinal order (which has more information than which). Knowing this information beforehand, we use it to ground our performance metrics. The methods are:
- Letter bias
: the letter code is used as bias. No style information is thus included. We use this as a lower baseline.
- Letter + Writer bias
: the letter and writer codes are used as bias. Thus, the model has an access explicit information about the writer (i.e. via his/her identity). Thus, this method is expected to perform the best. This model will also serve as a upper baseline.
Image classifier embedding
We build a convolution neural network (CNN) to classify the letters images, as shown in Fig4. Our architecture achieves classification accuracy. The embedding layer will encode information about the discriminative distance between the letters. This model should perform the same or a bit less performance that the Letter bias, since it learns to clusters the letters, and there are classification errors.
- Image auto-encoder latent space
we train a letter image autoencoder, using reconstruction error, and use the latent space as a representation of the letter+style bias. The architecture we use can be seen in Fig 4. The latent space encodes the similarity between the letters. This model should perform worse than Letter bias, since, while it capture the similarity between the letter images, it does not capture discriminative features about each letter.
Left: architecture of the CNN letter classifier. Batch normalization is used after each convolution layer. TheDense 1 layer is the embedding that is used to bias our generator. Right: the autoencoder architecture we used. The first Dense 34 layer provides the latent space used to bias the generator.
Vi-a Model selection
The generator part of our model has been quite challenging – due to the issues mentioned in the Dataset and pre-processing subsection. We ran random hyper-parameter selection for several days to get the best results. For other models, used to extract styles/bias our generator, we followed a more conservative approach, where we started from architectures tested before, and modified their hyper-parameters gradually, till we got satisfying results.
Evaluation, in generative models, is by far the most challenging part. Ideally, we want metrics to capture the distance between the generated and the reference distributions of handwriting features, and not between images using an ink-deposition model . In order to objectively compare the proposed style embeddings, we propose the following metrics:
- BLEU score 
It is an important metric evaluate the quality of text generation areas, like in machine translation and image captioning . In this work, we test the hypothesis that the BLEU score is also relevant to the generation of handwriting111In text evaluation, while the BLEU score is usually used when there is multiple reference sentences, there is no constraint on using it with one reference sentence only.. In this study, we report the BLEU scores for 1, 2 and 3 grams, for the freeman codes and the speed separately. The final score is calculated as follows:
where: is all the generated sequences,
is the N-gram to be measured,is the total number of N-grams we want to consider, is clipped N-grams count (if the number of N-grams in the generate sequence is larger than the reference sequence, the count is limited to the number in the reference sequence only), is the length of the reference sequence, is the length of the generated sequence. The term is added in order not to penalize small generated sequences (smaller than the reference sequence), which will achieve high scores.
- Generated Sequence Length
Another aspect that we measure, is the relationship between the length of the generated sequence and the reference sequence. Thus, for each proposed method, we use the Wilcoxon signed-rank test  to compute the statistical significance between the distribution of the length of generated letters and the reference letters. In addition, we also calculate the Pearson correlation coefficient on the length as well, in order to better quantify the relation between the generated and the ground-truth letters.
Vii Results and Discussion
Vii-a BLEU scores
The final results using the BLEU score can be seen in table I. The following is observed:
The letter + writer bias performed better than all other biases (in terms of B-3, for both speed and freeman directions), thus showing that having access to information about the writer, even so basic like the writer ID, have a clear advantage in the resulting quality of the handwriting generation.
The embedding from the image autoencoder performed the worse. To understand why, we show a 2-D projection of the latent space using t-SNE in Fig 4(a). Since the autoencoder is trained for minimizing the reconstruction error only, the distance in the latent space encode mostly the proximity between the images with no distinct representations for letter and style. It can be seen that the model latent space doesn’t encode discriminative features for the letters. Using this latent space for our generator, we find the model gets easily confused between nearby letters, leading to generating different letters than requested.
The embedding from the image classifier performs better than the letter only baseline, but the results vary compared to the letter+writer model. Since the classifier is trained on a single objective only (to classify the letters), and the classifier performs very well, we can expect the embedding to cluster the letters well, as seen in Fig 4(b)
. Also, we can expect the model to capture some of the writer style, possibly in the inter-cluster variance. This is an interesting result, suggesting that some fine tuning for the image classifier while in the generation task could be beneficial.
|Model / B-score||B-1||B-2||B-3||B-1||B-2||B-3|
|Letter + Writer bias||51.5||41.4||25.1||56.7||39.4||28.3|
Vii-B EOS performance
As mentioned earlier, we performed a statistical test between the paired distributions of lengths of the generated and the reference letters – in other words, when the EOS symbol first appears. The results are shown in table II. We can see the following:
For the statistical test, we can see that letter+writer bias outperform the rest of the approaches, achieving p-value . This is quite reassuring, since it is also in line with the results from the BLEU score.
The results from the Pearson correlation coefficients are also consistent with the rest of the results. High coefficients are given to the letter+writer biases, compared to the other methods. The image classifier and autoencoder gives the lowest results. This can be due to the errors during the learning, and the insufficient information about the letter length that can be inferred from the image. For the image classifier, as noted earlier, a fine-tuning during the generation task is worth exploring.
|Models||Pearson coefficient||p value|
|Letter + Writer bias||0.55||0.04|
Viii Conclusions and future work
We have proposed baselines for the task of handwriting generation, and evaluation metrics in order to measure the quality of the different methods: a letter bias only, which capture the average of the letters, and a letter + writer bias, which has a direct access to the writer ID (and thus, has information about the style). We also proposed two performance metrics: BLEU score (adapted from machine translation) and EOS analysis. In order to ground those metrics, we leveraged our prior knowledge over the cardinal power of different styling methods. With the performance metrics matching our expectation, we show a logical argument for using this metrics in the future for this task. This is an essential first step, towards further study and analysis for styles in handwriting, enabling further techniques to be developed and compared to each other.
Multiple points can be done in order to enhance our results, or to extend our study to become more complete. For example:
- Extract styles from examples
: The lettter + writer bias has explicit access to the writer ID, which we argue is the simplicity possible style information about the writer. The advantage is that it is quite simple, yet it does not have much information about the writer. For examples, for the X letter, some people draw it clockwise and some anticlockwise. Some people start from the left side, and some started from the right side.
- Style transfer
: From our observation of the data, although there are 400 writers, there are some components for writing styles, like the ones mentioned in the letter X
in the previous point (although it is not possible to enumerate them). One way to test the quality of a style extraction method is by performing a style transfer: leveraging the information from different writers to make a quick adaption to a new unseen writer. One interesting method we are investigating at the moment to extract the writer style is to adapt the method used inFaceNet 
, where they want to create an embedding for human faces. They introduced a loss function,the triplet loss, which is generic enough to be used in other applications, like identifying the speaker turn . Also, recent work has been performed in style transfer in the domain of speech synthesis[19, 20] for separating textual input from voice and expressivity shows promising results.
- Task specific metrics
: The proposed metrics in this paper are quite generic, allowing us to evaluate the system as a whole. Yet, a better understanding and analysis for the different systems requires more task-specific metrics. This is also in-line with the previous points, since it will give better insight on developing better methods for writer style extraction.
This work is supported by PERSYVAL (ANR-11-LABX-0025) via the project-action RHUM.
-  C. Viard-Gaudin, P. M. Lallican, S. Knerr, and P. Binter, “The ireste on/off (ironoff) dual handwriting database,” in Document Analysis and Recognition, 1999. ICDAR ’99. Proceedings of the Fifth International Conference on, Sep 1999, pp. 455–458.
-  I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
-  D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
-  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with recurrent neural networks,” inProceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 1017–1024.
-  I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’14. Cambridge, MA, USA: MIT Press, 2014, pp. 3104–3112. [Online]. Available: http://dl.acm.org/citation.cfm?id=2969033.2969173
-  A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015, pp. 3156–3164.
-  J.-P. Briot and F. Pachet, “Music generation by deep learning-challenges and directions,” arXiv preprint arXiv:1712.04371, 2017.
-  A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
-  A. Graves, “Generating sequences with recurrent neural networks,” CoRR, vol. abs/1308.0850, 2013. [Online]. Available: http://arxiv.org/abs/1308.0850
-  C. M. Bishop, “Mixture density networks,” 1994.
-  U.-V. Marti and H. Bunke, “A full english sentence database for off-line handwriting recognition,” in Document Analysis and Recognition, 1999. ICDAR’99. Proceedings of the Fifth International Conference on. IEEE, 1999, pp. 705–708.
-  L. Theis and M. Bethge, “Generative image modeling using spatial lstms,” in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, pp. 1927–1935. [Online]. Available: http://dl.acm.org/citation.cfm?id=2969442.2969455
-  A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ser. ICML’16. JMLR.org, 2016, pp. 1747–1756. [Online]. Available: http://dl.acm.org/citation.cfm?id=3045390.3045575
-  R. J. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” CoRR, vol. abs/1803.09047, 2018. [Online]. Available: http://arxiv.org/abs/1803.09047
-  Y. Wang, D. Stanton, Y. Zhang, R. J. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” CoRR, vol. abs/1803.09017, 2018. [Online]. Available: http://arxiv.org/abs/1803.09017
-  P. Koehn, Statistical Machine Translation, 1st ed. New York, NY, USA: Cambridge University Press, 2010.
-  K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002, pp. 311–318.
-  S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72.
-  R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575.
-  H. Freeman, “On the encoding of arbitrary geometric configurations,” IRE Transactions on Electronic Computers, vol. 2, pp. 260–268, 1961.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, pp. 1171–1179. [Online]. Available: http://dl.acm.org/citation.cfm?id=2969239.2969370
-  V. Nguyen and M. Blumenstein, “Techniques for static handwriting trajectory recovery: a survey,” in Proceedings of the 9th IAPR International Workshop on Document Analysis Systems. ACM, 2010, pp. 463–470.
-  F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945. [Online]. Available: http://www.jstor.org/stable/3001968
-  F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” CoRR, vol. abs/1503.03832, 2015. [Online]. Available: http://arxiv.org/abs/1503.03832
-  H. Bredin, “Tristounet: Triplet loss for speaker turn embedding,” CoRR, vol. abs/1609.04301, 2016. [Online]. Available: http://arxiv.org/abs/1609.04301
Example of the letters
The design choices of our experiments (discretization, and ignoring the pen state) affects the final shape of the letters, yet, the letters and their style are quite recognizable. See examples for the original letters in figure 6. Examples for the generation with our methods are in figure 7.