Handwriting styles: benchmarks and evaluation metrics

09/04/2018 ∙ by Omar Mohammed, et al. ∙ Grenoble Institute of Technology Université Grenoble Alpes 0

Evaluating the style of handwriting generation is a challenging problem, since it is not well defined. It is a key component in order to develop in developing systems with more personalized experiences with humans. In this paper, we propose baseline benchmarks, in order to set anchors to estimate the relative quality of different handwriting style methods. This will be done using deep learning techniques, which have shown remarkable results in different machine learning tasks, learning classification, regression, and most relevant to our work, generating temporal sequences. We discuss the challenges associated with evaluating our methods, which is related to evaluation of generative models in general. We then propose evaluation metrics, which we find relevant to this problem, and we discuss how we evaluate the evaluation metrics. In this study, we use IRON-OFF dataset. To the best of our knowledge, there is no work done before in generating handwriting (either in terms of methodology or the performance metrics), our in exploring styles using this dataset.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The characterization and the extraction of human style profile, given some human activity (like speech, handwriting, human interactions,…etc), is an open research problem. Usually, there is no clear definition of styles, making style extraction an ill-posed problem. In case of generative models, taking styles into account allows us to have more personalized generation.

In this paper, we look at the problem from the angle of generating for handwriting. Ideally, given a letter from a writer, we would like to have information about the letter symbol (the character) and the factors that give the characterize the shape (which, be default, characterize the writer). By doing so, we can: i) better study what constitutes the human profile, and ii) produce more human-acceptable samples.

In this paper, we discuss 4 methods to capture the style, to be used for biasing handwriting generation. The handwriting generation and two of the proposed style methods are based on state of the art in deep learning. We then propose our performance metrics, and the reasoning behind them. The cardinal power of style methods are known beforehand. This will allow us to validate our choice of the evaluation metrics.

Ii Related work

Some of the remarkable recent advances in deep learning [2] happened in the area of generative models. For static data, such as generating images, the work done using

Variational Autoencoders

 [3] and Generative Adversarial Networks [4] has shown remarkable results.

In contrast, handling temporal data, such generating tracings from letter/writer embedding, is more challenging: the data is sequential, and it is difficult to keep the coherence for long sequences. Advances in neural networks cell structure, like

LSTM [5] and GRU [6, 7], showed impressive results on handling long term dependencies in temporal sequences.

These advances later allowed the development of state-of-the-art neural networks architectures for generating biased temporal sequences, specifically for generating text and image captioning, are showing impressive results [8, 9, 10, 11]. More applications since then have been explored, like music generation [12] and speech synthesis generation [13].

The generation of continuous data has always been tricky. Graves [14] combined Long-Short Term Memory (LSTM) networks with Mixture Density Networks, MDN [15], to generate continuous handwritten characters, using IAM Handwriting Database [16]. While the results are impressive, the MDN approach are quite difficult to train. Another possible approach for generating continuous tracings is Gaussian Scale Mixtures, GSM [17].

In order to simplify the procedure, and focus on our investigation of styles, we discretized the tracings using Freeman codes for direction, and speed - see Section III-B more details -, and apply softmax to the output of the last layer, instead of MDN. This was inspired by the results reported in [18, 13], where they show impressive results on the discrete domain, given a good discretization policy. Having a categorical distribution is more flexible and generic that a continuous distribution, and requires no assumption about the data distribution shape.

Recently, interesting work has been done concerning style extraction in the area of speech synthesis [19, 20]. In their work, they extract a number of style tokens. They evaluated the performance of their system via classical subjective rating of voice, and show these token relate to some aspects of the speech prosody and the speaker’s voice.

On the evaluation side, there has been a lot of advancement in developing performance metrics for image captioning and machine translation [21]. Metrics like BLEU[22], METEOR [23] and CIREr [24] are considered the SOTA in image captioning and machine translation evaluation. Traditionally, the evaluation of these kind of applications is subjective. But with the advance of machine and statistical learning, there was a need to develop metrics that are cheap to evaluate, yet have a good correlation with the human evaluation.

Iii Dataset and Pre-Processing

Iii-a Dataset

The dataset we choose is IRON-OFF Cursive Handwriting Dataset [1]. While there are more famous handwriting datasets, like IAM Handwriting Database [16], already available, our dataset provides us with separated and labeled letters, instead of entire sentence, thus allowing us to focus more on the problem of styles. A quick summary of this dataset is given below:

  • [noitemsep]

  • 700 writers total. We use 412 writers, who have written isolated letters.

  • 10,685 isolated lower case letters.

  • 10,679 isolated upper case letters, e.g. see Fig 1.

  • 410 euro signs.

  • 4,086 isolated digits.

  • Gender, handiness, age and nationality are available for all writers.

  • For each letter, we have letter image - with size around  167x214 pixels, and a resolution of 300 dpi -, pen movement timed sequence comprising continuous X, Y and pen pressure, and also discrete pen state. This data is sampled at 100 points per seconds on a Wacom UltraPad A4.

One particular challenge in this dataset is that each writer wrote the letters only once. Since we are focusing on the styles, this makes it particularly challenging for us. We do not use the pressure or the pen state, in order to simplify the model.

Fig. 1: Example of a letter, showing the trajectory, strokes, speed and pen pressure

Iii-B Pre-processing

All the images of letters have been denoised and cropped in order to focus on the letters. Then, the images had been down-scaled to 28x28 pixels.

We cleaned the selected motion captured isolated letters by removing frames related to false starts or corrections, extra strokes as well as removing entire tracings whose lengths exceed 1 second, in particular due to lengthy pen-up durations. All tracings exceeding 99 time steps has been discarded from the dataset as well.

All the letter tracings are represented as two modalities: Freeman code - see part III-B1

- and speed features. Each modality is quantized into 16 level, and then represented as one-hot encoded vectors.

Iii-B1 Freeman coding for direction and quantizing speed

Freeman codes [25] belongs to a family of compression algorithms called Chain codes. These algorithms are useful to encode an image when it has connected components inside it. They are considered compression algorithms as they can transform a sparse matrix, to just a small fraction of the size of the image, in the form of a sequence of codes. Original Freeman codes have 2 versions, 4-directional codes, and 8-directional codes. Both are fairly simple as they encode each direction with a unique number (from 0 to n-1, where n are the directions). A direction is defined in the image as the directed vector connecting two neighbouring pixels on the contour of a connected component.

In our work, we compute the direction angle between each two consequent points. Then, we convert each direction to its corresponding freeman code symbol, as shown in Fig 2. Then, we perform one-hot encoding on the direction, and feed it to our network. In order to have a faithful reconstruction of the letters, we also quantize the speed of each displacement.

Fig. 2: Example for freeman code representation for 8 directions. Each direction will be given one number.

Iv Models

Iv-a Model selection

The quality of generation of our model has been quite challenging – due to the issues mentioned in the section III-A. We ran random hyper-parameter selection for several days to get the best results. The resulting generator is based GRU cell, with 3 hidden layers, each of size , and a dropout of . Adam optimizer [26] is selected, with a learning rate of . An MLP is applied to the output of the GRU at each time step, with an output size of 34. Two softmax operation are then applied, one for output , representing freeman codes, and the other on output , representing the speed.

For the models used to extract styles/bias our generator, we followed a more conservative approach, starting from already tested architectures, and modifying their hyper-parameters gradually, until we got satisfying results. The architectures are reported in the following sections.

Iv-B Training

We follow the similar approach to the work done in image captioning [11]. Each handwritten character is encoded as shown in Fig 2(a). An End-of-Sequence

(EOS) symbol is added at the end of each sequence. Padding is done to make all sequence lengths equal.

The first time step represents the bias we use for the model. It is projected to the same dimension as the rest of the letter sequence. For example, if we use the letter as embedding (as one-hot encoding, it has 26 dimensions), and dimensions of our sequence is 34 (16 + 1 for direction + EOS, 16 + 1 for speed + EOS), then we use a

Multi-Layer perceptron

(MLP) to project the 26 dimensions into 34 dimensions

In the training phase, Fig 2(b), first, a token that encodes the letter and the writer or his/her style is first set with the same feature dimension as the encoded sequence and considered as frame 0. This frame is added to rest of the encoded sequence (frames 1 to N) in order to bias the hidden states of the network. The objective of the model is to predict the next frame in the sequence given the preceding ones. The input to the model during the training is always the ground truth.

To formalize this, is the input letter trace, where is the trace length, and is the letter with/without style - the model bias, and is the MLP used to project to the same dimension as , then our system works as the following:


The loss used to optimized the GRU parameters is the negative log likelihood of the correct trace point at each time step, calculated as follows:

(a) Input sequence shape
(b) Training mode
(c) Generation framework
Fig. 3: Illustration for biasing the generative model using letter + writer. The MLP, receives letter/writer embedding is responsible for down-/up-sizing the input dimension to the frame dimension of the tracings (34 i.e 16+1 hot encoding for direction and speed together with EOS feature). In this example, the model is biased using the letter and writer code.

Iv-C Inference

During inference, Fig 2(c), the first time step has the embedding information, used to bias the model. The network then generates the first frame. This frame is then feedback to the network’s input for generating the second frame. This continues until an EOS symbol is generated.

Over the course of generation, the model accumulate errors, leading to degradation of performance when generating long sequences. Some techniques, like Scheduled Sampling [27], can be applied during the training phase in order to enhance the quality of the model training, but they are not used in this work.

In order to infer/generate the tracing of the letter, we use the Softmax Sampling strategy: at each time step, we generate a two multinomial distributions: one for the directions, the other for the speed). At time step , we sample both distributions according to a temperature level, and use these samples to feed the model’s input for the next time step . This method is the one we use in this work.

V Biasing the network with a style input

We assess the multiple methods to bias our letter generator in their ability to capture of writing styles. These methods are chosen since we know beforehand their cardinal order (which has more information than which). Knowing this information beforehand, we use it to ground our performance metrics. The methods are:


Letter bias

: the letter code is used as bias. No style information is thus included. We use this as a lower baseline.

Letter + Writer bias

: the letter and writer codes are used as bias. Thus, the model has an access explicit information about the writer (i.e. via his/her identity). Thus, this method is expected to perform the best. This model will also serve as a upper baseline.

Image classifier embedding

We build a convolution neural network (CNN) to classify the letters images, as shown in Fig 

4. Our architecture achieves classification accuracy. The embedding layer will encode information about the discriminative distance between the letters. This model should perform the same or a bit less performance that the Letter bias, since it learns to clusters the letters, and there are classification errors.

Image auto-encoder latent space

we train a letter image autoencoder, using reconstruction error, and use the latent space as a representation of the letter+style bias. The architecture we use can be seen in Fig 4. The latent space encodes the similarity between the letters. This model should perform worse than Letter bias, since, while it capture the similarity between the letter images, it does not capture discriminative features about each letter.

Fig. 4:

Left: architecture of the CNN letter classifier. Batch normalization is used after each convolution layer. The

Dense 1 layer is the embedding that is used to bias our generator. Right: the autoencoder architecture we used. The first Dense 34 layer provides the latent space used to bias the generator.

Vi Experiments

Vi-a Model selection

The generator part of our model has been quite challenging – due to the issues mentioned in the Dataset and pre-processing subsection. We ran random hyper-parameter selection for several days to get the best results. For other models, used to extract styles/bias our generator, we followed a more conservative approach, where we started from architectures tested before, and modified their hyper-parameters gradually, till we got satisfying results.

Evaluation, in generative models, is by far the most challenging part. Ideally, we want metrics to capture the distance between the generated and the reference distributions of handwriting features, and not between images using an ink-deposition model [28]. In order to objectively compare the proposed style embeddings, we propose the following metrics:

BLEU score [22]

It is an important metric evaluate the quality of text generation areas, like in machine translation

[9] and image captioning [11]. In this work, we test the hypothesis that the BLEU score is also relevant to the generation of handwriting111In text evaluation, while the BLEU score is usually used when there is multiple reference sentences, there is no constraint on using it with one reference sentence only.. In this study, we report the BLEU scores for 1, 2 and 3 grams, for the freeman codes and the speed separately. The final score is calculated as follows:


where: is all the generated sequences,

is the N-gram to be measured,

is the total number of N-grams we want to consider, is clipped N-grams count (if the number of N-grams in the generate sequence is larger than the reference sequence, the count is limited to the number in the reference sequence only), is the length of the reference sequence, is the length of the generated sequence. The term is added in order not to penalize small generated sequences (smaller than the reference sequence), which will achieve high scores.

Generated Sequence Length

Another aspect that we measure, is the relationship between the length of the generated sequence and the reference sequence. Thus, for each proposed method, we use the Wilcoxon signed-rank test [29] to compute the statistical significance between the distribution of the length of generated letters and the reference letters. In addition, we also calculate the Pearson correlation coefficient on the length as well, in order to better quantify the relation between the generated and the ground-truth letters.

Vii Results and Discussion

Vii-a BLEU scores

The final results using the BLEU score can be seen in table I. The following is observed:

  • The letter + writer bias performed better than all other biases (in terms of B-3, for both speed and freeman directions), thus showing that having access to information about the writer, even so basic like the writer ID, have a clear advantage in the resulting quality of the handwriting generation.

  • The embedding from the image autoencoder performed the worse. To understand why, we show a 2-D projection of the latent space using t-SNE in Fig 4(a). Since the autoencoder is trained for minimizing the reconstruction error only, the distance in the latent space encode mostly the proximity between the images with no distinct representations for letter and style. It can be seen that the model latent space doesn’t encode discriminative features for the letters. Using this latent space for our generator, we find the model gets easily confused between nearby letters, leading to generating different letters than requested.

  • The embedding from the image classifier performs better than the letter only baseline, but the results vary compared to the letter+writer model. Since the classifier is trained on a single objective only (to classify the letters), and the classifier performs very well, we can expect the embedding to cluster the letters well, as seen in Fig 4(b)

    . Also, we can expect the model to capture some of the writer style, possibly in the inter-cluster variance. This is an interesting result, suggesting that some fine tuning for the image classifier while in the generation task could be beneficial.

Aspect/Feature Speed Freeman
Model / B-score B-1 B-2 B-3 B-1 B-2 B-3
Letter bias 49.7 37.3 24.2 47.4 36.6 26.8
Image classifier 50.9 38.2 24.6 48.5 37.9 28.1
Image autoencoder 51.9 37.9 23.1 46.4 35.0 24.5
Letter + Writer bias 51.5 41.4 25.1 56.7 39.4 28.3
TABLE I: Comparing different approaches for style extraction using clipped n-grams
Fig. 5: a) In the the autoencoder latent space, there is no clear separation between letters; the encoding is based on the similarity of the images only. b) In the classifier embedding, there is a clear separation between the letters - with few exceptions -.

Vii-B EOS performance

As mentioned earlier, we performed a statistical test between the paired distributions of lengths of the generated and the reference letters – in other words, when the EOS symbol first appears. The results are shown in table II. We can see the following:

  • For the statistical test, we can see that letter+writer bias outperform the rest of the approaches, achieving p-value . This is quite reassuring, since it is also in line with the results from the BLEU score.

  • The results from the Pearson correlation coefficients are also consistent with the rest of the results. High coefficients are given to the letter+writer biases, compared to the other methods. The image classifier and autoencoder gives the lowest results. This can be due to the errors during the learning, and the insufficient information about the letter length that can be inferred from the image. For the image classifier, as noted earlier, a fine-tuning during the generation task is worth exploring.

Models Pearson coefficient p value
Letter bias 0.38 0.84
Image classifier 0.32 0.62
Image autoencoder 0.25 0.29
Letter + Writer bias 0.55 0.04
TABLE II: Pearson correlation coefficients and associated p-values for the EOS distributions of the different style biases.

Viii Conclusions and future work

We have proposed baselines for the task of handwriting generation, and evaluation metrics in order to measure the quality of the different methods: a letter bias only, which capture the average of the letters, and a letter + writer bias, which has a direct access to the writer ID (and thus, has information about the style). We also proposed two performance metrics: BLEU score (adapted from machine translation) and EOS analysis. In order to ground those metrics, we leveraged our prior knowledge over the cardinal power of different styling methods. With the performance metrics matching our expectation, we show a logical argument for using this metrics in the future for this task. This is an essential first step, towards further study and analysis for styles in handwriting, enabling further techniques to be developed and compared to each other.

Multiple points can be done in order to enhance our results, or to extend our study to become more complete. For example:

Extract styles from examples

: The lettter + writer bias has explicit access to the writer ID, which we argue is the simplicity possible style information about the writer. The advantage is that it is quite simple, yet it does not have much information about the writer. For examples, for the X letter, some people draw it clockwise and some anticlockwise. Some people start from the left side, and some started from the right side.

Style transfer

: From our observation of the data, although there are  400 writers, there are some components for writing styles, like the ones mentioned in the letter X

in the previous point (although it is not possible to enumerate them). One way to test the quality of a style extraction method is by performing a style transfer: leveraging the information from different writers to make a quick adaption to a new unseen writer. One interesting method we are investigating at the moment to extract the writer style is to adapt the method used in

FaceNet [30]

, where they want to create an embedding for human faces. They introduced a loss function,

the triplet loss, which is generic enough to be used in other applications, like identifying the speaker turn [31]. Also, recent work has been performed in style transfer in the domain of speech synthesis[19, 20] for separating textual input from voice and expressivity shows promising results.

Task specific metrics

: The proposed metrics in this paper are quite generic, allowing us to evaluate the system as a whole. Yet, a better understanding and analysis for the different systems requires more task-specific metrics. This is also in-line with the previous points, since it will give better insight on developing better methods for writer style extraction.


This work is supported by PERSYVAL (ANR-11-LABX-0025) via the project-action RHUM.


Example of the letters

The design choices of our experiments (discretization, and ignoring the pen state) affects the final shape of the letters, yet, the letters and their style are quite recognizable. See examples for the original letters in figure 6. Examples for the generation with our methods are in figure 7.

(a) B
(b) C
(c) D
(d) A
(e) E
(f) F
(g) G
(h) H
(i) I
(j) J
(k) K
(l) L
(m) M
(n) N
(o) O
(p) P
(q) Q
(r) R
(s) S
(t) T
(u) U
(v) V
(w) W
(x) X
(y) Y
(z) Z
Fig. 6: Examples of original letters. The blue x mark is the starting point. These ones are generated using the letter + Writer bias. E and F are visually harder to recognize, since we do not model the pen pressure, otherwise, the rest of the letters are well recognizable.
(a) A
(b) B
(c) C
(d) D
(e) E
(f) F
(g) G
(h) H
(i) I
(j) J
(k) K
(l) L
(m) M
(n) N
(o) O
(p) P
(q) Q
(r) R
(s) S
(t) T
(u) U
(v) V
(w) W
(x) X
(y) Y
(z) Z
Fig. 7: Examples of generated letters. The blue x mark is the starting point. These ones are generated using the letter + Writer bias. The general quality of this quite acceptable.