StackMix and Blot Augmentations for Handwritten Text Recognition

This paper proposes a handwritten text recognition(HTR) system that outperforms current state-of-the-artmethods. The comparison was carried out on three of themost frequently used in HTR task datasets, namely Ben-tham, IAM, and Saint Gall. In addition, the results on tworecently presented datasets, Peter the Greats manuscriptsand HKR Dataset, are provided.The paper describes the architecture of the neural net-work and two ways of increasing the volume of train-ing data: augmentation that simulates strikethrough text(HandWritten Blots) and a new text generation method(StackMix), which proved to be very effective in HTR tasks.StackMix can also be applied to the standalone task of gen-erating handwritten text based on printed text.



page 1

page 2

page 3

page 4


Handwritten text generation and strikethrough characters augmentation

We introduce two data augmentation techniques, which, used with a Resnet...

Preliminary experiments on automatic gender recognition based on online capital letters

In this paper we present some experiments to automatically classify onli...

EASTER: Efficient and Scalable Text Recognizer

Recent progress in deep learning has led to the development of Optical C...

Easter2.0: Improving convolutional models for handwritten text recognition

Convolutional Neural Networks (CNN) have shown promising results for the...

TMIXT: A process flow for Transcribing MIXed handwritten and machine-printed Text

Handling large corpuses of documents is of significant importance in man...

Manifold Mixup improves text recognition with CTC loss

Modern handwritten text recognition techniques employ deep recurrent neu...

Handling Heavily Abbreviated Manuscripts: HTR engines vs text normalisation approaches

Although abbreviations are fairly common in handwritten sources, particu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Handwriting text recognition is an vital task. Automation allows for a dramatical reduction in labor costs for processing correspondence and application forms and deciphering historical manuscripts.

Handwritten text has several features because of both the inter- and intraclass variability - different instances of the same word written by different people may vary greatly, and the same character written by the same writer may look very different depending on the context when it was written. In addition to the above, historical records may contain flaws like ink drops and paper defects. Another problem with historical documents is the usually small amount of labeled data. Deciphering historical manuscripts is carried out with the help of rare specialists – historians and linguists; this significantly increases the deciphering cost, especially with many documents. Our approach involves having specialists label a relatively small number of documents, train the model, and apply it to the remaining data.

We propose a handwritten text recognition system, as well as two ways to increase the volume of training data: augmentation that simulates strikethrough text - HandWritten Blots and a ”new text” generation method - StackMix.

The system was originally designed to decipher Peter the Great manuscripts that were first introduced at [potanin2021digital] by using marked-up lines of the text as input. One line of text is 2 048 pixels width and 128 pixels height (Fig. 1).

The remaining paper is organized as follows. Related works are presented in Section 2. Some of them are used for comparison with state-of-the-art models. Section 3 is devoted to describing our method. It includes information about neural network architecture and augmentations, and introduces StackMix. Section 4 describes datasets and metrics used in experiments. Section 5 provides a detailed description of the experiments and results. Section 6 concludes the paper.

Figure 1: One line of text.
Figure 2: Neural network architecture.

2 Related Work

Early works on handwritten recognition problems suggest using a combination of hidden Markov models and RNNs

[bengio1999, bourlard1994] or algorithms based on conditional random fields [lafferty2001]

. The disadvantage of these approaches is the impossibility of end-to-end loss function optimization. In 2006, a new approach was introduced - Connectionist Temporary Classification


. The basic idea was to interpret the network outputs as a probability distribution over all possible label sequences, conditioned on a given input sequence. Given this distribution, an objective function can be derived that directly maximizes the probabilities of the correct labelings. Since the objective function is differentiable, the network can then be trained with standard backpropagation through time. New loss function were also introduced in this article - CTC loss. The ideas proposed in the article found a wide response among researchers, becoming the de-facto standard for handwritten recognition works

[gramCTC, HWRfew, de2019no].

MDLSTM [voigtlaender2016handwriting] networks use 2D-RNN which can deal with both axes of an input image. A simple model consists of several CNN and MDLSTM layers and using CTC loss provides excellent metrics for the IAM [marti2002iam] dataset. However, MDLSTM models have some disadvantages like high computational costs and instability. In work [de2019no]

, the authors proposed an “example-packing” method. This eliminates the cost of padding for different-sized images. In works

[coquenet2020recurrence] and [ingle2019scalable], the authors try to eliminate recurrent layers in CNN-LSTM-CTC to decrease the number of parameters. Their Gated Fully Convolutional Networks shows relatively good results, even without a language model.

Another alternative to the RCNN-CTC approach is seq2seq models [michael2019evaluating]. The encoder extracts features from the input, and the decoder with an attention mechanism emits the output sequentially. Common tricks may significantly improve the quality of HTR models. In [aradillas2020boosting]

, the authors investigated data augmentation and transfer learning for small historical datasets.

OrigamiNet [yousef2020origaminet] is a module that combines segmentation and text recognition tasks. It can be added to any line-level handwritten text recognition model to make it page-level.

We compared the proposed model with the models described above. The data for comparison is given in section 4.1, and the description of the metrics is given in section 4.2. We also found that different authors used various data partitions for train, test, and validation on the IAM dataset (see section 4.1). This made it difficult to correctly compare the models, so we are presenting our model results for the same data sets that were used in the original papers.

3 Method

Our method comprises three parts. Section 3.1 describes the modified Resnet neural network architecture we used. Sections 3.2 proposes a new augmentation method that simulates strikethrough text – Handwritten Blots. Finally, Section 3.3 describes how to significantly increase the amount of training data generating new text in the style of the current dataset (StackMix approach).

3.1 Neural Network Architecture

The neural network underlying the proposed system consists of three parts: a feature generator, a recurrent network to account for the order of the features, and the classifier that outputs the probability of each character.

As a function generator, various network architectures were tested, and the final choice fell on Resnet (Fig. 2

). We took only three first blocks from Resnet-34 and replaced the stride parameter in the first layer with 1 to increase the ”width” of the objects. One Resnet block (Figure

3) consisted of 3, 4, and 6 residual blocks with 64, 128, and 256 output layers, respectively.

After the features were extracted, they were averaged through the AdaptiveAvgPool2d layer and fed into the three BiLSTM layers to deal with feature sequences. As a final classifier, we use two fully connected layers with GELU and dropout between them.

Figure 3: Resnet block architecture.

The results achieved using the described architecture without any additional modifications are shown in the ”base” row of Table 4.

3.2 Blot Augmentation

The idea of augmentation appeared during the analysis of the Digital Peter dataset [complink, potanin2021digital]. In the process of examining the dataset, we found examples of images in which some characters were crossed out and almost indistinguishable, but they were still present in the markup. Hense, the idea of using the Cutout augmentation [devries2017improved] emerged, since it allows for overlaping of some elements of symbols or entire symbols, which makes the augmentation a bit like crossed-out symbols.

However, in the process of training the models, we got the idea of implementing such an algorithm that would allow for simulating the strikethrough characters as close as possible to the originals. Since we did not find the implementation of such algorithms in open sources, we created it ourselves.

To implement the strikethrough effect, we decided to use the Bezier curve construction algorithm, which in our case smoothed the curve transition between points.

The Bezier curve is a parametric curve and is a special case of the Bernstein polynomial. Finding basic polynomials of degree n are found by the formula:


Where . The definition of a curve as a linear combination is found as:


Where is the point in the space, and define above. Since the sum of all polynomials must be equal to one, then.


Where S is non-negative weights that sum to one. We found the implementation of the algorithm for constructing the Bezier curve in [Hermes2017].

Figure 4: Graphic description of the algorithm. a) define the strikethrough area; b) define areas for generating points; c) generate random points; d) draw a Bezier curve from the generated points

Next, we implemented our own algorithm that simulates strikethrough. The main steps were as follows:

  1. Determine the coordinates of the strikethrough area.

  2. Define areas for generating points to be used for drawing a Bezier curve.

  3. Generate points for the Bezier curve with the parameters of the intensity of the points and their coordinates to simulate the slope. Sometimes, a random point needs to be used several times for the loop to go a slightly further from the curve.

  4. Draw a Bezier curve with a specified transparency.

A graphical description of the algorithm is shown in Fig. 4. An example of a picture from a dataset using our strikethrough is provided in Fig. 5

Figure 5: Sample image using faux strikethrough

The implementation of this algorithm can be found here [codelink]. We empirically selected the parameters for strikethrough (minh = 50, maxh = 100, minw = 10, maxw = 50, incline = 15, intensity = 0.9, transparency = 0.95, count = 1 …11, proba = 0.5) and tested it on different datasets.

The effect of the Blot augmentation on quality metrics is shown in the ”blots” row of Table 4, and in Figure 9. The obtained data suggests that handwritten blot augmentation makes a significant contribution to the quality of training models. Therefore, we recommend using it to train models in handwriting recognition problems.

Figure 6: Example images of symbol segmentation using semi supervised methods. Images have ids: IAM-a02-082-05, BenthamR0-072_105_002_04_05 and SaintGall-csg562-004-13 respectively.

3.3 StackMix

Improving the quality and stability of neural networks in image recognition tasks uses approaches that combine information from various objects of the training dataset, demonstrating effective performance and making the neural network more generalizable and resistant to new samples, despite not being previously presented to the neural network. For example, the CutMix approach [yun2019cutmix] is an augmentation, in which parts of the images are cut from different samples and inserted into a new one, while the targets are mixed according to the proportions of the original parts of the images. A similar approach is used in SnapMix [huang2020snapmix], but the cutting of images is regulated by the neural network model using Class Activation Map. This approach allows for reducing the noise of the cut objects and selecting the most significant parts from the point of view of the neural network. Additionally, an interesting approach with mixing objects is presented in MixUp [zhang2018mixup] and MWH [yu2021mixup], where images are overlapped on each other with a transparency coefficient and also mixes the targets with the proportion. Unfortunately, these methods cannot be applied to the optical character recognition and handwritten text recognition tasks, because mixing recursively dependent targets makes it very difficult to obtain a correct mapping of image and text. Therefore, we would like to introduce the StackMix approach, which allows for the improvement of the quality and stability of our neural network.

Figure 7: Post-processing scheme to get the boundaries of the symbols.

Weakly Supervised Symbol Segmentation. First, to apply the proposed StackMix approach to the OCR task additional data markup that exactly marks the symbols boundaries is required. To achieve this an approach with automatic segmentation of training images into symbols using post-processing of a supervised pretrained neural network via CTC loss was used. The main idea was to connect the last layer of RNN (after applying SoftMax activation for every symbol from image features) and image width to get symbols’ boundaries using only weakly supervised training without any manual markup (Fig. 7). For training neural network can be used base scheme without any augmentations and tricks. To get high quality marking of symbols’ boundaries a sample from the training stage should be used.

Corpus. The StackMix approach also requires an external text corpus that has the same alphabet as the main dataset. Corpus does not require special marking and only contains allowed symbols. For the Digital Peter dataset, we used the texts of the XVII-XVIII centuries ( 3M lines). For the IAM and BenthamR0 datasets, we used cleaned texts from the Kaggle Competition ”Jigsaw Unintended Bias in Toxicity Classification” [jigsaw] ( 560k lines). For the HKR dataset, we used small portion of Russian texts from Wikimedia [ruwiki] ( 2.2M lines), and for the Saint Gall dataset, we used cleaned Latin corpus from Kaggle [kaggle_latin_library] ( 180k lines) that used ”The Latin Library” [the_latin_library].

StackMix Algorithm. The input for the algorithm is expected to be text from the external corpus, which creates a new image with this text using parts of images of the training dataset. The algorithm from nltk MWETokenizer [nltkmwe] is used for tokenization. It processes tokenized text and merges multi-word expressions into single tokens. Collections of multi-word expressions are obtained from the training dataset, using symbol borders to connect parts of images and MWE tokens including spaces and punctuation marks. In this study, we used one random MWE tokenizer from six with a token dimension of no more than 3, 4, 5, 6, 7 and 8 with proba 0.05, 0.15, 0.2, 0.2, 0.2 and 0.2 respectively. After this each token wss matched with part of an image from the training data, and the pieces were stacked (hence the name ”StackMix”) together into a complete image, maintaining the correct order of the tokens.

Examples of the StackMix algorithm for various datasets are given in Figure 8 and the supplementary materials. Despite the visible places where tokens were glued together, the algorithm significantly increased the quality of recognition. The alignment and selection of samples to increase the realism of the generated string did not lead to an increase in metrics in our experiments.

Nevertheless, after certain improvements, this algorithm may be used for the realistic generation of new documents. As an example in supplementary materials, we generated pages of texts from different sources. For all models, some paragraphs from the first chapter of a Harry Potter book was used. The results suggest that it is possible to generate different texts with different styles and fonts. For models of English language, we used the original Harry Potter book and for models that had Cyrillic symbols, the Russian version of the Harry Potter book was used.

Figure 8: Example images created using StackMix.

4 Benchmarks

4.1 Datasets

Four different datasets were used in the experiments to prove state-of-the-art quality of our model.

Bentham manuscripts refers to a large set of documents that were written by the renowned English philosopher and reformer Jeremy Bentham (1748-1832). Volunteers of the Transcribe Bentham111 initiative transcribed this collection. Currently, 6 000 documents or 25 000 pages have been transcribed using this public web platform.

For our experiments, we used the BenthamR0 dataset [bentham] a part of the Bentham manuscripts.

The IAM handwriting dataset contains forms of handwritten English text. It consists of 1 539 pages of scanned text from 657 different writers. This dataset is widely used for experiments in many papers. However, there is a big problem with uncertainty in data splitting for model training and evaluation. This issue was described in [michael2019evaluating]. The IAM dataset has different train/val/test splits that are shown in Table 1. The problem is that none of them are labeled as a standard, so the IAM dataset split differs from paper to paper. However, results should be compared on the same split.

In our experiments, we use the IAM-B222 and IAM-D partitions. IAM-B was used to compare our model with others. IAM-D was a new partition inspired by the official page of project333 This page contained ”unknown”,”val1” and ”val2” split labels. We added ”unknown” samples to the train set, and combined ”val1” and ”val2” together.

IAM-B was chosen because many recently published papers used this partition. We used IAM-D because it provides more training samples.

We create a github repository [iam_splits] with information and indexes corresponding to each IAM split in Table 1. It also contains links to papers that use these splits. We hope this helps researchers choose appropriate IAM partitions and make valid comparisons with other papers.

Split Train Val Test
IAM-A 6161 966 2915
IAM-B 6482 976 2915
IAM-C 6161 940 1861
IAM-D 9652 1840 1861
Table 1: IAM splits.

Digital Peter is a completely new dataset of Peter the Great’s manuscripts [potanin2021digital]

. It consists of 9 694 images and text files corresponding to lines in historical documents. The open machine learning competition Digital Peter was held based on the considered dataset

[complink]. There are 6 237 lines in the training set, 1 527 lines in the validation set, and 1 930 lines in the testing set.

HKR_Dataset [nurseitov2020hkr] is a recently published dataset of modern Russian and Kazakh language. This database consists of 1 400 filled forms. It contains 64 943 lines and 715 699 symbols produced by about 200 different writers. Data are split in the following manner: 45 559 lines for a train set, 10 009 lines for validation, and 9 375 lines for test. This data splitting was found in github [hkr_splitting_github] of the authors of the HKR_Dataset, but these proportions of train/valid/test were slightly different that those of the original paper [nurseitov2020hkr]. We assumed that the seed was not fixed in a script to get split, which was not a big problem for comparing results. Also, authors of the HKR_Dataset used a good idea for splitting on Test1 and Test2 and demonstrated the problem of recognition text for new owners of the handwriting styles. The behavior of our approaches, separately in Test1 and Test2, can be found in the supplementary materials.

Saint Gall dataset contains handwritten historical manuscripts written in Latin that date back to the 9th century. It consists of 60 pages, 1 410 text lines and 11 597 words.

4.2 Metrics

Figure 9: This graph compares the relative train time and CER results of experiments for different datasets and approaches. Arb. units for train time were obtained by the formula , where ms is the minimum value of train time per one image. The colored lines represent obtained experiment points. Since the quality of approaches grows with increasing train time, some points have no line.

We used character error rate (CER, ), word error rate (WER, ), string accuracy (ACC, ) to measure the quality of the HTR models:


where and are the Levenshtein distances calculated in characters (including spaces) and words, respectively, and and are the length of the string in characters and words respectively.


In the formulas (4), (5), and (6), is the size of the test sample, is the string of characters that the model recognized in the -th image, and is the true translation of the -th image made by the expert.

5 Experiments

5.1 Experimental Setup

Each experiment was repeated five times. AdamW was used as an optimizer [adamW]

with OneCycleLR scheduler, starting at a learning rate of 0.001 down to 1e-08. The mean and standard deviation of each metric are given as the result. The use of augmentations allows for further metric improvements. In our experiments, we used StackMix ”on the fly” during training with a probability of 0.8. We added traditional augmentations of CLAHE

[Reza2004], JpegCompression, Rotate, and our augmentation (simulation of crossed-out letters) - ”HandWritten Blots”. Different combinations of augmentations were grouped in our experiments:

  • ”base” - experiments without augmentations, 300 epoch (HKR 100 epoch)

  • ”augs” - standart augmentations (CLAHE [Reza2004], JpegCompression, Rotate), 300 epoch (HKR 100 epoch)

  • ”blots” - using only our HandWrittenBlot augmentation, 500 epoch (HKR 150 epoch)

  • ”stackmix” - using only our Stackmix approach, 1 000 epoch (HKR 300 epoch)

  • ”all” - using all previous augmentations (augs + blots + stackmix), 1 000 epoch (HKR 300 epoch)

Models with StackMix were trained during 1 000 epoch, but were not overfitted. We believe they should be trained more with bigger external text corpora. A comparison of our results for various datasets (IAM [marti2002iam], BenthamR0 [bentham], Digital Peter [potanin2021digital], HKR_Dataset [nurseitov2020hkr], Saint Gall [fischer2011transcription]) is presented in Table 4 and in Figure 9.

Inference speed was measured as the average recognition speed for one line of text with batch size 16 and an image size of pixels, using one Tesla-V100 GPU. The inference speed of different datasets are shown in Table 2 and inference speed of models with different augmentations is shown in Table 3.

Dataset Speed, img / sec
BenthamR0 86.2 ± 1.3
IAM 94.9 ± 1.2
Digital Peter 92.6 ± 2.7
HKR 98.4 ± 1.3
Saint Gall 86.3 ± 0.6
Table 2: Inference speed for 1xV100 GPU, image size 128x2048.
Experiment Speed, img / sec
base 28.4 ± 0.9
augs 28.5 ± 0.6
blots 28.2 ± 1.4
stackmix 21.0 ± 5.0
all 17.7 ± 5.8
Table 3: Training speed for 1xV100 GPU, image size 128x2048.
3* Datasets
BenthamR0 IAM-B
CER, % WER, % ACC, % CER, % WER, % ACC, %
base 2.99 ± 0.06 11.8 ± 0.3 52.1 ± 0.8 5.80 ± 0.08 18.9 ± 0.2 29.3 ± 0.4
augs 2.85 ± 0.05 11.3 ± 0.1 52.8 ± 1.0 5.43 ± 0.04 17.8 ± 0.1 30.7 ± 0.5
blots 2.27 ± 0.04 9.5 ± 0.2 57.0 ± 0.4 4.59 ± 0.03 15.0 ± 0.1 36.4 ± 0.5
stackmix 2.08 ± 0.03 9.0 ± 0.1 58.2 ± 0.8 4.90 ± 0.07 16.4 ± 0.2 35.2 ± 0.5
all 1.73 ± 0.02 7.9 ± 0.1 61.9 ± 1.1 3.77 ± 0.06 12.8 ± 0.2 43.6 ± 0.6
3* Digital Peter IAM-D
CER, % WER, % ACC, % CER, % WER, % ACC, %
base 4.44 ± 0.02 24.3 ± 0.2 43.7 ± 0.5 4.55 ± 0.06 14.5 ± 0.2 35.5 ± 0.7
augs 4.15 ± 0.04 23.0 ± 0.2 45.7 ± 0.4 4.38 ± 0.04 14.0 ± 0.2 36.3 ± 0.6
blots 3.39 ± 0.04 19.3 ± 0.4 51.9 ± 0.4 3.70 ± 0.03 11.8 ± 0.1 42.3 ± 0.6
stackmix 3.40 ± 0.05 19.2 ± 0.3 51.6 ± 0.7 3.77 ± 0.04 12.3 ± 0.1 42.4 ± 0.8
all 2.50 ± 0.03 14.6 ± 0.2 60.8 ± 0.8 3.01 ± 0.02 9.8 ± 0.1 50.7 ± 0.3
3* HKR Saint Gall
CER, % WER, % ACC, % CER, % WER, % ACC, %
base 6.71 ± 0.18 22.5 ± 0.3 71.1 ± 0.5 4.71 ± 0.05 32.5 ± 0.3 2.2 ± 0.5
augs 6.67 ± 0.24 21.5 ± 0.3 72.2 ± 0.2 4.49 ± 0.11 31.3 ± 0.5 3.4 ± 0.7
blots 7.40 ± 0.26 23.0 ± 0.5 72.6 ± 0.4 4.08 ± 0.03 28.1 ± 0.2 4.6 ± 1.1
stackmix 3.69 ± 0.13 14.4 ± 0.4 80.0 ± 0.3 4.06 ± 0.12 28.8 ± 0.8 6.1 ± 0.8
all 3.49 ± 0.08 13.0 ± 0.3 82.0 ± 0.5 3.65 ± 0.11 26.2 ± 0.6 11.8 ± 2.0
Table 4: Results for all experiments

5.2 Comparison with State-of-the-art

In this section, we present the comparison with other models (Table 5).

Our model outperforms other approaches on the BenthamR0, HKR, and IAM-D datasets. It reached 3.77% of CER on the IAM-B dataset, which is very close to the best model published in 2016 that achieved 3.5% CER. On the Saint Gall dataset, we achieved 5.56% CER, which is very close to the current best solution of 5.26% CER.

The IAM dataset has two versions because papers use various data splits. We included IAM-B and IAM-D partitions in Table 5 to compare them with other models of the same split.

Authors of the paper [de2020htr] provided open code for their model here [htrflorlink]. We noticed that when evaluating the model, they lowered the characters in predicted and true strings. However, in our experiments, we did not convert the characters to lowercase. For real HTR tasks, it is not important to track the case of characters. In the supplementary materials, we provide the metrics of the models trained and evaluated on lowercase characters, and a comparison similar to the Table 5 on less widespread datasets.

2*Model IAM-B
CER, % WER, %
[voigtlaender2016handwriting] 3.5 9.3
[yousef2020origaminet] 4.7 -
[coquenet2020end] 4.32 16.24
[coquenet2020recurrence] 7.99 28.61
[moysset20192d] 7.73 25.22
[wang2020decoupled] 6.64 19.6
ours 3.770.06 12.80.2
2* IAM-D
[abdallah2020attention] 7.8 25.5
ours 3.010.02 9.80.1
2* Digital Peter
[complink] 10.5 44.4
ours 2.500.03 14.60.2
2* BenthamR0
[abdallah2020attention] 7.1 20.9
[de2020htr] 3.980.06 9.80.14
ours 1.730.02 7.90.1
2* HKR
[abdallah2020attention] 4.5 19.2
ours 3.490.08 13.00.3
2* Saint Gall
[abdallah2020attention] 7.25 23.0
[de2020htr] 5.260.03 21.140.13
ours 3.650.11 26.20.6
Table 5: Comparison to other models, test set

6 Conclusion

The neural network architecture presented in this article allows the problem of handwriting recognition of both modern and historical documents for various languages to be solved. The described augmentations - HandWritten Blots and StackMix - further improve the quality of recognition, demonstrating the best result among the currently known handwriting recognition systems.

The presented system can significantly increase the speed of deciphering historical documents. For example, it took a team of 10-15 historians about 3 months to decipher 662 pages of manuscripts from the Digital Peter dataset, what the organizers of the competition [complink] wrote about. When working on the same dataset on a single Tesla V100, the average decryption speed was 95 lines/s or 380 pages/min, which is unattainable by historical scientists.