Due to the rapid growth of document storage in modern society, handwriting recognition has become an important research branch in the field of machine learning and pattern recognition. In the context of handwriting digit strings recognition (HDSR), there are several application scenarios such as postcode recognition, bank checks, document indexing and word spotting [1, 4, 14].
Most state-of-the-art approaches focus on the segmentation of connected components, followed by training a model to classify each component. However, the performance collapses when two or more digits are touching. Besides, variant approaches based on multiple segmentation algorithms have been proposed. For such algorithms, to generate a potential segmentation cut, a heuristic analysis based on background and foreground information, contour, shape, or a combination of these is required to be implemented
. The over-segmentation strategy is frequently used to optimize the segmentation of touching components by over segmenting the string. Although over-segmentation increases the probability of obtaining a plausible segmentation cut, it also increases the computational cost since the hypothesis space expands exponentially with the increase of the segmentation cuts. However, the use of heuristic to deal with touching digits has shown that performance degrades by the presence of noise, fragments, and the lack of context in digit strings, such as a lexicon and the string length.
To better address these issues and to take advantage of deep learning models, segmentation-free methods came to the surface, offering unique capabilities to the research community in the domain [18, 2, 17, 23]. Along this line, the proposed approaches rely on implicit segmentation through a deep model or a set of them. Recently, Hochuli et al. evaluated object recognition models into the HSDR context. For that, a digit string is considered a sequence of objects. The advantage is that these models can encode the background, the shape, and the neighbourhood of digits efficiently. Nonetheless, the annotation of digit bounding boxes (ground-truth) is a drawback when synthetic data are not available. Finally, sequence-to-sequence approaches were proposed resulting in feasible end-to-end models  for word strings’ recognition. The primary strategy is to split the input string into fragments to feed a recurrent model (RNN), and then, a transcription layer determines the resulting string. As stated in [23, 19], due to the lack of context, this approach did not achieve outstanding results in HDSR; however, it produces a good trade-off between data annotation, training complexity and accuracy.
An essential aspect of the strategies mentioned above is that most of them have been proposed for modern handwritten digit strings recognition. There is still a lack of approaches in the context of historical (ancient) document recognition.The challenges are different from modern ones, such as paper texture deterioration, noise, ancient handwriting style, ink failure, bleed-through, and the lack of data . Remarkably, the performance of the modern approaches applied to historical document context is a matter of discussion.
Recently, Kusetogullari et al.
released to the public the ARDIS dataset, composed of historical handwriting digit strings extracted from the Swedish church record. Their comprehensive analysis reveals that a model trained with modern isolated digits (MNIST, USPS, etc.) fails by a fair margin to correctly encode the isolated digit from ARDIS due to its unique characteristics. However, a comprehensive analysis using the historical digit strings of ARDIS dataset is missing. In light of this, we propose to assess two state-of-the-art approaches by adapting slight modifications into their architecture to better fit the ARDIS digit strings characteristics. Moreover, we introduce data augmentation techniques to represent the classes more efficiently. This work’s resulting analysis eventually proposes a baseline for the ARDIS strings dataset and pin-points, which efforts are needed to implement feasible solutions for historical digit strings recognition. Moreover, it highlights the research gaps for further investigations.
2 Related Work
We surveyed the state-of-art and divided the related work into two main categories: (a) Segmentation-based and (b) Segmentation-free approaches. The related work is discussed in the following paragraphs.
Segmentation-based approaches: By segmenting the connected components as much as possible, we attain the concept of over-segmentation, which is the most commonly used strategy. It maximizes the probability of generating the optimal segmentation point. However, as mentioned earlier, it also increases the hypothesis space, resulting in a higher computational cost to classify all the candidates compared with a strategy based on single segmentation.
An implicit filter is proposed by Vellasques et al. 
to reduce the computational cost of over-segmentation,in which a Support Vector Machine (SVM) classifier is used to determine whether the cut produces reliable candidates. The proposed filter succeeded in eliminating up to 83% of unnecessary segmentation cuts in their experimental results.
In Roy et al. , a segmentation-based approach is devised to segment out destination address block for postal applications; a review of postal and check processing applications is warranted in .
Besides the strategies mentioned above, several segmentation algorithms were proposed in the last decade. Ribas et al.  assessed most of them considering their performance, computational cost, touching types, and complexity. This characterization aimed to identify the limitations of the algorithms based on a given pair of touching digits. Moreover, the work reveals that most of the heuristic segmentation strategies are biased towards the characteristic of the dataset’s characteristic under scrutiny; thus, a suitable method that works for all touching types is impractical.
Segmentation-free approaches: To the best of our knowledge, the first attempt along this line was proposed by Matan et al. 
. A convolutional neural network (CNN) based model is displaced from left to right over the input. The proposed approach is termed SDNN (Space Displacement Neural Network), which reported 66% of correct classification on 3000 images of ZIP Codes. LeCun et al., stated that SDNN is an attractive technique but has not yielded better results than heuristic segmentation methods.
Years later, Choi and Oh  presented a modular neural network composed of 100 sub-networks trained to recognize 100 classes of touching digits (00..99). The recognition rate of 1374 pairs of digits extracted from the NIST database reaches 95.3%. A similar concept was presented by Ciresan , in which 100-class CNN was trained with 200,000 images reporting a recognition rate of 94.65%.
An image-based sequence recognition was proposed by Shi et al.
. The end-to-end framework, named Convolutional Recurrent Neural Network (CRNN), naturally handles sequences in arbitrary lengths without character segmentation or horizontal scale normalization. The approach achieved outstanding performance on recognising the scene text (text in the wild) and music scores.
To make handwriting digit recognition less dependent on segmentation algorithms, Hochuli et al. proposed a segmentation-free framework based on a dynamic selection of classifiers. The authors postulate that a set of convolutional neural networks trained to (a) predict the size of touching components and (b) specific-task models to recognize up to three touching digits performs better than if the digits were segmented. However, this algorithm’s generalisation to other datasets needs further verification since there is a lack of diversity of the used datasets in the experimental protocol.
Cheng et al. 
proposed a strategy based on the improved VGG-16 model to overcome the lack of texture features in handwriting digit recognition. The model was examined on the extended MNIST dataset, eventually achieved a high accuracy of 99.97%, which indicates that this VGG-based model has a robust feature extraction ability than traditional classifiers and can meet the requirements of handwriting digits classification and recognition. Besides, a VGG-like model with multiple sub classifiers was built to recognize CAPTCHAs. Although the CAPTCHA images for the test are featured by a lot of noise and touching digits, the model accuracy reached 98.26% without any pre-segmentation.
End-to-end approaches are frequently proposed in recent years, including those tackling writer identification  and document analysis and recognition [24, 25]. Recently, approaches based on object recognition models have been exploited with the HDSR task [17, 19]. Considering that a string is a sequence of objects, these models can efficiently encode the background, shape, and neighbourhood of digits, providing an end-to-end solution for the problem. Additionally, they reduce the restrictions imposed on the number of touching components or string length. However, the annotation of each digit bounding box (ground-truth) is a bottleneck when synthetic data are not applicable.
3 ARDIS dataset
The Arkiv Digital Sweden (ARDIS) Dataset comprises historical handwriting digit strings extracted from Swedish church document images written by different priests from 1895 to 1970. The dataset is fully annotated , including the digits bounding-boxes . The sub-datasets of ARDIS are exemplified in Figure 1.
The dataset (I) is composed of 4-digit strings that represent the year of a record. Most of the samples were cropped with the size of 175 x 95 pixels from the document image and stored in its pristine RGB colour space. The dataset (I) contains 75 classes mapping to different years. However, due to insufficient samples, classes later than 1920 were not considered in this work. The class distribution used in this study is shown in Figure 2.
The historical digit strings pose several challenges to classification, including variations in terms of variability of handwriting styles, touching digits, ink failure, and noisy handwriting. Figure 3 demonstrates some of the aforementioned challenges. Moreover, as observed in Figure 2, the classes are not equally distributed.
The dataset (II) comprises cropped digits from the original digit strings, containing artefacts and fragments of the neighbour digits. In dataset (III), the isolated digits were manually cleaned. For completeness, a uniform distribution of each digit’s occurrences was ensured in the dataset (III), resulting in 7600 de-noised digit images in RGB colour space. For dataset (IV), the isolated digits are normalised and binarised.
3.1 Synthetic Data
Recently, data augmentation techniques in handwriting digit were proposed to generate a synthetic training set [30, 2, 18]. In order to improve the data representation of ARDIS strings (Dataset I), we propose the creation of synthetic data by permuting and concatenating several single digits from dataset III. Figure 4 depicts two synthetic samples. Although both digit strings belong to the label “1987”, the representation (e.g., style) is remarkably different.
We combine synthetic data and real data from up to 1000 samples for each class during the training phase to balance the distribution of classes. The number of real and synthetic data are summarized in Table 1. Considering the limited amount of data in the ARDIS data set, it is relatively reasonable (closer to standard practice) to retain about 43% of the real data for testing (2651/(3474+2651)) and use the remaining 57% of the real data set for training. It is worth mentioning that the synthetic data was used only for training, which represents 88.8% of the training samples. In total, 31000 and 2651 images are used for training and testing, respectively. All the data used in this work are already publicly available through the ARDIS Website111https://ardisdataset.github.io/ARDIS/ by the authors of .
|Label||Training Samples||Test Samples|
4 Approaches for Historical Handwriting Digit String Recognition
As stated by  and , the segmentation problem has been overcome by segmentation-free approaches in the recent advances in deep learning models. In light of this, we propose to evaluate two segmentation-free approaches on the ARDIS dataset to tackle the task of historical handwriting digit string recognition. The first approach (Section 4.1) is based on the well-known VGG-16 model. The second one (Section 4.2) is based on a sequence-to-sequence model.
Motivations on the choice of models: Given the VGG-16 model’s decent performance in other character recognition tasks, we consider it the baseline of this experiment to evaluate against other alternative models. The VGG network model 
was proposed by the Visual Geometry Group (VGG) at Oxford University. When first created, the focus of this network was to classify materials by their textural appearance and not by their colour. Due to the excellent generalisation performance of VGG-Net, its pre-trained model on the ImageNet dataset is widely used for feature extraction problems[9, 13] such as: object candidate frame (object proposal) generation 
, fine-grained object localization, image retrieval, image co-localization , etc. On the other hand, our new approach is based on modifying the concept of CRNN . The CRNN is mainly used for end-to-end recognition of indefinite length text sequence. It does not require pre-segmentation on long continuous text.
4.1 Specific-Task Classifiers
In this new approach, we adapted the well-known VGG-16 model to the context of the ARDIS dataset, which is composed of 4-digit strings. Instead of using a dynamic selection of classifiers , we proposed to parallelize the classification task by adding four dense layers (classifiers) on the bottom of the architecture. The final architecture is depicted in Figure 5. With this simple modification, we produce an end-to-end pipeline avoiding both the heuristic segmentation and fusion methods.
The rationale here is that each specific-task classifier (, , , and ) should determine the ten classes (0..9) for each digit of the 4-digit string. The prediction of input digit string is defined as follows:
Let be the probability produced by the digit classifier (10-classes). Then, an input digit is assigned to the class (j=0…9) according to Equation 1.
a. First, convolutional layers extract features from an input image, and then a sequence of feature vectors is extracted from feature maps.
Since each region of the feature map is associated with a receptive field in the input image, each vector in the sequence is a descriptor of this image field, as illustrated in Figure 6
b. Next, this sequence is fed to the recurrent layers, which are composed of a bidirectional Long-Short Term Memory (LSTM) network, producing a per-frame prediction from left to right of the image. Finally, the transcription layer determines the correct sequence of classes to the input image by removing the repeated adjacent labels and the blanks, represented by the character ‘-’. This solution is well suited when the past and future context of a sequence contributes to recognising the whole input. With the aid of contextual information, such as a lexicon, this approach achieves high text recognition performance. The application of this solution to handwriting digits is a matter of discussion since we have fewer classes (0..9) as compared to words, but there is no lexicon to mitigate possible confusion.
To address the context of historical digit string recognition, we propose a modification of the Recurrent Layers Architecture. Due to the lack of data, we replaced the LSTM with a Gated Recurrent Unit (GRU). Since the latter has fewer parameters, besides reducing training time, the vanishing gradient’s impact is minimised. Moreover, we combined two identical GRU Layers to process the feature maps from left to right and vice-versa, and then, the output of both GRUs are merged. It is worth mentioning that the feature maps fed to the GRU are reshaped to vectors to provide a sequence of information. Further, another two identical GRUs repeat the process; however, their outputs are concatenated. Finally, a fully connected layer determines the class probabilities, and the connectionist temporal classification (CTC) layer determines the final prediction. The architecture and characterisation of our modified CRNN approach are depicted in Figure 7.
In this section, we assess all reported models in the context of HDSR. We also added to the comparison the native VGG16 model (i.e., without our modification). All metrics used to measure the performance are described in Section 5.1. The training protocol is presented in Section 5.2. Finally, the results are discussed in Section 5.3.
5.1 Evaluation Metrics
To better assess the performance of the proposed approaches, besides the well-known accuracy and F1-score, we propose using the Normalized Levenshtein Distance (NLD) and the Average Normalized Levenshtein Distance (ANLD).
The Normalized Levenshtein Distance (NLD)  describes how close a predicted string is from the ground truth by eliminating the influence of string length on performance measurement. The NLD can be defined as follow:
where presents the length of the string and LD represents the Levenshtein distance, which refers to the minimum number of editing operations of each character (insert, delete, and substitute) to convert from one string to another. A zero value indicates a correct prediction.
Complementary to this, the Average Normalized Levenshtein Distance (ANLD) is a soft metric to evaluate the model performance:
where T indicates the number of evaluated strings. Lower values represent a good performance.
The models were fine-tuned using the data described in Table 1
, which comprises real and synthetic data that sums up to 1000 training samples for each one of 31 classes (number of years in the selected period). All the classifiers were trained up to 100 epochs, and the over-fitting was prevented through early-stopping when no convergence occurs after ten epochs. The training parameters are summarized in Table2.
|Specific-Task||32||Cat. Crossentropy||Dropout (0.25)||Adam||TRUE|
|VGG-16||32||Cat. Crossentropy||L2 ()||Adam||FALSE|
5.3 Results and Discussion
The performance of the evaluated models in this work is reported in Table 3. As described in Section 4.1, we modified the last layer of VGG-16 architecture by adding four classifiers instead of one. As stated in Table 3, this modification achieved the best performance since it can encode the string by implicit segmenting the digits with the specific-task classifiers. Comprehensive analysis reveals that the whole image’s information is difficult to encode in several cases by only one classifier (VGG-16) due to the handwriting variability. For example, let us assume that we have the following strings “1890” and “1819”. From a computational perspective, the global representation poses challenges in discriminating the digit ‘0’ and ‘9’ that mislead the classifier. However, for the specific-task approach, the information can be implicitly segmented according to each classifier’s domain space. Regarding this, the models and exhibit a reduced complexity when compared to the models and since the former two classifiers need to discriminate fewer classes (,[8,9]).
Regarding the CRNN, we believe that the model suffers due to a lack of context. Contrary to the word recognition, the CTC layer missed the prediction since there is no lexicon to mitigate some confusions, such as fragment recognition and repeated labels. The issue is quite similar to the over-segmentation.
6 Conclusion and Future Work
In this work, we explored the recognition of historical digit strings. Based on this context, an image-based dataset containing 31 classes representing handwriting years ranging from 1890 to 1920 is utilised.
To this end, we proposed to evaluate three models implementing end-to-end solutions. The use of synthetic data was employed to overcome the lack of data.
The proposed approach that combines four specific-task classifiers achieved outstanding results. This promising performance can be explained by the implicit segmentation of the input string made by the domain space of each specific-task classifier. On the other hand, the approach based on a single classifier suffers due to the handwriting variability represented in a global perspective. Regarding the CRNN approach, it suffers from the lack of lexicon, as also stated in .
For future endeavours, once we have context information about the first and second digits, a reduced number of classifiers for the specific-task approach could be examined. Also, we will investigate a dynamic approach on VGG-16 that can implicitly handle different length of strings.
Acknowledgments. This project is supported by the research project “DocPRESERV: Preserving and Processing Historical Document Images with Artificial Intelligence”, STINT, the Swedish Foundation for International Cooperation in Research and Higher Education (Grant: AF2020-8892).
-  (2014) Word spotting and recognition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (12), pp. 2552–2566. External Links: Cited by: §1.
-  (2019) Unknown-length handwritten numeral string recognition using cascade of pca-svmnet classifiers. IEEE Access 7 (), pp. 52024–52034. External Links: Cited by: §1, §3.1.
Active graph based semi-supervised learning using image matching: application to handwritten digit recognition. Pattern Recognition Letters 73, pp. 76–82. Cited by: §1.
-  (2016) Towards query by text example for pattern spotting in historical documents. In 2016 7th International Conference on Computer Science and Information Technology (CSIT), Vol. , pp. 1–6. External Links: Cited by: §1.
-  (2019) Handwritten digit recognition based on improved vgg16 network. In Tenth International Conference on Graphics and Image Processing (ICGIP 2018), Vol. 11069, pp. 110693B. Cited by: §2.
Learning phrase representations using rnn encoder-decoder for statistical machine translation.
Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Cited by: §4.2.
A segmentation-free recognition of handwritten touching numeral pairs using modular neural network.
International journal of pattern recognition and artificial intelligence15 (06), pp. 949–966. Cited by: §2.
-  (2020) An end-to-end deep learning system for medieval writer identification. Pattern Recognition Letters 129, pp. 137–143. External Links: Cited by: §2.
Deep filter banks for texture recognition and segmentation.
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3828–3836. External Links: Cited by: §4.
-  (2008) Avoiding segmentation in multi-digit numeral string recognition by combining single and two-digit classifiers trained without negative examples. In 2008 10th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pp. 225–230. Cited by: §2.
-  (2014) ICFHR 2014 competition on handwritten digit string recognition in challenging datasets (hdsrc 2014). In 2014 14th International Conference on Frontiers in Handwriting Recognition, pp. 779–784. Cited by: §5.1.
-  (2018-08) Improving cnn-rnn hybrid networks for handwriting recognition. In 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), Vol. , pp. 80–85. External Links: Cited by: §4.2.
-  (2015) Deep spatial pyramid: the devil is once again in the details. arXiv preprint arXiv:1504.05277. Cited by: §4.
-  (2020) Towards data-efficient modeling for wake word spotting. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 7479–7483. External Links: Cited by: §1.
-  (2017) DeepProposals: hunting objects and actions by cascading deep convolutional layers. International Journal of Computer Vision 124 (2), pp. 115–131. External Links: Cited by: §4.
-  (2014) Document analysis in postal applications and check processing. Handbook of Document Image Processing and Recognition, pp. 705–747. External Links: Cited by: §2.
-  (2020) An end-to-end approach for recognition of modern and historical handwritten numeral strings. In 2020 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 1–8. External Links: Cited by: §1, §2, §3.
-  (2018) Handwritten digit segmentation: is it still necessary?. Pattern Recognition 78, pp. 1–11. Cited by: §1, §2, §3.1, §4.1, §4.
-  (2021) A comprehensive comparison of end-to-end approaches for handwritten digit string recognition. Expert Systems with Applications 165, pp. 114196. External Links: Cited by: §1, §2, §6.
-  (2019) ARDIS: A swedish historical handwritten digit dataset. Neural Computing and Applications, pp. 1–14. Cited by: §1, §1, Figure 1, Figure 2, §3.1, §3.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.
-  (1992) Multi-digit recognition using a space displacement neural network. In Advances in neural information processing systems, pp. 488–495. Cited by: §2.
-  (2020) HDSR-flor: a robust end-to-end system to solve the handwritten digit string recognition problem in real complex scenarios. IEEE Access 8, pp. 208543–208553. Cited by: §1.
OCR-d: an end-to-end open source ocr framework for historical printed documents. In Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage, DATeCH2019, New York, NY, USA, pp. 53–58. External Links: Cited by: §2.
-  (2019) Attend, copy, parse end-to-end information extraction from documents. In 2019 International Conference on Document Analysis and Recognition (ICDAR), Vol. , pp. 329–336. External Links: Cited by: §2.
-  (2013) Handwritten digit segmentation: a comparative study. International Journal on Document Analysis and Recognition (IJDAR) 16 (2), pp. 127–137. Cited by: §1, §2, §4.
-  (2005) A system for indian postal automation. In Eighth International Conference on Document Analysis and Recognition (ICDAR’05), Vol. , pp. 1060–1064 Vol. 2. External Links: Cited by: §2.
-  (1997-11) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. External Links: Cited by: §4.2.
-  (2017-11) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (11), pp. 2298–2304. External Links: Cited by: §2, §4.
-  (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39 (11), pp. 2298–2304. Cited by: §1, §3.1, Figure 6, §4.2.
-  (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §4.
-  (2008) Filtering segmentation cuts for digit string recognition. Pattern Recognition 41 (10), pp. 3044–3053. Cited by: §2.
-  (2016-10) Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), Vol. , pp. 228–233. External Links: Cited by: §4.2.
-  (2017-06) Selective convolutional descriptor aggregation for fine-grained image retrieval. Trans. Img. Proc. 26 (6), pp. 2868–2881. External Links: Cited by: §4.
-  (2016) Learning structured sparsity in deep neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Red Hook, NY, USA, pp. 2082–2090. External Links: Cited by: §4.