HANA: A HAndwritten NAme Database for Offline Handwritten Text Recognition

01/22/2021
by   Christian M. Dahl, et al.
0

Methods for linking individuals across historical data sets, typically in combination with AI based transcription models, are developing rapidly. Probably the single most important identifier for linking is personal names. However, personal names are prone to enumeration and transcription errors and although modern linking methods are designed to handle such challenges these sources of errors are critical and should be minimized. For this purpose, improved transcription methods and large-scale databases are crucial components. This paper describes and provides documentation for HANA, a newly constructed large-scale database which consists of more than 1.1 million images of handwritten word-groups. The database is a collection of personal names, containing more than 105 thousand unique names with a total of more than 3.3 million examples. In addition, we present benchmark results for deep learning models that automatically can transcribe the personal names from the scanned documents. Focusing mainly on personal names, due to its vital role in linking, we hope to foster more sophisticated, accurate, and robust models for handwritten text recognition through making more challenging large-scale databases publicly available. This paper describes the data source, the collection process, and the image-processing procedures and methods that are involved in extracting the handwritten personal names and handwritten text in general from the forms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

07/19/2015

Handwriting Recognition

This paper describes the method to recognize offline handwritten charact...
12/29/2017

Personal Names in Modern Turkey

We analyzed the most common 5000 male and 5000 female Turkish names base...
11/13/2018

Personal Names Popularity Estimation and its Application to Record Linkage

This study deals with a fairly simply formulated problem -- how to estim...
05/09/2022

Behind the Mask: Demographic bias in name detection for PII masking

Many datasets contain personally identifiable information, or PII, which...
04/08/2015

Mining and discovering biographical information in Difangzhi with a language-model-based approach

We present results of expanding the contents of the China Biographical D...
12/04/2020

Boosting offline handwritten text recognition in historical documents with few labeled lines

In this paper, we face the problem of offline handwritten text recogniti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

Methods for linking individuals across historical data sets, typically in combination with AI based transcription models, are developing rapidly. Probably the single most important identifier for linking is personal names. However, personal names are prone to enumeration and transcription errors and although modern linking methods are designed to handle such challenges these sources of errors are critical and should be minimized. For this purpose, improved transcription methods and large-scale databases are crucial components. This paper describes and provides documentation for HANA, a newly constructed large-scale database which consists of more than 1.1 million images of handwritten word-groups. The database is a collection of personal names, containing more than 105 thousand unique names with a total of more than 3.3 million examples. In addition, we present benchmark results for deep learning models that automatically can transcribe the personal names from the scanned documents. Focusing mainly on personal names, due to its vital role in linking, we hope to foster more sophisticated, accurate, and robust models for handwritten text recognition through making more challenging large-scale databases publicly available. This paper describes the data source, the collection process, and the image-processing procedures and methods that are involved in extracting the handwritten personal names and handwritten text in general from the forms.

I Introduction

As part of the global digitization of historical archives, the present and future challenges are to transcribe these efficiently and cost-effectively. We hope that the scale, label accuracy, and structure of the HANA database can offer opportunities for researchers to test the robustness of their handwritten text recognition (HTR) methods and models on more challenging, large-scale, and highly unbalanced databases.

The availability of large scale databases for training and testing HTR models is a core prerequisite for constructing high performance models. While several databases based on historical documents are available, only few have been made available for personal names. For linking, matching, or genealogy, the personal names of individuals is one of the most important pieces of information, and being able to read personal names across historical documents is of great importance for linking individuals across e.g. censuses, see, for example, Abramitzkyetal2012, Abramitzkyetal2013, Abramitzkyetal2014, Abramitzkyetal2016, Abramitzkyetal2020a, Abramitzkyetal2020b, Feigenbaum2018, Massey2017, Baileyetal2020, and Priceetal2019. Importantly, Abramitzkyetal2020a and Baileyetal2020 both discuss the rather low matching rates when linking transcriptions of the same census together. This is partly due to low transcription accuracy of names. Furthermore, both papers are concerned about the lack of representativeness of the linked samples. These two observations strongly motivates the HANA database, i.e., collecting and sharing more data of higher quality, and to the work on improving transcription methods in order to reduce the potential biases in record linking.

In total, the HANA database consists of more than 1.1 million personal names written as word-groups on single-line images. While most of the existing databases contain single isolated characters or isolated words, such as names available on the repository from Appen, this database contributes to the literature of more unconstrained HTR. All original images are made electronically available by Copenhagen Archives and the processed database described here is made freely available with this paper.111The HANA database will soon be made available for download from an online repository but can already now be obtained by request from the authors. The Appen database on handwritten names (names with clearly separated characters) is available here: https://appen.com/datasets/handwritten-name-transcription-from-an-image/.

A complete description of the image-processing methods for transcription is included in the appendix, allowing users to experiment with the image-processing for further improving the database. One of the important features of this database is the resemblance with other challenging historical documents, such as general image noise, different writing styles, and varying traits across the images.

As pointed out by Combesetal2020

, a new, flourishing literature uses historical data to study the development of regions, cities, and neighbourhoods. Many of these new studies use the very rich data source based on historical maps. The machine learning models and image-processing methods discussed in this paper for the visual recognition of handwritten text would also efficiently be able to turn historical maps into data. The methods can easily be adapted and trained for general pattern recognition and for recognizing the shapes of objects and their surroundings and are powerful enough to be used to identifying buildings, roads, agricultural parcels, and more.

The remainder of the paper is organized as follows: In Section II, we describe the database and the data acquisition procedure in detail. Section III

presents the benchmark results on the database using a ResNet-50 deep neural network in three different model settings. In Section

III, we discuss the database and the benchmark methods and results, in addition to some considerations for future research. Section V concludes.

Ii Constructing the HANA Database

This section describes the HANA database in detail and the image-processing procedures involved in extracting the handwritten text from the forms. In 1890, Copenhagen introduced a precursor to the Danish National Register. This register was organized and structured by the police in Copenhagen and has been digitized and labelled by hundreds of volunteers at Copenhagen Archives. In Figure I, we present an example of one of these register sheets.

The Register Sheets In total, we obtain 1,419,491 scanned police register sheets from Copenhagen Archives. All adults above the age of 10 residing in Copenhagen in the period 1890 to 1923 are registered in these forms. Children between 10 and 13 were registered on their father’s register sheet. Once they turned 14, they obtained their own sheet. Married women were recorded on their husband’s register sheet, while single women were recorded on their own sheet. The documents cover the municipality of Copenhagen, which from around 1901 through 1902 also included Brønshøj, Emdrup, Sundbyerne, Utterslev, Valby and Vanløse. With the only exception being Nyboderne as individuals residing in these buildings are only registered in the censuses. Aside from the municipality of Copenhagen, there are some addresses from the municipality of Frederiksberg due to the frequent migration between these two municipalities.

Prior to 1890, the main registers used by the police were the census lists which goes back to 1865 and lasted until 1923. However, due to the census lists only being registered twice a year, in May and November, some migration across addresses was not recorded, and individuals residing only shortly in the city would not have been recorded (stadsarkiv_mandtal).

[h!]

Population of Denmark   Year 1890 1901 1911 1921   Denmark 2,172,380 2,449,540 2,757,076 3,267,831 Copenhagen 312,859 378,235 462,161 561,344    

  • The table shows the population of Denmark and its capital, Copenhagen. This illustrates that the register sheets approximately cover 16% of the adult population of Denmark in the period from 1890 to 1923. This table is from folketal.

A wealth of information is recorded in the police register sheets, including birth date, occupation, address, name of spouse, and more, all of which is systematically structured across the forms. While this paper focuses on extracting and creating benchmark results for the personal names, the remaining information can be constructed using similar procedures to those presented in this paper and may serve as additional databases for HTR models.

Rare information is also included in the register sheets, such as whether the individuals were wanted, had committed prior criminal offences, or owed child benefits. This kind of information is written as notes and is therefore typically written under special remarks in the documents. As opposed to the censuses, which were sorted by streets and dates, the police register sheets were sorted by personal names. This made it easier for the police to control the migration of citizens of Copenhagen and track individuals over time. Once an individual died, they were transferred to the death register (deathregister).

In 1923, the Danish National Register replaced the registration performed by the police of all citizens in Copenhagen (stadsarkiv). Ever since 1924, the Danish National Register has registered all individuals in all municipalities in Denmark (folkeregister).

Data Extraction and Segmentation To segment the data we use point set registration. Point set registration refers to the problem of aligning point spaces across a template image to an input image (registration). To find point spaces that roughly correspond to each other across semi-structured documents we extract horizontal and vertical lines from the document. We use the intersections as the point space, which we align with the template points. We briefly outline the method below; see dahl2020 for more details.

The figure shows an example of the raw documents received from Copenhagen Archives. The first line specifies the date the document was filled. The second line contains the full name while the occupation is written in the smaller region just below the name. The fourth line contains the birth place and the birth date. The sections below contain information on the spouse and children.
Figure I:
Example of a Police Register Sheet

To start the process of extracting the personal names from the forms we binarize the images. We extract horizontal and vertical lines from the documents by performing several morphological transformations similar to what is described by other authors such as

extracting_lines. The intersections are subsequently found using Harris corner detection (harris). Once we have the point space defined, we use Coherent Point Drift (CPD), which coherently aligns the point space from the input image to the point space on the template image. This yields a transformation function that maps the points found in the input images to the points in the template image. To improve the segmentation performance of the database, we add several restrictions to the transformations such that all extreme transformations are automatically discarded. This reduces the size of the database to just over 1.1 million images with attached labels. Even though this removes more than 20% of the data, we believe the gain from more reliable data outshines the cost associated with a smaller database.

Once we have prepared the images, we clean the labels to fit into a Danish context, which implies that all non-Danish variations of letters are replaced with the Danish equivalent of these. A few of these might be incorrect e.g. if the individuals are foreigners, but we expect the level of mis-classification arising from this to be smaller than the number of characters labelled incorrectly by the volunteers at Copenhagen Archives. In addition, we restrict the sample to names that only contain alphabetic characters and with a length of at least two characters, yielding a final database of 1,106,020 full names.

The figure shows examples of the HANA database with the corresponding labels written above. The last name is typically written as the first word followed by the first and middle names, which is the case for all images above. For this reason, we write the last name as the first word of the label.
Figure II:
Samples from the HANA Database

It is possible to extend the number of names for each sheet by considering the spouse and children of an individual. However, this would entail lowering the quality of the database, as the last name is generally not present and the quality of the segmentation is lower. We leave this for future work.

The personal name labels are either categorized as first or last names by Copenhagen Archives. Most commonly, the last names is written as the first word and the subsequent words are the first and middle names (in that order). However, some exceptions occur, and there are other rules that may interfere with the structure of the ordering, such as underlining and numbering. The structure of the database can therefore be challenging for HTR models, as this structuring complication has to be overcome by the models.

The figure shows the length of names per image. The name is in this context defined as each word in a full name i.e. either first, middle, or last name. The longest name consists of 18 characters. This is a simple count figure of the length of the names (first, middle, and last name) aggregated across all individuals present in either the train or test data.
Figure III:
Distribution of name lengths

Train and Test Splits To ensure a standardized database and test measure, we split the database into a train and test database. The test database consists of 5% of the total database and is randomly selected. The training data consists of 1,050,191 documents while the test data consists of 55,829 documents. 2,129 surnames are only represented in the test sample, which contains a total of 10,231 unique last names relative to the overall of almost 70,000 unique last names.

As mentioned previously, the database is highly unbalanced due to vast differences in the commonness of names. Only the 604 most common surnames in the database occur at least 100 times, and only the 3,464 most common surnames occur more than 20 times. This covers slightly more than 85% of the data, meaning that almost 15% of the images contain names that occur fewer than 20 times. This naturally leads to challenges for any HTR model, as it needs to learn to recognize names with very few or even zero examples in the training data. However, this is also an important and indeed crucial goal to work towards.

Labelling While transcribers at Copenhagen Archives were instructed to make accurate transcriptions of the register sheets, there exist humanly introduced inconsistencies in the labels. The same points made by deng2009imagenet can be made here, as there are especially two issues to consider. First, humans make mistakes and not all users follow the instructions carefully. Second, users are not always in agreement with each other, especially for harder to read cases where the characters of an image are open to interpretation.

The figure shows the distribution of names per image file. As seen, the vast majority of images contain two to four names. The longest name consists of 10 separate names. The figure aggregates across all individuals present in either the train or test data.
Figure IV:
Distribution of words per image
The figure shows the distribution of characters in the names. As seen, the most frequent character used in the names is e which appears approximately 3.4 million times, while both q and å occur fewer than 5,000 times. The figure aggregates across all individuals present in either the train or test data.
Figure V:
Distribution of characters

With respect to the first point, we perceive this as part of the challenge for construction of any handwriting database, as these are all based on human transcriptions. For this database, all labels have been double-checked by a super-user at Copenhagen Archives, and therefore at least two individuals have read all documents. In addition, it is possible to send requests for corrections at the website of Copenhagen Archives and thereby change incorrect labels.

With respect to the second point, a number of considerations should be taken into account. A common labelling error found in the database is the existence of subtle confusing characters, similar names, or phonetically spelling errors. Characters or names that are often misread are e.g. Pedersen versus Petersen, Christensen versus Christiansen, and Olesen versus Olsen. Solutions for these complications are difficult, as it is in many cases a judgement call by the transcriber. To reduce the number of incorrect labels in the training database, one could consider combining similar names, but we refrain from this in this paper.

Further Characteristics of the Database Despite there being 69,913 unique surnames and 48,396 unique first and middle names, the total number of unique names amounts to 105,615, as there exists an overlap between the two sub-groups. The distributions of the length of the names and the characters are shown in Figures III and V. There are fewer than fifty thousand examples of the characters q, w, x, z, å, and æ. For q and å, there are fewer than five thousand examples. The vast majority of names contain four to nine characters, with only 6.35% of the names being shorter or longer. Quite frequently reported for Danish last names is the fraction of names ending with sen. For this database, we have 710,179 surnames that end with sen, which is equal to 64.21% of all last names in the database.

Iii Benchmark Results

This section describes the benchmark results published together with the HANA database. We use a variant of a ResNet-50 network for estimating the benchmark results. We split the surnames into characters and perform classification character-by-character. The predictions are subsequently matched to the closest existing name. One could also consider the surnames as an entity and classify each word in a holistic sense. We imagine that this could be problematic due to the unbalanced nature of this database. We train three neural networks, one to predict the last name, one to predict the first and last name, and one to predict the entire name i.e. first, middle, and last names.

We start by describing the architecture, optimization, and other details of the neural networks used in the paper. Each neural network is mostly similar, and optimization is done similarly for each. The code to run the neural networks is created in Python (10.5555/1593511)

using PyTorch

(paszke2019pytorch).

Network Architecture Each neural network uses a ResNet-50 with bottleneck building blocks (he2016deep)

as its feature extractor; the weights of the PyTorch version of ResNet-50 pretrained on ImageNet

(deng2009imagenet) are used as the initial weights. The neural networks differ only insofar as their classification heads differ. Here, a method similar to the one described in goodfellow2013multi is used, with the exception that the sequence length is never estimated. The weights (and biases) of the heads are randomly initialized.

For the last name network, 18 output layers are used (names are at most 18 letters long), each with 30 output nodes (letters a-å as well as a “no character” option). For the first and last name network, 36 output layers are used (2 names of at most 18 letters), each with 30 output nodes. For the full name network, 180 output layers are used (up to 10 names of at most 18 letters), each with 30 output nodes.

The figure shows the performance on the test set from the HANA database when trained on last names. We find that the matching of names to closest name relative to the unmatched performance is very similar until the 80th quantile. From this point on, the two lines diverge and the matched predictions clearly outperform the unmatched predictions.

Figure VI:
Performance on the HANA database: Last Name

Optimization

All neural networks are optimized using stochastic gradient descent with momentum of 0.9, weight decay of 0.0005, and Nesterov acceleration based on the formula in

sutskever2013importance

. The batch size used is 256 and the learning rate is 0.05. The networks are trained for 100 epochs and the learning rate is divided by ten every 30 epochs. The loss function used is the mean cross entropy of each output layer (i.e. normal cross entropy in the case of one output layer).

Image Preprocessing

Images are resized to half width and height for computational reasons (resulting in images of width 522 and height 80). The images are normalized using the ImageNet means and standard deviations (to normalize similarly to the pretrained ResNet-50 feature extractor). During training, image augmentation in the form of RandAugment with

and is used (cubuk2020randaugment); the implementation is based on RandAugmentGitHub.

The figure shows the performance on the test set from the HANA database when trained on first and last names. We find that the network obtains a higher accuracy on the first names relative to the last names and that the last name accuracy is lower than the performance of the model that is trained only on last names.
Figure VII:
Performance on the HANA database: First and Last Name

Prediction of Networks Some postprocessing of predictions is performed. Each layer is mapped to its corresponding character (the 29 letters and the “no character” option). Then, for each name (i.e. sequence of 18 output layers), the “no character” predictions are removed and the rest of the letters form the prediction. Letting, denote the “no character” option for character , this means that both [h, a, n, , s, ], which is an invalid name, and [h, a, n, s, ], which is a valid name, will be transformed to hans.

Matching As an additional step, we refine the predictions of the networks by using matching. In some cases, a list of possible names (i.e. a dictionary of valid outcomes) may be present, in which case this can be used to match predictions that are not valid to the nearest valid name. Specifically, we use the procedure in the difflib Python module to perform this matching.

For the last name network, the predictions that do not fall within the list of valid last names are matched to the nearest last name. For the first and last name network, a similar procedure is used separately for the first name and the last name. For the full name network, a similar procedure is used separately for the first name, the up to eight middle names, and the last name.

Results An overview of the performance is shown in Table III. A common trade-off is the coverage relative to the accuracy, which is the motivation for also showing the results using a threshold at the 90th quantile.

Three different models for character-by-character recognition were tested. The first model predicts only the characters in the last name, the second model predicts the first name and the last name for the linking arguments stated previously, and the third model predicts the full name sequence. All of them are trained on the full database. For the full name model, the number of names present in a person’s predicted name is equal to the number of names in the corresponding label in 96.85% of the cases. Using the Levenshtein distance to calculate the character error rate of the predictions (without matching) we find error rates of 1.48% for the last name network, 1.66% for the first and last name network, and 11.82% for the full name network.

[!ht]

WACC   Names Recall WACC WACC with Matching   Last Name 100% 94.33% 95.68% First and Last Name 100% 93.52% 94.79% Full Name 100% 67.44% 68.81% Last Name 90% 98.36% 98.41% First and Last Name 90% 97.29% 97.45% Full Name 90% 72.78% 74.10 %    

  • The table shows the test performance of the HTR models as measured by word accuracy. The recall is defined as the fraction of the test database the model is tested upon (keeping the most likely predictions). For the models with 90% recall we remove the 10% of the test sample with the lowest sequence-probability. All models are trained on the full train database allowing the networks to learn primitives and characters from uncommon names.

Iv Discussion

Table III and Figures VI and VII summarize our results on the HANA database. Due to computational constraints, we only tested the performance of relatively few models. As these models are the first results on this database, there are currently no available comparable results, and we hope that other researchers can use these results as a benchmark for this database.

We believe that large-scale databases are a necessary prerequisite for achieving high accuracy when transcribing handwritten text. This database proves to be sufficiently large for models to read handwritten names with high accuracy, especially for more frequently represented names. The high performance is achieved despite several stated complications. The most common complications with the labels and the corresponding images are the structure of the personal names on a image relative to the label, confusion of certain characters, and general typos. This underlines the fact that humans make mistakes. This is especially true for harder to read cases where certain characters are possible to read in different ways and are open to interpretation.

As a robustness check one could also test the models using phonetically spelled versions of the names e.g. Christian versus Kristian. We choose not to do this in our benchmark models as there exist labels that are very similar but have different meanings e.g. Pedersen versus Petersen. Therefore, by allowing for small discrepancies in the names one could easily create mislabeled training data across very similar names. We realize that it could to some extent mitigate the complication from the harder to read cases where the transcribers possibly made mistakes, but we leave this as an open question for other researchers to approach.

V Conclusion

This paper introduces the HANA database, which is the largest publicly available handwritten personal name database. The large-scale HANA database is based upon Danish police register sheets, which have been made freely available by Copenhagen Archives. The final processed database contains a total of 3,355,388 names distributed across 1,106,020 images. Benchmark results for transcription based deep learning models are provided for the database on the last name, first and last name, and full name.

Our goal is to create and promote a more challenging database that in many ways is more comparable to other historical documents. Specifically, historical documents are often tabulated and can therefore be cropped into single-line fragments, which should make it easier to train HTR models and to make more efficient transcriptions. Second, the naturalism of the police register sheets are in our opinion quite comparable to a lot of wide used historical documents such as census lists, parish registers, and funeral records. This makes any performance based on these documents more representative of the performance that would be obtained in custom applications.

We highlight two important features of the database. First, despite the challenges associated with labelling errors and unstructured images, the size of the database appears to compensate, making possible high performance models for automatically transcribing handwritten names. Second, related to the prior point, despite the commonness of names being far from evenly distributed, resulting in a highly unbalanced sample of the represented names, with 65,026 names singularly represented out of a total of 105,615 different names, the models still generalize well. We view this as very encouraging, suggesting that high performance automatic transcription is possible even in difficult and realistic scenarios.

We have performed image-processing procedures to make the database useful for training single-line learning systems. The scripts are all available upon request. We strongly encourage other researchers to use HANA and to make improvements to our procedures in order to continuously increase the size and quality of the database. Ultimately, we believe this can help making automatic transcriptions of personal names and other handwritten entities much more precise and cost efficient in addition to making the transcriptions fully end-to-end reproducible. By adding improvements to existing linking methods, due to lower transcription error rates, this could further incentivize the usage and construction of reliable long historical databases across multiple generations.

References