Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.
Document digitization is essential for the digital transformation of our societies, yet a crucial step in the process, Optical Character Recognition (OCR), is still not perfect. Even commercial OCR systems can produce questionable output depending on the fidelity of the scanned documents. In this paper, we demonstrate an effective framework for mitigating OCR errors for any downstream NLP task, using Named Entity Recognition (NER) as an example. We first address the data scarcity problem for model training by constructing a document synthesis pipeline, generating realistic but degraded data with NER labels. We measure the NER accuracy drop at various degradation levels and show that a text restoration model, trained on the degraded data, significantly closes the NER accuracy gaps caused by OCR errors, including on an out-of-domain dataset. For the benefit of the community, we have made the document synthesis pipeline available as an open-source project.READ FULL TEXT VIEW PDF
Extraction of categorised named entities from text is a complex task giv...
Studies on the Named Entity Recognition (NER) task have shown outstandin...
Identifying the named entities mentioned in text would enrich many seman...
From small screenshots to large videos, documents take up a bulk of spac...
While neural network-based models have achieved impressive performance o...
The goal in the NER task is to classify proper nouns of a text into clas...
Extracting key information from documents, such as receipts or invoices,...
Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.
Despite years of advancement of information technologies, a vast amount of information is still locked inside analog documents. For example, the JFK Assassination Records Collection111https://www.archives.gov/research/jfk consists of over 5 million pages of records. Document digitization technologies like Optical Character Recognition (OCR) have made it possible to index the collection at a word level, allowing search via keywords and phrases. More elaborate NLP technologies such as Named Entity Recognition (NER) can then be applied to extract information such as person names, allowing information retrieval at entity level. However, even modern OCR technologies can produce output of a poor quality depending on the quality of the scanned documents.
In this paper, we demonstrate an effective framework for mitigating the impact of OCR errors on any downstream NLP task, using the task of NER as an example. Although there is a growing body of work dedicated to OCR and quality improvements, the specific topic of the impact of OCR errors on NER has not been widely explored. With the exception of (Hamdi et al., 2019; Hwang et al., 2019), and to an extent (Jean-Caurant et al., 2017), the majority of the previous work has focused on general OCR error detection and correction (Hakala et al., 2019; D’hondt et al., 2017). Here, we focus on a framework that allows us to improve the accuracy of any downstream NLP task such as NER.
Our major contributions are as follows. (1) To address the scarcity of entity-labeled OCR documents for model training, we design a pipeline Genalog for generating analog synthetic documents. The pipeline takes plain texts optionally annotated with named entities, synthesizes degraded document images, runs OCR on the images, and propagates the labels onto the OCR output texts. These imperfect texts can then be aligned with their clean counterparts for model training. (2) We propose an action prediction model that can restore clean text from OCR output and mitigate the downstream NER accuracy degradation. (3) We systematically investigate NER accuracy drop on OCR output at various synthetic degradation levels, and show that our text restoration model can indeed significantly close the NER accuracy gap.
The core capabilities of Genalog – synthesizing visual document images, degrading images, extracting text from images, and performing text alignment with label propagation – can have a wide array of applications. To facilitate further research, we are making Genalog available as an open-source project on GitHub222https://github.com/microsoft/genalog.
Since we focus on closing the accuracy gap of NER induced by OCR errors, we adopt a more conventional NER model and keep it constant in our experimentation, instead of using the most recent transformer-based (Vaswani et al., 2017) NER model such as (Wang et al., 2019). LSTM-CNN models (Chiu and Nichols, 2016) typically employ a token-level Bi-LSTM on top of a character-level CNN layer. Hence, we train a multi-task Bi-LSTM model that operates at both levels (Wang et al., 2019). This provides more granularity at the character-level than the traditional Bi-LSTM CRF model (Huang et al., 2015).
To systematically study the challenges of performing NER on OCR output, (Hamdi et al., 2019) categorized four types of OCR degradations and examined the decrease in NER accuracy for each as evaluated on synthetic documents. Their analysis referenced DocCreator (Journet et al., 2017) as a tool for generating synthetic documents. Our implementation, Genalog, was initially inspired by this tool. Another work, (Etter et al., 2019), presented a flexible framework for synthetic document generation but did not produce annotations for a downstream task.
The work of (van Strien. et al., 2020) examined the impact of noise introduced by OCR on analog documents on several downstream NLP tasks, including NER. They observed a consistent relationship between decreased OCR quality and worse NER accuracy. (Miller et al., 2000) explored the relationship between the word error rates of noisy speech and OCR on downstream NER, and (Packer et al., 2010) noted the difficulty of extracting names from noisy OCR text.
The goal of our text restoration approach is to reconstruct a “clean” version of the text from OCR output, which may contain various misspellings and errors. Sequence-to-sequence (seq2seq) models are a natural first approach: they convert an input sequence into an output sequence and can process text at the character level (Hakala et al., 2019). The use of LSTMs (Hochreiter and Schmidhuber, 1997) is a common paradigm for seq2seq models. Another approach is the use of an encoder-decoder model with a single or multiple LSTM layers (D’hondt et al., 2017; Suissa et al., 2020). Here, the encoder-decoder model is not based on a seq2seq architecture; instead it is a straightforward decoding of a character at every time step corresponding to the input sequence. These approaches are powerful but have a few drawbacks that we will explore in details in Section 4.
While there are many publicly available annotated datasets for NER, there are few targeting OCR output, i.e., document images containing texts with entities marked manually. To address this data scarcity, we developed Genalog, a Python package to synthesize document images from text with customizable degradation. It can also run an OCR engine on the images and align the output with ground truth to propagate the NER labels. The final result consists of degraded document images, corresponding OCR text, and NER labels for the OCR text suitable for model training and testing (Figure 1). We now describe each of these steps in the remainder of this section.
A document contains various layout information such as font family/size, page layout, etc. Our goal is to generate a document image given a specified layout and input text. Genalog provides several standard document templates implemented in HTML/CSS (shown in Appendix A), and a browser engine is used to render document images.The layout can be reconfigured via CSS properties, and a template can be extended to include other content such as images and tables. In our experiments, we use a simple text-block template.
Genalog supports elementary degradation operations such as Gaussian blur, bleed-through, salt and pepper, and morphological operations including open, close, dilate, and erode. Each effect can be applied at various strengths, and multiple effects can be stacked together to simulate realistic degradation. Figure 6 provides more details on each degradation.
We use a commercial OCR API to extract text from document images. Genalog calls the service on batches of documents and obtains extracted lines of text and their bounding boxes on each page. Genalog also computes metrics measuring the OCR accuracy, including Character Error Rate (CER), Word Error Rate (WER), and two additional classes of metrics: edit distance (Miller et al., 2009) and alignment gap metrics. This provides information on OCR errors and can capture errors such as “room” misrecognized as “roorn” by OCR.
After obtaining the OCR text from synthetic documents, we propagate NER labels from the source text to the OCR text by aligning these texts and propagating the labels accordingly. Genalog uses Needleman-Wunsch algorithm (Needleman and Wunsch, 1970) for text alignment on the character level. Because the algorithm is in both time and space complexity, it can be inefficient on long documents. To improve efficiency, we use the Recursive Text Alignment Scheme (Yalniz and Manmatha, 2011) and search for unique words common to both ground truth and OCR text as anchor points to break documents into smaller fragments for faster alignment, thereby obtaining a 15-20x speed-up on documents with average length of 4000 characters.
To understand what degradation on synthetic documents is realistic, we compute CER and WER on OCR texts obtained from a large corpus of real document scans (first row of Table 1). We then synthesize documents from the public CoNLL 2003 (Sang and De Meulder, 2003) and CoNLL 2012 (Pradhan et al., 2012) datasets with different degrees of degradation (none, light and heavy) and compute CER and WER on the OCR output. The result shows that in terms of WER, the light degradation is closer to the real degradation than the heavy degradation.
|CoNLL ’03||All (light)||0.5||5.4|
|CoNLL ’12||All (light)||0.9||7.5|
|CoNLL ’03||All (heavy)||9.7||36.6|
|CoNLL ’12||All (heavy)||8.5||33.4|
To make our framework for mitigating OCR errors flexible, we propose a model for text restoration. This model can be trained independently from the downstream task, and training data can be readily obtained from a data synthesis pipeline such as Genalog.
One approach to generate a corrected sequence of text is to correct one word at a time via a seq2seq model such as (Hakala et al., 2019). In that work they report when allowing an entire sentence as input instead of a single word, the model trained with smaller dataset resulted in much increased WER.
An alternative approach is to adopt an encoder-decoder model that decodes one character a time in sync with the input. Although this approach is faster than seq2seq, as insertions or deletions accumulate in a long sequence, the model needs to account for a growing character shift (see (a)). This issue often leads to a repetition of the same character for several timesteps during the prediction. (D’hondt et al., 2017) mitigated this problem by limiting the length of the sequence to 20 characters and utilizing a sliding window. This significantly slows down inference and limits the context of the model.
Our solution builds on the previous approaches, but instead of predicting characters at each time step, we predict actions that are required for restoration ((b)). For example, to restore the source text “Cute cat” from the OCR output “Cute at”, we need to insert a “c” before the “a”. This is akin to sequence labelling: each character in the input sentence is assigned an action labe). There are four main actions (INSERT, REPLACE, INSERT_SPACE, and DELETE) and two auxiliary actions (NONE and PAD). The first three actions need a character, e.g., we need to specify which character to INSERT. Predicting actions and characters separately can reduce vocabulary size and label sparsity, which can be severe for languages such as Chinese or Japanese. The distributions of the actions on CoNLL-2012 is presented in (a).
Since it is a sequence labelling problem, each input character is limited to have only one action. If a sentence is missing several characters in a row, our model is limited to recovering only one character. (Omelianchuk et al., 2020) and (Awasthi et al., 2019), who suggested similar approaches for grammatical error correction, mitigate this by applying the model up to 3 times. In OCR we’ve observed that errors mostly occur in non-adjacent locations (see (b)), and only few characters suffer from errors that need more than one action to fix. This is also supported by (Jatowt et al., 2019).After a manual examination of the characters that need exactly two actions, we found that most of them need the following two INSERT actions: insert a space and a character. We therefore introduce an action INSERT_SPACE that inserts both in one action.
Since OCR errors usually have single erroneous characters, fixing them does not require long contexts. Our model thus consists of a character embedding layer followed by one-dimensional convolution layers, and for each input character, it predicts an action and a character using two separate fully-connected layers. The architecture of the model is presented on Figure 4. This model is trained with a weighted combination of cross-entropy losses for actions and characters: .
|OCR Sentence||Reconstructed Sentence||Ground Truth Sentence|
|No one will be able to rec- ognize her body !||No one will be able to recognize her body .||No one will be able to recognize her body !|
|Although ! still have n’t made the final decision to go||Although I still have n’t made the final decision to go||Although I still have n’t made the final decision to go|
|Why do you think the North Koreans chose july Fourth /?||Why do you think the North Koreans chose July Fourth /?||Why do you think the North Koreans chose July Fourth /?|
|israel says & may abandon the peace negotiations altogether .||Israel says it may abandon the peace negotiations altogether .||Israel says it may abandon the peace negotiations altogether .|
|! will protect this cky and save it .||I will protect this city and save it .||I will protect this city and save it .|
To investigate the effect of OCR errors on NER accuracy and the effectiveness of our model in mitigating them, we first use Genalog to create a synthetic dataset from clean texts (“clean data”) and propagate the NER annotations to the OCR text (“noisy data”). We measure the the accuracy of the NER model on the clean data and the noisy data. Using the alignment between the ground truth and the OCR output, we produce labeled data to train the text restoration model. We then evaluate the same NER model on the restored text to determine if we can close the accuracy gap caused by OCR.
We use the well-known corpora from CoNLL-2003 (Sang and De Meulder, 2003) and CoNLL-2012 (Pradhan et al., 2012) and refer to them as “clean data”. Next, Genalog synthesizes and generates equivalent OCR texts, called “noisy data”. For each degradation effect in Genalog, we generate three versions of noisy data: none, light, or heavy degradation. This allow us to observe, at a finer-grained level, the influence of degradations on NER accuracy. We also produce a version of datasets with All Degradations. Appendix B shows example images of each degradations and Appendix C lists degradation parameters.
We use a popular character-level and token-level Bi-LSTM model (Dernoncourt et al., 2017) for NER. The model consists of a char-level Bi-LSTM layer followed by a token-level Bi-LSTM. Since we are using CoNLL-2012 and CoNLL-2003 to train the model, we utilize separate classification layer for each dataset (Wang et al., 2019). We settled on this popular approach since our focus is mitigating NER accuracy drop caused by OCR, not achieving state-of-the-art NER on clean text.
In this section, we report the evaluation results of our action prediction model, analyze the effect of different degradations on NER accuracy, and show how our model behaves on out-of-domain datasets.
Table 3 compares character-level and word-level accuracy between the noisy OCR text and the restored text. We observe that our model improves accuracy according to both metrics, and small improvements at character level lead to bigger improvements at word level.
|Char Accuracy Noisy||0.986||0.900||0.989||0.900|
|Char Accuracy Restored||0.991||0.907||0.994||0.903|
|Word Accuracy Noisy||0.927||0.661||0.942||0.646|
|Word Accuracy Restored||0.962||0.732||0.969||0.700|
There are several common OCR errors that our model is able to recover, including errors introduced by line breaks and hyphenation (e.g., “rec-ognize” to “recognize”). Remarkably, it also restores words suffering from multi-character errors, despite the one action per character limitation. Table 2 shows more examples.
To improve NER accuracy on noisy text, we first pass it through the action prediction model trained on both CoNLL-2003 and CoNLL-2012. The restored text is then fed to the NER model to generate entities. To measure the benefit in NER accuracy due to the use of restoration, we first examine the drop in NER accuracy introduced by degradation and OCR. We then evaluate NER on the restored text. By comparing the NER accuracy on the noisy text to that on the restored text, we can then compute the improvement introduced by the restoration model. Table 4 shows the result of this evaluation using All Degradations (Light) on CoNLL-2012 and CoNLL-2003. The NER accuracy on noisy text goes down significantly on all datasets. Our restoration approach is able to improve the F1 scores considerably, and the accuracy gaps are reduced up to 73% on CoNLL datasets.
|Relative Gap Reduction||73%||52%||76%|
To further investigate the effect of heavy degradations, we applied these degrations to CoNLL-2012 dataset and measured the resulting errors in NER. Table 5 shows that all degradations hurt the NER accuracy and, in all cases, the proposed restoration model helps to improve the NER scores. The NER accuracy on degraded text declines the most when all degradations are combined, but 19% of this gap is recovered by our model. In general, we notice that the action prediction model works the best at moderate levels of degradation, where up to 77% of the accuracy drop is restored. When the text is obscured by many degradations in the heavy mode, the gap reduction becomes smaller.
To evaluate how our model generalizes to unseen documents from another domain, we evaluated our framework on a new333Although CoNLL includes news articles, they are just a part of it. “CNN” dataset – 1000 randomly sampled articles from the CNN part of the DeepMind Q&A Dataset (Hermann et al., 2015). Since the CNN datset does not have NER labels, we run our NER model on it to produce the “ground truth”. We then use Genalog to generate synthetic documents with All Degradations (Light).
The accuracy of the NER system on the degraded text drops from the perfect score (the ground truth labels are produced by the same system)444It is less than 1.0 due to sentence splitting differences between label generation and evaluation. to the F1 score of 0.59. Notice that the NER accuracy gap on the CNN dataset is much bigger than that on the CoNLL datasets (40 vs. 4.5 points), because here we are evaluating the accuracy using the model output on clean text as ground truth. Any change in NER prediction due to OCR errors therefore will contribute to the accuracy gap, while in the case of CoNLL datasets, changing an incorrect NER prediction (based on human annotations) to another incorrect prediction does not contribute to the accuracy gap.
After applying the restoration model to the degraded text, the accuracy of the NER system rises to 0.895 F1. Despite not being trained on the CNN dataset, our action prediction model is still able to close 76% of the NER accuracy gap. This result shows the action prediction model is able to generalize to data from another domain.
In this paper, we have demonstrated an effective framework to mitigate errors produced by OCR, a key step in the process of document digitization. We first constructed a data synthesis pipeline, Genalog, capable of generating synthetic document images given plain text input, running the images through OCR, and propagating annotated NER labels from the input text to the OCR output. We then proposed a novel approach, the action prediction model, to restore text from the inflicted OCR errors and show that our model does not suffer from problems faced by the conventional models when alignment mismatches between input and output accumulate. Lastly, we demonstrated that the accuracy of an important downstream task, NER, does drop at various degradation levels, and our text restoration model can significantly close the accuracy gaps, including on an out-of-domain dataset. Since the design of our restoration model does not depend on a the downstream task, it is generalizable to many other NLP tasks as well.
Leveraging text repetitions and denoising autoencoders in ocr post-correction. arXiv preprint arXiv:1906.10907. Cited by: §1, §2, §4.
Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: ARTIDIGH,, pp. 484–496. External Links: Cited by: §2.
Genalog provides three standard document templates for document generation, depicted on Figure 5.
We show example degradations produced by Genalog in Figure 6.
|Degradation||No Degradation||Bleed Heavy||Blur Heavy||Close Heavy||Open Heavy||Pepper Heavy||Salt Heavy||All Heavy|
Table 7 lists the degradation parameters used in our experiments.
|Order||Degradation||Parameter Name||Parameter Value||Description|
|1||open||kernel shape||(9,9)||Thickens characters|
|-||kernel type||plus||Kernel filled with ”1”s in a ”+” shape|
|2||close||kernel shape||(9,1)||Remove horizontal structures|
|-||kernel type||ones||Kernel filled all with ”1”s|
|3||salt||amount||0.5 / 0.7||Substantially reduce effect with more salt|
|4||overlay||src||ORIGINAL_STATE||Overlay current image on the original|
|5||bleed through||src||CURRENT_STATE||Apply bleed-through effect|
|-||alpha||0.8 / 0.8||Transparency factor|
|-||offset x||-6 / -5||Offset (pixels) of the background in x-axis|
|-||offset y||-12 / -5||Offset (pixels) of the background in y-axis|
|6||pepper||amount||0.015 / 0.005||Imitate page degradation|
|7||blur||radius||11 / 3||Apply Gaussian blur|
|8||salt||amount||0.15||Add digital noise|