Fast Search with Poor OCR

09/17/2019 ∙ by Taivanbat Badamdorj, et al. ∙ 0

The indexing and searching of historical documents have garnered attention in recent years due to massive digitization efforts of important collections worldwide. Pure textual search in these corpora is a problem since optical character recognition (OCR) is infamous for performing poorly on such historical material, which often suffer from poor preservation. We propose a novel text-based method for searching through noisy text. Our system represents words as vectors, projects queries and candidates obtained from the OCR into a common space, and ranks the candidates using a metric suited to nearest-neighbor search. We demonstrate the practicality of our method on typewritten German documents from the WWII era.



There are no comments yet.


page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Wiener Library is one of the most extensive archives on the Holocaust and Nazi era. Established in 1933, the Library’s unique collection of over one million items includes press cuttings, eyewitness testimonies, photographs, as well as published and unpublished works from that era.

With the digitization of this collection, it is also important to be able to search through the collection.

Although various image-based search methods have been proposed in recent years for such datasets [3, 22, 24, 33, 38], we choose to work instead with the noisy text obtained from a state-of-the-art optical character recognition (OCR) engine, encouraged by the increasing accuracy of such systems.

Image-based search systems [3, 24, 33, 38] often encode the query string and candidate images into a common subspace, and find matches using nearest-neighbors. Analogously, we encode the noisy candidates output by the OCR and the query into a fixed vector representation, and learn a common space between them.

On the other hand, the size of the collection (over 70,000 documents) also makes efficiency a leading concern. Pure text-based distance metrics such as edit-distance are impractical for the task of searching through large and noisy corpora; our corpus has close to 2 million unique reads from the OCR.

Thus searching using nearest neighbors is an easy, and quite common solution. This step is fast, since it can use efficient matrix multiplication implementations on modern GPUs [13]. Although the most common metric for nearest neighbors is simply the cosine distance, we use a new metric introduced in [5]. This metric is still fast, and also improves retrieval results.

2 The Corpus

2.1 History

The Wiener Collection was established in 1933 by Dr. Alfred Wiener, a German Jewish scholar and former member and activist of the Centralverein deutscher Staatsbürger jüdischen Glaubens. Wiener left Germany when the Nazis rose to power and established the Jewish Central Information Office (JCIO) in Amsterdam. The idea was to collect information about the Nazi Party, as part of the struggle to prevent its strengthening and to draw world attention towards the dangers of Nazi antisemitism. In 1939, Wiener transferred the collection to London. Throughout the war years, Wiener and his assistants continued to collect information and documents on Germany’s occupation policy, responses to it, and particularly on the fate of European Jewry. When the war ended, Holocaust survivors’ testimonies, as well as information regarding the fate of Jewish refugees, were added. The collection played an important role in the charges leveled against war criminals and has continued to serve the media and scholars.

2.2 Holdings

The collection comprises publications on the Third Reich, Europe during and between the two world wars, Jewish communities in Europe, the Holocaust, antisemitism, and fascism throughout the world. These include approximately 150,000 books, reference works, pamphlets and journals; over one million indexed newspaper clippings, unpublished memoirs, and interviews; around 40,000 documents on the Nuremberg trials; various editions and extensive literature on The Protocols of the Elders of Zion; dossiers on war criminals; documents on the “Jewish Question” taken from records of the Gestapo, the Reichskanzlei, and the Foreign Office of the Third Reich; more than 500 microfilm and microfiche titles; and over 300 subscriptions to journals, both Holocaust and extreme right wing/Holocaust denial.

2.3 Online Archive

In the late 1970s, the Wiener Library (London) transferred most of its collection to Tel Aviv University in Israel, where the Wiener Library for the Study of the Nazi Era and the Holocaust was established. In addition, microfilms of parts of the collection are held in London and Tel Aviv.

In recent years, the documents at both the London and Tel Aviv locations are being digitized and made available online. Beginning in 2015, Tel Aviv digitized its “500 Document Collection,” which includes the main component of the original Wiener materials. It comprises more than 75,000 images that were scanned from microfilm and microfiche reproductions of the originals. The work described in this paper was performed with this online collection.

3 Related Work

Many recent projects have used word-spotting techniques to search large textual corpora. It was first introduced in [24] to enable search through handwritten documents, and was further improved upon in [3, 8, 9, 10, 11, 12, 38]. The work of [3] is interesting because of the vector encoding for words that they introduced.

More recently, [22] and [33]

introduced deep learning-based approaches to the task. The former


jointly learned the text and image embeddings using a neural network. The latter


implicitly trained an OCR by training a convolutional neural network (CNN) to output its ground truth vector representation.

Other works that incorporate searching through a noisy OCR include [4, 7], both of which use string-based methods to find possible matches. We found using purely string-based methods for matching to be too slow for our needs.

4 Method

Given a text query, we would like to find the correct matches among the noisy candidates output by the OCR.

Our method consists of (A) preprocessing the images, (B) getting the words from a state of the art OCR, (C) encoding the query and candidates into vectors (D) learning a common subspace between them and finally (E) ranking the candidates according to distance from the query.

4.1 Preprocessing

The information lost at the beginning of our pipeline is irrecoverable, so we take care to preprocess the images properly such that OCR may perform optimally.

The documents were photographed with a black border around each page. We first remove these borders by binarizing the image and finding the largest connected component, which will be the black border. We find its bounds and keep only that part which is enclosed inside, which is the actual document of interest.

There is a stark difference in lighting between many of the documents. Some documents are very dark, such as in Fig. 5. Other documents are not uniformly lit, and have light or dark patches. So we adjust the contrast using the CLAHE [37] algorithm. CLAHE adjusts the contrast in local regions, thus alleviating the problem of nonuniform lighting.

Other possible steps, such as binarization [26], degraded the performance of the OCR across all tasks.

4.2 Ocr

4.2.1 Tesseract

We were encouraged by recent improvements in Optical Character Recognition (OCR) software. We used Tesseract [29, 31],111Tesseract instructions and pretrained models are available at an open-source OCR engine. Although we had many other options, we chose Tesseract for its convenience.

Tesseract also works well with a variety of fonts and languages [30]

, which was needed for this project. The latest version – which is the one we used – is based on Long Short-Term Memory

[16] with Connectionist Temporal Classification (CTC) [14] used as the scoring function, a common combination in unaligned sequence generation problems. We also used pytesseract222pytesseract is available at, a Python wrapper that allowed us to interface with the Tesseract engine more easily.

For each word, we use Tesseract to get its bounding box and its transcription.

Figure 1: Example of OCR output from best performing German model from Tesseract.

4.3 Encoding

After obtaining the readings, we would like to have a fixed size vector representation of each word to do quick nearest neighbor search. We want the correct candidate vectors to be close to the query vector.

To achieve this, we propose the use of the following vector representation.

4.3.1 Phoc

The Pyramidal Histogram of Characters (PHOC) encoding was introduced in [3], in the context of word-spotting. In the simplest case, words can be represented simply by the characters that are in the word, that is, by a vector where 1 indicates that a character is in the word, and 0 otherwise. In this case, however, the words “beard” and “bread” would have the exact same representation. Since the letter “r” is in the second half of the word “beard” and in the first half of the word “bread”, we could use that to differentiate between the two words. Thus we define two more binary vectors, one for all the characters in the first half of the word, and another for the words in the second half of the word. Proceeding along the same lines, we could divide the word into thirds, quarters, and so on. We refer to the number of divisions as “levels”. The final representation of the word is the concatenation of all its vector representations at each level.

Figure 2: Illustration of PHOC. The final representation is the concatenation of all the binary vectors obtained at each level.

Formally, we decide where to assign each character in the following manner:

We first define the normalized occupancy of the th character of a word with length as , where the position starts from . Using the same formula for the region at level , the character belongs to the region if the overlap between their occupancies is greater than or equal to of the occupancy area of the character, that is, if

where .

In simpler terms, if we assume all characters are the same width, in other words, they each occupy some unit area, and we divide the word into bins such that each bin occupies an area , then we assign the character into each bin where at least half of the character overlaps with the bin.

4.3.2 Character Set and Special Considerations

Our dataset has additional special characters from multiple languages. Although most of our text is in German, there are also Polish, English, and even some Hebrew texts. We focus our efforts on the Latin-based languages, and get a set of 96 unique characters.

The German language is also famous for its compound words, the practice of combining multiple words into one word, resulting in many long words in our corpora. Therefore, we use levels 1, 2, 4, and 8 for computing the PHOC histograms. The end result is that each word is represented by a binary vector of size 1440.

The vector representations of each word are precomputed and stored on memory. For our dataset, this amounts to a large 80GB matrix. But computing the distances for a single query in over 1 million candidates takes only 7 seconds using a GPU.

Figure 3: Qualitative search results for the entire dataset with bounding boxes found by the OCR engine in pink. Our model works well even for cases where there is substantial fading of the ink, and cases where the transcription of the candidate itself is not an exact match with our query.

4.4 Learning a Common Space using Canonical Correlation Analysis

Canonical Correlation Analysis (CCA) [15]

is a statistical method for computing a linear projection for two views into a common space by maximizing their correlation. It has been widely used in the field of computer vision to tackle tasks such as linking text to images, multiview analysis

[28], and action recognition [21].

Multiple variants have been proposed over the years: it was regularized [35], kernels were introduced [17, 18], and more recently, deep learning methods were created [19, 20].

CCA takes two vectors, often referred to as views in the literature, and , as inputs. They are stacked as columns of two matrices and . The columns of the matrix , and are assumed to be centered, that is, and . CCA learns a common space by learning two matrices and such that the correlation is maximized under the condition that the projection is uncorrelated, i.e.


We use the regularized version of CCA [35], which adds a regularization term to the covariance matrices. The regularized version generalizes better and is more stable.

We consider the PHOC vector of the correct spelling of a word (the query) to be one view, and the PHOC vector of its corrupted versions (the candidates) as another view. We set apart a training set of 3032 pairs of noisy candidates with their ground truth transcriptions, compute their PHOC representations, and learn the matrices and . Because the data is assumed to be centered, we also compute the mean of our candidates and ground truth transcriptions during training, and subtract these means at query time.

4.5 Ranking Candidates

Having projected queries and candidates into a common subspace, we find the correct candidates by using nearest-neighbor search.

4.5.1 Nearest Neighbors

Finding the correct candidates for a given query using nearest neighbors [6, 23] is common in related tasks such as topic detection [34].

However, nearest neighbor search is by its very nature asymmetric: A being a nearest neighbor of B does not imply that B is a nearest neighbor of A. In high-dimensional spaces, this leads to an effect that causes faulty matching when using a nearest neighbors rule [27]

: some vectors, dubbed “hubs” have a high probability of being a nearest neighbor to many other vectors, while other vectors dubbed “anti-hubs” are not the nearest neighbors of any other points. Recently, methods such as Inverted Softmax (ISF)

[32], and Cross-Domain Local Scaling (CSLS) [5] have been proposed to mitigate the issue. ISF requires cross-validation of a parameter; thus we chose CSLS for its simpler nature.

4.5.2 Cross-Domain Local Scaling

In our experiments, we use the Cross-Domain Local Scaling (CSLS) metric as the ranking metric and show that it is better than simply using a cosine distance. The CSLS distance between two vectors is defined as:

where is the average cosine distance from the vector to its k-nearest neighbors. We show that this is a better metric for ranking than using the cosine distance by itself. To save runtime, is precomputed for each candidate vector.

4.5.3 Edit-Distance

We also consider using the edit-distance as the ranking metric.

The Levenshtein distance between two words and is defined as the number of single-character edits needed to transform into .

Although the edit-distance is an attractive metric due to its simplicity, it does not scale well for a large number of candidates, as its complexity is , where is the length of strings and . This makes edit-distance impractical for our dataset. Although various methods have been proposed for fast approximation of the edit-distance [2, 25], none have been widely adopted, nor openly implemented. In our experiments, we use the well-known dynamic programming algorithm [36].

5 Experiment

We tested the accuracy and efficiency of our system in an information retrieval setting. We manually prepared a dataset of 18 pages, amounting to 4284 words.

We put aside a training set of 3032 pairs of transcriptions and their noisy candidates to obtain the matrices and using CCA. Then we tested accuracy and efficiency on the remaining test set of 1252 queries.

We only report on the best metrics obtained when using regularized CCA.

We use CSLS for 20 nearest neighbors ().

All timing statistics are for a standard personal computer with no GPU.

6 Results

Table 1 summarizes our findings. The best method is using CCA with CSLS. Our final system achieves a higher precision 9 times faster than simply using the edit-distance between strings as the ranking metric. CSLS by itself gives more accurate retrieval results than using the cosine distance after projection.

Fig. 3 shows qualitative search results that were obtained by searching through the entire dataset. Our system is capable of finding very long words (“Nationalsozialistische”), as well as approximate matches to our query that have slightly different readings (“Rossenstrasse” vs. “Rossenstraße”). The OCR performs well on a variety of fonts.

Method mAP (%)
CCA and CSLS 83.6
Edit-Distance 83.2
CSLS 83.2
CCA and Cosine 83.1
Cosine 82.2
Table 1: Search Results
Figure 4: Time required for different ranking methods. We achieve higher mAP in much less time after projecting using CCA and ranking using the CSLS metric.

7 Discussion

General Methods.

The methods discussed in this paper are general, and straightforward. The effects of the noisy OCR were manageable in our case. The effects of noise on text classification was discussed in [1], where they concluded that up to 40% noise was not detrimental to text classification. We determined from our manually tagged dataset that around 60% of OCR outputs from the pretrained Tesseract model are within edit-distance 2 of the correct transcription. Despite this level of noise, we were still able to get compelling results.

Another positive aspect for us was that OCR works well out of the box for a variety of different languages, as well as fonts. This is important for our dataset because it includes German, Polish, English, and even some Hebrew texts. Within each language, there are also a variety of fonts that must be dealt with.

That being said, the drawback of this method is that it fails wherever the OCR fails. An example where OCR fails is shown in Fig. 5. It does not predict any words in the image due to the poor quality of the image. Also, we were not able to remove the black border around the image using our current preprocessing tool, as the document is too dark.

A careful evaluation of the quality of the OCR is necessary before using any of the proposed methods.

Figure 5: Example of failure case.
Self-Supervised OCR.

Since the OCR engine assigns a confidence score for each word that it outputs, we tried fine-tuning the OCR by using the transcriptions of its most confident outputs as additional ground truth texts. We took images of words that it assigned a confidence score higher than 90, but were unable to improve the performance of the OCR itself.

Weighted Edit-Distance.

Another variant of the edit-distance takes into account the likelihood of each single character edit. We tried computing a confusion matrix, and using the corresponding weighted edit-distance instead of the regular edit-distance, but were not able to outperform “vanilla” (uniform cost) edit distance in our experiments.

8 Conclusion

In this work, we present a fast and accurate text-based search that, given a query and a set of candidates, encodes them each into a fixed vector representation, projects them into a common subspace, and ranks the candidates with a metric better suited for nearest-neighbor search. Our search is 9 times faster than edit-distance, and is also more accurate.

We have applied the system described her to all the German-language documents in the Wiener collection. When embedded in the library’s search tool, this will provide WWII scholars a valuable tool to search effectively through these important historic documents.


We wish to thank …