Fast Search with Poor OCR

09/17/2019
by   Taivanbat Badamdorj, et al.
0

The indexing and searching of historical documents have garnered attention in recent years due to massive digitization efforts of important collections worldwide. Pure textual search in these corpora is a problem since optical character recognition (OCR) is infamous for performing poorly on such historical material, which often suffer from poor preservation. We propose a novel text-based method for searching through noisy text. Our system represents words as vectors, projects queries and candidates obtained from the OCR into a common space, and ranks the candidates using a metric suited to nearest-neighbor search. We demonstrate the practicality of our method on typewritten German documents from the WWII era.

READ FULL TEXT
research
05/30/2023

DuoSearch: A Novel Search Engine for Bulgarian Historical Documents

Search in collections of digitised historical documents is hindered by a...
research
02/13/2017

Content-Based Video Retrieval in Historical Collections of the German Broadcasting Archive

The German Broadcasting Archive (DRA) maintains the cultural heritage of...
research
02/20/2019

Empowering Elasticsearch with Exact and Fast r-Neighbor Search in Hamming Space

A growing interest has been witnessed recently in building nearest neigh...
research
09/23/2021

Named Entity Recognition and Classification on Historical Documents: A Survey

After decades of massive digitisation, an unprecedented amount of histor...
research
12/11/2019

Lifelong learning for text retrieval and recognition in historical handwritten document collections

This chapter provides an overview of the problems that need to be dealt ...
research
04/23/2020

A Tool for Facilitating OCR Postediting in Historical Documents

Optical character recognition (OCR) for historical documents is a comple...
research
06/16/2021

Sentiment Progression based Searching and Indexing of Literary Textual Artefacts

Literary artefacts are generally indexed and searched based on titles, m...

Please sign up or login with your details

Forgot password? Click here to reset