Detection Masking for Improved OCR on Noisy Documents

05/17/2022
by   Daniel Rotman, et al.
0

Optical Character Recognition (OCR), the task of extracting textual information from scanned documents is a vital and broadly used technology for digitizing and indexing physical documents. Existing technologies perform well for clean documents, but when the document is visually degraded, or when there are non-textual elements, OCR quality can be greatly impacted, specifically due to erroneous detections. In this paper we present an improved detection network with a masking system to improve the quality of OCR performed on documents. By filtering non-textual elements from the image we can utilize document-level OCR to incorporate contextual information to improve OCR results. We perform a unified evaluation on a publicly available dataset demonstrating the usefulness and broad applicability of our method. Additionally, we present and make publicly available our synthetic dataset with a unique hard-negative component specifically tuned to improve detection results, and evaluate the benefits that can be gained from its usage

READ FULL TEXT
research
07/04/2022

BusiNet – a Light and Fast Text Detection Network for Business Documents

For digitizing or indexing physical documents, Optical Character Recogni...
research
05/27/2019

FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents

In this paper, we present a new dataset for Form Understanding in Noisy ...
research
09/12/2020

Abstractive Information Extraction from Scanned Invoices (AIESI) using End-to-end Sequential Approach

Recent proliferation in the field of Machine Learning and Deep Learning ...
research
02/07/2022

Combining Deep Learning and Reasoning for Address Detection in Unstructured Text Documents

Extracting information from unstructured text documents is a demanding t...
research
06/15/2023

Document Entity Retrieval with Massive and Noisy Pre-training

Visually-Rich Document Entity Retrieval (VDER) is a type of machine lear...
research
09/10/2020

OCR Graph Features for Manipulation Detection in Documents

Detecting manipulations in digital documents is becoming increasingly im...
research
05/02/2023

An experimental framework for designing document structure for users' decision making – An empirical study of recipes

Textual documents need to be of good quality to ensure effective asynchr...

Please sign up or login with your details

Forgot password? Click here to reset