Combining Deep Learning and Reasoning for Address Detection in Unstructured Text Documents

02/07/2022
by   Matthias Engelbach, et al.
0

Extracting information from unstructured text documents is a demanding task, since these documents can have a broad variety of different layouts and a non-trivial reading order, like it is the case for multi-column documents or nested tables. Additionally, many business documents are received in paper form, meaning that the textual contents need to be digitized before further analysis. Nonetheless, automatic detection and capturing of crucial document information like the sender address would boost many companies' processing efficiency. In this work we propose a hybrid approach that combines deep learning with reasoning for finding and extracting addresses from unstructured text documents. We use a visual deep learning model to detect the boundaries of possible address regions on the scanned document images and validate these results by analyzing the containing text using domain knowledge represented as a rule based system.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/04/2022

BusiNet – a Light and Fast Text Detection Network for Business Documents

For digitizing or indexing physical documents, Optical Character Recogni...
research
09/12/2020

Abstractive Information Extraction from Scanned Invoices (AIESI) using End-to-end Sequential Approach

Recent proliferation in the field of Machine Learning and Deep Learning ...
research
01/07/2022

Data-Efficient Information Extraction from Form-Like Documents

Automating information extraction from form-like documents at scale is a...
research
12/24/2014

Locating Tables in Scanned Documents for Reconstructing and Republishing (ICIAfS14)

Pool of knowledge available to the mankind depends on the source of lear...
research
10/23/2020

Extracting Body Text from Academic PDF Documents for Text Mining

Accurate extraction of body text from PDF-formatted academic documents i...
research
05/17/2022

Detection Masking for Improved OCR on Noisy Documents

Optical Character Recognition (OCR), the task of extracting textual info...
research
09/27/2016

Semi Automatic Color Segmentation of Document Pages

-This paper presents a semi automatic method used to segment color docum...

Please sign up or login with your details

Forgot password? Click here to reset