DocReader: Bounding-Box Free Training of a Document Information Extraction Model

05/10/2021
by   Shachar Klaiman, et al.
0

Information extraction from documents is a ubiquitous first step in many business applications. During this step, the entries of various fields must first be read from the images of scanned documents before being further processed and inserted into the corresponding databases. While many different methods have been developed over the past years in order to automate the above extraction step, they all share the requirement of bounding-box or text segment annotations of their training documents. In this work we present DocReader, an end-to-end neural-network-based information extraction solution which can be trained using solely the images and the target values that need to be read. The DocReader can thus leverage existing historical extraction data, completely eliminating the need for any additional annotations beyond what is naturally available in existing human-operated service centres. We demonstrate that the DocReader can reach and surpass other methods which require bounding-boxes for training, as well as provide a clear path for continual learning during its deployment in production.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/28/2021

Segmentation-Based Bounding Box Generation for Omnidirectional Pedestrian Detection

We propose a segmentation-based bounding box generation method for omnid...
research
11/28/2021

CHARTER: heatmap-based multi-type chart data extraction

The digital conversion of information stored in documents is a great sou...
research
11/11/2022

Bounding Box Priors for Cell Detection with Point Annotations

The size of an individual cell type, such as a red blood cell, does not ...
research
06/20/2023

Polytope: An Algorithm for Efficient Feature Extraction on Hypercubes

Data extraction algorithms on data hypercubes, or datacubes, are traditi...
research
04/24/2023

DocParser: End-to-end OCR-free Information Extraction from Visually Rich Documents

Information Extraction from visually rich documents is a challenging tas...
research
07/16/2023

DocTr: Document Transformer for Structured Information Extraction in Documents

We present a new formulation for structured information extraction (SIE)...
research
03/17/2022

deepNIR: Datasets for generating synthetic NIR images and improved fruit detection system using deep learning techniques

This paper presents datasets utilised for synthetic near-infrared (NIR) ...

Please sign up or login with your details

Forgot password? Click here to reset