DocParser: End-to-end OCR-free Information Extraction from Visually Rich Documents

04/24/2023
by   Mohamed Dhouib, et al.
8

Information Extraction from visually rich documents is a challenging task that has gained a lot of attention in recent years due to its importance in several document-control based applications and its widespread commercial value. The majority of the research work conducted on this topic to date follow a two-step pipeline. First, they read the text using an off-the-shelf Optical Character Recognition (OCR) engine, then, they extract the fields of interest from the obtained text. The main drawback of these approaches is their dependence on an external OCR system, which can negatively impact both performance and computational speed. Recent OCR-free methods were proposed to address the previous issues. Inspired by their promising results, we propose in this paper an OCR-free end-to-end information extraction model named DocParser. It differs from prior end-to-end approaches by its ability to better extract discriminative character features. DocParser achieves state-of-the-art results on various datasets, while still being faster than previous works.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/10/2021

DeepCPCFG: Deep Learning and Context Free Grammars for End-to-End Information Extraction

We combine deep learning and Conditional Probabilistic Context Free Gram...
research
09/12/2020

Abstractive Information Extraction from Scanned Invoices (AIESI) using End-to-end Sequential Approach

Recent proliferation in the field of Machine Learning and Deep Learning ...
research
03/27/2023

An Information Extraction Study: Take In Mind the Tokenization!

Current research on the advantages and trade-offs of using characters, i...
research
10/07/2013

End-to-End Text Recognition with Hybrid HMM Maxout Models

The problem of detecting and recognizing text in natural scenes has prov...
research
09/10/2019

Chargrid-OCR: End-to-end trainable Optical Character Recognition through Semantic Segmentation and Object Detection

We present an end-to-end trainable approach for optical character recogn...
research
05/10/2021

DocReader: Bounding-Box Free Training of a Document Information Extraction Model

Information extraction from documents is a ubiquitous first step in many...
research
10/05/2022

Intelligent Information Retrieval: Techniques for Character Recognition and Structured Data Extraction

The day-to-day activities of every corporation in-volve working with a h...

Please sign up or login with your details

Forgot password? Click here to reset