Donut: Document Understanding Transformer without OCR

11/30/2021
by   Geewook Kim, et al.
0

Understanding document images (e.g., invoices) has been an important research topic and has many applications in document processing automation. Through the latest advances in deep learning-based Optical Character Recognition (OCR), current Visual Document Understanding (VDU) systems have come to be designed based on OCR. Although such OCR-based approach promise reasonable performance, they suffer from critical problems induced by the OCR, e.g., (1) expensive computational costs and (2) performance degradation due to the OCR error propagation. In this paper, we propose a novel VDU model that is end-to-end trainable without underpinning OCR framework. To this end, we propose a new task and a synthetic document image generator to pre-train the model to mitigate the dependencies on large-scale real document images. Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets. Through extensive experiments and analysis, we demonstrate the effectiveness of the proposed model especially with consideration for a real-world application.

READ FULL TEXT

page 2

page 5

page 7

research
09/03/2023

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

We propose a novel end-to-end document understanding model called SeRum ...
research
01/27/2022

DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer

Understanding documents with rich layouts is an essential step towards i...
research
07/24/2023

MataDoc: Margin and Text Aware Document Dewarping for Arbitrary Boundary

Document dewarping from a distorted camera-captured image is of great va...
research
11/16/2021

Document AI: Benchmarks, Models and Applications

Document AI, or Document Intelligence, is a relatively new research topi...
research
08/18/2023

TrOMR:Transformer-Based Polyphonic Optical Music Recognition

Optical Music Recognition (OMR) is an important technology in music and ...
research
02/01/2021

RectiNet-v2: A stacked network architecture for document image dewarping

With the advent of mobile and hand-held cameras, document images have fo...
research
04/07/2021

Document Layout Analysis via Dynamic Residual Feature Fusion

The document layout analysis (DLA) aims to split the document image into...

Please sign up or login with your details

Forgot password? Click here to reset