OCR-IDL: OCR Annotations for Industry Document Library Dataset

02/25/2022
by   Ali Furkan Biten, et al.
0

Pretraining has proven successful in Document Intelligence tasks where deluge of documents are used to pretrain the models only later to be finetuned on downstream tasks. One of the problems of the pretraining approaches is the inconsistent usage of pretraining data with different OCR engines leading to incomparable results between models. In other words, it is not obvious whether the performance gain is coming from diverse usage of amount of data and distinct OCR engines or from the proposed models. To remedy the problem, we make public the OCR annotations for IDL documents using commercial OCR engine given their superior performance over open source OCR models. The contributed dataset (OCR-IDL) has an estimated monetary value over 20K US. It is our hope that OCR-IDL can be a starting point for future works on Document Intelligence. All of our data and its collection process with the annotations can be found in https://github.com/furkanbiten/idl_data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/29/2022

LinkBERT: Pretraining Language Models with Document Links

Language model (LM) pretraining can learn various knowledge from text co...
research
05/23/2022

Document Intelligence Metrics for Visually Rich Document Evaluation

The processing of Visually-Rich Documents (VRDs) is highly important in ...
research
11/09/2022

DoSA : A System to Accelerate Annotations on Business Documents with Human-in-the-Loop

Business documents come in a variety of structures, formats and informat...
research
03/10/2023

HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining

Human-centric perceptions include a variety of vision tasks, which have ...
research
06/09/2023

DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents

Vision-language pretraining models have achieved great success in suppor...
research
10/20/2022

Automatic Document Selection for Efficient Encoder Pretraining

Building pretrained language models is considered expensive and data-int...

Please sign up or login with your details

Forgot password? Click here to reset