DoSA : A System to Accelerate Annotations on Business Documents with Human-in-the-Loop

11/09/2022
by   Neelesh K Shukla, et al.
0

Business documents come in a variety of structures, formats and information needs which makes information extraction a challenging task. Due to these variations, having a document generic model which can work well across all types of documents and for all the use cases seems far-fetched. For document-specific models, we would need customized document-specific labels. We introduce DoSA (Document Specific Automated Annotations), which helps annotators in generating initial annotations automatically using our novel bootstrap approach by leveraging document generic datasets and models. These initial annotations can further be reviewed by a human for correctness. An initial document-specific model can be trained and its inference can be used as feedback for generating more automated annotations. These automated annotations can be reviewed by human-in-the-loop for the correctness and a new improved model can be trained using the current model as pre-trained model before going for the next iteration. In this paper, our scope is limited to Form like documents due to limited availability of generic annotated datasets, but this idea can be extended to a variety of other documents as more datasets are built. An open-source ready-to-use implementation is made available on GitHub https://github.com/neeleshkshukla/DoSA.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/28/2022

Understanding Questions that Arise When Working with Business Documents

While digital assistants are increasingly used to help with various prod...
research
01/25/2023

Generalizability in Document Layout Analysis for Scientific Article Figure Caption Extraction

The lack of generalizability – in which a model trained on one dataset c...
research
07/19/2023

IncDSI: Incrementally Updatable Document Retrieval

Differentiable Search Index is a recently proposed paradigm for document...
research
02/25/2022

OCR-IDL: OCR Annotations for Industry Document Library Dataset

Pretraining has proven successful in Document Intelligence tasks where d...
research
01/24/2023

Sherlock in OSS: A Novel Approach of Content-Based Searching in Object Storage System

Object Storage Systems (OSS) inside a cloud promise scalability, durabil...
research
10/08/2018

Split-Correctness in Information Extraction

Programs for extracting structured information from text, namely informa...
research
05/15/2017

Using Titles vs. Full-text as Source for Automated Semantic Document Annotation

A significant part of the largest Knowledge Graph today, the Linked Open...

Please sign up or login with your details

Forgot password? Click here to reset