An Augmentation Strategy for Visually Rich Documents

12/20/2022
by   Jing Xie, et al.
0

Many business workflows require extracting important fields from form-like documents (e.g. bank statements, bills of lading, purchase orders, etc.). Recent techniques for automating this task work well only when trained with large datasets. In this work we propose a novel data augmentation technique to improve performance when training data is scarce, e.g. 10-250 documents. Our technique, which we call FieldSwap, works by swapping out the key phrases of a source field with the key phrases of a target field to generate new synthetic examples of the target field for use in training. We demonstrate that this approach can yield 1-7 F1 point improvements in extraction performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/05/2020

Rapid Adaptation of BERT for Information Extraction on Domain-Specific Business Documents

Techniques for automatically extracting important content elements from ...
research
01/07/2022

Data-Efficient Information Extraction from Form-Like Documents

Automating information extraction from form-like documents at scale is a...
research
06/02/2021

A Span Extraction Approach for Information Extraction on Visually-Rich Documents

Information extraction (IE) from visually-rich documents (VRDs) has achi...
research
05/22/2020

Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models

Many business documents processed in modern NLP and IR pipelines are vis...
research
10/27/2022

Make More of Your Data: Minimal Effort Data Augmentation for Automatic Speech Recognition and Translation

Data augmentation is a technique to generate new training data based on ...
research
05/26/2023

GDA: Generative Data Augmentation Techniques for Relation Extraction Tasks

Relation extraction (RE) tasks show promising performance in extracting ...

Please sign up or login with your details

Forgot password? Click here to reset