Doc2Dict: Information Extraction as Text Generation

05/16/2021
by   Benjamin Townsend, et al.
0

Typically, information extraction (IE) requires a pipeline approach: first, a sequence labeling model is trained on manually annotated documents to extract relevant spans; then, when a new document arrives, a model predicts spans which are then post-processed and standardized to convert the information into a database entry. We replace this labor-intensive workflow with a transformer language model trained on existing database records to directly generate structured JSON. Our solution removes the workload associated with producing token-level annotations and takes advantage of a data source which is generally quite plentiful (e.g. database records). As long documents are common in information extraction tasks, we use gradient checkpointing and chunked encoding to apply our method to sequences of up to 32,000 tokens on a single GPU. Our Doc2Dict approach is competitive with more complex, hand-engineered pipelines and offers a simple but effective baseline for document-level information extraction. We release our Doc2Dict model and code to reproduce our experiments and facilitate future work.

READ FULL TEXT
research
05/26/2022

Jointly Learning Span Extraction and Sequence Labeling for Information Extraction from Business Documents

This paper introduces a new information extraction model for business do...
research
05/28/2021

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Most widely-used pre-trained language models operate on sequences of tok...
research
10/29/2019

Big Bidirectional Insertion Representations for Documents

The Insertion Transformer is well suited for long form text generation d...
research
01/14/2022

Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

A typical information extraction pipeline consists of token- or span-lev...
research
10/04/2018

Zooming Network

Structural information is important in natural language understanding. A...
research
04/25/2018

Hierarchical RNN for Information Extraction from Lawsuit Documents

Every lawsuit document contains the information about the party's claim,...
research
04/08/2022

Enhance Incomplete Utterance Restoration by Joint Learning Token Extraction and Text Generation

This paper introduces a model for incomplete utterance restoration (IUR)...

Please sign up or login with your details

Forgot password? Click here to reset