DeepAI
Log In Sign Up

Doc2Dict: Information Extraction as Text Generation

05/16/2021
by   Benjamin Townsend, et al.
0

Typically, information extraction (IE) requires a pipeline approach: first, a sequence labeling model is trained on manually annotated documents to extract relevant spans; then, when a new document arrives, a model predicts spans which are then post-processed and standardized to convert the information into a database entry. We replace this labor-intensive workflow with a transformer language model trained on existing database records to directly generate structured JSON. Our solution removes the workload associated with producing token-level annotations and takes advantage of a data source which is generally quite plentiful (e.g. database records). As long documents are common in information extraction tasks, we use gradient checkpointing and chunked encoding to apply our method to sequences of up to 32,000 tokens on a single GPU. Our Doc2Dict approach is competitive with more complex, hand-engineered pipelines and offers a simple but effective baseline for document-level information extraction. We release our Doc2Dict model and code to reproduce our experiments and facilitate future work.

READ FULL TEXT
05/26/2022

Jointly Learning Span Extraction and Sequence Labeling for Information Extraction from Business Documents

This paper introduces a new information extraction model for business do...
05/28/2021

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Most widely-used pre-trained language models operate on sequences of tok...
10/29/2019

Big Bidirectional Insertion Representations for Documents

The Insertion Transformer is well suited for long form text generation d...
01/14/2022

Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

A typical information extraction pipeline consists of token- or span-lev...
10/04/2018

Zooming Network

Structural information is important in natural language understanding. A...
04/25/2018

Hierarchical RNN for Information Extraction from Lawsuit Documents

Every lawsuit document contains the information about the party's claim,...
04/16/2021

Cost-effective End-to-end Information Extraction for Semi-structured Document Images

A real-world information extraction (IE) system for semi-structured docu...