Sequence-to-Sequence Models for Extracting Information from Registration and Legal Documents

01/14/2022
by   Ramon Pires, et al.
8

A typical information extraction pipeline consists of token- or span-level classification models coupled with a series of pre- and post-processing scripts. In a production pipeline, requirements often change, with classes being added and removed, which leads to nontrivial modifications to the source code and the possible introduction of bugs. In this work, we evaluate sequence-to-sequence models as an alternative to token-level classification methods for information extraction of legal and registration documents. We finetune models that jointly extract the information and generate the output already in a structured format. Post-processing steps are learned during training, thus eliminating the need for rule-based methods and simplifying the pipeline. Furthermore, we propose a novel method to align the output with the input text, thus facilitating system inspection and auditing. Our experiments on four real-world datasets show that the proposed method is an alternative to classical pipelines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/29/2022

Thutmose Tagger: Single-pass neural model for Inverse Text Normalization

Inverse text normalization (ITN) is an essential post-processing step in...
research
10/28/2020

CopyNext: Explicit Span Copying and Alignment in Sequence to Sequence Models

Copy mechanisms are employed in sequence to sequence models (seq2seq) to...
research
05/16/2021

Doc2Dict: Information Extraction as Text Generation

Typically, information extraction (IE) requires a pipeline approach: fir...
research
09/25/2020

Persian Keyphrase Generation Using Sequence-to-Sequence Models

Keyphrases are a very short summary of an input text and provide the mai...
research
06/04/2021

Bi-Granularity Contrastive Learning for Post-Training in Few-Shot Scene

The major paradigm of applying a pre-trained language model to downstrea...
research
06/29/2023

The mapKurator System: A Complete Pipeline for Extracting and Linking Text from Historical Maps

Scanned historical maps in libraries and archives are valuable repositor...
research
11/07/2021

Information Extraction from Visually Rich Documents with Font Style Embeddings

Information extraction (IE) from documents is an intensive area of resea...

Please sign up or login with your details

Forgot password? Click here to reset