CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor

03/29/2019
by   Xiaohui Zhao, et al.
14

Extracting key information from documents, such as receipts or invoices, and preserving the interested texts to structured data is crucial in the document-intensive streamline processes of office automation in areas that includes but not limited to accounting, financial, and taxation areas. To avoid designing expert rules for each specific type of document, some published works attempt to tackle the problem by learning a model to explore the semantic context in text sequences based on the Named Entity Recognition (NER) method in the NLP field. In this paper, we propose to harness the effective information from both semantic meaning and spatial distribution of texts in documents. Specifically, our proposed model, Convolutional Universal Text Information Extractor (CUTIE), applies convolutional neural networks on gridded texts where texts are embedded as features with semantical connotations. We further explore the effect of employing different structures of convolutional neural network and propose a fast and portable structure. We demonstrate the effectiveness of the proposed method on a dataset with up to 6,980 labelled receipts, without any pre-training or post-processing, achieving state of the art performance that is much higher than BERT but with only 1/10 parameters and without requiring the 3,300M word dataset for pre-training. Experimental results also demonstrate that the CUTIE being able to achieve state of the art performance with much smaller amount of training data.

READ FULL TEXT

page 3

page 5

page 6

page 7

research
05/06/2023

NER-to-MRC: Named-Entity Recognition Completely Solving as Machine Reading Comprehension

Named-entity recognition (NER) detects texts with predefined semantic la...
research
11/13/2018

Few-shot Learning for Named Entity Recognition in Medical Text

Deep neural network models have recently achieved state-of-the-art perfo...
research
12/08/2021

Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents

The extraction of relevant information carried out by named entities in ...
research
05/18/2023

Learning In-context Learning for Named Entity Recognition

Named entity recognition in real-world applications suffers from the div...
research
06/02/2021

Exploiting Global Contextual Information for Document-level Named Entity Recognition

Most existing named entity recognition (NER) approaches are based on seq...
research
06/30/2023

Information Extraction in Domain and Generic Documents: Findings from Heuristic-based and Data-driven Approaches

Information extraction (IE) plays very important role in natural languag...
research
11/27/2022

Topic Segmentation in the Wild: Towards Segmentation of Semi-structured Unstructured Chats

Breaking down a document or a conversation into multiple contiguous segm...

Please sign up or login with your details

Forgot password? Click here to reset