DeepAI AI Chat
Log In Sign Up

CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor

03/29/2019
by   Xiaohui Zhao, et al.
Accenture
Microsoft
14

Extracting key information from documents, such as receipts or invoices, and preserving the interested texts to structured data is crucial in the document-intensive streamline processes of office automation in areas that includes but not limited to accounting, financial, and taxation areas. To avoid designing expert rules for each specific type of document, some published works attempt to tackle the problem by learning a model to explore the semantic context in text sequences based on the Named Entity Recognition (NER) method in the NLP field. In this paper, we propose to harness the effective information from both semantic meaning and spatial distribution of texts in documents. Specifically, our proposed model, Convolutional Universal Text Information Extractor (CUTIE), applies convolutional neural networks on gridded texts where texts are embedded as features with semantical connotations. We further explore the effect of employing different structures of convolutional neural network and propose a fast and portable structure. We demonstrate the effectiveness of the proposed method on a dataset with up to 6,980 labelled receipts, without any pre-training or post-processing, achieving state of the art performance that is much higher than BERT but with only 1/10 parameters and without requiring the 3,300M word dataset for pre-training. Experimental results also demonstrate that the CUTIE being able to achieve state of the art performance with much smaller amount of training data.

READ FULL TEXT

page 3

page 5

page 6

page 7

11/13/2018

Few-shot Learning for Named Entity Recognition in Medical Text

Deep neural network models have recently achieved state-of-the-art perfo...
12/08/2021

Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents

The extraction of relevant information carried out by named entities in ...
10/31/2016

Towards Deep Learning in Hindi NER: An approach to tackle the Labelled Data Scarcity

In this paper we describe an end to end Neural Model for Named Entity Re...
06/02/2021

Exploiting Global Contextual Information for Document-level Named Entity Recognition

Most existing named entity recognition (NER) approaches are based on seq...
10/26/2020

Using Unlabeled Texts for Named-Entity Recognition

Named Entity Recognition (NER) poses the problem of learning with multip...
11/27/2022

Topic Segmentation in the Wild: Towards Segmentation of Semi-structured Unstructured Chats

Breaking down a document or a conversation into multiple contiguous segm...
10/09/2021

Extending Multi-Text Sentence Fusion Resources via Pyramid Annotations

NLP models that compare or consolidate information across multiple docum...

Code Repositories

OCR-Extract-total-amount-TTC-of-receipts

None


view repo

InvoiceExtraction

code to extract SROIE invoice field data and custom dataset fields.


view repo

OCR_Receipts_End_To_End_CUTIE_B

None


view repo