TRIE: End-to-End Text Reading and Information Extraction for Document Understanding

by   Peng Zhang, et al.

Since real-world ubiquitous documents (e.g., invoices, tickets, resumes and leaflets) contain rich information, automatic document image understanding has become a hot topic. Most existing works decouple the problem into two separate tasks, (1) text reading for detecting and recognizing texts in the images and (2) information extraction for analyzing and extracting key elements from previously extracted plain text. However, they mainly focus on improving information extraction task, while neglecting the fact that text reading and information extraction are mutually correlated. In this paper, we propose a unified end-to-end text reading and information extraction network, where the two tasks can reinforce each other. Specifically, the multimodal visual and textual features of text reading are fused for information extraction and in turn, the semantics in information extraction contribute to the optimization of text reading. On three real-world datasets with diverse document images (from fixed layout to variable layout, from structured text to semi-structured text), our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.



page 4


Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution

Visual information extraction (VIE) has attracted considerable attention...

Cost-effective End-to-end Information Extraction for Semi-structured Document Images

A real-world information extraction (IE) system for semi-structured docu...

Chargrid: Towards Understanding 2D Documents

We introduce a novel type of text representation that preserves the 2D l...

Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer

We address the challenging problem of Natural Language Comprehension bey...

ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction

Scanned receipts OCR and key information extraction (SROIE) represent th...

One-shot Text Field Labeling using Attention and Belief Propagation for Structure Information Extraction

Structured information extraction from document images usually consists ...

Deep Reader: Information extraction from Document images via relation extraction and Natural Language

Recent advancements in the area of Computer Vision with state-of-art Neu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Document understanding is a relatively traditional research topic that refers to the techniques to automatically handle with the text content. Among different types of documents, Visually Rich Documents (VRDs) have attracted more and more attention. It is named after its composed modality of both text and vision that offers abundant information, including but not limited to layout, tabular structure, font size and even the font color in addition to plain text. Thus, we divide them into four categories from the dimensions of layout and text type. Layout here is defined as the relative positions of texts and text type (i.e., structured and semi-structured) follows the conventions of (Judd et al., 2004; Soderland, 1999). In detail, Structured data connotes data that is organized in a predetermined schema, e.g., invoices and receipt, while semi-structured data denotes data that has one or more identifiers, but each portion of data is not necessarily organized in predefined fields, e.g., resumes. Table 1 summarizes common examples of VRDs and some examples are shown in Fig. 1. Automaticly recognizing texts and extracting valuable contents from VRDs can faciliate information entry, retrieval and compliance check and is of great benefit to office automation in areas like accounting, financial and much broader real-world applications.

The complex problem of understanding VRDs is always decoupled into two different stages, text reading and information extraction. Specifically, text reading

includes text detection and recognition in images, which belongs to the optical character recogtion (OCR) research field and has already been widely used in many Computer Vision (CV) applications

(Wang et al., 2020; Qiao et al., 2020; Feng et al., 2019). In the information extraction (IE) stage, specific contents (e.g.

entity, relation) are mined and processed from previously recognized plain text for diverse Natural Language Processing (NLP) tasks, such as Named Entity Recognition (NER)

(Shaalan, 2014; Lample et al., 2016; Ma and Hovy, 2016) and Question-Answer (QA) (Yang et al., 2016; Anderson et al., 2018; Fukui et al., 2016). Note that in existing routines, the two stages are separately executed. It means the former stage recognizes text from images without semantic supervision (e.g., entity name annotation) of IE stage, and the latter extracts information from only serialized plain text, not using rich visual information.

Under the traditional routines, most existing methods (Katti et al., 2018; Denk and Reisswig, 2019; Zhao et al., 2019; Palm et al., 2017; Liu et al., 2019; Xu et al., 2019; Sage et al., 2019; Palm et al., 2019) design frameworks in multiple stages of text reading (usually including detection and recognition) and information extraction independently in order. In earlier explorations, the task is intrinsically downgraded into a traditional OCR procedure and the downstream IE from serialized plain text (Palm et al., 2017; Sage et al., 2019), completely discarding the visual features and layout information from images, as shown in Fig. 2(a). Noticing the rich visual information contained in document images, recent works incorporate them into network design. (Katti et al., 2018; Denk and Reisswig, 2019; Zhao et al., 2019; Palm et al., 2019; Liu et al., 2019) work on recognized texts and their positions (see Fig. 2(b)), while (Xu et al., 2019) further includes image embeddings (see Fig. 2(c)). They all focus only on the design of information extraction task in VRD understanding.

In essence, all the above works inevitably have the following three limitations. (1) VRD understanding requires both visual and textual features, but the visual features they exploited are limited. (2) Text reading and information extraction are highly correlated, but the relations between them have rarely been explored. (3) The stagewise training strategy of text reading and information extraction brings redundant computation and time cost.

Figure 2. Comparison of VRD understanding architectures: (a) IE with plain text, (b) IE with text and position, (c) IE with position and visual features in addition to plain text, (d) Our proposed TRIE network. The black and red arrows mean the forward and backward processing, respectively. Best viewed in color.

To address those limitations, in this paper, we propose a novel end-to-end Text Reading and Information Extraction (TRIE) network. The workflow is as shown in Fig. 2(d). Instead of just focusing on information extraction, we bridge text reading and information extraction tasks with shared features, including position features, visual features and textual features. Thus, these two tasks can reinforce each other amidst a unified framework. Specifically, in the forward pass, multimodal visual and textual features produced by text reading are fully fused for information extraction, while in the backward pass, the semantics in information extraction also contribute to the optimization of text reading. Since all the modules in the network are differentiable, the whole network can be trained end-to-end. To the best of our knowledge, this is the first end-to-end trainable document understanding framework, which can handle various types of VRDs, from fixed to variable layouts and from structured to semi-structured text type.

Major contributions are summarized as follows:

(1) We propose an end-to-end trainable framework for simultaneous text reading and information extraction in VRD understanding. The whole framework can be trained end-to-end from scratch, with no need of stagewise training strategies.

(2) We design a multimodal context block to bridge the OCR and IE modules. To the best of our knowledge, it is the first work to mine the mutual influence of text reading and information extraction.

(3) We perform extensive evaluations on our framework and show superior performance compared with the state-of-the-art counterparts both in efficiency and accuracy on three real-world benchmarks. Note that those three benchmarks cover diverse types of document images, from fixed to variable layouts, from structured to semi-structured text types.

2. Related Works

Here, we briefly review the recent advances in text reading and information extraction.

2.1. Text Reading

Text reading are formally divided into two sub-tasks: text detection and text recognition. In text detection, methods are usually divided into two categories: anchor-based methods and segmentation-based methods . Anchor-based methods (He et al., 2017b; Liao et al., 2017; Liu and Jin, 2017; Shi et al., ) predict the existence of texts and regress their location offsets at pre-defined grid points of the input image, while segmentation-based methods (Zhou et al., 2017; Long et al., 2018; Wang et al., 2019) learn the pixel-level classification tasks to separate text regions apart from the background. In text recognition, the mainstreaming CRNN framework was indroduced by (Shi et al., 2017)

, using recurrent neural networks (RNNs)

(Hochreiter and Schmidhuber, 1997; Chung et al., 2014) combined with CNN-based methods for better sequential recognition of text lines. Then, the attention mechanism replaced existing CTC decoder (Shi et al., 2017) and was applied to a stacked RNN on top of the recursive CNN (Lee and Osindero, 2016), whose performance surpassed the state-of-the-art among diverse variations.

To sufficiently exploit the complementary between detection and recognition, (Liu et al., 2018; Li et al., 2017; He et al., 2018; Busta et al., 2017; Wang et al., 2020; Qiao et al., 2020; Feng et al., 2019) were proposed to jointly detect and recognize text instances in an end-to-end manner. They all achieved impressive results compared to traditional pipepline approaches due to capturing relations within detection and recognition sub-tasks through joint training. This inspires us to dive into the broader field and for the first time pay attention to the relations between text reading and information extraction.

2.2. Information Extraction

Information extraction has been studied for decades. Before the advent of learning based models, rule-based methods(Riloff, 1993; Huffman, 1995; Muslea and others, 1999; Dengel and Klein, 2002; Schuster et al., 2013; Esser et al., )

were proposed. Pattern matching were widely used in 

(Riloff, 1993; Huffman, 1995) to extract one or multiple target values. Afterwards, Intellix(Schuster et al., 2013) required predefined template with relevant fields annotated and SmartFix(Dengel and Klein, 2002) employed specifically designed configuration rules for each template. Though rule-based methods work in some cases, they rely heavily on the predefined rules, whose design and maintenance usually require deep expertise and large time cost. Besides, they can not generalize across document templates.

Learning-based networks  (Lample et al., 2016; Ma and Hovy, 2016; Yadav and Bethard, 2018; Devlin et al., 2019; Dai et al., 2019; Yang et al., 2019) were firstly proposed to work on plain text documents.  (Palm et al., 2017; Sage et al., 2019) adopted the idea from natural language processing and used recurrent neural networks to extract entities of interest from VRDs. However, they discard the layout information during the text serialization, which is crucial for document understanding. Recently, observing the rich visual information contained in document images, works tended to incorporate more details from VRDs. Some works(Katti et al., 2018; Denk and Reisswig, 2019; Zhao et al., 2019; Palm et al., 2019) took the layout into consideration, and worked on the reconstructed character or word segmentation of the document. Position information had also been utilized. (Liu et al., 2019) combined texts and their positions through a Graph Convolutional Network (GCN) and (Xu et al., 2019) further integrated position and image embeddings for pre-training inspired by BERT (Devlin et al., 2019). However, these works all focus on the design of information extraction task. They miss lots of informative details since multi-modality of inputs are not fully explored.

Two related concurrent works were presented in (Guo et al., 2019; Carbonell et al., 2019). (Guo et al., 2019) proposed an entity-aware attention text extraction network to extract entities from VRDs. However, it could only process documents of relatively fixed layout and structured text, like train tickets, passports and bussiness cards. (Carbonell et al., 2019)

localizes, recognizes and classifies each word in the document. Since it works in the word granularity, it not only requires much more labeling efforts (positions, content and category of each word) but also has difficulties in extracting those entities which are embedded in word texts (

e.g. extracting ‘’ from ‘’). Besides, in its entity recognition branch, it still works on the serialized word features, which are sorted and packed in the left to right and top to bottom order. Thus, it can only handle documents of simple layout and non-structured text, like paragraph pages. Both of the two existing works are strictly limited to documents of relative fixed layout and one type of text (structured or semi-structured). While our proposed framework has the ability to handle documents of both fixed and variable layouts, structured and semi-structured text types.

3. Methodology

We first introduce the overall architecture of the proposed TRIE in Sec 3.1. In the following three sections, text reading module, multimodal context block and information extraction module are illustrated in details respectively. Optimation functions are shown in Sec 3.5.

3.1. Overall Architecture

Figure 3. Overall architecture. The network predicts text regions, text content and extract entities of interest in a single forward pass.

An overview of the architecture is shown in Fig. 3. It mainly consists of three parts: text reading module, multimodal context block and information extraction module. Text reading module is responsible for localizing and recognizing all the texts in the document image and information extraction module is to extract entities of interest from them. The multimodal context block is novelly designed to bridge the text reading and information extraction modules.

Specifically, in text reading, the network takes the original image as input and outputs text region coordinate positions. Once the positions obtained, we apply RoIAlign on the shared convolutional features to get text region features. Then an attention-based sequence recognizor is adopted to get textual features for each text. Since context information of a text is necessary to tell it apart from other entities, we design a special multimodal context block to provide both visual and textual context. In the block, the visual context features summarize the local patterns of a text while textural features model implicit relations among the whole document image. These two context features are complementary to each other and fused as final context features. Finally, entities are extracted through the combined context and textual features. The whole framework learns to read text and extract entities jointly and can be trained in an end-to-end way from the scratch, which allows bi-directional information flows and brings reinforcing effects among text reading and information extraction modules.

3.2. Text Reading Module

Text reading module commonly includes a shared backbone, a text detection branch as well as a sequential recognition branch. We adopt ResNet (He et al., 2016) and Feature Pyramid Network (FPN) (Lin et al., 2017) as our backbone to extract shared convolutional features. For an input image , we denote as the shared feature maps, where , and are the height, width and channel number of . The text detection branch takes as input and predicts the locations of all possible text regions.


where the Detector can be any anchor-based (He et al., 2017b; Liao et al., 2017; Liu and Jin, 2017; Shi et al., ) or segmentation-based  (Zhou et al., 2017; Long et al., 2018; Wang et al., 2019) text detection methods. Here, is a set of text bounding boxes and denotes the top-left and bottom-right positions of the -th text. Given text positions , RoIAlign (He et al., 2017a) is applied on the shared convolutional features to get their text region features, denoted as , where , and are the spatial dimensions, and

is the vector dimension the same as in


Afterwards, the text recognition branch predicts a character sequence from each text region features . Firstly, is fed into an encoder to extract a higher-level feature sequence , where is the length of the sequence and is vector dimension. Then, attention-based decoder is adopted to recurrently generate the sequence of characters , where is the length of label sequence. Specifically, at step , the attention weights and glimpse vector are computed as follows,




and , , and are trainable weights.

The state is updated via,


where is the -th ground-truth label in training, while in testing, it is the label predicted in the previous step. In our experiment, LSTM(Hochreiter and Schmidhuber, 1997)

is adopted as RNN unit. Finally, the probability distribution over the vocabulary label space is estimated by,


where and are learnable weights. We denote that as the textual features of -th text, as they contain semantic features for each character in it.

In a nutshell, the text reading module outputs visual features and textual features of texts in the image in addition to their positions .

3.3. Multimodal Context Block

The context of a text provides necessary information to tell it apart from other entities, which is crucial for information extraction. Unlike the most existing works which rely only on textual features and/or position features, we design a multimodal context block to consider position features, visual features and textual features all together. And this block provides both visual context and textual context of a text, which are complementary to each other and further fused in the information extraction module.

3.3.1. Visual Context.

As mentioned, visual details such as the obvious color, font, layout and other informative features are equally important as textual details (text content) for document understanding. A natural way of capturing the local visual context of a text is resort to the convolutional neural network. Different from

(Xu et al., 2019) which extracts these features from scratch, we directly reuse produced by the text reading module. Thanks to the deep backbone and lateral connections introduced by FPN, each summarizes the rich local visual patterns of the -th text.

3.3.2. Textural Context.

Unlike visual context which focuses on local visual patterns, textual context models the fine-grained long distance dependencies and relationships between texts, providing complementary context information. Inspired by (Devlin et al., 2019), we apply the self-attention mechanism to extract textual context features, supporting variable number of texts.

Self-attention Recap. The input of popular scaled dot-product attention consists of queries and keys of dimension , and values of dimension . The output is obtained by weighted summation over all values and the attention weights are learned from and . Please refer to (Devlin et al., 2019) for details.

Textual Context Modeling. To retain the document layout and content as much as possible, we make the use of positions and textual features at the same time. We first introduce position embeddings to preserve the layout information, Then, a of multiple kernels similar to (Zhang et al., 2015) is used to aggregate the semantic character features in and outputs ,


Finally, the -th text’s embedding is fused of and position embedding, followed by the normalization, defined as


where, is a fully-connected layer, projecting to the same dimension with . is the bounding box coordinates of -th text (computed by Equa. 1) and is the position embedding layer. Afterwards, we pack all the texts’ embedding vector together, which serves as the , and in the scaled dot-product attention.

The textual context features is obtained by,


where is the dimension of text embeddings, and is the scaling factor. To further improve the representation capacity of the attended feature, multi-head attention is introduced. Each head corresponds to an independent scaled dot-product attention function and the text context features is given by:


where , and are the learned projections matrics for the -th head, is the number of heads, and

. To prevent the multi-head attention model from becoming too large, we usually have


3.4. Information Extraction Module

Both the context and textual features matter in entity extraction. The context features (including both visual context features and textual context features ) provide necessary information to tell entities apart while the textual features enable entity extraction in the character granularity, as they contain semantic features for each character in the text. So we first perform multimodal fusion of visual context features and textual context features , which are further combined with textual features to extract entities.

3.4.1. Adaptive Multi-modality Context Fusion

Given the visual context features and textual context features , we fuse them adaptively to get the final context of each text. Specifically, for the -th text, is spatially averaged to output the visual context vector. The final context vector is a weighted sum of visual context vector and textual context vector


where and are automatically learnable weights. is a fully-connected layer, projecting the visual context vector and text context vector to the same dimension.

3.4.2. Entity Extraction

We further combine the context vector and textual features by concatenating to each state vector in ,


Then, a Bidirectional-LSTM is applied to further model the dependencies within the characters,


which is followed by a fully connected network, projecting the output to the dimension of IOB (Sang and Veenstra, 1999) label space.


3.5. Optimization

The proposed network can be trained in an end-to-end manner and the losses are generated from three parts,


where hyper-parameters and control the trade-off between losses.

is the loss of text detection branch, which consists of a classification loss and a regression loss, as defined in (Ren et al., 2015). The recognition loss and information extraction loss for each image are formulated as,


where is the ground-truth label of -th character in -th text from recognition branch and is its corresponding label of information extraction.

Note that since text reading and information extraction modules are bridged with shared visual and textual features, they can reinforce each other. Specifically, the visual and textual features of text reading are fully fused and essential for information extraction, while the semantic feedback of information extraction module also contributes to the optimization of the shared convolutions and text reading module.

4. Experiments

In this section, we perform experiments to verify the effectiveness of the proposed method.

4.1. Datasets

Dataset Training Testing Entities Layout
4000 1000 9 Fixed Structured
SROIE 626 347 4 Variable Structured
Resumes 1978 497 6 Variable
Table 2. Statistics of benchmark datasets used in this paper.

We validate our model on three real-world datasets. One is the public SROIE (Huang et al., 2019) benchmark, and the other two are self-built datasets, Taxi Invoices and Resumes, respectively. Note that, the three benchmarks differ largely in layout and text type, from fixed to variable layout and from structured to semi-structured text. Statistics of the datasets are listed in Table 2 and some examples are shown in Fig. 1.

  • Taxi Invoices consists of 5000 images and has 9 entities to extract (Invoice Code, Invoice Number, Date, Pick-up time, Drop-off time, Price, Distance, Waiting, Amount). The invoices are in Chinese and can be grouped into roughly 13 templates. So it’s kind of document of fixed layout and structured text type.

  • SROIE (Huang et al., 2019) is a public dateset for receipt information extraction in ICDAR 2019 Chanllenge. It contains 626 receipts for training and 347 receipts for testing. Each receipt is labeled with four types of entities, which are Company, Date, Address and Total. The layout of SROIE dataset is variable and it has structured text.

  • Resumes is a dataset of 2475 Chinese scanned resumes, which has 6 entities to extract (Name, Phone Number, Email Address, Education period, Universities and Majors). As an owner can design his own resume template, this dataset has variable layouts and semi-structured text.

4.2. Implementation Details

4.2.1. Network Details:

The backbone of our model is ResNet-50 (He et al., 2016), followed by the FPN (Lin et al., 2017) to further enhance features. The text detection branch in text reading module adopts the Faster R-CNN (Ren et al., 2015) network and outputs the predicted bounding boxes of possible texts for later sequential recognition. For each text region, its features of shape are extracted from the shared convolutional features by RoIAlign (He et al., 2017a) and decoded by LSTM-based attention, where the number of hidden units is set to 256. In the information extraction module, we use convolutions of three kernel size

followed by max pooling to extract representations of text and the dimension of text’s embedding vector is 256. The textual context modeling module consists of multiple layers of multi-head scaled dot-product attention modules. The number of hidden units of BiLSTM used in entity extraction is set to 128. Hyper-parameters

and in Equa. 16 are all set to 1 in our experiments.

4.2.2. Training:

Our model and all the reimplemented counterparts are implemented under the PyTorch framework

(Paszke et al., 2019). For our model, with the ADADELTA (Zeiler, 2012)

optimization method, we set the learning rate to 1.0 at the beginning and decreased it to a tenth every 40 epoches. The batch size is set to 2 images per GPU and we train our model for 150 epoches. For compared counterparts, we train the separate text reading and information extraction tasks independently until they fully converge. All the experiments are carried out on a workstation with two NVIDIA Tesla V100 GPUs.

4.3. State-of-the-Art Comparisons

We compare our proposed method with pipeline of SOTA text reading and information extraction.

4.3.1. Network Settings of SOTA counterparts and Evaluation Metric.

Most existing works only focus on information extraction task for document understanding. For fair comparisons, we train the text reading network which is identical to our model individually and take the results as the input of information extraction counterparts. we reimplement three top information extraction methods (Katti et al., 2018; Ma and Hovy, 2016; Liu et al., 2019) and their network settings are as follows:

  1. Chargrid(TR): Pipeline of text reading and Chargrid. For Chargrid(Katti et al., 2018), the input character embedding is set to and the rest of network is identical to the paper.

  2. NER(TR): Pipeline of text reading and LSTM-CRF. The input embedding of LSTM-CRF(Ma and Hovy, 2016) is set to , followed by a BiLSTM with hidden units and CRF.

  3. GCN(TR): Pipeline of text Reading and GCN. In GCN, we build the network as the paper (Liu et al., 2019) specified and set the input embedding to , which performs slightly better than .

In consistence with (Liu et al., 2019) and the SROIE Challenge (Huang et al., 2019), we use F1-score to evaluate the performance of all experiments.

4.3.2. Results.

Taxi Invoices Dataset: As shown in Table 3, our model outperforms counterparts by significant margins except for the Pick-up time (illustrated in the tail of the paragraph). For this dataset, the noise of low-quality and taint may lead to failure of detection and recognition of entities. Besides, the contents may be misplaced, e.g. the content of Pick-up time may appear after the ‘Date’. NER(TR) discards the layout information and serializes all the texts into one-dimensional text sequence, reporting inferior performance than other methods. Benefiting from the layout information introduced from the constructed chargrid and positions, Chargrid(TR) and GCN(TR) work much better. However, Chargrid(TR) performs pixel segmentation task and is prone to omit characters or include extra characters. For GCN(TR), it only exploits the positions of text segments. Obviously, our TRIE has the ability to boost performances by using more useful visual features in VRDs compared to the counterparts. We attribute the only slight lower score of Pick-up time entity compared with GCN(TR) to the annotations. Specifically, during annotation, when an entity such as the Pick-up time ‘18:47’ is too blurred to read, tagged as NULL. However, our model can still correctly read and extract this entity out, leading to lower statics, as shown in Fig. 4.

Entities Chargrid(TR) NER(TR) GCN(TR) Our Model
Code 89.4 94.5 97.0 98.2
Number 85.3 92.4 93.7 95.4
Date 89.8 82.5 93.0 94.9
Pick-up time 82.9 60.0 86.3 84.6
Drop-off time 87.4 81.1 91.0 93.6
Price 93.0 94.5 93.6 94.9
Distance 92.7 93.6 91.4 94.4
Waiting 89.2 85.4 91.0 92.4
Amount 80.2 86.3 88.7 90.9
Avg 87.77 85.59 91.74 93.26
Table 3. Performance comparisons (F1-score) on Taxi Invoices dataset.
Figure 4. The pick-up time which is too blur to read is annotated as NULL, but our model correctly read and extract it out, resulting in lower accuracy.

SROIE Dataset: On SROIE dataset, we perform two set of experiments and the results are shown in Table 4.

  1. Setting 1: We train text reading module all by ourselves and report comparisons. Notice that, we do not employ tricks of data synthesis and model ensemble in the training of text reading. Since entities of ‘Company’ and ‘Total’ often have distinguishing visual features with bold type and large font, as shown in 1(b), benefiting from fusion of visual and textual features, our model outperforms three counterparts by a large margin.

  2. Setting 2: For fair comparisons on the leadboard, we use the groundtruth of text bounding-box and transcript provided officially. Character-Word LSTM is similar to named entity recognition (Lample et al., 2016) and applys LSTM on character and word level sequentially. Character-wise BiLSTM is used to create word embeddings while word wise BiLSTM to classify each word into corresponding category. LayoutLM (Xu et al., 2019) makes use of large pre-training data and finetunes on SROIE. Compared with these methods, our model shows competitive performance.

Input Model F1-Score
Prediction of bboxes
and transcript of texts
Chargrid(TR) 78.24
NER(TR) 69.09
GCN(TR) 76.51
Our model 82.06
Groundtruth of bboxes
and transcript of texts
Character-Word LSTM (Lample et al., 2016) 90.85
LayoutLM(Xu et al., 2019) 95.24
Our model 96.18
Table 4. Performance comparisons (F1-score) on SROIE.

Resumes Dataset: As for the Resumes dataset, it’s kind of document of variable layout and semi-structured text, where our method still shows impressive results. We find that GCN(TR) has difficulty in adapting to such flexible layout, just using the positions of the text. Chargrid(TR) gives impressive performances on isolated entities, such as Name, Phone, Email and Education period. While University and Major entities, which are often blended with other texts, are hard to be extracted. As expected, NER(TR) performs excellent on such kind of documents, thanks to the inherent serializable property. However, on isolated entities, it is not competitive as Chargrid(TR), as discarding the layout information completely. Our model inherits the advantages of both Chargrid(TR) and NER(TR), where context features provide necessary information to tell entities apart and we perform entity extraction in the character granularity. Thus, our model can get comprehensive gain, as shown in Table 5.

Entities Chargrid(TR) NER(TR) GCN(TR) Our Model
Name 43.4 42.7 42.9 45.7
Phone 87.0 86.6 83.3 88.0
E-mail 70.9 69.6 68.0 74.9
Edu-period 77.1 68.7 62.2 81.4
University 74.7 86.0 82.3 87.4
Major 72.1 80.4 78.7 80.8
Avg 70.87 72.33 69.57 76.3
Speed(fps) 1.13 1.69 1.62 1.76
Table 5. Performance comparisons (F1-score) on Resumes dataset.

4.4. Discussion

In this section, we offer more detailed discussion and deeper understanding on different part of the proposed TRIE framework.

4.4.1. Effect of multi-modality features on information extraction:

To further examine the contributions of visual and textual context features to information extraction, we perform the following ablation study on the Taxi Invoices dataset and the result is presented in Table 6. Text feat only means performing entity extraction using text features from text reading module only. Since the layout information is completely lost, this method presents worst performance. Introducing either the visual context features or textual context features brings significant performance gains. Notice that, introducing visual context features outperforms textual context features slightly, revealing the effects of visual context features. Further fusion of the above multimodal features gives the best performance, verifying the effect of multi-modality features of text reading on information extraction.

Text feat only
Textual Context feat
Visual Context feat
Accuracy 74.33 92.30 92.70 93.26
Table 6. Effect of multi-modality features on information extraction.

4.4.2. Effect of the end-to-end framework on text reading:

To verify the effect of end-to-end framework on text reading, we evaluate the same trained GCN (Liu et al., 2019) model on two sets of text reading results and the comparision result is shown in Table 7. TR only means training text reading module individually and End-to-End (TRIE) is our proposed model. GCN on results of End-to-End (TRIE) showns performance gains over that on TR only, revealing the benefits of end-to-end training. Notice that, we compare the text reading results in the F1-score accuracy of entities to exclude the distractions of unrelated contents. Some visualization examples are shown in Fig. 5. The Distance ‘23.3km’ and Date ‘2017-12-21’ are incorrectly recognized as ‘23.38m’ and ‘2010-02-21’ in pipeline training, leading to wrong information extraction results. Section 4.4.1 reveals the effect of multi-modality features of text reading on information extraction in the forward pass while the above quantitative and qualitative results uncover the effect of end-to-end framework on text reading in the backward pass. This verifies that joint modeling of text reading and information extraction can benefit the both tasks.

Text Reading Results IE Model Accuracy
TR only End-to-End (TRIE)
GCN (Liu et al., 2019) 91.70
GCN (Liu et al., 2019) 92.60
TRIE 93.26
Table 7. Effect of end-to-end framework on text reading.
Figure 5. Visualization of text reading results of pipeline training and our end-to-end training. Best viewed in color.

4.4.3. Speed:

We evaluate the running time of our proposed model and its SOTA counterparts using pipeline stages in fps (frames per second). The result is shown in Table 5. Since the text reading module needs to detect and recognize all the texts in the image, it takes up most of the time and runs at fps. Compared to text reading module, the information extraction module is much more lightweight. The separated information extraction module Chargrid, NER and GCN run at , and fps, and their full pipeline Chargrid(TR), NER(TR) and GCN(TR) report speed of , and respectively. Benefiting from feature sharing between text reading and information extraction, our TRIE does not need to restore the textual and visual features from scratch, showing the highest speed of fps.

All these methods are tested on Resumes test set. We evaluate all test images and calculate the average speed. The text reading module uses size image as input and image patches for recognition. The number of time steps in recognition is set to in order to handle long Chinese texts. The resolution is set to to reduce computitions. All results are tested under Pytorch using a NVIDIA Tesla V100 GPU.

4.4.4. Ablation on layers and heads in Multimodal Context Block:

To analyze the impact of different numbers of layers and heads in the Multimodal Context Block, we perform the following ablation studies and the results are shown in Table 8. Taxi Invoices dataset is relatively simple and has fixed layout, thus model with 1 or 2 layers and small number of heads gives the best result. As the layers and heads grow deeper, the model is prone to overfit. The layout of Resumes dataset is much more variable than Taxi Invoices dataset, thus it requires deeper layers and heads. In practice, one can adjust these settings according to the complexity of a specific task.

Datasets Layers Heads
2 4 8 16
1 92.97 92.98 92.86 92.72
2 93.00 92.98 93.26 92.71
3 92.55 92.81 93.06 92.83
Resumes 1 75.20 75.21 75.47 74.53
2 75.62 76.25 76.28 75.86
3 75.55 75.74 76.35 76.35
Table 8. Impacts of Layers and Heads in the Multimodal Context Block.

5. Conclusion

In this paper, we presented an end-to-end network to bridge the text reading and information extraction for document understanding. These two tasks can mutually reinforce each other through joint training. The visual and textual features of text reading can boost the performances of information extraction while the loss of information extraction can also supervise the optimization of text reading. On a variety of benchmarks, from structured to semi-structured text type and fixed to variable layout, our proposed method significantly outperforms three state-of-the-art methods in both aspects of efficiency and accuracy.


  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering

    In CVPR, pp. 6077–6086. Cited by: §1.
  • M. Busta, L. Neumann, and J. Matas (2017) Deep textspotter: an end-to-end trainable scene text localization and recognition framework. In ICCV, pp. 2223–2231. Cited by: §2.1.
  • M. Carbonell, A. Fornés, M. Villegas, and J. Lladós (2019) TreyNet: A neural model for text localization, transcription and named entity recognition in full pages. arXiv preprint arXiv:1912.10016. Cited by: §2.2.
  • J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §2.1.
  • Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. In ACL, pp. 2978–2988. Cited by: §2.2.
  • A. Dengel and B. Klein (2002) SmartFIX: A requirements-driven system for document analysis and understanding. In DAS, Lecture Notes in Computer Science, Vol. 2423, pp. 433–444. Cited by: §2.2.
  • T. I. Denk and C. Reisswig (2019) BERTgrid: contextualized embedding for 2d document representation and understanding. arXiv preprint arXiv:1909.04948. Cited by: §1, §2.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pp. 4171–4186. Cited by: §2.2, §3.3.2, §3.3.2.
  • [9] D. Esser, D. Schuster, K. Muthmann, M. Berger, and A. Schill Automatic indexing of scanned documents: a layout-based approach. In Document Recognition and Retrieval XIX, part of the IS&T-SPIE Electronic Imaging Symposium, SPIE Proceedings, Vol. 8297, pp. 82970H. Cited by: §2.2.
  • W. Feng, W. He, F. Yin, X. Zhang, and C. Liu (2019) TextDragon: an end-to-end framework for arbitrary shaped text spotting. In ICCV, pp. 9075–9084. Cited by: §1, §2.1.
  • A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP, pp. 457–468. Cited by: §1.
  • H. Guo, X. Qin, J. Liu, J. Han, J. Liu, and E. Ding (2019) EATEN: entity-aware attention for single shot visual text extraction. In ICDAR, pp. 254–259. Cited by: §2.2.
  • K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2017a) Mask R-CNN. In ICCV, pp. 2980–2988. Cited by: §3.2, §4.2.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.2, §4.2.1.
  • P. He, W. Huang, T. He, Q. Zhu, Y. Qiao, and X. Li (2017b) Single shot text detector with regional attention. In ICCV, pp. 3066–3074. Cited by: §2.1, §3.2.
  • T. He, Z. Tian, W. Huang, C. Shen, Y. Qiao, and C. Sun (2018) An end-to-end textspotter with explicit alignment and attention. In CVPR, pp. 5020–5029. Cited by: §2.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §2.1, §3.2.
  • Z. Huang, K. Chen, J. He, X. Bai, D. Karatzas, S. Lu, and C. V. Jawahar (2019) ICDAR2019 competition on scanned receipt OCR and information extraction. In ICDAR, pp. 1516–1520. Cited by: 2nd item, §4.1, §4.3.1.
  • S. B. Huffman (1995) Learning information extraction patterns from examples. In Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, Lecture Notes in Computer Science, Vol. 1040, pp. 246–260. Cited by: §2.2.
  • D. R. Judd, B. Karsh, R. Subbaroyan, T. Toman, R. Lahiri, and P. Lok (2004) Apparatus and method for searching and retrieving structured, semi-structured and unstructured content. Note: US Patent App. 10/439,338 Cited by: §1.
  • A. R. Katti, C. Reisswig, C. Guder, S. Brarda, S. Bickel, J. Höhne, and J. B. Faddoul (2018) Chargrid: towards understanding 2d documents. In EMNLP, pp. 4459–4469. Cited by: §1, §2.2, item 1, §4.3.1.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In NAACL-HLT, pp. 260–270. Cited by: §1, §2.2, item 2, Table 4.
  • C. Lee and S. Osindero (2016) Recursive recurrent nets with attention modeling for OCR in the wild. In CVPR, pp. 2231–2239. Cited by: §2.1.
  • H. Li, P. Wang, and C. Shen (2017) Towards end-to-end text spotting with convolutional recurrent neural networks. In ICCV, pp. 5248–5256. Cited by: §2.1.
  • M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu (2017) TextBoxes: A fast text detector with a single deep neural network. In AAAI, pp. 4161–4167. Cited by: §2.1, §3.2.
  • T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017) Feature pyramid networks for object detection. In CVPR, pp. 936–944. Cited by: §3.2, §4.2.1.
  • X. Liu, F. Gao, Q. Zhang, and H. Zhao (2019) Graph convolution for multimodal information extraction from visually rich documents. In NAACL-HLT, pp. 32–39. Cited by: §1, §2.2, item 3, §4.3.1, §4.4.2, Table 7.
  • X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, and J. Yan (2018) FOTS: fast oriented text spotting with a unified network. In CVPR, pp. 5676–5685. Cited by: §2.1.
  • Y. Liu and L. Jin (2017) Deep matching prior network: toward tighter multi-oriented text detection. In CVPR, pp. 3454–3461. Cited by: §2.1, §3.2.
  • S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao (2018) Textsnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. In ECCV, pp. 19–35. Cited by: §2.1, §3.2.
  • X. Ma and E. H. Hovy (2016) End-to-end sequence labeling via bi-directional lstm-cnns-crf. In ACL, Cited by: §1, §2.2, item 2, §4.3.1.
  • I. Muslea et al. (1999) Extraction patterns for information extraction tasks: a survey. In AAAI, Vol. 2. Cited by: §2.2.
  • R. B. Palm, F. Laws, and O. Winther (2019) Attend, copy, parse end-to-end information extraction from documents. In ICDAR, pp. 329–336. Cited by: §1, §2.2.
  • R. B. Palm, O. Winther, and F. Laws (2017) CloudScan - A configuration-free invoice analysis system using recurrent neural networks. In ICDAR, pp. 406–413. Cited by: §1, §2.2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In NeurIPS, pp. 8024–8035. Cited by: §4.2.2.
  • L. Qiao, S. Tang, Z. Cheng, Y. Xu, Y. Niu, S. Pu, and F. Wu (2020)

    Text perceptron: towards end-to-end arbitrary-shaped text spotting

    In AAAI, Cited by: §1, §2.1.
  • S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NeurIPS, pp. 91–99. Cited by: §3.5, §4.2.1.
  • E. Riloff (1993) Automatically constructing a dictionary for information extraction tasks. In

    Proceedings of the 11th National Conference on Artificial Intelligence.

    pp. 811–816. Cited by: §2.2.
  • C. Sage, A. Aussem, H. Elghazel, V. Eglin, and J. Espinas (2019) Recurrent neural network approach for table field extraction in business documents. In ICDAR, pp. 1308–1313. Cited by: §1, §2.2.
  • E. F. T. K. Sang and J. Veenstra (1999) Representing text chunks. In EACL, pp. 173–179. Cited by: §3.4.2.
  • D. Schuster, K. Muthmann, D. Esser, A. Schill, M. Berger, C. Weidling, K. Aliyev, and A. Hofmeier (2013) Intellix - end-user trained information extraction for document archiving. In ICDAR, pp. 101–105. Cited by: §2.2.
  • K. Shaalan (2014) A survey of arabic named entity recognition and classification. Comput. Linguistics 40 (2), pp. 469–510. Cited by: §1.
  • [43] B. Shi, X. Bai, and S. J. Belongie Detecting oriented text in natural images by linking segments. In CVPR, Cited by: §2.1, §3.2.
  • B. Shi, X. Bai, and C. Yao (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE TPAMI. 39 (11), pp. 2298–2304. Cited by: §2.1.
  • S. Soderland (1999) Learning information extraction rules for semi-structured and free text. Mach. Learn. 34 (1-3), pp. 233–272. Cited by: §1.
  • H. Wang, P. Lu, H. Zhang, M. Yang, X. Bai, Y. Xu, M. He, Y. Wang, and W. Liu (2020) All you need is boundary: toward arbitrary-shaped text spotting. In AAAI, Cited by: §1, §2.1.
  • W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, and S. Shao (2019) Shape Robust Text Detection With Progressive Scale Expansion Network. In CVPR, Cited by: §2.1, §3.2.
  • Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou (2019) LayoutLM: pre-training of text and layout for document image understanding. arXiv preprint arXiv:1912.13318. Cited by: §1, §2.2, §3.3.1, item 2, Table 4.
  • V. Yadav and S. Bethard (2018) A survey on recent advances in named entity recognition from deep learning models. In COLING, pp. 2145–2158. Cited by: §2.2.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In NeurIPS, pp. 5754–5764. Cited by: §2.2.
  • Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola (2016) Stacked attention networks for image question answering. In CVPR, pp. 21–29. Cited by: §1.
  • M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §4.2.2.
  • X. Zhang, J. J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In NeurIPS, pp. 649–657. Cited by: §3.3.2.
  • X. Zhao, Z. Wu, and X. Wang (2019) CUTIE: learning to understand documents with convolutional universal text information extractor. arXiv preprint arXiv:1903.12363. Cited by: §1, §2.2.
  • X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang (2017) EAST: An Efficient and Accurate Scene Text Detector. In CVPR, pp. 2642–2651. Cited by: §2.1, §3.2.