Log In Sign Up

Spatial Dual-Modality Graph Reasoning for Key Information Extraction

by   Hongbin Sun, et al.

Key information extraction from document images is of paramount importance in office automation. Conventional template matching based approaches fail to generalize well to document images of unseen templates, and are not robust against text recognition errors. In this paper, we propose an end-to-end Spatial Dual-Modality Graph Reasoning method (SDMG-R) to extract key information from unstructured document images. We model document images as dual-modality graphs, nodes of which encode both the visual and textual features of detected text regions, and edges of which represent the spatial relations between neighboring text regions. The key information extraction is solved by iteratively propagating messages along graph edges and reasoning the categories of graph nodes. In order to roundly evaluate our proposed method as well as boost the future research, we release a new dataset named WildReceipt, which is collected and annotated tailored for the evaluation of key information extraction from document images of unseen templates in the wild. It contains 25 key information categories, a total of about 69000 text boxes, and is about 2 times larger than the existing public datasets. Extensive experiments validate that all information including visual features, textual features and spatial relations can benefit key information extraction. It has been shown that SDMG-R can effectively extract key information from document images of unseen templates, and obtain new state-of-the-art results on the recent popular benchmark SROIE and our WildReceipt. Our code and dataset will be publicly released.


page 1

page 5

page 8


Revising FUNSD dataset for key-value detection in document images

FUNSD is one of the limited publicly available datasets for information ...

Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution

Visual information extraction (VIE) has attracted considerable attention...

Iterative Document-level Information Extraction via Imitation Learning

We present a novel iterative extraction (IterX) model for extracting com...

One-shot Key Information Extraction from Document with Deep Partial Graph Matching

Automating the Key Information Extraction (KIE) from documents improves ...

MATrIX – Modality-Aware Transformer for Information eXtraction

We present MATrIX - a Modality-Aware Transformer for Information eXtract...

DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval

Many previous methods on text-based person retrieval tasks are devoted t...

A Neural Edge-Editing Approach for Document-Level Relation Graph Extraction

In this paper, we propose a novel edge-editing approach to extract relat...

I Introduction

Extracting key information from unstructured document images, such as historical documents, receipts, orders and credit notes, plays an important role in office automation including efficient archiving, compliance checking and so on. Conventional approaches [31, 30, 3, 25] maintain a set of templates, each of which consists of key words and their layouts. Although they can usually accurately extract key information from documents, they are not robust against the partial text recognition errors, which usually occurs. To make matters worse, they cannot generalize to documents from unseen templates, which prohibits them from being used in many real application scenarios.

(a) NER (b) SDMG-R
Fig. 1:

Illustration of Named Entity Recognition (NER) and our proposed Spatial Dual-Modality Graph Reasoning (SDMG-R). NER models the relations between two text regions at the same horizontal line while our SDMG-R between all text regions in the spatial neighborhood. Moreover, NER use textual features only while SDMG-R both visual features extracted from image regions and textual ones extracted from texts.

In this paper, we target at key information extraction with a more challenging setting, where training set and test set have different templates. CloudScan [27]

modeled the key information extraction as Named Entity Recognition problem via concatenating texts as strings, which are classified as predefined categories such as order ID, invoice number and so on (see Figure 

1 (a)). Although it can generalize to samples of unseen templates, it degrades greatly when lines are not aligned properly due to non-front image captures. Moreover, it makes full use of pre-context and after-context only in the concatenated strings but not neighboring text regions which are not in the same line. We believe that a robust key information extraction approach should be robust against image views, and utilize all context in the spatial neighborhood but not the same horizontal line only.

To this end, we propose an end-to-end Spatial Dual-Modality Graph Reasoning (SDMG-R) approach for key information extraction. We model the unstructured document images as spatial dual-modality graphs with graph nodes as detected text boxes and graph edges as the spatial relations between these nodes (see Figure 1

(b)). Each node is associated with the textual and visual features which are learned via a recurrent neural network (RNN) and convolutional neural network (CNN) automatically. Features of graph nodes are propagated iteratively along graph edges before classifying into pre-defined key information categories. In this way, SDMG-R makes full use of spatial relations between detected text regions, and their dual modality features. It is independent of document templates, and thus naturally has the potential to extract key information from document images of unseen templates.

Most of previous key information extraction approaches are evaluated on private data only due to the lack of public datasets. Recently, a few datasets such as IEHHR [10], SROIE [14], which target at key information extraction, have been emerging. However, their training and test set share many templates, and thus they are unsuited to evaluate the generalization ability of key information extraction methods. To this end, we build a new key information extraction benchmark dubbed WildReceipt. It consists of 25 key information categories, totally about 50000 text boxes, which is about 2 times larger than SROIE. The key information categories in WildReceipt are fine-grained. e.g., they contain “Subtotal value”, “Total value” and “Tax value” categories, all of which are money amount, and it is difficult to distinguish with each other without context. Different from previous scanned images as in SROIE, receipt images in WildReceipt are captured in the wild, most of them are from non-front views and with folds. Therefore, it is more challenging and realistic than previous ones.

We extensively evaluate our proposed SDMG-R on SROIE and WildReceipt. It has been shown that the proposed approach outperforms previous methods with impressive margins. We investigate the factors of the effectiveness of the proposed approach, and find that both the spatial relations and the dual modality features benefit the key information extraction.

The contributions of this paper are as follows:

  • We propose an effective spatial dual-modality graph reasoning network (dubbed SDMG-R) for key information extraction. To the best of our knowledge, our SDMG-R is the first key information extraction approach which jointly reasons key information categories on textual and visual features of text boxes and their 2-dimensional spatial relationships.

  • We annotate a new benchmark, named WildReceipt, to facilitate the future research of key information extraction, which is fine-grained, and 2 times bigger than its competitors. It targets at evaluating key information extraction from document images of unseen templates captured in the wild, which is not explored in previous datasets.

  • We validate the effectiveness of the proposed SDMG-R on two benchmarks, i.e., SROIE and WildReceipt. Our proposed approach outperforms state-of-the-art approaches with impressive margins.

Ii Related Work


Key information extraction has been attracting a large number of researchers from computer vision and multimedia filed. However, most of them conducted experiments on private datasets. Intellix 

[31] was trained on 8000 and tested on 4000 scanned documents with 10 annotated semantic categories. CloudScan [27] conducted experiments on scanned UBL invoices. Later, CUTIE annotated Spanish receipt documents captured in the wild including taxi receipts, Meals Entertainment (ME) receipts, and hotel receipts, with 9 different key information classes. However, all the aforementioned datasets are unavailable publicly. Recently, SROIE [14] consisting of 600 scanned receipts from ICDAR 2019 Robust Reading Challenge is released with 4 categories. Namely, store name, store address, date and total price. Its training set and its corresponding ground truth are publicly available. However, the test set ground truth is not released. We annotate its test set for evaluating key information extraction approaches on it in our experiments. Our WildReceipt targets at facilitating the key information extraction research and will be publicly released. It is about 2 times and 5 times bigger than SROIE in terms of the total image number and the key information category number respectively. Different from previous datasets which contain value categories only, WildReceipt contains both key and value categories such as “Total key” category (e.g., “Total:” and “Total” ) and “Total value” category (e.g., “$10.5” and “$20”). We empirically find that key category classification during training can boost the classification performance of value categories. Our WildReceipt contains fine-grained categories such as “Total value”, “Subtotal value”, and “Tax value”, all of which represent money amount. Moreover, it is captured in the wild, which is more challenging and wider applicable. The detailed comparison between the key information extraction datasets is conducted in Table I.

Dataset #img #class Source Wild FG Key Public
Intellix 12,000 10 unknown
CloudScan 326,471 8 invoice
CUTIE 4,484 9 receipt
SROIE 600 4 receipt
Ours 1768 25 receipt
TABLE I: Comparison between different datasets. “FG” indicates fine-grained.
Fig. 2: The overall architecture of the proposed SDMG-R model for key information extraction. Given one image, visual features are extracted via U-Net and ROI-Pooling while textual features

are extracted via one Bi-LSTM. The modality features are fused by Kronecker product approximated by the block-diagonal tensor decomposition in the Dual Modality Fusion Module before being fed into the Graph Reasoning Module, where the node features are propagated and aggregated, and the edge weights are dynamically learned. The final node features are classified into one of key information categories in the classification module.

Key information extraction. Conventional approaches [31, 30, 3, 6] utilized template matching or rule based strategies, and thus performed poorly on documents of unseen templates. Later, CloudScan [27] modeled the key information extraction problem as NER [28, 20, 5, 24], and concatenated the entire invoice texts as one dimensional sequences without utilizing the two-dimensional spatial layout information. Chargrid [9]

encoded each document page as a two-dimensional grid of characters to conduct semantic segmentation. It utilized two-dimensional spatial layout information with small neighborhood only, and could not make fully use of nonlocal spatial relations between text regions with long distances due to limited effective receptive filed of convolution neural networks 

[23]. Recently, VRD [22] learned graph embedding to summarize the two-dimensional context of text segments in the document, which were further combined with text embedding for entity extraction from visually rich documents. Our proposed SDMG-R models documents as fully-connected graphs with text regions as nodes and two-dimensional spatial relations as edges. Different from VRD, SDMG-R learns both visual features and textual features of text regions, which leads to robustness against text recognition errors. Detailed comparison between recent approaches with our SDMG-R is given in Table II.

Methods 1d context 2d context Nonlocal context Visual features
TABLE II: Comparison between different key information extraction approaches in terms of whether utilizing 1D context, 2d context, nonlocal context and visual features of text regions.

Graph neural networks.

Recently, integrating graphs with deep neural networks is an emerging topic in deep learning research. A considerable amount of models have arisen for reasoning on graph-structured data at various tasks, such as classification of graphs

[8, 7, 26], classification of nodes of graphs [17, 18], and modeling multi-agent interacting physical systems [17, 13, 33]. Graph neural networks have been widely used in many application such as human action recognition [38, 36], social relationship understanding [37], object parsing [21], multi-label image recognition [4], visual question answer [34], and fashion retrieval [19]. These work create graphs via modeling the relations between image objects or regions. In contrast, we explore the use of graph to express spatial relations between detected text boxes which are encoded by visual-textual features, and apply it to the field of key information extraction. For each detected text box, it can automatically mine the useful layout structure in its neighborhood.

Multi-modality fusion. Our work is also related to multi-modality fusion methods [11, 15, 2, 1]. Most of them targets at visual question answering, visual grounding and visual relationship detection. We are the first to investigate visual and textual modality fusion for key information extraction.

Iii Approach

Given one document image of size , together with detected text regions , where with , , , and being the top-left corner coordinate, the height, the width, and the recognized text string of respectively, the key information extraction aims at classifying each detected text region into one of a predefined category set . We model the key information extraction as the graph node classification problem via jointly making full use of dual modality features. Namely, visual features and textual ones. Our proposed spatial dual-modality graph reasoning model consists of the dual-modality fusion module, the graph reasoning module and the classification module. Figure 2 shows its overall architecture.

Iii-a Dual-Modality Fusion Module

Given one image with the text regions

, we learn a feature vector

to represent each text region via the dual-modality fusion module. The dual-modality fusion module is designed to effectively learn and compose the visual features and textual ones. We extract the visual feature for by RoI Pooling [12] with its rectangle on the output feature maps of the last layer of one CNN feature extractor. In our experiments, we use U-Net [29] to instantiate the CNN feature extractor. Besides, we extract the textual feature for by designing a char-level Bi-LSTM [24]. Specifically, we first represent each char in as a one-hot vector with dimension being the cardinality of the char dictionary. is then projected into a lower dimensional space and finally sequentially fed into the Bi-LSTM module to obtain the textual representation for the text region . Formally, we have


where is the projection matrix of textual one-hot vectors. We fuse the visual features and textual features via modeling the interactions between all possible visual and textual feature dimensional pairs, which are easily obtained with Kronecker product as follows:


is the Kronecker product operation.

is one learnable linear transformation and

is the fused feature. For simplicity, we ignore the bias term in our paper. The number of learnable parameters in Equation (2) grows linearly with the dimension of the visual features, that of the textual ones, and that of the fused representations, which results in heavy memory and computation overheads. To reduce the memory and computation complexity, we first reformulate Equation (2) into tensor form:


where is one tensor via reshaping in Equation (2), and are the mode- product. indicates the transpose of . We then introduce the block tensor decomposition  [39, 1] to decompose as follows:


where is the block-diagonal core tensor with being the block number and being the block size, , and . Usually, we set , , and to one small constant. Thus, the parameter number decomposed tensor in Equation (4) is greatly smaller than that of original tensor in Equation (3). i.e., as shown in our experiments.

We also implement alternative fusion schemes in our experiments for comparison.

LinearSum. The visual features and textual features are linearly projected into one common space via one three-layer MLP, and then element-wise added as the fused representation of .

ConcatMLP. The visual features and textual features are concatenated, followed by one three-layer MLP.

Iii-B Graph Reasoning Module

We model the document images as graphs , where with being the feature vector of the text node , and with being the edge weight between the node and the node .

We encode the spatial relation between and via one dynamic attention mechanism. We first define the spatial relation between node and as follows:


where , and are the horizontal distance and the vertical one between the two text boxes and respectively. is one normalization constant, and is the concatenation operation. The spatial position relation between two text boxes plays a critical role in key information extraction. encodes the relative spatial position distance between node and . The first term and latter two ones in Equation (8) encode the aspect ratio of and relative shape information respectively.

Fig. 3: Annotations and samples of WildReceipt. The left shows the annotated text bounding boxes (red) with their corresponding key information categories (blue); The right shows one receipt sample with folds, and one non-front sample in WildReceipt (best viewed in color).

Inspired by [35], we embed the spatial information between text boxes into the edge weight as follows:


where is one linear transformation which embeds the spatial relation information into a -dimensional representation. is the normalization operation, which is introduced to stabilize the training procedure. is the concatenated representations of , and the normalized spatial relation embedding. is one MLP which transforms into the scalar .

Graph reasoning. We iteratively refine the features of the proposed spatial dual-modality graph times as follows:


where indicates the feature of the graph node at time step . is the normalized graph edge weight at time step . is a linear transformation at time step . is the concatenated representation of , and the normalized spatial relation embedding at time step as described in Equation (11).

is the ReLU nonlinear activation.

is the learnable normalized edge weight between node and at time step . It is given by


From Equation (14), the edge weights of the proposed graph change dynamically during reasoning from one iteration to another.

Iii-C Loss

The final output of the iterative reasoning module is fed to the classification module to classify each text region into one of key information categories. Formally, our loss is defined as


where is the key information category ground truth.


Str nm key

Str nm value

Str addr key

Str addr value

Tel key

Tel value

Date key

Date value

Time key

Time value

Prod item key

Prod item value

Prod qty key

Prod qty value

Prod price key

Prod price value

Subtotal key

Subtotal value

Tax key

Tax value

Tips key

Tips value

Total key

Total value


3 1682 2 2347 452 1067 342 1673 193 1525 368 8634 334 5272 373 8401 1269 1352 1413 1415 151 165 1783 2136 26623

0 663 0 1530 0 0 0 638 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 589 26108
TABLE III: Statistics text bounding boxes of our WildReceipt and SROIE. To represent the category name more concisely, we use the abbreviation of words in names to represent its category, e.g., ‘Str nm’ and ‘Prod item’ denote the store name and the product item, respectively.

Iv WildReceipt

Iv-a Data Collection

We selected receipts to benchmark key information extraction as SROIE [14] due to the following reasons: (1) receipts are anonymous, and suitable for public release without private information leak; (2) receipts are of varied templates since different companies usually have different templates. Thus, it is suitable for evaluating key information extraction from document images of unseen templates; (3) receipts are widely available and easy to collect; (4) extracting key information from receipts have many applications such as bookkeeping, and reimbursement.

We collected and annotated WildReceipt in the following procedure.

  • Data collection. We searched receipt images on search engines with related key-words, such as receipt, invoice and so on. We downloaded about 4300 document images.

  • Data cleaning. We removed images which have multiple receipts inside, are not receipts, unreadable, incomplete, or non-English manually.

  • Data annotation. We first labelled the text bounding boxes and their corresponding texts, and then labelled each bounding box to one of 25 key information categories (see Figure 3). These annotations were done by 6 experts.

The receipt images in WildReceipt we selected are captured in the wild. They are of non-front views and possibly with folds as shown in Figure 3. Therefore, WildReceipt is much more challenging than previous key information extraction benchmarks which focus on scanned documents only.

Iv-B Statistics

The WildReceipt dataset consists of 1740 receipt images, 68975 text bounding boxes. Each image has average about 39 text bounding boxes. Table III lists the annotation numbers of all 25 key information categories. In the 25 key information categories, 12 categories are keys and 12 categories are their corresponding values, and 1 category indicates others. As there are many variants for one type of key, e.g., “Address”, “address”, and “Add.” all indicate the key category “Str addr key”. We believe that accurately identifying key categories can benefit greatly key information extraction, which is validated in our experiments. WildReceipt is 2 times and 3 times larger than SROIE [14] in terms of the image number and the category number. Besides, it contains fine-grained key information categories. e.g., “Product price value”, “Tax value”, “Tips value” and “Total value” all are related with the amount of money, and difficult to distinguish with each other by their own textual or visual features without context information.

Iv-C Evaluation Protocol

We randomly sampled 1268 and 472 images without replacement for training and test respectively. During sampling, we made sure these two sets had different templates according to store names and near-duplicated image retrieval 

[32]. In this way, the templates in test set are unseen in the training set. Therefore, WildReceipt is suitable for evaluating key information extraction from document images of unseen templates. Table IV lists the statistics for the training set and the test set in WildReceipt.

Performances on WildReceipt are evaluated by score as [14]. The averaged score over value categories is finally reported. WildReceipt will be publicly released to facilitate future research and fair comparison on key information extraction.

Dataset Training Test
Images 1268 472
Templates 847 345
Total boxes 49377 19598
TABLE IV: Statistics of training and test sets in WildReceipt.

V Experiments

In this section, the proposed approach SDMG-R is extensively evaluated on SROIE and WildReceipt. We first introduce the implementation details. Then, SDMG-R is compared with state-of-the-art key information extraction approaches quantitatively. Finally, we investigate the effectiveness of each component of our proposed method by ablation study.

V-a Implementation Details

Our implementation is based on PyTorch. Our models are trained on 1 NVIDIA Titan X GPUs with 12 GB memory. During training, we randomly crop images with probability 0.5 while keeping all text boxes not cutting out. During test, we do not crop images. In both training and test, all images are resized to

, and their text boxes are resized proportionally before being fed into the network. The whole network is trained from scratch with default initializer of PyTorch using Adam optimizer [16]

. We use a batch size of 4 during training. Maximum epoch number is set to

. The learning rate is set to initially. It is decreased via after 40 and 50 epochs.

The cardinality of our dictionary is 91 (i.e., ). It contains 0-9 digital, a-z and A-Z letters, and special characters which are greatly related to key information categories. They are “/”, “”, “.”, “$”, “€”, “₤”, “¥”, “:”, “-”, “*”, “#”, “(”, “)”, “%”, “@”, “!”, “”’, “&”, “=”, “¿”, “+”, “””, “”, “?”, “¡”, “[”, “]”, and “_”. All other characters in texts are set to one token “unkown”. The one-hot char encoding vectors are projected to a 32-dimensional space (i.e., ). The dimension of the hidden vector of Bi-LSTM is set to . As for visual modality, we adopt the U-Net [29] as our visual feature extractor, and extract visual features on its last convolutional output feature maps, followed by a dimensional reduction to . Thus, we have . In the block tensor decomposition module, we set and . We set the graph node feature representation dimension to (i.e., ). The normalization constant is set to (i.e.,) in Equation (7). The 5-dimensional edge features are embedded into one 256-dimensional space (i.e., ). The MLP ( in Equation (12)) is of one two layers with one ReLU between them. Its hidden dimension is . The graph reasoning iteration number is set to 2 (i.e., .) except otherwise noted.

V-B Comparison with State-of-the-art Methods

We compare our proposed SDMG-R with two state-of-the-art approaches and their variants. Specially, we evaluate the following methods:

  • Chargrid [9]. It models documents as two-dimensional grids of characters, which are fed into a fully convolutional neural network to predict segmentation masks.

  • Chargrid-UNet. For fair comparison, we also use U-Net as Chargrid’s backbone while keeping other unchanged. We name Chargrid with this setting as Chargrid-UNet.

  • VRD [22]. It models documents with text bounding boxes as graphs, which are then fed into one CRF.

Method Str nm Str addr Tel Date Time Prod item Prod qty Prod price Subtotal Tax Tips Total Avg
Chargrid 78.4 79.0 86.2 90.0 87.0 92.0 93.6 92.0 68.0 68.1 20.5 70.1 76.9
Chargrid-UNet 80.6 82.0 85.2 90.6 86.3 89.6 94.2 92.0 66.9 68.6 39.0 72.6 79.0
VRD 78.0 82.8 92.2 95.1 94.9 91.8 95.8 95.6 82.4 84.2 56.0 79.5 85.7
SDMG-R 79.8 85.7 94.0 95.7 94.7 93.9 95.6 97.1 87.9 89.5 67.9 82.4 88.7
TABLE V: Comparison with state-of-the-art methods on WildReceipt in terms of score ().
Method Str nm Str addr Tel Date Time Prod item Prod qty Prod price Subtotal Tax Tips Total Avg
Chargrid 69.1 72.3 80.7 85.6 81.7 84.2 88.6 90.9 65.9 63.6 5.8 68.2 71.4
Chargrid-UNet 69.5 74.5 79.1 85.6 83.4 85.3 86.9 89.2 62.8 64.5 12.5 65.7 71.6
VRD 66.1 72.1 86.9 92.7 92.7 88.4 88.3 93.9 77.8 80.2 44.8 74.2 79.8
SDMG-R 75.4 78.7 90.0 92.0 92.2 88.7 93.0 94.1 82.1 82.0 45.2 75.7 82.4
TABLE VI: Comparison with state-of-the-art methods on WildReceipt with ground truth text boxes and recognized texts in terms of score ().
Method Str nm Str addr Tel Date Time Prod item Prod qty Prod price Subtotal Tax Tips Total Avg.
Chargrid 65.8 67.9 65.2 56.2 59.6 76.8 42.9 87.9 61.8 54.7 18.7 62.1 60.0
Chargrid-UNet 69.7 69.4 69.5 56.4 60.2 77.3 36.9 86.0 56.9 53.4 6.4 61.8 58.7
VRD 67.9 77.9 80.7 89.1 90.5 82.3 65.4 91.7 77.6 82.3 44.8 75.7 77.2
SDMG-R 72.6 79.8 80.3 89.0 88.7 86.8 80.3 92.3 78.0 79.8 50.0 73.4 79.3
TABLE VII: Comparison with state-of-the-art methods on WildReceipt with detected text boxes and recognized texts in terms of score ().
Method SROIE
Chargrid 80.9
Chargrid-UNet 80.8
VRD 84.9
SDMG-R 87.1
TABLE VIII: Comparison with state of the art methods on SROIE in terms of score ().

We compare our proposed method with its counterparts in Table  V. It has been shown that our SDMG-R outperforms all its competitors with impressive margins. Specifically, SDMG-R achieves 11.8%, 9.7%, and 3.0% absolute improvements in terms of score averaged on 12 value categories on WildReceipt compared with Chargrid, Chargrid-UNet, and VRD respectively. Moreover, SDMG-R achieves best score for 10 out of 12 categories. Our SDMG-R is greatly superior than Chargrid-UNet. We believe it is because of the long range dependence between texts learned via graphs. Compared with VRD, the performance gain of the SDMG-R attributes to our proposed U-Net based visual modality and Kronecker product based modality fusion. For the categories “Time” and “Prod qty”, our proposed SDMG-R and VRD are comparable.

Since in real applications, text boxes and texts are usually obtained by OCR engines, which might introduce text detection and recognition errors. To evaluate how those errors affect the performance of the key information extraction, we employ Google OCR API111 to detect and recognize texts of WildReceipt. For each detected text box, we label its key information category as that of the ground truth text region of maximum IOU with it. We compare our SDMG-R with state-of-the-art methods when texts are recognized using the OCR engine given ground truth text boxes in Table VI. Again, our proposed SDMG-R achieves the best averaged score. Moreover, it obviously outperforms its competitors in 10 out of 12 categories. Comparing Table V and Table VI, we observe that there are about perform drop ( v.s. ) in terms of averaged score if texts are recognized by the OCR engine. It is reasonable as some of texts, especially, characters which are closely related to some specific key information categories such as “$”, “€”, “₤”, “¥” are misrecognized via the OCR engine, which results in noisy signals and poor discriminative representations. To move forward, we compare our method with other methods in the case that both text boxes and texts are predicted by the OCR engine in Table VII. It has been shown that our proposed SDMG-R outperforms Chargird, Chargird-UNet and VRD with impressive margins. Note that there exists mismatching between detected text boxes and ground truth boxes. e.g., one detected text boxes might overlap with multiple ground truth ones or one ground truth text box might overlap with multiple detected ones. Directly matching text boxes with ground truth ones with maximum IOU might introduce noisy signals, which results in further perform drop. However, our method is still superior than its counterparts, which validates its robustness against noises.

We also compare our method with other start-of-the-art approaches on the dataset SROIE in Table VIII. Similar to WildReceipt, our SDMG-R obviously performs better than others. Specially, it absolutely improves the scores of Chargrid, Chargrid-UNet, and VRD by , and respectively. It has demonstrated the superiority of our SDMG-R on scanned document images.

w/o textual features 80.1
w/o visual features 86.4
w/o spatial relation 81.8
w/o graph reasoning 77.2
w/o key category classification 84.3
SDMG-R 88.7
TABLE IX: Effectiveness of components of SDMG-R on WildReceipt in terms of the averaged score ().
Method Settings
ConcatMLP 0.860
ConcatMLP 0.867
ConcatMLP 0.867
ConcatMLP 0.854

LinearSum 0.861
LinearSum 0.845
Our 0.859
Our 0.887
Our 0.857
TABLE X: Performance comparison of SDMG with various modality fusion methods on WildReceipt in terms of the averaged score ().

V-C Ablation Studies

We perform detailed ablation studies on WildReceipt to investigate the effectiveness of each component of our proposed SDMG-R.

Effects of visual and textual features. In Table IX, SDMG-R decreases absolutely by on WildReceipt in terms of score when without textual features. Similarly, it decreases absolutely by when without visual features. It has been shown that both textual and visual features, especially, textual features, contribute the key information extraction greatly.

Effects of spatial relation. To cancel out the spatial relation, we set the edge weights in Equation (10) for all graph node pairs . SDMG-R decreases its score to on WildReceipt. We have observed that spatial relations between two text boxes play an important role in key information extraction and can boost its performance obviously.

Effects of graph reasoning. Without graph reasoning, we directly conduct classification over the fused visual and textual features, resulting in great performance degradation. Namely, absolute score drop on WildReceipt. It suggests that message propagation between text regions can refine their representations so that they can be correctly classified into their corresponding key information categories.

Effects of key category classification. In our WildReceipt dataset, we also annotate key categories such as “Str nm key”, although only the information of value categories needs to be extracted in real application scenarios. However, we experimentally find that key category classification can help the value category classification. As shown in Table IX, without it, our SDMG-R decreases absolutely by 4.4%.

Effects of graph reasoning iteration number. Our SDMG-R obtains the averaged score of , and when the graph reasoning iteration number is set to , and respectively. It achieves the best performance when , and is overfitted when . We set in our experiments.

Effects of dual modality fusion module. Dual modality fusion module is the core component to fuse visual and textual features. We compare our module with its counterparts LinearSum and ConcatMLP as described in Section III-A. For fair comparison, we enumerate the hidden dimension () of the MLP in LinearSum and ConcatMLP, and the block size (, , ) and the block num () in our proposed dual modality fusion module, and report their corresponding results in Table X. We can observe that ConcatMLP and LinearSum achieve their best results with (or ), and respectively while our method with and . It has been shown that our proposed dual modality fusion module is very effective, and obviously outperforms its alternative methods. Namely, LinerSum and ConcatMLP.

First GCL Second GCL
Fig. 4: Visualization of the learned dynamic weight between text regions and . Each row shows one text region (blue rectangle) and its related regions (red rectangles) of one receipt image. The first column shows the learned weights in the first Graph Convolution Layer (GCL) in our graph reasoning module while the second column shows those in the second GCL. We visualize the weights (the red numbers), which are bigger than 0.1, of edges (the red directed curves) incoming to one node only for clarity (best viewed in color).

V-D Visualization

To better understand how our SDMG-R learns the spatial relations between the text regions, we visualize the learned edge weights in Figure 4. Interestingly, it can highlight the edges between two semantically-related text regions even they are with long spatial distances. e.g., the edges between “Total value”, “Subtotal value” and “Prod item value” (top left), those between “Total value”, “Total key”, and “Tax key” (top right), those between “Subtotal value” and “Subtotal key” (bottom left), and those between “Subtotal key”, “Tax key”, “Total key” and “Subtotal value” (bottom right). Compared with the first GCL (the left column), the second GCL can learn more helpful spatial relations to identity the key information categories of text regions. e.g., “TOTAL” on the left of “$30473.00” highly indicates “$30473.00” is one instance of “Total value” in the top right subfigure. “TOTAL:” and “TAXES:” under the “SUBTOTAL:” indicate “SUBTOTAL:” is one instance of “Subtotal key” in the bottom right subfigure.

Vi Conclusions

In this paper, we have proposed a novel spatial dual-modality graph reasoning model (termed SDMG-R) for key information extraction from unstructured documents. We have introduced Kronecker product approximated via the block diagonal tensor decomposition to fuse the visual and textual features. SDMG-R naturally learns spatial relations between text regions via dynamical attentions in its graph reasoning module. We have validated the effectiveness of each component of the proposed SDMG-G by extensive experiments. Moreover, a new large key information extraction dataset, named WildReceipt, has been annotated to evaluate the model performance of the key information extraction on document of unseen templates. It is fine grained and captured in the wild, and thus more challenging and realistic than previous public datasets. It will be publicly released for facilitating future research. Experimental results on both SROIE and our WildReceipt databases have shown that our proposed SDMG-R consistently outperforms start-of-the-art key information extraction methods with impressive margins.


  • [1] H. Ben-younes, R. Cadene, N. Thome, and M. Cord (2019) BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection. In AAAI, Vol. 33, pp. 8102–8109. Cited by: §II, §III-A.
  • [2] H. Ben-younes, M. Cord, and N. Thome (2017) MUTAN: Multimodal Tucker Fusion for VQA. In ICCV, pp. 2612–2620. Cited by: §II.
  • [3] F. Cesarini, E. Francesconi, M. Gori, and G. Soda (2003) Analysis and understanding of multi-class invoices. Document Analysis and Recognition 6 (2), pp. 102–114. Cited by: §I, §II.
  • [4] T. Chen, Z. Wang, G. Li, and L. Lin (2018)

    Recurrent Attentional Reinforcement Learning for Multi-label Image Recognition

    In AAAI, Cited by: §II.
  • [5] J. P. Chiu and E. Nichols (2016) Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics 4, pp. 357–370. Cited by: §II.
  • [6] V. P. D’Andecy, E. Hartmann, and M. Rusinol (2018) Field extraction by hybrid incremental and a-priori structural templates. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 251–256. Cited by: §II.
  • [7] H. Dai, B. Dai, and L. Song (2016) Discriminative embeddings of latent variable models for structured data. In ICML, pp. 2702–2711. Cited by: §II.
  • [8] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In NeurIPS, pp. 2224–2232. Cited by: §II.
  • [9] A. R. K. Faddoul, C. R. C. Guder, S. Brarda, S. Bickel, J. Höhne, and J. Baptiste (2018) Chargrid: Towards Understanding 2D Documents. In EMNLP, External Links: arXiv:1809.08799v1 Cited by: §II, 1st item.
  • [10] A. Fornés, V. Romero, A. Baró, J. I. Toledo, J. A. Sánchez, E. Vidal, and J. Lladós (2017) ICDAR2017 competition on information extraction in historical handwritten records. In ICDAR, Vol. 1, pp. 1389–1394. Cited by: §I.
  • [11] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach (2016) Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In EMNLP, pp. 457–468. Cited by: §II.
  • [12] R. Girshick (2015) Fast r-cnn. In ICCV, Cited by: §III-A.
  • [13] Y. Hoshen (2017) Vain: attentional multi-agent predictive modeling. In NeurIPS, pp. 2701–2711. Cited by: §II.
  • [14] Z. Huang, K. Chen, J. He, X. Bai, D. Karatzas, S. Lu, and C. V. Jawahar (2019) ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction. Cited by: §I, §II, §IV-A, §IV-B, §IV-C.
  • [15] J. H. Kim, K. W. On, W. Lim, J. Kim, J. W. Ha, and B. T. Zhang (2017) Hadamard Product for Low-rank Bilinear Pooling. In ICLR, pp. 1–14. Cited by: §II.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §V-A.
  • [17] T. Kipf, E. Fetaya, K. Wang, M. Welling, and R. Zemel (2018) Neural relational inference for interacting systems. arXiv preprint arXiv:1802.04687. Cited by: §II.
  • [18] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §II.
  • [19] Z. Kuang, Y. Gao, G. Li, P. Luo, Y. Chen, L. Lin, and W. Zhang (2019) Fashion Retrieval via Graph Reasoning Networks on a Similarity Pyramid. In ICCV, Cited by: §II.
  • [20] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360. Cited by: §II.
  • [21] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan (2016) Semantic Object Parsing with Graph LSTM. In ECCV, External Links: arXiv:1603.07063v1 Cited by: §II.
  • [22] X. Liu, F. Gao, Q. Zhang, and H. Zhao (2019) Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In NAACL, pp. 32–39. Cited by: §II, 3rd item.
  • [23] W. Luo, Y. Li, R. Urtasun, and R. Zemel (2016) Understanding the effective receptive field in deep convolutional neural networks. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 4898–4906. Cited by: §II.
  • [24] X. Ma and E. Hovy (2016) End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354. Cited by: §II, §III-A.
  • [25] E. Medvet, A. Bartoli, and G. Davanzo (2011) A probabilistic approach to printed document understanding. International Journal on Document Analysis and Recognition (IJDAR) 14 (4), pp. 335–347. Cited by: §I.
  • [26] B. Ni, X. Yang, and S. Gao (2016) Progressively parsing interactional objects for fine grained action detection. In CVPR, pp. 1020–1028. Cited by: §II.
  • [27] R. B. Palm, O. Winther, and F. Laws (2017) CloudScan-a configuration-free invoice analysis system using recurrent neural networks. In ICDAR, Vol. 1, pp. 406–413. Cited by: §I, §II, §II.
  • [28] N. Peng, H. Poon, C. Quirk, K. Toutanova, and W. Yih (2017) Cross-sentence n-ary relation extraction with graph lstms. Transactions of the Association for Computational Linguistics 5, pp. 101–115. Cited by: §II.
  • [29] O. Ronneberger, P.Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), Vol. 9351, pp. 234–241. Cited by: §III-A, §V-A.
  • [30] M. Rusiñol, T. Benkhelfallah, and V. P. dAndecy (2013) Field extraction from administrative documents by incremental structural templates. In 2013 12th International Conference on Document Analysis and Recognition, pp. 1100–1104. Cited by: §I, §II.
  • [31] D. Schuster, K. Muthmann, D. Esser, A. Schill, M. Berger, C. Weidling, K. Aliyev, and A. Hofmeier (2013) Intellix - End-User Trained Information Extraction for Document Archiving. In ICDAR, Cited by: §I, §II, §II.
  • [32] J. Sivic and A. Zisserman (2003) Video google: a text retrieval approach to object matching in videos.. In ICCV, pp. 1470–1477. Cited by: §IV-C.
  • [33] S. Sukhbaatar, A. Szlam, and R. Fergus (2016)

    Learning multiagent communication with backpropagation

    In NeurIPS, pp. 2244–2252. Cited by: §II.
  • [34] D. Teney, L. Liu, and A. v. D. Hengel (2017) Graph-Structured Representations for Visual Question Answering. In CVPR, pp. 1–9. Cited by: §II.
  • [35] P. VeliAkoviA, G. Cucurull, A. Casanova, A. Romero, P. LiA, and Y. Bengio (2018) Graph attention networks. In International Conference on Learning Representations, Cited by: §III-B.
  • [36] X. Wang and A. Gupta (2018) Videos as space-time region graphs. In ECCV, pp. 399–417. Cited by: §II.
  • [37] Z. Wang, T. Chen, J. Ren, W. Yu, H. Cheng, and L. Lin (2018)

    Deep Reasoning with Knowledge Graph for Social Relationship Understanding

    In IJCAI, External Links: 1807.00504, Link Cited by: §II.
  • [38] S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, Cited by: §II.
  • [39] J. Ye, L. Wang, G. Li, D. Chen, S. Zhe, X. Chu, and Z. Xu (2018) Learning compact recurrent neural networks with block-term tensor decomposition. In CVPR, pp. 9378–9387. Cited by: §III-A.