Log In Sign Up

One-shot Key Information Extraction from Document with Deep Partial Graph Matching

Automating the Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios such as rapid indexing and archiving. Many existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents. However, collecting and labeling a large dataset is time-consuming and is not a user-friendly requirement for many cloud platforms. To overcome these challenges, we propose a deep end-to-end trainable network for one-shot KIE using partial graph matching. Contrary to previous methods that the learning of similarity and solving are optimized separately, our method enables the learning of the two processes in an end-to-end framework. Existing one-shot KIE methods are either template or simple attention-based learning approach that struggle to handle texts that are shifted beyond their desired positions caused by printers, as illustrated in Fig.1. To solve this problem, we add one-to-(at most)-one constraint such that we will find the globally optimized solution even if some texts are drifted. Further, we design a multimodal context ensemble block to boost the performance through fusing features of spatial, textual, and aspect representations. To promote research of KIE, we collected and annotated a one-shot document KIE dataset named DKIE with diverse types of images. The DKIE dataset consists of 2.5K document images captured by mobile phones in natural scenes, and it is the largest available one-shot KIE dataset up to now. The results of experiments on DKIE show that our method achieved state-of-the-art performance compared with recent one-shot and supervised learning approaches. The dataset and proposed one-shot KIE model will be released soo


page 1

page 4

page 6

page 9

page 11

page 12

page 14


One-shot Text Field Labeling using Attention and Belief Propagation for Structure Information Extraction

Structured information extraction from document images usually consists ...

One-Shot Template Matching for Automatic Document Data Capture

In this paper, we propose a novel one-shot template-matching algorithm t...

Spatial Dual-Modality Graph Reasoning for Key Information Extraction

Key information extraction from document images is of paramount importan...

DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer

Understanding documents with rich layouts is an essential step towards i...

Learning a Universal Template for Few-shot Dataset Generalization

Few-shot dataset generalization is a challenging variant of the well-stu...

MatchVIE: Exploiting Match Relevancy between Entities for Visual Information Extraction

Visual Information Extraction (VIE) task aims to extract key information...

Makeup216: Logo Recognition with Adversarial Attention Representations

One of the challenges of logo recognition lies in the diversity of forms...

I Introduction

Companies often face the problem of searching through and extracting information that they are interested in, from their unorganized mix of physical paper and digital documents. This process can be time-consuming and tedious. To automate this process, people studied the Key Information Extraction (KIE) task [1, 2, 3]. Thus, KIE is crucial to a company in terms of efficiency and productivity, and it has been successfully used in many industrial scenarios, such as fast indexing and efficient archiving.

A typical KIE consists of three key steps: text detection, text recognition, and text field labeling, as shown in Fig. 2. While the text detection and recognition approaches [4, 5, 6] have been studied widely in the area of Optical Character Recognition (OCR), one-shot learning based text field labeling is less studied. The text field labeling task aims to identify the predefined label of each text field.

Fig. 1: Samples containing drifted fields in the d0 dataset. (a) Support document. (b) Query document. The red boxes represent landmarks (static zones), and the blue ones indicate fields (dynamic zones). The users annotate support fields in the support document with predefined labels. Then, one can infer the labels of query fields in query documents with three steps. First, one can align query fields with their corresponding landmarks as the yellow lines did. Then, one can find support fields that align to the same landmarks, and therefore a mapping between a support field and a query field exists if they align to the same landmark. At last, one can use the labels of support fields as the labels of their mapped query fields. The drifted fields in a document are likely to misalign with their corresponding landmarks caused by a printer.

Fig. 2: The pip-line of extracting user predefined key information. (a) Receipt, (b) Text detection, (c) Text recognition, (d) Text field labeling.

The layout of a document plays a key role in distinguishing different fields. Generally, as illustrated in Fig. 2, the Total_Amount 7.60 is much more likely on the above of Tendered 10.00. Fig. 20 shows some document images with different layouts and categories. Many learning-based methods [7, 8, 9, 10, 11] have been proposed to utilize both the text and visual patterns for the KIE task. They have shown good performance, but they require sufficient training data. To reduce the cost of labor and alleviate the dependence of a large amount of training data for each type of document with a separate model, one-shot learning methods are studied. Early attempts at one-shot methods [12, 13, 2, 7] are usually based on template for entity extraction. However, these rule-based methods are limited to specific layouts and are not general enough to scale to all types of documents. Cheng et al.[1] proposed an attention-based learning approach to transfer the spatial relationships between landmarks and fields from a support document to a query document.

However, their method cannot deal with drifted fields and outliers. In practice, printed docs often contain drifted fields and outliers as shown in Fig. 

1. Drifted fields refer to fields that are printed in unexpected positions. Thus the spatial relationships between landmarks and drifted fields are different from the one between landmarks and non-drifted fields. A direct transfer from the support documents to the query documents would fail because of this difference. Outliers refer to fields that do not match any fields in the support doc such as unexpected handwritten words. Their method cannot pick out the outliers too.

To address the challenges of drifted fields and outliers, we propose to cast the text field labeling task as a partial graph matching problem. Our method uses multiple features such as position, shape, and text embedding, to measure the similarity score between a support and a query field. Then our method maps support fields to query fields to maximize the summation of similarity score of all mapped pairs of fields. Particularly, our method will obey the one-to-(at most)-one mapping constraint when it searches for the mapping between fields. This constraint can help map drifted query fields to the correct support fields even if there are other more similar support fields. Our method maps the outliers to no fields too.

The major contributions of this paper can be summarized as follows:

  • We propose a deep end-to-end trainable network for one-shot Key Information Extraction (KIE) using partial

    graph matching with the one-to-(at most)-one mapping constraint. Our method enables the learning of similarity and solving for combinatorial optimization done in an end-to-end framework instead of solving these two phases explicitly separated as opposed to many previous methods. To the best of our knowledge, this is the first KIE approach that generates globally optimized solutions.

  • We design a simple context ensemble block to fuse features of spatial, textual, and aspect representations. The proposed framework is general enough to plugin other constraints such as zero assignment constraint to adapt to different KIE tasks.

  • To promote research in KIE, one dataset is constructed and the proposed one-shot KIE model will be released soon. Note that these datasets cover diverse types of document images, and much of them are highly difficult with spatial drift.

  • Our method achieves state-of-the-art performance on the collected datasets.

Fig. 3:

Overview of the proposed model. In step (a), we build the graphs, extract and concatenate vertex and edge features. In step (b), we feed different features into separate Multi-layer Perceptrons (MLP), and their outputs are vertex and edge affinity matrices. In step (c), we compute the average vertex and edge affinity matrices over different MLPs. In step (d), the average vertex and edge affinity matrices are feed into the combinatorial solvers such as ZAC-GM solvers, and its output is

, which is called the predicted permutation matrix. The elements of are 1 or 0. In step (e), we calculate the hamming loss between and the groundtruth . Lastly, we compute the gradients of hamming loss for the parameters of MLPs. Each in the gradient matrix means the corresponding element is non-zero.

Ii Related Work

In this section, we first review previous work on keyword spotting task, KIE, one-shot learning of KIE, and then discuss approaches for graph matching that inspired our approach.

Ii-a Keyword Spotting

The methods in KeyWord Spotting (KWS) [14] task cannot solve the KIE task. On the one hand, KWS checks if a given text exists in an image and finds its location. On the other hand, KIE aims at assigning a label to each field based on text detection and recognition results, e.g., identifying “Tom” as “name” in an ID card. However, KWS can’t find “Tom” in the support doc because the name can vary for different ID cards. Thus, both the methods and datasets for KWS are not suitable for KIE.

Ii-B Key Information Extraction

Language model based methods [15, 16, 17] work on plain text representations. However, document layout information is also crucial for information extraction. Then, many existing learning-based methods [18, 19, 20] tend to use both textual and visual embedding to enhance the performance of KIE.

Liu et al.[10] introduced a method that combines visual and textual information in an image by a graph convolution model. Yu et al.[9] presented a layout extraction framework via combining graph learning with graph convolutions, which resulted in rich semantic representations of textual, visual, and layout representations. Zhang et al.[21] fused the embedding of visual and textual representations such that the two tasks can reinforce each learning process. Inspired by BERT[15], Xu et al.[8] proposed a pre-training method that jointly models the text and layout information within a single framework. However, this method requires explicit segmentation of individual words such that some modern OCR approaches are not applicable.

While the approaches we discussed above achieved promising results, we have to train a separate model for each type of document that is a waste of resources. Additionally, we have to collect and manually annotate a large number of labeled images for each category of document, which is labor-intensive and time-consuming.

Ii-C One-shot Learning of KIE

Medvet et al.[22] proposed a probabilistic model to search key information from a document. However, their method required two sequences have the same length. Rusinol et al.[2] presented an iterative framework to extract information from administrative documents. They introduced a star graph to model the spatial relationships among different fields. The weights for each node were adapted by term frequency-inverse document frequency (TF-IDF). However, for some scenarios such as invoice, where most words are digits such that TF-IDF is not robust enough.

Cheng et al.[1] presented a one-shot field labeling method using attention and belief propagation to retrieve structured information. Although their method dramatically simplifies the labeling process and achieved good performance compared with previous one-shot-based approaches, the final matching results are not globally optimal. For example, as illustrated in Fig. 1, and were labeled as the same class due to the vertical drift caused by the printer.

Existing one-shot approaches are mostly rule-based and struggled to identify text fields close to each other. In particular, the performance of crucial information extraction dropped sharply when large spatial drift is observed between the landmark and corresponding fields. These performance drops suggesting that exiting models are sensitive to spatial relationship variations. This paper proposes a deep end-to-end trainable structured information extraction framework that is topology invariant and global optimized such that cases like two different fields are mapped to the same category would be alleviated.

Ii-D Graph Matching

Graph matching approaches have been widely used in computer vision tasks, such as key-points matching. In this subsection, we focus on deep learning methods for graph matching.

Hammami et al.[23] proposed a subgraph isomorphism-based method to extract informative areas in administrative and commercial forms using color information. The information extraction task is then converted to search the sub-graph of a query document for the best matching of the graph representation of the supporting document. However, many documents are scanned in black and white that limits the application of this method.

Andrei Zanfir and Cristian Sminchisescu[24]

proposed to model deep feature extraction and solve combinatorial optimization as an end-to-end learning framework. Wang et al.

[25] presented an end-to-end differentiable deep combinatorial learning of graph matching. Different from the pixel offset loss[24], a permutation loss based on Sinkhorn net was employed to handle an arbitrary number of nodes for combinatorial graph matching. Further, Wang et al.[26] embedded the learning of affinities and into a uniform framework instead of solving them separately[24].

(a) Samples contain multi-region fields.
(b) Samples contain drifted fields.
(c) Samples contain outliers.
Fig. 7: The one-to-(at most)-one mapping constraint help resolve the problem of drifted fields and outliers. The red bounding boxes are landmarks, and the rest boxes are fields. We omit parts of boxes of both landmarks and fields for clearer illustration. The hand writen words in c2 and c4 are both “”.

Iii Our Model

In this section, we introduce our framework in detail. We present the framework of our model in Fig. 3. In the first subsection, we define all the notations about graphs. In the second subsection, we discuss how to formulate the partial graph matching problem and how to annotate training data to avoid the many-to-many mapping, which violates the definition of graph matching problem [1], between fields. In the third subsection, we report important details of constructing graphs that consists of fields. In the forth subsection, we propose to use different MLP modules to calculate similarity scores between fields or edges based on different features. In the last subsection, we apply two solvers to the partial graph matching problem based on the similarity score.

Iii-a Notations on Graphs

We follow [23] to call the set of dynamic text regions as Fields. We use to note each field. We use the superscripts to differentiate the fields within one document, e.g., the th field is denoted as . Each field has its vertex features and label . There are several ways to generate such node features. Firstly, we can use the set of static text regions, denoted as Landmark , to generate the spatial feature of each field. Secondly, we can use text embedding to generate the semantic feature of each field. Thirdly, the aspect of each field, namely the width and height of its OCR bounding box, can also be a useful feature. For a document, we note the set of its fields as , the set of node features as , and the set of labels as . Notations on edges are flexible. Given two fields , both and can represent the directed edges from to . We define the set of all edges to be . We can represent a document as a quaternion .

We will use the subscripts to differentiate the support and query documents, i.e., represents the support document and represents the query document. We use to represent the th field in a query document. The one-shot KIE problem is to predict the label of each query field whose ground-truth label is . We propose to solve the one-shot KIE problem using partial graph matching such that if the query field is matched with the support field , then the model predict the label of to be .

Iii-B Solving one-shot KIE with Partial Graph Matching

Based on the above notations, the formulation of partial graph matching requires two additional concepts. We follow Burkard et al.[27] to use the concave quadratic formulation of the graph matching problem. Partial graph matching shares the same concepts but has different constraints.

The first one is the permutation matrix , whose element is if a query field is matched with a support field , otherwise. This matrix describes the matching between with , and has elements.

The second one is the affinity matrix

, which is a square matrix and operates on the vector version of

. Note that the vector version of is in the space, and the shape of is . The elements of in different positions have different meanings. For the off diagonal elements, they describe how similar two edges are, where one edge comes from the graph and another one comes from . If is the edge in , and is the edge in , then their similarity score in the affinity matrix is denoted as . For the diagonal elements, we use to note how similar two fields are, i.e., is the similarity score between and .

Finally, the partial graph matching problem is formulated to be a constrained optimization problem, whose objective is:

s.t. (2)

where is a column-wise vector whose elements are all one. All the notations in equation (1) and (2) are fully explained in subsection B and A. The first inequality in equation (2) forbids a feasible permutation matrix to match multiple support fields with a target query field. The second inequality forbids to match multiple query fields with one support field. Both inequalities allow part of support and query fields to match with no fields. The first term in equation (1) sums over all possible matching between support and query fields to calculate the vertex similarity score. The second term sums over all possible matching between support and query edges to calculate the edge similarity score. A query edge is matched with a support edge if and only if both is matched with and is matched with .

Fig. 7 shows why the one-to-(at most) one constraint can be ensured, and how it helps resolve the problems of drifted fields and outliers. In practice, a document may contain many multi-line fields. The examples are the fields in subfigure (a). They are supposed to have the same label but are bounded by separate boxes. Cheng et al. [1] suggests using the average boxes of support fields that share the same label to match with query fields. As shown in a1 and a2, their method leads to a one-to-many mapping that violates the one-to-(at most)-one mapping constraint. However, we propose to add number suffix to the label of multi-line support fields, so that the one-to-(at most)-one mapping between multi-region fields are possible as shown in a3 and a4. We remove the number suffix after prediction to restore the original label of each field.

In the subfigure (b), the yellow line segments indicate that the fields in b2 and b4 drifted towards down when compared with b1 and b3. We can see that a direct transport of spatial relationship from b1 to b2 failed. Particularly, the support field of “” is mapped to “” and “” at the same time. However, the one-to-(at most)-one constraint forbids our model to do so. In b2 and b4, to satisfy such constraint, our model choose to map each support field to the correct query field eventhough they are not the most similar field to each other in the aspect of spatial relationship.

In the subfigure (c), c2 and c4 both contain the same outlier, which is “”. If there is no constraint, the method of Cheng et al [1] will map a wrong support field to “” as shown in c1 and c2. However, our model can refuse to match any support fields with “” because of the constraint.

Iii-C Document Graph Construction

Graph Vertices. We only regard fields as graph vertices for both support and query documents.

We use landmarks to generate spatial features for the fields. Specifically, for a target field, all the line segments connecting its center point with all landmarks will be arranged as a 2d matrix whose shape is . means the number of landmarks. The spatial features of different fields will be stacked such that the overall shape of one spatial features in a document is . means the number of fields. We also use the OCR bounding box of each field to generate its aspect feature, i.e., the height and width of the bounding box are concatenated into a 2-dimensional feature. The aspect features in a document is of size . We use average word embedding to generate the textual features for each field. We use the pre-trained word embedding [28] with 300-dimension, and freeze it during training. The shape of the textual features in a document is with the size of .

For all documents, the landmarks and fields are detected by OCR systems automatically and then labeled manually. For each type of document, we will select one document as the support document, and the rest will serve as query documents. The support document should be as complete as possible.

We will remove the extra landmarks for the query document and repair the missing ones compared to the support document. If a field is split into several parts because of an imperfect OCR system, then we will merge these fields. Note that this operation is possible only for the training data. The model will assign the “outliers” label to the extra fields during the evaluation process.

Graph Edges. For each document, we build a visible graph among fields and then apply the Prime algorithm [29] to get the minimum spanning tree of this graph. This tree is used as the final graph. Specifically, each field will emit 36 rays to search its visible neighbors. The resulting visible graph may contain many loops. We find that it is important to remove all the loops in the graph using the Prime algorithm in practice. The shorter edges connecting neighbor fields should be preserved to generate better performance. Each edge has two types of features: 1) The direction feature is the line segment connecting two fields. 2) We concatenate the height and width of the start field to the ones of the end field and generates a 4-dimensional feature as the aspect feature.

(a) Training data, whose supervision signal is a permutation matrix.
(b) The permutation matrix.
(c) The vertex affinity matrix.
Fig. 11:

In subfigure (a), the red boxes are landmarks, the blue boxes are fields, and the black lines indicate mapping from support to query fields. In subfigure (b), each row of the matrix corresponds to a query field, and each row has at most one entry being one with the other entries being zero. The third entry of the fifth row is “1”, and this means that the fifth query field, whose text is “00086508”, corresponds to the third support field “00347247”. The fourth row has no entries being “1”, and this means that the forth query field has no matching support fields. We shuffle the index of query fields such that the permutation matrix will not be an identity matrix. In subfigure (c), similar to the permutation matrix, each row of the vertex affinity matrix corresponds to a query field. The entries are the similarity score, calculated by formula (

5), between support and query fields. We only show the entries of correct pairs of fields, and the other entries are not zero. The similarity score of correct pairs of fields should be larger than the wrong pairs that lie in the same row or column.

Iii-D Vertex and Edge Affinities

For a pair of fields, we can use multiple features to compute the vertex affinity between them. Specifically, we can compute their spatial, aspect, and textual affinities. Then, the average of these affinities is the final vertex affinity between them. We concatenate the features from query and support fields and then apply a Multi-layer Perceptron (MLP) to generate the affinity score between them. We will describe the process of computing spatial affinity matrix precisely. Aspect, textual, and edge affinity matrices are processed similarly.

For the field and , their spatial features are and . Both features are matrices with the same shape, . We have aligned the landmarks such that the query documents always have the corresponding landmarks in the support document. Thus, it is reasonable to fix a specific landmark and then concatenate the two line segments and . Note that connects the center point of and . is similar. Let be the concatenated feature, then the affinity score between and w.r.s to landmark is


By iterating over all landmarks, we can concatenate and into a matrix, which will be denoted as . Finally, the spatial affinity between and equals to the average of affinity score for all landmarks:


Note that the spatial affinity matrix is calculated in a vectorized way, i.e., and are concatenated into a tensor, then feed it into an MLP module to compute the affinity score tensor with the shape of . We then average over the last dimension to obtain spatial affinity matrix with the shape of .

We compute the aspect, textual affinity matrices in a similar way but with separate MLP modules. The average of all affinity matrices is the final vertex affinity matrix computed as follow:


The off-diagonal elements of are calculated similarly. For edge , we use to represent the direction feature, and use to represent the aspect feature. Notations about the edge are similar. Then the similarity score between the two edges is computed by:


Note that we use separate MLP modules for vertex and edge affinities.

Iii-E Combinatorial Solver

Fig.3 shows the pipeline of our model. We need to solve the partial graph matching problem and back-propagate through the solvers after calculating the affinity matrix.

Solving Partial Graph Matching Problem. Inspired by recent work of fusing deep learning and combinatorics[30], we adopt two solvers to solve the partial graph matching problem. The first solver is DD-ILP solver [31]

, which is a third party libraries that aim to solve a specific type of discrete optimization problem called Integer-Relaxed Pairwise-Separable Linear Programs (IRPS-LP), and the partial graph matching problem with formulation (1) and (2) is an example of such problems.

We also reimplement the ZAC-GM solver in [32]. Although the formulation of the ZAC-GM solver is not the same as equations (1) and (2), the input, output, and constraints of this solver are the same as DD-ILP. In [32], the authors clarify a sufficient condition about when a vertex affinity matrix can lead to an optimal permutation matrix that represents the correct mapping between vertexes. Inspired by this sufficient condition, we design an additional ranking loss to regularize the MLP modules. We use this ranking loss to enlarge the similarity score difference between correct vertex pairs and the wrong vertex pairs during training.

Fig. 11 illustrates how to calculate this ranking loss for a pair of support and query documents. Take the support field “00347247” and the query field “00086508” for example. Their similarity score should be higher than the score between “00347247” and any other query fields. Their score should be higher than the score between “00086508” and any other support fields too. Experiments show the effectiveness of this ranking loss.

Back Propagate Through Solver. We adopt the hamming loss between the predicted and the label

. A fundamental problem of fusing deep learning and combinatorics is that the gradient of neural networks tends to be zero most times. In our model, a subtle change of the affinity matrix will not change the predicted permutation matrix

, i.e., is a piece-wise constant function w.r.t the parameters of MLP modules. To overcome this problem, we adopt the techniques described in [33]. The additional benefits of using hamming loss are that the wrong prediction can also generate gradients as shown in Fig. 3, this leads to a faster convergence compared with the cross-entropy (CE) loss in the LF-BP model. For example, the CE loss will only consider the negative diagonal elements in in Fig.  3, while the hamming loss will also propagate through those non-zero and off-diagonal elements.

Dataset Description Styles Docs Fields
d0 Taxi receipts 12 (7:5) 136 16
d1 CHSR tickets 1 (All test) 169 11
d2 Bording pass 2 (1:1) 54 10
d3 PT invoice (Special) 2 (1:1) 151 15
d4 VAT invoice (Normal) 2 (1:1) 118 12
d5 Ferry tickets 2 (1:1) 98 14
d6 Airline itinerary 3 (2:1) 107 25
d7 VAT invoice (Special) 2 (1:1) 118 44
d8 Medical invoice 3 (2:1) 163 36
d9 Quota invoice 4 (All train) 162 9
d10 Bank card 1 (All train) 197 8
d11 Express bill 1 (All train) 157 5
d12 Toll fee 1 (All train) 151 10
d13 Customs declaration 3 (All train) 158 14
d14 Duty-paid proof 3 (All train) 106 6
d15 Car-hailing receipts 2 (All train) 162 21
d16 VAT invoice (Volume) 2 (All train) 151 33
TABLE I: Statistics of DKIE datasets.
Method Features Training Testing Accuracy(%)
data size data type d0 d1 d2 d3 d4 d5 d6 d7 d8
One-shot LF-BP [1] spatial 1861 all styles clean 94.7 99 100 100 92.1 100 92.6 100 100
Ours (DD-ILP) 99.1 100 100 100 100 100 98 100 100
Ours (ZAC-GM) 99.1 100 100 100 100 100 97.8 100 100
One-shot LF-BP [1] spatial 1861 all styles drifted 60 68.6 42.4 80.9 75.6 65.2
Ours (DD-ILP) 97 - 71.1 90 - 100 96 - 96.2
Ours (ZAC-GM) 96.3 71.1 93.9 100 96 96.2
One-shot LF-BP [1] spatial 1861 all styles outliers 64.4 97.9 90 81.3 93.1 97 70 91.2 90
Ours (DD-ILP) 89 97.2 85.2 79.1 86.3 96.2 88 91.3 95.8
Ours (ZAC-GM) 91.2 98.7 90 100 98 100 95.8 96 97
Supervised LayoutLM [8] spatial+visual 3K per style all 97.3 97.0 - 94.6 95.0 96.1 91.8 94.3 92.5
learning PICK [9] +text 97.9 97.8 - 95.8 95.7 96.6 92.3 94.9 92.2
One-shot LF-BP [1] spatial 1861 all styles 80.39 98.2 84.1 94.4 92.5 97.3 87.2 95.8 93.8
Ours (DD-ILP) 93.6 98.5 84.7 96.0 97.1 98.4 94.4 96.1 97.2
Ours (ZAC-GM) 95.7 99 85.2 98.5 99.5 100 96.4 97.2 98
TABLE II: Comparison with state-of-the-art supervised and one-shot learning methods.

Iv Experiments

Iv-a Datasets

The datasets of DocLL/SROIE-Oneshot collected by [1] are not released by the day we submit our paper. Therefore, we created our own datasets to promote the research in the one-shot KIE task, especially on the problems of drifted fields and outliers. To generate a fair comparison on our datasets in this paper, we used the same features and training settings as [1].

DKIE One-shot Dataset. We created a new dataset consisting of 2,500 documents. We report the statistics of the DKIE dataset in Table I. The dataset can be grouped into 17 big categories such as Value-Added Tax (VAT) invoices, Medical invoices, and Taxi receipts. Each category may contain different styles of documents. Different styles of documents in one category will have similar layouts. However, each style needs independent support document because different styles have different landmarks and labels for the one-shot learning methods. Fig.20 shows sample images from the testing set. The dataset is challenging because the document images are taken by smartphones. Thus, 3D distortions, variant image size, drifted fields, and unmatched fields are quite common.

Groundtruth Generation. We asked two annotators to label the data separately. We cross-checked and rectified the incorrect labels. We repair the missing landmarks with dummy bounding boxes for a query document during the inference process to guarantee the support and query have the same number of landmarks. For multi-region fields, we add number suffix to the original labels as suggested in the B subsection of our model.

Iv-B Implementation Details

State-Of-The-Art models We compared our method with the LayoutLM[8], PICK[9], and LF-BP[1]. Layout LM model and the PICK model are supervised-learning models. LM-BP model and our model are one-shot learning models.

Training Details

. We used Pytorch to implement our models. All the models are trained on one NVIDIA Tesla V100 GPU with 16G memory. We applied ADAM to optimize the model with a batch size of 8 and trained the model on a single GPU card.

The initial learning rate is 0.05 and decays by 0.85 after each epoch.

To keep good performance, supervised-learning-based models generally maintain different parameters for different styles of documents. Therefore, we need to split each style of documents into training and testing data to train separate parameters for each style. We generate 3,000 documents for each style, including all the real samples, and the rest are synthesized. The rest training details are the same as the original methods discussed in [8, 9].

On the contrary, one-shot-learning-based models can handle different styles of documents using the same parameters. Therefore, we should train one model with different styles of documents. For one-shot learning methods, We split the documents of each category into training and testing sets according to their styles. For example, there are 12 styles in the taxi receipts category. We choose 7 styles as the training data and the rest 5 styles as the testing data. The third column of Table I shows the number of styles of each category. This column also shows the ratio between training and testing styles in the parentheses. The number of all training documents is 1861 and the number of all testing documents is 639.

Testing Details. For one-shot-learning-based models, we predefined the support document of each style, and then different images of the same style will serve as the query documents. For each style, a document containing as many landmarks/fields as possible serves to be a good support document. Our model will predict the label of each field in query documents using those labels defined in the support documents.

To study the performance of our model on samples containing drifted fields and outliers separately, we further split the testing data of each style into 3 parts as shown in the fifth column of Table II. There are “clean” documents in which all fields are nicely aligned with the landmarks and have no outliers. There are “drifted” documents in which some fields are drifted so badly that even humans need to check each field very carefully to judge the labels. There are also documents containing “outliers”. A small number of documents contain both drifted fields and outliers. We include them in “drifted” and “outliers” datasets at the same time. In the “all” dataset, we report the average accuracy of fields in all query documents within the same category.

Iv-C Experimental Results

In this section, we compared our method with existing SOTA methods. The results are presented in Table II. For a fair comparison, we only use spatial features in our model because the LF-BP model [1] only consumes spatial features. The effect of other features will be discussed in Section IV-D.

Iv-C1 Performance on “Clean” Documents

All models perform good on the “clean” data in Table II. Our model converges faster than the LF-BP model in the training stage. For each iteration, we calculate the average accuracy of fields in all training documents. We show part of the increasing process in Figure 12. Different from the LF-BP model whose average accuracy increased relatively smoothly, the accuracy of our model increased drastically in the initial training stage. There are two reasons explaining the rapidly increasing process of accuracy of our model. First, our model uses hamming loss to calculate the gradients while the LF-BP model uses the cross entropy loss. For each query field, when our model matches it with the wrong support field, the hamming loss will generate gradients not only based on the labeled pairs of fields but also the wrong pairs. In the contrary, the cross entroy loss will only generate gradients based on the labeled pairs of fields. Second, the combinatorial solvers in our model are not sensitive to the suttle change of affinity matrices, which are the outputs of MLP modules. Therefore, MLP modules in our model only need to output roughly correct affinity matrices such that the combinatorial solvers can find the correct mapping between support and query fields. This also leads to a faster increasing process of accuracy.

Iv-C2 Performance on “Drifted” Documents

The LF-BP model failed on the “drifted” data in Table II while the performance of our model droped moderately. Our model significantly outperform the LF-BP model accross all datasets. Especially on the d0 dataset. We find that the fields in this dataset aranged vertically. If one of the fields in the head part of a document drifted downwards, then all the fields below it will also drifted downwards. Typical samples of d0 dataset can be found in Figure 1, Figure 7 (b) and Figure 13. Both the online demo released by [1] and our reimplemented LF-BP model achieve low accuracy on the drifted fields. We show typical mistakes made by the LF-BP model in Figure 13. The left pair of documents in Figure 13 shows the prediction of the online demo released by [1]. Red lines indicate wrong mapping between fields. Although the LF-BP model does not map “16.9” and “$0.00” to any support fields, the authors of [1] do not report this feature in their paper. We analysis why the LF-BP model failed on the drifted fields in the case study section IV-E using the samples in the d0 dataset.

Notice that d1, d4 and d7 do not have documents that contain drifted fields. The fields drifted horizontally in d3, d5, d6 and d8 datasets. Around half of documents in the d2 dataset contain flipped fields as shown in Figure 15. The models can not predict the labels of these flipped fields correctly solely based on the spatial features. However, our model with ZAC-GM solver can achieve good performance when it use additional features such as the width and height of bounding boxes of fields. We discuss this problem in the subsection IV-D. Figure 20 shows examples of drifted fields in all datasets.

Fig. 12: Average accuracy of all fields from different documents.
Fig. 13: LF-BP (left) and our model’s (right) prediction about documents containing drifted fields in the d0 dataset. LF-BP model failed while our model can handle the drifted fields.

Iv-C3 Performance on “Outliers” Documents

Table II shows that our model, when using ZAC-GM solver, is the only one succeed across all datasets. When our model use the DD-ILP solver, it can not handle documents containing outliers. We investigate the original code and find that it aims to solve the graph matching problem and requires the support and query documents to have the same number of fields. This is not true in documents containing outliers. However, when we reimplement the ZAC-GM solver [32] and employ it to pick out the outliers, our model can handle the drifted fields and outliers at the same time to some extend. By taking a close look at the prediction of our reimplemented LF-BP model, we find that the outliers are predicted to have a label of their neighbors as shown in the Figure 7 (c1) and (c2). Despite that [1] did not report how they handle the outliers, the demo released by them will refuse to output the label of some outliers. We argure that our model provides an alternative approach that can handle the outliers.

Some documents in the d0, d3 and d6 dataset contain drifted fields and outliers at the same time. We find that such documents are the most difficult ones. Not only the LF-BP model failed on these documents, but also the performance of our model dropped by a relatively large margin. Figure 14 shows such samples in the d3 dataset. We find that when the outliers are close to the drifted fields, they are hard to distinct with each other solely based on their spatial features. For example, LF-BP model maps the “type” field in (a1) to the “outliers2” field in (a2). Our model also maps the “fee” field in (b1) to the “outliers1” field in (b2). The position of these outliers are so close to other drifted query fields such that the models may confuse them with the situation of multi-region fields. This indicates that we should measure the imilarity between fields using more diverse features such as the width and height of bounding boxes of fields or the text embedding in fields. We discuss the impact of different features in the ablation study section IV-D.

Fig. 14: Documents containing both drifted fields and outliers in the d3 dataset. (a1) and (a2) show the prediction of LF-BP model. (b1) and (b2) show the prediction of our model with ZAC-GM solver. Yellow boxes are outliers. Red lines indicate wrong mapping between fields. Both models failed on this pair of documents.

Iv-C4 Performance on “All” Documents

Our model outperformed the LF-BP method on all datasets. This is because our datasets are more challenging. There are many documents containing spatial drifted fields and outliers. In contrast to the method of LF-BP, the solving process of our model generates globally optimized results through the one-to-(at most)-one constraint. Such constraint overcomes the difficulties brought by the spatial drifted fields and outliers. Our model achieved competitive performance against the supervised-learning-based models on the d0 dataset, and better results on the other testing datasets. This result is not surprising because the supervised-learning-based models have three unfair advantages. First, they consume a lot more labeled documents than one-shot-learning-based methods during the training stage, namely 3,000 samples per style versus 1861 samples for all styles. Second, they need to train separate models for different types of documents, which can help them fit the data bias in each type of document. In contrast, one-shot learning based models use one model for all types of documents. Therefore, for each new style of document, our model benefits from the other styles of documents and requires only one labeled sample to serve as the support document for each type. Lastly, they consume more features than the one-shot-learning-based models in this experiment, namely spatial, visual, and text features versus spatial feature only.

Fig. 15: Documents containing both flipped fieldsin the d2 dataset. (a1) and (a2) show the prediction of our model without using aspect features. (b1) and (b2) show the prediction of our model using aspect features. Red lines indicate wrong mapping between fields.
Different Features Clean Drifted(flipped) Outliers All
Spatial 100 71.7 90 85.2
Spatial+Aspect 100 100 100 100
TABLE III: Aspect features help deal with the flipped fields in the d2 dataset. Examples are shown in the Figure 15

Iv-D Ablation Study

We conduct ablation studies on the d0 and d2 dataset to examine the importance of spatial features, aspect features, textual features, edge features, the number of landmarks, and the ranking loss. The d0 dataset contains 5 testing styles of documents. The d2 dataset contains 1 testing style. Each style has a separate support documents and a number of query documents. We use the d2 dataset to test whether different features can help handle the challenge of drifted fields and outliers. We present the results in Table III. We use the d0 dataset to test the impact of different features across different styles. The results of training with different features are presented in Table IV. We report the influence of landmarks on the performance of our model in Fig. 16. We also study the influence of ranking loss in Table V.

Different Features SC BJ AH JS CQ AVG
Spatial 98.8 91.2 81.8 92.8 99.1 93.6
Aspect 0 0 0 0 0 0
Text 10.2 10.4 14.7 7.2 12.3 10.6
Edge 66.2 87.9 53.1 98.8 88.0 78.6
Spatial+Aspect 98.8 85.3 94.4 94.0 97.2 93.2
Spatial+Edge 97.2 87.9 95.8 91.6 99.1 93.2
Aspect+Edge 91.6 82.2 74.1 98.2 90.0 87.1
Spatial+Aspect+Edge 98.2 88.5 97.2 94.0 99.1 94.2
Spa+Aspe+Text+Edge 96.9 93.3 95.8 94.6 99.1 95.1
TABLE IV: Test the impact of different features using different styles of documents in the d0 dataset. Document name alias: “SC”: SiChuan province, “BJ”: BeiJing, “AH”: AnHui province, “JS”: JiangSu province, “CQ”: CongQing province. “AVG” indicates average accuracy.

Fig. 16: Labeling accuracy versus the number of landmarks. “S”, “A”, and “E” means spatial, aspect, and edge features.

Iv-D1 Different Features

As shown in Figure 15 and Table III, some fields are not possible to distinguish apart solely based on the spatial features. For example, in (a1) and (b1) of Figure 15, the “namechinese” field lies above the “nameenglish” field. In the contrary, their position exchanged in (a2) and (b2) of Figure 15. According to the results in (a1) and (a2) of Figure 15, if our model only uses spatial features, it maps the “namechinese” field to the “nameenglish” field because they stay in the same position. However, two fields have very different width and height. Therefore, when our model incorporates the aspect of two fields to calculate their similarity scores, it can find the correct mapping between fields in (b1) and (b2) of Figure 15. Table III shows that our model can achieve 100 percent accuracy in the d2 dataset if it also uses the aspect features.

Fig. 17: Accuracy of our models (ZAC-GM) that are trained with different ranking loss weight. We also include training accuracy of LF-BP model in this figure. Better viewed in color.

This inspired us to design additional MLP modules to incorporate more diverse features, and test the benefits of this practise across different datasets in Table IV. First, we use only the spatial feature to train our model and test its performance. The first line of Table  IV indicates that our model achieves good performance solely based on the spatial features on most types of documents except for the “AH” type. The second line shows that our model fails if it only consumes the aspect features. This is not surprising because we find that many fields have a very similar shape and only a few of them have vast bounding boxes in the d0 dataset. Therefore, this feature alone can not distinguish the fields well. Similarly, our model will fail if it uses the textual features alone as we can see in the third line of Table IV. By checking the text of many fields, we found that many fields are hard to be distinguished by content, such as pure digits. We also find that our model will fail when the textual features are combined with other features except when all the features are used. Therefore, we do not report these failed results in Table IV to save the space. The forth line of Table IV tells that our model cannot perform well if we only use the edge features. By investigating the edges set of each document, we find that many edges in query documents cannot be found in their corresponding support documents. This is because the typology of the graph constructed for a query document can be very different from the one in the support document.

Second, we evaluate the model performance by combining two types of features to train our model. If we use “Spatial+Aspect” features (spatial combined with aspect features), the accuracy of our model on “AH” and “JS” type increases, while its accuracy on “BJ” and “CQ” decreases. If we use “Spatial+Edge” features (spatial combined with edge features), our model can perform better on the “AH” type, and worse on the “SC”, “BJ”, and “JS” type. Not surprisingly, if we use “Aspect+Edge” features (aspect combined with edge features), the performance of our model will only increase on the “JS” type, while decreases on the rest types. When we use “Spatial+Aspect+Edge” features, our model achieves the best accuracy on “AH” document with increases in accuracy from 14.7 to 97.2. Lastly, if we use all possible features (“Spatial+Aspect+Text+Edge”), the accuracy of our model on all types outperform 90%. This is surprising because the textual features tends to decrease the performance of our model. Therefore, we believe the proposed four features are complementary to each other, and it is necessary to use all of them if possible.

Solvers SC BJ AH JS CQ
Ranking loss (0) DD-ILP 95.7 96.6 92.3 94.9 92.2
ZAC-GM 98.2 98.3 94.4 92.5 97.3
Ranking loss (1) DD-ILP 97.1 98.4 94.4 96.1 97.2
ZAC-GM 98.8 99.2 98.8 99.8 99.1
TABLE V: Impact of Ranking Loss on different solvers.
(a1) in
Figure 19.Total:11.471 /
None (b1) in
Figure 19.Total:10.604 /
10.481 (c1) in
Figure 19.Total:9.782 /
10.423 (d1) in
Figure 19.Total:11.257 /
Coordinates in Score Coordinates in Score Coordinates in Score Coordinates in Score
(S1, Q7) 1.000 (S1, Q7) 1.000 (S1, Q7) 1.000 (S1, Q7) 1.000
(S1, Q10) 0.980 (S2, Q2) 0.959 (S2, Q2) 0.959 (S2, Q10) 0.937
(S1, Q2) 0.964 (S3, Q11) 0.937 (S3, Q11) 0.937 (S3, Q2) 0.910
(S2, Q11) 0.938 (S4, Q3) 0.879 (S4, Q3) 0.879 (S4, Q11) 0.902
(S3, Q3) 0.880 (S5, Q8) 0.863 (S5, Q8) 0.863 (S5, Q3) 0.861
(S5, Q8) 0.863 (S6, Q1) 0.842 (S6, Q1) 0.842 (S6, Q8) 0.845
(S6, Q1) 0.842 (S7, Q6) 0.832 (S7, Q6) 0.832 (S7, Q1) 0.823
(S7, Q6) 0.832 (S8, Q10) 0.122 (S8, Q4) 0.817 (S8, Q6) 0.810
(S9, Q4) 0.832 (S9, Q4) 0.832 (S9, Q9) 0.821 (S9, Q4) 0.832
(S9, Q9) 0.821 (S10, Q9) 0.818 (S10, Q5) 0.814 (S10, Q9) 0.818
(S11, Q5) 0.818 (S11, Q5) 0.818 (S11, Q12) 0.819 (S11, Q5) 0.818
(S12, Q12) 0.843 (S12, Q12) 0.843 (S12, Q13) 0.839 (S12, Q12) 0.843
(S13, Q13) 0.859 (S13, Q13) 0.859 (S13, Q10) 0.641 (S13, Q13) 0.859
TABLE VI: Different permutation matrix corresponds to different total affinity score.
Fig. 18: The prediction of LF-BP model on the drifted fields in d0 dataset.

Iv-D2 Landmarks

We further evaluate the impacts of the number of landmarks on the accuracy of our model in Figure 16. In practice, text embedding (300 dimensions) may not be used to save computation resources, thus we only test the combination of “Spatial+Aspect+Edge” (4 dimensions). The overall accuracy is good if we drop less than 3 landmarks (see -1, -2, -3 in the x-axis). When we use “Spatial+Aspect+Edge” features, the labeling accuracy grows as the number of landmarks increases. This is not true when we only use the spatial features. This also proves that these features are complementary to each other.

Iv-D3 Ranking Loss

We also evaluate the effectiveness of ranking loss by changing the weight of ranking loss. Figure 17 shows that our ranking loss can help accelerate the training process. When our model does not employ the ranking loss, our model can still outperform the LF-BP model. By increasing the weight of ranking loss, our model converged much more faster and the accuracy also increased. When the weight of ranking loss is too large, the performance of our model dropped. As shown in Table V, when we apply the ranking loss to two solvers, the accuracy of them improved 1 across different testing styles in the d0 dataset.

Iv-E Case Study

In this section, we presented a case study on the d0 dataset. The performance on this dataset can be found in Table II. The accuracy of our model on this dataset is much higher than the LF-BP method [1], despite that both of them use the landmarks to generate spatial features. We select a pair of documents from the “SC” type as an example. As shown in Fig. 1, fields in the query document drift upwards when compared with fields in the support document. The LF-BP method successfully predicted fields with ”receipt-code”, ”receipt-no”, ”phone-a”, and ”phone-b” labels. It failed on the rest fields because they drift and aligned with the wrong landmarks. Therefore, the LF-BP would mismatch their labels according to the landmarks. Although the Belief Propagation -‘(BP) step in the LP-BP model may alleviate the spatial drift problem, both our re-implementation and their online demonstration 111 failed on this example.

To dive into the details of the inference phase of our model, we visualized the vertex affinity matrix of our model in Fig. 19. The fields in the query document are ordered in the same vertical sequence as shown in Fig. 1. We deliberately disrupt the order of fields for the support document to add difficulty to the model. Each row of the affinity matrix in Fig. 19 indicates the similarities between the current query filed and all the fields in the support document. Since the direct output of vertex MLP is not normalized, we apply the min-max normalization to each row of this matrix, i.e., each element subtracts the minimum value of its row, and then divided by the maximum value of its row. We will choose all ”1” elements if we apply the greedy strategy used in LP-BP to select the possible label for each query field. However, this solution conflicts with the one-to-(at most)-one constraint in our model. Therefore, combinatorial solvers in our model would find a globally optimized labeling strategy such that the overall affinity summation over all chosen elements is maximized and never violate the constrain. This enables that our model is less sensitive to vertex shift such that we can handle the spatial drifted cases well. Our model picked the elements lying in the red line path as shown in Fig. 19.

Fig. 19: Different permutation matrix corresponds to different total affinity score.

V Conclusion

In this work, we proposed to solve the text field labeling problem using graph matching. We designed a one-shot framework that combines the power of deep learning and combinatorial solvers. To the best of our knowledge, our framework is the first to generate globally optimized solutions. Our model could handle spatial drifted documents, and shows state-of-the-art performance on most testing datasets.

Other potential visual cues, such as text color, fonts, and background, will be explored in future work in addition to current textual, aspect, and spatial relationship features. Our method can be extended to support few-short learning by adding additional constraints such as cycle consistency, and we leave it for future work.

Fig. 20: Samples in test dataset of DKIE. The sensitive information has been masked.


  • [1] M. Cheng, M. Qiu, X. Shi, J. Huang, and W. Lin, “One-shot text field labeling using attention and belief propagation for structure information extraction,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 340–348.
  • [2] M. Rusinol, T. Benkhelfallah, and V. Poulain dAndecy, “Field extraction from administrative documents by incremental structural templates,” in 2013 12th International Conference on Document Analysis and Recognition.   IEEE, 2013, pp. 1100–1104.
  • [3] V. Sunder, A. Srinivasan, L. Vig, G. Shroff, and R. Rahul, “One-shot information extraction from document images using neuro-deductive program synthesis,” 2019.
  • [4] Q. Yang, M. Cheng, W. Zhou, Y. Chen, M. Qiu, W. Lin, and W. Chu, “Inceptext: A new inception-text module with deformable psroi pooling for multi-oriented scene text detection,” arXiv preprint arXiv:1805.01167, 2018.
  • [5] Z. Wan, M. He, H. Chen, X. Bai, and C. Yao, “Textscanner: Reading characters in order for robust scene text recognition,” in

    Proceedings of the AAAI Conference on Artificial Intelligence

    , vol. 34, no. 07, 2020, pp. 12 120–12 127.
  • [6] X. Yue, Z. Kuang, C. Lin, H. Sun, and W. Zhang, “Robustscanner: Dynamically enhancing positional clues for robust text recognition,” in European Conference on Computer Vision.   Springer, 2020, pp. 135–151.
  • [7] V. P. d’Andecy, E. Hartmann, and M. Rusinol, “Field extraction by hybrid incremental and a-priori structural templates,” in 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).   IEEE, 2018, pp. 251–256.
  • [8] Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou, “Layoutlm: Pre-training of text and layout for document image understanding,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1192–1200.
  • [9] W. Yu, N. Lu, X. Qi, P. Gong, and R. Xiao, “PICK: Processing key information extraction from documents using improved graph learning-convolutional networks,” in

    2020 25th International Conference on Pattern Recognition (ICPR)

    , 2020.
  • [10] X. Liu, F. Gao, Q. Zhang, and H. Zhao, “Graph convolution for multimodal information extraction from visually rich documents,” in Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019, pp. 32–39.
  • [11] Y. Qian, E. Santus, Z. Jin, J. Guo, and R. Barzilay, “GraphIE: A graph-based framework for information extraction,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).   Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 751–761. [Online]. Available:
  • [12] L. Chiticariu, Y. Li, and F. Reiss, “Rule-based information extraction is dead! long live rule-based information extraction systems!” in

    Proceedings of the 2013 conference on empirical methods in natural language processing

    , 2013, pp. 827–832.
  • [13] D. Schuster, K. Muthmann, D. Esser, A. Schill, M. Berger, C. Weidling, K. Aliyev, and A. Hofmeier, “Intellix–end-user trained information extraction for document archiving,” in 2013 12th International Conference on Document Analysis and Recognition.   IEEE, 2013, pp. 101–105.
  • [14] E. Vidal, A. H. Toselli, and J. Puigcerver, “A probabilistic framework for lexicon-based keyword spotting in handwritten text images,” arXiv preprint arXiv:2104.04556, 2021.
  • [15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [16]

    Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov, “Transformer-XL: Attentive language models beyond a fixed-length context,” in

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.   Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 2978–2988. [Online]. Available:
  • [17] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” 2019.
  • [18] T. I. Denk and C. Reisswig, “Bertgrid: Contextualized embedding for 2d document representation and understanding,” arXiv preprint arXiv:1909.04948, 2019.
  • [19] A. R. Katti, C. Reisswig, C. Guder, S. Brarda, S. Bickel, J. Höhne, and J. B. Faddoul, “Chargrid: Towards understanding 2D documents,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.   Brussels, Belgium: Association for Computational Linguistics, Oct.-Nov. 2018, pp. 4459–4469.
  • [20] R. B. Palm, F. Laws, and O. Winther, “Attend, copy, parse end-to-end information extraction from documents,” in 2019 International Conference on Document Analysis and Recognition (ICDAR).   IEEE, 2019, pp. 329–336.
  • [21] P. Zhang, Y. Xu, Z. Cheng, S. Pu, J. Lu, L. Qiao, Y. Niu, and F. Wu, “Trie: End-to-end text reading and information extraction for document understanding,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1413–1422.
  • [22] E. Medvet, A. Bartoli, and G. Davanzo, “A probabilistic approach to printed document understanding,” International Journal on Document Analysis and Recognition (IJDAR), vol. 14, no. 4, pp. 335–347, 2011.
  • [23] M. Hammami, P. Héroux, S. Adam, and V. P. d’Andecy, “One-shot field spotting on colored forms using subgraph isomorphism,” in 2015 13th International Conference on Document Analysis and Recognition (ICDAR).   IEEE, 2015, pp. 586–590.
  • [24] A. Zanfir and C. Sminchisescu, “Deep learning of graph matching,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2684–2693.
  • [25] R. Wang, J. Yan, and X. Yang, “Learning combinatorial embedding networks for deep graph matching,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3056–3065.
  • [26] T. Wang, H. Liu, Y. Li, Y. Jin, X. Hou, and H. Ling, “Learning combinatorial solver for graph matching,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 7568–7577.
  • [27] R. E. Burkard, E. Cela, P. M. Pardalos, and L. S. Pitsoulis, “The quadratic assignment problem,” in Handbook of combinatorial optimization.   Springer, 1998, pp. 1713–1809.
  • [28] S. Li, Z. Zhao, R. Hu, W. Li, T. Liu, and X. Du, “Analogical reasoning on chinese morphological and semantic relations,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).   Association for Computational Linguistics, 2018, pp. 138–143. [Online]. Available:
  • [29] D. Cheriton and R. E. Tarjan, “Finding minimum spanning trees,” SIAM journal on computing, vol. 5, no. 4, pp. 724–742, 1976.
  • [30] M. Vlastelica, A. Paulus, V. Musil, G. Martius, and M. Rolínek, “Differentiation of blackbox combinatorial solvers,” arXiv preprint arXiv:1912.02175, 2019.
  • [31] P. Swoboda, J. Kuske, and B. Savchynskyy, “A dual ascent framework for lagrangean decomposition of combinatorial problems,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1596–1606.
  • [32] F. Wang, N. Xue, J.-G. Yu, and G.-S. Xia, “Zero-assignment constraint for graph matching with outliers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3033–3042.
  • [33] M. Rolínek, P. Swoboda, D. Zietlow, A. Paulus, V. Musil, and G. Martius, “Deep graph matching via blackbox differentiation of combinatorial solvers,” in European Conference on Computer Vision.   Springer, 2020, pp. 407–424.