Log In Sign Up

Spatial Dependency Parsing for 2D Document Understanding

by   Wonseok Hwang, et al.

Information Extraction (IE) for document images is often approached as a BIO tagging problem, where the model sequentially goes through and classifies each recognized input token into one of the information categories. However, such problem setup has two inherent limitations that (1) it can only extract a flat list of information and (2) it assumes that the input data is serialized, often by a simple rule-based script. Nevertheless, real-world documents often contain hierarchical information in the form of two-dimensional language data in which the serialization can be highly non-trivial. To tackle these issues, we propose SPADE (SPatial DEpendency parser), an end-to-end spatial dependency parser that is serializer-free and capable of modeling an arbitrary number of information layers, making it suitable for parsing structure-rich documents such as receipts and multimodal documents such as name cards. We show that SPADE outperforms the previous BIO tagging-based approach on name card parsing task and achieves comparable performance on receipt parsing task. Especially, when the receipt images have non-flat manifold representing physical distortion of receipt paper in real-world, SPADE outperforms the tagging-based method by a large margin of 25.8 the strong performance of SPADE over spatially complex document.


page 6

page 8


An improved neural network model for joint POS tagging and dependency parsing

We propose a novel neural network model for joint part-of-speech (POS) t...

Object-oriented Neural Programming (OONP) for Document Understanding

We propose Object-oriented Neural Programming (OONP), a framework for se...

Neural End-to-End Learning for Computational Argumentation Mining

We investigate neural techniques for end-to-end computational argumentat...

A Hybrid Approach to Dependency Parsing: Combining Rules and Morphology with Deep Learning

Fully data-driven, deep learning-based models are usually designed as la...

Neural Ranking Models for Temporal Dependency Structure Parsing

We design and build the first neural temporal dependency parser. It util...

Extracting Headless MWEs from Dependency Parse Trees: Parsing, Tagging, and Joint Modeling Approaches

An interesting and frequent type of multi-word expression (MWE) is the h...

1 Introduction

Automatic information extraction from unstructured documents is important for various downstream tasks by providing structured output which facilitates further information processing for both human and computer. Previous approaches often formulate the task as sequential tagging problem where the model classifies each text segment from the document into one of the information categories, assuming that the input text is well-formatted as a one dimensional sequence. However, such a shallow-parsing problem setup is not suitable for parsing real-world documents by two reasons: (1) the serialization of texts often requires non-trivial feature engineering due to rich spatial structure of documents, and (2) the information organized with multiple conceptual layers is difficult to handle even after the serialization. Receipt, name card are representative such examples with complicated layouts.

Here, we present SPADE (SPAtial DEpendency parser), an end-to-end serialization-free document parser that formulates the document information extraction task as inferring spatial dependency relations between text segments distributed in a two dimensional space. Under this end-to-end setting, three sub-tasks–(1) serialization, (2) classification of information categories (fields), and (3) structuralization of flat output–are all unified into the task of inferring directed graph similar to syntactic dependency parsing task (Fig. 1) while they are all separated in tagging-based approach. We first evaluate SPADE on receipt parsing task using CORD

, a recently open-sourced dataset that consists of 1,000 human-labeled receipts with two level information hierarchy

(Park et al., 2019). We show that SPADE can achieve comparable performance with a recently developed tagging-based model POT (Hwang et al., 2019) which is equipped with a complicated serializer even without relying on any feature engineering. Furthermore, when text segments are placed on non-flat surface imitating physical distortion of receipt in the real-world (Fig. 3), SPADE outperforms POT by a large margin of +25.8%. SPADE also outperform POT by +0.8% on name card parsing task representing spatially more complex document. We also show that SPADE can parse document with multiple columns without any feature engineering while it is challenging for the tagging-based-model.

Our contributions are summarized as follows:

  • We introduce a noble view of tackling real-word document information extraction tasks as spatial dependency parsing between text segments, which unifies the previously required tasks of “serialization”, “tagging (classification)”, and “structuralization” of text segments into a single end-to-end framework: inferring a dependency graph.

  • SPADE achieves comparable performance on CORD receipt parsing task with previous tagging based approach and state-of-the-arts on parsing spatially more complicated documents such as name card (+0.8%) and receipts on non-flat manifold (+25.8%) even without introducing any feature engineering.

The source code and the part of the datasets used in this study will be open-sourced in the near future.

2 Related works

In this section, we introduce related works published recently for information extraction task from document images. Although there have been studies that develop the end-to-end encoders that does not requires serialization of texts, all previous models, to our best knowledge, set the problem as sequence tagging which assumes the input text is already well-aligned and the downstream task does not requires further sturcturalization of the flat output.

Katti, Reisswig et al (Katti et al., 2018)

develop Chargird that encodes text segments preserving two dimensional spatial information using convolution neural network. Target keys and values on invoice are extracted through semantic segmentation. Our end-to-end encoder is similar in that, the input does not require serialization but we use the rectified information obtained after OCR task–text and their coordinates–focusing on spatial distribution of text segments. Denk and Reisswig extends Chargrid by contextulaizing words using BERT

Denk and Reisswig (2019). However, the contextulaization process require serialization of texts hindering easy application to spatially complex documents. Hwang et al. develop POT that consists of the text serializer and the Transformer based encoder performing BIO-tagging. Although POT has achieved high performance on receipt and name card parsing tasks, their performance relies on accurate serialization of texts that requires heavy feature engineering. Also, the extension to documents with column-wise text organization is non-trivial under their sequential tagging approach. Liu et al. develop the enocder using graph convolution networks that can contextualize text segments from spatially complex document (Liu et al., 2019). However, the encoder is validate on the task that does not require structuralization of the output. Xu and Li et al. develop LayoutLM (Xu et al., 2019) which contextualize text segments together with their absolute coordinates (layout) without serialization of text segments. Our encoder is also based on Transformer architecture but uses relative coordinates during self-attention.

3 Model

We build SPADE that can extract information from the documents on which text segments are distributed on a two dimensional surface with a complex spatial arrangement reflecting information categories and their hierarchical organization. SPADE consists of three major modules: (1) the serializer-free encoder which recieves text segments and their coordinates as inputs, (2) the neural graph generator inferring the spatial dependency graph, and (3) the parse generator which makes the final parse. Below, we introduce our approach on parsing a complex document and explain individual modules of SPADE.

3.1 Information extraction via spatial dependency parsing

Figure 1: The scheme of spatial dependency parsing. The receipt parsing task is explained in detail by three pictures: (a) first text segments and their coordinates are extracted from receipt image through OCR, (b) next, the relations between text segments are classified using two types of relations (rel-s for serialization and information type (field) classification, reg-g for conceptual grouping of serialized text. The number inside of circles represent box number denoted in red in (a)), and (c) the final parse is generated by interpreting the graph. (d) The example of name card spatial dependency parsing. (e) Other virtual examples showing the versatility of spatial dependency parsing approach.

Although natural language is formatted as a serialized one-dimensional string, real-world documents often include additional information encoded by spatial arrangement of texts. For example, a schematic image of receipt shown in Fig. 1a consists of multiple information layers implicitly separated by spatial arrangement; text segments “volcano” (box 5), “iced” (box 6), and “coffee” (box 10) together form a single field “menu.nm” which is grouped with “x4” (box 7, menu.cnt), “@1000” (box 8, menu.unit-price), and “4,000” (box9, menu.price) to form a group in a higher level. We use “field” to indicate the information category. Similarly, ”citron” (box 11) and ”tea” (box 12) together forms a group in the lowest level (menu.nm field, 1st level group) and form a group in the second level with “x1” (box 13), “@2000” (box 14), and “2,000” (box 15). Other example is name card (Fig. 1d) on which information like “company name” (box 1, 2), “name” (box 3, 4), “position” (box 5, 6) are aligned with no strict spatial pattern. Other conceptual examples are shown in (e); documents that have triple information layers (left), multiple columns (middle), or a table (right). In all cases, spatial arrangement of text segments represent information with no strict rules.

Traditional NLP mostly deals with serialized text and the many of previous shallow-parsing approaches require serialization accordingly. However, the examples above not only show that serialization of text segments is highly non-trivial but also there is a risk of losing the information implicitly included in the spatial arrangements. Here we approach the problem of parsing documents as inferring a dependency graph between text segments which we call as “spatial dependency parsing”. For example, in the receipt shown in Fig. 1a, there are two types of dependency relations: (1) rel-s for serialization of text segments belonging to same fields (Fig. 1b, arrows in light blue), and rel-g that indicates inter-grouping between fields (black arrows). For example, “volcano iced coffee”, an special field representing menu have “x4”, “@1,000”, and “4,000” as its members. The relations are indicated by directed arrows between first segments of each field. Similarly, name cards can be parsed by inferring rel-s (Fig. 1d). Other various conceptual examples shown in Fig. 1e also can be parsed by using these two relations. To perform this task in an end-to-end fashion, we develop SPADE consisting of three major modules: (1) the serializer-free spatial text encoder, (2) the graph generator, and (3) the parse generator.

3.2 Two dimensional text encoders

Figure 2:

The scheme of our model. (a) The encoder contextualize text segments using their relative coordinates (yellow). Each field (information category) is represented by trainable vector (blue) representing virtual node used for field classification. (b) The dependency graph is inferred by mapping the vector representations of two text segments into a scalar.

To encode text segments with the information about their spatial arrangement, we employ Transformer (Vaswani et al., 2017) architecture as our backbone. Since we do not use serialized text, only text embeddings are fed into the encoder without sequential ordering information (Fig. 2a). To incorporate the spatial information, we add relative coordinate information during self-attention as below. Inspired by Transformer XL (Dai et al., 2019), we replace inner product between query and key vectors during self-attention into


where is query vector of -th input token, is key vector of -th input token, is relative coordinate vector containing spatial information about -th token based on the coordinate of -th token, and

is bias vector. Only the first term in the equation above is used in the original Transformer.

is prepared as below. First relative coordinate for each text segments are calculated. For example, if “text1” is placed at and “text2” is placed at , then during contextualization of “text1”, “text2” is considered to be placed at . Next we quantize and coordinates into integers and embed using sin and cos functions (Vaswani et al., 2017). The distances and angles between text segments are also embedded in the same way. Finally, is constructed by concatenating four positional embedding vectors: , , distance, and angle. Each embedding vector passes through single linear layer before concatenation having layer-wise difference. We use either 5 (SPADE-Small) or 12 (SPADE-Base) Transformer layers during encoding. The parts of weight parameters are initialized with multi-lingual BERT (Devlin et al., 2018) 111

Graph at time Input node Action Graph at time Dependency
field start nodes SEEDING()
linked to representer fields GROUPING()
Table 1: A formal description of parse generation (graph interpretation) process. Each action is continued until the graph does not expand before moving to the next action.

3.3 Graph generator

To infer spatial dependency graph, the contextualized vector representation of first token of each text segment () is collected from the output of the encoder. In addition, , a trainable embedding vector representing -th field is prepared for each field. This vector represents virtual field start node and is used for the classification of the information categories of text segments in SEEDING stage during the parse generation (section 3.4). Although SPADE can handle arbitrary number of relations, many documents types can be described by using just two relations (Fig. 1); (1) a relation for serializing text segments (rel-s), and (2) a relation for grouping serialized texts (rel-g).

Mathematically, both relations can be represented as binary matrices () (Fig. 2b). The subscripts and super-script are used to indicate the directed edge from -th text segment to -th text segment with relation type . consists of number of rows and number of columns where and

represent the number of fields (the number of information categories) and the number of text segments on document respectively. The score for each relation is estimated via inner product after projecting

using linear layer similar to the approach used in (Dozat and Manning, 2018)

. Formally, the probability that there exists directed edge

representing relation () is calculated by


Here, stands for affine transformation, is a vector representation of head text segment, and is that of dependent text segment. indicates -th node is a virtual node representing -th field type.

To construct ,

is binarized by following procedure.


3.4 Parse generator

The inferred graph is converted to parse in following three stages: (1) SEEDING, (2) SERIALIZATION, and (3) GROUPING.

  1. In SEEDING stage, the virtual field start nodes (f-node, filled circles in Fig. 1) are linked to multiple text segments (seeds) using rel-s.

  2. Next, in SERIALIZATION stage, each seed node found in previous stage generates directed edge to a next text segment (serialization) recursively until there is no further node to be linked.

  3. Finally, in GROUPING stage, serialized texts are grouped recursively constructing information layer from top-layer to bottom-layer. To group texts using directed edges, we define special representer field for each information layer. Then first text segment of representer field generates directed edge to first segments of other serialized texts using rel-g (for example, menu.nm (“volcano iced coffee”) in Fig. 1a) and other fields belong to the same group is linked by directed edges (menu.cnt (“x4”), menu.unitprice (“@1,000”)). The process is repeated until the bottom layer.

The whole process is formally expressed in Table 1.

Although undirected edge can be employed to described rel-g, the use of directed edge has following merits: an arbitrary number of information hierarchy can be described without increasing the number of relation types (Fig. 1e) under unified framework and parse can be generated recursively simply by selecting target nodes with high score.

4 Experiments

Figure 3: The examples of receipt images on flat (left, from CORD) or non-flat (right, from CORD-R) manifold with their parsing results. Two representative examples are shown when (a) the receipt is placed on relatively flat surface, and (b) the receipt is embedded on highly curved surface.

4.1 Text extraction from document

We validate our approach on post-OCR parsing task although our method is generally applicable to any information retrieval task on document where text segments and their coordinates are available. To extract visually embedded texts from an image, we used our in-house OCR system consisting of CRAFT text detector (Baek et al., 2019b) and text recognizer (Baek et al., 2019a). The OCR models are finetuned on each parsing dataset. Resulting text segments and their spatial information on image were delivered to SPADE.

4.2 Learning

The model is trained via cross-entropy function. ADAM optimizer is used with following learning rates: 1e-5 for the encoder, 1e-4 for the graph generation module, and 2e-5 for our own implementation of POT. The decay rates are set to . The batch size is set to 412. During the training, the coordinates of text segments are augmented (1) by rotating whole coordinates by -10 +10 randomly with uniform probability and (2) by distorting whole coordinates randomly using trigonometric function. In receipt parsing task, tokens are also augmented by randomly deleting or inserting a single token with 3.3% probability for each. Also, we attach one or two random tokens at the end of the text segment with 1.7% probability for each. Newly added tokens are randomly selected from the collection of all tokens from the train set.

4.3 Performance metric

To evaluate our model, we compare predicted parses to the oracle parses generated from the ground truth text segments. The parse format is hierarchical groups of key-value pairs, where a key indicates a field and a value is a text sequence belonging. The score of these key-value pairs and the sample accuracy (acc) were measured for the performance evaluation. A group of key-value pairs from the ground truth parse is matched with a group from the predicted parse having the highest matching score based on the number of common keys and common value characters. Then score is computed considering keys as classes for each matched group pair. If a key-value pair in a group from the predicted parse is identical to the key-value pair in the matched group from the oracle parse, then it is a true positive (TP) sample. Otherwise, the key-value pair in the group of the prediction parse is counted as false positive (FP) and the key-value pair in the group of the ground truth parse is counted as true negative (TN). Finally, the score can be computed from total number of TP, FP, and TN of all samples. The sample accuracy acc indicates the rate of samples of which all key-value pairs are correctly predicted.

4.4 Image distortion

In real-world, the input text segments can be placed on non-flat surface as document images took from cameras are usually distorted due to physical distortion on documents, or camera angles. In order to imitate these real world noises, we augmented the image data by applying image augmentation techniques such as translation, rotation, illumination, and warping Jain (1989). For the image warping, we applied sinusoidal wavy distortion to the image horizontally and vertically and this allows the model to be exposed to realistically modified images. The examples of distorted image are shown in Figure 3.

5 Results & Evaluation

5.1 Performance evaluation

Parsing receipts

dev-oracle dev test-oracle test
Model task acc acc acc acc
POT CORD 96.0 70.0 93.9 64.0 92.5 65.0 90.1 51.0
SPADE-Small CORD 90.2 60.0 86.2 48.0 80.7 49.0 77.7 37.0
SPADE-Base CORD 92.4 64.0 89.3 56.0 88.5 59.0 83.4 35.0
POT CORD-R - - 56.2 7.0 - - 52.0 12.0
SPADE-Small CORD-R - - 80.7 37.0 - - 72.5 33.0
SPADE-Base CORD-R - - 85.2 42.0 - - 77.8 35.0
SPADE-Small CORD-Two-Column 80.7 32.0 77.2 31.0 71.4 26.0 67.3 22.0
POT Namecard-22k 95.5 68.4 89.4 35.5 94.6 61.7 89.3 39.8
SPADE-Base Namecard-22k 94.2 58.2 90.0 37.1 93.4 56.6 90.1 39.8
Table 2: The parsing accuracy table on CORD, CORD-R, CORD-Two-Column, and Namecard-22k datasets.

To measure the performance of SPADE over documents with multiple information hierarchy, we evaluate the model over CORD (Park et al., 2019), a human-labeled open sourced dataset for information extraction task from receipts consisting of 800 train sets, 100 dev sets, and 100 test sets. We compare our model to POT (Hwang et al., 2019) that extracts information through BIO-tagging after the serialization of whole text segments though feature-engineering. Since, the performance of POT is not measured on CORD and not open-sourced, we prepared our own implementation of POT and performs experiments on CORD by ourselves. score and sample accuracy (acc) of both models are measured by comparing final parses with ground-truth after the string refinement using regular expression as in (Hwang et al., 2019) (Table 2). The oracle score measures the OCR-independent score over human annotated text segments whereas other score is computed by parsing text segments obtained from OCR task. The result shows that SPADE has competent performance compared to POT (1st, 2nd, and 3rd rows) despite of the fact that our-model is serializer-free.

Parsing receipts under real world setting

In real-world documents, texts are not always placed on flat surface due to rotation and physical distortion of documents. Also, documents may have more complicated spatial structure where serialization is highly non-trivial. To mimic such situation, we prepare CORD-R by distorting original images (Fig. 3) The results show that SPADE outperforms POT by a large margin of 25.8% (4th rows vs 5th and 6th rows).

Figure 4: The example of receipt image from CORD-Two-Column and its parsing result. The image is constructed by concatenating two receipts from CORD into a single image.

Parsing multiple receipts

Real documents can include multiple types of information on a single page with column-wise separation. To mimic such situation, we prepare another datatset CORD-Two-Column by rotating and concatenating the receipts from CORD dataset (Fig. 4). The result show that, thanks to the end-to-end encoder in SPADE, the case can be handled without additional feature engineering (Fig. 4, 7th column of Table 2). Although there is a room for the improvement, it should be emphasized that, this kind of data cannot be handled easily by the shallow-parsing approach relying on tagging of each text segment and the careful serialization of text segments.

dev-oracle dev test-oracle test
Model task acc acc acc acc
SPADE-Small CORD 90.2 60.0 86.2 48.0 80.7 49.0 77.7 37.0
     (-) relative coordinate CORD 29.2 0 28.1 0 27.0 0 25.3 0
     (-) relative coordinate (+) absolute coordinate CORD 86.3 56.0 81.0 42.0 80.9 54.0 70.3 33.0
     (-) data augmentation CORD 85.5 51.0 83.0 43.0 78.7 41.0 76.9 40.0
Table 3: The result of ablation study.

Parsing name cards

We also measure the performance of models on name card parsing task using our internal dataset consisting of 22k name cards on which text segments are distributed with more spatial complexity compared to receipts. SPADE shows slightly lower performance on the oracle test set while outperforms on the score obtained after OCR task (+0.8% , Table 2 8th, 9th rows). In oracle task, the serialization of text segments are performed using human-labeled line-grouping information indicating that the score obtained after the OCR task better represents the model performance over document with spatial complexity. Together with the result obtained with CORD-R, the result indicates SPADE show better parsing performance compared to POT over spatially complex documents.

We summarize the properties of the all datasets used in this study in Table 4.

D-lv1 D-lv2 D-lv3
S-lv1 - CORD CORD-Two-Column
S-lv2 Namecard-22k CORD-R -
Table 4: The qualitative description of datasets used in this study. stands for spatial complexity relating to text distribution on two dimensional document image and the trailing number indicate the qualitative measure of complexity. stands for information depth relating to how information is grouped hierarchically and the trailing number indicates the information depth.

5.2 Ablation study

Next to probe the role of each component of SPADE, we perform ablation study. The removal of relative information drops the accuracy dramatically highlighting the importance of including positional information in parsing spatially complex documents (Table 3, 2nd row). The inclusion of positional information using absolute coordinate in the input as in POT shows the importance of using relative coordinate (Table 3, 3rd row). Finally, the removal of data augmentation (random token dropping and insertion, and the warping and rotation of coordinates) shows the critical role of it (Table 3, 4th row).

6 Conclusion

Here, we present SPADE, a spatial dependency parser that can extract information from multi-modal documents having complicated spatial arrangements of text. By setting the problem as constructing dependency graph, we provide the powerful unified framework that can extract structurally rich information without feature-engineering. We have validated SPADE over real-world documents–receipts and name cards–achieving state of the art results.