Automatic information extraction from unstructured documents is important for various downstream tasks by providing structured output which facilitates further information processing for both human and computer. Previous approaches often formulate the task as sequential tagging problem where the model classifies each text segment from the document into one of the information categories, assuming that the input text is well-formatted as a one dimensional sequence. However, such a shallow-parsing problem setup is not suitable for parsing real-world documents by two reasons: (1) the serialization of texts often requires non-trivial feature engineering due to rich spatial structure of documents, and (2) the information organized with multiple conceptual layers is difficult to handle even after the serialization. Receipt, name card are representative such examples with complicated layouts.
Here, we present SPADE (SPAtial DEpendency parser), an end-to-end serialization-free document parser that formulates the document information extraction task as inferring spatial dependency relations between text segments distributed in a two dimensional space. Under this end-to-end setting, three sub-tasks–(1) serialization, (2) classification of information categories (fields), and (3) structuralization of flat output–are all unified into the task of inferring directed graph similar to syntactic dependency parsing task (Fig. 1) while they are all separated in tagging-based approach. We first evaluate SPADE on receipt parsing task using CORD
, a recently open-sourced dataset that consists of 1,000 human-labeled receipts with two level information hierarchy(Park et al., 2019). We show that SPADE can achieve comparable performance with a recently developed tagging-based model POT (Hwang et al., 2019) which is equipped with a complicated serializer even without relying on any feature engineering. Furthermore, when text segments are placed on non-flat surface imitating physical distortion of receipt in the real-world (Fig. 3), SPADE outperforms POT by a large margin of +25.8%. SPADE also outperform POT by +0.8% on name card parsing task representing spatially more complex document. We also show that SPADE can parse document with multiple columns without any feature engineering while it is challenging for the tagging-based-model.
Our contributions are summarized as follows:
We introduce a noble view of tackling real-word document information extraction tasks as spatial dependency parsing between text segments, which unifies the previously required tasks of “serialization”, “tagging (classification)”, and “structuralization” of text segments into a single end-to-end framework: inferring a dependency graph.
SPADE achieves comparable performance on CORD receipt parsing task with previous tagging based approach and state-of-the-arts on parsing spatially more complicated documents such as name card (+0.8%) and receipts on non-flat manifold (+25.8%) even without introducing any feature engineering.
The source code and the part of the datasets used in this study will be open-sourced in the near future.
2 Related works
In this section, we introduce related works published recently for information extraction task from document images. Although there have been studies that develop the end-to-end encoders that does not requires serialization of texts, all previous models, to our best knowledge, set the problem as sequence tagging which assumes the input text is already well-aligned and the downstream task does not requires further sturcturalization of the flat output.
Katti, Reisswig et al (Katti et al., 2018)
develop Chargird that encodes text segments preserving two dimensional spatial information using convolution neural network. Target keys and values on invoice are extracted through semantic segmentation. Our end-to-end encoder is similar in that, the input does not require serialization but we use the rectified information obtained after OCR task–text and their coordinates–focusing on spatial distribution of text segments. Denk and Reisswig extends Chargrid by contextulaizing words using BERTDenk and Reisswig (2019). However, the contextulaization process require serialization of texts hindering easy application to spatially complex documents. Hwang et al. develop POT that consists of the text serializer and the Transformer based encoder performing BIO-tagging. Although POT has achieved high performance on receipt and name card parsing tasks, their performance relies on accurate serialization of texts that requires heavy feature engineering. Also, the extension to documents with column-wise text organization is non-trivial under their sequential tagging approach. Liu et al. develop the enocder using graph convolution networks that can contextualize text segments from spatially complex document (Liu et al., 2019). However, the encoder is validate on the task that does not require structuralization of the output. Xu and Li et al. develop LayoutLM (Xu et al., 2019) which contextualize text segments together with their absolute coordinates (layout) without serialization of text segments. Our encoder is also based on Transformer architecture but uses relative coordinates during self-attention.
We build SPADE that can extract information from the documents on which text segments are distributed on a two dimensional surface with a complex spatial arrangement reflecting information categories and their hierarchical organization. SPADE consists of three major modules: (1) the serializer-free encoder which recieves text segments and their coordinates as inputs, (2) the neural graph generator inferring the spatial dependency graph, and (3) the parse generator which makes the final parse. Below, we introduce our approach on parsing a complex document and explain individual modules of SPADE.
3.1 Information extraction via spatial dependency parsing
Although natural language is formatted as a serialized one-dimensional string, real-world documents often include additional information encoded by spatial arrangement of texts. For example, a schematic image of receipt shown in Fig. 1a consists of multiple information layers implicitly separated by spatial arrangement; text segments “volcano” (box 5), “iced” (box 6), and “coffee” (box 10) together form a single field “menu.nm” which is grouped with “x4” (box 7, menu.cnt), “@1000” (box 8, menu.unit-price), and “4,000” (box9, menu.price) to form a group in a higher level. We use “field” to indicate the information category. Similarly, ”citron” (box 11) and ”tea” (box 12) together forms a group in the lowest level (menu.nm field, 1st level group) and form a group in the second level with “x1” (box 13), “@2000” (box 14), and “2,000” (box 15). Other example is name card (Fig. 1d) on which information like “company name” (box 1, 2), “name” (box 3, 4), “position” (box 5, 6) are aligned with no strict spatial pattern. Other conceptual examples are shown in (e); documents that have triple information layers (left), multiple columns (middle), or a table (right). In all cases, spatial arrangement of text segments represent information with no strict rules.
Traditional NLP mostly deals with serialized text and the many of previous shallow-parsing approaches require serialization accordingly. However, the examples above not only show that serialization of text segments is highly non-trivial but also there is a risk of losing the information implicitly included in the spatial arrangements. Here we approach the problem of parsing documents as inferring a dependency graph between text segments which we call as “spatial dependency parsing”. For example, in the receipt shown in Fig. 1a, there are two types of dependency relations: (1) rel-s for serialization of text segments belonging to same fields (Fig. 1b, arrows in light blue), and rel-g that indicates inter-grouping between fields (black arrows). For example, “volcano iced coffee”, an special field representing menu have “x4”, “@1,000”, and “4,000” as its members. The relations are indicated by directed arrows between first segments of each field. Similarly, name cards can be parsed by inferring rel-s (Fig. 1d). Other various conceptual examples shown in Fig. 1e also can be parsed by using these two relations. To perform this task in an end-to-end fashion, we develop SPADE consisting of three major modules: (1) the serializer-free spatial text encoder, (2) the graph generator, and (3) the parse generator.
3.2 Two dimensional text encoders
To encode text segments with the information about their spatial arrangement, we employ Transformer (Vaswani et al., 2017) architecture as our backbone. Since we do not use serialized text, only text embeddings are fed into the encoder without sequential ordering information (Fig. 2a). To incorporate the spatial information, we add relative coordinate information during self-attention as below. Inspired by Transformer XL (Dai et al., 2019), we replace inner product between query and key vectors during self-attention into
where is query vector of -th input token, is key vector of -th input token, is relative coordinate vector containing spatial information about -th token based on the coordinate of -th token, and
is bias vector. Only the first term in the equation above is used in the original Transformer.is prepared as below. First relative coordinate for each text segments are calculated. For example, if “text1” is placed at and “text2” is placed at , then during contextualization of “text1”, “text2” is considered to be placed at . Next we quantize and coordinates into integers and embed using sin and cos functions (Vaswani et al., 2017). The distances and angles between text segments are also embedded in the same way. Finally, is constructed by concatenating four positional embedding vectors: , , distance, and angle. Each embedding vector passes through single linear layer before concatenation having layer-wise difference. We use either 5 (SPADE-Small) or 12 (SPADE-Base) Transformer layers during encoding. The parts of weight parameters are initialized with multi-lingual BERT (Devlin et al., 2018) 111https://github.com/google-research/bert222 https://github.com/huggingface/transformers.
|Graph at time||Input node||Action||Graph at time||Dependency|
|field start nodes||SEEDING()|
|linked to representer fields||GROUPING()|
3.3 Graph generator
To infer spatial dependency graph, the contextualized vector representation of first token of each text segment () is collected from the output of the encoder. In addition, , a trainable embedding vector representing -th field is prepared for each field. This vector represents virtual field start node and is used for the classification of the information categories of text segments in SEEDING stage during the parse generation (section 3.4). Although SPADE can handle arbitrary number of relations, many documents types can be described by using just two relations (Fig. 1); (1) a relation for serializing text segments (rel-s), and (2) a relation for grouping serialized texts (rel-g).
Mathematically, both relations can be represented as binary matrices () (Fig. 2b). The subscripts and super-script are used to indicate the directed edge from -th text segment to -th text segment with relation type . consists of number of rows and number of columns where and
represent the number of fields (the number of information categories) and the number of text segments on document respectively. The score for each relation is estimated via inner product after projectingusing linear layer similar to the approach used in (Dozat and Manning, 2018)
. Formally, the probability that there exists directed edgerepresenting relation () is calculated by
Here, stands for affine transformation, is a vector representation of head text segment, and is that of dependent text segment. indicates -th node is a virtual node representing -th field type.
To construct ,
is binarized by following procedure.
3.4 Parse generator
The inferred graph is converted to parse in following three stages: (1) SEEDING, (2) SERIALIZATION, and (3) GROUPING.
In SEEDING stage, the virtual field start nodes (f-node, filled circles in Fig. 1) are linked to multiple text segments (seeds) using rel-s.
Next, in SERIALIZATION stage, each seed node found in previous stage generates directed edge to a next text segment (serialization) recursively until there is no further node to be linked.
Finally, in GROUPING stage, serialized texts are grouped recursively constructing information layer from top-layer to bottom-layer. To group texts using directed edges, we define special representer field for each information layer. Then first text segment of representer field generates directed edge to first segments of other serialized texts using rel-g (for example, menu.nm (“volcano iced coffee”) in Fig. 1a) and other fields belong to the same group is linked by directed edges (menu.cnt (“x4”), menu.unitprice (“@1,000”)). The process is repeated until the bottom layer.
The whole process is formally expressed in Table 1.
Although undirected edge can be employed to described rel-g, the use of directed edge has following merits: an arbitrary number of information hierarchy can be described without increasing the number of relation types (Fig. 1e) under unified framework and parse can be generated recursively simply by selecting target nodes with high score.
4.1 Text extraction from document
We validate our approach on post-OCR parsing task although our method is generally applicable to any information retrieval task on document where text segments and their coordinates are available. To extract visually embedded texts from an image, we used our in-house OCR system consisting of CRAFT text detector (Baek et al., 2019b) and Comb.best text recognizer (Baek et al., 2019a). The OCR models are finetuned on each parsing dataset. Resulting text segments and their spatial information on image were delivered to SPADE.
The model is trained via cross-entropy function. ADAM optimizer is used with following learning rates: 1e-5 for the encoder, 1e-4 for the graph generation module, and 2e-5 for our own implementation of POT. The decay rates are set to . The batch size is set to 412. During the training, the coordinates of text segments are augmented (1) by rotating whole coordinates by -10 +10 randomly with uniform probability and (2) by distorting whole coordinates randomly using trigonometric function. In receipt parsing task, tokens are also augmented by randomly deleting or inserting a single token with 3.3% probability for each. Also, we attach one or two random tokens at the end of the text segment with 1.7% probability for each. Newly added tokens are randomly selected from the collection of all tokens from the train set.
4.3 Performance metric
To evaluate our model, we compare predicted parses to the oracle parses generated from the ground truth text segments. The parse format is hierarchical groups of key-value pairs, where a key indicates a field and a value is a text sequence belonging. The score of these key-value pairs and the sample accuracy (acc) were measured for the performance evaluation. A group of key-value pairs from the ground truth parse is matched with a group from the predicted parse having the highest matching score based on the number of common keys and common value characters. Then score is computed considering keys as classes for each matched group pair. If a key-value pair in a group from the predicted parse is identical to the key-value pair in the matched group from the oracle parse, then it is a true positive (TP) sample. Otherwise, the key-value pair in the group of the prediction parse is counted as false positive (FP) and the key-value pair in the group of the ground truth parse is counted as true negative (TN). Finally, the score can be computed from total number of TP, FP, and TN of all samples. The sample accuracy acc indicates the rate of samples of which all key-value pairs are correctly predicted.
4.4 Image distortion
In real-world, the input text segments can be placed on non-flat surface as document images took from cameras are usually distorted due to physical distortion on documents, or camera angles. In order to imitate these real world noises, we augmented the image data by applying image augmentation techniques such as translation, rotation, illumination, and warping Jain (1989). For the image warping, we applied sinusoidal wavy distortion to the image horizontally and vertically and this allows the model to be exposed to realistically modified images. The examples of distorted image are shown in Figure 3.
5 Results & Evaluation
5.1 Performance evaluation
To measure the performance of SPADE over documents with multiple information hierarchy, we evaluate the model over CORD (Park et al., 2019), a human-labeled open sourced dataset for information extraction task from receipts consisting of 800 train sets, 100 dev sets, and 100 test sets. We compare our model to POT (Hwang et al., 2019) that extracts information through BIO-tagging after the serialization of whole text segments though feature-engineering. Since, the performance of POT is not measured on CORD and not open-sourced, we prepared our own implementation of POT and performs experiments on CORD by ourselves. score and sample accuracy (acc) of both models are measured by comparing final parses with ground-truth after the string refinement using regular expression as in (Hwang et al., 2019) (Table 2). The oracle score measures the OCR-independent score over human annotated text segments whereas other score is computed by parsing text segments obtained from OCR task. The result shows that SPADE has competent performance compared to POT (1st, 2nd, and 3rd rows) despite of the fact that our-model is serializer-free.
Parsing receipts under real world setting
In real-world documents, texts are not always placed on flat surface due to rotation and physical distortion of documents. Also, documents may have more complicated spatial structure where serialization is highly non-trivial. To mimic such situation, we prepare CORD-R by distorting original images (Fig. 3) The results show that SPADE outperforms POT by a large margin of 25.8% (4th rows vs 5th and 6th rows).
Parsing multiple receipts
Real documents can include multiple types of information on a single page with column-wise separation. To mimic such situation, we prepare another datatset CORD-Two-Column by rotating and concatenating the receipts from CORD dataset (Fig. 4). The result show that, thanks to the end-to-end encoder in SPADE, the case can be handled without additional feature engineering (Fig. 4, 7th column of Table 2). Although there is a room for the improvement, it should be emphasized that, this kind of data cannot be handled easily by the shallow-parsing approach relying on tagging of each text segment and the careful serialization of text segments.
|(-) relative coordinate||CORD||29.2||0||28.1||0||27.0||0||25.3||0|
|(-) relative coordinate (+) absolute coordinate||CORD||86.3||56.0||81.0||42.0||80.9||54.0||70.3||33.0|
|(-) data augmentation||CORD||85.5||51.0||83.0||43.0||78.7||41.0||76.9||40.0|
Parsing name cards
We also measure the performance of models on name card parsing task using our internal dataset consisting of 22k name cards on which text segments are distributed with more spatial complexity compared to receipts. SPADE shows slightly lower performance on the oracle test set while outperforms on the score obtained after OCR task (+0.8% , Table 2 8th, 9th rows). In oracle task, the serialization of text segments are performed using human-labeled line-grouping information indicating that the score obtained after the OCR task better represents the model performance over document with spatial complexity. Together with the result obtained with CORD-R, the result indicates SPADE show better parsing performance compared to POT over spatially complex documents.
We summarize the properties of the all datasets used in this study in Table 4.
5.2 Ablation study
Next to probe the role of each component of SPADE, we perform ablation study. The removal of relative information drops the accuracy dramatically highlighting the importance of including positional information in parsing spatially complex documents (Table 3, 2nd row). The inclusion of positional information using absolute coordinate in the input as in POT shows the importance of using relative coordinate (Table 3, 3rd row). Finally, the removal of data augmentation (random token dropping and insertion, and the warping and rotation of coordinates) shows the critical role of it (Table 3, 4th row).
Here, we present SPADE, a spatial dependency parser that can extract information from multi-modal documents having complicated spatial arrangements of text. By setting the problem as constructing dependency graph, we provide the powerful unified framework that can extract structurally rich information without feature-engineering. We have validated SPADE over real-world documents–receipts and name cards–achieving state of the art results.
Baek et al. (2019a)
Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo
Yun, Seong Joon Oh, and Hwalsuk Lee. 2019a.
What is wrong with scene text recognition model comparisons? dataset
and model analysis.
Proceedings of the IEEE International Conference on Computer Vision.
Baek et al. (2019b)
Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee.
Character region awareness for text detection.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9365–9374.
- Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy. Association for Computational Linguistics.
- Denk and Reisswig (2019) Timo I. Denk and Christian Reisswig. 2019. BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding. arXiv e-prints, page arXiv:1909.04948.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. NAACL, abs/1810.04805.
- Dozat and Manning (2018) Timothy Dozat and Christopher D. Manning. 2018. Simpler but more accurate semantic dependency parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 484–490, Melbourne, Australia. Association for Computational Linguistics.
- Hwang et al. (2019) Wonseok Hwang, Seonghyeon Kim, Jinyeong Yim, Minjoon Seo, Seunghyun Park, Sungrae Park, Junyeop Lee, Bado Lee, and Hwalsuk Lee. 2019. Post-ocr parsing: building simple and robust parser via bio tagging.
- Jain (1989) Anil K. Jain. 1989. Fundamentals of digital image processing. Prentice Hall, Englewood Cliffs, NJ.
Katti et al. (2018)
Anoop R Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen
Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018.
understanding 2D documents.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4459–4469, Brussels, Belgium. Association for Computational Linguistics.
- Liu et al. (2019) Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019. Graph convolution for multimodal information extraction from visually rich documents. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), pages 32–39, Minneapolis, Minnesota. Association for Computational Linguistics.
- Park et al. (2019) Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. 2019. Cord: A consolidated receipt dataset for post-ocr parsing.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
- Xu et al. (2019) Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2019. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. arXiv e-prints, page arXiv:1912.13318.