Form-like documents are very commonly used in business workflows. However, tremendous forms are still processed manually everyday. When humans need to extract some relevant information from a form-like document, they proceed as conducting a value retrieval task with some text queries. For example in Figure 1, when humans process a form, they usually have a description of the information (query) that they want to extract (e.g., total page number). Then, they examine the form (usually an image or a PDF) carefully to locate the key (e.g., NUMBER OF PAGES INCLUDING COVER SHEET in Figure 1) that is most semantically similar to the query and finally infer the target value based on the localized key. This manual process costs a large amount of human efforts as the number of forms and queries increase. Automating information extraction from forms is important to alleviate this problem.
Existing methods formulate the problem as sequence labeling Xu et al. (2020b) or field extraction Gao et al. (2021), where they define a fixed set of items of interest (referred as fields) and train models that only extract values of the pre-defined fields. There are at least two limitations of this formulation. First, forms are very diverse and it is impossible to cover all the items of interest using a fixed set of fields. Second, their models are very domain-specific and hard to be utilized for different form types. For example, an invoice field extractor may not be able to process resumes, since different fields are expected for these two form types.
To handle diverse queries with a unified model, we formulate the problem as value retrieval with arbitrary queries for form-like documents. Under such task formulation, users can extract values from a form by presenting variants of the corresponding keys as queries. We also set up a benchmark for the task by introducing a simple yet effective method. The method takes an arbitrary query phrase and all the detected optical character recognition (OCR) words with their locations in the form as inputs. Then, we model the interactions between the query and the detected words from the document using a transformer-based architecture. The training objective encourages the matching of the positive query-value pairs and discourages that of the negative ones. To further boost the performance, we present simpleDLM, a simple document pre-training strategy that makes it more flexible to learn local geometric relations between words/tokens compared to existing pre-trained models. Experimental results show that our method outperforms the baselines by a large margin under different settings. When initializing using simpleDLM, our method is further improved largely by about 17% F1 score compared to initializing using a state-of-the-art (SOTA) pre-trained model, i.e., LayoutLM Xu et al. (2020b).
2 Related Work
Information Extraction from Documents is crucial for improving the efficiency of form processing and reducing human labor. Information extraction is often formulated as a field extraction task. Earlier methods Chiticariu et al. (2013); Schuster et al. (2013) extract information from document with the help of registered templates in the system. Later, Palm et al. 2019 propose an invoice field extractor by using an Attend, Copy, Parse architecture. Majumder et al. 2020 present a field-value pairing framework that learns the representations of fields and value candidates in the same feature space using metric learning. Nguyen et al. 2021 propose a span extraction approach to extract the start and end of a value for each queried field. Gao et al. 2021 introduce a field extraction system that can be trained with large-scale unlabeled documents. Xue et al. 2021 propose 14 form transformations to mimic the variations of forms for robustness evaluation of transformer-based field extractors. Unlike previous methods that aim to extracting values for a pre-defined set of fields, our method targets at retrieving values for arbitrary queries.
Document Pre-training is an effective strategy to improve document-related downstream tasks Xu et al. (2020b, a); Appalaraju et al. (2021). Xu et al. 2020b propose LayoutLM that models interaction between texts and layouts in scanned documents using masked language modeling and image-text matching. Later, LayoutLMv2 Xu et al. (2020a) is introduced to emphasize the importance of image features for document understanding and present a spatial-aware self-attention mechanism to improve learning relative positional relationship among different text blocks. Most recently, Appalaraju et al. 2021 propose DocFormer that encourages the interaction between image and text modalities by adding an image reconstruction task. The existing pre-training methods perform well when applied to downstream tasks such as document classification and token sequence labeling. However, all of the above methods include the absolute 1-D positional embedding (the so called reading order Xu et al. (2020b)) of the tokens in the inputs. Although, this 1-D embedding is a helpful prior knowledge for a holistic understanding of a document, it hinders a model from learning rich geometric relationship among tokens, thus is not beneficial to our value retrieval task, where local geometric relationship between words is essential for prediction. Our method uses permutation-invariant positional encoding to improve model performance.
3 Our Approach
Problem Formulation. The inputs to the system are an expected key phrase as the query, , and a document . is represented using a set of OCR words and their bounding-box locations in the document. The input query phrase is tokenized to . A value retrieval system reads the document and understands the layout and semantics. The goal is to pick a phrase from the OCR words as the value for the input query, . Since the modeling is within one document, we will omit the subscript for simplicity.
3.1 Value Retrieval with Arbitrary Queries
Our method is illustrated in Figure 2. The direct inputs are a query and OCR words associated with their locations . We use a fixed dummy location for each query. Each query/OCR word is embedded as and its location is encoded as , where
indicates the length of each vector (see Section3.2 for details). The final embedding of each word (e.g., for a query word or for an OCR word) is the summation of its word embedding and location embedding.
A transformer is used to model the interactions among and via self-attention layers. In the
layer, the hidden representation of thetoken is updated following
where indicates the representation of all tokens from the layer and denotes the length of the hidden feature . . In the final layer, . The self-attention mechanism allows the model to fully learn the interactions among query words and OCR words.
We obtain the query phrase representation, , using average pooling over . The final representation of is obtained by , where indicates a fully connected layer. The likelihood score of being a part of the target value of the query is obtained in Equation 2
Our model is expected to learn (1) the layout and semantics of the document (2) the mapping between the input query phrase and the actual key texts in the document and (3) the geometric and semantic relationship between the key and value.
Model optimization. During training, each is associated with a ground-truth label , where means this word is a part of the target value and means it is not. The model is optimized using binary cross entropy loss as .
Inference. Since the target value may contain multiple words, we group nearby OCR words horizontally as value candidates, , based on their locations using DBSCAN algorithm Ester et al. (1996), where is the number of grouped candidates. The value score of each candidate, , is the maximum of all its covered words as
where, indicates the OCR word is a part of the grouped . The value candidate with the highest score is used as the value prediction.
3.2 SimpleDLM for Document Pre-training
We introduce a Simple Document Language Modeling (SimpleDLM) method to encourage the understanding of the geometric relationship among words during pre-training.
The inputs to our pre-trained model are the OCR words associated with their locations . The word/location embedding protocol and the transformer structure of our pre-trained model are the same as those of our value retrieval model such that the pre-trained model can be directly used for initializing the parameters of the value retrieval model. Specifically, the word embedding, , is constructed by using a simple look up table. Previous works Xu et al. (2020b, a); Appalaraju et al. (2021) require the input text sorted in the reading order, so that they can process texts of the document in a similar way of processing languages. They leverage this prior knowledge by adding the ranking of each word as the 1-D positional embedding in the final location embedding, . This 1-D embedding provides a holistic view of the geometric relationship of words. However, it introduces extra dependence on the OCR engines (most SOTA OCR engines do not have the capability of sorting detected OCR words in the reading order), and it also restricts the model from learning local geometric relations between words in a flexible way. To encourage a model to better learn the local geometric relations, we exclude the 1-D positional embedding and only encode the 2-D bounding-box location (top-left, bottom-right, width and height) of each word using a lookup table. For simplicity, we only use the masked language modeling as the pre-training objective. We show in Section 4 that our method improves the SOTA largely by using this simple pre-training strategy.
IIT-CDIP Lewis et al. (2006) is a large-scale unlabeled document dataset that contains more than 11 million scanned images. Following prior works Xu et al. (2020b, a); Appalaraju et al. (2021), our model is pre-trained using this dataset.
FUNSD Jaume et al. (2019) is a commonly used dataset for spatial layout analysis. It contains 199 scanned forms with 9,707 semantic entities annotated, where 149 samples are for training and 50 for testing. The semantic linking annotations for all the key-value pairs are provided in the dataset, which makes it suitable for our task.
INV-CDIP Gao et al. (2021) is document dataset which contains 350 real invoices for testing. This dataset has key-value pair annotations for 7 commonly used invoice fields including invoice_number, purchase_order, invoice_date, due_date, amount_due, total_amount and total_tax. We evaluate our model using this test set.
4.2 Experimental Settings
By default the annotated key texts of each dataset is used as the queries. The location of the keys are not used. Models are pre-trained on IIT-CDIP and fine-tuned on the train set of FUNSD. The implementation details are shown in Section A.
Our baseline. We implement our baseline following Majumder et al. (2020); Nguyen et al. (2021). Unlike our method that utilizes a unified transformer to deeply model interactions among the query words and the OCR words, our baseline models the interactions in a shallower way (see Section A for details). Besides, we also compare our method and baseline when initialized using different pre-trained models including Bert Devlin et al. (2018), LayoutLM Xu et al. (2020b) and our SimpleDLM.
Evaluation Metric. Following prior work Xu et al. (2020b, a); Gao et al. (2021), we use entity-level F1 score as a metric to evaluate models. Exact string matching between our predicted values and the ground-truth ones is used to count true positive, false positive and false negative. If a query has multiple value answers, a prediction is counted as correct if it equals one of them.
The comparisons between our method and the baseline when they are pre-trained using different approaches are shown in Table 1. As we can see, the performance of both methods are improved largely when using our proposed SimpleDLM. For the baseline, the F1 score is improved by 14.4% when replacing the LayoutLM with our SimpleDLM as the pre-trained model. Similary, using SimpleDLM increases the F1 score of our method by 16.9% compared to using LayoutLM. Our method outperforms the baseline by 3-4% using different pre-trained models.
Transferring ability to another dataset is important in real-word applications. We measure this ability by directly evaluating the trained models using FUNSD to the test set of INV-CDIP in Table 2. As we can see, when transferring to a new dataset, the performance of both our method and the baseline drops, compared to the numbers in Table 1. When using the exact key as the query, our method largely surpasses the baseline by 16.5% in F1 score.
In practice, we may not assume the input queries match exactly with the actual keys shown in a form. Here, we experiment using the field names (see Section 4.1) directly as the queries. Using field names is a more convenient way, since users don’t need to design different queries that match keys for different forms. However, field names are also more abstract, which makes the scenario more challenging. When using the abstract field name as a query, our method achieves 20.5% F1 which is 14.8% better than our baseline.
We propose an approach for value retrieval with arbitrary queries for form-like documents. We introduce a transformer-based method that takes a query and the detected OCR words from a document as inputs, models their interactions and predicts the best value correspond to the input query. Different from previous methods that aim to extracting values for pre-defined fields of specific form types, our method targets at extracting values for arbitrary queries. Besides, we also present simple document language modeling (simpleDLM) as a pre-training strategy. Experimental results show that our method significantly outperforms our baseline in different settings. The proposed simpleDLM shows a big advantage for our task compared to LayoutLM.
6 Broader Impacts
This work is introduced to automate the information extraction from forms to improve document processing efficiency. It has positive impacts such as reducing human labor. However, reducing human labor may also cause negative consequences such as job loss or displacement, particularly amongst low-skilled labor who may be most in need of gainful employment. The negative impact is not specific to this work, but should be addressed broadly in the field of AI research.
- DocFormer: end-to-end transformer for document understanding. arXiv preprint arXiv:2106.11539. Cited by: §2, §3.2, §4.1.
- Rule-based information extraction is dead! long live rule-based information extraction systems!. In EMNLP, Cited by: §2.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.2.
- A density-based algorithm for discovering clusters in large spatial databases with noise.. In KDD, Cited by: §3.1.
- Field extraction from forms with unlabeled data. arXiv preprint arXiv:2110.04282. Cited by: §1, §2, §4.1, §4.2.
- Funsd: a dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 2, pp. 1–6. Cited by: §4.1.
- Building a test collection for complex document information processing. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 665–666. Cited by: §4.1.
- Representation learning for information extraction from form-like documents. In ACL, Cited by: §2, §4.2.
- A span extraction approach for information extraction on visually-rich documents. arXiv preprint arXiv:2106.00978. Cited by: §2, §4.2.
- Attend, copy, parse end-to-end information extraction from documents. In ICDAR, Cited by: §2.
- Intellix – end-user trained information extraction for document archiving. In ICDAR, Cited by: §2.
- LayoutLMv2: multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740. Cited by: §2, §3.2, §4.1, §4.2.
- Layoutlm: pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200. Cited by: §1, §1, §2, §3.2, §4.1, §4.2, §4.2.
- Robustness evaluation of transformer-based form field extractors via form attacks. arXiv preprint arXiv:2110.04413. Cited by: §2.
Appendix A Appendix
a.1 Implementation Details
a.1.1 Our Method
Our code is implemented using Pytorch. We used Tesseract111https://github.com/tesseract-ocr/tesseract to extract OCR words from documents for IIT-CDIP and INV-CDIP. Since FUNSD provides an official OCR annotation, we use it directly. The total number of query words and the OCR words are different for different queries and documents. We keep
and pad with 0s when needed. We follow LayoutLM-base to setup the structure of our transformers. The fully connected layer used for feature projection has 768 units. Each document is rescaled to [1000, 1000] and the dummy location,, is set to [0, 0, 1000, 1000]. Adam is used as the optimizer. During pre-training, the learning rate is 5. During fine-tuning, the learning rate is set to 3
with weight decay equals to 0.9. SimpleDLM is initialized by LayoutLM and pre-trained on IIT-CDIP using 8 Nvidia A100 GPUs with a batch size of 36 for a total of 40,000 iterations. We use a single A100 GPU for fine-tuning, where the training batch size is 8 and the total number of epochs is 45. In our experiments, all the pre-trained models including Bert, LayoutLM and SimpleDLM are base models.
a.1.2 Our Baseline
As shown in Figure A1, the baseline has the same transformer architecture and the feature projection layer as our method. The transformer takes OCR words with locations as inputs and obtain for each word . And then, the query-value pairing score is obtained by measuring the distance between the query representation, , and as in Equation 2. The query representation is obtained by average pooling over word embeddings extracted from a pre-trained Bert model. As we can see, the only difference between the baseline and our method is how the query interacts with OCR words. To perform a fair comparison, we keep the baseline settings the same as our method except for the above interaction strategy.