Robustness Evaluation of Transformer-based Form Field Extractors via Form Attacks

by   Le Xue, et al.

We propose a novel framework to evaluate the robustness of transformer-based form field extraction methods via form attacks. We introduce 14 novel form transformations to evaluate the vulnerability of the state-of-the-art field extractors against form attacks from both OCR level and form level, including OCR location/order rearrangement, form background manipulation and form field-value augmentation. We conduct robustness evaluation using real invoices and receipts, and perform comprehensive research analysis. Experimental results suggest that the evaluated models are very susceptible to form perturbations such as the variation of field-values ( 15 disarrangement of input text order( 15 the neighboring words of field-values( 10 analysis, we make recommendations to improve the design of field extractors and the process of data collection.


Field Extraction from Forms with Unlabeled Data

We propose a novel framework to conduct field extraction from forms with...

Evaluating the Robustness of Trigger Set-Based Watermarks Embedded in Deep Neural Networks

Trigger set-based watermarking schemes have gained emerging attention as...

Explanation-Guided Diagnosis of Machine Learning Evasion Attacks

Machine Learning (ML) models are susceptible to evasion attacks. Evasion...

Robust Perception through Analysis by Synthesis

The intriguing susceptibility of deep neural networks to minimal input p...

Robustness Gym: Unifying the NLP Evaluation Landscape

Despite impressive performance on standard benchmarks, deep neural netwo...

Field Label Prediction for Autofill in Web Browsers

Automatic form fill is an important productivity related feature present...

A Machine Learning Approach for Automated Filling of Data Entry Forms

Users frequently interact with software systems through data entry forms...

1 Introduction

Forms such as invoices and receipts are essential in business workflows. Extracting target values for fields of interest from forms (see an example in Fig. 1) is among the most important tasks in document understanding. There are large amounts of forms processed every day, but most current systems still rely on human labor to manually capture field-values from massively irrelevant information. Developing a method that automatically extracts field-values based on understanding the forms is crucial to reduce human labor, thus improve business efficiency.

Existing works Chiticariu et al. (2013); Schuster et al. (2013); Palm et al. (2019); Majumder et al. (2020); Xu et al. (2020) focus on improving the modeling of field extractors and have made great progress. However, their evaluation paradigms are limited. First, most of the methods are evaluated using internal datasets. Internal datasets usually have very limited variations and are often biased towards certain data distributions due to the constraints of the data collection process. For example, the forms might be collected from just a few vendors in a relatively short time which leads to similar semantics and layouts across the forms. Second, public datasets lack for diversity in terms of both textual expression and form layouts. Take the most frequently used dataset, SROIE Huang et al. (2019), as an example. The fields, company and address, are always on the very top in all receipts. Although the existing models achieve decent performance on these datasets, it is difficult to know whether they can generalize well. This issue can be solved by collecting large-scale diverse forms for evaluation, but it is very challenging since real forms usually contain customers’ private information, thus are not publicly accessible.

Figure 1: A form field extraction system may fail due to a slight modification to the form. Keys (concrete text-expressions of fields) are marked in blue boxes. Values are marked in red boxes.

To tackle this dilemma, we propose a novel framework to evaluate the robustness of form field extractors by attacking the models using form transformations. We consider form perturbations from both OCR level and form level, including OCR text location/order rearrangement, form background manipulation, and form field-value augmentation. Fourteen form transformations are proposed to impose these attacks. Using the proposed framework, we conduct robustness analysis on two commonly used form types, i.e., invoices and receipts. Experimental results demonstrate that the state-of-the-art (SOTA) methods are particularly vulnerable to form perturbations, including the variation of field-values, the disarrangement of input text order, and the disruption of the neighboring words of field-values. Recommendations for model design and data collection/augmentation are made accordingly.

Our contributions are summarized as follows. First, we introduce a framework to measure the robustness of form field extractors by attacking the models using the proposed form transformations. To the best of our knowledge, this is the first work studying form attacks to field extraction methods. Second, we identify the susceptibilities of the SOTA methods by comprehensive robustness analysis on two form types using the proposed framework and make insightful recommendations.

Figure 2: An illustration of our evaluation pipeline. An OCR engine is first used to extract texts and locations. 14 transformations are applied to the texts and their locations to generate diverse form variants. Then, each transformed set will input to a transformer-based field extractor. Robustness evaluation results are finally generated.

2 Related Work

Information extraction from forms is a widely researched area. Katti et al. (2018) and Denk and Reisswig (2019) encode each page of a form as a two-dimensional grid and extract header and line items from it using fully convolutional networks. DocStruct Wang et al. (2020) conducts document structure inference by encoding the form structure as a graph-like hierarchy of text fragments. Research works specifically focusing on form field extraction are more related to our work. Earlier methods Chiticariu et al. (2013); Schuster et al. (2013) relied on pre-registered templates in the system for information extraction. Palm et al. (2019) extract field-values of invoices via an Attend, Copy, Parse architecture. Recent methods formulate the field extraction problem as field-value pairing Majumder et al. (2020) and field tagging Xu et al. (2020) tasks, where transformer Vaswani et al. (2017) based structures are used to extract informative form representation via modeling interactions among text tokens. We focus on evaluating transformer-based field extraction methods given their great predictive capability for the task.

Robustness evaluation of models has received considerable attention. Errudite Wu et al. (2019) introduces model and task agnostic principles for informative error analysis of NLP models. Ma (2019) propose NLPAug, which contains simple textual augmentations to improve model robustness. Some works aim at robustness of text attacks Morris et al. (2020); Zeng et al. (2020); Kiela et al. (2021). A recent work, Robustness Gym Goel et al. (2021), presents a simple and extensible evaluation toolkit that unifies standard evaluation paradigms. There are also recent methods studying robustness of visual models Santurkar et al. (2020); Salman et al. (2020); Taori et al. (2020). To the best of our knowledge, this work is the first one focusing on robustness evaluation of form field extraction systems.

3 Preliminary: Transformer-based Form Field Extractor

We are focusing on the robustness evaluation of transformer-based form field extractors due to their undisputedly outstanding performance. Before discussing the robustness evaluation, we first illustrate the field extraction pipeline.

In a standard field extraction system, an OCR engine is used to extract a set of words, and their bounding box locations, , where indicates the total number of words. Then, a transformer-based feature backbone is used to model the interactions between text tokens and generate informative token representations, . Since both semantics and layouts are essential for field-value inference, we use LayoutLM Xu et al. (2020) as the feature backbone. We also experiment with two more transformers, i.e., BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019) which only take text as input in the appendix. Finally, a fully connected (FC) layer is used to project the token features to field space and generate, , where indicates the predicted field score and denotes the total number of positive fields. During training, cross entropy loss between and field label is utilized for model optimization. During inference, a post-processing method is applied to the predicted field scores to get the value for each field. We follow the simple criteria to generate field-values: (1) we find the predicted field label for each word by , where corresponds to fields (2) by default, for each field, we only keep the word as the value if its prediction score is the highest among all the words and larger than a threshold (). For fields that often include multi-word values, e.g., address and company, we keep all the words exceeding the threshold and group nearby ones with the same predicted field.

Evaluation Metric. End-to-end F1 score averaging over fields is used to evaluate models. We use exact string matching between the predicted values and the ground-truth to count true positives, false positives, and false negatives. Precision, recall and F1 scores are obtained accordingly for each field.

4 Robustness Evaluation via Form Attacks

We propose OCR level and form level transformations to attack field extractors. As shown in Fig. 2, our transformations are performed after OCR extraction, and the transformed data is input to a transformer-based field extractor. An analysis is conducted via performance comparison between the original set and transformed sets. Each transformation and the principles behind it are introduced as follows.

4.1 OCR Location and Order Rearrangement

We transform the original data to meaningful variants by slightly altering the OCR locations and the text order arrangement. These transformations simulate scenarios where we may obtain different OCR results before inputting to field extractors due to various reasons, e.g., the quality of OCR engines.

Center Shift and Box Stretch. To evaluate model robustness to OCR text location jittering, we propose two word-level location transformations. In Center Shift

, we keep the box size and randomly shift the center of a box. The shifting is in proportional to the width (horizontally) and height (vertically) of the box, and the ratio is a random number drawn from a normal distribution,

. Box Stretch randomly changes the four coordinates of a box in a similar way using .

Margin Padding. Scanning forms may introduce white margins, which globally changes text locations. We use Margin Padding

to manipulate the locations of all the words in a form. We pad white margins in the left, right, up, and down sides of a form where the margin length is a generated random number between 1 and

of the page size.

Global Shuffle. We observe that organizing a transformer’s inputs in reading order is particularly beneficial to understanding the form structure. However, the reading order is not always guaranteed by OCR engines. Hence, it is interesting to investigate model robustness to poor reading order quality. We use Global Shuffle to shuffle the order of words before inputting to transformers. Note that the words and their locations are not changed at all and the only difference is the order of the word sequence input to the transformer.

Neighbor Shuffle and Non-neighbor Shuffle. Intuitively, local neighbors of a value make more contributions to its prediction. So, we propose Neighbor Shuffle which shuffles the order of each value’s neighbors and keeps the order of the rest. Oppositely, we also have Non-neighbor Shuffle. A word, , is defined as a value’s neighbor if the IoU between and the neighbor zone of the value is larger than . The neighbor zone is a box that shares the same center as the value box with expanded width and height (expand rate denoted as ). We also include nearby words from the original reading order as the neighbors.

4.2 Form Background Manipulation

Form background generally affects the model performance in two ways: (1) some background words are strong indicators to improve field-values’ recall and (2) accurate prediction of background reduces false positives, thus increase model precision. We propose the following transformations to evaluate model robustness to background perturbations.

BG Drop

. Background (BG) Drop mimics the scenario that some words are completely missed by OCR detection. This transformation removes background words together with the corresponding boxes at a probability of


Neighbor BG Drop is similar to Neighbor Shuffle, which drops all background words if they are neighbors of a field-value.

Key Drop. Keys are concrete text-representations of fields in a form. For example, the field invoice_number may be represented as "INV #", "Invoice No." etc. in a form. A key is a very important feature for value localization since the value is often located near the key. We propose Key Drop to see the model performance change if keys are accidentally missed by OCR detection.

BG Typo. OCR recognition usually makes errors. BG Typo simulates word-level string typos. We select each background word at a probability of . For each selected word, we apply one of the error types, including swapping, deleting, adding, and replacing a random or a specific character. 111We utilize the implementation of the string typos provided in

BG Synonyms. Similar semantics may be represented using different word synonyms. BG Synonyms randomly replaces each background word at a probability of with their synonyms.222We generate word synonyms using WordNet Interface (

BG Adversarial. Some forms contain only one word in the same data type as field-values. For example, there might be only one word with the date type of date. It is less challenging for a model to recognize it as the invoice_date. However, this type of easy case is not always guaranteed in real-world applications. BG Adversarial is used to increase the difficulty level by adding distraction. Concretely, we select background words at a probability of and use adversarial words for replacement. For each replacement, we randomly choose a data type and then generate a random value of the corresponding data type. We focus on three data types, i.e., date, number and money. We generate random dates using Faker.333 For numbers, we first generate the number length randomly and then a random number of the length accordingly. For money, we obtain a random amount within lower and upper bounds. Then, we make the amount in money format, where we add a decimal point at the second to the left digit, place a comma at every third digit to the left of the decimal point and randomly insert at the beginning. To protect strong indicators for values, neighbor words are not replaced.

4.3 Form Field-Value Augmentations

Modifying field-values is a more direct way to increase the diversity of the evaluation set. We augment field-values in both text and locations.

Value Text Augment. Field-values of forms may be biased due to the limitation of the data collection process. For example, the invoice_date may be restricted to the year the form is collected, and the invoice_number may be biased towards the vendor’s numbering system. Value Text Augment transformation targets at augmenting the field-values based on their data types. For each field-value, we randomly generate a substitute with the same data type following the same value generation procedure as we do in BG Adversarial.

Value Location Augment. Form layouts can be very diverse in real-world scenarios. Intuitively, we should be able to infer a field-value as long as a key is represented properly no matter where we place the key-value pair in the document. We introduce Value Location Augment to increase layout diversity. To maintain the form format to the most, we keep the background as it is and shuffle the key-value pair’s locations in the form. For example, for the field invoice_number (key: Invoice No., value: 1234) and invoice_date (key: Invoice date, value: 01/01/2021), we swap the box locations of “Invoice No." and “Invoice date", and also the locations of “1234" and “01/01/21".

5 Experiments

We evaluate the robustness of transformation-based field extractors using our framework on two commonly used form types, i.e., invoices and receipts.

5.1 Datasets

Our evaluation models are trained using a labeled train set, and the best performing model is picked based on a validation set. We prepare a separate test set to perform the robustness evaluation. To perform the proposed transformations, we annotate both the key and value of each field of interest with their bounding box locations in a form.

Invoice. The train, valid, and test sets contain 158, 348 and 338 real invoices. They are collected from 111, 222, and 222 vendors, respectively. We sample at most 5 forms from the same vendor and the vendors of train, valid and test sets do not overlap. We consider 7 frequently used fields including invoice_number, purchase_order, invoice_date, due_date, amount_due, total_amount and total_tax.

Receipt. We use the publicly available receipt dataset, SROIE. The annotations of their original test set are not publicly available, so we split the original train set to train, valid, and test sets based on their company names and sample at most 5 forms per company following Majumder et al. (2020). Finally, we get 237 receipts for training, 76 for validation, and 74 for testing. The fields of interest are company, address, date and total. We add value boxes and annotate keys according to the text-level annotations provided by the original dataset.

5.2 Implementation Details

Our evaluation framework is implemented using Pytorch and the experiments are conducted on a single Tesla V100 GPU. The strength of a transformation is controlled by parameters. We set parameters to make moderate perturbations. In

BG Typo, BG Drop, BG Synonyms and BG Adversarial, transformations are only applied to some selected words. We fix the pre-defined probabilities, , to 0.1. is set to 0.5 and is 0.1. in Margin Padding is set to 0.3. When determining the value’s neighbor zone, we set the expand rate as and .

We generate random values based on data types in BG Adversarial and Value Text Augment. For dates, we randomly pick a date from the year 2001 to 2021 in one of the formats, including mm/dd/yy, yy-mm-dd, dd/month/yy, and dd/mon/yy. For numbers, the number length is randomly generated from 3 to 12. The amount of money is randomly selected from 1 to 10,000,000.

We use a commercial OCR engine444 for OCR extraction and utilize Tesseract555 to rank the words in reading order. Our default transformer is LayoutLM Xu et al. (2020) with text and boxes as inputs. We also evaluate using BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019) that take only text tokens as the inputs in Sec. A. All models are finetuned from the corresponding base models. During training, we set the batch size to 8 and use the Adam optimizer with a learning rate of 5.

5.3 Robustness Evaluation of Invoices

Transforms Precision Recall F1
Original 70.9 69.3 70.0
Center Shift 70.8 69.9 70.3
Box Stretch 70.0 68.5 69.1
Margin Pad 69.6 68.3 68.9
Table 1: Evaluation of robustness to OCR location modifications on invoices. LayoutLM is used as the transformer model.

OCR location modification. Robustness evaluation of the LayoutLM model to OCR text location jittering is shown in Table 1. We obtain comparable results to the original performance when applying Center Shift and Box Stretch which indicates that slight box jittering is tolerable in our case. Margin Pad shifts all text locations by adding random margins around a form. This transformation also just slightly decrease the model performance.

Transforms Precision Recall F1
Original 70.9 69.3 70.0
Global Shfl. 58.5 53.9 55.9
Neighbor Shfl. 66.9 61.9 64.1
Non-neighbor Shfl. 67.0 65.3 65.9
Table 2: Evaluation of robustness to OCR text order on invoices. LayoutLM is used as the transformer model.

OCR text order is essential to transformer-based field extractors since the order can serve as an important feature to improve model performance. We show the model performance when applying order shuffle to different places in Table 2. The results show that if we shuffle all text orders, the performance drops dramatically by 14.1% in F1 score. When only shuffle value neighbors, we obtain 6% lower F1 score. We get 4% lower F1 score if we shuffle non-neighbor words, even though we keep the text order of all values and their neighbors the same. The results demonstrate the importance of the text order. If the model is trained using texts with a good reading order, we may also want to ensure a good reading order during inference.

A natural question is, what if we break the reading orders during training. Will this help us save the effort of ensuring test reading order during inference? We re-train field extractors using texts with random orders. We obtain 58.7% in F1 score, which is 11.3% lower than our baseline. The comparison result suggests that the text reading order is a very important feature. How to use it without overfitting to it is an interesting research topic.

Transforms Precision Recall F1
Original 70.9 69.3 70.0
BG Drop 69.8 66.2 67.7
Neighbor BG Drop 67.6 57.1 61.3
Key Drop 62.8 56.4 59.1
BG Typo 69.6 66.9 68.1
BG Synonyms 70.4 68.8 69.5
BG Adversarial 66.2 67.7 66.9
Table 3: Evaluation of robustness to BG manipulations on invoices. LayoutLM is used as the transformer model.

Background drop related transformations simulate the scenarios where an OCR detector accidentally misses some background words. BG Drop removes words randomly selected in the background. As shown in Table 3, global BG Drop leads to slight performance decrease. When we apply Neighbor BG Drop, the performance largely drops by 9% in F1 score. Comparing BG Drop and Neighbor BG Drop, we find that neighbor words are indeed more important for value extraction. For a fair comparison, we have adjusted the dropping rate of BG Drop to 0.13, such that the total number of dropped words is roughly equal to the number of neighbor words. Further, Key Drop results in a similar performance decrease as Neighbor BG Drop, although the total number of Keys is less than half of that for neighbor words.

Transforms Precision Recall F1
Original 70.9 69.3 70.0
Value Text Aug. 56.3 53.5 54.5
Value Location Aug. 61.4 56.8 58.8
Table 4: Evaluation of robustness to background and field-value augmentations on invoices. LayoutLM is used as the transformer model.

Other background manipulations. Some background words, e.g., keys and other useful indicators, are important features to localize values. Adding typos to the background words may be harmful to these good features, thus BG Typo leads to a 2.4% drop in recall rate.

Forms from different vendors may use different words even when represent similar semantics. We attack the models using BG Synonyms. As shown in Table 3, our field extractors are quite robust to this transformation with only a negligible drop in F1 score.

BG Adversarial is used to add background words (serve as distractions) with similar data types as field-values. As shown in Table 3, BG Adversarial leads to 4.7% drop in model precision.

Value augmentations. Field-values of forms may be limited in diversity. Value Text Augment transformation augments field-values by replacing them with randomly generated values in the same data type. We augment the values of all the fields, except for total_amount and amount_due, since these two fields may involve complicated mathematical computations. For total_tax, we randomly select a number between 0 and 15% of the total_amount. The comparison results in Table 4 show that the model performance drop significantly by 15.5% F1 score.

Value Location Augment changes the spatial arrangements of key-value pairs. In practice, we only shuffle the key-value pairs if they have the same number of key words and the same number of value words, resulting in more than 75% key-value pairs relocated. The results in Table 4 demonstrate that Value Location Augment significantly reduces F1 scores by 11.2%.

Multiple transformations. The proposed transformations can be combined together to generate more diverse sets. We conduct exhaustive combinations of every two and three transformations which result in 91 2-transformation combinations and 364 3-transformation combinations.666We observe that different orders of transformations in a combination result in ignorable differences. The top-10 most impactful combinations are shown in Fig. 3. The comparison results suggest the following conclusions.

Figure 3: Top-10 most impactful 2-transformation and 3-transformation combinations. VTA: Value Text Augment, GS: Global Shift, VLA: Value Location Augment, KD: Key Drop, NBD: Neighbor Background Drop, BA: Background Adversarial, MP: Margin Pad, BT: Background Typo, NS: Neighbor Shuffle, NNS: Non-Neighbor Shuffle.

First, generally if an individual transformation drops more performance, it also contributes more drop when combined with other transformations. The most impactful combination is (Value text Augment, Global Shuffle, Value Location Augment) with a F1 score of 25.7. They are the top-3 impactful transformations suggested in Fig. 4.

Second, some individual transformations are less impactful, but they affect more when combined with some specific transformations. For example, individual Margin Pad ranks low in Fig. 4. However, it leads to more performance drop when combined with Value Text Augment and Global Shuffle. Their performance drop ranks 5 out of 364 combinations (F1 is 30.8). This may due to that Margin Pad (changes all words’ locations), Value Text Augment (changes values’ texts) and Global Shuffle (changes text input order) are three complementary transformations. When we do Margin Pad alone, the model resorts to the information of value texts and text orders. However, when we do these three transformations together, the model becomes inevitably confused.

Third, if the transformations have overlapping effects, their combination has a lower impact. For example, Key Drop, Neighbor BG Drop and Neighbor Shuffle all manipulate neighbor words. The performance drop on their combination ranks 246 out of 364 combinations although their individual transformation is impactful (see Fig.4). The F1 score is 58.1 which is very close to an individual Key Drop transformation (59.1).

Figure 4: An overview of LayoutLM-based model performance drop due on different transformed dataset. The results are sorted by the performance gap.
Figure 5: An illustration of Value Location Augment* transformation on a receipt in SROIE dateset.

5.4 Robustness Evaluation of Receipts

There are two interesting features of the SROIE dataset. First, a significant amount of field-values have no keys, for example, all values of company and address, and some values of date and total. Consequently, changing the context of values has a minor effect on model performance. Second, the layouts of different receipts are very fixed. For example, company and address are always on the very top of every receipt. So, models could easily overfit to field-value locations and the text order. As shown in Table 5, Global Shuffle leads to significant performance drop by 38.7% F1 score. Specifically, the fields of address and company become 0% F1 score when the text’s order is completely shuffled before inputting to the transformer. The results demonstrate that the model is overfitting to the input text order, especially for address and company.

Transforms Precision Recall F1
Original 81.8 80.1 80.9
Global Shfl. 42.5 41.9 42.2
Value Location Aug.* 54.3 48.0 49.7
Value Text Aug. 75.2 73.6 74.4
Table 5: Robustness evaluation on SROIE dataset. LayoutLM is used as the transformer model.

Most of the fields in SROIE have no keys. To augment receipt layouts, we design a dedicated method that locally moves field-value locations. Specifically, for company and address in SROIE receipts, we move the values to the bottom of the form and shift the rest above to fill the gap as shown in Fig. 5. We refer to this transformation as Value Location Augment*. This transformation changes the location of the values without breaking the text order within each value. We obtain 21.6% and 1.7% F1 score for company and address, respectively, which are around 60% and 68% lower than the original numbers.

Besides, we also evaluate models on test set transformed by Value Text Augment. We replace values of company, address and date using substitute randomly generated by Faker. Same as what we do for total_amount and amount_due for invoices, we keep the values of total as they are. To maintain the layout structure, we only replace company and address if we are able to get a randomly generated sample with the same number of words as the original sample. This results in about 69% company values and 31% address values changed, respectively. As shown in Table 5, the Value Text Augment largely decreases the model performance by 5.5% in F1 score.

5.5 Observations and Suggestions

An overview comparison of all the transformations of invoices is summarized in Fig. 4. As we can see, the top-3 substantial transformations are Value Text Augment, Global Shuffle and Value Location Augment. Experiments on receipts also show the effectiveness of these three transformations.

We make the following recommendations based on the analysis. For data collection/augmentation, forms with more diverse values are preferable. For example, we may want dates covering a wide range of time periods with more types of formats and numbers being more extensive. Varying forms’ layouts is also beneficial. Especially, we may want to focus on varying the arrangement of field-values instead of altering individual word locations locally.

For the design of field extractors, we suggest making better utilization of the text order. As shown in our experiments, the text order is a very useful feature. How to utilize the text reading order without overfitting to it is an interesting topic. Besides, Key Drop and Neighbor BG Drop result in significant performance decreases as shown in Fig. 4. This suggests that value’s neighbors, especially the keys, are essential for value extractions. Current state-of-the-art models use transformers to model interactions between all words. We believe paying attention to keys and neighbors in the model design has the potential to improve the existing field extraction systems.

6 Conclusion and Future Work

We proposed a novel framework to evaluate the robustness of transformer-based form field extractors via form attacks. We introduced 14 transformations that transform forms in different aspects, including OCR-level location and order, background contexts, and field-value text and layouts. We conducted studies on real invoices and receipts with three types of transformer-based models using our proposed framework. Research recommendations were made based on the robustness analysis.

Improving field extraction from forms using the research analysis generated by the robustness evaluation is a very meaningful research area. The proposed transformations are potentially useful for increasing the diversity of training samples, thus improving model robustness. We will consider this in the future work.

7 Broader Impact

This work targets at robustness evaluation of form information extraction systems, so it has positive impacts such as identifying bias of existing information extractors and improving the fairness of model comparison. On the opposite side, our method may have unintended negative consequences in that we have proposed transformations that evaluate various aspects of model robustness, but the metrics we have selected may not be comprehensive. As a result, there is likely some degree of model bias present that has been missed by the proposed framework. However, this negative impact is not specific to our work and should be considered in general in the field of robustness AI.

The invoice dataset is for internal use only and does not contain any personally identifiable data. The SROIE dataset is a public dataset under MIT license. All forms were annotated by the authors. Consequently, we are confident that the datasets do not have ethical issues.


  • L. Chiticariu, Y. Li, and F. R. Reiss (2013) Rule-based information extraction is dead! long live rule-based information extraction systems!. In EMNLP, Cited by: §1, §2.
  • T. I. Denk and C. Reisswig (2019) Bertgrid: contextualized embedding for 2d document representation and understanding. NeurIPS Workshop. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. NAACL. Cited by: §3, §5.2.
  • K. Goel, N. Rajani, J. Vig, S. Tan, J. Wu, S. Zheng, C. Xiong, M. Bansal, and C. Ré (2021) Robustness gym: unifying the nlp evaluation landscape. arXiv preprint arXiv:2101.04840. Cited by: §2.
  • Z. Huang, K. Chen, J. He, X. Bai, D. Karatzas, S. Lu, and C. Jawahar (2019) Icdar2019 competition on scanned receipt ocr and information extraction. In ICDAR, Cited by: §1.
  • A. R. Katti, C. Reisswig, C. Guder, S. Brarda, S. Bickel, J. Höhne, and J. B. Faddoul (2018) Chargrid: towards understanding 2d documents. EMNLP. Cited by: §2.
  • D. Kiela, M. Bartolo, Y. Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad, A. Singh, P. Ringshia, et al. (2021) Dynabench: rethinking benchmarking in nlp. NAACL. Cited by: §2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §3, §5.2.
  • E. Ma (2019) NLP augmentation. Note: Cited by: §2.
  • B. P. Majumder, N. Potti, S. Tata, J. B. Wendt, Q. Zhao, and M. Najork (2020) Representation learning for information extraction from form-like documents. In ACL, Cited by: §1, §2, §5.1.
  • J. X. Morris, E. Lifland, J. Y. Yoo, and Y. Qi (2020)

    TextAttack: a framework for adversarial attacks in natural language processing

    arXiv preprint arXiv:2005.05909. Cited by: §2.
  • R. B. Palm, F. Laws, and O. Winther (2019) Attend, copy, parse end-to-end information extraction from documents. In ICDAR, Cited by: §1, §2.
  • H. Salman, A. Ilyas, L. Engstrom, A. Kapoor, and A. Madry (2020)

    Do adversarially robust imagenet models transfer better?

    NeurIPS. Cited by: §2.
  • S. Santurkar, D. Tsipras, and A. Madry (2020) Breeds: benchmarks for subpopulation shift. arXiv preprint arXiv:2008.04859. Cited by: §2.
  • D. Schuster, K. Muthmann, D. Esser, A. Schill, M. Berger, C. Weidling, K. Aliyev, and A. Hofmeier (2013) Intellix – end-user trained information extraction for document archiving. In ICDAR, Cited by: §1, §2.
  • R. Taori, A. Dave, V. Shankar, N. Carlini, B. Recht, and L. Schmidt (2020) Measuring robustness to natural distribution shifts in image classification. NeurIPS. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. NeurIPS. Cited by: §2.
  • Z. Wang, M. Zhan, X. Liu, and D. Liang (2020) DocStruct: a multimodal method to extract hierarchy structure in document for general form understanding. EMNLP. Cited by: §2.
  • T. Wu, M. T. Ribeiro, J. Heer, and D. S. Weld (2019) Errudite: scalable, reproducible, and testable error analysis. In ACL, Cited by: §2.
  • Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou (2020) Layoutlm: pre-training of text and layout for document image understanding. In KDD, Cited by: §1, §2, §3, §5.2.
  • G. Zeng, F. Qi, Q. Zhou, T. Zhang, B. Hou, Y. Zang, Z. Liu, and M. Sun (2020)

    OpenAttack: an open-source textual adversarial attack toolkit

    arXiv preprint arXiv:2009.09191. Cited by: §2.

Appendix A Appendix

a.1 More Robustness Evaluation of Invoices

The results of BERT and RoBERTa on invoices are summarized in Table A1 and Table A2.

Transforms Precision Recall F1
Original 58.4 58.1 57.8
Global Shfl. 40.5 38.2 38.6
Neighbor Shfl. 51.5 50.7 50.6
Non-neighbor Shfl. 57.9 56.4 56.6
BG Drop 57.7 57.2 56.9
Neighbor BG Drop 49.9 44.7 46.7
Key Drop 50.2 46.5 47.6
BG Typo 56.0 54.1 54.7
BG Synonyms 58.8 58.3 58.0
BG Adversarial 48.7 51.6 49.7
Value Text Aug. 36.2 33.8 34.3
Value Location Aug. 55.5 54.4 54.4
Table A1: Robustness evaluation on invoice dataset. BERT is used as the transformer model.
Transforms Precision Recall F1
Original 63.4 59.1 61.1
Global Shfl. 40.5 33.3 36.4
Neighbor Shfl. 58.9 53.2 55.9
Non-neighbor Shfl. 60.6 57.0 58.7
BG Drop 63.6 58.8 61.0
Neighbor BG Drop 58.8 49.5 53.5
Key Drop 57.7 49.8 53.2
BG Typo 62.1 57.7 59.8
BG Synonyms 62.7 58.5 60.4
BG Adversarial 56.2 55.2 55.5
Value Text Aug. 44.9 39.2 41.5
Value Location Aug. 57.9 53.9 55.7
Table A2: Robustness evaluation on invoice dataset. RoBERTa is used as the transformer model.

OCR text order. BERT and RoBERTa do not have text locations as inputs, so they rely more on text order than LayoutLM does. When deploying Global Shuffle, we observe more performance drop, i.e., by 19.2% F1 score for BERT and by 24.7% F1 score for RoBERTa.

Background drop. We observe similar results on BERT and RoBERTa as compared to that on LayoutLM. Key Drop and Neighbor BG Drop have more impact than global BG Drop.

Other background manipulations. We observe 3.1% and 1.3% drop in F1 score on BG Typo, when using BERT and RoBERTa. Similar to LayoutLM, BERT and RoBERTa are robust to BG Synonyms. BG Adversarial leads to 9.7% and 7.2% drop in model precision for BERT and RoBERTa based methods, respectively.

Value augmentations. Value Text Augment results in 23.5% (BERT) and 19.6% (RoBERTa) drop in F1 score. Since BERT and RoBERTa do not rely on location input, the performance drop on Value Location Augment of these two models is much less than that of the LayoutLM (drop 11.2%).

a.2 More Robustness Evaluation of Receipts

The results of BERT and RoBERTa on receipts are summarized in Table A3 and Table A4. The experimental results suggest that these three transformations have significant impact to the performance of BERT and RoBERTa.

Transforms Precision Recall F1
Original 75.4 73.3 74.3
Global Shfl. 37.9 36.1 37.0
Value Location Aug.* 51.9 44.9 46.0
Value Text Aug. 71.2 69.2 70.2
Table A3: Robustness evaluation on SROIE dataset. BERT is used as the transformer model.
Transforms Precision Recall F1
Original 77.3 74.7 75.9
Global Shfl. 40.6 37.1 38.8
Value Location Aug.* 48.6 41.9 43.4
Value Text Aug. 72.5 70.6 71.5
Table A4: Robustness evaluation on SROIE dataset. RoBERTa is used as the transformer model.