BROS: A Layout-Aware Pre-trained Language Model for Understanding Documents

08/10/2021 ∙ by Teakgyu Hong, et al. ∙ KAIST 수리과학과 NAVER Corp. 0

Understanding documents from their visual snapshots is an emerging problem that requires both advanced computer vision and NLP methods. The recent advance in OCR enables the accurate recognition of text blocks, yet it is still challenging to extract key information from documents due to the diversity of their layouts. Although recent studies on pre-trained language models show the importance of incorporating layout information on this task, the conjugation of texts and their layouts still follows the style of BERT optimized for understanding the 1D text. This implies there is room for further improvement considering the 2D nature of text layouts. This paper introduces a pre-trained language model, BERT Relying On Spatiality (BROS), which effectively utilizes the information included in individual text blocks and their layouts. Specifically, BROS encodes spatial information by utilizing relative positions and learns spatial dependencies between OCR blocks with a novel area-masking strategy. These two novel approaches lead to an efficient encoding of spatial layout information highlighted by the robust performance of BROS under low-resource environments. We also introduce a general-purpose parser that can be combined with BROS to extract key information even when there is no order information between text blocks. BROS shows its superiority on four public benchmarks – FUNSD, SROIE*, CORD, and SciTSR – and its robustness in practical cases where order information of text blocks is not available. Further experiments with a varying number of training examples demonstrate the high training efficiency of our approach. Our code will be open to the public.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Document intelligence (DI)111, which understands industrial documents from their visual appearance, is a critical application of AI in business. One important challenge of DI is a key information extraction task (KIE) (Huang et al., 2019; Jaume et al., 2019; Park et al., 2019)

that extracts structured information from documents such as financial reports, invoices, business emails, insurance quotes, and many others. The task requires a multi-disciplinary perspective spanning from computer vision for extracting text from document images to natural language processing for parsing key information from the identified texts.

Model Modality # Params F1 score
LayoutLM (Xu et al., 2020a) Text + Layout 113M 78.66
LayoutLM (Xu et al., 2020a) Text + Layout + Image 160M 79.27
LayoutLMv2 (Xu et al., 2020b) Text + Layout + Image 200M 82.76
BROS Text + Layout 110M 83.05
Table 1. FUNSD KIE performances of language models pre-trained on OCR results. BROS provides better performance with fewer parameters and without using Image features.

Optical character recognition (OCR) is an important first step to extract text blocks in document images. Then key information can be reconstructed from the extracted text blocks (Schuster et al., 2013; Qian et al., 2019; Hwang et al., 2019; Hwang et al., 2020). Although OCR alleviates the burden of processing raw images, the task still requires understanding semantic relations between text blocks.

To obtain a semantic representation of the text blocks, existing works utilize a pre-trained language model. Hwang et al. (2019) fine-tunes BERT (Devlin et al., 2019) by regarding KIE tasks as sequence tagging problems. Denk and Reisswig (2019) uses BERT to incorporate textual information into image pixels during their image segmentation tasks. However, as BERT is designed for understanding 1D text, they artificially convert text blocks distributed in 2D into a single pseudo 1D text losing spatial layout information.

To encode spatial information from document layouts, a new type of language model, LayoutLM, is recently proposed (Xu et al., 2020a). The models are pre-trained over text blocks extracted from a large corpus of industrial documents showing various layouts. By learning spatial semantics included in layouts, these models show much higher performances on KIE tasks compared to previous pre-trained language models trained over 1D text. However, in LayoutLM, the spatial semantics are learned following the style of BERT. As BERT is specialized for 1D text, this implies there is still a large room for further improvement for encoding 2D text.

Here we propose BROS, a pre-trained language model for understanding 2D text. Compared to LayoutLM, BROS has the following improvements. First, BROS encodes a relative position between text blocks for layout understanding with a few additional parameters. By considering a relative position instead of an absolute position, the model can have flexibility on some variations on the image, such as translation. On the other hand, LayoutLM uses two absolute position embeddings for x- and y-axes following BERT. Second, BROS is pre-trained with a novel area-masking strategy. By explicitly masking 2D span of text blocks, the strategy guides a model to learn 2D semantic representation. On the other hand, LayoutLM utilizes a token-masking strategy following BERT in which the learning scope is mainly limited to local 1D semantics.

With these two improvements, BROS achieves better performance with fewer parameters compared to LayoutLM (Table 1, 1st and 2nd rows vs final row). The higher performance of BROS compared to recently published LayoutLMv2 that uses additional image features in addition to text blocks from OCR (Xu et al., 2020b), highlights the efficiency and effectiveness of BROS.

Aside from BROS, we also introduce a general-purpose parser (we dub it Token Relationship decoder (TR decoder)), that can be easily combined with pre-trained language models for KIE tasks. By explicitly decoding the relation between text blocks, the TR decoder can extract key information without relying on prior word order information and can solve entity linking problems. A conventional BIO tagger always requires prior word ordering and cannot easily handle relations between text blocks.

We extensively validate BROS on four KIE benchmarks—FUNSD (form-like documents), SROIE* (receipts), CORD (receipts), and SciTSR (table structures)—under two settings: with or without word order information. BROS shows better performance over all datasets under both settings. In addition, BROS also performs consistently better when training with various amounts of pre-training and fine-tuning examples.

2. Related Work

2.1. Pre-trained Language Models

BERT (Devlin et al., 2019) is a pre-trained language model using Transformer (Vaswani et al., 2017)

that shows superior performance on various NLP tasks. The main strategy to train BERT is a masked language model (MLM) that masks and estimates randomly selected tokens to learn the semantics of language from large-scale corpora. Many variants of BERT have been introduced to learn transferable knowledge by modifying the pre-training strategy. XLNet 

(Yang et al., 2019) permutes tokens during the pre-training phase to reduce a discrepancy from the fine-tuning phase. XLNet also utilizes relative position encoding to handle long texts. StructBERT (Wang et al., 2020a) shuffles tokens in text spans and adds sentence prediction tasks for recovering the order of words or sentences. SpanBERT (Joshi et al., 2020) masks the span of tokens to extract better representation for span selection tasks such as question answering and co-reference resolution. ELECTRA (Clark et al., 2020) is trained to distinguish real and fake input tokens generated by another network for sample-efficient pre-training.

Inspired by these previous works, BROS utilizes a new pre-training strategy, named area-masked language model, that can capture complex spatial dependencies between text blocks distributed on 2D space. Note that LayoutLM (Xu et al., 2020a) is the first pre-trained language model on spatial text blocks but it still employs the original MLM of BERT.

2.2. Key Information Extraction from Documents

The early works (Liu et al., 2019; Yu et al., 2020; Qian et al., 2019; Katti et al., 2018) focused to capture better representations of pair-wise relationships between text blocks identified by OCR without any usage of pre-trained language models. With a promising performance of a pre-trained language model, BERT, on diverse NLP tasks, BERT starts to be utilized by (Hwang et al., 2019; Denk and Reisswig, 2019; Hwang et al., 2020). However, the different nature of BERT from texts in 2D space limits performance on KIE tasks including a small training dataset. To address the issue, LayoutLM (Xu et al., 2020a), a pre-trained language model on OCR results, is introduced. In this research flow, this paper also proposes an advanced model by enhancing conjugation between texts and layouts.

Recently, LayoutLMv2 (Xu et al., 2020b) is introduced to the public. It shows remarkable performance improvements on advanced KIE tasks by learning the multi-modality of image features and text blocks. Our model only encodes text and layout features since we focus on building a practical model for real-world scenarios.

In a view of generating outputs for KIE, most of the previous works utilize BIO tagging approaches (Hwang et al., 2019; Denk and Reisswig, 2019; Xu et al., 2020a; Xu et al., 2020b). Although a BIO tagger requires additional information about the order of text blocks, there has been no issue because KIE benchmark datasets provide optimal reading orders of text blocks. In this work, we evaluate practical cases without prior order information by using our proposed TR decoder. Inspired by SPADE (Hwang et al., 2020), the TR decoder adapts a graph-based approach and it can be simply attached to pre-trained language models.

Figure 1. An overview of BROS. The tokens in the document image are masked through token- and area-masking strategy. The position difference between text blocks is encoded directly to the attention mechanism in Transformer. The output token representations are used in both pre-training and fine-tuning.

3. BERT Relying on Spatiality (BROS)

The main structure of BROS follows LayoutLM, but there are two critical advances: (1) a use of spatial encoding metric that describes spatial relations between text blocks, (2) a use of 2D pre-training objective designed for text blocks on 2D space. Figure 1 shows a visual description of BROS for document KIE tasks.

3.1. Encoding Spatial Information into BERT

The way to represent spatial relations between text blocks is important to identify constructed semantics in layouts. We calculate relative positions for four points of text blocks, utilize sinusoidal functions to encode the distances, and combine them through a linear transformation to represent a spatial relation between two text blocks.

For formal description, we use to denote a point on 2D space and to represent a sinusoidal function. is the dimensions of sinusoid embedding. BROS first normalizes all 2D points indicating the location of the text blocks using the size of the image. And it encodes spatial relations between two 2D points, and , by applying the sinusoidal function to gaps of x- and y-axis, and concatenating them as . The semicolon (;) indicates concatenation. The bounding box of a text block consists of four vertices, such as , , , and

that indicate top-left, top-right, bottom-right, and bottom-left points, respectively. The spatial distances over four point pairs are converted into vectors such as

, , , and with . Finally, to represent an embedding for the spatial relation between two text blocks, , BROS combines four identified vectors through a linear transformation,


where , , , are linear transition matrices, is a hidden size of BERT, and is the number of self-attention heads. The periodic property of the sinusoidal function can encode continuous distances more naturally than using point-specific embedding used in BERT and LayoutLM. By learning the linear transition parameters, BROS provides an effective representation of spatial relation between text blocks. It should be noted that the dimension of the final embedding becomes , divided by , for usage of common embeddings over multiple heads in the attention module in order to reduce memory burdens from identifying embeddings for all pairs of text blocks.

BROS directly encodes the spatial relations to the contextualization of text blocks. In detail, BROS calculates an attention logit combining both semantic and spatial features as follows;


where and are context representations for and tokens and both and are linear transition matrices for head. The former is the same as the original attention mechanism in Transformer (Vaswani et al., 2017). The latter, motivated by Dai et al. (2019), considers the relative spatial information of the target text block when the source context and location are given. As we mentioned above, we have shared relative spatial embedding across all of the different attention heads for efficient memory usage. Compared to the spatial-aware attention in Xu et al. (2020b), it has two major differences. First, our method couples the relative embeddings with the semantic information of tokens for better conjugation between texts and their spatial relations. Second, when calculating the relative spatial information between two text blocks, we consider all four vertices of the block. By doing this, our encoding can incorporate not only relative distance but also relative shape and size which play important roles in distinguishing key and value in a document.

3.2. Pre-training Objective: Area-masked Language Model

(a) Random token selection and token masking
(b) Random area selection and block masking
Figure 2. Illustrations of (a) conventional token-masking and (b) proposed area-masking. The token-masking selects tokens randomly (red) and masks them directly (gray). This allows the model to learn how to represent the token within the text block (blue). The area-masking identifies areas (red) by expanding randomly chosen text blocks and masks tokens (gray) in all text blocks (blue) of which center is aligned in the identified area. In both figures, 15% of tokens are masked.

Pre-training diverse layouts from unlabeled documents is a key factor for document understanding tasks. BROS utilizes two pre-training objectives: one is a token-masked LM (TMLM) used in BERT and the other is a novel area-masked LM (AMLM) introduced in this paper. The area-masked LM, inspired by SpanBERT (Joshi et al., 2020), captures consecutive text blocks based on a 2D area in a document.

TMLM randomly masks tokens while keeping their spatial information, and then the model predicts the masked tokens with the clues of spatial information and the other un-masked tokens. The process is identical to MLM of BERT and Masked Visual-Language Model (MVLM) of LayoutLM. Figure 2.(a) shows how TMLM masks tokens in a document. Since tokens in a text block can be masked partially, their estimation can be conducted by referring to other tokens in the same block or text blocks near the masked token.

AMLM masks all text blocks allocated in a randomly chosen area. It can be interpreted as a span masking for text blocks in 2D space. Specifically, AMLM consists of the following four steps: (1) randomly select a text block, (2) identify an area by expanding the region of the text block, (3) determine text blocks allocated in the area, and (4) mask all tokens of the text blocks and predict them. At the second step, the degree of expansion is identified by sampling a value from an exponential distribution with a hyper-parameter,

. The rationale behind using exponential distribution is to convert the geometric distribution used in SpanBERT for a discrete domain into a distribution for a continuous domain. Thus, we set

where used in SpanBERT. Also, we truncated exponential distribution with 1 to prevent an infinity value covering all space of the document. It should be noted that the masking area is expanded from a randomly selected text block since the area should be related to the text sizes and locations to represent text spans in 2D space. Figure 2 compares token- and area-masking on text blocks. Because AMLM hides spatially close tokens together, their estimation requires more clues from text blocks far from the estimation targets.

Finally, BROS combines two masked LMs, TMLM and AMLM, to stimulate model to learn both individual and consolidated token representations. BROS first masks 15% of tokens for AMLM and then masks 15% of tokens on the left text blocks for TMLM. Similar to BERT (Devlin et al., 2019), the masked tokens are replaced by [MASK] token for 80%, random token for 10%, and original token for the rest 10%.

4. Parsers for KIE Tasks

KIE tasks can be categorized into two downstream tasks: (1) an entity extraction (EE) task and (2) an entity linking (EL) task. The EE identifies a sequence of text blocks for key information (e.g. extract address texts in a receipt) and the EL determines relations between entities when the text blocks of the entities are known (e.g. identify key and value text pairs).

To address the EE and EL tasks, we introduce two parsers: one is a sequence classifier (e.g. BIO tagger) that can operate based on prior information about an order of text blocks and the other is a graph-based parser that does not require any order information of text blocks. The following sections briefly introduce the two types of parsers.

4.1. Parser with Order Information

BIO tagger, a representative parser that depends on the optimal reading order of text blocks, extracts key information by identifying the beginning and inside points of the ordered text blocks. The optimal reading order of text blocks indicates an order in which all of the key information can be represented in its sub-sequences. The sequence classifier requires the optimal order condition because it never succeeds to find the key information with a wrong sequence. For example, if two text blocks, “Michael” and “Jackson”, are ordered as “Jackson” and “Michael”, the sequence classifier cannot find “Michael Jackson”.

(a) Recognized text blocks.
(b) Serialized text blocks.
(c) BIO-tagged text sequence.
Figure 3. Visual descriptions of how BIO tagger extracts entities in a document. All recognized tokens are serialized and classified. By combining sub-sequences identified by the BIO taggings, key information can be parsed from the recognized tokens.

Figure 3 shows how the BIO tagger performs the EE task for a given document. First, text blocks are recognized by an OCR engine (Figure 3, a). The recognized text blocks are then serialized by a serializer (Figure 3, b). Finally, for each token, the BIO classes are classified and key information is extracted by combining the classified labels (Figure 3, c).

The BIO tagger cannot solve the EL task since links between text blocks cannot be represented as a sequence unit. In addition, a single text block can hold the same relationships with other multiple text blocks but the sequence-based approach cannot explain the one-to-many relations as well.

(a) Initial token classification
(b) Subsequent token classification
(c) Entity linking (EL) task
Figure 4. Visual descriptions of TR decoder downstream tasks. For EE tasks, TR decoder combines two sub-tasks such as (a) and (b). TR decoder identifies starting tokens and then connects next tokens without any order information of text blocks. For the EL task, TR decoder links the first tokens of the entities.

4.2. Parser without Order Information

In many practical cases, the optimal reading order cannot be available. Most OCR APIs provide proper reading orders of text blocks based on rule-based approaches but they cannot guarantee the optimal order of text blocks (Li et al., 2020).

Here, we introduce a token relationship decoder, referred to as the TR decoder, that extracts key information without any information about the order. The key idea of TR decoder is to extract a directional sub-graph from a fully-connected graph which nodes are text blocks. Due to no limitation on the connections between text blocks, the TR decoder does not require the order information.

For EE tasks, the TR decoder divides the problem into two sub-tasks: initial token classification (Figure 4, a) and subsequent token classification (Figure 4, b). Let denote the token representation from the last Transformer layer of BROS. The initial token classification conducts a token-level tagging to determine whether a token is an initial token of target information as follows,


where is a linear transition matrix and indicates the number of target classes. Here, the extra +1 dimension is considered to indicate non-initial tokens.

The subsequent token classification is conducted by utilizing pair-wise token representations as follows,

Here, are linear transition matrices, is a hidden feature dimension for the subsequent token classification decoder and is the maximum number of tokens. The semicolon (;) indicates concatenation. is a model parameter to classify tokens which do not have a subsequent token or are not related to any class. It has a similar role with an end-of-sequence token, [EOS], in NLP. By solving these two sub-tasks, the TR decoder can identify a sequence of text blocks by finding initial tokens and then connecting subsequent tokens.

For EL tasks, the TR decoder conducts a binary classification for all possible pairs of tokens (Figure 4, c) as follows,


where are linear transition matrices and is a hidden feature dimension. Compared to the subsequent token classification, a single token can hold multiple relations with other tokens to represent hierarchical structures of document layouts.

Batch size, FUNSD EE CORD EE
Model Modality

# Epochs

Precision Recall F1 Precision Recall F1 # Params
(Xu et al., 2020a) Text - 54.69 67.10 60.26 88.33 91.07 89.68 110M
(Xu et al., 2020a) Text - 61.13 70.85 65.63 88.86 91.68 90.25 340M
(Xu et al., 2020a) Text + Layout 80, 2 75.97 81.55 78.66 94.37 95.08 94.72 113M
(Xu et al., 2020b) Text + Layout 80, 2 75.96 82.19 78.95 94.32 95.54 94.93 343M
LayoutLM (Xu et al., 2020a) Text + Layout + Image* 80, 2 76.77 81.95 79.27 - 160M
(Ours) Text + Layout 80, 2 79.85 84.38 82.05 96.87 96.58 96.72 110M
LayoutLMv2 (Xu et al., 2020b) Text + Layout + Image* 64, 5 80.29 85.39 82.76 94.53 95.39 94.95 200M
(Ours) Text + Layout 64, 5 81.16 85.02 83.05 96.89 96.44 96.66 110M
LayoutLMv2 (Xu et al., 2020b) Text + Layout + Image* 2K, 20 83.24 85.19 84.20 95.65 96.37 96.01 426M
Table 2. Performance comparison on the FUNSD and CORD EE tasks.

5. Key Information Extraction Tasks

Here, we describe three EE tasks and three EL tasks from four KIE benchmark datasets.

  • [leftmargin=1.4em]

  • Form Understanding in Noisy Scanned Documents (FUNSD) (Jaume et al., 2019) is a set of documents with various forms. The dataset consists of 149 training and 50 testing examples. FUNSD has both EE and EL tasks. In the EE task, there are three semantic entities: Header, Question, and Answer. In the EL task, the semantic hierarchies are represented as relations between text blocks like header-question and question-answer pairs.

  • SROIE* is a variant of Task 3 of “Scanned Receipts OCR and Information Extraction” (SROIE)222 that consists of a set of store receipts. In the original SROIE task, semantic contents (Company, Date, Address, and Total price) are generated without explicit connection to the text blocks. To convert SROIE into a EE task, we developed SROIE* by matching ground truth contents with text blocks. We also split the original training set into 526 training and 100 testing examples because the ground truths are not given in the original test set. SROIE* will be publicly available.

  • Consolidated Receipt Dataset (CORD) (Park et al., 2019) is a set of store receipts with 800 training, 100 validation, and 100 testing examples. CORD consists of both EE and EL tasks. In the EE task, there are 30 semantic entities including menu name, menu price, and so on. In the EL task, the semantic entities are linked according to their layout structure. For example, menu name entities are linked to menu id, menu count, and menu price.

  • Complicated Table Structure Recognition (SciTSR) (Chi et al., 2019) is an EL task that connects cells in a table to recognize the table structure. There are two types of relations: vertical and horizontal connection between cells. The dataset consists of 12,000 training images and 3,000 test images.

Although these four datasets provide testbeds for the EE and EL tasks, they represent the subset of real problems as the optimal order information of text blocks is given. FUNSD provides the optimal orders of text blocks related to target classes in both training and testing examples. In SROIE*, CORD, and SciTSR, the text blocks are serialized in reading orders.

6. Experiments

6.1. Experiment Settings

For pre-training, IIT-CDIP Test Collection 1.0333 (Lewis et al., 2006), which consists of approximately 11M document images, is used but 400K RVL-CDIP dataset444 aharley/rvl-cdip/ (Harley et al., 2015) is excluded following LayoutLM. To obtain text blocks from document images, CLOVA OCR API555 was applied.

The main Transformer structure of BROS is the same as BERT. By following , the hidden size, the number of self-attention heads, the feed-forward/filter size, and the number of Transformer layers are set to 768, 12, 3072, and 12, respectively. The dimensions of sinusoid embedding is set to 24. The same pre-training setting with LayoutLM is used for a fair comparison.

BROS is trained by using AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 5e-5 with linear decay. The first 10% of the total epochs are used for a warm-up learning rate. We initialized weights of BROS with those of and trained it on the IIT-CDIP dataset.

During fine-tuning, the learning rate is set to 5e-5. The batch size is set to 16 for all tasks. The number of training epochs or steps is as follows: 100 epochs for FUNSD, 1K steps for SROIE* and CORD, and 7.5 epochs for SciTSR. When applying the TR decoder, the hidden feature dimensions, and

, are set to 128 for FUNSD, 64 for SROIE*, and 256 for CORD and SciTSR. For all experiments, we repeated the experiments with 5 different random seeds and reported the mean and standard deviation of the scores.

Although the authors of LayoutLM published their codes on GitHub666, the data and script file used for pre-training are not included. For a fair comparison, we made our own implementation, which we refer to LayoutLM, on the same pre-training data and script file used for BROS pre-training. We verified LayoutLM by comparing its performances on FUNSD from the reported scores in Xu et al. (2020a). See Appendix A for more information.

6.2. Comparison with Other Pre-trained Models

Table 2 shows the performance comparison on FUNSD and CORD EE tasks. For a fair comparison, we trained BROS for 2 epochs with 80 of batch size and 5 epochs with 64 of batch size as like the training settings of LayoutLM (Xu et al., 2020a) and LayoutLMv2 (Xu et al., 2020b). For the downstream tasks, BROS utilizes a BIO tagger the same as the baselines. When compared from LayoutLM using 2 epochs with 80 of batch size, BROS shows the best performance even though BROS has the smallest number of parameters without using image features. In details, performance improvements on FUNSD EE task from a larger model (LayoutLM) and the use of image feature (LayoutLM) are 0.29 percentage point (pp) (78.66 78.95) and 0.61pp (78.66 79.27), based on LayoutLM. However, BROS improves 3.39pp (78.66 82.05) by only modifying the positional encoding and pre-training strategy. Moreover, when compared from LayoutLMv2 using 5 epochs with 64 of batch size, BROS shows better performances even though BROS has 55% of parameters of LayoutLMv2 (200M 110M). LayoutLMv2 provides the best performance on FUNSD by using 20 epochs with 2048 of batch size but the performance on CORD is still under those of BROS.

6.3. Experimental Results with Optimal Order Information

with BIO tagger with TR decoder
Dataset Model Precision Recall F1 Precision Recall F1
FUNSD 56.11 1.39 66.63 0.95 60.92 1.14 20.47 12.32 7.02 4.35 10.44 6.39
EE 76.12 0.46 81.88 0.33 78.89 0.30 76.46 0.64 78.17 0.70 77.30 0.64
79.85 0.61 84.38 0.47 82.05 0.46 81.05 1.02 80.50 0.91 80.77 0.94
SROIE* 92.90 1.08 94.47 0.86 93.67 0.73 50.46 5.24 40.00 3.83 44.53 4.00
EE 94.31 0.59 95.78 0.37 95.04 0.45 94.99 0.79 95.13 0.34 95.06 0.52
95.60 0.51 96.13 0.61 95.87 0.54 95.61 0.72 96.18 0.49 95.89 0.59
CORD 93.08 0.39 93.18 0.29 93.13 0.34 21.41 8.69 20.32 8.98 20.82 8.84
EE 95.03 0.21 94.58 0.16 94.80 0.16 95.52 0.30 94.79 0.25 95.15 0.27
96.87 0.41 96.58 0.40 96.72 0.40 96.71 0.19 96.52 0.17 96.61 0.16
Table 3. Performance comparisons on three EE tasks with the optimal order information of text blocks.
Dataset Model Precision Recall F1
FUNSD 5.22 4.97 0.56 0.52 0.98 0.88
EL 41.29 0.41 44.45 0.80 42.81 0.57
67.63 0.84 72.35 0.67 69.91 0.65
CORD 58.40 2.33 42.00 11.07 48.08 8.85
EL 91.37 0.47 89.52 0.30 90.43 0.32
95.07 0.66 94.65 0.81 94.86 0.73
SciTSR 87.61 0.52 85.92 0.57 86.76 0.39
EL 98.76 0.22 99.44 0.02 99.09 0.11
98.92 0.11 99.59 0.01 99.32 0.06
Table 4. Performance comparisons on three EL tasks with the optimal order information of text blocks.

Here, we provide apples-to-apples comparisons of pre-trained language models by combining them with two decoders such as BIO tagger and TR decoder. The benchmark datasets include the optimal order information of text blocks in a document, thus the experiments on them can simulate a practical case that can access to order information of text blocks. In this subsection, we compare BERT, LayoutLM, and BROS. All models are pre-trained for 2 epochs with 80 of batch size except for BERT. We utilize LayoutLM our implementation for deep analysis to other benchmarks (LayoutLM shows slightly better performance than LayoutLM, see Table 11).

Table 3 shows the results on three EE tasks, such as FUNSD, SROIE*, and CORD, using both parsers. The experiments show that BROS provides better performances than LayoutLM regardless of the parsers. Although TR decoder does not utilize additional order information of the ground truth when identifying results, it shows comparable performances. However, BIO tagger provides better results on most benchmark tasks. The results guide using a BIO tagger when the optimal order information is available.

Table 4 provides the results on three EL tasks as FUNSD, CORD, and SciTSR. Since EL tasks require the links between text blocks, all models utilize the TR decoder. As with the results on EE tasks, BROS shows better performances than LayoutLM in all benchmarks. Interestingly, the performance of BERT is quite limited since EL tasks are greatly related to spatial information. The performance increases in an order of LayoutLM and BROS by encoding spatial information. Note that it is the first evaluation of pre-trained models on EL tasks since there has been no parser for them.

6.4. Experimental Results without Optimal Order Information

To simulate a practical case when an order of text blocks cannot be available, we remove the order information of KIE benchmarks by randomly permuting the order of text blocks. We denote the permuted datasets as p-FUNSD, p-SROIE*, p-CORD, and p-SciTSR and compare BERT, LayoutLM, and BROS. All models are pre-trained for 2 epochs with 80 of batch size except for BERT. Note that we utilize a TR decoder for all models because BIO tagging on these permuted datasets cannot extract a sequence of text blocks in the correct order.

Dataset Model Precision Recall F1
p-FUNSD 7.91 0.54 4.86 1.67 5.87 1.60
EE 34.98 0.98 32.55 0.68 33.72 0.80
74.64 0.55 74.03 0.57 74.33 0.55
p-SROIE* 7.29 3.75 2.56 1.31 3.79 1.94
EE 67.42 1.54 66.48 1.05 66.94 1.14
82.77 1.05 83.02 0.49 82.89 0.76
p-CORD 20.72 8.79 19.56 8.13 20.12 8.45
EE 77.49 0.82 77.27 0.86 77.38 0.83
95.20 0.25 95.12 0.30 95.16 0.27
Table 5. Performance comparisons on three EE without the optimal order information of text blocks. All models utilize the TR decoder because BIO tagging cannot be used without the optimal order.
Dataset Model Precision Recall F1
p-FUNSD 3.77 3.46 0.53 0.51 0.89 0.84
EL 36.69 2.16 31.25 1.29 33.75 1.66
66.24 1.30 68.33 1.14 67.27 1.15
p-CORD 37.69 12.53 4.99 2.26 8.32 3.19
EL 58.26 0.85 56.38 0.75 57.30 0.76
86.64 1.38 87.88 0.79 87.25 1.06
p-SciTSR 59.42 1.88 0.89 0.17 1.75 0.34
EL 95.59 0.30 99.04 0.04 97.28 0.14
98.11 0.33 99.38 0.02 98.74 0.16
Table 6. Performance comparisons on three EL tasks without the optimal order information of text blocks.
33.72 0.80 36.61 0.60 60.50 0.31 77.30 0.64
74.33 0.55 75.05 0.63 75.30 0.67 80.77 0.94
Table 7. Comparison of FUNSD EE performances according to sorting methods.

Table 5 and Table 6 show the results. Due to the loss of correct orders, BERT shows poor performances in overall tasks including EE and EL tasks. By utilizing spatial information of text blocks, LayoutLM shows better performance but it suffers from huge performance degradation compared to the score computed with the optimal order information. On the other hand, BROS shows comparable results compared to the cases with the optimal order information and achieves better performances than BERT and LayoutLM.

To systematically investigate how the order information affects the performance of the models, we construct variants of FUNSD by re-ordering text blocks with two sorting methods based on the top-left points. The text blocks of xy-FUNSD are sorted according to the x-axis with ascending order of y-axis and those of yx-FUNSD are sorted according to y-axis with ascending order of x-axis. Table 7 shows performance on p-FUNSD, xy-FUNSD, yx-FUNSD, and the original FUNSD. Interestingly, the performance of LayoutLM is degraded in the order of FUNSD, yx-FUNSD, xy-FUNSD, and p-FUNSD as like the order of the reasonable serialization for text on 2D space. On the other hand, the performance of BROS is relatively consistent. These results show that BROS with a TR decoder is applicable to KIE problems without relying on an additional serialization method.

6.5. Ablation Studies

6.5.1. Gradually Adding Proposed Components to the Original LayoutLM

Entity Extraction Entity Linking
Model F S C F C Sci
76.89 94.99 94.37 41.98 90.29 99.06
pos enc. only 78.84 95.45 96.36 60.26 95.09 99.22
objectives only 78.44 94.81 95.95 43.39 95.28 99.20
both (= ) 80.58 95.72 96.64 64.85 95.39 99.28
Table 8. Performance improvements on EE and EL tasks through adding components of BROS. At the last line, all components are changed from LayoutLM and the model becomes BROS. F, S, C, and Sci refer to FUNSD, SROIE, CORD, and SciTSR, respectively.

To evaluate performance improvements from the proposed components of BROS, we provide the experimental results when adding each component. Table 8 provides performance changes of F1 score for EE and EL tasks, respectively. In this table, all models are trained for 1 epoch. When adding our proposed positional encoding, the performances consistently increase with huge margins of 4.61pp on average over all tasks. Our pre-training objective combining TMLM and AMLM solely shows 1.58pp of performance improvement on average. By utilizing both, BROS provides the best performances with margins of 5.81pp on average. This ablation study proves that each component of BROS solely contributes to performance improvements as well as their combination provides better results.

6.5.2. Comparison between Pre-training Objectives

Table 9 shows the F1 scores of EE and EL tasks according to the pre-training objectives. All models are trained for 1 epoch, and all other settings except for the pre-training objective follow BROS. In most cases, AMLM shows better performance than TMLM. Especially, in FUNSD EL task, the performance is significantly improved when using AMLM (60.69 63.02, 63.52 65.09), which shows that AMLM helps learn the dependencies between text blocks. And for a fair comparison with the proposed method, we conduct experiments by increasing the ratio of masked tokens to 30% for each objective. It can be seen that AMLM 30% and TMLM 15% + AMLM 15% show better performance than TMLM 30%. AMLM 30% and TMLM 15% + AMLM 15% show similar performance, but since each objective helps to learn different characteristics, we utilize both pre-training objectives.

Figure 5. Performance comparisons according to the amount of pre-training data. Each point represents the result of fine-tuning after learning for data of 0, 500K, 1M, 2M, 11M, and 11M * 2. The results of BERT are presented together as a baseline.
Figure 6. Performance comparisons according to the amount of fine-tuning data. Each point represents the result of fine-tuning using from 10% to 100% of training data.
Pre-training Entity Extraction Entity Linking
objectives F S C F C Sci
TMLM 15% 78.84 95.47 96.53 60.69 95.68 99.22
AMLM 15% 79.80 95.31 96.71 63.02 94.87 99.27
TMLM 30% 79.69 95.54 96.26 63.52 95.44 99.22
AMLM 30% 80.62 95.55 96.48 65.09 95.30 99.32
TMLM 15% 80.58 95.72 96.64 64.85 95.39 99.28
+ AMLM 15%
Table 9. Performance comparison on pre-training objectives. ‘%’ represents the ratio of tokens masked by the given method. F, S, C, and Sci refer to FUNSD, SROIE, CORD, and SciTSR, respectively.

6.5.3. 1D Positional Embeddings in BERT

LayoutLM and BROS are initialized with weights of BERT to utilize powerful knowledge of BERT that learns from large-scale corpora. However, BERT includes its 1D positional embeddings (1D-PE) that might be harmful by making a sequence of text blocks even though there is no order information. For ablation for 1D-PE, we pre-train two models with or without 1D-PE of BERT for 1 epoch with 80 of batch size. To evaluate them on both FUNSD and p-FUNSD, we fine-tune them with TR decoders. When comparing our models depending on the usage of 1D-PE, using 1D-PE shows better performances with a large margin (80.21 with 1D-PE and 69.05 without 1D-PE on the FUNSD EE task). More interestingly, 1D-PE is also beneficial in the case without the optimal order information (73.18 with 1D-PE and 69.05 without 1D-PE on p-FUNSD). We interpret these results as an effect of using BERT weights as initial weights. When exploring performances without the BERT initialization, we observed much lower performances over all datasets. The experiments show that there are dependencies between the base model for initialization and language models pre-trained on OCR results.

6.6. Experiments on Training Efficiency

6.6.1. Training Efficiency in Pre-training

It is well-known that more data leads to better models. Here, we examine performance changes according to the number of pre-training documents. In this experiment, we compare BROS and LayoutLM, both initialized with BERT.

Figure 5 shows the results of the three EE tasks according to the amount of training data. Each point of the figures represents the result (F1 score) of fine-tuning after learning for data of 0, 500K, 1M, 2M, 11M, and 11M * 2. The results of BERT are plotted together as a baseline. It is shown as horizontal lines because BERT does not utilize any OCR results in its pre-training phase. At the starting point with no pre-training documents, BERT shows better or comparable performances with LayoutLM but BROS provides better results than both models. This indicates that the BROS architecture for encoding 2D text information is more suitable for KIE tasks. When increasing the number of pre-training documents, the performances of BROS and LayoutLM gradually increase but BROS shows better performances consistently and significantly. These results prove the efficiency and the effectiveness of BROS compared to LayoutLM.

Dataset # data
FUNSD 5 31.51 50.88 64.88
EE 10 40.46 62.26 70.00
SROIE* 5 31.01 38.04 41.63
EE 10 45.67 60.50 61.72
CORD 5 48.38 56.53 58.97
EE 10 62.13 66.14 68.61
Table 10. Results of 5-shot and 10-shot learning.

6.6.2. Training Efficiency in Fine-tuning

One of the advantages of pre-trained models is that it shows good performance even when the number of fine-tuning data is small (Devlin et al., 2019). Since collecting and labeling fine-tuning data requires a lot of time and money, achieving high performance with a small number of fine-tuning data is important for the pre-trained models.

Figure 6 shows the results of the three EE tasks by varying the amount of training data from 10% to 100% during fine-tuning. As in the ablation study, LayoutLM and BROS are pre-trained for 1 epoch. In all models, F1 scores tend to increase as the ratio of training data increased. And in most cases, BROS achieves better performance than LayoutLM.

To further test on extreme cases, we conduct experiments on few-shot learning settings (Wang et al., 2020b). Table 10 shows the results of 5-shot and 10-shot learning in three EE tasks. For few-shot learning, we fine-tune models for 100 epochs with a batch size of 4. In all cases, our model shows the best performances and proves its generalization ability even when there are very few training examples.

7. Conclusion

We propose BROS, a novel pre-trained language model for understanding the 2D text in documents. By encoding 2D texts with their relative positions and pre-training the model with the area-masking strategy, BROS robustly contextualizes 2D texts. This leads to an efficient encoding of 2D text and robust performance under a resource-hungry environment. We also introduce a TR decoder that identifies a relationship between text blocks explicitly. This new decoder can be utilized for both EE and EL tasks without relying on the prior ordering of input text blocks. Our extensive experiments on four public benchmarks show BROS consistently outperforms previous methods as well as its robustness to the ordering noise in text blocks. Further experiments on the numbers of pre-training and fine-tuning examples demonstrate that BROS learns more general representations of text and layout than previous methods by showing better performances.


  • (1)
  • Chi et al. (2019) Zewen Chi, Heyan Huang, Heng-Da Xu, Houjin Yu, Wanxuan Yin, and Xian-Ling Mao. 2019. Complicated table structure recognition. arXiv preprint arXiv:1908.04729 (2019).
  • Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In Proceedings of the 8th International Conference on Learning Representations (ICLR).
  • Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL).
  • Denk and Reisswig (2019) Timo I Denk and Christian Reisswig. 2019. BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding. In Workshop on Document Intelligence at NeurIPS 2019.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Volume 1 (Long and Short Papers). 4171–4186.
  • Harley et al. (2015) Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpanis. 2015. Evaluation of deep convolutional nets for document image classification and retrieval. In Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR). 991–995.
  • Huang et al. (2019) Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and CV Jawahar. 2019. ICDAR2019 competition on scanned receipt ocr and information extraction. In Proceedings of the 15th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1516–1520.
  • Hwang et al. (2019) Wonseok Hwang, Seonghyeon Kim, Minjoon Seo, Jinyeong Yim, Seunghyun Park, Sungrae Park, Junyeop Lee, Bado Lee, and Hwalsuk Lee. 2019. Post-OCR parsing: building simple and robust parser via BIO tagging. In Workshop on Document Intelligence at NeurIPS 2019.
  • Hwang et al. (2020) Wonseok Hwang, Jinyeong Yim, Seunghyun Park, Sohee Yang, and Minjoon Seo. 2020. Spatial Dependency Parsing for 2D Document Understanding. arXiv preprint arXiv:2005.00642 (2020).
  • Jaume et al. (2019) Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. 2019. FUNSD: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 2. IEEE, 1–6.
  • Joshi et al. (2020) Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics (TACL) 8 (2020), 64–77.
  • Katti et al. (2018) Anoop R Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards Understanding 2D Documents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4459–4469.
  • Lewis et al. (2006) David Lewis, Gady Agam, Shlomo Argamon, Ophir Frieder, D Grossman, and Jefferson Heard. 2006. Building a test collection for complex document information processing. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR). 665–666.
  • Li et al. (2020) Liangcheng Li, Feiyu Gao, Jiajun Bu, Yongpan Wang, Zhi Yu, and Qi Zheng. 2020. An End-to-End OCR Text Re-organization Sequence Learning for Rich-text Detail Image Comprehension. In Proceedings of the 16th European Conference on Computer Vision (ECCV).
  • Liu et al. (2019) Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Volume 2 (Industry Papers). 32–39.
  • Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR).
  • Park et al. (2019) Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. 2019. CORD: A Consolidated Receipt Dataset for Post-OCR Parsing. In Workshop on Document Intelligence at NeurIPS 2019.
  • Qian et al. (2019) Yujie Qian, Enrico Santus, Zhijing Jin, Jiang Guo, and Regina Barzilay. 2019. GraphIE: A Graph-Based Framework for Information Extraction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Volume 1 (Long and Short Papers). 751–761.
  • Schuster et al. (2013) Daniel Schuster, Klemens Muthmann, Daniel Esser, Alexander Schill, Michael Berger, Christoph Weidling, Kamil Aliyev, and Andreas Hofmeier. 2013. Intellix–End-User Trained Information Extraction for Document Archiving. In Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 101–105.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS). 5998–6008.
  • Wang et al. (2020a) Wei Wang, Bin Bi, Ming Yan, Chen Wu, Jiangnan Xia, Zuyi Bao, Liwei Peng, and Luo Si. 2020a. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding. In Proceedings of the 8th International Conference on Learning Representations (ICLR).
  • Wang et al. (2020b) Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. 2020b. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys (CSUR) 53, 3 (2020), 1–34.
  • Xu et al. (2020a) Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020a. LayoutLM: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD). 1192–1200.
  • Xu et al. (2020b) Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, et al. 2020b. LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding. arXiv preprint arXiv:2012.14740 (2020).
  • Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32 (NeurIPS). 5753–5763.
  • Yu et al. (2020) Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, and Rong Xiao. 2020. PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks. In

    Proceedings of the 25th International Conference on Pattern Recognition (ICPR)


Appendix A Reproducing the LayoutLM

As mentioned in the paper, to compare BROS from LayoutLM in diverse experimental settings, we implement LayoutLM in our experimental pipeline. Table 11 compares our implementations from the reported scores in Xu et al. (2020a). As can be seen, multiple experiments are conducted according to the number of pre-training data. Our implementation, referred to LayoutLM, shows comparable performances over all settings.

# Pre-training data # Epochs Model Precision Recall F1
500K 1 (Xu et al., 2020a) 0.5779 0.6955 0.6313
0.5823 0.6935 0.6330
1M 1 (Xu et al., 2020a) 0.6156 0.7005 0.6552
0.6142 0.7151 0.6608
2M 1 (Xu et al., 2020a) 0.6599 0.7355 0.6957
0.6562 0.7456 0.6980
11M 1 (Xu et al., 2020a) 0.7464 0.7815 0.7636
0.7384 0.8022 0.7689
2 (Xu et al., 2020a) 0.7597 0.8155 0.7866
0.7612 0.8188 0.7889
Table 11. Sanity checking of LayoutLM by comparing its performances on FUNSD from the reported scores in Xu et al. (2020a).