MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

by   Junlong Li, et al.
Shanghai Jiao Tong University

Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU), especially the fixed-layout documents such as scanned document images. While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based pre-training approaches not easy to apply. In this paper, we propose MarkupLM for document understanding tasks with markup languages as the backbone such as HTML/XML-based documents, where text and markup information is jointly pre-trained. Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks. The pre-trained model and code will be publicly available at



There are no comments yet.


page 1

page 2

page 3

page 4


LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

Multimodal pre-training with text, layout, and image has achieved SOTA p...

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Pre-training of text and layout has proved effective in a variety of vis...

LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding

Structured document understanding has attracted considerable attention a...

Skim-Attention: Learning to Focus via Document Layout

Transformer-based pre-training techniques of text and layout have proven...

LayoutReader: Pre-training of Text and Layout for Reading Order Detection

Reading order detection is the cornerstone to understanding visually-ric...

Position Masking for Improved Layout-Aware Document Understanding

Natural language processing for document scans and PDFs has the potentia...

XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding

Recently, various multimodal networks for Visually-Rich Document Underst...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multimodal pre-training with text, layout and visual information has recently become the de facto approach (Xu et al., 2020, 2021a, 2021b; Pramanik et al., 2020; Garncarek et al., 2021; Hong et al., 2021; Powalski et al., 2021; Wu et al., 2021; Li et al., 2021a, b; Appalaraju et al., 2021) in Visually-rich Document Understanding (VrDU) tasks. These multimodal models are usually pre-trained with the Transformer architecture (Vaswani et al., 2017) using large-scale unlabeled scanned document images (Lewis et al., 2006) or digital-born PDF files, followed by task-specific fine-tuning with relatively small-scale labeled training samples to achieve the state-of-the-art performance on a variety of document understanding tasks, including form understanding (Jaume et al., 2019; Xu et al., 2021b), receipt understanding (Huang et al., 2019; Park et al., 2019), complex document understanding (Graliński et al., 2020), document type classification (Harley et al., 2015) and document visual question answering (Mathew et al., 2020) etc. Significant progress has been witnessed not only in research tasks within academia, but also in different real-world business applications such as finance, insurance, and many others.

(a) Mobile

(b) Tablet

(c) Desktop
Figure 1: HTML-based webpages rendered by different platforms, such as mobile, tablet and desktop. (

Visually-rich documents can be generally divided into two categories. The first one is the fixed-layout documents such as scanned document images and digital-born PDF files, where the layout and style information is pre-rendered and independent of software, hardware, or operating system. This property makes existing layout-based pre-training approaches easily applicable to document understanding tasks. While, the second category is the markup-language-based documents such as HTML/XML, where the layout and style information needs to be interactively and dynamically rendered for visualization depending on the software, hardware, or operating system, which is shown in Figure 1. For markup-language-based documents, the 2D layout information does not exist in an explicit format but usually needs to be dynamically rendered for different devices, e.g. mobile/tablet/desktop, which makes current layout-based pre-trained models difficult to apply. Therefore, it is indispensable to leverage the markup structure into document-level pre-training for downstream VrDU tasks.

To this end, we propose MarkupLM to jointly pre-train text and markup language in a single framework for markup-based VrDU tasks. Distinct from fixed-layout documents, markup-based documents provide another viewpoint for the document representation learning through markup structures because the 2D position information and document image information cannot be used straightforwardly during the pre-training. Instead, MarkupLM takes advantage of the tree-based markup structures to model the relationship among different units within the document. Similar to other multimodal pre-trained LayoutLM models, MarkupLM has four input embedding layers: (1) a text embedding that represents the token sequence information; (2) an XPath embedding that represents the markup tag sequence information from the root node to the current node; (3) a 1D position embedding that represents the sequence order information; (4) a segment embedding for downstream tasks. The overall architecture of MarkupLM is shown in Figure 2. The XPath embedding layer can be considered as the replacement of 2D position embeddings compared with the LayoutLM model family. To effectively pre-train the MarkupLM, we use three pre-training strategies. The first is the Masked Markup Language Modeling (MMLM), which is used to jointly learn the contextual information of text and markups. The second is the Node Relationship Prediction (NRP), where the relationships are defined according to the hierarchy from the markup trees. The third is the Title-Page Matching (TPM), where the content within “<title> … </title>” is randomly replaced by a title from another page to make the model learn whether they are correlated. In this way, MarkupLM can better understand the contextual information through both the language and markup hierarchy perspectives. We evaluate the MarkupLM models on the Web-based Structural Reading Comprehension (WebSRC) dataset (Chen et al., 2021) and the Structured Web Data Extraction (SWDE) dataset (Hao et al., 2011). Experiment results show that the pre-trained MarkupLM significantly outperforms the several strong baseline models in these tasks.

The contributions of this paper are summarized as follows:

  • We propose MarkupLM to address the document representation learning where the layout information is not fixed and needs to be dynamically rendered. For the first time, the text and markup information is pre-trained in a single framework for the VrDU tasks.

  • MarkupLM integrates new input embedding layers and pre-training strategies, which have been confirmed effective on HTML-based downstream tasks.

  • The pre-trained MarkupLM models and code will be publicly available at

Figure 2: The architecture of MarkupLM, where the pre-training tasks are also included.

2 MarkupLM

MarkupLM utilizes the DOM tree in markup language and the XPath query language to obtain the markup streams along with natural texts in markup-language-based documents (Section 2.1). We propose this Transformer-based model with a new XPath embedding layer to accept the markup sequence inputs (Section 2.2) and pre-train it with three different-level objectives including Masked Markup Language Modeling (MMLM), Node Relation Prediction (NRP), and Title-Page Matching (TPM) (Section 2.3).

2.1 DOM Tree and XPath

A DOM111 tree is the tree structure object of a markup-language-based document (e.g. HTML or XML) in the view of DOM (Document Object Model) wherein each node is an object representing a part of the document.

XPath222 (XML Path Language) is a query language for selecting nodes from a markup-language-based document, which is based on the DOM tree and can be used to easily locate a node in the document. In a typical XPath expression, like /html/body/div/li[1]/div/span[2], the texts stand for the tag name of the nodes while the subscripts are the ordinals of a node when multiple nodes have the same tag name under a common parent node.

We show an example of DOM tree and XPath along with the corresponding source code in Figure 3, from which we can clearly identify the genealogy of all nodes within the document, as well as their XPath expressions.

Figure 3: An example of DOM tree and XPath with the source HTML code.

2.2 Model Architecture

To take advantage of existing pre-trained models and adapt to markup-language-based tasks (e.g., webpage tasks), we use the BERT (Devlin et al., 2019) architecture as the encoder backbone and add a new input embedding named XPath embedding to the original embedding layer. The overview structures of MarkupLM and the newly-proposed XPath Embedding are shown in Figure 2 and 4.

Figure 4: Overview of the XPath embedding from an XPath expression.

XPath Embedding

For the -th input token , we take its corresponding XPath expression and split it by "/" to get the node information at each level of the XPath as a list, , where is the depth of this XPath and denotes the tag name and the subscript of the XPath unit on level for . Note that for units with no subscripts, we assign 0 to

. To facilitate further process, we do truncation and padding on

to unify their lengths as .

The process of converting XPath expression into XPath embedding is shown in Figure 4. For , we input this pair into the -th tag unit embedding layer and -th subscript unit embedding layer respectively, and they are added up to get the -th unit embedding . We set the dimensions of these two embeddings as .

We concatenate all the unit embeddings to get the representation of the complete XPath for .

Finally, to match the dimension of other embeddings, we apply a linear transformation on

to get the final XPath embedding .

where is the hidden size of MarkupLM.

2.3 Pre-training Objectives

To efficiently capture the complex structures of markup-language-based documents, we propose pre-training objectives on three different levels, including token-level (MMLM), node-level (NRP), and page-level (TPM).

Masked Markup Language Modeling

Inspired by the previous works (Devlin et al., 2019; Xu et al., 2020, 2021a), we propose a token-level pre-training objective Masked Markup Language Modeling (MMLM), which is designed to enhance the language modeling ability with the markup clues. Basically, with the text and markup input sequences, we randomly select and replace some tokens with [MASK], and this task requires the model to recover the masked tokens with all markup clues.

Node Relation Prediction

Although the MMLM task can help the model improve the markup language modeling ability, the model is still not aware of the semantics of XPath information provided by the XPath embedding. With the naturally structural DOM tree, we propose a node-level pre-training objective Node Relation Prediction (NRP) to explicitly model the relationship between a pair of nodes. We firstly define a set of directed node relationships {self, parent, child, sibling, ancestor, descendent, others}. Then we combine each node to obtain the node pairs. For each pair of nodes, we assign the corresponding label according to the node relationship set, and the model is required to predict the assigned relationship labels with the features from the first token of each node.

Title-Page Matching

Besides the fine-grained information provided by markups, the sentence-level or topic-level information can also be leveraged in markup-language-based documents. For HTML-based documents, the element <title> can be excellent summaries of the <body>, which provides a supervision for high-level semantics. To efficiently utilize this self-supervised information, we propose a page-level pre-training objective Title-Page Matching (TPM). Given the element <body> of a markup-based document, we randomly replace the text of element <title> and ask the model to predict if the title is replaced by using the representation of token [CLS] for binary classification.

2.4 Fine-tuning

We follow the scheme of common pre-trained language models 

(Devlin et al., 2019; Liu et al., 2019) and introduce the fine-tuning recipes on two downstream tasks including reading comprehension and information extraction.

For the reading comprehension task, we model it as an extractive QA task. We input the last hidden state of each token to a binary linear classification layer to get two scores for start and end positions.

For the information extraction task, we model it as a token classification task. We input the last hidden state of each token to a linear classification layer, which has categories, where is the number of attributes we need to extract and the extra category is for tokens that belong to none of the attributes.

3 Experiments

In this work, we apply our MarkupLM framework to HTML-based webpages, which is one of the most common markup language scenarios. Equipped with the existing webpage datasets Common Crawl (CC)333, we pre-train MarkupLM with large-scale unlabeled HTML data and evaluate the pre-trained models on web-based structural reading comprehension and information extraction tasks.

3.1 Data

Common Crawl

The Common Crawl (CC) dataset contains petabytes of webpages in the form of raw web page data, metadata extracts and text extracts. We use the pre-trained language detection model from fasttext (Joulin et al., 2017) to filter out non-English pages in it, and only keep the common tags (e.g. <div>, <span>, <li>, <a>, etc.) in these pages to save storage space. After pre-processing, a subset of CC with 24M English webpages is extracted as our pre-training data for MarkupLM.


The Web-based Structural Reading Comprehension (WebSRC) dataset (Chen et al., 2021) consists of 440K question-answer pairs, which are collected from 6.5K web pages with corresponding HTML source code, screenshots, and metadata. Each question in WebSRC requires a certain structural understanding of a webpage to answer, and the answer is either a text span on the web page or yes/no. By adding the additional yes/no tokens to the text input, it can be modeled as a typical extractive reading comprehension task. Following the original paper (Chen et al., 2021)

, we choose evaluation metrics for this dataset as

Exact match (EM), F1 score (F1) and Path overlap score (POS). We use the official split to get the training and development set. Note that the authors of WebSRC did not release their testing set, so all our results are obtained from the development set.


The Structured Web Data Extraction (SWDE) dataset (Hao et al., 2011) is a real-world webpage collection for automatic extraction of structured data from the Web. It involves 8 verticals, 80 websites (10 per vertical), and 124,291 webpages (200 - 2,000 per website) in total. The task is to extract the values corresponding to a set of given attributes (depending on which vertical the webpage belongs to) from a webpage, like value for author in book pages. Following previous works (Hao et al., 2011; Lin et al., 2020; Zhou et al., 2021), we choose page-level F1 scores as our evaluation metrics for this dataset.

Since there is no official train-test split, we follow previous works (Hao et al., 2011; Lin et al., 2020; Zhou et al., 2021) to do training and evaluation on each vertical (, category of websites) independently. In each vertical, we randomly select seed websites as the training data and use the remaining websites as the testing set. Note that in this few-shot extraction task, none of the pages in the websites have been visited in the training phase. This setting is abstracted from the real application scenario where only a small set of labeled data is provided for specific websites and we aim to infer the attributes on a much larger unseen website set. The final results are obtained by taking the average of all 8 verticals and all 10 permutations of seed websites per vertical, leading to 80 individual experiments in all for each . For data pre/post-processing, we follow Zhou et al. (2021) to make a fair comparison.

Model Modality Exact Match F1 POS
T-PLM () Text 52.12 61.57 79.74
H-PLM () Text + HTML 61.51 67.04 82.97
V-PLM () Text + HTML + Image 62.07 66.66 83.64
Text + HTML 67.38 74.80 87.24
Table 1: Evaluation results on the WebSRC (Chen et al., 2021) development set, where authors have updated the results on
Model #Seed Sites
SSM (Carlson and Schafer, 2008) 63.00 64.50 69.20 71.90 74.10
Render-Full (Hao et al., 2011) 84.30 86.00 86.80 88.40 88.60
FreeDOM-NL (Lin et al., 2020) 72.52 81.33 86.44 88.55 90.28
FreeDOM-Full (Lin et al., 2020) 82.32 86.36 90.49 91.29 92.56
SimpDOM (Zhou et al., 2021) 83.06 88.96 91.63 92.84 93.75
84.31 89.75 93.02 94.24 95.48
Table 2: Comparing the extraction performance (F1 score) of five baseline models to our method MarkupLM using different numbers of seed sites on the SWDE dataset. Each value in the table is computed from the average over 8 verticals and 10 permutations of seed websites per vertical (80 experiments in total).

3.2 Settings


The size of the selected tags and subscripts in XPath embedding are 216 and 1,001 respectively, the max depth of XPath expression is 50, and the dimension for the tag-unit embedding and subscript-unit embedding is 32. The token-masked probability in MMLM and title-replaced probability in TPM are both 15%, and we do not mask the tokens in the input sequence corresponding to the webpage titles. The max number of selected node pairs is 1,000 in NRP for each sample, and we limit the ratio of pairs with

non-others (i.e., self,parent,...) labels as 80% to make a balance. We initialize from and train it for 300K steps on 8 NVIDIA A100 GPUs. We set the batch size as 32, learning rate as 5e-5, and warmup ratio as 0.1.


For WebSRC, we fine-tune MarkupLM for 5 epochs with the batch size of 16, the learning rate of 1e-5, and the warmup ratio of 0.1. For SWDE, we fine-tune MarkupLM for 10 epochs with a batch size of 16, the learning rate of 1e-4, and the warmup ratio of 0.1. All experiments in the fine-tuning stage are conducted on 4 NVIDIA V100 GPUs.

3.3 Results

The results for WebSRC are shown in Table 1. We observe significantly surpass H-PLM Chen et al. (2021) which uses the same modality of information. This strongly indicates that MarkupLM makes better use of the XPath features with the specially designed embedding layer and pre-training objectives compared with merely adding more tag tokens into the input sequence as in H-PLM. Besides, also achieves a higher score than the previous state-of-the-art V-PLM model that uses additional vision features from Faster-RCNN, showing that our render-free MarkupLM can learn the structural information better without any visual information.

The results for SWDE are in Table 2 and 3. It is observed that our also substantially outperforms the strong baselines. Different from the previous state-of-the-art model SimpDOM which explicitly sends the relationship between DOM tree nodes into their model and adds huge amounts of extra discrete features (e.g., whether a node contains digits), MarkupLM is much simpler and is free from time-consuming additional webpage annotations. We also report detailed statistics with regard to different verticals in Table 3. With the growth of

, MarkupLM gets more webpages as the training set, so there is a clear ascending trend reflected by the scores. We also see the variance among different verticals since the number and type of pages are not the same.

Vertical #Seed Sites
auto 75.67 79.43 89.14 92.05 94.77
book 86.29 88.74 90.16 91.33 94.08
camera 87.49 92.09 93.13 93.59 95.20
job 78.43 85.40 86.77 88.78 89.24
movie 90.75 94.66 97.16 98.66 98.98
nbaplayer 86.95 89.64 94.95 95.79 95.82
restaurant 82.99 92.31 95.82 95.97 96.67
university 85.92 95.76 97.05 97.78 99.09
Average 84.31 89.75 93.02 94.24 95.48
Table 3: Evaluation results of on the SWDE dataset with different numbers of seed sites for training.
Initialization Data MMLM NRP TPM Exact Match F1 POS
1M 54.11 63.44 81.87
1M 56.72 65.07 83.02
1M 59.56 68.12 84.80
1M 60.83 69.13 85.61
24M 67.38 74.80 87.24
Table 4: Ablation study with different pre-trained objectives and initialization for on the WebSRC dataset.

3.4 Ablation Study

To investigate how each pre-training objective contributes to MarkupLM, we conduct an ablation study on WebSRC with a smaller training set containing 1M webpages. The model we initialized from is BERT-base-uncased in this sub-section with all the other settings unchanged. We fine-tune the models for 2 epochs for fast verification. The results are in Table 4. Comparing line 1 and line 2, we find that the NRP objective that models the relationship between nodes in the DOM tree greatly helps the MarkupLM to learn the structural information. Furthermore, after adding the page-level TPM objective (lines 2 and 3), the performance is enhanced to a higher level. We hold that this objective is able to help MarkupLM learn the semantic information of questions in this dataset. We also investigate the impact of different initialization by replacing BERT with RoBERTa and keeping all three objectives (lines 3 and 4) and confirm the benefits of initializing from a better model.

4 Related Work

Multimodal pre-training with text, layout, and image information has significantly advanced the research of document AI, and it has been the de facto approach in a variety of VrDU tasks. Although great progress has been achieved for the fixed-layout document understanding tasks, the existing multimodal pre-training approaches cannot be easily applied to markup-based document understanding in a straightforward way, because the layout information of markup-based documents needs to be rendered dynamically and may be different depending on software and hardware. Therefore, the markup information is vital for the document understanding. Ashby and Weir (2020) compared the Text+Tags approach with their Text-Only equivalents over five web-based NER datasets, which indicates the necessity of markup enrichment of deep language models. Lin et al. (2020)

presented a novel two-stage neural approach named FreeDOM. The first stage learns a representation for each DOM node in the page by combining both the text and markup information. The second stage captures longer range distance and semantic relatedness using a relational neural network. Experiments show that FreeDOM beats the previous SOTA results without requiring features over rendered pages or expensive hand-crafted features.

Zhou et al. (2021) proposed a novel transferable method SimpDOM to tackle the problem by efficiently retrieving useful context for each node by leveraging the tree structure. However, these methods did not fully leverage the large-scale unlabeled data and the self-supervised pre-training techniques to enrich the document representation learning. To the best of our knowledge, MarkupLM is the first large-scale pre-trained model that jointly learns the text and markup language in a single framework for VrDU tasks.

5 Conclusion and Future Work

In this paper, we present MarkupLM, a simple yet effective pre-training approach for text and markup language. With the Transformer architecture, MarkupLM integrates different input embeddings including text embeddings, position embeddings, and XPath embeddings. Furthermore, we also propose new pre-training objectives that are specially designed for understanding the markup language. We evaluate the pre-trained MarkupLM model on the WebSRC and SWDE datasets. Experiments show that MarkupLM significantly outperforms several SOTA baselines in these tasks.

For future research, we will investigate the MarkupLM pre-training with more data and more computation resources, as well as the language expansion. Furthermore, we will also pre-train MarkupLM models for digital-born PDFs and Office documents that use XML DOM as the backbones. In addition, we will also explore the relationship between LayoutLM and MarkupLM to deeply understand whether two models can be pre-trained under the multi-view and multi-task settings and whether the knowledge from the two models can be transferred to each other.


  • S. Appalaraju, B. Jasani, B. U. Kota, Y. Xie, and R. Manmatha (2021) DocFormer: end-to-end transformer for document understanding. External Links: 2106.11539 Cited by: §1.
  • C. Ashby and D. Weir (2020)

    Leveraging HTML in free text web named entity recognition

    In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 407–413. External Links: Document, Link Cited by: §4.
  • A. Carlson and C. Schafer (2008) Bootstrapping information extraction from semi-structured web pages. In

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases

    pp. 195–210. Cited by: Table 2.
  • L. Chen, X. Chen, Z. Zhao, D. Zhang, J. Ji, A. Luo, Y. Xiong, and K. Yu (2021) WebSRC: a dataset for web-based structural reading comprehension. External Links: 2101.09465 Cited by: §1, §3.1, §3.3, Table 1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Document, Link Cited by: §2.2, §2.3, §2.4.
  • Ł. Garncarek, R. Powalski, T. Stanisławek, B. Topolski, P. Halama, M. Turski, and F. Graliński (2021) LAMBERT: layout-aware (language) modeling for information extraction. External Links: 2002.08087 Cited by: §1.
  • F. Graliński, T. Stanisławek, A. Wróblewska, D. Lipiński, A. Kaliska, P. Rosalska, B. Topolski, and P. Biecek (2020) Kleister: a novel task for information extraction involving long documents with complex layout. External Links: 2003.02356 Cited by: §1.
  • Q. Hao, R. Cai, Y. Pang, and L. Zhang (2011) From one tree to a forest: a unified solution for structured web data extraction. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, Beijing, China, July 25-29, 2011, W. Ma, J. Nie, R. Baeza-Yates, T. Chua, and W. B. Croft (Eds.), pp. 775–784. External Links: Document, Link Cited by: §1, §3.1, §3.1, Table 2.
  • A. W. Harley, A. Ufkes, and K. G. Derpanis (2015) Evaluation of deep convolutional nets for document image classification and retrieval. In International Conference on Document Analysis and Recognition (ICDAR), Cited by: §1.
  • T. Hong, D. Kim, M. Ji, W. Hwang, D. Nam, and S. Park (2021) {bros}: a pre-trained language model for understanding texts in document. External Links: Link Cited by: §1.
  • Z. Huang, K. Chen, J. He, X. Bai, D. Karatzas, S. Lu, and C. V. Jawahar (2019) ICDAR2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), Vol. , pp. 1516–1520. External Links: Document Cited by: §1.
  • G. Jaume, H. K. Ekenel, and J. Thiran (2019) FUNSD: a dataset for form understanding in noisy scanned documents. 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) 2, pp. 1–6. Cited by: §1.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2017) Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 427–431. External Links: Link Cited by: §3.1.
  • D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard (2006) Building a test collection for complex document information processing. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, New York, NY, USA, pp. 665–666. External Links: Document, ISBN 1595933697, Link Cited by: §1.
  • C. Li, B. Bi, M. Yan, W. Wang, S. Huang, F. Huang, and L. Si (2021a) StructuralLM: structural pre-training for form understanding. In

    Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    Online, pp. 6309–6318. External Links: Document, Link Cited by: §1.
  • P. Li, J. Gu, J. Kuen, V. I. Morariu, H. Zhao, R. Jain, V. Manjunatha, and H. Liu (2021b) SelfDoc: self-supervised document representation learning. External Links: 2106.03331 Cited by: §1.
  • B. Y. Lin, Y. Sheng, N. Vo, and S. Tata (2020) FreeDOM: A transferable neural architecture for structured information extraction on web documents. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, R. Gupta, Y. Liu, J. Tang, and B. A. Prakash (Eds.), pp. 1092–1102. External Links: Link Cited by: §3.1, §3.1, Table 2, §4.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. S. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. ArXiv preprint abs/1907.11692. External Links: Link Cited by: §2.4.
  • M. Mathew, D. Karatzas, R. Manmatha, and C. V. Jawahar (2020) DocVQA: a dataset for vqa on document images. External Links: 2007.00398 Cited by: §1.
  • S. Park, S. Shin, B. Lee, J. Lee, J. Surh, M. Seo, and H. Lee (2019) CORD: a consolidated receipt dataset for post-OCR parsing. In Workshop on Document Intelligence at NeurIPS 2019, External Links: Link Cited by: §1.
  • R. Powalski, Ł. Borchmann, D. Jurkiewicz, T. Dwojak, M. Pietruszka, and G. Pałka (2021) Going full-tilt boogie on document understanding with text-image-layout transformer. External Links: 2102.09550 Cited by: §1.
  • S. Pramanik, S. Mujumdar, and H. Patel (2020) Towards a multi-modal, multi-task learning based pre-training framework for document representation learning. External Links: 2009.14457 Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §1.
  • T. Wu, C. Li, M. Zhang, T. Chen, S. A. Hombaiah, and M. Bendersky (2021) LAMPRET: layout-aware multimodal pretraining for document understanding. External Links: 2104.08405 Cited by: §1.
  • Y. Xu, Y. Xu, T. Lv, L. Cui, F. Wei, G. Wang, Y. Lu, D. Florencio, C. Zhang, W. Che, M. Zhang, and L. Zhou (2021a) LayoutLMv2: multi-modal pre-training for visually-rich document understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 2579–2591. External Links: Document, Link Cited by: §1, §2.3.
  • Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou (2020) LayoutLM: pre-training of text and layout for document image understanding. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, R. Gupta, Y. Liu, J. Tang, and B. A. Prakash (Eds.), pp. 1192–1200. External Links: Link Cited by: §1, §2.3.
  • Y. Xu, T. Lv, L. Cui, G. Wang, Y. Lu, D. Florencio, C. Zhang, and F. Wei (2021b) LayoutXLM: multimodal pre-training for multilingual visually-rich document understanding. External Links: 2104.08836 Cited by: §1.
  • Y. Zhou, Y. Sheng, N. Vo, N. Edmonds, and S. Tata (2021) Simplified dom trees for transferable attribute extraction from the web. External Links: 2101.02415 Cited by: §3.1, §3.1, Table 2, §4.