Structure-aware Pre-training for Table Understanding with Tree-based Transformers

10/21/2020 ∙ by Zhiruo Wang, et al. ∙ Microsoft Carnegie Mellon University Peking University 0

Tables are widely used with various structures to organize and present data. Recent attempts on table understanding mainly focus on relational tables, yet overlook to other common table structures. In this paper, we propose TUTA, a unified pre-training architecture for understanding generally structured tables. Since understanding a table needs to leverage both spatial, hierarchical, and semantic information, we adapt the self-attention strategy with several key structure-aware mechanisms. First, we propose a novel tree-based structure called a bi-dimensional coordinate tree, to describe both the spatial and hierarchical information in tables. Upon this, we extend the pre-training architecture with two core mechanisms, namely the tree-based attention and tree-based position embedding. Moreover, to capture table information in a progressive manner, we devise three pre-training objectives to enable representations at the token, cell, and table levels. TUTA pre-trains on a wide range of unlabeled tables and fine-tunes on a critical task in the field of table structure understanding, i.e. cell type classification. Experiment results show that TUTA is highly effective, achieving state-of-the-art on four well-annotated cell type classification datasets.



There are no comments yet.


page 4

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. Examples of various structured tables. Left: (a,b) vertical relational, Middle: (c,d) horizontal entity, and Right: (e,f) matrix tables. Different colors denote different cell types.
Figure 2. Examples of real-world tables. (a) shows a relational web table with a flat top header. (b) shows a matrix PDF table with a flat top header and a hierarchical left header. (c) shows a spreadsheet table that has hierarchical top and left headers.

Table, as a key structure to organize and present data, is widely used for documentations in webpages, spreadsheets, PDFs, etc. Unlike sequenced NL text, tables usually lay cells in a two-dimensional matrix and contain additional information like format and formula besides pure text. As Figure 1 shows, tables are flexible with various structures, including vertical and horizontal relational tables, vertical and horizontal entity tables, matrix tables, etc (Nishida et al., 2017; Crestan and Pantel, 2011). To present data more effectively in the two-dimensional space, real-world tables often have hierarchical structures and expandable data groups (Chen and Cafarella, 2014, 2013; Dou et al., 2018). For example, Figure 2 (b) shows a matrix PDF table identified by a hierarchical left header with different indentation levels, and Figure 2 (c) shows a matrix spreadsheet table that has both hierarchical top and left headers (“B1:E2” and “A3:A22”), shaped by merged cells and indents respectively, to organize data in a compact and hierarchical way for easy look-up or side-by-side comparison.

Tables include a huge amount of high-quality data and recently gain increasing attention. There is a flurry of research on table understanding tasks, including entity linking in tables (Bhagavatula et al., 2015; Ritze and Bizer, 2017; Zhao and He, 2019; Deng et al., 2020), column type identification in tables (Chen et al., 2019a; Guo et al., 2020), answering natural language questions over tables (Pasupat and Liang, 2015; Sun et al., 2016; Krishnamurthy et al., 2017), and generating data analysis for tables (Zhou et al., 2020). However, the vast majority of these works only focuses on relational tables that only account for 0.9% of web tables111 and 22.0% of spreadsheet tables (Chen and Cafarella, 2014). Also, other widely-used table types such as matrix tables and entity tables are largely overlooked. Both problems lead to a huge gap between cutting-edge table understanding techniques and real-world tables in varied structures. Therefore, it is significant to enable table semantic understanding in variously structured tables and make a critical step to mitigate this step. There are several attempts (Dong et al., 2019a; Gol et al., 2019; Koci et al., 2019; Chen and Cafarella, 2014, 2013) in identifying table hierarchies and cell types to extract relational data from variously structured tables, but labeling such structural information is very time-consuming and labor-intensive

, thus greatly challenges machine learning methods which are rather data-hungry.

Motivated by the success of large-scale pre-trained language models (LMs) (Devlin et al., 2018; Radford et al., 2018) in a variety of NL tasks, one potential way to mitigate the label-shortage challenge is leveraging weakly supervised pre-training using large volumes of unlabeled tables on the web.  (Herzig et al., 2020; Yin et al., 2020) target question answering over relational tables via joint pre-training of tables and their textual descriptions.  (Deng et al., 2020) attempts to pre-train embeddings on relational tables to enhance table knowledge matching and table augmentation. However, these pre-training methods still only target relational web tables because the structure is simple and clear. In relational tables, each column is homogeneous and described by its column name, so  (Yin et al., 2020) augments each data cell with its corresponding column name, and  (Deng et al., 2020) enforces each cell only aggregate information from its row and column. But these methods are not suitable for other commonly used table types. For example, Figure 2 (c) shows a hierarchical matrix table, where cell “D8” (148,270) is not only described by its top header, but also described by its left header. Furthermore, indicated by merged cells in the top header and indentation levels in the left header, cell “D8” is jointly described by multiple cells including “D1” (“Mortality”), “D2” (“Males”), “A6” (“Urinary tract”) and “A8” (“Bladder”), and constructs a relational mapping (“Mortality”, “Males”, “Urinary tract”, “Bladder”, 148,270) (Chen and Cafarella, 2014). Such structural information greatly helps human readers to understand this table, but simply treating it as a relational table will lose the valuable structural information.

Hence in this paper, we aim to propose a structure-aware method for general table pre-training. Fortunately, previous studies show strong structural commonalities in real-world tables: (1) tables are arranged in a two-dimensional format with vertical orientation, horizontal orientation, or both orientations (Nishida et al., 2017); (2) tables usually contain headers on the top or left side to describe other cells (Zanibbi et al., 2004; Wang, 2016); (3) headers are often structured using a hierarchical tree (Chen and Cafarella, 2014, 2013; Lim and Ng, 1999), especially in professional finance and government tables222SAUS (Statistical Abstract of US from the Census Bureau) is a widely studied public spreadsheet dataset, in which most tables contain hierarchical headers.. Motivated by previous studies, we propose a bi-dimensional coordinate tree, a unified structure to describe both cell locations and hierarchies in variously structured tables. Based on this novel tree-based structure, we propose TUTA, a structure-aware architecture for Table Understanding with Tree-based Attention. TUTA consists of two key enhancements to capture structural information in tables. (1) To encode spatial and hierarchical information in TUTA, we devise tree-based positional encodings. Different from explicit tree-based positional encodings using a uni-dimension tree structure (Shiv and Quirk, 2019), TUTA compares both explicit and implicit positional encodings with the bi-dimensional tree structure. Also, to encode spatial and hierarchical information jointly, TUTA combines tree-based coordinates with traditional rectangular Cartesian coordinates. (2) Self-attention (Vaswani et al., 2017) is considered to be more parallelizable and efficient than dedicated structure-aware models like tree-LSTMs and RNNs (Shiv and Quirk, 2019; Nguyen et al., 2020), however, is challenged by numerous “distraction”s in large bi-dimensional tables. Structural information is crucial for local cells in both aggregating structural relevant contexts and ignoring noisy contexts. To enable greater efficiency in both processes, we, therefore, uniquely devise a structure-aware attention mechanism. Different from existing practices that use bottom-up tree-based attention (Nguyen et al., 2020) and constituent tree attention (Wang et al., 2019) in the NL domain, TUTA adapts the general idea of graph attention networks (Veličković et al., 2017) to tree structures. Through the bi-dimensional tree structure, data readily flows in top-down, bottom-up, and peer-to-peer manners. Key contributions of this paper summarize as follows:

  • For generally structured tables, we propose in this paper a novel bi-dimensional coordinate tree to describe both the spatial and hierarchical information. Based on this bi-tree, we propose TUTA, a structure-aware pre-training method following a self-attention strategy. To better incorporate spatial and structural information in tables, we devise two crucial techniques, called tree positional encodings and tree-based attention, that prove to be highly effective throughout experiments.

  • We leverage three novel pre-training tasks for TUTA, including token-level masked language modeling, cell-level cloze, and table-level context retrieval. During pre-training, TUTA progressively learns token/cell/table representations on a large volume of tables in an unsupervised manner.

  • To demonstrate the effectiveness of TUTA, we fine-tune our pre-trained model on a critical task for table semantic structure understanding, cell type classification. TUTA is the first transformer-based method applied to this task and achieves state-of-the-art performance on four well-annotated datasets.

Figure 3. An example of bi-dimensional coordinate tree. In this example, both the top tree and the left tree contain three levels. Cell “A6” (“Urinary tract”) is the “parent” node of cell “A7” and cell “A8”, and has “brothers” including “A3”, “A9” and “A13”.

2. Preliminaries

2.1. Dataset construction

In this paper, we extend the amount and diversity of tables in two ways. (1) In addition to web tables, spreadsheet tables are also widely used333

By estimation, there are around 800 million spreadsheet users., especially for professional usage in government, finance, and medical domains. Thus in this paper, we additionally build a large-scale table corpus of web-crawled spreadsheets. (2) In addition to relational tables, other diversely structured tables, such as matrix tables and entity tables (Nishida et al., 2017) are also commonly used to store high-quality tabular data. Hence in this paper, we include various structured tables into our pre-training corpus.

Document types Table types Total amount
TUTA WikiTable General 2.62 million
WDC 50.8 million
Spreadsheet 4.49 million
TAPAS WikiTable Relational Not published
TaBERT WikiTable Relational 1.3 million
WDC 25.3 million
TURL WikiTable Relational 0.57 million

Table 1. Dataset comparison between pre-training methods.

We collect web tables from Wikipedia (WikiTable) and the WDC WebTable Corpus (Lehmberg et al., 2016), meanwhile, build a large-scale web-crawled spreadsheet corpus including 115 million tables. The data collection and pre-processing details can be found in Appendix A. We further unify the feature schema for web tables and spreadsheet tables as shown in Table 7 in Appendix A. After pre-processing, the total amount of tables in the combined corpus is 57.9 million. To capture diverse table structures and data characteristics from our table corpus, we iterate three datasets in parallel during the pre-training process.

2.2. Bi-dimensional coordinate tree

As introduced in Section 1, tables usually employ top and left headers to describe other cells (Zanibbi et al., 2004; Wang, 2016). For example, vertical/horizontal tables often contain top/left headers, while matrix tables usually contain both (Nishida et al., 2017). A table is recognized as hierarchical if its header on the top or left exhibits a hierarchical tree structure of at least two layers (Lim and Ng, 1999; Chen and Cafarella, 2014). In contrast, flat tables refer to those without any hierarchical structure. Motivated by existing definitions, we propose a bi-dimensional coordinate tree to jointly describe the cell locations and hierarchies in tables.

Tree-based position  The bi-dimensional coordinate tree is defined to be a directed tree with ordered lists of children, where each node has a unique parent and an ordered finite list of children. It contains two perpendicular subtrees, namely a top tree and a left tree. In the left tree, each node’s position can be defined from its path from the left root node. And the top-tree works similarly. Next, given the positions of corresponding tree nodes on top and left, each cell in the table is uniquely presented in bi-dimensional tree coordinates. For example, as shown in Figure 3 (a), the top and left coordinates of cell “D8” (148,270) are (2,0) and (2,1) respectively; the left coordinate of cell “A6” (“Urinary tract”) is (2) since it corresponds to the parent node of the third subtree in the left header.

Tree-based distance  For each tree, the relation between two nodes is a path: a series of steps along tree branches, with each step either going up to the parent or down to a child. Motivated by  (Shiv and Quirk, 2019), we define the top/left tree distance between two nodes in a top/left tree to be the number of steps of the shortest path between them, while the tree distance between two cells is the sum of their corresponding left tree distance and top tree distance. Figure 3 (b) shows the distances from cell “A6” ( “Urinary tract”) in the left tree. Cell “A6” (“Urinary tract”)is the parent node of cell “A8” (“Bladder”), so they have a very short left-tree distance of 1. But cell “A6” (“Urinary tract”) and cell “A10” (“Larynx”) have a relatively long distance of 3, because cell “A6” (“Urinary tract”) is the “uncle node” of cell “A10” (“Larynx”). Furthermore, cell “A6” (“Urinary tract”) and cell “C2” (“Females”) even have a longer distance of 6 (both the distances of top and left tree are 3), indicating that they are far from related. Specifically, distances from the table descriptions (e.g. table title, table caption, page title, text segments before/after a table) to any table cells are set to 0 since table descriptions contain global information of a table. The tree-based definition of distance enables effective spatial and hierarchical data flow via our proposed tree attention in Section 3.3.

Note that the bi-dimensional coordinate tree is quite general for both hierarchical tables and flat tables. In flat tables, the top and left tree coordinates degenerate to rectangular Cartesian coordinates, thus the tree distance between two cells in the same row or column is 2, while the tree distance between two cells in different rows and columns is 4.

Tree extraction  Existing header detection methods  (Fang et al., 2012; Dong et al., 2019a) have already achieved desirable accuracy for various structured tables, so we adopt the method introduced by  (Dong et al., 2019a)

for header detection. For those detected header regions, we implement a rule-based method to extract header hierarchies by consolidating effective expert-engineered heuristics. We extract the hierarchical tree based on the merged cells in the top header, indentation levels in the left header, and formulas in the data region. To be specific, we build a hierarchical relationship in the top header between a merged cell and the following cells under it. It is similar in the left header for the cell and its subsequent cells with more indents. And formulas containing “SUM” or “AVERAGE” also strongly indicate hierarchies, which are also incorporated in our algorithm. This approach has high performance and desirable explainability when processing large unlabeled corpus. Since TUTA is a general table pre-training framework based on bi-dimensional trees, one can also employ other techniques such as  

(Lim and Ng, 1999; Chen and Cafarella, 2014; Paramonov et al., 2020) to extract hierarchies for TUTA. More details of tree extraction can be found in Appendix B.

3. TUTA Model

Our model’s architecture is based on BERT’s encoder with four key enhancements: (1) building the first dedicated vocabulary for table understanding domain to better encode commonly used tokens in real-world tables. (2) leveraging tree-based positional embeddings to better encode cell locations and hierarchies. (3) proposing a structure-aware attention mechanism to help table cells to attend to their structural neighboring contexts efficiently (4) devising three pre-training objectives to enable representation learning at token, cell, and table levels to capture table information in a progressive manner. The architecture overview is shown in Figure  4.

Figure 4. Overview of TUTA architecture. It contains (1) embedding layers to convert an input table into input embeddings, (2) N stacked transformer encoders to capture table semantics and structures, and (3) final projection layers for pre-training objectives.

3.1. Vocabulary construction

Different from long texts in NL documents, cell strings in tables often have short lengths and concise semantics, which makes the word distribution in tables very different from the word distribution in NL. For example, tables often include lots of measures such as “quantity” and “yards”, or statistical terms such as “average” and “difference”. To be emphasised that tables use a lot of abbreviations such as “num” and “qtr”, to make tables compact. Hence directly using NL vocabulary to parse cell strings is not appropriate.

In this paper, we build the first vocabulary for general tables, which includes widely used tokens in real-world tables. Based on table corpus introduced in Section 2.1, we build our vocabulary using the same WordPiece model with BERT (Devlin et al., 2018) and get 9,754 new tokens compared with the BERT vocabulary. We directly list the top 50 newly added tokens in Table  2

and try to manually classify them into different semantic groups, showing meaningful and intuitive results. Interestingly that there are indeed lots of abbreviations in the dedicated vocabulary, e.g., “pos”, “fg” and “pts” in sport domain. Since NL-form table descriptions (e.g., titles) are also needed to be modeled, we merge our table vocabulary (top 996 tokens) with the NL vocabulary to achieve better generosity. Based on the merged vocabulary, we perform sub-tokenization for table contexts and cell strings using the Wordpiece tokenizer  

(Devlin et al., 2018).

Measures & units num (number), qty (quantity), yds, dist, att, eur…
Statistical terms avg (average), pct (percentage), tot, chg, div, rnd…
Date & time fy (fiscal year), thu (Thursday), the, fri, qtr, yr…
Sports pos (position), fg (field goal), pim, slg, obp…
pld, nd, ast, xxl, px, blk, xxs, ret, lmsc, stl, ef, wkts,
Unsorted fga, wm, adj, tweet, comp, mdns, ppg, bcs, sog,
chr, xs, fta, mpg, xxxl, sym, url, msrp, lng
Table 2. Examples of the top 50 newly added tokens in TUTA. We rank them based on their frequencies and try to sort them into several groups based on our understanding.

3.2. Embedding layer

Table cells encode rich information. Whether such information is well extracted and leveraged can lead to remarkable differences in table understanding tasks (Dong et al., 2019b; Gol et al., 2019). Since transformer-based methods need to first linearize the input into a sequence of tokens, we perform sub-tokenization for each cell using the Wordpiece tokenizer and then flatten a table to a sequence. We add a special token for each cell and a special token for table textual descriptions. As shown in Figure 4, we extract additional key information from each cell including numbers, positions and formats, and construct embeddings for each token as follows:

Tree-based positional embedding  As introduced in 2.2

, each table cell has a pair of tree coordinates to capture hierarchical information. To unify representation for tables in different tree depths, we extend top/left tree coordinates to the same length (in this paper, the maximum length is 4) by padding with -1. To further incorporate absolute cell positions, column and row indexes are respectively concatenated with top and left tree coordinates. Take the cell “D8” in Figure

3 as an example, its original top coordinate (2,0) and a left coordinate (2,1) are extended to (2,0,-1,-1,3) and (2,1,-1,-1,7), respectively. As shown in Figure 5, we assign each level of top and left coordinates with a sub-embedding, then concatenate them together as a joint position embedding. We also compare two methods to embed this compositional position. For one, we experiment with explicit tree embeddings as introduced by (Shiv and Quirk, 2019), it has good explainability and can be readily-adapted to our bi-dimensional tree. While for another, we use randomly initialized implicit embeddings and jointly train it with encoding layers as (Devlin et al., 2018).

Internal positional embedding  Internal position is the index of a token in the sub-tokenized cell (in this paper, the maximum internal index is 32). Each internal position is assigned with a learnable embedding.

Token and number embedding  Since the token vocabulary is a finite set, each token in the vocabulary can be directly assigned with a learnable embedding. But different from tokens, numbers constitute an infinite set, so we assign each number with a special token and extract four discrete features including number’s magnitude, precision, first digit and last digit. We assign each discrete feature with a learnable sub-embedding (1/4 of the total embedding length), then concatenate them to an aggregated number embedding.

Format embedding  Format helps humans to understand tables intuitively, so we add format embeddings that marks whether a cell has vertically/horizontally merging, top/left /bottom/right border, formula, font bold, non-white background color, and non-black font color. A simple MLP is used to map these cell-level formatting features to a joint formatting embedding.

Figure 5. Tree-based positional embedding (total length is 768 in our setting).

3.3. Tree attention

Based on definitions of the bi-dimensional tree coordinate and distance (Section 2.2), tree attention can be naturally introduced in this section. Self-attention is considered to be more parallelizable and efficient than dedicated structure-aware models like tree-LSTMs and RNNs (Shiv and Quirk, 2019; Nguyen et al., 2020), but is challenged by lots of “distraction” in large two-dimensional tables, because in its most general formulation, self-attention allows every token to attend to every other token, dropping all structural information (e.g., whether two tokens appear in the same cell, appear in the same row/column, or have a hierarchical relationship). Since spatial and hierarchical information is highly important for local cells to aggregate their structural related contexts and ignore unrelated or noisy contexts, we devise a structure-aware attention mechanism to help local cells to attend to their structural neighboring cells efficiently. Motivated by the general idea of graph attention networks (Veličković et al., 2017), we inject the tree structure into the attention mechanism by performing masked attention — we only set for cells belonging to , where is structural neighborhood of cell in the bi-tree, and is a symmetric binary matrix to indicate visibility between cell and cell .

Based on the proposed tree distance in Section  2.2, structural neighboring cells can be filtered out by a predefined distance threshold . Smaller can make local cells more ”focus”, but when is large enough (16 is the maximum distance in our tree definition), tree attention will perform in the same way with global attention. Different from existing practice such as bottom-up tree-based attention (Nguyen et al., 2020) and constituent tree attention (Wang et al., 2019) in NL domain, our encoder with tree attention enables both top-down, bottom-up and peer-to-peer data flow in tree structure. Sicne TUTA has stacked encoders with tree attention, each cell can represent information from its neighborhood with nearly arbitrary depth (). In experiments, we implement ablation studies to compare different values.

3.4. Pre-training objectives

Tables naturally have progressive levels — token level, cell level, and table level. To capture table information in such a progressive manner, we devise novel cell-level and table-level objectives in addition to commonly used token-level objective.

Masked language modeling (MLM)  MLM (Devlin et al., 2018; Lample and Conneau, 2019) is widely used for NL pre-training. Motivated by (Herzig et al., 2020), we train token representations by predicting masked tokens in both table cells and text segments. We further adapt the masking strategy by: (1) randomly selecting 15% of cells; (2) randomly masking one token in 70% of selected cells, and masking all tokens in the other 30% selected cells. Strategy (2) is mainly used to learn contextual information from neighboring cells. MLM is modeled as a multi-classification problem for each masked token with the cross-entropy loss .

Cell-level Cloze (CLC)  Cells are the basic units in tables to record text, position and format. So, cell-level representations are important for various tasks such as cell type classification, entity linking and table question answering. Existing methods (Herzig et al., 2020; Yin et al., 2020; Deng et al., 2020)

directly take the averaged token representation to represent cells, but lack a systematic pre-training objective for cells in a whole way. In this paper, we devise a novel cell-level task by randomly masking some selected cells in table headers (with higher probability in our setting) and data regions (with lower probability) and encourage the model to retrieve the correct cell strings based on their locations. This task can be regarded as a cell-level one-to-one mapping problem from cell locations to cell strings as shown in Figure

6. Based on embeddings of leading tokens of selected cell locations and cell strings, and , we use a dot-product attention module (Vaswani et al., 2017) to compute the mapping probabilities. Then we model CTC as a multi-classification task for each cell location with the cross-entropy loss . Note that in our attention mechanism, locations of selected cells can still “see” their structural contexts in tables while strings of selected cells can only “see” its internal tokens via attention masks.

Around 20% of the cells are randomly selected for each table. To leverage more training on structural information, we apply two strategies during cell selection. In one way, we randomly choose cells on the same sub-tree based on our bi-tree structure, while in another, cells are randomly taken from different top header rows and left header columns.

Table context retrieval (TCR)

Table level representations are important in two aspects. On the one hand, table titles and descriptions can help one to better understand table cells. On the other hand, all cells in a table constitute an overall meaning of this table. To both ends, we propose a novel objective to capture table-level representation using the leading token . We split table titles and descriptions into text segments, then, couple each table with positive segments from its own contexts and negative one from others context. Each table has at most three positive and three negative segments, from which the needs to retrieve its correct belongings. We assign one randomly selected positive segments to follow the leading token , assign other positive and negative segments with leading s. Then, TCR can be viewed as a table-level one-to-many mapping problem from to s of text segments. It models as a binary classification problem for each pair of and with cross-entropy loss . Note that with our attention mechanism, learns table-level representations and can ”see” all table cells, while the of each text segment learns text segment representations and only ”see”s its internal tokens.

The final objective is the summarization of , , and with the same weight.

Figure 6. An intuitive example for pre-training objectives.

3.5. Pre-training details

Data processing configuration  In our setting, top/left tree structures embed in a maximum depth of 4, as nodes hang deeper onto the tree, we allow an increasing number of degrees from 32, 32, 64, to 256 for the 1st, 2nd, 3rd, and 4th tree levels, respectively. The largest supporting degree, 256, accords with both the maximum number of rows and columns to incorporate large relational tables. When encountering large tables in the downstream task, we can split the table into several smaller ones (share the same top or left header) based on detected table headers. Table corpus for pre-training is a mixture of spreadsheets and web tables. Since different datasets have different structure distributions and data characteristics, in the pre-training process, we feed table samples from these datasets (WikiTable, WDC and Spreadsheet) in parallel to our model to learn from diverse tables simultaneously. Since the size of spreadsheet corpus and WikiTable corpus are not as big as WDC, they will be cycled for several times in the pre-training process.

Tables are serialized into token sequences in turns of rows, in which each row of cell strings are tokenized and concatenated with jointing tokens. Note that cells in data region often express similar numerical information yet introduce limited semantics, therefore, they are randomly sampled out in the pre-training process, and we adopt heuristics to categorize data region cells into text- and value- dominated types, sampling out 50% in the former and 90% in the latter. We bound cells in 8 tokens and allow 64 tokens for context pieces.

Model configuration  TUTA is a -layer Transformer encoder with hidden size , in which each layer performs self-attention with

heads. We align hyperparameters with BERT

( = 12, = 768,

= 12) and initialize with token embeddings and encoder of BERT before adapting to the tabular domain. TUTA first pre-trains with table sequences under a maximum of 256 tokens for 1M steps in a batch size of 12, then, it extends the supporting sequence length to 512 and continues training for another 1M steps with a batch size of 4 (totally going through 16M tables). We implement the above procedures with PyTorch 

444 distributed training. We train TUTA and its variants on 64 Tesla V100 GPUs, and each TUTA variant takes nine days on four Tesla V100 GPUs.

Figure 7. A real example of cell type classification from SAUS. The figure in the left shows the well-annotated cell types. Different colors represent different cell types. And the figure in the right shows the relational data contained in this table. It can be seen that different cell types play different roles in the relational data extraction process.

4. Experiments

Understanding the semantic structures of tables is the initial and critical step for plenty of tasks on tables. One critical task towards understanding table semantic structures is to identify the structural type of table cells. Cell type classification (CTC) is a widely studied task in table structure understanding domain (Dong et al., 2019a; Gol et al., 2019; Koci et al., 2019; Gonsior et al., 2020) with several well-annotated datasets. Therefore, we use cell type classification to validate the effectiveness of TUTA on understanding various structured tables.

Datasets  Existing CTC datasets includes WebSheet (Dong et al., 2019a), deexcelerator (DeEx) (Koci et al., 2019), SAUS (Gol et al., 2019), and CIUS (Gol et al., 2019). These datasets are collected from different domains (financial, business, crime, agricultural and health-care) and include tables with various structures. Table  3 shows the statistics on table size and structure for each annotated dataset. Note that these datasets have different definitions on cell types. DeEx, SAUS and CIUS categorize cells into general types including metadata (MD), notes (N), data (D), top attribute (TA), left attribute (LA) and derived (B). To perform automatic table transformation, WebSheet further defines three fine-grained semantic cell types in table headers, namely index, index name and value name (Dong et al., 2019a). An intuitive example of these definitions is shown in Figure 7, cells of different types play different roles in the relational data extraction process.

Number of labeled tables 3,503 221 284 248
Number of labeled cells 1075k 192k 711k 216k
Avg. number of rows 33.2 52.5 220.2 68.4
Avg. number of columns 9.9 17.7 12.7 12.7
Avg. number of cells 307 871 2,506 869
Prop. of hierarchical tables 53.7% 93.7% 43.7% 72.1%
Prop. of hierarchical top 35.8% 68.8% 28.9% 46.8%
Prop. of hierarchical left 29.3% 76.0% 29.2% 30.2%
Table 3. Statistics of CTC datasets.

Baselines  To verify the effectiveness of TUTA, we compare it with four representative baselines. CNN  (Dong et al., 2019a) and Bi-LSTM (Gol et al., 2019) are two state-of-the-art methods for CTC. CNN is a CNN-based method for both cell classification and table range detection with pre-trained BERT embedding. RNN is a bidirectional LSTM-based method for cell classification using pre-trained cell and format embeddings. TAPAS (Herzig et al., 2020) and TaBERT (Yin et al., 2020)

are two recently proposed transformer-based methods in table-text joint pre-training. To ensure an unbiased comparison, we download the pre-trained models of TAPAS and TaBERT, then fine-tune them using the same CTC head and loss function with TUTA. TURL 

(Deng et al., 2020) also pre-trains on relational tables but has no datasets and models publicly available, so we have not compared with it in this paper.

4.1. Implementation details

Fine-tune TUTA  For CTC task, tables are tokenized, embedded and encoded in the same way as introduced in section 3. Recognizing this task as a cell-level multi-classification problem, we design a fine-tuning head as described below. Continuing with the encoder output of hidden size

, we introduce two linear transformation layers, one with weights

and bias , another with weights and bias respectively, where is the number of cell types. To predict the type of each cell, both the leading

token and the aggregation of other tokens are treated as potential cell-level representations. Given one of these vector representation

, we calculate the prediction distribution as , then calculate the cross-entropy loss with the label . Since we simultaneously fine-tune them in downstream tasks, they have comparable performance. Unless further noted, we always report the accuracy of predictors.

Experiment details  Following the method of  (Dong et al., 2019a) on WebSheet and  (Gol et al., 2019) on DeEx, SAUS and CIUS, we use the same train/validation /test sets for TUTA and all of the baseline methods. When splitting data sets into train, validation and test, we adopt a table-wise, rather than a cell-wise manner. Since cells in a table are always considered together, none of the cells in test tables have been used for training. For WebSheet, we follow (Dong et al., 2019a)

and tune our model for 4 epochs. For DeEx, SAUS and CIUS, since the amount of tables is not as big, we separately tune TUTA on five randomly split folds of data with 100 epochs and report their averaged macro f1. All of the downstream experiments set batch size to 4 and learning rate to 8e-6.

4.2. Experiment results

Macro-F1 score is a commonly used evaluation metric for the overall accuracy on different cell types. As shown in Table 

4, TUTA achieves an averaged macro-F1 of 88.1% on four datasets, outperforming all baselines by a large margin (3.6%+). We observe that RNN also outperforms TaBERT and TAPAS. It is probably because RNN has a greater capability to capture spatial information than TAPAS and TaBERT. TAPAS takes spatial information by encoding only the row and column indexes, which however, is insufficient for hierarchical tables with complicated headers. TaBERT, without joint coordinates from two-dimensions, employs a row-wise attention and a subsequent column-wise attention. Due to such indirect position encoding, it performs not as well on CTC.

We also list detailed F1-scores for different cell types in Table 5. As Table 5 shows, TUTA achieves the highest score for every cell type in WebSheet. Note that WebSheet, to perform relational data extraction, further defines fine-grained cell types inside table headers. This is a quite challenging task, for table headers often contain complicated hierarchy. Hence, it greatly demonstrates the superiority of TUTA on recognizing fine-grained cell types in table headers.

4.3. Ablation studies

To validate the effectiveness of Tree-based Attention (TA), Position Embeddings (PE) and three pre-training objectives, we evaluate 8 variants of TUTA.

We first start with TUTA-base before explicitly using postions. To test the effectiveness of introducing and varying TA distances, one variant without TA and three under visible distances 2, 4, 8 are tested.

WebSheet DeEx SAUS CIUS Average
CNN 78.4% 60.8% 89.1% 95.1% 80.9%
RNN 79.6% 70.5% 89.8% 97.2% 84.3%
TaBERT-large 79.3% 50.0% 78.9% 92.9% 75.3%
TAPAS-large 82.3% 68.6% 83.9% 94.1% 82.2%
TUTA 86.6% 76.6% 90.2% 99.0% 88.1%

Table 4. Comparison results of macro-F1 scores.
Index name Index Value name
CNN 69.9% 86.9% 78.4%
RNN 75.0% 86.6% 77.1%
TaBERT-large 76.8% 85.2% 75.8%
TAPAS-large 74.3% 88.1% 84.6%
TUTA 83.4% 91.6% 84.8%

Table 5. Results of F1 scores on WebSheet by cell types.
  • TUTA-base, w/o TA: cells are globally visible.

  • TUTA-base, TA-8: cells are visible in a distance of 8.

  • TUTA-base, TA-4: cells are visible in a distance of 4.

  • TUTA-base, TA-2: cells are visible in a distance of 2.

Upon this comparison, we keep TA distance to 2 and augment TUTA-base with positional information, forming TUTA in both implicit and explicit embedding methods.

  • TUTA-implicit: implicitly embed positions using trainable weights.

  • TUTA-explicit: calculate positions in explicitly as (Vaswani et al., 2017).

Given the best TUTA-implicit, we dig deeper into the contributions of three objective by separately removing each of them.

  • TUTA, w/o MLM

  • TUTA, w/o CLC

  • TUTA, w/o TCR

TUTA Variants WebSheet DeEx SAUS CIUS Average
TUTA-base, w/o TA 76.3% 70.5% 80.0% 92.8% 79.9%
TUTA-base, TA-8 80.0% 71.1% 80.5% 93.2% 81.2%
TUTA-base, TA-4 81.5% 73.8% 80.9% 95.6% 83.0%
TUTA-base, TA-2 84.5% 75.5% 84.6% 96.7% 85.3%
TUTA-explicit 86.5% 76.0% 89.7% 98.8 % 87.8 %
TUTA-implicit 86.6% 76.6% 90.2% 99.0% 88.1%
TUTA, w/o MLM 85.4% 76.6% 89.2% 99.0% 87.6%
TUTA, w/o CLC 83.0% 76.4% 88.7% 98.9% 86.8%
TUTA, w/o TCR 85.8% 76.6% 88.2% 99.0% 87.4%
Table 6. Experiment results of ablation studies.

Experiment results of ablation studies  Table 6 shows the results of ablation studies. It is clear that smaller attention distances help our method to achieve better results. TUTA-base can only achieve 79.9% averaged macro-F1, lower than 82.0% of TAPAS. But when attention distance decreases to 2, TUTA-base, TA-2 has a significant improvement (5%), and also outperforms TAPAS. And tree position embeddings can further improve the accuracy. By augmenting with implicit tree embeddings, TUTA-implicit achieves an overall F1 of 88.1%. Although TUTA-implicit shows higher F1 than TUTA-explicit, the difference is relative small. Thus we conclude that tree positional embeddings can significantly improve accuracy of TUTA by encoding spatial and hierarchical information, but the form of positional embeddings is not that critical. Furthermore, we also examinate the effectiveness of three progressive objectives by removing each of them and only keeping two of them for pre-training. It is shown that removing the CLC objective has the biggest impact, with a accuracy drop of 1.3%. But the overall effect of removing MLM or TCR is not significant. Note that for SAUS, removing TCR can also cause an obvious accuracy drop of 2%, and after our study, we find almost all tables in SAUS have informative titles and descriptions, which can serve as strong hints for metadata classification and global table semantics.

4.4. Case studies

To facilitate an intuitive understanding of the experiment results, we show two typical cases in the test set to help illustrate key concepts intuitively.

Tree Attention with different distances  Take the cells ‘Fault Current’ in Figure 8 (column C:D and E:F) as an example, without tree-attention, TUTA-base-w/o-TA tends to learn surrounding semantics with little discrimination, and misclassifies cell ‘Fault Current’ to ‘Value Name’, as shown in Figure 8 (c). As we narrows its attention distance for each encoder from global to distance 2 as shown in Figure 8 (b), it successfully identifies ‘Fault Current’, we guess the underlying reason is that it captures greater similarity between ‘Fault Current’ and its hierarchical parents ( ‘Instantaneous Symmetrical’ and ‘Instantaneous Asymmetrical’), rather than its hierarchical children, ‘P.U.’ and ‘Amps’ below.

Tree-based positional embeddings  Smaller attention distances help local cells to learn from spatial and structural related neighbours. To further help cells precisely identify spatial/ hierarchical relationships and avoid collective faults, we augment our attention mechanism with tree-based positional embeddings. Comparing TUTA and its no-position version in Figure 8 (a) and (b), respectively, we find B10 and B11 are misclassified to ‘Index’s (ground truth is ‘Index Name’) by no-position variant in (b), even when their neighbours in the right (C10, C12, D10 and D12) largely fall to the ‘Index’ type and are semantically described by B10 and B11. Beyond tree attention, tree-based positional embeddings help cells to find precise spatial and hierarchical relationships.

Interactive functions between headers and values  As for the case in Figure  9, TUTA categorize D28 and E28 to the ‘Derived’ type, that is, an aggregation from other cells of the ‘Data’ type. This result indicates an interactive phenomenon between header and value regions, such that a header often, targets and poses assumptions to certain data regions given the structural and semantic information. For example, the top header D26:D27 headlines the data D28. Its merged formatting, as well as its hierarchical level (as a direct child of top-tree root), readily sets a high aggregation priority to D28, who especially, also locates in the first data row. However, only by calculating possible aggregation results of data can we know that D28 is actually not ‘Derived’ from any other cells. Though it reaches beyond the scope of table semantic understanding, we do think it reasonable to augment numerical computations for tables in the future.

Although D28 is misclassified by TUTA, E28 is successfully classified by TUTA — E28 is a ground truth error caused by human labelers. The ground truth for E28 should be ’Derived’, but not ’Data’ since E28 is indeed the summarization of F28:G28 even if there is no formula to indicate it. We guess that TUTA makes the successfully classification for E28 based on the key word ’Total’ in the top cell and the large number magnitude of E28.

Figure 8. Case no.1, shows smaller attention distance and position embedding improve accuracy. Green icons mark correct predictions, while red ones mark mistakes.
Figure 9. Case no.2, shows an interactive pattern between header and data cells.

5. Related Work

Table pre-training methods Although TUTA is the first effort for pre-training table representations on various structured tables, there has been a range of prior work on relational table pre-training. Table-BERT linearized a table as a sentence so that a table can be directly processed by the pre-trained BERT model (Chen et al., 2019c).  (Gol et al., 2019) adopted continuous bag-of-words and skip-gram to learn embeddings of table cells over 8 neighboring cells in local contexts and used the resulting embeddings as features in cell type classification, but it did not leverage structural information. TAPAS and TaBERT target question answering over relational tables via joint pre-training of tables and their text (Herzig et al., 2020; Yin et al., 2020). Another work, TURL, attempted to pre-train embeddings on relational tables to enhance table knowledge matching and table augmentation (Deng et al., 2020). But In TURL, each cell can only aggregate information from the located row and column due to the masked self-attention.

Neural networks dedicated for tables Since two-dimensional spatial information is crucial for table understanding tasks, lots of neural architectures have been proposed to capture spatial information. CNNs (Dong et al., 2019a; Chen et al., 2019a; Dong et al., 2019b; Paliwal et al., 2019; Yang et al., 2017) are widely adopted to capture spatial information for spreadsheet tables, web tables, PDF tables, and scanned tables, and bidirectional RNNs and LSTMs are frequently adopted in web tables to capture the order of rows and columns (Nishida et al., 2017; Khan et al., 2019; Gol et al., 2019; Fetahu et al., 2019)

. Later work has proposed a hybrid neural network by combining bidirectional RNNs and CNNs in the task of column type prediction 

(Chen et al., 2019b). However, Our proposed method is the first transformer-based method for table semantic structure understanding.

Structure-aware neural networks in NL domain In NL domain, a sentence can be represented via a dependency tree or a constituency tree structure. For this reason, a variant of LSTMs, named Tree-LSTM (Tai et al., 2015; Chen et al., 2016), has been proposed to work on tree topology. Since dedicated models like the Tree-LSTM and RNNs are not as efficient and parallelizable as attention-based methods,  (Nguyen et al., 2020) has devised tree-structured attention with bottom-up information accumulation and outperformed Tree-LSTMs on three text classification tasks.  (Wang et al., 2019) has leveraged a learning-based constituent tree prior to guide the self-attention process. And tree-based positional encodings have also been proposed to help transformer better exploit tree-structured information (Shiv and Quirk, 2019). In addition to tree-based methods, graph-based methods also have been widely adopted recently, for example,  (Huang and Carley, 2019) used GATs to model the dependency graph in a NL sentence for sentiment classification, and  (Zhou et al., 2019) proposed graph-based evidence aggregating and reasoning model to capture relational and logical information among the evidence from plain text.

6. Conclusion and Discussion

In this paper, we propose a novel structure-aware pre-training framework, TUTA, for understanding tables with various semantic structures. TUTA is the first transformer-based method for table semantic structure understanding, which is enhanced with two core mechanisms to capture spatial and hierarchical information in tables, including tree attention and tree positional embeddings. Moreover, we devise three pre-training objectives to enable representation learning at token, cell and table levels. TUTA is pre-trained on large volume of unlabeled tables in an unsupervised manner and then fine-tuned on four well-annotated datasets for table semantic structure understanding. TUTA achieves state-of-the-art on all public datasets in the task of cell type classification, and shows a large margin of improvements over all baselines. Although we only validate TUTA through the task of table structure understanding, we believe TUTA is a general pre-training framework that can be applied to other table understanding tasks with minor modifications. So, in the future, we plan to demonstrate the effectiveness of TUTA on more tasks.


  • C. S. Bhagavatula, T. Noraset, and D. Downey (2015) TabEL: entity linking in web tables. In International Semantic Web Conference, pp. 425–441. Cited by: §1.
  • J. Chen, E. Jiménez-Ruiz, I. Horrocks, and C. Sutton (2019a) Colnet: embedding the semantics of web tables for column type prediction. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 29–36. Cited by: §1, §5.
  • J. Chen, E. Jiménez-Ruiz, I. Horrocks, and C. Sutton (2019b) Learning semantic annotations for tabular data. arXiv preprint arXiv:1906.00781. Cited by: §5.
  • Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2016) Enhanced lstm for natural language inference. arXiv preprint arXiv:1609.06038. Cited by: §5.
  • W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang, S. Li, X. Zhou, and W. Y. Wang (2019c) TabFact: a large-scale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164. Cited by: §5.
  • Z. Chen and M. Cafarella (2013) Automatic web spreadsheet data extraction. In Proceedings of the 3rd International Workshop on Semantic Search over the Web, pp. 1–8. Cited by: §1, §1, §1.
  • Z. Chen and M. Cafarella (2014) Integrating spreadsheet data via accurate and low-effort extraction. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1126–1135. Cited by: §1, §1, §1, §1, §2.2, §2.2.
  • E. Crestan and P. Pantel (2011) Web-scale table census and classification. In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 545–554. Cited by: §1.
  • X. Deng, H. Sun, A. Lees, Y. Wu, and C. Yu (2020) TURL: table understanding through representation learning. arXiv preprint arXiv:2006.14806. Cited by: §1, §1, §3.4, §4, §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §3.1, §3.2, §3.4.
  • H. Dong, S. Liu, Z. Fu, S. Han, and D. Zhang (2019a) Semantic structure extraction for spreadsheet tables with a multi-task learning architecture. In Workshop on Document Intelligence at NeurIPS 2019, Cited by: Appendix B, §1, §2.2, §4.1, §4, §4, §4, §5.
  • H. Dong, S. Liu, S. Han, Z. Fu, and D. Zhang (2019b)

    Tablesense: spreadsheet table detection with convolutional neural networks

    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 69–76. Cited by: Appendix A, §3.2, §5.
  • W. Dou, S. Han, L. Xu, D. Zhang, and J. Wei (2018) Expandable group identification in spreadsheets. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 498–508. Cited by: §1.
  • J. Fang, P. Mitra, Z. Tang, and C. L. Giles (2012) Table header detection and classification. In Twenty-Sixth AAAI Conference on Artificial Intelligence, Cited by: §2.2.
  • B. Fetahu, A. Anand, and M. Koutraki (2019) TableNet: an approach for determining fine-grained relations for wikipedia tables. In The World Wide Web Conference, pp. 2736–2742. Cited by: §5.
  • M. G. Gol, J. Pujara, and P. Szekely (2019) Tabular cell classification using pre-trained cell embeddings. In 2019 IEEE International Conference on Data Mining (ICDM), pp. 230–239. Cited by: §1, §3.2, §4.1, §4, §4, §4, §5, §5.
  • J. Gonsior, J. Rehak, M. Thiele, E. Koci, M. Günther, and W. Lehner (2020) Active learning for spreadsheet cell classification.. In EDBT/ICDT Workshops, Cited by: §4.
  • T. Guo, D. Shen, T. Nie, and Y. Kou (2020)

    Web table column type detection using deep learning and probability graph model

    In International Conference on Web Information Systems and Applications, pp. 401–414. Cited by: §1.
  • J. Herzig, P. K. Nowak, T. Müller, F. Piccinno, and J. M. Eisenschlos (2020) TAPAS: weakly supervised table parsing via pre-training. arXiv preprint arXiv:2004.02349. Cited by: §1, §3.4, §3.4, §4, §5.
  • B. Huang and K. M. Carley (2019) Syntax-aware aspect level sentiment classification with graph attention networks. arXiv preprint arXiv:1909.02606. Cited by: §5.
  • S. A. Khan, S. M. D. Khalid, M. A. Shahzad, and F. Shafait (2019)

    Table structure extraction with bi-directional gated recurrent unit networks

    In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1366–1371. Cited by: §5.
  • E. Koci, M. Thiele, J. Rehak, O. Romero, and W. Lehner (2019) DECO: a dataset of annotated spreadsheets for layout and table recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1280–1285. Cited by: §1, §4, §4.
  • J. Krishnamurthy, P. Dasigi, and M. Gardner (2017) Neural semantic parsing with type constraints for semi-structured tables. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    pp. 1516–1526. Cited by: §1.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §3.4.
  • O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer (2016) A large public corpus of web tables containing time and context metadata. In Proceedings of the 25th International Conference Companion on World Wide Web, pp. 75–76. Cited by: Appendix A, §2.1.
  • S. Lim and Y. Ng (1999) An automated approach for retrieving hierarchical data from html tables. In Proceedings of the eighth international conference on Information and knowledge management, pp. 466–474. Cited by: §1, §2.2, §2.2.
  • X. Nguyen, S. Joty, S. C. Hoi, and R. Socher (2020) Tree-structured attention with hierarchical accumulation. arXiv preprint arXiv:2002.08046. Cited by: §1, §3.3, §3.3, §5.
  • K. Nishida, K. Sadamitsu, R. Higashinaka, and Y. Matsuo (2017) Understanding the semantic structures of tables with a hybrid deep neural network architecture. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1, §1, §2.1, §2.2, §5.
  • S. S. Paliwal, D. Vishwanath, R. Rahul, M. Sharma, and L. Vig (2019) TableNet: deep learning model for end-to-end table detection and tabular data extraction from scanned document images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 128–133. Cited by: §5.
  • V. Paramonov, A. Shigarov, and V. Vetrova (2020) Table header correction algorithm based on heuristics for improving spreadsheet data extraction. In International Conference on Information and Software Technologies, pp. 147–158. Cited by: §2.2.
  • P. Pasupat and P. Liang (2015) Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305. Cited by: §1.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §1.
  • D. Ritze and C. Bizer (2017)

    Matching web tables to dbpedia-a feature utility study

    context 42 (41), pp. 19–31. Cited by: §1.
  • V. Shiv and C. Quirk (2019) Novel positional encodings to enable tree-based transformers. In Advances in Neural Information Processing Systems, pp. 12081–12091. Cited by: §1, §2.2, §3.2, §3.3, §5.
  • H. Sun, H. Ma, X. He, W. Yih, Y. Su, and X. Yan (2016) Table cell search for question answering. In Proceedings of the 25th International Conference on World Wide Web, pp. 771–782. Cited by: §1.
  • K. S. Tai, R. Socher, and C. D. Manning (2015)

    Improved semantic representations from tree-structured long short-term memory networks

    arXiv preprint arXiv:1503.00075. Cited by: §5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §3.4, 2nd item.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §3.3.
  • X. Wang (2016) Tabular abstraction, editing, and formatting. Cited by: §1, §2.2.
  • Y. Wang, H. Lee, and Y. Chen (2019) Tree transformer: integrating tree structures into self-attention. arXiv preprint arXiv:1909.06639. Cited by: §1, §3.3, §5.
  • X. Yang, E. Yumer, P. Asente, M. Kraley, D. Kifer, and C. Lee Giles (2017) Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 5315–5324. Cited by: §5.
  • P. Yin, G. Neubig, W. Yih, and S. Riedel (2020) TaBERT: pretraining for joint understanding of textual and tabular data. arXiv preprint arXiv:2005.08314. Cited by: §1, §3.4, §4, §5.
  • R. Zanibbi, D. Blostein, and J. R. Cordy (2004) A survey of table recognition. Document Analysis and Recognition 7 (1), pp. 1–16. Cited by: §1, §2.2.
  • C. Zhao and Y. He (2019)

    Auto-em: end-to-end fuzzy entity-matching using pre-trained deep models and transfer learning

    In The World Wide Web Conference, pp. 2413–2424. Cited by: §1.
  • J. Zhou, X. Han, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2019) GEAR: graph-based evidence aggregating and reasoning for fact verification. arXiv preprint arXiv:1908.01843. Cited by: §5.
  • M. Zhou, W. Tao, J. Pengxin, H. Shi, and Z. Dongmei (2020) Table2Analysis: modeling and recommendation of common analysis patterns for multi-dimensional data. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 320–328. Cited by: §1.

Appendix A Dataset construction

Two kinds of table corpus are used for pre-training:

Web tables We collected 2.62 million web tables from WikiTable 555 and 50.82 million web tables from WDC WebTable Corpus [Lehmberg et al., 2016]. We also kept their titles, captions, and surrounding NL contexts.

Spreadsheet tables We crawled about 13.5 million public spreadsheet files (.xls and .xlsx) from more than 1.75 million web sites. Then we utilized techniques of TableSense [Dong et al., 2019b] to detect tables from sheets of each file. Totally we got about 115 million tables.

Pre-processing As the spreadsheet files are crawled from various web sites, the tables detected from them are very noisy. We cleaned the data and detected language for them to build a clean table corpus for pre-training. Firstly, we filtered out tables with extreme table size (number of rows/columns ¡ 4, number of rows ¿ 512 or number of columns ¿ 128), tables with very deep hierarchical headers (number of top/left header rows/columns ¿ 5) and tables without any headers. With these rules, we filtered 52.34% tables from the original dataset. Secondly, we de-duplicated the filtered data based on table content. After removing the duplicated tables, 23.49% tables are left. Thirdly, we used Microsoft Azure Text Analytics666 to detect language for spreadsheet tables. Among all filtered and de-duplicated tables, about 31.76% tables are English, which were used for pre-training. Finally we got 4.49 million spreadsheet tables for TUTA.

Feature extraction We used ClosedXML777

to parse spreadsheet files and extract features. For the two web table corpus, tables are serialized as JSON files, so we just load and parse the JSON files for feature extraction. We unified the featurization schema for web tables and spreadsheet tables as shown in Table


Appendix B Tree extraction

We adopt the method introduced by  [Dong et al., 2019a] for header detection. For those detected header regions, we develop a rule-based method to extract header hierarchies based on merged cells, indentation levels and formulas of table. By this method, we could get two hierarchical trees for each table, which are used to build the bi-dimensional coordinate tree for TUTA.

Merged cells Merged cells provide spatial alignment information between cells, with which we can build the hierarchical relationships between corresponding header cells. For example, the cells under a merged area belong to the child nodes of the merged area. By this way, we can build a hierarchical tree for top header cells based on merged cells. It is similar for building left header hierarchies based on the merged cells but in a different direction.

Indentation levels Indentation is commonly used for indicating hierarchical relationships of left headers both in web tables and spreadsheet tables. Generally, indentation refers to the visual indentation effect, which includes various operation methods. Users can use different amount of spaces and tabs to create indentation effect, which exists in both spreadsheet tables and web tables. Or they can write different levels of cell strings into different columns to create a visual effect of indentation. In spreadsheet tables, Excel provides an operation to set indentation level in cell format menu. We apply these expert knowledge as effective heuristics to extract indentation levels for left header cells. Based on the three before-mentioned operations, and further on the extracted indentation levels, we built hierarchical relationships in the left header for each cell and its subsequent cells with one more indentation levels.

Formulas Formula is an important feature in spreadsheet tables, which contains information about the calculation relationship between cells. Some formulas indicate the aggregation relationship between cells, such as SUM, AVERAGE, etc. What’s more, if the formulas of all cells in a row share the same aggregation formula pattern, the left header node corresponding to this row should be treated as the parent node of other reference rows. It is similar to columns. Then we can get the hierarchical relationship based on formulas, which has higher priority than indentation levels in our method.

Description Feature value Default value
Merged region
The number of merged rows positive integer 1
The number of merged columns positive integer 1
Cell border
If cell has a top border 0 or 1 0
If cell has a bottom border 0 or 1 0
If cell has a left border 0 or 1 0
If cell has a right border 0 or 1 0
Data type
If cell string matches a date template 0 or 1 0
If formula exists in the cell 0 or 1 0
Cell format
If the bold font is applied 0 or 1 0
If the the background color is white 0 or 1 1
If the the font color is white 0 or 1 1

Table 7. Feature set for each cell.