ToTTo: A Controlled Table-To-Text Generation Dataset

04/29/2020 ∙ by Ankur P. Parikh, et al. ∙ Google 0

We present ToTTo, an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. To obtain generated targets that are natural but also faithful to the source table, we introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. We present systematic analyses of our dataset and annotation process as well as results achieved by several state-of-the-art baselines. While usually fluent, existing methods often hallucinate phrases that are not supported by the table, suggesting that this dataset can serve as a useful research benchmark for high-precision conditional text generation.



There are no comments yet.


page 13

Code Repositories


ToTTo is a dataset for the controlled table-to-text generation dataset comprising of over 100,000 examples. For each example, given a table and set of highlighted cells as input, the goal is to produce a one sentence description. ToTTo is unique both in task design as well as annotation process. In particular, during the dataset creation process, tables from English Wikipedia are matched with (noisy) descriptions. Each table cell mentioned in the description is highlighted and the descriptions

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Table Title: Cristhian Stuani
Section Title: International goals
Table Description: As of 25 March 2019 (Uruguay score listed first, score column indicates score after each Stuani goal)
No. Date Venue Opponent Score Result Competition
1. 10 September 2013
Estadio Centenario,
Montevideo, Uruguay
Colombia 2-0 2-0
2014 FIFA World Cup
2. 13 November 2013
Amman International
Stadium, Amman, Jordan
Jordan 2-0 5-0
2014 FIFA World Cup
3. 31 May 2014 Estadio Centenario, Montevideo, Uruguay
1-0 1-0 Friendly
4. 5 June 2014 Slovenia 2-0 2-0
Original Text: On 13 November 2013, he netted the Charruas’ second in their 5 – 0 win in Jordan for the playoffs first leg,
finishing Nicolas Lodeiro’s cross at close range.
Text after Deletion: On 13 November 2013, he netted the second in their 5 – 0 win in Jordan.
Text after Decontextualization: On 13 November 2013, Cristhian Stuani netted the second in 5 – 0 win in Jordan.
Final Text: On 13 November 2013 Cristhian Stuani netted the second in a 5 – 0 win in Jordan.
Table 1: Example in the ToTTo  dataset. The goal of the task is given the table and set of highlighted cells, to produce the final text. Our data annotation process revolves around annotators iteratively revising the original text to produce the final text.

Data-to-text generation (kukich1983design; mckeown1992text) is the task of generating a target textual description conditioned on source content in the form of structured data such as a table. Examples include generating sentences given biographical data (lebret2016neural), textual descriptions of restaurants given meaning representations (novikova2017e2e), and basketball game summaries given boxscore statistics (wiseman2017challenges).

Existing data-to-text tasks have provided an important test-bed for neural generation models (sutskever2014sequence; bahdanau2014neural). Neural models are known to be prone to hallucination, i.e., generating text that is fluent but not faithful to the source (vinyals2015neural; koehn2017six; lee2018hallucinations; tian2019sticking) and it is often easier to assess faithfulness of the generated text when the source content is structured (wiseman2017challenges; dhingra2019handling). Moreover, structured data can also test a model’s ability for reasoning and numerical inference (wiseman2017challenges) and for building representations of structured objects (liu2017table), providing an interesting complement to tasks that test these aspects in the NLU setting (pasupat2015compositional; chen2019tabfact; dua2019drop).

However, constructing a data-to-text dataset can be challenging on two axes: task design and annotation process. First, tasks with open-ended output like summarization mani1999advances; lebret2016neural; wiseman2017challenges lack explicit signals for models on what to generate, which can lead to subjective content and evaluation challenges (kryscinski2019neural). On the other hand, data-to-text tasks that are limited to verbalizing a fully specified meaning representation (gardent2017webnlg) do not test a model’s ability to perform inference and thus remove a considerable amount of challenge from the task.

Secondly, designing an annotation process to obtain natural but also clean targets is a significant challenge. One strategy employed by many datasets is to have annotators write targets from scratch (banik2013kbgen; wen2015semantically; gardent2017creating) which can often lack variety in terms of structure and style (gururangan2018annotation; poliak2018hypothesis). An alternative is to pair naturally occurring text with tables (lebret2016neural; wiseman2017challenges). While more diverse, naturally occurring targets are often noisy and contain information that cannot be inferred from the source. This can make it problematic to disentangle modeling weaknesses from data noise.

In this work, we propose ToTTo, an open-domain table-to-text generation dataset that introduces a novel task design and annotation process to address the above challenges. First, ToTTo proposes a controlled generation task: given a Wikipedia table and a set of highlighted cells as the source , the goal is to produce a single sentence description . The highlighted cells identify portions of potentially large tables that the target sentence should describe, without specifying an explicit meaning representation to verbalize.

For dataset construction, to ensure that targets are natural but also faithful to the source table, we request annotators to revise existing Wikipedia candidate sentences into target sentences, instead of asking them to write new target sentences (wen2015semantically; gardent2017creating). Table 1 presents a simple example from ToTTo to illustrate our annotation process. The table and Original Text

were obtained from Wikipedia using heuristics that collect pairs of tables

and sentences that likely have significant semantic overlap. This method ensures that the target sentences are natural, although they may only be partially related to the table. Next, we create a clean and controlled generation task by requesting annotators to highlight a subset of the table that supports the original sentence and revise the latter iteratively to produce a final sentence (see §5). For instance, in Table 1, the annotator has chosen to highlight a set of table cells (in yellow) that are mentioned in the original text. They then deleted phrases from the original text that are not supported by the table, e.g., for the playoffs first leg and replaced the pronoun he with an entity Cristhian Stuani. The resulting final sentence (Final Text) serves as a more suitable generation target than the original sentence. This annotation process makes our dataset well suited for high-precision conditional text generation.

Due to the varied nature of Wikipedia tables, ToTTo covers a significant variety of domains while containing targets that are completely faithful to the source (see Figures 26 for more complex examples). Our experiments demonstrate that state-of-the-art neural models struggle to generate faithful results, despite the high quality of the training data. These results suggest that our dataset and the underlying task could serve as a strong benchmark for controllable data-to-text generation models.

2 Related Work

ToTTo  differs from existing datasets in both task design and annotation process as we describe below. A summary is given in Table 2.

Dataset Train Size Domain Target Quality Target Source Content Selection
Wikibio (lebret2016neural) 583K Biographies Noisy Wikipedia Not specified
Rotowire (wiseman2017challenges) 4.9K Basketball Noisy Rotowire Not specified
WebNLG gardent2017webnlg 25.3K 15 DBPedia categories Clean Annotator Generated Fully specified
E2E (novikova2017e2e) 50.6K Restaurants Clean Annotator Generated Partially specified
LogicNLG (chen2020logical) 28.5K Wikipedia (open-domain) Clean Annotator Generated Columns via entity linking
ToTTo 120K Wikipedia (open-domain) Clean Wikipedia (Annotator Revised) Annotator highlighted
Table 2: Comparison of popular data-to-text datasets. ToTTo combines the advantages of annotator-generated and fully natural text through a revision process.

Task Design

Most existing table-to-text datasets are restricted in topic and schema such as WeatherGov (liang2009learning), RoboCup (chen2008learning), Rotowire (wiseman2017challenges, basketball), E2E (novikova2016crowd; novikova2017e2e, restaurants), KBGen (banik2013kbgen, biology), and Wikibio (lebret2016neural, biographies). In contrast, ToTTo  contains tables with various schema spanning various topical categories all over Wikipedia. Moreover, ToTTo  takes a different view of content selection compared to existing datasets. Prior to the advent of neural approaches, generation systems typically separated content selection (what to say) from surface realization (how to say it(reiter1997building). Thus many generation datasets only focused on the latter stage (wen2015semantically; gardent2017webnlg). However, this decreases the task complexity, since neural systems have already been quite powerful at producing fluent text. Some recent datasets (wiseman2017challenges; lebret2016neural) have proposed incorporating content selection into the task by framing it as a summarization problem. However, summarization is much more subjective, which can make the task underconstrained and difficult to evaluate (kryscinski2019neural). We place ToTTo as a middle-ground where the highlighted cells provide some guidance on the topic of the target but still leave a considerable amount of content planning to be done by the model.

Annotation Process

There are various existing strategies to create the reference target . One strategy employed by many datasets is to have annotators write targets from scratch given a representation of the source (banik2013kbgen; wen2015semantically; gardent2017creating). While this will result in a target that is faithful to the source data, it often lacks variety in terms of structure and style (gururangan2018annotation; poliak2018hypothesis). Domain-specific strategies such as presenting an annotator an image instead of the raw data (novikova2016crowd) are not practical for some of the complex tables that we consider. Other datasets have taken the opposite approach: finding real sentences on the web that are heuristically selected in a way that they discuss the source content (lebret2016neural; wiseman2017challenges). This strategy typically leads to targets that are natural and diverse, but they may be noisy and contain information that cannot be inferred from the source (dhingra2019handling).To construct ToTTo, we ask annotators to revise existing candidate sentences from Wikipedia so that they only contain information that is supported by the table. This enables ToTTo to maintain the varied language and structure found in natural sentences while producing cleaner targets. The technique of editing exemplar sentences has been used in semiparametric generation models (guu2018generating; pandey2018exemplar; peng2019text) and crowd-sourcing small, iterative changes to text has been shown to lead to higher-quality data and a more robust annotation process (little2010turkit). However, to our knowledge, we are the first to use this technique to construct generation datasets.

Concurrent to this work, chen2020logical proposed LogicNLG which also uses Wikipedia tables, although omitting some of the more complex structured ones included in our dataset. Their target sentences are annotator-generated and their task is significantly more uncontrolled due to the lack of annotator highlighted cells.

3 Preliminaries

Our tables come from English Wikipedia articles and thus may not be regular grids.222In Wikipedia, some cells may span multiple rows and columns. See Table 1 for an example. For simplicity, we define a table as a set of cells where is the number of cells in the table. Each cell contains: (1) a string value, (2) whether or not it is a row or column header, (3) the row and column position of this cell in the table, (4) The number of rows and columns this cell spans.

Let indicate table metadata, i.e, the page title, section title, and up to the first 2 sentences of the section text (if present) respectively. These fields can help provide context to the table’s contents. Let be a sentence of length . We define an annotation example333An annotation example is different than a task example since the annotator could perform a different task than the model. a tuple of table, table metadata, and sentence. Here, refers to a dataset of annotation examples of size .

4 Dataset Collection

We first describe how to obtain annotation examples for subsequent annotation. To prevent any overlap with the Wikibio dataset (lebret2016neural), we do not use infobox tables. We employed three heuristics to collect tables and sentences:

Number matching

We search for tables and sentences on the same Wikipedia page that overlap with a non-date number of at least 3 non-zero digits. The numbers are extracted by regular expressions that capture most common number patterns, including numbers with commas and decimal points. This approach captures most of the table-sentence pairs that describe statistics (e.g., sports, election, census, science, weather).

Cell matching

We extract a sentence if it has tokens matching at least 3 distinct cell contents from the same row in the table. The intuition is that most tables are structured, and a row is usually used to describe a complete event (e.g., a sports game, an election, census data from a certain year), which is likely to have a corresponding sentence description from the same page.


The above heuristics only consider sentences and tables on the same page. We also find examples where a sentence contains a hyperlink to a page with a title that starts with List (these pages typically only consist of a large table). If the table on that page also has a hyperlink to the page containing , then we consider this to be an annotation example. Such examples typically result in more diverse examples than the other two heuristics, but also add more noise, since the sentence may only be distantly related to the table.

Using the above heuristics we obtain a large set of annotation examples . We then sample random subset of the examples for annotation: 191,693 examples for training, 11,406 examples for development, and 11,406 examples for test. Among these examples, 35.8% were derived from number matching, 29.4% from cell matching, and 34.7% from hyperlinks.

5 Data Annotation Process

Original After Deletion After Decontextualization Final
He was the first president of the Federal Supreme Court (1848–1850) and president of the National Council in 1850–1851. He was the first president of the Federal Supreme Court (1848–1850) and president of the National Council in 1850–1851. Johann Konrad Kern was the first president of the Federal Supreme Court from 1848 to 1850. Johann Konrad Kern was the first president of the Federal Supreme Court from 1848 to 1850.
He later raced a Nissan Pulsar and then a Mazda 626 in this series, with a highlight of finishing runner up to Phil Morriss in the 1994 Australian Production Car Championship. He later raced a Nissan Pulsar and then a Mazda 626 in this series, with a highlight of finishing runner up to Phil Morriss in the 1994 Australian Production Car Championship. Murray Carter raced a Nissan Pulsar and finished as a runner up in the 1994 Australian Production Car Championship. Murray Carter raced a Nissan Pulsar and finished as runner up in the 1994 Australian Production Car Championship.
On July 6, 2008, Webb failed to qualify for the Beijing Olympics in the 1500 m after finishing 5th in the US Olympic Trials in Eugene, Oregon with a time of 3:41.62. On July 6, 2008, Webb failed to qualify for the Beijing Olympics in the 1500 m after finishing 5th in the US Olympic Trials in Eugene, Oregon with a time of 3:41.62. On July 6, 2008, Webb finishing 5th in the Olympic Trials in Eugene, Oregon with a time of 3:41.62. On July 6, 2008, Webb finished 5th in the Olympic Trials in Eugene, Oregon, with a time of 3:41.62.
Out of the 17,219 inhabitants, 77 percent were 20 years of age or older and 23 percent were under the age of 20. Out of the 17,219 inhabitants , 77 percent were 20 years of age or older and 23 percent were under the age of 20. Rawdat Al Khail had a population of 17,219 inhabitants. Rawdat Al Khail had a population of 17,219 inhabitants.
Table 3: Examples of annotation process. Deletions are indicated in red strikeouts, while added named entities are indicated in underlined blue. Significant grammar fixes are denoted in orange.

The collected annotation examples are noisy since a sentence may be partially or completely unsupported by the table . We thus define a data annotation process that guides annotators through small, incremental changes to the original sentence. This allows us to measure annotator agreement at every step of the process, which is atypical in existing generation datasets.

5.1 Primary Annotation Task

The primary annotation task consists of the following steps: (1) Table Readability, (2) Cell highlighting, (3) Phrase Deletion, (4) Decontextualization. Each of these are described below and more examples are provided in Table 3.

Table Readability

If a table is not readable, then the following steps will not need to be completed. This step is only intended to remove fringe cases where the table is poorly formatted or otherwise not understandable (e.g., in a different language). 99.5% of tables are determined to be readable.

Cell Highlighting

An annotator is instructed to highlight cells that support the sentence. A phrase is supported by the table if it is either directly stated in the cell contents or meta-data, or can be logically inferred by them. Row and column headers do not need to be highlighted. If the table does not support any part of the sentence, then no cell is marked and no other step needs to be completed. 69.7% of examples are supported by the table.

For instance, in Figure 1, the annotator highlighted cells that support the phrases second, 13 November 2013, in Jordan, and 5-0 win. We denote the set of highlighted cells as a subset of the table: .

Phrase Deletion

This step removes phrases in the sentence unsupported by the selected table cells. Annotators are restricted such that they are only able to delete phrases, transforming the original sentence: . In Table 1, the annotator transforms by removing an individual word Charras’, and an entire phrase: For the playoffs first leg, finishing nicols lodeiro’s cross at close range.

On average, is different from for 85.3% of examples and while has an average length of 26.6 tokens, this is reduced to 15.9 for . We found that the phrases annotators often disagreed on corresponded to verbs purportedly supported by the table. For instance, in Table 1, some annotators decided netted is supported by the table since it is about soccer, while others opted to delete it.


A given sentence may contain pronominal references or other phrases that depend on context. We thus instruct annotators to identify the main topic of the sentence; if it is a pronoun or other ambiguous phrase, we ask them to replace it with a named entity from the table or metadata. To discourage excessive modification, they are instructed to make at most one replacement.444Based on manual examination of a subset of 100 examples, all of them could be decontextualized with only one replacement. Allowing annotators to make multiple replacements led to excessive clarification. This transforms the sentence yet again: . In Table 1, the annotator replaced he with Cristhian Stuani.

Since the previous steps can lead to ungrammatical sentences, annotators are also instructed to fix the grammar to improve the fluency of the sentence. We find that is different than 68.3% of the time, and the average sentence length increases to 17.2 tokens for compared to 15.9 for .

5.2 Secondary Annotation Task

Due to the complexity of the primary annotation task, the resulting sentence may still have grammatical errors, even if annotators were instructed to fix grammar. Thus, a second set of annotators were asked to further correct the sentence and were shown the table with highlighted cells as additional context. However, the annotators were not required to use the table. They were asked to determine the grammaticality and fluency of the provided sentence. If the sentence is not fluent or grammatical, they fix the errors to make it such. Annotators are also given an option to indicate that the sentence is not fixable.

This results in the final sentence . On average, annotators edited the sentence 27.0% of the time, and the sentence length slightly increased to 17.4 tokens from 17.2. We found that for most of the cases, the table is not necessary to fix the sentence since the grammatical errors are due to surface syntax, such as a missing punctuation or a missing determiner. In a few cases, a verb may be missing, and in such instances, the table is needed to indicate the correct verb to use.

6 Dataset Analysis

Basic statistics of ToTTo are described in Table 4. The number of unique tables and vocabulary size attests to the open domain nature of our dataset. Furthermore, while the median table is actually quite large (87 cells), the median number of highlighted cells is significantly smaller (3). This indicates the importance of the cell highlighting feature of our dataset toward a well-defined text generation task.

Property Value
Training set size 120,761
Number of target tokens 1,268,268
Avg Target Length (tokens) 17.4
Target vocabulary size 136,777
Unique Tables 83,141
Rows per table (Median/Avg) 16 / 32.7
Cells per table (Median/Avg) 87 / 206.6
No. of Highlighted Cell (Median/Avg) 3 / 3.55
Development set size 7,700
Test set size 7,700
Table 4: ToTTo  dataset statistics.
Annotation Stage Measure Result
Table Readability Agreement / 99.38 / 0.646
Cell Highlighting Agreement / 73.74 / 0.856
After Deletion BLEU-4 82.19
After Decontextualization BLEU-4 72.56
Final BLEU-4 68.98
Table 5: Annotator agreement over the development set. If possible, we measure the total agreement (in %) and the Fleiss’ Kappa (). Otherwise, we report the BLEU-4 between annotators.

6.1 Annotator Agreement

Table 5 shows annotator agreement over the development set for each step of the annotation process. We compute annotator agreement and Fleiss’ kappa fleiss_kappa for table readability and highlighted cells, and BLEU-4 score between annotated sentences in different stages, including (1) sentence after deletion; (2) sentence after decontextualization; and (3) final sentence after the secondary grammar correction task.

As one can see, the table readability task has an agreement of 99.38%. The cell highlighting task is more challenging. 73.74% of the time all three annotators completely agree on the set of cells which means that they chose the exact same set of cells. The Fleiss’ kappa is , which is regarded as “almost perfect agreement” ( - ) according to kappa_table.

With respect to the sentence revision tasks, we see that the agreement slightly degrades as more steps are performed. We compute single reference BLEU among all pairs of annotators for examples in our development set (which only contains examples where both annotators chose ). As the sequence of revisions are performed, the annotator agreement gradually decreases in terms of BLEU-4: . This is considerably higher than the BLEU-4 between the original sentence and (45.87).

6.2 Topics and Linguistic Phenomena

Figure 1: Topic distribution of our dataset.

We use the Wikimedia Foundation’s topic categorization model (asthana2018few) to sort the categories of Wikipedia articles where the tables come from into a 44-category ontology.555 Figure 1 presents an aggregated topic analysis of our dataset. We found that the Sports and Countries topics together comprise 56.4% of our dataset, but the other 44% is composed of a much broader set of topics such as Performing arts, Transportation, and Entertainment.

Table 6 summarizes the fraction of examples that require reference to the metadata, as well as some of the challenging linguistic phenomena in the dataset that potentially pose new challenges to current systems. Please refer to Figures 26 in the Appendix for more complex examples.

Types Percentage
Require reference to page title 82%
Require reference to section title 19%
Require reference to table description 3%
Reasoning (logical, numerical, temporal etc.) 21%
Comparison across rows / columns / cells 13%
Require background information 12%
Table 6: Distribution of different linguistic phenomena among 100 randomly chosen sentences.

6.3 Training, Development, and Test Splits

Each annotation consists of the set of highlighted cells and the modified sentences . After the annotation process, we only consider examples where the sentence is related to the table, i.e., . This initially results in a training set of size 131,849 that we further filter as described below.

For more robust evaluation, each example in the development and test sets was annotated by three annotators. Since the machine learning task uses

as an input, it is challenging to use three different sets of highlighted cells in evaluation. Thus, we only use a single randomly chosen while using the three as references for evaluation666We don’t use union or intersection because this may result in a set of highlighted cells that doesn’t directly correspond to any of the references.. We only use examples where at least 2 of the 3 annotators chose . This results in a development set of size 7,700 and a test set of size 7,700.

Overlap and Non-Overlap Sets

Without any modification , , and may contain many similar tables. Thus, to increase the generalization challenge, we filter to remove some examples based on overlap with .

For a given example , let denote its set of header values and similarly let be the set of header values for a given dataset . We remove examples from the training set where is both rare in the data as well as occurs in either the development or test sets. Specifically, is defined as:

The function returns the number of examples in with header

. To choose the hyperparameter

we first split the test set as follows:

The development set is analogously divided into and .

We then choose so that and have similar size. After filtering, the size of is 120,761, and , , , and have sizes , , , and respectively.

7 Machine Learning Task Construction

In this work, we focus the following task:

Given a table and related metadata (page title, section title, table section text) and a set of highlighted cells , produce the final sentence . Mathematically this can be described as learning a function where and .

Note that this task is different from what the annotators perform, since they are provided with a starting sentence requiring revision. Therefore, this task is more challenging, as the model must generate a new sentence instead of revising an existing sentence. Since we use several stages in our annotation mechanism, one could design several other tasks for machine learning models given the data such as sentence revision or cell highlighting, but we leave this out of the scope of this work.

8 Experiments

We present baseline results on ToTTo by examining three existing state-of-the-art approaches:

  • BERT-to-BERT (rothe2019leveraging): A Transformer encoder-decoder model (vaswani2017attention) where the encoder and decoder are both initialized with BERT (devlin2018bert). The original BERT model is pre-trained with both Wikipedia and the Books corpus (zhu2015aligning), the former of which contains our (unrevised) test targets. Thus, we also pre-train a version of BERT on the Books corpus only, which we consider a more correct baseline. However, empirically we find that both models perform similarly in practice (Table 7).

  • Pointer-Generator (see2017get): A Seq2Seq model with attention and copy mechanism (our implementation).

  • puduppully2019data: A Seq2Seq model with an explicit content selection and planning mechanism designed for data-to-text.

Moreover, we explore different strategies of representing the source content that resemble standard linearization approaches in the literature (lebret2016neural; wiseman2017challenges).

  • Full Table The simplest approach is simply to use the entire table as the source, adding special tokens to mark which cells have been highlighted. However, many tables can be very large and this strategy performs poorly.

  • Subtable Another option is to only use the highlighted cells with the heuristically extracted row and column header for each highlighted cell. This makes it easier for the model to only focus on relevant content but limits the ability to perform reasoning in the context of the table structure (see Table 10). Overall though, we find this representation leads to higher performance.

In all cases, the selected cells are linearized with row and column separator tokens. We also experiment with prepending the table metadata to the source table.777The table section text is ignored, since it is usually missing or irrelevant.

8.1 Evaluation metrics

The model output is evaluated using two automatic metrics. Human evaluation is described in § 8.3.

BLEU papineni2002bleu

: A widely used metric that uses n-gram overlap between the reference

and the prediction at the corpus level. BLEU does not take the source content into account.

PARENT (dhingra2019handling): A metric recently proposed specifically for data-to-text evaluation that takes the table into account. PARENT is defined at an instance level. For a given example PARENT is defined as:

is the PARENT precision computed using the prediction, reference, and table (the last of which is not used in BLEU). is the PARENT recall and is computed as:

where is a recall term that compares the prediction with both the reference and table. is an extra recall term that gives an additional reward if the prediction contains phrases in the table that are not necessarily in the reference ( is a hyperparameter).

In the original PARENT work (dhingra2019handling), the same table is used for computing the precision and both recall terms. While this makes sense for most existing datasets, it does not take into account the highlighted cells in our task. To incorporate , we modify the PARENT metric so that the additional recall term uses instead of to only give an additional reward for relevant table information. The other recall and the precision term still use .

8.2 Results

Model Overall Overlap Subset Nonoverlap Subset
BERT-to-BERT (Books+Wiki) 44.0 52.6 52.7 58.4 35.1 46.8
BERT-to-BERT (Books) 43.9 52.6 52.7 58.4 34.8 46.7
Pointer-Generator 41.6 51.6 50.6 58.0 32.2 45.2
puduppully2019data 19.2 29.2 24.5 32.5 13.9 25.8
Table 7: Performance compared to multiple references on the test set for the subtable input format with metadata.
subtable w/ metadata 43.9 52.6
subtable w/o metadata 36.9 42.6
full table w/ metadata 26.8 30.7
full table w/o metadata 20.9 22.2
Table 8: Multi-reference performance of different input representations for BERT-to-BERT Books model.

Table 7 shows our results against multiple references with the subtable input format. Both the BERT-to-BERT models perform the best, followed by the pointer generator model.888Note the BLEU scores are relatively high due to the fact that our task is more controlled than other text generation tasks and that we have multiple references. We see that for all models the performance on the non-overlap set is significantly lower than that of the overlap set, indicating that slice of our data poses significant challenges for machine learning models. We also observe that the baseline that separates content selection and planning performs quite poorly. We attest this to the fact that it is engineered to the Rotowire data format with fixed size tables and predefined column names.

Table 8 explores the effects of the various input representations (subtable vs. full table) on the BERT-to-BERT model. We see that the full table format performs poorly even if it is the most knowledge-preserving representation. Using table metadata significantly helps under different input.

8.3 Human evaluation

For each of the 2 top performing models in Table 7, we take 500 random outputs and perform human evaluation using the following axes:

  • Fluency - A candidate sentence is fluent if it is grammatical and natural. The three choices are Fluent, Mostly Fluent, Not Fluent.

  • Faithfulness (Precision) - A candidate sentence is considered faithful if all pieces of information are supported by either the table or one of the references. Any piece of unsupported information makes the candidate unfaithful.

  • Covered Cells (Recall) - The percentage of highlighted cells the candidate sentence covers.

  • Coverage with Respect to Reference (Recall) - We ask whether the candidate is strictly more or less informative than each reference (or neither, which is referred to as neutral).

In addition to evaluating the model outputs, we compute an oracle upper-bound by treating one of the references as a candidate and evaluating it compared to the table and other references. The results, shown in Table 9, attest to the high quality of our human annotations since the oracle consistently achieves high performance. All the axes demonstrate that there is a considerable gap between the model and oracle performance.

This difference is most easily revealed in the last column when annotators are asked to directly compare the candidate and reference. As expected, the oracle has similar coverage to the reference (61.7% neutral) but both baselines demonstrate considerably less coverage. According to an independent-sample t-test, this difference is significant at a

level for both baselines. Similarly, we observe a significantly lower percentage of covered cells for the baselines compared to the reference according to a test. Comparing the baselines to each other, we do not observe a significant difference in either coverage metric.

Furthermore, the baselines are considerably less faithful than the reference. The faithfulness of both models is significantly lower than the reference ( test with ). The models do not differ significantly from each other, except for the non-overlap case, where we see a moderate effect favoring the book model. While it is well known that neural methods struggle with faithfulness in the presence of noisy references (wiseman2017challenges; tian2019sticking), our results indicate it is a problem even when the references are clean.

Model Fluency (%) Faithfulness (%) Covered Cells (%) Less/Neutral/More Coverage w.r.t. Ref
Overall Oracle 99.3 93.6 94.8 18.3 / 61.7 / 20.0
BERT-to-BERT (Books) 88.1 76.2 89.0 49.2 / 36.2 / 14.5
BERT-to-BERT (Books+Wiki) 87.3 73.6 87.3 53.9 / 32.9 / 13.2
Overlap Oracle 99.6 96.5 95.5 19.8 / 62.8 / 17.4
BERT-to-BERT (Books) 89.6 78.7 92.1 42.0 / 43.7 / 14.3
BERT-to-BERT (Books+Wiki) 89.8 81.1 91.0 47.8 / 39.2 / 13.1
Non-overlap Oracle 99.1 91.4 94.3 17.0 / 60.9 / 22.1
BERT-to-BERT (Books) 86.9 74.2 86.4 55.5 / 29.8 / 14.7
BERT-to-BERT (Books+Wiki) 84.8 66.6 83.8 60.1 / 26.6 / 13.3
Table 9: Human evaluation over references (to compute Oracle) and model outputs. For Fluency, we report the percentage of outputs that were completely fluent. In the last column means X% and Z% of the candidates were deemed to be less and more informative than the reference respectively and Y% were neutral.
ID Reference Decoder output (w/ metadata) w/o metadata
Full table Subtable Subtable
1 in the 2012 film pizza bagel, michael pillarella portrays tommy. in 2012, groff played the role of tommy in the film pizza bagel. in 2012, pillarella appeared as tommy in the film pizza bagel. harris played the role of tommy in the 2012 film pizza bagel.
2 the album shari addison placed at no. 176 on the billboard 200 along with no. 5 on the gospel albums. shari addison’s ” 5”, reached number 176 on the billboard 200. shari addison charted at number 176 on the us chart and at number 5 on the us billboard 200. the shari addison peaked at number 176 on the billboard 200 chart.
3 in the 1939 currie cup, western province lost to transvaal by 17–6 in cape town. northern transvaal and western province were drawn by 16 in 1989 and in 1995, western province were defeated by the sharks in durban. the first currie cup was played in 1939 in transvaal1 at newlands, with western province winning 17–6. the first scottish cup was played in 1939 in transvaal1, where transvaal1 defeated western province 17–6 in the final at newlands in cape town.
4 the bnp secured their best general election result in oldham west and royton where nick griffin secured 16.4% of the votes. bnp results ranged from 278 to 6,552 votes. in the british national party election, nick griffin placed third with 16.4% of the vote. in oldham west and royton, nick griffin won 16.4% of the vote.
5 a second generation of microdrive was announced by ibm in 2000 with increased capacities at 512 mb and 1 gb. the microdrive models formed 512 megabyte and 1 gigabyte in 2000. there were 512 microdrive models in 2000: 1 gigabyte. cortete’s production was 512 megabyte.
6 the 1956 grand prix motorcycle racing season consisted of six grand prix races in five classes: 500cc, 350cc, 250cc, 125cc and sidecars 500cc. the 1966 grand prix motorcycle racing season consisted of seven grand prix races in five classes: 500cc, 350cc, 250cc, 125cc and sidecars 500cc. the 1956 grand prix motorcycle racing season consisted of eight grand prix races in five classes: 500cc, 350cc, 250cc, 125cc and sidecars 500cc. the 1955 grand prix motorcycle racing season consisted of eight grand prix races in five classes: 500cc, 350cc, 250cc, 125cc and sidecars 500cc.
7 in travis kelce’s last collegiate season, he set personal career highs in receptions (45), receiving yards (722), yards per receptions (16.0) and receiving touchdowns (8). during the 2011 season, travis kelceum caught 76 receptions for 1,612 yards and 14 touchdowns. travis kelce finished the 2012 season with 45 receptions for 722 yards (16.0 avg.) and eight touchdowns. kelce finished the 2012 season with 45 catches for 722 yards (16.0 avg.) and eight touchdowns.
Table 10: Decoder output examples from BERT-to-BERT Books models on the development set. The “subtable with metadata” model achieves the highest BLEU. Red indicates model errors and blue denotes interesting reference language not in the model output.

9 Model Errors and Challenges

In this section, we visualize some example decoder outputs from the BERT-to-BERT Books model (Table 10) and discuss specific challenges that existing approaches face with this dataset. In general, the model performed reasonably well in producing grammatically correct and fluent sentences given the information from the table, as indicated by Table 10. Given the “full table”, the model is not able to correctly select the information needed to produce the reference, and instead produces an arbitrary sentence with irrelevant information. Note the model corrects itself with highlighted cell information (“subtable”), and learns to use the metadata to improve the sentence. However, we also observe certain challenges that existing approaches are struggling with, which can serve as directions for future research. In particular:


As shown in Table 10 (examples 1-4) the model sometimes outputs phrases such as first scottish, third that seem reasonable but are not faithful to the table. This hallucination phenomenon has been widely observed in other existing data-to-text datasets (lebret2016neural; wiseman2017challenges). However, the noisy references in these datasets make it difficult to disentangle model incapability from data noise. Our dataset serves as strong evidence that even when the reference targets are faithful to the source, neural models still struggle with faithfulness.

Rare topics

Another challenge revealed by the open domain nature of our task is that models often struggle with rare or complex topics. For instance, example 5 of Table 10 concerns microdrive capacities which is challenging. As our topic distribution indicates (Figure 1), certain topics have relatively limited training examples. This calls for the development of models that can be learned with limited examples and better generalization-ability.

Diverse table structure

In example 6, inferring six and five correctly requires counting table rows and columns. Similarly, in the last example of Table 10, the phrases last and career highs can be deduced from the table structure and with comparisons over the columns. However, the models are unable to easily make these inferences from the simplistic source representation that we used. Our dataset presents a unique challenge for learning better table representation due to its various types of table schema. Please see Figures 2-6 for more example tables.

Numerical reasoning

As discussed above, reasoning over the table structure often requires counting rows or columns, or comparing numbers over a set of cells. In addition to examples 6 and 7, example 4 requires comparing numbers to conclude that third is an incorrect relation. The model errors indicate numerical reasoning remains challenging for generation systems. Recent attention to this problem in question answering (dua2019drop; andor2019giving) may be relevant for our task.

Evaluation metrics

Many of the above issues are difficult to capture with metrics like BLEU since the reference and prediction may only differ by a word but largely differ in terms of semantic meaning. Furthermore, it is unclear how to correctly reward models that produce output with more inferences, urging for better metrics possibly built on learned models (wiseman2017challenges; ma2019results; sellam2020bleurt) for appropriate evaluation.

10 Conclusion

In this work, we presented ToTTo, a large, English table-to-text dataset that presents both a controlled generation task and a data annotation process based on iterative sentence revision. We also provided several state-of-the-art baselines, and demonstrated ToTTo

  could be a useful dataset for modeling research as well as for developing evaluation metrics that can better detect model improvements.

ToTTo  is available at


The authors wish to thank Ming-Wei Chang, Jonathan H. Clark, Kenton Lee, and Jennimaria Palomaki for their insightful discussions and support. Many thanks also to Ashwin Kakarla and his team for help with the annotations.