DART: Open-Domain Structured Data Record to Text Generation

07/06/2020 ∙ by Dragomir Radev, et al. ∙ Yale University Salesforce 0

We introduce DART, a large dataset for open-domain structured data record to text generation. We consider the structured data record input as a set of RDF entity-relation triples, a format widely used for knowledge representation and semantics description. DART consists of 82,191 examples across different domains with each input being a semantic RDF triple set derived from data records in tables and the tree ontology of the schema, annotated with sentence descriptions that cover all facts in the triple set. This hierarchical, structured format with its open-domain nature differentiates DART from other existing table-to-text corpora. We conduct an analysis of DART on several state-of-the-art text generation models, showing that it introduces new and interesting challenges compared to existing datasets. Furthermore, we demonstrate that finetuning pretrained language models on DART facilitates out-of-domain generalization on the WebNLG 2017 dataset. DART is available at https://github.com/Yale-LILY/dart.



There are no comments yet.


page 2

page 16

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatically generating textual descriptions from structured data inputs is crucial to improving the accessibility of knowledge bases to lay users. Such applications include explaining data records to non-experts in the healthcare domain Cawsey et al. (1997), writing sports news Chen and Mooney (2008), summarizing information in multiple documents Fan et al. (2019), and generating dialogue responses Wen et al. (2015).

While significant research progress has been made in this field, there are still several issues with existing data-to-text datasets. First, they mostly deal with knowledge sources with a flat ontology structure such as slot-value pairs (e.g., WikiBio Lebret et al. (2016), The E2E Dataset Novikova et al. (2017b), dialog response generation Wen et al. (2016)) and tables (e.g., RotoWire Wiseman et al. (2017), WikiTableText Bao et al. (2018), LogicNLP Chen et al. (2020a), ToTTo Parikh et al. (2020)). The flat structure representation is not powerful enough to encode rich semantic relationships in the ontology, for example, transitive functional dependencies such as person - city - population. ToTToParikh et al. (2020)222This paper appeared on arxiv after we had started working on DART. is a recent table-to-text dataset pairing a one-sentence description with a Wikipedia table and a set of highlighted table cells. They found using only highlighted cells with flat row and column headers led to higher performance than using the whole table. Second, some of them only focus on a small number of domains (e.g., WebNLG Gardent et al. (2017b) on 15 categories from DBPedia, WikiBio on biographies, E2E on restaurants, RotoWire on basketball, MLB on baseball Puduppully et al. (2019), WeatherGov Liang et al. (2009) on weather forecasts, RoboCup Chen and Mooney (2008) on soccer reports). Furthermore, some of them only have loose alignments between data input and sentence due to the automatic generation procedure (e.g., RotoWire, Neural Wikipedian Vougiouklis et al. (2018), and T-Rex Elsahar et al. (2018)).

Figure 1: Overview of our human annotation procedure. Top panel: We collect the parent-child relations between columns from internal annotators (yellow is parent, green is child). Then, we collect a surface realization of the selected cells (highlighted in orange). This “NFL Europe Stadiums" example appears in the WikiSQL component of DART. Middle panel: We use the provided parent-child relations to construct an ontology tree on the columns, then select the nodes corresponding to the highlighted cells. We gather a connected subtree by collecting all nodes leading up to the highlighted cells’ lowest common ancestor. Bottom panel: We extract a set of triples from the subtree as shown. This tripleset is paired with the provided surface realization to form an entry in DART.

To address some of these issues and to encourage further research in natural language generation from semantic data, we introduce DART, a large and open-domain structured DAta Record to Text generation corpus. DART provides high-quality sentence annotations with each input being a set of entity-relation triples in a tree structure. As shown in Figure 2, to construct DART, we combine reliable human annotations and an automatic conversion procedure from two existing open-domain Question Answering datasets, while also incorporating two other existing corpora. More specifically, we use open-domain tables from Wikipedia and ask human annotators to construct a tree-structured ontology of the column headers. Then we automatically choose a subset of the columns by sampling a connected component from the ontology tree, ensuring that the sampled subset is semantically valid and continuous. We present the table row with the chosen cells highlighted and ask the annotators to describe the highlighted parts in natural language. We illustrate this annotation procedure in Figure 1. Furthermore, we introduce an automatic construction procedure by converting the text-to-SQL dataset WikiSQL into tripleset-sentence pairs. To this end, we execute the SQL command to get the answer of the question and then convert the question-answer pair into a declarative sentence using a set of expert rules with part-of-speech and dependency parsing information Demszky et al. (2018)

, and then we use a string matching heuristic to find corresponding table cells to build tripleset-sentence pairs.

Evaluating several state-of-the-art table-to-text models on DART, we found that while these models can achieve high results on domain-specific datasets, they don’t perform as well on DART due to its open-domain nature and structured ontology representations.

Our contributions are as follows. (1) We present a large and open-domain corpus for structured data record to text generation with each input as a set of entity-relation triples based on the tree ontology of the table. This hierarchical structured format differentiates our corpus from other data sets including both domain-specific and open-domain table-to-text corpora. (2) We benchmark on several state-of-the-art table-to-text models showing that DART introduces new open-domain generalization challenges with the hierarchical structure of the semantic triple inputs. (3) We also demonstrate that using the training set of DART for data augmentation improves the pretrained language models on WebNLG 2017 because DART helps out-of-domain generalization due to its open-domain nature.

Input Unit Examples Vocab Size Words per SR Sents per SR Tables
WikiTableText Row 13,318 13.9 1.0 4,962
LogicNLG Table 37,015 122K 13.8 1.0 7,392
ToTTo Highlighted Cells 136,161 136K 17.4 1.0 83,141
DART Triple Set 82,191 33.2K 21.6 1.5 5,623
Table 1: DART compared with other open-domain table-to-text datasets. DART takes triple sets as input by incorporating the tree ontology, and its surface realizations tend to be longer with more than single sentence verbalization. These statistics are computed from DART v1.1.1. SR: Surface Realization.

2 Related Work

Data-to-text generation aims to produce natural language output from structured input. Applications include generating sports commentaries (Tanaka-Ishii et al., 1998; Chen and Mooney, 2008; Wiseman et al., 2017; Wang, 2019; Taniguchi et al., 2019), weather forecasts (Goldberg et al., 1994; Reiter et al., 2005; Liang et al., 2009; Konstas and Lapata, 2012; Mei et al., 2016), biographical texts (Lebret et al., 2016; Sha et al., 2018; Liu et al., 2018; Vougiouklis et al., 2018; Perez-Beltrachini and Lapata, 2018), knowledge-base descriptions (O’Donnell et al., 2000; Banik et al., 2013; Gardent et al., 2017a, b; Novikova et al., 2017b; Wang et al., 2018; Yu et al., 2019), code comment (Iyer et al., 2016), dialogue response generation (Wen et al., 2015, 2016), and commonsense reasoning (Lin et al., 2019; Rajani et al., 2020). Yet, most existing datasets are restricted to specific domains and applications. In contrast, a major source of DART is from Wikipedia tables covering various domains and topics.

Data-to-text datasets take different formats, including slot-value pairs, Abstract Meaning Representation (AMR) Flanigan et al. (2016); Ferreira et al. (2017); Song et al. (2017); Zhu et al. (2019a); Ribeiro et al. (2019); Damonte and Cohen (2019); Wang et al. (2020a); Zhao et al. (2020b); Shen et al. (2020), Minimal Recursion Semantics (MRS) Hajdik et al. (2019), Resource Description Framework (RDF triples) Gardent et al. (2017b); Vougiouklis et al. (2018); Distiawan et al. (2018); Elsahar et al. (2018); Zhu et al. (2019b), logic forms Chen et al. (2020b), and tables. Recently, some open-domain table-to-text datasets have been proposed including WikiTableText Bao et al. (2018), LogicNLP Chen et al. (2020a), and ToTTo Parikh et al. (2020). While DART is also constructed by annotating tables, we introduce a tree-structured ontology on the table headers and an automatic procedure to extract connected components from the ontology as highlighted table cells. This encodes hierarchical relationships among table headers while ensuring the highlighted part is logically consistent and can be described in text without loss of information.

Traditional data-to-text models break the generation progress into different stages such as signal analysis, data interpretation, document planning, microplanning, and realization Reiter and Dale (2000); Reiter (2007). Recently, neural encoder-decoder models based on attention and copy mechanisms have shown promising results (Wiseman et al., 2017; Liu et al., 2018; Gehrmann et al., 2018; Puduppully et al., 2018, 2019; Iso et al., 2019; Castro Ferreira et al., 2019; Shen et al., 2019; Roberti et al., 2019; Zhu et al., 2019b; Koncel-Kedziorski et al., 2019; Ye et al., 2020; Harkous et al., 2020; Wang et al., 2020b; Song et al., 2020; Shahidi et al., 2020). Furthermore, recent progress on pretrained language models such as GPT-2 Radford et al. (2018), UniLM Dong et al. (2019), CTRL Keskar et al. (2019), BERT-to-BERT Rothe et al. (2020), DialoGPT Zhang et al. (2020b), T5 Raffel et al. (2019), and BART Lewis et al. (2020) have shown very effective results for text generation tasks on machine translation, summarization, conversation response generation, and abstractive QA. Chen et al. (2020c); Peng (2020); Kale (2020) also finetune pretrained language models on data-to-text tasks.

To evaluate the performance of different NLG methods, qualitative evaluation measures such as BLEU Papineni et al. (2002), ROUGE Lin (2004) and METEOR Banerjee and Lavie (2005), are widely used to replace costly human judgments. However, these measures have limitations in considering the semantic meanings of words or phrases, which is shown in Novikova et al. (2017a) that these measures fail to correlate well with human judgement. Recently, embedding based evaluation measures have been proposed to tackle this problem. Typical examples include BERTScore Zhang et al. (2020a), YiSi-1 Lo (2019), WMD Kusner et al. (2015), WMDO Chow et al. (2019), MoverScore Zhao et al. (2019), and BLEURT Sellam et al. (2020). They first compute the semantic similarity between word representations, produced by word embedding models such as word2vec Mikolov et al. (2013) or contextual embedding models such as BERT Devlin et al. (2019)

. The final score is then given by different intrinsic metrics, such as generalized precision, recall, and F-score used in BERTScore, and Earth Mover distance used in WMD, WMDO, and MoverScore. Furthermore,

Dhingra et al. (2019)

propose PARENT which explicitly aligns n-grams from the reference and generated text to the table information.

3 DART Data Collection

Figure 2: DART data collection pipeline. MR: Meaning Representation.
DART: 62,659 train / 6,980 dev / 12,552 test
WikiTableQuestions WikiSQL WebNLG Cleaned E2E
Internal MTurk Internal Declarative
Domains Wikipedia (open-domain) 15 DBPedia Categories Restaurants
Unique Predicates 1,950 1,403 493 2,008 347 7
Unique Triples 13,505 5,541 1,648 7,787 3,220 946
Tripleset-Sentence Pairs 4,902 2,120 772 4,204 27,731 42,462
Triples per Tripleset (min, med, max) 1, 3, 10 1, 3, 7 1, 2, 7 1, 2, 10 1, 3, 7 1, 4, 7
Vocab Size 13.4K 8.9K 3.0K 10.7K 8.0K 3.0K
Words per SR 15.2 16.5 14.0 12.6 22.5 22.9
Sentences per SR 1.0 1.1 1.0 1.0 1.4 1.6
Table 2: Statistics of DART decomposed by different collection methods. DART exhibits a great deal of topical variety in terms of the number of unique predicates, the number of unique triples, and the vocabulary size. These statistics are computed from DART v1.1.1; the number of unique predicates reported is post-unification (see Section 3.4). SR: Surface Realization.

We build DART from existing datasets that cover a variety of different domains while allowing us to build a tree ontology and form RDF triple set as our semantic representation. The data statistics are summarized in Table 1 and 2. Compared with WebNLG and Cleaned E2E in Table 2, DART exhibits more topical variety in terms of the number of unique predicates, the number of unique triples, and the vocabulary size. Compared to other open-domain table-to-text datasets in Table 1, DART takes triple sets as input by incorporating the tree ontology, and its surface realizations tend to be longer with more than single sentence verbalization.

As illustrated in Figure 2, DART is constructed using multiple complementary methods: (1) human annotation on open-domain Wikipedia tables from WikiTableQuestions Pasupat and Liang (2015) and WikiSQL Zhong et al. (2017) (Section 3.1), (2) automatic conversion of questions in WikiSQL to declarative sentences (Section 3.2), and (3) incorporation of existing datasets including WebNLG 2017 Gardent et al. (2017a, b); Shimorina and Gardent (2018) and Cleaned E2E Novikova et al. (2017b); Dušek et al. (2018, 2019) (Section 3.3).

We also explored automatic alignments between the knowledge base and text including Neural Wikipedian Vougiouklis et al. (2018) and T-Rex Elsahar et al. (2018). Although these datasets are large in size with natural sentences in a variety of domains, their automatic data construction procedures cannot guarantee good quality alignment: the sentence is often noisy with omitted or hallucinated information compared with the paired triple. So we ended up not including these in the current version of DART.

3.1 Tree Ontology and Sentence Annotation on Tables

Tables are sources of structured data and have been used for question answering (WikiTableQuestions), semantic parsing (WikiSQL), and table-to-text generation (ToTTo) tasks. Here, we aim to collect triple-sentence pairs from open-domain Wikipedia tables in the WikiTableQuestions and WikiSQL datasets. However, such tables have a flat structure, making them not directly usable for building (subject, predicate, object) triples, which are critical in capturing rich relationships in the data. We propose a two-stage annotation process for constructing tripleset-sentence pairs based on a tree-structured ontology of each table. First, internal skilled annotators denote the parent column for each column header. Then, a larger number of annotators provide a sentential description of an automatically-chosen subset of table cells in a row. To form a tripleset-sentence pair, the highlighted cells can be converted to a connected tripleset automatically according to the column ontology for the given table.

Tree Ontology Annotation

For each column in a given table, our internal annotators labeled its ontological parent. In Figure 1, for example, the annotator would provide the sequence {(blank), Team, Stadium, Stadium, Team} — column Team has no parent, Stadium has parent Team, Capacity has parent Stadium, and so on. In many cases, the relationship between a parent column and its child column can loosely be conceptualized as a “has-a" relationship.

In some tables, the table title itself (rather than a column) serves as the root of the column ontology (e.g., Figure 4). In these cases, the annotator assigns [TITLE] as the parent of the relevant columns. Note that by construction, annotators are prohibited from assigning parents to [TITLE]. Ontologies are rejected and corrections are requested if the provided ontology is disconnected or contains a cycle.

Following the procedure above, we now have a tree structure in which each node is either a column in the table or [TITLE]. The tree root is either [TITLE] or a column name. However, in many cases, even if the table title is not included as a node in the ontology, it provides important context for understanding the table’s rows and the annotated sentence (e.g., Figure 5). To handle this case, we add a dummy [TABLECONTEXT] node as the parent of the existing tree’s root node. This context node doesn’t represent a data entity, but instead unifies the table’s title into the ontology. If the former root node was [TITLE], then we still need the [TABLECONTEXT] node which will have a single child, [TITLE]. The reason why we need a [TABLECONTEXT] node here is because the [TITLE] node needs a parent during triple formation.

In this way, we generate a fully-connected tree of column names and the two special nodes, [TABLECONTEXT] and [TITLE]. Every tree contains both special nodes. These trees exhibit a great deal of structural variety, and relevant statistics are summarized in Table 3 and Figure 3.

Tables Ontology depth (min, med, max) Nodes in ontology (min, med, max) Branching factor (mean)
WikiTableQuestions 2060 1, 1, 4 2, 6, 25 4.0
WikiSQL 3563 1, 1, 4 3, 7, 25 5.1
Table 3: Properties of the ontology in the WikiTableQuestions and WikiSQL samples in DART. Branching factor refers to the average number of children across all non-leaf nodes in a table’s ontology. These statistics were computed on DART v1.1.1.

Some tables are malformed and have duplicate or missing column names (e.g., Figure 6). In these cases, human annotators either changed or added appropriate column names in order to fit these constraints. Additionally, we require that the annotated ontology encodes a valid tree structure with each of the column names present. We verify that all headers appear as a node, nodes refer only to column headers, and that each of these tree structures contains no cycles. We introduce no new “abstract" nodes to encode relationships between column header nodes for the sake of consistency, scale, and robustness. Note that for many tables, the determination of an ontology is a subjective process with many “correct" answers – for example, swapping the positions of Team and City in the tree in Figure 1 produces an equally valid ontology for the referenced table. If there are multiple ways to construct an ontology based on annotators’ decisions of attribute relationships among column headers, we manually unify the annotation agreement for similar tables (for example, tables about athletes in different sports).

Connected Component Extraction

After we label the ontology, we automatically choose a subset of cells in a table row. Randomly selecting cells leads to poor quality annotation as the selected data could lack a subject, lack cohesion, or would require information not encoded in the ontology to formulate a coherent sentence. For example, in Figure 1, if only two nodes City and Capacity were highlighted then a coherent sentence cannot be produced as there is no direct logical relationship (functional dependency) between them. To solve these issues, instead of randomly selecting cells in a row, we extract connected components from the ontology. In addition, the component will always include the root (unless the root node is a table’s title) as it usually denotes the subject of the row.

The extracted components have two controllable properties: size and shape. To create variation in size, we randomly sampled between two and five inclusive. The shape can be characterized by two numbers: the number of sibling node pairs and parent-child node pairs. Increasing the number of sibling node pairs creates a wider tree, while increasing the parent-child node pairs creates a deeper tree. We created a sliding scale between wide and deep trees using an expansion parameter,

. Component search will recursively search through the ontology’s nodes. At each stage, it recursively visits a node if it has children with probability

and otherwise move to a sibling if it exists. If , the search becomes a DFS and if , it becomes BFS. We found that randomly selecting from 0.5 to 0.7 created a reasonable variation in extracted component shapes. This ensures the balance between breadth and depth of selected cells for sentence annotation.

Figure 3: Distribution of column ontology depths in the WikiTableQuestions and WikiSQL samples in DART v1.1.1.

Sentence Annotation

We collect annotations from two sources: internal annotators in our group (Figure 7) and Amazon Mechanical Turk (MTurk) workers (Figure 9). We showed both annotator groups the table title and table row. When collecting internal annotations, we asked annotators to highlight a subset of cells from the row. When collecting annotations from MTurk, we pre-selected and highlighted cells according to the connected component extraction method described above. Both annotator groups were then asked to write natural sentences describing the highlighted cells.

We encouraged the annotators to use diverse vocabulary and syntactic structures. We also asked annotators whether or not they used information (either explicit or entailed) from the table title in producing each sentence. In particular, this information is used to determine whether or not the [TITLE] node should be included in the ontology subtree of interest. In our WikiSQL sentences, much of which came from automatic declarative sentence generation, 5.5% of sentences included the table title. In our WikiTableQuestions sentences which were internally annotated, 43.0% of sentences included the title.

A small-scale review revealed that a number of the sentences provided were nonsensical, ungrammatical, or incorrect. Additionally, we found that annotators often misreported whether or not they used information from table titles in producing sentences. Therefore, a team of skilled annotators reviewed every crowdsourced sentence and marked whether it (a) was correct and (b) used information from the table title or not. The skilled annotators either rewrote or discarded the sentences that were nonsensical or incorrect. In some cases, they also changed cell highlighting patterns to match the sentence provided.

Build Tripleset-Sentence Pairs

Finally, we convert the highlighted cells to triples according to the ontology annotation as shown in Figure 1. To do this for a row , we start with the source table’s column ontology . To extract triples from , we first place the cell values in in their corresponding slots in . For the example ontology in Figure 1, we fill Team with “Amsterdam Admirals," Stadium with “Olympisch Stadion," etc. We then check that the nodes of corresponding to the highlighted cells in form a connected subtree. If this is not the case, we walk up the tree, highlighting each traversed node up until the lowest common ancestor of the highlighted nodes (inclusive). The selected nodes form a connected subtree. For each node in the tree except the root node, we can extract the triple (, , ). For example, since Stadium is highlighted in Figure 1, we extract the triple (Amsterdam Admirals, Stadium, Olympisch Stadion). All the triples extracted from and together with their sentence annotation form a tripleset-sentence pair. A small number of triplesets contained more than 10 triples; we discarded these because their associated surface realizations were long, meandering, and of uniformly poor quality.

3.2 Automatically Converting Questions to Declarative Sentences

WikiSQL is a dataset of questions and corresponding SQL queries for tables from Wikipedia. We convert these questions to declarative sentences and automatically align the sentences to subsets of the table cells. Since we are interested in generating natural sentences for single records, we first filter out questions relating to multiple rows if the corresponding SQL query contains aggregate commands (including MAX, MIN, COUNT, SUM, AVG, JOIN, INTERSECT, UNION, GROUP BY, ORDER BY). For the remaining questions relating to a single record in the table, we execute the SQL query to retrieve the answer. We convert the question-answer pair into a declarative sentence333We use the rule-based model from https://github.com/kelvinguu/qanli Demszky et al. (2018). Their neural model code is not released., for example:

Question: In which year did Greece hold its last Summer Olympics?
Answer: 2004
Declarative Sentence: Greece held its last Summer Olympics in 2004.

To get the corresponding record, we change the SQL command to SELECT * so we can get the whole row from the table. We then find the corresponding table cells from this row that are from answer columns and WHERE condition columns. The corresponding table cells are then converted into RDF triples in the same way as we described in Section 3.1. In this way, we can get 4,237 sentences with on average two triples for each sentence. Examples of produced declarative sentences can be found in Figure 10.

We employed a similar procedure for WikiTableQuestions, but the questions in WikiTableQuestions often require comparing across different records, so we ended up not including it.

3.3 Incorporating Existing Datasets

We incorporate the following existing datasets in the same tripleset-sentence format as discussed before.

WebNLG 2017

The WebNLG dataset (Gardent et al., 2017a) is a set of triples extracted from DBpedia and the human-annotated target text. We include the WebNLG 2017 dataset444https://gitlab.com/shimorina/webnlg-dataset/-/tree/master/webnlg_challenge_2017 consisting of 27731 tripleset-text pairs with up to 7 RDF triples in a triple set covering 15 domains. It contains multi-sentential surface realizations and the average number of sentences per realization is 1.44.

Cleaned E2E

The original E2E dataset Novikova et al. (2017b) includes dialogue act meaning representations and natural language references in the restaurant domain. The dialog act meaning representations consist of slot-value pairs on eight attributes of restaurants such as name and food. Later, Dušek et al. (2019) provide Cleaned E2E555https://github.com/tuetschek/e2e-cleaning by automatically fixing the dialogue acts to account for omissions and hallucinations in the text. We incorporate Cleaned E2E because of its strict alignment between the meaning representation and the text. The average number of sentences per realization is 1.59.

To convert a meaning representation (MR) to a tripleset we take the name slot — present in almost all the MRs — as the subject. For example, the MR (name[Alimentum], area[city centre], familyFriendly[no]) is converted to the tripleset {(Alimentum, area, city centre), (Alimentum, familyFriendly, no)}. We drop MRs which do not contain the name slot.

End-to-End Transformer 19.87 0.26 0.65 0.28 0.87 -0.20
Seq-to-Seq with Attention 29.60 0.28 0.62 0.32 0.90 -0.11
BART 37.06 0.36 0.57 0.44 0.92 0.22
Table 4: Model results on the test set of DART v1.0.0b. : Higher is better. : Lower is better. BART performs the best due to its generalization ability from pretraining.
GCN-EC Marcheggiani and Perez-Beltrachini (2018) 55.90 - - 0.39 - - 0.41 - -
Pipeline Transformer Castro Ferreira et al. (2019) 56.28 23.04 42.41 0.42 0.21 0.32 0.39 0.63 0.50
Pipeline GRU Castro Ferreira et al. (2019) 56.09 25.12 42.73 0.42 0.22 0.33 0.39 0.64 0.51
MELBOURNE Gardent et al. (2017b) 54.52 33.27 45.13 0.41 0.33 0.37 0.40 0.55 0.47
BestPlan Moryossef et al. (2019b) 53.30 34.41 47.24 0.44 0.34 0.39 0.47 0.56 0.51
GTR-LSTM (Entity Masking) Distiawan et al. (2018) 58.60 34.10 - 0.41 0.32 - 0.42 0.58 -
DualEnc Zhao et al. (2020a) 63.45 36.73 51.42 0.46 0.37 0.41 0.34 0.55 0.44
CGE-LW (Levi Graph) Ribeiro et al. (2020) 63.69 - - 0.44 - - - - -
PlanEnc Zhao et al. (2020a) 64.42 38.23 52.78 0.45 0.37 0.41 0.33 0.53 0.42
T5-Large Kale (2020) 63.90 52.80 57.10 0.46 0.41 0.44 - - -
End-to-End Transformer Castro Ferreira et al. (2019) 49.77 4.87 31.41 0.39 0.08 0.24 0.47 0.85 0.64
Pipeline Transformer Castro Ferreira et al. (2019) 53.70 29.63 46.95 0.41 0.20 0.32 0.42 0.67 0.53
BestPlan Moryossef et al. (2019b) 56.31 34.98 47.01 0.44 0.36 0.40 0.41 0.56 0.48
NeuralPlan Moryossef et al. (2019a) 55.89 35.91 47.05 0.44 0.37 0.40 0.40 0.54 0.46
BART 51.86 26.23 39.05 0.42 0.30 0.36 0.46 0.85 0.64
 + DART 52.86 37.85 45.89 0.42 0.37 0.40 0.44 0.59 0.51
Table 5: The WebNLG 2017 results on the test set. : We report results from Zhao et al. (2020a) who use the evaluation scripts that are strictly the same as the official challenge resulting in different numbers from the original papers. : Results of our replications; we include previously unreported seen vs. unseen results from Moryossef et al. (2019b) by running evaluation scripts on their original output.

3.4 Predicate Unification

Combining triples from a wide variety of sources results in multiple predicates that represent the same concept. For example, WebNLG contains the predicate birthDate while the DART WikiTableQuestions sample contains the predicate Date of birth. Additionally, the wide syntactic variety present in Wikipedia tables results in the extraction of different triples that represent the same concept – for example, the DART WikiTableQuestions sample contains both Team and Club in reference to soccer teams. The triple-to-text task becomes much easier if these disparate predicates are unified, which allows a model to learn a single concept from a single predicate with a greater number of training examples. We manually constructed a predicate mapping table to achieve this. As an example, applying our predicate mapping procedure sends "Hometown," "Home Town," and "Home Town/City" (all of which appear as predicates pre-unification) to the unified predicate "HOMETOWN".

3.5 Entity Mapping and Delexicalization

Castro Ferreira et al. (2018) pointed out the benefits of delexicalization for NLG models as it allows the model to describe categories instead of noisy entity names (e.g. "President" instead of "Barack Obama").

We perform automatic delexicalization through the named entity extraction interface provided by spaCy.666https://github.com/explosion/spaCy

First, we check if spaCy’s “xx_ent_wiki_sm" named entity recognition (NER) model returns a match for the extracted entity. If no match is found, we fall back to spaCy’s “en_core_web_lg" NER model. If this fails too, we check the WebNLG delexicalization for a match. Finally, if there are no matches found, we map the entity to an unknown token.

3.6 Dataset Split

For WebNLG 2017 and Cleaned E2E, we use their original data splits. For our annotation on WikiTableQuestions and WikiSQL, random splitting will make train, dev, and test splits contain similar tables and similar tripleset-sentence examples. This can create models that simply memorize the training data. Therefore, to increase the generalization challenge, we compare the table title and the table header to find similar tables, and make sure the model is evaluated on unseen tables on test split. We first sample some tables as a seed test set, and then compute Jaccard similarity with other tables on table titles and table headers. If other tables have a Jaccard similarity greater than 0.5 with any of the tables in the seed test set, we add it into the test set. A similar process is repeated to create the dev set, and the remaining tables form the training set. This results in 62659/6980/12552 sentences in the train/dev/test sets, respectively.

4 Experimental Results

We conducted experiments on DART v1.0.0b and the WebNLG 2017 challenge dataset.

4.1 Models

We investigate several state-of-the-art data-to-text generation models.

Seq-to-Seq with Attention

We use an encoder-decoder architecture with attention mechanism (Sutskever et al., 2014; Bahdanau et al., 2015). The architecture is a 2-layer bidirectional-LSTMs for the encoder and 300-d word embeddings, 512 hidden units, and 0.3 dropout for the decoder. The input is sequential, formed by concatenating the triples. Delexicalization is applied on the input triple sequence as well as the reference output sentences in accordance with the WebNLG dataset delexicalization. This baseline did not

use pretrained word vectors.


We use the Transformer model Vaswani et al. (2017), previously used by Castro Ferreira et al. (2019) on the WebNLG dataset. We use the end-to-end transformer model without explicit intermediate representations to directly generate text from an unordered (linearized) RDF triple set. The pipeline transformer requires more annotation efforts for additional gold-standard representations of traditional pipeline steps, so we did not use it.


BART is a pretrained sequence-to-sequence language model using a standard Transformer-based architecture and has shown very effective results for text generation tasks such as machine translation, summarization, conversation response generation, and abstractive QA in Lewis et al. (2020). We fine-tuned the BART-large model with a max output length of 128 and a learning rate of 3e-05 using the same linearization and delexicalization on RDF triple sets as described above.

Furthermore, we also tried BestPlan Moryossef et al. (2019b) and NeuralPlan Moryossef et al. (2019a) models, both of which use a step-by-step approach with a text-planning stage followed by a plan-realization stage on the WebNLG dataset. However, the training and inference were quite slow on DART so we could not get these results.

4.2 Evaluation Metrics

We use a variety of automatic metrics to evaluate the quality of the generated text. We report BLEU Papineni et al. (2002), METEOR Denkowski and Lavie (2014), and TER Snover et al. (2005) which are also used in the official WebNLG challenge. Furthermore, we also use MoverScore Zhao et al. (2019), BERTScore Zhang et al. (2020a), and BLEURT Sellam et al. (2020), which are new metrics that incorporate semantics rather than surface forms using contextual embeddings. We did not use PARENT Dhingra et al. (2019) because not all of our inputs are in a tabular form (e.g., the ones from WebNLG and E2E).

4.3 Results


Our experimental results on DART are summarized in Table 4. The BART model has the highest performance among three models with a BLEU score of 37.06. This is attributed to BART’s generalization ability due to pretraining. However, language models such as BART are pretrained by reconstructing text and, as a result, we found that their output on DART often contains hallucinated words Parikh et al. (2020); Harkous et al. (2020); Reiter (2020). In addition, while the pretrained language model shows better text generation quality due to its generalization ability from pretraining, it does not fully capture the hierarchical ontology nature of the triple sets in their linearized input. Furthermore, the end-to-end transformer has the lowest performance since the transformer model needs intermediate pipeline planning steps to have higher performance. Similar findings can be found in Castro Ferreira et al. (2019).


Furthermore, we also investigate if DART can be used to improve the performance on other data-to-text generation tasks. To this end, we finetune the BART model on the WebNLG 2017 challenge, and gradually augment the training data with the train split of DART. The experimental results can be found in Table 5. Our finetuned BART model achieves 39.05 BLEU scores on all categories, while adding DART as additional training data can improve it to 45.89. The improvement mainly comes from the unseen categories increasing from 26.23 to 37.85, demonstrating that DART is particularly helpful for out-of-domain generalization due to its open-domain nature.

5 Conclusion

In this paper, we introduce DART as a large and open-domain corpus for structured data record to text generation. DART’s hierarchical and structured format differentiates itself from other open-domain table-to-text corpora such as ToTTo. We found that DART introduces new challenges to several state-of-the-art data-to-text models due to its open-domain nature and its ontology structure of the semantic triple input. For future work, we will explore more controlled and high-fidelity generation using pretrained language models conditioned on semantic representations with encoding methods incorporating the ontology hierarchy.


  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §4.1.
  • S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In ACL Workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, Cited by: §2.
  • E. Banik, C. Gardent, and E. Kow (2013) The KBGen challenge. In Proceedings of the 14th European Workshop on Natural Language Generation, Cited by: §2.
  • J. Bao, D. Tang, N. Duan, Z. Yan, Y. Lv, M. Zhou, and T. Zhao (2018) Table-to-text: describing table region with natural language. In AAAI, Cited by: §1, §2.
  • T. Castro Ferreira, D. Moussallem, E. Krahmer, and S. Wubben (2018) Enriching the WebNLG corpus. In INLG, Cited by: §3.5.
  • T. Castro Ferreira, C. van der Lee, E. van Miltenburg, and E. Krahmer (2019) Neural data-to-text generation: a comparison between pipeline and end-to-end architectures. In EMNLP, Cited by: §2, Table 5, §4.1, §4.3.
  • A. J. Cawsey, B. L. Webber, and R. B. Jones (1997) Natural language generation in health care. BMJ Group BMA House, Tavistock Square, London, WC1H 9JR. Cited by: §1.
  • D. L. Chen and R. J. Mooney (2008) Learning to sportscast: a test of grounded language acquisition. In ICML, Cited by: §1, §1, §2.
  • W. Chen, J. Chen, Y. Su, Z. Chen, and W. Y. Wang (2020a) Logical natural language generation from open-domain tables. In ACL, Cited by: §1, §2.
  • Z. Chen, W. Chen, H. Zha, X. Zhou, Y. Zhang, S. Sundaresan, and W. Y. Wang (2020b) Logic2Text: high-fidelity natural language generation from logical forms. arXiv preprint arXiv:2004.14579. Cited by: §2.
  • Z. Chen, H. Eavani, W. Chen, Y. Liu, and W. Y. Wang (2020c) Few-shot nlg with pre-trained language model. In ACL, Cited by: §2.
  • J. Chow, L. Specia, and P. Madhyastha (2019) WMDO: fluency-based word mover’s distance for machine translation evaluation. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), Cited by: §2.
  • M. Damonte and S. B. Cohen (2019) Structural neural encoders for amr-to-text generation. In NAACL, Cited by: §2.
  • D. Demszky, K. Guu, and P. Liang (2018) Transforming question answering datasets into natural language inference datasets. arXiv preprint arXiv:1809.02922. Cited by: §1, footnote 3.
  • M. Denkowski and A. Lavie (2014) Meteor universal: language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, Cited by: §4.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §2.
  • B. Dhingra, M. Faruqui, A. Parikh, M. Chang, D. Das, and W. Cohen (2019) Handling divergent reference texts when evaluating table-to-text generation. In ACL, Cited by: §2, §4.2.
  • B. Distiawan, J. Qi, R. Zhang, and W. Wang (2018) GTR-lstm: a triple encoder for sentence generation from rdf data. In ACL, Cited by: §2, Table 5.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. In NeurIPS, Cited by: §2.
  • O. Dušek, D. M. Howcroft, and V. Rieser (2019) Semantic noise matters for neural natural language generation. In INLG, Cited by: §3.3, §3.
  • O. Dušek, J. Novikova, and V. Rieser (2018) Findings of the E2E NLG Challenge. In INLG, Cited by: §3.
  • H. Elsahar, P. Vougiouklis, A. Remaci, C. Gravier, J. Hare, F. Laforest, and E. Simperl (2018) T-REx: a large scale alignment of natural language with knowledge base triples. In LREC, Cited by: §1, §2, §3.
  • A. Fan, C. Gardent, C. Braud, and A. Bordes (2019)

    Using local knowledge graph construction to scale seq2seq models to multi-document inputs

    arXiv preprint arXiv:1910.08435. Cited by: §1.
  • T. C. Ferreira, I. Calixto, S. Wubben, and E. Krahmer (2017) Linguistic realisation as machine translation: comparing different mt models for amr-to-text generation. In INLG, Cited by: §2.
  • J. Flanigan, C. Dyer, N. A. Smith, and J. G. Carbonell (2016) Generation from abstract meaning representation using tree transducers. In NAACL, Cited by: §2.
  • C. Gardent, A. Shimorina, S. Narayan, and L. Perez-Beltrachini (2017a) Creating training corpora for NLG micro-planners. In ACL, Cited by: §2, §3.3, §3.
  • C. Gardent, A. Shimorina, S. Narayan, and L. Perez-Beltrachini (2017b) The WebNLG challenge: generating text from RDF data. In INLG, Cited by: §1, §2, §2, Table 5, §3.
  • S. Gehrmann, F. Dai, H. Elder, and A. Rush (2018) End-to-end content and plan selection for data-to-text generation. In INLG, Cited by: §2.
  • E. Goldberg, N. Driedger, and R. I. Kittredge (1994)

    Using natural-language processing to produce weather forecasts

    IEEE Expert 9 (2), pp. 45–53. Cited by: §2.
  • V. Hajdik, J. Buys, M. W. Goodman, and E. M. Bender (2019) Neural text generation from rich semantic representations. In NAACL, Cited by: §2.
  • H. Harkous, I. Groves, and A. Saffari (2020) Have your text and use it too! end-to-end neural data-to-text generation with semantic fidelity. arXiv preprint arXiv:2004.06577. Cited by: §2, §4.3.
  • H. Iso, Y. Uehara, T. Ishigaki, H. Noji, E. Aramaki, I. Kobayashi, Y. Miyao, N. Okazaki, and H. Takamura (2019) Learning to select, track, and generate for data-to-text. In ACL, Cited by: §2.
  • S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer (2016)

    Summarizing source code using a neural attention model

    In ACL, Cited by: §2.
  • M. Kale (2020) Text-to-text pre-training for data-to-text tasks. arXiv preprint arXiv:2005.10433. Cited by: §2, Table 5.
  • N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher (2019) CTRL: a conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858. Cited by: §2.
  • R. Koncel-Kedziorski, D. Bekal, Y. Luan, M. Lapata, and H. Hajishirzi (2019) Text Generation from Knowledge Graphs with Graph Transformers. In NAACL, Cited by: §2.
  • I. Konstas and M. Lapata (2012) Unsupervised concept-to-text generation with hypergraphs. In NAACL, Cited by: §2.
  • M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger (2015) From word embeddings to document distances. In ICML, Cited by: §2.
  • R. Lebret, D. Grangier, and M. Auli (2016) Neural text generation from structured data with application to the biography domain. In EMNLP, Cited by: §1, §2.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL, Cited by: §2, §4.1.
  • P. Liang, M. I. Jordan, and D. Klein (2009) Learning semantic correspondences with less supervision. In ACL, Cited by: §1, §2.
  • B. Y. Lin, M. Shen, W. Zhou, P. Zhou, C. Bhagavatula, Y. Choi, and X. Ren (2019) CommonGen: a constrained text generation challenge for generative commonsense reasoning. CoRR abs/1911.03705. Cited by: §2.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Cited by: §2.
  • T. Liu, K. Wang, L. Sha, B. Chang, and Z. Sui (2018) Table-to-text generation by structure-aware seq2seq learning. In AAAI, Cited by: §2, §2.
  • C. Lo (2019)

    YiSi-a unified semantic mt quality evaluation and estimation metric for languages with different levels of available resources

    In Proceedings of the Fourth Conference on Machine Translation, Cited by: §2.
  • D. Marcheggiani and L. Perez-Beltrachini (2018) Deep graph convolutional encoders for structured data to text generation. In INLG, Cited by: Table 5.
  • H. Mei, M. Bansal, and M. R. Walter (2016) What to talk about and how? selective generation using LSTMs with coarse-to-fine alignment. In NAACL, Cited by: §2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NeurIPS, Cited by: §2.
  • A. Moryossef, Y. Goldberg, and I. Dagan (2019a) Improving quality and efficiency in plan-based neural data-to-text generation. In INLG, Cited by: Table 5, §4.1.
  • A. Moryossef, Y. Goldberg, and I. Dagan (2019b) Step-by-step: Separating planning from realization in neural data-to-text generation. In NAACL, Cited by: Table 5, §4.1.
  • J. Novikova, O. Dušek, A. C. Curry, and V. Rieser (2017a)

    Why we need new evaluation metrics for nlg

    In EMNLP, Cited by: §2.
  • J. Novikova, O. Dusek, and V. Rieser (2017b) The E2E dataset: new challenges for end-to-end generation. In SIGDIAL, Cited by: §1, §2, §3.3, §3.
  • M. O’Donnell, A. Knott, J. Oberlander, and C. Mellish (2000) Optimising text quality in generation from relational databases. In INLG, Cited by: §2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, Cited by: §2, §4.2.
  • A. P. Parikh, X. Wang, S. Gehrmann, M. Faruqui, B. Dhingra, D. Yang, and D. Das (2020) ToTTo: a controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373. Cited by: §1, §2, §4.3.
  • P. Pasupat and P. Liang (2015) Compositional semantic parsing on semi-structured tables. In ACL, Cited by: §3.
  • B. Peng (2020) Few-shot natural language generation for task-oriented dialog. In arXiv, Cited by: §2.
  • L. Perez-Beltrachini and M. Lapata (2018) Bootstrapping generators from noisy data. In NAACL, Cited by: §2.
  • R. Puduppully, L. Dong, and M. Lapata (2018) Data-to-text generation with content selection and planning. In AAAI, Cited by: §2.
  • R. Puduppully, L. Dong, and M. Lapata (2019) Data-to-text generation with entity modeling. In ACL, Cited by: §1, §2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Technical report OpenAI. Cited by: §2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    arXiv preprint arXiv:1910.10683. Cited by: §2.
  • N. F. Rajani, R. Zhang, Y. C. Tan, S. Zheng, J. Weiss, A. Vyas, A. Gupta, C. Xiong, R. Socher, and D. Radev (2020) ESPRIT: Explaining Solutions to Physical Reasoning Tasks. In ACL, Cited by: §2.
  • E. Reiter and R. Dale (2000) Building natural language generation systems. Cambridge university press. Cited by: §2.
  • E. Reiter, S. Sripada, J. Hunter, J. Yu, and I. Davy (2005) Choosing words in computer-generated weather forecasts. Artificial Intelligence 167 (1-2), pp. 137–169. Cited by: §2.
  • E. Reiter (2007) An architecture for data-to-text systems. In Proceedings of the Eleventh European Workshop on Natural Language Generation (ENLG 07), Cited by: §2.
  • E. Reiter (2020) OpenAI gpt system: what does it do?. Technical report Arria. Cited by: §4.3.
  • L. F. R. Ribeiro, C. Gardent, and I. Gurevych (2019) Enhancing AMR-to-text generation with dual graph representations. In EMNLP, Cited by: §2.
  • L. F. Ribeiro, Y. Zhang, C. Gardent, and I. Gurevych (2020) Modeling global and local node contexts for text generation from knowledge graphs. Transactions of the Association for Computational Linguistics. Cited by: Table 5.
  • M. Roberti, G. Bonetta, R. Cancelliere, and P. Gallinari (2019) Copy mechanism and tailored training for character-based data-to-text generation. In ECML-PKDD, Cited by: §2.
  • S. Rothe, S. Narayan, and A. Severyn (2020) Leveraging pre-trained checkpoints for sequence generation tasks. Transactions of the Association for Computational Linguistics. Cited by: §2.
  • T. Sellam, D. Das, and A. P. Parikh (2020) BLEURT: learning robust metrics for text generation. In ACL, Cited by: §2, §4.2.
  • L. Sha, L. Mou, T. Liu, P. Poupart, S. Li, B. Chang, and Z. Sui (2018) Order-planning neural text generation from structured data. In AAAI, Cited by: §2.
  • H. Shahidi, M. Li, and J. Lin (2020) Two birds, one stone: a simple, unified model for text generation from structured and unstructured data. In ACL, Cited by: §2.
  • S. Shen, D. Fried, J. Andreas, and D. Klein (2019) Pragmatically informative text generation. In NAACL, Cited by: §2.
  • X. Shen, E. Chang, H. Su, J. Zhou, and D. Klakow (2020) Neural data-to-text generation via jointly learning the segmentation and correspondence. In ACL, Cited by: §2.
  • A. Shimorina and C. Gardent (2018) Handling rare items in data-to-text generation. In INLG, Cited by: §3.
  • M. Snover, B. Dorr, R. Schwartz, J. Makhoul, L. Micciulla, and R. Weischedel (2005) A study of translation error rate with targeted human annotation. In AMTA, Cited by: §4.2.
  • L. Song, X. Peng, Y. Zhang, Z. Wang, and D. Gildea (2017) AMR-to-text generation with synchronous node replacement grammar. In ACL, Cited by: §2.
  • L. Song, A. Wang, J. Su, Y. Zhang, K. Xu, Y. Ge, and D. Yu (2020) Structural information preserving for graph-to-text generation. In ACL, Cited by: §2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014)

    Sequence to sequence learning with neural networks

    In NeurIPS, Cited by: §4.1.
  • K. Tanaka-Ishii, K. Hasida, and I. Noda (1998) Reactive content selection in the generation of real-time soccer commentary. In COLING, Cited by: §2.
  • Y. Taniguchi, Y. Feng, H. Takamura, and M. Okumura (2019) Generating live soccer-match commentary from play data. In AAAI, Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §4.1.
  • P. Vougiouklis, H. ElSahar, L. Kaffee, C. Gravier, F. Laforest, J. S. Hare, and E. Simperl (2018) Neural wikipedian: generating textual summaries from knowledge base triples. Journal of Web Semantics 52-53, pp. 1 – 15. Cited by: §1, §2, §2, §3.
  • H. Wang (2019) Revisiting challenges in data-to-text generation with fact grounding. In INLG, Cited by: §2.
  • Q. Wang, X. Pan, L. Huang, B. Zhang, Z. Jiang, H. Ji, and K. Knight (2018) Describing a knowledge base. In INLG, Cited by: §2.
  • T. Wang, X. Wan, and H. Jin (2020a) AMR-to-text generation with graph transformer. Transactions of the Association for Computational Linguistics 8, pp. 19–33. Cited by: §2.
  • Z. Wang, X. Wang, B. An, D. Yu, and C. Chen (2020b) Towards faithful neural table-to-text generation with content-matching constraints. In ACL, Cited by: §2.
  • T. Wen, M. Gašić, N. Mrkšić, L. M. Rojas-Barahona, P. Su, S. Ultes, D. Vandyke, and S. Young (2016) Conditional generation and snapshot learning in neural dialogue systems. In EMNLP, Cited by: §1, §2.
  • T. Wen, M. Gašić, N. Mrkšić, P. Su, D. Vandyke, and S. Young (2015) Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. In EMNLP, Cited by: §1, §2.
  • S. Wiseman, S. Shieber, and A. Rush (2017) Challenges in data-to-document generation. In EMNLP, Cited by: §1, §2, §2.
  • R. Ye, W. Shi, H. Zhou, Z. Wei, and L. Li (2020) Variational template machine for data-to-text generation. In ICLR, Cited by: §2.
  • T. Yu, R. Zhang, H. Er, S. Li, E. Xue, B. Pang, X. V. Lin, Y. C. Tan, T. Shi, Z. Li, Y. Jiang, M. Yasunaga, S. Shim, T. Chen, A. Fabbri, Z. Li, L. Chen, Y. Zhang, S. Dixit, V. Zhang, C. Xiong, R. Socher, W. Lasecki, and D. Radev (2019) CoSQL: a conversational text-to-SQL challenge towards cross-domain natural language interfaces to databases. In EMNLP, Cited by: §2.
  • T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2020a) BERTScore: evaluating text generation with BERT. In ICLR, Cited by: §2, §4.2.
  • Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan (2020b) DialoGPT : large-scale generative pre-training for conversational response generation. In ACL, Cited by: §2.
  • C. Zhao, M. Walker, and S. Chaturvedi (2020a) Bridging the structural gap between encoding and decoding for data-to-text generation. In ACL, Cited by: Table 5.
  • W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. M. Meyer, and S. Eger (2019) MoverScore: text generation evaluating with contextualized embeddings and earth mover distance. In EMNLP, Cited by: §2, §4.2.
  • Y. Zhao, L. Chen, Z. Chen, R. Cao, S. Zhu, and K. Yu (2020b) Line graph enhanced amr-to-text generation with mix-order graph attention networks. In ACL, Cited by: §2.
  • V. Zhong, C. Xiong, and R. Socher (2017)

    Seq2SQL: generating structured queries from natural language using reinforcement learning

    CoRR abs/1709.00103. Cited by: §3.
  • J. Zhu, J. Li, M. Zhu, L. Qian, M. Zhang, and G. Zhou (2019a) Modeling graph structure in transformer for better AMR-to-text generation. In EMNLP, Cited by: §2.
  • Y. Zhu, J. Wan, Z. Zhou, L. Chen, L. Qiu, W. Zhang, X. Jiang, and Y. Yu (2019b) Triple-to-text: converting rdf triples into high-quality natural languages via optimizing an inverse kl divergence. In SIGIR, Cited by: §2, §2.