Structured data, such as tables, knowledge graphs, or dictionaries containing key-value pairs are popular data representation mechanisms used in a wide variety of industries to capture domain-specific knowledge. As examples, (1) in the finance domain, data representing the financial performance of companies, (2) in healthcare, information about chemical composition of drugs, patient records etc., (3) in retail
, inventory records of products and their features, are few among many other manifestations of structured data. Various AI-based human-machine interaction applications such as question-answering or dialog involve retrieving information from such structured data for their end goals. A key component in such applications deals with Natural Language Generation (NLG) from the aforementioned structured data representations, known as thedata-to-text problem. Another important use-case of this problem is story-telling from data as in automatic report generation.
In literature, several approaches have been proposed for data-to-text, which can be primarily categorized as rule based systems [Dale, Geldof, and Prost2003, Reiter et al.2005], modular statistical techniques [Barzilay and Lapata2005, Konstas and Lapata2013] and much recently explored end-to-end neural architectures [Lebret, Grangier, and Auli2016, Mei, Bansal, and Walter2016, Jain et al.2018, Nema et al.2018]. The rule based approaches employ handcrafted rules or templates for specific tasks, which cannot scale to accommodate newer domains. On the other hand, the statistical and neural approaches require large amounts of parallel labeled data for training. Parallel data in NLG tasks are quite expensive to obtain since they require an annotator to frame a complete text as output for an input instance. To work on unseen domains and/or tasks, these data-hungry systems need to be trained again with parallel data for the newer domain, To put it into the perspective of data-to-text NLG, Table. 1 shows the lack of adaptability of such systems on unseen domain data. It can been seen that models do well only on the domain they are trained on and perform poorly on a different domain.
|WikiBio Test data|
|Reference:||Thomas Tenison (29 September 1636 –- 14 December 1715) was an English church leader, Archbishop of Canterbury from 1694 until his death.|
|WebNLG Model:||thomas , england and england are the main ingredients of thomas of archbishop , which is a member of the title of the thomas of archbishop . The birth date of the country is thomas.|
|WikiBio Model:||thomas tenison (29 september 1636 - 14 december 1715) was archbishop of canterbury from 1695 to 1715.|
|WebNLG Test data|
|Reference:||Bacon Explosion come from the United States where Barack Obama is the leader and white Americans are an ethnic group.|
|WebNLG Model:||The Bacon Explosion comes from the United States where Barack Obama is the leader and White Americans are one of the ethnic groups.|
|WikiBio Model:||bacon explosion is a united states competitive american former competitive men ’s national team. united states ( born october 16 , 1951 ) is a retired united states district judge for the united states district court for the united states district court for the united states district court for the united states district court for the united states district court for the united states district court for the united states district court for the united states district court for the united states district court for the united states district court for the united states united states ( born october 16 , 1949 ) is an american former white executive.|
Further, since existing systems are designed as task-specific solutions, they tend to jointly learn both content selection from input (what to say?) and the surface realization or language generation (how to say?). This is often undesirable as the former, which decides what is interesting in the input, can be highly domain-specific. For example, what weather parameters (temperature, wind-chill) are influential versus what body parameters (heart-rate, body temperature) are important are heavily dependent on the domain in question - like weather or healthcare respectively. The latter part of language generation, on the other hand, may not be as much domain dependent and can, thus, be designed in a reusable and scalable way.
In this paper we propose a general purpose, unsupervised approach to language generation from structured data; our approach works purely at a linguistic level using word and sub-word level structures. The system is primarily designed for taking a structured table with variable schema as input and producing a coherent paragraph description pertaining to facts in the table. However, it can also work with graphs and key-value pairs (in the form of JSONs) as input. Multiple experiments show the efficacy of our approach on different datasets having varying input formats without being trained on any of these datasets. By design, the system is unsupervised and scalable, i.e. it assumes no labeled corpus and only considers monolingual (unlabeled) corpora and WordNet during development, which are inexpensive and relatively easy to obtain for most domains and languages.
Here, the generation of description from structured data happens in three stages, viz. (1) canonicalization, where the input is converted to a standard canonical representation in the form of tuples, (2) simple language generation, where each canonical form extracted from the input is converted to simple sentences, and (3) discourse synthesis and language enrichment, where simple sentences are merged together to produce complex and more natural sentences. The first stage is essential to handle variable schema and different formats. The second stage gleans morphological, lexical and syntactic constituents from the canonical tuples, and stitches them into simple sentences. The third stage applies sentence compounding and coreference replacement on the simple sentences to generate a fluent and adequate description, For development of these modules, at most a monolingual corpus, WordNet and three basic NLP modules such as part-of-speech tagger, dependency parser and named entity recognizer are needed.
To test our system, we first curate a multi-domain benchmark dataset (referred henceforth as WikiTablePara) that contains tables and corresponding paragraph descriptions manually written; to the best of our knowledge, such a dataset never existed before. Our experimental results on the dataset shows the superiority of our system over the existing data-to-text systems. We will release this dataset publicly.
As discussed earlier, our framework, can also be extended to different schema and datatypes. To prove this, we perform additional experiments on three datasets representing various domains and input-types, only using their test splits: (i) WikiBio [Lebret, Grangier, and Auli2016], representing key-value pairs, (ii) WikiTableText [Bao et al.2018], representing tabular rows and their one line summaries, and (iii) WebNLG [Gardent et al.2017b], representing knowledge graphs. Note that our system does not undergo training on any of these datasets and yet, delivers promising performance on their test splits.
In short, the key contributions of this paper are summarized below:
We propose a general purpose, unsupervised domain-agnostic system for generation of natural descriptions from structured table with variable schema or various other formats.
Our system has a modular approach enabling interpretability, as the output of each stage in our pipeline is in human-understandable textual form.
We release a dataset containing WikiTables and their descriptions for further research. Additionally, we also release data gathered for modules for sentence realization from tuples (refer Secs. 4, 6.2, and 6.3). This is useful for building general purpose tuple/set to sentence systems. Code for replicating our experiments will also be released.
2 Central Challenges and Our Solution
In this section we summarize the key challenges in description generation from structured tabular data.
Variable Schema: Tables can have variable number of rows and columns. Moreover, the central theme around which the description should revolve can vary. For example, two tables can contain column-headers [Company Name, Location], yet the topic of the description can be the companies or the locations of various companies. Also, two tables, having column-headers [PlayersName, Rank] and [Rank, PlayersName] represent same data but may be handled differently by existing-methods that rely on sequential inputs.
Variation in Presentation of Information: The headers of the tables typically capture information that is crucial to generation. However, presentations of headers can considerably vary for similar tables. For example, two similar tables can have column-headers like [Player, Country] and [Player Name, Played for Country], where the headers in the first table are single-word nouns but the first header of the second table is a noun-phrase and second header is a verb-phrase.
It is also possible that the headers share different interrelationships. For example, nouns such as [Company, CEO] should represent the fact that CEO is the part of the company ,where as headers [temperature, humidity] should represent that these entities are independent of each other.
Domain Influence: It is known that changing the domain of the input data proves to be adversarial for end-to-end generators, primarily due to differences in: (a) vocabulary (e.g., the word “tranquilizer” in healthcare data may not be found in tourism data), and (b) word senses (e.g., “cricket” in sports versus “cricket” in biology). Other forms of variation that can occur at semantic and pragmatic levels, such as variation in sentiment (e.g., “deadly player” versus “deadly steering”) can also impede in-domain generators.
Natural Discourse Generation: Table descriptions in the form of discourse (paragraphs) should emanate a natural flow with a mixture of simple, compound, complex sentences. Repetition of entities in should also be replaced by appropriate co-referents. In short, the paragraphs should be fluent, adequate and coherent.
End-to-end neural systems mentioned in the previous sections suffer from all the challenges. According to gardent2017creating gardent2017creating, these systems tend to overfit the data they are trained on, “generating domain specific, often strongly stereotyped text” (eg. weather forecast or game commentator reports). Rather than learning the semantic relations between data and text, these systems are heavily influenced by the style of the text, the domain vocabulary, input format of the data and co-occurrence patterns. As per D17-1239 D17-1239, “Even with recent ideas of copying and reconstruction, there is a significant gap between neural models and template-based systems, highlighting the challenges in data-to-text generation”.
Our system is carefully designed to tackle the challenges through a three-staged pipeline, namely, (a) canonicalization, (b) simple language generation, and (c) discourse synthesis and language enrichment. In the first stage the input is converted to a standard canonical representation in the form of tuples. In the second stage each canonical form extracted from the input is converted to simple sentences. In the final stage the simple sentences are merged together to produce complex and more natural sentences. The overall architecture is presented in Figure. 1.
It is important to note our pipeline is designed to work with tables which do not have a hierarchy amongst its column headers and row headers. We believe, tables with multiple level of headers or with hierarchy among headers can be normalized and then processed by our system; this is beyond the scope of the current work. We discuss our central idea in the following section.
3 Structured Data Canonicalization
Our goal is to generate descriptions from structured data which can appear in various formats. For this it is essential to convert the data to a canonical form which can be handled by our generation stages. Though our main focus is to process data in tabular form, the converter is designed to handle other input formats as well, as discussed below.
3.1 Input Formats
Table - Tables are data organized in rows and columns. We consider single level row and column headers with no hierarchy. A table row can be interpreted as an -ary relation. Currently, we simplify table row representation as a collection of binary relations (or triples).
Graph - Knowledge Graphs have entities represented as nodes while edges denote relations between entities. Here we consider binary relations only. It is possible to convert entity attributes to binary relations. A knowledge graph can, thus, be viewed as a collection of triples, each consisting of a binary relation connecting two entities.
Json - This is data organized in the form of a dictionary of key-value pairs. We limit ourselves to single-level key-value pairs where the keys and values are atomic and cannot further be dictionaries. A couple of pairs (where one is corresponding to the primary key) can also be represented as a triple with the relation being the second key name and the entities being the values of the two keys.
3.2 Canonical Form and Canonicalization
For our system to handle various formats listed above, we need to convert them to a standard format easily recognizable by our system. Moreover, it is required that the generation step can be trained without involving labeled parallel data so that they can be used in various domains where only monolingual corpora is available.
Keeping in mind the above desiderata and the various input formats, we arrived at a canonical form consisting of triples made of binary relations among two entities types. For example, consider the triple : Albert Einstein ; birth place ; Ulm, Germany. The entity tags for named entities ‘Albert Einstein’ and ‘Ulm, Germany’ are PERSON and GPE respectively. This leads to a canonicalized triple form as the following (from hereon, we refer to this triple as canonical triple):
For tabular inputs, extraction of tuples require the following assumption to be followed.
The column-headers of the table should be considered as the list of key-words that decide the structure of the sentences to be generated. In case the table is centered around row headers (i.e., row headers contain maximum generic information about the table), the table has to be transposed first. We do not handle such cases in the current work and all the tables in our proposed dataset are column-centric.
One column header is considered as the primary key, around which the theme of the generated output revolves. For simplicity, we chose the first column-header of the tables in our dataset to be the primary key.
For each table, the table is first broken into a set of sub-tables containing 1-row and 2-columns, as shown in Figure. 2. The first columns of the subtables always represent the primary-key of the table. For a table containing rows and columns (excluding headers), a total number of subtables are thus produced. The subtables are then flattened to produce a triple where flattening is carried out by dropping the primary key header and concatenating the entries of the subtables, as shown in Figure. 2. This produces standard entity-relationship triples where , and are entities that are entries and is the relationship, which is captured by the column header.
To enable our system to adapt to different domains, the entities , and are tagged using an NER-tagger. The tagger essentially assigns place holder tags such as PERSON and GPE for persons and geographical regions respectively. For tagging we use Spacy111http://spacy.io/ NER tagger, an off-the-shelf tagger that performs reasonably well even on words and phrases. We also employ a DBPedia lookup (based on fuzzy string matching) in situations where NER is unable to recognize the name entity. This is helpful in detection of peculiar multi-word named entities like ‘The Silence of the Lambs’ which will not be recognized by Spacy due to lack of context. All DBPedia classes have been manually mapped to 18 Spacy NER types. As fallback, any entity not recognized through DBPedia lookup is assigned with UNK tag.
The above process produces somewhat domain independent canonical representations from the tables as seen in Figure. 2. The NER tags and the corresponding original entries are carried forward and remain available for use in stages and . At the stage , these tags are replaced with the original entries to form proper sentences. The tags and the original entries also influence language enrichment in stage .
Unlike tables, for input types like knowledge graphs and key-value pairs, extraction of canonical triples is straightforward. Knowledge graphs are typically in triple form with nodes representing entities and edges representing relations. Similarly, a pair of key-value entries can be flattened and a triple can be extracted. All these formats, thus, can be standardized to a collection of canonical triples with NE tags acting as placeholders.
In the following section, we describe how a simple sentence can be extracted from each canonical triple. A collection of canonical triples obtained from a table (or other input types) will produce a collection of simple sentences, which can then be compounded to produce a more linguistically rich description of a table in natural language.
4 Simple Language Generation
The simple language generation module takes each canonical triple and generates a simple sentence in natural language. For instance, the triple PERSON birth place GPE will be translated to a simple sentential form like the following:
This will finally be replaced with the original entities to produce a simple sentence as follows : Albert Einstein was born in Ulm, Germany. The canonical triple set in Figure. 2 should produce the following (or similar) simple sentences (refer as set 1):
|Albert Einstein was born in Ulm, Germany|
|Albert Einstein has birthday on 14 March 1879|
|Elsa Lowenthal is the wife of Albert Einstein||(1)|
This is achieved by the following steps : (1) Preprocessing - which transforms the canonical triple to a modified canonical triple, (2) TextGen - which converts the modified canonical triple to a simple sentential form like PERSON was born in GPE, (3) Postprocessing - which replaces back the original entities to produce a simple sentence like Albert Einstein was born in Ulm, Germany, and lastly, (4) Ranking - which selects the best sentence produced in step 3 when multiple variants of TextGen are run in parallel. The details of all these steps are shared below.
It is quite possible that the canonical triples will contain words that cannot be easily converted to a sentence form without additional explicit knowledge. For example, it may not be easy to transform the vanilla triple PERSON game Badminton to a syntactically correct sentence PERSON plays Badminton.
To convert the relation term into a verb phrase we employ a pre-processing step. The step requires two resources to be available - (1) WordNet and (2) Generic Word embeddings, at least covering the default vocabulary of the language (English). We use the 300-dimensional glove embeddings for this purpose [Pennington, Socher, and Manning2014]. Note that, models for embedding learning require only monolingual corpus and any specific unlabeled corpora can be leveraged for fine tuning the embeddings for a certain set of domains, if desired.
The preprocessing step covers the following two scenarios:
Relation term is a single-word term:
In this case, the word is lemmatized and the root form is looked up in a verb lexicon pre-extracted from WordNet. If the look up succeeds, the lemma form is retained in the modified triple. Otherwise, the topverbs222 is set to in our setup. that are closest to the word are extracted using glovevector, based cosine similarity. Through this technique, the word “game”, which is not a verb, would yield verbs like “match” and “play”. The verb “play” will clearly be the most suitable one for generating a sentence later.
The most suitable verb for the word is decided by computing the degree of co-occurrence of the original word and the extracted verbs in the WordNet glosses and examples of the Synsets of the original word and the extracted verb. The degree of co-occurrence is measured using pointwise-mutual information. By this technique, “play” would be selected as the most appropriate verb form for the word “game”.
Relation term is a multi-word term: The relation term, in this case, would contain both content (i.e. non-stopwords) and function words (i.e. stopwords). Examples of multi-word terms are “country played for” and “number of reviews”. When such terms are encountered, the main verb in the phrase is extracted through part-of-speech
(POS) tagging. If a verb is present, the phrase is altered by moving the noun phrase preceding the verb to the end of the phrase. So, the phrase “country played for”, through this heuristic, would be transformed to “played for country”. This is based on the assumption that in tabular forms, noun phrases that convey anaction are actually a transformed version of a verb phrase.
After all the above operations if the relation term still does not get modified, it is appended with a variant of “have” verb (i.e., “has”, if the entity preceding the relation term is singular, else “have”). Thus a noun phrase like “number of reviews” will be modified to “has number of reviews”. In this manner, the above preprocessing techniques enable creation of a triple which we refer to as modified canonical triple. This preprocessing is useful for the triple2text generation step as discussed next.
The objective of this step is to generate simple syntactically correct sentences from the (modified) canonical triples. This is the most crucial stage in our pipeline where conversion happens from a structured form (canonical triple) to a simple sentential form. We propose three ways to generate sentential forms as elaborated below. All the below-mentioned ways are different alternatives to generate a simple sentential form; hence, they can be run in parallel.
This module is the simplest and is developed using a seq2seq [Klein et al.2017] network which is trained on the Triple2Text data consisting of triples curated from various sources (please find details in Sec. 6 item 3). Only this variant of generation requires a modified canonical triple. The other variants discussed next can work with canonical triple without such modification.
morphkey2text (v1 and v2):
The conversion of any canonical triples to sentences demands the following linguistic operations to be carried out:
Determining the appropriate morphological form for the words/phrase in the canonical triple, especially the relation word/phrase (e.g., transforming the word “play” to “played” or “plays”).
Determining the articles and prepositions necessary to construct the sentences (e.g., transforming “play” to “plays for”).
Adding appropriate auxiliary verbs when necessary. This is needed especially for passive forms (e.g., transforming “location” to “is located at” by adding the auxiliary verb “is”).
Ideally, any module designed for canonical triple to sentence translation should dynamically select a subset of the above operations based on the contextual clues present in the input. To this end, we propose the morphkey2text module, a variant of seq2seq network empowered with attention and copy mechanisms. Figure. 3 shows a working example of the morphkey2text system. We skip explaining the well-known seq2seq mechanism for brevity. As input, the module takes a processed version of canonical triple in which (a) NE tags are retained (b) Stopwords (if appearing in the relation terms in the canonical triples) are removed and (c) The coarse POS tags for both the NE tags and words are appended to the input sequence. The module is expected to produce words. Additionally, the fine grained POS tags333in PENN tagset format are generated for the verbs appearing in their lemma form. The rationale behind such an input-output design is that, dealing with the lemma forms at the target side and incorporating additional linguistic signals in terms of POS should help handle lexical and morphological data-sparsity across domains. As seen in Figure. 3, the canonical triple PERSON playing country GPE, is first transformed into a list of content words and their corresponding coarse grained POS tags. During generation, the input key-word and POS “playing VERB” are translated to “play VBD” which will be post-processed to produce the word “played”. It is worth noting that, as the system has to deal with lemma forms and NE and POS tags at both input and output sides, many input words are just copied, which makes the system robust across domains.
Preparing training data for the the morphkey2text design requires only a monolingual corpora and a few general purpose NLP tools and resources such as POS tagger, NE Tagger and WordNet. A large number of simple sentences extracted from web-scale text dumps (such as Wikipedia) have to be collected. The sentences are then POS tagged and the named entities are replaced with NE tags. Stopwords (function words) such as articles and prepositions are dropped from the sentences by looking up in a stopword lexicon. Since the POS tagger produces fine-grained POS-tags, the tags are converted to coarse POS tags using a predefined mapping. This produces the source (input) side of the training example, similar to the one shown in Figure. 3. As of target (output), the named entities in the original sentences are replaced with NE tags, the other words are lemmatized using WordNet lemmatizer and the fine grained POS tags of the words are augmented if the lemma form is not the same as the base form.
We implement two different variant of the morphkey2text system. The morphkey2text V1 module is trained based on the MorphKey2Text dataset (version v1) that was created from monoligual corpora (explained in Sec. 6 item 2). The morphkey2text v2 is trained on a different version (v2) of the MorphKey2Text dataset (details in Sec. 6 item 2).
This step restores entities into the tagged forms generated from the aforementioned step. Additionally, if possessive nouns are detected in the sentence, apostrophes are added to such nouns. Possessives are checked using the following heuristic - if the POS tag for the word following the first entity is not a verb, the word is a potential possessive candidate. Postprocessing is applied to each of the competing modules enlisted in the previous step.
The above variants triple2text, morphkey2text v1 and morphkey2text v2 can be run in parallel to compete with each other to produce different translations of the canonical triple. Out of these, the best produced sentence will be selected by the ranking step mentioned below.
4.4 Scoring and Ranking
To extract the most appropriate output from the compendium of seq2seq systems discussed earlier, a ranker is employed; it sorts the sentence based on a composite linguistic score as given below:
where and represent the canonical triple and generated sentence. Functions and represent the fluency (grammaticality) of the output sentence and and adequacy (factual overlap between input and output).
The fluency function is defined as follows:
where for a sentence of N words
, the LM, an N-gram language model, returns the likelihood of the sentence. For this, a 5-gram general purpose language model is built using Wikipedia dump and KenLM[Heafield2011].
The adequacy function is defined as:
Before passing through the ranker module, we employ heuristics to filter incomplete and un-natural sentences. Some examples of filtration include sentences without verbs, sentences with dropped entities and sentences which are extremely big or small that are disproportionately larger or smaller than the words in the input triples. Once the ranker produces the best simple sentence per canonical triple, the simple sentences are combined into a coherent paragraph form in the following stage.
5 Discourse Synthesis and Language Enrichment
Albert Einstein was born in Ulm, Germany and has birthday on 14 March 1879. Elsa Lowenthal is the wife of Albert Einstein.
The above paragraph is produced by a sentence compounding module which is then succeeded by a coreference replacement module to produce the final coherent paragraph:
Albert Einstein was born in Ulm, Germany and has birthday on 14 March 1879. Elsa Lowenthal is the wife of him.
5.1 Sentence Compounding
This module takes a pair of simple sentences and produces a merged compound version. Every simple sentence can be converted into a form where and appear in the input.
For a pair of sentences, if both share the same first entity or both have the same second entity , the compounded version can be obtained by ‘AND’-ing of the relation phrases . In case where second entity of one matches the first entity of the following sentence, then a clausal pattern can be created by adding “who” or “which”. In all other cases the sentences can be merged by ‘AND’-ing both the sentences. Alg. 1 elaborates on this heuristic. This module can also generate different variations of paragraphs based on different combinations of sentence pairs.
5.2 Coreference Replacement
To enhance the paragraph coherence, it is often desirable to replace entities that repeat within or across consecutive sentences with appropriate coreferents. For this we employ a heuristic that currently replaces repeating entities with pronominal anaphora.
If an entity is encountered twice in a sentence or appears in consecutive sentences, it is marked as a potential candidate for replacement. The number and gender of the entity is decided using POS tags and an off-the-shelf Gender Predictor module which is a Convolutional Neural Network based classifier that trains on person names gathered from various websites. The entity’s role is determined based on whether it appears to the left of the verb (i.e., Agent)) or to the right (Object). Based on the gender, number, role and possessives, the pronouns (he/she/their/him/his etc.) are selected and they replace the entity. We ensure that we replace only one entity in a sentence to avoid incoherent construction due to multiple replacements in close proximity.
We remind our reader about Fig. 1 that presents an overview of our system. Due to its modular nature, our system enjoys interpretability; each stage in the pipeline is conditioned on the output of the previous stage. Moreover, all the modules, in principle, can adapt to newer domains. The datasets used for training do not have any domain-specific characteristics and thus these modules can work well across various domains as will be seen in the experiments sections. . The whole pipeline can be developed without any parallel corpora of structured table to text. Any data used for training any individual module can be curated from monolingual corpora. The subsequent section discusses such datasets in detail.
The section discusses three datasets; Dataset 1 contains tables from various domains and their summaries, and can be used for benchmarking any table descriptor generator. Dataset 2 and 3 are developed to train our TextGen modules (Sec. 4). These datasets will be released for academic use and could prove to be useful in various data-to-text scenarios. We will also release the code and resources to create similar datasets in larger scale.
6.1 Dataset 1
Descriptions from WikiTable (WikiTablePara) : We prepare a benchmark dataset for multi-sentence description generation from tables. For gathering input tables, we rely on the WikiTable dataset [Pasupat and Liang2015], which is a repository of more than 2000 tables. Most of the tables however suffer from the following issues as: (a) they do not provide enough context information, as they were originally a part of a Wikipedia page, (b) they are concatenation of multiple tables, and (c) they contain noisy entries. After filtering such tables, we extract 171 tables. Two reference table descriptions in the form of paragraphs were manually generated by linguists who are fluent in English. The descriptions revolve around one column of the table, which acts as the primary-key.
6.2 Dataset 2
Morphological variation based keywords to text dataset (MorphKey2Text) : This is created from monolingual corpora released by [Thorne et al.2018], which has around 150,000 sentences. The sentences may not always be simple sentences as we hope to generate through our morphkey2text modules. However, it is interesting to note that tagging all named entities and replacing with their entity tags helps create a rich enough training data to capture co-occurrences statistics of stopwords around entity types. We create the first version of the dataset following the technique discussed in Sec. 4.2.
The second v2 is slightly different in the sense that it employs a higher-recall oriented entity tagging mechanism through checking POS tags and dependency parse tree of the sentences. This is necessary as there are entities like “A Song of Ice and Fire”, which will not be recognized by the NE tagger used to create v1. Such multi-word entities can be detected by a simple heuristic which looks for sequence of proper nouns (in this case ‘Song’, ‘Ice’ and ‘Fire’) surrounded by stop-words but not including any punctuation. Moreover, it should not have any verb marked as root by the dependency parser. Through this technique, it is also possible to handle cases where entity like “Tony Blair” gets detected as two entity tags PERSON and UNK by popular NE taggers such as Spacy, instead of single entity tag PERSON.
6.3 Dataset 3
Triple2Text dataset (Triple2Text) : For this, a large number of triples and corresponding sentential forms are gathered from the following resources. Yago Ontology: 6198617 parallel triples and sentences extracted from Yago [Suchanek, Kasneci, and Weikum2007]. Our improvised NER, discussed in Sec. 3 is used for getting tags for entities in the triples. OpenIE on WikiData: 53066988 parallel triples and sentences synthesized from relations from Reverb Clueweb [Banko et al.2007] and all possible combinations of NE Tags. VerbNet: 149760 parallel triples and sentences synthesized from verbs (in the first person singular form) from VerbNet [Schuler2005] and possible combinations of NE Tags.
For all the knowledge resources considered for this dataset, concatenation of the elements in the triples yielded simple sentences, hence there was no manual effort needed for creation of this dataset.
The simple language generator modules in Sec. 4 require training seq2seq networks using the MorphKey2Text (v1 and v2) and the Triple2Text datasets. For this we use the OpenNMT
framework in PyTorch to train the models. The best model is chosen based on accuracy on the validation split of the above datasets. Once these modules are trained, they are used in inference mode in our pipeline.
Through experiments, we show the efficacy of our proposed system on WikiTablePara and other public data-to-text benchmark datasets even though it is not trained on those datasets. Additionally, we also assess the generalizability of our and other existing end-to-end systems in unseen domains. We use BLEU-4, METEOR, ROUGE-L and Skip-Thoughts
based Cosine Similarity (denoted as STSim) as the evaluation metrics444https://github.com/Maluuba/nlg-eval. We also perform a human evaluation study, where a held-out portion of the test data is evaluated by linguists who assign scores to the generated descriptions pertaining to fluency, adequacy and coherence. Mainly, we try to answer the following questions through our empirical study:
Can existing end-to-end systems adapt to unseen domains? For this, we consider two pretrained representative models : (a) WikiBioModel [Nema et al.2018] - A neural model trained on the WikiBio dataset, and (b) WebNLGModel555http://webnlg.loria.fr/pages/baseline.html - A seq2seq baseline trained on the WebNLG dataset. These models are tested on the WikiTablePara dataset which is not restricted to any particular domain. Thus, the performance of the existing models would be tested on wide variety of domains which may not have been present in the datasets used for training.
How well our system adapts to new domains? For this, we evaluate our proposed system also on the table-to-descriptions WikiTablePara benchmark dataset to contrast the performance with the above pretrained models. Additionally, we also assess our system on
related (table-to-text summarization)datasets: (1) WebNLG [Colin et al.2016, Gardent et al.2017b], (2) WikiBio [Lebret, Grangier, and Auli2016], and (3) WikiTableText [Bao et al.2018]. The WikiTableText dataset, like ours, is also derived from WikiTables. However, it contains only tabular-rows and their summary, making the objective different from ours (please refer Sec. 9 for more details). Hence, for brevity, we only report our system’s performance on the dataset without further analysis.
How interpretable is our approach? By leveraging the modularity of our system, we would analyze the usefulness of major components in the proposed system and perform error analysis.
7.1 Experimental Setup
In this section we discuss how the various systems are configured for evaluation on multiple datasets.
Proposed System : Our proposed system is already designed to work with the format of the WikiTablePara dataset. Each table in the dataset is converted to canonical triples leading to the output table description (refer Sec. 3).
To test our system for other input types such as Knowledge Graphs and Key-Value dictionaries, we use the WebNLG and WikiBio datasets respectively. From the WikiBio datasets, JSONs containing Key-Value pairs key1:value1, key2:value2, … , keyN:valueN are converted to triples. Each triple is in the form , , , where . The assumption here is that the first key is the primary key and typically contains names and other keywords for identifying the original wikipedia infobox. For the WebNLG dataset, the triples in a group are directly used by our system to produce the output. Finally, for the WikiTableText dataset, which contains one table-row per instance, each input is converted into , triples, in similar manner as the WikiTablePara dataset.
WebNLGModel : The WebNLGModel is designed get trained and tested on the WebNLG dataset. An already trained WebNLGModel model (similar to the one by nlg-micro17 nlg-micro17) is evaluated on WikiTablePara and WikiBio datasets. For the WikiTablePara dataset, we convert every table to triples. For each triple, the model infers a sentence and sentences collated for all the triples representing a table are concatenated to produce a paragraph description.
For the WikiBio dataset, each JSON is converted to triples for key-value pairs in the same way as described in the previous item. The model yields output sentences for each such JSON containing Key-Value pairs.
WikiBioModel : The WikiBioModel is designed to get trained and tested on the WikiBio dataset that contains Key-Value pairs at the input side and summaries at the output. An already trained model (similar to the one by N18-1139 N18-1139) is evaluated on WikiTablePara and WebNLG datasets. For the WikiTablePara dataset, we convert every table to jsons in WikiBio format. each JSON contains a pair of Key-Value pairs, where the first Key-Value pair always represents the primary-key and its corresponding entry in the table (hence, N-1 JSONs are produced). The inferred sentences for all jsons from the model are concatenated to produce the required paragraph description.
For the WebNLG dataset, each triple is converted to a JSON of a pair of Key-Value pairs. A triple is converted to a JSON format of . For each instance in the WebNLG dataset, sentences are inferred for all the triples belonging to the instance, and they are concatenated to produce the final output.
Please note for the above evaluations, only the test splits for WikiBio and WebNLG datasets are used, whereas there is no train:test split for the WikiTablePara dataset and the entire dataset is used for evaluation. The results for these are summarized in Tables 2 and 3.
Apart from comparing our system with the existing ones, we also try to understand how different stages of our pipeline contribute to the overall performance. For such an ablation study, we prepare the different variants of the system based on the following two scenarios, and compare their performance against that of the complete system.
Instead of using the ensemble (Ranker), each participating TextGen systems viz. triple2text, morphkey2text v1 and morphkey2text v2 are treated as separate system. The intention is to show the advantage of using an ensemble of generators and the ranking mechanism.
Language enrichment modules such as compounding and coreference replacement modules are removed. Simple sentences are just concatenated to produce the table descriptions. The intention is to study the impact of the language enrichment modules on the overall performance of our system.
Apart from reporting quantitative numbers in terms of the standard evaluation metrics, we also leverage the modularity of the system to inspect cases where undesirable descriptions are generated by our system.
8 Results and Discussion
Table 2 illustrates how the various pretrained models fare on the WikiTablePara benchmark dataset compared to our proposed system. We observe that the end-to-end WebNLGModel does better that WikiBioModel. However, our proposed system clearly gives the best performance, demonstrating the capability of generalizing in unseen domains and structured data in a more complex form such as multi-row and multi-column table.
Table 3 shows the performance of our proposed system on the test splits of various datasets (including the whole WikiTablePara dataset), even though it is not trained using them. The performance measures (especially the STSim metric) indicate that our system can be used as it is for other input types with varied domains. Note that despite the fact that the WikiTableText, WebNLG and WikiBio datasets are summarization datasets and are not designed for complete description generation, our system still performs reasonably well, without having been trained on any of these datasets.
We performed ablation on our proposed system at multiple levels; Table 4 shows the performance of individual simple language generation systems and also shows the performance of the ranker module. The results suggests that ranker indeed improved the performance of the system. Moreover, to measure effectiveness of our proposed sentence compounding and coreference replacer, we replaced these modules with simple sentence concatenation module. As observed in the same table, the performance of the system degrades compared to when compounding and coreference replacement modules are used. This goes to show that the enrichment modules indeed play an important role, especially when it comes to discourse generation.
|Reference:||fabio de matos pereira or fabinho -lrb- born 26 february 1982 in brazil -rrb- , is a brazilian football midfielder , currently playing for botafogo-sp .|
|Proposed:||Fabinho has full name fabio de matos pereira and birth date 26 february 1982. It has birth place brazil and has current club botafogo-sp. It positions midfielder and has clubs fc braov fc metalurh donetsk ermis aradippou skonto riga botafogo-sp. It has article title fbio de matos pereira.|
|Reference:||Airey Neave was involved in the Battle of France in which Hugo Sperrle was a commander.|
|Proposed:||Airey Neave ’s fights Battle of France and its commander was Hugo Sperrle.|
|Reference:||Yamato flat inland plain has an area of 837.27 sq. kms and has a population of 1,282. Its population density per kilometre is 1,531. Yamato highland has an area of 506.89 sq. kms and has a population of 56. Its population density per kilometre is 110. Goj, Yoshino has an area of 2,346.84 sq. kms and has a population of 92. Its population density per kilometre is 39.|
|Proposed:||Yamato flat inland plain has area size 837.27 and its population is 1,282. It has density per 1,531. Yamato highland has area size 506.89 and has a population of 56. He has density per 110. Goj, Yoshino has area size 2,346.84 and has a population of 92. It is in the density of km 39.|
|Reference:||Vancouver Whitecaps women are the winners of Voyageurs cup in 2005 .|
|Proposed:||Voyageurs cup’s year is 2005 and its winners was Vancouver Whitecaps women.|
8.1 Human Evaluation
Since quantitative evaluation metrics such as BLEU and Skip-thought similarity are known to have limited capabilities in judging sentences that are correct but different from the gold-standard reference, we perform a human evaluation study. instances from the WikiTablePara dataset were randomly selected. For each instance, the table, the reference paragraph, and outputs from our proposed system, WikiBio and WebNLG models were shuffled and shown to the experts. They were instructed to assign three scores related to fluency, adequacy and coherence of the generated and gold-standard paragraphs. The minimum and maximum scores for each category are 1 and 5 respectively. Table. 5 reports the evaluation results. While it was expected that the gold-standard output would get maximum scores in all aspects, the scores for our proposed systems are quite superior than the existing systems and are also sometimes close to those for the gold standard paragraphs. This shows that a modular approach like ours can be promising for generating tabular descriptions.
On manual inspection of the descriptions generated by our system across datasets (as exhibited in Table 6), we find that our system gives promising performance qualitatively in addition to the quantitative evaluation metrics mentioned before.
8.2 Effectiveness of the Individual Modules
We also examine if, for TextGen, using an ensemble of generators followed by a ranking mechanism was effective. We intend to study if all the participating systems were chosen by the ranker for significant number of examples. Fig. 4 shows the proportions (in terms of percentage) of the times the output of the three TextGen systems were selected by the ranker. As we can see, all systems are significantly involved in producing correct output in the test data. However, the triple2text system is selected fewer number of times than the other two systems. This is a positive result as the triple2text system requires data obtained from specific resources such as OpenIE, and Yago as opposed to the morphkey2text systems that require just a monolingual corpora.
8.3 Error Analysis
Since our system is modular, we could inspect the intermediate outputs of different stages and perform error-analysis. We categorize the errors into the following:
Error in Tagging of Entities: One of the crucial steps in the canonicalization stage is tagging the table entries. Our modified NE taggers sometimes fails to tag entities primarily because of lack of context. For example, the original triple in our dataset Chinese Taipei, gold medals won, 1 is converted to a triple UNK, gold medals won, CARDINAL . Because of the wrong NE-tagging of the entity Chinese Taipei, the text generation stage in the pipeline did not get enough context and failed to produce a fluent output as shown below,
Chinese Taipei’s gold medals have been won by 0.
This error eventually affected all the subsequent stages. While it is hard to resolve this error with existing NLP techniques, maintaining and incrementally building gazetteers of domain specific entities for look-up based tagging can be a temporary solution.
Error in the TextGen: We observe that all the TextGen systems, discussed in Sec. 4.2 are prone to syntactic errors, which mostly belong to the categories of subject-verb disagreement, noun-number disagreement, article and preposition errors. An example of such an erroneous output is shown below:
Republican ’s active voters is 13,916. Republican was inactive in voters 5,342
We believe errors of such kind can be alleviated by adding more training examples, judiciously prepared from large scale monolingual data from different domains.
Error in Ranking: This error impacts the performance of our system the most. We consistently observe that even though one of the individual system is able to produce fluent and adequate output, it is not selected by the ranker module. In the hindsight, simple language model, and overlap based scorers (Eq. 2) are not able to capture diverse syntactic and semantic representations of the same context (such as passive forms, reordering of words etc.). Moreover, language models are known to capture N-gram collocations better than the overall context of the sentences, and tend to penalize grammatically correct sentences more than the incorrect sentences which have more likely collocations of N-grams. Furthermore, more adequate (and creative) long sentences are penalized more by the language model than shorter ones. To put this into perspective, consider the following example from our dataset. For the input triple, Bischofsheim, building type, Station building, the output from the TextGen systems are as follows:
triple2text: Bischofsheim has building type Station Building.
morphkey2text-v1: Bischofsheim’s building is a type of Station Building.
morphkey2text-v2: Bischofsheim is a building type of Station Building.
The ranker, unfortunately selects the most imperfect output produced by the morphkey2text-v2
system. We believe presence of highly probable bigrams such asbuilding type and type of would have bolstered the language model score and eventually the overall score. A possible solution to overcome this problem, would be to train neural knowledge language models [Ahn et al.2016] that not only considers contextual history but also factual correctness of the generated text. Gathering more monolingual data for training such models may help as well.
Error in Coreference Determination: Error in coreference determination happens due to two reasons : (a) The entities are incorrectly tagged (e.g., a PERSON is mis-tagged as ORG, leading to a wrong pronominal anaphora.), and (b) The gender of the entity is incorrectly classified (e.g., Esther Ndiema ’s nationality is Kenya and his
rank is 5). While improving the tagger is extremely important for this and the overall system, the gender detector could be improved though more training data and better tuning of hyperparameters.
We would also like to point that several modules in our pipeline employ basic heuristics such as Alg. 1, which should be amended to ensure better coverage and less error.
9 Related Work
Data-to-text NLG has received a lot of attention in recent times, especially due to the increasing demands of such systems for industrial applications. Several such systems are based on rule-based, modular statistical and hybrid approaches and are summarized by N18-1139 N18-1139. Recently, end-to-end neural generation systems have been preferred over the others. Some of the most recent ones are based on the WikiBio dataset [Lebret, Grangier, and Auli2016], a dataset tailor-made for summarization of structured data in the form of key-value pairs. Such systems include the ones by lebret2016neural lebret2016neural, who use conditional language model with copy mechanism for generation, [Liu et al.2017] who propose a dual attention Seq2Seq model, N18-1139 N18-1139 who employ gated orthogonalization along with dual attention, and bao2018table bao2018table who introduce a flexible copying mechanism that selectively replicates contents from the table in the output sequence. Other systems revolve around popular datasets such as WeatherGov dataset [Liang, Jordan, and Klein2009, Jain et al.2018], RoboCup dataset [Chen and Mooney], RotoWire and SBNation [Wiseman, Shieber, and Rush2017], and WebNLG dataset [Gardent et al.2017a]. Recently bao2018table bao2018table have introduced a new table-to-text dataset (We refer to it as WikiTableText dataset and compare our systems performance on it). While, it may sound highly relevant on the context our main objective of tabular description generation, it significantly differs from ours as given below:
bao2018table bao2018table’s objective is to generate natural language summary for a region of the table such as a row, whereas, we intend to translate the complete table into paragraph descriptions. This requires additional discourse level operations (such as sentence compounding and coreference insertion), making the task more difficult.
The existing dataset contains tabular rows at the input side and summaries at the output side. Since the objective is summarization of a tabular region, a fraction of the entries are dropped and not explained, unlike ours that aims to translate the complete table into natural language form.
For the end-to-end neural frameworks trained and tuned on the above mentioned special purpose datasets, the learning objective is to maximize accuracy figures only on these datasets. This, in turn, impedes the scalability and adaptability of such systems to newer domains, input types, and schemas.
It is worth noting that recent works for key-word to question generation [Reddy et al.2017], Set2Seq generation [Vinyals, Bengio, and Kudlur2016], knowledge language model based generation [Ahn et al.2016] can also act as building blocks for generation from structured data. We agree that these could have been used to foster better variants of our core system. Due to space constraints, we limit our discussion to only the simple models. However, irrespective of the generation paradigms we use, the bottomline remains the same - data driven approaches like ours can produce robust and scalable solutions for data-to-text NLG.
10 Conclusion and Future Directions
We presented a modular interpretable framework for generating discourse level natural language description from structured tabular data. We highlighted the challenges involved and contended why a modular data-driven architecture like ours could tackle them better as opposed to end-to-end neural systems. Our framework employs modules for obtaining standard representations of tables, generating atomic simple sentences from them and finally combining the sentences to form coherent and fluent paragraphs. Since no benchmark dataset for evaluating discourse level tabular description generation was available, we created one to evaluate our system. The dataset will be released for further research in this direction. Our experiments on our dataset and various other data-to-text type datasets reveal that: (a) our system outperforms the existing ones in producing discourse level descriptions, and (b) the system can realize good quality sentences for various other input data-types such as knowledge graphs in the form of tuples and key-value pairs. Furthermore, the modularity of the system allows us to interpret the system’s output better. In future, we would like to incorporate additional modules into the system for tabular summarization. Extending the framework for multilingual tabular description generation is also a future implication.
- [Ahn et al.2016] Ahn, S.; Choi, H.; Pärnamaa, T.; and Bengio, Y. 2016. A neural knowledge language model. CoRR abs/1608.00318.
- [Banko et al.2007] Banko, M.; Cafarella, M. J.; Soderland, S.; Broadhead, M.; and Etzioni, O. 2007. Open information extraction from the web. In IJCAI, volume 7, 2670–2676.
- [Bao et al.2018] Bao, J.; Tang, D.; Duan, N.; Yan, Z.; Lv, Y.; Zhou, M.; and Zhao, T. 2018. Table-to-text: Describing table region with natural language. arXiv preprint arXiv:1805.11234.
- [Barzilay and Lapata2005] Barzilay, R., and Lapata, M. 2005. Collective content selection for concept-to-text generation. In EMNLP, HLT ’05.
- [Chen and Mooney] Chen, D. L., and Mooney, R. J. Learning to sportscast: A test of grounded language acquisition. In ICML, ICML ’08.
- [Colin et al.2016] Colin, E.; Gardent, C.; Mrabet, Y.; Narayan, S.; and Perez-Beltrachini, L. 2016. The webnlg challenge: Generating text from dbpedia data. In INLG, 163–167.
- [Dale, Geldof, and Prost2003] Dale, R.; Geldof, S.; and Prost, J.-P. 2003. Coral: Using natural language generation for navigational assistance. In Proceedings of the 26th Australasian computer science conference-Volume 16, 35–44. Australian Computer Society, Inc.
- [Gardent et al.2017a] Gardent, C.; Shimorina, A.; Narayan, S.; and Perez-Beltrachini, L. 2017a. Creating training corpora for micro-planners. In ACL.
- [Gardent et al.2017b] Gardent, C.; Shimorina, A.; Narayan, S.; and Perez-Beltrachini, L. 2017b. Creating training corpora for NLG micro-planners. In ACL, 179–188.
- [Heafield2011] Heafield, K. 2011. Kenlm: Faster and smaller language model queries. In Sixth Workshop on SMT, 187–197. Association for Computational Linguistics.
- [Jain et al.2018] Jain, P.; Laha, A.; Sankaranarayanan, K.; Nema, P.; Khapra, M. M.; and Shetty, S. 2018. A mixed hierarchical attention based encoder-decoder approach for standard table summarization. In NAACL-HLT.
- [Klein et al.2017] Klein, G.; Kim, Y.; Deng, Y.; Senellart, J.; and Rush, A. M. 2017. Opennmt: Open-source toolkit for neural machine translation. CoRR abs/1701.02810.
- [Konstas and Lapata2013] Konstas, I., and Lapata, M. 2013. Inducing document plans for concept-to-text generation. In EMNLP, 1503–1514.
- [Lebret, Grangier, and Auli2016] Lebret, R.; Grangier, D.; and Auli, M. 2016. Neural text generation from structured data with application to the biography domain. In EMNLP.
- [Liang, Jordan, and Klein2009] Liang, P.; Jordan, M. I.; and Klein, D. 2009. Learning semantic correspondences with less supervision. In ACL, 91–99. Association for Computational Linguistics.
- [Liu et al.2017] Liu, T.; Wang, K.; Sha, L.; Chang, B.; and Sui, Z. 2017. Table-to-text generation by structure-aware seq2seq learning. arXiv preprint arXiv:1711.09724.
- [Mei, Bansal, and Walter2016] Mei, H.; Bansal, M.; and Walter, M. R. 2016. What to talk about and how? selective generation using lstms with coarse-to-fine alignment. In NAACL-HLT.
- [Nema et al.2018] Nema, P.; Shetty, S.; Jain, P.; Laha, A.; Sankaranarayanan, K.; and Khapra, M. M. 2018. Generating descriptions from structured data using a bifocal attention mechanism and gated orthogonalization. In NAACL-HLT.
- [Pasupat and Liang2015] Pasupat, P., and Liang, P. 2015. Compositional semantic parsing on semi-structured tables. In IJCNLP.
- [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In EMNLP, 1532–1543.
- [Reddy et al.2017] Reddy, S.; Raghu, D.; Khapra, M. M.; and Joshi, S. 2017. Generating natural language question-answer pairs from a knowledge graph using a rnn based question generation model. In EACL.
- [Reiter et al.2005] Reiter, E.; Sripada, S.; Hunter, J.; Yu, J.; and Davy, I. 2005. Choosing words in computer-generated weather forecasts. Artificial Intelligence 167(1-2):137–169.
- [Schuler2005] Schuler, K. K. 2005. Verbnet: A Broad-coverage, Comprehensive Verb Lexicon. Ph.D. Dissertation, Philadelphia, PA, USA. AAI3179808.
- [Suchanek, Kasneci, and Weikum2007] Suchanek, F. M.; Kasneci, G.; and Weikum, G. 2007. Yago: a core of semantic knowledge. In WWW, 697–706. ACM.
- [Thorne et al.2018] Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; and Mittal, A. 2018. FEVER: a large-scale dataset for fact extraction and verification. In NAACL-HLT.
- [Vinyals, Bengio, and Kudlur2016] Vinyals, O.; Bengio, S.; and Kudlur, M. 2016. Order matters: Sequence to sequence for sets. In International Conference on Learning Representations (ICLR).
- [Wiseman, Shieber, and Rush2017] Wiseman, S.; Shieber, S.; and Rush, A. 2017. Challenges in data-to-document generation. In EMNLP.