Logical Natural Language Generation from Open-Domain Tables

04/22/2020 ∙ by Wenhu Chen, et al. ∙ Tencent The Regents of the University of California The Ohio State University 3

Neural natural language generation (NLG) models have recently shown remarkable progress in fluency and coherence. However, existing studies on neural NLG are primarily focused on surface-level realizations with limited emphasis on logical inference, an important aspect of human thinking and language. In this paper, we suggest a new NLG task where a model is tasked with generating natural language statements that can be logically entailed by the facts in an open-domain semi-structured table. To facilitate the study of the proposed logical NLG problem, we use the existing TabFact dataset <cit.> featured with a wide range of logical/symbolic inferences as our testbed, and propose new automatic metrics to evaluate the fidelity of generation models w.r.t. logical inference. The new task poses challenges to the existing monotonic generation frameworks due to the mismatch between sequence order and logical order. In our experiments, we comprehensively survey different generation architectures (LSTM, Transformer, Pre-Trained LM) trained with different algorithms (RL, Adversarial Training, Coarse-to-Fine) on the dataset and made following observations: 1) Pre-Trained LM can significantly boost both the fluency and logical fidelity metrics, 2) RL and Adversarial Training are trading fluency for fidelity, 3) Coarse-to-Fine generation can help partially alleviate the fidelity issue while maintaining high language fluency. The code and data are available at <https://github.com/wenhuchen/LogicNLG>.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Table-to-text generation examples with and without implicit logical inference. Logical NLG requires a generation model to generate natural language statements that can be logically entailed by the facts in the table instead of simply restating certain superficial facts in natural language.

Neural network models, especially the recent wave of massive models like BERT Devlin et al. (2019) and GPT-2 Radford et al. (2019), have shown the ability to generate natural language text at an astonishing level of fluency and coherence. For the generated text to fulfill its purpose, however, a critical property that is necessary but often overlooked is fidelity, i.e., what is generated should be faithful to the underlying data, knowledge, or meaning representation. A line of recent work has started to address the surface-level fidelity issue of natural language generation (NLG) by encouraging the model to learn to reuse the verbatim of certain inputs through copy mechanism  See et al. (2017); Gu et al. (2016); Wiseman et al. (2017); Liu et al. (2018), structured attention Liu et al. (2018), or planning and selection/entity modeling Puduppully et al. (2019, 2019). While shown to be effective, most such methods so far are primarily focused on surface-level realization and simply restate the facts in the underlying data (Figure 1).

However, humans have the ability to generalize beyond superficial facts (e.g., “Canada has got 3 gold medals.”) by inferring and communicating with new statements that can be entailed from these facts (e.g., “Canada obtained the most gold medals.”). We believe it is important for NLG models to be able to generalize beyond the superficla facts given to them as well. Therefore, we propose a new task, logical NLG, where a model is tasked with generating natural language statements that can be logically entailed by the given data (i.e., the premises). The new task requires a model to jointly reason and generate sentences that are consistent both linguistically and logically. Since there are a variety of reasoning/inference tasks such as natural language inference Bowman et al. (2015) and commonsense reasoning Talmor et al. (2019), to avoid confusion, this paper is specifically focused on inferences involving symbolic operations over the given table Pasupat and Liang (2015).

To empower research in this direction, we collect a new corpus LogicNLG based on the existing TabFact Chen et al. (2019), which brings two major renovations to the existing NLG paradigm: 1) the text involves diversified types of logical inferences including math operations like max/min/sum/add, comparison operations like same/different, and counting operations like total/only. A more detailed description of logical inference is listed in the Appendix. 2) while existing datasets are often restricted to a specific domain such as weather Liang et al. (2009), restaurant Dušek et al. (2019), NBA Wiseman et al. (2017), etc, LogicNLG uses open-domain tables without prior knowledge about their schema. As such, existing methods based on surface-level copying See et al. (2017); Gu et al. (2016); Puduppully et al. (2019) becomes insufficient, so are the existing fidelity evaluation based on the surface-level information extraction Wiseman et al. (2017); Rohrbach et al. (2018); Dhingra et al. (2019)

, which extracts surface triples in a certain pre-defined form (i.e. subj-pred-obj, n-gram) and compare them with the surface content given in the knowledge.

Figure 2: When making the decision at the third step, the model needs to foresee the future tokens to ensure logical consistency. There is no back-tracking once the model makes a wrong decision like “5”.

Most neural generation models follow a monotonic generation schema from left to right with the current prediction only depending on the preceding words. Logical NLG poses unique challenges to the traditional generation scheme due to the mismatch between sequence order and logical order. As illustrated in Figure 2, the word “2” is derived from the logical inference of ‘diff(Silver medal of Colombia, Silver medal of Canada)) 2.’ In other words, the logical order of word “2” should be after “more”, “silver”, and “Canada”, while the sequence order of “2” is before those words. Since the monotonic generation scheme is purely based on sequence order while agnostic to logical order, existing NLG models struggle to maintain the fidelity as they cannot model the logical dependency on future tokens. To alleviate such an order mismatch, an NLG model must have the capability to plan ahead for the next few steps before generation. In this context, we believe LogicNLG to be an important testbed to study such a planing/inference ability in generation models Ford et al. (2018); Welleck et al. (2019). In this paper, we further propose a non-monotonic coarse-to-fine generation model and show that it is able to alleviate the order mismatch problem and achieve better performance. The contribution of this work is three-fold:
) We propose a new research problem of logical natural language generation, and provide novel metrics to approximately evaluate the logical fidelity of generation models.
) We justify the mismatch problem between sequence order and logical order of the traditional monotonic generation scheme in logical NLG.
) We conduct comprehensive experiments with state-of-the-art neural generation models under both automatic and human evaluation, which demonstrates the challenges and opportunities for future research on logic NLG.

Vocab Examples Vocab/Sent Tables Domain Source Inference Schema
WEATHERGOV 394 22.1K 0.01 22.1K Weather Crawled No Known
WikiBIO 400K 728K 0.54 728K Biography Crawled No Limited
ROTOWIRE 11.3K 4.9K 0.72 4.9K NBA Annotated Few Known
LogicNLG 122K 37.0K 3.31 7.3K Open Annotated Rich Unlimited
Table 1: Comparison of LogicNLG against existing NLG datasets in different aspects.

2 Dataset and Problem Definition

Existing NLG datasets Chen and Mooney (2008); Dušek et al. (2019); Lebret et al. (2016); Liang et al. (2009) are mainly composed of surface-level description over the given records. Though ROTOWIRE Wiseman et al. (2017) involves sporadic inference in the long document, and the inference is restricted to domain-specific knowledge (e.g. double-double, smash, triple-double and other NBA-related terms). Hence, we need a better testbed for studying the proposed problem.

Figure 3: Evaluation of surface-level generation vs. logical natural language generation. It suffices to use IE-based evaluation Wiseman et al. (2017); Rohrbach et al. (2018) to verify surface-level generation, but it causes either “empty triple” or “false negative” problems to verify logical NLG.


We construct a dataset based on TabFact Chen et al. (2019), which is a table-based fact-checking dataset with rich logical inferences in the annotated statements. Specifically, we took their positive statements (the sentences which are entailed by the knowledge in the table) collected from “complex channel” (required to annotate sentences with logical inference) as our target text. To prevent confusion with the original dataset, we name this table-to-text dataset as LogicNLG, which contains 28,450 training, 4,260 validation and 4,305 test examples based on 7,392 open-domain tables crawled from Wikipedia. Each table has 5 different examples covering diverse types of logical inference. More detailed statistics and comparisons are listed in Table 1. LogicNLG is distinguished from the existing datasets due to:
) It involves very rich logical inference, every annotated sentence involves certain types of inference with minimum domain-specific knowledge. The open-domain characteristic simulates a realistic setting, where we cannot enumerate the possible inference based on the scheme, which poses great challenges to the model’s generalization capability.
) It is mainly composed of short sentences with an average length of 11 and a simple syntactic structure, which isolates from other linguistic complexity to focus on the problem of logical inference.

The dataset contains tables with open schema crawled from diversified domains Figure 4

. The major categories are sports, politics, and entertainment. The sports category is mainly divided into two types: 1) the record summary of a player/team, 2) leaderboard of a given competition/season. The politics tables are mainly covering the election-related statistics and the entertainment tables are mainly discussing music, films, etc. The schema diversity of the tables make the rule-based system infeasible to apply. Besides, most of the tables have very rich numeral records, which provide an ideal environment for numerical inference.

Problem Definition

Here, we formally define our proposed table-to-text generation task. The input is a table T with its title denoted as a natural language sequence . The table T = has rows and columns with the being the content in the -th cell. could be a word, a number, a phrase or even a natural language sentence. The annotated statement is a sentence , we aim to train a neural generation model to generate statement which are both fluent and logically (numerically) supported by the given table T.

Figure 4: The domain distribution of LogicNLG.

3 Automatic Evaluation

Figure 5: The parsing-based and adversarial evaluation to measure model’s correctness in logical reasoning.

In this section, we discuss the evaluation of our proposed NLG task. The fluency evaluation is simply based on the standard metrics like Perplexity Bengio et al. (2003) and BLEU-1,2,3 Papineni et al. (2002) based on NLTK Bird (2006). The most challenging problem is to evaluate the logical fidelity of the generated sentences, which is also the core problem of our paper. The existing IE-based extractive evaluation Wiseman et al. (2017) leads to two issues as shown in Figure 3: 1) Empty Extraction: the sentence can not be formulated as (subject, predicate, object) structure, thus the IE system fail to extract triples for verification. 2) False Negative: the sentence is a logical composition (instead of surface form) of the fact from the table, the IE system cannot match it against the table. For these reasons, we test two approximate automatic metrics:

Parsing-based Evaluation

We first propose a model-based evaluation method, which aims to directly extract the meaning representation from the generated sentence and execute it against the table to verify its correctness. Our evaluation is based on weakly-supervised semantic parsing Liang et al. (2009, 2013), the basic idea is to first link entities and predicates in the sentence, and then use linked entities to perform a breadth-first search to synthesize potential logical forms, finally, a scorer is used to re-rank these logical forms and filter out spurious ones. The logical form returns a binary value of True to indicate whether its logic is supported by the knowledge. The basic idea is shown in the upper part of Figure 5, the implementation details are in the Appendix. We pre-train the semantic parser on the training set with weakly supervised algorithm, at test time, we use it to parse a sentence into a set of logical forms, which is re-ranked to obtain the highest logical form . We compute the ratio of returning “true” on to approximate model’s fidelity.

where is the indicator function.

NLI-based Evaluation

We then propose another model-based evaluation method to complement the parsing-based evaluation (which is sensitive to semantic variation), the basic idea follows Kryściński et al. (2019) to evaluate the entailment score between the table and the generated sentence. The NLI model is based on TableBERT Chen et al. (2019), which linearizes the table into textual form and uses it as the evidence for natural language inference. The model is trained with TabFact Chen et al. (2019) dataset containing both positive/negative samples. During the evaluation, we use this NLI model to predict the entailment relationship based on the likelihood of . Finally, we compute the ratio of “entailed” to approximate model’s fidelity:

where is the indicator function.

Adversarial Evaluation

Adversarial evaluation Goodfellow et al. (2014); Kannan and Vinyals (2017) is used to study the generation model’s robustness in logical reasoning. Specifically, we hire human workers from Amazon Mechanical Turk111https://www.mturk.com/ to annotate adversarial examples for the test/validation set by simply changing minimum words to revert the logic of the sentence. Such adversarial examples preserve linguistic components like length and style except the logic-related words to specifically disentangle the generation model’s reasoning skill. As drawn in the lower part of Figure 5, the original sentence modifies its word “more” into “less” as an adversarial example. There are two principles the workers need to follow to make their jobs accepted: 1) the modified words/phrases should be roughly equally frequent to balance the language prior, for example, the number “1” is better swapped with “2,3” rather than “9999” which rarely appears in the corpus. 2) the perturbation should be diverse enough to cover different aspects of logical reasoning skills. We use the generation model to score the original sentence and the adversarial sentence . If the confidence of the original example is higher than its adversarial counterpart, we count it as a successful defense, otherwise as a failed defense. We use the success rate to approximate model’s logical reasoning capability.

where is the indicator function.


Both types of metrics have pros and cons, the SP-Acc and NLI-Acc are two metrics unbiased as it measures the peak samples in the model’s likelihood, however, both metrics are based on imperfect models and thus their evaluation scores are inaccurate. SP-Acc is more sensitive to number/calculation errors, and NLI-Acc is more sensitive to semantic errors, therefore, we report both of them to help increase the metrics’ robustness. In contrast, the adversarial evaluation score is accurate in terms of reflecting the model’s reasoning capability on the given samples. However, as the provided samples might not lie in the high-confidence area of the model’s distribution, it is biased in reflecting the model’s general reasoning capability. Though these fidelity metric models are prone to errors, in section 6, we show their consistency with human judgment, which reveals their potential to assist human evaluation.

4 Baselines

In this section, we design comprehensive baseline models to perform logical NLG. Specifically, we consider the following two cases: non-pretrained models (LSTM/Transformer) with copy mechanism and pre-trained models (GPT-2 and BERT) with sub-word unit. We train these models with three different algorithms: Maximum Likelihood, Adversarial Training, and Reinforcement Learning.

Figure 6: The Non-pretrained and Pre-trained generation models, the detailed table is shown in Figure 1.

4.1 Non-pretrained Models

Here we mainly consider two table encoding methods, namely field-infusing and field-gating. These two methods differ in their strategies to coalesce the field information into cells. After the table is represented as a sequence of vectors, a decoder based on LSTM 

Hochreiter and Schmidhuber (1997) or Transformer Vaswani et al. (2017) is applied to generate text token by token. The two methods are depicted in the upper part of Figure 6:


This strategy is inspired by Lebret et al. (2016). We first use an LSTM Hochreiter and Schmidhuber (1997) to encode the table field text word by word and then use the last output as field representation. This representation is concatenated with the embedding of row index and word embedding at each cell to obtain a position-aware cell embedding for each word inside the cell. We stack transformers layers on top of the cell embedding to obtain the table representation as with as the dimension.


This strategy is inspired by by Liu et al. (2018). Like the previous strategy, we first use an LSTM Hochreiter and Schmidhuber (1997) to obtain field representation . The field representation is concatenated with ending distance information as the input to an additional field gate built inside the LSTM as suggested in Liu et al. (2018), such a field gate is used to control whether the current cell is already encoded. Such a mechanism can help LSTM to identify the boundary between different cells to grasp local information.

4.2 Pre-trained Models

To further enhance the fluency and resolve the out-of-vocabulary problem, we use pre-trained language models and finetune them on LogicNLG. Specifically, we consider two models based on GPT-2 Radford et al. (2019) and BERT Devlin et al. (2019), respectively, and name them as GPT-TableGen and BERT-TableGen.

Table Linearization

We follow previous work on linearizing knowledge base as natural language Liu et al. (2019); Zhang et al. (2019) to propose “table linearization”, which uses template to flatten the table T as a document fed into pre-trained language models to generate statement , where we use to denote the -th word in the generated paragraph and to denote the length of the paragraph (the word is either a table entry or a functional word in the template). As depicted in the left bottom part of Figure 6, the original table T is transformed into a paragraph by horizontally scanning each cell in the table.


we directly feed the paragraph as the input to the pre-trained GPT-2 model and generate the output sentence . We finetune the model on LogicNLG by maximizing the likelihood of , with denoting the parameters of GPT-2 model Radford et al. (2019).


1) we encode the linearized paragraph using the pre-trained BERT model into the source representation . 2) at the -th time step, we replace all the words in the groundtruth statement after -th time step by MASK token and use BERT to encode the partially masked as . 3) we use an attention layer to obtain the output hidden states , where is used to predict the word . We jointly optimize of BERT and to maximize the likelihood of generating text conditioned on the table and the masked partial sentence. As BERT is a bidirectional model, we need to re-encode the target sentence at each step to get . Therefore, the generation is finished with passes.

Figure 7: Coarse-to-fine generation scheme: first generates a template, and then realize the surface form. It exposes more context to the surface realization model for better capturing logical dependency.

4.3 Training

Except for the standard maximum likelihood training, we also use the following training algorithms:

Adversarial Regularization

To encourage the model to ground on the table rather than relying on artificial language priors Ramakrishnan et al. (2018), we use an adversarial regularization to enhance the maximum likelihood training. Specifically, we first perform entity resolution to locate all the numbers, count, entities in the sentence and then randomly replace them with entities or numbers appearing in the table T. These perturbed samples are used as adversarial examples to regularize the model’s behavior. Formally, we optimize to maximize the objective:

where is the controlling hyper-parameter.

Reinforcement Learning

The maximum likelihood training is a fluency-driven objective, which is inconsistent with the goal of logical consistency. To bridge the gap, we view the generation problem from the reinforcement learning perspective to optimize the long-term fidelity. We use the trained semantic parser to assign reward to the policy . At -th step, the generator will sample different actions and roll-out from -th step to produce a full sequence starting from using greedy search. The full sentence receives a binary score from the semantic parser as reward. Formally, we optimize the objective:

where we only use one trajectory to approximate the inner roll-out expectation for efficiency.

5 Coarse-to-Fine Generation

As discussed before, the baseline models follow the monotonic generation scheme and suffer from the mismatch between sequence order and logical order (Figure 2). In this section, we propose an imperfect remedy for such a situation based on the coarse-to-fine generation paradigm.

Before plunging into technical details, it is helpful to first realize the resemblance between logical NLG and semantic parsing Dong and Lapata (2018). Compared to traditional NLG tasks like machine translation and summarization, logical NLG is closer to semantic parsing in the sense that a model may make catastrophic errors that are impossible to be corrected at later steps (Figure 2). Therefore, we take inspiration from semantic parsing models Dong and Lapata (2018) that have proven effective in mitigating such errors and propose a coarse-to-fine generation scheme. We break down generation into two phases. In the first phase, the model only generates a template which determines the global logical structure, while in the second phase the model generates the final, grounded sentence conditioned on the template generated in the first phase. As depicted in Figure 7, we use the entity linker from the semantic parser (Section 3) to identify the entities and numbers in the original sentence and replace them with placeholder “[ENT]”, which we call as the template . During the generation of GPT-TabGen, instead of directly predicting the final sentence , we first predict the template and then . The process is simply realized by concatenating the template with the sentence as to maximize the overall likelihood .

Unlike template-based or delexicalized generation Reiter and Dale (1997); Wen et al. (2015), which uses rigid slot filling prone to grammatic errors, our fine-grained generation has the flexibility to modify the surface form of non-slot words, which alleviates the linguistic coherence problem Sharma et al. (2017).

By decoupling sentence structure generation and entity grounding, our proposed coarse-to-fine scheme could partially alleviate the mismatch problem. For example, the generation of “Canada” is now aware of “more than” in the latter part of the sentence, which exposes the model to more context than standard monotonic models to help make logically consistent decisions though the dependency on the “1” and “Mexico” is still not captured. The proposed two-step generation could be viewed as the first step towards a fully non-monotonic generation model to solve such mismatch problem.

Model Training PPL BLEU-1 BLEU-2 BLEU-3 SP-Acc NLI-Acc Adv-Acc
Field-Gating + LSTM MLE 27.7 42.3 19.5 6.9 38.0 56.8 56.2
Field-Gating + Trans MLE 26.8 44.1 20.9 8.3 38.5 57.3 58.1
Field-Infusing + LSTM MLE 27.9 43.1 19.7 7.1 38.6 57.1 56.9
Field-Infusing + Trans MLE 26.9 43.7 20.9 8.4 38.9 57.3 58.2
BERT-TabGen (sm) MLE 7.5 47.8 26.3 11.9 42.2 68.1 62.4
GPT-TabGen (sm) MLE 8.8 48.8 27.1 12.6 42.1 68.7 62.3
GPT-TabGen (sm) Adv-Reg 12.1 45.8 23.1 9.6 40.9 68.5 64.7
GPT-TabGen (sm) RL 11.3 45.1 23.6 9.1 43.1 67.7 61.9
GPT-Coarse-to-Fine (sm) MLE - 46.6 26.8 13.3 42.7 72.2 64.9
BERT-TabGen (lg) MLE 6.3 49.1 27.7 13.5 44.4 73.9 64.0
GPT-TabGen (med) MLE 6.8 49.6 28.2 14.2 44.7 74.6 64.3
GPT-TabGen (med) Adv-Reg 10.1 47.2 24.0 10.8 44.1 73.0 65.4
GPT-TabGen (med) RL 10.0 46.4 24.1 10.0 45.5 73.3 63.7
GPT-Coarse-to-Fine (med) MLE - 49.0 28.3 14.6 45.3 76.4 66.0
Table 2: The experimental results of different models on the test split of LogicNLG, where we split the table into non-pretrained LSTM/Transformer, small pre-trained LM (sm) and medium/large pre-trained LM (med/lg).

6 Experiments

In this section, we explain the experimental details and then comprehensively report the automatic evaluation of different generation models and training algorithms. Finally, we will conduct detailed human evaluation and error analysis.

6.1 Experiment Setup

For the non-pretrained models, we fix the hidden size of both LSTM and transformer to be 256, the transformer is 3-layered with 4 heads, while LSTM is also 3-layered. We use Adam optimizer Kingma and Ba (2015) with a learning rate of 2e-4 to jointly optimize the parameters and keep the model with the best perplexity on the validation set. During test time, we use a greedy search to generate text and calculate the BLEU-1,2,3 scores with the 5 references from the table. For the pre-trained models, we base our implementation on Huggingface’s Transformer Wolf et al. (2019) for both BERT Devlin et al. (2019) and GPT-2 Radford et al. (2019) with subword unit vocabulary of 30K. During linearization, we truncate to fit the length constraint (BERT with 512, and GPT-2 with 1024). Both are finetuned using Adam optimizer Kingma and Ba (2015) with a learning rate of 1e-6. In both adversarial training and reinforcement learning algorithms, we add maximum likelihood objective to stabilize the training, we select the appropriate balancing factor based on the validation Adv-Acc socre. For coarse-to-fine training, we first warm up the model to generate the template sequence and then finetune it on the concatenated full sequence. Model selection is based on the bleu-3 score on validation split.

Figure 8: The human evaluation results of different models on the sampled sentences.

6.2 Experimental Results

We first perform an automatic evaluation to approximately measure the performance of different models and then conduct an in-depth human evaluation to have a better understanding.

Automatic Evaluation:

The experimental results are summarized in Table 2, where we comprehensively survey different architectures and training algorithms. For the non-pretrained models, we observe that Transformer is slightly better than LSTM and two different table encoding strategies achieve similar results. In contrast, pre-trained models are much better at lowering the perplexity, besides the generated sentences significantly outperform the non-pretrained models in terms of both fluency and fidelity score with GPT-TabGen and BERT-TabGen achieving similar performance. As the BERT-TabGen runs much slower due to multiple passes of decoding, we favor GPT-TabGen in the following experiments. With the adversarial regularization and reinforcement training, the model can only improve the optimized fidelity metric, with the fluency scores dropping significantly. Such phenomena confirm our assumption about the caveats of the monotonic generation paradigm. For the proposed coarse-to-fine generation scheme, as the “[ENT]” tokens are replaced by entity names, which normally contain a phrase like “Feb 2nd”. Such n-gram phrase substitution preserves the completeness of entity names and thus leads to higher 2/3/4-gram matches, which translates to higher BLEU-3 and lower BLEU-1 in Table 2. The proposed coarse-to-fine generation can yield significant improvement over NLI-Acc and Adv-Acc metrics on both small and medium model setting, which demonstrates the advantages of the non-monotonic generation in capturing logical dependency.

Human Evaluation

To further investigate the quality of the generated text, we propose to perform human evaluation. Specifically, we sample 200 sentences from different models and distribute them independently to human experts (graduate students from the computer science department) to verify their quality. Specifically, the quality measure is categorized into categories: 1) non-sense: the sentence does not make much sense, which is mainly due to disfluency or repetition problem. 2) wrong: a fluent sentence with wrong logic. 3) partial-correct: the sentence contains more than one fact, at least one of them is correct 4) correct: the high-quality in both fluency and logic correctness. We demonstrate the results in Figure 8, from which we observe that pre-training significantly decreases the non-sense proportion. However, the RL and Adv-Reg both harm the fluency and lead to more non-sense sentences. In contrast, the coarse-to-fine model can maintain the non-sense proportion while significantly increasing correct/partial-correct sentences. From human evaluation, even the best performing model can only get 20% of its prediction logically correct, which reflects the difficulty of maintaining logical consistency in NLG models.

Evaluation Metrics

We here analyze the effectiveness of the defined automatic evaluation metrics for fidelity evaluation. For the Parsing-based evaluation and NLI-based evaluation, we use the adversarial set (containing positive/negative sample pairs) to evaluate their consistency with human judges. Parsing-based model only achieves an accuracy of 60%, while NLI-based model achieves a higher accuracy of 65%. It indicates that the fidelity measurement model is itself a very challenging problem and the existing models are still in a premature stage. Therefore, the exact number of SP-Acc or NLI-Acc cannot reliably reflect the exact proportion of sentences logically entailed by the table. However, we still believe they are informative for model development based on the following reasons: 1) the automatic fidelity scores are quite stable, not sensitive to random initialization or different configurations, 2) when comparing different models (Transformer vs. GPT-2 vs. RL/Adv-Reg vs. Coarse-to-Fine), the trends of different automatic scores are consistent with human evaluation, which indicates its potential in assisting the development of new models.

Fine-grained Analysis

To better understand the generation model’s reasoning capability in regarding different logical operations, we pick the most frequent 9 operations (definition in the Appendix) and analyze the best model’s capability in expressing these different logic. We demonstrate our human evaluation in Figure 8 to make the following inspections: 1) the model performs best in justifying the order of different entities (before/after) and relating two entities (both/neither/comparison). 2) the model performs reasonably well at superlative and count operation. 3) the generation model performs much worse in operations like “only, unique”. 4) the model is not able to perform mathematical aggregation like average, sum, etc. Overall, the string-based operations are easier than numeric-based operations, how to infuse the numeric knowledge is an open research question to move forward.

7 Related Work

Natural Language Generation

Natural language generation is a long-standing problem Kukich (1983); Holmes-Higgin (1994); Reiter and Dale (1997), which involves generating text from records or data. Recently, many neural-based generation models have been proposed Puduppully et al. (2019, 2019); Lebret et al. (2016); Wiseman et al. (2018) to achieve impressive performance on the existing datasets Chen and Mooney (2008); Liang et al. (2009); Lebret et al. (2016); Dušek et al. (2019); Wiseman et al. (2017) since the annotated text are mostly surface-level annotation without logical inference. Unlike them, LogicNLG has rich inference, which poses great challenges to existing models and evaluations.

Non-monotonic Generation

There have been attempts recently to study the problem of non-monotonic text generation, which aims to teach the generation model to learn the generation order without external supervision Ford et al. (2018); Welleck et al. (2019); Gu et al. (2019); Mansimov et al. (2019). These models have shown to learn rational generation order to approach similar performance as the left-to-right case. These approaches are useful at capturing more sophisticated dependency within the sentence, which provides a plausible direction to pursue in LogicNLG.

Factualness Evaluation

Fidelity is an important research topic in generation, In ROTOWIRE Wiseman et al. (2017) and MSCOCO Lin et al. (2014), IE-based extractive evaluation Rohrbach et al. (2018); Dhingra et al. (2019) are adopted for surface-level matching to replace costly human evaluation. In abstractive summarization, Goodrich et al. (2019) proposes NER + Relation Classification method to investigate fidelity in generated summarization while Kryściński et al. (2019) proposes to use NLI models to understand the entailment between generated text with the given document. These evaluations are beyond surface-level to study more sophisticated linguistic phenomena like paraphrasing, compression, entailment, inclusion, etc, which are common in summarization tasks.

8 Conclusion

In this paper, we propose logical NLG, where the model is required to generate statements involving logical operations. To study this problem, we conduct comprehensive experiments to show the existing NLG models are restricted by its monotonic nature. We believe this is a proper next-step problem to work on to build models that can go beyond surface-level copying to jointly perform logical reasoning and generation. Besides, how to further improve the quality of automatic metrics are also a critical step towards solving the problem.

9 Acknowledgement

The authors would like to thank the anonymous reviewers for their thoughtful comments, which greatly help them polish the paper.


  • Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin (2003) A neural probabilistic language model.

    Journal of machine learning research

    3 (Feb), pp. 1137–1155.
    Cited by: §3.
  • S. Bird (2006) NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions, pp. 69–72. Cited by: §3.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    pp. 632–642. Cited by: §1.
  • D. L. Chen and R. J. Mooney (2008) Learning to sportscast: a test of grounded language acquisition. In Proceedings of the 25th international conference on Machine learning, pp. 128–135. Cited by: §2, §7.
  • W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang, S. Li, X. Zhou, and W. Y. Wang (2019) TabFact: a large-scale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164. Cited by: Logical Natural Language Generation from Open-Domain Tables, §1, §2, §3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §4.2, §6.1.
  • B. Dhingra, M. Faruqui, A. Parikh, M. Chang, D. Das, and W. Cohen (2019) Handling divergent reference texts when evaluating table-to-text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4884–4895. Cited by: §1, §7.
  • L. Dong and M. Lapata (2018) Coarse-to-fine decoding for neural semantic parsing. In ACL, Cited by: §5.
  • O. Dušek, J. Novikova, and V. Rieser (2019) Evaluating the state-of-the-art of end-to-end natural language generation: The E2E NLG Challenge. arXiv preprint arXiv:1901.11528. External Links: Link Cited by: §1, §2, §7.
  • N. Ford, D. Duckworth, M. Norouzi, and G. Dahl (2018) The importance of generation order in language modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2942–2946. Cited by: §1, §7.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §3.
  • B. Goodrich, V. Rao, P. J. Liu, and M. Saleh (2019) Assessing the factual accuracy of generated text. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 166–175. Cited by: §7.
  • J. Gu, Q. Liu, and K. Cho (2019) Insertion-based decoding with automatically inferred generation order. TACL. Cited by: §7.
  • J. Gu, Z. Lu, H. Li, and V. O. Li (2016) Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1631–1640. Cited by: §1, §1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1, §4.1, §4.1.
  • P. Holmes-Higgin (1994) Text generation—using discourse strategies and focus constraints to generate natural language text by kathleen r. mckeown, cambridge university press, 1992, pp 246,£ 13.95, isbn 0-521-43802-0..

    The Knowledge Engineering Review

    9 (4), pp. 421–422.
    Cited by: §7.
  • A. Kannan and O. Vinyals (2017) Adversarial evaluation of dialogue models. arXiv preprint arXiv:1701.08198. Cited by: §3.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. ICLR. Cited by: §6.1.
  • W. Kryściński, B. McCann, C. Xiong, and R. Socher (2019)

    Evaluating the factual consistency of abstractive text summarization

    arXiv preprint arXiv:1910.12840. Cited by: §3, §7.
  • K. Kukich (1983) Design of a knowledge-based report generator. In Proceedings of the 21st annual meeting on Association for Computational Linguistics, pp. 145–150. Cited by: §7.
  • R. Lebret, D. Grangier, and M. Auli (2016) Neural text generation from structured data with application to the biography domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1203–1213. Cited by: §2, §4.1, §7.
  • P. Liang, M. I. Jordan, and D. Klein (2009) Learning semantic correspondences with less supervision. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pp. 91–99. Cited by: §1, §2, §3, §7.
  • P. Liang, M. I. Jordan, and D. Klein (2013) Learning dependency-based compositional semantics. Computational Linguistics 39 (2), pp. 389–446. Cited by: §3.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In

    European conference on computer vision

    pp. 740–755. Cited by: §7.
  • A. Liu, J. Du, and V. Stoyanov (2019)

    Knowledge-augmented language model and its application to unsupervised named-entity recognition

    In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1142–1150. Cited by: §4.2.
  • T. Liu, K. Wang, L. Sha, B. Chang, and Z. Sui (2018) Table-to-text generation by structure-aware seq2seq learning. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §1, §4.1.
  • E. Mansimov, A. Wang, and K. Cho (2019) A generalized framework of sequence generation with application to undirected sequence models. arXiv preprint arXiv:1905.12790. Cited by: §7.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §3.
  • P. Pasupat and P. Liang (2015) Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1470–1480. Cited by: §1.
  • R. Puduppully, L. Dong, and M. Lapata (2019) Data-to-text generation with entity modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2023–2035. External Links: Link, Document Cited by: §1, §7.
  • R. Puduppully, L. Dong, and M. Lapata (2019) Data-to-text generation with content selection and planning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6908–6915. Cited by: §1, §1, §7.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §1, §4.2, §4.2, §6.1.
  • S. Ramakrishnan, A. Agrawal, and S. Lee (2018) Overcoming language priors in visual question answering with adversarial regularization. In Advances in Neural Information Processing Systems, pp. 1541–1551. Cited by: §4.3.
  • E. Reiter and R. Dale (1997) Building applied natural language generation systems. Natural Language Engineering 3 (1), pp. 57–87. Cited by: §5, §7.
  • A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko (2018)

    Object hallucination in image captioning

    In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4035–4045. Cited by: §1, Figure 3, §7.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083. Cited by: §1, §1.
  • S. Sharma, J. He, K. Suleman, H. Schulz, and P. Bachman (2017) Natural language generation in dialogue using lexicalized and delexicalized data. ICLR Workshop. Cited by: §5.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4149–4158. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.1.
  • S. Welleck, K. Brantley, H. D. Iii, and K. Cho (2019) Non-monotonic sequential text generation. In International Conference on Machine Learning, pp. 6716–6726. Cited by: §1, §7.
  • T. Wen, M. Gasic, N. Mrkšić, P. Su, D. Vandyke, and S. Young (2015) Semantically conditioned lstm-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1711–1721. Cited by: §5.
  • S. Wiseman, S. Shieber, and A. Rush (2017) Challenges in data-to-document generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2253–2263. Cited by: §1, §1, Figure 3, §2, §3, §7, §7.
  • S. Wiseman, S. Shieber, and A. Rush (2018) Learning neural templates for text generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3174–3187. Cited by: §7.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §6.1.
  • N. Zhang, S. Deng, Z. Sun, J. Chen, W. Zhang, and H. Chen (2019) Relation adversarial network for low resource knowledgegraph completion. arXiv preprint arXiv:1911.03091. Cited by: §4.2.

Appendix A Dataset Examples

In order to give readers a better sense of the statements in LogicNLG, we demonstrate some typical examples below as Figure 9 and Figure 10. Each table in the dataset is associated with five different examples covering diversified inference skills. For example,  Figure 9 requires ‘all’ operation to identify multiple rows having the same value on certain properties.  Figure 10 requires the model to perform superlative, or count operation to identify the numerically highest number.

Figure 9: Example from LogicNLG.
Figure 10: Example from LogicNLG.

Appendix B Logical Operation Distribution

The dataset consists of the most common types of logical inference in our daily communication, to help the readers understand the semantic meaning of these inference, we list their definition and some examples below:

  • superlative: operations involving max,min or other comparison operation to get the lowest or highest value. Sentence: xxx is the tallest player in xxx team.

  • only: operation to identify the single entity which has a unique property the other entries do not have. Sentence: xxx is the only person to win all the games.

  • before/after: operations to compare time or spatial order. Sentence: xxx is born before xxx.

  • count: operations to enumerate the amount of entries meeting certain criterion. Sentence: there are two people from the central united states.

  • comparison: operations to compare two or given number of entities. Sentence: xxx has better income than xxx.

  • both/neither: operations to summarize the common properties of two entries. Sentence: xxx and xxx are both from the same country.

  • sum/diff: operations to perform numeric summation or difference between numbers. Sentence: xxx gives 1 more dollars than xxxx in the donation.

  • average: the average number of people attending the game is 500.

  • unique: the uniq operation in sql to assemble summarize different entities. Sentence: from the table, players are from 4 unique countries.

Figure 11: The BFS-based parser used in our evaluation.

Appendix C Semantic Parser

Specifically, the scorer is realized by a matching model , which takes a logic form and the statement to output a consistency score between range of [0,1] with higher value indicating better consistency. As no groundtruth logical forms are provided, we utilize weakly supervised training. The set of logical forms generated is denoted as , the logical forms returning binary value of True is viewed as pseudo positive example and the logical forms returning False is treated as pseudo negative example . We propose to optimize the following objective to discriminate two sets:

As demonstrated in Figure 11, our semantic parser is comprised of three different parts, namely a resolution model, a breadth-first search model and a ranker model. The resolution model will try to figure out what are the entities appearing in the table and what are the numbers it needs to infer. These results are pushed to a buffer as the initial point, then the BFS search will try to compose plausible logical forms based on the values from the buffer. However, most of the synthesized logical forms are not relevant to the semantics the sentence is aimed to express. In the end, we need to train a ranker, which can learn to identify the most consistent logical form and use that to represent the formal semantics of given sentence.

Appendix D Qualitative Example

Figure 12: The statements generated by GPT-TabGen model with random sampling.

Next, we demonstrate some generated samples in Figure 12, which are generated from a table crawled from Wikipedia page222https://en.wikipedia.org/wiki/2007%E2%80%9308_Golden_State_Warriors_season. Though most of the text generated by the model is coherent and reasonable, we do observe some disfluency like repetition, contradiction, erroneous sentences like the sentence 5. For the other sentences, three of them are logically correct, the first sentence contains quite complex logic with three different symbolic operations “argmax, argmin, after”. The fourth and sixth sentences involve operations like “filter, count”. In contrast, the second and third examples are factually incorrect as the team only competes with “Seattle” once and the 3 games are not in a row. We can see that the errors are quite diversified, it is difficult to debug what is the source of these errors by simply looking into the deep generation model. In the future, more interpretable generation model need to be built to make the inference process more transparent.