Logic2Text: High-Fidelity Natural Language Generation from Logical Forms

04/30/2020 ∙ by Zhiyu Chen, et al. ∙ The Regents of the University of California Intel 0

Previous works on Natural Language Generation (NLG) from structured data have primarily focused on surface-level descriptions of record sequences. However, for complex structured data, e.g., multi-row tables, it is often desirable for an NLG system to describe interesting facts from logical inferences across records. If only provided with the table, it is hard for existing models to produce controllable and high-fidelity logical generations. In this work, we formulate logical level NLG as generation from logical forms in order to obtain controllable, high-fidelity, and faithful generations. We present a new large-scale dataset, Logic2Text, with 10,753 descriptions involving common logic types paired with the underlying logical forms. The logical forms show diversified graph structure of free schema, which poses great challenges on the model's ability to understand the semantics. We experiment on (1) Fully-supervised training with the full datasets, and (2) Few-shot setting, provided with hundreds of paired examples; We compare several popular generation models and analyze their performances. We hope our dataset can encourage research towards building an advanced NLG system capable of natural, faithful, and human-like generation. The dataset and code are available at <https://github.com/czyssrs/Logic2Text>.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Natural language generation (NLG) from structured data has been an important research problem in many applications. Recent data-driven methods have achieved good performances on various NLG tasks DBLP:conf/aaai/LiuWSCS18; DBLP:conf/emnlp/FreitagR18; DBLP:journals/corr/abs-1904-09521. However most studies focus on surface descriptions of simple record sequences, for example, attribute-value pairs of fixed or very limited schema, like E2E DBLP:conf/sigdial/NovikovaDR17 and WikiBio DBLP:conf/emnlp/LebretGA16. In real-world cases for multi-row tables, it is often more desirable and plausible to provide descriptions involving higher-level logical inference across data records. For example, in Figure 1, instead of plain restatements, human readers would be more favorable to diversified descriptions that can summarize or conclude information over the table records.

chen2020logic propose the task of generating arbitrary sentences with logical inference from the table. Their task mainly works for probing purpose, i.e., to test the ability of neural models to produce any logically correct descriptions solely based on the table. However, such task formulation is not yet appropriate for building an applied NLG system, based on the following reasons:

1) Low Fidelity. Given only the table, it is challenging for existing models to produce such logically correct generations involving reasoning and symbolic calculations, e.g., max, min, counting, etc. The most performant model in chen2020logic only obtains a factual correctness rate over 20% based on human evaluation, which is clearly far from an acceptable level in real-world NLG systems.

2) Uncontrollable content selection. Given a table, the space of logically entailed descriptions is exponentially large, due to vast number of combinations of different operations and arguments from the table, e.g., count, comparison, superlative, etc. It is hard and uncontrollable for neural models to decide a valid, favorable choice of logical selections solely based on the table.

To combat with the above problems, we argue that it is necessary to leverage intermediate meaning representations to achieve faithful and controllable logical generations. To this end, in parallel with chen2020logic, we formulate the task of logical level NLG as a logical form to text problem. Specifically, besides the table information, the generation module is provided with a logical form representing the semantics of the target text (see Figure 1 for an example). By separating logical reasoning and language realization, the correctness of the intermediate logical form is guaranteed, and the challenge for the realization module is fully shifted to semantic understanding.

To facilitate research in this direction, we propose a new dataset named Logic2Text, consisting of 5.6k open-domain tables, 10.8k manually annotated (logical form, description) pairs. We design a data annotation workflow of (1) description composition and verification, (2) logical form annotation and derivation, (3) logical form execution and verification. Our dataset is of high quality under such construction workflow in terms of (1) natural, interesting, and diversified descriptions; (2) accurate logical forms with 100% execution correctness. In our dataset, the coarse logic types are 7 commonly used ones to describe multi-row tables: count, superlative, comparative, aggregation, majority, unique, and ordinal. We employ a Python-like program to serve as our logical forms, which can be easily converted to other types of logical forms. Figure 1 shows two examples of our dataset. Compared with previous surface-level NLG datasets, one major distinction of our dataset is the free schema of the logical forms, which can be represented as diversified graph structures. The new dataset poses great challenges on the model’s ability to understand the structural semantics in graph representation.

Figure 1: Examples of surface-level NLG compared with NLG with logical forms of our dataset. Here are two examples with logic type count and superlative. The function nodes are in blue, and the text nodes in grey.

We employ an array of popular generation models as the baseline approaches. The experiments are conducted in (1) Fully-supervised setting. We train the models using the full dataset to analyze their performances. (2) Few-shot setting. We simulate the low-resource scenario in real-world use cases. Experimental results show that the logical forms are critical to acquiring high-fidelity generations. The pre-trained language model outperforms other baselines (seq2seq, pointer-generator, graph2seq, and transformer), but still makes factual and logical errors.

In summary, our contributions are the following:

  • We formulate logical level NLG as the task of logical form to text generation, which is an important step towards high-fidelity, faithful, and controllable generations in NLG systems.

  • We propose a new large-scale dataset, Logic2Text, with logical descriptions of common logic types accompanied by the underlying logical forms. The logical forms present diversified graph structures of free schema, which raises more challenges on semantic understandings.

  • We surveyed several popular generation models as the baselines under fully-supervised and few-shot settings, as well as analyze their pros and cons.

Our dataset can also be used in the reverse way (text to logical form) to facilitate tasks such as semantic parsing. In this work, we focus on NLG.

2 Related Work

NLG from structured data or knowledge has been studied for many years. There are various applications in automatic text generation, such as weather reports DBLP:conf/acl/LiangJK09, sport reports DBLP:conf/emnlp/WisemanSR17; DBLP:conf/aaai/Puduppully0L19, clinical and health reports dimarco2007development; lee2018natural, response generation in task-oriented dialogue systems DBLP:conf/emnlp/WenGMSVY15; DBLP:conf/emnlp/BudzianowskiWTC18; dusek2019e2e, etc.

Traditional methods typically employ the pipeline-based approach including content selection, planning and surface realization DBLP:journals/nle/ReiterD97; DBLP:conf/naacl/WalkerRR01; DBLP:conf/emnlp/LuNL09; DBLP:conf/acl/LiangJK09

. Recent data-driven methods tend to conflate the pipeline modules into one end-to-end neural networks, such as 

DBLP:conf/aaai/LiuWSCS18; DBLP:conf/emnlp/WisemanSR17; DBLP:conf/emnlp/WisemanSR18; DBLP:conf/acl/LiuLYWCS19; DBLP:conf/emnlp/GongFQL19; DBLP:conf/emnlp/ChenWFJQL19. Most recently, large-scale pre-trained models radford2019language; DBLP:conf/icml/SongTQLL19; DBLP:journals/corr/abs-1910-10683 have achieved new state-of-the-arts on various generation tasks. Chen et al. DBLP:journals/corr/abs-1904-09521 demonstrate that a simple method incorporating pre-trained language model can achieve very reasonable performance on the WikiBio dataset DBLP:conf/emnlp/LebretGA16, with only tens or hundreds of pair training examples. Freitag et al. DBLP:conf/emnlp/FreitagR18 obtains BLEU-4 score over 60 on the E2E dataset DBLP:conf/sigdial/NovikovaDR17 using unsupervised methods.

There are a few popular NLG datasets mostly on surface-level generation. Such as WeatherGov DBLP:conf/acl/LiangJK09, E2E DBLP:conf/sigdial/NovikovaDR17, and WikiBio DBLP:conf/emnlp/LebretGA16. RotoWire DBLP:conf/emnlp/WisemanSR17 is a more challenging dataset on generating basketball game reports from multi-row tables. But the descriptions in the reports are still limited to superficial restatements of table records, with very few involving logical inference across table rows. chen2020logic propose the task of generating arbitrary logically entailed descriptions from tables. However, as we discussed in the introduction, such a setting is unrealistic for building real-world NLG systems.

3 Dataset Construction

The table source of Logic2Text is from WikiTables111http://websail-fe.cs.northwestern.edu/wikiTables/about/ DBLP:conf/kdd/BhagavatulaND13, a collection of open-domain tables crawled from Wikipedia. We follow DBLP:journals/corr/abs-1909-02164 to filter out over-complicated tables and take a subset of tables with less than 20 rows and 10 columns.

In this dataset, we start from 7 types of most commonly used logics DBLP:journals/corr/abs-1909-02164 to describe multi-row tables: count, superlative, comparative, aggregation, majority, unique, and ordinal. For example, for logic type count, the definition is: counting some rows in the table based on the values in one column, with the scope of all table rows or a subset. Refer to Appendix A for the definitions of all logic types. Each description involves exactly one type of logic. This matches the observation that humans generally do not describe their interested information in tables with over-complicated logics. For logical forms, we use a python-like program, and the function set is an extension of DBLP:journals/corr/abs-1909-02164. Refer to Appendix B for definitions of all functions.

Our dataset is constructed in 3 stages:  §3.1 Description composition and verification,  §3.2 Logical form annotation and derivation,  §3.3 Logical form execution and verification. We adopt the workflow of composing descriptions first and then deriving the logical forms, because under such an order, the annotators can compose natural descriptions based on the interesting facts in the table, which is hard to be achieved by automatic enumeration of logical forms followed by template re-writing. For all crowd-sourcing tasks we hire Amazon Mechanical Turkers222https://www.mturk.com/ (AMT) under three requirements: (1) from English native countries (“US”,“CA”,“GB”, “AU”); (2) Approval rate higher than 95% for all HITs; (3) More than 500 approved HITs. We follow the human subject research protocols 333https://en.wikipedia.org/wiki/Minimum_wage_in_the_United_States to pay the workers. We maintain strict high criterions for approval and review at least 10 random samples for each worker to decide whether to approve or reject all his/her HITs.

Figure 2: Description Composition: the workers are asked to select three logic types and compose a statement based on the selected logic type, that describe interesting facts in the table.
Figure 3: logical form annotation & derivation: Note that in this example the questions are all in concise forms. In the AMT interface shown to the workers, we write instructions in a more casual and detailed manner, accompanied by several examples.

3.1 Description Composition & Verification

In this first stage, the human workers are asked to compose statements of a certain logic type, that describe interesting facts in the table. It’s possible that some logic types cannot be applied to certain tables. Therefore we design the following working procedure: For each table, the 7 logic types are randomly put into three groups (with sizes 2, 2, and 3). The worker is asked to choose one logic type from each group and compose a description based on the chosen logic type. They must follow the requirements (1) try to choose diversified logic types, (2) avoid template-like language and try to compose natural and interesting descriptions, (3) include the information in table captions, so as to compose comprehensive and self-contained descriptions without unspecified pronouns. An example of the description composition workflow is shown in Figure 2. We provide the workers detailed explanations for each logic type by their corresponding definitions, accompanied by examples. After collecting descriptions, we add a verification stage to filter out descriptions of low quality. We redistribute the collected descriptions grouped by each logic type, then ask three questions for each description: Is this description (1) of the correct logic type presented? (2) factually correct? (3) grammatically correct and fluent? We filter out the descriptions if any of the above three questions receives a negative response.

3.2 Logical Form Annotation & Derivation

As the core step of our dataset construction pipeline, we design a workflow to obtain the semantic information via conversations with human workers, then use the information to derive the logical forms. The questions in the conversation are specifically designed for each logic type. For example, for logic type count, the logical form structure prototype is:

count{filter_(eq/greater/...)
{scope;column_name;value}}=result

Where scope can be all table rows or a subset of table rows, whose structure prototype is:

scope=filter_(eq/greater/...)
{scope;column_name;value}

Then we ask the follow-up questions to derive the complete logical form based on the prototype (1) Whether the counting is operated on the scope of all table rows, or on a subset of all rows. ( scope ) (2) Select the table column that the counting is operated on. ( column_name ) (3) Select the criterion, based on which we filter the table values to be counted. ( filter_eq or filter_greater or etc ). (4) Based on the selected criterion, write the value to be filtered for counting. ( value ) (5) Write down the result of the counting. ( result ). If the scope is selected to be subset, then we perform another round of conversation to derive the logical form of this subset. Figure 3 provides a more detailed example of the logical form derivation of logic type superlative, which presents more complex structures. Note that the prototype we provide covers most of the descriptions of a certain logic type, but does not cover all of them due to the diverse nature of logical-level descriptions. Thus we also provide the workers with the option to skip the example if it cannot be formulated by the given question set.

3.3 Logical Form Execution & Verification

After the collection of logical forms, we use the Stanford CoreNLP toolkits 444https://stanfordnlp.github.io/CoreNLP/index.html to tokenize all text content (all table information, the descriptions, and the texts in the logical forms). To remove incorrect logical forms, we first execute the logical forms against their corresponding tables and then perform another round of semantic verification.

logical form Execution The functionality in our logical form is based on the ones used in DBLP:journals/corr/abs-1909-02164. We extend the function set to deal with semi-structured table cells (mixed numbers and strings, dates, etc.). Refer to Appendix B for definitions of all functions. We execute all logical forms against the corresponding table, and only keeps the ones that evaluate to True. This guarantees that the logical forms in our dataset achieve 100% execution correctness, thus 100% factual correctness.

Semantic Verification Note that execution correctness sometimes does not guarantee semantic correctness. Therefore we perform another round of semantic verification. Since AMT workers do not have experts knowledge to understand the logical forms, we first convert the logical form into natural language interpretation based on the fact that each function has a corresponding natural language interpretation of how it operates on the table. Take the example logical form for type count, the natural language form should be: select the rows whose column_name column (matches/is greater than/...) value. count the number of these rows. the result is count_result. We then ask the workers to verify whether the interpretation correctly matches the meaning of the description, with neither insufficient nor redundant information. Then we remove the examples receiving negative responses.

4 Dataset Statistics and Analysis

The constructed Logic2Text dataset has 5,554 tables, 10,753 descriptions paired with corresponding logical forms. Each table has 1-3 descriptions with different logic types. We show the statistics of the dataset in table 1 and the distributions of 7 logic types in Figure 4.

Tables 5,554
Examples 10,753
Vocabulary 14.0k
Average description length 16.77
Average nodes in logical form 9.00
Average function nodes in logical form 3.27
Average length of the linearized logical form 24.35
Table 1: General statistics of Logic2Text.
Figure 4: Distribution of logic types.
Figure 5: The distribution of our dataset regarding the number of all nodes (Left) and function nodes (Mid) in the logical form. Right: average number of all nodes and function nodes in the logical forms for each logic type.
Figure 6: Overview of logical form structures for logic type count, superlative, and comparative. (a) count: the structure in the green shadow is optional, representing the scope of counting. It can be all table rows (a single text node) or a subset of rows from a filter operation. (b) superlative: the structure in the orange shadow is optional, depending on the presence of the max/minimum value in the description. The structure in the yellow shadow appears 0 or more times. (c) comparative: the structures in the yellow shadow are similar as in (b).

Since the logical forms present graph structure nature, we analyze the complexity of the logical forms based on the number of nodes in the graph, regarding the number of function nodes (count, max, etc.) and the number of all nodes (both function nodes and text nodes), respectively. As shown in Figure 5, the logical forms in Logic2Text have a minimum of 5 nodes and maximum over 14 nodes. For different logic types, comparative has the most number of nodes, because it involves the selection and operation for two table rows. superlative, ordinal, and unqiue primarily focus on one table row, sometimes with the scope being a subset of all table rows, which makes the logical forms more complex. count, majority, and aggregation are summarization based logic types on multiple table rows. They are the three relatively simpler ones in terms of logical form structures. Figure 6 gives the logical form structures for 3 example logic types.

5 Experiments

In this section we first describe the generation models of our dataset in  §5.1; Then we conduct comprehensive experiments and analysis in both fully-supervised setting §5.2, few-shot setting §5.3 to investigate their performance using both automatic and human evaluation.

5.1 Baseline Models

Apart from the logical forms serving as the primary input to the generation model, the table information is also crucial to help the model understand the semantics. Following human’s order to comprehend the table and produce table descriptions, the input is formulated as the sequence of table captions (TC), table headers (TH), and logical form (L) (We did not observe significant improvements adding the table content). The goal is to generate a sequence that maximize where C = [TC, TH, L].

(1)

We employ the following models as our baselines for Logic2Text.

Template We manually craft templates for each logic type based on the logical form, then fill in the arguments as the generated description.

Seq2seq+att

We employ the seq2seq with attention model from  

DBLP:journals/corr/BahdanauCB14. We formulate the input sequence as following: the table caption is concatenated with the sequence of table headers. The logical form is concatenated to the last in linearized form.

Pointer generator DBLP:conf/acl/SeeLM17 adds the copy mechanism upon the seq2seq with attention model, allowing the decoder to directly copy tokens from the input. Such a mechanism is known to be critical for fidelity-preserving generation with abundant entities, numbers, etc.

Graph2seq+copy There is a line of research for graph neural network based encoders, such as DBLP:conf/inlg/MarcheggianiP18; DBLP:conf/emnlp/XuWWFS18, etc. We employ one representative model, Graph2seq DBLP:conf/emnlp/XuWWFS18, to encode the logical forms. The table caption and headers are first fed into a seq2seq, followed by the graph encoder for the logical form. We also add the copy mechanism to allow copying from the input.

Transformer+copy The popular Transformer model DBLP:conf/nips/VaswaniSPUJGKP17 has shown remarkable progress in many tasks including NLG. It can be seen as graph neural networks using self-attention to aggregate neighboring information, regarding the input as fully-connected graphs. In addition to the original Transformer structure, we add the copy mechanism where the last hidden layer is used to calculate the attention score and the copy switch. We also add segment embeddings for different components of the input (table caption, header, and logical form), similar as DBLP:conf/naacl/DevlinCLT19.

GPT-2 Generally with Transformer based structures, recent large-scale pre-trained models have new SOTA results in a wide range of NLP tasks. A typical workflow is to use the pre-trained model as initialization, then fine-tune the model on task-specific data. In this work, we employ the generative pre-training model, GPT-2 radford2019language, as one of our baselines.

For all neural models we use Byte-Pair Encoding (BPE)  DBLP:conf/acl/SennrichHB16a and the subword vocabulary used in radford2019language. We refer the readers to Appendix C for more implementation details.

5.2 Fully-Supervised Setting

We first conduct experiments under fully-supervised setting, and provide ablation studies on the input components. We follow a rough ratio of 8:1:1 to split our dataset into 8,566 for training, 1,095 for development, and 1,092 for testing. The train, dev, and test sets have no overlap tables. For automatic evaluations, we employ BLEU-4 555Standard script NIST mteval-v13a.pl. (B-4), ROUGE-1, 2, 4, and L (F measure) 666rouge-1.5.5., noted as R-1, R-2, R-4, and R-L. The results for all models are presented in table 2.

For models without pre-training, the copy mechanism brings a significant improvement, as can be seen comparing pointer-generator and seq2seq. This is because the descriptions in our dataset involve a lot of factual information from the table and the logical form, e.g., entity names, and numbers. Without the copy mechanism, it is hard for neural models to outperform template approaches that can guarantee the correctness of factual terms. However, the pre-trained language model can mostly accurately produce these factual terms even without a copy mechanism, demonstrating the powerful prior knowledge obtained from large-scale pre-training.

Compared to the pointer generator, which takes linearized logical form as input, Graph2seq+copy directly models the graph structure and gets a slight improvement. The Transformer+copy model obtains better performance than the Graph2seq+copy model, as the Transformer architecture is indeed a graph neural network with self-attention as aggregation function over the neighbors and regards the input as a fully-connected graph. Recent works DBLP:journals/corr/abs-1906-01698; DBLP:journals/corr/abs-2002-12327 also show that Transformer-based structure can capture hierarchical syntactic structures. Also, in the Transformer architecture, the table information and the logical form are jointly modeled, with each layer aggregating on both sides. This helps the model better understand the semantics when encoding the logical forms. The GPT-2 model obtains the best performance among all with a significantly larger improvement. As a pre-trained language model with the Transformer structure, it combines the strength of both structural modeling and language modeling prior.

Since the core challenge of our dataset is semantic understanding and logical correctness, automatic scores based on n-grams overlap is not sufficient for precise evaluation. We will present human evaluation results in future versions.

Models B-4 R-1 R-2 R-4 R-L

Template
17.57 50.56 24.20 6.61 37.81
Seq2seq+att 12.46 36.22 15.91 4.49 31.03
Pointer generator 24.03 56.23 30.51 10.78 46.85
Graph2seq+copy 25.38 58.15 32.79 12.25 49.47
Transformer+copy 26.42 58.77 33.05 12.83 49.01
GPT-2 31.49 64.47 39.59 17.34 53.04
Table 2: Automatic evaluation results for all baseline models under fully-supervised setting.

Importance of Logical Form

We conduct experiment without using the logical form, i.e., only feed the table caption, table header and the table content as input, which is the task setting of generating arbitrary logically correct descriptions in  chen2020logic. Then we evaluate the generations with all descriptions of the same table as multi-references, as in their setting. The most performant GPT-2 model obtains a BLEU score of 17.28. Note that the automatic scores are not directly comparable, since in our task setting of using the logical form, each generation maps to a unique logical form and is evaluated with a single reference. We will conduct human evaluations of fidelity preservation in future versions.

Component-Wise Ablation

We perform ablation studies on other input components: table caption and table header. We use the most performant GPT-2 model to conduct experiments. As shown in table 3, both the table caption and header provide strong context information for the description.

Models B-4 R-1 R-2 R-4 R-L
GPT-2 31.49 64.47 39.59 17.34 53.04
-w/o caption 21.06 53.53 28.74 9.93 45.73
-w/o header 28.78 62.64 38.44 16.20 52.95
Table 3: Ablation study: the importance of table caption and table header.

5.3 Few-Shot Setting

Considering that annotating logical forms in real-world cases is expensive, we also include a few-shot setting task for our dataset, where the model is only provided with hundreds of examples. Previous works have shown that the pre-trained language models obtain strong NLG performance even with a handful of fine-tuning instances DBLP:journals/corr/abs-1904-09521. Therefore we still use the best performing GPT-2 model for this study. In our dataset, the amount of unseen logical form structures increases with the reduction of training instances. Results in table 4 show that around 200 training examples are the minimum amount required to outperform the template-based approach. While there’s still a gap with the fully-supervised result, the result with 1,000 training instances using GPT-2 is comparable to some other baselines with the full training data, as shown in table 2. This demonstrates the potential of incorporating generative pre-training for the few-shot learning task.

# of examples B-4 R-1 R-2 R-4 R-L
Full 31.49 64.47 39.59 17.34 53.04
100 16.80 48.16 23.60 7.14 38.54
200 20.19 52.10 26.92 9.15 41.87
500 21.80 55.56 29.82 10.48 45.60
1000 24.10 56.99 31.57 11.34 46.51
Table 4: Results for few-shot learning setting with 100, 200, 500, and 1000 training examples.

6 Conclusion

In this work, we study the task of logical-level NLG from tabular data. We formulate the problem as generation from logical forms in order to obtain controllable and high-fidelity generations in real-world NLG systems. To this end, we propose a new dataset, named Logic2text, with (logical form, description) pairs of diversified common logic types. Besides the tasks in our experiments, there are several potential future works based on our dataset:

1) Among our baselines, pre-trained language model obtains the best result but still brings factual and logical errors. It’s still a challenging task for the neural models to understand and generalize on such semantic forms.

2) Human evaluations are precise but expensive. We suggest future works to propose new metrics. Our dataset can be used in the reverse direction to train a semantic parser, to assist parsing-based evaluations.

3) In this work, we primarily focus on the step to generate descriptions based on the logical form. In a real-world NLG system, the logical forms should be produced based on the end applications and user interests. Another potential future direction could be the empirical studies on how to select and organize such a plan for content selection.

References

Appendix

A. Logic Type Definitions

We define all 7 logic types in our dataset and provide examples based on the following table in Figure 7.

Figure 7: Example table

Count: counting some rows in the table based on the values in one column, with the scope of all table rows or a subset of rows.

Example descriptions: “in opec 2012, there were 4 countries from africa.”, “in opec 2012, among the countries from africa, 2 of them joined after 1970.”, etc.

Superlative: Describing the maximum or minimum value in a column, with the scope of all table rows or a subset of rows. You may also talk about other columns on this row with the superlative value.

Example descriptions: “in opec in 2012, angola, from africa, was the latest country to join.”, “among the member countries in opec in 2012 from the middle east, qatar was the smallest in area.”, etc.

Ordinal: Describing the n-th maximum or minimum value in a column, with the scope of all table rows or a subset of rows. You may also talk about other columns on this row with the n-th maximum or minimum value.

Example descriptions: “in opec in 2012, qatar was the 5th country to join.”, “Among the africa member countries, algeria was the 2nd earliest to join.”, etc.

Comparative: Comparing two rows in the table, regarding their values in one column. You may also talk about other columns on these two rows.

Example descriptions: “in opec in 2012, libiya joined 2 years later than kuwait.”, “in opec in 2012, algeria, from africa, had a larger population than iraq from the middle east.”

Aggregation: Describing the sum or average value over a column, with the scope of all table rows or a subset of rows.

Example descriptions: “in opec 2012, the countries from africa had an average population of around 57,800,000.”, etc.

Unique: Describing one unique row, regarding one column, with the scope of all table rows or a subset of rows. You may also talk about other columns on this unique row.

Example descriptions: “in opec 2012, angola was the only country to join after 2000.”, “in 2012, among the member countries from africa, the only one to join opec after 2000 is angola.”, etc.

Majority: Describing the majority values (most or all) over one column, with the scope of all table rows or a subset of rows.

Example descriptions: “in opec 2012, most countries joined before 2000.”, “in opec 2012, all of the africa member countries had an area larger than 900,000.”, etc.

B. Function Definitions

Here we list the function definitions and descriptions for our logical form in table 5. Note that since the tables in WikiTables are not standard database table, but semi-structured tables, the cell values are often not well-formatted with a lot of mixed strings and numbers, dates in different formats, etc. Therefore for some functions involving arithmetic operations on table cell values, we only specify a coarse “object” type for the arguments, and then parse the numerical or date type values in the function implementations. Refer to our released code for detailed implementations.

Name Arguments Output Description
count view number returns the number of rows in the view
only view bool returns whether there is exactly one row in the view
hop row, header string object returns the value under the header column of the row
and bool, bool bool returns the boolean operation result of two arguments
max/min/avg/sum view, header string number returns the max/min/average/sum of the values under the header column
nth_max/nth_min view, header string number returns the n-th max/n-th min of the values under the header column
argmax/argmin view, header string row returns the row with the max/min value in header column
nth_argmax/nth_argmin view, header string row returns the row with the n-th max/min value in header column
eq/not_eq object, object bool returns if the two arguments are equal
round_eq object, object bool returns if the two arguments are roughly equal under certain tolerance
greater/less object, object bool returns if argument 1 is greater/less than argument 2
diff object, object object returns the difference between two arguments
filter_eq/not_eq view, header string, object view returns the subview whose values under the header column is equal/not equal to argument 3
filter_greater/less view, header string, object view returns the subview whose values under the header column is greater/less than argument 3
filter_greater_eq /less_eq view, header string, object view returns the subview whose values under the header column is greater/less or equal than argument 3
filter_all view, header string view returns the view itself for the case of describing the whole table
all_eq/not_eq view, header string, object bool returns whether all the values under the header column are equal/not equal to argument 3
all_greater/less view, header string, object bool returns whether all the values under the header column are greater/less than argument 3
all_greater_eq/less_eq view, header string, object bool returns whether all the values under the header column are greater/less or equal to argument 3
most_eq/not_eq view, header string, object bool returns whether most of the values under the header column are equal/not equal to argument 3
most_greater/less view, header string, object bool returns whether most of the values under the header column are greater/less than argument 3
most_greater_eq/less_eq view, header string, object bool returns whether most of the values under the header column are greater/less or equal to argument 3
Table 5: Function definitions

C. Model Implementation Details

Here we provide some implementation details of the baseline models.

Template Some example templates are listed below. Texts in braces is optional depending on the logical form.

count:

in [table_caption], (among the ones whose [scope_column] are [equal to/greater than/…] [scope_value]), there are [result] ones whose [column_name] are [equal to/greater than/…] [value] .

superlative:

in [table_caption], (among the ones whose [scope_column] are [equal to/greater than/…] [scope_value]), the [max/minimum] [column_name] is [value].

in [table_caption], (among the ones whose [scope_column] are [equal to/greater than/…] [scope_value]), [subject], with ([other_col1] [other_val];…), has the [max/minimum] [column_name], ([value]).

ordinal:

similar as superlative, replace max/minimum as n-th max/minimum.

comparative:

in [table_caption], [subject1] has [greater/less/…] [column_name] than [subject2].

in [table_caption], [subject1] has [diff_value] [column_name] [greater/less/…] than [subject2].

in [table_caption], [subject1], with ([other_col1] [other_val];…), has [greater/less/…] [column_name] than [subject2], with ([other_col1] [other_val];…).

unique:

in [table_caption], (among the ones whose [scope_column] are [equal to/greater than/…] [scope_value]), there is only one of them whose [column_name] is [greater/less /…] than [value].

in [table_caption], (among the ones whose [scope_column] are [equal to/greater than/…] [scope_value]), the only one whose [column_name] is [greater/less/…] than [value] is for [subject], with ([other_col1] [other_val];…).

aggregation:

in [table_caption], (among the ones whose [scope_column] are [equal to/greater than/…] [scope_value]), the [average/sum] of [column_name] is [result].

majority:

in [table_caption], (among the ones whose [scope_column] are [equal to/greater than/…] [scope_value]), [most/all] of them has [column_name] [equal to/greater than/ …] [majority_value].

For all neural models we use Byte-Pair Encoding (BPE)  DBLP:conf/acl/SennrichHB16a and the subword vocabulary used in radford2019language. We use the pre-trained word embeddings from radford2019language and project to certain smaller dimensions (300) as the word embeddings. The batch size of all models are set to 32. The beam size is set to 3. As the table content only serves as context information for generation, to save GPU memory we set the maximum length of the table content as 200.

Seq2seq+att & pointer-generator The learning rate is set to 0.001.

Graph2seq+copy we reuse the code skeleton from the released code from DBLP:conf/emnlp/XuWWFS18. The table caption and header are first fed into a seq2seq, then the final hidden state is used to initialize the nodes of the graph encoder. When applying attention and copy, for graph nodes, we concatenate the token embedding and the embedding of its node as the embedding for the token. The learning rate is set to 0.0005.

Transformer+copy we mostly follow the structure setting in the original Transformer model DBLP:conf/nips/VaswaniSPUJGKP17. We use 4 attention heads and 6 layers. The final hidden layer is used for calculating the attention score and the copy switch. We also add the segment embeddings for different input components similar as devlin2018bert. The learning rate is set to 0.0005.

GPT-2 We use the GPT-2 small 117M model from the released code and pre-trained model from radford2019language. Word embeddings are fixed during training. The learning rate is set to 0.0003.