Neural Multi-Step Reasoning for Question Answering on Semi-Structured Tables

02/21/2017 ∙ by Till Haug, et al. ∙ ETH Zurich Veezoo 0

Advances in natural language processing tasks have gained momentum in recent years due to the increasingly popular neural network methods. In this paper, we explore deep learning techniques for answering multi-step reasoning questions that operate on semi-structured tables. Challenges here arise from the level of logical compositionality expressed by questions, as well as the domain openness. Our approach is weakly supervised, trained on question-answer-table triples without requiring intermediate strong supervision. It performs two phases: first, machine understandable logical forms (programs) are generated from natural language questions following the work of [Pasupat and Liang, 2015]. Second, paraphrases of logical forms and questions are embedded in a jointly learned vector space using word and character convolutional neural networks. A neural scoring function is further used to rank and retrieve the most probable logical form (interpretation) of a question. Our best single model achieves 34.8 ensemble of our models pushes the state-of-the-art score on this task to 38.7 thus slightly surpassing both the engineered feature scoring baseline, as well as the Neural Programmer model of [Neelakantan et al., 2016].



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Teaching computers to answer complex natural language questions requires sophisticated reasoning and human language understanding. We investigate generic natural language interfaces for simple arithmetic questions on semi-structured tables. Typical questions for this task are topic independent and may require performing multiple discrete operations such as aggregation, comparison, superlatives or arithmetics.

We propose a weakly supervised neural model that eliminates the need for expensive feature engineering in the candidate ranking stage. Each natural language question is translated using the method of [8] into a set of machine understandable candidate representations, called logical forms or programs. Then, the most likely such program is retrieved in two steps: i) using a simple algorithm, logical forms are transformed back into paraphrases (textual representations) understandable by non-expert users, ii) next, these strings are further embedded together with their respective questions in a jointly learned vector space using convolutional neural networks over character and word embeddings. Multi-layer neural networks and bilinear mappings are further employed as effective similarity measures and combined to score the candidate interpretations. Finally, the highest ranked logical form is executed against the input data to retrieve the answer. Our method uses only weak-supervision from question-answer-table input triples, without requiring expensive annotations of gold logical forms.

We empirically test our approach on a series of experiments on WikiTableQuestions, to our knowledge the only dataset designed for this task. An ensemble of our best models reached state-of-the-art accuracy of 38.7% at the moment of publication.

2 Related Work

We briefly mention here two main types of QA systems related to our task111An extensive list of open-domain QA publications can be found here: semantic parsing-based and embedding-based. Semantic parsing-based methods perform a functional parse of the question that is further converted to a machine understandable program and executed on a knowledgebase or database. For QA on semi-structured tables with multi-compositional queries, [8] generate and rank candidate logical forms with a log-linear model, resorting to hand-crafted features for scoring. As opposed, we learn neural features for each question and the paraphrase of each candidate logical form. Paraphrases and hand-crafted features have successfully facilitated semantic parsers targeting simple factoid [1] and compositional questions [10]. Compositional questions are also the focus of [7] that construct logical forms from the question embedding through operations parametrized by RNNs, thus losing interpretability. A similar fully neural, end-to-end differentiable network was proposed by [11].

Embedding-based methods determine compatibility between a question-answer pair using embeddings in a shared vector space [2]. Embedding learning using deep learning architectures has been widely explored in other domains, e.g. in the context of sentiment classification [3].

3 Model

We describe our QA system. For every question : i) a set of candidate logical forms is generated using the method of [8]; ii) each such candidate program is transformed in an interpretable textual representation ; iii) all ’s are jointly embedded with in the same vector space and scored using a neural similarity function; iv) the logical form corresponding to the highest ranked is selected as the machine-understandable translation of question and executed on the input table to retrieve the final answer. Our contributions are the novel models that perform steps ii) and iii), while for step i) we rely on the work of [8] (henceforth: PL2015).

3.1 Candidate Logical Form Generation

We generate a set of candidate logical forms from a question using the method of [8]. Only briefly, we review this method. Specifically, a question is parsed into a set of candidate logical forms using a semantic parser that recursively applies deduction rules. Logical forms are represented in Lambda DCS form [6] and can be executed on a table to yield an answer. An example of a question and its correct logical form are below:
How many people attended the last Rolling Stones concert?

3.2 Converting Logical Forms to Text

In Algorithm 1 we describe how logical forms are transformed into interpretable textual representations called ”paraphrases”. We choose to embed paraphrases in low dimensional vectors and compare these against the question embedding. Working directly with paraphrases instead of logical forms is a design choice, justified by their interpretability, comprehensibility (understandability by non-technical users) and empirical accuracy gains. Our method recursively traverses the tree representation of the logical form starting at the root. For example, the correct candidate logical form for the question mentioned in section 3.1, namely How many people attended the last Rolling Stones concert?, is mapped to the paraphrase Attendance as number of last table row where act is Rolling Stones.

3.3 Joint Embedding Model

We embed the question together with the paraphrases of candidate logical forms in a jointly learned vector space. We use two convolutional neural networks (CNNs) for question and paraphrase embeddings, on top of which a max-pooling operation is applied. The CNNs receive as input token embeddings obtained as described below.

1:procedure Paraphrase() is the root of a Lambda DCS logical form
2:     switch  do
3:         case Aggregation e.g. count, max, min…
5:         case Join join on relations, e.g. .Country(, Australia)
6:               +          
7:         case Reverse reverses a binary relation
9:         case LambdaFormula lambda expression
11:         case Arithmetic or Merge e.g. plus, minus, union…
13:         case Superlative e.g. argmax(x, value)
15:         case Value i.e. constants
17:     return is the textual paraphrase of the Lambda DCS logical form
18:end procedure
Algorithm 1 Recursive paraphrasing of a Lambda DCS logical form. The + operation means string concatenation with spaces. Lambda DCS language is detailed in [6].

3.3.1 Token Embedding

The embedding of an input word sequence (e.g. question, paraphrase) is depicted in Figure 1 and is similar to [4]. Every token is parametrized by learnable word and character embeddings. The latter help dealing with unknown tokens (e.g. rare words, misspellings, numbers or dates). Token vectors are then obtained using a CNN (with multiple filter widths) over the constituent characters , followed by a max-over-time pooling layer and concatenation with the word vector.

3.3.2 Sentence Embedding

We map both the question and the paraphrase into a joint vector space using sentence embeddings obtained from two jointly trained CNNs. CNNs’ filters span a different number of tokens from a width set . For each filter width , we learn different filters, each of dimension , where is the word embedding size. After the convolution layer, we apply a max-over-time pooling on the resulting feature matrices which yields, per filter-width, a vector of dimension . Next, we concatenate the resulting max-over-time pooling vectors of the different filter-widths in to form our sentence embedding. The final sentence embedding size is .

3.3.3 Neural Similarity Measures

Let be the sentence embeddings of question and of paraphrase . We experiment with the following similarity scores: i) DOTPRODUCT : ; ii) BILIN : , with being a trainable matrix; iii) FC: u and v concatenated, followed by two sequential fully connected layers with ELU non-linearities; iv) FC-BILIN: weighted average of BILIN and FC. These models define parametrized similarity scoring functions , where is the set of natural language questions and is the set of paraphrases of logical forms.

Figure 1: Conversion of a sentence into a token embedding matrix using word embeddings and char CNNs. The resulting matrix is again fed to a CNN/RNN (not pictured) to produce a sentence embedding. Fig. inspired from [4].

3.4 Training Algorithm

For training, we build two sets (positive) and (negative) consisting of all pairs of questions and paraphrases of candidate logical forms generated as described in Section 3.1

. A pair is positive or negative if its logical form gives the correct or respectively incorrect gold answer when executed on the corresponding table. During training, we use the ranking hinge loss function (with margin


4 Experiments

Dataset: For training and testing we use the train-validation-test split of WikiTableQuestions [8], a dataset containing 22,033 pairs of questions and answers based on 2,108 Wikipedia tables. This dataset is also used by our baselines, [8, 7]. Tables are not shared across these splits, which requires models to generalize to unseen data. We obtain about 3.8 million training triples , where is a binary indicator of whether the logical form gives the correct gold answer when executed on the corresponding table. 76.7% of the questions have at least one correct candidate logical form when generated with the model of [8].

Baseline Systems P@1
Neural Programmer [7]
(single model)
Neural Programmer [7]
(15 ensemble models)
PL2015 [8] 37.1%
Our Models P@1
CNN-FC 30.4%
CNN-FC-BILIN (best single model) 34.8%
CNN-FC-BILIN (15 ensemble models) 38.7%
Table 1: Precision@1 of various baselines and our models on the WikiTableQuestions dataset.
Question Paraphrase (gold vs predicted)
Which association entered last? association of last row
association of row with highest number of joining year
What is the total of all the medals? count all rows
number of total of nation is total
How many episodes were originally
aired before December 1965?
count original air date as date <= 12 1965
count original air date as date <12 1965
Table 2: Example of common errors of our model.

Training Details:

Our models are implemented using TensorFlow and trained on a single Tesla P100 GPU. Training takes approximately 6 hours. We initialize word vectors with 200 dimensional GloVe (

[9]) pre-trained vectors. For the character CNN we use widths spanning 1, 2 and 3 characters. The sentence embedding CNNs use widths of

. The fully connected layers in the FC models have 500 hidden neurons, which we regularize using 0.8-dropout. The loss margin

is set to 0.2. Optimization is done using Adam [5]

with a learning rate of 7e-4. Hyperparameters are tunned on the development data split of the Wiki-TableQuestions table. We choose the best performing model on the validation set using early stopping.

Results: Experimental results are shown in Table 1

. Our best performing single model is FC-BILIN with CNNs, Intuitively, BILIN and FC are able to extract different interaction features between the two input vectors, while their linear combination retains the best of both models. An ensemble of 15 single CNN-FC-BILIN models was setting (at the moment of publication) a new state-of-the-art precision@1 for this dataset: 38.7%. This shows that the same model initialized differently can learn different features. We also experimented with recurrent neural networks (RNNs) for the sentence embedding since these are known to capture word order better than CNNs. However, RNN-FC-BILIN performs worse than its CNN variant.

There are a few reasons that contributed to the low accuracy obtained on this task by various methods (including ours) compared to other NLP problems: weak supervision, small training size and a high percentage of unanswerable questions.

Error Analysis: The questions our models do not answer correctly can be split into two categories: either a correct logical form is not generated, or our scoring models do not rank the correct one at the top. We perform a qualitative analysis presented in Table 2 to reveal common question types our models often rank incorrectly. The first two examples show questions whose correct logical form depends on the structure of the table. In these cases a bias towards the more general logical form is often exhibited. The third example shows that our model has difficulty distinguishing operands with slight modification (e.g. smaller and smaller equals), which may be due to weak-supervision.

System P@1
  w/o Dropout 33.3%
  w/o Char Embeddings 33.8%
  w/o GloVe (random init) 32.4%
  w/o Paraphrasing 33.1%
System Amount
Lookup 10.8%
Aggregation &
Superlatives 30.1%
Arithmetic & Comparisons 19.3%
Table 3: Ablation studies. Left: Component contributions to our model. Right: Types of questions answered correctly by our system.

Ablation Studies: For a better understanding of our model, we investigate the usefulness of various components with an ablation study shown in Table 3. In particular, we emphasize that replacing the paraphrasing stage with the raw strings of the Lambda DCS expressions resulted in lower precision@1, which confirms the utility of this stage.

Analysis of Correct Answers: We analyze how well our best single model performs on various question types. For this, we manually annotate 80 randomly chosen questions that are correctly answered by our model and report statistics in Table 3.

5 Conclusion

In this paper we propose a neural network QA system for semi-structured tables that eliminates the need for manually designed features. Experiments show that an ensemble of our models reaches competitive accuracy on the WikiTableQuestions dataset, thus indicating its capability to answer complex, multi-compositional questions. Our code is available at .


This research was supported by the Swiss National Science Foundation (SNSF) grant number 407540_167176 under the project ”Conversational Agent for Interactive Access to Information”.