Answering Conversational Questions on Structured Data without Logical Forms

08/30/2019 ∙ by Thomas Müller, et al. ∙ Google 0

We present a novel approach to answering sequential questions based on structured objects such as knowledge bases or tables without using a logical form as an intermediate representation. We encode tables as graphs using a graph neural network model based on the Transformer architecture. The answers are then selected from the encoded graph using a pointer network. This model is appropriate for processing conversations around structured data, where the attention mechanism that selects the answers to a question can also be used to resolve conversational references. We demonstrate the validity of this approach with competitive results on the Sequential Question Answering (SQA) task (Iyyer et al., 2017).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, there has been significant progress on conversational question answering (QA), where questions can be meaningfully answered only within the context of a conversation Iyyer et al. (2017); Choi et al. (2018); Saha et al. (2018)11todo: 1Add more citations.. This line of work, as in single QA setup, falls into two main categories, (i) the answers are extracted from some text in a reading comprehension setting, (ii) the answers are extracted from structured objects, such as knowledge bases or tables. The latter is commonly posed as a semantic parsing task, where the goal is to map questions to some logical form which is then executed over the knowledge base to extract the answers.

In semantic parsing, there is extensive work on using deep neural networks for training models over manually created logical forms in a supervised learning setup

Jia and Liang (2016); Ling et al. (2016); Dong and Lapata (2018). However, creating labeled data for this task can be expensive and time-consuming. This problem resulted in research that investigates semantic parsing with weak supervision where training data consists of questions and answers along with the structured resources to recover the logical form representations that would yield the correct answer Liang et al. (2016); Iyyer et al. (2017).

In this paper, we follow this line of research and investigate answering sequential questions with respect to structured objects. In contrast to previous approaches, instead of learning the intermediate logical forms, we propose a novel approach that encodes the structured resources, i.e. tables, along with the questions and answers from the context of the conversation. This approach allows us to handle conversational context without the definition of detailed operations or a vocabulary dependent on the logical form formalism that are required in the weakly supervised semantic parsing approaches.

We present empirical performance of the proposed approach on the Sequential Question Answering task (SQA) Iyyer et al. (2017) which improves the state-of-the-art performance on all questions, particularly on the follow-up questions that require effective encoding of the context.

2 Approach

We build a QA model for a sequence of questions that are asked about a table and can be answered by selecting one or more table cells.

Figure 1: Example of a table encoded as a graph in relation to , the follow up question to . Columns (orange boxes), rows (green boxes) and cells (gray boxes) are represented by interconnected nodes. Questions words are linked to the table by edit distance (blue lines). Numerical values in question and table are connected by comparison relations (yellow lines). Additionally numerical cells are annotated with their numerical rank and inverse rank with respect to the column (small blue boxes). The answer column, rows and cells of are also marked (nodes with borders).

2.1 Graph Formulation

We encode tables as graphs by representing columns, rows and cells as nodes. Figure 1 shows an example graph representing how we encode the table in relation to , which is the follow up question to . Within a column, cells with identical texts are collapsed into a single node. In the example graph, we only create a single node for “Toronto” and a single node for “Montreal”. We then add directed edges that connect columns and rows to cells, one in either direction (orange and green edges in the figure).

The question is represented by a node covering the entire question text and a node for each token. The main question node is connected to each token, column and cell node. 111We do not show some of these connections in the figure to avoid clutter.

Nodes have associated nominal feature sets. All nodes have a feature indicating their type: column, row, cell, question and question token. The text in column (i.e., the column name), cell, question and token nodes are added to the corresponding node feature set adopting a bag-of-words representation. Column, row and cell nodes have additional features that indicate their column (for cell and column nodes) and row (for cell and row nodes) indexes.

We align question tokens with column names and cell text using the Levenshtein edit distance between n-grams in the question and the table text, similar to previous work

Shaw et al. (2019). In particular, we score every question n-gram with the normalized edit distance 222 and connect the cell to the token span if the score is . Through the alignment, the cell is connected to all the tokens of the matching span and the binned score is added as an additional feature to the cell. In Figure 1, the “building” and “floors” tokens in the questions are connected to the matching “Building” and “Floors” column nodes from the table.

Numeric Operations

In order to allow operations over numbers and date expressions, we extend our graph with a set of relations involving numerical values in the question and table cells. We infer the type of the numerical values in a column, such as the ones in the “Floors” column, by picking their most frequent type (number or date). Then, we add special features to each cell in the column: the rank and inverse rank of the value in the cell, considering the other cell values from the same column. These features allow the model to answer questions such as “what is the building with most floors?”. In addition, we add a new node to the graph for each numeric expression from the question (such as the number 60 from the second question in Figure 1), and we connect it to the tokens it spans. The numerical nodes originated from the question are connected to the table cells containing numerical values. The connection type encodes the result of the comparison between the question and cell values, lesser, greater or equal, as shown in the figure (yellow edges). This relations allow the model to answer questions such as “which buildings have more than 50 floors?”.

Context

We extend the model to capture conversational context by using the feature-based encoding in the graph formulation. In order to handle follow-up questions, we add the previous answers to the graph by marking all the answer rows, columns and cells with nominal features. The nodes with borders in Figure 1 contain the answers to the first question : “what are the buildings in toronto?”. In the example, the first two rows receive a feature ANSWER_ROW, the “building” column a feature ANSWER_COLUMN and “First Canadian Place” and “Commerce Court West” a feature ANSWER_CELL. Notice that the content of is not encoded in the graph, only its answers.

2.2 Node Representations

Before the initial encoder layer, all nodes are mapped to vector representations using learned embeddings. For nodes with multiple features, such as column and cell nodes, we reduce the set of feature embeddings to a single vector using the mean. We also concatenate an embedding encoding whether the node represents a question token, or not.

2.3 Encoder

We use a Graph Neural Network (GNN) encoder based on the Transformer Vaswani et al. (2017). The only modification is that the self-attention mechanism is extended to consider the edge label between each pair of nodes.

We follow the formulation of shaw-etal-2019-generating that uses additive edge vector representations. The self-attention weights are therefore calculated as:

(1)

where is the unnormalized attention weight for the node vector representations and , and and are parameter matrices. This extension introduces to the calculation, which is a vector representation of the edge label between the two nodes. Edge vectors are similarly added when summing over node representations to produce the new output representations.

We use edge labels corresponding to relative positions between tokens, alignments between tokens and table elements, and relations between table elements, as described in Section 2.1. These amount to 9 fixed edge labels in the graph (4 between rows/cells/columns, 2 between question and cells/columns, and 3 numeric relations) and a tunable number of relative token positions).

2.4 Answer Selection

Figure 2: The model input is a graph, with nodes corresponding to query words and table elements, and edge labels capturing their relations. A graph neural network encoder generates contextualized node representations. Answers values are selected from nodes corresponding to table elements.

We extend the Transformer decoder to include a copy mechanism based on a pointer network Vinyals et al. (2015). The copy mechanism allows the model to predict sequences of answer columns and rows from the input, rather than select symbols from an output vocabulary. Figure 2 visualizes the entire model architecture.

3 Related Work

Semantic parsing models can be trained to produce gold logical forms using an encoder-decoder approach Suhr et al. (2018) or by filling templates Xu et al. (2017); Peng et al. (2017); Yu et al. (2018). When gold logical forms are not available, they are typically treated as latent variables or hidden states and the answers or denotations are used to search for correct logical forms Yih et al. (2015); Long et al. (2016); Iyyer et al. (2017)

. In some cases, feedback from query execution is used as a reward signal for updating the model through reinforcement learning

Zhong et al. (2017); Liang et al. (2016, 2018); Agarwal et al. (2019) or for refining parts of the query Wang et al. (2018). In our work, we do not use logical forms or RL, which can be hard to train, but simplify the training process by directly matching questions to table cells.

Most of the QA and semantic parsing research focuses on single turn questions. We are interested in handling multiple turns and therefore in modeling context. In semantic parsing tasks, logical forms Iyyer et al. (2017); Sun et al. (2018b); Guo et al. (2018) or SQL statements Suhr et al. (2018) from previous questions are refined to handle follow up questions. In our model, we encode answers to previous questions by marking answer rows, columns and cells in the table, in a non-autoregressive fashion.

In regards to how structured data is represented, methods range from encoding table information, metadata and/or content, Gur et al. (2018); Sun et al. (2018b); Petrovski et al. (2018) to encoding relations between the question and table items Krishnamurthy et al. (2017) or KB entities Sun et al. (2018a). We also encode the table structure and the question in an annotation graph, but use a different modelling approach.

4 Experimental Setup

We evaluate our method on the SequentialQA (SQA) dataset Iyyer et al. (2017), which consists of sequences of questions that can be answered from a given table. The dataset is designed so that every question can be answered by one or more table cells. It consists of answer sequences containing questions ( question per sequence on average). Table 2 shows an example.

We lowercase and remove punctuation from questions, cell texts and column names. We then split each input utterance on spaces to generate a sequence of tokens.333Whitespace tokenization simplifies the preprocessing but we can expect an off-the-shelf tokenizer to work as good or even better. We only keep the most frequent word types and map everything else to one of OOV buckets. Numbers and dates are parsed in a similar way as in the DynSp and the Neural Programmer Neelakantan et al. (2016a).

We use the Adam optimizer Kingma and Ba (2014)

for optimization and tune hyperparameters with Google Vizier 

Golovin et al. (2017). More details and the hyperparameter ranges can be found in the appendix (A).

All numbers given for our model are averaged over 5 independent runs with different random initializations.

5 Results

Model ALL SEQ POS1 POS2 POS3
FP * 34.1 7.2 52.6 25.6 25.9
NP * 39.4 10.8 58.9 35.9 24.6
DynSp 42.0 10.2 70.9 35.8 20.1
FP  * 33.2 7.7 51.4 22.2 22.3
NP  * 40.2 11.8 60.0 35.9 25.5
DynSp  44.7 12.8 70.4 41.1 23.6
Camp  * 45.6 13.2 70.3 42.6 24.8
Ours * 45.1 13.3 67.2 42.4 26.4
Ours  * 55.1 28.1 67.2 52.7 46.8
Ours  * (RA) 61.7 28.1 67.2 60.1 57.7
Table 1: SQA test results. marks contextual models using the previous question or the answer to the previous question. * marks the models that use the table content. RA denotes an oracle model that has access to the previous reference answer at test time. ALL is the average question accuracy, SEQ the sequence accuracy, and POS X, the accuracy of the X’th question in a sequence.

We compare our model to Float Parser (FP)  Pasupat and Liang (2015), Neural Programmer (NP) Neelakantan et al. (2016b), DynSp Iyyer et al. (2017) and Camp Sun et al. (2018b) in Table 1.

We observe that our model improves the SOTA from by Camp to in question accuracy (ALL), reducing the relative error rate by . For the initial question (POS1), however, it is behind DynSp by . More interestingly, our model handles follow up questions especially well outperforming the previously best model FP by on POS3, a relative error reduction.

As in previous work, we also report performance for a non-contextual setup where follow up questions are answered in isolation. We observe that our model effectively leverages the context information by improving the average question accuracy from to in comparison to the use of context in DynSp yielding improvement. If we provide the previous reference answers, the average question accuracy jumps to , showing that of the errors are due to error propagation.

Numeric operations

For understanding how effective our model is in handling numeric operations, we trained models without the specific handling explained in Section 2. We find that that the overall accuracy decreases from to , demonstrating the competence of our approach to model such operations. This effectiveness is further emphasized when focusing on questions that contain a superlative (e.g., “tallest”, “most expensive”) with a performance difference of with numeric relations and without. It is worthwhile to call out that the model without special number handling still out-performs the previous SOTA Camp by more than 5 points ( vs ).

What are all the nations? Australia, Italy, Germany, Soviet Union, Switzerland, United States, Great Britain, France Which won gold medals? Australia, Italy, Germany, Soviet Union Which won more than one? Australia Rank Nation Gold Silver Bronze Total 1 Australia 2 1 0 3 2 Italy 1 1 1 3 3 Germany 1 0 1 2 4 Soviet Union 1 0 0 1 5 Switzerland 0 2 1 3 6 United States 0 1 0 1 7 Great Britain 0 0 1 1 7 France 0 0 1 1
Table 2: A sequence of questions (left) and the corresponding table (right) selected from the SQA dataset that is answered correctly by our approach. This example requires handling conversational context and numerical comparisons.

Table size.

We observe that our model is not sensitive to table size changes, with an average accuracy of for the 10% largest tables (vs. overall).444Fig. 3 of the appendix shows a scatter plot of table size vs. accuracy.

Error analysis.

Table 2 shows an example that is consistently handled correctly by the model. It requires a simple string match (“nations”“nation”), and implicit and explicit comparisons.

We performed error analysis on test data over initial (POS) and follow up questions (POS) to identify the limitations of our approach.

For the initial questions, we find that are match errors, e.g., the model does not match “episode” to “Eps #”, or cases where the model has to exclude rows with empty values from the results. of the errors require a more sophisticated table understanding, e.g., rows that represent the total of all other rows should often not be included in the answers. For of the errors, we think that the reference answer is incorrect and for another the model prediction is correct but contains duplicates because multiple rows contain the same value. of the errors are around complex matches such as selecting certain ranks (“the first two”), exclusion or negation.

For the follow up questions, are caused by complex matches; are match errors; of the errors are due to incorrect reference answers and would require advanced table understanding. Only of the errors are due to incorrect management of the conversational context. Section B of the appendix contains a more detailed analysis and error examples.

6 Discussion

We present a model for table-centered conversational QA that predicts the answers directly from the table. We show that this model improves SOTA on SQA dataset and particularly handles conversational context effectively.

As future work, we plan to expand our model with pre-trained language representations (e.g., BERT Devlin et al. (2018)) in order to improve performance on initial queries and matching of queries to table entries. To handle larger tables, we will investigate sharding the table row-wise, running the model on all the shards first, and then on the final table which combines all the answer rows.

References

Appendix A Hyperparameters

We tune hyperparameters with Google Vizier Golovin et al. (2017). For the encoder and decoder, we select the number of layers from and embedding and hidden dimensions from , setting the feed forward layer hidden dimensions higher. We employ dropout at training time with selected from . We select the attention heads from and use a clipping distance of for relative position representations.

We use the Adam optimizer Kingma and Ba (2014) with , , and . We tune the learning rate and use the same warm-up and decay strategy for learning rate as Vaswani et al. (2017), selecting a number of warm-up steps up to a maximum of . We run the training until convergence for a fixed number of steps () and use the final checkpoint for evaluation. We choose batch sizes from .

Appendix B Results

Table Size

Given that our model makes use of the whole table, it is conceivable that the performance of our approach can be more sensitive to the table size than methods that predict intermediate representations. Plotting the model performance with respect to number of cells in the table (Figure 3), we observe that the performance does not vary significantly by the table size.

Figure 3: Scatter-plot of accuracy and table size. Each point represents the accuracy on all the questions asked about a test set table of the corresponding size.
Error type Counts
TABLE_UNDERSTANDING 29
COMPLEX_MATCH 12
MATCH 26
GOLD 15
ANSWER_SET 15
OTHER 3
(a)
Error type Counts
TABLE_UNDERSTANDING 11
COMPLEX_MATCH 38
MATCH 17
GOLD 13
ANSWER_SET 4
OTHER 9
CONTEXT 8
(b)
Table 5: Errors on 100 random initial LABEL:sub@initial and follow-up LABEL:sub@followup questions

Error Analysis

For a detailed analysis, we annotated initial and follow-up questions with the following match types:

Match

A lexical or semantic match error such as not matching “episode” with “EPS #”.

Table_understanding

A question that would require a higher level table understanding to be answered correctly. For example, we often have to decide that a row is just the total of all other rows and should not be selected.

Complex_match

A question that would require a numerical value match, some sort of sorting or a negation to be answered. “what is the area of all of the non-remainders?”

Gold

A question with a wrong reference answers. “what gene functions are listed?” – Gold points to Category column rather than Gene Functions.

Answer_set

The returned answer should be duplicate free. “what are all of the rider’s names?”, but the table contains “Carl Fogarty” multiple times.

Context

Only used in follow-up questions. This error indicates that a more sophisticated context management is needed.

Other

Any other kind of error.

Table 5 contains the error counts for initial questions and follow-ups, respectively. Table 6 contains interesting examples.

Question Notes
COMPLEX_MATCH
who scored more than earnie stewart? 2-hop reasoning. Requires comparison on top of the result of the inner question.
and which has been active for the longest? Reasoning with text and date (1986-present)
who else is in that field? Exclusion.
of these, which did not publish on february 9? Negation. The model is doing the right thing but missing one of the values.
what is the highest passengers for a route? The model selects the 2nd highest and not the 1st.
TABLE_UNDERSTANDING
now, during which year did they have the worst placement? Requires understanding that “Withdrawal ..” is worse than any position.
which of these seasons have four judges? Requires counting the number of named entities within a single cell.
GOLD
what gene functions are listed? Gold points to “Category” column rather than “Gene Functions”
which aircraft have 1 year of service Gold points to the 4th column (“in service”) instead of the 1st column (“aircraft”)
CONTEXT
when was thaddeus bell born?, when was klaus jurgen schneider born?, which is older? The correct birthday is selected, but not the person.
Table 6: Some interesting error cases with comments.