In recent years, there has been significant progress on conversational question answering (QA), where questions can be meaningfully answered only within the context of a conversation Iyyer et al. (2017); Choi et al. (2018); Saha et al. (2018)11todo: 1Add more citations.. This line of work, as in single QA setup, falls into two main categories, (i) the answers are extracted from some text in a reading comprehension setting, (ii) the answers are extracted from structured objects, such as knowledge bases or tables. The latter is commonly posed as a semantic parsing task, where the goal is to map questions to some logical form which is then executed over the knowledge base to extract the answers.
In semantic parsing, there is extensive work on using deep neural networks for training models over manually created logical forms in a supervised learning setupJia and Liang (2016); Ling et al. (2016); Dong and Lapata (2018). However, creating labeled data for this task can be expensive and time-consuming. This problem resulted in research that investigates semantic parsing with weak supervision where training data consists of questions and answers along with the structured resources to recover the logical form representations that would yield the correct answer Liang et al. (2016); Iyyer et al. (2017).
In this paper, we follow this line of research and investigate answering sequential questions with respect to structured objects. In contrast to previous approaches, instead of learning the intermediate logical forms, we propose a novel approach that encodes the structured resources, i.e. tables, along with the questions and answers from the context of the conversation. This approach allows us to handle conversational context without the definition of detailed operations or a vocabulary dependent on the logical form formalism that are required in the weakly supervised semantic parsing approaches.
We present empirical performance of the proposed approach on the Sequential Question Answering task (SQA) Iyyer et al. (2017) which improves the state-of-the-art performance on all questions, particularly on the follow-up questions that require effective encoding of the context.
We build a QA model for a sequence of questions that are asked about a table and can be answered by selecting one or more table cells.
2.1 Graph Formulation
We encode tables as graphs by representing columns, rows and cells as nodes. Figure 1 shows an example graph representing how we encode the table in relation to , which is the follow up question to . Within a column, cells with identical texts are collapsed into a single node. In the example graph, we only create a single node for “Toronto” and a single node for “Montreal”. We then add directed edges that connect columns and rows to cells, one in either direction (orange and green edges in the figure).
The question is represented by a node covering the entire question text and a node for each token. The main question node is connected to each token, column and cell node. 111We do not show some of these connections in the figure to avoid clutter.
Nodes have associated nominal feature sets. All nodes have a feature indicating their type: column, row, cell, question and question token. The text in column (i.e., the column name), cell, question and token nodes are added to the corresponding node feature set adopting a bag-of-words representation. Column, row and cell nodes have additional features that indicate their column (for cell and column nodes) and row (for cell and row nodes) indexes.
We align question tokens with column names and cell text using the Levenshtein edit distance between n-grams in the question and the table text, similar to previous workShaw et al. (2019). In particular, we score every question n-gram with the normalized edit distance 222 and connect the cell to the token span if the score is . Through the alignment, the cell is connected to all the tokens of the matching span and the binned score is added as an additional feature to the cell. In Figure 1, the “building” and “floors” tokens in the questions are connected to the matching “Building” and “Floors” column nodes from the table.
In order to allow operations over numbers and date expressions, we extend our graph with a set of relations involving numerical values in the question and table cells. We infer the type of the numerical values in a column, such as the ones in the “Floors” column, by picking their most frequent type (number or date). Then, we add special features to each cell in the column: the rank and inverse rank of the value in the cell, considering the other cell values from the same column. These features allow the model to answer questions such as “what is the building with most floors?”. In addition, we add a new node to the graph for each numeric expression from the question (such as the number 60 from the second question in Figure 1), and we connect it to the tokens it spans. The numerical nodes originated from the question are connected to the table cells containing numerical values. The connection type encodes the result of the comparison between the question and cell values, lesser, greater or equal, as shown in the figure (yellow edges). This relations allow the model to answer questions such as “which buildings have more than 50 floors?”.
We extend the model to capture conversational context by using the feature-based encoding in the graph formulation. In order to handle follow-up questions, we add the previous answers to the graph by marking all the answer rows, columns and cells with nominal features. The nodes with borders in Figure 1 contain the answers to the first question : “what are the buildings in toronto?”. In the example, the first two rows receive a feature ANSWER_ROW, the “building” column a feature ANSWER_COLUMN and “First Canadian Place” and “Commerce Court West” a feature ANSWER_CELL. Notice that the content of is not encoded in the graph, only its answers.
2.2 Node Representations
Before the initial encoder layer, all nodes are mapped to vector representations using learned embeddings. For nodes with multiple features, such as column and cell nodes, we reduce the set of feature embeddings to a single vector using the mean. We also concatenate an embedding encoding whether the node represents a question token, or not.
We use a Graph Neural Network (GNN) encoder based on the Transformer Vaswani et al. (2017). The only modification is that the self-attention mechanism is extended to consider the edge label between each pair of nodes.
We follow the formulation of shaw-etal-2019-generating that uses additive edge vector representations. The self-attention weights are therefore calculated as:
where is the unnormalized attention weight for the node vector representations and , and and are parameter matrices. This extension introduces to the calculation, which is a vector representation of the edge label between the two nodes. Edge vectors are similarly added when summing over node representations to produce the new output representations.
We use edge labels corresponding to relative positions between tokens, alignments between tokens and table elements, and relations between table elements, as described in Section 2.1. These amount to 9 fixed edge labels in the graph (4 between rows/cells/columns, 2 between question and cells/columns, and 3 numeric relations) and a tunable number of relative token positions).
2.4 Answer Selection
We extend the Transformer decoder to include a copy mechanism based on a pointer network Vinyals et al. (2015). The copy mechanism allows the model to predict sequences of answer columns and rows from the input, rather than select symbols from an output vocabulary. Figure 2 visualizes the entire model architecture.
3 Related Work
Semantic parsing models can be trained to produce gold logical forms using an encoder-decoder approach Suhr et al. (2018) or by filling templates Xu et al. (2017); Peng et al. (2017); Yu et al. (2018). When gold logical forms are not available, they are typically treated as latent variables or hidden states and the answers or denotations are used to search for correct logical forms Yih et al. (2015); Long et al. (2016); Iyyer et al. (2017)
. In some cases, feedback from query execution is used as a reward signal for updating the model through reinforcement learningZhong et al. (2017); Liang et al. (2016, 2018); Agarwal et al. (2019) or for refining parts of the query Wang et al. (2018). In our work, we do not use logical forms or RL, which can be hard to train, but simplify the training process by directly matching questions to table cells.
Most of the QA and semantic parsing research focuses on single turn questions. We are interested in handling multiple turns and therefore in modeling context. In semantic parsing tasks, logical forms Iyyer et al. (2017); Sun et al. (2018b); Guo et al. (2018) or SQL statements Suhr et al. (2018) from previous questions are refined to handle follow up questions. In our model, we encode answers to previous questions by marking answer rows, columns and cells in the table, in a non-autoregressive fashion.
In regards to how structured data is represented, methods range from encoding table information, metadata and/or content, Gur et al. (2018); Sun et al. (2018b); Petrovski et al. (2018) to encoding relations between the question and table items Krishnamurthy et al. (2017) or KB entities Sun et al. (2018a). We also encode the table structure and the question in an annotation graph, but use a different modelling approach.
4 Experimental Setup
We evaluate our method on the SequentialQA (SQA) dataset Iyyer et al. (2017), which consists of sequences of questions that can be answered from a given table. The dataset is designed so that every question can be answered by one or more table cells. It consists of answer sequences containing questions ( question per sequence on average). Table 2 shows an example.
We lowercase and remove punctuation from questions, cell texts and column names. We then split each input utterance on spaces to generate a sequence of tokens.333Whitespace tokenization simplifies the preprocessing but we can expect an off-the-shelf tokenizer to work as good or even better. We only keep the most frequent word types and map everything else to one of OOV buckets. Numbers and dates are parsed in a similar way as in the DynSp and the Neural Programmer Neelakantan et al. (2016a).
We use the Adam optimizer Kingma and Ba (2014)
for optimization and tune hyperparameters with Google VizierGolovin et al. (2017). More details and the hyperparameter ranges can be found in the appendix (A).
All numbers given for our model are averaged over 5 independent runs with different random initializations.
|FP † *||33.2||7.7||51.4||22.2||22.3|
|NP † *||40.2||11.8||60.0||35.9||25.5|
|Camp † *||45.6||13.2||70.3||42.6||24.8|
|Ours † *||55.1||28.1||67.2||52.7||46.8|
|Ours † * (RA)||61.7||28.1||67.2||60.1||57.7|
We observe that our model improves the SOTA from by Camp to in question accuracy (ALL), reducing the relative error rate by . For the initial question (POS1), however, it is behind DynSp by . More interestingly, our model handles follow up questions especially well outperforming the previously best model FP by on POS3, a relative error reduction.
As in previous work, we also report performance for a non-contextual setup where follow up questions are answered in isolation. We observe that our model effectively leverages the context information by improving the average question accuracy from to in comparison to the use of context in DynSp yielding improvement. If we provide the previous reference answers, the average question accuracy jumps to , showing that of the errors are due to error propagation.
For understanding how effective our model is in handling numeric operations, we trained models without the specific handling explained in Section 2. We find that that the overall accuracy decreases from to , demonstrating the competence of our approach to model such operations. This effectiveness is further emphasized when focusing on questions that contain a superlative (e.g., “tallest”, “most expensive”) with a performance difference of with numeric relations and without. It is worthwhile to call out that the model without special number handling still out-performs the previous SOTA Camp by more than 5 points ( vs ).
We observe that our model is not sensitive to table size changes, with an average accuracy of for the 10% largest tables (vs. overall).444Fig. 3 of the appendix shows a scatter plot of table size vs. accuracy.
Table 2 shows an example that is consistently handled correctly by the model. It requires a simple string match (“nations”“nation”), and implicit and explicit comparisons.
We performed error analysis on test data over initial (POS) and follow up questions (POS) to identify the limitations of our approach.
For the initial questions, we find that are match errors, e.g., the model does not match “episode” to “Eps #”, or cases where the model has to exclude rows with empty values from the results. of the errors require a more sophisticated table understanding, e.g., rows that represent the total of all other rows should often not be included in the answers. For of the errors, we think that the reference answer is incorrect and for another the model prediction is correct but contains duplicates because multiple rows contain the same value. of the errors are around complex matches such as selecting certain ranks (“the first two”), exclusion or negation.
For the follow up questions, are caused by complex matches; are match errors; of the errors are due to incorrect reference answers and would require advanced table understanding. Only of the errors are due to incorrect management of the conversational context. Section B of the appendix contains a more detailed analysis and error examples.
We present a model for table-centered conversational QA that predicts the answers directly from the table. We show that this model improves SOTA on SQA dataset and particularly handles conversational context effectively.
As future work, we plan to expand our model with pre-trained language representations (e.g., BERT Devlin et al. (2018)) in order to improve performance on initial queries and matching of queries to table entries. To handle larger tables, we will investigate sharding the table row-wise, running the model on all the shards first, and then on the final table which combines all the answer rows.
- Agarwal et al. (2019) Rishabh Agarwal, Chen Liang, Dale Schuurmans, and Mohammad Norouzi. 2019. Learning to generalize from sparse and underspecified rewards. arXiv preprint arXiv:1902.07198.
Choi et al. (2018)
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy
Liang, and Luke Zettlemoyer. 2018.
answering in context.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174–2184, Brussels, Belgium. Association for Computational Linguistics.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Dong and Lapata (2018) Li Dong and Mirella Lapata. 2018. Coarse-to-fine decoding for neural semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 731–742, Melbourne, Australia. Association for Computational Linguistics.
- Golovin et al. (2017) Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Elliot Karro, and D. Sculley, editors. 2017. Google Vizier: A Service for Black-Box Optimization.
- Guo et al. (2018) Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin. 2018. Dialog-to-action: Conversational question answering over a large-scale knowledge base. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, pages 2946–2955, USA. Curran Associates Inc.
- Gur et al. (2018) Izzeddin Gur, Semih Yavuz, Yu Su, and Xifeng Yan. 2018. DialSQL: Dialogue based structured query generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1339–1349, Melbourne, Australia. Association for Computational Linguistics.
- Iyyer et al. (2017) Mohit Iyyer, Wen-tau Yih, and Ming-Wei Chang. 2017. Search-based neural structured learning for sequential question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1821–1831, Vancouver, Canada. Association for Computational Linguistics.
- Jia and Liang (2016) Robin Jia and Percy Liang. 2016. Data recombination for neural semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12–22, Berlin, Germany. Association for Computational Linguistics.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
- Krishnamurthy et al. (2017) Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gardner. 2017. Neural semantic parsing with type constraints for semi-structured tables. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1516–1526, Copenhagen, Denmark. Association for Computational Linguistics.
- Liang et al. (2016) Chen Liang, Jonathan Berant, Quoc V. Le, Kenneth D. Forbus, and Ni Lao. 2016. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. CoRR, abs/1611.00020.
- Liang et al. (2018) Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc V Le, and Ni Lao. 2018. Memory augmented policy optimization for program synthesis and semantic parsing. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 10015–10027. Curran Associates, Inc.
- Ling et al. (2016) Wang Ling, Phil Blunsom, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský, Fumin Wang, and Andrew Senior. 2016. Latent predictor networks for code generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 599–609, Berlin, Germany. Association for Computational Linguistics.
- Long et al. (2016) Reginald Long, Panupong Pasupat, and Percy Liang. 2016. Simpler context-dependent logical forms via model projections. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1456–1465, Berlin, Germany. Association for Computational Linguistics.
- Neelakantan et al. (2016a) Arvind Neelakantan, Quoc V. Le, Martín Abadi, Andrew McCallum, and Dario Amodei. 2016a. Learning a natural language interface with neural programmer. CoRR, abs/1611.08945.
- Neelakantan et al. (2016b) Arvind Neelakantan, Quoc V. Le, Martín Abadi, Andrew McCallum, and Dario Amodei. 2016b. Learning a natural language interface with neural programmer. CoRR, abs/1611.08945.
- Pasupat and Liang (2015) Panupong Pasupat and Percy Liang. 2015. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1470–1480, Beijing, China. Association for Computational Linguistics.
- Peng et al. (2017) Haoruo Peng, Ming-Wei Chang, and Wen-tau Yih. 2017. Maximum margin reward networks for learning from explicit and implicit supervision. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2368–2378, Copenhagen, Denmark. Association for Computational Linguistics.
- Petrovski et al. (2018) Bojan Petrovski, Ignacio Aguado, Andreea Hossmann, Michael Baeriswyl, and Claudiu Musat. 2018. Embedding individual table columns for resilient SQL chatbots. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, pages 67–73, Brussels, Belgium. Association for Computational Linguistics.
Saha et al. (2018)
Amrita Saha, Vardaan Pahuja, Mitesh M. Khapra, Karthik Sankaranarayanan, and
Sarath Chandar. 2018.
sequential question answering: Towards learning to converse over linked
question answer pairs with a knowledge graph.
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 705–713.
- Shaw et al. (2019) Peter Shaw, Philip Massey, Angelica Chen, Francesco Piccinno, and Yasemin Altun. 2019. Generating logical forms from graph representations of text and entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 95–106, Florence, Italy. Association for Computational Linguistics.
- Suhr et al. (2018) Alane Suhr, Srinivasan Iyer, and Yoav Artzi. 2018. Learning to map context-dependent sentences to executable formal queries. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2238–2249, New Orleans, Louisiana. Association for Computational Linguistics.
- Sun et al. (2018a) Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, and William Cohen. 2018a. Open domain question answering using early fusion of knowledge bases and text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4231–4242, Brussels, Belgium. Association for Computational Linguistics.
- Sun et al. (2018b) Yibo Sun, Duyu Tang, Nan Duan, Jingjing Xu, Xiaocheng Feng, and Bing Qin. 2018b. Knowledge-aware conversational semantic parsing over web tables. CoRR, abs/1809.04271.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
- Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2692–2700. Curran Associates, Inc.
- Wang et al. (2018) Chenglong Wang, Po-Sen Huang, Alex Polozov, Marc Brockschmidt, and Rishabh Singh . 2018. Execution-guided neural program decoding. In ICML Neural Abstract Machines and Program Induction workshop, 2018.
- Xu et al. (2017) Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet: Generating structured queries from natural language without reinforcement learning. arXiv preprint arXiv:1711.04436.
- Yih et al. (2015) Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1321–1331, Beijing, China. Association for Computational Linguistics.
- Yu et al. (2018) Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir Radev. 2018. TypeSQL: Knowledge-based type-aware neural text-to-SQL generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 588–594, New Orleans, Louisiana. Association for Computational Linguistics.
- Zhong et al. (2017) Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103.
Appendix A Hyperparameters
We tune hyperparameters with Google Vizier Golovin et al. (2017). For the encoder and decoder, we select the number of layers from and embedding and hidden dimensions from , setting the feed forward layer hidden dimensions higher. We employ dropout at training time with selected from . We select the attention heads from and use a clipping distance of for relative position representations.
We use the Adam optimizer Kingma and Ba (2014) with , , and . We tune the learning rate and use the same warm-up and decay strategy for learning rate as Vaswani et al. (2017), selecting a number of warm-up steps up to a maximum of . We run the training until convergence for a fixed number of steps () and use the final checkpoint for evaluation. We choose batch sizes from .
Appendix B Results
Given that our model makes use of the whole table, it is conceivable that the performance of our approach can be more sensitive to the table size than methods that predict intermediate representations. Plotting the model performance with respect to number of cells in the table (Figure 3), we observe that the performance does not vary significantly by the table size.
For a detailed analysis, we annotated initial and follow-up questions with the following match types:
A lexical or semantic match error such as not matching “episode” with “EPS #”.
A question that would require a higher level table understanding to be answered correctly. For example, we often have to decide that a row is just the total of all other rows and should not be selected.
A question that would require a numerical value match, some sort of sorting or a negation to be answered. “what is the area of all of the non-remainders?”
A question with a wrong reference answers. “what gene functions are listed?” – Gold points to Category column rather than Gene Functions.
The returned answer should be duplicate free. “what are all of the rider’s names?”, but the table contains “Carl Fogarty” multiple times.
Only used in follow-up questions. This error indicates that a more sophisticated context management is needed.
Any other kind of error.
|who scored more than earnie stewart?||2-hop reasoning. Requires comparison on top of the result of the inner question.|
|and which has been active for the longest?||Reasoning with text and date (1986-present)|
|who else is in that field?||Exclusion.|
|of these, which did not publish on february 9?||Negation. The model is doing the right thing but missing one of the values.|
|what is the highest passengers for a route?||The model selects the 2nd highest and not the 1st.|
|now, during which year did they have the worst placement?||Requires understanding that “Withdrawal ..” is worse than any position.|
|which of these seasons have four judges?||Requires counting the number of named entities within a single cell.|
|what gene functions are listed?||Gold points to “Category” column rather than “Gene Functions”|
|which aircraft have 1 year of service||Gold points to the 4th column (“in service”) instead of the 1st column (“aircraft”)|
|when was thaddeus bell born?, when was klaus jurgen schneider born?, which is older?||The correct birthday is selected, but not the person.|