Multi-Task Learning for Conversational Question Answering over a Large-Scale Knowledge Base

We consider the problem of conversational question answering over a large-scale knowledge base. To handle huge entity vocabulary of a large-scale knowledge base, recent neural semantic parsing based approaches usually decompose the task into several subtasks and then solve them sequentially, which leads to following issues: 1) errors in earlier subtasks will be propagated and negatively affect downstream ones; and 2) each subtask cannot naturally share supervision signals with others. To tackle these issues, we propose an innovative multi-task learning framework where a pointer-equipped semantic parsing model is designed to resolve coreference in conversations, and naturally empower joint learning with a novel type-aware entity detection model. The proposed framework thus enables shared supervisions and alleviates the effect of error propagation. Experiments on a large-scale conversational question answering dataset containing 1.6M question answering pairs over 12.8M entities show that the proposed framework improves overall F1 score from 67 79



There are no comments yet.


page 1

page 2

page 3

page 4


Conversational Question Answering over Knowledge Graphs with Transformer and Graph Attention Networks

This paper addresses the task of (complex) conversational question answe...

Multi-Module System for Open Domain Chinese Question Answering over Knowledge Base

For the task of open domain Knowledge Based Question Answering in CCKS20...

Few-shot Multi-hop Question Answering over Knowledge Base

Previous work on Chinese Knowledge Base Question Answering has been rest...

Multi-Task Learning with Multi-View Attention for Answer Selection and Knowledge Base Question Answering

Answer selection and knowledge base question answering (KBQA) are two im...

MATINF: A Jointly Labeled Large-Scale Dataset for Classification, Question Answering and Summarization

Recently, large-scale datasets have vastly facilitated the development i...

Gaussian Attention Model and Its Application to Knowledge Base Embedding and Question Answering

We propose the Gaussian attention model for content-based neural memory ...

MIX : a Multi-task Learning Approach to Solve Open-Domain Question Answering

In this paper, we introduce MIX : a multi-task deep learning approach to...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent decades have seen the development of AI-driven personal assistants (e.g., Siri, Alexa, Cortana, and Google Now) that often need to answer factorial questions. Meanwhile, large-scale knowledge base (KB) like DBPedia Auer et al. (2007) or Freebase Bollacker et al. (2008) has been built to store world’s facts in a structure database, which is used to support open-domain question answering (QA) in those assistants.

Neural semantic parsing based approach Jia and Liang (2016); Reddy et al. (2014); Dong and Lapata (2016); Liang et al. (2016); Dong and Lapata (2018); Guo et al. (2018) is gaining rising attention for knowledge-based question answer (KB-QA) in recent years since it does not rely on hand-crafted features and is easy to adapt across domains. Traditional approaches usually retrieve answers from a small KB (e.g., small table) Jia and Liang (2016); Xiao et al. (2016) and are difficult to handle large-scale KBs. Many recent neural semantic parsing based approaches for KB-QA take a stepwise framework to handle this issue. For example, Liang et al. (2016), Dong and Lapata (2016), and Guo et al. (2018) first use an entity linking system to find entities in a question, and then learn a model to map the question to logical form based on that. Dong and Lapata (2018) decompose the semantic parsing process into two stages. They first generate a rough sketch of logical form based on low-level features, and then fill in missing details by considering both the question and the sketch.

However, these stepwise approaches have two issues. First, errors in upstream subtasks (e.g., entity detection and linking, relation classification) are propagated to downstream ones (e.g., semantic parsing), resulting in accumulated errors. For example, case studies in previous works Yih et al. (2015); Dong and Lapata (2016); Xu et al. (2016); Guo et al. (2018) show that entity linking error is one of the major errors leading to wrong predictions in KB-QA. Second, since models for the subtasks are learned independently, the supervision signals cannot be shared among the models for mutual benefits.

Alias Operator Comments
A4 set of entities with a predicate edge to entity
A5 number of distinct elements in the input
A6 whether the entity in or not
A9 -
A10 subset of set linking to more than entities with predicate
A11 subset of set linking to less than entities with predicate
A12 subset of set linking to entities with predicate
A13 subset of set linking to most entities with predicate
A14 subset of set linking to least entities with predicate
A15 subset where entity in set and belong to entity type
A16 transform number in utterance to intermediate number
A18/19/20/21 constant *instantiation for , , ,
Table 1: Brief grammar definitions for logical form generation. *instantiation of entity , predicate , type , number-in-question , by corresponding constant parsed from the question.

To tackle issues mentioned above, we propose a novel multi-task semantic parsing framework for KB-QA. Specifically, an innovative pointer-equipped semantic parsing model is first designed for two purposes: 1) built-in pointer network toward positions of entity mentions in the question can naturally empower multi-task learning with conjunction of upstream sequence labeling subtask, i.e., entity detection; and 2) it explicitly takes into account the context of entity mentions by using the supervision of the pointer network. Besides, a type-aware entity detection method is proposed to produce accurate entity linking results, in which, a joint prediction space combining entity detection and entity type is employed, and the predicted type is then used to filter entity linking results during inference phase.

The proposed framework has certain merits.

First, since the two subtasks, i.e., pointer-equipped semantic parsing and entity detection, are closely related, learning them within a single model simultaneously makes the best of supervisions and improves performance of KB-QA task.

Second, considering entity type prediction is crucial for entity linking, our joint learning framework combining entity mention detection with type prediction leverages contextual information, and thus further reduces errors in entity linking.

Third, our approach is naturally beneficial to coreference resolution for conversational QA due to rich contextual features captured for entity mention, compared to previous works directly employing low-level features (e.g., mean-pooling over word embeddings) as the representation of an entity. This is verified via our experiments in §4.2.

We evaluate the proposed framework on the CSQA Saha et al. (2018) dataset, which is the largest public dataset for complex conversational question answering over a large-scale knowledge base. Experimental results show that the overall F1 score is improved by 12.56% compared with strong baselines, and the improvements are consistent for all question types in the dataset.

2 Task Definition

In this work, we target the problem of conversational question answering over a large-scale knowledge base. Formally, in training data, question denotes an user utterance from a dialog, which is concatenated dialog history for handling ellipsis or coreference in conversations, and the question is labeled with its answer . Besides, “IOB” (Insider-Outside-Beginning) tagging and entities linking to KB are also labeled for entity mentions in to train an entity detection model.

We employ a neural semantic parsing based approach to tackle the problem. That is, given a question, a semantic parsing model is used to produce a logical form which is then executed on the KB to retrieve an answer. We decompose the approach into two subtasks, i.e., entity detection for entity linking and semantic parsing for logical form generation. The former employs IOB tagging and corresponding entities as supervision, while the latter uses a gold logical form as supervision, which may be obtained by conducting intensive BFS111Breadth-first search with limited buffer Guo et al. (2018) over KB if only final answers (i.e., weak supervision) are provided.

Figure 1: Proposed Multi-task Semantic Parsing (MaSP) model. Note that P* and T* are predicate and entity type ids in Wikidata where entity type id originally starts with Q but is replaced with T for clear demonstration.

3 Approach

This section begins with a description of grammars and logic forms used in this work. Then, the proposed model is presented, and finally, model’s training and inference are introduced.

3.1 Grammar and Logical Form


We use similar grammars and logical forms as defined in Guo et al. (2018), with minor modification for better adaptation to the CSQA dataset. The grammars are briefly summarized in Table 1

, where each operator consists of three components: semantic category, a function name, and a list of arguments with specified semantic categories. Semantic categories can be classified into two groups here w.r.t. the ways for instantiation: one is referred to as

entry semantic category (i.e., for entities, predicates, types, numbers) whose instantiations are constants parsed from a question, and another is referred to as intermediate semantic category (i.e., ) whose instantiation is the output of an operator execution.

Logical Form

A KB-executable logical form is intrinsically formatted as an ordered tree where the root is the semantic category , each child node is constrained by the nonterminal (i.e., the un-instantiated semantic category in parenthesis) of its parent operator, and leaf nodes are instantiated entry semantic categories, i.e., constants.

To make the best of well-performed sequence to sequence (seq2seq) models Vaswani et al. (2017); Bahdanau et al. (2015) as a base for semantic parsing, we represent a tree-structured logical form as a sequence of operators and constants via depth-first traversal over the tree. Note, given guidance of grammars, we can recover corresponding tree structure from a sequence-formatted logical form.

3.2 Proposed Model

The structure of our proposed Multi-task Smantic Parsing (MaSP) model is illustrated in Figure 1. The model consists of four components: i.e., word embedding, contextual encoder, entity detection and pointer-equipped logical form decoder.

3.2.1 Embedding and Contextual Encoder

To handle ellipsis or coreference in conversations, our model takes current user question combined with dialog history as the input question . In particular, all those sentences are concatenated with a [SEP] separated, and then a special token [CTX] is appended. We apply wordpiece tokenizing Wu et al. (2016) method, and then use a word embedding method Mikolov et al. (2013) to transform the tokenized question to a sequence of low-dimension distributed embeddings, i.e., where denotes embedding size and denotes question length.

Given word embeddings , we use stacked two-layer multi-head attention mechanism in the Transformer Vaswani et al. (2017) with learnable positional encodings as an encoder to model contextual dependencies between tokens, which results in context-aware representations . And, contextual embedding for token [CTX] is used as the semantic representation for entire question, i.e.,

3.2.2 Pointer-Equipped Decoder

Given contextual embeddings of a question, we employ stacked two-layer masked attention mechanism in Vaswani et al. (2017) as the decoder to produce sequence-formatted logical forms.

In each decoding step, the model first predicts a token from a small decoding vocabulary = {, , , , , , A1, , A21} , where and indicate the start and end of decoding, are defined in Table 1, and , , and denote entity, predicate, type and number entries respectively. A neural classifier is established to predict current decoding token, which is formally denoted as


where is decoding hidden state of current (i.e., -th) step, denotes a

-parameterized two-layer feed forward network with an activation function inside, and

is a predicted distribution over to score candidates222Superscript in bracket denotes the type instead of index..

Then, a or a pointer network Vinyals et al. (2015) is utilized to predict instantiation for entry semantic category (i.e., , , or in ) if it is necessary.

  • For predicate and type , two parameter-untied are used as


    where is semantic embedding of entire question, is current hidden state, and are predicted distributions over the predicate and type instantiation candidates respectively, and and are the numbers of distinct predicates and types in the knowledge base.

  • For entity and number , two parameter-untied pointer-networks Vinyals et al. (2015) with learnable bilinear layer are employed to point toward the targeted entity333Toward the first one if entity consists of multiple words. and number, which are defined as follows.


    where is contextual embedding of tokens in the question except [CTX], and are weights of pointer-network for entity and number, are the resulting distributions over positions of input question, and is the length of the question.

The pointer network is also used for semantic parsing in Jia and Liang (2016), where the pointer aims at copying out-of-vocabulary words from a question over small-scale KB. Different from that, the pointer used here aims at locating the targeted entity and number in a question, which has two advantages. First, it handles the coreference problem by considering the context of entity mentions in the question. Second, it solves the problem caused by huge entity vocabulary, which reduces the size of decoding vocabulary from several million (i.e., the number of entities in KB) to several dozen (i.e., the length of the question).

3.2.3 Entity Detection and Linking

Figure 2: Transformation from entity-pointed logical form to KB-executable logical form for KB querying.

To map the pointed positions to entities in KB, our model also detects entity mentions for the input question, as shown as the “Entity Detection” part of Figure 1.

We observe that multiple entities in a large-scale KB usually have same entity text but different types, leading to named entity ambiguity. Therefore, we design a novel type-aware entity detection module in which the prediction is fulfilled in a joint space of IOB tagging and corresponding entity type for disambiguation. Particularly, the prediction space is defined as where stands for the -th entity type label, denotes number of distinct entity types in KB, and .

The prediction for both entity IOB tagging and entity type is formulated as


where is the contextual embedding of the -th token in the question, and is the predicted distribution over .

Given the predicted IOB labels and entity types, we take the following steps for entity linking. First, the predicted IOB labels are used to locate all entities in the question and return corresponding entity mentions. Second, an inverted index built on the KB is leveraged to find entity candidates in KB based on each entity mention. Third, the jointly predicted entity types are used to filter out the candidates with unwanted types, and the remaining entity with the highest inverted index score is selected to substitute the pointer. This process is shown as the bottom part of Figure 2.

During inference phase, the final logical form is derived by replacing entity pointers in entity-pointed logical form from §3.2.2 with entity linking results, and is then executed on the KB to retrieve an answer for the question, as shown as the top part of Figure 2.

3.3 Learning and Inference

Model Learning

During the training phase, we first search gold logical forms for questions in training data over KB if only weak supervision is provided. Then we conduct multi-task learning for semantic parsing and entity detection. The final loss is defined as



is a hyperparameter for a tradeoff between semantic parsing and entity detection, and

and are negative log-likelihood losses of semantic parsing and entity detection defined as follows.


In the two equations above, is gold label for decoding token in ; are gold labels for predicate, type, entity position and number position for instantiation; , , and are defined in Eq.(1-6) respectively; and denotes the decoding length.

Here, we use a single model to handle two subtasks simultaneously, i.e., semantic parsing and entity detection. This multi-task learning framework enables each subtask to leverage supervision signals from the others, and thus improves the final performance for KB-QA.

Grammar-Guided Inference

The grammars defined in Table 1 are utilized to filter illegal operators out in each decoding step. An operator is legitimate if its left-hand semantic category in the definition is identical to the leftmost nonterminal (i.e., un-instantiated semantic category) in the incomplete logical form parsed so far. In particular, the decoding of a logical form begins with the semantic category . During decoding, the proposed semantic parsing model recursively rewrites the leftmost nonterminal in the logical form by 1) applying a legitimate operator for an intermediate semantic category, or 2) instantiation for one of entity, predicate, type or number for an entry semantic category. The decoding process for the parsing terminates until no nonterminals remain.

Furthermore, beam search is also incorporated to boost the performance of the proposed model during the decoding. And, the early stage execution is performed to filter out illegal logical forms that lead to empty intermediate result.

4 Experiments

Methods HRED+KVmem D2A (Baseline) MaSP (Ours)
Question Type #Example F1 Score F1 Score F1 Score
Overall 203k 9.39% 66.70% 79.26% +12.56%
Clarification 9k 16.35% 35.53% 80.79% +45.26%
Comparative Reasoning (All) 15k 2.96% 48.85% 68.90% +20.05%
Logical Reasoning (All) 22k 8.33% 67.31% 69.04% +1.73%
Quantitative Reasoning (All) 9k 0.96% 56.41% 73.75% +17.34%
Simple Question (Coreferenced) 55k 7.26% 57.69% 76.47% +18.78%
Simple Question (Direct) 82k 13.64% 78.42% 85.18% +6.76%
Simple Question (Ellipsis) 10k 9.95% 81.14% 83.73% +2.59%
Question Type #Example Accuracy Accuracy Accuracy
Verification (Boolean) 27k 21.04% 45.05% 60.63% +15.58%
Quantitative Reasoning (Count) 24k 12.13% 40.94% 43.39% +2.45%
Comparative Reasoning (Count) 15k 8.67% 17.78% 22.26% +4.48%
Table 2: Comparisons with baselines on CSQA. The last column consists of differences between MaSP and D2A.

4.1 Experimental Settings


We evaluated the proposed approach on Complex Sequential Question Answering (CSQA) dataset444 Saha et al. (2018), which is the largest dataset for conversational question answering over large-scale KB. It consists of about 1.6M question-answer pairs in 200K dialogs, where 152K/16K/28K dialogs are used for train/dev/test. Questions are classified as different types, e.g., simple, comparative reasoning, logical reasoning questions. Its KB is built on Wikidata555 in a form of (subject, predicate, object), and consists of 21.2M triplets over 12.8M entities, 3,054 distinct entity types, and 567 distinct predicates.

Training Setups

We leveraged a BFS method to search valid logical forms for questions in training data. The buffer size in BFS is set to 1000. Both embedding and hidden sizes in the model are set to , and no pretrained embeddings are loaded for initialization, and the positional encodings are randomly initialized and learnable. The head number of multi-head attention is and activation function inside is Hendrycks and Gimpel (2016). We used Adam Kingma and Ba (2015)

to optimize the loss function defined in Eq.(

7) where is set to , and learning rate is set to . The training batch size is for epochs. And we also employed learning rate warmup within the first steps and linear decay within the rest. The source codes are available at More details of our implementation are described in Appendix A

Evaluation Metrics

We used the same evaluation metrics as

Saha et al. (2018) and Guo et al. (2018)

. F1 score (i.e., precision and recall) is used to evaluate the question whose answer is comprised of entities, and accuracy is used to measure the question whose answer type is boolean or number.


There are few works targeting conversational question answering over a large-scale knowledge base. HRED+KVmem Saha et al. (2018) and D2A Guo et al. (2018) are two typical approaches, and we compared them with our proposed approach. Particularly, HRED+KVmem is a memory network Sukhbaatar et al. (2015); Li et al. (2017) based seq2seq model, which combines HRED model Serban et al. (2016) with key-value memory network Miller et al. (2016). D2A666Overall score of D2A reported in this paper is superior to that in the original paper since our re-implemented grammars for CSQA achieve a better balance between the simple and non-simple question types. For rational and fair comparisons, we report re-run results for D2A in this paper. is a memory augmented neural symbolic model for semantic parsing in KB-QA, which introduces dialog memory manager to handle ellipsis and co-reference problems in conversations.

4.2 Model Comparisons

We compared777Mores details of comparisons are listed in Appendix B. our approach (denoted as MaSP) with HRED+KVmem and D2A in Table 2

. As shown in the table, the semantic parsing based D2A significantly outperforms the memory network based text generation approach (HRED+KVmem), which thus poses a strong baseline. Further, our proposed approach (MaSP) achieves a new state-of-the-art performance, where the overall F1 score is improved by

12%. Besides, the improvement is consistent for all question types, which ranges from 2% to 45%.

There are two possible reasons for this significant improvement. First, our approach predicts entities more accurately, where the accuracy of entities in final logical forms increases from 55% to 72% compared with D2A. Second, the proposed pointer-equipped logical form decoder in the multi-task learning framework handles coreference better. For instance, given an user question with history, “What is the parent organization of that one? // Did you mean Polydor Records ? // No, I meant Deram Records. Could you tell me the answer for that?” with coreference, D2A produces “(find {Polydor Records}, owned by)” and in contrast our approach produces “(find {Deram Records}, owned by)”. This also explains the substantial improvement for Simple Question (Coreferenced) and Clarification888In CSQA, the performance of Clarification closely depends on F1 score for next question, 88% of which belong to Simple Question(Coreference) ..

We also observed that the improvement of MaSP over D2A for some question types is relatively small, e.g., 1.73% for logical reasoning questions. A possible reason is that there are usually more than one entities are needed to compose the correct logical form for logical reasoning questions, and our current model is too shallow to parse the multiple entities. Hence, we adopted deeper model and employed BERT Devlin et al. (2018) as the encoder (latter in §4.4), and found that the performance of logical reasoning questions is improved by 10% compared to D2A.

4.3 Ablation Study

Methods Ours w/o ET w/o Multi w/o Both
Question Type F1 F1 F1 F1
Overall 79.26% 70.42% 76.73% 68.22%
Clarification 80.79% 68.01% 66.30% 54.64%
Comparative 68.90% 66.35% 61.12% 58.04%
Logical 69.04% 62.63% 67.81% 62.51%
Quantitative 73.75% 73.75% 64.56% 64.55%
Simple (Co-ref) 76.47% 64.94% 74.35% 63.15%
Simple (Direct) 85.18% 75.24% 84.93% 75.19%
Simple (Ellipsis) 83.73% 78.45% 82.66% 77.44%
Question Type Accu Accu Accu Accu
Verification 60.63% 45.40% 60.43% 45.02%
Quantitative 43.39% 39.70% 37.84% 43.39%
Comparative 22.26% 19.08% 18.24% 22.26%
Table 3: Ablation study. “w/o ET” stands for removing entity type prediction in Entity Detection of §3.2.3; “w/o Multi” stands for learning two subtasks separately in our framework; and “w/o Both” stands for a combination of “w/o ET” and “w/o Multi”.

There are two aspects leading to performance improvement, i.e., predicting entity type in entity detection to filter candidates, and multi-task learning framework. We conducted an ablation study in Table 3 for in-depth understanding of their effects.

Effect of Entity Type Prediction (w/o ET)
Figure 3: Performance of entity linking. “w/o ET” denotes removing entity type filtering.

First, the entity type prediction was removed from the entity detection task, which results in 9% drop of overall F1 score. We argue that the performance of the KB-QA task is in line with that of entity linking. Hence, we separately evaluated the entity linking task on the test set. As illustrated in Figure 3, both precision and recall of entity linking drop significantly without filtering the entity linking results w.r.t. the predicted entity type, which verifies our hypothesis above.

Effect of Multi-Task Learning (w/o Multi)
Accuracy Ours w/o Multi
Entity pointer 79.8% 79.3%
Predicate 96.9% 96.3%
Type 86.8% 84.1%
Number 89.1% 88.3%
Operators 79.4% 78.7%
Table 4: Prediction accuracy on each component composing the pointer-equipped logical form.

Second, to measure the effect of multi-task learning, we evaluated the KB-QA task when the two subtasks, i.e., pointer-equipped semantic parsing and entity detection, are learned separately. As shown in Table 3, the F1 score for every question type consistently drops in the range of 3% to 14% compared with that with multi-task learning. We further evaluated the effect of multi-task learning on each subtask. As shown in Table 4, the accuracy for each component of the pointer-equipped logical form drops with separate learning. Meanwhile, we found 0.1% F1 score reduction (99.4% vs. 99.5%) for entity detection subtask compared to the model without multi-task learning, which only poses a negligible effect on the downstream task. To sum up, the multi-task learning framework increases the accuracy of the pointer-based logical form generation while keeping a satisfactory performance of entity detection, and consequently improves the final question answering performance.

Note that, considering a combination of removing the entity type filter and learning two subtasks separately (i.e., w/o Both in Table 3), the proposed framework will degenerate to a model that is similar to Coarse-to-Fine semantic parsing model, another state-of-the-art KB-QA model over small-scale KB Dong and Lapata (2018). Therefore, an improvement of 11% of F1 score also verifies the advantage of our proposed framework.

4.4 Model Setting Analysis

As introduced in §4.1 and evaluated in §4.2

, the proposed framework is built on a relatively shallow neural network, i.e., stacked two-layer multi-head attention, which might limit its representative ability. Hence, in this section, we further exploited the performance of the proposed framework by applying more sophisticated strategies.

Methods Vanilla w/ BERT w/ Large Beam
Question Type F1 F1 F1
Overall 79.26% 80.60% 81.55%
Clarification 80.79% 79.46% 83.37%
Comparative 68.90% 65.99% 69.34%
Logical 69.04% 77.53% 69.41%
Quantitative 73.75% 70.43% 73.75%
Simple (Co-ref) 76.47% 77.95% 79.03%
Simple (Direct) 85.18% 86.40% 88.28%
Simple (Ellipsis) 83.73% 84.82% 86.96%
Question Type Accuracy Accuracy Accuracy
Verification 60.63% 63.85% 61.96%
Quantitative 43.39% 47.14% 44.22%
Comparative 22.26% 25.28% 22.70%
Table 5: Comparisons with different experimental settings. “Vanilla” stands for standard settings of our framework, i.e, MaSP. “w/ BERT” stands for incorporating BERT. And “w/ Large Beam” stands for increasing beam search size from 4 to 8.

As shown in Table 5, we first replaced the encoder with pre-trained BERT base model Devlin et al. (2018) and fine-tuned parameters during the training phase, which results in 1.3% F1 score improvement over the vanilla one. Second, we increased beam search size from 4 to 8 during the decoding in the inference phase for the standard settings, which leads to 2.3% F1 score increase.

4.5 Error Analysis

We randomly sampled 100 examples with wrong logical forms or incorrect answers to conduct an error analysis, and found that the errors mainly fall into the following categories.

Entity Ambiguity

Leveraging entity type as a filter in entity linking significantly reduces errors caused by entity ambiguity, but it is still possible that different entities with same text belong to the same type, due to coarse granularity of the entity type, which results in filtering invalidity. For example, it is difficult to distinguish between two persons whose names are both Bill Woods.

Wrong Predicted Logical Form

The predicted components (e.g., operators, predicates and types) composing the logical form would be inaccurate, leading to a wrong answer to the question or an un-executable logical form.

Spurious Logical Form

We took a BFS method to search gold logical forms for questions in training set, which inevitably generates spurious (incorrect but leading to correct answers coincidentally) logical forms as training signals. Take the question “Which sexes do King Harold, Queen Lillian and Arthur Pendragon possess” as an example, a spurious logical form only retrieves the genders of “King Harold” and “Queen Lillian”, while it gets correct answers for the question. Spurious logical forms accidentally introduce noises into training data and thus negatively affect the performance of KB-QA.

5 Related Work

Our work is aligned with semantic parsing based approach for KB-QA. Traditional semantic parsing systems typically learn a lexicon-based parser and a scoring model to construct a logical form given a natural language question

Zettlemoyer and Collins (2007); Wong and Mooney (2007); Zettlemoyer and Collins (2009); Kwiatkowski et al. (2011); Andreas et al. (2013); Artzi and Zettlemoyer (2013); Zhao and Huang (2014); Long et al. (2016). For example, Zettlemoyer and Collins (2009) and Artzi and Zettlemoyer (2013) learn a CCG parser, and Long et al. (2016) develop a shift-reduce parser to construct logical forms.

Neural semantic parsing approaches have been gaining rising attention in recent years, eschewing the need for extensive feature engineering Jia and Liang (2016); Ling et al. (2016); Xiao et al. (2016). Some efforts have been made to utilize the syntax of logical forms Rabinovich et al. (2017); Krishnamurthy et al. (2017); Cheng et al. (2017); Yin and Neubig (2017). For example, Dong and Lapata (2016) and Alvarez-Melis and Jaakkola (2017) leverage an attention-based encoder-decoder framework to translate a natural language question to tree-structured logical form.

Recently, to handle huge entity vocabulary existing in a large-scale knowledge base, many works take a stepwise approach. For example, Liang et al. (2016), Dong and Lapata (2016), and Guo et al. (2018) first process questions using a name entity linking system to find entity candidates, and then learn a model to map a question to a logical form based on the candidates. Dong and Lapata (2018) decompose the task into two stages: first, a sketch of the logical form is predicted, and then a full logical form is generated with considering both the question and the predicted sketch.

Our proposed framework also decomposes the task into multiple subtasks but is different from existing works in several aspects. First, inspired by pointer network Vinyals et al. (2015), we replace entities in a logical form with the starting positions of their mentions in the question, which can be naturally used to handle coreference problem in conversations. Second, the proposed pointer-based semantic parsing model can be intrinsically extended to jointly learn with entity detection for fully leveraging all supervision signals. Third, we alleviate entity ambiguity problem in entity detection & linking subtask, by incorporating entity type prediction into entity mention IOB labeling to filter out the entities with unwanted types.

6 Conclusion

We studied the problem of conversational question answering over a large-scale knowledge base, and proposed a multi-task learning framework which learns for type-aware entity detection and pointer-equipped logical form generation simultaneously. The multi-task learning framework takes full advantage of the supervisions from all subtasks, and consequently increases the performance of final KB-QA problem. Experimental results on a large-scale dataset verify the effectiveness of the proposed framework. In the future, we will test our proposed framework on more datasets and investigate potential approaches to handle spurious logical forms for weakly-supervised KB-QA.


We acknowledge the support of NVIDIA Corporation and MakeMagic Australia with the donation of GPUs for our research group at University of Technology Sydney. And we also thank anonymous reviewers for their insightful and constructive suggestions.


Appendix A Model Details

a.1 Word Embedding

Given an user question sentence , a tokenizing method (e.g., punctuation or wordpiece tokenizer Wu et al. (2016)) is applied to the sentence for a list of tokens, i.e., , where or is an one-hot vector whose dimension equals to distinct tokens in vocabulary, and is the length of . Note that a special token is appended to the tokenized sentence, corresponding to the token [CTX]. Then, randomly initialized or pre-trained Mikolov et al. (2013); Pennington et al. (2014) embeddings are applied to and thus transform discrete tokens to a sequence of low-dimension distributed embeddings, i.e., where is embedding size. This process is formulated as where is the trainable word embedding weight matrix.

a.2 Pointer-equipped Semantic Parsing

a.2.1 Encoder of Seq2seq Model

To model contextual dependencies between tokens and generate context-aware representations, we leverage stacked two-layer multi-head attention mechanism with additive positional encoding Vaswani et al. (2017). The stacking scheme is identical to that in Vaswani et al. (2017)

: two-layer feed forward network with activation function (FFN) follows each multi-head attention, and residual connection

He et al. (2016) with layer normalization Lei Ba et al. (2016) is applied. This process is briefly denoted as


where is a sequence of contextual embeddings, is learnable weights of PE and the three arguments for are value, key, query for an attention mechanism.

a.2.2 Decoder of Seq2seq Model

Similar to token embedding in encoder (§A.1), we embed the -th decoder input token as via a randomly initialized embedding weight matrix . We use to represent all tokens in a gold logical form sketch, where denotes the length of gold sketch.

The basic structure of proposed logical form decoder is same as that in the original Transformer Vaswani et al. (2017) except only two stacked layers are used here. Each layer of the decoder is bottom-up comprised of self-attention with forward mask, cross attention between decoder and encoder and FFN, which we briefly formulate as


where is a sequence of decoding hidden states.

a.3 Multi-task Learning

We propose to employ a multi-task learning strategy to learn a entity detection (ED) model jointly with the pointer-equipped semantic parsing model because the supervision information from ED, i.e, IOB tagging, can provide all entities spans in the input question, which thus results in better performance than separate learning.

The reasons why we use a multi-task learning to jointly learn the semantic parsing model and ED rather than directly equip the semantic parsing model with span prediction Seo et al. (2017) are that 1) the supervision information of the entities not existing in the gold logical form but appearing in the question is lost; 2) deeper network is required when predicting the end index of the target as shown in Seo et al. (2017) and 3) the well-solved entity detection method can provide correction for the pointer even with slight deviation during inference phrase, in contrast, span-based model usually leads to error aggregation.

a.4 Inverted Index

Based on each entity text in Wikidata, we traversed its substring whose length is not less than that of its full text minus a threshold, and then, we separately calculated Levenshtein Distance between the full text and each substring as a score for the map from the substring to corresponding full text. Since multiple entities could generate identical substring, we kept maps with largest scores and used the maps to build a dictionary for future queries.

Appendix B Supplemental Experiment Results

b.1 Precision and Recall for Main Paper

Since we report the F1 score for brief demonstration in the main paper, in this section, we report the corresponding recall and precision detailedly: 1) as shown in Table 12, the results of the proposed model compared with baselines are presented; 2) as shown in Table 13, the ablation study is presented; and 3) as shown in Table 14, the performance improvement comparison after sophisticated strategies applied is provided.

b.2 Comparison to D2A

Question Type D2A Ours
Simple Question (Direct) 2.6 1.5
Clarification 2.7 1.4
Simple Question (Coreferenced) 2.7 1.4
Quantitative Reasoning (Count) (All) 2.9 1.5
Logical Reasoning (All) 2.7 1.6
Simple Question (Ellipsis) 2.6 1.6
Verification (Boolean) (All) 2.8 1.4
Quantitative Reasoning (All) 2.7 1.4
Comparative Reasoning (Count) (All) 2.8 1.4
Comparative Reasoning (All) 3.0 1.4
Overall 2.9 1.5
Table 6: The averaged number of entity candidates from entity linking.

To further demonstrate that the proposed model is superior to the previous D2A model in term of entity linking and logical form generation, we conduct the following comparisons.

First, as shown in Table 6, the average number of entity candidates in test set from entity linking of the proposed model is less than that of D2A, which means the proposed approach provides the downstream subtask with more accurate entity linking results.

Question Type D2A Ours
Simple Question (Direct) 0.8960 0.9520
Clarification 0.8281 0.9323
Simple Question (Coreferenced) 0.8177 0.8952
Quantitative Reasoning (Count) (All) 0.8385 0.9581
Logical Reasoning (All) 0.8726 0.9791
Simple Question (Ellipsis) 0.9364 0.9474
Verification (Boolean) (All) 0.7448 0.9637
Quantitative Reasoning (All) 0.9304 0.9832
Comparative Reasoning (Count) (All) 0.8165 0.9863
Comparative Reasoning (All) 0.8312 0.9727
Overall 0.8499 0.9475
Table 7: Ratio of non-empty logical form.

Second, we compare the proposed model with D2A in term of logical form generation where the logical form would be empty due to timeout or illegal logical forms during beam search. As demonstrated in Table 7, the proposed model obtains less ratio of empty logical form than D2A.

Question Type D2A Ours +BERT
Simple Question (Direct) 0.7967 0.8519 0.8664
Clarification 0.2385 0.6408 0.6414
Simple Question (Coreferenced) 0.5341 0.7234 0.7469
Quantitative Reasoning (Count) (All) 0.5000 0.6947 0.7004
Logical Reasoning (All) 0.3692 0.0791 0.3196
Simple Question (Ellipsis) 0.7533 0.8843 0.8878
Verification (Boolean) (All) 0.1757 0.5278 0.5854
Quantitative Reasoning (All) 0.8913 0.9792 0.9911
Comparative Reasoning (Count) (All) 0.3235 0.8924 0.9121
Comparative Reasoning (All) 0.2483 0.9053 0.9242
Overall 0.5522 0.7167 0.7546
Table 8: accuracy of entities in predicted logical form.

Third, we list the accuracies of the entities appearing in the predicted logical form for D2A, our standard approach and BERT-based model, which verifies that the proposed approach can significantly improve the performance of entity linking during entity detection and entity prediction during logical form generation. Note that the analysis for performance reduction of Logical Reasoning (All) is elaborated in the main paper.

b.3 Multi-task Learning

The multi-task learning framework increases the accuracy of logical form generation while keeping a satisfactory performance of entity detection, and consequently improves the final question answering task via logical form execution. In this section, we detailedly list all metrics to measure the performance for both two subtasks in the case of our approach with or without multi-task learning. To evaluate the logical form generation, we also apply BFS method to test set for gold logical form (inevitably existing spurious ones).

Question Type Ours w/o Multi
Comparative Reasoning (All) 0.1885 0.1885
Logical Reasoning (All) 0.6256 0.6188
Quantitative Reasoning (All) 0.6403 0.6188
Simple Question (Coreferenced) 0.8721 0.8663
Simple Question (Direct) 0.8772 0.8715
Simple Question (Ellipsis) 0.9073 0.9034
Comparative Reasoning (Count) (All) 0.1601 0.1495
Quantitative Reasoning (Count) (All) 0.5711 0.5564
Verification (Boolean) (All) 0.7638 0.7565
Overall 0.7940 0.7872
Table 9: Sketch accuracy for logical form generation.
Ours w/o Multi
IOB Tagging Accuracy 0.9967 0.9975
F1 Score 0.9941 0.9955
Precision 0.9960 0.9972
Recall 0.9923 0.9938
Entity Type Accuracy 0.9822 0.9844
F1 Score 0.9674 0.9717
Precision 0.9958 0.9971
Recall 0.9407 0.9475
Table 10: Performance of IOB tagging and entity type prediction.

As shown in Table 9 and 10, the model with multi-task learning can outperform that without multi-task learning in term of logical form generation from semantic parsing model. And, although 0.002 performance reduction is observed for entity detection subtask, the performance of entity detection and linking is good enough for the downstream task, which thus poses a very minor effect on the performance of KB-QA.

b.4 BFS Success Ratio

Question Type #Example Ratio
Simple Question (Direct) 274527 0.96
Simple Question (Ellipsis) 34549 0.97
Quantitative Reasoning (All) 58976 0.46
Quantitative Reasoning (Count) (All) 114074 0.67
Logical Reasoning (All) 66161 0.61
Simple Question (Coreferenced) 173765 0.86
Verification (Boolean) (All) 77167 0.75
Comparative Reasoning (Count) (All) 59557 0.37
Comparative Reasoning (All) 57343 0.32
Table 11: The BFS search success ratio w.r.t. difference question type.

Given the final answer to a question as well as gold entities, predicates and types, we conduct a BFS method to search the gold logical form, which may result in search failure due to limited time and buffer. We list the success ratio of BFS for training data of CSQA in Table 11.

Appendix C Supplemental Analysis

We also observe that the improvement of MaSP over D2A for some question types is relatively small especially for logical reasoning questions. Furthermore, for logical reasoning, we find that the accuracy of entities in final logical forms is only 8%, and there are usually two distinct entities needed to produce a correct logical form. This means the presented shallow network, i.e., two-layer multi-head attention, cannot handle such complex cases. We study a case here for better understanding. Given, “Which diseases are a sign of lead poisoning or pentachlorophenol exposure?”, D2A produces “(union (find {lead poisoning}, symptoms), (pe…ol exposure))” where entities are correct but operator is wrong, our approach produces “(union (find {pe…ol exposure}, symptoms), (union (find {pe…ol exposure}, symptoms))” where the entities are wrong, while our approach plus BERT Devlin et al. (2018) as encoder can produce correct logical form that is “(union (find {pe…ol exposure}, symptoms), (union (find {lead poisoning}, symptoms))”.

Methods HRED+KVmem D2A (Baseline) Our Approach
Question Type #Example Recall Precision Recall Precision Recall Precision
Overall - 18.40% 6.30% 66.83% 66.57% 78.07% 80.48%
Clarification 12k 25.09% 12.13% 37.24% 33.97% 84.18% 77.66%
Comparative Reasoning (All) 15k 2.11% 4.97% 44.14% 54.68% 59.83% 81.20%
Logical Reasoning (All) 22k 15.11% 5.75% 65.82% 68.86% 61.92% 78.00%
Quantitative Reasoning (All) 9k 0.91% 1.01% 52.74% 60.63% 69.14% 79.02%
Simple Question (Coreferenced) 55k 12.67% 5.09% 58.47% 56.94% 76.94% 76.01%
Simple Question (Direct) 82k 33.30% 8.58% 79.50% 77.37% 86.09% 84.29%
Simple Question (Ellipsis) 10k 17.30% 6.98% 84.67% 77.90% 85.50% 82.03%
Question Type #Example Accuracy Accuracy Accuracy
Verification (Boolean) 27k 21.04% 45.05% 60.63%
Quantitative Reasoning (Count) 24k 12.13% 40.94% 43.39%
Comparative Reasoning (Count) 15k 8.67% 17.78% 22.26%
Table 12: Results of comparisons for KB-QA with baselines.
Methods Our Approach w/o ET w/o Multi w/o Both
Question Type Recall Precision Recall Precision Recall Precision Recall Precision
Overall 78.07% 80.48% 68.78% 72.15% 75.75% 77.73% 66.75% 69.75%
Clarification 84.18% 77.66% 69.79% 66.32% 70.12% 62.88% 56.96% 52.51%
Comparative Reasoning (All) 59.83% 81.20% 57.48% 78.45% 53.62% 71.06% 50.86% 67.59%
Logical Reasoning (All) 61.92% 78.00% 54.43% 73.73% 61.04% 76.27% 54.16% 73.91%
Quantitative Reasoning (All) 69.14% 79.02% 69.14% 79.02% 60.86% 68.73% 60.86% 68.72%
Simple Question (Coreferenced) 76.94% 76.01 64.92% 64.96% 74.65% 74.06% 63.06% 63.24%
Simple Question (Direct) 86.09% 84.29% 75.87% 74.62% 85.88% 84.01% 75.84% 74.56%
Simple Question (Ellipsis) 85.50% 82.03% 80.12% 76.85% 84.28% 81.11% 78.96% 75.97%
Question Type Accuracy Accuracy Accuracy Accuracy
Verification (Boolean) 60.63% 45.40% 60.43% 45.02%
Quantitative Reasoning (Count) 43.39% 39.70% 37.84% 43.39%
Comparative Reasoning (Count) 22.26% 19.08% 18.24% 22.26%
Table 13: Ablation study. “w/o ET” stands for removing entity type prediction in Entity Detection; “w/o Multi” stands for learning two subtasks separately in our framework; and “w/o Both” stands for a combination of “w/o ET” and “w/o Multi”.
Methods Vanilla w/ BERT Larger Beam Size
Question Type Recall Precision Recall Precision Recall Precision
Overall 78.07% 80.48% 79.67% 81.56% 80.39% 82.75%
Clarification 84.18% 77.66% 83.24% 76.01% 86.90% 80.11%
Comparative Reasoning (All) 59.83% 81.20% 58.79% 75.21% 60.25% 81.67%
Logical Reasoning (All) 61.92% 78.00% 72.56% 83.24% 62.16% 78.58%
Quantitative Reasoning (All) 69.14% 79.02% 66.91% 74.35% 69.14% 79.02%
Simple Question (Coreferenced) 76.94% 76.01% 78.05% 77.85% 79.54% 78.52%
Simple Question (Direct) 86.09% 84.29% 86.84% 85.96% 89.26% 87.33%
Simple Question (Ellipsis) 85.50% 82.03% 86.38% 83.32% 88.78% 85.22%
Question Type Accuracy Accuracy Accuracy
Verification (Boolean) 60.63% 63.85% 61.96%
Quantitative Reasoning (Count) 43.39% 47.14% 44.22%
Comparative Reasoning (Count) 22.26% 25.28% 22.70%
Table 14: Comparisons with different experimental settings. “Vanilla” stands for standard settings of our framework. “w/ BERT” stands for incorporating BERT. “w/ Large Beam” stands for increasing beam search size from 4 to 8.