1 Introduction
Recent decades have seen the development of AI-driven personal assistants (e.g., Siri, Alexa, Cortana, and Google Now) that often need to answer factorial questions. Meanwhile, large-scale knowledge base (KB) like DBPedia Auer et al. (2007) or Freebase Bollacker et al. (2008) has been built to store world’s facts in a structure database, which is used to support open-domain question answering (QA) in those assistants.
Neural semantic parsing based approach Jia and Liang (2016); Reddy et al. (2014); Dong and Lapata (2016); Liang et al. (2016); Dong and Lapata (2018); Guo et al. (2018) is gaining rising attention for knowledge-based question answer (KB-QA) in recent years since it does not rely on hand-crafted features and is easy to adapt across domains. Traditional approaches usually retrieve answers from a small KB (e.g., small table) Jia and Liang (2016); Xiao et al. (2016) and are difficult to handle large-scale KBs. Many recent neural semantic parsing based approaches for KB-QA take a stepwise framework to handle this issue. For example, Liang et al. (2016), Dong and Lapata (2016), and Guo et al. (2018) first use an entity linking system to find entities in a question, and then learn a model to map the question to logical form based on that. Dong and Lapata (2018) decompose the semantic parsing process into two stages. They first generate a rough sketch of logical form based on low-level features, and then fill in missing details by considering both the question and the sketch.
However, these stepwise approaches have two issues. First, errors in upstream subtasks (e.g., entity detection and linking, relation classification) are propagated to downstream ones (e.g., semantic parsing), resulting in accumulated errors. For example, case studies in previous works Yih et al. (2015); Dong and Lapata (2016); Xu et al. (2016); Guo et al. (2018) show that entity linking error is one of the major errors leading to wrong predictions in KB-QA. Second, since models for the subtasks are learned independently, the supervision signals cannot be shared among the models for mutual benefits.
Alias | Operator | Comments |
---|---|---|
A1/2/3 | ||
A4 | set of entities with a predicate edge to entity | |
A5 | number of distinct elements in the input | |
A6 | whether the entity in or not | |
A7 | ||
A8 | ||
A9 | - | |
A10 | subset of set linking to more than entities with predicate | |
A11 | subset of set linking to less than entities with predicate | |
A12 | subset of set linking to entities with predicate | |
A13 | subset of set linking to most entities with predicate | |
A14 | subset of set linking to least entities with predicate | |
A15 | subset where entity in set and belong to entity type | |
A16 | transform number in utterance to intermediate number | |
A17 | ||
A18/19/20/21 | constant | *instantiation for , , , |
To tackle issues mentioned above, we propose a novel multi-task semantic parsing framework for KB-QA. Specifically, an innovative pointer-equipped semantic parsing model is first designed for two purposes: 1) built-in pointer network toward positions of entity mentions in the question can naturally empower multi-task learning with conjunction of upstream sequence labeling subtask, i.e., entity detection; and 2) it explicitly takes into account the context of entity mentions by using the supervision of the pointer network. Besides, a type-aware entity detection method is proposed to produce accurate entity linking results, in which, a joint prediction space combining entity detection and entity type is employed, and the predicted type is then used to filter entity linking results during inference phase.
The proposed framework has certain merits.
First, since the two subtasks, i.e., pointer-equipped semantic parsing and entity detection, are closely related, learning them within a single model simultaneously makes the best of supervisions and improves performance of KB-QA task.
Second, considering entity type prediction is crucial for entity linking, our joint learning framework combining entity mention detection with type prediction leverages contextual information, and thus further reduces errors in entity linking.
Third, our approach is naturally beneficial to coreference resolution for conversational QA due to rich contextual features captured for entity mention, compared to previous works directly employing low-level features (e.g., mean-pooling over word embeddings) as the representation of an entity. This is verified via our experiments in §4.2.
We evaluate the proposed framework on the CSQA Saha et al. (2018) dataset, which is the largest public dataset for complex conversational question answering over a large-scale knowledge base. Experimental results show that the overall F1 score is improved by 12.56% compared with strong baselines, and the improvements are consistent for all question types in the dataset.
2 Task Definition
In this work, we target the problem of conversational question answering over a large-scale knowledge base. Formally, in training data, question denotes an user utterance from a dialog, which is concatenated dialog history for handling ellipsis or coreference in conversations, and the question is labeled with its answer . Besides, “IOB” (Insider-Outside-Beginning) tagging and entities linking to KB are also labeled for entity mentions in to train an entity detection model.
We employ a neural semantic parsing based approach to tackle the problem. That is, given a question, a semantic parsing model is used to produce a logical form which is then executed on the KB to retrieve an answer. We decompose the approach into two subtasks, i.e., entity detection for entity linking and semantic parsing for logical form generation. The former employs IOB tagging and corresponding entities as supervision, while the latter uses a gold logical form as supervision, which may be obtained by conducting intensive BFS111Breadth-first search with limited buffer Guo et al. (2018) over KB if only final answers (i.e., weak supervision) are provided.

3 Approach
This section begins with a description of grammars and logic forms used in this work. Then, the proposed model is presented, and finally, model’s training and inference are introduced.
3.1 Grammar and Logical Form
Grammar
We use similar grammars and logical forms as defined in Guo et al. (2018), with minor modification for better adaptation to the CSQA dataset. The grammars are briefly summarized in Table 1
, where each operator consists of three components: semantic category, a function name, and a list of arguments with specified semantic categories. Semantic categories can be classified into two groups here w.r.t. the ways for instantiation: one is referred to as
entry semantic category (i.e., for entities, predicates, types, numbers) whose instantiations are constants parsed from a question, and another is referred to as intermediate semantic category (i.e., ) whose instantiation is the output of an operator execution.Logical Form
A KB-executable logical form is intrinsically formatted as an ordered tree where the root is the semantic category , each child node is constrained by the nonterminal (i.e., the un-instantiated semantic category in parenthesis) of its parent operator, and leaf nodes are instantiated entry semantic categories, i.e., constants.
To make the best of well-performed sequence to sequence (seq2seq) models Vaswani et al. (2017); Bahdanau et al. (2015) as a base for semantic parsing, we represent a tree-structured logical form as a sequence of operators and constants via depth-first traversal over the tree. Note, given guidance of grammars, we can recover corresponding tree structure from a sequence-formatted logical form.
3.2 Proposed Model
The structure of our proposed Multi-task Smantic Parsing (MaSP) model is illustrated in Figure 1. The model consists of four components: i.e., word embedding, contextual encoder, entity detection and pointer-equipped logical form decoder.
3.2.1 Embedding and Contextual Encoder
To handle ellipsis or coreference in conversations, our model takes current user question combined with dialog history as the input question . In particular, all those sentences are concatenated with a [SEP] separated, and then a special token [CTX] is appended. We apply wordpiece tokenizing Wu et al. (2016) method, and then use a word embedding method Mikolov et al. (2013) to transform the tokenized question to a sequence of low-dimension distributed embeddings, i.e., where denotes embedding size and denotes question length.
Given word embeddings , we use stacked two-layer multi-head attention mechanism in the Transformer Vaswani et al. (2017) with learnable positional encodings as an encoder to model contextual dependencies between tokens, which results in context-aware representations . And, contextual embedding for token [CTX] is used as the semantic representation for entire question, i.e.,
3.2.2 Pointer-Equipped Decoder
Given contextual embeddings of a question, we employ stacked two-layer masked attention mechanism in Vaswani et al. (2017) as the decoder to produce sequence-formatted logical forms.
In each decoding step, the model first predicts a token from a small decoding vocabulary = {, , , , , , A1, , A21} , where and indicate the start and end of decoding, are defined in Table 1, and , , and denote entity, predicate, type and number entries respectively. A neural classifier is established to predict current decoding token, which is formally denoted as
(1) |
where is decoding hidden state of current (i.e., -th) step, denotes a
-parameterized two-layer feed forward network with an activation function inside, and
is a predicted distribution over to score candidates222Superscript in bracket denotes the type instead of index..Then, a or a pointer network Vinyals et al. (2015) is utilized to predict instantiation for entry semantic category (i.e., , , or in ) if it is necessary.
-
For predicate and type , two parameter-untied are used as
(2) (3) where is semantic embedding of entire question, is current hidden state, and are predicted distributions over the predicate and type instantiation candidates respectively, and and are the numbers of distinct predicates and types in the knowledge base.
-
For entity and number , two parameter-untied pointer-networks Vinyals et al. (2015) with learnable bilinear layer are employed to point toward the targeted entity333Toward the first one if entity consists of multiple words. and number, which are defined as follows.
(4) (5) where is contextual embedding of tokens in the question except [CTX], and are weights of pointer-network for entity and number, are the resulting distributions over positions of input question, and is the length of the question.
The pointer network is also used for semantic parsing in Jia and Liang (2016), where the pointer aims at copying out-of-vocabulary words from a question over small-scale KB. Different from that, the pointer used here aims at locating the targeted entity and number in a question, which has two advantages. First, it handles the coreference problem by considering the context of entity mentions in the question. Second, it solves the problem caused by huge entity vocabulary, which reduces the size of decoding vocabulary from several million (i.e., the number of entities in KB) to several dozen (i.e., the length of the question).
3.2.3 Entity Detection and Linking

To map the pointed positions to entities in KB, our model also detects entity mentions for the input question, as shown as the “Entity Detection” part of Figure 1.
We observe that multiple entities in a large-scale KB usually have same entity text but different types, leading to named entity ambiguity. Therefore, we design a novel type-aware entity detection module in which the prediction is fulfilled in a joint space of IOB tagging and corresponding entity type for disambiguation. Particularly, the prediction space is defined as where stands for the -th entity type label, denotes number of distinct entity types in KB, and .
The prediction for both entity IOB tagging and entity type is formulated as
(6) |
where is the contextual embedding of the -th token in the question, and is the predicted distribution over .
Given the predicted IOB labels and entity types, we take the following steps for entity linking. First, the predicted IOB labels are used to locate all entities in the question and return corresponding entity mentions. Second, an inverted index built on the KB is leveraged to find entity candidates in KB based on each entity mention. Third, the jointly predicted entity types are used to filter out the candidates with unwanted types, and the remaining entity with the highest inverted index score is selected to substitute the pointer. This process is shown as the bottom part of Figure 2.
3.3 Learning and Inference
Model Learning
During the training phase, we first search gold logical forms for questions in training data over KB if only weak supervision is provided. Then we conduct multi-task learning for semantic parsing and entity detection. The final loss is defined as
(7) |
where
is a hyperparameter for a tradeoff between semantic parsing and entity detection, and
and are negative log-likelihood losses of semantic parsing and entity detection defined as follows.(8) | ||||
(9) |
In the two equations above, is gold label for decoding token in ; are gold labels for predicate, type, entity position and number position for instantiation; , , and are defined in Eq.(1-6) respectively; and denotes the decoding length.
Here, we use a single model to handle two subtasks simultaneously, i.e., semantic parsing and entity detection. This multi-task learning framework enables each subtask to leverage supervision signals from the others, and thus improves the final performance for KB-QA.
Grammar-Guided Inference
The grammars defined in Table 1 are utilized to filter illegal operators out in each decoding step. An operator is legitimate if its left-hand semantic category in the definition is identical to the leftmost nonterminal (i.e., un-instantiated semantic category) in the incomplete logical form parsed so far. In particular, the decoding of a logical form begins with the semantic category . During decoding, the proposed semantic parsing model recursively rewrites the leftmost nonterminal in the logical form by 1) applying a legitimate operator for an intermediate semantic category, or 2) instantiation for one of entity, predicate, type or number for an entry semantic category. The decoding process for the parsing terminates until no nonterminals remain.
Furthermore, beam search is also incorporated to boost the performance of the proposed model during the decoding. And, the early stage execution is performed to filter out illegal logical forms that lead to empty intermediate result.
4 Experiments
Methods | HRED+KVmem | D2A (Baseline) | MaSP (Ours) | ||
Question Type | #Example | F1 Score | F1 Score | F1 Score | |
Overall | 203k | 9.39% | 66.70% | 79.26% | +12.56% |
Clarification | 9k | 16.35% | 35.53% | 80.79% | +45.26% |
Comparative Reasoning (All) | 15k | 2.96% | 48.85% | 68.90% | +20.05% |
Logical Reasoning (All) | 22k | 8.33% | 67.31% | 69.04% | +1.73% |
Quantitative Reasoning (All) | 9k | 0.96% | 56.41% | 73.75% | +17.34% |
Simple Question (Coreferenced) | 55k | 7.26% | 57.69% | 76.47% | +18.78% |
Simple Question (Direct) | 82k | 13.64% | 78.42% | 85.18% | +6.76% |
Simple Question (Ellipsis) | 10k | 9.95% | 81.14% | 83.73% | +2.59% |
Question Type | #Example | Accuracy | Accuracy | Accuracy | |
Verification (Boolean) | 27k | 21.04% | 45.05% | 60.63% | +15.58% |
Quantitative Reasoning (Count) | 24k | 12.13% | 40.94% | 43.39% | +2.45% |
Comparative Reasoning (Count) | 15k | 8.67% | 17.78% | 22.26% | +4.48% |
4.1 Experimental Settings
Dataset
We evaluated the proposed approach on Complex Sequential Question Answering (CSQA) dataset444https://amritasaha1812.github.io/CSQA Saha et al. (2018), which is the largest dataset for conversational question answering over large-scale KB. It consists of about 1.6M question-answer pairs in 200K dialogs, where 152K/16K/28K dialogs are used for train/dev/test. Questions are classified as different types, e.g., simple, comparative reasoning, logical reasoning questions. Its KB is built on Wikidata555https://www.wikidata.org in a form of (subject, predicate, object), and consists of 21.2M triplets over 12.8M entities, 3,054 distinct entity types, and 567 distinct predicates.
Training Setups
We leveraged a BFS method to search valid logical forms for questions in training data. The buffer size in BFS is set to 1000. Both embedding and hidden sizes in the model are set to , and no pretrained embeddings are loaded for initialization, and the positional encodings are randomly initialized and learnable. The head number of multi-head attention is and activation function inside is Hendrycks and Gimpel (2016). We used Adam Kingma and Ba (2015)
to optimize the loss function defined in Eq.(
7) where is set to , and learning rate is set to . The training batch size is for epochs. And we also employed learning rate warmup within the first steps and linear decay within the rest. The source codes are available at https://github.com/taoshen58/MaSP. More details of our implementation are described in Appendix AEvaluation Metrics
We used the same evaluation metrics as
Saha et al. (2018) and Guo et al. (2018). F1 score (i.e., precision and recall) is used to evaluate the question whose answer is comprised of entities, and accuracy is used to measure the question whose answer type is boolean or number.
Baselines
There are few works targeting conversational question answering over a large-scale knowledge base. HRED+KVmem Saha et al. (2018) and D2A Guo et al. (2018) are two typical approaches, and we compared them with our proposed approach. Particularly, HRED+KVmem is a memory network Sukhbaatar et al. (2015); Li et al. (2017) based seq2seq model, which combines HRED model Serban et al. (2016) with key-value memory network Miller et al. (2016). D2A666Overall score of D2A reported in this paper is superior to that in the original paper since our re-implemented grammars for CSQA achieve a better balance between the simple and non-simple question types. For rational and fair comparisons, we report re-run results for D2A in this paper. is a memory augmented neural symbolic model for semantic parsing in KB-QA, which introduces dialog memory manager to handle ellipsis and co-reference problems in conversations.
4.2 Model Comparisons
We compared777Mores details of comparisons are listed in Appendix B. our approach (denoted as MaSP) with HRED+KVmem and D2A in Table 2
. As shown in the table, the semantic parsing based D2A significantly outperforms the memory network based text generation approach (HRED+KVmem), which thus poses a strong baseline. Further, our proposed approach (MaSP) achieves a new state-of-the-art performance, where the overall F1 score is improved by
12%. Besides, the improvement is consistent for all question types, which ranges from 2% to 45%.There are two possible reasons for this significant improvement. First, our approach predicts entities more accurately, where the accuracy of entities in final logical forms increases from 55% to 72% compared with D2A. Second, the proposed pointer-equipped logical form decoder in the multi-task learning framework handles coreference better. For instance, given an user question with history, “What is the parent organization of that one? // Did you mean Polydor Records ? // No, I meant Deram Records. Could you tell me the answer for that?” with coreference, D2A produces “(find {Polydor Records}, owned by)” and in contrast our approach produces “(find {Deram Records}, owned by)”. This also explains the substantial improvement for Simple Question (Coreferenced) and Clarification888In CSQA, the performance of Clarification closely depends on F1 score for next question, 88% of which belong to Simple Question(Coreference) ..
We also observed that the improvement of MaSP over D2A for some question types is relatively small, e.g., 1.73% for logical reasoning questions. A possible reason is that there are usually more than one entities are needed to compose the correct logical form for logical reasoning questions, and our current model is too shallow to parse the multiple entities. Hence, we adopted deeper model and employed BERT Devlin et al. (2018) as the encoder (latter in §4.4), and found that the performance of logical reasoning questions is improved by 10% compared to D2A.
4.3 Ablation Study
Methods | Ours | w/o ET | w/o Multi | w/o Both |
---|---|---|---|---|
Question Type | F1 | F1 | F1 | F1 |
Overall | 79.26% | 70.42% | 76.73% | 68.22% |
Clarification | 80.79% | 68.01% | 66.30% | 54.64% |
Comparative | 68.90% | 66.35% | 61.12% | 58.04% |
Logical | 69.04% | 62.63% | 67.81% | 62.51% |
Quantitative | 73.75% | 73.75% | 64.56% | 64.55% |
Simple (Co-ref) | 76.47% | 64.94% | 74.35% | 63.15% |
Simple (Direct) | 85.18% | 75.24% | 84.93% | 75.19% |
Simple (Ellipsis) | 83.73% | 78.45% | 82.66% | 77.44% |
Question Type | Accu | Accu | Accu | Accu |
Verification | 60.63% | 45.40% | 60.43% | 45.02% |
Quantitative | 43.39% | 39.70% | 37.84% | 43.39% |
Comparative | 22.26% | 19.08% | 18.24% | 22.26% |
There are two aspects leading to performance improvement, i.e., predicting entity type in entity detection to filter candidates, and multi-task learning framework. We conducted an ablation study in Table 3 for in-depth understanding of their effects.
Effect of Entity Type Prediction (w/o ET)

First, the entity type prediction was removed from the entity detection task, which results in 9% drop of overall F1 score. We argue that the performance of the KB-QA task is in line with that of entity linking. Hence, we separately evaluated the entity linking task on the test set. As illustrated in Figure 3, both precision and recall of entity linking drop significantly without filtering the entity linking results w.r.t. the predicted entity type, which verifies our hypothesis above.
Effect of Multi-Task Learning (w/o Multi)
Accuracy | Ours | w/o Multi |
---|---|---|
Entity pointer | 79.8% | 79.3% |
Predicate | 96.9% | 96.3% |
Type | 86.8% | 84.1% |
Number | 89.1% | 88.3% |
Operators | 79.4% | 78.7% |
Second, to measure the effect of multi-task learning, we evaluated the KB-QA task when the two subtasks, i.e., pointer-equipped semantic parsing and entity detection, are learned separately. As shown in Table 3, the F1 score for every question type consistently drops in the range of 3% to 14% compared with that with multi-task learning. We further evaluated the effect of multi-task learning on each subtask. As shown in Table 4, the accuracy for each component of the pointer-equipped logical form drops with separate learning. Meanwhile, we found 0.1% F1 score reduction (99.4% vs. 99.5%) for entity detection subtask compared to the model without multi-task learning, which only poses a negligible effect on the downstream task. To sum up, the multi-task learning framework increases the accuracy of the pointer-based logical form generation while keeping a satisfactory performance of entity detection, and consequently improves the final question answering performance.
Note that, considering a combination of removing the entity type filter and learning two subtasks separately (i.e., w/o Both in Table 3), the proposed framework will degenerate to a model that is similar to Coarse-to-Fine semantic parsing model, another state-of-the-art KB-QA model over small-scale KB Dong and Lapata (2018). Therefore, an improvement of 11% of F1 score also verifies the advantage of our proposed framework.
4.4 Model Setting Analysis
As introduced in §4.1 and evaluated in §4.2
, the proposed framework is built on a relatively shallow neural network, i.e., stacked two-layer multi-head attention, which might limit its representative ability. Hence, in this section, we further exploited the performance of the proposed framework by applying more sophisticated strategies.
Methods | Vanilla | w/ BERT | w/ Large Beam |
---|---|---|---|
Question Type | F1 | F1 | F1 |
Overall | 79.26% | 80.60% | 81.55% |
Clarification | 80.79% | 79.46% | 83.37% |
Comparative | 68.90% | 65.99% | 69.34% |
Logical | 69.04% | 77.53% | 69.41% |
Quantitative | 73.75% | 70.43% | 73.75% |
Simple (Co-ref) | 76.47% | 77.95% | 79.03% |
Simple (Direct) | 85.18% | 86.40% | 88.28% |
Simple (Ellipsis) | 83.73% | 84.82% | 86.96% |
Question Type | Accuracy | Accuracy | Accuracy |
Verification | 60.63% | 63.85% | 61.96% |
Quantitative | 43.39% | 47.14% | 44.22% |
Comparative | 22.26% | 25.28% | 22.70% |
As shown in Table 5, we first replaced the encoder with pre-trained BERT base model Devlin et al. (2018) and fine-tuned parameters during the training phase, which results in 1.3% F1 score improvement over the vanilla one. Second, we increased beam search size from 4 to 8 during the decoding in the inference phase for the standard settings, which leads to 2.3% F1 score increase.
4.5 Error Analysis
We randomly sampled 100 examples with wrong logical forms or incorrect answers to conduct an error analysis, and found that the errors mainly fall into the following categories.
Entity Ambiguity
Leveraging entity type as a filter in entity linking significantly reduces errors caused by entity ambiguity, but it is still possible that different entities with same text belong to the same type, due to coarse granularity of the entity type, which results in filtering invalidity. For example, it is difficult to distinguish between two persons whose names are both Bill Woods.
Wrong Predicted Logical Form
The predicted components (e.g., operators, predicates and types) composing the logical form would be inaccurate, leading to a wrong answer to the question or an un-executable logical form.
Spurious Logical Form
We took a BFS method to search gold logical forms for questions in training set, which inevitably generates spurious (incorrect but leading to correct answers coincidentally) logical forms as training signals. Take the question “Which sexes do King Harold, Queen Lillian and Arthur Pendragon possess” as an example, a spurious logical form only retrieves the genders of “King Harold” and “Queen Lillian”, while it gets correct answers for the question. Spurious logical forms accidentally introduce noises into training data and thus negatively affect the performance of KB-QA.
5 Related Work
Our work is aligned with semantic parsing based approach for KB-QA. Traditional semantic parsing systems typically learn a lexicon-based parser and a scoring model to construct a logical form given a natural language question
Zettlemoyer and Collins (2007); Wong and Mooney (2007); Zettlemoyer and Collins (2009); Kwiatkowski et al. (2011); Andreas et al. (2013); Artzi and Zettlemoyer (2013); Zhao and Huang (2014); Long et al. (2016). For example, Zettlemoyer and Collins (2009) and Artzi and Zettlemoyer (2013) learn a CCG parser, and Long et al. (2016) develop a shift-reduce parser to construct logical forms.Neural semantic parsing approaches have been gaining rising attention in recent years, eschewing the need for extensive feature engineering Jia and Liang (2016); Ling et al. (2016); Xiao et al. (2016). Some efforts have been made to utilize the syntax of logical forms Rabinovich et al. (2017); Krishnamurthy et al. (2017); Cheng et al. (2017); Yin and Neubig (2017). For example, Dong and Lapata (2016) and Alvarez-Melis and Jaakkola (2017) leverage an attention-based encoder-decoder framework to translate a natural language question to tree-structured logical form.
Recently, to handle huge entity vocabulary existing in a large-scale knowledge base, many works take a stepwise approach. For example, Liang et al. (2016), Dong and Lapata (2016), and Guo et al. (2018) first process questions using a name entity linking system to find entity candidates, and then learn a model to map a question to a logical form based on the candidates. Dong and Lapata (2018) decompose the task into two stages: first, a sketch of the logical form is predicted, and then a full logical form is generated with considering both the question and the predicted sketch.
Our proposed framework also decomposes the task into multiple subtasks but is different from existing works in several aspects. First, inspired by pointer network Vinyals et al. (2015), we replace entities in a logical form with the starting positions of their mentions in the question, which can be naturally used to handle coreference problem in conversations. Second, the proposed pointer-based semantic parsing model can be intrinsically extended to jointly learn with entity detection for fully leveraging all supervision signals. Third, we alleviate entity ambiguity problem in entity detection & linking subtask, by incorporating entity type prediction into entity mention IOB labeling to filter out the entities with unwanted types.
6 Conclusion
We studied the problem of conversational question answering over a large-scale knowledge base, and proposed a multi-task learning framework which learns for type-aware entity detection and pointer-equipped logical form generation simultaneously. The multi-task learning framework takes full advantage of the supervisions from all subtasks, and consequently increases the performance of final KB-QA problem. Experimental results on a large-scale dataset verify the effectiveness of the proposed framework. In the future, we will test our proposed framework on more datasets and investigate potential approaches to handle spurious logical forms for weakly-supervised KB-QA.
Acknowledgments
We acknowledge the support of NVIDIA Corporation and MakeMagic Australia with the donation of GPUs for our research group at University of Technology Sydney. And we also thank anonymous reviewers for their insightful and constructive suggestions.
References
-
Alvarez-Melis and Jaakkola (2017)
David Alvarez-Melis and Tommi S Jaakkola. 2017.
Tree-structured decoding with doubly-recurrent neural networks.
In ICLR. - Andreas et al. (2013) Jacob Andreas, Andreas Vlachos, and Stephen Clark. 2013. Semantic parsing as machine translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 47–52.
-
Artzi and Zettlemoyer (2013)
Yoav Artzi and Luke Zettlemoyer. 2013.
Weakly supervised learning of semantic parsers for mapping instructions to actions.
Transactions of the Association for Computational Linguistics, 1:49–62. - Auer et al. (2007) Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722–735. Springer.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
- Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250. AcM.
- Cheng et al. (2017) Jianpeng Cheng, Siva Reddy, Vijay Saraswat, and Mirella Lapata. 2017. Learning structured natural language representations for semantic parsing. arXiv preprint arXiv:1704.08387.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
- Dong and Lapata (2016) Li Dong and Mirella Lapata. 2016. Language to logical form with neural attention. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33–43, Berlin, Germany. Association for Computational Linguistics.
- Dong and Lapata (2018) Li Dong and Mirella Lapata. 2018. Coarse-to-fine decoding for neural semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 731–742. Association for Computational Linguistics.
- Guo et al. (2018) Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin. 2018. Dialog-to-action: Conversational question answering over a large-scale knowledge base. In Advances in Neural Information Processing Systems, pages 2946–2955.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
- Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
- Jia and Liang (2016) R. Jia and P. Liang. 2016. Data recombination for neural semantic parsing. In Association for Computational Linguistics (ACL).
- Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
-
Krishnamurthy et al. (2017)
Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gardner. 2017.
Neural semantic parsing with type constraints for semi-structured
tables.
In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages 1516–1526. - Kwiatkowski et al. (2011) Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwater, and Mark Steedman. 2011. Lexical generalization in ccg grammar induction for semantic parsing. In Proceedings of the conference on empirical methods in natural language processing, pages 1512–1523. Association for Computational Linguistics.
- Lei Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
- Li et al. (2017) Zheng Li, Yun Zhang, Ying Wei, Yuxiang Wu, and Qiang Yang. 2017. End-to-end adversarial memory network for cross-domain sentiment classification. In IJCAI, pages 2237–2243.
- Liang et al. (2016) Chen Liang, Jonathan Berant, Quoc Le, Kenneth D Forbus, and Ni Lao. 2016. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. arXiv preprint arXiv:1611.00020.
- Ling et al. (2016) Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiskỳ, Andrew Senior, Fumin Wang, and Phil Blunsom. 2016. Latent predictor networks for code generation. arXiv preprint arXiv:1603.06744.
- Long et al. (2016) Reginald Long, Panupong Pasupat, and Percy Liang. 2016. Simpler context-dependent logical forms via model projections. arXiv preprint arXiv:1606.05378.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS.
- Miller et al. (2016) Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. 2016. Key-value memory networks for directly reading documents. arXiv preprint arXiv:1606.03126.
-
Pennington et al. (2014)
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.
Glove: Global vectors for word representation.
In EMNLP. - Rabinovich et al. (2017) Maxim Rabinovich, Mitchell Stern, and Dan Klein. 2017. Abstract syntax networks for code generation and semantic parsing. arXiv preprint arXiv:1704.07535.
- Reddy et al. (2014) Siva Reddy, Mirella Lapata, and Mark Steedman. 2014. Large-scale semantic parsing without question-answer pairs. Transactions of the Association for Computational Linguistics, 2:377–392.
-
Saha et al. (2018)
Amrita Saha, Vardaan Pahuja, Mitesh M Khapra, Karthik Sankaranarayanan, and
Sarath Chandar. 2018.
Complex sequential question answering: Towards learning to converse over linked question answer pairs with a knowledge graph.
InThirty-Second AAAI Conference on Artificial Intelligence
. - Seo et al. (2017) Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In ICLR.
- Serban et al. (2016) Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Thirtieth AAAI Conference on Artificial Intelligence.
- Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In NIPS.
- Vaswani et al. (2017) Ashish Vaswani, Shazeer, Noam, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
- Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems, pages 2692–2700.
- Wong and Mooney (2007) Yuk Wah Wong and Raymond Mooney. 2007. Learning synchronous grammars for semantic parsing with lambda calculus. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 960–967.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- Xiao et al. (2016) Chunyang Xiao, Marc Dymetman, and Claire Gardent. 2016. Sequence-based structured prediction for semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1341–1350.
- Xu et al. (2016) Kun Xu, Siva Reddy, Yansong Feng, Songfang Huang, and Dongyan Zhao. 2016. Question answering on freebase via relation extraction and textual evidence. arXiv preprint arXiv:1603.00957.
- Yih et al. (2015) Scott Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. 2015. Semantic parsing via staged query graph generation: Question answering with knowledge base. In ACL-IJCNLP.
- Yin and Neubig (2017) Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696.
- Zettlemoyer and Collins (2007) Luke Zettlemoyer and Michael Collins. 2007. Online learning of relaxed ccg grammars for parsing to logical form. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL).
- Zettlemoyer and Collins (2009) Luke S Zettlemoyer and Michael Collins. 2009. Learning context-dependent mappings from sentences to logical form. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 976–984. Association for Computational Linguistics.
- Zhao and Huang (2014) Kai Zhao and Liang Huang. 2014. Type-driven incremental semantic parsing with polymorphism. arXiv preprint arXiv:1411.5379.
Appendix A Model Details
a.1 Word Embedding
Given an user question sentence , a tokenizing method (e.g., punctuation or wordpiece tokenizer Wu et al. (2016)) is applied to the sentence for a list of tokens, i.e., , where or is an one-hot vector whose dimension equals to distinct tokens in vocabulary, and is the length of . Note that a special token is appended to the tokenized sentence, corresponding to the token [CTX]. Then, randomly initialized or pre-trained Mikolov et al. (2013); Pennington et al. (2014) embeddings are applied to and thus transform discrete tokens to a sequence of low-dimension distributed embeddings, i.e., where is embedding size. This process is formulated as where is the trainable word embedding weight matrix.
a.2 Pointer-equipped Semantic Parsing
a.2.1 Encoder of Seq2seq Model
To model contextual dependencies between tokens and generate context-aware representations, we leverage stacked two-layer multi-head attention mechanism with additive positional encoding Vaswani et al. (2017). The stacking scheme is identical to that in Vaswani et al. (2017)
: two-layer feed forward network with activation function (FFN) follows each multi-head attention, and residual connection
He et al. (2016) with layer normalization Lei Ba et al. (2016) is applied. This process is briefly denoted as(10) | ||||
(11) | ||||
(12) |
where is a sequence of contextual embeddings, is learnable weights of PE and the three arguments for are value, key, query for an attention mechanism.
a.2.2 Decoder of Seq2seq Model
Similar to token embedding in encoder (§A.1), we embed the -th decoder input token as via a randomly initialized embedding weight matrix . We use to represent all tokens in a gold logical form sketch, where denotes the length of gold sketch.
The basic structure of proposed logical form decoder is same as that in the original Transformer Vaswani et al. (2017) except only two stacked layers are used here. Each layer of the decoder is bottom-up comprised of self-attention with forward mask, cross attention between decoder and encoder and FFN, which we briefly formulate as
(13) | ||||
(14) | ||||
where is a sequence of decoding hidden states.
a.3 Multi-task Learning
We propose to employ a multi-task learning strategy to learn a entity detection (ED) model jointly with the pointer-equipped semantic parsing model because the supervision information from ED, i.e, IOB tagging, can provide all entities spans in the input question, which thus results in better performance than separate learning.
The reasons why we use a multi-task learning to jointly learn the semantic parsing model and ED rather than directly equip the semantic parsing model with span prediction Seo et al. (2017) are that 1) the supervision information of the entities not existing in the gold logical form but appearing in the question is lost; 2) deeper network is required when predicting the end index of the target as shown in Seo et al. (2017) and 3) the well-solved entity detection method can provide correction for the pointer even with slight deviation during inference phrase, in contrast, span-based model usually leads to error aggregation.
a.4 Inverted Index
Based on each entity text in Wikidata, we traversed its substring whose length is not less than that of its full text minus a threshold, and then, we separately calculated Levenshtein Distance between the full text and each substring as a score for the map from the substring to corresponding full text. Since multiple entities could generate identical substring, we kept maps with largest scores and used the maps to build a dictionary for future queries.
Appendix B Supplemental Experiment Results
b.1 Precision and Recall for Main Paper
Since we report the F1 score for brief demonstration in the main paper, in this section, we report the corresponding recall and precision detailedly: 1) as shown in Table 12, the results of the proposed model compared with baselines are presented; 2) as shown in Table 13, the ablation study is presented; and 3) as shown in Table 14, the performance improvement comparison after sophisticated strategies applied is provided.
b.2 Comparison to D2A
Question Type | D2A | Ours |
---|---|---|
Simple Question (Direct) | 2.6 | 1.5 |
Clarification | 2.7 | 1.4 |
Simple Question (Coreferenced) | 2.7 | 1.4 |
Quantitative Reasoning (Count) (All) | 2.9 | 1.5 |
Logical Reasoning (All) | 2.7 | 1.6 |
Simple Question (Ellipsis) | 2.6 | 1.6 |
Verification (Boolean) (All) | 2.8 | 1.4 |
Quantitative Reasoning (All) | 2.7 | 1.4 |
Comparative Reasoning (Count) (All) | 2.8 | 1.4 |
Comparative Reasoning (All) | 3.0 | 1.4 |
Overall | 2.9 | 1.5 |
To further demonstrate that the proposed model is superior to the previous D2A model in term of entity linking and logical form generation, we conduct the following comparisons.
First, as shown in Table 6, the average number of entity candidates in test set from entity linking of the proposed model is less than that of D2A, which means the proposed approach provides the downstream subtask with more accurate entity linking results.
Question Type | D2A | Ours |
---|---|---|
Simple Question (Direct) | 0.8960 | 0.9520 |
Clarification | 0.8281 | 0.9323 |
Simple Question (Coreferenced) | 0.8177 | 0.8952 |
Quantitative Reasoning (Count) (All) | 0.8385 | 0.9581 |
Logical Reasoning (All) | 0.8726 | 0.9791 |
Simple Question (Ellipsis) | 0.9364 | 0.9474 |
Verification (Boolean) (All) | 0.7448 | 0.9637 |
Quantitative Reasoning (All) | 0.9304 | 0.9832 |
Comparative Reasoning (Count) (All) | 0.8165 | 0.9863 |
Comparative Reasoning (All) | 0.8312 | 0.9727 |
Overall | 0.8499 | 0.9475 |
Second, we compare the proposed model with D2A in term of logical form generation where the logical form would be empty due to timeout or illegal logical forms during beam search. As demonstrated in Table 7, the proposed model obtains less ratio of empty logical form than D2A.
Question Type | D2A | Ours | +BERT |
---|---|---|---|
Simple Question (Direct) | 0.7967 | 0.8519 | 0.8664 |
Clarification | 0.2385 | 0.6408 | 0.6414 |
Simple Question (Coreferenced) | 0.5341 | 0.7234 | 0.7469 |
Quantitative Reasoning (Count) (All) | 0.5000 | 0.6947 | 0.7004 |
Logical Reasoning (All) | 0.3692 | 0.0791 | 0.3196 |
Simple Question (Ellipsis) | 0.7533 | 0.8843 | 0.8878 |
Verification (Boolean) (All) | 0.1757 | 0.5278 | 0.5854 |
Quantitative Reasoning (All) | 0.8913 | 0.9792 | 0.9911 |
Comparative Reasoning (Count) (All) | 0.3235 | 0.8924 | 0.9121 |
Comparative Reasoning (All) | 0.2483 | 0.9053 | 0.9242 |
Overall | 0.5522 | 0.7167 | 0.7546 |
Third, we list the accuracies of the entities appearing in the predicted logical form for D2A, our standard approach and BERT-based model, which verifies that the proposed approach can significantly improve the performance of entity linking during entity detection and entity prediction during logical form generation. Note that the analysis for performance reduction of Logical Reasoning (All) is elaborated in the main paper.
b.3 Multi-task Learning
The multi-task learning framework increases the accuracy of logical form generation while keeping a satisfactory performance of entity detection, and consequently improves the final question answering task via logical form execution. In this section, we detailedly list all metrics to measure the performance for both two subtasks in the case of our approach with or without multi-task learning. To evaluate the logical form generation, we also apply BFS method to test set for gold logical form (inevitably existing spurious ones).
Question Type | Ours | w/o Multi |
---|---|---|
Comparative Reasoning (All) | 0.1885 | 0.1885 |
Logical Reasoning (All) | 0.6256 | 0.6188 |
Quantitative Reasoning (All) | 0.6403 | 0.6188 |
Simple Question (Coreferenced) | 0.8721 | 0.8663 |
Simple Question (Direct) | 0.8772 | 0.8715 |
Simple Question (Ellipsis) | 0.9073 | 0.9034 |
Comparative Reasoning (Count) (All) | 0.1601 | 0.1495 |
Quantitative Reasoning (Count) (All) | 0.5711 | 0.5564 |
Verification (Boolean) (All) | 0.7638 | 0.7565 |
Overall | 0.7940 | 0.7872 |
Ours | w/o Multi | ||
---|---|---|---|
IOB Tagging | Accuracy | 0.9967 | 0.9975 |
F1 Score | 0.9941 | 0.9955 | |
Precision | 0.9960 | 0.9972 | |
Recall | 0.9923 | 0.9938 | |
Entity Type | Accuracy | 0.9822 | 0.9844 |
F1 Score | 0.9674 | 0.9717 | |
Precision | 0.9958 | 0.9971 | |
Recall | 0.9407 | 0.9475 |
As shown in Table 9 and 10, the model with multi-task learning can outperform that without multi-task learning in term of logical form generation from semantic parsing model. And, although 0.002 performance reduction is observed for entity detection subtask, the performance of entity detection and linking is good enough for the downstream task, which thus poses a very minor effect on the performance of KB-QA.
b.4 BFS Success Ratio
Question Type | #Example | Ratio |
---|---|---|
Simple Question (Direct) | 274527 | 0.96 |
Simple Question (Ellipsis) | 34549 | 0.97 |
Quantitative Reasoning (All) | 58976 | 0.46 |
Quantitative Reasoning (Count) (All) | 114074 | 0.67 |
Logical Reasoning (All) | 66161 | 0.61 |
Simple Question (Coreferenced) | 173765 | 0.86 |
Verification (Boolean) (All) | 77167 | 0.75 |
Comparative Reasoning (Count) (All) | 59557 | 0.37 |
Comparative Reasoning (All) | 57343 | 0.32 |
Given the final answer to a question as well as gold entities, predicates and types, we conduct a BFS method to search the gold logical form, which may result in search failure due to limited time and buffer. We list the success ratio of BFS for training data of CSQA in Table 11.
Appendix C Supplemental Analysis
We also observe that the improvement of MaSP over D2A for some question types is relatively small especially for logical reasoning questions. Furthermore, for logical reasoning, we find that the accuracy of entities in final logical forms is only 8%, and there are usually two distinct entities needed to produce a correct logical form. This means the presented shallow network, i.e., two-layer multi-head attention, cannot handle such complex cases. We study a case here for better understanding. Given, “Which diseases are a sign of lead poisoning or pentachlorophenol exposure?”, D2A produces “(union (find {lead poisoning}, symptoms), (pe…ol exposure))” where entities are correct but operator is wrong, our approach produces “(union (find {pe…ol exposure}, symptoms), (union (find {pe…ol exposure}, symptoms))” where the entities are wrong, while our approach plus BERT Devlin et al. (2018) as encoder can produce correct logical form that is “(union (find {pe…ol exposure}, symptoms), (union (find {lead poisoning}, symptoms))”.
Methods | HRED+KVmem | D2A (Baseline) | Our Approach | ||||
---|---|---|---|---|---|---|---|
Question Type | #Example | Recall | Precision | Recall | Precision | Recall | Precision |
Overall | - | 18.40% | 6.30% | 66.83% | 66.57% | 78.07% | 80.48% |
Clarification | 12k | 25.09% | 12.13% | 37.24% | 33.97% | 84.18% | 77.66% |
Comparative Reasoning (All) | 15k | 2.11% | 4.97% | 44.14% | 54.68% | 59.83% | 81.20% |
Logical Reasoning (All) | 22k | 15.11% | 5.75% | 65.82% | 68.86% | 61.92% | 78.00% |
Quantitative Reasoning (All) | 9k | 0.91% | 1.01% | 52.74% | 60.63% | 69.14% | 79.02% |
Simple Question (Coreferenced) | 55k | 12.67% | 5.09% | 58.47% | 56.94% | 76.94% | 76.01% |
Simple Question (Direct) | 82k | 33.30% | 8.58% | 79.50% | 77.37% | 86.09% | 84.29% |
Simple Question (Ellipsis) | 10k | 17.30% | 6.98% | 84.67% | 77.90% | 85.50% | 82.03% |
Question Type | #Example | Accuracy | Accuracy | Accuracy | |||
Verification (Boolean) | 27k | 21.04% | 45.05% | 60.63% | |||
Quantitative Reasoning (Count) | 24k | 12.13% | 40.94% | 43.39% | |||
Comparative Reasoning (Count) | 15k | 8.67% | 17.78% | 22.26% |
Methods | Our Approach | w/o ET | w/o Multi | w/o Both | ||||
---|---|---|---|---|---|---|---|---|
Question Type | Recall | Precision | Recall | Precision | Recall | Precision | Recall | Precision |
Overall | 78.07% | 80.48% | 68.78% | 72.15% | 75.75% | 77.73% | 66.75% | 69.75% |
Clarification | 84.18% | 77.66% | 69.79% | 66.32% | 70.12% | 62.88% | 56.96% | 52.51% |
Comparative Reasoning (All) | 59.83% | 81.20% | 57.48% | 78.45% | 53.62% | 71.06% | 50.86% | 67.59% |
Logical Reasoning (All) | 61.92% | 78.00% | 54.43% | 73.73% | 61.04% | 76.27% | 54.16% | 73.91% |
Quantitative Reasoning (All) | 69.14% | 79.02% | 69.14% | 79.02% | 60.86% | 68.73% | 60.86% | 68.72% |
Simple Question (Coreferenced) | 76.94% | 76.01 | 64.92% | 64.96% | 74.65% | 74.06% | 63.06% | 63.24% |
Simple Question (Direct) | 86.09% | 84.29% | 75.87% | 74.62% | 85.88% | 84.01% | 75.84% | 74.56% |
Simple Question (Ellipsis) | 85.50% | 82.03% | 80.12% | 76.85% | 84.28% | 81.11% | 78.96% | 75.97% |
Question Type | Accuracy | Accuracy | Accuracy | Accuracy | ||||
Verification (Boolean) | 60.63% | 45.40% | 60.43% | 45.02% | ||||
Quantitative Reasoning (Count) | 43.39% | 39.70% | 37.84% | 43.39% | ||||
Comparative Reasoning (Count) | 22.26% | 19.08% | 18.24% | 22.26% |
Methods | Vanilla | w/ BERT | Larger Beam Size | |||
---|---|---|---|---|---|---|
Question Type | Recall | Precision | Recall | Precision | Recall | Precision |
Overall | 78.07% | 80.48% | 79.67% | 81.56% | 80.39% | 82.75% |
Clarification | 84.18% | 77.66% | 83.24% | 76.01% | 86.90% | 80.11% |
Comparative Reasoning (All) | 59.83% | 81.20% | 58.79% | 75.21% | 60.25% | 81.67% |
Logical Reasoning (All) | 61.92% | 78.00% | 72.56% | 83.24% | 62.16% | 78.58% |
Quantitative Reasoning (All) | 69.14% | 79.02% | 66.91% | 74.35% | 69.14% | 79.02% |
Simple Question (Coreferenced) | 76.94% | 76.01% | 78.05% | 77.85% | 79.54% | 78.52% |
Simple Question (Direct) | 86.09% | 84.29% | 86.84% | 85.96% | 89.26% | 87.33% |
Simple Question (Ellipsis) | 85.50% | 82.03% | 86.38% | 83.32% | 88.78% | 85.22% |
Question Type | Accuracy | Accuracy | Accuracy | |||
Verification (Boolean) | 60.63% | 63.85% | 61.96% | |||
Quantitative Reasoning (Count) | 43.39% | 47.14% | 44.22% | |||
Comparative Reasoning (Count) | 22.26% | 25.28% | 22.70% |
Comments
There are no comments yet.