Effective Search of Logical Forms for Weakly Supervised Knowledge-Based Question Answering

09/06/2019 ∙ by Tao Shen, et al. ∙ University of Technology Sydney Microsoft 0

Many algorithms for Knowledge-Based Question Answering (KBQA) depend on semantic parsing, which translates a question to its logical form. When only weak supervision is provided, it is usually necessary to search valid logical forms for model training. However, a complex question typically involves a huge search space, which creates two main problems: 1) the solutions limited by computation time and memory usually reduce the success rate of the search, and 2) spurious logical forms in the search results degrade the quality of training data. These two problems lead to a poorly-trained semantic parsing model. In this work, we propose an effective search method for weakly supervised KBQA based on operator prediction for questions. With search space constrained by predicted operators, sufficient search paths can be explored, more valid logical forms can be derived, and operators possibly causing spurious logical forms can be avoided. As a result, a larger proportion of questions in a weakly supervised training set are equipped with logical forms, and fewer spurious logical forms are generated. Such high-quality training data directly contributes to a better semantic parsing model. Experimental results on one of the largest KBQA datasets (i.e., CSQA) verify the effectiveness of our approach: improving the precision from 67 72

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge-based question answering (KBQA) interacts with a knowledge base (KB) and draws a correct answer for a factoid question. Many top-performing approaches to KBQA are based on a semantic parsing framework, that is, translating a natural language question into corresponding logical form in the light of pre-defined grammars [artzi2013weakly, vlachos2014new, suhr2018learning]. For example, “how many people have birthplace at Provence” has a corresponding logical form Count(Find( Provence, place-of-birth)). The logical form is then executed by KB system to retrieve an answer.

To train a semantic parser, ideal training example is in the format of question, logical form. However, it usually requires some expertise to compose logical forms, especially for complex questions [webquestions, wikitablequestions, csqa]. Therefore, it is not realistic to employ crowdsourcing to scale out the size of such training data. To circumvent this challenge, a weakly supervised training setting was proposed. The idea is to create training example in the format question, answer instead of question, logical form, since it is easier to get the answer for a factoid question than writing the corresponding logical form. However, answers cannot be directly used to train a semantic parser. Therefore, given a factoid question, a crucial step in weak supervision is to automatically search for valid logical form over a knowledge base, which must lead to the given ground-truth answer after its execution. Logical forms derived from the searching process will then be considered as fully-supervised training targets for a semantic parser.

In this case, the quality of a semantic parser depends on the effectiveness of upstream searching process for logical forms. However, the search space for eligible logical forms can be very large [iyyer2017search]. For example, a complex question frequently involves 7 to 8 steps, and in each step an operator is chosen from up to 20 candidates. The size of search space is then about . Although we may leverage the constraint of grammars to prune the search space, it can still be at the magnitude of . The large search space results in two challenges as follows.

First, it may not be practical to exhaustively search the whole hypothesis space, since this takes huge cost in computation time and memory to verify each candidate by executing the logical form in a large-scale KB. Therefore, a usual practice is to search a randomly-selected, middle-sized subspace. However, such incomplete search possibly misses valid logical forms. In our empirical study, we define search success ratio as the number of questions for which the subspace search can find valid logical forms, divided by the total number of questions. We applied some traditional search algorithms, such as naive BFS [d2a], on a public dataset CSQA [csqa], and found the success ratio is very low. For example, the search success ratio for comparative and quantitative questions are barely 25% and 43% respectively. In other words, for a large percentage of these questions, there are no corresponding logical forms generated as training data. As shown in our experiment part (Section 3.2), the insufficient training data can negatively impact the performance of a semantic parser.

Second, even if we could overcome the practical resource constraints and search the entire space, we are likely to find spurious logical forms rather than correct ones. A spurious logical form does not match the semantic meaning of the original question, but coincidentally results in ground-truth answer if executed over the KB. For example, for the question “Which occupations do Camil Samson and Daniel von Moser do for a living”, due to the two persons have the same occupation, all following three logical forms, i.e., 1) “Find(Camil Samson, Occupation)”, 2) “Find(Daniel von Moser, Occupation)”, and 3) “Union(Find(Camil Samson, Occupation), Find(Camil Samson, Occupation))”, lead to the ground-truth answer, but only the last one is correct. To measure the severity of spurious logical forms, we conducted a quantitative analysis over randomly held-out examples by human evaluation on CSQA, and found that up to 54.5% of the search results were spurious. A large percentage of spurious forms in the training set can introduce high noise to diminish the performance of a semantic parser.

Several prior works have been proposed to reduce search space or decrease spurious logical forms, which can be categorized into two ways. First, some methods use techniques to reduce search space but still suffer from spurious logical forms, such as macro grammars [macrogrammar] and logical form sketch [dong2018coarse]

. Second, under reinforcement learning, these works try to gradually reduce spurious logical forms while their model are iteratively trained on weakly supervised data, such as iterative search

[dasigi2019iterative]

, but these methods still suffer from high failure ratio due to exponentially-growing search space, and are probably not suitable for supervised settings.

In this work, we propose a novel approach to effectively search for logical forms over a large-scale knowledge base by introducing an operator predictor111An operator is an action unit we take when querying the KB, e.g. finding objects given a subject-predicate pair, counting the number of an entity set, comparing two numbers, etc.

. Intuitively, we can estimate operator candidates for a given question based on its semantics. For example, the phrase “the most” may suggest

and “less than” may suggest . With the constraint of the predicted operator set, searching for valid logical forms will result in a lower percentage of spurious logical forms and a higher ratio of the search success. In turn, high-quality training data will improve the accuracy of downstream question-to-logical-form translation model. Additionally, the predicted small set of operators can also be easily integrated into translation model’s decoder to improve performance by providing constraints.

Experiments on CSQA dataset [csqa], one of the largest weakly supervised KBQA datasets over a large-scale KB with complex questions, verify the effectiveness of this approach. In particular, by searching logical forms with our approach, the percentage of spurious logical form is reduced from 55% to 27% by human evaluation, and the search success ratio increases from 71% to 80%; for KBQA task, the overall score is significantly improved compared to the baseline i.e., 5% growth for both recall and precision.

2 Our Approach

This section starts with an introduction to grammars and logical form. Then, an outline of proposed approach and the implementation of models are elaborated.

Alias Operator
A1/A2/A3
A4
A5
A6
A7/A8/A9
A10/A11
A12
A13/A14
A15
A16/A17/A18
Table 1: Grammars to compose logical form. Instantiation for entity , predicate or number from an input question.

2.1 Grammar and Logical Form

We leverage similar formats of grammar and logical form as in [d2a]. Here we give a brief introduction and refer readers to [d2a] for more details.

Grammar

The grammar definitions are shown in Table 1, where each operator is composed of three parts, i.e., a semantic category, a function symbol and a list of arguments. An argument can be a semantic category or a constant instantiated from a question.

Logical Form

A KB-executable logical form is usually formatted as a tree structure, where the root is the operator and each child node is a legitimate operator constrained by a semantic category in its parent’s argument list. To take advantage of sophisticated sequence-to-sequence models [bahdanau2015neural, vaswani2017attention] for question-to-logical-form translation, we re-format tree structure into a sequence by applying depth-first traversal over the tree. Reversely, once a sequence-formatted logical form is generated during decoding phase, it can be easily recovered into tree structure under grammars’ guidance.

2.2 Overview of Our Approach

As shown in Figure 1, our approach mainly consists of 6 steps: searching, cleaning, training operator predictor, operator prediction, re-searching, and training semantic parser.

Searching and Cleaning

Step 1 and 2 create training examples for proposed operator predictor based on a sampled small subset (e.g., of the total in this work) of the entire training data , which results in a new training set .

In Step 1, for each question in the sampled small dataset, we naively search and get valid logical forms . As stated in Section 1, this step could generate spurious logical forms and leads to bad operator predictor if we directly use these data for the model training. Hence, we further clean the searched results in Step 2.

In Step 2, we clean searched logical forms according to question types, which is inspired by an observation that questions belong to same types require similar operators. For example, all these quantitative questions “How many cities are sister town of …”, “How many rivers flow through …”, and “How many countries have …” require the Count operator, rather than Argmax. We follow the same question types as csqa csqa, which is a general and widely-used taxonomy for KBQA. Specifically, we first create a legitimate operator set for each question type. The criterion is that, for a question type, an operator is legitimate if removing the operator from the candidates leads to a notable (e.g., 1% in our setup) search success ratio drop for the questions in that type. Then for each , we remove illegal logical forms that contain any operators not belonging to the legitimate set of corresponding question type. And, is a set of unique operators appearing in the cleaned .

Figure 1: An overview of the proposed approach.

Model Training for Operator Predictor

In Step 3, an operator predictor model , which maps a question into its most likely operators to compose a correct logical form, is trained based on the cleaned training data . More details for operator predictor are introduced in Section 2.3.

Training Data Generation for KBQA

In Step 4 and 5, we apply to each question in all training data for predicting its most likely operators , and re-search for valid logical forms with constraints of .

Model Training for KBQA

Lastly, in Step 6, we train a semantic parser based on the searched results in Step 5. More details of the model are introduced in Section 2.3.

Notably, this approach involves two rounds of searches, but it is still faster than previous works (e.g., naive BFS by d2a d2a). The reason is that, in Step 1 only a small subset (e.g., 10%) needs to be fully searched as previous works do, and with reduced search space, the re-search in Step 5 is much faster than previous works. For example, on CSQA benchmark, our algorithm is faster than its baseline (search speed of 0.94s vs. 2.75s per example).

There are two main benefits to incorporating an operator predictor into a standard “searching and training” scheme. First, it helps to provide high-quality data for downstream model training by improving search success ratio and reducing the number of spurious logical forms. Second, it makes training and inference more effective by providing the constraint from legal operators.

2.3 Model Details

We detailedly describe the implementation of our models in this section, including the operator predictor introduced in Step 3, and semantic parsing model in Step 6 which consists of three sub-tasks: entity detection & linking, predicate prediction, and sequence-to-sequence translation model.

In formal terms, a question is first tokenized as a list of words, i.e., , and then, a word embedding approach [mikolov2013distributed]

is invoked to transform the discrete words into low-dimensional vector representations, i.e.,

, where denotes embedding size and stands for question sequence length. The embedded words are separately passed into each following neural component with untied parameters.

Operator Predictor

We define this sub-task as a multi-label problem, whose input is a natural language question and output is a set of operators possibly composing correct logical form for the question. In particular, a bi-directional LSTM () [hochreiter1997long] performs over input word embeddings as an encoder to capture contextual information, which is denoted as

(1)
(2)
(3)

where are learnable parameters for Bi-LSTM, denotes a concatenation operation, and is the resulting vector representation for the whole question. Then, the probability of generating each operator is defined as

(4)

where,

is a multi-layer perceptron,

stands for all operators defined in Table 1, and denotes the size of .

Entity Detection & Linking

Entity detection aims to locate named entity mention in the input question, which is usually formulated as a sequence labeling problem. It assigns each word with one of B, I and O label, which stand for begin, middle and end of a named entity respectively. To solve this problem, we use Bi-LSTM Conditional Random Field (Bi-LSTM-CRF) [huang2015bidirectional] model to predict entity mention tag for each input word. Formally, another Bi-LSTM model is leveraged as a context embedding layer and it is parameter-untied with the one defined in Eq.(1)-(2), i.e.,

(5)

where, and . Then, a position-wise feed forward network with 3-way output is applied to each for -th word’s scores over B, I and O respectively, which is written as

(6)

where, denotes the resulting scores for all words, and stands for the learnable parameters (i.e. weights and biases) for . Then, given a path of tags , a scoring function with learnable transition matrix is defined as

(7)

Further details on the training and inference of Bi-LSTM-CRF model are available in [huang2015bidirectional].

Given mentions detected in the question, we then follow the traditional approach for linking them back to entities in a KB. Specifically, we first build an inverted dictionary where the keys are entity mentions and the values are linked entities with matching scores appended. Then, given a mention, we select the entity with highest score from the dictionary.

Predicate Prediction

Identifying predicates in a question is also essential to compose an executable logical form. For this purpose, we simply formulate this sub-task as a multi-class prediction problem. In brief, another Bi-LSTM is used to embed the sentence as a vector representation, i.e., where and are derived from another . Then an with -way is used to fulfill the classification, where is a set of all possible predicates. In training phase, a negative log-likelihood loss is applied to learn this model’s parameters.

Question-to-Logical-Form Translation

Given predicted entity and predicate candidates from the upstream, a semantic parsing model aims to translate a input natural language question into KB-executable logical form. Since the logical forms have been formatted as sequences, we employ a sequence-to-sequence encoder-decoder structure with attention mechanism [bahdanau2015neural]. In particular, we here use a Bi-LSTM model as an encoder for natural language question to produce context-aware representation for each word, which is formulated as

(8)

where, and . For the decoder, we employ a forward

as an autoregressive model to predict logical form. To be specific, at

-th decoding step, given previous hidden state , we use a compatibility function to calculate the alignment score between previous decoding hidden state and each encoding word representation, resulting in a contextual embedding . The attention procedure and the decoder’s hidden state update formula are expressed as

(9)
(10)
(11)

where is a learnable parameter matrix, is attention distribution over all encoder states, is decoding input embedding for -th step. Next, given the decoding hidden state

, a neural classifier composed of a linear layer is used to predict an operator for

-th step, i.e.,

(12)

where is a predicted distribution over all possible operators, i.e.,

. Unlikely typical sequence-to-sequence task, e.g., neural machine translation, except predicting one operator from Table

1 at each step, the model also needs to instantiate a semantic category. Specifically, if predicted operator is one of , or

(i.e., entity, predicate, number), the model is also required to choose one term from corresponding candidates for instantiation. Note, number candidates are derived from named entity recognition by SpaCy.

To complete instantiation, entity(s), predicate(s) and number(s) are respectively embedded by mean-pooling over entity’s composing words, predicate token embedding, and character-level 1D-CNN [kim2014convolutional]. Then, a dot-product is invoked between decoding hidden state and candidates’ embeddings of targeted instantiation type (w.r.t. the predicted operator). A is finally applied to all scores of the candidates to produce a prediction distribution. After iterative decoding with grammar’s guidance, a logical form can be completely composed by this autoregressive model.

Ideally, all four models introduced above would be trained in a multi-task learning framework to make the best of the shared encoder and improve the performance. However, our focus in this paper is to propose an effective and efficient approach to search full supervision data for weakly supervised KBQA, and thus highlight that more high-quality training data play a vitally important role in this task. Hence, for a fair comparison, we directly adopt previous state-of-the-art pipeline model [d2a] in our framework, rather than attempting to improving it. In addition, “copy” operators proposed by d2a d2a are also included in this model for competitive results, whose descriptions are omitted for simplification.

3 Experiment

Methods HRED+KVmem D2A (Baseline) D2A+Ours*
Question Type #Example Recall Precision Recall Precision Recall Precision
Overall - 18.40% 6.30% 66.83% 66.57% 71.63% 72.42%
Simple Question (Direct) 82k 33.30% 8.58% 79.50% 77.37% 82.80% 83.20%
Simple Question (Coreferenced) 55k 12.67% 5.09% 58.47% 56.94% 64.67% 64.58%
Simple Question (Ellipsis) 55k 17.30% 6.98% 84.67% 77.90% 84.88% 83.02%
Logical Reasoning (All) 22k 15.11% 5.75% 65.82% 68.86% 73.88% 72.00%
Quantitative Reasoning (All) 9k 0.91% 1.01% 52.74% 60.63% 60.30% 68.06%
Comparative Reasoning (All) 15k 2.11% 4.97% 44.14% 54.68% 50.42% 60.62%
Clarification 12k 25.09% 12.13% 37.24% 33.97% 38.74% 34.80%
Question Type #Example Accuracy Accuracy Accuracy
Verification (Boolean) 27k 21.04% 37.07% 45.80%
Quantitative Reasoning (Count) 24k 12.13% 38.42% 41.35%
Comparative Reasoning (Count) 15k 8.67% 16.62% 20.93%
Table 2: KBQA result comparison with HRED+KVmem [serban2016building, miller2016key] and D2A [d2a]. *D2A+Ours means that D2A model is integrated with the proposed pipeline as shown in Figure 1.

This section begins with experimental setups to evaluate our proposed framework. Then, the evaluation includes assessments of the quality of responses to a KBQA task, effectiveness of searching, and the performance of each sub-task. Lastly, case study and error analysis are presented for qualitative and in-depth understanding of this work.

3.1 Experimental Settings

Dataset

We employed one of the largest weakly-supervised KBQA datasets over large-scale knowledge base, Complex Sequential Question Answering (CSQA) [csqa] in our experiments. There are 1.6M turns in 200K dialogues without logical form labeled. Its KB is built on WikiData in the form of (subject, predicate, object), including 21.2M triplets over 12.8M entities. Moreover, it also defines a question taxonomy with 10 types (e.g., logical reasoning and comparative reasoning), and labels each question in the dataset with the type it belongs to. Although the taxonomy is defined for this dataset, it is rather general and can be used to other KBQA datasets.

Evaluation Metrics

In line with csqa csqa and d2a d2a, we used Precision and Recall as metrics for questions when the answer is an entity(s), and Accuracy for questions when the answer is boolean or numeric.

Model Setup

For each neural model, the word embedding weight matrix was independent of each other and the embedding size was 300D; the hidden state size

was also set to 300D and activation function was

for the middle layer of each . For the optimization, we used Adam optimizer [kingma2014adam] with learning rate of

, the batch size was set to 64 for 6 epochs, and early stop strategy was applied when there was no longer a significant improvement over the development set during the training. Moreover, for the operator predictor we first used naive BFS method to search only 10% training data from CSQA and applied the data pre-processing steps outlined in Section

2.2.

Baselines

Only a few approaches have been proposed for solving large-scale, weakly supervised KBQA problem. HRED+KVmem [csqa] and D2A [d2a] are two typical approaches in regard to information retrieval and neural symbolic ways, respectively. In particular, HRED+KVmem involves a sequence-to-sequence based HRED model [serban2016building] and a key-value memory network [miller2016key] to retrieve answer from KB. In contrast, D2A222The re-implemented D2A in this work outperforms the one originally proposed by d2a d2a, and one possible reason is that our re-implemented grammars reach a better performance balance between simple and non-simple questions. For a fair comparison, we report the re-produced results for D2A in this paper. defines a set of semantic parsing grammars and translates natural language questions into corresponding logical forms to query KB via a memory-augmented neural symbolic model.

3.2 Question Answering Performance

As listed in Table 2, our proposed effective search approach coupled with D2A model improves previous baselines by a significant margin, setting a new state-of-the-art performance on CSQA dataset. Specifically, compared to the strong baseline, D2A, our proposed framework can improve the recall from 66.83% to 71.63% and the precision from 66.57% to 72.42% in terms of the overall score. And, as shown in the bottom panel of Table 2, for boolean (i.e, Verification) and numeric (i.e., Quantitative and Comparative Reasoning) questions, our proposed framework also significantly outperforms D2A. In addition, the improvements are more notable with more complex question types. For example, 7.56%/7.43% improvement of recall/precision over Quantitative Reasoning is much greater than 3.30%/5.83% improvement over Simple. We attribute this to more operators required to answer more complex questions, which exacerbates the problems associated with large search space.

3.3 Searching Effectiveness

In this section, we quantitatively analyze the effectiveness of our proposed algorithm in terms of alleviating the two problems caused by large search space, i.e., low search success ratio and spurious logical forms.

Figure 2: Search success ratio comparison w.r.t. question type.
Method #Correct #Spurious %Spurious
Naive BFS 114 136 54.5%
Ours 115 40 26.7%
Table 3: Statistics of spurious logical forms. Note that a question may be assigned with multiple valid logical forms.
Increasing Search Success Ratio

The search success ratio is defined as the number of questions, each with at least one valid logical form found by a search method, over the total number of questions. We compared our approach with traditional BFS one [d2a], and reported the results in Figure 2. From the figure we found that our approach increases the search success ratio significantly, especially almost increase for logical reasoning questions. And we also found, the improvement of KBQA is proportional to the increase of search success ratio w.r.t. question types.

Reducing Spurious Logical Forms

To determine whether our approach reduces the number of spurious logical forms, we randomly sampled 40 questions, each with at least one valid logical form found through both naive BFS search method and our approach. Human evaluators manually inspected the results and made a judgment as to whether the logical form is correct or spurious. The results listed in Table 3 demonstrate that our approach considerably reduces the incidence of spurious logical forms from 54.5% to 26.7% compared to the baseline. This is a substantial reduction and provides a clear evidence that this approach can improve the quality of training data and thus benefit any downstream semantic parsing model.

3.4 Sub-Task Evaluation

We evaluated performance of the models for solving sub-tasks, which provide prerequisites for question-to-logical-form translation model. These empirical results can also measure the severity of error propagation in pipeline model.

Operator Predictor

Question Type Question Coverage (%) Remaining (%)
Quantitative Reas 97.38 44.15
Comparative Reas 94.41 50.93
Verification 99.06 17.65
Logical Reasoning 98.07 23.90
Clarification 86.21 21.50
ALL 98.67 30.83
Table 4: Operator predictor evaluation.

A proper operator prediction is crucial to downstream semantic parsing model training. A poorly-trained operator predictor will lead to searching in a wrong logical form space, and consequently damage the performance of the model. To assess the quality of the operator predictor, we took (question, valid operators) pairs found by naive BFS as evaluation set, and evaluated the performance according to a metric, i.e., Question Coverage. Question coverage is defined as the number of questions, with predicted operators able to compose at least one valid logical form, as a ratio of all questions. As shown in Table 4, our operator predictor can achieve 98.67% of question coverage, which means when re-searching logical forms in Step 5, our approach will locate a correct sub-space for at least 98.67% questions. As an auxiliary measure, Remaining represents averaged proportion of the size of predicted operator candidates over the number of all operators defined in Table 1.

Question Logical Form from Naive Approach Logical Form from Ours Ops Prediction
Where is Zinc finger protein 775 found? Diff(Find(Set(Zinc…775, found-in-taxon)), Set(Zinc…775)) Find(Set(Zinc…775, found-in-taxon)) [start set, Find, Set]
Is Sumy Oblast adjacent to Poltava Oblast? In(Poltava Oblast, Union(Find(Set( Sumy Oblast), shares-border), Set(Italy))) In(Sumy Oblast, Find(Set( Poltava Oblast), shares-border)) [start bool, In, Find, Set]
Which administrative territories holds diplomatic relationship with max number of administrative territories? Diff(Argmax(Count(Find(Find(Set( administrative territorial), is-a)), diplomatic-relation)), Set(Quebec)) Argmax(Count(Find(Find( Set(administrative territorial), is-a)), diplomatic-relation)) [start set, Find, Count, Union, Diff, Argmax, Set]
Which administrative territories are Yale University present in and are the origins of Anna Karenina? Inter(Find(Set(Yale University), country), Find(Set(Anna Karenina), country-of-origin)) Find(Set(Yale University, country)) [start set, Find, Count, Union, Inter, Diff, Set]
Table 5: Case study of valid logical forms. And a logical form ending with means it is correct, otherwise spurious.

Entity Detection & Linking

The employed entity detection model is quite accurate when predicting “IOB” tags for the entity mentions. The F1 score of this sequence labeling task can reach 99%. Given detected entity mentions, we need to link them back to the knowledge base, so we also evaluated the performance of entity linking, and obtained333Co-references may appear in the example on CSQA and thus the oracle linking label is usually inaccurate, so our entity detection & linking model is underestimated. the precision of 24% and the recall of 90%. The results show that, although the precision is relatively low due to entity ambiguity, the recall stays high, indicating that most correct entities are retrieved and sent to downstream for the translation. The low precision mainly comes from ambiguity of entities, which means different entities share the same mention but express totally different meanings.

Predicate Prediction

As demonstrated in Section 2.3, the predicate classification is formulated as a multi-class problem in training phase. During the inference, in case of more than one predicates existing in an input question, we kept the top- most probable predicates. Although this certainly hurts the precision, it can guarantee a high recall and reduce error propagation. The final results are a precision of 46% and a recall of 98% with set to two.

3.5 Case Study

In this section, we leverage some cases to demonstrate the effectiveness of our proposed algorithm in searching for logical forms on weakly supervised KBQA. As shown in Table 5, for each question, we listed the logical forms searched by naive BFS approach and our proposed one respectively, as well as the predicted operators from operator predictor.

According to first three cases, due to the constraints posted by operator predictor (last column), our approach could avoid some spurious results. Meanwhile, as shown in fourth case, although predicted operator candidates substantially reduces the search space, it is still possible to include spurious logical forms in searched results.

3.6 Error Analysis

To conduct an error analysis and provide an insight into the causes of the prediction errors, we randomly sampled 50 wrongly-predicted examples for KBQA, and found the errors could be coarsely categorized as follows.

Entity Ambiguity

This is the most serious problem leading to wrong predictions during question-to-logical-form translation since many entities with identical text however express totally different meanings. For example, an entity The Avengers could be a movie, a soundtrack album or a punk rock band; even for a movie whose title is The Avengers, it also could be 2012 superhero film produced by Marvel or 1998 film by Jeremiah S. Chechik.

Error Propagation

Because a pipeline approach is employed to solve KBQA problem, it is inevitable that the prediction errors occurring at early stage will be propagated into downstream models. An apparent case is that wrongly-predicted predicate candidates directly lead to an un-executable logical form.

Translation Error

Due to translation model’s limitation on representative expression, a wrong operator or entity could be chosen to compose a logical form during decoding, which results in an incorrect answer.

4 Related work

This work is in line with semantic parsing based approach for KBQA task. Given a natural language question, based on a set of well-defined grammars for specific task, typical semantic parsing approaches learn a model to transform the question to a KB-executable logical form for answer retrieval [wong2007learning, zettlemoyer2009learning, kwiatkowski2011lexical, andreas2013semantic, artzi2013weakly, zhao2014type, long2016simpler, jia2016recombination, ling2016latent, xiao2016sequence].

Usually, because of limited crowdsourcing, only final answers instead of full executable logical forms are provided to learn a semantic parsing model, i.e., in a weakly supervised learning scheme

[webquestions, iyyer2017search, csqa]. Hence, “searching and training” is a conventional stepwise approach to handle such weakly supervised setting by searching logical form for semantic parser learning [kbqaasmt, stagg, macrogrammar, ltop, mapo, d2a, dasigi2019iterative].

However, searching over structured knowledge bases inevitably leads to spurious logical form problem, which introduces wrongly labeled data and thus poses negative effect on KBQA performance [pasupat2016inferring, d2a]. To alleviate spurious logical forms’ effect, for example, mapo mapo separately estimated expectations over the trajectories inside and outside high-rewarded memory buffer, rather than maximum likelihood training. ltop ltop reduced the impact of spurious logical forms by using randomized beam search and more balanced optimization. And, dasigi2019iterative dasigi2019iterative alternated between searching for consistent logical forms and maximizing the marginal likelihood of the retrieved ones while iterative training, which increases logical forms’ complexity for subsequent ones, thus dealing with the problem of spuriousness. In addition, there are also some works proposed to reduce the search space. macrogrammar macrogrammar used macro grammars to reduce the search space. And, dong2018coarse dong2018coarse proposed coarse-to-fine semantic parsing model to predict logical form sketch first.

In contrast, our approach aims to prevent the problems from its root. In other words, we directly reduce search space by restricting operator candidates, which decreases spurious logical forms in search results, and also increases search success ratio in the meantime.

5 Conclusion

We proposed a novel approach for effective search of logical forms by operator prediction for weakly supervised KBQA task, which provides sufficient and superior data for downstream question-to-logical-form translation model training, and makes training and inference more effective under the constraints of possible operators. The proposed approach is simple and effective, which makes it of great practical use. Experimental results verify the effectiveness of our approach in terms of reducing spurious logical forms, increasing search success ratio, improving search efficiency, and boosting the final accuracy for question answering.

References