KBQA: Learning Question Answering over QA Corpora and Knowledge Bases

03/06/2019 ∙ by Wanyun Cui, et al. ∙ Yonsei University The Hong Kong University of Science and Technology FUDAN University 0

Question answering (QA) has become a popular way for humans to access billion-scale knowledge bases. Unlike web search, QA over a knowledge base gives out accurate and concise results, provided that natural language questions can be understood and mapped precisely to structured queries over the knowledge base. The challenge, however, is that a human can ask one question in many different ways. Previous approaches have natural limits due to their representations: rule based approaches only understand a small set of "canned" questions, while keyword based or synonym based approaches cannot fully understand the questions. In this paper, we design a new kind of question representation: templates, over a billion scale knowledge base and a million scale QA corpora. For example, for questions about a city's population, we learn templates such as What's the population of city?, How many people are there in city?. We learned 27 million templates for 2782 intents. Based on these templates, our QA system KBQA effectively supports binary factoid questions, as well as complex questions which are composed of a series of binary factoid questions. Furthermore, we expand predicates in RDF knowledge base, which boosts the coverage of knowledge base by 57 times. Our QA system beats all other state-of-art works on both effectiveness and efficiency over QALD benchmarks.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Question Answering (QA) has drawn a lot of research interests. A QA system is designed to answer a particular type of questions [5]. One of the most important types of questions is the factoid question (FQ), which asks about objective facts of an entity. A particular type of FQ, known as the binary factoid question (BFQ) [1], asks about a property of an entity. For example, how many people are there in Honolulu? If we can answer BFQs, then we will be able to answer other types of questions, such as 1) ranking questions: which city has the 3rd largest population?; 2) comparison questions: which city has more people, Honolulu or New Jersey?; 3) listing questions: list cities ordered by population etc. In addition to BFQ and its variants, we can answer a complex factoid question such as when was Barack Obama’s wife born? This can be answered by combining the answers of two BFQs: who’s Barack Obama’s wife? (Michelle Obama) and when was Michelle Obama born? (1964). We define a complex factoid question as a question that can be decomposed into a series of BFQs. In this paper, we focus on BFQs and complex factoid questions.

QA over a knowledge base has a long history. In recent years, large scale knowledge bases become available, including Google’s Knowledge Graph, Freebase 

[3], YAGO2 [16], etc., greatly increase the importance and the commercial value of a QA system. Most of such knowledge bases adopt RDF as data format, and they contain millions or billions of SPO triples (, , and denote subject, predicate, and object respectively).

Figure 1: A toy RDF knowledge base (here, “dob” and “pob” stand for “date of birth” and “place of birth” respectively). Note that the “spouse of” intent is represented by multiple edges: name - marriage - person - name.

1.1 Challenges

Given a question against a knowledge base, we face two challenges: in which representation we understand the questions (representation designment), and how to map the representations to structured queries against the knowledge base (semantic matching)?

  • [leftmargin=0.4cm]

  • Representation Designment: Questions describe thousands of intents, and one intent has thousands of question templates. For example, both a⃝ and b⃝ in Table 1 ask about population of Honolulu, although they are expressed in quite different ways. The QA system needs different representations for different questions. Such representations must be able to (1) identify questions with the same semantics; (2) distinguish different question intents. In the QA corpora we use, we find 27M question templates over 2782 question intents. So it’s a big challenge to design representations to handle this.

  • Semantic Matching: After figuring out the representation of a question, we need to map the representation to a structured query. For BFQ, the structured query mainly depends on the predicate in the knowledge base. Due to the gap between predicates and question representations, it is non-trivial to find such mapping. For example, in Table 1, we need to know has the same semantics with predicate . Moreover, in RDF graph, many binary relations do not correspond to a single edge but a complex structure: in Figure 1, “spouse of” is expressed by a path . For the knowledge base we use, over 98% intents we found correspond to complex structures.

Question in Natural language Predicate in KB
a⃝ How many people are there in Honolulu? population
b⃝ What is the population of Honolulu? population
c⃝ What is the total number of people in Honolulu? population
d⃝ When was Barack Obama born? dob
e⃝ Who is the wife of Barack Obama? marriagepersonname
f⃝ When was Barack Obama’s wife born? marriagepersonname
Table 1: Questions in Natural Language and Related Predicates in a Knowledge Base

Thus, the key problem is to build a mapping between natural language questions and knowledge base predicates through proper question representations.

1.2 Previous Works

According to how previous knowledge based QA systems represent questions, we roughly classify them into three categories: rule based, keyword based, and synonym based.

  1. [leftmargin=0.4cm]

  2. Rule based [23]. Rule based approaches map questions to predicates by using manually constructed rules. This leads to high precision but low recall (low coverage of the variety of questions), since manually creating rules for a large number of questions is infeasible.

  3. Keyword based [29]. Keyword based methods use keywords in the question and map them to predicates by keyword matching. They may answer simple questions such as in Table 1 by identifying population in the question and mapping it to predicate population in the knowledge base. But in general, using keywords can hardly find such mappings, since one predicate representation in the knowledge base cannot match diverse representations in natural language. For example, we cannot find population from or .

  4. Synonym based [28, 33, 38, 37]. Synonym based methods extend keyword based methods by taking synonyms of the predicates into consideration. They first generate synonyms for each predicate, and then find mappings between questions and these synonyms. DEANNA [33] is a typical synonym based QA system. The main idea is reducing QA into the evaluation of semantic similarity between predicate and candidate synonyms (words/phrases in the question). It uses Wikipedia to compute the semantic similarity. For example, question in Table 1 can be answered by knowing that number of people in the question is a synonym of predicate . Obviously, their semantic similarity can be evaluated by Wikipedia. gAnswer [38, 37] further improved the precision by learning synonyms for more complex sub-structures. However, all these approaches cannot answer a⃝ in Table 1, as none of how many, people, are there has obvious relation with . How many people is ambiguous in different context. In how many people live in Honolulu?, it refers to . In how many people visit New York each year?, it refers to number of passengers.

In general, these works cannot solve the above challenges. For rule based approaches, it takes unaffordable human labeling effort. For keyword based or synonym based approaches, one word or one phrase cannot represent the question’s semantic intent completely. We need to understand the question as a whole. And it’s even tremendously more difficult for previous approaches if the question is a complex question or maps to a complex structure in a knowledge base (e.g. or ).

1.3 Overview of Our Approach

Figure 2: Our Approach

To answer a question, we must first represent the question. By representing a question, we mean transforming the question from natural language to an internal representation that captures the semantics and intent of the question. Then, for each internal representation, we learn how to map it to an RDF query against a knowledge base. Thus, the core of our work is the internal representation which we denote as templates.

Representing questions by templates The failure of synonym based approach in a⃝ inspires us to understand a question by templates. As an example, How many people are there in ? is the template for a⃝. No matter refers to Honolulu or other cities, the template always asks about population of the question.

Then, the task of representing a question is to map the question to an existing template. To do this, we replace the entity in the question by its concepts. For instance, Honolulu will be replaced by as shown in Figure 2. This process is not trivial, and it is achieved through a mechanism known as conceptualization [25, 17], which automatically performs disambiguation on the input (so that the term apple in what is the headquarter of apple will be conceptualized to instead of ). The conceptualization mechanism itself is based on a large semantic network (Probase [32]) that consists of millions of concepts, so that we have enough granularity to represent all kinds of questions.

The template idea also works for complex questions. Using templates, we simply decompose the complex question into a series of question, each of which corresponds to one predicate. Consider question in Table 1. We decompose into Barack Obama’s wife and when was Michelle Obama born?, which correspond to and respectively. Since the first question is nested within the second one, we know modifies , and modifies Barack Obama.

Mapping templates to predicates We learn templates and their mappings to knowledge base predicates from Yahoo! Answers. This problem is quite similar to the semantic parsing [6, 7]. Most semantic parsing approaches are synonym based. To model the correlation between phrases and predicates, SEMPRE [2] uses a bipartite graph, and SPF [18] uses a probabilistic combinatory categorial grammar (CCG) [8]. They still have the drawbacks of synonym based approaches. The mapping from templates to predicates is , that is, each predicate in the knowledge base corresponds to multiple templates. For our work, we learned a total of different templates for 2782 predicates. The large amount guarantees the wide coverage of template-based QA.

The procedure of learning the predicate of a template is as follows. First, for each QA pair in Yahoo! Answer, we extract the entity in question and the corresponding value. Then, we find the predicate from the knowledge base by looking up the direct predicate connecting the entity and the value. Our basic idea is, if most instances of a template share the same predicate, we map the template to this predicate. For example, suppose questions derived by template how many people are there in ? always map to the predicate , no matter what specific

it is. We can conclude that for certain probability the template maps to

. Learning templates that map to a complex knowledge base structure employs a similar process. The only difference is that we find “expanded predicates” that correspond to a path consisting of multiple edges which lead from an entity to a certain value (e.g., ).

1.4 Paper Organization

The rest of the paper is organized as follows. In Sec 2, we give an overview of KBQA. The major contribution of this paper is learning templates from QA corpora. All technique parts are closely related to it. Sec 3 shows the online question answering with templates. Sec 4 elaborates the predicates inference for templates, which is the key step to use templates. Sec 5 extends our solution to answer a complex question. Sec 6 extends the ability of templates to infer complex predicates. We present experimental studies in Sec 7, discuss more related works in Sec 8, and conclude in Sec 9.

2 System Overview

In this section, we introduce some background knowledge and give an overview of KBQA. In Table 2, we list the notations used in this paper.

Notation Description Notation Description
question subject
answer predicate
QA corpus object
entity knowledge base
value category
template expanded predicate
is a substring of
template of by estimation of
conceptualizing to at iteration
Table 2: Notations

Binary factoid QA We focus on binary factoid questions (BFQs), that is, questions asking about a specific property of an entity. For example, all questions except f⃝ in Table 1 are BFQs.

RDF knowledge base Given a question, we find its answer in an RDF knowledge base. An RDF knowledge base is a set of triples in the form of , where , , and denote subject, predicate, and object respectively. Figure 1 shows a toy RDF knowledge base via an edge-labeled directed graph. Each is represented by a directed edge from to labeled with predicate . For example, the edge from to with label represents an RDF triple , which represents the knowledge of Barack Obama’s birthday.

Table 3: Sample QA Pairs from a QA Corpus
Id Question Answer
When was Barack Obama born? The politician was born in 1961.
When was Barack Obama born? He was born in 1961.
How many people are there in Honolulu? It’s 390K.

QA corpora We learn question templates from Yahoo! Answer, which consists of 41 million QA pairs. The QA corpora is denoted by , where is a question and is the reply to . Each reply consists of several sentences, and the exact factoid answer is contained in the reply. Table 2 shows a sample from a QA corpus.


We derive a template from a question by replacing each entity with one of ’s categories . We denote this template as . A question may contain multiple entities, and an entity may belong to multiple categories. We obtain concept distribution of through context-aware conceptualization [32]. For example, question in Table 2 contains entity in Figure 1. Since belongs to two categories: $Person, $Politician, we can derive two templates from the question: When was $Person born? and When was $Politician born?.

Figure 3: System Overview
System Architecture

Figure 3 shows the pipeline of our QA system, which consists of two major procedures:

  • [leftmargin=0.4cm]

  • Online procedure: When a question comes in, we first parse and decompose it into a series of binary factoid questions. The decomposition process is described in Sec 5. For each binary factoid question, we use a probabilistic inference approach to find its value, shown in Sec 3. The inference is based on the predicate distribution of given templates, i.e. . Such distribution is learned offline.

  • Offline procedure: The goal of offline procedure is to learn the mapping from templates to predicates. This is represented by , which is estimated in Sec 4. And we expand predicates in the knowledge base in Sec 6, so that we can learn more complex predicate forms (e.g., in Figure 1).

3 Our Approach: KBQA

In this section, we first formalize our problem in a probabilistic framework in Sec 3.1. We present the details for most probability estimations in Sec 3.2, leaving only the estimation of in Sec 4. We elaborate the online procedure in Sec 3.3.

3.1 Problem Model

KBQA learns question answering by using a QA corpus and a knowledge base. Due to issues such as uncertainty (e.g. some questions’ intents are vague), incompleteness (e.g. the knowledge base is almost always incomplete), and noise (e.g. answers in the QA corpus may be wrong), we create a probabilistic model for QA over a knowledge base below. We highlight the uncertainty from the question’s intent to the knowledge base’s predicates [18]. For example, the question “where was Barack Obama from” is related to at least two predicates in Freebase: “place of birth” and “place lived location”. In DBpedia, who founded $organization? relates to predicates and .

Problem Definition 1

Given a question , our goal is to find an answer with maximal probability ( is a simple value):


To illustrate how a value is found for a given question, we proposed a generative model. Starting from the user question , we first generate/identify its entity according to the distribution . After knowing the question and the entity, we generate the template according to the distribution . The predicate only depends on , which enables us to infer the predicate by . Finally, given the entity and the predicate , we generate the answer value by . can be directly returned or embedded in a natural language sentence as the answer . We illustrate the generation procedure in Example 1

, and shows the dependency of these random variables in Figure 

4. Based on the generative model, we compute in Eq (2). Now Problem 1 is reduced to Eq (3).

Example 1
Figure 4: Probabilistic Graph

Consider the generative process of in Table 2. Since the only entity in is “Honolulu”, we generate the entity node (in Figure 1) by . By conceptualizing “Honolulu” to a city, we generate the template How many people are there in $city?. Note that the corresponding predicate of the template is always “population”, no matter which specific city it is. So we generate predicate “population” by distribution . After generating entity “Honolulu” and predicate “population”, the value “390k” can be easily found from the knowledge base in Figure 1. Finally we use a natural language sentence as the answer.

Outline of the following subsections Given the above objective function, our problem is reduced to the estimation of each probability term in Eq (2). The term is estimated in the offline procedure in Sec 4. All other probability terms can be directly computed by the off-the-shelf solutions (such as NER, conceptualization). We elaborate the calculation of these probabilities in Sec 3.2. And we elaborate the online procedure in Sec 3.3.

3.2 Probability Computation

In this subsection, we compute each probability term in Eq (2) except .

Entity distribution The distribution represents the entity identification from the question. We identify entities that meet both conditions: (a) it is an entity in the question; (b) it is in the knowledge base. We use Stanford Named Entity Recognizer [13] for (a). And we then check if it is an entity’s name in the knowledge base for (b). If there are multiple candidate entities, we simply give them uniform probability.

We optimize the computation of in the offline procedure by ’s answer. As illustrated in Sec 4.1, we already extracted a set of entity-value pairs for question and answer , where the values are from the answer. We assume the entities in have equal probability to be generated. So we obtain:


,where is the Iverson bracket. As shown in Sec 7.5, this approach is more accurate than directly using the NER approach.

Template distribution A template is in the form of When was $person born?. In other words, it is a question with the mention of an entity (e.g., “Barack Obama”) replaced by the category of the entity (e.g., $person).

Let indicate that template is obtained by replacing entity in by ’s category . Thus, we have


, where is the category distribution of in context . In our work, we directly apply the conceptualization method in [25] to compute .

Value (answer) distribution For an entity and a predicate of , it is easy to find the predicate value by looking up the knowledge base. For example, in Figure 1, let entity = Barack Obama, and predicate = dob. We easily get Obama’s birthday, 1961, from the knowledge base. In this case, we have , since Barack Obama only has one birthday. Some predicates may have multiple values (e.g., the children of Barack Obama). In this case, we assume uniform probability for all possible values. More formalized, we compute by


3.3 Online Procedure

In the online procedure, we are given a user question . We can compute by Eq (7). And we return as the answer.


, where is derived from offline learning in Sec 4, and other probability terms are computed in Sec 3.2.

Complexity of Online Procedure: In the online procedure, we enumerate ’s entities, templates, predicates, and values in order. We treat the number of entities per question, the number of concepts per entity, and the number of values per entity-predicate pair as constants. So the complexity of the online procedure is , which is caused by the enumeration on predicate. Here is the number of distinct predicates in the knowledge base.

4 Predicate Inference

In this section, we present how we infer predicates from templates, i.e., the estimation of . We treat the distribution as parameters and then use the maximum likelihood (ML) estimator to estimate . To do this, we first formulate the likelihood of the observed data (i.e., QA pairs in the corpora) in Sec 4.1. Then we present the parameter estimation and its algorithmic implementation in Sec 4.2 and Sec 4.3, respectively.

4.1 Likelihood Formulation

Rather than directly formulating the likelihood to observe the QA corpora ( ), we first formulate a simpler case, the likelihood of a set of question-entity-value triples extracted from the QA pairs. Then we build the relationship between the two likelihoods. The indirect formulation is well motivated. An answer in is usually a complicated natural language sentence containing the exact value and many other tokens. Most of these tokens are meaningless in indicating the predicate and bring noise to the observation. On the other hand directly modeling the complete answer in a generative model is difficult, while modeling the value in a generative model is much easier.

Next, we first extract entity-value pairs from the given QA pair in Sec 4.1.1, which allows to formalize the likelihood of question-entity-value triples (). We then establish the relationship between the likelihood of the QA corpora and the likelihood of , in Eq (13), Sec 4.1.2.

4.1.1 Entity-Value Extraction

Our principle to extract candidate values from the answer is that a valid entity&value pair usually has some corresponding relationships in knowledge base. Following this principle, we identify the candidate entity&value pairs from :


, where means “is substring of”. We illustrate this in Example 2.

Example 2

Consider in Table 2. Many tokens (e.g. The, was, in) in the answer are useless. We extract the valid value 1961, by noticing that the entity Barack Obama in and 1961 are connected by predicate “pob” in Figure 1. Note that we also extract the noise value politician in this step. We will show how to filter it in the refinement step below.

Refinement of We need to filter the noisy pairs in , e.g. in Example 2. The intuition is: the correct value and the question should have the same category. Here the category of a question means the expected answer type of the question. This has been studied as the question classification problem [22]. We use the UIUC taxonomy [20]. For question categorization, we use the approach proposed in [22]. For value’s categorization, we refer to the category of its predicate. The predicates’ categories are manually labeled. This is feasible since there are only a few thousand predicates.

4.1.2 Likelihood Function

After the entity&value extraction, each QA pair is transferred into a question and a set of entity-value pairs, i.e., . By assuming the independence of these entity-value pairs, the probability of observing such a QA pair is shown in Eq (9). Thus, we compute the likelihood of the entire QA corpora in Eq (10).


By assuming each question has an equal probability to be generated, i.e. , we have:


, where can be considered as a constant. Eq (11) implies that is proportional to the likelihood of these question-entity-value triples. Let be the set of such triples that are extracted from QA corpora:


We denote the -the term in as . So . Thus we establish the linear relationship between the likelihood of and the likelihood of .


Now, maximizing the likelihood of is equivalent to maximize the likelihood of . Using the generative model in Eq (2), we calculate by marginalizing the joint probability over all templates and all predicates . The likelihood is shown in Eq (14). We illustrate the entire process in Figure 4.


4.2 Parameter Estimation

Goal: In this subsection, we estimate by maximizing Eq (14). We denote the distribution as parameter and its corresponding log-likelihood as . And we denote the probability as . So we estimate by:


, where


Intuition of EM Estimation:

We notice that some random variables (e.g. predicate and template) are latent in the proposed probabilistic model, which motivates us to use the Expectation-Maximization (EM) algorithm to estimate the parameters. The EM algorithm is a classical approach to find the maximum likelihood estimates of parameters in a statistical model with unobserved variables. The ultimate objective is to maximize the

likelihood of complete data . However, it involves a logarithm of a sum and is computationally hard. Hence, we instead resort to maximizing one of its lower bound [7], i.e., the Q-function . To define the Q-function, we leverage the likelihood of complete data . The EM algorithm maximizes by maximizing the lower bound iteratively. In the -th iteration, E-step computes for given parameters , and M-step estimates the parameters (parameters in the next iteration) that maximizes the lower bound.

Likelihood of Complete Data: Directly maximizing is computationally hard, since the function involves a logarithm of a sum. Intuitively, if we know the complete data of each observed triple, i.e., which template and predicate it is generated with, the estimation becomes much easier. We thus introduce a hidden variable for each observed triple . The value of is a pair of predicate and template, i.e. , indicating is generated with predicate and template . Note that we consider predicate and template together since they are not independent in the generation. Hence, is the probability that is generated with predicate and template .

We denote . and together form the complete data. The log-likelihood of observing the complete data is:


, where


As discussed in Sec 3.2, can be computed independently before the estimation of . So we treat it as a known factor.

Q-function: Instead of optimizing directly, we define the “Q-function” in Eq (20), which is the expectation of the complete observation’s likelihood. Here is the estimation of at iteration . According to Theorem 1, when treating as a constant, provides a lower bound for . Thus we try to improve rather than directly improve .

Theorem 1 ( Lower bound [10])

, where only relies on and can be treated as constant for .

In E-step, we compute . For each in Eq (20), we have:


In M-step, we maximize the Q-function. By using Lagrange multiplier, we obtain in Eq (22).


4.3 Implementation

Now we elaborate the implementation of the EM algorithm (in Algorithm 1), which consists of three steps: initialization, E-step, and M-step.

Initialization: To avoid in Eq (21) being all zero, we require that

is uniformly distributed over all pairs of

s.t. . So we have:


E-step: We enumerate all and compute by Eq (21). Its complexity is .

M-step: We compute the for each . The direct computation costs time since we need to enumerate all possible templates and predicates. Next, we reduce it to by only enumerating a constant number of templates and predicates for each .

We notice that only with needs to be considered. Due to Eq (18) and Eq (19), this implies:


With , we pruned the enumeration of templates. implies that we only enumerate the templates which are derived by conceptualizing in . The number of concepts for is obviously upper bounded and can be considered as a constant. Hence, the total number of templates enumerated in Line 7 is . With , we pruned the enumeration of predicates. implies that only predicates connecting and in the knowledge base need to be enumerated. The number of such predicates also can be considered as a constant. So the complexity of the M-step is .

Data: ;
Result: ;
1 Initialize the iteration counter ;
2 Initialize the parameter ;
3 while  not converged do
        //E-step ;
4        for  do
5               Estimate by Eq (21) ;
       //M-step ;
7        for   do
8               for all for with  do
9                      for all with  do
10                             ;
14       Normalize as in Eq (22) ;
15        ;
Algorithm 1 EM Algorithm for Predicate Inference

Overall Complexity of EM algorithm: Suppose we repeat the EM algorithm times, the overall complexity thus is .

5 Answering Complex Questions

In this section, we elaborate how we answer complex questions. We first formalize the problem as an optimization problem in Sec 5.1. Then, we elaborate the optimization metric and our algorithm in Sec 5.2 and Sec 5.3, respectively.

5.1 Problem Statement

We focus on the complex questions which are composed of a sequence of BFQs. For example, question f⃝ in Table 1 can be decomposed into two BFQs: (1) Barack Obama’s wife (Michelle Obama); (2) When was Michelle Obama born? (1964). Clearly, the answering of the second question relies on the answer of the first question.

A divide-and-conquer framework can be naturally leveraged to answer complex questions: (1) we first decompose the question into a sequence of BFQs, (2) then we answer each BFQ sequentially. Since we have shown how to answer BFQ in Sec 3, the key issue in this section is the decomposition.

We highlight that in the decomposed question sequence, each question except the first one is a question string with an entity variable. The question sequence can only be materialized after the variable is assigned with a specific entity, which is the answer of the immediately previous question. Continue the example above, the second question When was Michelle Obama born? is When was born? in the question sequence. here is the variable representing the answer of the first question Barack Obama’s wife. Hence, given a complex question , we need to decompose it into a sequence of questions such that:

  • [leftmargin=0.4cm]

  • Each () is a BFQ with entity variable , whose value is the answer of .

  • is a BFQ that its entity is equal to the entity of .

Example 3 (Question sequence)

Consider the question in Table 1. One natural question sequence is = Barack Obama’s wife and When was born?. We can also substitute an arbitrary substring to construct the question sequence, such as =was Barack Obama’s wife born and =When ?. However, the later question sequence is invalid since is neither an answerable question nor a BFQ.

Given a complex question, we construct a question sequence in a recursive way. We first replace a substring with an entity variable. If the substring is a BFQ that can be directly answered, it is . Otherwise, we continue the above procedure on the substring until we meet a BFQ or the substring is a single word. However, as shown in Example 3, many question decompositions are not valid (answerable). Hence, we need to measure how likely a decomposition sequence is answerable. More formally, let be the set of all possible decompositions of . For a decomposition , let be the probability that is a valid (answerable) question sequence. Out problem thus is reduced to


Next, we elaborate the estimation of and how we solve the optimization problem efficiently in Sec 5.2 and 5.3, respectively.

5.2 Metric

The basic intuition is that is a valid question sequence if each individual question is valid. Hence, we first estimate (the probability that is a valid question), and then aggregate each to compute .

We use QA corpora to estimate . is a BFQ with entity variable . A question matches , if we can get by replacing a substring of with $e. We say the match is valid, if the replaced substring is a mention of the entity in . For example, When was Michelle Obama born? matches When was $e born? and When was $e?. However, only the former one is valid since only Michelle Obama is an entity. We denote the number of all questions in the QA corpora that matches as , and the number of questions that validly matches as . Both and are counted by the QA corpora. We estimate by:


The rationality is clear: the more valid match the more likely is answerable. is used to punish the over-generalized question pattern. We show an example of below.

Example 4

Suppose When was born?, When ?, the QA corpora is shown in Table 2. Clearly, satisfies the patterns of and . However, only is a valid pattern for since when matching to , the replaced substring corresponds to a valid entity “Barack Obama”. Thus we have . However, . Due to Eq (26), , .

Given each , we define . We assume that each in being valid are independent. A question sequence is valid if and only if all in it are valid. So we compute by:


5.3 Algorithm

Given , our goal is to find the question sequence maximizing . This is not trivial due to the huge search space. Consider a complex question of length , i.e., the number of words in . There are overall substrings of . If finally is decomposed into sub-questions, the entire search space will be , which is unacceptable. In this paper, we proposed a dynamic programming based solution to solve our optimization problem, with complexity . Our solution is developed upon the local optimality property of the optimization problem. We establish this property in Theorem 2.

Theorem 2 (Local Optimality)

Given a complex question , let be the optimal decomposition of , then , , is the optimal decomposition of .

Theorem 2 suggests a dynamic programming (DP) algorithm. Consider a substring of , is either (1) a primitive BFQ (non-decomposable) or (2) a string that can be further decomposed. For case (1), contains a single element, i.e., itself. For case (2), , where is the one with the maximal , is the operation that appends a question at the end of a question sequence, and is the question generated by replacing in with a placeholder “$e”. Thus, we derive the dynamic programming equation:


where is the indicator function to determine whether is a primitive BFQ. That is when is a primitive BFQ, or otherwise.

Algorithm 2 outlines our dynamic programming algorithm. We enumerate all substrings of in the outer loop (Line 2). Within each loop, we first initialize and (Line 2-4). In the inner loop, we enumerate all substrings of (Line 2), and update and (Line 2-2). Note that we enumerate all s in the ascending order of their lengths, which ensures that and are known for each enumerated .

The complexity of Algorithm 2 is , since both loops enumerates substrings. In our QA corpora, over 99% questions contain less than 23 words (). So this complexity is acceptable.

Data: ;
Result: ;
1 for each substring of , with length from 1..  do
2        ;
3        if  then
4              ;
5        for each substring of  do
6               Replace in with ;
7               if   then
8                      ;
9                      ;
Algorithm 2 Complex Question Decomposition

6 Predicate Expansion

In a knowledge base, many facts are not expressed by a direct predicate, but by a path consisting of multiple predicates. As shown in Figure 1, “spouse of” relationship is represented by three predicates . We denote these multi-predicate paths as expanded predicates. Answering questions over expanded predicates highly improves the coverage of KBQA.

Definition 1 (Expanded Predicate)

An expanded predicate is a predicate sequence . We refer to as the length of the . We say connects subject and object , if there exists a sequence of subjects such that and . Similar to representing connects and , we denote connecting and as .

The KBQA model proposed in Section 3 in general is flexible for expanded predicates. We only need some slight changes for the adaptation. In Sec 6.1, we show such adaptation. Then we show how to scale expanded predicates for billion scale knowledge bases in Sec 6.2. Finally, we show how to select a reasonable predicate length to pursue highest effectiveness in Sec 6.3.

6.1 KBQA for Expanded Predicates

Recall that the framework of KBQA for single predicate consists of two major parts. In the offline part, we compute , the predicate distribution for given templates; in the online part, we extract the question’s template , and compute the predicate through . When changing to , we do the following adjustments:

In the offline part, we learn question templates for expanded predicates, i.e., computing . The computation of only relies on knowing whether is in . We can compute this cardinality if we generate all . We show this generation process in Sec 6.2

In the online part, we use expanded predicates to answer question. To compute , we can compute it by exploring the RDF knowledge base starting from and going through . For example, let , to compute from the knowledge base in Figure 1, we start the traverse from node , then go through , . Finally we have .

6.2 Generation of Expanded Predicates

A naive approach to generate all expanded predicates is breadth-first search (BFS) starting from each node in the knowledge base. However, the number of expanded predicates grows exponentially with the predicates’ length. So the cost is unacceptable for a billion scale knowledge base. To do this, we first set a limit on the predicate length to improve the scalability. That is we only search expanded predicate with length no larger than . In the next subsection, we will show how to set a reasonable . In this subsection, we improve the scalability from another two aspects: (1) reduction on ; (2) memory-efficient BFS.

Reduction on : During the offline inference process, we are only interested in which occurred in at least one question in the QA corpus. Hence, we only use subjects occurring in the questions from QA corpus as starting nodes for the BFS exploration. This strategy significantly reduces the number of triples to be generated. Because the number of such entities is far less than the number of those in a billion-scale knowledge base. For the knowledge base (1.5 billion entities) and QA corpus (0.79 million distinct entities) we use, this filtering reduces the number of triples times theoretically.

Memory-Efficient BFS: To enable the BFS on a knowledge base of 1.1TB, we use a disk based multi-source BFS algorithm. At the very beginning, we load all entities occurring in QA corpus (denoted by ) into memory and build the hash index on . In the first round, by scanning all RDF triples resident on disk once and joining the subjects of triples with , we get all with length 1. The hash index built upon allows a linear time joining. In the second round, we load all the triples found so far into memory and build hash index on all objects (denoted by ). Then we scan the RDF again and join the subject of RDF tripes with . Now we get all with length 2, and load them into memory. We repeat the above index+scan+join operation times to get all with .

The algorithm is efficient since our time cost is mainly spent on scanning the knowledge base times. The index building and join are executed in memory, and the time cost is negligible compared to the disk I/O. Note that the number of expanded predicate starting from is always significantly smaller than the size of knowledge base, thus can be hold in memory. For the knowledge base (KBA, please refer to experiment section for more details) and QA corpus we use, we only need to store 21M triples. So it’s easy to load them into memory. Suppose the size of is , and the number of triples found is . It consumes memory, and the time complexity is .

6.3 Selection of

The length limit of expanded predicate affects the effectiveness of predicate expansion. A larger leads to more triples, and consequently higher coverage of questions. However, it also introduces more meaningless triples. For example, the expanded predicate in Figure 1 connects “Barack Obama” and “1964”. But they have no obvious relations and are useless for KBQA.

The predicate expansion should select a that allows most meaningful relations and avoids most meaningless relations. We estimate the best using Infobox in Wikipedia. Infobox stores facts about entities and most entries in Infobox are subject-predicate-object triples. Facts in Infobox can be used as meaningful relations. Hence, our idea is sampling triples with length and see how many of them have correspondence in Infobox. We expect to see a significant drop for an excessive .

Specifically, we select top 17,000 entities from the RDF knowledge base ordered by their frequencies. The frequency of an entity is defined as the number of triples in so that . We choose these entities because they have richer facts, and therefore are more trustworthy. For these entities, we generate their triples at length using the BFS procedure proposed in Sec 6.2. Then, for each , we count the number of these that can find its corresponding SPO triples in Wikipedia Infobox. More formally, let be the sampled entity set, and be with length . We define to measure the influence of in finding meaningful relations as follows:


The results of over KBA and DBpedia are shown in Table 4. Note that for expanded predicates with length , we only consider those which end with . This is because we found entities and values connected by other expanded predicates always have some very weak relations and should be discarded. The number of valid expanded predicates significantly drops when . This suggests that most meaningful facts are represented within this length. So we choose in this paper.

k 1 2 3
KBA 14005 16028 2438
DBpedia 352811 496964 2364
Table 4: valid(k)