Log In Sign Up

Program Transfer and Ontology Awareness for Semantic Parsing in KBQA

Semantic parsing in KBQA aims to parse natural language questions into logical forms, whose execution against a knowledge base produces answers. Learning semantic parsers from question-answer pairs requires searching over a huge space of logical forms for ones consistent with answers. Current methods utilize various prior knowlege or entity-level KB constraints to reduce the search space. In this paper, we investigate for the first time prior knowledge from external logical form annotations and ontology-level constraints. We design a hierarchical architecture for program transfer, and propose an ontology-guided pruning algorithm to reduce the search space. The experiments on ComplexWebQuestions show that our method improves the state-of-the-art F1 score from 44.0 the effectiveness of program transfer and ontology awareness.


page 1

page 2

page 3

page 4


Effective Search of Logical Forms for Weakly Supervised Knowledge-Based Question Answering

Many algorithms for Knowledge-Based Question Answering (KBQA) depend on ...

Semantic Parsing with Dual Learning

Semantic parsing converts natural language queries into structured logic...

DecAF: Joint Decoding of Answers and Logical Forms for Question Answering over Knowledge Bases

Question answering over knowledge bases (KBs) aims to answer natural lan...

Learning Dependency-Based Compositional Semantics

Suppose we want to build a system that answers a natural language questi...

Inferring Logical Forms From Denotations

A core problem in learning semantic parsers from denotations is picking ...

Semantic Scaffolds for Pseudocode-to-Code Generation

We propose a method for program generation based on semantic scaffolds, ...

ArcaneQA: Dynamic Program Induction and Contextualized Encoding for Knowledge Base Question Answering

Question answering on knowledge bases (KBQA) poses a unique challenge fo...

1 Introduction

Semantic parsing in Knowledge Base Question Answering (KBQA) aims to parse natural language questions (utterances) into logical forms (e.g., -calculus Zettlemoyer and Collins (2005), query graph Yih et al. (2015), program Liang et al. (2017)), whose execution against a knowledge base (KB) produces the answers (denotations). It has emerged as a promising technique to provide interpretable access to KBs Berant et al. (2013).

Early works focus on semantic parsing from annotations Zettlemoyer and Collins (2005); Dong and Lapata (2016), which learn semantic parsers from the pairs of question and logical form. The primary limitation is that collecting logical forms requires expensive expert annotation. This has led to a body of work on semantic parsing from denotations Yih et al. (2015); Liang et al. (2017), which learns semantic parsers only from question-answer pairs. This line of work poses two unique challenges as follows:

Figure 1: An example for question, the corresponging program, and the answer. The left side of the arrow is the sketch, and the right side is the complete program, with dotted boxes denoting arguments for functions.

Prior Knowledge Utilization. Without explicit supervision, prior knowledge is required to guide the learning process, e.g., neural program induction methods inject hand-crafted rules Ansari et al. (2019) or knowledge linking Liang et al. (2017); Saha et al. (2019) when translating the question to a multi-step executable program. However, such prior knowledge is usually lacking or requires manual analysis. Recently, there emerge several valuable annotated resources, and how to utilize the prior knowledge from these external annotations remains an open question.

KB Constraints Exploitation. Diverse language expressions and large-scale KBs require the semantic parser to search in a huge search space. KB constraints are often employed to prune the search space, e.g., query graph generation Yih et al. (2015); Lan and Jiang (2020) grows a query graph that resembles subgraphs of the KB, naturally incorporating the KB constraints. However, current works only consider entity-level knowledge, ignoring to fully exploit the KB with the ontology-level knowledge, i.e., constraints over abstract concepts and relations, which provides a high-level summary for the entity-level triples Hao et al. (2019).

In this paper, we expect to exploit prior knowledge from external annotations and constraints from KB ontology for semantic parsing from denotations. Recently, a program annotation resource KQA Pro Shi et al. (2020) is released. As illustrated in Figure 1, KQA Pro defines program as the composition of symbolic functions that are designed to perform basic operations on KBs. The composition of functions well captures the language compositionality Baroni (2019), and the functions are atomatic and general, making it possible to perform tranfer. Therefore, We take KQA Pro as the external annotations to conduct program transfer accross KBs. Nevertheless, the heterogeneity of KBs needs to be well addressed in the transfer. For the ontology constraints, which types of constraints should be considered and how to incorporate them into the program induction also need investigation.

As shown in Figure 1, program can be abstracted to a sketch Solar-Lezama (2009)

by decoupling detailed arguments. Translation from questions to sketches is only relevant to language compositional structure, thus is general across KBs. Guided by the question and sketch, arguments can be retrieved from the KB, where pretrained language models can be employed to benefit the generalization 

Gu et al. (2021). By selecting candidates for the argument parser under the ontology guide, ontology constraints can be incorporated to the parsing process. To this end, we propose a hierarchical program induction model, which contains 1) a high-level sketch parser aiming to decompose a language question into the program sketch; 2) a low-level argument parser which aims to get the detailed arguments for functions in the sketch.

We learn the parsers with a pretrain-finetune paradigm. Specifically, we pretrain the parsers with existing question-program pairs in KQA Pro. Further, we finetune the parsers with question-answer pairs in the target KB by searching for possible programs and employing the consistent ones to optimize the parsers. The parsing process is divided into two stages. At the first stage, we utilize the sketch parser to translate the question to a program sketch using a Seq2Seq model with attention mechanism. At the second stage, we utlize the argument parser to retrieve the arguments by matching the semantics of question-sketch pair and candidate arguments. Type constraints for entities, and domain, range constraints for relations are employed to prune the argument candidate space.

KQA Pro contains 11,790 question-program pairs and is based on a small subset of Wikidata Vrandecic and Krötzsch (2014). We take the Freebase Bollacker et al. (2008)-based ComplexWebQuestions Talmor and Berant (2018) and WebQuestionSP Yih et al. (2016) as the target domain datasets. Comparing with the state-of-the-art methods that learn from question-answer pairs, we improve the F1 score from 44.0% to 58.7% and 74.0% to 76.5%, with an absolute gain of 14.7% and 2.5% for ComplexWebQuestions and WebQuestionSP respectively. Experimental results demonstrate that exploiting prior knowledge from external annotations and constraints from KB ontology are practical and promising. Ablation studies demonstrate the effectiveness of our program transfer architecture and ontology-guided pruning algorithm.

2 Preliminaries

Knowledge Base. Knowledge base describes concepts, entities, and the relations between them. It can be formalized as . , , and denote the sets of concepts, entities, relations and triples respectively.

Relation set can be formalized as , where is an instanceOf relation, is a subClassOf relation, and is the general relation set. can be divided into three disjoint subsets: 1) InstanceOf triple set ; 2) SubClassOf triple set ; 3) Relational triple set .

Program. Program is composed of symbolic functions with arguments and produces answer through execution on a KB. Each function defines a basic operation on KB and takes a specific type of argument. For example, the function Relate defines a basic operation to find entities that have a specific relation (i.e., argument type) with the given entity. Formally, a program is defined as a sequence of functions with arguments, denoted as . Here, is a pre-defined function set with size 27 which covers basic and reasoning operations over KBs Shi et al. (2020). According to the argument type, can be devided into four disjoint subsets: , representing the functions whose argument type is entity, concept, relation and empty respectively. Table 1 gives some examples of program functions.

Argument Description
Find entity FC Barcelona
Find the specific
KB entity
Relate relation arena stadium
Find the entities that
hold a specific relation
with the given entity
FilterConcept concept sports facility
Find the entities that
belong to a specific
And - -
Return the intersection
of two entity sets
Table 1: Examples of program functions.

Program Transfer. First, we define the semantic parsing task: given a , and a natural language question , produces a program that generates the right answer when executed against .

Then, we define the program transfer task: we have access to the source domain data , where contains pairs of question and program ; and target domain data , where contains pairs of question and answer . The task of program transfer aims to learn a semantic parser for the target domain. That is, learning a semantic parser to translate the question for into program which can execute on .

Figure 2: We design a high-level sketch parser to generate the sketch, and a low-level argument parser to predict arguments for the sketch. The arguments are retrieved from candidate pools which are illustrated by the color blocks. The arguments for functions are mutually constrained by the ontology structure. For example, when the second function Relate finds the argument teams owned, the candidate pool for the third function Fil.Con. (short for FilterConcept) is reduced to the range of relation teams owned. For training, we use a pretrain-finetune paradigm to transfer the prior knowledge from source domain to target domain.

3 Model Architecture

We propose to explicitly decompose semantic parsing into a high-level sketch parser and a low-level argument parser. Both of them can generalize across KBs. By grounding the sketch to concepts and relations in the ontology step by step, the search space is reduced progressively. The left part of Figure 2 depicts the flow of our parser and illustates the ontology-guided pruning process. For model training, we employ the pretrain-finetune paradigm, by first training our parsers with question-program pairs in the source domain, and then finetuning in the target domain, where the prior knowledge from source domain plays an important role to guide the program search.

The rationale of such model design is that by decoupling arguments from programs, 1) At the sketch parsing stage, we can focus on the compositionality of language and program functions, without considering the detailed information. Because language compositional structure repeats across KBs, our sketch parser can generalize across KBs; 2) At the argument parsing stage, we match the semantics of question and argument sequences (e.g., sports team owner

) by encoding them into a unified vector space and then calculating the semantic similarity. The pretrained contextual representations from Bert 

Devlin et al. (2019), which has shown effectiveness in compositional and zero-shot generalization Gu et al. (2021), benefit the generalization of our argument parser.

Specifically, first, we learn a high-level sketch parser to parse into the program sketch , which can be formulated as


Second, we learn an argument parser to retrieve the argument from a candidate pool for each function , which can be formulated as


The candidate pool is the relevant KB items, including concepts, entities, and relations. In a real KB, candidate space is usually huge, which makes the learning from answers very hard. Therefore, we propose an ontology-guided pruning algorithm to dynamically update the candidate pool and progressively reduce the search space, which will be introduced in Section 3.3.

The parsers are pretrained with source domain data and then finetuned with target domain data , which will be introduced in Section 3.4.

3.1 Sketch Parser

The sketch parser is based on encoder-decoder model Sutskever et al. (2014) with attention mechanism Dong and Lapata (2016)

. We aim to estimate

, the conditional probability of sketch

given input . We decompose as:


where .

Question Encoder We utilize Bert Devlin et al. (2019) as our encoder. Formally,


where is the question embedding, and is the hidden vector of word . is the hidden dimension.

Sketch Decoder

We use Gated Recurrent Unit (GRU) 

Cho et al. (2014), a well-known variant of RNNs, as our decoder of program sketch. The decoding is conducted step by step. After we have predicted , the hidden state of step is computed as:


where is the hidden state from last time step, denotes the the embedding corresponding to in the embedding matrix . We use as the attention key to compute scores for each word in the question based on the hidden vector , and compute the attention vector as:


The information of and are fused to predict the final probability of next sketch token:


where MLP (short for multi-layer perceptron) projects

-dimensional feature to

-dimension, which consists of two linear layers with ReLU activation.

3.2 Argument Parser

Argument parser aims to retrieve the argument from the candidate pool for each function . The construction of will be introduced in Section 3.3. In this section, we focus on the retrieving process.

Specifically, we take as the context representation of , learn vector representation for each candidate , and calculate the probability for based on and . Candidate is encoded with the Bert encoder in Equation 4, which can be formulated as:


is the row of . The probability of candidate is calculated as:


3.3 Ontology-guided Pruning

In KB, every relation comes with a domain and a range . The domain and range constraint of relations, and the type constraint of entities (i.e., ) form the KB ontology.

Our ontology-guided pruning algorithm aims to reduce the space of candidate pool in Section 3.2. The rationale is that the arguments for program functions are mutually constrained according to the KB ontology, and when the argument for is determined, the possible candidates for will be adjusted. For example, in Figure 2, when Relate takes “teams owned” as the argument, the candidate pool for the next FilterConcept is constrained to the range of relation “teams owned”, thus other concepts (e.g., “time zone”) will be excluded from the candidate pool; when FilterConcept takes “sports team” as the argument, the candidate pool for the next function Relate is constrained to the relations whose domain contains “sports team”.

Specifically, we utilize three ontology-oriented constraints , which aim to find the type constraint for entity , the range constraint for relation , and the relations whose domain constraint contains . The details of these constraints are shown in Table 2. In addition, we maintain three global candidate pools , , and for entities, relations and concepts respectively, and take one of them as according to the argument type of . When of is determined, we will update , , and using . The detailed algorithm is shown in Appendix.

Notation Descriptions
the range constraint of relation .
the relations whose domain constraint contains .
, the type constraint of entity , denotes instanceOf.
Table 2: Details of the ontology-oriented constraints.

3.4 Training

We train our program parser using the popular pretrain-finetune paradigm. Specifically, we pretrain the parser on the source domain data in a supervised way. After that, we conduct finetuning on the target domain data in a weakly supervised way.

Pretraining in Source Domain. Since the source domain data provides complete annotations, we can directly maximize the log likelihood of the golden sketch and golden arguments:


Finetuning in Target Domain. At this training phase, questions are labeled with answers while programs remain unknown. The basic idea is to search for potentially correct programs and optimize their corresponding probabilities. Specifically, we propose two training strategies:

  • [leftmargin=13pt]

  • Iterative maximum likelihood learning (IML). At each training step, IML generates a set of possible programs with beam search based on current model parameters, and then executes them to find the one whose answers have the highest F1 score compared with the gold. Let denote the best program, we can directly maximize like Equation 10.

  • Reinforcement learning (RL). It formulates the program generation as a decision making procedure and computes the rewards for sampled programs based on their execution results. We take the F1 score between the executed answers and golden answers as the reward value, and use REINFORCE Williams (1992) algorithm to optimize the parsing model.

4 Experimental Settings

4.1 Datasets

Source Domain. KQA Pro Shi et al. (2020) provides 117,970 question-program pairs based on a subset of Wikidata Vrandecic and Krötzsch (2014).

Target Domain. We use two KBQA benchmark datasets: WebQuestionSP (WebQSP) Yih et al. (2016) and ComplexWebQuestions (CWQ) Talmor and Berant (2018), both based on Freebase Bollacker et al. (2008). WebQSP contains 4,737 questions and is divided into 2,998 train, 100 dev and 1,639 test cases. CWQ is an extended version of WebQSP which incorporates more complex questions and thus is more challenging. It has four types of questions: composition (44.7%), conjunction (43.6%), comparative (6.2%), and superlative (5.4%). CWQ is divided into 27,639 train, 3,519 dev and 3,531 test cases.

We use the Freebase dump on 2015-08-09111, from which we extract type, domain, and range constraints to construct the ontology. The average domain, range, type constraint size is 1.43 per relation, 1.17 per relation, 8.89 per entity respectively.

Table 3 shows the statistics of the source and target domain KB. The target domain KB contains much more entities, relations, and concepts, and most of them are uncovered by the source domain.

Domain # Entities # Relations # Concepts
Source 16,960 363 794
Target 30,943,204 15,015 2,519
Table 3: The statistics for source and target domain KB.

4.2 Baselines

We choose the state-of-the-art methods which learn from question-answer pairs as our baselines.

Existing program induction methods use hand-crafted rules or knowledge linking. NSM Liang et al. (2017) uses the prior entity, relation and type linking knowledge to solve simple questions. NPI Ansari et al. (2019) designs rules such as disallow repeating or useless actions.

Query graph generation methods incorporate KB guidance by considering entity-level triples. TEXTRAY Bhutani et al. (2019) prunes the search space by a decompose-execute-join approach. QGG Lan and Jiang (2020) incorporates constraints into query graphs in the early stage. TeacherNet He et al. (2021) learns relation paths from the topic entity to answer entities utilizing bidirectional multi-hop reasoning.

Information retrieval based methods directly retrive answers from the KB without generating interpretable logical forms. GraftNet Sun et al. (2018)

uses heuristics to create a question-specific subgraph and uses a variant of graph convolutional networks to retrieve the answer.

PullNet Sun et al. (2019) improves GraftNet by iteratively constructing the subgraph instead of using heuristics.

Besides, we compare our full model with , , , , which denotes our model without finetuning, without pretraining, without pretraining of argument parser, and without ontology constraints respectively.

4.3 Evaluation Metrics

Following prior works Berant et al. (2013); Sun et al. (2018); He et al. (2021)

, we use F1 score and Hit@1 as the evaluation metrics. Since questions in the datasets have multiple answers, F1 score reflect the coverage of predicted answers better.

4.4 Implementations

We used the bert-base-cased model of HuggingFace222 as our Bert encoder with the hidden dimension 768. The hidden dimension of the sketch decoder was 1024. We used AdamW Loshchilov and Hutter (2019) as our optimizer. We searched the learning rate for Bert paramters in {1e-4, 3e-5, 1e-5}, the learning rate for other parameters in {1e-3, 1e-4, 1e-5}, and the weight decay in {1e-4, 1e-5, 1e-6}. According to the performance on validation set, we finally used learning rate 3e-5 for Bert parameters, 1e-3 for other parameters, and weight decay 1e-5.

5 Experimental Results

Models WebQSP CWQ
F1 Hit@1 F1 Hit@1
GraftNet 62.3 68.7 - 32.8*
PullNet - 68.1 - 47.2*
TeacherNet 67.4 74.3 44.0 48.8
TEXTRAY 60.3 72.2 33.9 40.8
QGG 74.0 - 40.4 44.1
NSM - 69.0 - -
NPI - 72.6 - -
53.8 53.0 45.9 45.2
3.2 3.1 2.3 2.1
70.8 68.9 54.5 54.3
72.0 71.3 55.8 54.7
Ours 76.5 74.6 58.7 58.1
Table 4: Performance comparison of different methods (F1 score and Hits@1 in percent). We highlight the best results in bold and second with underline. *: reported by PullNet on the validation set.

5.1 Overall Results

As shown in Table 4, our model achieves the best performance on both WebQSP and CWQ. Especially on CWQ, we have an absolute gain of 14.7% in F1 and 9.3% in Hit@1, beating previous methods by a great margin. Note that CWQ is much more challenging than WebQSP because it includes more compositional and conjunctional questions. Previous work mainly suffer from the huge search space and sparse training signals. We conquer these challenges by transfering the prior knowledge and incorporating the ontology constraints, which reduce the search space substantially. On WebQSP, we achieve an absolute gain of 2.5% and 0.3% in F1 and Hit@1, respectively, demonstrating that our model can also handle simple questions well and can adapt to different complexities of questions.

Note that our F1 scores are higher than corresponding Hit@1. This is because we just randomly sampled one answer from the returned answer set as the top 1 without ranking them.

Models WebQSP CWQ
Best F1 Best F1
Top-1 76.5 58.7
Top-2 81.1 61.2
Top-5 85.4 63.3
Top-10 86.9 65.0
Table 5: The highest F1 score in the top-k programs.

We utilize beam search to generate multiple possible programs and evaluate their performance. Table 5 shows the highest F1 score in the top-k generated programs, where top-1 is the same as Table 4. We can see that the best F1 in the top-10 programs is much higher than the F1 of the top-1 (e.g., with absolute gain 10.4% for WebQSP and 6.3% for CWQ). This indicates that a good re-ranking method can further improve the overall performance of our model. We leave this as our future work.

5.2 Ablation study

Pretraining: As shown in Table 4, when comparing with , the F1 and Hit@1 on CWQ drop by and respectively, which indicates that the pretraining for the argument parser is necessary. denotes the model without pretraining for neither sketch parser nor argument parser. We can see that its results are very poor, achieving just about 3% and 2% on WebQSP and CWQ, indicating that the pretraining is essential, especially for the sketch parser.

Finetuning: Without finetuning on the target data, i.e., in , performance drops a lot compared with the complete model. For example, F1 and Hit@1 on CWQ drops by 12.8% and 12.9% respectively. It indicates that finetuning is necessary for model’s performance. As shown in Table 3, most of the relations and concepts in the target domain are uncovered by the source domain. Due to the semantic gap between source and target data, the prior knowledge must be properly transfered to the target domain to bring into full play.

Ontology: We implemented by removing ontology constraints from KB and removing FilterConcept from the program. Comparing with , the F1 and Hit@1 on CWQ drops by 2.9% and 3.4% respectively, which demonstrates the importance of ontology constraint. We calculated the search space size for each compositional and conjunctive question in CWQ validation set, and report the average size in Table 6. The statistics shows that by incorporating the ontology constraint, substantially reduces the search space.

Model Composition Conjunction
4,248,824.5 33,152.1
11,200.7 1,066.5
Table 6: The average search space size for composition and conjunction questions in CWQ validation set for and .

IML v.s. RL:

For both WebQSP and CWQ, training with IML achieves better performance. For RL, we simply employed the REINFORCE algorithm and did not implement any auxiliary reward strategy since this is not focus of our work. The sparse, delayed reward causes high variance, instability, and local minima issues, making the training hard 

Saha et al. (2019). We leave exploring more complex training strategies as our future work.

Models WebQSP CWQ
F1 Hit@1 F1 Hit@1
IML 76.5 74.6 58.7 58.1
RL 71.4 72.0 46.1 45.4
Table 7: Results of different training strategies.

5.3 Case Study

Figure 3: An example from CWQ validation set. Our model translates the question into multiple programs with the corresponding probability and F1 score. We show the best, 2-nd best and 10-th best programs. Both the best and 2-nd best programs are correct.

Figure 3 gives a case, where our model parses an question into multiple programs along with their probablility scores and F1 scores of executed answers. Given the question “The person education institution is Robert G. Cole Junior-Senior High School played for what basketball teams?”, we show the programs with the largest, 2-nd largest and 10-th largest possibility score. Both of the top-2 programs get the correct answer set and are semantically equivelant with the question, while the 10-th best program is wrong.

Error Analysis We randomly sampled 100 error cases whose F1 score is lower than 0.1 for manual inspection. The errors can be summarized into the following categories: 1) Wrong relation (53%): wrongly predicted relation makes the program wrong, e.g., for question “What language do people in the Central Western Time Zone speak?”, our model predicts the relation main country, meaning the main country that uses one language, while the ground truth is countries spoken in, meaning all the countries that use one language; 2) Wrong concept (38%): wrongly predicted concept makes the program wrong, e.g., for the question “What continent does the leader Ovadia Yosel live in?”, our model predicted the concept location, whereas the ground truth is continent. 3) Model limitation (9%): Handling attribute constraint was not considered in our model, e.g., for the question “Who held his governmental position from before April 4, 1861 and influenced Whitman’s poetry?”, the start time constraint April 4, 1861 cannot be handled.

6 Related Work

Semantic Parsing from Denotations. Semantic parsing from denotations requires searching over an exponentially large space of logical forms and may be misled by spurious ones Berant et al. (2013); Pasupat and Liang (2015); Guu et al. (2017), which makes training extremely challenging. Attemps to address this problem either used prior knowledge to guide the learnining process (e.g. hand-crafted rules Ansari et al. (2019); Saha et al. (2019), gold entity, relation, type linking Liang et al. (2017); Saha et al. (2019), etc.), or enforced KB constraints to ensure the semantical correctness of the logical form Yih et al. (2015); Lan and Jiang (2020); Bhutani et al. (2019) at the entity-level.

Our work has drawn inspiration from the works that abstract entities and relations from the logical forms Zhang et al. (2017); Dong and Lapata (2018); Herzig and Berant (2018). However, their methods did not decompose semantic parsing into generic, reusable, atomic functions and investigate program transfer. In this paper, we investigate for the first time the problem of program transfer as far as we know.

Cross-domain Semantic Parsing. Cross-domain semantic parsing trains a semantic parser on some source domains and adapts it to the target domain. Some works Herzig and Berant (2017); Su and Yan (2017); Fan et al. (2017) pooled together examples from multiple datasets in different domains and trained a single sequence-to-sequence model over all examples, sharing parameters across domains. These methods relied on annotated data in the target domain. To facilitate low-resource target domains,  Chen et al. (2020) adapted to target domains with a very limited amount of training data. Other works considered a zero-shot semantic parsing task Givoli and Reichart (2019)

, decoupleing structures from lexicons for transfer. However, they only learned from the source domain without further learning from the target domain using the prior knowledge. In addition, existing works mainly focus on the domains in OVERNIGHT 

Wang et al. (2015), which are much smaller than large scale KBs such as Wikidata and Freebase. To deal with the complex schema of large scale KBs, transfer in ours setting is more challenging.

7 Conclusion

In this parper we investigate program transfer and ontology awareness for semantic parsing in KBQA for the first time. We propose a hierarchical program induction model which composes a high-level sketch parser and a low-level argument parser. Both of them generalize across KBs. The ontology-guided pruning algorithm reduces the search space substantially by using three ontology-oriented constraints. The experimental results demonstrate that our program transfer and ontology constraints facilitate semantic parsing from denotations greatly.


  • G. A. Ansari, A. Saha, V. Kumar, M. Bhambhani, K. Sankaranarayanan, and S. Chakrabarti (2019) Neural program induction for kbqa without gold programs or query annotations. In IJCAI’19, Cited by: §1, §4.2, §6.
  • M. Baroni (2019)

    Linguistic generalization and compositionality in modern artificial neural networks

    Philosophical Transactions of the Royal Society B 375. Cited by: §1.
  • J. Berant, A. K. Chou, R. Frostig, and P. Liang (2013) Semantic parsing on freebase from question-answer pairs. In EMNLP’13, Cited by: §1, §4.3, §6.
  • N. Bhutani, X. Zheng, and H. Jagadish (2019) Learning to answer complex questions over knowledge bases with query composition. In CIKM’19, Cited by: §4.2, §6.
  • K. Bollacker, C. Evans, P. K. Paritosh, T. Sturge, and J. Taylor (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD’08, Cited by: §1, §4.1.
  • X. Chen, A. Ghoshal, Y. Mehdad, L. Zettlemoyer, and S. Gupta (2020) Low-resource domain adaptation for compositional task-oriented semantic parsing. In EMNLP’20, Cited by: §6.
  • K. Cho, B. V. Merrienboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder-decoder approaches

    CoRR abs/1409.1259. Cited by: §3.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT’19, Cited by: §3.1, §3.
  • L. Dong and M. Lapata (2016) Language to logical form with neural attention. CoRR abs/1601.01280. Cited by: §1, §3.1.
  • L. Dong and M. Lapata (2018) Coarse-to-fine decoding for neural semantic parsing. In ACL’18, Cited by: §6.
  • X. Fan, E. Monti, L. Mathias, and M. Dreyer (2017) Transfer learning for neural semantic parsing. CoRR abs/1706.04326. Cited by: §6.
  • O. Givoli and R. Reichart (2019) Zero-shot semantic parsing for instructions. CoRR abs/1911.08827. Cited by: §6.
  • Y. Gu, S. Kase, M. Vanni, B. Sadler, P. Liang, X. Yan, and Y. Su (2021) Beyond iid: three levels of generalization for question answering on knowledge bases. In WWW’21, Cited by: §1, §3.
  • K. Guu, P. Pasupat, E. Liu, and P. Liang (2017) From language to programs: bridging reinforcement learning and maximum marginal likelihood. In ACL’17, Cited by: §6.
  • J. Hao, M. Chen, W. Yu, Y. Sun, and W. Wang (2019) Universal representation learning of knowledge bases by jointly embedding instances and ontological concepts. SIGKDD’19. Cited by: §1.
  • G. He, Y. Lan, J. Jiang, W. X. Zhao, and J. Wen (2021) Improving multi-hop knowledge base question answering by learning intermediate supervision signals. Cited by: §4.2, §4.3.
  • J. Herzig and J. Berant (2017) Neural semantic parsing over multiple knowledge-bases. CoRR abs/1702.01569. Cited by: §6.
  • J. Herzig and J. Berant (2018) Decoupling structure and lexicon for zero-shot semantic parsing. In EMNLP’18, Cited by: §6.
  • Y. Lan and J. Jiang (2020) Query graph generation for answering multi-hop complex questions from knowledge bases. In ACL’20, Cited by: §1, §4.2, §6.
  • C. Liang, J. Berant, Q. V. Le, K. D. Forbus, and N. Lao (2017)

    Neural symbolic machines: learning semantic parsers on freebase with weak supervision

    In ACL’17, Cited by: §1, §1, §1, §4.2, §6.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In ICLR’19, Cited by: §4.4.
  • P. Pasupat and P. Liang (2015) Compositional semantic parsing on semi-structured tables. In ACL’15, pp. 1470–1480. Cited by: §6.
  • A. Saha, G. A. Ansari, A. Laddha, K. Sankaranarayanan, and S. Chakrabarti (2019) Complex program induction for querying knowledge bases in the absence of gold programs. Transactions of the Association for Computational Linguistics 7, pp. 185–200. Cited by: §1, §5.2, §6.
  • J. Shi, S. Cao, L. Pan, Y. Xiang, L. Hou, J. Li, H. Zhang, and B. He (2020) KQA Pro: a large diagnostic dataset for complex question answering over knowledge base. CoRR abs/2007.03875. Cited by: §1, §2, §4.1.
  • A. Solar-Lezama (2009) The sketching approach to program synthesis. In APLAS’09, Cited by: §1.
  • Y. Su and X. Yan (2017) Cross-domain semantic parsing via paraphrasing. In EMNLP’17, Cited by: §6.
  • H. Sun, T. Bedrax-Weiss, and W. Cohen (2019) PullNet: open domain question answering with iterative retrieval on knowledge bases and text. In EMNLP’19, Cited by: §4.2.
  • H. Sun, B. Dhingra, M. Zaheer, K. Mazaitis, R. Salakhutdinov, and W. Cohen (2018) Open domain question answering using early fusion of knowledge bases and text. In EMNLP’18, Cited by: §4.2, §4.3.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In NIPS’14, Cited by: §3.1.
  • A. Talmor and J. Berant (2018) The web as a knowledge-base for answering complex questions. In NAACL-HLT’18, Cited by: §1, §4.1.
  • D. Vrandecic and M. Krötzsch (2014) Wikidata: a free collaborative knowledgebase. Communications of the ACM. Cited by: §1, §4.1.
  • Y. Wang, J. Berant, and P. Liang (2015) Building a semantic parser overnight. In ACL’15, Cited by: §6.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning. Cited by: 2nd item.
  • W. Yih, M. Chang, X. He, and J. Gao (2015) Semantic parsing via staged query graph generation: question answering with knowledge base. In ACL, Cited by: §1, §1, §1, §6.
  • W. Yih, M. Richardson, C. Meek, M. Chang, and J. Suh (2016) The value of semantic parse labeling for knowledge base question answering. In ACL’16, Cited by: §1, §4.1.
  • L. Zettlemoyer and M. Collins (2005) Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. In UAI’05, Cited by: §1, §1.
  • Y. Zhang, P. Pasupat, and P. Liang (2017) Macro grammars and holistic triggering for efficient semantic parsing. In EMNLP’17, Cited by: §6.

Appendix A Program

We list the functions of KQA Pro in Table 8. The arguments in our paper are the textual inputs. To reduce the burden of the argument parser, for the functions that take multiple textual inputs, we concatenate them to a single input.

Function Functional Inputs Textual Inputs Outputs Description Example (only show textual inputs)
FindAll () () (Entities) Return all entities in KB -
Find () (Name) (Entities) Return all entities with the given name Find(Kobe Bryant)
FilterConcept (Entities) (Name) (Entities) Find those belonging to the given concept FilterConcept(athlete)
FilterStr (Entities) (Key, Value) (Entities, Facts) Filter entities with an attribute condition of string type, return entities and corresponding facts FilterStr(gender, male)
FilterNum (Entities) (Key, Value, Op) (Entities, Facts) Similar to FilterStr, except that the attribute type is number FilterNum(height, 200 centimetres, )
FilterYear (Entities) (Key, Value, Op) (Entities, Facts) Similar to FilterStr, except that the attribute type is year FilterYear(birthday, 1980, )
FilterDate (Entities) (Key, Value, Op) (Entities, Facts) Similar to FilterStr, except that the attribute type is date FilterDate(birthday, 1980-06-01, )
QFilterStr (Entities, Facts) (QKey, QValue) (Entities, Facts) Filter entities and corresponding facts with a qualifier condition of string type QFilterStr(language, English)
QFilterNum (Entities, Facts) (QKey, QValue, Op) (Entities, Facts) Similar to QFilterStr, except that the qualifier type is number QFilterNum(bonus, 20000 dollars, )
QFilterYear (Entities, Facts) (QKey, QValue, Op) (Entities, Facts) Similar to QFilterStr, except that the qualifier type is year QFilterYear(start time, 1980, )
QFilterDate (Entities, Facts) (QKey, QValue, Op) (Entities, Facts) Similar to QFilterStr, except that the qualifier type is date QFilterDate(start time, 1980-06-01, )
Relate (Entity) (Pred, Dir) (Entities, Facts) Find entities that have a specific relation with the given entity Relate(capital, forward)
And (Entities, Entities) () (Entities) Return the intersection of two entity sets -
Or (Entities, Entities) () (Entities) Return the union of two entity sets -
QueryName (Entity) () (string) Return the entity name -
Count (Entities) () (number) Return the number of entities -
QueryAttr (Entity) (Key) (Value) Return the attribute value of the entity QueryAttr(height)
QueryAttrUnderCondition (Entity) (Key, QKey, QValue) (Value) Return the attribute value, whose corresponding fact should satisfy the qualifier condition QueryAttrUnderCondition(population, point in time, 2016)
QueryRelation (Entity, Entity) () (Pred) Return the predicate between two entities QueryRelation(Kobe Bryant, America)
SelectBetween (Entity, Entity) (Key, Op) (string) From the two entities, find the one whose attribute value is greater or less and return its name SelectBetween(height, greater)
SelectAmong (Entities) (Key, Op) (string) From the entity set, find the one whose attribute value is the largest or smallest SelectAmong(height, largest)
VerifyStr (Value) (Value) (boolean) Return whether the output of QueryAttr or QueryAttrUnderCondition and the given value are equal as string VerifyStr(male)
VerifyNum (Value) (Value, Op) (boolean) Return whether the two numbers satisfy the condition VerifyNum(20000 dollars, )
VerifyYear (Value) (Value, Op) (boolean) Return whether the two years satisfy the condition VerifyYear(1980, )
VerifyDate (Value) (Value, Op) (boolean) Return whether the two dates satisfy the condition VerifyDate(1980-06-01, )
QueryAttrQualifier (Entity) (Key, Value, QKey) (QValue) Return the qualifier value of the fact (Entity, Key, Value) QueryAttrQualifier(population, 23,390,000, point in time)
QueryRelationQualifier (Entity, Entity) (Pred, QKey) (QValue) Return the qualifier value of the fact (Entity, Pred, Entity) QueryRelationQualifier(spouse, start time)
Table 8: Details of 27 functions in KQA Pro. Each function has 2 kinds of inputs: the functional inputs come from the output of previous functions, while the textual inputs come from the question.

Appendix B Ontology-guided Pruning

Input: natural language question , program sketch , knowledge base

for all  in  do
     if  then
     else if  then
     else if  then
     end if
end for
Algorithm 1 Ontology-guided Pruning

Appendix C Freebase Details

We extracted a subset of Freebase which contains all facts that are within 4-hops of entities mentioned in the questions of CWQ and WebQSP. We extracted the domain constraint for relations according to “ /type/property/schema”, range constraint for relations according to “/type/property/expected_type”, type constraint for entities according to “/type/type/instance”. CVT nodes in the Freebase were dealed with concatenation of neiborhood relations.