Log In Sign Up

AutoQGS: Auto-Prompt for Low-Resource Knowledge-based Question Generation from SPARQL

by   Guanming Xiong, et al., Inc.
Peking University

This study investigates the task of knowledge-based question generation (KBQG). Conventional KBQG works generated questions from fact triples in the knowledge graph, which could not express complex operations like aggregation and comparison in SPARQL. Moreover, due to the costly annotation of large-scale SPARQL-question pairs, KBQG from SPARQL under low-resource scenarios urgently needs to be explored. Recently, since the generative pre-trained language models (PLMs) typically trained in natural language (NL)-to-NL paradigm have been proven effective for low-resource generation, e.g., T5 and BART, how to effectively utilize them to generate NL-question from non-NL SPARQL is challenging. To address these challenges, AutoQGS, an auto-prompt approach for low-resource KBQG from SPARQL, is proposed. Firstly, we put forward to generate questions directly from SPARQL for the KBQG task to handle complex operations. Secondly, we propose an auto-prompter trained on large-scale unsupervised data to rephrase SPARQL into NL description, smoothing the low-resource transformation from non-NL SPARQL to NL question with PLMs. Experimental results on the WebQuestionsSP, ComlexWebQuestions 1.1, and PathQuestions show that our model achieves state-of-the-art performance, especially in low-resource settings. Furthermore, a corpus of 330k factoid complex question-SPARQL pairs is generated for further KBQG research.


page 1

page 2

page 3

page 4


PSG: Prompt-based Sequence Generation for Acronym Extraction

Acronym extraction aims to find acronyms (i.e., short-forms) and their m...

ETC-NLG: End-to-end Topic-Conditioned Natural Language Generation

Plug-and-play language models (PPLMs) enable topic-conditioned natural l...

The Importance of Context in Very Low Resource Language Modeling

This paper investigates very low resource language model pretraining, wh...

Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey

Dense retrieval (DR) approaches based on powerful pre-trained language m...

Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

Large generative language models such as GPT-2 are well-known for their ...

Generating Quizzes to Support Training on Quality Management and Assurance in Space Science and Engineering

Quality management and assurance is key for space agencies to guarantee ...

Detecting Suicide Risk in Online Counseling Services: A Study in a Low-Resource Language

With the increased awareness of situations of mental crisis and their so...

1. Introduction

Figure 1. Model overview.

Knowledge-Based Question Generation (KBQG) is a task that aims to generate natural language questions given a knowledge base (KB). KBQG has a wide range of applications and has won wide attention in academia and industry. In this paper, we use a specific knowledge graph (KG), Freebase(Bollacker et al., 2008)

, as KB. Recent works mainly adopt sequence-to-sequence neural networks to generate questions given a Resource Description Framework

(Miller, 1998, 2001) (RDF) graph that is a directed graph composed of triple statements in a knowledge graph. For a single relation graph, a series of works (Elsahar et al., 2018; Liu et al., 2019; Bi et al., 2020) enriched the input with auxiliary information, equipped the decoder with copy mechanisms, and devised delicate models to improve the fluency or the semantic accuracy of generated questions. For star-like graph, (Kumar et al., 2019) used Transformer encoder while (Chen et al., 2022) applied a bidirectional Graph2Seq model to encode the graph structure. Very recently, JointGT (Ke et al., 2021) adopted the modern pre-trained language model BART (Lewis et al., 2020) to generate questions, achieving state-of-the-art performance.

However, KBQG still faces three major challenges:

1. Absence of complex operations

The existing KBQG methods mainly generated questions from an RDF graph. However, RDF is a standard model for data interchange on the Web, it’s oriented to describe resources not to express constraints on top of them, such as aggregation, superlative, and comparative questions. Compared with the RDF graph, SPARQL has a complete semantic representation that covers all the question types mentioned above. How to generate questions based on SPARQL is a challenge.

2. Low resources

For expressing the predicates in KG to NL form, traditional supervised methods tended to annotate large-scale SPARQL-question pairs. However, the labor cost is enormous. Besides, KG may contain complex schemas, such as Compound Value Type222A Compound Value Type is a Type within Freebase which is used to represent data where each entry consists of multiple fields. See (CVT) in Freebase. Each CVT combination has a different meaning, hence it is difficult for annotations to cover all combinations. How to generate questions without sufficient resources is still under-explored to date.

3. Efficient generation

The advanced generative pre-trained language models (PLMs) have been proven effective in natural language generation (NLG) tasks (Lewis et al., 2020; Chen et al., 2020; Ke et al., 2021). Nonetheless, PLMs were trained in NL-to-NL paradigm, but the SPARQL expression is different from the NL form. Therefore, how to leverage the strengths of PLMs to generate high-quality questions matters a lot.

To address the challenges mentioned above, we propose AutoQGS, an auto-prompt approach for low-resource KBQG from SPARQL. Figure 1 shows the overall process. Firstly, we incorporate SPARQL expression directly as input, which retains the original semantics. Secondly, we propose a model, auto-prompter, to rephrase the SPARQL to the corresponding NL description, named prompt text. Auto-prompter combines the strengths of distant supervision (Mintz et al., 2009) and the strong generative ability of PLMs. Specifically, the training data of auto-prompter are subgraphs that could be massively sampled from KB, and the target prompt text is collected from large-scale corpus as well. Lastly, we explore an efficient question generation method in low-resource scenarios. Our model significantly outperforms existing state-of-the-art baselines by a large margin, especially in low-resource settings.

The main contributions of this paper are summarized as follows.

  • Put forward to generate questions directly from SPARQL for KBQG task to handle complex operations.

  • Propose AutoQGS, an approach to rephrase RDF graph to prompt text, and generate questions from SPARQL and prompt text.

  • Conduct extensive experiments on two datasets, and the results show that AutoQGS improve the performance observably.

2. Related Work

KBQG has come a long way in the past decades. In the early time, many works generate questions in template-based approaches. (Song and Zhao, 2017) proposed an unsupervised system to collect questions by a search engine from a small number of template-based questions. (Seyler et al., 2015, 2017) collect structured triple-pattern query from seed question and use a template-based method to verbalize the structured query. However, template-based approaches lack flexibility, and the annotation cost is high.

Recent works for KBQG are mainly based on sequence-to-sequence neural networks given a set of subgraphs from knowledge graph (KG) . (Serban et al., 2016) first used a neural network for encoding KG fact triples into natural language questions and generated the 30M Factoid Question-Answer datasets. However, it was trained on a mass of fact-question pairs, which is challenging to collect. (Reddy et al., 2017) proposed an RNN-based model to generate simple questions and corresponding answers by converting all the KG entities to a set of keywords. (Bao et al., 2018) and (Bao et al., 2019) developed a flexible copying mechanism to alleviate the rare words problem. For unseen predicates and entity types problem, (Elsahar et al., 2018) collected textual contexts in the Wikipedia as auxiliary information and adopted a part-of-speech copy action mechanism to generate questions. However, (Liu et al., 2019) thought the textual contexts were noisy or even wrong. They presented a complicated model that integrates diversified off-the-shelf contexts and devised an answer-aware loss to make sure the generated questions are associated with a definitive answer. Based on the Transformer (Vaswani et al., 2017), (Kumar et al., 2019) proposed an end-to-end neural network-based method for generating complex multi-hop and difficulty-controllable questions over a subgraph in KG. (Bi et al., 2020)

focus on semantic drift problem. They proposed to incorporate auxiliary information and word types in generated questions, make the decoder output conditioned on these types, and design a DPT-based evaluator to encourage question structural conformity in a reinforcement learning framework. Instead of using a set of KG triples,

(Chen et al., 2022) proposed to apply a bidirectional Graph2Seq model to encode the KG subgraph and target answers, and then generate questions with a node-level coping mechanism. (Ke et al., 2021) proposed a pre-trained model called JointGT for KG-to-text generation tasks. They added a structure-aware semantic aggregation module in BART to model the structure of input graphs. They then pre-trained the model in a large-scale corpus, and fine-tuned in question generation task.

In comparison, our approach utilizes advanced generative pre-trained language models to generate questions from SPARQL, rather than either make templates to rephrase questions or generate questions from an KG subgraph.

3. Approach

3.1. Problem Formulation

This study investigates the task of knowledge-based question generation (KBQG). Conventional KBQG works generated questions from fact triples in the knowledge graph, which could not express complex operations. A SPARQL expression is of complete syntax, which is capable of fully formalizing questions with complex operations333Complex operations are defined as functions beyond the KG predicates., e.g., aggregation(COUNT), comparative(¡,¿,¡=,¿=), and superlative(ORDER BY ?x LIMIT 1). Moreover, due to the costly annotation of large-scale SPARQL-question pairs, KBQG from SPARQL under low-resource scenarios urgently needs to be explored. In this paper, we propose to generate question directly from SPARQL for low-resource KBQG task. Given an executable SPARQL expression and a knowledge graph , our goal is to generate a natural language (NL) question that is consistent with the SPARQL. Given the above definitions, the task can be formalized as learning the distribution .

3.2. Model Overview

Recently, the generative pre-trained language models (PLMs) typically trained in NL-to-NL paradigm have been proven effective for low-resource generation, e.g., T5(Raffel et al., 2020) and BART(Lewis et al., 2020). However, generating questions directly from non-NL SPARQL is not friendly to the generative PLMs. In this paper, we propose AutoQGS, an auto-prompt approach which rephrases SPARQL to NL text automatically, smoothing the transformation from non-NL SPARQL to NL question. The overall process of AutoQGS is shown in Figure 1. Specifically, AutoQGS consists of two procedures, (1) auto-prompt from SPARQL to NL text, and (2) question generation (QG) based on SPARQL and NL prompt text.

Formally, auto-prompt aims to generate prompt text from SPARQL .

The process of auto-prompt can be formalized as


Afterwards, and are concatenated as input444We replace the machine identifiers of the topic entities in SPARQL with their surface names, but for brevity, we still use the symbol . of a question generator to produce a question . The procedure of question generation can be formalized as


We will introduce the auto-prompt and the question generation procedures in Section 3.3 and Section 3.4, respectively.

Figure 2. Overview of auto-prompt training and inference procedures. 666For brevity, we omit the prefix: PREFIX ns: ¡¿.

3.3. Auto-prompt

Auto-prompt generates prompt text from SPARQL . Figure 2 shows the overview of auto-prompt. Given a SPARQL , we first execute it based on and obtain an instantiated subgraph via subgraph sampling defined in Section 3.3.1. can be decomposed into a set of atomic subgraphs. Each is serialized as input of an auto-prompter and converted into text. They are then concatenated to construct the prompt text . Specifically, this procedure is formalized as follows:


where function means concatenating all the elements, is the generation process with beam search (we keep the one with the highest score), is the learnable parameters of auto-prompter, is the -th predicted token, and function indicates the graph sampling process. In this work, the auto-prompter supports both star and chain topologies, and the prompt text will vary as the input entities and their information change. By contrast, previous works only save the hard matched results for single-relation predicate (e.g. “is birthplace of” for “person/place_of_birth”).

3.3.1. Subgraph Sampling

Subgraph sampling is a process that aims to find a KG subgraph that can instantiate a given SPARQL based on . Firstly, we define an atomic subgraph , in Freebase, where is a function that identifies the type of an atomic subgraph. is a Compound Value Type (CVT) subgraph and is a single-relation one. A CVT subgraph consist of a central CVT node and its corresponding one-hop edges. A single-relation subgraph can be formalized to a (subject, predicate, object) triple. Specifically, we use Virtuoso to store Freebase. Following Google’s instruction777

, we classify the type of predicates in Freebase into single-relation and CVT. The difference is that the tail node of a CVT predicate is a CVT node and the entire CVT graph expresses an event, as the predicate “

film.film_character.portrayed_in_films” shown in Figure 2 a. The single-relation one links two named entities, which means a fact.

In the training stage, given a predicate, we sample a set of corresponding atomic subgraphs in the database. For single-relation predicate, we simply construct “SELECT ?s ?o WHERE {?s [predicate] ?o.}” to query KG and save all (subject name, predicate, object name) triples returned. For each CVT predicate, we sample a set of nodes that are the tails of the predicate, then instantiate a graph center at the node for each one in the set. As shown in Figure 2 a, the instantiated CVT graph consists of both inside and outside edges in one-hop connection. For example, in order to sample the graphs of predicate “film.film_character.portrayed _in_films”, we firstly construct a query, “SELECT ?x WHERE {?y film.film_character.portrayed_in_films ?x.}”, to get corresponding CVT nodes, and then for each node (e.g. m.0gxrhxd), we construct another query, “SELECT ?e1 ?p_in ?e2 ?p_out WHERE {?e1 ?p_in m.0gxrhxd. m.0gxrhxd ?p_out ?e2.}”, to instantiate the CVT graph. Due to limited resources, instead of covering the entire predicates in Freebase, we selected a subset of predicates involved in the two datasets mentioned in Section 4.1.

In the inference stage, given a SPARQL expression , we search all variables (e.g. replacing “?x” with “?y ?x ?num” in the SPARQL) in KG and sample one from the results to instantiate a complete graph . As shown in Figure 2 b, a SPARQL may consist of both single-relation and CVT subgraphs.

3.3.2. Subgraph Serialization

To explore how to mine the general statements more effectively for different entities with the same predicate, we propose two strategies of serialization.

Entity name

In this setup, we serialize the subgraph with the original entity name. As we know, Wikipedia is used to train PLMs, so that keeping the original name may be a simple but effective way to utilize the knowledge the model has learned. For single-relation subgraph, the serialization pattern is “¡H¿ [head entity name] ¡D¿ [head entity description] ¡P¿ [predicate] ¡T¿ [tail entity name] ¡D¿ [tail entity description]”, the special tokens ¡H¿, ¡D¿, ¡P¿, and ¡T¿ mean the head entity, description, predicate and tail entity, respectively. For CVT subgraph, we add a special token “R@” in front of inside predicates (point to a CVT node) in order to traverse the subgraph in a uniform format. Therefore, the serialization pattern of one edge in the subgraph is “¡P¿ [R@][predicate] ¡T¿ [entity name] ¡D¿ [entity description]”, and then we just concatenate all edges together. Figure 2 gives an example.

Entity type placeholder

In this setup, we replace the entity name with its entity type in both input and target text. Intuitively, the delexicalization will make model focus more on predicate rephrasing. Previous work (Elsahar et al., 2018) picked the entity type that is mentioned most in the first sentence of the entity’s Wikipedia article. However, it does not make sense to fix the entity type in all contexts. Instead, we directly utilize the head and tail entity types contained in a Freebase predicate. The predicate in Freebase is organized as “”, where we treat the “property” part as the type of object entity. For example, the part of serialization shown in the Figure 2 a will be: “¡P¿ R@film.film_character. portrayed_in_films ¡T¿ [film_character] ¡D¿ [film_character] is the fictional character of 1985 film Return to Oz .”. In general, the serialization patterns used here are similar to those in the entity name setup.

3.3.3. Collection of Subgraph Description

Given a atomic subgraph , the auto-prompter is trained to generate the corresponding subgraph description, denote as . Training the aforementioned auto-prompter needs large-scale labeled SPARQL and NL description pairs, since the SPARQL involves a lot of KB predicates, e.g., “”, and SPARQL operators, e.g., “ORDER BY ?num DESC LIMIT 1” (means “Argmax”). For example, given a non-NL SPARQL (a simplified version of the Figure 2 b) “SELECT DISTINCT ?x WHERE {m.01d1st ?y. ?y ?x .}” and the name of topic entity m.01d1st “Nick Cannon”, an annotator needs to understand the SPARQL and write the corresponding NL text “Nick Cannon star in film [?x]”. Labelling such large-scale data is impracticable. Therefore, the motivation here is to make use of large-scale unstructured corpus to fill the gap between predicates and natural language expression. Different from (Elsahar et al., 2018)

who use a heuristic string matching rule to find the phrases in corpus by co-occurrence of entity names, we propose a novel soft-generation approach by combining the distant supervision and the ability of modern generative PLMs together.

Specifically, to find the NL descriptions for atomic subgraphs, we rely on the Wikipedia 2018-12-20 dump as the source of text documents. Each page in Wikipedia has a title and a content that consists of a list of paragraphs. All these fields are re-tokenized by Spacy888 and indexed by Elasticsearch999 As mentioned in (Riedel et al., 2010; Elsahar et al., 2018), we believe that the distant supervision assumption has been effective on Wikipedia. For single-relation subgraph , we match sentence in Wikipedia if the subject name and the object name of this triple co-occur in the same sentence. For CVT subgraph , we find paragraphs in Wikipedia if all entities of this subgraph co-occur in the same paragraph. In addition, we remove the sentences that do not have any entity in matched paragraphs and drop subgraphs that match nothing. Furthermore, during the training phase, the entity names in the descriptions are replaced by their types in the entity type placeholder setup, while in another setup, the descriptions remain untouched. Accordingly, we need to replace the entity types with the corresponding names in the inference phase to get the prompt text.

3.3.4. Training

The auto-prompter is based on an advanced generative PLM, BART(Lewis et al., 2020). Specifically, in order to make the optimization process more stable, we merge and shuffle the two kinds of subgraph as training data, making them evenly distributed in each mini-batch. Then we fine-tune two auto-prompters based on the two serialization strategies mentioned above. The details of the hyper-parameters setting are recorded in Appendix A.1. The auto-prompter is trained with a maximum likelihood objective. Given the training samples , the objective is defined as:


where is the -th token in and the is the length of training instances.

3.3.5. Prompt Text

The motivation for developing the prompt text is to smooth the transformation from non-NL SPARQL to NL question. Therefore we rephrase the formal expression in KB to a NL form with the help of PLMs. As shown in Figure 2 b, the auto-prompter successfully and correctly generates the relation, “A Very Scholl Gyrls Holla-Day starring Nick Cannon”, from a relative complex CVT subgraph. In a word, given a SPARQL expression , Section 3.3.1 instantiates and samples one RDF subgraph . Section 3.3.2 adds entity descriptions and serializes each . Lastly, the well-trained auto-prompter is used to generate prompt text given . It is hard to evaluate the quality of prompt text by automatic metrics because there is no target text. Therefore we conduct human evaluations and report the scores in Section 4.7.

3.4. Question Generation

Figure 3. Overview of question generation.

Question generation takes both the SPARQL and prompt text as input, and generates the question

with max likelihood logit in the beam search space. The question generator is implemented based on another BART. The training and inference processes here are similar to the classic fine-tuning methods of BART. The overall process is shown in Figure


3.4.1. Data Construction

We construct a training sample by concatenating and together as input and the annotated question as target. In order to better prompt the question generator, we add “(the [var])” behind the corresponding instantiated entity in prompt text for each variable in SPARQL 101010For example, add “(the ?x)” behind “A Very School Gyrls Holla-Day.

It is worth noting that, following previous works, we convert both input and output to lowercase.

3.4.2. Training and Inference

We fine-tune the question generator as a standard sequence-to-sequence model from the input to the output text, as presented in BART (Lewis et al., 2020). Formally, given the training sample , the objective is defined as


where is the -th token in the target question , is the length of training samples and denotes the parameters of the question generator.

We use the standard beam search method for decoding. The specific parameter settings are reported in Appendix A.1.

4. Experiments

In this section, we conduct extensive experiments to evaluate the effectiveness of our proposed approach.

4.1. Dataset & Preprocessing

Dataset #Rel
Total Length
WQCWQ1.1 931 29,909 / 2,529 / 2,529 34,967 14.12
 w/ OPs 494 4,097 / 300 / 343 4,740 (13.56%) 14.69
 w/o OPs 870 25,812 / 2,229 / 2,186 30,277 (86.44%) 14.03
PathQuestions 378 9,793 / 1,000 / 1,000 11,793 12.4
Table 1. Statistics of the datasets.

WebQuestionsSP (Yih et al., 2016) and ComplexWebQuestions 1.1 (Talmor and Berant, 2018) are widely used question answering datasets that contain natural language questions and corresponding SPARQL queries. Previous works (Kumar et al., 2019; Ke et al., 2021; Chen et al., 2022) combined WebQuestionsSP and ComplexWebQuestions (the older version) and convert the SPARQL from SELECT query to CONSTRUCT query to get RDF graphs. However, ComplexWebQuestions is unavailable now, and previous works only released the processed results without original SPARQL expressions, wherein the exact semantics are lost. Therefore, we merge and reprocess these two datasets by ourselves and release a new dataset, WQCWQ1.1. We leave the original training set untouched and randomly divide the validation/test set equally.


PathQuestions (PQ) is constructed from a question answering dataset (Zhou et al., 2018). The main characteristic is that PQ only consists of chain questions, wherein the path between the topic entity and the answer is 2-hop or 3-hop. Compared with WQCWQ1.1, PQ does not contain any complex operation in questions. We used the same data as the existing works (Kumar et al., 2019; Chen et al., 2022).

Brief statistics of the datasets are listed in Table 1, including the total number of relations, the data split, the subset divided by whether containing complex operations (OPs), and the average length of questions.

4.2. Implementation

Our auto-prompter and question generator are both based on pre-trained models BART (Lewis et al., 2020). We initialize our model weights with the BART-base checkpoint released by HuggingFace’s Transformers (Wolf et al., 2020). We follow BART to use BytePair Encoding (BPE) vocabulary (Radford et al., 2019) with the size of 50,265 and adopt Adam (Kingma and Ba, 2015)

as the optimizer. Since the computational resources are limited, we pick a set of hyper-parameters and train the auto-prompter on unsupervised data for 10 epochs. It took 60 hours on 4 NVIDIA A100 (40GB) GPUs. For training question generator, we apply different hyper-parameters on each data proportion setting. More details, including the hyper-parameters and search space settings, are reported in Appendix


4.3. Baseline Methods

We choose the following two categories of models as our baselines:

Pre-trained Models

We chose JointGT as the pre-trained baseline. JointGT (Ke et al., 2021) is a BART-based model for KG-to-text generation. It adopts a structure-aware semantic aggregation module to model the structure of an input graph at each Transformer layer. Afterward, it is pre-trained in large-scale KG-to-text corpora, and then fine-tuned in the downstream tasks, including question generation.

Task-Specific Models without Pre-training

We also adopted the recent task-specific models without pre-training as baselines, including Graph2Seq (Chen et al., 2022) and KTG (Bi et al., 2020). Graph2Seq introduces a bidirectional graph encoder to model subgraphs in KG and generate questions by a graph-to-sequence generator with a coping mechanism. KTG proposes a knowledge-enriched, type-constrained, and grammar-guided model with auxiliary information to enrich input.

We report the baseline results directly if they use the same dataset as ours. Otherwise, we implement these baselines using the codes and parameters released by the original papers.

4.4. Automatic Evaluation Metrics

Following previous QG works (Kumar et al., 2019; Bi et al., 2020; Ke et al., 2021; Chen et al., 2022) , we use BLEU-4 (B-4) (Papineni et al., 2001), METEOR (ME) (Denkowski and Lavie, 2014) and ROUGE-L (R-L) (Lin, 2004)

as our evaluation metrics. Initially, BLEU-4 and METEOR were designed to evaluate machine translation systems, and ROUGE-L was designed to assess text summarization systems.

4.5. Experimental Results

Data Proportion 0.1% 0.5% 1% 5% 10% 100%
Model B-4 ME R-L B-4 ME R-L B-4 ME R-L B-4 ME R-L B-4 ME R-L B-4 ME R-L
Graph2Seq 0.00 0.87 5.98 7.01 12.65 26.86 8.53 13.51 31.70 17.17 21.56 45.11 19.69 23.25 46.64 29.56 31.14 58.34
JointGT 6.92 18.77 32.14 11.31 23.10 38.62 13.83 24.90 42.52 20.79 28.41 49.67 25.25 30.50 54.05 32.81 34.48 60.92
AutoQGS 15.26 22.07 42.97 21.56 27.06 48.70 24.09 28.77 51.04 29.58 32.43 56.90 31.81 33.82 59.07 36.93 36.63 63.82
AutoQGS-T 14.43 21.45 42.59 20.40 26.09 47.53 23.18 28.11 50.40 28.86 31.83 56.13 32.00 33.79 59.52 36.49 36.38 63.53
KTG - - - - - 45.58 52.31 73.21
Graph2Seq 1.01 4.99 12.07 2.63 10.64 41.45 17.59 18.35 51.44 43.43 31.34 67.51 42.72 32.20 67.62 61.48 44.57 77.72
JointGT 43.15 35.91 69.57 51.05 41.23 73.23 51.89 42.19 73.62 55.90 43.25 74.49 57.39 43.51 75.26 65.89 48.25 78.87
AutoQGS 43.46 33.55 68.23 56.30 41.95 74.68 58.69 42.48 75.50 61.55 44.81 76.68 60.73 44.92 76.95 65.13 47.50 76.80
Table 2. Results on WQCWQ1.1 and PathQuestions in six data proportion settings. The results marked with , and are re-printed from the references (Bi et al., 2020), (Chen et al., 2022) and (Ke et al., 2021), respectively.
Dataset WQCWQ1.1 PathQuestions
Metrics B-4 ME R-L B-4 ME R-L
Graph2Seq 12.88 12.97 17.98 29.82 18.83 22.26
JointGT 8.06 3.44 7.43 3.43 0.15 0.63
Table 3. Average gains on WQCWQ1.1 and PathQuestions over six data proportion settings.

4.5.1. Automatic Evaluation

Table 2 shows the detailed evaluation results comparing our proposed models against other state-of-the-art baselines in six data proportion settings, from 0.1% to 100%, respectively. AutoQGS is implemented on the entity name serialization setup by default, and we also report the experimental results on entity type placeholder serialization setup on WQCWQ1.1, denoted as AutoQGS-T. Since AutoQGS performs better than AutoQGS-T under most settings on WQCWQ1.1, we only report AutoQGS performance hereafter. As we can see, both AutoQGS and AutoQGS-T outperform every baseline by a large margin on WQCWQ1.1, particularly in few-shot settings. Table 3 shows the mean average gains across six settings. AutoQGS generally exceeds Graph2Seq/JointGT by 12.88/8.06 BLEU-4 points in WQCWQ1.1, and 29.82/3.43 in PQ, respectively. Specifically, for Graph2Seq, the vocab depends on the training data heavily. As the training instances decrease, the vocabulary becomes smaller and more words become OOV (out of vocabulary), which severely degrades the performance of this model. Furthermore, our model outperforms JointGT in all six settings except in 0.1% of PQ. We speculate that, in the 0.1% setting, there are only ten training instances (9793*0.001, we take the upper bound), which is too difficult for both JointGT and AutoQGS. These results verify that AutoQGS can effectively and accurately generate questions based on SPARQL in low-resource scenarios.

4.5.2. Impact of Complex Operations

WQCWQ1.1 w/ OPs w/o OPs
Model B-4 ME R-L B-4 ME R-L
Graph2Seq 22.17 26.63 51.72 30.11 31.32 58.79
JointGT 24.33 30.83 54.35 35.08 35.35 62.48
AutoQGS 35.75 36.73 62.11 36.03 36.09 63.37
Table 4. Results on WQCWQ1.1 subsets dividing by whether containing complex operations.

Next, we evaluate how the complex operations affect the performance of our AutoQGS. We keep the same divisions in WQCWQ1.1 and split data into two subsets by whether containing complex operations. The statistics are reported in Table 1. About 86.59% of the data do not contain any complex operation, which means that this part is similar to the data used by previous works. We report the comparison results on both subsets in Table 4. Results show that AutoQGS performs better in both settings. The results confirm the point that our model can generate questions from SPARQL better than others.

4.6. Ablation Test

Data Proportion 0.1% 1%
Model B-4 ME R-L B-4 ME R-L
AutoQGS 15.26 22.07 42.97 24.09 28.77 51.04
 w/o prompt text 13.72 20.87 42.46 18.69 24.39 46.75
 re/desc 11.65 19.16 40.39 21.92 26.66 48.09
10% 100%
AutoQGS 31.81 33.82 59.07 36.93 36.63 63.82
 w/o prompt text 31.07 33.08 58.79 36.18 36.19 63.57
 re/desc 31.49 33.31 58.82 36.00 36.17 63.19
Table 5. Ablation tests on WQCWQ1.1

We conduct ablation tests to investigate the effectiveness of AutoQGS in four representative data proportion settings (0.1%, 1%, 10%, 100%) by removing the prompt text and replacing the prompt text with topic entity descriptions (re/desc, for short) one at a time. As shown in Table 5, removing the prompt text leads to significant performance reduction in all settings, particularly in 0.1% and 1%. The decline is consistent with our purpose as the prompt text is designed to enable the model to perform better in low-resource scenarios. Similarly, we also observe performance decline by replacing prompt text with topic entity descriptions. This result again validates the effectiveness of prompt text, as it is also developed to mine relations between entities rather than only describe them.

4.7. Human Evaluation

Graph Type Pred Natural
single-relation 81.85% (5.4) 4.25 (0.18)
CVT 71.11% (7.2) 3.66 (0.25)
question generator
Data proportion 0.1% 1%
Model Pred Natural Pred Natural
JointGT 44% (4.2) 2.76 (0.16) 72% (4.9) 3.72 (0.22)
AutoQGS 56% (6.5) 4.36 (0.19) 80% (5.0) 4.60 (0.24)
10% 100%
JointGT 80% (3.5) 4.20 (0.30) 84% (3.3) 4.28 (0.19)
AutoQGS 84% (4.1) 4.68 (0.32) 88% (3.9) 4.84 (0.16)
Pred Natural
Golden 95% (2.1) 4.80 (0.18)
Table 6. Human evaluations results (standard deviation). Pred and Natural mean the percent of predicates identification and naturalness score (0-5), respectively.

To further evaluate AutoQGS, we conduct two human evaluations on the auto-prompter and question generator results, respectively. Considering that the goals of the two components are essentially similar, we decide to run the same two criteria following (Elsahar et al., 2018).

Predicates identification

Annotators were asked to estimate whether the generated text expresses all predicates in the given SPARQL (or subgraph) or not.


Annotators were requested to assign each generated text based on the fluency and readability by a score from 1 to 5, where (5) perfectly clear and natural, (3) grammatically correct but seems artificial, and (1) entirely not understandable.

For the evaluation of auto-prompter, we randomly sample 100 atomic subgraphs from the dataset. For the evaluation of question generator, we randomly sample 100 instances from the test set. We also collect the outputs from the most competitive baseline, JointGT, for comparison. All evaluations are done with the help of 3 annotators. Results in Table 6 show some critical observations. First of all, the auto-prompter has the ability to paraphrase predicates consistently and fluently. On the other hand, the question generator achieves remarkable results in all four settings, whereas the naturalness score in the 100% setting is even higher than that of annotated questions (denoted as Golden). Generally speaking, our AutoQGS can beat the corresponding baselines in both predicates identification and naturalness.

4.8. Case Study

Figure 4. Case study

To provide a complete and visual presentation of AutoQGS, we provide a case in Figure 4. Steps 1 to 4 display the processes of how to get the prompt text by the auto-prompter. Firstly, we re-write the SPARQL query (adding “?c ?num” in this case) to search all variables in KG and sample one from the results to instantiate a complete subgraph. Secondly, for each atomic subgraph in the complete subgraph, we serialize it in corresponding pattern. Afterward, we use the well-trained auto-prompter to get the prompt text of every atomic subgraph. Finally, in order to explicitly describe the relationship between entities and variables, we add “(the [var])” behind the corresponding entities in the prompt text, as shown in Step 4. The table in Figure 4 demonstrates the quality of generated questions (converted to lowercase). Compared to JointGT, AutoQGS can generate questions more faithfully and completely. More importantly, AutoQGS is able to generate questions with complex operations according to the SPARQL accurately. For example, JointGT fails to express “earliest” for “ORDER By ?num LIMIT 1” expressed in SPARQL in all settings, whereas ours successfully capture it with only 10% training data (marked in bold). On the other hand, our model successfully generate “Australia” with the help of prompt text (marked in underscore). It proves that the prompt text successfully supplements information for question generation. Moreover, in 0.1% setting, the question generated by ours is much more natural and fluent.

4.9. Error Analysis

 ”Julius Caesar”
_of_death ”The Theatre of Pompey”(the ?x) .
Prompt text The Theatre of Pompey (the ?x) was built during the reign of Julius Caesar.
Golden where was caesar when he was stabbed?
AutoQGS where did julius caesar die?
”Presidential system”(the ?x) .
 ”Presidential system”(the ?x) government.form
_of_government.countries ”Brazil” .
Prompt text Chile has a Presidential system (the ?x) with a bicameral legislature. Brazil is a country with a Presidential system (the ?x)
Golden what are the government types of chile and brazil?
AutoQGS what type of government is used in both brazil and chile?
Table 7. Error analysis

Table 7 shows some failure cases on the WQCWQ1.1 test set in 100% data proportion setting. For convenience, variables in SPARQL are instantiated and put together with their variable names. The most common mistake for the auto-prompter is that the distant supervision approach sometimes produces descriptions that are not consistent with the facts, as shown in the first case111111In this example, the Theatre of Pompey was built during the latter part of the Roman Republican era by Pompey the Great, not Julius Caesar.. On the other hand, one of the frequent error patterns we find out for the question generator is incorrect statement. For instance, in the second example, the statement “what type of government is used” is grammatically correct, but the usage is inappropriate. Another error pattern is lack of knowledge. In the first example, given the SPARQL and the prompt text, we still don’t know that Caesar was stabbed. But it is usual to include common sense in a question.

4.10. Data Augmentation

Automatically constructing question-answer pairs from knowledge bases is one of the objectives of question generation. Based on the WQCWQ1.1 dataset, we augment data by replacing topic entities in SPARQL. Specifically, given a SPARQL, we construct a new one to get a list of different topic entities by replacing the topic entities with variables. For each set of entities in the result, we construct another SPARQL, in which only the named entity is different from the original one, to form a new sample. Afterward, we use the best AutoQGS trained in 100% data to generate the corresponding questions. In order to multiply the data tenfold, we randomly choose ten instances from querying results for each item in WQCWQ1.1. Finally, we generate a corpus of 340K question-SPARQL pairs for future research on question generation and question answering.

5. Conclusion

In this paper, we propose AutoQGS, an auto-prompt approach for low-resource KBQG from SPARQL. Firstly, we generate questions directly from SPARQL to handle complex operations. Secondly, we propose an auto-prompter to rephrase SPARQL into NL prompt text, smoothing the transformation from non-NL SPARQL to NL question with PLMs. Lastly, we devise a question generator to generate questions given SPARQL and corresponding prompt text. Experimental results show that our approach achieves state-of-the-art performance especially in low-resource scenarios. Furthermore, we generate a dataset of 330k factoid complex question-SPARQL pairs for further KBQG research.

This work was supported by the National Key Research and Development Program of China under Grant No. 2020AAA0108600 and No. 2020YFC0833301.

Appendix A Appendix

a.1. Hyper-Parameter Setting

Model auto-prompter question generator
Hyper-parameter Setting Search Space
Learning Rate 3e-5 [3e-5, 5e-5, 1e-4]
Warmup Proportion 0.1 [0.1, 0.2, 0.3]
Batch Size 64 [16, 24, 32, 64]
Beam Size 10 [5, 10]
Length Penalty - [1.0, 1.2]
Input Length 512 512
Output Length 512 128
Warmup Epoch 10 50
Early Stop Patience - 10
Maximum Gradient Norm 1.0 1.0
Optimizer Adam Adam
Epsilon (for Adam) 1e-8 1e-8
Table 8. Hyper-parameter search space. [] indicates that the listed numbers will be traverse one by one.
Dataset WQCWQ 1.1
Data Proportion 0.1% 0.5% 1% 5% 10% 100%
Warmup Proportion 0.2 0.1 0.1 0.2 0.2 0.2
Batch Size 16 16 32 32 32 64
Beam Size 10 10 10 10 10 10
Length Penalty 1.2 1.0 1.2 1.0 1.0 1.0
Dataset Pathquestions
Data Proportion 0.1% 0.5% 1% 5% 10% 100%
Warmup Proportion 0.1 0.1 0.1 0.1 0.1 0.1
Batch Size 16 16 16 24 24 24
Beam Size 10 10 10 5 5 5
Length Penalty 1.2 1.2 1.2 1.0 1.0 1.0
Table 9. Best assignments of hyper-parameters for question generator.

We provided the detailed settings of hyper-parameters for training the auto-prompter and the question generator in Table 8. The table include the hyper-parameter settings for training auto-prompter and the hyper-parameter search space for training question generator. We implement our models base on Huggingface’s Transformers (Wolf et al., 2020). For auto-prompter, we train the model on unsupervised data for 10 epochs. For question generator, We set Warmup Epoch, which means the expected training rounds, to control warmup steps. The hyper-parameter search space and the best assignments are listed in Table 8. We go through each combination of hyper-parameters in every few-shot settings to find the optimal one. The selection criterion is BLEU-4 on the validation set.


  • J. Bao, D. Tang, N. Duan, Z. Yan, Y. Lv, M. Zhou, and T. Zhao (2018) Table-to-text: describing table region with natural language.

    Proceedings of the AAAI Conference on Artificial Intelligence

    32 (1).
    External Links: Link, Document Cited by: §2.
  • J. Bao, D. Tang, N. Duan, Z. Yan, M. Zhou, and T. Zhao (2019) Text generation from tables. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (2), pp. 311–320. External Links: Document Cited by: §2.
  • S. Bi, X. Cheng, Y. Li, Y. Wang, and G. Qi (2020) Knowledge-enriched, Type-constrained and Grammar-guided Question Generation over Knowledge Bases. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 2776–2786 (en). External Links: Link, Document Cited by: §1, §2, §4.3, §4.4, Table 2.
  • K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, New York, NY, USA, pp. 1247–1250. External Links: ISBN 9781605581026, Link, Document Cited by: §1.
  • Y. Chen, L. Wu, and M. J. Zaki (2022) Toward Subgraph Guided Knowledge Graph Question Generation with Graph Neural Networks. arXiv:2004.06015 [cs] (en). Note: arXiv: 2004.06015Comment: 12 pages External Links: Link Cited by: §1, §2, §4.1, §4.1, §4.3, §4.4, Table 2.
  • Z. Chen, H. Eavani, W. Chen, Y. Liu, and W. Y. Wang (2020) Few-Shot NLG with Pre-Trained Language Model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 183–190. External Links: Link, Document Cited by: §1.
  • M. Denkowski and A. Lavie (2014) Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA, pp. 376–380 (en). External Links: Link, Document Cited by: §4.4.
  • H. Elsahar, C. Gravier, and F. Laforest (2018) Zero-Shot Question Generation from Knowledge Graphs for Unseen Predicates and Entity Types. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 218–228 (en). External Links: Link, Document Cited by: §1, §2, §3.3.2, §3.3.3, §3.3.3, §4.7.
  • P. Ke, H. Ji, Y. Ran, X. Cui, L. Wang, L. Song, X. Zhu, and M. Huang (2021) JointGT: Graph-Text Joint Representation Learning for Text Generation from Knowledge Graphs. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, pp. 2526–2538 (en). External Links: Link, Document Cited by: §1, §1, §2, §4.1, §4.3, §4.4, Table 2.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §4.2.
  • V. Kumar, Y. Hua, G. Ramakrishnan, G. Qi, L. Gao, and Y. Li (2019) Difficulty-Controllable Multi-hop Question Generation from Knowledge Graphs. In The Semantic Web – ISWC 2019, Vol. 11778, pp. 382–398 (en). Note: Series Title: Lecture Notes in Computer Science External Links: ISBN 978-3-030-30792-9 978-3-030-30793-6, Link, Document Cited by: §1, §2, §4.1, §4.1, §4.4.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7871–7880 (en). External Links: Link, Document Cited by: §1, §1, §3.2, §3.3.4, §3.4.2, §4.2.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Link Cited by: §4.4.
  • C. Liu, K. Liu, S. He, Z. Nie, and J. Zhao (2019) Generating Questions for Knowledge Bases via Incorporating Diversified Contexts and Answer-Aware Loss. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    Hong Kong, China, pp. 2431–2441 (en). External Links: Link, Document Cited by: §1, §2.
  • E. J. Miller (2001) An introduction to the resource description framework. Journal of Library Administration 34 (3-4), pp. 245–255. External Links: Document, Link Cited by: §1.
  • E. Miller (1998) An introduction to the resource description framework.. D-lib Magazine. Cited by: §1.
  • M. Mintz, S. Bills, R. Snow, and D. Jurafsky (2009) Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 1003–1011. External Links: Link Cited by: §1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2001) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, Philadelphia, Pennsylvania, pp. 311 (en). External Links: Link, Document Cited by: §4.4.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §4.2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer


    Journal of Machine Learning Research

    21 (140), pp. 1–67.
    External Links: Link Cited by: §3.2.
  • S. Reddy, D. Raghu, M. M. Khapra, and S. Joshi (2017) Generating Natural Language Question-Answer Pairs from a Knowledge Graph Using a RNN Based Question Generation Model. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 376–385 (en). External Links: Link, Document Cited by: §2.
  • S. Riedel, L. Yao, and A. McCallum (2010) Modeling Relations and Their Mentions without Labeled Text. In Machine Learning and Knowledge Discovery in Databases, J. L. Balcázar, F. Bonchi, A. Gionis, and M. Sebag (Eds.), Berlin, Heidelberg, pp. 148–163. External Links: ISBN 978-3-642-15939-8 Cited by: §3.3.3.
  • I. V. Serban, A. García-Durán, C. Gulcehre, S. Ahn, S. Chandar, A. Courville, and Y. Bengio (2016)

    Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus

    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 588–598 (en). External Links: Link, Document Cited by: §2.
  • D. Seyler, M. Yahya, and K. Berberich (2015) Generating quiz questions from knowledge graphs. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion, New York, NY, USA, pp. 113–114. External Links: ISBN 9781450334730, Link, Document Cited by: §2.
  • D. Seyler, M. Yahya, and K. Berberich (2017) Knowledge questions from knowledge graphs. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ’17, New York, NY, USA, pp. 11–18. External Links: ISBN 9781450344906, Link, Document Cited by: §2.
  • L. Song and L. Zhao (2017) Question Generation from a Knowledge Base with Web Exploration. arXiv:1610.03807 [cs] (en). Note: arXiv: 1610.03807 External Links: Link Cited by: §2.
  • A. Talmor and J. Berant (2018) The Web as a Knowledge-Base for Answering Complex Questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 641–651 (en). External Links: Link, Document Cited by: §4.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush (2020) Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link, Document Cited by: §A.1, §4.2.
  • W. Yih, M. Richardson, C. Meek, M. Chang, and J. Suh (2016) The Value of Semantic Parse Labeling for Knowledge Base Question Answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, pp. 201–206 (en). External Links: Link, Document Cited by: §4.1.
  • M. Zhou, M. Huang, and X. Zhu (2018) An Interpretable Reasoning Network for Multi-Relation Question Answering. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 2010–2022. External Links: Link Cited by: §4.1.