Question Generation from a Knowledge Base with Web Exploration

10/12/2016 ∙ by Linfeng Song, et al. ∙ 0

Question generation from a knowledge base (KB) is the task of generating questions related to the domain of the input KB. We propose a system for generating fluent and natural questions from a KB, which significantly reduces the human effort by leveraging massive web resources. In more detail, a seed question set is first generated by applying a small number of hand-crafted templates on the input KB, then more questions are retrieved by iteratively forming already obtained questions as search queries into a standard search engine, before finally questions are selected by estimating their fluency and domain relevance. Evaluated by human graders on 500 random-selected triples from Freebase, questions generated by our system are judged to be more fluent than those of serban-EtAl:2016:P16-1 by human graders.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Question generation is important as questions are useful for student assessment or coaching purposes in educational or professional contexts, and a large-scale corpus of question and answer pairs is also critical to many NLP tasks including question answering, dialogue interaction and intelligent tutoring systems. There has been much literature so far [Chen et al.2009, Ali et al.2010, Heilman and Smith2010, Curto et al.2012, Lindberg et al.2013, Mazidi and Nielsen2014, Labutov et al.2015] studying question generation from text. Recently people are becoming interested in question generation from KB, since large-scale KBs, such as Freebase [Bollacker et al.2008] and DBPedia [Auer et al.2007], are freely available, and entities and their relations are already present in KBs but not for texts.

Question generation from KB is challenging as function words and morphological forms for entities are abstracted away when a KB is created. To tackle this challenge, previous work [Seyler et al.2015, Serban et al.2016]

relies on massive human-labeled data. Treating question generation as a machine translation problem, serban-EtAl:2016:P16-1 train a neural machine translation (NMT) system with 10,000

triple111A triple is a subject,predicate,object in KB, such as jigsaw, performsActivity, CurveCut, question pairs. At test time, input triples are “translated” into questions with the NMT system. On the other hand, the question part of the 10,000 pairs are human generated, which requires a large amount of human effort. In addition, the grammaticality and naturalness of generated questions can not be guaranteed (as seen in Table 1).

We propose a system for generating questions from KB that significantly reduces the human effort by leveraging the massive web resources. Given a KB, a small set of question templates are first hand-crafted based on the predicates in the KB. These templates consist of a transcription of the predicate in the KB (e.g. performsActivityhow to) and placeholders for the subject (#X#) and the object (#Y#). A seed question set is then generated by applying the templates on the KB. The seed question set is further expanded through a search engine (e.g., Google, Bing), by iteratively forming each generated question as a search query to retrieve more related question candidates. Finally a selection step is applied by estimating the fluency and domain relevance of each question candidate.

The only human labor in this work is the question template construction. Our system does not require a large number of templates because: (1) the iterative question expansion can produce a large number of questions even with a relatively small number of seed questions, as we see in the experiments, (2) multiple entities in the KB share the same predicates. Another advantage is that our system can easily generate updated questions as web is self-updating consistently. In our experiment, we compare with serban-EtAl:2016:P16-1 on 500 random selected triples from Freebase [Bollacker et al.2008]. Evaluated by 3 human graders, questions generated by our system are significantly better then serban-EtAl:2016:P16-1 on grammaticality and naturalness.

2 Knowledge Base

A knowledge base (KB) can be viewed as a directed graph, in which nodes are entities (such as “jigsaw” and “CurveCut”) and edges are relations of entities (such as “performsActivity”). A KB can also be viewed as a list of triples in the format of subject, predicate, object, where subjects and objects are entities, and predicates are relations.

3 System

Figure 1: Overview of our framework.

Shown in Figure 1, our system contains the sub-modules of question template construction, seed question generation, question expansion and selection. Given an input KB, a small set of question templates is first constructed such that each template is associated with a predicate, then a seed question set is generated by applying the template set on the input KB, before finally more questions are generated from related questions that are iteratively retrieved from a search engine with already-obtained questions as search queries (section 3.1). Taking our in-house KB of power tool domain as an example, template “how to use #X#” is first constructed for predicate “performsActivity”. In addition, seed question “how to use jigsaw” is generated by applying the template on triple “jigsaw, performsActivity, CurveCut”, before finally questions (Figure 2) are retrieved from Google with the seed question.

Figure 2: Related search results for the question “how to use jigsaw”.

3.1 Question expansion and selection

Data: seed question set
Result: candidate questions
1 ;
2 ;
3 ;
4 while len and  do
5       ;
6       .Pop();
7       for  in WebExp do
8             if not .contains then
9                   .Append() ;
10                   .Push();
12             end if
14       end for
16 end while
Algorithm 1 Question expansion method

Shown in Algorithm 1, the expanded question set is initialized as the seed question set (Line 1). In each iteration, an already-obtained question is expanded from web and the retrieved questions are added to if does not contain them (Lines 6-10). As there may be a large number of questions generated in the loop, we limit the maximum number of iterations with (Line 4).

Ours serban-EtAl:2016:P16-1
what is the cultural heritage of churchill national park where in australia is churchill national park
what percentage of argentina’s population live in urban areas what ’s one of the mountain where can you found in argentina in netflix
which country is the largest financial center of latin america what is an organization that was born in latin america
which country has the largest freshwater lake in central america what are the major town three gringos in venezuela and central america book
how does leukemia affect the body in children who was someone who was involved in the leukemia
how does the nervous system maintain homeostasis what is the drug category of central nervous system stimulation
why were colonial minutemen so prepared for the arrival of the british in concord what county is concord in
which is the only country to have a bible on their national flag whats the title of a book of the subject of the bible
why is new york called the city that never sleeps who was born in new york
what three important documents were written in pennsylvania what is located in pennsylvania
Table 1: Comparing generated questions
System grammatical naturalness
serban-EtAl:2016:P16-1 3.36 3.14
Ours 3.53 3.31
Table 2: Human ratings of generated questions

The questions collected from the web search engine may not be fluent or domain relevant; especially the domain relevance drops significantly as the iteration goes on. Here we adopt a skip-gram model [Mikolov and Dean2013] and a language model for evaluating the domain relevance and fluency of the expanded questions, respectively. For domain relevance, we take the seed question set as the in-domain data , the domain relevance of expanded question is defined as:


where is the document embedding defined as the averaged word embedding within the document. For fluency, we define the averaged language model score as:



is the general-domain language model score (log probability), and

is the word count. We apply thresholds and for domain relevance and fluency respectively, and filter out questions whose scores are below these thresholds.

4 Experiments

We perform three experiments to evaluate our system qualitatively and quantitatively. In the first experiment, we compare our end-to-end system with the previous state-of-the-art method [Serban et al.2016] on Freebase [Bollacker et al.2008], a domain-general KB. In the second experiment, we validate our domain relevance evaluation method on a standard dataset about short document classification. In the final experiment, we run our end-to-end system on a highly specialized in-house KB and present sample results, showing that our system is capable of generating questions from domain specific KBs.

4.1 Evaluation on Freebase

We first compare our system with serban-EtAl:2016:P16-1 on 500 randomly selected triples from Freebase [Bollacker et al.2008]222We obtain their results from For the 500 triples, we hand-crafted 106 templates, as these triples share only 53 distinct predicates (we made 2 templates for each predicate on average). 991 seed questions are generated by applying the templates on the triples, and 1529 more questions are retrieved from Google. To evaluate the fluency of the candidate questions, we train a 4-gram language model (LM) on gigaword (LDC2011T07) with Kneser Ney smoothing. Using the averaged language model score as index, the top 500 questions are selected to compare with the results from serban-EtAl:2016:P16-1. We ask three native English speakers to evaluate the fluency and the naturalness333whether people will ask in reality of both results based on a 4-point scheme where 4 is the best.

We show the averaged human rate in Table 2, where we can see that our questions are more grammatical and natural than serban-EtAl:2016:P16-1. The naturalness score is less than the grammatical score for both methods. It is because naturalness is a more strict metric since a natural question should also be grammatical.

Shown in Table 1, we compare our questions with serban-EtAl:2016:P16-1 where questions in the same line describe the same entity. We can see that our questions are grammatical and natural as these questions are what people usually ask on the web. On the other hand, questions from serban-EtAl:2016:P16-1 are either ungrammatical (such as “who was someone who was involved in the leukemia ?” and “whats the title of a book of the subject of the bible ?”), unnatural (“what ’s one of the mountain where can you found in argentina in netflix ?”) or confusing (“who was someone who was involved in the leukemia ?”).

Method Precision
phan2008learning 82.18
chen2011short 85.31
ma-EtAl:2015:VSM-NLP 85.48
Ours 85.65
Table 3: Precision on the web snippet dataset

4.2 Domain Relevance

We test our domain-relevance evaluating method on the web snippet dataset, which is a commonly-used for domain classification of short documents. It contains 10,060 training and 2,280 test snippets (short documents) in 8 classes (domains), and each snippet has 18 words on average. There have been plenty of prior results [Phan et al.2008, Chen et al.2011, Ma et al.2015] on the dataset.

Shown in Table 3, we compare our domain-relevance evaluation method (section 3.1) with previous state-of-the-art methods: phan2008learning first derives latent topics with LDA [Blei et al.2003] from Wikipedia, then uses the topics as appended features to expand the short text. chen2011short further expanded phan2008learning by using multi-granularity topics. ma-EtAl:2015:VSM-NLP adopts a Bayesian model that the probability a document belongs to a topic equals to the prior of times the probability each word in comes from

. Our method first concatenates training documents of the same domain into one “domain document”, then calculates each document embedding by averaging word embeddings within it, before finally assigns the label of the nearest (cosine similarity) “domain document” to each test document.

Simple as it is, our method outperforms all previous methods proving its effectiveness. The reason can be that word embeddings captures the similarity between distinct words (such as “finance” and “economy”), while it is hard for traditional methods. On the order hand, LDA only learns probabilities of words belonging to topics.

4.3 Evaluation on the Domain-specific KB

The last experiment is on our in-house KB in the power tool domain. It contains 67 distinct predicates, 293 distinct subjects and 279 distinct objects respectively. For the 67 predicates, we hand-craft 163 templates. Here we use the same language model as in our first experiment, and learn a skip-gram model [Mikolov and Dean2013] on Wikipedia444 for evaluating domain relevance.

We generate 12,228 seed questions from which 20,000 more questions are expanded with Google. Shown in Table 4 are some expanded questions from which we can see that most of them are grammatical and relevant to the power tool domain. In addition, most questions are informative and correspond to a specific answer, except the one “do I need a hammer drill” that lacks context information. Finally, in addition to the simple factoid questions, our system generates many complex questions such as “how to cut a groove in wood without a router”.

how to change circular saw blade
how to measure lawn mower cutting height
how to sharpen drill bits on bench grinder
how does an oscillating multi tool work
how to cut a groove in wood without a router
what type of sander to use on deck
do i need a hammer drill
can i use acrylic paint on wood
how to use a sharpening stone with oil
Table 4: Example question expanded

5 Conclusion

We presented a system to generate natural language questions from a knowledge base. By leveraging rich web information, our system is able to generate domain-relevant questions in wide scope, while human effort is significantly reduced. Evaluated by human graders, questions generated by our system are significantly better than these from serban-EtAl:2016:P16-1 on 500 random-selected triples from Freebase. We also demonstrated generated questions from our in-house KB of power tool domain, which are fluent and domain-relevant in general. Our current system only generates questions without answers, leaving automatic answer mining as our future work.


  • [Ali et al.2010] Husam Ali, Yllias Chali, and Sadid A Hasan. 2010. Automation of question generation from sentences. In Proceedings of QG2010: The Third Workshop on Question Generation, pages 58–67.
  • [Auer et al.2007] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722–735. Springer.
  • [Blei et al.2003] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation.

    Journal of machine Learning research

    , 3(Jan):993–1022.
  • [Bollacker et al.2008] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD, pages 1247–1250. ACM.
  • [Chen et al.2009] W Chen, G Aist, and J Mostow. 2009. Generating questions automatically from informational text. In Proceedings of the 2nd Workshop on Question Generation, pages 17–24.
  • [Chen et al.2011] Mengen Chen, Xiaoming Jin, and Dou Shen. 2011. Short text classification improved by learning multi-granularity topics. In IJCAI, pages 1776–1781. Citeseer.
  • [Curto et al.2012] Sérgio Curto, A Mendes, and Luisa Coheur. 2012. Question generation based on lexico-syntactic patterns learned from the web. Dialogue & Discourse, 3(2):147–175.
  • [Heilman and Smith2010] Michael Heilman and Noah A Smith. 2010. Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 609–617. Association for Computational Linguistics.
  • [Labutov et al.2015] Igor Labutov, Sumit Basu, and Lucy Vanderwende. 2015. Deep questions without deep understanding. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL-15), pages 889–898, Beijing, China.
  • [Lindberg et al.2013] David Lindberg, Fred Popowich, John Nesbit, and Phil Winne. 2013. Generating natural language questions to support learning on-line. pages 105–114.
  • [Ma et al.2015] Chenglong Ma, Weiqun Xu, Peijia Li, and Yonghong Yan. 2015. Distributional representations of words for short text classification. In

    Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing

    , pages 33–38, Denver, Colorado.
  • [Mazidi and Nielsen2014] Karen Mazidi and Rodney D Nielsen. 2014. Linguistic considerations in automatic question generation. In Proceedings of ACL, pages 321–326.
  • [Mikolov and Dean2013] T Mikolov and J Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems.
  • [Phan et al.2008] Xuan-Hieu Phan, Le-Minh Nguyen, and Susumu Horiguchi. 2008.

    Learning to classify short and sparse text & web with hidden topics from large-scale data collections.

    In Proceedings of the 17th international conference on World Wide Web, pages 91–100.
  • [Serban et al.2016] Iulian Vlad Serban, Alberto García-Durán, Caglar Gulcehre, Sungjin Ahn, Sarath Chandar, Aaron Courville, and Yoshua Bengio. 2016.

    Generating factoid questions with recurrent neural networks: The 30m factoid question-answer corpus.

    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL-16), pages 588–598, Berlin, Germany.
  • [Seyler et al.2015] Dominic Seyler, Mohamed Yahya, and Klaus Berberich. 2015.

    Generating quiz questions from knowledge graphs.

    In Proceedings of the 24th International Conference on World Wide Web, pages 113–114. ACM.