Log In Sign Up

Knowledge Questions from Knowledge Graphs

We address the novel problem of automatically generating quiz-style knowledge questions from a knowledge graph such as DBpedia. Questions of this kind have ample applications, for instance, to educate users about or to evaluate their knowledge in a specific domain. To solve the problem, we propose an end-to-end approach. The approach first selects a named entity from the knowledge graph as an answer. It then generates a structured triple-pattern query, which yields the answer as its sole result. If a multiple-choice question is desired, the approach selects alternative answer options. Finally, our approach uses a template-based method to verbalize the structured query and yield a natural language question. A key challenge is estimating how difficult the generated question is to human users. To do this, we make use of historical data from the Jeopardy! quiz show and a semantically annotated Web-scale document collection, engineer suitable features, and train a logistic regression classifier to predict question difficulty. Experiments demonstrate the viability of our overall approach.


page 1

page 2

page 3

page 4


Question Answering over Knowledge Graphs via Structural Query Patterns

Natural language question answering over knowledge graphs is an importan...

All It Takes is 20 Questions!: A Knowledge Graph Based Approach

20 Questions (20Q) is a two-player game. One player is the answerer, and...

Structured Query Construction via Knowledge Graph Embedding

In order to facilitate the accesses of general users to knowledge graphs...

Would You Ask it that Way? Measuring and Improving Question Naturalness for Knowledge Graph Question Answering

Knowledge graph question answering (KGQA) facilitates information access...

Technical Report: Optimizing Human Involvement for Entity Matching and Consolidation

An end-to-end data integration system requires human feedback in several...

Generating Adequate Distractors for Multiple-Choice Questions

This paper presents a novel approach to automatic generation of adequate...

Opacity, Obscurity, and the Geometry of Question-Asking

Asking questions is a pervasive human activity, but little is understood...

1 Introduction

“This president from Illinois won a Grammy.”

Figure 1: A fragment of a KG, a topic, and a hard question generated from it. Two distractors for turning it into a multiple choice question are shown, one easy to rule out and one hard (Regan had a Hollywood career before becoming president).

Knowledge graphs (KGs) such as YAGO [42] and DBpedia [4] contain facts about real-world named entities. They provide taxonomic knowledge, for instance, that BarackObama is a person as well as a formerSenator. They also contain factual knowledge, for instance, that BarackObama is married to MichelleObama and was born on August 4, 1961. Textual knowledge captures how named entities and their relationships are referred to in natural language, for example, BarackObama as ‘Barack H. Obama’.

Easily extensible data formats such as RDF are commonly used to store KGs, which makes it easy to complement them with additional facts without having to worry about a predefined schema. RDF stores facts as (subject, predicate, object) triples, which can then be queried using SPARQL as a simple-yet-powerful structured query language.

In this work, we address the problem of generating quiz-style knowledge questions from KGs. As shown in Figure 1, starting from a KG and a topic such as US Presidents, we generate a quiz question whose unique answer is an entity from that topic. The question starts its life as an automatically generated triple-pattern query, which our system verbalizes. Each generated question is adorned with a difficulty level, providing an estimate for how hard it is to answer, and optionally a set of distractors, which can be listed alongside the correct answer to obtain a multiple-choice question. Our system is able to judge the impact of the distractors on the difficulty of the resulting multiple-choice question.

Applications of automatically generated knowledge questions include education and evaluation. One way to educate users about a specific domain (e.g., Sports or Politics) is to prompt them with questions, so that they pick up facts as they try to answer – reminiscent of flash cards used by pupils. When qualification for a task needs to be ensured, such as knowledge about a specific domain, automatically generated knowledge questions can serve as a qualification test. Crowdsourcing is one concrete use case as outlined in [39]. Likewise, knowledge questions can serve as a form of CAPTCHA to exclude likely bots.

Challenges. To discriminate how much people know about a domain, it is typical to ask progressively more difficult questions. In our setting, this means that we need to automatically quantify the difficulty of a question. This is not a trivial task as it requires to take into consideration multiple signals and their interaction. One might, for example, consider all questions whose answer is BarackObama to be easy, as he is a prominent entity. However, very few people would know that he won a GrammyAward. It is therefore important to identify signals that predict question difficulty and to combine them in a meaningful manner.

Answers provided by the user should be easy to verify automatically. In our setting, we want to ensure that disputes about the correctness of an answer are minimal, since we envision a setting with minimal human involvement (on the asking side). One important way to achieve this is by ensuring that each question has exactly one correct answer. Complementary to having questions with unique correct answers is dealing with possible variation in user input (e.g., ‘Barack Obama’ vs ‘Barack H. Obama’). One way of overcoming this is by turning fill-in-the-blank questions to multiple-choice questions. Here, one needs to carefully consider the impact distractors have on question difficulty.

A final challenge is the production of well-formed natural language questions. We are interested not only in correct language, but also in generating questions that do not look artificial. Such questions are desirable not only for their aesthetic appeal, but also minimize the chance of humans discovering that such questions were generated automatically. An important consideration here is how to ensure that coherent questions have sufficient variety. For example, while a may classify BarackObama as both an entity and a formerSenator, we would like to use the latter in asking about him, as the first is unnatural. Similarly, while the relation connecting TobeyMaguire to Spider-Man might be called actedIn, we would like to have some variety in how this is expressed (e.g., ‘acted in’ or ‘starred in’)

Contributions. We propose an end-to-end approach to the novel problem of generating quiz-style knowledge questions from knowledge graphs. Our approach has three major components: query generation, difficulty estimation, and query verbalization to generate a question. In a setting where multiple-choice questions are desired, a fourth component takes care of both generating the distractors and quantifying their impact on question difficulty. Figure 2 depicts our pipeline for generating questions and multiple choice questions.

The query generation component generates a structured query that will serve as the basis of the final question shown to a human. By starting from a structured query, we are able to generate questions that are certain to have exactly one unique, correct answer in our knowledge graph. In query generation, several challenges need to be addressed so that the resulting cues are meaningful.

Difficulty estimation is one of the challenges that needs to be addressed. To estimate the difficulty of a structured query, we leverage different signals about contained named entities, which we derive from a Web-scale document collection annotated with named entities from the KG. To learn weighting those signals, we make use of more than thirty years’ worth of data from the Jeopardy! quiz show.

Since our questions start their life as structured queries over the KG, we also verbalize them by generating a corresponding natural language question. Following earlier work on query verbalization and natural language generation, we adopt a template-based approach. However, we extend this approach with automatically mined paraphrases for relations and classes in the KG, ensuring diversity in the resulting natural language questions.

Outline. The rest of this paper unfolds as follows. Section 2 introduces preliminaries and provides a formal statement of the problem addressed in this work. Following that, we provide details on each stage shown in Figure 2. Section 3 describes how a SPARQL query can be generated that has a unique answer in the KG. Our approach for estimating the difficulty of the generated query is subject to Section 4. Section 5 describes how the query can be verbalized into natural language. Extensions for multiple-choice questions are described in Section 6. Section 7 lays out the setup and results of our experiments. We put our work in context with existing prior research in Section 8, before concluding in Section 9.

Figure 2: Question generation pipeline.

2 Preliminaries and
Problem Statement

We now lay out preliminaries and formally state the problem addressed in this work.

Knowledge Graphs (KGs) such as as Freebase [10], Yago [42], and DBpedia [4] describe entities (e.g., BarackObama) by connecting them to other entities, types — also called classes (e.g., president, leader), and literals (e.g., ‘1985-02-05’) using predicates (e.g., bornIn, birthdate, type). A KG is thus a set of facts (or triples), . A triple can also be seen as an instance of a binary predicate, with the first argument called the subject and the second called the object, hence the name subject-predicate-object (SPO). Figure 1 shows a KG fragment.

Pattern-matching is used to query a KG. Given a set of variables that are always prefixed with a question mark (e.g., ?x), a triple-pattern-query is a set of triple patterns . An answer to a query is a total mapping of variables to items in the KG such that the application of to each results in a fact in the KG. In our setting, inspired by Jeopardy!, we restrict ourselves to queries having a single variable for which a unique answer exists in the KG. Put differently, there exists only one binding of the single variable to a named entity, so that all triple patterns have corresponding facts in the KG.

More specifically, we use Yago2s [43] as our reference knowledge graph in this work. Yago2s is automatically constructed by combining information extraction over Wikipedia infoboxes and categories with the lexical database WordNet [15]. In total, Yago2s contains 2.6m entities, 300k types organized into a type hierarchy, and more than a hundred predicates which are used to form more than 48m facts. Yago entities are associated with Wikipedia entries, whereas a Yago type corresponds to a WordNet synset or Wikipedia category. To compute signals necessary for estimating question difficulty, we make use of the ClueWeb09/12 document collections and the FACC annotations provided by Google [20]. The latter provide semantic annotations of disambiguated named entities from Freebase, which we can easily map to Yago2s via their corresponding Wikipedia article. An annotated sentence in this corpus looks as follows:

[Obama|BarackObama] endorsed [Clinton|HillaryClinton] earlier today.”

Jeopardy! is a popular U.S. TV quiz show that features comprehensive natural language questions that are referred to as clues. Clues are usually posed as a statement and the required answer is in turn posed as a question. For instance, in Jeopardy! the question: This fictional private investigator was created by Arthur Conan Doyle. has the answer: Who is Sherlock Holmes? Clues come with monetary values, corresponding to the amount added to a contestant’s balance when answering correctly. We reckon that monetary values correlate with human performance and thus question difficulty – a hypothesis which we investigate in Section 4.

Problem Statement. Put formally, our objective in this work is to automatically generate a question whose unique answer is an entity which can be supported by facts in the KG. is a thematic set of entities called a topic topic, which allows us to control the domain from which knowledge questions are generated (e.g., American Politics). Moreover, we assume a predefined set of difficulty levels with a strict total order defined over its elements, and we want to estimate the difficulty of providing the answer to , denoted . An extension of the above problem which we also deal with in this work is the generation of multiple choice questions (MCQ’s), where the task is to extend a question into a by generating a set of incorrect answers, called distractors, and quantifying their difficulty.

In our concrete instantiation of the above problem, we use Wikipedia categories as topics and Yago2s as our KG. As a first attempt to address the above problem, we consider a setting with two difficulty levels, where . For our purposes, a question is any natural language sentence that requires an answer. It can look like what we think of as a question, or as a declarative sentence in the same style as Jeopardy! clues.

3 Query Generation

The first stage in our pipeline is the generation of a query that has a unique answer in the KG. This query serves as the basis for generating a question that will be shown to human contestants. The unique answer will be the one a contestant needs to provide in order to correctly answer the question. As is common practice in quiz-games, ensuring that a question has a single answer simplifies answer verification.

The input to the query generation step is a topic . The unique answer to the generated query will be an entity randomly drawn from the KG. Query generation is guided by the following desiderata: i) the query should contain at least one type triple pattern, which is crucial when verbalizing the query to generate a question (e.g., “Which president …”), and ii) entities mentioned in the query should not give any obvious clues about the answer entity. In what follows we present the challenges in achieving each of these desiderata, and our solutions to these challenges.

3.1 Answer Type Selection

Questions asking for entities always require a type that is either specified implicitly (e.g., ‘who’ for person and ‘where’ for location) or explicitly (e.g., “Which president …”). Here we address the problem of selecting a type to refer to the answer entity in the question. KGs tend to contain a large number of types and typically associate an entity with multiple types. Some of these types are easy for an average human to understand and typically appear in text talking about an entity (e.g., president, lawyer). Other types, however, are artifacts of attempts to have an ontologically complete and formally sound type system. Such types are meaningful only in the context of a type system, but not on their own (e.g., the type entity or thing).

We use our entity-annotated corpus to capture the salience of a semantic type for an entity , denoted . We start by collecting occurrences of an entity along with textual types to which it belongs in our entity-annotated corpus. We use the following patterns to collect pairs:
Pattern #1:
ENTITY (‘is a’‘is an’‘, a’‘and other’‘or other’) TYPE
BarackObama and other presidents attended the ceremony.

Pattern #2:
TYPE (‘like’‘such as’‘including’‘especially’) ENTITY
…several attorneys including BarackObama
These patterns are inspired by Hearst [23].

The next step before computing semantic type salience is to disambiguate pairs to pairs — note that entities are already disambiguated in the corpus, so we only need to disambiguate to a semantic type in the KG. Relying on the fact that our semantic types are WordNet synsets [15]

, we use the lexicon that comes with WordNet (e.g., {

lawyer, attorney} lawyer

) for generating a set of semantic type candidates for a given textual type. We then use a simple yet effective heuristic where a textual type

paired with an entity is disambiguated to a semantic type if i) is in the set of candidates for and ii) .

We compute salience as the relative frequency with which the disambiguated pair was observed in our corpus. To select a type for the answer entity , we draw one of the types to which it belongs randomly based on .

3.2 Triple Pattern Generation

We now have an answer entity and one of its semantic types that will be used to refer to in the question. We now need to create a query (which includes the type constraint ) whose unique answer over the KG is . We focus here on questions with unknown entities as these are the ones we can use Jeopardy! data to train our difficulty classifier on [17]. In principle, we can allow for unknown relations or types as well if we had the right training data. Creating a query means selecting facts where is either the subject or object and turning these into triple patterns by replacing with a variable (?x). Not all facts can be used here, as some reveal too much about the answer and render the question too trivial. Other facts will be redundant given the facts already used.

Elimination of Textual Overlap with the Answer. The first restriction we impose on a fact is that the surface forms of entities that appear in it cannot have any textual overlap with surface forms of the answer entity. The question “This president is married to Michelle Obama.” reveals too much about the answer entity. For overlap, we look at the set of words in the surface forms, excluding common stop words. We discuss our approach to collecting surface forms for entities in Section 5.2 below.

Elimination of Redundant Facts. Given a set of facts that has been chosen, a new fact does not always add new information. Keeping this new fact in a query will result in an awkwardly phrased question that can be clearly identified by a human as having been automatically generated. In our example from Figure 1 we decided to use the type president to ask about BarackObama. Using the fact (BarackObama type politician) or the fact (BarackObama type person) to extend the question is clearly redundant and adds no extra information. To eliminate this issue, we check each new type fact against all existing ones. If the new type is a supertype (e.g., person) of an existing one (e.g., president), we discard it.

4 Difficulty Estimation

We now describe our approach to estimating the difficulty of answering the knowledge query generated in Section 3. There are several, seemingly contradictory, signals that affect the difficulty of a question. As discussed earlier, one might expect any question asking for a popular entity such as BarackObama to be an easy one. However, if we were to ask “This president from Illinois won a Grammy Award.”, few people are likely to think of BarackObama. We use a classification model trained on a corpus of questions paired with their difficulties to predict question difficulty.

Note that the difficulty is computed based on the query and not its verbalization, which we generate in the next section. Our goal here is to create questions that measure factual knowledge rather than linguistic ability. We elaborate on this point further in Section 5.

Since we rely on supervised training for difficulty estimation, we make the natural assumption that the difficulty labels in the training and ‘testing’ questions are drawn from the same underlying distribution for some target audience. We also assume that for this population, it is possibly to capture the difficulty of a question. As evidence for this, in the Jeopardy! dataset [1] we find a positive correlation between the attempted questions for a certain difficulty-level and the number of times a question of this difficulty-level could not be answered. For the five difficulty-levels ($200, $400, $600, $800, $1000), 4.46%, 8.35%, 12.69%, 17.82% and 25.69% of the questions could not be answered, respectively.

4.1 Data Preparation

We use the Jeopardy! quiz-game show data described in Section 2 for training and testing our difficulty estimation classifier. The larger goal is to estimate the difficulty of answering queries generated from a knowledge graph, so we restrict ourselves to a subset of the Jeopardy! questions answerable from Yago [42], which we collected as described below. However, all methods and tools are general enough to apply to a setting other than ours of Jeopardy!/Yago.

We say a question is answerable form Yago if i) all entities mentioned in the question and its answer are in Yago, and ii) all relations connecting these entities are captured by Yago. To find these questions, we automatically annotate the questions with Yago entities using the Stanford CoreNLP named entity recognizer (NER) [18] in conjunction with the AIDA tool for named entity disambiguation [25]. We concatenate the output of the NER system with the answer entity, which we annotate as an entity mention as well, and pass it to AIDA for an improved disambiguation context. An example of an input to AIDA looks as follows:

[Shah Jahan] built this complex in [Agra, India] to immortalize [Mumtaz], his favorite wife. [Taj Mahal]

and the corresponding disambiguated output is:

ShahJahan built this complex in Agra to immortalize MumtazMahal, his favorite wife. TajMahal

We retain an entity-annotated question if i) its answer can be mapped to a Yago entity, ii) its body has at least one entity (the one that will be given in the question, not the answer), and iii) considering all entities in the question and the answer entity, each entity can be paired with another entity to which it has a direct relation in Yago. The last condition ensures that we have questions that can be captured by the relationships in Yago. However, it does not identify this relation, and such a match may be spurious. Since this is hard to establish automatically, we invoke humans at this point.

We run a crowdsourcing task on the questions that survive the above automated annotation and filtering procedure. The task is to assign one of two labels to an entity-annotated question/answer pair. A question/answer pair is to be labeled Good if i) all entities in the question have been captured and disambiguated correctly, ii) the question can be captured by relations in Yago, and iii) the answer is a unique one. The crowdsourcing task ran until we obtained a total of 500 questions that we use in our experiments.

4.2 Difficulty Classifier

After obtaining the data needed for training and testing a difficulty classifier, we turn our attention to building this classifier and the features used to do so. Formally, our goal is to learn a function that learns the difficulty of providing the answer to the query .

We use logistic regression as our model of choice. We chose this specific model due to the ease with which it can be trained and because it allows easy inspection of feature weights, which proved helpful during development. As we are dealing with a binary classification case (

classification), we train our model to learn the probability of the question being an

one, , and set a decision boundary at 0.5. We judge a question to be if and otherwise.

The model, however, only works if provided with the right features. Table 1 provides a summary of our features and a brief description of each. The key ingredients in our feature repertoire are entity salience, per coarse semantic type salience, and coherence of entity pairs.

Entity Salience () is a normalized score that is used as a proxy for an entity’s popularity. As our entities come from Wikipedia, we use the Wikipedia link structure to compute entity salience as the relative frequency with which the Wikipedia entry for an entity is linked to from all other entries. We also consider salience on a per-coarse-semantic-type basis. The second group of Table 1 defines a set of templates. We consider the coarse semantic types person, location, and organization and define a fourth coarse semantic type other that collects entities not in any of the three aforementioned coarse types (e.g., movies, inventions). Having specialized features for individual coarse-grained types allows us to take into consideration some particularities of these coarse types. For example, locations tend to have disproportionately high salience. By having a feature that accounts for this specific semantic type, we can mitigate this. Without this feature, having a location in a question would result in our classifier always labeling the question as easy.

Coherence of entity pairs () captures the relative tendency of two entities to appear in the same context. This feature essentially informs us about how much the presence of one entity indicates the presence of the other entity. For example, we would expect that:

> .

The reason is that the first pair is more likely to co-occur together than the second one. All else being equal, we would expect a question asking for BarackObama using the WhiteHouse in the question to be easier than one asking for him using GrammyAward. Intuitively, coherence counteracts the effect of salience. Since BarackObama is a salient entity, we would expect questions asking for him to be relatively easy. However, asking for him using GrammyAward is likely to make the question difficult, as people are unlikely to make a connection between the two entities.

We capture coherence using Wikipedia’s link structure. Given two entities and , we define their coherence as the Jaccard coefficient of the sets of Wikipedia entries that link to their respective entries in Wikipedia. The intuition here is that any overlap corresponds to a mention of the relation between these two entities. For the above measures, we take their maximum, minimum, average, and sum over the question as features as detailed in Table 1.

Feature Description
Entity Salience
answer entity salience
min. salience of question entities
max. salience of question entities
sum over salience of entities
mean salience of question and answer entities
mean salience of entities in question
Per-coarse-semantic-type Salience
min. salience of entities of type
max salience of entities of type
sum over salience of entities of type
mean salience of entities of type
maximum pairwise coherence of all entity pairs
sum over coherence of all entity pairs
average coherence of all entity pairs
average coherence of entity pairs that involve answer
Answer Type
binary indicator: answer entity is of type

Table 1: Difficulty estimator features and their description. is one of person, organization, location, or other.

5 Query Verbalization

We now turn to the problem of query verbalization, whereby we transform a query constructed in Section 3 into a natural language question. A human can digest this question without the technical expertise required to understand a triple pattern query. Our final goal is to construct well-formed questions that are easy to understand.

The goal of our questions is to test factual knowledge as opposed to linguistic ability. The way that a question is formulated is not a factor in predicting its difficulty. This guides our approach to query verbalization, which ensures uniformity in how questions are phrased.

We rely on a hand crafted verbalization template and automatically generated lexicons for transforming a query into a question. The verbalization template specifies where the different components of the query appear in the question. The lexicon serves as a bridge between knowledge graph entries and natural language. We start by describing our template and then move to our lexicon generation process.

5.1 Verbalization Template

Our approach to verbalizing queries is based on templates. Such approaches are standard in the natural language generation literature [26, 34]. We adopt a template inspired by the Jeopardy! quiz game show given in Figure 3. Most of the work is done in the function .

Input: Query,


Figure 3: Verbalization Template

The function takes a triple pattern and produces its verbalization. How this verbalization is performed depends on the nature of the triple pattern. More concretely, there are three distinct patterns possible in our setting (see Section 3):

  • Type: if the predicate is type, then this results in verbalizing the object, which is a semantic type.

  • PO: where the triple pattern is of the form ?var p o and p is not type.

  • SP: where the triple pattern is of the form s p ?var and p is not type.

By considering these cases individually we ensure that linguistically well-formed verbalizations are created. Figure 4 shows an example of each of the three cases above. Verbalizing a triple pattern requires that we are able to verbalize its constituent semantic items (entities, types, and predicates) in a manner that is considerate of the specific pattern. We present our solution to this next.

Triple Pattern Pattern Verbalization
?x type movie type ‘film’, ‘movie’
?x actedIn Heat PO ‘acted in the movie Heat’
‘starred in the film Heat’
AlPacino actedIn ?x SP ‘Al Pacino appeared in’
Figure 4: Examples of results of .

5.2 Verbalization Lexicons

Semantic items in the knowledge graph are simply identifiers that are not meant for direct human consumption. It is therefore important that we map each semantic item to phrases that can be used to represent it in a natural language string such as a question.

Entities. To verbalize entities we follow the approach of Hoffart et al. [25] and rely on the fact that our entities come from Wikipedia. We resort to Wikipedia for extracting surface forms of our entities. For each entity , we collect the surface forms of all links to ’s Wikipedia entry. We consider this text to be a possible verbalization of .

The above process extracts many spurious verbalizations of an entity . To overcome this issue, we associate with each candidate verbalization the number of times it was used to link to ’s Wikipedia entry and restrict ourselves to the five most frequent ones, which we add to the lexicon for the entry corresponding to .

Predicates. As Figure 4 shows, predicate verbalization depends on the the pattern in which it is observed (SP or PO). We rely on our large entity-annotated corpus described in Section 2 for mining predicate verbalizations sensitive to the SP and PO patterns. For each triple , we collect all sentences in our corpus that match the patterns (e.g., “BarackObama was born in Hawaii”) and (e.g., “Hawaii is the birthplace of BarackOmaba”) . Following the distant supervision assumption [31], we hypothesize that is expressing . The above hypothesis does not always hold. To filter out possible noise we resort to a combination of heuristic filtering and scoring. We remove from the above verbalization candidate set any phrases that are longer than 50 characters or contain a third entity . We subsequently score how good of a fit a phrase is for a predicate using normalized pointwise mutual information (npmi). For each predicate , we retain the 5 highest scoring verbalizations for each of the two patterns, and , which are used for verbalizing SP and PO triple patterns, respectively.

Types. As explained in Section 2, our types are WordNet synsets. We therefore rely on the lexicon distributed as part of WordNet for type paraphrasing.

Each of the three lexicons provides several ways to verbalize a semantic item. When verbalizing a specific semantic item, we choose a verbalization uniformly at random to ensure variety.

6 multiple-choice questions

The final component in our question generation framework turns a question into a multiple-choice question. This has several advantages: in general, it is easier to administer a multiple-choice question as the problem of answer verification can be completely mechanized. This is particularly true in cases where questions are not administered though a computer, where such things as completion suggestion can ensure canonical answers. In general, where knowledge questions are involved (as opposed to free response questions that might involve opinion), the use of multiple-choice questions is widespread as observed in such tests as the GRE.

Turning a question into multiple-choice requires distractors: entities that are presented to the user as answer candidates, but are in fact incorrect answers. Of course, not all entities constitute reasonable distractors. A negative example would be entities that are completely unrelated to the question. In addition to being related to the question, distractors should ideally be related to the correct answer entity. It should generally be possible to confuse a distractor with the correct answer to make a multiple-choice question interesting. We call this the confusability of a distractor. The more confusable a distractor is with the correct answer, the more likely a test taker is to choose it as an answer, making the multiple-choice question more challenging.

In what follows we take a look at the problems of generating distractors in our framework and quantifying the confusability of these distractors.

6.1 Distractor Generation

Our starting point for generating distractors is the query generated in Section 3, which formed the basis of the question verbalized in Section 5. By starting with a query, we have a fairly simple but powerful scheme for generating distractors. By removing one or more triple patterns from we obtain a query that has more than one answer entity. All but one of these entities are an incorrect answer to .

The relaxation scheme described above can generate a large number of candidate distractors. However, not all relaxations stay close to the original query. If a relaxation deviates too much from , the obtained distractors become meaningless. We address this by imposing two restrictions on relaxed queries used to generate distractors: (i) a semantic type restriction, and (ii) a relaxation distance restriction.

Semantic type restriction ensures that the answer and distractor are type-compatible. For example, a multiple-choice question asking for a location should not have a person as one of its distractors. The semantic type restriction requires that a semantic type triple pattern is relaxed to the corresponding coarse type.

The relaxation distance restriction refers to relaxations involving instance triple patterns. We define the distance between a query and a query as follows:

where is the set of answers of ( is always 1). We restrict relaxed queries to have a distance of no more than , which we set to 10. By pooling the results of all relaxed queries, we form a set of candidate distractors. The choice of distractor is based on how much difficulty we want the distractors to introduce using our notion of distractor confusability.

6.2 Distractor Confusability

All things equal, a multiple-choice question can be made more or less difficult by the choice of distractors. If one of the distractors is highly confusable with the answer entity, the multiple-choice question is difficult. If none of the distractors is easy to confuse with the answer entity, the multiple-choice question is easy.

Based on this observation we regard a distractor as confusable if it is likely to be the answer to the original question based on our difficulty model. This implies that if an entity is very likely to be the answer to a question asking about a different entity, this entity pair must be similar. We can therefore define confusability between the question’s answer and a distractor entity as follows:

Since we can have more than one distractor in a multiple-choice question, we capture the above intuition regarding how multiple distractors affect the overall difficulty of the question. We observe that a multiple-choice question is as confusing as its most confusing distractor and define the confusability of a distractor set as:

Looking at the big picture, we relate the notion of confusability in a multiple-choice question with our earlier notion of difficulty by combining and as shown in Table 2. We see that an easy question can be turned in two a hard one when a very confusable distractor is added, since the user has to distinguish between two very similar entities. However, adding an easy distractor to a hard question will not change its difficulty because even when both entities are not similar to each other, the user still has to know which entitiy is the correct answer.

easy hard
hard hard
Table 2: Combining question difficulty and multiple-choice question confusability into an overall difficulty in a multipe-choice setting.

7 Experimental Evaluation

In the following section we evaluate our approach to knowledge question generation from knowledge graphs. We perform two user studies which focus on evaluating the difficulty model and our distractor generation framework.

7.1 Human Assessment of Difficulty

An important motivation for automating difficulty assessment of questions is the fact that it is difficulty to judge for the average human what constitutes an easy or hard question. Beinborn et al. [7] has already shown this result for language proficiency tests, where language teachers were shown to be bad at predicting the difficulty of questions when considering the actual performance of students. We would like to observe if the same applies to our setting. To create fair and informative tests, it is crucial that we are able to correctly assess the difficulty of a question.

We start with the assumption that the creators of Jeopardy! are good at automatically assessing question difficulty. Evidence for this was discussed in Section 4, where we showed that there exists a correlation between the monetary value of a question and the likelihood of it being incorrectly answered by Jeopardy! contestants.

In our experiment we want to show how well the average human can predict the difficulty of a question. To do so, we randomly sampled 100 easy ($200) and 100 hard ($1000) questions from the 500 questions generated in Section 4 to maximize the discrepancy in question difficulty. We then asked three human evaluators (, , ) to annotate each of the 200 questions as easy or hard. We then compared their answers with each other and with the ground truth according to Jeopardy!.

Table 3 shows the agreement between each pair of human evaluators and the majority vote difficulty assessment using Fleiss’ Kappa [19]. When looking at pairwise agreement between evaluators, it ranges from fair to moderate [28]. This leads us to conclude that it is hard for non-experts to properly judge the difficulty of questions.

We also compared the majority vote of the evaluators on the difficulty of the questions with the ground truth provided by Jeopardy!. The result was agreement on 62.5% of questions. This suggests that there is a need to automate the task.

0.192 0.325 0.500
0.443 0.661
Table 3: Agreement between human evaluators (all measurements are Fleiss’ Kappa)

7.2 Question Difficulty Classification

We start by looking at the quality of our scheme for assigning difficulty levels to questions. The scheme is described in Section 4, where the possible difficulty levels are . We train our logistic regression classifier on 500 Jeopardy! questions annotated as described in Section 4. Using ten-fold cross validation, our classifier was able to correctly identify the difficulty levels of questions with an accuracy of 66.4%.

To gain insight into how informative our features are, we performed a feature ablation study where we look at the results for all combinations of our features. For this part, we grouped our features into three classes:

  • SAL: “Salience” features as in Table 1, with additional log-transformation of salience values to deal with long-tail entities.

  • COH: “Coherence” features in Table 1.

  • TYPE: “Per-coarse-semantic-type Salience” and “Answer Type” features in Table 1.

Table 4 shows the results of this experiment. Each row corresponds to a certain combination of features enabled or disabled. Rows are shown in descending order of ten-fold cross validation accuracy. It can be seen that best performance is achieved when all of our features are integrated. From this observation it can be reasoned that all features are necessary and give complementary signals. The bottom row corresponds to a random classifier.

yes yes yes 66.4%
yes no yes 65.8%
yes yes no 62.6%
yes no no 62.2%
no no yes 60.0%
no yes yes 57.8%
no yes no 52.4%
no no no 50.0%
Table 4: Ablation study results for features introduced in Section 4. Accuracy is based on ten-fold cross-validation of the difficulty classifier’s predictions.

7.3 User Study on Difficulty Estimation

In the following we perform an experiment on how well our classifier agrees with relative difficulty assessments of humans for questions generated by our system. It is important to note that we ask humans for relative difficulty assessments as opposed to absolute difficulties, since we have shown in Section 7.1 that humans are not very proficient in judging absolute difficulties.

For the user study we sampled a set of entities with at least non-type facts in Yago. For each entity, we generated a set of three questions and presented them with the answer entity to human annotators. The annotators were asked order these questions by their relative difficulty and were allowed to skip a set of questions about an entity if they were not familiar with the entity.

We then compared the correlation between the ranking given by each of the human annotators and the output of our logistic regression classifier. For this we used Kendall’s , which ranges from -1, in the case of perfect disagreement, to 1, in the case of perfect agreement.

A total of 13 evaluators took part in the study and evaluated 92.5 questions on average. Rankings produced by the difficulty classifier moderately agree with the human annotators with . When the -values for users are weighted by study participation, the average rises to . Here, each user’s contribution to the final average depends on how many questions she evaluated to avoid overly representing users that evaluated only few questions.

7.4 Distractors Confusability

We now turn to the evaluation of distractor generation for multiple-choice questions. Our goal is to accurately predict the confusability of a distractor given a question’s correct answer. In Section 6.2 we presented our scheme for quantifying distractor confusability and how it fits into a multiple-choice question setting. We evaluate our approach here.

For this experiment we automatically generate 10,000  multiple-choice questions. Each question has three answer  choices, which are the correct answer and two distractors. We then restricted ourselves to 400 multiple-choice questions whose distractor pair has the largest difference in confusability. This was done to maximize the probability that study participants can actually discriminate the more confusable from the less confusable distractor.

We ran each multiple-choice question through a crowdsourcing platform and asked workers to judge which distractor is more confusing. Each multiple-choice question was judged by 5 workers so we could take the majority vote in case the judgments where not unanimous. We then compare this majority vote with the result of our confusability estimator. Our estimator agreed with the human annotations on 76% of the 400 multiple-choice questions. This translates to a Cohen’s of 0.521, indicating moderate agreement [12].

8 Related Work

There has been work on knowledge question generation for testing linguistic knowledge and reading comprehension. The generation of language proficiency tests has been tackled in several works [21, 32, 35]. Here, the focus is on generating cloze (fill-in-the-blank) tests. Beinborn et al. [7] presents an approach for predicting the difficulty of answering such questions with multiple blanks using SVMs trained on four classes of features that look at individual blanks, their candidate answers, their dependence on other blanks, and the overall question difficulty.

Question generation for reading comprehension is aimed at evaluating knowledge from text corpora. This includes including general Wikipedia knowledge [8, 24] and specialized domain such as medical texts [2, 46]. While the above works focus on generating a question from a single document, Questimator [22] generates multiple choice questions from the textual Wikipedia corpus by considering multiple documents related to a single topic to produce a question. Work in this area has mostly taken the approach of overgeneration and ranking [24, 46]. Multiple questions are generated for a given passage using rules. A learned model ranks the questions in terms of “acceptability”. In this setting, acceptable answers should be sensical, grammatical, and their answers should not be obvious.

Recent work has started to look at the problem of generating questions, including multiple choice ones, from KGs and ontologies [3, 38, 41, 37]. Strong motivations for studying this problem, compared to question generation from text, are scenarios where structured data is what is available at hand, and the ability to generate deeper, structurally more complex questions. Our system is an end-to-end solution for this problem over a large KG.

In Section 5 we presented a simple approach for query verbalization that sits our needs. The query verbalization problem has been tackled by Ngomo et al. for SPARQL [33, 14], and Koutrika et al. for SQL [27], with a focus on usability. Similar to our approach, these earlier works take a template-based approach to verbalization, which are very widely used on the natural language generation from logical form such as SPARQL queries [26, 34].

Much recent work has focused on keyword search [9] and question answering, rather than generation, from knowledge graphs [6, 13, 30, 40, 44, 47, 50], possibly in combination with textual data [5, 36, 48]. The value of knowledge graphs is that they return crisp answers and allow for complex constraint to answer structurally complex questions. Of course, question answering has a long history, with one of the major highlights being IBM’s Watson [16], which won the Jeopardy! game show combining both structured and unstructured sources for answering.

One important contribution of our work is an approach to compute the difficulty of questions generated. This topic has received attention lately in community question answering [29, 45], by using a competition-based approach that tries to capture how much skill a question requires for answering. There has also been work on estimating query difficulty in the context of information retrieval [11, 49] to learn an estimator that predicts the expected precision of the query by analyzing the overlap between the results of the full query and the results of its sub-queries.

9 Conclusion

We proposed an end-to-end approach to the novel problem of generating quiz-style knowledge questions from knowledge graphs. Our approach addresses the challenges inherent to this problem, most importantly estimating the difficulty of generated questions. To this end, we engineer suitable features and train a model of question difficulty on historical data from the Jeopardy! quiz show, which is shown to outperform humans on this difficult task. A working prototype implementing our approach is accessible at:


  • [1] J! Archive.
  • [2] M. Agarwal and P. Mannem. Automatic gap-fill question generation from text books. In BEA, 2011.
  • [3] T. Alsubait et al. Generating multiple choice questions from ontologies: Lessons learnt. In OWLED, 2014.
  • [4] S. Auer et al. DBpedia: A Nucleus for a Web of Open Data. In ISWC/ASWC, 2007.
  • [5] H. Bast et al. Semantic Search on Text and Knowledge Bases. Foundations and Trends in IR, 10(2-3), 2016.
  • [6] H. Bast and E. Haussmann. More Accurate Question Answering on Freebase. In CIKM, 2015.
  • [7] L. Beinborn et al. Predicting the Difficulty of Language Proficiency Tests. TACL, 2, 2014.
  • [8] A. S. Bhatia et al. Automatic generation of multiple choice questions using wikipedia. In PReMI, 2013.
  • [9] R. Blanco et al. Effective and efficient entity search in RDF data. In ISWC, 2011.
  • [10] K. D. Bollacker et al. Freebase: a Collaboratively Created Graph Database for Structuring Human Knowledge. In SIGMOD, 2008.
  • [11] D. Carmel and E. Yom-Tov. Estimating the Query Difficulty for Information Retrieval. Morgan & Claypool Publishers, 2010.
  • [12] J. Cohen. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1):37, 1960.
  • [13] W. Cui et al. KBQA: an Online Template Based Question Answering System over Freebase. In IJCAI, 2016.
  • [14] B. Ell et al. Spartiqulation – Verbalizing SPARQL Queries. In ILD Workshop, ESWC, 2012.
  • [15] C. Fellbaum, editor. WordNet: an Electronic Lexical Database. MIT Press, 1998.
  • [16] D. A. Ferrucci. Introduction to "this is watson". IBM Journal of Research and Development, 2012.
  • [17] D. A. Ferrucci et al. Building Watson: An Overview of the DeepQA Project. AI Magazine, 31(3), 2010.
  • [18] J. R. Finkel et al. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In ACL, 2005.
  • [19] J. L. Fleiss. Measuring Nominal Scale Agreement among Many Raters. Psychological Bulletin, 1971.
  • [20] E. Gabrilovich et al. FACC1: Freebase annotation of ClueWeb corpora, Version 1, 2013.
  • [21] D. M. Gates. How to Generate Cloze Questions from Definitions: A Syntactic Approach. In AAAI, 2011.
  • [22] Q. Guo et al. Questimator: Generating Knowledge Assessments for Arbitrary Topics. In IJCAI, 2016.
  • [23] M. A. Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora. In COLING, 1992.
  • [24] M. Heilman and N. A. Smith. Question Generation via Overgenerating Transformations and Ranking. Technical report, 2009.
  • [25] J. Hoffart et al. Robust Disambiguation of Named Entities in Text. In EMNLP, 2011.
  • [26] N. Indurkhya and F. J. Damerau, editors.

    Handbook of Natural Language Processing

    Chapman and Hall/CRC, 2010.
  • [27] G. Koutrika et al. Explaining Structured Queries in Natural Language. In ICDE, 2010.
  • [28] J. R. Landis and G. G. Koch. The Measurement of Observer Agreement for Categorical Data. Biometrics, Vol. 33, 1977.
  • [29] J. Liu et al. Question difficulty estimation in community question answering services. In EMNLP, 2013.
  • [30] V. López et al. Scaling up question-answering to linked data. In EKAW, 2010.
  • [31] M. Mintz et al. Distant supervision for relation extraction without labeled data. In ACL, 2009.
  • [32] A. Narendra et al. Automatic Cloze-Questions Generation. In RANLP, 2013.
  • [33] A.-C. Ngonga Ngomo et al. Sorry, I Don’T Speak SPARQL: Translating SPARQL Queries into Natural Language. In WWW, 2013.
  • [34] E. Reiter and R. Dale. Building Natural Language Generation Systems. Cambridge University Press, 2000.
  • [35] K. Sakaguchi et al. Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners. In ACL, 2013.
  • [36] D. Savenkov and E. Agichtein. When a knowledge base is not enough: Question answering over knowledge bases with external text data. In SIGIR, 2016.
  • [37] I. V. Serban et al.

    Generating factoid questions with recurrent neural networks: The 30m factoid question-answer corpus.

    In ACL, 2016.
  • [38] D. Seyler et al. Generating quiz questions from knowledge graphs. In WWW, 2015.
  • [39] D. Seyler et al. Automated question generation for quality control in human computation tasks. In WebSci, 2016.
  • [40] S. Shekarpour et al. Question answering on interlinked data. In WWW, 2013.
  • [41] L. Song and L. Zhao. Domain-specific question generation from a knowledge base. arXiv, 2016.
  • [42] F. M. Suchanek et al. Yago: A Core of Semantic Knowledge. In WWW, 2007.
  • [43] F. M. Suchanek et al. Yago2s: Modular high-quality information extraction with an application to flight planning. In BTW, volume 214, 2013.
  • [44] C. Unger et al. Template-based question answering over RDF data. In WWW, 2012.
  • [45] Q. Wang et al. A regularized competition model for question difficulty estimation in community question answering services. In EMNLP, 2014.
  • [46] W. Wang et al. Automatic question generation for learning evaluation in medicine. In ICWL, 2007.
  • [47] K. Xu et al. What Is the Longest River in the USA? Semantic Parsing for Aggregation Questions. In AAAI, 2015.
  • [48] P. Yin et al. Answering Questions with Complex Semantic Constraints on Open Knowledge Bases. In CIKM, 2015.
  • [49] E. Yom-Tov et al. Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval. In SIGIR, 2005.
  • [50] L. Zou et al. Natural language question answering over RDF: a graph data driven approach. In SIGMOD, 2014.