Generating Questions for Knowledge Bases via Incorporating Diversified Contexts and Answer-Aware Loss

by   Cao Liu, et al.

We tackle the task of question generation over knowledge bases. Conventional methods for this task neglect two crucial research issues: 1) the given predicate needs to be expressed; 2) the answer to the generated question needs to be definitive. In this paper, we strive toward the above two issues via incorporating diversified contexts and answer-aware loss. Specifically, we propose a neural encoder-decoder model with multi-level copy mechanisms to generate such questions. Furthermore, the answer aware loss is introduced to make generated questions corresponding to more definitive answers. Experiments demonstrate that our model achieves state-of-the-art performance. Meanwhile, such generated question can express the given predicate and correspond to a definitive answer.


page 1

page 2

page 3

page 4


A Question Type Driven and Copy Loss Enhanced Frameworkfor Answer-Agnostic Neural Question Generation

The answer-agnostic question generation is a significant and challenging...

Knowledge-enriched, Type-constrained and Grammar-guided Question Generation over Knowledge Bases

Question generation over knowledge bases (KBQG) aims at generating natur...

On the Evaluation of Answer-Agnostic Paragraph-level Multi-Question Generation

We study the task of predicting a set of salient questions from a given ...

Generating Answer Candidates for Quizzes and Answer-Aware Question Generators

In education, open-ended quiz questions have become an important tool fo...

Extended Answer and Uncertainty Aware Neural Question Generation

In this paper, we study automatic question generation, the task of creat...

Let's Ask Again: Refine Network for Automatic Question Generation

In this work, we focus on the task of Automatic Question Generation (AQG...

Neural Generation of Diverse Questions using Answer Focus, Contextual and Linguistic Features

Question Generation is the task of automatically creating questions from...

1 Introduction

Question Generation over Knowledge Bases (KBQG) aims at generating natural language questions for the corresponding facts on KBs, and it can benefit some real applications. Firstly, KBQG can automatically annotate question answering (QA) datasets. Secondly, the generated questions and answers will be able to augment the training data for QA systems. More importantly, KBQG can improve the ability of machines to actively ask questions on human-machine conversations Duan et al. (2017); Sun et al. (2018). Therefore, this task has attracted more attention in recent years Serban et al. (2016); Elsahar et al. (2018).

Specifically, KBQG is the task of generating natural language questions according to the input facts from a knowledge base with triplet form, like subject, predicate, object. For example, as illustrated in Figure 1, KBQG aims at generating a question “Which city is Statue of Liberty located in?” (Q3) for the input factual triplet “Statue of Liberty, location/containedby111We omit the domain of the predicate for sake of brevity., New York City”. Here, the generated question is associated to the subject “Statue of Liberty” and the predicate fb:location/containedby) of the input fact, and the answer corresponds to the object “New York City”.

Figure 1: Examples of KBQG. We aims at generating questions like Q3 which expresses (matches) the given predicate and refers to a definitive answer.

As depicted by P16-1056, KBQG is required to transduce the triplet fact into a question about the subject and predicate, where the object is the correct answer. Therefore, it is a key issue for KBQG to correctly understand the knowledge symbols (subject, predicate and object in the triplet fact) and then generate corresponding text descriptions. More recently, some researches have striven toward this task, where the behind intuition is to construct implicit associations between facts and texts. Specifically, P16-1056 designed an encoder-decoder architecture to generate questions from structured triplet facts. In order to improve the generalization for KBQG, N18-1020 utilized extra contexts as input via distant supervisions Mintz et al. (2009), then a decoder is equipped with attention and part-of-speech (POS) copy mechanism to generate questions. Finally, this model obtained significant improvements. Nevertheless, we observe that there are still two important research issues (RIs) which are not processed well or even neglected.

RI-1: The generated question is required to express the given predicate in the fact. For example in Figure 1, Q1 does not express (match) the predicate (fb:location/containedby) while it is expressed in Q2 and Q3. Previous work Elsahar et al. (2018) usually obtained predicate textual contexts through distant supervision. However, the distant supervision is noisy or even wrong (e.g. “X is the husband of Y” is the relational pattern for the predicate fb:marriage/spouse, so it is wrong when “X” is a woman). Furthermore, many predicates in the KB have no predicate contexts. We make statistic in the resources released by N18-1020, and find that only 44% predicates have predicate textual context222We map the “prop_text_evidence.csv” file to the “property.vocab” file in N18-1020. Therefore, it is prone to generate error questions from such without-context predicates.

RI-2: The generated question is required to contain a definitive answer. A definitive answer means that one question only associates with a determinate answer rather than alternative answers. As an example in Figure 1, Q2 may contain ambiguous answers since it does not express the refined answer type. As a result, different answers including “United State”, “New York City”, etc. may be correct. In contrast, Q3 refers to a definitive answer (the object “New York City” in the given fact) by restraining the answer type to a city. We believe that Q3, which expresses the given predicate and refers to a definitive answer, is a better question than Q1 and Q2. In previous work, N18-1020 only regarded a most frequently mentioned entity type as the textual context for the subject or object in the triplet. In fact, most answer entities have multiple types, where the most frequently mentioned type tends to be universal (e.g. a broad type “administrative region” rather than a refined type “US state” for the entity “New York”). Therefore, generated questions from N18-1020 may be difficult to contain definitive answers.

To address the aforementioned two issues, we exploit more diversified contexts for the given facts as textual contexts in an encoder-decoder model. Specifically, besides using predicate contexts from the distant supervision utilized by N18-1020, we further leverage the domain, range and even topic for the given predicate as contexts, which are off-the-shelf in KBs (e.g. the range and the topic for the predicate fb:location/containedby are “location” and “containedby”, respectively1). Therefore, 100% predicates (rather than 44%2 of those in Elsahar et al.) have contexts. Furthermore, in addition to the most frequently mentioned entity type as contexts used by N18-1020, we leverage the type that best describes the entity as contexts (e.g. a refined entity type333We obtain such representative entity types through the predicate fb:topic/notable_types in freebase. “US state” combines a broad type “administrative region” for the entity “New York”), which is helpful to refine the entity information. Finally, in order to make full use of these contexts, we propose context-augmented fact encoder and multi-level copy mechanism (KB copy and context copy) to integrate diversified contexts, where the multi-level copy mechanism can copy from KB and textual contexts simultaneously. For the purpose of further making generated questions correspond to definitive answers, we propose the answer-aware loss by optimizing the cross-entropy between the generated question and answer type words, which is beneficial to generate precise questions.

We conduct experiments on an open public dataset. Experimental results demonstrate that the proposed model using diversified textual contexts outperforms strong baselines (+4.5 BLEU4 score). Besides, it can further increase the BLEU score (+5.16 BLEU4 score) and produce questions associated with more definitive answers by incorporating answer-aware loss. Human evaluations complement that our model can express the given predicate more precisely.

In brief, our main contributions are as follows:

(1) We leverage diversified contexts and multi-level copy mechanism to alleviate the issue of incorrect predicate expression in traditional methods.

(2) We propose an answer-aware loss to tackle the issue that conventional methods can not generate questions with definitive answers.

(3) Experiments demonstrate that our model achieves state-of-the-art performance. Meanwhile, such generated question can express the given predicate and refer to a definitive answer.

Figure 2: Overall structure of the proposed model for KBQG. A context encoder is firstly employed to encode each textual context (Sec. 3.1), where “Diversified Types” represents the subject (object) context, and “DS pattern” denotes the relational pattern from distant supervisions. At the same time, a fact encoder transforms the fact into low-dimensional representations (Sec. 3.2). The above two encoders are aggregated by the context-augmented fact encoder (Sec. 3.3). Finally, the aggregated representations are fed to the decoder (Sec. 3.4), where the decoder leverages multi-level copy mechanism (KB copy and context copy) to generate target question words.

2 Task Description

We leverage textual contexts concerned with the triplet fact to generate questions over KBs. The task of KBQG can be formalized as follows:


where represents the subject (), predicate () and object () of the input triplet, denotes a set of additional textual contexts, is the generated question, represents all previously generated question words before time step .

3 Methodology

Our model extends the encoder-decoder architecture Cho et al. (2014) with three encoding modules and two copy mechanisms in the decoder. The model overview is shown in Figure 2 along with its caption. It should be emphasized that we additionally design an answer-aware loss to make the generated question associated with a definitive answer (Sec. 3.5.2).

3.1 Context Encoder

Inspired by the great success of transformer Vaswani et al. (2017) in sequence modeling Shen et al. (2018), we adopt a transformer encoder to encode each textual context separately. Take the subject context as an example, is concatenated from diversified types for the subject, and is the -th token in the subject context, stands for the length of the subject context. Firstly, is mapped into a query matrix Q, where Q is constructed by summing the corresponding token embeddings and segment embeddings. Similar to BERT Devlin et al. (2019), segment embeddings are the same for tokens of but different for that of (predicate context) or (object context). Based on the query matrix, transformer encoder works as follows:


where K and V are the key matrix and value matrix, respectively. It is called self-attention because K and V are equal to the query matrix in the encoding stage, where represents the number of hidden units. And denotes the number of the heads in multi-head attention mechanism of the transformer encoder. It first projects the input matrixes (Q, K, V) into subspaces times mapped by different linear projections , , () in Equation 2. And then projections perform the scaled dot-product attention to obtain the representation of each head in parallel (Equation 3). Representations for all parallel heads are concatenated together in Equation 4

. After residual connection, layer normalization (Equation

5) and feed forward operation (Equation 6), we can obtain the subject context matrix .

Similarly, and are obtained from the same transformer encoder for the predicate and object, respectively.

3.2 Fact Encoder

In contrast to general Sequence-to-Sequence (Seq2Seq) model Sutskever et al. (2014), the input fact is not a word sequence but instead a structured triplet . We employ a fact encoder to transform each atom in the fact into a fixed embedding, and the embedding is obtained from a KB embedding matrix. For example, the subject embedding is looked up from the KB embedding matrix , where represents the size of KB vocabulary, and the size of KB embedding is equal to the number of hidden units () in Equation 3. Similarly, the predicate embedding and the object embedding are mapped from the KB embedding matrix , where is pre-trained using TransE Bordes et al. (2013) to capture much more fact information in previous work Elsahar et al. (2018). In our model, can be pre-trained or randomly initiated (Details in Sec. 4.7.1).

3.3 Context-Augmented Fact Encoder

In order to combine both the context encoder information and the fact encoder information, we propose a context-augmented fact encoder which applies the gated fusion unit Gong and Bowman (2018) to integrate the context matrix and the fact embedding. For example, the subject context matrix

and the subject embedding vector

are integrated by the following gated fusion:


where is an attentive vector from to , which is similar to D18-1424. The attentive vector is combined with original subject embedding as a new enhanced representation f (Equation 7). And then a learnable gate vector, g (Equation 8), controls the information from and to the final augmented subject vector (Equation 9), where denotes the element-wise multiplication. Similarly, the augmented predicate vector and the augmented object vector are calculated in the same way. Finally, the context-augmented fact representation is the concatenation of augmented vectors as follows:


3.4 Decoder

The decoder aims at generating a question word sequence. As shown in Figure 2, we also exploit the transformer as the basic block in our decoder. Then we use a multi-level copy mechanism (KB copy and context copy), which allows copying from KBs and textual contexts.

Specifically, we first map the input of the decoder into an embedding representation by looking up word embedding matrix, then we use position embedding Vaswani et al. (2017) to enhance sequential information. Compared with the transformer encoder in Sec. 3.1, transformer decoder has an extra sub-layer: a fact multi-head attention layer, which is similar to Equation 2-6, where the query matrix is initiated with previous decoder sub-layer while both the key matrix and the value matrix are the augmented fact representation . After feedforward and multiple transformer layers, we obtain the decoder state at time step , and then could be leveraged to generate the target question sequence word by word.

As depicted in Figure 2, we propose multi-level copy mechanism to generate question words. At each time step , given decoder state together with input fact , textual contexts and vocabulary , the probabilistic function for generating any target question word is calculated as:


where , and

denote the vocab generation mode, the KB copy mode and the context copy mode, respectively. In order to control the balance among different modes, we employ a 3-dimensional switch probability in Equation

12, where is the embedding of previous generated word, indicates the probabilistic score function for generated target word of each mode. In the three probability score functions, is typically performed by a classifier over a fixed vocabulary based on the word embedding similarity, and the details of and are in the following.

3.4.1 KB Copy

Previous study found that most questions contain the subject name or its aligns in SimpleQuestion Petrochuk and Zettlemoyer (2018). However, the predicate name and object name hardly appear in the question. Therefore, we only copy the subject name in the KB copy, where

, the probability of copying the subject name, is calculated by a neural network function with a multi-layer perceptron (MLP) projected from


3.4.2 Context Copy

N18-1020 demonstrated the effectiveness of POS copy for the context. However, such a copy mechanism heavily relies on POS tagging. Inspired by the CopyNet Gu et al. (2016), we directly copy words in the textual contexts , and it does not rely on any POS tagging. Specifically, the input sequence for the context copy is the concatenation of all words in the textual contexts . Unfortunately, is prone to contain repeated words because it consists of rich contexts for subject, predicate and object. The repeated words in the input sequence tend to cause repetition problems in output sequences Tu et al. (2016). We adopt the maxout pointer Zhao et al. (2018) to address the repetition problem. Instead of summing all the probabilistic scores for repeated input words, we limit the probabilistic score of repeated words to their maximum score as Equation 13:


where represents the -th token in the input context sequence , is the probabilistic score of generating the token at time step , and is calculated by a softmax function over .

3.5 Learning

3.5.1 Question-Aware Loss

It is totally differential for our model to obtain question words, and it can be optimized in an end-to-end manner by back-propagation. Given the input fact , additional textual context and target question word sequence , the object function is to optimize the following negative log-likelihood:


The question-aware loss does not require any additional labels to optimize because the three modes share a same classifier to keep a balance (Equation 12), and they can learn to coordinate each other by minimizing .

3.5.2 Answer-Aware loss

It is able to generate questions similar to the labeled questions by optimizing the question-aware loss . However, there is an ambiguous problem in the annotated questions where the questions have alternative answers rather than determinate answers Petrochuk and Zettlemoyer (2018). In order to make generated questions correspond to definitive answers, we propose a novel answer-aware loss. By answer-aware loss, we aim at generating an answer type word in the question, which contributes to generating a question word matching the answer type. Formally, the answer-aware loss is in the following:


where is a set of answer type words. We treat object type words as the answer type words because the object is the answer. denotes the cross entropy between the answer type word and the generated question word . Finally, the minimum cross entropy is regarded as the answer-aware loss . Optimizing means that the model aims at generating an answer type word in the generated question sequence. For example, the model tends to generate Q3 rather than Q2 in Figure 1, because Q3 contains an answer type word—“city”. Similarly, could be optimized in an end-to-end manner, and it can integrate by a weight coefficient to the total loss as follows:


4 Experiment

4.1 Experimental Settings

4.1.1 Experimental Data Details

We conduct experiments on the SimpleQuestion dataset Bordes et al. (2015), and there are 75910/10845/21687 question answering pairs (QA-pairs) for training/validation/test. In order to obtain diversified contexts, we additionally employ domain, range and topic of the predicate to improve the coverage of predicate contexts. In this way, 100% predicates (rather than 44%2 of those in Elsahar et al.) have contexts. For the subject and object context, we combine the most frequently mentioned entity type Elsahar et al. (2018) with the type that best describe the entity3. The KB copy needs subject names as the copy source, and we map entities with their names similar to those in N18-2047. The data details are in Appendix A and submitted Supplementary Data.

4.1.2 Evaluation Metrics

Following Serban et al. (2016); Elsahar et al. (2018)

, we adopt some word-overlap based metrics (WBMs) for natural language generation including BLEU-4

Papineni et al. (2002), ROUGE Lin (2004) and METEOR Denkowski and Lavie (2014). However, such metrics still suffer from some limitations Novikova et al. (2017). Crucially, it might be difficult for them to measure whether generated questions that express the given predicate and refer to definitive answers. To better evaluate generated questions, we run two further evaluations as follows.

(1) Predicate identification: Following N18-2047, we employ annotators to judge whether the generated question expresses the given predicate in the fact or not. The score for predicate identification is the percentage of generated questions that express the given predicate.

(2) Answer coverage: We define a novel metric called answer coverage to identify whether the generated question refers to a definitive answer. Specifically, answer coverage is obtained by automatically calculating the percentage of questions that contain answer type words, and answer type words are object contexts (entity types for the object are regarded as answer type words).

Furthermore, it is hard to automatically evaluate the naturalness of generated questions. Following N18-2047, we adopt human evaluation to measure the naturalness by a score of 0-5.

4.1.3 Comparison with State-of-the-arts

We compare our model with following methods.

(1) Template: A baseline in P16-1056, it randomly chooses a candidate fact in the training data to generate the question, where shares the same predicate with the input fact.

(2) P16-1056: We compare our methods with the single placeholder model, which performs best in P16-1056.

(3) N18-1020: We compare our methods with the model utilizing copy actions, the best performing model in N18-1020. Although this model is designed to a zero-shot setting (for unseen predicates and entity type), it has good abilities to generate better questions (on known or unknown predicates and entity types) represented in the additional context input and SPO copy mechanism.

4.1.4 Implementation Details

To make our model comparable to the comparison methods, we keep most parameter values the same as N18-1020. We utilize RMSProp algorithm with a decreasing learning rate (0.001), batch size (200) to optimize the model. The size of KB embeddings is 200, and KB embeddings are pre-trained by TransE

Bordes et al. (2013). The word embeddings are initialized by the pre-trained Glove word vectors444 with 200 dimensions. In the transformer, we set the hidden units to 200, and we employ 4 paralleled attention head and a stack of 5 identical layers. We set the weight () of the answer-aware loss to 0.2.

4.2 Overall Comparisons




Template 31.36 * 33.12
P16-1056 33.32 * 35.38
N18-1020 36.56 58.09 34.41
Our Model 41.09 68.68 47.75
Our Model 41.72 69.31 48.13


Table 1: Overall comparisons on the test data, where “ans_loss” represents answer-aware loss.

In Table 1, we compare our model with the typical baselines on word-overlap based metrics. It is evident that our model is remarkably better than baselines on all metrics, where the BLEU4 score increases 4.53 compared with the strongest baseline Elsahar et al. (2018). Especially, incorporating answer-aware loss (the last line in Table 1) further improves the performance (+5.16 BLEU4).

4.3 Performances on Predicate Identification


Model Pred. Identification


P16-1056 53.5
N18-1020 71.5
Our Model 75.5


Table 2: Performances on predicate identification.

To evaluate the ability of our model on predicate identification, we sample 100 generated questions from each model, and then two annotators are employed to judge whether the generated question expresses the given predicate. The Kappa for inter-annotator statistics is 0.611, and p-value for all scores is less than 0.005. As shown in Table 2, we can see that our model has a significant improvement in the predicate identification.

4.4 Performances on Answer Coverage — The Effectiveness of Answer-Aware Loss


Model BLEU4


N18-1020 0 36.56 59.49
Our Model 0 41.09 61.65
Our Model 0.05 41.55 62.27
Our Model 0.2 41.72 64.23
Our Model 0.5 41.57 65.50
Our Model 1.0 41.34 65.25


Table 3: Performances on answer coverage, where “Ans” denotes the metric of answer coverage. “” is the weight of the answer-aware loss in Equation 16.

Table 3 reports performances on BLUE4 and answer coverage (Ans). We can obtain that:

(1) When answer-aware loss is not leveraged (), advantages of performance are obvious in our model. Note that the answer coverage is 55.23 on the human-labeled questions. Although our model does not explicitly capture answer information, it still obtains a high answer coverage, which may be because our diversified contexts contain rich answer type words.

(2) To demonstrate the effectiveness of answer-aware loss, we set the weight of answer-aware loss () to 0.05/0.2/0.5/1.0 (the last four lines in Table 3). It can be seen that our model, incorporating answer-aware loss, has a significant improvement on answer coverage while there is no performance degradation on BLEU4 compared with , which indicates that answer-aware loss contributes to generating better questions. Especially, the generated questions are more precise because they refer to more definitive answers with high Ans.

(3) It tends to correspond to alternative answers (object in the triplet fact) for some predicates such as fb:location/containedby, while other predicates (e.g. fb:person/gender) may refer to a definitive answer. To investigate our model, by incorporating answer-aware loss, does not generate an answer type word in a mandatory way, we found 20.5% predicate corresponds to the generated questions without answer type words when our model obtains the highest Ans (=0.5), and it is very close to 21.7% for the one in human-annotated questions. This demonstrates that the answer-aware loss does not force all predicates to generate questions with answer type words.

4.5 Ablation Study




Our Model 41.72 69.31 48.13
w/o context copy 41.27 68.36 47.54
w/o KB copy 41.04 68.66 47.72
w/o answer-aware loss 41.09 68.68 47.75
w/o diversified contexts 40.53 68.52 47.66


Table 4: Ablation study by removing the main components, where “w/o” means without, and “w/o diversified contexts” represents that diversified contexts are replaced by contexts used in N18-1020.

In order to validate the effectiveness of model components, we remove some important components in our model, including context copy, KB copy, answer-aware loss and diversified contexts. The results are shown in Table 4. We can see that removing any component brings performance decline on all metrics. It demonstrates that all these components are useful. Specifically, the last line in Table 4, replacing diversified contexts with contexts used in N18-1020, has more obvious performance degradation.

4.6 Performances on Naturalness


Model Naturalness


P16-1056 2.96
N18-1020 2.23
Our Model 3.56


Table 5: Performances on naturalness.

Human evaluation is important for generated questions. Following N18-1020, we sample 100 questions from each system, and then two annotators measure the naturalness by a score of 0-5. The Kappa coefficient for inter-annotator is 0.629, and p-value for all scores is less than 0.005. As shown in Table 5, N18-1020 perform poorly on naturalness, while our model obtains the highest score on naturalness, which demonstrates our model can deliver more natural questions than baselines.

4.7 Discussion

4.7.1 Without Pre-trained KB Embeddings




N18-1020 True 36.56 58.09 34.41
N18-1020 False 33.67 55.57 33.20
Our Model True 41.72 69.31 48.13
Our Model False 41.55 68.59 47.52


Table 6: Performances of whether using the pre-trained KB embedding by transE.

Pre-trained KB embeddings may provide rich structured relational information among entities. However, it heavily relies on large-scale triplets, which is time and resource-intensive. To investigate the effectiveness of pre-trained KB embedding for KBQG, we report the performance of KBQG whether using pre-trained KB embeddings by simply applying TransE. Table 6 shows that the performance of KBQG is degraded without TransE embeddings. In comparison, N18-1020 obtain obvious degradation on all metrics while there is only a slight decline in our model. We believe that it may owe to the context-augmented fact encoder since our model drops to 40.87 on the BLEU4 score without context-augmented fact encoder and transE embeddings.

4.7.2 The Effectiveness of Generated Questions for Enhancing Question Answering over Knowledge Bases


Data Type Accuracy


human-labeled data 68.97
+ gen_data (Serban et al., 2016) 68.53
+ gen_data (Elsahar et al., 2018) 69.13
+ gen_data (Our Model) 69.57


Table 7: Performances of generated questions for QA.

Previous experiments demonstrate that our model can deliver more precise questions. To further prove the effectiveness of our model, we will see how useful the generated questions are for training a question answering system over knowledge bases. Specifically, we combine human-labeled data with the same amount of model-generated data to a typical QA system (Mohammed et al., 2018). The accuracy of QA is shown in Table 7. We can observe that adding generative questions may weaken the performance of QA (drop from 68.97 to 68.53 in Table 7). Our generated questions achieve the best performance on the QA system. It indicates that our model generates more precise question and has improved QA performances greatly.

4.7.3 Speed

Figure 3:

Performance on valid data through epochs, where “base” is the method in Elsahar et al. (2018).

In order to further explore the convergence speed, we plot the performances on valid data through epochs in Figure 3. Our model has much more information to learn, and it may have a bad impact on the convergence speed. Nevertheless, our model can copy KB elements and textual context simultaneously, which may accelerate the convergence speed. As demonstrated in Figure 3, our model achieves the best performances on almost epochs. After about 6 epochs, performances on our model become stable and convergent.

4.7.4 Case Study

Figure 4: Examples of questions by different models.

Figure 4 lists referenced question and generated questions by different models. It can be seen that our generated questions can better express the target predicate such as ID 1 (marked as underline). In ID 2, although all questions express the target predicate correctly, only our question refers to a definitive answer since it contains an answer type word “city” (marked as bold). It should be emphasized that the questions, generated by our method with answer-aware loss, do not always contain answer type words (ID 1 and 3).

5 Related Work

Our work is inspired by a large number of successful applications using neural encoder-decoder frameworks on NLP tasks such as machine translation Cho et al. (2014) and dialog generation Vinyals and Le (2015). Our work is also inspired by the recent work for KBQG based on encoder-decoder frameworks. P16-1056 first proposed a neural network for mapping KB facts into natural language questions. To improve the generalization, N18-1020 introduced extra contexts for the input fact, which achieved significant performances. However, these contexts may make it difficult to generate questions that express the given predicate and associate with a definitive answer. Therefore, we focus on the two research issues: expressing the given predicate and referring to a definitive answer for generated questions.

Moreover, our work also borrows the idea from copy mechanisms. Point network Vinyals et al. (2015) predicted the output sequence directly from the input, and it can not generate new words while CopyNet Gu et al. (2016) combined copying and generating. DBLP:conf/aaai/BaoTDYLZZ18 proposed to copy elements in the table (KB). N18-1020 exploited POS copy action to better capture textual contexts. To incorporate advantages from above copy mechanisms, we introduce KB copy and context copy which can copy KB element and textual context, and they do not rely on POS tagging.

6 Conclusion and Future Work

In this paper, we focus on two crucial research issues for the task of question generation over knowledge bases: generating questions that express the given predicate and refer to definitive answers rather than alternative answers. For this purpose, we present a neural encoder-decoder model which integrates diversified off-the-shelf contexts and multi-level copy mechanisms. Moreover, we design an answer-aware loss to generate questions that refer to definitive answers. Experiments show that our model achieves state-of-the-art performance on automatic and manual evaluations.

For future work, we investigate error cases by analyzing the error distributions of 100 examples. We find that most generated questions (51%) are judged by the human to correctly express the input facts, but they unfortunately obtain low scores on the widely used metrics. It implies that it is still intractable to evaluate generated questions. Although we additionally evaluate on predicate identification and answer coverage, these metrics may be coarse and deserve further study.


This work is supported by the National Natural Science Foundation of China (No.61533018), the Natural Key R&D Program of China (No.2018YFC0830101), the National Natural Science Foundation of China (No.61702512) and the independent research project of National Laboratory of Pattern Recognition. This work was also supported by CCF-Tencent Open Research Fund.


  • A. Bordes, N. Usunier, S. Chopra, and J. Weston (2015) Large-scale simple question answering with memory networks. CoRR abs/1506.02075. External Links: Link, 1506.02075 Cited by: §4.1.1.
  • A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 2787–2795. External Links: Link Cited by: §3.2, §4.1.4.
  • K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder-decoder approaches.

    See conf/ssst/2014, pp. 103–111. External Links: ISBN 978-1-937284-96-1, Link Cited by: §5.
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Doha, Qatar, pp. 1724–1734. External Links: Link, Document Cited by: §3.
  • M. Denkowski and A. Lavie (2014) Meteor universal: language specific translation evaluation for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380. External Links: Document, Link Cited by: §4.1.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Cited by: §3.1.
  • N. Duan, D. Tang, P. Chen, and M. Zhou (2017) Question generation for question answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 866–874. External Links: Document, Link Cited by: §1.
  • H. Elsahar, C. Gravier, and F. Laforest (2018)

    Zero-shot question generation from knowledge graphs for unseen predicates and entity types

    In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 218–228. External Links: Document, Link Cited by: §1, §1, §1, §3.2, §4.1.1, §4.1.2, §4.2.
  • Y. Gong and S. Bowman (2018) Ruminating reader: reasoning with gated multi-hop attention. In Proceedings of the Workshop on Machine Reading for Question Answering, pp. 1–11. External Links: Link Cited by: §3.3.
  • J. Gu, Z. Lu, H. Li, and V. O.K. Li (2016) Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1631–1640. External Links: Document, Link Cited by: §3.4.2, §5.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Barcelona, Spain, pp. 74–81. External Links: Link Cited by: §4.1.2.
  • M. Mintz, S. Bills, R. Snow, and D. Jurafsky (2009) Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2, ACL ’09, Stroudsburg, PA, USA, pp. 1003–1011. External Links: ISBN 978-1-932432-46-6, Link Cited by: §1.
  • J. Novikova, O. Dušek, A. Cercas Curry, and V. Rieser (2017)

    Why we need new evaluation metrics for nlg

    In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2241–2252. External Links: Link Cited by: §4.1.2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Stroudsburg, PA, USA, pp. 311–318. External Links: Link, Document Cited by: §4.1.2.
  • a. Petrochuk and L. Zettlemoyer (2018) SimpleQuestions nearly solved: a new upperbound and baseline approach. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 554–558. External Links: Link Cited by: §3.4.1, §3.5.2.
  • I. V. Serban, A. García-Durán, C. Gulcehre, S. Ahn, S. Chandar, A. Courville, and Y. Bengio (2016)

    Generating factoid questions with recurrent neural networks: the 30m factoid question-answer corpus

    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 588–598. External Links: Document, Link Cited by: §1, §4.1.2.
  • T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan, and C. Zhang (2018) DISAN: directional self-attention network for rnn/cnn-free language understanding. In

    AAAI Conference on Artificial Intelligence

    Cited by: §3.1.
  • X. Sun, J. Liu, Y. Lyu, W. He, Y. Ma, and S. Wang (2018) Answer-focused and position-aware neural question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3930–3939. External Links: Link Cited by: §1.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Proceedings of Advances in neural information processing systems, pp. 3104–3112. Cited by: §3.2.
  • Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li (2016) Modeling coverage for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 76–85. External Links: Document, Link Cited by: §3.4.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §3.1, §3.4.
  • O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2692–2700. External Links: Link Cited by: §5.
  • O. Vinyals and Q. Le (2015) A neural conversational model. Note:

    cite arxiv:1506.05869Comment: ICML Deep Learning Workshop 2015

    External Links: Link Cited by: §5.
  • Y. Zhao, X. Ni, Y. Ding, and Q. Ke (2018) Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3901–3910. External Links: Link Cited by: §3.4.2.