Improving Neural Question Generation using World Knowledge

09/09/2019 ∙ by Deepak Gupta, et al. ∙ 0

In this paper, we propose a method for incorporating world knowledge (linked entities and fine-grained entity types) into a neural question generation model. This world knowledge helps to encode additional information related to the entities present in the passage required to generate human-like questions. We evaluate our models on both SQuAD and MS MARCO to demonstrate the usefulness of the world knowledge features. The proposed world knowledge enriched question generation model is able to outperform the vanilla neural question generation model by 1.37 and 1.59 absolute BLEU 4 score on SQuAD and MS MARCO test dataset respectively.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of question generation (QG) aims to generate syntactically and semantically sound questions, from a given text, for which a given answer would be a correct response. Recently, there has been increased research interest in the QG task due to (1)

the wide success of neural network based sequence-to-sequence techniques

Sutskever et al. (2014) for various NLP tasks and Bahdanau et al. (2014); Srivastava et al. (2015); Xu et al. (2015); Rush et al. (2015); Kumar et al. (2016), (2) the abundance of large question answering datasets: SQuAD Rajpurkar et al. (2016), NewsQA Trischler et al. (2016), MS MARCO Nguyen et al. (2016)).

Text: Kevin Hart and Dwayne Johnson are taking on MTV. The duo, who co-star in the upcoming film Central Intelligence, are set to co-host the 2016 MTV Movie Awards, The Hollywood Reporter has confirmed.
Ans: Central Intelligence
Q: Dwayne Johnson co-starred with Kevin Hart in what organization?
Q: In which film did Dwayne Johnson collaborate with comedian Kevin Hart?
Table 1: Sample texts, along with the machine (Q) and human (Q) generated questions for the given answer (Ans). In the machine generated questions where the corresponding entities could not be resolved are shown in red, the corresponding resolved entities in human generated questions are in blue.

In this paper, we advocate for improving question generation systems using world knowledge, which has not been investigated as of yet. We explore world knowledge in the form of entities present in text and exploit the associated entity knowledge to generate human-like questions. In our experiments, we use two types of world knowledge: linked entities to the Wikipedia knowledge base and fine grained entity types (FGET). Table 1 illustrates how this form of world knowledge can be used to improve question generation. Here, “Central Intelligence” is the name of a movie and ‘Dwayne Johnson’ and ‘Kevin Hart’ are actors. The world knowledge here is the name of a movie (“Central Intelligence”), which helps the model to generate the correct word ‘film’ instead of the incorrect word ‘organization’.

We adopt the sequence-to-sequence model Bahdanau et al. (2014) equipped with the copy mechanism Gulcehre et al. (2016); See et al. (2017) as our base model for question generation. The entity linking and fine-grained entity typing information are fed to the network along with the answer of interest. We believe this is the first work that explores world-knowledge in the form of linked entities and fine grained entity types as features to improve neural question generation models.

2 Related Work

Recently, works on question generation have drifted towards neural-based approaches. These approaches typically involve end-to-end supervised learning to generate questions.

Du et al. (2017) proposed sequence-to-sequence learning for question generation from text passages. Zhou et al. (2017)

utilized the answer-position, and linguistic features such as named entity recognition (NER) and parts of speech (POS) information to further improve the QG performance as the model is aware that for which answer a question need to be generated. In the work of

Wang et al. (2016) a multi-perspective context matching algorithm is employed. Harrison and Walker (2018) use a set of rich linguistic features along with a NQG model. Song et al. (2018) used the matching algorithm proposed by Wang et al. (2016) to compute the similarity between the target answer and the passage for collecting relevant contextual information under the different perspective, so that contextual information can be better considered by the encoder. More recently, Kim et al. (2018) has claimed to improve the performance of QG model by replacing the target answer in the original passage with a special tokens. Other NQG models, include Zhao et al. (2018); Sun et al. (2018); Gao et al. (2018) which generate questions mainly from SQuAD and MS MARCO dataset.

3 Proposed Approach

Problem Statement

Given a passage containing a sequence of words and an answer s.t. , a length sub-sequence of passage. The task is to find the optimal question sequence having a sequence of words . Mathematically,


3.1 World Knowledge Enrich Encoder

Our proposed model is based on the sequence-to-sequence Bahdanau et al. (2014)

paradigm. For the encoder, we utilize a Long Short Term Memory (LSTM)

Hochreiter and Schmidhuber (1997) network. In order to capture more contextual information, we use a two-layer bidirectional LSTM (Bi-LSTM). Inspired by the success of using linguistic features in Zhou et al. (2017); Harrison and Walker (2018), we exploit word knowledge in the form of entity linking and fine-grained entity typing in the encoder of the network. A Bi-LSTM encoder reads the passage words and their associated world knowledge features (c.f. section 3.1.1, 3.1.2

) to produce a sequence of word-and-feature vectors. The word vectors, the embedded world knowledge feature vectors and the answer position indicator embedding vectors are concatenated and passed as input to the Bi-LSTM encoder.

3.1.1 Entity Linking

In previous works Zhou et al. (2017); Harrison and Walker (2018), named entity type features have been used. These features, however, only allow for the encoding of coarse level information such as knowledge of if an entity belongs to a set of predefined categories such as ‘PERSON’, ‘LOCATION’ and ‘ORGANIZATION’. To alleviate this, we use the knowledge in the form of linked entities. In our experiments, we use Wikipedia as the knowledge base for which to link entities. This specific task (also known as Wikification Cheng and Roth (2013)) is the task of identifying concepts and entities in text and disambiguation them into the most specific corresponding Wikipedia pages. We followed the approach by Cheng and Roth (2013) for the Wikification. The Wikification process is performed on the input passage having words , we map each word of the passage to their corresponding Wikipedia title to generate a sequence of Wikipedia titles . For multi-word mentions, we assign the same Wikipedia title to each word of the mention. In order to project the word and entity in the same vector space, we jointly learn pre-trained word-entity vector embeddings using the method proposed by Yamada et al. (2016).

3.1.2 Fine-grained Entity Types (FGET)

Fine-grained entity typing consists of assigning types from a hierarchy to entity mentions in text. Similar to the approach in Xu and Barbosa (2018)

, we build a classification model, to classify the predicted entity mention from the entity linker, discussed in section

3.1.1, into one of the predefined fine-grained entity types ( entities) Ling and Weld (2012). The inputs to the network are a sub-sequence of passage sentence and the target entity () of length . The sub-sequence is a context sentence of length for the given mention , where . Using the FGET classification approach discussed in Xu and Barbosa (2018), we obtain the representation of the passage sentence

. Thereafter, a soft-max layer is employed to obtain the probability distribution over the set of fine-grained entity types

. Concretely,


where weight matrix is treated as the learned type embedding and is the bias.

Similar to the process we use for the linked entities, we map the passage words to their corresponding fine-grained entity types to get a sequence of FGET . The final embedding of a word at a given time of the passage , is computed as:


where, , and are the embeddings of the answer position, word, linked entity, and fine-grained entity type of the token of the passage. The final embedding sequence , is passed to a Bi-LSTM encoder to produce two sequences of hidden vectors, the forward sequence and the backward sequence . Lastly, the output sequence of the encoder is the concatenation of the two sequences, .

3.2 Decoding with Attention

We use a two layered LSTM for the decoder. Words are generated sequentially conditioned on the encoder output and the previous decoder step. Formally, at decoding time step , the LSTM decoder reads the previous word embedding and context vector to compute the new hidden state . The context vector at time step is computed using the attention mechanism in Luong et al. (2015), which matches the current decoder state with each encoder hidden state to get an relevance score. A one layer feed-forward network takes the decoder state and and predicts the probability distribution over the decoder vocabulary. Similar to Sun et al. (2018); Song et al. (2018), we also use the copy mechanism from Gulcehre et al. (2016) to deal with the rare and unknown words.

4 Experimental Results and Analysis

We evaluated the performance of our approach on SQuAD Rajpurkar et al. (2016) and MS MARCO v2.1 Nguyen et al. (2016). SQuAD is composed of more than 100K questions posed by crowd workers on Wikipedia articles. We used the same split as Zhou et al. (2017). MS MARCO datasets contains million queries with corresponding answers and passages. All questions are sampled from real anonymized user queries and context passages are extracted from real web documents. We picked a subset of MS MARCO data where answers ( words) are sub-spans within the passages ( words), and use dev set as test set (), and split train set with ratio 90%-10% into train () and dev () sets.

4.1 Experimental Settings

We optimized network hyper-parameters for both dataset via their respective development set. The LSTM cell hidden size was 512 for both the dataset. We used dimension vector111 jointly trained for word and Wikipedia entity from Yamada et al. (2016) for the pretrained word and entity embeddings. The dimension of answer tagging and entity type tagging were set to . The model was optimized via gradient descent using the Adam Kingma and Ba (2014) optimiser with learning rate of and mini-batch size 64. We selected the models with the highest BLEU-4 Papineni et al. (2002) scores as our final models. At inference time, we used beam search with a beam size of (also optimized on the dev set) for the SQuAD dataset and greedy search is adopted for MS MARCO dataset as it performed near the best result compared to beam search. For, both the datasets, we restrict the target vocabulary to most frequent words. We evaluate the question generation performance in terms of BLEU Papineni et al. (2002), METEOR Banerjee and Lavie (2005) and and ROUGE-L Lin (2004) and using the evaluation package released by Sharma et al. (2017).

4.2 Experimental Results

ModelDataset SQuAD MS MARCO
s2s+Att 7.53 13.38 33.98 8.86 13.98 34.57
NQG 12.54 17.67 41.74 11.73 18.06 37.64
NQG+EL 11.78 17.41 40.02 11.52 18.34 37.56
NQG+ EL (pre) 13.28 18.03 41.89 12.95 20.07 39.76
NQG+FGET 13.91 18.51 42.53 12.01 18.82 37.88
NQG+FGET (pre) 13.91 18.48 42.46 12.95 20.07 39.76
NQG+ EL (best) + FGET (best) 13.69 18.50 42.13 13.32 20.47 40.05
NQG + NER 13.22 18.54 41.36 12.18 18.46 38.04
NQG + NER + FGET 12.73 18.51 40.39 12.11 18.52 37.69
NQG + NER + FGET (pre) 13.44 19.14 41.27 12.00 18.56 37.70
Zhou et al. (2017) - - - - -
Song et al. (2018) 13.91 - - - - -
Table 2: Performance comparison of the proposed model on test set of both the datasets. The term ‘best’ refer to the best performance on development set.
ModelDataset SQuAD
s2s + Att 30.16 16.80 11.10 07.77 13.48 33.96
NQG 37.72 23.89 16.93 12.46 17.78 42.02
NQG + EL 38.57 23.85 16.65 12.12 17.47 40.24
NQG + EL (pre) 38.81 24.89 17.88 13.40 18.08 41.90
NGQ + FGET 39.64 25.72 18.64 14.05 18.61 42.88
NGQ + FGET (pre) 40.21 26.17 19.03 14.41 18.72 42.90
NQG+ EL (pre) + FGET (pre) 39.82 25.63 18.41 13.79 18.57 42.40
NQG + NER 42.05 26.16 18.05 12.96 18.51 41.37
NQG + NER + FGET 42.49 25.85 17.67 12.59 18.50 40.34
NGQ + NER + FGET (pre) 42.78 26.68 18.48 13.30 19.15 41.22
Table 3: Performance comparison of the proposed model with the other baselines and state-of-the-art model on the development set of SQuAD dataset
ModelDataset MS MARCO
s2s + Att 31.67 16.25 11.63 8.41 14.01 33.54
NQG 39.41 26.38 17.81 12.06 18.82 38.19
NQG + EL 40.84 27.92 19.27 13.51 20.34 39.79
NQG + EL (pre) 42.01 28.66 19.81 13.93 20.96 40.60
NGQ + FGET 39.97 27.05 18.47 12.64 19.59 38.54
NGQ + FGET (pre) 39.83 26.84 18.22 12.33 19.15 38.40
NQG+ EL (pre) + FGET (pre) 41.92 28.57 19.72 13.82 20.84 40.62
NQG + NER 39.99 27.00 18.38 12.56 19.22 38.73
NQG + NER + FGET 39.97 27.05 18.47 12.60 19.37 38.65
NGQ + NER + FGET (pre) 40.05 27.09 18.46 12.60 19.57 38.68
Table 4: Performance comparison of the proposed model with the other baselines and state-of-the-art model on the development set of MS MARCO dataset

We conducted series of experiments as follows:
(1) s2s+Att: Baseline encoder-decoder based seq2seq network with attention mechanism.
(2) NQG: Extension of s2s+Att with answer position feature.
(3) NQG + EL: Extension of NQG with the entity linking feature (500 dimension) discussed in Section 3.1.1.
(4) NQG + EL (pre): NQG + Entity Linking with the pre-trained entity linking feature obtained from the joint training of word and Wikipedia entity using Yamada et al. (2016).
(5) NQG + FGET: Extension of NQG with the fine grained entity type (FGET) feature (100 dimension) discussed in section 3.1.2.
(6) NQG + FGET (pre): NQG + FGET with the pre-trained FGET features as discussed in Section 3.1.2.
(7) NQG + EL (pre) + FGET (pre): Combination of NQG, Entity Linking and FGET with pre-trained entity linking and FGET features.
In order to compare our models with the existing coarse-grained entity features (NER) being used in literature Zhou et al. (2017); Harrison and Walker (2018), we also report the following experiments.
(a) NQG + NER: NQG with the coarse-grained named entity recognition222We use the Stanford NER Finkel et al. (2005) to tag the entity feature.
(b) NQG + NER + FGET: NQG, NER and FGET with NER (100 dimension) and FGET features
(c) NQG + NER + FGET (pre): NQG, NER and FGET with NER (100 dimension) and pre-trained FGET features
We report the results on the test set of the datasets in table 2. The results on dev set are shown in Table 3, 4.

4.3 Discussion and Analysis

Table 2 clearly demonstrates that the proposed fine-grained word-knowledge features improve the performance of the models over the baseline and the coarse-grained entity (NER) features seem to be not as useful as the entity linking features for both datasets. We analyzed the effect of each word-knowledge feature on both the datasets. Our findings are as follows:

Passage: though there is no official definition for the northern boundary of southern california , such a division has existed from the time when mexico ruled california , and political disputes raged between the californios of monterey in the upper part and los angeles in the lower part of alta california .
Target Answer: mexico
Reference: which country used to rule california ?
NQG: who ruled california?
NQG+EL (best)+ FGET (best): which country california ?
Passage: for example , a significant number of students all over wales are educated either wholly or largely through the medium of welsh : in 2008/09 , 22 per cent of classes in maintained primary schools used welsh as the sole or main medium of instruction .
Target Answer: welsh
Reference: what language is used to educate in wales ?
NQG: a significant number of students over wales are educated through what medium ?
NQG+EL (best)+ FGET (best): what language has a significant number of students in wales ?
Table 5: Samples questions generated by the baseline and proposed model.
Entity Linking:

On both the datasets, the pre-trained entity linking features were more effective compared to random initialized features followed by fine-tuning while training. We believe this is due to the word and corresponding entity being jointly trained and projected into the same vector space. We observe that, entity linking features on SQuAD is less effective as compared to MS MARCO.


Similar to the linker based features, the pre-trained FGET features trained on FIGER dataset Ling and Weld (2012) are more effective than the randomly initialized vectors. The FGET feature is more effective at improving the QG model on SQuAD. We believe this is likely because the both SQuAD and FIGER datasets were both derived from the Wikipedia. In contrast, MS MARCO was derived from Bing333

user queries and web passages, which is entirely different in nature. It should also be noted that the FGET features were derived using entities detected using the entity linker. In order to evaluate the effect of using the linker as an entity detector we also performed an experiment for which we used entities detected using the NER. We found that that the models that use the entities detected with the linker have higher performance in terms of each evaluation metrics on both the datasets. A samples of generated questions are given in table


5 Conclusion and Future Work

We proposed that features based on general word-knowledge can improve the performance of question generation. Our results on SQuAD and MS MARCO show that entity based world knowledge are effective at improving question generation according to automated metrics. In order to fully explore the performance gains of these features, human evaluation is required and we leave this for future work. We would also like to explore other sources of world knowledge beyond entity based information. In particular, we believe that information based on the relationships between the entities present in the passage would also be useful.


  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473. Cited by: §1, §1, §3.1.
  • S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72. External Links: Link Cited by: §4.1.
  • X. Cheng and D. Roth (2013) Relational inference for wikification. In

    Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

    pp. 1787–1796. Cited by: §3.1.1.
  • X. Du, J. Shao, and C. Cardie (2017) Learning to ask: neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1342–1352. External Links: Document, Link Cited by: §2.
  • J. R. Finkel, T. Grenager, and C. Manning (2005) Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd annual meeting on association for computational linguistics, pp. 363–370. Cited by: footnote 2.
  • Y. Gao, J. Wang, L. Bing, I. King, and M. R. Lyu (2018) Difficulty controllable question generation for reading comprehension. arXiv preprint arXiv:1807.03586. Cited by: §2.
  • C. Gulcehre, S. Ahn, R. Nallapati, B. Zhou, and Y. Bengio (2016) Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 140–149. External Links: Document, Link Cited by: §1, §3.2.
  • V. Harrison and M. Walker (2018) Neural generation of diverse questions using answer focus, contextual and linguistic features. arXiv preprint arXiv:1809.02637. Cited by: §2, §3.1.1, §3.1, §4.2.
  • S. Hochreiter and J. Schmidhuber (1997) Long Short-Term Memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.1.
  • Y. Kim, H. Lee, J. Shin, and K. Jung (2018) Improving neural question generation using answer separation. arXiv preprint arXiv:1809.02393. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher (2016) Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. In

    International Conference on Machine Learning

    pp. 1378–1387. Cited by: §1.
  • C. Lin (2004) ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Vol. 8. Cited by: §4.1.
  • X. Ling and D. S. Weld (2012) Fine-grained entity recognition. In AAAI, Vol. 12, pp. 94–100. Cited by: §3.1.2, §4.3.
  • T. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421. External Links: Document, Link Cited by: §3.2.
  • T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS marco: a human generated machine reading comprehension dataset. External Links: Link Cited by: §1, §4.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Cited by: §4.1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392. External Links: Document, Link Cited by: §1, §4.
  • A. M. Rush, S. Chopra, and J. Weston (2015)

    A neural attention model for abstractive sentence summarization

    arXiv preprint arXiv:1509.00685. Cited by: §1.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083. External Links: Document, Link Cited by: §1.
  • S. Sharma, L. El Asri, H. Schulz, and J. Zumer (2017)

    Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation

    CoRR abs/1706.09799. External Links: Link Cited by: §4.1.
  • L. Song, Z. Wang, W. Hamza, Y. Zhang, and D. Gildea (2018) Leveraging context information for natural question generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 569–574. External Links: Document, Link Cited by: §2, §3.2, Table 2.
  • N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §1.
  • X. Sun, J. Liu, Y. Lyu, W. He, Y. Ma, and S. Wang (2018) Answer-focused and position-aware neural question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3930–3939. External Links: Link Cited by: §2, §3.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems, pp. 3104–3112. Cited by: §1.
  • A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman (2016) NewsQA: A Machine Comprehension Dataset. arXiv preprint arXiv:1611.09830. Cited by: §1.
  • Z. Wang, H. Mi, W. Hamza, and R. Florian (2016) Multi-perspective context matching for machine comprehension. arXiv preprint arXiv:1612.04211. Cited by: §2.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §1.
  • P. Xu and D. Barbosa (2018) Neural fine-grained entity type classification with hierarchy-aware loss. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 16–25. External Links: Link Cited by: §3.1.2.
  • I. Yamada, H. Shindo, H. Takeda, and Y. Takefuji (2016) Joint learning of the embedding of words and entities for named entity disambiguation. arXiv preprint arXiv:1601.01343. Cited by: §3.1.1, §4.1, §4.2.
  • Y. Zhao, X. Ni, Y. Ding, and Q. Ke (2018) Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3901–3910. External Links: Link Cited by: §2.
  • Q. Zhou, N. Yang, F. Wei, C. Tan, H. Bao, and M. Zhou (2017) Neural question generation from text: a preliminary study. In National CCF Conference on Natural Language Processing and Chinese Computing, pp. 662–671. Cited by: §2, §3.1.1, §3.1, §4.2, Table 2, §4.