Testing with open-ended quiz questions can help both learning and retention, e.g., it could be used for self-study or as a way to detect knowledge gaps in a classroom setting, thus allowing instructors to adapt their teaching (Roediger III et al., 2011).
As creating such quiz questions is a tedious job, automatic methods have been proposed. The task is often formulated as an answer-aware question generation (Heilman and Smith, 2010; Zhang et al., 2014; Du et al., 2017; Du and Cardie, 2018; Sun et al., 2018; Dong et al., 2019; Bao et al., 2020; CH and Saha, 2020): given an input text and a target answer, generate a corresponding question.
Many researchers have used the Stanford Question Answering Dataset (SQuAD1.1) dataset(Rajpurkar et al., 2016) as a source of training and testing data for answer-aware question generation. It contains human-generated questions and answers about articles in Wikipedia, as shown in Figure 1.
However, this formulation requires that answers be picked beforehand, which may not be practical for real-world situations. Here we aim to address this limitation by proposing a method for generating answers, which can in turn serve as an input to answer-aware question generation models. Our model combines orthographic, lexical, syntactic, and semantic information, and shows promising results. It further allows the user to specify the number of answer to propose. Our contributions can be summarized as follows:
We propose a new task: generate answer candidates that can serve as an input to answer-aware question generation models.
We create a dataset for this new task.
We propose a suitable model for the task, which combines orthographic, lexical, syntactic, and semantic information, and can generate a pre-specified number of answers.
We demonstrate improvements over simple approaches based on named entities, and competitiveness over complex neural models.
2 Related Work
The success of large-scale pre-trained Transformers such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2020), and generative ones such as T5 (Raffel et al., 2020) or BART (Lewis et al., 2020), has led to the rise in popularity of the Question Generation task. Models such as BERT (Devlin et al., 2019), T5 (Raffel et al., 2020) and PEGASUS (Zhang et al., 2020) have been used to generate questions for the SQuAD1.1 dataset and have been commonly evaluated (Du et al., 2017) using BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and METEOR (Lavie and Agarwal, 2007). Strong models for this task include NQG++ (Zhou et al., 2017), ProphetNet (Qi et al., 2020), MPQG (Song et al., 2018), UniLM (Dong et al., 2019), UniLMv2 (Bao et al., 2020), and ERNIE-GEN (Xiao et al., 2020).
All these models were trained for answer-aware question generation, which takes the answer and the textual context as an input and outputs a question for that answer. In contrast, our task formulation takes a textual context as an input and generates possible answers; in turn, these answers can be used as an input to the above answer-aware question generation models.
The Quiz-Style Question Generation for News Stories task (Lelkes et al., 2021) uses a formulation that asks to generate a single question as well as the corresponding answer, which is to be extracted from the given context.
Follow-up research has tried to avoid the limitation of generating a single question or a single question–answer pair by generating a question for each sentence in the input context or by using all named entities in the context as answer keys (Montgomerie, 2020).
Finally, there has been a proliferation of educational datasets in recent years (Zeng et al., 2020; Dzendzik et al., 2021; Rogers et al., 2021), which includes Crowdsourcing (Welbl et al., 2017), ARC (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), multiple-choice exams in Bulgarian (Hardalov et al., 2019), Vietnamese (Nguyen et al., 2020), and EXAMS, which covers 16 different languages (Hardalov et al., 2020). Yet, these datasets are not directly applicable for our task as their questions do not expect the answers to be exact matches from the textual context. While there are also span-based extraction datasets such as NewsQA (Trischler et al., 2017), SearchQA (Dunn et al., 2017) and Natural Questions: A Benchmark for Question Answering Research (Kwiatkowski et al., 2019) they contains a mix of long and short spans rather than factoid answers. Thus, we opted to use SQuAD1.1 in our experiments, but focusing on generating answers rather than on questions.
Given an input textual context, we first extract phrases from it, then we calculate a representation for each phrase, and finally, we predict which phrases are appropriate of being an answer to a quiz question based on these representations.
To train our classifier, we need a labeled dataset of key phrases. In particular, we use SQuAD1.1, which consists of more than 100,000 questions created by humans from Wikipedia articles, and was extensively used for question answering. An example is shown in Figure1. We use version 1.1 of the dataset instead of 2.0 (Rajpurkar et al., 2018) because it contains the exact position of the answers in the text, which allows us to easily match them against the candidate phrases. Version 2.0 only adds examples whose answer is not present.
We created a dataset for our task using 87,600 questions from the SQuAD1.1 training set and their associated textual contexts. Because only 33% of the answers consisted of one word, it is important to also extract phrases longer than a single word. Thus, we also added all named entities; note that they have a variable word length. We further included all noun chunks, which we then extended by combining two or more noun chunks if the only words between them were connectors like and, of, and or. Here is an example of a complex chunk with three pieces: a Marian place of prayer and reflection. We considered as positive examples the phrases for which there was a question asked in the SQuAD1.1 dataset, and we considered as negative examples the additional phrases we created.
We extracted the following features, adapted for the use of phrases containing multiple words:
TFIDFArticle, TFIDFParagraph: The average TF.IDF score for all words in the key phrase, where the Inverse Document Frequency (IDF) is computed from the words in all paragraphs of the article (TFIDFArticle) or only from the paragraph of the given key phrase (TFIDFParagraph).
POS, TAG, DEP: The coarse-grained part-of-speech tag (POS), the fine-grained part-of-speech tag (TAG), and the syntactic dependency relation (DEP). If the phrase contains multiple words, we only consider the word with the highest TF.IDF.
EntityType: The named entity type of the phrase if any.
IsAlpha: True if all characters in the phrase are alphabetic.
IsAscii: True if the phrase consists only of characters contained in the standard ASCII table.
IsDigit: True if the phrase only contains digits.
IsLower: True if all words in the phrase are in lowercase.
IsCapital: True if the first word in the phrase is in uppercase.
IsCurrency: True if some word in the phrase contains a currency symbol, e.g., $23.
LikeNum: True if some word in the phrase represents a number, e.g., 13.4, 42, twenty, etc.
We convert all the above features to binary, and then we use a Bernoulli Naïve Bayes classifier, which can account both for the presence and for the absence of a feature. To achieve this, we encode categorical features (e.g., POS, TAG
) using one-hot encoding, and we put continuous features (e.g.,TFIDFArticle, TitleSimilarity) into bins of five.
3.4 Evaluation Measures
As there is no established measure for evaluating key phrases for answer generation, we use and adapt the original evaluation script111http://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py created for the Question Answering task on the SQuAD1.1 dataset (Rajpurkar et al., 2016), which calculates an exact match (EMF1).
In the SQuAD1.1 dataset, there can be multiple correct versions of the answer for a question (e.g., third, third-most). Thus, the evaluation script calculates EM and F1 for each such version and then returns the highest value. As there can be also multiple question–answer pairs in a given passage, we further adapted the script to include all human-created answers, we calculated these scores against all answers in the passage, and finally, we took the highest values.
Finally, in order to allow for a more practical use of question generation algorithms, it is desirable to be able to generate multiple question–answer pairs for a given passage. To compute EM and F1 over multiple answer candidates, we adopted the following two approaches:
EM-Any and F1-Any show how likely it is to pick a ground-truth answer (also, how likely it is to be chosen by a human annotator of SQuAD1.1) out of N returned candidate answers. To calculate them, we took only the best EM and F1 scores after computing all scores for each candidate answer.
Using EM-Avg and F1-Avg, we can measure what percentage of all returned candidate answers have also been marked as an answer by a human. To calculate them, we took the average of all EM and F1 scores computed for the proposed candidate answers.
4 Experiments and Evaluation
We used our model to generate ten candidate answers per passage (taking the ones with the highest classifier confidence), and we compared the results to other commonly used methods.
Below, we list the baselines that we compared against:
NER: Extracting all named entities from the passage and using them as candidate answers. On average, there are 13.64 named entities per SQuAD1.1 passage.
Noun Chunks: Extracting all noun chunks from the passage and using them as candidate answers. On average, there are 33.15 noun chunks per SQuAD1.1 passage.
NE + NCh: Combining all extracted named entities and noun chunks from the passage after using the SQuAD1.1 normalization script222http://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py#L11 to remove duplicate words (e.g., the third matches third).
We fine-tuned the small version of T5, which has 220M parameters. We trained the model to accept the passage as an input and to output the answer. We used a learning rate of 0.0001, a source token length of 300, and a target token length of 24. The best validation loss was achieved in the forth out of ten epochs.
In this section, we describe our experimental results and we compare them to the baselines described in Section 4.1 above.
|NER + NCh||35.4||95.02||98.48|
4.2.1 Best Over Multiple Candidates
Table 1 shows the results for EM-Any and F1-Any, i.e., how often, among the top- candidates by the model, at least one was picked by a human.
We can see that, compared to using named entities, our model achieves a better EM-Any score with just eight answer candidates rather than using all named entities in the passage (which are 13.6 on average). It also achieves a higher F1-Any score with just six answer candidates.
We further see that using the combination of all named entities and noun chunks yields the best score, but it produces 35 candidates on average, which is the majority of the words in the passage.
|NER + NCh||35.4||8.97||18.84|
4.2.2 Average Over Multiple Candidates
Table 2 shows the results for EM-Avg and F1-Avg, i.e., measuring what percentage of the proposed answers were also selected as an answer by a human.
Due to the ability of the classifier to take a lower number of candidate questions, we can see that it outperforms taking all named entities or all noun chunks by a sizable margin.
We further see that the average scores consistently drop with the increase of the number of answer candidates. This also explains the lower scores of the named entity and noun chunks approaches as they produce much longer lists of candidate answers.
4.2.3 Single Answer Candidate
Finally, we see in both tables, that the T5 model achieves the highest average result. However, in our setup it cannot produce multiple candidates. We plan to extend it accordingly in future work.
Figure 2 shows a passage from the development split of the SQuAD1.1 dataset and the top-10 answers that our model proposed for it. We can see that these answers represent a diverse set, including named entities, noun chunks, and individual words. Indeed, this is a typical example, as our analysis across the entire development dataset shows that on average, among the top-10 candidates, our model proposes 4.82 named entities and 6.40 noun chunks.
Note also that our evaluation setup could be unfair to the model in some cases, e.g., if the model proposes a good candidate answer but one that was not chosen by the human annotators, it would receive no credit for it.
Finally, note that our model can produce top- results for user-defined values of , which is an advantage over simple baselines based on entities or chunks, as well as over our setup for T5.
6 Conclusion and Future Work
We proposed a new task: generate answer candidates that can serve as an input to answer-aware question generation models. We further created a dataset for this new task. Moreover, we proposed a suitable model for the task, which combines orthographic, lexical, syntactic, and semantic information, and can generate a pre-specified number of answers. Finally, we demonstrated improvements over simple approaches based on named entities, and competitiveness over complex, computationally expensive neural network models such as T5.
In future work, we plan to analyze and to improve the features. We also want to extend T5 to generate multiple candidates. We further plan to reduce the impact of false negatives, e.g., by means of manual evaluation by domain experts, and eventually by producing datasets with (potentially ranked) annotations of all suitable candidate answers.
This research is partially funded via Project UNITe by the OP “Science and Education for Smart Growth” and co-funded by the EU through the ESI Funds under GA No. BG05M2OP001-1.001-0004.
UniLMv2: pseudo-masked language models for unified language model pre-training. In
Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 119, pp. 642–652. Cited by: §1, §2.
- Automatic multiple choice question generation from text: a survey. IEEE Transactions on Learning Technologies 13 (1), pp. 14–25. Cited by: §1.
- Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv:1803.05457. Cited by: §2.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’19, Minneapolis, Minnesota, USA, pp. 4171–4186. Cited by: §2.
- Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), NeurIPS ’19, Vancouver, BC, Canada, pp. 13042–13054. Cited by: §1, §2.
- Harvesting paragraph-level question-answer pairs from Wikipedia. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL ’18, Melbourne, Australia, pp. 1907–1917. Cited by: §1.
- Learning to ask: neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL ’17, Vancouver, Canada, pp. 1342–1352. Cited by: §1, §2.
- SearchQA: a new Q&A dataset augmented with context from a search engine. arXiv:1704.05179. Cited by: §2.
- English machine reading comprehension datasets: A survey. arXiv:2101.10421. Cited by: §2.
Beyond English-only reading comprehension: experiments in zero-shot multilingual transfer for Bulgarian.
Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP ’19, Varna, Bulgaria, pp. 447–459. Cited by: §2.
- EXAMS: a multi-subject high school examinations dataset for cross-lingual and multilingual question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP ’20, Online, pp. 5427–5444. Cited by: §2.
- Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’10, Los Angeles, California, USA, pp. 609–617. Cited by: §1.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7, pp. 452–466. Cited by: §2.
ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedigs of the 8th International Conference on Learning Representations, ICLR ’20, Addis Ababa, Ethiopia. Cited by: §2.
- METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, WMT ’07, Prague, Czech Republic, pp. 228–231. Cited by: §2.
- Quiz-style question generation for news stories. In Proceedings of the Web Conference 2021, WWW ’21, pp. 2501–2511. External Links: Cited by: §2.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL ’20, pp. 7871–7880. Cited by: §2.
ROUGE: a package for automatic evaluation of summaries.
Proceedigs of the Workshop on Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. Cited by: §2.
- RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692. Cited by: §2.
- Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP ’18, Brussels, Belgium, pp. 2381–2391. Cited by: §2.
- Question generator. Note: https://github.com/AMontgomerie/question_generator Cited by: §2.
- Enhancing lexical-based approach with external knowledge for Vietnamese multiple-choice machine reading comprehension. IEEE Access 8 (), pp. 201404–201417. Cited by: §2.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, ACL ’02, Philadelphia, Pennsylvania, USA, pp. 311–318. Cited by: §2.
ProphetNet: predicting future n-gram for sequence-to-SequencePre-training. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2401–2410. Cited by: §2.
Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv:1910.10683. Cited by: §2.
- Know what you don’t know: unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL ’18, Melbourne, Australia, pp. 784–789. Cited by: §3.1.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP ’16, Austin, Texas, pp. 2383–2392. Cited by: §1, §3.4.
- Chapter one - ten benefits of testing and their applications to educational practice. In Psychology of Learning and Motivation, J. P. Mestre and B. H. Ross (Eds.), Vol. 55, pp. 1–36. External Links: Cited by: §1.
- QA dataset explosion: a taxonomy of NLP resources for question answering and reading comprehension. arXiv:2107.12708. Cited by: §2.
- Leveraging context information for natural question generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’18, New Orleans, Louisiana, USA, pp. 569–574. Cited by: §2.
- Answer-focused and position-aware neural question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP ’18, Brussels, Belgium, pp. 3930–3939. Cited by: §1.
- NewsQA: a machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, RepL4NLP ’17, Vancouver, Canada, pp. 191–200. Cited by: §2.
- Crowdsourcing multiple choice science questions. In Proceedings of the 3rd Workshop on Noisy User-generated Text, W-NUT ’17, Copenhagen, Denmark, pp. 94–106. Cited by: §2.
ERNIE-GEN: an enhanced multi-flow pre-training and fine-tuning framework for natural language generation.
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, C. Bessiere (Ed.), IJCAI ’20, pp. 3997–4003. Cited by: §2.
A survey on machine reading comprehension—tasks, evaluation metrics and benchmark datasets. Applied Sciences 10 (21). External Links: Cited by: §2.
- PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th International Conference on Machine Learning, ICML ’20, pp. 11328–11339. Cited by: §2.
- Question retrieval with high quality answers in community question answering. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM ’14, Shanghai, China, pp. 371–380. External Links: Cited by: §1.
- Neural question generation from text: a preliminary study. arXiv:1704.01792. Cited by: §2.