Log In Sign Up

CONSISTENT: Open-Ended Question Generation From News Articles

by   Tuhin Chakrabarty, et al.
Columbia University

Recent work on question generation has largely focused on factoid questions such as who, what, where, when about basic facts. Generating open-ended why, how, what, etc. questions that require long-form answers have proven more difficult. To facilitate the generation of open-ended questions, we propose CONSISTENT, a new end-to-end system for generating open-ended questions that are answerable from and faithful to the input text. Using news articles as a trustworthy foundation for experimentation, we demonstrate our model's strength over several baselines using both automatic and human=based evaluations. We contribute an evaluation dataset of expert-generated open-ended questions.We discuss potential downstream applications for news media organizations.


page 1

page 2

page 3

page 4


What's The Latest? A Question-driven News Chatbot

This work describes an automatic news chatbot that draws content from a ...

Discord Questions: A Computational Approach To Diversity Analysis in News Coverage

There are many potential benefits to news readers accessing diverse sour...

Explaining the Entombed Algorithm

In <cit.>, John Aycock and Tara Copplestone pose an open question, namel...

Automatic Generation of Socratic Subquestions for Teaching Math Word Problems

Socratic questioning is an educational method that allows students to di...

Multi-VQG: Generating Engaging Questions for Multiple Images

Generating engaging content has drawn much recent attention in the NLP c...

LibertyMFD: A Lexicon to Assess the Moral Foundation of Liberty

Quantifying the moral narratives expressed in the user-generated text, n...

Myths and Legends in High-Performance Computing

In this humorous and thought provoking article, we discuss certain myths...

1 Introduction

Factoid questions are relatively straightforward questions that can be answered with single words or short phrases (e.g. who, what, where, when). However to obtain the central idea of a long piece of text, one can ask an open-ended question (e.g. why, how, what) cao-wang-2021-controllable; gao-etal-2022-makes, which can essentially be viewed as an extreme summary of the text narayan-etal-2018-dont in the form of a question. The ability to generate such questions is particularly difficult because the generated questions must be answerable from and faithful to the given input text (see Table 1).

Answer-agnosticdu-etal-2017-learning; subramanian-etal-2018-neural; scialom-staiano-2020-ask or “Answer-awarelewis2021paq; song-etal-2018-leveraging; zhao-etal-2018-paragraph; li-etal-2019-improving-question question generation has gained focus in NLP but these approaches are usually trained by re-purposing question answering datasets that are factual in nature or trained with trivia-like factoid QA pair data sets where answers are entities or short phrases.

At the current rate of COVID-19 vaccination,
experts say, it will take months to change the
virus’s trajectory. In the short term, they worry
that the vaccine could present new risks if
newly immunized people start socializing
without taking precautions. It is not yet clear if
the vaccine protects against asymptomatic
infection, so vaccinated people may still be able
to spread the virus to others.
Why are people so worried about the
COVID-19 virus?
Why is the current rate of vaccination
for COVID-19 so worrisome?
Table 1: Example of open ended questions requiring long form answer generated by fine-tuning a Seq2Seq model BART lewis-etal-2020-bart

and by adding explicit control with salient n-grams

Prior work on long-form question answering (LFQA) kwiatkowski2019natural; fan-etal-2019-eli5 focuses on generating answers to open-ended questions that require explanations. We argue that these benchmarks can also be useful for generation of diverse, human-like open-ended question requiring long form answer.

Figure 1: Human-written question-answer pairs as seen on a FAQ news tool about Covid-19 vaccination

While question generation often helps in data augmentation for training models lewis2021paq; pan2020unsupervised, it can also help in possible downstream consumer applications (Section 7). Leading news organizations often rely on human-written QA-pairs for frequently asked questions (FAQ) news tools (Figure 1) or as representative headlines for news articles used in article recommendation panels. As seen in Figure 1, a news article about the likelihood of breakthrough infections after Covid-19 vaccination can be summarized in the form of representative question-answer pairs.

We propose a novel end-to-end system, CONSISTENT for generating open-ended questions that are answerable from and faithful to the input document. We fine-tune a state-of-the-art pre-trained seq2seq model lewis-etal-2020-bart to generate open-ended questions conditioned on an input paragraph. We further propose methods to ensure better controllability and faithfulness for our generated questions by steering them towards salient keywords in the paragraph which act as “control codes” keskar2019ctrl. Well-formed generated questions can still be unanswerable. Prior work on using filtering methods lewis2021paq to ensure consistency is not possible for our task, owing to increased answer length. Thus, we first rely on confidence scores obtained from pre-trained question answering models to filter out simple inconsistent questions. We further evaluate answerability by designing human-readable prompts to elicit judgements for answerability from the T0pp model sanh2021multitask, which has shown good zero-shot performance on several NLP benchmarks.

We release an evaluation dataset of 529 paragraphs across diverse domains along with human written open-ended questions. Empirical evaluation using automatic metrics demonstrate that our model is better than 5 baselines. Finally, expert evaluation of the top two performing systems shows that our model is capable of generating high quality, answerable open-ended questions spanning diverse news topics (3.5 times better than a competitive baseline: a (lewis-etal-2020-bart, BART) model fine-tuned on an existing inquisitive questions-answers dataset ELI5 (fan-etal-2019-eli5, Explain Like I’m Five). Our novel evaluation dataset, code and models is made publicly available at 111

2 Related Work

Question generation can primarily be answer-aware or answer-agnostic. Prior work on Answer-agnostic Question Generation du-etal-2017-learning; subramanian-etal-2018-neural; nakanishi-etal-2019-towards; Wang_Wei_Fan_Liu_Huang_2019; scialom-etal-2019-self focuses on training models that can extract phrases or sentences that are question-worthy and use this information to generate better questions. scialom-staiano-2020-ask paired questions with other sentences in the article that do not contain the answers to generate curiosity-driven questions. However, these approaches are trained by repurposing QA datasets that are factual rajpurkar-etal-2016-squad or conversational 10.1162/tacl_a_00266; choi-etal-2018-quac. cao-wang-2021-controllable focus on generating open-ended questions from input consisting of multiple sentences based on a question type ontology. Most recently ko-etal-2020-inquisitive

built question generation models by fine-tuning generative language models on 19K crowd-sourced inquisitive questions from news articles. These questions are elicited from readers as they naturally read through a document sentence by sentence, are not required to be answerable from the given context or document.

Answer-Aware question generation models lewis2021paq; song-etal-2018-leveraging; zhao-etal-2018-paragraph; li-etal-2019-improving-question typically encode a passage P and an answer A letting the decoder generate a question Q auto-regressively. These methods work well in practice and have been shown to be improve downstream QA performance. However despite their efficacy, these methods emphasize simple factoid questions whose answers are based on short and straightforward spans. Previous work on generating clarification questions rao-daume-iii-2019-answer; rao-daume-iii-2018-learning; majumder-etal-2021-ask uses questions crawled from forums and product reviews. The answers to the questions were used in the models to improve the utility of the generated questions.

Our work is different from prior work in that we focus on generating open-ended questions, which require long-form answers, from news articles. Unlike answer-aware question generation, where models ask a factoid question conditioned on an answer span, our task is challenging as it requires comprehension of the larger context as well as the ability to compress and represent the salient idea of the passage in the form of a question.

3 Data

Training Data

It’s springtime of the pandemic. After the trauma of the
last year, the quarantined are emerging into sunlight, and
beginning to navigate travel, classrooms and restaurants.
And they are discovering that when it comes to returning
to the old ways, many feel out of sorts. Do they shake
hands? Hug? With or without a mask?
How are people adapting to life after the pandemic?
Table 2: Examples of our evaluation data containing paragraphs from news articles with human written questions. More in Table 9 in Appendix A

Most prior work has successfully trained models for question generation using SQUAD rajpurkar-etal-2016-squad, TriviaQA joshi-etal-2017-triviaqa, or NQ 47761 datasets, the answers to which are typically short.

To account for the open-ended nature of our desired questions, we rely on the ELI5 (fan-etal-2019-eli5, Explain Like I’m Five) dataset. The dataset comprises 270K English-language threads in simple language from the Reddit forum of the same name222, i.e easily comprehensible to someone with minimal background knowledge.

Compared to existing datasets, ELI5 comprises diverse questions requiring long-form answers. It contains a significant number of open-ended how/why questions. Interestingly, even what questions tend to require paragraph-length explanations (What is the difference…). As seen in Table 8 in Appendix A, each question is open-ended, inquisitive and requires an answer that is descriptive in nature. Finally, one of the advantages of the ELI5 dataset is that it covers diverse domains such as science, health, and politics. This quality makes ELI5 an ideal candidate to transfer to the news domain, which similarly covers a diverse range of topics.

Evaluation Data

Since our goal is to generate open-ended questions from news articles, we specifically design our evaluation data to reflect the same. To achieve this goal we obtain English-language articles from The New York Times website from January 2020 to June 2020. We obtained written consent to use this content for research purposes by the copyright holder. One of the additional advantages of crawling data from the The New York Times website is that we can divide news articles by domain, as each news article appears in a specific section of the website. From the given URL333, we can tell that the article belongs to the Science domain. Additionally, as most pre-trained language models were trained prior to the Covid-19 pandemic, we also test how well they generalize to COVID-19 related news topics.

Each news article from a particular domain is segmented into several paragraphs. We randomly sample 529 paragraphs spanning six domains. This includes 55 paragraphs from Science, 66 from Climate, 98 from Technology, 110 from Health, 100 from NYRegion, and 100 from Business. While we understand that selecting standalone paragraphs might sometimes ignore the greater context, or suffer from co-reference issues, we carefully replace any such paragraphs from our bigger pool.

As we do not have gold questions associated with each paragraph, we crowd-source human-written questions for each paragraph on Amazon Mechanical Turk. Each paragraph is shown to a distinct crowdworker who is then instructed to read the paragraph carefully and write an open-ended question that is answered by the entire passage. We recruit 96 distinct crowd workers for this task. After the questions are collected from first round of crowd-sourcing, two expert news media employees approve or reject them based on quality. The paragraphs with rejected questions are put up again and through this iterative process and careful quality control we obtain one high quality open-ended question associated with each paragraph. Table 2 and 9 shows selected paragraphs from our evaluation set and the associated human-generated open-ended question.


The backbone of our approach is a fine-tuned BART-large lewis-etal-2020-bart model on the ELI5 dataset of question-answer pairs. However, there are two major factors to consider in our end-to-end question generation pipeline. The generated questions i) must be relevant and factually consistent to the input paragraph, and ii) must have the answer self-contained in the input paragraph. Our CONSISTENT model (Figures 2 and 3) addresses these issues as described below.

Factual Consistency

To ensure faithfulness to the input paragraph, we need to design our model in such a way that the generated question is about a topic or concept mentioned in the paragraph. In traditional fine-tuning of a seq2seq model where x denotes input paragraphs in the training set and y denotes the corresponding question our goal is to learn where


Recently keskar2019ctrl proposed CTRL, a conditional language model that is conditioned on a control code c and learns the distribution

to provide explicit control over text generation. The distribution can still be decomposed using the chain rule of probability and trained with a loss that takes the control code into account.


Owing to this modification, language models can generate text conditioned on control codes that specify domain, style, topics, dates, entities, relationships between entities, plot points, and task-related behavior. We rely on the same underlying principle for training question generation models.

Figure 2: Architecture to train the CONSISTENT model
Figure 3: Inference pipeline for CONSISTENT model ensuring Factual Consitency (Control Codes) and Answerability (Model Confidence and Instruction prompting)

During training, we extract keywords from questions and feed the input paragraph along with the extracted keyword to the encoder of BART. The extracted keyword here acts as the control code. Since we do not have any supervision for these keywords we use YAKE Campos2020YAKEKE

, an unsupervised keyword extraction tool. For example, as shown in Figure

2, given the question: Where do presidential campaign donations actually get spent?, we extract the top-most salient trigram “presidential campaign donations” using YAKE. We then feed the (control code, answer) to the encoder, the original question to the decoder, and fine-tune the model as shown in Figure 2.

lewis2021paq propose a BERT devlin2018bert based answer extraction model on Natural Questions (NQ) by predicting where “a” is an answer and “c” is a passage containing “a”. This model first feeds the passage “c” through BERT, before concatenating the start and end token representations of all possible spans of up to length 30, and then feeds them into an MLP to compute . At generation time, the answer extraction component extracts a constant number of spans from each passage, ranked by their extraction probabilities. These extracted spans, while originally designed for a different purpose, can act here as control codes for our question generation model. To encourage the question to refer to a concept mentioned in the passage, we extract salient key phrases as control codes from the input paragraph using a combination of YAKE and the answer extraction model (Figure 3).

It should be noted that during training the keywords are taken from the question, while during inference the keywords are produced from the article. This is because at training time we want to minimize the generation loss with respect to the training question so encouraging the model to obey the training keyword is beneficial. At inference time we do not have access to any question so using keywords from article is the only option.


Prior work lewis2021paq; fang2020accelerating; alberti-etal-2019-synthetic has relied on filtering methods to ensure answerability of generated questions. A filtering QA model generates an answer for a given question. If an answer generated by does not match the answer a question was generated from, the question is discarded. Such filtering methods are not applicable for our task because i) our question generation model treats the entire input paragraph as an answer instead of short answer spans typically common in ODQA tasks, and ii) the length of the answers are typically long-ranging across several sentences which is beyond the capabilities of most generative models krishna2021aurko and additionally would be hard for string matching purposes. We propose two filtering methods to ensure answerability: model confidence and instruction prompting.

Primary Filtering: Model Confidence QA models trained on SQUAD 2.0 rajpurkar-etal-2018-know are capable of asserting when a question is unanswerable by signaling lower confidence scores. Taking advantage of this fact we first rely on an ALBERT-based QA model finetuned on SQUAD 2.0 444 The intuition behind this is such a model would typically have lower confidence scores for most poorly formed / unanswerable questions and can be used as a primary filtering step.

While it may appear that a model trained on SQuAD to determine the answerability of questions conflicts with open-ended nature of questions requiring long-form answers, it is often not the case. As we have seen in the case of Natural Questions kwiatkowski2019natural at least 35% question requiring a long form answer often has a short answer associated with it. This means the specific span that is returned by the SQUAD based QA model when prompted with the generated question and the input paragraph can often be an approximate short answer. For instance for the generated question Q3 in Figure 3, the SQUAD model gives an answer via art, sports, entertainment and media. which isn’t inaccurate but requires further elucidation. This motivates us to use a model trained on SQUAD v2.0 as our initial primary step. However different questions can have different model confidence. To decide on an appropriate threshold for model confidence we observe the distribution of confidence scores. We observe a median model confidence of . We then experiment with 3 different thresholds for selecting generated questions. The quality on a held out set of 50 generated questions by is evaluated by human judges and finally decide on a model confidence threshold such that any generated question having a confidence score below is discarded. It should also be noted that we tried higher values of between (0.6,0.9) but having such a strict high confidence score sometimes leaves us with no generated question for an input paragraph. As can be seen in Figure 3, a generated question “What are Bitcoins and how have the made a lot of people very rich?” while being open-ended, grammatically correct, and relevant to the input does not meet the answerability threshold and hence is discarded.

Secondary Filtering: Instruction Prompting While the above filtering step acts as excellent proxy for unanswerable questions, the original model is still trained for short answer spans. To ensure our filtering method is devoid of such biases, we use a secondary filtering approach. Recently sanh2021multitask show how large language models exhibit zero-shot generalization to unseen tasks when presented with natural language prompts. As we do not have annotated data for answerability judgements for open-ended questions with longer answer spans, we rely on zero-shot prompt-based instructions for further filtering. We prompt the best-performing model from sanh2021multitask T0pp with the following instruction:

Given paragraph {{paragraph}},
is the question {{question}}
answerable? Please answer
in Yes or No
The variant from South Africa, known as
B.1.351, could make things even worse for
the vaccine push. Given the speed at
which the variant swept through that
country, it is conceivable that by April it
could make up a large fraction of infections
in the United States.
BART What’s going on with the Ebola virus?
Lead What is the name of the variant from South Africa?
SQUAD What is B.1.351?
RandomOut Why is the domestic product of the flu so bad right now?
What is going on in the US right now after a
B.1 variant swept through the country?
What does the variant from South Africa
mean for the vaccine push?
Table 3: Generated Questions from Baseline Models and CONSISTENT.

We feed the questions that pass the acceptability test based on our model confidence threshold as natural language instructions to the model as shown in Figure 3. Only questions which receive an answer of “Yes” are considered in our final set. This process makes our filtering approach robust, owing to the fact that only questions which pass both filtering tests are considered as consistent.

It can be argued that T0pp is a stochastic system that was not trained for un-answerability detection. To justify our use of T0pp we conducted a experiment where we sample a subset of 200 questions (100 answerable and 100 unanswerable). These questions are manually selected by humans from our pool of all possible generated questions. We then feed T0pp with the same prompt above containing the respective questions and their associated paragraphs. On a binary task of un-answerability prediction we get an accuracy of 84%.

As our pipeline can generate multiple questions for each input due to different control codes, we further need to rank the generated questions. Towards this task, we rank all our generated questions for a given input paragraph that are consistent based on model confidence scores.

5 Evaluation Setup

5.1 Baselines

We compare our CONSISTENT model against several baseline approaches.

Lead Sentence to Question (Lead):

In order to ensure that our data is free from any potential artifacts we take the lead sentence of every passage and convert it to a question. In particular, we prompt the T0pp sanh2021multitask model which acts as a statement-to-question converter transforming the first sentence of every paragraph to a question.

QG based on fine-tuned BART (BART):

Our initial backbone model of fine-tuned BART-large on answer-question pairs from the ELI5 dataset.

QG based on random keyword inside Paragraph (RandomIn):

We use the same fine-tuned BART-large model from Section 4 with <keyphrase, paragraph> as input to the encoder and the question as the output from the decoder. During inference we feed a random keyphrase from the input paragraph to generate the question. It should be noted that this approach does not undergo any of the filtering mechanism used in CONSISTENT.

QG based on random keyword outside Paragraph (RandomOut):

The training method is similar to that of RandomIn except that during inference we feed a random keyphrase outside of the input paragraph to generate the question. It again does not undergo any of the filtering mechanism used in CONSISTENT.

QG based on SQUAD data (SQUAD):

We fine-tuned a BART-large model on SQUAD 2.0 but conditioning on the keyphrases in the prompt during inference. In particular, we use the same keyphrase used in the prompt for the highest scoring question from our CONSISTENT model. For instance, we prompt the model fine-tuned on SQUAD with the keyphrase once-niche world and the input paragraph as shown in Figure 3.

5.2 Evaluation Metrics

The space of possible correct outputs is too large in our case to rely on n-gram based metrics like BLEU or ROUGE. For this reason, we chose the two best available automatic evaluation metrics based on contextual representations. We report BERTScore

Zhang-etal:2020:bertscore to measure the similarity between a generated question and its gold-reference human written question.555We used BERTScore based on deberta-mnli that is shown to have high correlation with human judgements.. We also report BLEURT sellam-etal-2020-bleurt scores, which combine expressivity and robustness by pre-training a fully learned metric on large amounts of synthetic data, before fine-tuning it on human ratings.

BART 44.0 64.5
Lead 43.0 64.4
SQUAD 39.0 62.1
RandomInside 40.0 62.1
RandomOutside 34.2 58.1
CONSISTENT 47.0* 66.4*
Table 4: Evaluation based on automatic metrics. *Results are significant (

) via t-test.

However, automatic metrics are not enough. To evaluate the controllability and answerability of the generated open-ended questions we chose outputs from 2 best performing systems based on the automatic evaluation. We further propose a new metric well-formedness and a human-based evaluation. A well-formed question is grammatically correct, faithful to the provided paragraph, and whose answer is detailed, long-form spanning through the entire paragraph. A well-formed question only mentions people, places, things, or ideas that are also in the original text.

Regarding human judges, karpinska-etal-2021-perils discuss how even with strict qualification filters, AMT workers are not suitable for evaluating open-domain NLG outputs. To avoid such issues we recruit multiple employees of a news media organization with experience in building products and tools for news room to evaluate the output of our baseline and CONSISTENT model for each input paragraph. We believe these evaluators can ground their judgments in the real-world utility of the generated questions for our target use case. Each input was evaluated by three people. Annotator guidelines are in Appendix A. We use the Amazon SageMaker Ground Truth666 platform where we upload our input paragraphs along with the generated questions from two systems (randomly shuffled) as shown in Figure 6. The news article headline is given for additional context to the human evaluator. The evaluators are provided with the above definition of what constitutes a well-formed question. The evaluators are then asked to determine which questions are well-formed between four possible options Question1, Question2, Both Questions, and Neither Question.

Health 56.3 13.6 17.2 12.7
Technology 37.7 22.4 22.4 17.3
Science &
42.1 18.1 23.1 16.5
New York 51.0 9.0 33.0 7.0
Business 55.0 11.0 30.0 4.0
Overall 48.4 14.8 25.1 11.5
Table 5: Human-based evaluation results (percentage win). via t-test

6 Results and Analysis

Table 4 shows that our CONSISTENT model is better than all the existing baselines. Table 5 shows that our experts agree on the quality of the generated questions spanning different domains. To get a single verdict on the correct label for each input, we consider majority voting for each question. Agreement rates were measured using Krippendorff’s and a moderate agreement of 0.62 was achieved. As observed, our CONSISTENT model outperforms Baseline BART overall by a margin of 33.6 points. Table 3 shows examples of generations by the five models on a given paragraph. In an effort to better understand why or how the CONSISTENT model is better than the baselines, we carefully analyze outputs from all systems.

Without an explicit supervision on what to ask, BART often asks generic questions or deviates from the input source and hallucinates content as can be seen in Table 3 and Table 7. The LEAD model works decently well when the central idea of the paragraph is expressed in the first sentence, however without broader context it often suffers from generating factoid or uninteresting questions. The SQUAD model is the second-worst performing model as expected due to mismatch in training and evaluation domain. Due to lack of answerability filters, the RandomIn model even though generating a question based on a random keyword from article is often found to be unanswerable as demonstrated by the automatic evaluation scores in Table 4. The worst performance of RandomOut model bolsters our claim that using keywords from the article in the prompt helps the model achieve faithfulness and doing otherwise might hurt performance.

To test the effect of the primary filtering we choose the candidate question with the highest confidence given by the QA model. To test the effect of secondary filtering we choose the candidate question with the highest confidence of generating yes. The obtained BERTScore for Primary, Secondary and CONSISTENT(both) are 64.0, 64.5 and 66.4 showing that the best filtering mechanism is the combined one.

7 Downstream Applications

Figure 4: A prototype admin tool for humans to approve, reject, or edit questions generated by CONSISTENT for individual news articles.

We believe open-ended question generation might enhance the news experience through new Q&A tools, enhanced search, improved recommendations, and more. Media organizations have used FAQ pages to help readers better understand complex news topics, from Covid-19 vaccines777 to personal finance888 The ability to automatically generate open-ended question about a given topic could make it easier for news organizations to launch an FAQ page for a new topic. We envision an admin tool (Figure 4) that presents the users with a list of generated questions and allows them to approve/reject, edit, and publish the results quickly. This human-in-the-loop approach is essential for maintaining reader trust when the generated questions may be presented directly to readers laban-etal-2022-quiz.

Figure 5: A prototype news article where human-approved questions generated by CONSISTENT are displayed as related articles.

Another potential application of this human-in-the-loop version of the system is improved news article recommendations (Figure 5). While reading a news article, a recommended article interface may be presented showing a series of questions related to the topics in the article. This level of functionality could in some ways anticipate questions a reader may ask and point them in the direction of other news articles that may provide them with answers.

For other use cases that might allow automatically generated open-ended question to be used more broadly in production systems, we envision a human-validated database of such question-answer pairs where the reliability of the results could be controlled. This can help improve the search experience on news media websites. For instance, if a user were to search What are some of the issues with NFTs?, the search experience could fuzzily match questions generated by CONSISTED to prioritize the article containing the answer relevant to the user’s query which can help them discover things pertaining to what they are curious about. Additionally, question clustering algorithms could be deployed to better match searched questions with the generated questions. Finally automatic question generation can also used to improve interactive news podcast laban2022newspod.

8 Conclusion

We propose CONSISTENT, an end-to-end system for generating open-ended questions requiring long form answers, which accounts for factual consistency and answerability. Using news articles as a trustworthy foundation for experimentation, we demonstrate CONSISTENT’s strength over a competitive baseline model as evaluated both using automatic metrics and human evaluation. We also contribute an evaluation set of input paragraphs and human-generated open-ended questions. Through potential downstream applications of CONSISTENT, we demonstrate how they can enhance the experience of news media websites.

Ethical Considerations

As noted in Section 3, we use a corpus of news articles from The New York Times as the foundational set of documents used for question generation. We have used this data with the approval and consent of the copyright holders for research purposes. We intentionally decided to use news articles as a trustworthy foundation for question generation. Further, we selected The New York Times as they have published999 a clear set of ethics and standards to guide the creation of their journalism.Our models were trained on four A100 GPUs for 10 hours. Parameter size 400m.

As with other text generative models, our model can suffer from hallucinations hallucination, biases sheng-etal-2019-woman; sheng2021societal from the Reddit ELI5 dataset and text found on the internet more broadly, and concerns about potential misuse. Much of the paper goes into detail about the great lengths we have gone to in order to reduce hallucinations and exert greater control over the final outputs in order to counter these risks (see Section 4). We use control codes selected from the original news article in an attempt to better control the generated question. We filter for answerability to further ensure that generated questions are faithful to the original text. While considerable work has been done to reduce the impact of these issues, any language generation system will be imperfect.

Our human evaluators were selected due to their familiarity with standards of journalism. Each evaluator was a paid, full-time employee of a news media organization.

To encourage critical thinking about the risks of deployment in a production environment, we included Section 7 to discuss possible downstream applications. We detailed our perspective on when a human-in-the-loop would be essential to an ethical use of this system.

We hope that our work in this paper can further the important work of safe and trustworthy language generation.

9 Limitations

We note that our training dataset is automatically collected from the r/ELI5 subreddit and as such we don’t account for any sensitive text. We focus on open ended question generation from news articles where our inputs are paragraph level and longer than sentence level inputs in factoid QG. However our model is not capable of handling longer sequences like an entire news article or opinion piece. We believe models like LongT5 guo-etal-2022-longt5 might be useful for such inputs however we leave this for future task.

Even though we control for hallucination by incorporating control codes from input text, it does not ensure 100% hallucination free output. In regards to answerability judgements our methods are useful and bridge the gap in distinguishing unanswerable questions however it in itself is a difficult task and our approaches based on SQUAD V2.0 and T0pp can still make errors.This means our models are still capable of generating unacceptable questions and should be deployed based on due deliberation.

Finally temporal misalignment is an issue and owing to the fact that our training data is from a few years back it sometimes fails on newly coined scientific terms or expressions related to COVID-19 pandemic. Continually fine-tuning our models on newer data with experience replay can mitigate these issues. We leave this for future work.


We would like to thank the anonymous reviewers for their helpful comments. Tuhin was funded by NYC Media Lab x New York Times R&D Researcher Fellowship. The authors also want to thank the members of NYTimes R&D team and NYC Media Lab teammates Robert Clauser, Erica Matsumoto and Matt Macvey for their support.


Appendix A Appendices

Annotator Guidelines

As our well-formedness metric constitutes multiple dimensions it is important for us to have clear annotation guidelines. Towards this we specifically instruct workers on what should be looked into

  • Question needs to be grammatical

  • Question should not refer to concepts or entities that is not referenced in the original paragraph. For instance, a question about an Ebola vaccine when the original text is about the COVID-19 vaccine is NOT considered well-formed. Also a question that references Vice President Biden when the text is about President Biden would not be considered well-formed.

  • Question needs to be faithful and relevant to the input paragraph and on same topic

  • The question should encapsulate or summarize the key idea of the entire passage and should not be simply factoid (i.e something that can be answered using a few words)

The variant from South Africa, known as B.1.351,
could make things even worse for the vaccine push.
Given the speed at which the variant swept through
that country, it is conceivable that by April it could
make up a large fraction of infections in the United
What’s going on with the Ebola virus?
Type Hallucination
What is the name of the variant from South Africa?
Type Simply factoid
What is B.1.351?
Type Simply factoid
What does the variant from South Africa
mean for the vaccine push?
Type Well-formed
Table 6: Generated examples used to instruct human workers as to what makes a question well-formed and why are some questions not well-formed

They were also provided with examples of generated questions along with reasoning so as to why a given question is not well-formed.


We fine-tune a BART Large model on ELI5 for both baseline and CONSISTENT for 10 epochs with batch size 64 and learning rate

and save the best checkpoint based on validation loss. To generate questions we use top-k sampling fan2018hierarchical with and a temperature of 0.8 coupled with

At the two New Horizons homes in
Gainesville, the medical director, Dr. Swati
Gaur has held six staff town halls, in person
or online, including one at 2 a.m. for the
night shift, and offered rewards like free
meals. About half of the workers have been
vaccinated, Dr. Gaur said.
How do nursing homes deal with the
influx of new residents?
What has the medical director of New
Horizons done to ensure that workers
are vaccinated?
Table 7: Hallucination by the Baseline BART model
Question Answer
Why are my muscles sore
after jumping in cold water?
From what I understand, our bodies defenses against hypothermia is to shiver.
This involves involuntary muscle contractions to generate heat. These muscles
contractions still can cause muscle soreness just like working out.
How come bluetooth is
so much slower than Wi-Fi?
Bluetooth is designed to be short-range very low-power for small portable
equipment. Part of the power-savings of Bluetooth come from diminished bandwidth
(just as much as the weaker signal). One could speed up Bluetooth to Wi-Fi speeds,
but then it would defeat the purpose of BT’s major design feature. If you’re looking
for something that works like plunging a cable between devices but has Wi-Fi
speeds, you might like wireless USB.
Table 8: Examples from the r/ELI5 Subreddit of open ended question requiring long form answer paired with human-written answers
Figure 6: Screenshot of the evaluation tool where employees of a news media organization are asked to select the best option given two generated questions, an input paragraph, and headline from a news article
More companies are also using augmented reality to
help people with online shopping, Ms. Ask said. Jins
Eyewear, which sells prescription glasses, lets you take a
photo of your face to virtually try on glasses before
deciding whether to buy them. Snap, the parent company
for Snapchat, has teamed up with luxury brands like
Gucci and Dior to offer virtual try-ons.
How are companies using AR for online shopping?
For instance, a number of U.S. colleges and universities,
including the University of Arizona and the University
of North Carolina at Charlotte, have used wastewater
surveillance of dorms to find asymptomatic, infected
students who had otherwise evaded detection. In the
Netherlands, health officials have used wastewater data
to determine where to send their mobile testing buses,
Dr. Medema said.
How has wastewater data been used to detect symptoms?
No matter what their goals are — moving a stock, overturning
a presidential election, getting the graphics on a Sonic the
Hedgehog movie changed — these internet-based insurgencies
tend to follow a similar pattern. One day, a group decides
to take action against a system it feels is immoral or corrupt.
Members identify structural weak points (a vulnerable
political party, a risk-averse studio head, an overexposed
short position) and figure out creative ways to exploit
them, using social media for leverage and visibility. With
enough highly motivated people pushing in the same
direction they eventually prevail, or get enough attention that
it feels like they did.
How do internet-based insurgencies gain traction?
A growing body of research shows that FEMA often helps
white disaster victims more than people of color, even when
the amount of damage is comparable. The problem seems to
stem from complex systemic factors, like the difficulty of
navigating the federal bureaucracy and a real estate market that
often places higher values on properties in communities
with white residents.
Why does FEMA serve more white victims?
The demands come as the safety of firefighters has become
an urgent concern amid worsening effects of climate change,
which bring rising temperatures that prime the nation for
increasingly devastating fires. In October,2 dozen firefighters
in California where a record 4.2 million acres burned across the
state last year — filed suit against 3M, Chemours, E.I. du Pont
de Nemours and other manufacturers, claiming that the
companies for decades knowingly made and sold firefighting
equipment loaded with toxic chemicals without warning of
the chemicals’ risks. .
Why are firefighters suing companies in California
Table 9: Human written questions from our crowdsourced evaluation set