Log In Sign Up

How Well Do You Know Your Audience? Reader-aware Question Generation

by   Ian Stewart, et al.

When writing, a person may need to anticipate questions from their readers, but different types of readers may ask very different types of questions. If someone is writing for advice about a problem, what question will a domain expert ask, and is this different from how a novice might react? In this paper, we address the task of reader-aware question generation. We collect a new data set of questions and posts from social media, augmented with background information about the post readers. Based on predictive analysis and descriptive differences, we find that different readers, such as experts and novices, consistently ask different types of questions. We next develop several text generation models that incorporate different types of reader background, including discrete and continuous reader representations based on the readers' prior behavior. We demonstrate that reader-aware models can perform on par or slightly better than the text-only model in some cases, particularly in cases where a post attracts very different questions from readers of different groups. Our work has the potential to help writers anticipate the information needs of different readers.


page 8

page 9


Ask to Learn: A Study on Curiosity-driven Question Generation

We propose a novel text generation task, namely Curiosity-driven Questio...

Inquisitive Question Generation for High Level Text Comprehension

Inquisitive probing questions come naturally to humans in a variety of s...

Learning to Ask Good Questions: Ranking Clarification Questions using Neural Expected Value of Perfect Information

Inquiry is fundamental to communication, and machines cannot effectively...

AI-lead Court Debate Case Investigation

The multi-role judicial debate composed of the plaintiff, defendant, and...

Modeling question asking using neural program generation

People ask questions that are far richer, more informative, and more cre...

Tie-breaker: Using language models to quantify gender bias in sports journalism

Gender bias is an increasingly important issue in sports journalism. In ...

1 Introduction

Writers are often trained to be aware of their audience Park (1986) and to minimize the effort required for others to understand their message, especially if the communication is one-directional. However, NLP tools for writing assistance are often not made aware of the writer’s likely readers Zhang et al. (2020) and the information gaps that the readers may have when reading the writer’s message. For instance, someone in the US who is writing about a problem may omit important details related to US norms and customs, which a non-US reader may then misunderstand due to the missing context. A system that could preempt the readers’ information needs would enable the writer to include additional information in their original message. This would be especially useful in situations where writers are seeking advice from readers Govindarajan et al. (2020), since preempting clarification questions from readers would help writers modify their original post and arrive at a solution more quickly.

While it is not possible to predict the information needs of every possible future reader, it is likely that there are consistent tendencies among reader groups based on their prior backgrounds  Garimella et al. (2019). In this paper, we propose a reader-aware question generation model with the goal of providing possible feedback questions to writers, in the context of writing requests for advice. This kind of model could prove useful for people learning how to write in new contexts as well as people who have difficulty anticipating the information needs of different audiences. We provide examples of questions generated by one of our reader-aware models in Table 1.

Post “I’m stuck in debt. My partner and I are only 30, and we bought a condo last year that we almost couldn’t afford.”
Target question (Expert author) “What’s your income and what are your expenses?”
Text-only model “How much do you have in savings?”

Reader-attention model (

“What are your expenses?”
Reader-attention model (Novice) “Do you have a budget?”
Table 1: Example questions generated by reader-attention model and text-only model for sample post (text modified for privacy).

Our work contributes the following:

  • [noitemsep]

  • We collect a dataset of 200000 Reddit posts seeking advice about a variety of everyday topics such as technology, legal issues, and finance, containing 700000 questions.111We will release the dataset and data processing code. We define several dimensions of reader groups that are relevant to writers’ requests for information such as expertise (§ 4.1).

  • We demonstrate that the questions that readers ask are consistently different between members of the groups, using prediction and descriptive analysis (§ 5.1).

  • We extend a transformer-based language model to incorporate reader information with discrete and continuous representations. The reader-aware model outperforms the text-only model for highly “divisive” posts that attract very different questions from different reader groups, which suggests that the model adapts well to posts with a variety of information expectations.

2 Background

2.1 Question generation

Researchers have proposed a variety of systems to generate questions based on a given document and a known answer to the question, with the goal of improving QA systems with augmented data Dong et al. (2019) and improving automated writing systems with question prompts for writers Becker et al. (2012); Liu et al. (2012)

. Some approaches have extended existing sequential generation language models, such as the LSTM and the transformer, to accommodate both the textual context (e.g. a document) and the likely answer to the question, in order to maximize the probability of generating an informative question 

Du et al. (2017); Indurthi et al. (2017)

. Reinforcement learning can also be leveraged to reward a system for generating questions that are more likely to have interesting answers 

Qi et al. (2020a) and more likely to be relevant to the context Rao and Daumé (2019). Furthermore, work such as Gao et al. (2019) has proposed controllable generation techniques (e.g. increased difficulty) to encourage less generic questions. The foundation of this line of work has been to collect human-generated questions from a variety of domains, including Wikipedia Du and Cardie (2018), Stack Overflow Kumar and Black (2020), and Twitter Xiong et al. (2019). This study builds on the lines of controlled generation and online data collection by testing several variants of reader-aware question generation on data from online forums, which provide questions and documents across many domains.

2.2 Language model personalization

Personalized language modeling often seeks to improve the performance of common language tasks, such as generation, using prior knowledge about the human writing the text Paik et al. (2001). Personalization can both improve performance as well as make language processing more human-aware Hovy (2018), which can ensure that a more diverse sample of society is included in language models Hovy and Spruit (2016). While initially focused on recommendation and generation, human-aware language processing has proven useful to a variety of prediction tasks such as syntactic parsing Garimella et al. (2019) and geolocation Bamman et al. (2014). How to best represent the “human” in such systems remains an open question, and personalized systems often use a writer’s demographics Welch et al. (2020) or a writer’s social network information Del Tredici et al. (2019) in combination with language to optimize task performance. One common approach is to convert each writer to a latent representation such as an embedding Pan and Ding (2019)

and then combine this information with the language context (represented as e.g. word embeddings) in a neural network model, where the writer’s representation is learned jointly with the text representations during training 

Miura et al. (2017). We draw inspiration from the contextualized view of personalization from Flek (2020), and we build models to represent the question-askers based on their prior behavior with respect to the specific context of a given question.

3 Data

In this study, we consider the task of generating clarification questions on informational posts. Prior work in social computing has illustrated the rich diversity of advice-seeking behavior on social media such as Reddit Fu et al. (2019); Govindarajan et al. (2020); Lahnala et al. (2021), which leads us to focus on several popular advice-seeking forums on Reddit. We choose the following subreddits which have a high proportion of text-only submissions, topical diversity, and where questions are frequently asked in response to the submissions: Advice, AmItheAsshole, LegalAdvice, PCMasterRace, and PersonalFinance.222AmITheAsshole hosts discussions about social norms in everyday situations; Advice hosts discussions about general lifestyle decisions including work, health, and family. We collect all submissions ( 8 million) to the above subreddits from January 2018 through December 2019, using a public archive (accessed February 2021) Baumgartner et al. (2020). We filter the post data to only include submissions written in English with at least 25 words. To identify potential clarification questions, we collect all children comments of the parent submission ( 6 million) that are not written by bots.

Initial analysis revealed that some questions were either irrelevant to the post (e.g., “what about X” where X is unrelated to the post topic) or did not actually seek more information from the original post (e.g., rhetorical questions). To address this, we sampled 100 questions from each subreddit in the data along with the parent post, and we collected binary annotations for relevance (“question is relevant”) and information-seeking (“question asks for more information”) from three annotators, who are undergraduate students and native English speakers. We provided instructions and a sample of 20 questions labeled by one of the authors as training data for the annotators. On the full data, the annotators achieved fair agreement on question relevance () and on whether questions are information-seeking ().

After annotation, we removed all instances of disagreement among annotators to yield questions with perfect agreement for relevance (76% perfect-agreement) and information-seeking (80%). In the perfect-agreement data, the majority of questions (94%) were marked as relevant by both annotators, which makes sense considering that the advice forums generally attract good-faith responses from commenters. We therefore chose to not filter questions based on potential relevance. To filter information-seeking questions, we trained a simple bag-of-words classifier on the annotated data (binary 1/0; based on questions with perfect annotator agreement).


We restricted the vocabulary to the 50 most frequent words, minus stop-words, to avoid overfitting. Initial tests with SVM, logistic regression, and random forest models revealed that the random forest model performed the best, which we used for the rest of the classification.

The annotated data were split into 10 folds for training and testing, and the model achieved 87.5% mean F1 score, which is reasonable for “noisy” user-generated text. We applied the classifier to the full dataset and removed questions for which the classifier’s probability was below 50%.

Total posts 207694
Total questions 730620
Post length 304 221
Question length 13.9 8.08
Questions with reader data 77.7%
Questions with discrete reader data 75.2%
Questions with reader embeddings 43.5%
Table 2: Summary statistics about posts, questions, and author data (see § 4.1, § 4.5);

indicates mean and standard deviation.

Subreddit Posts Questions
Advice 48858 87592
AmItheAsshole 61857 331345
LegalAdvice 53577 92737
PCMasterRace 31657 47613
PersonalFinance 74745 171333
Table 3: Summary statistics about subreddits.

We summarize the overall data in Table 2, and we show the distribution of the posts and questions among subreddits in Table 3. Example posts and associated clarification questions are shown in Table 4 under “Example questions.”

4 Models

The typical goal of question generation is to accurately predict the text of question considering the context of a post , i.e. maximizing . The primary goal of this study is to assess the relevance of the “reader” in the task of question generation: how can we best capture the prior knowledge and the likely interests of reader with respect to post , to better predict question ? The goal is to find model parameters that maximize the conditional likelihood of observing given and :

This task is different from other settings that condition on the question answer Du et al. (2017). We introduce several models to incorporate discrete and continuous representations of (defined in § 4.1), including a token-based model (§ 4.3), an attention-specific model (§ 4.4), and a reader-embedding model (§ 4.5).

Reader group Description Values Example questions Example post title
EXPERT Prior rate of commenting in the target subreddit OR a topically-related subreddit. Expert
( 75th percentile)
If you were to switch tomorrow how much would you need to make on day 1 to meet your current financial obligations? (PersonalFinance) Career change at 42
( 75th percentile)
Where do you live?
TIME Mean amount of time elapsed between original post and reader’s comment, among prior comments from reader. Fast
( 50th percentile)
So your wife still has a relationship with him? (LegalAdvice) My wife and I are having a baby and her step dad is a child sex offender
( 50th percentile)
If he is a registered sex offender , doesn’t he have a restriction not to be around young children based on what he did?
LOCATION Likely location of question asker, based on prior comments. US Have you also looked at the RX 480 or 580? (PCMasterRace) Should I buy GTX 1050Ti?
non-US The 1050ti is 180 $ in India?
Table 4: Reader groups assigned to question authors, based on prior comments.

4.1 Defining reader background

In this study, we assume that the prior background of a post reader plays a role in their information-seeking goals, and we therefore collect a limited history for a sample of readers to quantify relevant aspects of their background that may explain their information goals.444We collect up to comments for 50% of all question-askers, which omits comments that were deleted between their original creation date and the time of collection Gaffney and Matias (2018). Data collected in March 2021.

While there are many possible social attributes that can affect question asking, we choose to focus on attributes that are readily extracted from text and cover a wide variety of readers in our data. We consider the following dimensions of “background experience” to characterize the readers:

  1. [noitemsep]

  2. EXPERT: A reader with less experience may ask about surface-level aspects of the post, while a reader with more expertise might ask about a more fundamental aspect of the post. We quantify this dimension using the proportion of prior comments that the reader made in the subreddit (or a topically related subreddit)555We find related subreddits for each advice subreddit by (1) computing the top-10 nearest neighbors for subreddit in subreddit embedding space (see § 4.5) and (2) manually filtering unrelated subreddits. in which the original post was made.

  3. TIME: A reader who replies quickly may ask about missing information that is easily corrected (e.g., clarifying terminology), while a reader who replies more slowly may ask about more complicated aspects of the writer’s request (e.g., a connection between disparate details). We quantify this with the speed of responses of the author’s prior comments, relative to the parent post.

  4. LOCATION: A reader who is based in the US may ask questions that reflect US-centric assumptions (e.g., owning a car), while a reader who is not based in the US may ask about aspects of the post that are unfamiliar to them. Using “non-US” as a single category does combine people from many different countries, but we use this category to group clear patterns among people from other countries (e.g. non-US vocabulary) and to avoid data sparsity. We quantify location with the reader’s self-identification from prior comments Welch et al. (2020)666A statement such as “I live in the US” has “US” tagged by an high-precision NER system Qi et al. (2020b) and geolocated to a known geographic entity in OpenStreetMap with high confidence. and from their posting history in location-specific subreddits. We identify all location-specific subreddits in an author’s previous posts based on whether the subreddit name can be geolocated to a known geographic entity in OpenStreetMap with high confidence (e.g., r/nyc maps to New York City). A reader ’s location is identified with the location-specific subreddit where writes at least 5 comments and where they write the most comments out of all location-specific subreddits in which they have written.

We summarize these definitions of “reader” in Table 4. The example questions demonstrate that readers who occupy different ends of the dimensions proposed above tend to write questions about different aspects of the post. Note that some readers may belong to multiple groups: an Expert reader could also be non-US. For simplicity, in all our models we only handle one reader group at a time and therefore split readers with multiple groups into different data points: e.g. we provide the same post and question from reader to the model for different reader groups if belongs to groups 1 and 2.

4.2 Baseline model

We build all reader-aware models on top of the same baseline BART model Lewis et al. (2020), an encoder-decoder model shown to be more resistant to data noise than the typical transformer model. For fair comparison, we use the same pre-trained model (bart-base; ) and the same training settings for all models.777

Learning rate 0.0001, weight decay 0.01, Adam optimizer, 10 training epochs, batch size 2, max source length 1024 tokens, max target length 64 tokens.

We use cross-entropy loss during training to maximize the probability of generating a question given the provided post, i.e. .

4.3 Reader tokens

For the “reader-token” model, we add a token to the text input of the baseline model to indicate whether the reader belongs to one of the groups proposed above. This aligns with work in for text generation Keskar et al. (2019) and translation Wang et al. (2021). The embeddings for these reader tokens are learned during training in the same way as the other text tokens. All readers who could not be assigned to a group are represented with UNK tokens.

4.4 Reader attention

Some work in personalization has suggested training a separate model for different author groups Welch et al. (2020), which can be expensive in terms of runtime and memory. An alternative approach is to customize a single part of the model for different reader groups. The transformer model relies on multiple layers of attention modules Vaswani et al. (2017) to determine the most important part of the input text for the generation task. For the reader-attention model, we therefore add reader-specific attention layers to represent the different focus patterns that some readers may exhibit toward the original post. In our model, we replace layer of attention in the encoder with a different attention module for each reader group (a layer for the Expert readers, etc.), and we dynamically switch to the module associated with group when predicting a question generated by reader group . For regularization, we train a separate generic attention module at the same time as the reader-group attention, concatenate the reader attention with the generic attention, and pass the concatenated attention through a fully-connected neural network to produce the final attention distribution.888

We choose the hyperparameter

from through training and testing on a subset of the training data to maximize performance on BLEU-1 and ROUGE. We also experimented with other modifications, such as computing a weighted average of the attention distributions, and found that concatenation had the best performance.

4.5 Reader embeddings

The previous models represent readers as discrete entities, e.g., as a “expert” or “novice” in a particular domain. For a continuous approach, we represent readers using latent embeddings based on their prior posting activity, which are generated using social and linguistic information respectively. To represent a reader’s social history, we compute an embedding based on the subreddits in which the reader commented before writing their question.999We collect up to 100 comments written before the reader’s original question. To generate subreddit embeddings, we collect the cross-posting matrix for all subreddits and all readers in our data.101010Each cell is set equal to the PMI of author commenting in subreddit . We then decompose the matrix into a latent representation using SVD () and compute the average subreddit embedding for reader based on the reader’s prior posts history in different subreddits, i.e. .

To represent a reader’s linguistic history, we compute an embedding based on the reader’s text from previous comments. To generate text embeddings, we first train a Doc2Vec model  Le and Mikolov (2014) on all prior comments extracted from the readers (, default skip-gram parameters), to represent each comment as a single document embedding. Next, we compute each reader’s average language embedding based on the text in their prior posts. i.e. .

We combine the reader embedding with the input text by appending a special “author embedding” token and the reader embedding to the end of the input word embeddings, then we train the model as usual. All readers whose prior posts we were unable to collect are assigned a “dummy” embedding of zeroes.

5 Results

5.1 Reader group differences

As a first step, we test for consistent differences in the types of questions asked by different reader groups. For each reader group in category (e.g., groups US and non-US from LOCATION), we sample questions from each subreddit, where .

Reader group Top-3 LIWC categories (absolute frequency difference)
US >non-US MONEY (0.512%), WORK (0.361%), RELATIV (0.337%)
non-US >US FOCUSPRESENT (0.356%), FUNCTION (0.327%), AUXVERB (0.305%)
Expert >Novice MONEY (0.207%), YOU (0.135%), FOCUSPRESENT (0.106%)
Novice >Expert DRIVES (0.097%), AFFILIATION (0.056%), REWARD (0.055%)
Fast >Slow YOU (0.312%), PPRON (0.225%), PRONOUN (0.160%)
Slow >Fast DRIVES (0.105%), AFFECT (0.082%), IPRON (0.066%)
Table 5: Reader group LIWC category word usage differences. All differences are significant with via Mann-Whitney U test.

We first identify stylistic and topical differences between the groups by comparing their relative rate of LIWC word usage in their questions, commonly used to identify linguistic differences between social groups Pennebaker et al. (2001). The results in Table 5 show consistent differences in word usage between readers of different groups. Expert readers ask more questions about money than Novices, which could indicate an assumption from prior experience that post authors’ core problems stem from their financial decisions. Similarly, US readers ask more questions about money and work than non-US readers, who often frame questions to address present-tense issues and write with more auxiliary verbs. Fast-response readers ask more personal questions about the post author (YOU, PRONOUN), which may indicate a stronger interest or empathy toward the post author, as opposed to slow-response readers who address the poster’s underlying intentions (DRIVE) and emotional behavior (AFFECT).

Features Reader group Macro F1
Question text EXPERT 66.9 ( 0.4)
TIME 88.9 ( 0.3)
LOCATION 65.8 ( 1.4)
Post + question text EXPERT 80.4 ( 0.4)
TIME 91.4 ( 0.4)
LOCATION 65.9 ( 1.4)
Table 6: Reader group prediction accuracy.

To verify the differences in question content, we train a single-layer neural network to classify reader groups, using a latent semantic representation of the reader’s question and the related post generated by a pre-trained transformer model Sanh et al. (2019). The embedding for the question and the post are each converted into a lower dimension via PCA for regularization, and then concatenated. We train the classification model to convergence using Adam optimization.

Model type BLEU-1 ROUGE-L Perplexity WMD BERT Dist. Diversity Redundancy
Text-only (Reddit) 0.159 0.128 264 0.728 0.233 0.613 0.187
Reader tokens 0.159 0.128 271 0.731 0.233 0.675 0.191
Reader attention 0.157 0.127 450 0.752 0.242 0.511 0.468
Author embedding (subreddit) 0.153 0.120 657 0.746 0.238 0.744 0.277
Author embedding (text) 0.154 0.121 609 0.745 0.238 0.732 0.292
Table 7: Question generation results by model, best results bolded ( indicates that higher score is better, indicates lower is better).

The prediction results are shown in Table 6. We find that the models consistently outperform the random baseline across all reader groups tested, which suggests a clear difference between the groups. Furthermore, a model trained on the combined post and question text (“post + question text”) helps prediction improve over the question text alone, which supports the hypothesis that a reader’s background is reflected in both the question they ask and the context in which the question is asked. Therefore, generating reader-specific questions requires understanding how the question relates to the original post content, in addition to the question writing style. We find an unusually high performance for TIME, which may be due to a more consistent writing style among Fast readers (e.g., high “YOU” use, see Table 5).

5.2 Question generation

Having demonstrated consistent differences between readers’ questions, we next test the ability of the question generation model to incorporate reader group information. We leverage the models proposed earlier (see § 4), as well as a text-only baseline, and train them on the same task of question generation. We use the following metrics to assess generation quality for target question and generated question : BLEU-1 (single word overlap between and ); ROUGE-L (overlap in longest common sub-sequence for and ); perplexity; Word Mover Distance (mean distance between word embeddings for tokens in and ); BERT Distance (distance between pretrained BERT-encoder sentence embeddings for and ); Diversity (% unique questions among all generated questions ); and Redundancy (% generated questions that also appear in training data ).

The aggregate results are shown in Table 7. Overall, we see that the simpler reader-aware models (tokens and attention) perform roughly the same as the text-only model via traditional BLEU and ROUGE metrics. In addition, the reader-token model outperforms the text-only model in terms of diversity, which shows that being aware of the reader may lead to more “creative” questions. The reader attention model tends to generate questions that have higher perplexity. The model may overfit to the reader groups at the cost of generating less “on-average” plausible text, which may not be bad depending on the goals of personalized language modeling Madotto et al. (2019). Lastly, the embedding models under-perform as compared to the other models, except for diversity. The concatenation of reader embedding and text embeddings may encourage the model to generate highly unusual questions that are based more on the reader’s experience than on the information provided in the original post.

5.2.1 Performance for divisive posts.

We turn next to divisive posts: e.g., can the reader-aware models perform well in cases where Expert and Novice readers ask very different questions? Considering post , question written by an author of group 1 (e.g., Expert), and question written by an author of group 2 (Novice), we define

based on the cosine similarity of the latent representations of the questions, where we use a pretrained transformer model to generate the representations 

Sanh et al. (2019). We identify posts with pairs of questions from readers of different groups, then filter for pairs of questions that have a similarity score in the lowest percentile, for a total of 571 posts and 1142 questions.111111The low numbers are due to incomplete coverage of reader data for all questions and the fact that many posts don’t attract more than one question associated with a particular reader group, after data filtering (see § 3. We label these data pairs as divisive posts, since they attract divergent responses from different types of readers.

Text-only 0.163 0.118 305
Reader token 0.169 0.119 310
Reader attention 0.157 0.116 385
Reader embedding (subreddit) 0.152 0.117 746
Reader embedding (text) 0.158 0.116 572
Table 8: Question generation results for divisive posts.
Figure 1: Model performance by question type.

We present the results of the question prediction task on these divisive posts in Table 8. We find that the reader-token model slightly outperforms the text-only baseline. This suggests that the reader-token model picks up critical information about the reader groups that is required to anticipate how the readers approach potentially divisive posts. For example, one divisive post on PersonalFinance asks for advice about how to pay for a car, and a Novice reader asks “Are you above water on the car?” while a Expert reader asks “Have you been applying for jobs all day?” Being able to accurately predict such questions is critical to helping the post author understand what different readers need to know about, e.g., a Novice reader asking about details on payment for the car as compared to a Expert reader asking about underlying financial issues.

5.3 Error analysis

We assess the relative performance of the reader-aware models across different conditions, to assess their potential value for downstream applications.

Performance by question type

First, we assess the relative performance of different models according to the type of question asked. We separate questions based on the root question word, e.g. “who,” “what,” “when.”121212We use the dependency parser from spacy Honnibal and Johnson (2015) to identify root question words based on their dependency to the root verb of the question (e.g. advmod for “where” in “where do you live?.” We compare the BLEU-1 scores of all question generation models on the specified questions, restricting to questions asked by readers with available reader data (i.e. non-UNK readers).

The results are shown in Figure 1. In contrast to the aggregate results, the reader-attention model outperforms the text-only baseline for “do,” “where,” and “who” questions. All reader-aware models outperform the text-only model for “when” questions. These questions may reflect more of a focus on concrete details such as locations, times, and people mentioned by the original poster, and therefore the reader-aware models may generally identify differences among readers in terms of the details requested. The text-only model outperforms the reader-aware models for questions that are potentially more subjective, including “can,” “could,” “would,” and “should” questions. These more subjective questions may require the models to focus more precisely on the original post (e.g. a “would” question to pose a hypothetical concern about the post author’s situation), and therefore such questions may be less dependent on reader identity.

Post similarity

A helpful question should be related to the original post, but should not be so similar that it requests information that the post has already provided. We therefore assess the tendency of the models to generate semantically related questions for the given posts. We compute the similarity between each generated question and the associated post using the maximum cosine similarity between the sentence embedding for and each sentence in . The sentence embeddings are generated using a BERT model trained to generate paraphrased data Sanh et al. (2019).131313While we acknowledge that metrics like QBLEU are designed to evaluate question generation specifically using token overlap between the question and related document Nema and Khapra (2018), we use latent semantic representations instead to identify semantic similarities that may not appear obvious from the text tokens alone. E.g. if a question in PCMasterRace mentions “motherboard” in response to a post, this may cause the question to score poorly with a typical overlap metric despite the semantic relevance of the question to the post.

Figure 2: Maximum semantic similarity between questions and sentence from original post.

The results in Figure 2 show that the best overall models, text-only and reader-token, generate questions that are more similar to the original post than expected (cf. “target text” i.e. ground-truth). The other models show a significantly lower similarity, implying that their generated questions address new information about the post that is not mentioned in the post itself. For example, in response to a r/Advice post about self-improvement (“I just need some tips on maybe motivating myself”), the reader text embedding model asks “What do you want to do with your life?” The generated question is less semantically similar to the original post than the target question (“Have you talked to a doctor about this?”) but addresses an underlying personal issue for the post author that only a particularly thoughtful reader would uncover.

6 Conclusion

This study proposes several modifications to existing language models to incorporate information about the reader of a given post, for the purpose of question generation. We first compare the types of questions asked on Reddit by readers of different groups, and we find consistent differences in content between reader group members. We next train and evaluate the reader-aware and reader-agnostic transformer models on the task of question generation. Using automated metrics, we find that the reader-token model tends to do best in situations when a post attracts especially diverse questions from different reader groups, and for questions that solicit concrete details from the post author. The more complicated reader-aware models also seem to be more creative than the simpler models, which could be useful in cases where more diverse perspectives can help the original post author.

6.1 Future work

One limitation of our work is the definition of reader groups. We use prior commenting behavior as a proxy for expert or novice status, but a better metric would consider the content of their comments rather than frequency. If someone asks many simple questions in their comments in PersonalFinance, our approach would consider them an expert even if the questions did not reflect expert-level knowledge. In addition, we use relative time of a reader’s comment as a proxy for their response time, but there are many factors that can influence response time besides simply thinking about a question: a reader may respond late to a post simply because they didn’t see the post until long after it was written. Future work should consider refining and “stress-testing” these definitions of reader groups to better fit the task of modeling the reader’s curiosity and information needs for question generation.

Further research should also consider refining the generation process with more fine-grained controls Keskar et al. (2019), such as the assumed difficult of the question Gao et al. (2019). Providing more control features to the model, such as the desired similarity of the question to the original post or the question type, could produce questions that better adapt to the diversity of the data. In addition, it should be noted that not all questions benefit from reader information (see Figure 1), and further research should investigate how to choose between reader-aware and reader-agnostic models depending on the data domain and question types.

7 Ethics statement

We acknowledge that text generation is an ethically fraught application of NLP that can be used to manipulate public opinion Zellers et al. (2020) and reinforce negative stereotypes Bender et al. (2021). Our models could be modified to generate abusive questions or factually misleading questions, which we do not endorse. Furthermore, our models may be tricked into giving up private information from the training data, due to accidental memorization. We intend for our work to benefit people who share information about themselves for the purpose of gaining feedback from like-minded peers. All data used in this project was publicly available, and in our final release we will not release any data with personally identifiable information (e.g., reader LOCATION data), in order to protect the original authors.


  • D. Bamman, C. Dyer, and N. A. Smith (2014) Distributed representations of geographically situated language. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 828–834. Cited by: §2.2.
  • J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn (2020) The pushshift reddit dataset. In ICWSM, Vol. 14, pp. 830–839. Cited by: §3.
  • L. Becker, S. Basu, and L. Vanderwende (2012) Mind the gap: Learning to choose gaps for question generation. In NAACL HLT 2012 - 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, pp. 742–751. External Links: ISBN 1937284204 Cited by: §2.1.
  • E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell (2021) On the dangers of stochastic parrots: can language models be too big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623. Cited by: §7.
  • M. Del Tredici, D. Marcheggiani, S. S. im Walde, and R. Fernández (2019) You shall know a user by the company it keeps: dynamic representations for social media users in nlp. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    pp. 4701–4711. Cited by: §2.2.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. W. Hon (2019) Unified Language Model Pre-training for Natural Language Understanding and Generation. In NeurIPS, External Links: 1905.03197, ISSN 23318422 Cited by: §2.1.
  • X. Du and C. Cardie (2018) Harvesting paragraph-level question-answer pairs from wikipedia. In ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), pp. 1907–1917. Cited by: §2.1.
  • X. Du, J. Shao, and C. Cardie (2017) Learning to ask: Neural question generation for reading comprehension. ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) 1, pp. 1342–1352. External Links: Document, 1705.00106, ISBN 9781945626753 Cited by: §2.1, §4.
  • L. Flek (2020) Returning the n to nlp: towards contextually personalized classification models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7828–7838. Cited by: §2.2.
  • L. Fu, J. P. Chang, and C. Danescu-Niculescu-Mizil (2019) Asking the right question: inferring advice-seeking intentions from personal narratives. In ACL, Cited by: §3.
  • D. Gaffney and J. N. Matias (2018) Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus. PloS one 13 (7), pp. e0200162. Cited by: footnote 4.
  • Y. Gao, L. Bing, W. Chen, M. R. Lyu, and I. King (2019) Difficulty controllable generation of reading comprehension questions. In IJCAI, pp. 4968–4974. Cited by: §2.1, §6.1.
  • A. Garimella, C. Banea, D. Hovy, and R. Mihalcea (2019) Women’s syntactic resilience and men’s grammatical luck: gender-bias in part-of-speech tagging and dependency parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3493–3498. Cited by: §1, §2.2.
  • V. S. Govindarajan, B. Chen, R. Warholic, K. Erk, and J. J. Li (2020) Help! need advice on identifying advice. In EMNLP, pp. 5295–5306. Cited by: §1, §3.
  • M. Honnibal and M. Johnson (2015) An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 1373–1378. Cited by: footnote 12.
  • D. Hovy and S. L. Spruit (2016) The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 591–598. Cited by: §2.2.
  • D. Hovy (2018) The social and the neural network: how to make natural language processing about people again. In Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media, pp. 42–49. Cited by: §2.2.
  • S. Indurthi, D. Raghu, M. M. Khapra, and S. Joshi (2017)

    Generating natural language question-answer pairs from a knowledge graph using a RNN based question generation model

    15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017 - Proceedings of Conference 1, pp. 376–385. External Links: Document, ISBN 9781510838604 Cited by: §2.1.
  • N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher (2019) CTRL: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858. Cited by: §4.3, §6.1.
  • V. Kumar and A. W. Black (2020) ClarQ: A large-scale and diverse dataset for Clarification Question Generation. In ACL, pp. 7296–7301. External Links: Document, 2006.05986, ISSN 23318422 Cited by: §2.1.
  • A. Lahnala, Y. Zhao, C. Welch, J. K. Kummerfeld, L. An, K. Resnicow, R. Mihalcea, and V. Pérez-Rosas (2021) Exploring self-identified counseling expertise in online support forums. In ACL, Cited by: §3.
  • Q. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In

    International conference on machine learning

    pp. 1188–1196. Cited by: §4.5.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880. Cited by: §4.2.
  • M. Liu, R. A. Calvo, and V. Rus (2012) G-Asks: An Intelligent Automatic Question Generation System for Academic Writing Support. Dialogue & Discourse 3 (2), pp. 101–124. External Links: Document, ISSN 2152-9620 Cited by: §2.1.
  • A. Madotto, Z. Lin, C. Wu, and P. Fung (2019) Personalizing Dialogue Agents via Meta-Learning. In ACL, pp. 5454–5459. Cited by: §5.2.
  • Y. Miura, M. Taniguchi, T. Taniguchi, and T. Ohkuma (2017) Unifying text, metadata, and user network representations with a neural network for geolocation prediction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1260–1272. Cited by: §2.2.
  • P. Nema and M. M. Khapra (2018) Towards a better metric for evaluating question generation systems. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3950–3959. Cited by: footnote 13.
  • W. Paik, S. Yilmazel, E. Brown, M. Poulin, S. Dubon, and C. Amice (2001) Applying natural language processing (nlp) based metadata extraction to automatically acquire user preferences. In Proceedings of the 1st international conference on Knowledge capture, pp. 116–122. Cited by: §2.2.
  • S. Pan and T. Ding (2019) Social media-based user embedding: a literature review. In IJCAI, Cited by: §2.2.
  • D. B. Park (1986) Analyzing audiences. College Composition and Communication 37 (4), pp. 478–488. Cited by: §1.
  • J. W. Pennebaker, M. E. Francis, and R. J. Booth (2001) Linguistic inquiry and word count: liwc 2001. Mahway: Lawrence Erlbaum Associates 71 (2001), pp. 2001. Cited by: §5.1.
  • P. Qi, Y. Zhang, and C. D. Manning (2020a) Stay hungry, stay focused: generating informative and specific questions in information-seeking conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 25–40. Cited by: §2.1.
  • P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning (2020b) Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 101–108. Cited by: footnote 6.
  • S. Rao and H. Daumé (2019) Answer-based adversarial training for generating clarification questions. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 1, pp. 143–155. External Links: ISBN 9781950737130 Cited by: §2.1.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. In EC2, Cited by: §5.1, §5.2.1, §5.3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010. Cited by: §4.4.
  • Y. Wang, C. Hoang, and M. Federico (2021)

    Towards modeling the style of translators in neural machine translation

    In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1193–1199. Cited by: §4.3.
  • C. Welch, J. K. Kummerfeld, V. Pérez-Rosas, and R. Mihalcea (2020) Compositional demographic word embeddings. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4076–4089. Cited by: §2.2, item 3, §4.4.
  • W. Xiong, J. Wu, H. Wang, V. Kulkarni, M. Yu, S. Chang, X. Guo, and W. Y. Wang (2019) TWEETQA: A Social Media Focused Question Answering Dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5020–5031. Cited by: §2.1.
  • R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi (2020) Defending against neural fake news. In NeurIPS, Cited by: §7.
  • J. Zhang, J. Pennebaker, S. Dumais, and E. Horvitz (2020) Configuring audiences: a case study of email communication. Proceedings of the ACM on Human-Computer Interaction 4 (CSCW1), pp. 1–26. Cited by: §1.