The emergence of community question answering (CQA) forums has transformed the way in which people search for information on the web. As opposed to web search engines that require an information seeker to formulate their information need as a typically short keyword query, CQA systems allow users to ask questions in natural language, with an arbitrary level of detail and complexity. Once a question has been asked, community members set out to provide an answer based on their knowledge and understanding of the question. Stack Overflow, Yahoo! Answers and Quora depict popular examples of such CQA websites.
Despite their growing popularity and well established expert communities, increasing amounts of questions remain ignored and unanswered because they are too short, unclear, too specific, hard to follow or they fail to attract an expert member . To prevent such questions from being asked, the prediction of question quality has been extensively studied in the past [15, 13, 1]. Increasing the question quality has a strong incentive as it directly affects answer quality , which ultimately drives the popularity and traffic of a CQA website. However, such attempts ignore the fact that even a high-quality question may lack an important detail that requires clarification. On that note, previous work attempts to identify what aspects of a question requires editing. While the need for editing can be reliably detected, the prediction of whether or not a question lacks important details has been shown to be difficult . In order to support an information seeker in the formulation of her question and to increase question quality, we envision the following two-step system: (1) determine whether a question requires clarification (i.e., it is unclear), and (2) automatically generate and ask clarifying questions that elicit the missing information. This paper addresses the first step. When successful, the automated identification of unclear questions is believed to have a strong impact on CQA websites, their efforts to increase question quality and the overall user experience; see Fig. 1 for an illustration.
We phrase the unclear question detection as a supervised, binary classification problem and introduce the Similar Questions Model (SQM), which takes characteristics of similar questions into account. This model is compared to state-of-the-art text classification baselines, including a bag-of-words model and a convolutional neural network. Our experimental results show that this is a difficult task that can be solved to a limited extent using traditional text classification models. SQM provides a sound and extendable framework that has both comparable performance and promising options for future extensions. Specifically, the model can be used to find keyphrases for question clarification that may be utilized in a question formulation interface as shown inFig. 1. Experiments are conducted on a novel dataset including more than 6 million labeled Stack Exchange questions, which we release for future research on this task.111The dataset and sources can be found at https://github.com/jantrienes/ecir2019-qac
2 Related Work
Previous work on question modeling of CQA forums can be roughly grouped into three categories: question quality prediction, answerability prediction and question review prediction . With respect to the prediction of question quality, user reputation has been found to be a good indicator 
. Also, several machine learning techniques have been applied including topic and language models[15, 1]. However, there is no single objective definition of quality, as such a definition depends on the community standards of the given platform. In this paper, we do not consider question quality itself, since a question may lack an important detail regardless of whether its perceived quality is high or low. Question answerability has been studied by inspecting unanswered questions on Stack Overflow . Lack of clarity and missing information is among the top five reasons for a question to remain unanswered. Here, we do not consider other problems such as question duplication and too specific or off-topic questions . Finally, question review prediction specifically attempts to identify questions that require future editing. Most notably, Yang et al.  determine if a question lacks a code example, context information or failed solution attempts based on its contents. However, they disregard the task of predicting whether detail (e.g., a software version identifier) is missing and limit their experiments to the programming domain.
Clarification questions have been studied in the context of synchronous Q&A dialog systems. Kato et al.  analyzed how clarification requests influence overall dialog outcomes. In contrast to them, we consider asynchronous CQA systems. With respect to asynchronous systems, Braslavski et al.  categorized clarification questions from two Stack Exchange domains. They point out that the detection of unclear questions is a vital step towards a system that automatically generates clarification questions. To the best of our knowledge, we are the first study to address exactly this novel unclear question detection task. Finally, our study builds on recent work by Rao and Daumé III 
. We extend their dataset creation heuristic to obtain both clear and unclear questions.
3 Unclear Question Detection
The unclear question detection task can be seen as a binary classification problem. Given a dataset of questions, , where each question belongs to either the clear or unclear class, predict the class label for a new (unseen) question . In this section, we propose a model that utilizes the characteristics of similar questions as classification features. This model is compared to state-of-the-art text classification models described in Section 4.2.
We define a question to be unclear if it received a clarification question, and as clear if an answer has been provided without such clarification requests. This information is only utilized to obtain the ground truth labels. Furthermore, it is to be emphasized that it is most useful to detect unclear questions during their creation-time in order to provide user feedback and prevent unclear questions from being asked (see the envisioned use case in Fig. 1). Consequently, the classification method should not utilize input signals available only after question creation, such as upvotes, downvotes or conversations in form of comments. Finally, we do not make any assumptions about the specific representation of a question as it depends on the CQA platform at hand. The representation we employ for our experiments on Stack Exchange is given in Section 4.1.
|Field||New Question||Similar (Unclear) Question|
|Title||Simplest XML editor||XML Editing/Viewing Software|
|Body||I need the simplest editor with utf8 support for editing xml files; It’s for a non programmer (so no atom or the like), to edit existing files. Any suggestion?||What software is recommended for working with and editing large XML schemas? I’m looking for both Windows and Linux software (doesn’t have to be cross platform, just want suggestions for both)|
|Tags||xml, utf8, editors||windows, xml, linux|
|Comments||–||What operating system?|
3.1 Similar Questions Model
The Similar Questions Model is centered around the idea that similar existing questions may provide useful indicators about the presence or absence of information. For example, consider the two questions in Table 1. It can be observed that the existing question specifies additional information after a clarification question has been raised. A classification system may extract keyphrases (e.g., operating system) from the clarification question and check whether this information is present in the given question (see Fig. 1 and Table 7 for examples). In other words, the system checks if a new question lacks information that was also missing from similar previous questions. It has been shown that this general approach can be successfully employed to find and rank suitable clarification questions .
|(i) Features based on||Ex.|
|Question length in the number of tokens:||41|
|Indicator if question contains preformatted elements||0|
|Indicator if question contains a quote||0|
|Indicator if question contains question mark “?”||1|
|Coleman-Liau Index (CLI) ||16.7|
|(ii) Features based on|
|Sum of similarity scores:||8|
|Number of similar questions retrieved:||3|
|Number of similar questions that are unclear:||2|
|Number of similar questions that are clear:||1|
|Majority vote of labels in||1|
|Ratio between clear/unclear questions:||0.5|
|Proportion of clear questions among similar:||0.3|
|(iii) Features based on|
|Cosine similarity between all keyphrases in and||0.6|
|Sum of cosine similarities between each keyphrase and||1|
|Like above, but weighted by , see Eq. 1||1|
|These features are computed for the top- similar questions in where .|
|is obtained from the top similar questions in .|
The Similar Questions Model can be formalized as follows. Given a new question , we first seek a set of similar questions with their clear and unclear labels. As per the definition of unclear that we employ, the subset of unclear questions has a set of corresponding clarification questions . Within this framework, we design a number of indicative features that are then used to train a classifier to discriminate between the two classes. An illustration of the model can be found in Fig. 2.
The features employed by the Similar Questions Model can be grouped into three classes: (i) features based on only, (ii) features based on the set of similar questions and (iii) features based on the set of clarification questions . See Table 2 for a summary. We highlight the computation of the scoring features obtained from the set of clarification questions (group (iii) in Table 2). For each clarification question in , one or more keyphrases are extracted. These keyphrases are the central objects of a clarification question and refer to an aspect of the original question that is unclear (see Table 7 for examples). Afterwards, we define
to represent a question or clarification question as a vector, where each element indicates the number of times a keyphraseoccurs in and , respectively. Then, a question clarity score is obtained by computing the cosine similarity between these vectors. The scoring features differ in the way the keyphrase vectors are created. The global model constructs a single vector consisting of all keyphrases present in , whereas the individual model computes the sum of the scores considering each separately. For the individual weighted feature, the final score is given by:
where is the cosine similarity between the keyphrase vectors and is the similarity between and . This gives higher importance to keyphrases belonging to more similar questions.
3.3 Learning Method
We operationalize the Similar Questions Model in a variety of ways:
- SimQ Majority
We obtain a simple baseline that classifies according to the most common label of the similar questions in .
- SimQ Threshold
We test the scoring features in group (iii) using a threshold classifier where a threshold is learned on a held-out dataset. The label is then obtained as follows:
where is the value of the corresponding feature, 0 refers the clear class and 1 refers to the unclear class.
- SimQ ML
All features of the Similar Questions Model are combined and provided as input data to a machine learning classifier.
4 Experimental Setup
This section describes our experimental setup including the dataset and methods.
4.1 Dataset Creation
The Stack Exchange CQA platform depicts a suitable data source for our experiments. It is a network of specialized communities with topics varying from programming to Unix administration, mathematics and cooking. A frequent data dump is published consisting of all questions, answers and comments submitted to the site. For any post, a time-stamped revision history is included. We use this dump222Available at https://archive.org/details/stackexchange to create a labeled dataset consisting of clear and unclear questions.
To obtain unclear questions, we apply a heuristic that has been used in previous research to find clarification questions [3, 14]. A question is considered to be unclear when there is a comment by a different user than the original asker and that comment contains a sentence ending with a question mark. This heuristic is not perfect as it will inevitably miss clarification requests not formulated as a question (e.g., “Please post your code.”), while it retains rhetorical questions (e.g., “Is this a real question?”). We only keep those questions where the original asker has provided a clarification in form of a comment or question edit.
In order to gather clear questions, we extend the described heuristic as follows. A question is considered to be clear if it has neither edits, nor comments, but it has an accepted answer. An answer can be manually accepted by the question asker if they consider it to adequately answer their question. Again, this heuristic may introduce noise: an answer can make certain assumptions that would have ideally been asked as a clarification question instead of included in the answer itself (e.g., “Provided you are on system X, the solution is Y”).
We apply this heuristic to five Stack Exchange communities, each of a different size and with a different domain. The communities considered are Stack Overflow, Ask Ubuntu, Cross Validated, Unix & Linux and Super User, thus covering a broad range of topics. Table 3 summarizes the statistics of each dataset. The text has been preprocessed by combining the question title, body and tags into a single field, replacing URLs with a special token, converting every character to lower-case and removing special characters except for punctuation. Token boundaries are denoted by the remaining punctuation and whitespace. Furthermore, a minimum term-document frequency of 3 is imposed to prevent overfitting.
|Unix & Linux||44,936||133||162,805||31,852||27%||73%|
4.2 System Components
4.2.1 Obtaining Similar Questions
A general purpose search engine, Elasticsearch, is used with the BM25 retrieval model in order to obtain similar questions. The retrieval score is used as during feature computation. We only index the training set of each community but retrieve similar questions for the entire dataset. Queries are constructed by combining the title and tags of a question. These queries are generally short (averaging 13 tokens).333We experimented with longer queries that include 100 question body tokens. While computationally more expensive, model performance remained largely unaffected. To ensure efficient querying, we remove stopwords from all queries. Finally, BM25 parameters are set to common defaults (, ) .
4.2.2 Extracting Keyphrases
Keyphrases are extracted from clarification questions using the RAKE algorithm , which is an efficient way to find noun phrases. This algorithm has been used in a similar setting where CQA comments should be matched to related questions . We tokenize the keyphrases and consider each token individually.
4.2.3 Similar Questions Classifier
Besides applying a threshold-based classifier on a selected set of features presented in Table 2
, all features are combined to train a logistic regression classifier with L2 regularization (referred to as SimQ ML). The regularization strength is set to
which has been found to work well for all communities. All features are standardized by removing the mean and scaling to unit variance.
4.3 Baseline Models
The Similar Questions Model is compared with a number of baselines and state-of-the-art text classification approaches:
Random: produce predictions uniformly at random.
Majority: always predict the majority class (here: unclear).
Bag-of-words logistic regression (BoW LR).
Convolutional neural network (CNN) .
Within the BoW LR model, a question is represented as a vector of TF-IDF weighted n-gram frequencies. Intuitively, this approach captures question clarity on a phrase and topic level. We report model performances for unigrams () and unigrams combined with phrases of length up to . Using 5-fold cross-validation on the training data, we find that an L2 regularization strength of works best for all communities. With respect to the CNN model, we use the static architecture variant presented in 
consisting of a single convolutional layer, followed by a fully connected layer with dropout. Model hyperparameters (number of filters, their size, learning rate and dropout) are optimized per community using a development set.444Optimal CNN parameter settings can be found in the online appendix of this paper. The network is trained with the Adam optimizer , a mini-batch size of 64 and early stopping. We train 300-dimensional word embeddings for each community using word2vec 
and limit a question to its first 400 tokens (with optional padding). Out-of-vocabulary words are replaced by a special token. There are several other possible neural architectures, but an exploration of those is outside the scope of this paper.
As the data is imbalanced, we evaluate according to the F1 score of the unclear (positive) class and the ROC AUC score. We argue that it is most important to optimize these metrics based on the envisioned use case. When the classification outcome is used as a quality guard in a user interface, it is less sever to consider a supposedly clear question as unclear as opposed to entirely missing an unclear question. We randomly divide the data for each community into 80% training and 20% testing splits. Of the training set, we use 20% of the instances for hyperparameter tuning and optimize for ROC AUC. We experimented with several class balancing methods, but the classification models were not impacted negatively by the (slight) imbalance. Statistical significance is tested using an approximate randomization test. We mark improvements with () or (), deteriorations with () or (), and no significance by .
5 Results and Analysis
This section presents and discusses our experimental results.
|BoW LR ()||0.687||0.786||0.706||0.886||0.736||0.833||0.752||0.933|
|BoW LR ()||0.699||0.791||0.720||0.877||0.741||0.837||0.752||0.944|
Results for unclear question detection. The metrics are summarized over the five datasets using both micro-averaging and macro-averaging. F1, precision and recall are reported for the unclear class. Best scores for each metric are in boldface.
The traditional BoW LR model provides a strong baseline across all communities that outperforms both the random and majority baselines (see Table 5). The generic CNN architecture proposed in  does not provide any significant improvements over the BoW LR model. This suggests that a more task-specific architecture may be needed to capture the underlying problem.
We make several observations with respect to the Similar Questions Model. First, a majority vote among the labels of the top similar questions (SimQ Majority) consistently provides a significant improvement over the random baseline for all datasets (see Table 5). This simplistic model shows that the underlying concept of the Similar Questions Model is promising. Second, the scoring features that take clarification questions into consideration do not work well in isolation (see models prefixed with CQ in Table 4 and Table 5). The assumption that one can test for the presence of keyphrases without considering spelling variations or synonyms seems too strong. For example, the phrase “operating system” does not match sentences such as “my OS is X” and thus results in false positives. Finally, the SimQ ML model outperforms both the random and majority baselines, and has comparable performance with the BoW LR model. It is to be emphasized that the SimQ ML model, in addition to classifying a question as clear or unclear, generates several valuable hints about the aspects of a question that may be unclear or missing (see demonstration in Table 7). This information is essential when realizing the envisioned user interface presented in Fig. 1, and cannot be deducted from the BoW LR or CNN models.
|Cross Validated||Super User||Stack Overflow|
|BoW LR ()||0.819||0.647||0.900||0.702||0.720||0.798||0.685||0.693||0.784|
|BoW LR ()||0.818||0.659||0.900||0.709||0.731||0.807||0.697||0.718||0.788|
5.2 Feature Analysis
To gain further insights about the performance of the Similar Questions Model, we analyze the features and their predictive power. Features considering the stylistic properties of a question itself such as the length, readability and whether or not the question contains a question mark, are among the top scoring features (see Table 6). Other important features include the distribution of labels among the similar questions and their retrieval scores (, , , ). With respect to the bag-of-words classifier, we observe that certain question topics have attracted more unclear questions. For example, a question about windows 10 is more likely to be unclear than a question about emacs. Interestingly, also stylistic features are captured (e.g., a “?” token and the special URL token). Finally, this model reveals characteristics of well-written, clear questions. For example, if a user articulates their problem in the form of “difference between X and Y,” such a question is more likely to belong to the clear class. This suggests that it may be beneficial to include phrase-level features in the Similar Questions Model to improve performance.
|SimQ ML||BoW LR ()|
5.3 Error Analysis and Limitations
The feature analysis above reveals a problem which is common to both the Similar Questions Model and the traditional BoW LR model. Both models suffer from a topic bias. For example, a question about emacs is more likely to be classified as clear because the majority of emacs questions are clear. Furthermore, stylistic features can be misleading. Consider a post on Stack Overflow that contains an error message. This post does not require an explicit use of a question mark as the implied question most likely is “How can this error message be fixed?”. It is conceivable to design features that take such issues into account.
A potential limitation of the Similar Questions Model is its reliance on the existence of similar questions within a CQA website. It is unclear how the model would perform in the absence of such questions. It would make an interesting experiment to process a CQA dataset in chronological order, and measure how the model’s effectiveness changes as more similar questions become available over time. However, we leave the exploration of this idea to future work.
|Title||Laptop randomly going in hibernate|
|Body||I have an Asus ROG G751JT laptop, and a few days ago my battery has died. The problem that I am encountering is that my laptop randomly goes to sleep after a few minutes of use even when plugged in […].|
|ClarQ||(20.01) Does this happen if you boot instead from an ubuntu liveusb?|
|(17.92) Did you enable allow wake timers in power options sleep?|
|(16.88) Can you pop the battery out of the mouse?|
|(16.64) Which OS are you using?|
|(16.02) Have you scanned your system for malwares?|
|Title||Does ZFS make sense as local storage?|
I was reading about ZFS and for a moment thought of using it in my computer, but than reading about its memory requirements I thought twice. Does it make sense to use ZFS as local or only for servers used as storage?
|ClarQ||(36.11) What’s wrong with more redundancy?|
|(31.41) What kind of data are you trying to protect?|
|(30.77) How are you planning on doing backups and or disaster recovery?|
|(29.70) Is SSD large enough?|
The paper represents the first study on the challenging task of detecting unclear questions on CQA websites. We have constructed a novel dataset and proposed a classification method that takes the characteristics of similar questions into account. This approach encodes the intuition that question aspects which have been missing or found to be unclear in previous questions, may also be unclear in a given new question. We have performed a comparison against traditional text classification methods. Our main finding is that the Similar Questions Model provides a viable alternative to these models, with the added benefit of generating cues as to why a question may be unclear; information that is hard to extract form traditional methods but that is crucial for supportive question formulation interfaces.
Future work on this task may combine traditional text classification approaches with the Similar Questions Model to unify the benefits of both. Furthermore, one may start integrating the outputs of the Similar Questions Model into a clarification question generation system, which at a later stage is embedded in the user interface of a CQA site. As an intermediate step, it would be important to evaluate the usefulness of the generated cues as to why a question is unclear. Finally, the work by Rao and Daumé III  provides a natural extension, by ranking the generated clarification questions in terms of their expected utility.
We would like to thank Dolf Trieschnigg and Djoerd Hiemstra for their insightful comments on this paper. This work was partially funded by the University of Twente Tech4People Datagrant project.
- Arora et al.  P. Arora, D. Ganguly, and G. J. F. Jones. The good, the bad and their kins: Identifying questions with negative scores in stackoverflow. In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, ASONAM ’15, pages 1232–1239, New York, NY, USA, 2015. ACM.
- Asaduzzaman et al.  M. Asaduzzaman, A. S. Mashiyat, C. K. Roy, and K. A. Schneider. Answering questions about unanswered questions of stack overflow. In Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, pages 97–100, Piscataway, NJ, USA, 2013. IEEE Press.
- Braslavski et al.  P. Braslavski, D. Savenkov, E. Agichtein, and A. Dubatovka. What do you mean exactly?: Analyzing clarification questions in cqa. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval, CHIIR ’17, pages 345–348, New York, NY, USA, 2017. ACM.
- Coleman and Liau  M. Coleman and T. L. Liau. A computer readability formula designed for machine scoring. Journal of Applied Psychology, 60(2):283, 1975.
- Correa and Sureka  D. Correa and A. Sureka. Chaff from the wheat: Characterization and modeling of deleted questions on stack overflow. In Proceedings of the 23rd International Conference on World Wide Web, WWW ’14, pages 631–642, New York, NY, USA, 2014. ACM.
- Kato et al.  M. P. Kato, R. W. White, J. Teevan, and S. T. Dumais. Clarifications and question specificity in synchronous social q&a. In CHI ’13 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’13, pages 913–918, New York, NY, USA, 2013. ACM.
Convolutional neural networks for sentence classification.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1746–1751, 2014.
- Kingma and Ba  D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015.
- Li et al.  B. Li, T. Jin, M. R. Lyu, I. King, and B. Mak. Analyzing and predicting question quality in community question answering services. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12 Companion, pages 775–782, New York, NY, USA, 2012. ACM.
- Manning et al.  C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.
- Mikolov et al.  T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pages 3111–3119, USA, 2013. Curran Associates Inc.
- Nandi et al.  T. Nandi, C. Biemann, S. M. Yimam, D. Gupta, S. Kohail, A. Ekbal, and P. Bhattacharyya. IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Question Answering and Implicit Dialogue Identification. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 90–97. Association for Computational Linguistics, 2017.
- Ponzanelli et al.  L. Ponzanelli, A. Mocci, A. Bacchelli, and M. Lanza. Understanding and classifying the quality of technical forum questions. In Proceedings of the 2014 14th International Conference on Quality Software, QSIC ’14, pages 343–352, Washington, DC, USA, 2014. IEEE Computer Society.
- Rao and Daumé III  S. Rao and H. Daumé III. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2737–2746. Association for Computational Linguistics, 2018.
- Ravi et al.  S. Ravi, B. Pang, V. Rastagori, and R. Kumar. Great question! question quality in community q&a. International AAAI Conference on Weblogs and Social Media, (1):426–435, 2014.
- Rose et al.  S. Rose, D. Engel, N. Cramer, and W. Cowley. Automatic keyword extraction from individual documents. Text Mining: Applications and Theory, pages 1–20, 2010.
- Srba and Bielikova  I. Srba and M. Bielikova. A comprehensive survey and classification of approaches for community question answering. ACM Trans. Web, 10(3):18:1–18:63, Aug. 2016. ISSN 1559-1131.
- Tausczik and Pennebaker  Y. R. Tausczik and J. W. Pennebaker. Predicting the perceived quality of online mathematics contributions from users’ reputations. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’11, pages 1885–1888, New York, NY, USA, 2011. ACM.
- Yang et al.  J. Yang, C. Hauff, A. Bozzon, and G.-J. Houben. Asking the right question in collaborative q&a systems. In Proceedings of the 25th ACM Conference on Hypertext and Social Media, HT ’14, pages 179–189, New York, NY, USA, 2014. ACM.