Analysing the Effect of Clarifying Questions on Document Ranking in Conversational Search

08/09/2020 ∙ by Antonios Minas Krasakis, et al. ∙ University of Amsterdam 8

Recent research on conversational search highlights the importance of mixed-initiative in conversations. To enable mixed-initiative, the system should be able to ask clarifying questions to the user. However, the ability of the underlying ranking models (which support conversational search) to account for these clarifying questions and answers has not been analysed when ranking documents, at large. To this end, we analyse the performance of a lexical ranking model on a conversational search dataset with clarifying questions. We investigate, both quantitatively and qualitatively, how different aspects of clarifying questions and user answers affect the quality of ranking. We argue that there needs to be some fine-grained treatment of the entire conversational round of clarification, based on the explicit feedback which is present in such mixed-initiative settings. Informed by our findings, we introduce a simple heuristic-based lexical baseline, that significantly outperforms the existing naive baselines. Our work aims to enhance our understanding of the challenges present in this particular task and inform the design of more appropriate conversational ranking models.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The rise of voice-based digital assistants such as Amazon Alexa and Google Assistant, has intensified the need for agents that can hold meaningful conversations with users. Towards this direction, researchers have developed conversational systems that support question-answering and task-oriented dialogue, among others (budzianowski2018multiwoz; DBLP:conf/emnlp/ChoiHIYYCLZ18). However, it is often the case that in such information-seeking conversations, users fail to express their information need adequately. This makes the ability of a conversational search system to support mixed-initiative interactions imperative (radlinski2017theoretical; DBLP:conf/sigir/KieselBSAH18). Such a system can assist users to refine their information need, i.e., by disclosing new information to them (radlinski2017theoretical), or posing clarifying questions (aliannejadi2019asking).

Clarifying questions trigger users’ explicit feedback in the form of an answer, and have been shown to improve user experience  (aliannejadi2019asking; DBLP:conf/sigir/KieselBSAH18; zamani2020analyzing; DBLP:conf/chiir/BraslavskiSAD17). In Figure 1, we demonstrate examples of clarification-based conversations appearing in the conversational search dataset Qulac (aliannejadi2019asking). Specifically, we plot the most frequent user responses () to clarifying questions. Responses can be read starting from the circle centre () and moving outwards (e.g., “no I am looking…”). We observe that user responses often start with a “yes” or a “no”, but frequently provide additional information (e.g., “No, I want…”). We hypothesise that this explicit feedback can be used to improve ranking, even when the question asked ranges from being partially relevant to completely irrelevant w.r.t. the information need.

However, using this feedback effectively is challenging due to: (i) the noisy nature of the natural language used in mixed-initiative conversations, (ii) the presence of complex and mixed (both positive & negative) signals that answers often convey (e.g., user: “Yes, but I would like …”), and (iii) the presence of partially relevant information, in clarification questions that have received negative user feedback (e.g., : ”Drafting of Declaration of Independence”, : “Would you like to learn more about Thomas Jefferson?” : “No.”).

In this paper, we study the effect of the user’s feedback in mixed-initiative conversations. We categorise answers w.r.t. their polarity and length. Answer polarity indicates whether the question points to a relevant direction or not, while answer length enables us to (noisily) identify the presence of additional information in the response. We conduct our analysis on the Qulac dataset (aliannejadi2019asking), using a query likelihood model chosen because of its simplicity and transparency (ponte1998language).

In particular, we aim to answer the following research questions: RQ1 How does the polarity and informativeness of the user’s response to a system’s clarifying question affect the performance of a term-matching ranking model in conversational search? RQ2 How does the length of the clarifying question or the user’s response affect the performance of a term-matching ranking model in conversational search? RQ3 How well can a simple rule-based ranking model perform, compared to existing baselines?

Figure 1. Most frequent 4-grams to answers of clarifying questions. Each word covers an arc proportional to its frequency.

Our contributions are as follows: (i) we study the answers that users provide in a conversation, and categorise them based on polarity and length using simple yet effective heuristics; (ii) we conduct an in-depth analysis of the performance of the QL model, for those answer types; (iii) we study the effect of the clarification question and answer on ranking performance, when used in isolation, combined or ignored, and (iv) we design a simple heuristic ranker, which outperforms the baseline lexical model significantly, solely by incorporating information about the answer type.

2. Experimental Setup

Data. We run our experiments on the task of document ranking. Different from typical ad-hoc search, here we assume that after the initial query () posed by a user, a clarifying question () is asked by the system, followed by a user’s answer (). We use the document corpus and single-round conversations provided by Qulac, the only large-scale conversational search dataset with clarifying questions we are aware of (aliannejadi2019asking). Qulac is built on top of the TREC Web Track 2009-2012 data and consists of 10K query-question-answer tuples for 198 TREC topics with 762 facets, with each Q&A corresponding to a facet. We randomly keep 40 topics for testing the performance of the heuristic ranker (section 3.3) and perform all our analysis on the rest of the topics.

Retrieval Model. We use a KL divergence query-likelihood model with Dirichlet prior smoothing (ponte1998language). To adapt this model in a conversational setting, we initially experimented with expanding with the rest of the clarification round ( and ). However, preliminary results indicated poor ranking performance, since the importance of the topic (expressed by ) is largely underestimated. To mitigate this issue, we follow (aliannejadi2019asking)

and linearly interpolate the original query (

) with the clarification round ( and , concatenated) using an equal score interpolation weight of . We use for evaluation.

3. Results

In this section, we present our experimental results w.r.t. the RQs.

3.1. Answer polarity and informativeness

In this section we aim to answer RQ1 by measuring the performance of the retrieval model described in the previous section per answer type, using different parts of the conversation round. To do so, we identify two important characteristics of the user’s response to a system’s clarifying question, namely: (i) the polarity of the answer, and (ii) the informativeness of the answer. Answer polarity gives us a strong signal about the question being asked, and whether it was relevant to the user’s information need. We define four polarity classes: Positive , Negative , “I don’t know” , and Other .111The Other category refers to the answers that did not fit the other three categories. Motivated by Figure 1, we use a simple yet effective heuristic to annotate positive and negative polarity in an answer. More specifically, we tag an answer with if it contains the term “yes”, and with if it contains the term “no”.

Answer Polarity Answer Length # samples
(%) (%) (%)
single 364 0.191 0.206 +7.9 0.191 +0.0 0.206 +7.9
multi 1275 0.162 0.162 +0.0 0.163 +0.6 0.188 +16.0
single 580 0.130 0.106 18.5 0.129 0.8 0.106 18.5
multi 3791 0.132 0.117 11.4 0.162 +22.7 0.159 +20.5
single 47 0.177 0.147 16.9 0.085 52.0 0.170 4.0
multi 1729 0.153 0.132 13.7 0.169 +10.5 0.171 +11.8
multi 346 0.162 0.141 13.0 0.109 32.7 0.143 11.7
Table 1. QL performance per answer type and length (NDCG@20). is the original query, is the clarification question and is the answer to the clarifying question. Single refers to answers of length 1 and multi refers to answers of length greater than 1. We measure w.r.t. and indicate significant differences (2-sided , ) with †.

3.1.1. Ranking with full conversation.

Here we discuss the ranker’s performance when using the whole clarification round (). We discuss the results shown in Table  1 per answer type below, focusing on the relative improvement or decrease w.r.t. when only using .

Positive answers (). We observe that performance is improved when the answer is positive for both single-word and multi-word answers. This suggests that both the question and answer contain terms that complement the description of the information need and help the QL model rank relevant documents higher.

Negative answers (). We observe that the performance drops substantially when the user answer is single-word (“no”). In contrast, the user answer is multi-word, the clarification round significantly improves the ranking performance. This indicates that even in the presence of conflicting or misleading clarifying questions, a simple term-matching model can benefit from the user’s answer.

Other answers (). We observe an improvement when the user answer is multi-word, similarly to what we observed in the negative answers.

“I don’t know” answers (). We observe that it is preferable to ignore the clarification round altogether. This is likely related to the data collection strategy followed in (aliannejadi2019asking), where the crowd-workers were instructed to respond “I don’t know” to questions that cannot be answered in the context of the annotation task. Such questions include personal questions (e.g., “Do you have diabetes?”) or questions irrelevant to the information need.

3.1.2. Ranking with part of the conversation.

In this experiment, we aim to investigate how the performance of the QL ranker is affected when we take into account the clarification round partially, i.e., when we ignore the answer () or the question (). To this end, we first study how the performance is affected when we have positive multi-word answers, followed by cases where the answer polarity changes.

Positive multi-word answers. In Table 1, we see that ignoring part of the clarification round ( or ) seems to diminish any improvements that would occur in the presence of the entire round (). This indicates that in those cases the value of the clarification does not exist in isolation in questions or answers, but in their combination, implying that the question and the answer could contain complementary information. To further investigate this, in Figure 2 we plot the differences of the QL ranker when using the full clarification round () w.r.t. . Notice that we also plot the differences (of the same conversation) when using only the question in the x-axis (), and only the answer in the y-axis (. This enables us to better understand how the performance of the overall clarification is affected by the question and the answer. We see that when and follow the same trend – either improve (see quarter 2) or harm the performance (see quarter 4), their combination follows with a few exceptions. However, when questions harm but answers improve (see quarter 1), combining them in a conversational round can either improve or harm the performance. In the reverse case (see quarter 3), we observe a more robust positive impact in the combination of question and answer (). After examining a few examples, we observe that the main cause of this is that questions often contain crucial information, without which the answer is incomplete in isolation (e.g., pork tenderloin “Would you like to make a rub for pork tenderloin?” “Yes, but I need to find a recipe”).

in positive vs. negative answers. Another interesting observation deriving from Table 1 is that the performance when only using the original query () is quite low for negative answers (second group) compared to positive answers (first group). It is important to highlight here that this performance is completely unaffected by the questions asked and answers received, since it only takes into account the original query. Our hypothesis is that this is a side-effect of the process through which the clarification questions were collected in (aliannejadi2019asking). Specifically, the crowd-workers were instructed to read the first two result pages of a commercial search engine using the original query (topic) and compose a clarification question based on those. Therefore, this created some bias in the question generation towards more popular information needs. Let us note here, that this bias is defined by the search engine used by each crowd-worker and could even be a desirable characteristic of the dataset.

To validate this hypothesis, in Table 2 we analyse the percentage of positive and negative answers received per facet, w.r.t. the performance of the facet using only the original query (), while ignoring the clarification round ( and ). We see that there is a significant correlation between facet performance and the percentage of clarification questions pointed towards the correct direction. This suggests that identifying easier facets (higher ) is an easier task for the crowd-workers. This is mainly because naturally the facets with higher performance have more relevant documents at the top of the ranked list, and hence it is more likely for a crowd-worker to observe a facet in a relevant document and ask a question about it.

Figure 2. Scatter plot of (w.r.t. ) on long, positive answers.
rest 0.208
Table 2. Correlation between when using and percentage of positive, negative answers per facet.

3.2. Clarifying Question and Answer Length

Inf. need Drafting of Declaration of Independence +18.53
all men are created equal
- Would you like to learn more about
Thomas Jefferson?
- No
Inf. need Information about Atari arcade games +15.97
- Would you like to play atari
arcade games online?
- No
Inf. need How is a credit score determined? +13.57
credit report
- Would you like to know about the process
of disputing credit report?
- No
Table 3. Qualitative analysis of clarification rounds with single-word negative answers.
X Y Pearson’s r p-value
# tokens in 0.071
# tokens in 0.130
# tokens in 0.049
Table 4. Pearson correlation between the length of the question and answer w.r.t. from the original query ().

Here, we aim to study the effect of the length of questions and answers on the performance (RQ2), as it is unclear whether longer conversations would result in more informative queries and better results. Traditional IR models have been thoroughly tested with keyword-based queries. However, in the mixed-initiative setting many irrelevant terms would appear, even after stop-word removal.

Table 4 shows that significant correlations exist between the length of questions or answers and improvement in when those are added to the query. This correlation is strongest for answers, which is expected, since users have concrete knowledge of the information need in our setup. For questions, we observe the same trend. This is important, as most answers are negative (see Table 1) and highlights that even questions that receive negative answers contain helpful information for ranking, since they reduce the term mismatch. To further clarify why this happens, we provide examples where questions received negative and single-word responses (implying that no meaningful information was provided in the clarification round), but resulted in a big increase in (Table 3). We observe that while asking clarifying questions, a number of contextually relevant words appear. For instance, Thomas Jefferson is a very relevant entity w.r.t. the drafting of the declaration of independence and their name regularly appears within the relevant results, although the query is not strictly connected to him. Likewise, arcade games are relevant words which were previously not mentioned and help improve the ranking. Even further, words such as disputing are much less relevant to the information need, but they still appear in the relevant documents because they do contain information about those too (e.g., the disputing credit reports). This indicates that taking into account such conversations has, to a certain extent, similar effects to query expansion and can prove helpful, even when the added terms are not strictly relevant to the information need.

3.3. Heuristic Ranker

Based on the insights from the results presented so far, we design a ranker that heuristically classifies the user responses using polarity and length, and chooses which parts of the clarification round (

, ) to use (RQ3). Following the observations of Table 1, the ranker interpolates the original query () with: (i) only when answers are multi-word negative , (ii) for all answers that are positive, or “Other” () and multi-word, and (iii) ignores both question and answer (only uses ) elsewhere, where clarifications do not seem to help. We evaluate our model on a held-out test set and report the results in Table 5. We observe improvements compared to all of the baselines, despite the simplicity of the proposed model (the improvement w.r.t. the best compared model, , are significant at ). This suggests that using more advanced techniques to understand and effectively incorporate conversational feedback is likely improve ranking in this task.

Ranker NDCG@20
Heuristic ranker 0.171
Table 5. Comparison of the heuristic ranker with variations of the QL model.

4. Conclusions and Future Work

In this work, we provided insights on the task of document ranking with clarification-based conversations. We highlighted the importance of effectively understanding and incorporating explicit conversational feedback, and demonstrated challenges by quantitative and qualitative means. We argued that a more fine-grained treatment of the conversations is crucial to the success of conversational search and propose a heuristic ranking model, which addresses part of the problem despite its simplicity. As future work, we plan to expand our study to account for recently developed neural ranking models for this task (hashemi2020guided). Also, we aim to explore more sophisticated methods for classifying answers and develop ranking models that can better incorporate explicit conversational feedback.

This research was supported by the NWO Innovational Research Incentives Scheme Vidi (016.Vidi.189.039), the NWO Smart Culture - Big Data / Digital Humanities (314-99-301), the H2020-EU.3.4. - SOCIETAL CHALLENGES - Smart, Green And Integrated Transport (814961) the Google Faculty Research Awards program. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.