Log In Sign Up

Question rewriting? Assessing its importance for conversational question answering

by   Gonçalo Raposo, et al.

In conversational question answering, systems must correctly interpret the interconnected interactions and generate knowledgeable answers, which may require the retrieval of relevant information from a background repository. Recent approaches to this problem leverage neural language models, although different alternatives can be considered in terms of modules for (a) representing user questions in context, (b) retrieving the relevant background information, and (c) generating the answer. This work presents a conversational question answering system designed specifically for the Search-Oriented Conversational AI (SCAI) shared task, and reports on a detailed analysis of its question rewriting module. In particular, we considered different variations of the question rewriting module to evaluate the influence on the subsequent components, and performed a careful analysis of the results obtained with the best system configuration. Our system achieved the best performance in the shared task and our analysis emphasizes the importance of the conversation context representation for the overall system performance.


page 1

page 2

page 3

page 4


SCAI-QReCC Shared Task on Conversational Question Answering

Search-Oriented Conversational AI (SCAI) is an established venue that re...

An Empirical Study of Content Understanding in Conversational Question Answering

With a lot of work about context-free question answering systems, there ...

A Graph-guided Multi-round Retrieval Method for Conversational Open-domain Question Answering

In recent years, conversational agents have provided a natural and conve...

A Talker Ensemble: the University of Wrocław's Entry to the NIPS 2017 Conversational Intelligence Challenge

We present Poetwannabe, a chatbot submitted by the University of Wrocław...

Multifaceted Improvements for Conversational Open-Domain Question Answering

Open-domain question answering (OpenQA) is an important branch of textua...

Miutsu: NTU's TaskBot for the Alexa Prize

This paper introduces Miutsu, National Taiwan University's Alexa Prize T...

Action based Network for Conversation Question Reformulation

Conversation question answering requires the ability to interpret a ques...

1 Introduction

Conversational question answering extends traditional Question Answering (QA) by involving a sequence of interconnected questions and answers [Choi2018]. Systems addressing this problem need to understand an entire conversation flow, often using explicit knowledge from an external datastore to generate a natural and correct answer for the given question. One way of approaching this problem is to divide it into 3 steps (see Fig. 1): initial question rewriting, retrieval of relevant information regarding the question, and final answer generation.

In a conversational scenario, questions may contain acronyms, coreferences, ellipses, and other natural language elements that make it difficult for a system to understand the question. Question rewriting aims to solve this problem by reformulating the question and making it independent of the conversation context [Elgohary2019], which has been shown to improve systems performance [Vakulenko2021].

After an initial understanding of the question and its conversational context, the next challenge is the retrieval of relevant information to use explicitly in the answer generation [Dalton2020]. For this step, the rewritten question is used as a query to an external datastore, and thus the performance of the initial rewriting module can affect the conversational passage retrieval [Vakulenko2021a].

The last module has the task of generating an answer that incorporates the retrieved information conditioned on the rewritten question. The Question Rewriting in Conversational Context (QReCC) dataset [Anantha2021] brings these tasks together, supporting the training and evaluation of neural models for conversational QA.

This work presents a conversational QA system implemented according to the dataset and task definition of the Search-Oriented Conversational AI (SCAI) QReCC 2021 shared task111, specifically focusing on the question rewriting module. Besides evaluating the system performance as a whole, using many variations of the question rewriting module, our work highlights the importance of this module and how much it impacts the performance of subsequent ones.

2 Conversational Question Answering

Figure 1: Proposed conversational question answering system. Question rewriting is performed using T5, passage retrieval using BM25, and answer generation using Pegasus. Dashed lines represent different inputs explored for question rewriting.

To perform conversational question rewriting, the proposed system uses the model named castorini/t5-base-canard222 from HuggingFace [Wolf2020]. This consists of a T5 model [Raffel2020] which was fine-tuned for question rewriting using the CANARD dataset [Elgohary2019]. No further fine-tuning was performed with QReCC data.

In order to incorporate relevant knowledge when answering the questions, our system uses a passage retrieval module built with Pyserini [Lin2021b], i.e., an easy-to-use Python toolkit that allows searching over a document collection using sparse and dense representations. In our implementation, the relevant passage retrieval is performed using the BM25 ranking function [Robertson2009], with its parameters set to and . This function is used to retrieve the top-10 most relevant passages.

Since our system needs to extract the most important information from the retrieved passages, which are often large, we used a Transformer model pretrained for summarization. We chose the Pegasus model [Zhang2020], more specifically, the version google/pegasus-large333, which can handle inputs up to 1024 tokens.

We further fine-tuned the Pegasus model for 10 epochs in the task of answer generation, which can be seen as a summarization of the relevant text passages conditioned on the rewritten question. In detail, the training instances used the ground truth rewritten question concatenated with the ground truth passages (and additional ones retrieved with BM25), together with the ground truth answers as the target.

3 Evaluation

3.1 Experimental Setup

The dataset used for both training and evaluation was the one used in the SCAI QReCC 2021 shared task, which is a slight adaption of the QReCC dataset. The training data contains 11 k conversations with 64 k question-answer pairs, while the test data contains 3 k conversations with 17 k questions-answer pairs. For each question-answer pair, we have also the corresponding truth rewrites and passages, which are not considered during testing (unless specified otherwise).

To evaluate each module, we used the same automatic metrics as the shared task: ROUGE1-R [Lin2004] for question rewriting, Mean Reciprocal Rank (MRR) for passage retrieval, and F1 plus Exact Match (EM) [Rajpurkar2016] for answer generation. We additionally used ROUGE-L for assessing answer generation.

3.2 Results

3.2.1 Question Rewriting Input

We first studied different inputs to the question rewriting module in terms of the conversation history. Instead of using the original questions, one could replace them with the corresponding previous model rewrites. Moreover, one could use only the questions or also include the answers generated by the model. Regarding the length of the conversation history considered for question rewriting, we use all the most recent interactions that fit in the input size supported by the model.

Description Rewriting Input Rewriting Retrieval Generation
No rewriting () - 0.571 0.061 0.136 0.005 0.143
No rewriting () - 0.571 0.145 0.155 0.003 0.160
Questions (Q) + Q 0.673 0.158 0.179 0.011 0.181
Questions + answers (Q + MA) + Q 0.681 0.150 0.179 0.010 0.181
Rewritten questions (MR) + Q 0.676 0.157 0.187 0.010 0.188
Rewritten + answers (MR + MA) + Q 0.685 0.149 0.189 0.010 0.191
Ground truth rewritten - 1 0.385 0.302 0.028 0.293
Table 1: Evaluation of multiple variations of the input used in the question rewriting module: Question (Q), Model Answer (MA), Model Rewritten (MR).

The results of our analysis are shown in Table 1, where we observe that the system that did not perform question rewriting had the worst performance, especially when only the last question is considered ().

When introducing question rewriting, we explored 4 variations of the question rewriting input, all exhibiting higher scores than without question rewriting. In particular, the highest scores occur in only 2 of them: when using only the questions and when using both the model rewritten questions and model answers. The variation where the system does not use model outputs in the question rewriting should be more resilient to diverging from the conversation topic.

When we used the ground truth rewritten questions instead of performing question rewriting, the performance of the passage retrieval and answer generation components increased about , highlighting the importance of a good question rewriting.

3.2.2 Impact of Question Rewriting

After this initial evaluation, we used the system with the highest F1 score (rewriting using model rewritten questions and model answers) to further evaluate the impact of question rewriting. We computed the evaluation metrics for each sample and used the scores to classify the results into different splits reflecting result quality, allowing us to analyze a module’s performance when the previous ones succeeded (✓) or failed (✗).

To classify the performance of the question rewriting module using ROUGE scores, we used the 3rdquartile of the score distribution as a threshold (shown in Fig. 1(a)), since we are unable to choose a value for an undoubtful classification. As for classifying the passage retrieval using the MRR score, an immediate option would be to classify values greater than 0 as successful. However, although our system retrieves the top-10 most relevant passages, the answer generation model is limited by its maximum input size, which resulted in less important passages being truncated. A preliminary analysis showed us that, in most samples, the model only considered passages, and therefore we defined the threshold of a successful retrieval as .

When the question rewriting succeeds (), the passage retrieval also exhibits better performance, as seen by MRR scores greater than 0 being more than twice more frequent (see Fig 1(b)). Although both splits have many examples where the retrieval fails completely (), they are about twice more frequent when the question rewriting fails.

(a) Distribution of ROUGE1-R scores for question rewriting.
(b) Distributions of MRR scores (retrieval) when question rewriting succeeds or fails.
Figure 2: Analysis of the influence of question rewriting on passage retrieval performance. Relative frequencies refer to the number of samples of each split.
(a) Distribution of F1 scores for the answer generation component.
(b) Distributions of F1 scores (answer generation) when question rewriting and passage retrieval succeed and fail.
Figure 3: Analysis of the influence of question rewriting and passage retrieval on answer generation performance. Relative frequencies refer to each split.

Fig. 2(a) presents the distribution of F1 scores for answer generation, showing that of the results have an F1 score lower than 0.25. In turn, Fig. 2(b) shows 4 splits for when the question rewriting and retrieval modules each succeed or fail. Comparing the stacked bars together, one can analyze the influence of question rewriting in the obtained F1 score. Independently of the retrieval performance, F1 scores higher than 0.2 are much more frequent when the rewriting succeeds than when it fails. In particular, F1 scores between 0.3 and 0.8 are about more frequent when the rewriting succeeds. Moreover, poor rewriting performance results in about

more results with an F1 score close to 0. Analyzing in terms of MRR, higher F1 scores are much more frequent when the retrieval succeeded. Interestingly, if the rewriting fails but the retrieval succeeds (less probable, as seen in Fig.

1(b)), the system is still able to generate answers with a high F1 score.

3.2.3 Error Example

In Table 2, we present a representative error where the system achieves a high ROUGE1-R score in the rewriting module but fails to retrieve the correct passage and to generate a correct answer. The only difference between the model and truth rewritten questions is in the omitted first name Ryan, which led the system to retrieve a passage referring to a different person (Michael Dunn). Although the first name was not mentioned in the context, maybe by enhancing the question with information from the previous turn (e.g., the age or day of death) the system could have performed better in the subsequent modules.

Despite the importance of question rewriting, this example shows how a high ROUGE score in this module might not exactly reflect the ability to fully enhance the question with the necessary information from the conversation context.

Context Q: When was Dunn’s death?
A: Dunn died on August 12, 1955, at the age of 59.
Question What were the circumstances?
Rewriting Truth What were the circumstances of Ryan Dunn’s death?
Model What were the circumstances of Dunn’s death?
Score ROUGE1-R: 0.889
Retrieval Truth
Score MRR: 0
Generation Truth Ryan Dunn’s Porsche 911 GT3 veered off the road, struck a tree, and burst into flames in West Goshen Township, Chester County, Pennsylvania.
Model The Florida Department of Law Enforcement concluded that Dunn’s death was a homicide caused by a single gunshot wound to the chest.
Score F1: 0.051, EM: 0, ROUGEL-F1: 0.128
Table 2: Example conversation where the retrieval and generation failed.

4 Conclusions and Future Work

This work presented a conversational QA system composed of 3 modules: question rewriting, passage retrieval, and answer generation. The results obtained from its evaluation on the QReCC dataset show the influence of each individual module in the overall system performance, and emphasize the importance of question rewriting. When the question rewriting succeeded, both the retrieval and answer generation improved – low scores were up to less frequent while higher scores were also about more frequent. Future work should explore how to better control the question rewriting and its interaction with passage retrieval. Although our system with automatic question rewriting outperforms all the participants of the SCAI QReCC shared task, significant improvements can still be achieved with a better rewriting module.


Work supported by national funds through Fundação para a Ciência e a Tecnologia (FCT), under project UIDB/50021/2020; by FEDER, Programa Operacional Regional de Lisboa, Agência Nacional de Inovação (ANI), and CMU Portugal, under project Ref. 045909 (MAIA) and research grant BI|2020/090; and by European Union funds (Multi3Generation COST Action CA18231).