Question Rewriting for Conversational Question Answering

04/30/2020 ∙ by Svitlana Vakulenko, et al. ∙ Apple Inc. University of Amsterdam 0

Conversational question answering (QA) requires answers conditioned on the previous turns of the conversation. We address the conversational QA task by decomposing it into question rewriting and question answering subtasks, and conduct a systematic evaluation of this approach on two publicly available datasets. Question rewriting is designed to reformulate ambiguous questions, dependent on the conversation context, into unambiguous questions that are fully interpretable outside of the conversation context. Thereby, standard QA components can consume such explicit questions directly. The main benefit of this approach is that the same questions can be used for querying different information sources, e.g., multiple 3rd-party QA services simultaneously, as well as provide a human-readable interpretation of the question in context. To the best of our knowledge, we are the first to evaluate question rewriting on the conversational question answering task and show its improvement over the end-to-end baselines. Moreover, our conversational QA architecture based on question rewriting sets the new state of the art on the TREC CAsT 2019 dataset with a 28 evaluation results provide insights into the sensitivity of QA models to question reformulation, and demonstrates the strengths and weaknesses of the retrieval and extractive QA architectures, that should be reflected in their integration.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Extending question answering systems to a conversational setting is a natural development that allows smooth interactions with a digital assistant (Gao et al., 2019). This transition requires taking care of complex linguistic phenomena characteristic of a human dialogue, such as anaphoric expressions and ellipsis (Thomas, 1979). Such contextual dependencies refer back to the previous conversational turns, which are necessary to correctly interpret the question.

In this paper, we propose to address the task of conversational QA by decomposing it into two sub-tasks: (1) a question rewriting (QR) model that given previous conversational turns produces an explicit (contextually independent) question, which can then be used as an input to (2) a standard question answering (QA) model that can process explicit questions outside of the conversation context. Thereby, an explicit question serves as an intermediate output that connects QR with QA components. This setup offers a wide range of advantages in comparison with training a single conversational QA model end-to-end:

  1. Interpretability: Since contextual dependencies in a conversation are resolved explicitly by rewriting questions it becomes possible to separate errors that originate from incorrect context interpretation from the errors that stem from the questions answering phase. Our error analysis makes full use of this feature by investigating different sources of error and their correlation.

  2. Reuse: Since QR produces an explicit question that can be answered by non-conversational QA models, we can leverage existing QA models and datasets without the need to re-design architectures to handle conversational context. Additionally, a QR model can be pre-trained once, and reused with a variety of standard QA models.

  3. Modularity: QR setup makes it possible to query multiple remote 3rd-party APIs suitable for QA over different collections. This is a realistic scenario, in which information is distributed across a network of heterogenous nodes that do not share internal representations. Natural language provides a suitable communication protocol between these components (Rastogi et al., 2019).

Previous work on QR for conversational QA (Elgohary et al., 2019) evaluates QR performance using intrinsic quality of question rewrites. Text similarity metrics used for evaluation were already shown to be flawed on other generative tasks (Zhang et al., 2019; Liu et al., 2016). Consequently, there is no evidence that an improvement in a QR model, as evaluated by these metrics, will translate to better end-to-end performance in conversational QA. To the best of our knowledge, our work is the first to bridge this gap.

Two dimensions in which QA tasks vary are: the type of data source used to retrieve the answer (e.g., a paragraph, a document collection, or a knowledge graph); and the expected answer type (a text span, a ranked list of passages, or an entity). In this paper, we experiment with two variants of the QA task:

retrieval QA, the task of finding an answer to a given natural-language question as a ranked list of relevant passages given a document collection; and extractive QA, the task of finding an answer to a given natural-language question as a text span within a given passage.

Though the two QA tasks are complementary to each other, in this paper we focus on the QR task and its ability to enable different types of QA models within a conversational setting. We use both retrieval and extractive QA tasks to examine the effect of the QR component on the end-to-end conversational QA performance.

Our main contribution is a systematic evaluation of the QR approach on the conversational QA task. We examine the effect of QR on two different QA architectures in parallel to demonstrate how the differences in the setup affect the QA performance and the types of errors. This analysis is performed to better understand the interaction between the components and their limitations which have implications on further integration.

The approach described in this paper outperforms all other published methods on both conversational QA datasets used for evaluation. Moreover, we show that QR performance metrics correctly predict the model which consistently performs better across both QA datasets. We also demonstrate how correlation in the performance metrics and question rewrites can be used as a tool for diagnosing QA models.

2. Related Work

Conversational QA is an extension of the standard QA task that introduces contextual dependencies between the input question and the previous dialogue turns. Several datasets were recently proposed extending different QA tasks to a conversational setting including extractive (Reddy et al., 2019; Choi et al., 2018), retrieval (Dalton et al., 2020) and knowledge graph QA (Guo et al., 2018; Christmann et al., 2019). One common approach to conversational QA is to extend the input of a QA model by appending previous conversation turns (Qu et al., 2019; Ju et al., 2019; Christmann et al., 2019). Such approach, however falls short in case of retrieval QA, which requires a concise query as input to the candidate selection step, such as BM25 (Nogueira et al., 2019). Results of the recent TREC CAsT track demonstrated that co-reference models are also not sufficient to resolve the missing context in the follow-up questions (Dalton et al., 2020). A considerable gap between the performance of automated rewriting approaches and manual human annotations call for new architectures that are capable of retrieving relevant answers from large text collection using conversational context.

In this paper we explore an alternative to training an end-to-end conversational QA model by separating it into QR and QA subtasks. We present the results of our experiments with the state-of-the-art text generation approach applied to resolve missing conversation context in the follow-up questions. QR is designed to handle conversation context and produce an equivalent question that no longer depends on the conversation context, i.e., with anaphoras and other contextual ambiguities resolved. The output of QR model can then be used by a standard QA model pretrained on non-conversational datasets, such as SQuAD 

(Rajpurkar et al., 2016), WikiQA (Yang et al., 2015) or TrecQA (Wang et al., 2007).

QR has been already shown effective for multi-turn chitchat and task-oriented dialogues (Su et al., 2019; Rastogi et al., 2019). These settings are dissimilar to ours since they use different evaluation criteria. Instead of measuring slot-matching and intent detection performance as in task-oriented dialogue systems or user engagement as in chit-chat, we evaluate open-domain QA. As we show in our experimental evaluation, this directly correlates with the QR performance. In this paper, we show that QR is well suited for open-domain conversational QA and demonstrate that QR helps to achieve superior performance on this task. The setup we propose also provides a convenient framework for measuring performance of the individual components.

Question rewriting for conversational QA was initially proposed by Elgohary et al., who released the CANARD dataset that contains human rewrites of questions from QuAC. However, the evaluation of the question rewriting approaches trained on this task was limited to the intrinsic metrics reported in the original paper (Lin et al., 2020). No evaluation of the end-to-end performance using the generated question rewrites or the impact of the errors propagating from this stage to the question answering stage was reported to date. We close this research gap by training several question rewriting models and integrating them with the state-of-the-art question answering models based on different Transformer architectures. Finally, we show that the same QR model trained on the question rewrites for extractive QA can as well extend a standard retrieval QA model trained on the passage ranking task (Nguyen et al., 2016) to perform conversational QA on the TREC CAsT dataset.

3. Approach

QR allows to use existing retrieval and extractive QA models to conversational QA setting by introducing a question rewriting component that generates explicit questions interpretable outside of the conversation context (see Figure 1 for an illustrative example). To evaluate question rewriting for conversational QA on the end-to-end task, we set up two different QA models independently from each other: one of them is designed for passage retrieval and one for answer extraction from a passage. This setup allows us to better examine performance of the individual components and analyze similarities and differences between retrieval and extractive models that can provide insights on the potential for a better integration of all three components together.

Figure 1. Our approach for end-to-end conversational QA relies on the question rewriting component to handle conversation context and produce an explicit question that can be fed to standard, non-conversational QA components.

3.1. Question Rewriting

Given a conversation context and a potentially implicit question , a question which may require the conversation context to be fully interpretable, the task of a question rewriting (QR) model is to generate an explicit question which is equivalent to under conversation context and has the same correct answer . We use a model for question rewriting, which employs a unidirectional Transformer decoder (Radford et al., 2019) for both encoding the input sequence and decoding the output sequence. The input to the model is the question with previous conversation turns (we use 5 previous turns in our experiments) turned into token sequences separated with a special token. The training objective is to predict the output tokens provided in the ground truth question rewrites produced by human annotators. The model is trained via teacher forcing approach, which is a standard technique for training language generation models, to predict every next token in the output sequence given all the preceding tokens. The loss is calculated as negative log-likelihood (cross-entropy) between the output distribution over the vocabulary

, and the one-hot vector

for the correct token from the ground truth:

. At training time the output sequence is shifted by one token and is used as input to predict all next tokens of the output sequence at once. At inference time, the model uses maximum likelihood estimation to select the next token from the final distribution

(greedy decoding), as shown in Figure 2.

We further increase capacity of our generative model by learning to combine several separate distributions ( and in Figure 2). The final distribution is then produced as a weighted sum of the intermediate distributions: ( in our experiments). To produce we pass the last hidden state of the Transformer Decoder through a separate linear layer for each intermediary distribution: , where is the weight matrix and is the bias. For the weighting coefficients we use the matrix of input embeddings , where is the maximum sequence length and is the embedding dimension, and the weights of the first attention head of the Transformer Decoder put through a layer normalization function: , where all are the weight matrices and is the bias.

Figure 2. The question rewriting component uses the Transformer Decoder architecture, to recursively generate the tokens of an ”explicit” question. At inference time, the generated output is appended to the input sequence for the next timestep in the sequence.

3.2. Retrieval QA

In the retrieval QA settings, the task is to produce a ranked list of text passages from a collection, ordered by their relevance to a given a natural language question (Nguyen et al., 2016; Dietz et al., 2018). We employ a state-of-the-art approach to retrieval QA, which consists of two phases: candidate selection and passage re-ranking. In the first phase, a traditional retrieval algorithm (BM25) is used to quickly sift through the indexed collection retrieving top- passages ranked by relevance to the input question . In the second phase, a more computationally-expensive model is used to re-rank all question-answer candidate pairs formed using the previously retrieved set of passages. For re-ranking, we use a binary classification model that predicts whether the passage answers a question, i.e., the output of the model is the relevance score in the interval . The input to the re-ranking model is the concatenated question and passage with a separation token in between (see Figure 3 for the model overview). The model is initialized with weights learned from unsupervised pre-training on the language modeling (masked token prediction) task (BERT) (Devlin et al., 2019). During fine-tuning, the training objective is to reduce cross-entropy loss, using relevant passages and non-relevant passages from the top- candidate passages.

Figure 3. Retrieval QA component includes two sequential phases: candidate selection (BM25) followed by passage re-ranking (Transformer Encoder).

3.3. Extractive QA

The task of extractive QA is given a natural language question and a single passage find an answer as a contiguous text span within the given passage (Rajpurkar et al., 2016). Our model for extractive QA consists of a Transformer-based bidirectional encoder (BERT) (Devlin et al., 2019) and an output layer predicting the answer span. The input to the model is the sequence of tokens formed by concatenating a question and a passage separated with a special token. The encoder layers are initialized with the weights of a Transformer model pre-trained on an unsupervised task (masked token prediction). The output of the Transformer encoder is a hidden vector for each token of the input sequence. For fine-tuning the model on the extractive QA task, we add weight matrices , and biases ,

that produce two probability distributions over all the tokens of the given passage separately for the start (

) and end position () of the answer span. For each token the output of the Transformer encoder

is passed through a linear layer, followed by a softmax normalizing the output logits over all the tokens into probabilities:


The model is then trained to minimize cross-entropy between the predicted start/end positions ( and ) and the correct ones from the ground truth ( and are one-hot vectors indicating the correct start and end tokens of the answer span):


At inference time all possible answer spans from position to position , where , are scored by the sum of end and start positions’ probabilities: . The output of the model is the maximum scoring span (see Figure 4 for the model overview).

21% of the CANARD (QuAC) examples are Not Answerable (NA) by the provided passage. To enable our model to make No Answer predictions we prepend a special token to the beginning of the input sequence. For all No Answer samples we set both the gold truth start and end positions spans to this token’s position (0). Likewise, at inference time, predicting this special token is equivalent to a No Answer prediction for the given example.

Figure 4. Extractive QA component predicts a span of text in the paragraph P’, given an input sequence with the question Q’ and passage P’.

4. Experimental Setup

In the following subsections we describe the datasets used for training and evaluation, the set of metrics for each of the components, our baselines and details of the implementation.

4.1. Datasets

We chose two conversational QA datasets for the evaluation of our approach: (1) CANARD, derived from Question Answering in Context (QuAC) for extractive conversational QA (Choi et al., 2018), and (2) TREC CAsT for retrieval conversational QA (Dalton et al., 2020).

Following the setup of the TREC CAsT 2019, we use the MS MARCO Passage Ranking (Nguyen et al., 2016) and the TREC CAR (Dietz et al., 2018) paragraph collections. After de-duplication, the MS MARCO collection contains 8.6M documents and the TREC CAR – 29.8M documents. We evaluated on the test set with relevance judgements for 173 questions across 20 dialogues (topics).

CANARD (Elgohary et al., 2019) is built upon the QuAC dataset (Choi et al., 2018) by employing human annotators to rewrite original questions from QuAC dialogues into explicit questions. CANARD contains 40.5k pairs of question rewrites that can be matched to the original answers in QuAC. We use CANARD splits for training and evaluation. Each answer in QuAC is annotated with a Wikipedia passage from which it was extracted alongside the correct answer spans within this passage. We use the question rewrites provided in CANARD and passages with answer spans from QuAC. In our experiments, we refer to this joint dataset as CANARD for brevity.

See Table 1 for the overview of the datasets. Since TREC CAsT is relatively small we use only CANARD for training QR. The same QR model trained on CANARD is evaluated on both CANARD and TREC CAsT. The model for retrieval QA is tuned on a sample from the MS MARCO passage ranking dataset, which includes relevance judgements for 12.8M query-passage pairs with 399k unique queries (Nogueira and Cho, 2019). The model for extractive QA is pre-trained on MultiQA dataset, which contains QA pairs from six standard QA benchmarks (Fisch et al., 2019).

Question Retrieval Extractive
Tasks Rewriting QA QA
MS MARCO MultiQA (75k)
Train CANARD (35k) (399k) CANARD (35k)
Test CANARD (5.5k) (173) CANARD (5.5k)
Table 1. Datasets used for training and evaluation (with the number of questions in parenthesis).

4.2. Metrics

Mean average precision (MAP), mean reciprocal rank (MRR), normalized discounted cumulative gain (NDCG@3) and precision on the top-passage (P@1) evaluate quality of passage ranking. 1000 documents are evaluated per query with a relevance judgement value cut-off level of 2. We use F1 and Exact Match (EM) for extractive QA, which measure word token overlap between the predicted answer span and the ground truth. We also report accuracy for questions without answers in the given passage (NA Acc).

Our analysis showed that ROUGE recall calculated for unigrams (ROUGE-1 recall) correlates with the human judgement of the question rewriting performance (Pearson 0.69), which we adopt for our experiments as well. ROUGE (Lin, 2004)

is a standard metric estimating lexical overlap, which is often used in text summarization and other text generation tasks. We also calculate question similarity scores using the Universal Sentence Encoder (

USE) model (Cer et al., 2018) (Pearson 0.71).

4.3. QR Baselines

The baselines were designed to challenge the need for a separate QR component by incorporating previous turns as direct input to custom QA components. Manual rewrites by human annotators provide the upper-bound performance for a QR approach and allows for an ablation study of the down-stream QA components.


Original questions from the conversational QA datasets without any question rewriting.

Original + -Dt.

Our baseline approach for extractive QA prepends the previous questions to the original question to compensate for the missing context. The questions are separated with a special token and used as input to the Transformer model. We report the results for .

Original + -Dt*.

Since in the first candidate selection phase we use BM25 retrieval function which operates on a bag-of-words representation, we modify the baseline approach for retrieval QA as follows. We select keywords from prior conversation turns (not including current turn) based on their inverse document frequency (IDF) scores and append them to the original question of the current turn. We use the keyword-augmented query as the search query for Anserini (Yang et al., 2017), a Lucene toolkit for replicable information retrieval research, and if we use BERT re-ranking we concatenate the keyword-augmented query with the passages retrieved from the keyword-augmented query. We use the keywords with IDF scores above the threshold of 0.0001, which was selected based on a 1 million document sample of the MS MARCO corpus.


To provide an upper bound (skyline), we evaluate all our models on the question rewrites manually produced by human annotators.

4.4. QR Models

In addition to the baselines described above, we chose several alternative models for question rewriting of the conversational context: (1) co-reference resolution as in the TREC CAsT challenge; (2) PointerGenerator proposed in the related work for CANARD but not evaluated on the end-to-end conversational QA task (Elgohary et al., 2019); (3) CopyTransformer extension of the PointerGenerator model that replaces the bi-LSTM encoder-decoder architecture with the same Transformer Decoder model as in ActionGenerator. All models, except co-reference, were trained on the train split of the CANARD dataset. Question rewrites are generated turn by turn for each dialogue recursively using already generated rewrites as previous turns. This is the same setup as in the TREC CAsT evaluation.


Anaphoric expressions in original questions are replaced with their antecedents from the previous dialogue turns. Co-reference dependencies are detected using a publicly available neural co-reference resolution model that was trained on the OntoNotes corpus (Lee et al., 2018).111


A sequence-to-sequence model for text generation with bi-LSTM encoder and a pointer-generator decoder (See et al., 2017).


The Transformer decoder, which, similar to pointer-generator model, uses one of the attention heads as a pointer (Gehrmann et al., 2018). The model is initialized with the weights of a pre-trained GPT2 model (Radford et al., 2019; Wolf et al., 2019) (Medium-sized GPT-2 English model: 24-layer, 1024-hidden, 16-heads, 345M parameters) and then fine-tuned on the question rewriting task.222


The Transformer-based model described in Section 3.1. ActionGenerator is initialized with the weights of the pre-trained GPT2 model, same as in CopyTransformer.

4.5. QA Models

Our retrieval QA approach is implemented as proposed in (Nogueira and Cho, 2019) using Anserini for the candidate selection phase with BM25 (top-1000 passages) and for the passage re-ranking phase (Anserini + BERT). Both components were fine-tuned only on the MS MARCO dataset ().333

We train several models for extractive QA on different variants of the training set based on the CANARD training set (Elgohary et al., 2019). All models are first initialized with the weights of the model pre-trained using the whole word masking (Devlin et al., 2019).


The baseline models were trained using original (implicit) questions of the CANARD training set with a dialogue context of varying length (Original and Original + -DT). The models are trained separately for each , where corresponds to the model trained only on the original questions without any previous dialogue turns.


To accommodate input of the question rewriting models, we train a QA model that takes human rewritten question from the CANARD dataset as input without any additional conversation context, i.e., as in the standard QA task.

MultiQA Canard-H.

Since the setup with rewritten questions does not differ from the standard QA task, we experiment with pretraining the extractive QA model on the MultiQA dataset with explicit questions (Fisch et al., 2019), using parameter choices introduced by  Longpre et al. (2019). We further fine-tune this model on the target CANARD dataset to adopt the model to a different type of QA samples in CANARD (see Figure 6).

5. Results

Our proposed approach, using question rewriting for conversational QA, consistently outperforms the baselines that use previous dialogue turns, in both retrieval and extractive QA tasks. The PointerGenerator network that was previously proposed for the QR task in (Elgohary et al., 2019)

is the weakest rewriting model according to the end-to-end QA results (MAP 0.100 F1 57.37). This result was not apparent from the QR evaluation metric reported in Table 


Test Set Question ROUGE USE EM
CANARD Original 0.51 0.73 0.12
Co-reference 0.68 0.83 0.48
PointerGenerator 0.75 0.83 0.22
CopyTransformer 0.78 0.87 0.56
ActionGenerator 0.81 0.89 0.63
Human* 0.84 0.90 0.33
TREC CAsT Original 0.67 0.80 0.28
Co-reference 0.71 0.80 0.13
PointerGenerator 0.71 0.82 0.17
CopyTransformer 0.82 0.90 0.49
ActionGenerator 0.90 0.94 0.58
Human* 1.00 1.00 1.00
Table 2. Evaluation results of the QR models. *Human performance is measured as the difference between two independent annotators’ rewritten questions, averaged over 100 examples. This provides an estimate of the upper bound.

Passage re-ranking with BERT always improves ranking results (almost a two-fold increase in MAP, see Table 3). The best question rewriting model, ActionGenerator, when used together with BERT re-ranking and Anserini retrieval, sets the new state of the art on the TREC CAsT dataset with a 28% improvement in MAP and 21% in NDCG@3 over the top automatic run (Dalton et al., 2019). Keyword-based baselines (Original + -DT*) prove to be very strong outperforming both Co-reference and PointerGenerator models on all three performance metrics. Both MRR and NDCG@3 are increasing with the number of turns used for sampling keywords, while MAP is slightly decreasing, which indicates that it brings more relevant results at the very top of the rank but non-relevant results also receive higher scores. In contrast, the baseline results for Anserini + BERT model indicate that the re-ranking performance for all metrics decreases if the keywords from more than 2 previous turns are added to the original question. Similarly, we observe a performance peak of the F1 measure at in the extractive QA settings (see Table 4).

QA Input QA Model MAP MRR NDCG@3
Original Anserini 0.089 0.245 0.131
Original + 1-DT* 0.133 0.343 0.199
Original + 2-DT* 0.130 0.374 0.213
Original + 3-DT* 0.127 0.396 0.223
Co-reference 0.109 0.298 0.172
PointerGenerator 0.100 0.273 0.159
CopyTransformer 0.148 0.375 0.213
ActionGenerator 0.190 0.441 0.265
Human 0.218 0.500 0.315
Original Anserini 0.172 0.403 0.265
Original + 1-DT* +BERT 0.230 0.535 0.378
Original + 2-DT* 0.245 0.576 0.404
Original + 3-DT* 0.238 0.575 0.401
Co-reference 0.201 0.473 0.316
PointerGenerator 0.183 0.451 0.298
CopyTransformer 0.284 0.628 0.440
ActionGenerator 0.341 0.716 0.529
Human 0.405 0.879 0.589
Table 3. Retrieval QA results on the TREC CAsT test set.
Figure 5. Precision-recall curve illustrating model performance on the TREC CAsT test set for Anserini + BERT.

Training an extractive QA model on rewritten questions improves the performance even when the original questions are used as input at inference time (CANARD-H in Table 4). The tendency of the extractive model to hit correct answers even with incomplete information from ambiguous questions is also observable in Figure 7. In comparison with the retrieval QA performance across different question formulations on the left plot, extractive model answers twice as much questions without rewriting correctly, i.e. the impact of rewriting original questions is much more pronounced in retrieval QA (middle layers in the left plot). The anomalous behaviour of the extractive model to answer the original implicit question but not when it was explicitly reformulated by a human annotator (green and purple in the right plot) is almost absent in the retrieval model analysis (see also the sparse rows in Table 3).

Pre-training on MultiQA improves performance of the extractive QA model. The style of questions in CANARD dataset is rather different from other QA tasks and the Figure 6 shows that even a little training data for CANARD helps to immediately boost the performance of a pre-trained model.

QA Input Training Set EM F1 NA Acc
Original CANARD-O 38.68 53.65 66.55
Original + 1-DT 42.04 56.40 66.72
Original + 2-DT 41.29 56.68 68.11
Original + 3-DT 42.16 56.20 68.72
Original CANARD-H 39.44 54.02 65.42
Original MultiQA 41.32 54.97 65.84
Co-reference CANARD-H 42.70 57.59 66.20
PointerGenerator 41.93 57.37 63.16
CopyTransformer 42.67 57.62 68.02
ActionGenerator 43.39 58.16 68.29
Human 45.40 60.48 70.55
Table 4. Extractive QA results on the CANARD test set.
Figure 6.

Effect from fine-tuning the MultiQA model on a portion of the target CANARD-H dataset, visualizing domain shift between the datasets. Median, quartiles, and min/max are given by the red lines, boxes and whiskers, respectively.

Figure 7. Break-down analysis (best in color) with a sliding threshold for both retrieval (left) and extractive QA (right) results. The plot shows the difference between error distributions in retrieval and extractive settings. Most of the correct spans can be extracted given a relevant passage even with an original ambiguous question (the pink region in the bottom).

The precision-recall trade-off curve in Figure 5 shows that question rewriting performance is close to the performance achieved by manually rewriting implicit questions. Precision is decreasing rapidly and is below 0.1 when optimised for full recall. At the same time, the performance results of the extractive QA suggest that the model can not discriminate well the passages that do not have an answer to the question (71% accuracy on the human rewrites). Since the proportion of irrelevant passages produced by the retrieval component is twice the proportion of non-answerable questions in CANARD, the error rate is expected to increase when these components are combined.

6. Error Analysis

For every answer in the datasets we have three question formulations: (1) an original, possibly implicit, question (Original), and question rewrites that were produced either (2) by one of our QR models (we take the rewrites generated by the best QR model – ActionGenerator (QR) or (3) by a human annotator (Human). Table 5 allows us to better analyze the types of errors by grouping QA samples based on the performance of the retrieval QA component when given each of the specified question formulations. For example, the first row of the table indicates the number of samples where neither generated rewrite nor human rewrite were able to solicit the correct answer when measured by P@1, i.e., at the first position in the ranking. Therefore, all of these 49 error cases can be attributed to the retrieval QA component.

There are two anecdotal cases where our QR component was able to generate rewrites that helped to produce better ranking than the human-written questions. The first example shows that the re-ranking model does not handle paraphrases well. Original question: “What are good sources in food?”, human rewrite: “What are good sources of melatonin in food?”, model rewrite: “What are good sources in food for melatonin”. In the second example the human annotator and our model chose different context to disambiguate the original question. Original question: “What about environmental factors?”, human rewrite: “What about environmental factors during the Bronze Age collapse?”, model rewrite: “What about environmental factors that lead to led to a breakdown of trade”. Even though both model rewrites are not grammatically correct they solicited correct top-answers, while the human rewrites failed, which indicate flaws in the QA model performance.

P@1 NDCG@3
Original QR Human = 1 ¿ 0 0.5 = 1
49 (14) 10 (1) 55 (20) 154 (49)
0 0 0 0
2 0 1 0
0 1 1 0
19 10 25 4
0 1 0 0
48 63 47 11
55 (37) 88 (52) 44 (33) 4 (4)
Total 173 (53)
Table 5. Break-down analysis of all retrieval QA results for the TREC CAsT dataset. Each row represents a group of QA samples that exhibit similar behaviour. indicates that the answer produced by the QA model was correct or – incorrect, according to the thresholds provided in the right columns. We consider three types of input for every QA sample: the question from the test set (Original), generated by the best QR model (ActionGenerator) or rewritten manually (Human). The numbers correspond to the count of QA samples for each of the groups. The numbers in parenthesis indicate how many questions do not require rewriting, i.e., should be copied from the original.

More generally, assuming that humans always make correct question rewrites, we can attribute all cases in which they did not result in a correct answer as errors of the QA component (rows 1-4 in Tables 5-6). The next two rows 5-6 show the cases, where human rewrites succeeded but the model rewrites failed, which we consider to be a likely error of the QR component. The last two rows are true positives for our model, where the last row combines cases where the original question was just copied without rewriting (numbers in brackets) and other cases when rewriting was not required.

The majority of errors stem from the QA model: 29% of the test samples for retrieval and 55% for extractive estimated for P@1 and F1, comparing to 11% and 5% for QR respectively. It is a rough estimate since we can not tell whether the cases failing QA did not fail QR as well. Another interesting observation is that 10% of the questions in TREC CAsT were rewriten by human annotators that did not need rewriting to retrieve the correct answer. For CANARD the majority of questions (62%) can be correctly answered without question rewriting even when the questions are ambiguous.

Correlation between QR and QA metrics.

To discover the interaction between QR and QA metrics, we discarded all questions that do not lead to correct answers for human rewrites (top 4 rows) and then measure correlation between ROUGE scores and P@1. There is a strong correlation for ROUGE = 1, i.e., when the generated rewrite is very close to the human one, but when ROUGE ¡ 1 the answer is less predictable: even for rewrites that have a relatively small lexical overlap with the ground-truth (ROUGE 0.4) it is possible to retrieve a correct answer, and vice versa.

We explore this effect further by comparing differences in answers produced for human and model rewrites of the same question irrespective of the correct answer. Figure 8

demonstrates strong correlation between question similarity, as measured by ROUGE, and answer set similarity, as measured by recall of the answers produced for the human rewrites. Recall correlates with ROUGE more than precision. Points in the bottom right of this plot show sensitivity of the QA component, where similar questions lead to different answer rankings. The data points that are close the top center area indicate weakness of the QR metric as a proxy for the QA results: often questions do not have to be the same as the ground truth questions to solicit the same answers. The blank area in the top-left from the diagonal shows that a lexical overlap is required to produce the same answer set, which is likely due to the candidate filtering phase based on the bag-of-word representation matching. ROUGE and Jaccard in extractive QA show only weak correlation (Pearson 0.31). The extractive model is very sensitive to slight input perturbations and provides the same answer to very distinct questions. Correlation of the QA performance with USE metric is less than with ROUGE for both models and the outliers do not correspond to paraphrases.

Figure 8. Strong correlation (Pearson 0.77) between question similarity (ROUGE) and top-10 relevant passages produced by the retrieval QA model (Recall).

QA sensitivity.

We showed that QA results can provide an estimate of the question similarity. However, this property is directly dependent on the ability of the QA component to match equivalent questions to the same answer. Alternative question rewrites allow us to evaluate robustness and consistency of the QA models. Our analysis indicates that small perturbations of the input question, such as anaphora resolution, has a considerable impact on the answer ranking, e.g., the pair of the original question: “Who are the Hamilton Electors and what were they trying to do?”, and the human rewrite: “Who are the Hamilton Electors and what were the Hamilton Electors trying to do?” produce ROUGE = 1 but R@1000 = 0.33. We identified many cases in which inability of the QR component to generate apostrophes jeopardized relevance matching.

These results demonstrate the complexity of error estimation for this task. When asked for a human judgment of the quality of the generated rewrites independent from the QA results, little deviations may not seem important, but from the pragmatic point of view they have an impact on the overall performance of the end-to-end system. However, such inconsistencies (typos and paraphrases) should be handled by the QA component, since they do not originate from the context interpretation errors but are inherent in the stand-alone questions as well.

QR completeness.

The level of detail required to answer a particular question is often not apparent and depends on the collection. There are cases in which original questions without rewriting were already sufficient to retrieve the correct answers from the passage collection (see last row of the Table 5). For example, original question: “What is the functionalist theory?”, human rewrite: “What is the functionalist theory in sociology

?” However, in another question from the same dialogue, omitting the same word from the rewrite leads to retrieval of an irrelevant passage, since there are multiple alternative answers. This class of errors also corresponds to the variance evident from the Figure 

8, since a one-word difference between two questions may have a very little effect on the answer ranking as well as a dramatic change in the question interpretation. This effect, however, interacts with the size and diversity of the collection content. Some of the questions were correctly answered even with underspecified questions, e.g., original question: “What are some ways to avoid injury?”, human rewrite: “What are some ways to avoid sports injuries?”, because of the collection bias. The idea behind the question rewriting approach is to learn patterns that correct for such semantic differences independent from the collection content, similar to how humans resolve such cases, based on their knowledge of language and the world.

Original QR Human F1 ¿ 0 F1 0.5 F1 = 1
847 (136) 1855 (235) 2701 (332)
174 193 181
19 35 (2) 40 (1)
135 153 120
141 288 232
65 (1) 57 (1) 40
226 324 269
3964 (529) 2666 (428) 1988 (333)
Total 5571 (666)
Table 6. Break-down analysis of all extractive QA results for the CANARD dataset, similar to Table 5.

7. Conclusion

Question rewriting (QR) is a challenging task that attempts to learn linguistic patterns that signal and resolve ambiguity in question formulation. The core idea to develop QR as a separate component is that the context understanding and question formulation are independent from the knowledge collection process, i.e. the way humans approach this by relying on their linguistic and world knowledge. Our experimental results show that human intuition about correct rewrites is sub-optimal but it allows to establish a mechanism that performs well across different datasets.

We showed in an end-to-end evaluation that question rewriting is an effective method to extend existing question answering (QA) approaches to conversational settings, which establishes new state of the art on both retrieval and extractive conversational QA tasks on the TREC CAsT and CANARD datasets. The advantage of our approach is the ability to reuse existing non-conversational QA models for passage retrieval and extractive QA subtasks out-of-the-box. An end-to-end conversational QA system requires integration of all three components. Here, we focused exclusively on the integration of the first component responsible for the question formulation, and analyzed its effect on the other two components. By producing explicit representations of the question interpretation, question rewriting makes conversational QA results more explicable and the contribution of the individual components more transparent, which enable us to conduct a thorough performance analysis. Our analysis demonstrates sensitivity of both QA models to differences in question formulation, which calls for more adequate evaluation setups that are able to reflect model robustness.

We compared two QA models side by side and discovered major differences in their interaction with the QR component. The role of question rewriting is especially prominent in the case of retrieval QA where the candidate answer space is so large that any ambiguity in question formulation results in a very different answer. In contrast, the extractive QA model setup is optimized for recall and tends to produce answers for questions that are ambiguous or unanswerable given the passage. Since the accuracy of the retrieval component is twice as low as the proportion of answerable questions in the CANARD (QuAC) dataset, integration of the two QA components is likely to result in considerable error propagation from passage retrieval to the answer extraction phase. An important direction for future work is to design architectures that will be able to mitigate this negative effect.

It is important to note that the QR-QA architecture, which we employed to account for previous conversation turns, is generic enough to incorporate other types of context, which may incorporate a user model or an environment context obtained from multi-modal data (deictic reference). Experimental evaluation of QR-QA performance augmented with such auxiliary inputs is a promising direction for future work.


We would like to thank our colleagues Srinivas Chappidi, Bjorn Hoffmeister, Stephan Peitz, Russ Webb, Drew Frank and Chris DuBois for their insightful comments.


  • D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, and R. Kurzweil (2018) Universal sentence encoder for english. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018

    pp. 169–174. Cited by: §4.2.
  • E. Choi, H. He, M. Iyyer, M. Yatskar, W. Yih, Y. Choi, P. Liang, and L. Zettlemoyer (2018) QuAC: question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2174–2184. Cited by: §2, §4.1, §4.1.
  • P. Christmann, R. Saha Roy, A. Abujabal, J. Singh, and G. Weikum (2019) Look before you hop: conversational question answering over knowledge graphs using judicious context expansion. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 729–738. Cited by: §2.
  • J. Dalton, C. Xiong, and J. Callan (2019) CAsT 2019: the conversational assistance track overview. In Proceedings of the Twenty-Eighth Text REtrieval Conference, TREC, pp. 13–15. Cited by: §5.
  • J. Dalton, C. Xiong, and J. Callan (2020) TREC cast 2019: the conversational assistance track overview. CoRR abs/2003.13624. External Links: 2003.13624 Cited by: §2, §4.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, pp. 4171–4186. Cited by: §3.2, §3.3, §4.5.
  • L. Dietz, B. Gamari, J. Dalton, and N. Craswell (2018) TREC complex answer retrieval overview. Cited by: §3.2, §4.1.
  • A. Elgohary, D. Peskov, and J. Boyd-Graber (2019) Can you unpack that? learning to rewrite questions-in-context. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5920–5926. Cited by: §1, §2, §4.1, §4.4, §4.5, §5.
  • A. Fisch, A. Talmor, R. Jia, M. Seo, E. Choi, and D. Chen (2019) MRQA 2019 shared task: evaluating generalization in reading comprehension. arXiv preprint arXiv:1910.09753. Cited by: §4.1, §4.5.
  • J. Gao, M. Galley, and L. Li (2019) Neural approaches to conversational AI. Foundations and Trends in Information Retrieval 13 (2-3), pp. 127–298. Cited by: §1.
  • S. Gehrmann, Y. Deng, and A. M. Rush (2018) Bottom-up abstractive summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4098–4109. Cited by: §4.4.
  • D. Guo, D. Tang, N. Duan, M. Zhou, and J. Yin (2018) Dialog-to-action: conversational question answering over a large-scale knowledge base. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, pp. 2946–2955. Cited by: §2.
  • Y. Ju, F. Zhao, S. Chen, B. Zheng, X. Yang, and Y. Liu (2019) Technical report on conversational question answering. CoRR abs/1909.10772. Cited by: §2.
  • K. Lee, L. He, and L. Zettlemoyer (2018) Higher-order coreference resolution with coarse-to-fine inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pp. 687–692. Cited by: §4.4.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §4.2.
  • S. Lin, J. Yang, R. Nogueira, M. Tsai, C. Wang, and J. Lin (2020) Conversational question reformulation via sequence-to-sequence architectures and pretrained language models. arXiv preprint arXiv:2004.01909. Cited by: §2.
  • C. Liu, R. Lowe, I. Serban, M. Noseworthy, L. Charlin, and J. Pineau (2016) How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, pp. 2122–2132. Cited by: §1.
  • S. Longpre, Y. Lu, Z. Tu, and C. DuBois (2019) An exploration of data augmentation and sampling techniques for domain-agnostic question answering. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 220–227. Cited by: §4.5.
  • T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches, Cited by: §2, §3.2, §4.1.
  • R. Nogueira and K. Cho (2019) Passage re-ranking with bert. arXiv preprint arXiv:1901.04085. Cited by: §4.1, §4.5.
  • R. Nogueira, W. Yang, J. Lin, and K. Cho (2019) Document expansion by query prediction. CoRR abs/1904.08375. Cited by: §2.
  • C. Qu, L. Yang, M. Qiu, Y. Zhang, C. Chen, W. B. Croft, and M. Iyyer (2019) Attentive history selection for conversational question answering. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, pp. 1391–1400. Cited by: §2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §3.1, §4.4.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392. Cited by: §2, §3.3.
  • P. Rastogi, A. Gupta, T. Chen, and L. Mathias (2019) Scaling multi-domain dialogue state tracking via query reformulation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, pp. 97–105. Cited by: item 3, §2.
  • S. Reddy, D. Chen, and C. D. Manning (2019) CCoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics 7, pp. 249–266. Cited by: §2.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083. Cited by: §4.4.
  • H. Su, X. Shen, R. Zhang, F. Sun, P. Hu, C. Niu, and J. Zhou (2019) Improving multi-turn dialogue modelling with utterance rewriter. arXiv preprint arXiv:1906.07004. Cited by: §2.
  • A. L. Thomas (1979) Ellipsis: the interplay of sentence structure and context. Lingua Amsterdam 47 (1), pp. 43–68. Cited by: §1.
  • M. Wang, N. A. Smith, and T. Mitamura (2007) What is the jeopardy model? a quasi-synchronous grammar for qa. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 22–32. Cited by: §2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. (2019) Transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771. Cited by: §4.4.
  • P. Yang, H. Fang, and J. Lin (2017) Anserini: enabling the use of lucene for information retrieval research. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1253–1256. Cited by: §4.3.
  • Y. Yang, W. Yih, and C. Meek (2015) WikiQA: a challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2013–2018. Cited by: §2.
  • W. Zhang, Y. Feng, F. Meng, D. You, and Q. Liu (2019)

    Bridging the gap between training and inference for neural machine translation

    In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, pp. 4334–4343. Cited by: §1.