Conversational search is an embodiment of an iterative and interactive information retrieval (IR) system that has been studied for decades (Belkin et al., 1995; Croft and Thompson, 1987; Oddy, 1977). Due to the recent rise of intelligent assistant systems, such as Siri, Alexa, AliMe, Cortana, and Google Assistant, a growing part of the population is moving their information-seeking activities to voice or text based conversational interfaces. This trend is closely related to the revival of research interest in question answering (QA) and conversational QA (ConvQA). ConvQA can be considered as a simplified setting of conversational search (Qu et al., 2019c). A significant limitation of this setting is that an answer is either extracted from a given passage (Qu et al., 2019d) or selected from a given candidate set (Yang et al., 2018). This simplification neglects the fundamental role of retrieval in conversational search. To address this issue, we introduce an open-retrieval ConvQA (ORConvQA) setting, where we learn to retrieve evidence from a large collection before extracting answers.
We illustrate the importance of ORConvQA by characterizing the task and discussing the considerations of an ORConvQA dataset as follows. A comparison between ORConvQA and related tasks is presented in Table 1.
|Task & Example Data||OR||Conv||IS||GIN|
|Open-Retrieval QA (Lee et al., 2019; Das et al., 2019)||✓||✗||-||-|
|Response Ranking w/ UDC (Lowe et al., 2015; Yang et al., 2018; Wu et al., 2016)||✗||✓||✓||✓|
|Conversational MC w/ CoQA (Zhu et al., 2018; Chen et al., 2019)||✗||✓||✗||✗|
|Conversational MC w/ QuAC (Qu et al., 2019d; Huang et al., 2018)||✗||✓||✓||✓|
|ORConvQA w/ OR-QuAC (this work)||✓||✓||✓||✓|
, the ConvQA task is formulated as a conversational machine comprehension (MC) problem with the goal being to extract or generate an answer from a given gold passage. This setting can be impractical in real-world applications since the gold passage is not always available, or there could be no ground truth answer in the given passage. Instead of being given the passage, a ConvQA system should be able to retrieve candidate passages from a collection. In particular, it is desirable if this retriever is learnable and can be fine-tuned on the downstream ConvQA task, instead of adopting fixed heuristic retrieval functions like TF-IDF or BM25. Moreover, the retrieval process should be open in terms of retrieving from a large collection instead of reranking a small number of passages in a closed set.
2. Conversational. Being conversational reflects the interactive nature of a search activity. The important problem of user interaction modeling in IR can be formulated as conversation history modeling in this scenario.
3. Information-seeking. An information-seeking conversation typically requires multiple turns of information exchange to allow the seeker to clarify an information need, provide feedback, and ask follow up questions. In this process, answers are revealed to the seeker through a sequence of interactions between the seeker and the provider. These answers are generally longer than the entity-based answers in factoid QA.
4. Genuine information needs. An information-seeking conversation is closer to real-world scenarios if the seeker is genuinely seeking an answer. In SQuAD (Rajpurkar et al., 2016) and CoQA (Reddy et al., 2018), the seekers’ information needs are not genuine because they have access to the passage and thus have the target answer in mind when asking the question. These questions are referred to as “back-written questions” (Ahmad et al., 2019) and have been reported to have more lexical overlap with their answers in SQuAD (Ahmad et al., 2019). This undesirable property makes the models learned from such datasets less practical.
To the best of our knowledge, there has not been a publicly available dataset that satisfies all the properties we discussed as shown in Table 1. We address this issue by aggregating existing data to create the OR-QuAC dataset. The QuAC (Choi et al., 2018) dataset offers information-seeking conversations that are collected with no seekers’ prior knowledge of the passages. We extend QuAC to an open-retrieval setting by creating a collection of over 11 million passages using the whole Wikipedia corpus. Another important resource used in our aggregation process is the CANARD dataset (Elgohary et al., 2019) that offers context-independent rewrites of QuAC questions. Some initial questions in QuAC conversations are underspecified. This makes conversation difficult to interpret in an open-retrieval setting. We make these dialogs self-contained by replacing the initial question in a conversation with its rewrite from CANARD. Our data has 5,644 dialogs with 40,527 questions. We release OR-QuAC to the community to facilitate research on ORConvQA.
In addition to proposing ORConvQA and creating the OR-QuAC dataset, we develop a system for ORConvQA following previous work on open-retrieval QA (Lee et al., 2019). Our end-to-end system features a retriever, a reranker, and a reader that are all based on Transformers (Vaswani et al., 2017). We enable history modeling in all components by concatenating history questions to the current question. The passage retriever first retrieves the top relevant passages from the collection given a question and its history. The reranker and reader then rerank and read the top passages to produce an answer. The training process contains two phases, a pretraining phase for the retriever and a concurrent learning phase for all system components.
Specifically, our retriever adopts a dual-encoder architecture (Ahmad et al., 2019; Lee et al., 2019; Das et al., 2019; Karpukhin et al., 2020) that uses separate ALBERT (Lan et al., 2019) encoders for questions and passages. The question encoder also encodes conversation history. After being pretrained, the passage encoder is frozen and encodes all passages in the collection offline. The reranker and the reader share the same BERT (Devlin et al., 2019) encoder. It encodes the input sequence of a concatenation of the question, history, and each relevant passage to contextualized representations for reranking and answer extraction. We incorporate shared-normalization (Clark and Gardner, 2017)
in our system to enable comparison among the candidate passages. In the concurrent learning phase, we encode the question and the history to dense vectors with the question encoder for an efficient retrieval with maximum inner product search (MIPS)(Shrivastava and Li, 2014; Johnson et al., 2017). The top retrieved passages are fed to the reranker and reader for a concurrent learning of all model components.
We conduct extensive experiments on our OR-QuAC dataset. First, we show that our system without any history information has comparable performance with a conversational version of BERTserini (Yang et al., 2019b) that considers history. This improvement demonstrates the importance of a learnable retriever in ORConvQA. We further show that our system can make a substantial improvement when we enable history modeling in all system components. Moreover, we conduct in-depth analyses on model ablation and configuration to provide insights for the ORConvQA task. We show that our reranker component contributes to the model performance by providing a regularization effect. We also demonstrate that the initial question of each dialog is crucial for our system to understand the user’s information need. Our code and data are available for research purposes.111https://github.com/prdwb/orconvqa-release
2. Related Work
Our work is closely related to several research topics, including QA, open domain QA, ConvQA, and conversational search. We mainly discuss retrieval based methods since they tend to offer more informative responses (Yang et al., 2018) and thus better fit for information-seeking tasks than generation based methods.
Question Answering. One of the first modern reformulations of the QA task dates back to the TREC-8 Question Answering Track (Voorhees and Tice, 1999)
. Its goal is to answer 200 fact-based, short-answer questions by leveraging a large collection of documents. A retrieval module is crucial in this task to retrieve relevant passages for answer extraction. As an increasing number of researchers in the natural language processing (NLP) community moving their focus to answer extraction and generation methods, the role of retrieval has been gradually overlooked. As a result, many popular QA tasks and datasets either follow an answer selection setting(Yang et al., 2015; Wang et al., 2007; Garg et al., 2020) or a machine comprehension setting (Rajpurkar et al., 2016, 2018; Kwiatkowski et al., 2019; Trischler et al., 2016). In real-world scenarios, it is less practical to assume we are given a small set of candidate answers or a gold passage. Therefore, in this work, we make the retrieval component as one of our focuses in the task formulation and model architecture.
Open Domain Question Answering. In contrast to the tasks that offer a pre-selected passage for answer extraction, open domain QA tasks provide the model with access to a large corpus (Dhingra et al., 2017) or at least a set of candidate documents for each question (Nguyen et al., 2016; Joshi et al., 2017; Dunn et al., 2017; Dhingra et al., 2017; Cohen et al., 2018). DrQA (Chen et al., 2017) and BERTserini (Yang et al., 2019b) present an end-to-end open domain QA system by using a TF-IDF/BM25 retriever and a neural reader. Some previous work (Lee et al., 2018; Htut et al., 2018; Kratzwald and Feuerriegel, 2018; Wang et al., 2018) learns to rerank or select from a closed set of passages for open domain QA. These methods may not scale well to an open-retrieval setting. Recently, Lee et al. (2019), Das et al. (2019), and Karpukhin et al. (2020) adopt a dual-encoder architecture to construct a learnable retriever and demonstrate their methods are scalable to large collections. ReQA (Ahmad et al., 2019) also uses a similar architecture to retrieve sentence-level answers directly. Although these works are limited to single turn QAs, they are valuable resources for us to study how to extend ConvQA to an open-retrieval setting.
Conversational Question Answering. Similar to the answer selection and MC tasks in single-turn QAs, existing ConvQA research is mostly limited to response ranking (Yang et al., 2018; Lowe et al., 2015; Yan et al., 2016b, a; Tao et al., 2019; Wu et al., 2016; Yang et al., 2019a, 2020) and conversational MC (Choi et al., 2018; Reddy et al., 2018; Qu et al., 2019c, d; Huang et al., 2018; Zhu et al., 2018; Yeh and Chen, 2019; Chen et al., 2019), where the role of retrieval is also neglected. Open-retrieval is particularly important to ConvQA since the answers of the questions from the same dialog may not necessarily come from the same passage. The model needs to learn to retrieve passages for each dialog question. Another challenge is to investigate how to enable history modeling not only in the reader but also in the retriever. Moreover, there are no existing datasets that are suitable to study ORConvQA. Therefore, we tackle these research questions in this work.
Conversational Search. While the concept of conversational search can be traced back to research (Belkin et al., 1995; Croft and Thompson, 1987; Oddy, 1977) from decades ago, recent years have witnessed its revival. In addition to the ConvQA work mentioned above, researchers are also actively working on other conversation tasks, including conversational recommendation and product search (Bi et al., 2019; Zhang et al., 2018), user intent prediction (Qu et al., 2019b), and question retrieval (Yang et al., 2017). Another rich body of work targets the user-oriented aspect (Qu et al., 2018; Chuklin et al., 2018; Trippas et al., 2018; Thomas et al., 2017; Qu et al., 2019a; Trippas et al., 2017, 2019) for conversational information seeking. Our work extends ConvQA to an open-retrieval setting as another fundamental step towards conversational search.
3. The OR-QuAC Dataset
The OR-QuAC dataset enhances QuAC by adapting it to an open-retrieval setting. It is an aggregation of three existing datasets: (1) the QuAC dataset (Choi et al., 2018) that offers information-seeking conversations, (2) the CANARD dataset (Elgohary et al., 2019) that consists of context-independent rewrites of QuAC questions, and (3) the Wikipedia corpus that serves as the knowledge source of answering questions. An example of OR-QuAC is presented in Figure 1. We will describe the data construction process in the following sections.
3.1. Self-Contained Information-seeking Dialogs
The QuAC (Question Answering in Context) dataset (Choi et al., 2018) is designed for modeling information-seeking conversations. It consists of real human-human dialogs between an information seeker and an information provider. The seeker tries to learn about a hidden Wikipedia text by asking a sequence of freeform questions. She/he only has access to the title and a summary of the article. This simulates a genuine information need. The provider answers each question by indicating a short span of the given passage. This dataset poses unique challenges because its questions are more open-ended, sometimes unanswerable, or only meaningful within the dialog context (Choi et al., 2018).
A drawback of QuAC is that many dialogs are not self-contained. This is typically caused by incomplete initial questions. A QuAC dialog is motivated by a general and underlying information need. During the data collection process, this information need is provided to both the seeker and provider before initiating the dialog. Therefore, the seeker might not necessarily reiterate this information need when asking the first question. For example, a seeker in QuAC is instructed to learn about Zhang Heng, a Chinese polymathic scientist. The very first question the seeker asked was ”what did he have to do with science and technology?”. Such underspecified and ambiguous initial questions become an issue in the open-retrieval setting because they make the conversation difficult to interpret.
We tackle this issue by replacing initial questions in QuAC with their context-independent rewrites provided by the CANARD dataset. For example, the rewrite for the previously mentioned question is ”What did Zhang Heng have to do with science and technology?”. We do the replacement for the first questions only. This makes a dialog self-contained while keeping the history dependencies within the dialog untouched.
CANARD covers about half of the released QuAC questions. Since the QuAC test set is not publicly available, they use QuAC’s development set as their test set and 10% of QuAC’s training set as their development set (Elgohary et al., 2019). We follow the data split of CANARD. QuAC questions that not in CANARD are discarded. The data statistics of our derived dataset, OR-QuAC, are presented in Table 2.
|# Avg. Question Tokens||6.7||6.6||6.7|
|# Avg. Answer Tokens||12.5||12.6||12.2|
|# Avg. Questions / Dialog||7.2||7.0||7.2|
We use the whole Wikipedia corpus to construct a collection since passages in QuAC are from Wikipedia. We use the English Wikipedia dump from 10/20/2019.222https://dumps.wikimedia.org/enwiki/20191020/ The Wikipedia passages in QuAC were downloaded via PetScan333http://petscan.wmflabs.org/ (Choi et al., 2018), and thus, the exact date for the data dump is unavailable. Therefore, we use the latest data dump instead of trying to match the date of QuAC. We then use the WikiExtractor444https://github.com/attardi/wikiextractor to extract and clean text from the data dump, resulting in over 5.9 million Wikipedia articles. After this, we split the articles into chunks with at most 384 wordpieces using the tokenizer of BERT, following Lee et al. (2019). The split is done greedily while preserving sentence boundaries. These chunks are referred to as passages. Less than 0.5% of known answers are split into different passages. Their corresponding questions are considered as unanswerable during training. We do the split to make the passages fit for Transformer based retrievers and readers. Moreover, Yang et al. (2019b) reported that the paragraph level is the best granularity for an end-to-end retrieve-and-read framework compared to the article and sentence levels. They believe the reason is that an article may contain non-relevant content that distracts the reader while a sentence may lack context information. For an open-retrieval setting, we prefer passage-level retrieval over article-level since a full article would be harder to represent with a fixed-length dense vector.
Since the paragraphs in QuAC may not be exactly the same as those in the Wikipedia dump given the difference in the dates of the dumps, we conduct the same split process for QuAC paragraphs and replace the Wikipedia passages with QuAC passages that have the same article titles. The positions of the ground truth answer spans are mapped to the new passages. The resulting collection has over 11 million passages for retrieval.
Due to the synthetic nature of this dataset, the answers of the questions in the same dialog are distributed in the same section of text. In real world, questions and answers in a dialog may be distributed at different locations of the corpus. This is a limitation of our dataset.
4. An End-to-end ORConvQA System
In this section, we first formally define the task of open-retrieval conversational QA. We then describe our end-to-end system that deals with this task and explain the intuitions behind it.
4.1. Task Definition
The ORConvQA task is defined as follows. Given the -th question in a conversation, and all history questions preceding , the task is to predict an answer for using a passage collection . In an extractive setting, is a text span of a passage in . We do not assume we have access to ground truth history answers since it is impractical in real-world scenarios.
Extractive models are trained on the supervision signals of the position of a span in the gold passage. Previous works (Chen et al., 2017; Lee et al., 2019; Das et al., 2019) present a distantly-supervised setting, where they only have access to QA pairs without gold passages. This setting heavily relies on a heuristic that a positive passage should contain an exact match of the known answer. Short and entity-based answers can often be discovered in multiple passages, meaning that positive passages are highly substitutable. In information-seeking conversations that are motivated by genuine information needs, however, the answers are typically much longer. For example, QuAC answers have 12 tokens on average while SQuAD and CoQA answers have 3 (Reddy et al., 2018). It is common that the retrieved passages do not contain exact matches of the known answers, making many training examples useless. To tackle this, we adopt a fully-supervised setting: we assume we have access to gold passages so that we can include them if they are not present in the retrieval results and use the ground-truth answer spans. This is done at training time only. Although this is a limitation, it does not conflict with the learnable retriever we promote. We will work on a weak supervision method that is suitable for information-seeking conversations in our future work.
4.2. Model Overview
We now present an end-to-end system that deals with the ORConvQA task described in Section 4.1. Our system consists of three major components, a passage retriever, a passage reranker, and a passage reader. The reranker and reader are based on the same encoder. All components are learnable. As described in Figure 2, the passage retriever first retrieves top- relevant passages from the collection given a question and its history. The passage reranker and reader then rerank and read the top passages to produce an answer. History modeling is enabled in all components. We will describe each component in detail in the following sections.
4.3. Passage Retriever
by using a dual-encoder architecture to construct a learnable retriever. This architecture features separated encoders for questions and passages. The retriever score is then defined as the dot product of the hidden representations of a question and a passage. We use two ALBERT(Lan et al., 2019) models for both encoders. ALBERT is a lite BERT (Devlin et al., 2019) model for learning bidirectional language representations from Transformers (Vaswani et al., 2017). It reduces the parameters of BERT by cross-layer parameter sharing and embedding parameters factorization (Lan et al., 2019).
Given all available history questions , we first identify those that are in a history window with the size . These questions are denoted as . We then construct a concatenation of and . We prepend the initial question of the conversation to the concatenation if is not already included. The initial question typically contains an information need that is pertinent to the entire conversation as explained in Section 3.1. The reformatted question for the retriever is denoted as . For an ALBERT based question encoder, the input sequence would be “[CLS] [SEP] [SEP] [SEP] [SEP] [SEP]”. All questions are in the same segment. [CLS] and [SEP] are special tokens introduced in BERT (Devlin et al., 2019). We then take the [CLS] representation and project it to a 128-dimensional vector as the question representation following Lee et al. (2019). Formally,
where is the question encoder, is the projection matrix for the question [CLS] representation, and is the final question representation enhanced with history information. We then follow the same scheme to obtain the passage representation for a passage :
where is a passage in the collection, is the passage encoder, is the projection matrix for the passage [CLS] representation, and is the final passage representation. Finally, the retrieval score is computed as
4.4. Passage Reader/Reranker
Given the current question , history questions , the history window size , and one of the retrieved passages , the passage reader predicts an answer span within the passage. In contrast to Lee et al. (2019) and Yang et al. (2019b), we introduce reranking into this process with little additional cost. Our reader mostly follows the standard architecture of a BERT based MC model (Devlin et al., 2019). We enhance this model by applying the shared-normalization mechanism proposed by Clark and Gardner (2017) to enable comparison across all retrieved passages for a question. Similar mechanisms are also adopted by Yang et al. (2019b) and Lee et al. (2019).
The reader and reranker share the same BERT encoder. Similar to the retriever, we first construct a reformatted question by concatenating history questions within a history window and the current question. We do not additionally prepend the initial question because the conversation is considered to be grounded to . The reformatted question for the reader is denoted as . We then concatenate a retrieved passage to form the input sequence for the BERT model. Specifically, the input sequence is “[CLS] [SEP] [SEP] [SEP] [SEP] [SEP]”, with and in different segments. The BERT model then generates contextualized representations for all tokens in the input sequence:
where is the representation for the -th token in the input sequence. We also need the sequence representation obtained by
where is a projection for the [CLS] representation to obtain the sequence representation following Devlin et al. (2019).
As shown in Figure 2. The reranker components conduct a listwise reranking of the top retrieved passages. The reranking task provides more supervision signals to fine-tune the BERT encoder. The representation learning of the encoder also benefits from a regularization effect for optimizing for multiple tasks. Moreover, the reranking task adds little additional cost to the training process because representations for all tokens, including the [CLS] token, are generated with vectorization in a Transformer architecture. Specifically, we learn a reranking vector to project the sequence representation to a reranking score :
The reader predicts an answer span by computing scores of each token being the start token and the end token. We learn two sets of parameters, a start vector and an end vector , to project token representations to start and end scores:
where and are the scores for the -th token being the start and end tokens of the answer span. The reader score and overall score will be computed at inference time in Section 4.6.
Our training procedure contains two phases. The first is the retriever pretraining phase, followed by the concurrent learning phase of the retriever (question encoder), reranker, and reader.
4.5.1. Retriever Pretraining
We follow previous work (Lee et al., 2019) to pretrain the retriever so that it gives a reasonable performance in the concurrent learning phase.
In Section 4.3, we mentioned that history modeling is enabled in the retriever by prepending history questions. The history window size is a hyper-parameter and is tunable. In the pretraining phase, however, we would like to train a uniform retriever for every single history window size. Therefore, we use the rewrite in CANARD as the reformatted question for a question in the pretraining phase. We will mitigate the question mismatch issue by fine-tuning the question encoder in the concurrent learning phase.
The pretraining process of the retriever is described in Figure 3. Given a batch of question representations and their gold passage representations , we obtain the retrieval scores for the batch by
where . The element in the -th row and -th column of represents
. The objective is to maximize the probability of the gold passage for each question:
In other words, the passage set is considered as randomly sampled negative passages for . The pretraining loss for this batch is then defined as follows.
Lee et al. (2019) suggest that it is crucial to set the batch size to a large number because it makes the pretraining task more difficult and closer to what the retriever observes at test time. Therefore, we use two ALBERT models as the question encoder and the passage encoder. This doubles the batch size compared to that of using BERT models. The ALBERT models are fine-tuned.
We then encode all passages in the collection offline with the passage encoder and obtain a set of passage vectors. Finally, we use Faiss555https://github.com/facebookresearch/faiss, a library for efficient similarity search of dense vectors, to create an index for maximum inner product search. Retrieval is performed on a GPU during concurrent learning for faster training.
4.5.2. Concurrent Learning of the Retriever, Reranker, and Reader
As indicated in Figure 2, given the current question , the history questions , and the history window size , we obtain the reformatted question for the retriever and the reader . We first obtain the question representation of using the question encoder in Equation 1. We then retrieve the top passages for the reader from the passage collection using the index we created offline. This set of top passages is denoted as . The number of negative samples for retriever is limited by the CUDA memory in the retriever pretraining phase. In the concurrent learning phase, we can use a relatively large amount of negative samples to fine-tune the retriever at a low cost since all passages have been encoded offline. Therefore, we also retrieve the top passages, where , for an aggressive update of the retriever following Lee et al. (2019). This set of passages is denoted as . If the gold passage of is not present in or , we manually include it in the retrieval results. Formally, the retriever loss to fine-tune the question encoder in the retriever is defined as follows.
where is the position of the gold passage in .
Passages in are then fed into the reader/reranker module. This module conducts reading and reranking simultaneously. For every passage , we obtain a reranking score following Equation 6. We then compute the reranking probability and the reranking loss as follows.
where is the position of the gold passage in .
For the reader component, a standard BERT based machine comprehension model uses the cross entropy loss to maximize the probability of the true start and end tokens among all tokens in the given passage. Different from that, we apply the shared-normalization mechanism (Clark and Gardner, 2017) to this step to maximize the probabilities of the true start and end tokens among all tokens from . This makes the model produce start and end scores that are comparable across passages. The passages are encoded independently, and the shared-normalization is applied to all passages at the last step. For a passage , we obtain a start score for every token [m] in the input sequence. The training loss for the start token is then defined as follows.
where [S] is the true start token in the gold passage. For unanswerable questions, we set the start and end tokens to [CLS]
. The BERT encoder is fine-tuned. The loss function of the end tokenis defined in the same way. The reader loss is computed as follows.
Finally, the concurrent learning loss is computed as:
Although the gradients of the reader/reranker do not back propagate to the retriever, we train these modules concurrently so that the reader/reranker can benefit from seeing more negative passages due to a dynamically changing set of retrieved passages .
Given the current question , the history questions , and the history window size , we follow the same process in the concurrent learning phase to retrieve a set of relevant passages . Note we do not manually include the gold passage in at inference time. For a passage , we obtain the retriever score and the reranker score following Equations 3 and 6. We then follow Devlin et al. (2019) to obtain the reader score using the start score and the end score in Equation 7 as follows.
where is the answer span with the start token and end token . To ensure tractability, we only consider the top 20 spans following convention (Devlin et al., 2019). Invalid predictions, including the cases where the start token comes after the end token, or the predicted span overlaps with the question part of the input sequence, are discarded. Finally, the overall score is defined as a function of the current question , its history questions , a history window size , a retrieved passage , and a answer span as in Figure 2:
The system outputs the answer span that has the largest overall score for each question in a conversation.
5. Experimental Setups
We now describe our experimental setups, including competing methods, evaluation metrics, and implementation details.
5.1. Competing Methods
To the best of our knowledge, there is no published work tackling the ORConvQA problem that we describe in Section 4.1. There is, however, a rich body of work on single-turn open-domain QA, led by DrQA (Chen et al., 2017). We can adapt such methods to a conversational setting by using the same history modeling method in our system. Given the effort to adapt such models to ORConvQA, we only compare to the original DrQA and the best model that we are aware of, BERTserini (Yang et al., 2019b). To be specific, the competing methods are:
DrQA (Chen et al., 2017)
. This model uses a TF-IDF retriever and an RNN based reader. We train this model on OR-QuAC dialogs with gold passages. At test time, the passages are retrieved with the retriever. This setting is consistent with DrQA’s original setting. We do not use its distantly-supervised setting since we would like to adopt full supervision for all competing methods in this work. We start from their open-sourced implementation on GitHub.666https://github.com/facebookresearch/DrQA
BERTserini (Yang et al., 2019b). This model uses a BM25 retriever from Anserini777http://anserini.io/ and a BERT reader. Their BERT reader is similar to ours, except that it does not support reranking and thus cannot benefit from multi-tasking learning. They study the granularity of retrieval, including article, paragraph, and sentence. They conclude that retrieval on a paragraph level gives the best overall performance. We only compare to the paragraph retrieval setting since it is the best and is consistent with our passage retrieval setting. We use the top 5 passages for the reader to be consistent with our setup. This baseline is our implementation since BERTserini’s source code was not available at the time of our submission.
ORConvQA without history (Ours w/o hist.). This is our model described in Section 4 with the history window size . Note that the first question of a dialog is still included in the reformatted question for the retriever, as described in Section 4.3. This model is our adaptation of the open-retrieval QA framework (Lee et al., 2019) to a conversational setting. We use a more direct and resource-efficient retriever pretraining method that is suitable for ConvQA. We also enable reranking in the reader component.
ORConvQA (Ours). This is our full model described in Section 4.
We adapt DrQA and BERTserini to a conversational setting using the same history modeling method in our model. It involves prepending history questions for reformatted questions for the retriever and the reader. For these models and our ORConvQA model, the history window size is tuned on the development set. We report their performance under the best history setting.
5.2. Evaluation Metrics
The word-level F1 and the human equivalence score (HEQ) are two metrics provided by the QuAC challenge to evaluate ConvQA systems. F1 measures the overlap of the predicted answer span and the ground truth answer span. This is our most important metric since it evaluates the overall performance of the system. HEQ computes the percentage of examples for which system F1 exceeds or matches human F1. It measures whether a system can give answers as good as an average human. This metric is computed on a question level (HEQ-Q) and a dialog level (HEQ-D).
In addition to F1 and HEQ, we also use the Mean Reciprocal Rank (MRR) and Recall to evaluate the retrieval performance for the retriever and reranker. The reciprocal rank of a query is the inverse of the rank of the first positive passage in the retrieved passages. MRR is the mean of the reciprocal ranks of all queries. This metric is computed for both the retriever and reranker. MRR is a reflection of how well these two components contribute to the overall score in Equation 19. Recall is the fraction of the total amount of relevant passages that are retrieved. There is only one positive passage for each question in the training and development sets. In comparison, there could be more than one positive passage for a testing question since there are multiple reference answers per question provided by QuAC. Recall is computed for the retriever only since reranking does not impact this measure. This metric reflects whether the retriever can provide reasonable retrieval performance for the rest of the system. All retrieval metrics are computed for the top 5 passages that are retrieved for the reader/reranker.
5.3. Implementation Details
Our models are implemented with PyTorch888https://pytorch.org/ and the open-source implementation of ALBERT and BERT by Hugging Face.999https://github.com/huggingface/transformers
5.3.1. Retriever and Pretraining
We use two ALBERT Base (V1) models for the question and passage encoders. We set the max sequence length of the question encoder to 128, that of the passage encoder to 384, the training batch size to 16 per GPU, the number of training epochs to 12, and the learning rate to 5e-5. Models are trained with 4 NVIDIA TITAN X GPUs. We create a smaller collection to evaluate the retrieval performance by collecting the top 50 documents retrieved by TF-IDF for development questions. This allows us to do model selection in a scenario that is closer to how the retriever operates during concurrent learning. We save checkpoints every 5,000 steps and evaluate on the development questions to select the best model for concurrent learning. The pretraining time for the retriever is 2.5 hours.
5.3.2. Reranker, Reader, and Concurrent Learning
We use the BERT Base (Uncased) model. We set the max sequence length to 512, the max question length to 125 (so that the passage length is at least 384 after accounting for a [CLS] and two [SEP] tokens), the training batch size to 2, the number of training epochs to 3, and the learning rate to 5e-5. We retrieve top 5 passages for the reader. We tune the number of passages to update retriever and the history window size in Section 6.3. Models are trained with a NVIDIA TITAN X GPU. We take advantage of another TITAN X card for faster MIPS. All passage representations in our collection occupy 7.2 GB of CUDA memory. We save checkpoints every 5,000 steps and evaluate on the development set to select the best model for the test set. The time for concurrent learning is 20.0 hours.
For all model components, we use half precision for training as suggested in the Hugging Face repository to alleviate CUDA memory consumption. The warm up portion of the learning rate is 10% of the total steps.
6. Evaluation Results
In this section, we present our evaluation results, ablation studies on system components, and more analyses on history window size and the number of passages to fine-tune the retriever.
6.1. Main Evaluation Results
|Settings||DrQA||BERTserini||Ours w/o hist.||Ours|
We report the main evaluation results in Table 3. We tune the history window size for all models that consider history and report their performances under the best history setting. The best history settings for DrQA, BERTserini, and Ours are 5, 2, and 6 respectively. We summarize our observations as follows:
We observe that DrQA has poor performance. The main reason for this lies in the reader component. The RNN based reader in DrQA cannot produce representations that are as good as the readers based on a pretrained BERT in the rest of the competing models. More importantly, the DrQA reader cannot handle unanswerable questions natively.
BERTserini has a significant improvement over DrQA and serves as a much stronger baseline. It addresses the issues in DrQA by using a BERT reader that can handle unanswerable questions. BM25 in Anserini also gives better retrieval performance.
Our model without any history manages to perform on par with BERTserini that considers history on the test set. In particular, our learned retriever achieves higher performance on retrieval metrics. Since our reader is similar to that of BERTserini, the overall performance gain mostly comes from our learned retriever. This verifies the observation in Lee et al. (2019) in a conversational setting that a learned retriever is crucial if the information-seeker is genuinely seeking an answer. The margins are substantially larger on the development set, presumably because the best pretrained retriever model is selected based on the development performance.
Our model with history obtains statistically significant improvement over the strongest baseline with
tested by the Student’s paired t-test. This demonstrates the effectiveness of our model. This also indicates that incorporating conversation history is essential for ORConvQA, as expected. More analyses on the history window size are presented in Section6.3.1. In addition, we observe that the reranker consistently outperforms the retriever. This suggests that although reranking is more expensive as it jointly models the question and the passage, it enjoys better performance than the retriever that models the question and the passage separately.
6.2. Ablation Studies
Section 6.1 has shown the effectiveness of our model. This model performance is closely related to several design choices we made. In this section, we conduct ablation studies on our best model in Table 3 to investigate the contributions of each design choice. Specifically, we have three ablation settings as follows.
ORConvQA w/o reranker. We introduce reranking to the system as one of the differences from previous works (Lee et al., 2019; Das et al., 2019). In this ablation setting, we remove the reranking loss in Equation 17 so that the encoder in the reader is not fine-tuned by the reranking objective. Naturally, we also do not use the reranking score in the overall score in Equation 19.
ORConvQA w/o learned retriever. We replace our learned retriever with DrQA’s TF-IDF retriever.
ORConvQA w/o first question (q) for retriever. We do not manually include the first question of a dialog in the reformatted question for the retriever.
The ablation results are presented in Table 4. The following are our observations.
By removing the reranker from the full system, we observe a degradation in the overall performance. Although the reranking loss does not influence the retriever, the retriever performance also decreases. This is because that the ablated system gives the best development performance earlier than the full system during training. The reason behind this is that the reader overfits before the retriever has enough fine-tuning to produce reasonable retrieval performance. This verifies our assumption that the encoder in the reader/reranker benefits from a regularization effect by optimizing for the additional reranking task.
Replacing the learned retriever with TF-IDF causes a dramatic performance drop. This further verifies our observation in Section 6.1 that a learned retriever is crucial for ORConvQA.
When we do not additionally include the first question of the dialog in the reformatted question for the retriever, we observe a statistically significant performance decrease on most of the metrics. This validates our observation during data construction that the initial question of a dialog often contains a general information need that is pertinent to the entire dialog. By including the initial questions, the retriever can retrieve passages that are more relevant to the information need. The performance drop is less substantial than we anticipated. This is probably because the history window size of 6 has already covered the initial question for more than half of the questions, given that the number of history turns per question has a median of 3.
6.3. Additional Analyses
6.3.1. Impact of history window size
Leveraging conversation history is an integral part of a ConvQA system and has not been well studied in an open-retrieval setting. In this section, we study the impact of the history window size on the system performance. The results are presented in Figure 4.
In Figure 3(a), we observe that incorporating any number of history turns outperforms no history at all. Although fluctuating, the overall performance first increases then decreases, with the peak value at . In Figure 3(b), we observe that all retrieval metrics generally grow as we incorporate more conversation history. This suggests that the additional history turns we prepend are useful for matching and retrieval in most cases. Since we have reserved 125 tokens for the reformatted question in the BERT input sequence as reported in Section 5.3, we show less degradation in the performance than previous work (Qu et al., 2019c) when we prepend more history.
It is intriguing that the retriever recall, the most important retrieval metric, shows a trimodal distribution. This could be due to the “topic return” phenomenon mentioned in Yatskar (2018). Given the current question in a dialog, an adjacent turn is typically more useful than a distant turn to reveal the information need of the current turn. In other words, the utility of a history turn decreases as the distance between itself and the current turn increases. This utility trend shifts when the current turn is returning to the topic that has been discussed in a distant history turn. The trimodal distribution could imply that a topic return phenomenon typically happens five turns or nine turns away from the current turn. Moreover, the valley values of the trimodal distribution of retriever recall are consistent with those of the F1 curve in Figure 3(a), suggesting that the fluctuation in the overall performance can be explained by the variation in retriever performance.
6.3.2. Impact of the number of passages to update retriever
Lee et al. (2019) suggest that it is crucial to set the batch size in the retriever pretraining phase as large as possible because it makes the pretraining task more difficult and closer to what the retriever observes at test time. During pretraining, we set to 16 as reported in Section 5.3, meaning that we have 16 passages per question to train the retriever. At the concurrent learning phase, we can increase this number to fine-tune the question encoder in the retriever at a low cost since all passages have been encoded offline. Therefore, we investigate how helpful it is to increase the number of passages to fine-tune the retriever during concurrent learning. The choices of are [16, 50, 100, 500, 1000]. We sample the choices of unevenly and with large gaps so that the trends are clear. The results are presented in Figure 5.
We observe that gives the best overall performance and retriever recall. Using a smaller and larger number both give a sub-optimal performance. Although a smaller value is closer to what we use for pretraining, the retriever cannot aggressively learn from enough negative passages. On the contrary, if we use a value that is progressively larger than that of the pretraining time, the mismatch of supervision signals also leads to inferior performance.
7. Conclusions and Future Work
In this work, we introduce an open-retrieval conversational QA setting as a further step towards conversational search. We create a dataset, OR-QuAC, by aggregating existing data to facilitate research on ORConvQA. We build an end-to-end system for ORConvQA, featuring a retriever, a reranker, and a reader that are all based on Transformers. Our extensive experiments on OR-QuAC demonstrate that a learnable retriever is crucial in the ORConvQA setting. We further show that our system can make a substantial improvement when we enable history modeling in all system components. Moreover, we show that the additional reranker component contributes to the model performance by providing a regularization effect. Finally, we demonstrate that the initial question of each dialog is essential for our system to understand the user’s information need. For future work, we would like to address the limitations of this work by studying weak supervision methods for information-seeking conversations and a retriever that is not only learnable but also tunable by the downstream task. In addition, we will investigate more effective history modeling methods.
Acknowledgements.This work was supported in part by the Center for Intelligent Information Retrieval, in part by NSF IIS-1715095, and in part by China Postdoctoral Science Foundation (No. 2019M652038). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.
- Ahmad et al.  A. Ahmad, N. Constant, Y. Yang, and D. M. Cer. ReQA: An Evaluation for End-to-End Answer Retrieval Models. ArXiv, 2019.
- Belkin et al.  N. J. Belkin, C. Cool, A. Stein, and U. Thiel. Cases, Scripts, and Information-seeking Strategies: On the Design of Interactive Information Retrieval Systems. 1995.
- Bi et al.  K. Bi, Q. Ai, Y. Zhang, and W. B. Croft. Conversational Product Search Based on Negative Feedback. In CIKM, 2019.
- Chen et al.  D. Chen, A. Fisch, J. Weston, and A. Bordes. Reading Wikipedia to Answer Open-Domain Questions. In ACL, 2017.
Chen et al. 
Y. Chen, L. Wu, and M. J. Zaki.
GraphFlow: Exploiting Conversation Flow with Graph Neural Networks for Conversational Machine Comprehension.ArXiv, 2019.
- Choi et al.  E. Choi, H. He, M. Iyyer, M. Yatskar, W.-T. Yih, Y. Choi, P. Liang, and L. Zettlemoyer. QuAC: Question Answering in Context. In EMNLP, 2018.
- Chuklin et al.  A. Chuklin, A. Severyn, J. R. Trippas, E. Alfonseca, H. Silén, and D. Spina. Prosody Modifications for Question-Answering in Voice-Only Settings. ArXiv, 2018.
- Clark and Gardner  C. Clark and M. Gardner. Simple and Effective Multi-Paragraph Reading Comprehension. In ACL, 2017.
- Cohen et al.  D. Cohen, L. Yang, and W. B. Croft. WikiPassageQA: A Benchmark Collection for Research on Non-factoid Answer Passage Retrieval. In SIGIR, 2018.
- Croft and Thompson  W. B. Croft and R. H. Thompson. I3R: A New Approach to the Design of Document Retrieval Systems. JASIS, 38:389–404, 1987.
- Das et al.  R. Das, S. Dhuliawala, M. Zaheer, and A. McCallum. Multi-step Retriever-Reader Interaction for Scalable Open-domain Question Answering. In ICLR, 2019.
- Devlin et al.  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT, 2019.
- Dhingra et al.  B. Dhingra, K. Mazaitis, and W. W. Cohen. Quasar: Datasets for Question Answering by Search and Reading. ArXiv, 2017.
- Dunn et al.  M. Dunn, L. Sagun, M. Higgins, V. U. Güney, V. Cirik, and K. Cho. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. ArXiv, 2017.
- Elgohary et al.  A. Elgohary, D. Peskov, and J. L. Boyd-Graber. Can You Unpack That? Learning to Rewrite Questions-in-Context. In EMNLP/IJCNLP, 2019.
- Garg et al.  S. Garg, T. Vu, and A. Moschitti. TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection. In AAAI, 2020.
- Htut et al.  P. M. Htut, S. R. Bowman, and K. Cho. Training a Ranking Function for Open-Domain Question Answering. In NAACL-HLT, 2018.
- Huang et al.  H.-Y. Huang, E. Choi, and W. tau Yih. Flowqa: Grasping flow in history for conversational machine comprehension. ArXiv, 2018.
- Johnson et al.  J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with GPUs. ArXiv, 2017.
- Joshi et al.  M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In ACL, 2017.
- Karpukhin et al.  V. Karpukhin, B. Ouguz, S. Min, L. Y. Wu, S. Edunov, D. Chen, and W. tau Yih. Dense Passage Retrieval for Open-Domain Question Answering. ArXiv, abs/2004.04906, 2020.
- Kratzwald and Feuerriegel  B. Kratzwald and S. Feuerriegel. Adaptive Document Retrieval for Deep Question Answering. In EMNLP, 2018.
- Kwiatkowski et al.  T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural Questions: A Benchmark for Question Answering Research. TACL, 7:453–466, 2019.
Lan et al. 
Z.-Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.ArXiv, 2019.
- Lee et al.  J. Lee, S. Yun, H. Kim, M. Ko, and J. Kang. Ranking Paragraphs for Improving Answer Recall in Open-Domain Question Answering. In EMNLP, 2018.
- Lee et al.  K. Lee, M.-W. Chang, and K. Toutanova. Latent Retrieval for Weakly Supervised Open Domain Question Answering. In ACL, 2019.
- Lowe et al.  R. Lowe, N. Pow, I. Serban, and J. Pineau. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. In SIGDIAL, 2015.
- Nguyen et al.  T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. ArXiv, 2016.
- Oddy  R. N. Oddy. Information Retrieval through Man-Machine Dialogue. 1977.
- Qu et al.  C. Qu, L. Yang, W. B. Croft, J. R. Trippas, Y. Zhang, and M. Qiu. Analyzing and Characterizing User Intent in Information-seeking Conversations. In SIGIR, 2018.
- Qu et al. [2019a] C. Qu, L. Yang, W. B. Croft, F. Scholer, and Y. Zhang. Answer Interaction in Non-factoid Question Answering Systems. In CHIIR, 2019a.
- Qu et al. [2019b] C. Qu, L. Yang, W. B. Croft, Y. Zhang, J. R. Trippas, and M. Qiu. User Intent Prediction in Information-seeking Conversations. In CHIIR, 2019b.
- Qu et al. [2019c] C. Qu, L. Yang, M. Qiu, W. B. Croft, Y. Zhang, and M. Iyyer. BERT with History Answer Embedding for Conversational Question Answering. In SIGIR, 2019c.
- Qu et al. [2019d] C. Qu, L. Yang, M. Qiu, Y. Zhang, C. Chen, W. B. Croft, and M. Iyyer. Attentive History Selection for Conversational Question Answering. In CIKM, 2019d.
- Rajpurkar et al.  P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP, 2016.
- Rajpurkar et al.  P. Rajpurkar, R. Jia, and P. Liang. Know What You Don’t Know: Unanswerable Questions for SQuAD. In ACL, 2018.
- Reddy et al.  S. Reddy, D. Chen, and C. D. Manning. CoQA: A Conversational Question Answering Challenge. TACL, 7:249–266, 2018.
- Shrivastava and Li  A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). In NIPS, 2014.
- Tao et al.  C. Tao, W. Wu, C. Xu, W. Hu, D. Zhao, and R. Yan. Multi-Representation Fusion Network for Multi-Turn Response Selection in Retrieval-Based Chatbots. In WSDM, 2019.
- Thomas et al.  P. Thomas, D. J. McDuff, M. Czerwinski, and N. Craswell. MISC: A data set of information-seeking conversations. In SIGIR (CAIR’17), 2017.
- Trippas et al.  J. R. Trippas, D. Spina, L. Cavedon, and M. Sanderson. How Do People Interact in Conversational Speech-Only Search Tasks: A Preliminary Analysis. In CHIIR, 2017.
- Trippas et al.  J. R. Trippas, D. Spina, L. Cavedon, H. Joho, and M. Sanderson. Informing the Design of Spoken Conversational Search: Perspective Paper. In CHIIR, 2018.
- Trippas et al.  J. R. Trippas, D. Spina, P. Thomas, M. Sanderson, H. Joho, and L. Cavedon. Towards a Model for Spoken Conversational Search. ArXiv, 2019.
- Trischler et al.  A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. NewsQA: A Machine Comprehension Dataset. In Rep4NLP@ACL, 2016.
- Vaswani et al.  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention Is All You Need. In NIPS, 2017.
- Voorhees and Tice  E. M. Voorhees and D. M. Tice. The TREC-8 Question Answering Track Evaluation. In TREC, 1999.
- Wang et al.  M. Wang, N. A. Smith, and T. Mitamura. What is the Jeopardy Model? A Quasi-Synchronous Grammar for QA. In EMNLP-CoNLL, 2007.
- Wang et al.  S. Wang, M. Yu, X. Guo, Z. Wang, T. Klinger, W. Zhang, S. Chang, G. Tesauro, B. Zhou, and J. Jiang. R3: Reinforced Ranker-Reader for Open-Domain Question Answering. In AAAI, 2018.
- Wu et al.  Y. Wu, W. Y. Wu, M. Zhou, and Z. Li. Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots. In ACL, 2016.
- Yan et al. [2016a] R. Yan, Y. Song, and H. Wu. Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System. In SIGIR, 2016a.
- Yan et al. [2016b] R. Yan, Y. Song, X. Zhou, and H. Wu. ”Shall I Be Your Chat Companion?”: Towards an Online Human-Computer Conversation System. In CIKM, 2016b.
- Yang et al.  L. Yang, H. Zamani, Y. Zhang, J. Guo, and W. B. Croft. Neural Matching Models for Question Retrieval and Next Question Prediction in Conversation. ArXiv, 2017.
- Yang et al.  L. Yang, M. Qiu, C. Qu, J. Guo, Y. Zhang, W. B. Croft, J. Huang, and H. Chen. Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems. In SIGIR, 2018.
- Yang et al. [2019a] L. Yang, J. Hu, M. Qiu, C. Qu, J. Gao, W. B. Croft, X. Liu, Y. Shen, and J. Liu. A Hybrid Retrieval-Generation Neural Conversation Model. In CIKM, 2019a.
- Yang et al.  L. Yang, M. Qiu, C. Qu, C. Chen, J. Guo, Y. Zhang, W. B. Croft, and H. Chen. IART: Intent-aware Response Ranking with Transformers in Information-seeking Conversation Systems. In WWW, 2020.
- Yang et al. [2019b] W. Yang, Y. Xie, A. Lin, X. Li, L. Tan, K. Xiong, M. Li, and J. Lin. End-to-End Open-Domain Question Answering with BERTserini. In NAACL-HLT, 2019b.
- Yang et al.  Y. Yang, W.-T. Yih, and C. Meek. WikiQA: A Challenge Dataset for Open-Domain Question Answering. In EMNLP, 2015.
- Yatskar  M. Yatskar. A Qualitative Comparison of CoQA, SQuAD 2.0 and QuAC. In NAACL-HLT, 2018.
- Yeh and Chen  Y.-T. Yeh and Y.-N. Chen. FlowDelta: Modeling Flow Information Gain in Reasoning for Conversational Machine Comprehension. ArXiv, 2019.
- Zhang et al.  Y. Zhang, X. Chen, Q. Ai, L. Yang, and W. B. Croft. Towards Conversational Search and Recommendation: System Ask, User Respond. In CIKM, 2018.
- Zhu et al.  C. Zhu, M. Zeng, and X. Huang. SDNet: Contextualized Attention-based Deep Network for Conversational Question Answering. ArXiv, 2018.