Retriever-reader models in open-domain QA require a long time for inference izacard2021leveraging; NEURIPS2020_6b493230; sachan2021end; mao-etal-2021-generation; karpukhin2020dense. This has been identified as a bottleneck in building real-time QA systems, and question retrieval and phrase-indexed QA have been proposed to resolve this problem seo2018phrase; seo2019real; lee2020contextualized; lee-etal-2021-learning-dense; lee2021phrase; lewis2021question; lewis2021paq. These approaches directly search the answer of the input question from the corpus without conducting additional machine reading steps which are computationally inefficient. In phrase-indexed QA, retrievers pre-index all phrases in the corpus and find the most similar phrase to the input question. In question retrieval, synthetic question-answer pairs are pre-indexed and referenced by retrievers du2017learning; duan2017question; fabbri2020template; lewis2020bart.
Although recent question retrieval models significantly increase the inference speed, this improvement accompanies QA performance degradation. Several approaches have been applied to question retrieval models to overcome the performance degradation, such as adopting the cross-encoder mao-etal-2021-reader; xiong2020answering for re-ranking and increasing the model size lewis2021paq. However, these approaches cause a significant loss of computational efficiency. Figure 1 shows the trade-off between the open-domain QA performance and the inference speed of question retrieval models.
We propose SQuID (Sequential Question-Indexed Dense retrieval) which significantly improves QA performance without losing computational efficiency. Our work follows previous work on neural re-ranking methods, which use a cross-encoder to re-rank the top-k passages retrieved from the first-step retriever lewis2021paq; xiong2020answering. Re-ranking methods have improved retrieval performance but require huge computation costs due to the cross-encoder architecture. We use an additional bi-encoder retriever in SQuID instead of the cross-encoder to prevent loss on computational efficiency. We also provide distant supervision methods for training the additional retriever in the absence of training data for question retrievers.
We evaluate SQuID on NaturalQuestions (NQ) kwiatkowski2019natural and TriviaQA joshi2017triviaqa. We conduct three types of experiments: open-domain QA, computational efficiency evaluation, and analysis on distant supervision methods for training the second-step retriever. Experimental results show that SQuID significantly outperforms the state-of-the-art question retrieval model by 4.0%p on NQ and 6.1%p on TriviaQA without losing computational efficiency. Our main contribution is in proposing a sequential question retriever model that successfully improves both QA performance and inference speed, thereby making a meaningful step toward developing real-time open-domain QA systems.
2 Related Work
The research problem of reducing the computational cost of open-domain QA has received much attention recently. The main bottleneck of a retriever-reader model is the machine reading step, and seo2018phrase, seo2019real, lee-etal-2021-learning-dense propose phrase-indexed QA, which directly retrieves the answer from the corpus without the machine reading step. These models pre-compute the context of phrases in a corpus and conduct lexical and semantic similarity searches between the given question and the context of phrases zhao-etal-2021-sparta; yamada2021bpr. Most related to our work are the question retrieval models with question-generation models to build question-answer pairs and conduct a similarity search between the input question and the pre-indexed questions lewis2021question; lewis2021paq. These models significantly reduce the computational cost but results in lower performance. Our work provides an efficient question retrieval pipeline with distant supervision methods for training, while previous question retrieval models focus on the indexing methods with less attention on the retrieval pipeline.
Our method is constructed based on the question retrieval pipeline proposed by lewis2021paq, where question retrievers find the most similar question to the input question and return the answer of the selected question. In this study, we note that previous question retrievers are optimized not just for improving the retrieval performance but for maintaining the inference speed to cover millions of text lewis2021paq. In this process, the performance of retrievers decreases as they are more optimized for computational efficiency. We propose to use an additional retriever that takes the top-k predictions from the first retriever and selects the most similar question from the top-k results. The second-step retriever has a lower constraint in the inference speed than the first retriever since its search space contains only a few samples. This enables us to focus only on the retrieval performance when designing the training method. The overall training and inference procedure of SQuID is illustrated in Figure 2. We describe the details of SQuID below.
Since the annotated question-question pairs are unavailable, we distantly supervise SQuID with heuristically selected positive and negative samples. We first select top-k similar questions with the first-step retriever. Among the top-k questions, we choose the positive samples and the negative samples as the following. For positive samples, we choose questions with the most similar answer to the ground truth answer in terms of F1-score, the evaluation metric used in extractive QArajpurkar2016squad. For negative samples, we sample questions with answers that differ from the ground truth answer karpukhin2020dense; xiong2020approximate.
When the input question is provided with a positive sample () and negative samples (
), our second-step retriever is trained to distinguish the positive and negative samples. The loss function is as follows:
The similarity function is defined as the dot product of two vectors:. Where is the question encoder of the second-step retriever.
Given a question , the two retrievers of SQuID work in two steps. The first-step retriever selects top-k similar questions. The retrieved questions are then mapped to the question vectors pre-computed by the second-step retriever. The second-step retriever selects the most similar question from the top-k results with the question vectors. We use Maximum Inner Product Search (MIPS) for the second-step retrieval. Finally, SQuID puts the answer of as the answer for .
4 Experimental Setup and Results
|Model Type||Model||NQ||TriviaQA||Inference speed (Q/sec)|
|Question retrieval||RePAQ-base256 lewis2021paq||40.0||38.8||1376|
* indicates the inference speed is from the original paper. indicates that the inference speed is computed in the parallel computing setting.
|w/o 2nd retriever||34.4||40.0|
|+ Similar / Self||43.6||44.1|
|+ Same Answer||43.4||44.4|
We evaluate the performance and computational efficiency of SQuID on two open-domain QA datasets: NaturalQuestions (NQ) and TriviaQA. We also compare various distant supervision methods for training SQuID. We use exact match (EM) rajpurkar2016squad for performance evaluation and the number of questions per second (Q/sec) for evaluation of inference speed. The details of our experimental setup is described in Appendix A.2.
Question Retrievers on Open-Domain QA:
We evaluate SQuID with two different first-step retrievers: BM25 and RePAQ-base256222We use RePAQ-base256 provided by the official implementation. RePAQ-base256 has slightly lower performance than RePAQ-base. lewis2021paq. Table 1 shows that SQuID-BM25/DPR and SQuID-RePAQ/DPR achieve the best performance among question retrieval models on TriviaQA and NQ, respectively. Note that SQuID-RePAQ/DPR outperforms RePAQ-base256 significantly with a negligible loss of inference speed; 4.0%p EM gain on NQ and 6.1%p gain on TriviaQA at 92.0% speed (1266 Q/sec vs. 1376 Q/sec).
Trade-off between QA Performance and Computational Efficiency:
Table 1 shows the trade-off between the open-domain QA performance and the inference speed of the three types of open-domain QA models. Comparing RePAQ-large and RAG-Sequence, we see a large performance gap of 3.3%p on NQ and 18.0%p on TriviaQA, and we also see a large speed gap of 624 Q/s and 0.8 Q/s. SQuID bridges this gap, achieving comparable performances to RAG-Sequence on NQ while maintaining the high inference speed. The performance gain on TriviaQA is not as high, and we conjecture that this is because RePAQ uses only questions from NQ in its filtering step. We leave a deeper study of this discrepancy for future research.
Figure 1 illustrates the QA performance and inference speed of various configurations of RePAQ SQuID. We vary the encoder of the second-step retriever with different pre-trained models: DPR karpukhin2020dense, BERT-base/large devlin2019bert, and ALBERT-base/large lan2019albert. The first and second-step question encoders can be executed concurrently, so we run them in parallel and set the batch size as half to measure the inference speed (SQuID-DPR-parallel). We use the maximum batch size possible on a single V100-16GB GPU. The figure shows that results of SQuID all lie to the top right of the curve fitted to the RePAQ results, meaning that SQuID succeeds in improving both QA performance and inference speed. The detailed results are in Appendix A.1.
Analysis on Positive Sampling Methods:
We distantly supervise the second-step retriever because annotated question-question pairs are unavailable. We conduct experiments on various positive sampling methods for distant supervision: “Self”, “Similar”, “Similar/Self”, and “Same Answer”. Each method uses the following as the positive sample:
1) the input question itself (“Self”), 2) a similar question with a similar answer (“Similar”), 3) a similar question if it has the ground truth answer, or the input question itself (“Similar/Self”), and 4) a random question with the ground truth answer (“Same Answer”).
Table 2 shows the performance of SQuID-BM25 and SQuID-RePAQ-base256 on the NQ test set with the four distant supervision methods. The first row (w/o 2nd retriever) indicates the performance based only on the first-step retriever (BM25 or RePAQ-base256). The second-step retriever with “Self” method improves the performance slightly, and the others improve the performance more significantly. The large gap between “Self” and the other methods shows that using the answer information is essential for distant supervision.
Error Propagation Analysis:
The error rate of each stage in a multi-stage model provides a better understanding of the model’s performance boundary. In SQuID, the second-step retriever only predicts the correct answer when the top-50 question-answer pairs retrieved by the first-step retriever contain the answer. This indicates that the upper-bound performance of SQuID is determined by the performance of the first-step retriever. We measure the R@50 accuracy of the first-step retrievers on NQ and TriviaQA. The performance of BM25 and RePAQ are 64.07% and 64.34% on NQ and 61.73% and 59.10% on TriviaQA, respectively.
The trade-off between the performance and the inference speed is an important problem in open-domain QA. Recently proposed question retrieval models have shown significantly improved inference speed. However, this improvement came at the cost of a significantly lower QA performance by the question retrieval models compared to the state-of-the-art open-domain QA models. In this paper, we proposed a two-step question retrieval model, SQuID. We evaluated the open-domain QA performance and the inference speed of SQuID on two datasets: NaturalQuestions and TriviaQA. From the results, we showed that the sequential two-retriever approach in SQuID achieves a significant QA performance improvement over the existing question retrieval models, while retaining the advantage of faster inference speed. This improvement in both QA performance and inference speed is a meaningful step toward the development of real-time open domain QA systems.
This work was partly supported by NAVER Corp. and the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-2018R1A5A1059921).
Appendix A Appendix
a.1 Detailed results of Figure 1
|RePAQ-base + Reranker-base||45.7||41|
|RePAQ-large + Reranker-xlarge||46.2||7|
a.2 Experimental Setup
We set the batch size to 2 per GPU and the number of negative samples to 16. We used validation EM score for early stopping. SQuID was trained on a machine with four V100-16GB GPUs. We report the result of a single trial.
Computational Environment for Measuring the Inference Speed:
The inference speed of baseline models and SQuID is measured with a V100-16GB GPU and 32 CPUs (Intel Xeon E5-2686v4). We report mean of three separate trials.
a.3 License or Terms of Artifacts
We use BERT whose license is under the Apache License 2.0 free with modification and distribution. Also, we use RePAQ whose license is under the CC BY-NC 4.0 free with modification and distribution. All models we used are publicly available.