With the promise of making the huge amount of information buried in unstructured text accessible with simple and user-friendly natural language queries, the area of open-domain QA has attracted lots of attention in recent years. Existing open-domain QA systems are typically made of two essential components chen-etal-2017-reading. A retrieval module first retrieves a compact set of paragraphs from the whole corpus (such as Wikipedia) that includes tens of millions of paragraphs. Then a reading module is deployed to extract an answer span from the retrieved paragraphs.
Over the past few years, much of the progress in open-domain QA has been focusing on improving the reading module of the system, which only needs to process a small number of retrieved paragraphs. Specifically, improvements include stronger reading comprehension models DBLP:conf/iclr/WangY0ZGCWKTC18; DBLP:journals/corr/abs-1902-01718; DBLP:journals/corr/abs-1912-09637; min-etal-2019-discrete and paragraph reranking models DBLP:journals/corr/abs-1709-00023; lin-etal-2018-denoising that assign more accurate relevance scores to the retrieved paragraphs. However, the performance is still bounded by the retrieval components, which rely on traditional IR methods such as TF-IDF or BM25 DBLP:journals/ftir/RobertsonZ09
due to the efficiency of these methods while handling the millions of documents. These methods retrieve paragraphs solely based on n-gram lexical overlap and usually fail on cases that require deep semantic matching and when there are no common lexicons between the question and the target paragraph.
While neural models have proven effective at learning deep semantic matching between text pairs bowman-etal-2015-large; DBLP:journals/corr/ParikhT0U16; DBLP:journals/corr/ChenZLWJ16; devlin-etal-2019-bert, such matching usually requires fine-grained token-level attention matching, suggesting that we need to store all the token representations of all documents or calculate question-aware paragraph encodings to achieve good performance. However, these approaches are formidable considering space constraints and retrieval efficiency in practice. Inspired by the breakthroughs of large-scale language model pretraining, more recent studies lee-etal-2019-latent; chang2020pre; guu2020realm show that such dilemma can be resolved with matching-oriented pretraining that imitate the matching between the question and paragraph in open-domain QA. As these approaches use separate encoders for questions and paragraphs and simply model the matching using inner products of the output vectors, they only need to pre-encode the whole corpus into dense vectors in a question-agnostic fashion and the retrieval could be efficiently implemented using existing maximum inner product search (MIPS) methods. These dense retrieval methods have achieved significant improvements over the BM25 baseline across a set of information-seeking QA datasets. However, the existing pretraining strategies could be highly sample-inefficient and typically require a very large batch size (up to thousands) such that diverse and effective negative question-paragraph pairs could be included in each batch. Our experiments show that a model trained with random and small batches almost ceases improving after certain updates. Given that a 12G GPU can only store around 10 samples with the BERT-base architecture, the wider usage of these methods to different corpora is largely hindered given the resource limitation of many organizations.111As there is typically a domain gap between different corpora, the index needs to be retrained constantly.
In this work, we propose a simple and energy-efficient method for pretraining the dense corpus index. We achieve on-par or stronger open-domain QA performance compared to an existing method lee-etal-2019-latent that uses around 7 times more computational resources. Besides, our method uses a much smaller batch size and can be implemented with only a small number of GPUs, i.e., we use at most 4 TITAN RTX GPUs for all our experiments. In a nutshell, the proposed method first utilizes a pretrained sequence-to-sequence model to generate high-quality pretraining data instead of relying on simple heuristics to create pseudo question-paragraph pairs; for the training algorithm, we use clustering techniques to get effective negative samples for each pair and progressively update the clusters using our updated corpus index. The efficacy of our method is further validated through ablation studies where we replicate existing methods that use the same amount of resources. For the downstream QA experiments, we carefully investigate different finetuning objectives and show the different combinations of the reranking and span prediction losses have non-trivia effect on the final performance. We hope these studies could save the efforts on trying out various finetuning strategies of future research that focus on improving the retrieval component of open-domain QA systems.
We begin by introducing the network architectures used in our retrieval and reading comprehension model. Next, we present how to generate high-quality question-paragraph pairs for pretraining and how we make sure there are effective negative samples within each small batch using a progressive training algorithm. Finally, we show how to finetune the whole system for QA.
2.1 Model Architectures
We introduce the following notations which will be used through our paper. The goal of open-domain QA is to find the answer derivation from a large text corpus given a question , where is an evidence paragraph and is a text span within the evidence paragraph . The start and end token of are denoted as and respectively. We refer the retrieval module as , with learnable parameters . Similarly, we refer the reading comprehension module as , which can be decomposed as . We use to represent the top-k paragraphs from the retrieval module; a subset of represents the paragraphs in that actually cover the correct answer; for each paragraph , we define as all the spans in that match the ground-truth answer string.
The Retrieval Module
We uses two separate encoders to encode the questions and paragraphs, and the inner product of the output vectors is used as the matching score. Both the question encoder and the paragraph encoder are based on the BERT-base architecture. We add linear layers and above the final representations of the [CLS] token to derive the question and paragraph representations. Formally, we have
The matching score is modeled as
. Thus, the probability of selectinggiven is calculated as:
In practice, we only consider the top-k retrieved paragraphs for normalization. Both the encoders in this module will be pretrained using our progressive method, then we use the paragraph encoder to build the dense index.
The Reading Module
The architecture of our reading comprehension model is identical to the one in the original BERT paper devlin-etal-2019-bert. We use two independent linear layers to predict the start and end position of the answer span. At training time, when calculating the span probabilities, we apply the shared-normalization technique proposed by DBLP:conf/acl/GardnerC18, which normalizes the probability across all the top-k retrieved paragraphs. This encourages the model to produce globally comparable answer scores. We denote this probability as in contrast to the original formulation that normalizes the probability within each paragraph.
2.2 The Pretrainining Method
With the predefined model architectures, we now describe how to pretrain the retrieval module using a better data generation strategy and a progressive training paradigm. Figure 1 depicts the whole pretraining process.
Pretraining Data Generation
As mentioned in previous sections, previous dense retrieval approaches rely on simple heuristics to generate the pretraining data. However, these synthetic matching pairs not necessarily reflect the underlying matching pattern between questions and paragraphs. To minimize the gap between pretraining and the end task, we adopt a state-of-the-art pretrained sequence-to-sequence model, i.e., BART lewis2019bart, to generate high-quality questions. This model is pretrained on a large corpus with denoising auto-encoder objectives. We finetune this model on the NaturalQuestions dataset DBLP:journals/tacl/KwiatkowskiPRCP19 such that it learns to generate questions given the groundtruth answer string and the groundtruth paragraph (labeled as long answer in NaturalQuestions). As the multi-head attention in this model already captures fine-grained interactions between tokens, we directly concatenate the paragraph and the answer string with a separating token as the input to the BART model. After being finetuned, the model learns to generate high-quality questions, achieving a 3.86 perplexity and a 55.6 ROGUE-L score on the development set. Afterward, we utilize the spaCy package to recognize at most three named entities or dates in all paragraphs in the corpus. These entities and date spans are considered as potential answer strings. Then we use the finetuned BART model to generate the questions conditioned on the paragraph and each of the potential answers. The question-paragraph pairs are collected to pretrain the retrieval module.
It is worth noting that the groundtruth answer paragraph supervision at this step could be eventually dropped and we could just use weakly supervised paragraphs to train the question generator, thus our system becomes fully weak-supervised. As the pretraining process takes lots of resources and we did not adopt the weak-supervised setting at the early stage, we conduct additional question generation experiments to verify this claim: while using weakly-supervised paragraphs, the question generator still generates high-quality questions, achieving an average ROUGE-L score of 49.6 on the same development set.222For reference, a state-of-the-art QG model ma2019improving trained with strong supervision achieves 49.9 ROUGE-L on a similar QA dataset that is also collected from real-user queries. Additionally, our final QA performance on two other datasets suggests that the question generator does not end up with QA systems that bias towards NaturalQuestions, i.e., we achieve larger improvements on other QA datasets.
In-batch Negative Sampling
To save the computation and improve sample-efficiency, we choose to use in-batch negative sampling in-batch-negative-sampling instead of gathering negative samples for each pair to pretrain the retrieval module. For each pair within a batch , the paragraphs paired with other questions are regarded as negative paragraphs for . Thus, our pretraining objective for each generated question is to minimize the negative log likelihood of selecting the correct among all paragraphs in the batch:
A graphic illustration of this strategy can be found in Figure 1. As the batch size is usually very small compared to the number of all the paragraphs in the corpus, the pretraining task is actually easier compared to the final retrieval task. In the whole corpus, there are usually lots of similar paragraphs and these paragraphs could act as strong distractors for each other in the QA system. A good retriever should be able to learn fine-grained matching instead of just learning to distinguish obviously different paragraphs. Since existing dense retrieval methods typically use random examples in each batch, there could be many easy negative samples in the batch. These easy negative samples can only provide ineffective supervision signals. Thus, a large batch size is usually adopted to include enough diverse negative samples. However, this is not applicable without hundreds of GPUs.
The Progressive Training Paradigm
To provide effective negative samples under the case of small batch size, we adopt a progressive training algorithm, shown in the lower part of Figure 1. Since the goal of the retrieval pretraining is to produce effective vector representations of paragraphs, we could leverage the model itself to find hard negative samples as the pretraining progresses. At a certain time step, we use the paragraph encoder to encode the whole corpus and cluster all pairs into many groups based on the paragraph encodings. These groups are supposed to include similar paragraphs and potentially similar questions. Then, we continue our pretraining by sampling each batch from one of the clusters. By doing this, we provide more challenging and effective negative samples for each pair even with small batch size. Every time we recluster the whole corpus, the model will be encouraged to learn finer-grained matching between questions and paragraphs. Algorithm 1 provides a formal description of the whole process.
2.3 QA Finetuning
Once pretrained, we use the paragraph to encode the corpus into an index of dense vectors. Following previous practice, we only finetune the question encoder and the reading comprehension model so that we can use the same corpus index for different datasets. For every training question, we obtain the question representation from the question encoder and retrieve the top-k paragraphs on the fly using existing maximum inner product search tools. For the reading module, we apply the shared-normalization trick and optimize marginal probability of all matched answer spans in the top-k paragraphs:
In additional to the reader loss, we also incorporate the “early” loss used by lee-etal-2019-latent, which updates the question encoder using the top-5000 dense paragraph vectors. If we define as those paragraphs in the top-5000 that contain the correct answer, then the “early” loss is defined as:
Thus our total finetuning loss is . Note this is different from the joint formulation used by lee-etal-2019-latent and guu2020realm, which consider the paragraphs as latent variables when calculating . We find the joint objective does no bring additional improvements especially after we use shared normalization. More variants of the finetuning objectives will be discussed in §3.5. At inference time, we use a linear combination of the retrieval score and the answer span score to rank the answer candidates from the top-5 retrieved paragraphs.
We center our experiments on QA datasets that simulate real-world information-seeking scenarios. Namely, we consider 1) NaturalQuestions-Open DBLP:journals/tacl/KwiatkowskiPRCP19; lee-etal-2019-latent, which includes real-user queries from Google Search; 2) WebQuestions DBLP:conf/emnlp/BerantCFL13, which is original designed for knowledge base QA and includes questions generated by Google Suggest API; 3) CuratedTREC DBLP:conf/clef/BaudisS15, which includes real-user queries from MSNSearch and AskJeeves logs. Compared to other datasets such as SQuAD DBLP:conf/emnlp/RajpurkarZLL16 and TriviaQA DBLP:conf/acl/JoshiCWZ17, these questions are created without the presence of ground-truth answers or the target paragraphs, thus are less likely to have overlapped lexicons with the answer paragraph, which usually oversimplify the realistic open-domain QA problem. Additionally, these datasets excludes context-dependent questions DBLP:conf/acl/GardnerC18 that are meaningless in open-domain settings.
3.2 Implementation Details
For pretraining, we use a batch size of 80 and aggregate the gradients every 8 batches. We use the Adam optimizer with learning rate 1e-5 for optimization and conduct 90K parameter updates. Following previous work lee-etal-2019-latent; min-etal-2019-discrete, we use the 12-20-2018 snapshot of English Wikipedia as our open-domain QA corpus. When splitting the documents into chunks, we try to reuse the original paragraph boundaries and create a new chunk every time the length of the current one exceeds 256. Overall, we created 12,494,770 text chunks, which is close to the number (13M) reported in previous work. These chunks are viewed as paragraphs in our framework. For clustering, we recluster all the chunks around every 20k updates. The number of clusters is set as 1024 at the beginning and 10000 at later steps. For better efficiency, we use a paragraph subset for finding the centroids.
While finetuning the modules for QA, we fix the paragraph encoder in the retrieval module, such that we only need to encode the corpus once and reuse the index for different datasets. For each question, we use the top-5 retrieved paragraphs for training and omit the question if the top-5 paragraphs fail to cover the answer. The MIPS-based retrieval is implemented with FAISS DBLP:journals/corr/JohnsonDJ17.333We use the IndexIVFFlat index for efficient search. We assign all the vectors to 100 Voronoi cells and only search from the closest 20 cells.
On NaturalQuestions-Open, we finetune for at most 3 epochs. For WebQuestions and CuratedTREC, we finetune for 10 epochs.
3.3 QA Performance
In Table 1, we first show that our progressive method (denoted as ProQA) is superior to most of the open-domain QA systems (the upper part of the table) that use conventional IR methods, even though we only use the top-5 paragraphs to predict the answer while these methods use dozens of retrieved paragraphs. For the dense retrieval method, we compare with ORQA lee-etal-2019-latent, which is most relevant to our study but simply uses pseudo question-paragraph pairs for pretraining and also requires a larger batch size (4,096). We achieve stronger performance than ORQA with much less updates and a limited number of GPUs. To the best of our knowledge, this is the first work showing that an effective dense corpus index can be obtained without using highly expensive computational resources, which are generally not accessible to most of the academic labs. The reduced requirement of computation also makes our method easier to replicate for corpus in different domains.
It is worth noting that stronger open-domain QA performance have been achieved with much larger pretrained model DBLP:journals/corr/abs-1910-10683, i.e., T5 roberts2020 or better designed pretraining paradigm combined with more updates, i.e., REALM guu2020realm. In Table 2, we compare our method with these state-of-the-art approaches in terms of both QA performance and computational resources. As T5 simply converts the QA problem into a sequence-to-sequence (decode answers after encoding questions) problem, it does not pretrain the corpus index. The disadvantage of this method is its inefficiency at inference time, as the model is orders of magnitude larger than the others. In contrast, REALM uses the same amount of parameters as our approach and achieves significant improvements. However, it relies on ORQA initialization and further pretraining updates, thus is still computational expensive at pretraining. As our method directly improves the ORQA pretraining, we believe our method is complementary to the REALM approach.
|Method||EM||model size||batch size||# updates|
3.4 Ablation Studies
|ProQA (no clustering, 90k)||36.9||47.7||57.0|
|ProQA (no clustering; 70k)||39.0||47.0||56.2|
|ProQA (no clustering; 50k)||35.7||44.1||52.9|
To validate the sample efficiency of our method, we replicate the inverse-cloze pretraining approach from ORQA using the same amount of resource as we used while training our model, i.e.
, the same batch size and updates (90k). We also study the effect of the progressive training paradigm by pretraining the model with the same generated data but without the clustering-based sampling. We test the retrieval performance on WebQuestions before any finetuning. We use Recall@k as the evaluation metric, which measures how often the answer paragraphs appear in the top-k retrieval. The results are shown in Table3. With the limited results of the non-clutering version of our method, we validate the strong effect of the clustering-based progressive training algorithm, which brings 7-10% improvements on different metrics. We also report the results of early checkpoints of the non-clustering version. We can see that with the limited batch size, the improvements are diminishing as training goes on. This suggests the importance of introducing more challenging negative examples in the batch. Comparing the no-clustering version of our method and ORQA, we see that using the generated data results in much better retrieval performance (more than 15% improvements on all metrics).
3.5 Analysis on Finetuning Objectives
With the pretrained retrieval model , how to finetune it along with the reading module can have nontrivial effects on the final QA performance. Here we use the development set of NaturalQuestions-Open to investigate the effects of different finetuning objectives.
First, we investigate the effect of the joint objective as used by guu2020realm:
We test the joint formulation with and without the shared normalization. These two objectives correspond to the entries (2) and (4) in Table 4. For methods that use conventional IR methods, it is often beneficial to introduce a paragraph reranker that use question-aware paragraph encoders, usually based on the [CLS] representation from BERT that takes the question and paragraph concatenation as input. This kind of reranker can usually provide more accurate paragraph scores than the TF-IDF or BM25 based retriever while ranking the answer candidates. Here we investigate whether the additional reranking component is necessary with the presence of the pretrained dense index. Specifically, we add another reranking scoring layer to our span prediction module , which encodes the paragraphs in a question-aware fashion. We try to use the paragraph scores predicted by this reranking component instead of the pretrained retrieval model while selecting the best answer from the top-5 paragraphs.
Table 4 show the results of different objective settings. Comparing the results of (1) vs (2), we find that the joint objective does not yield improvements when we use shared-normalization. Even without shared-normalization, as shown by (4) and (5), the improvements are trivial. By comparing (1) and (3), we see that with the pretrained retrieval model, adding an extra reranking module that uses question-aware paragraph encodings is not helpful. Finally, from entries (1) and (5), we see that the shared normalization is essential in open-domain QA, which aligns with the findings in DBLP:conf/emnlp/WangNMNX19.
4 Related Work
The problem of answering questions without setting the limits on specific domains has been intensively studied since the earlier TREC QA competitions DBLP:conf/trec/Voorhees99. Studies in the early stage DBLP:conf/www/KwokEW01; DBLP:conf/emnlp/BrillDB02; ferrucci2010building; baudivs2015yodaqa mostly rely on highly sophisticated pipelines and heterogeneous resources. Based on the recent advances in machine reading comprehension, chen-etal-2017-reading shows that the open-domain QA problem can be simply formulated as a reading comprehension problem with the help of a standard IR component that provides the candidate paragraphs for answer extraction. This two-stage formulation is clean and effective enough to achieve competitive performance while only using Wikipedia as the knowledge resource.
Following this formulation, a couple of recent studies have proposed to improve the system using stronger reading comprehension models DBLP:journals/corr/abs-1902-01718; DBLP:conf/iclr/WangY0ZGCWKTC18, more effective learning objectives DBLP:conf/acl/GardnerC18; min-etal-2019-discrete; DBLP:conf/emnlp/WangNMNX19 or paragraph reranking models DBLP:journals/corr/abs-1709-00023; lin-etal-2018-denoising; DBLP:conf/emnlp/LeeYKKK18. However, the retrieval components in these systems are still based on traditional inverted index methods, which are efficient but might fail when the target passage does not have enough lexicon overlap with the question.
In contrast to the sparse term-based features used in TF-IDF or BM25, dense paragraph vectors learned by deep neural networksDBLP:conf/nips/ZhangSWGHC17; DBLP:journals/corr/ConneauKSBB17 can capture much richer semantics beyond the n-gram term features. In order to build effective paragraph encoders tailed for the paragraph retrieval in open-domain QA, more recent studies lee-etal-2019-latent; chang2020pre; guu2020realm propose to pretrain Transformer encoders DBLP:conf/nips/VaswaniSPUJGKP17 using objectives that simulate the semantic matching between questions and paragraphs. For instance, the Inverse Cloze Task used by ORQA trains a two-encoder model to match a sentence and its original paragraph. These approaches demonstrate promising open-domain QA performance but they require a lot of resources for pretraining, and often a huge batch size to introduce effective negative matching pairs. The focus of this paper is to reduce the computational requirements of building an effective corpus index such that the dense retrieval approach can be easily and cheaply adapted for other corpora in different domains.
We propose a resource-efficient method for pretraining a dense corpus index which can replace the traditional IR methods that use sparse features in open-domain QA systems. The proposed approach is powered by a better data generation strategy and a simple yet effective data sampling protocol for pretraining. With careful finetuning, we achieve stronger QA performance than a method that uses seven times more computational resources. We hope our method could encourage more energy-efficient pretraining methods in this direction such that the dense retrieval methods could be widely used for corpora from different domains.
This research was supported in part by DARPA Grant D18AP00044 funded under the DARPA YFA program, and a UCSB Institute of Energy Efficiency (IEE) seed grant. The authors are solely responsible for the contents of the paper, and the opinions expressed in this publication do not reflect those of the funding agencies.