Passage retrieval is fundamentally burdened by short passages. While document retrieval systems can rely on signals such as term frequency to estimate the importance of a given term in a document, passages usually do not have this benefit. Consequently, traditional retrieval approaches often perform poorly at passage retrieval. Supervised deep learning approaches—in particular, those that make use of pretrained contextualized language models—have successfully overcome this limitation by making use of general language characteristics(Craswell2019OverviewTRECDL; Hashemi2019ANTIQUEAN). However, these approaches come at a substantial computational burden, which can make them impractical to use (Hofsttter2019LetsMR; nogueiradoc2query).
We propose a new approach for passage retrieval that performs modeling of term importance (i.e., salience) and expansion over a contextualized language model to build query and document representations. We call this approach EPIC (Expansion via Prediction of Importance with Contextualization). At query time, EPIC can be employed as an inexpensive re-ranking method because document representations can be pre-computed at index time. EPIC improves upon the prior state of the art on the MS-MARCO passage ranking dataset by substantially narrowing the effectiveness gap between practical approaches with subsecond retrieval times and those that are considerably more expensive, e.g., those using BERT as a re-ranker. Furthermore, the proposed representations are interpretable because the dimensions of the representation directly correspond to the terms in the lexicon. An overview is shown in Figure 1.
Neural re-ranking approaches can generally be characterized as either representation-based or interaction-based (Guo2016ADR). Representation-based models, like ours, build representations of a query and passage independently and then compare these representations to calculate a relevance score. These are beneficial because one can compute document representations at index time to reduce the query-time cost. Interaction-based models combine signals from the query and the document at query time to compute the relevance score (nogueira2019passage). The Duet model (mitra2019updated) aims to achieve low query-time latency by combining signals from both a representation-based and an interaction-based model. However, this approach substantially under-performs the latest pure interaction-based approaches such as the one in (nogueira2019passage). TK (hofstatter2020interpretable)
attempts to bridge this performance gap by using a smaller transformer network, but still utilizes an interaction-based approach which itself adds considerable computational overhead. Finally, other interesting proposals have investigated alternative approaches for offloading computational cost to index time. Doc2query(nogueira2019document) and docTTTTTquery (nogueiradoc2query) add important context to otherwise short documents by using a sequence-to-sequence model to predict additional terms to add to the document. DeepCT-Index (dai2019context) models an importance score for each term in the document and replaces the term frequency values in the inverted index with these scores. Unlike these approaches, EPIC models query/document term importance and performs document expansion. We found that it can be employed as an inexpensive yet effective re-ranking model; the impact on query time latency can be as low as an additional 5ms per query (for a total of 68ms).
In summary, the novel contributions presented are:
We propose a new representation-based ranking model that is grounded in the lexicon.
We show that this model can improve ranking effectiveness for passage ranking, with a minimal impact on query-time latency.
We show that the model yields interpretable representations of both the query and the document.
We show that latency and storage requirements of our approach can be reduced by pruning the document representations.
For reproducibility, our code is integrated into OpenNIR (macavaney:wsdm2020-onir), with instructions and trained models available at:
Overview and notation. Our model follows the representation-focused neural ranking paradigm. That is, we train a model to generate a query and document111For ease of notation, we refer to passages as documents.
representation in a given fixed-length vector space, and produce a ranking score by computing a similarity score between the two representations.
Assume that queries and documents are composed by sequences of terms taken from a vocabulary . Any sequence of terms, either a query or a document, is firstly represented as a sequence of vectors using a contextualized language model like BERT (devlin-19). More formally, let denote such a function associating an input sequence of terms to their embeddings , where and is the size of the embedding. So, a -term query is represented with the embeddings , and a -term document is represented with embeddings . Given the embeddings for queries and documents, we now illustrate the process for constructing query representations, document representations, the final query-document similarity score.
Query representation. A query is represented as a sparse vector . The elements of corresponding to terms not appearing in the query are set to . For each term appearing in the terms of the query , the corresponding element is equal to the importance of the term w.r.t. the query
where is a vector of learned parameters. The function is defined as . The use of softplus ensures that no terms have a negative importance score, while imposing no upper bound. The logarithm prevents individual terms from dominating. When a term appears more than once in a query, the corresponding value of sums up all contributions. The elements of the query representation encode the importance of the terms w.r.t. the query. This approach allows the query representation model to learn to assign higher weights to the query terms that are most important to match given the textual context. Note that the number of elements in the representation is equal to the number of query terms; thus the query processing time is proportional to the number of query terms (fntir2018).
Document representation. A document is represented as a dense vector . Firstly, to perform document expansion, each -dimensional term embedding is projected into a -dimensional vector space, i.e., , where is a matrix of learned parameters. Note that , and let denote the entry of this vector corresponding to term . Secondly, the importance of the terms w.r.t. the document is computed as in Eq (1):
where is a vector of learned parameters. Thirdly, we compute a factor representing the overall quality of the document
where is a vector of learned parameters, and is the embedding produced by the contextualized language model’s classification mechanism. We find that this factor helps give poor-quality documents lower values overall. The function is defined as: . Finally, for each term appearing in the vocabulary, the corresponding element of the document representation is defined as:
This step takes the maximum score for each term in the vocabulary generated by any term in the document. Such document representations can be computed at index time, as they do not rely on the query.
Similarity measure. We use the dot product to compute the similarity between the query and document vectors, i.e.,
3. Experimental Evaluation
We conduct experiments using the MS-MARCO passage ranking dataset (full ranking setting).222https://microsoft.github.io/msmarco/ This dataset consists of approximately million natural-language questions gathered from a query log (average length: 7.5 terms, stddev: 3.1), and million candidate answer passages (avg length: 73.1, stddev: 28.4). The dataset is shallowly-annotated. Annotators were asked to write a natural-language answer to the given question using a set of candidate answers from a commercial search engine. The annotators were asked to indicate which (if any) of the passages contributed to their answers, which are then treated as relevant to the question. This results in 0.7 judgments per query on average (1.1 judgments per query of the 62% that have an answer). Thus, this dataset has a lot of variation in queries, making it suitable for training neural ranking methods. Although this dataset is limited by the method of construction, the performance on these shallow judgments correlate well with those conducted on a deeply-judged subset (Craswell2019OverviewTRECDL).
Training. We train our model using the official MS-MARCO sequence of training triples (query, relevant passage, presumed non-relevant passage) using cross-entropy loss. We use BERT-base (devlin-19) as the contextualized language model, as it was shown to be an effective foundation for various ranking techniques (dai2019context; macavaney:sigir2019-cedr; nogueira2019passage). We set the dimensionality of our representations to the size of the BERT-base word-piece vocabulary (=30,522). The embedding size is instead . is initialized to the pre-trained masked language model prediction matrix; all other added parameters are randomly initialized. Errors are back-propagated through the entire BERT model with a learning rate of with the Adam optimizer (Kingma2015AdamAM). We train in batches of 16 triples using gradient accumulation, and we evaluate the model on a validation set of 200 random queries from the development set every 512 triples. The optimal training iteration and re-ranking cutoff threshold is selected using this validation set. We roll back to the top-performing model after 20 consecutive iterations (training iteration 42) without improvement to Mean Reciprocal Rank at 10 (MRR@10).
|BM25 (from Anserini (Yang2018AnseriniRR))||0.198||21|
|EPIC + BM25 (ours)||0.273||106|
|EPIC + docTTTTTquery (ours)||0.304||78|
|Duet (v2, ensemble) (mitra2019updated)||0.252||440|
|BM25 + TK (1 layer) (hofstatter2020interpretable)||0.303||445|
|BM25 + TK (3 layers) (hofstatter2020interpretable)||0.314||640|
|BM25 + BERT (large) (nogueira2019passage)||0.365||3,500*|
Baselines and Evaluation
. We test our approach by re-ranking the results from several first-stage rankers. We report the performance using MRR@10, the official evaluation metric, on the MS-MARCO passage ranking Dev set. We measure significance using a paired t-test at. We compare the performance of our approach with the following baselines:
BM25 retrieval from a Porter-stemmed Anserini (Yang2018AnseriniRR) index using default settings.333We observe that the default settings outperform the BM25 results reported elsewhere and on the official leaderboard (e.g., (nogueira2019document)).
DeepCT-Index (dai2019context), a model which predicts document term importance scores, and replaces the term frequency values with these importance scores for first-stage retrieval.
doc2query (nogueira2019document), a document expansion approach which predicts additional terms to add to the document via a sequence-to-sequence transformer model. These terms are then indexed and used for retrieval using BM25.
docTTTTTquery (nogueiradoc2query), an improved doc2query approach that replaces the transformer model with the pre-trained Text To Text Transfer Transformer (T5) model (Raffel2019ExploringTL).
Duet (mitra2019updated), a hybrid representation- and interaction-focused model. We include the top Duet variant on the MS-MARCO leaderboard (version 2, ensemble) to compare with another model that utilizes query and document representations.
TK (hofstatter2020interpretable), a contextualized interaction-based model, focused on minimizing query time. We report results from (hofstatter2020interpretable) with the optimal re-ranking threshold and measure end-to-end latency on our hardware.
BERT Large (nogueira2019passage), an expensive contextualized language model-based re-ranker. This approach differs from ours in that it models the query and passage jointly at query time, and uses the model’s classification mechanism for ranking.
We also measure query latency over the entire retrieval and re-ranking process. The experiments were conducted on commodity hardware equipped with an AMD Ryzen 3.9GHz processor, 64GiB DDR4 memory, a GeForce GTX 1080ti GPU, and a SSD drive. We report the latency of each method as the average execution time (in milliseconds) of 1000 queries from the Dev set after an initial 1000 queries is used to warm up the cache. First-stage retrieval is conducted with Anserini (Yang2018AnseriniRR).
Ranking effectiveness. We report the effectiveness of our approach in terms of MMR@10 in Table 1. When re-ranking BM25 results, our approach substantially outperforms doc2query and DeepCT-Index. Moreover, it performs comparably to docTTTTTquery (0.273 compared to 0.277, no statistically significant difference). More importantly, we observe that the improvements of our approach and docTTTTTquery are additive as we achieve a MRR@10 of 0.304 when used in combination. This is a statistically significant improvement, and substantially narrows the gap between approaches with low query-time latency and those that trade off latency of effectiveness (e.g., BERT Large).
To test whether EPIC is effective on other passage ranking tasks as well, we test on the TREC CAR passage ranking benchmark (Dietz2017TRECCA). When trained and evaluated on the 2017 dataset (automatic relevance judgments) with BM25, the MRR increases from 0.235 to 0.353. This also outperforms the DeepCT performance reported by (dai2019context) of 0.332.
Effect of document representation pruning. For document vectors, we observe that the vast majority of values are very low (approximately 74% have a value of 0.1 or below, see Figure 2). This suggests that many of the values can be pruned with little impact on the overall performance. This is desirable because pruning can substantially reduce the storage required for the document representations. To test this, we apply our method keeping only the top values for each document. We show the effectiveness and efficiency of (reduces vocabulary by 93.4%) and (96.7%) in Table 1. We observe that the vectors can be pruned to with virtually no difference in ranking effectiveness (differences not statistically significant). We also tested with lower values of , but found that the effectiveness drops off considerably by (0.241 and 0.285 for BM25 and docTTTTTquery, respectively).
Ranking efficiency. We find that EPIC can be implemented with a minimal impact on query-time latency. On average, the computation of the query representation takes 18ms on GPU and 51ms on CPU. Since this initial stage retrieval does not use our query representation, it is computed in parallel with the initial retrieval, which reduces the impact on latency. The similarity measure consistently takes approximately 1ms per query (both on CPU and GPU), with the remainder of the time spent retrieving document representations from disk. Interestingly, we observe that the latency of EPIC BM25 is higher than EPIC docTTTTTquery. This is because when re-ranking docTTTTTquery results, a lower re-ranking cutoff threshold is needed than for BM25. This further underscores the importance of using an effective first-stage ranker. When using pruning at , the computational overhead can be substantially reduced. Specifically, we find that EPIC only adds a 5ms overhead per query to docTTTTTquery, while yielding a significant improvement in effectiveness. With pruning at , EPIC BM25 performs comparably with docTTTTTquery with a speedup.
Cost of pre-computing. We find that document vectors can be pre-computed for the MS-MARCO collection in approximately hours on a single commodity GPU (GeForce GTX 1080ti). This is considerably less expensive than docTTTTTquery, which takes approximately hours on a Google TPU (v3). When stored as half-precision (16-bit) floating point values, the vector for each document uses approximately 60KiB, regardless of the document length. This results in a total storage burden of approximately 500GiB for the entire collection. Pruning the collection to (which has minimal impact on ranking effectiveness) reduces the storage burden of each document to 3.9KiB (using 16-bit integer indices) and total storage to 34 GiB.
Interpretability of representations. A benefit of our approach is that the dimensions of the representation correspond to terms in the lexicon, allowing the representations to be easily inspected. In Figure 3, we present the relative scores for sample queries from MS-MARCO. We observe that the model is generally able to pick up on the terms that match intuitions of term importance. For instance, (a) gives highest scores to california, aaa (American Automobile Association), and tow. These three terms are good candidates for a keyword-based query with the same query intent. This approach does not necessarily just remove stop words; in (b) what is assigned a relatively high score.
We provide an example of document vector importance scores in Figure 4. Because the document vector is dense, the figure only shows the terms that appear directly in the document and other top-scoring terms. Notice that terms related to price, endless, pool(s), and cost are assigned the highest scores. In this case, the expansion of the term cost was critical for properly scoring this document, as a relevant query is cost of endless pools/spas. Although the terms that docTTTTTquery generate for this document are similar, the continuous values generated by EPIC paid off in a higher MRR@10 score for the query “cost of endless pools/swim spa” (a relevant question for this passage).
We demonstrated an effective and inexpensive technique for re-ranking passages based on lexicon-grounded representations generated from contextualized language models. This work advances the art by further approaching fully BERT-based re-ranking performance, while providing low query-time latency and easy interpretability of representations. We also find that pruning can be an effective technique for reducing query latency and the size of the pre-computed passage representations without sacrificing effectiveness. Future work can investigate how well this approach generalizes to document retrieval.
Work partially supported by the ARCS Foundation. Work partially supported by the Italian Ministry of Education and Research (MIUR) in the framework of the CrossLab project (Departments of Excellence). Work partially supported by the BIGDATAGRAPES project funded by the EU Horizon 2020 research and innovation programme under grant agreement No. 780751, and by the OK-INSAID project funded by the Italian Ministry of Education and Research (MIUR) under grant agreement No. ARS01_00917.