Establishing Strong Baselines for TripClick Health Retrieval; ECIR 2022
We present strong Transformer-based re-ranking and dense retrieval baselines for the recently released TripClick health ad-hoc retrieval collection. We improve the - originally too noisy - training data with a simple negative sampling policy. We achieve large gains over BM25 in the re-ranking task of TripClick, which were not achieved with the original baselines. Furthermore, we study the impact of different domain-specific pre-trained models on TripClick. Finally, we show that dense retrieval outperforms BM25 by considerable margins, even with simple training procedures.READ FULL TEXT VIEW PDF
Establishing Strong Baselines for TripClick Health Retrieval; ECIR 2022
The latest neural network advances in Information Retrieval (IR) – specifically the ad-hoc passage retrieval task – are driven by available training data, especially the large web-search-based MSMARCO collection[msmarco16]. Here, neural approaches lead to enormous effectiveness gains over traditional techniques [hofstaetter2020_crossarchitecture_kd, khattab2020colbert, macavaney2019, nogueira2019passage]. A valid concern is the generalizability and applicability of the developed techniques to other domains and settings [lin2021proposed, lietal-multitask21, thakur2021beir, wang-2021-TSDAE].
The newly released TripClick collection [rekabsaz2021tripclick] with large-scale click log data from the Trip Database, a health search engine, provides us with the opportunity to re-test previously developed techniques on this new ad-hoc retrieval task: keyword search in the health domain with large training and evaluation sets. TripClick provides three different test sets (Head, Torso, Tail), grouped by their query frequency, so we can analyze model performance for different slices of the overall query distribution.
This study conducts a range of controlled ad-hoc retrieval experiments using pre-trained Transformer [vaswani2017attention] models with various state-of-the-art retrieval architectures on the TripClick collection. We aim to reproduce effectiveness gains achieved on MSMARCO in the click-based health ad-hoc retrieval setting. Typically, neural ranking models are trained with a triple of one query, a relevant and a non-relevant passage. As part of our evaluation study, we discovered a flaw in the provided neural training data of TripClick: The original negative sampling strategy included non-clicked results, which led to inadequate training. Therefore, we re-created the training data with an improved negative sampling strategy, based solely on BM25 negatives, with better results than published baselines.
As the TripClick collection was released only recently, we are the first to study a wide-ranging number of BERT-style ranking architectures and answer the fundamental question:
How do established ranking models perform on re-ranking TripClick?
In the re-ranking setting, where the neural models score a set of candidates produced by BM25, we observe large effectiveness gains for , , and TK for every one of the three frequency-based query splits. improves over BM25 on Head by , on Torso by and Tail still by .
We compare the general BERT-Base & DistilBERT with the domain-specific SciBERT & PubMedBERT models to answer:
Which BERT-style pre-trained checkpoint performs best on TripClick?
Although the general domain models show good effectiveness results, they are outperformed by the domain-specific pre-training approaches. Here, PubMedBERT slightly outperforms SciBERT on re-ranking with & . An ensemble of all domain-specific models with again outperforms all previous approaches and sets new state-of-the-art results for TripClick.
Finally, we study the concept of retrieving passages directly from a nearest neighbor vector index, also referred to as dense retrieval, and answer:
How well does dense retrieval work on TripClick?
Dense retrieval outperforms BM25 considerably for initial candidate retrieval, both in top-10 precision results and for all recall cutoffs, except top-1000. In contrast to re-ranking, SciBERT outperforms PuBMedBERT on dense retrieval results.
We publish our source code as well as the improved training triples at:
We describe the collection, the BERT-style pre-training instances, ranking architectures, and training procedures we use below.
TripClick contains million passages (with an average length of words), thousand click-based training queries (with an average of words), and test queries. The TripClick collection includes three test sets with queries each grouped by their frequency and called Head, Torso, and Tail queries. For the Head queries a DCTR [chuklin2015click] click model was employed to created relevance signals, the other two sets use raw clicks.
In comparison to the widely analyzed MSMARCO collection [hofstaetter2021mitigating], TripClick is yet to be fully understood. This includes the quality of the click labels and the effect of various filtering mechanisms of the professional search production UI, that are not part of the released data.111The TripDatabase allows users to use different ranking schemes, such as popularity, source quality and pure relevance, as well as filtering results by facets. Unfortunately, this information is not available in the public dataset.
We study multiple architectures with different aspects on the efficiency vs. effectiveness tradeoff scale. Here, we give a brief overview, for more detailed comparisons see Hofstätter et al. [hofstaetter2020_crossarchitecture_kd].
BERT – Concatenated Scoring The base re-ranking model BERT [nogueira2019passage, macavaney2019, yilmaz2019cross] concatenates query and passage sequences with special tokens and computes a score by reducing the pooled CLS representation with a single linear layer. It represents one of the current state-of-the art models in terms of effectiveness, however it exhibits many drawbacks in terms of efficiency [Hofstaetter2019_osirrc, xiong2020approximate].
ColBERT The model [khattab2020colbert] delays the interactions between every query and document representation after BERT. The interactions in the
model are aggregated with a max-pooling per query term and sum of query-term scores. The aggregation only requires simple dot product computations, however the storage cost of pre-computing passage representations is very high as it depends on the total number of terms in the collection.
TK (Transformer-Kernel) The Transformer-Kernel model [Hofstaetter2020_ecai] is not based on BERT pre-training, but rather uses shallow and independently computed Transformers followed by a set of RBF kernels to count match signals in a term-by-term match matrix, for very efficient re-ranking.
BERT – Dense Retrieval The BERT model matches a single CLS vector of the query with a single CLS vector of a passage [xiong2020approximate, luan2020sparse, lu2020twinbert], independently computed. This decomposition of interactions to a single dot-product allows us to pre-compute every contextualized passage representation and employ a nearest neighbor index for dense retrieval, without a traditional first stage.
The 12-layer BERT-Base model [devlin2018bert] (and the 6-layer distilled version DistilBERT [sanh2019distilbert]) and its vocabulary are based on the Books Corpus and English Wikipedia articles. The SciBERT model [beltagy-etal-2019-scibert] uses the identical architecture to the BERT-Base model, but the vocabulary and the weights are pre-trained on Semantic Scholar articles (with articles from the broad biomedical domain). Similarly the PubMedBERT model [pubmedbert2020] and its vocabulary are trained on PubMed articles using the same architecture as the BERT model.
At the time of writing, this is the first paper evaluating on the novel TripClick collection. However many other tasks have been set up before in the biomedical retrieval domain, such as BioASQ [bioasq], TREC Precision Medicine tracks [trecprecmedicine, trechealthmisinformation] or the timely created TREC-COVID [treccovid, moller-etal-2020-covid, covidjimmylin] (which is based on CORD-19 [wang2020cord19], a collection of scientific articles concerned with the coronavirus pandemic).
For TREC-COVID, MacAvaney et al. [macavaney2020sledge] train a neural re-ranking model on a subset of the MS MARCO dataset containing only medical terms (Med-MARCO) and demonstrate its domain-focused effectiveness on a transfer to TREC-COVID. Xiong et al. [xiong2020cmttreccovid] and Lima et al. [lima2020denmarkstreccovid] explore medical domain specific BERT representations for the retrieval from the TREC-COVID corpus and show that using SciBERT for dense retrieval outperforms the BM25 baseline by a large margin. Wang et al. [wang2020participation]
explore continuous active learning for the retrieval task from the COVID-19 corpus, this method is also studied for retrieval in the precision medicine track[trecprecmedicine, Cormack_Grossman_2018a]. Reddy et al. [reddy2020endtoend] demonstrate synthetic training for question answering of COVID-19 related questions.
Many of these related works are concerned with overcoming the lack of large training data on previous medical collections. Now with TripClick we have a large-scale medical retrieval dataset. In this paper we jumpstart work on this collection, by showcasing the effectiveness of neural ranking approaches on TripClick.
In our experiment setup, we largely follow Hofstätter et al. [hofstaetter2020_crossarchitecture_kd]
, except where noted otherwise. Mainly we rely on PyTorch[pytorch2017] and HuggingFace Transformer [wolf2019huggingface] libraries as foundation for our neural training and evaluation methods. For TK, we follow Rekabsaz et al. [rekabsaz2021tripclick] and utilize a PubMed-trained dimensional word embedding as starting point [mcdonald2018deep]. For validation and testing we utilize the data splits outlined in TripClick by Rekabsaz et al. [rekabsaz2021tripclick].
The TripClick dataset conveniently comes with a set of pre-generated training triples for neural training. Nevertheless, we found this training set to produce less than optimal results and the trained BERT models show no robustness against increased re-ranking depth. This phenomena of having to tune the best re-ranking depth for effectiveness, rather than efficiency, has been studied as part of early non-BERT re-rankers [Hofstaetter2019_sigir]. With the advent of Transformer-based re-rankers, this technique became obsolete [Hofstaetter2020_ecai].
In the TripClick dataset, the clicked results are considered as positives samples for training. However, we discovered a flaw in the published negative sampling procedure, that non-clicked results – ranked above the clicked ones – are included as negative sampled passages. We hypothesize this leads to many false negatives in the training set, confusing the models during training. We confirm this thesis by observing our training telemetry data, showing low pairwise training accuracy as well as a lack of clear distinction in the scoring margins of the models. For all results presented in this study we generate new training data with the following simple procedure:
We generate BM25 candidates for every training query
For every pair of query - relevant (clicked) passage in the training set we randomly sample, without replacement, up to negative candidates from the candidates created in 1.
We remove candidates present in the relevant pool, regardless of relevance grade.
We discard positional information (we expect position bias to be in the training data – a potential for future work).
After shuffling the training triples we save million triples for training
Our new training set gave us a improvement on MRR@10 (from to ) and nDCG@10 (from to ) for the HEAD validation queries using the same PubMedBERT model and setup. The models are now also robust against increasing the re-ranking depth.
|Model||BERT||Head (DCTR)||Torso (RAW)||Tail (RAW)|
|Our Re-Ranking (BM25 Top-200)|
|12||Ensemble (Lines: 9,10,11)||.303||.601||.370||.472||.420||.392|
In this section, we present the results for our research questions, first for re-ranking and then for dense retrieval.
We present the original baselines, as well as our re-ranking results for all three frequency-based TripClick query sets in Table 1. All neural models re-rank the top-200 results of BM25. While the original baselines do improve the frequent Head queries by up to points nDCG@10 (TK-L3 vs. BM25-L1); they hardly improve the Tail queries with only points difference in nDCG@10 (CK-L2 & TK-L3 vs. BM25-L1). This is a pressing issue, as those queries make up 83% of all Trip searches [rekabsaz2021tripclick].
Turning to our results in Table 1, to answer RQ1 How do established ranking models perform on re-ranking TripClick? We can see that our training approach for TK (Line 4) strongly outperforms the original TK (L3), especially on the Tail queries. This is followed by ColBERT (L5 & 6) and (L7 to L12) which both improve strongly over the previous model. This trend directly follows previous observations of effectiveness improvements per model architecture on MSMARCO [hofstaetter2020_crossarchitecture_kd, khattab2020colbert].
To understand if there is a clear benefit of the BERT model choice we study: RQ2 Which BERT-style pre-trained checkpoint performs best on TripClick? We find that although the general domain models show good effectiveness results (L7 & 8), they are outperformed by the domain-specific pre-training approaches (L9 to L11). Here, PubMedBERT (L5 + L10 & 11) slightly outperforms SciBERT (L9 + L10 & 11) on re-ranking with & . An ensemble of all domain-specific models with (L12) again outperforms all previous approaches, and sets new state-of-the-art results for TripClick.
To answer RQ3 How well does dense retrieval work on TripClick? we present our results in Table 2. Dense retrieval with (L13 to L15) outperforms BM25 (L1) considerably for initial candidate retrieval, both in terms of top-10 precision results, as well as for all recall cutoffs, except top-1000. We also provided the judgement coverage for the top-10 results, and surprisingly, the coverage for dense retrieval increases compared to BM25. Future annotation campaigns should explore the robustness of these click-based evaluation results.
|Retrieval (Full Collection Nearest Neighbor)|
Test collection diversity is a fundamental requirement of IR research. Ideally, we as a community develop methods that work on the largest possible set of problem settings. However, neural models require large training sets, which restricted most of the foundational research to the public MSMARCO and other web search collections. Now, with TripClick we have a another large-scale collection available. In this paper we show that in contrast to the original baselines, neural models perform very well on TripClick – both in the re-ranking task and the full collection retrieval with nearest neighbor search. We make our techniques openly available to the community to foster diverse neural information retrieval research.