How Train-Test Leakage Affects Zero-shot Retrieval

06/29/2022
by   Maik Fröbe, et al.
0

Neural retrieval models are often trained on (subsets of) the millions of queries of the MS MARCO / ORCAS datasets and then tested on the 250 Robust04 queries or other TREC benchmarks with often only 50 queries. In such setups, many of the few test queries can be very similar to queries from the huge training data – in fact, 69 MS MARCO / ORCAS. We investigate the impact of this unintended train-test leakage by training neural retrieval models on combinations of a fixed number of MS MARCO / ORCAS queries that are highly similar to the actual test queries and an increasing number of other queries. We find that leakage can improve effectiveness and even change the ranking of systems. However, these effects diminish as the amount of leakage among all training instances decreases and thus becomes more realistic.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/27/2022

On Survivorship Bias in MS MARCO

Survivorship bias is the tendency to concentrate on the positive outcome...
research
01/20/2022

Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models

The advent of transformer-based models such as BERT has led to the rise ...
research
02/23/2023

Data leakage in cross-modal retrieval training: A case study

The recent progress in text-based audio retrieval was largely propelled ...
research
12/20/2022

HYRR: Hybrid Infused Reranking for Passage Retrieval

We present Hybrid Infused Reranking for Passages Retrieval (HYRR), a fra...
research
09/23/2022

Promptagator: Few-shot Dense Retrieval From 8 Examples

Much recent research on information retrieval has focused on how to tran...
research
09/10/2020

Sanitizing Synthetic Training Data Generation for Question Answering over Knowledge Graphs

Synthetic data generation is important to training and evaluating neural...
research
03/24/2022

Revisiting the Effects of Leakage on Dependency Parsing

Recent work by Søgaard (2020) showed that, treebank size aside, overlap ...

Please sign up or login with your details

Forgot password? Click here to reset