The tale of two MS MARCO – and their unfair comparisons

04/25/2023
by   Carlos Lassance, et al.
0

The MS MARCO-passage dataset has been the main large-scale dataset open to the IR community and it has fostered successfully the development of novel neural retrieval models over the years. But, it turns out that two different corpora of MS MARCO are used in the literature, the official one and a second one where passages were augmented with titles, mostly due to the introduction of the Tevatron code base. However, the addition of titles actually leaks relevance information, while breaking the original guidelines of the MS MARCO-passage dataset. In this work, we investigate the differences between the two corpora and demonstrate empirically that they make a significant difference when evaluating a new method. In other words, we show that if a paper does not properly report which version is used, reproducing fairly its results is basically impossible. Furthermore, given the current status of reviewing, where monitoring state-of-the-art results is of great importance, having two different versions of a dataset is a large problem. This is why this paper aims to report the importance of this issue so that researchers can be made aware of this problem and appropriately report their results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/27/2022

On Survivorship Bias in MS MARCO

Survivorship bias is the tendency to concentrate on the positive outcome...
research
12/10/2021

Match Your Words! A Study of Lexical Matching in Neural Information Retrieval

Neural Information Retrieval models hold the promise to replace lexical ...
research
04/10/2018

Report on the 7th International Workshop on Bibliometric-enhanced Information Retrieval (BIR 2018)

The Bibliometric-enhanced Information Retrieval (BIR) workshop series ha...
research
05/18/2022

Health Information Retrieval – State of the art report

This report provides an overview of the field of Information Retrieval (...
research
04/11/2021

The Cardan grille approach to the Voynich MS taken to the next level

The Voynich MS is an illustrated 15th century manuscript, whose text is ...
research
09/14/2023

MMEAD: MS MARCO Entity Annotations and Disambiguations

MMEAD, or MS MARCO Entity Annotations and Disambiguations, is a resource...
research
01/14/2020

Conceptual Design and Preliminary Results of a VR-based Radiation Safety Training System for Interventional Radiologists

Recent studies have reported an increased risk of developing brain and n...

Please sign up or login with your details

Forgot password? Click here to reset