Shallow pooling for sparse labels

08/31/2021
by   Negar Arabzadeh, et al.
0

Recent years have seen enormous gains in core IR tasks, including document and passage ranking. Datasets and leaderboards, and in particular the MS MARCO datasets, illustrate the dramatic improvements achieved by modern neural rankers. When compared with traditional test collections, the MS MARCO datasets employ substantially more queries with substantially fewer known relevant items per query. Given the sparsity of these relevance labels, the MS MARCO leaderboards track improvements with mean reciprocal rank (MRR). In essence, a relevant item is treated as the "right answer", with rankers scored on their ability to place this item high in the ranking. In working with these sparse labels, we have observed that the top items returned by a ranker often appear superior to judged relevant items. To test this observation, we employed crowdsourced workers to make preference judgments between the top item returned by a modern neural ranking stack and a judged relevant item. The results support our observation. If we imagine a perfect ranker under MRR, with a score of 1 on all queries, our preference judgments indicate that a searcher would prefer the top result from a modern neural ranking stack more frequently than the top result from the imaginary perfect ranker, making our neural ranker "better than perfect". To understand the implications for the leaderboard, we pooled the top document from available runs near the top of the passage ranking leaderboard for over 500 queries. We employed crowdsourced workers to make preference judgments over these pools and re-evaluated the runs. Our results support our concerns that current MS MARCO datasets may no longer be able to recognize genuine improvements in rankers. In future, if rankers are measured against a single "right answer", this answer should be the best answer or most preferred answer, and maintained with ongoing judgments.

READ FULL TEXT

page 7

page 10

research
12/15/2020

Traditional IR rivals neural models on the MS MARCO Document Ranking Leaderboard

This short document describes a traditional IR system that achieved MRR@...
research
04/21/2022

Human Preferences as Dueling Bandits

The dramatic improvements in core information retrieval tasks engendered...
research
10/29/2021

Incentives for Item Duplication under Fair Ranking Policies

Ranking is a fundamental operation in information access systems, to fil...
research
02/25/2021

Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking Leaderboard

Leaderboards are a ubiquitous part of modern research in applied machine...
research
07/22/2020

Assessing top-k preferences

Assessors make preference judgments faster and more consistently than gr...
research
04/05/2022

How Different are Pre-trained Transformers for Text Ranking?

In recent years, large pre-trained transformers have led to substantial ...
research
02/20/2018

Stability of items: an experimental test

The words of a language are randomly replaced in time by new ones, but l...

Please sign up or login with your details

Forgot password? Click here to reset