Match Your Words! A Study of Lexical Matching in Neural Information Retrieval

12/10/2021
by   Thibault Formal, et al.
0

Neural Information Retrieval models hold the promise to replace lexical matching models, e.g. BM25, in modern search engines. While their capabilities have fully shone on in-domain datasets like MS MARCO, they have recently been challenged on out-of-domain zero-shot settings (BEIR benchmark), questioning their actual generalization capabilities compared to bag-of-words approaches. Particularly, we wonder if these shortcomings could (partly) be the consequence of the inability of neural IR models to perform lexical matching off-the-shelf. In this work, we propose a measure of discrepancy between the lexical matching performed by any (neural) model and an 'ideal' one. Based on this, we study the behavior of different state-of-the-art neural IR models, focusing on whether they are able to perform lexical matching when it's actually useful, i.e. for important terms. Overall, we show that neural IR models fail to properly generalize term importance on out-of-domain collections or terms almost unseen during training

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/17/2021

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Neural IR models have often been studied in homogeneous and narrow setti...
research
04/29/2020

Complementing Lexical Retrieval with Semantic Residual Embedding

Information retrieval traditionally has relied on lexical matching signa...
research
04/15/2021

COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List

Classical information retrieval systems such as BM25 rely on exact lexic...
research
04/25/2023

The tale of two MS MARCO – and their unfair comparisons

The MS MARCO-passage dataset has been the main large-scale dataset open ...
research
12/17/2020

A White Box Analysis of ColBERT

Transformer-based models are nowadays state-of-the-art in ad-hoc Informa...
research
10/30/2019

Lexical Learning as an Online Optimal Experiment: Building Efficient Search Engines through Human-Machine Collaboration

Information retrieval (IR) systems need to constantly update their knowl...
research
05/10/2022

From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective

Neural retrievers based on dense representations combined with Approxima...

Please sign up or login with your details

Forgot password? Click here to reset