DeepAI AI Chat
Log In Sign Up

Evaluating the Robustness of Retrieval Pipelines with Query Variation Generators

by   Gustavo Penha, et al.
Delft University of Technology

Heavily pre-trained transformers for language modelling, such as BERT, have shown to be remarkably effective for Information Retrieval (IR) tasks, typically applied to re-rank the results of a first-stage retrieval model. IR benchmarks evaluate the effectiveness of retrieval pipelines based on the premise that a single query is used to instantiate the underlying information need. However, previous research has shown that (I) queries generated by users for a fixed information need are extremely variable and, in particular, (II) neural models are brittle and often make mistakes when tested with modified inputs. Motivated by those observations we aim to answer the following question: how robust are retrieval pipelines with respect to different variations in queries that do not change the queries' semantics? In order to obtain queries that are representative of users' querying variability, we first created a taxonomy based on the manual annotation of transformations occurring in a dataset (UQV100) of user-created query variations. For each syntax-changing category of our taxonomy, we employed different automatic methods that when applied to a query generate a query variation. Our experimental results across two datasets for two IR tasks reveal that retrieval pipelines are not robust to these query variations, with effectiveness drops of ≈20% on average. The code and datasets are available at


page 8

page 9


Neural-IR-Explorer: A Content-Focused Tool to Explore Neural Re-Ranking Results

In this paper we look beyond metrics-based evaluation of Information Ret...

Grep-BiasIR: A Dataset for Investigating Gender Representation-Bias in Information Retrieval Results

The provided contents by information retrieval (IR) systems can reflect ...

Query Performance Prediction for Neural IR: Are We There Yet?

Evaluation in Information Retrieval relies on post-hoc empirical procedu...

To Phrase or Not to Phrase - Impact of User versus System Term Dependence Upon Retrieval

When submitting queries to information retrieval (IR) systems, users oft...

Are Neural Ranking Models Robust?

Recently, we have witnessed the bloom of neural ranking models in the in...

CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos

Current dense retrievers are not robust to out-of-domain and outlier que...

Incorporating Total Variation Regularization in the design of an intelligent Query by Humming system

A Query-By-Humming (QBH) system constitutes a particular case of music i...