Simplified Data Wrangling with ir_datasets

03/03/2021
by   Sean MacAvaney, et al.
9

Managing the data for Information Retrieval (IR) experiments can be challenging. Dataset documentation is scattered across the Internet and once one obtains a copy of the data, there are numerous different data formats to work with. Even basic formats can have subtle dataset-specific nuances that need to be considered for proper use. To help mitigate these challenges, we introduce a new robust and lightweight tool (ir_datases) for acquiring, managing, and performing typical operations over datasets used in IR. We primarily focus on textual datasets used for ad-hoc search. This tool provides both a python and command line interface to numerous IR datasets and benchmarks. To our knowledge, this is the most extensive tool of its kind. Integrations with popular IR indexing and experimentation toolkits demonstrate the tool's utility. We also provide documentation of these datasets through the ir_datasets catalog: https://ir-datasets.com/. The catalog acts as a hub for information on datasets used in IR, providing core information about what data each benchmark provides as well as links to more detailed information. We welcome community contributions and intend to continue to maintain and grow this tool.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/19/2022

repro_eval: A Python Interface to Reproducibility Measures of System-oriented IR Experiments

In this work we introduce repro_eval - a tool for reactive reproducibili...
research
05/31/2023

BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language

The BEIR dataset is a large, heterogeneous benchmark for Information Ret...
research
11/26/2021

Streamlining Evaluation with ir-measures

We present ir-measures, a new tool that makes it convenient to calculate...
research
06/27/2017

DE-PACRR: Exploring Layers Inside the PACRR Model

Recent neural IR models have demonstrated deep learning's utility in ad-...
research
07/19/2023

Information Retrieval Meets Large Language Models: A Strategic Report from Chinese IR Community

The research field of Information Retrieval (IR) has evolved significant...
research
01/31/2019

An InfoVis Tool for Interactive Component-Based Evaluation

In this paper, we present an InfoVis tool based on Sankey diagrams for t...
research
03/26/2022

Tutorial: Modern Theoretical Tools for Understanding and Designing Next-generation Information Retrieval System

In the relatively short history of machine learning, the subtle balance ...

Please sign up or login with your details

Forgot password? Click here to reset