WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset

Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc information retrieval datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 annotated queries, the recent deep learning models for information retrieval perform poorly on these datasets. These models (e.g. DUET, Conv-KNRM) are trained and evaluated on data collected from commercial search engines not publicly available for academic research which is a problem for reproducibility and the advancement of research. In this paper, we propose WIKIR: an open-source toolkit to automatically build large-scale English information retrieval datasets based on Wikipedia. WIKIR is publicly available on GitHub. We also provide wikIR59k: a large-scale publicly available dataset that contains 59,252 queries and 2,617,003 (query, relevant documents)

READ FULL TEXT

page 3

page 5

research
05/26/2023

DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions

Modern machine learning relies on datasets to develop and validate resea...
research
01/04/2023

InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval

Recently, InPars introduced a method to efficiently use large language m...
research
02/14/2023

Large-Scale Knowledge Synthesis and Complex Information Retrieval from Biomedical Documents

Recent advances in the healthcare industry have led to an abundance of u...
research
09/14/2022

Pre-training for Information Retrieval: Are Hyperlinks Fully Explored?

Recent years have witnessed great progress on applying pre-trained langu...
research
01/28/2020

WikiHist.html: English Wikipedia's Full Revision History in HTML Format

Wikipedia is written in the wikitext markup language. When serving conte...
research
04/26/2022

PLOD: An Abbreviation Detection Dataset for Scientific Documents

The detection and extraction of abbreviations from unstructured texts ca...
research
02/28/2023

Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face

We present Spacerini, a modular framework for seamless building and depl...

Please sign up or login with your details

Forgot password? Click here to reset