BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language

05/31/2023
by   Konrad Wojtasik, et al.
0

The BEIR dataset is a large, heterogeneous benchmark for Information Retrieval (IR) in zero-shot settings, garnering considerable attention within the research community. However, BEIR and analogous datasets are predominantly restricted to the English language. Our objective is to establish extensive large-scale resources for IR in the Polish language, thereby advancing the research in this NLP area. In this work, inspired by mMARCO and Mr. TyDi datasets, we translated all accessible open IR datasets into Polish, and we introduced the BEIR-PL benchmark – a new benchmark which comprises 13 datasets, facilitating further development, training and evaluation of modern Polish language models for IR tasks. We executed an evaluation and comparison of numerous IR models on the newly introduced BEIR-PL benchmark. Furthermore, we publish pre-trained open IR models for Polish language,d marking a pioneering development in this field. Additionally, the evaluation revealed that BM25 achieved significantly lower scores for Polish than for English, which can be attributed to high inflection and intricate morphological structure of the Polish language. Finally, we trained various re-ranking models to enhance the BM25 retrieval, and we compared their performance to identify their unique characteristic features. To ensure accurate model comparisons, it is necessary to scrutinise individual results rather than to average across the entire benchmark. Thus, we thoroughly analysed the outcomes of IR models in relation to each individual data subset encompassed by the BEIR benchmark. The benchmark data is available at URL https://huggingface.co/clarin-knext.

READ FULL TEXT
research
04/17/2021

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Neural IR models have often been studied in homogeneous and narrow setti...
research
02/10/2022

InPars: Data Augmentation for Information Retrieval using Large Language Models

The information retrieval community has recently witnessed a revolution ...
research
07/02/2023

BioCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

Information retrieval (IR) is essential in biomedical knowledge acquisit...
research
03/03/2021

Simplified Data Wrangling with ir_datasets

Managing the data for Information Retrieval (IR) experiments can be chal...
research
04/24/2022

Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval

We show that supervised neural information retrieval (IR) models are pro...
research
05/05/2022

Toward A Fine-Grained Analysis of Distribution Shifts in MSMARCO

Recent IR approaches based on Pretrained Language Models (PLM) have now ...
research
11/27/2021

Pre-training Methods in Information Retrieval

The core of information retrieval (IR) is to identify relevant informati...

Please sign up or login with your details

Forgot password? Click here to reset