InPars: Data Augmentation for Information Retrieval using Large Language Models

02/10/2022
by   Luiz Bonifacio, et al.
0

The information retrieval community has recently witnessed a revolution due to large pretrained transformer models. Another key ingredient for this revolution was the MS MARCO dataset, whose scale and diversity has enabled zero-shot transfer learning to various tasks. However, not all IR tasks and domains can benefit from one single dataset equally. Extensive research in various NLP tasks has shown that using domain-specific training data, as opposed to a general-purpose one, improves the performance of neural models. In this work, we harness the few-shot capabilities of large pretrained language models as synthetic data generators for IR tasks. We show that models finetuned solely on our unsupervised dataset outperform strong baselines such as BM25 as well as recently proposed self-supervised dense retrieval methods. Furthermore, retrievers finetuned on both supervised and our synthetic data achieve better zero-shot transfer than models finetuned only on supervised data. Code, models, and data are available at https://github.com/zetaalphavector/inpars .

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/03/2020

On-The-Fly Information Retrieval Augmentation for Language Models

Here we experiment with the use of information retrieval as an augmentat...
research
07/10/2023

InPars Toolkit: A Unified and Reproducible Synthetic Data Generation Pipeline for Neural Information Retrieval

Recent work has explored Large Language Models (LLMs) to overcome the la...
research
01/04/2023

InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval

Recently, InPars introduced a method to efficiently use large language m...
research
07/02/2023

BioCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

Information retrieval (IR) is essential in biomedical knowledge acquisit...
research
05/31/2023

BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language

The BEIR dataset is a large, heterogeneous benchmark for Information Ret...
research
04/24/2022

Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval

We show that supervised neural information retrieval (IR) models are pro...
research
09/19/2021

Towards Zero-Label Language Learning

This paper explores zero-label learning in Natural Language Processing (...

Please sign up or login with your details

Forgot password? Click here to reset