InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval

01/04/2023
by   Vitor Jeronymo, et al.
12

Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such datasets. In this work we introduce InPars-v2, a dataset generator that uses open-source LLMs and existing powerful rerankers to select synthetic query-document pairs for training. A simple BM25 retrieval pipeline followed by a monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on the BEIR benchmark. To allow researchers to further improve our method, we open source the code, synthetic data, and finetuned models: https://github.com/zetaalphavector/inPars/tree/master/tpu

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/10/2022

InPars: Data Augmentation for Information Retrieval using Large Language Models

The information retrieval community has recently witnessed a revolution ...
research
11/20/2018

Alignment Analysis of Sequential Segmentation of Lexicons to Improve Automatic Cognate Detection

Ranking functions in information retrieval are often used in search engi...
research
08/21/2023

RaLLe: A Framework for Developing and Evaluating Retrieval-Augmented Large Language Models

Retrieval-augmented large language models (R-LLMs) combine pre-trained l...
research
12/04/2019

WIKIR: A Python toolkit for building a large-scale Wikipedia-based English Information Retrieval Dataset

Over the past years, deep learning methods allowed for new state-of-the-...
research
01/12/2023

The Keyword Explorer Suite: A Toolkit for Understanding Online Populations

We have developed a set of Python applications that use large language m...
research
07/10/2023

InPars Toolkit: A Unified and Reproducible Synthetic Data Generation Pipeline for Neural Information Retrieval

Recent work has explored Large Language Models (LLMs) to overcome the la...
research
09/10/2020

Patient Cohort Retrieval using Transformer Language Models

We apply deep learning-based language models to the task of patient coho...

Please sign up or login with your details

Forgot password? Click here to reset