InPars Toolkit: A Unified and Reproducible Synthetic Data Generation Pipeline for Neural Information Retrieval

07/10/2023
by   Hugo Abonizio, et al.
0

Recent work has explored Large Language Models (LLMs) to overcome the lack of training data for Information Retrieval (IR) tasks. The generalization abilities of these models have enabled the creation of synthetic in-domain data by providing instructions and a few examples on a prompt. InPars and Promptagator have pioneered this approach and both methods have demonstrated the potential of using LLMs as synthetic data generators for IR tasks. This makes them an attractive solution for IR tasks that suffer from a lack of annotated data. However, the reproducibility of these methods was limited, because InPars' training scripts are based on TPUs – which are not widely accessible – and because the code for Promptagator was not released and its proprietary LLM is not publicly accessible. To fully realize the potential of these methods and make their impact more widespread in the research community, the resources need to be accessible and easy to reproduce by researchers and practitioners. Our main contribution is a unified toolkit for end-to-end reproducible synthetic data generation research, which includes generation, filtering, training and evaluation. Additionally, we provide an interface to IR libraries widely used by the community and support for GPU. Our toolkit not only reproduces the InPars method and partially reproduces Promptagator, but also provides a plug-and-play functionality allowing the use of different LLMs, exploring filtering methods and finetuning various reranker models on the generated data. We also made available all the synthetic data generated in this work for the 18 different datasets in the BEIR benchmark which took more than 2,000 GPU hours to be generated as well as the reranker models finetuned on the synthetic data. Code and data are available at https://github.com/zetaalphavector/InPars

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/10/2022

InPars: Data Augmentation for Information Retrieval using Large Language Models

The information retrieval community has recently witnessed a revolution ...
research
06/08/2023

RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit

Although Large Language Models (LLMs) have demonstrated extraordinary ca...
research
09/13/2023

CONVERSER: Few-Shot Conversational Dense Retrieval with Synthetic Data Generation

Conversational search provides a natural interface for information retri...
research
11/25/2022

CAD2Render: A Modular Toolkit for GPU-accelerated Photorealistic Synthetic Data Generation for the Manufacturing Industry

The use of computer vision for product and assembly quality control is b...
research
01/04/2023

InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval

Recently, InPars introduced a method to efficiently use large language m...
research
06/08/2021

Automatic Generation of Machine Learning Synthetic Data Using ROS

Data labeling is a time intensive process. As such, many data scientists...
research
05/24/2023

Ranger: A Toolkit for Effect-Size Based Multi-Task Evaluation

In this paper, we introduce Ranger - a toolkit to facilitate the easy us...

Please sign up or login with your details

Forgot password? Click here to reset