Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval

03/11/2022
by   Luyu Gao, et al.
0

Recent rapid advancements in deep pre-trained language models and the introductions of large datasets have powered research in embedding-based dense retrieval. While several good research papers have emerged, many of them come with their own software stacks. These stacks are typically optimized for some particular research goals instead of efficiency or code structure. In this paper, we present Tevatron, a dense retrieval toolkit optimized for efficiency, flexibility, and code simplicity. Tevatron provides a standardized pipeline for dense retrieval including text processing, model training, corpus/query encoding, and search. This paper presents an overview of Tevatron and demonstrates its effectiveness and efficiency across several IR and QA data sets. We also show how Tevatron's flexible design enables easy generalization across datasets, model architectures, and accelerator platforms(GPU/TPU). We believe Tevatron can serve as an effective software foundation for dense retrieval system research including design, modeling, and optimization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/19/2021

Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

Pyserini is an easy-to-use Python toolkit that supports replicable IR re...
research
04/24/2023

Anserini Gets Dense Retrieval: Integration of Lucene's HNSW Indexes

Anserini is a Lucene-based toolkit for reproducible information retrieva...
research
08/12/2021

Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

Recent research demonstrates the effectiveness of using fine-tuned langu...
research
12/13/2021

Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback: A Reproducibility Study

Pseudo-Relevance Feedback (PRF) utilises the relevance signals from the ...
research
10/28/2020

Flexible retrieval with NMSLIB and FlexNeuART

Our objective is to introduce to the NLP community an existing k-NN sear...
research
04/30/2020

Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering

To extract answers from a large corpus, open-domain question answering (...
research
05/19/2022

PLAID: An Efficient Engine for Late Interaction Retrieval

Pre-trained language models are increasingly important components across...

Please sign up or login with your details

Forgot password? Click here to reset