DeepAI AI Chat
Log In Sign Up

Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval

by   Luyu Gao, et al.

Recent rapid advancements in deep pre-trained language models and the introductions of large datasets have powered research in embedding-based dense retrieval. While several good research papers have emerged, many of them come with their own software stacks. These stacks are typically optimized for some particular research goals instead of efficiency or code structure. In this paper, we present Tevatron, a dense retrieval toolkit optimized for efficiency, flexibility, and code simplicity. Tevatron provides a standardized pipeline for dense retrieval including text processing, model training, corpus/query encoding, and search. This paper presents an overview of Tevatron and demonstrates its effectiveness and efficiency across several IR and QA data sets. We also show how Tevatron's flexible design enables easy generalization across datasets, model architectures, and accelerator platforms(GPU/TPU). We believe Tevatron can serve as an effective software foundation for dense retrieval system research including design, modeling, and optimization.


page 1

page 2

page 3

page 4


Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

Pyserini is an easy-to-use Python toolkit that supports replicable IR re...

Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval

Recent research demonstrates the effectiveness of using fine-tuned langu...

Query-as-context Pre-training for Dense Passage Retrieval

This paper presents a pre-training technique called query-as-context tha...

Flexible retrieval with NMSLIB and FlexNeuART

Our objective is to introduce to the NLP community an existing k-NN sear...

ConTextual Mask Auto-Encoder for Dense Passage Retrieval

Dense passage retrieval aims to retrieve the relevant passages of a quer...

OpenICL: An Open-Source Framework for In-context Learning

In recent years, In-context Learning (ICL) has gained increasing attenti...

NaturalCC: A Toolkit to Naturalize the Source Code Corpus

We present NaturalCC, an efficient and extensible toolkit to bridge the ...