Log In Sign Up

Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

by   Jimmy Lin, et al.

Pyserini is an easy-to-use Python toolkit that supports replicable IR research by providing effective first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections. We aim to support, out of the box, the entire research lifecycle of efforts aimed at improving ranking with modern neural approaches. In particular, Pyserini supports sparse retrieval (e.g., BM25 scoring using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well as hybrid retrieval that integrates both approaches. This paper provides an overview of toolkit features and presents empirical results that illustrate its effectiveness on two popular ranking tasks. We also describe how our group has built a culture of replicability through shared norms and tools that enable rigorous automated testing.


page 1

page 2

page 3

page 4


Flexible retrieval with NMSLIB and FlexNeuART

Our objective is to introduce to the NLP community an existing k-NN sear...

Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval

Recent rapid advancements in deep pre-trained language models and the in...

A Replication Study of Dense Passage Retriever

Text retrieval using learned dense representations has recently emerged ...

SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

In neural Information Retrieval, ongoing research is directed towards im...

A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques

Recent developments in representational learning for information retriev...

On Single and Multiple Representations in Dense Passage Retrieval

The advent of contextualised language models has brought gains in search...