BioCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval

07/02/2023
by   Qiao Jin, et al.
0

Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce BioCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot biomedical IR. To train BioCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker. Experimental results show that BioCPT sets new state-of-the-art performance on five biomedical IR tasks, outperforming various baselines including much larger models such as GPT-3-sized cpt-text-XL. In addition, BioCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, BioCPT can be readily applied to various real-world biomedical IR tasks. BioCPT API and code are publicly available at https://github.com/ncbi/BioCPT.

READ FULL TEXT

page 1

page 28

research
02/10/2022

InPars: Data Augmentation for Information Retrieval using Large Language Models

The information retrieval community has recently witnessed a revolution ...
research
05/31/2023

BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language

The BEIR dataset is a large, heterogeneous benchmark for Information Ret...
research
01/30/2021

OpenMatch: An Open-Source Package for Information Retrieval

Information Retrieval (IR) is an important task and can be used in many ...
research
07/17/2020

Multi-Perspective Semantic Information Retrieval in the Biomedical Domain

Information Retrieval (IR) is the task of obtaining pieces of data (such...
research
04/15/2021

Towards Robust Neural Retrieval Models with Synthetic Pre-Training

Recent work has shown that commonly available machine reading comprehens...
research
04/24/2022

Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval

We show that supervised neural information retrieval (IR) models are pro...
research
09/17/2019

Revealing the Importance of Semantic Retrieval for Machine Reading at Scale

Machine Reading at Scale (MRS) is a challenging task in which a system i...

Please sign up or login with your details

Forgot password? Click here to reset