Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature

by   Yu Wang, et al.

Information overload is a prevalent challenge in many high-value domains. A prominent case in point is the explosion of the biomedical literature on COVID-19, which swelled to hundreds of thousands of papers in a matter of months. In general, biomedical literature expands by two papers every minute, totalling over a million new papers every year. Search in the biomedical realm, and many other vertical domains is challenging due to the scarcity of direct supervision from click logs. Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck. We propose a general approach for vertical search based on domain-specific pretraining and present a case study for the biomedical domain. Despite being substantially simpler and not using any relevance labels for training or development, our method performs comparably or better than the best systems in the official TREC-COVID evaluation, a COVID-related biomedical search competition. Using distributed computing in modern cloud infrastructure, our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search, a new search experience for biomedical literature:


LitMC-BERT: transformer-based multi-label classification of biomedical literature with an application on COVID-19 literature curation

The rapid growth of biomedical literature poses a significant challenge ...

Patent Citation Spectroscopy (PCS): Algorithmic retrieval of landmark patents

One essential component in the construction of patent landscapes in biom...

Coreference Resolution for the Biomedical Domain: A Survey

Issues with coreference resolution are one of the most frequently mentio...

CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web to Special Domain Search

Neural rankers based on deep pretrained language models (LMs) have been ...

PubMed Labs: An experimental platform for improving biomedical literature search

PubMed is a freely accessible system for searching the biomedical litera...

MOLIERE: Automatic Biomedical Hypothesis Generation System

Hypothesis generation is becoming a crucial time-saving technique which ...

1. Introduction

Figure 1.

General approach for vertical search: A neural ranker is initialized by domain-specific pretraining and fine-tuned on self-supervised relevance labels generated using a domain-specific lexicon from the domain ontology to filter query-passage pairs from MS MARCO.

Keeping up with scientific developments on COVID-19 highlights the perennial problem of information overload in a high-stakes domain. At the time of writing, hundreds of thousands of research papers have been published concerning COVID-19 and the SARS-CoV-2 virus. For biomedicine more generally, the PubMed111

service adds 4,000 papers every day and over a million papers every year. While progress in general search has been made using sophisticated machine learning methods, such as neural retrieval models, vertical search is often limited to comparatively simple keyword search augmented by domain-specific ontologies (e.g., entity acronyms). The PubMed search engine exemplifies this experience. Direct supervision, while available for general search in the form of relevance labels from click logs, is typically scarce in specialized domains, especially for emerging areas such as COVID-related biomedical search.

Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck, based on automatically creating noisy labeled data from unlabeled text. In particular, neural language model pretraining, such as BERT (Devlin et al., 2018), has demonstrated superb performance gains for general-domain information retrieval (Yang et al., 2019; Nogueira and Cho, 2019; Xiong et al., 2021; Lin et al., 2020) and natural language processing (NLP) (Wang et al., 2019b, a). Additionally, for specialized domains, domain-specific pretraining has proven to be effective for in-domain applications (Lee et al., 2020; Beltagy et al., 2019; Gu et al., 2020; Gururangan et al., 2020; Alsentzer et al., 2019; Si et al., 2019; Huang et al., 2019).

We propose a general methodology for developing vertical search systems for specialized domains. As a case study, we focus on biomedical search. We find evidence that the methods have significant impact in the target domain, and, likely generalize to other vertical search domains. We demonstrate how advances described in earlier and related work (Gu et al., 2020; Xiong et al., 2020; Zhang et al., 2020) can be brought together to provide new capabilities. We also provide data supporting the feasibility of a large-scale deployment through detailed system analysis, stress-testing of the system, and acquisition of expert relevance evaluations.222The system has been released, though large-scale deployment measures other than stress-testing are not yet available and we focus on the evidence from expert evaluation.

In section 2, we explore the key idea of initializing a neural ranking model with domain-specific pretraining and fine-tuning the model on a self-supervised domain-specific dataset generated from general query-document pairs (e.g., from MS MARCO (Nguyen et al., 2016)). Then, we introduce the biomedical domain as a case study. In section 3, we evaluate the method on the TREC-COVID dataset (Roberts et al., 2020; Voorhees et al., 2020). We find that the method performs comparably or better than the best systems in the official TREC-COVID evaluation, despite its generality and simplicity, and despite using zero COVID-related relevance labels for direct supervision. In section 4, we discuss how our system design leverages distributed computing and modern cloud infrastructure for scalability and ease of use. This approach can be reused for other domains. In the biomedical domain, our system can scale to tens of millions of PubMed articles and attain a high query-per-second (QPS) throughput. We have deployed the resulting system for preview as Microsoft Biomedical Search, which provides a new search experience over biomedical literature:

2. Domain-Specific Pretraining for Vertical Search

In this section, we present a general approach for vertical search based on domain-specific pretraining and self-supervised learning (Figure 1

). We first review neural language models and show how domain-specific pretraining can serve as the foundation for a domain-specific document neural ranker. We then present a general method of fine-tuning the ranker by using self-supervised, domain-specific relevance labels from a broad-coverage query-document dataset using the domain ontology. Finally, we show how this approach can be applied in biomedical literature search.

2.1. Domain-Specific Pretraining

Language model pretraining can be considered a form of task-agnostic self-supervision that generates training examples by hiding words from unlabeled text and tasks the model with predicting the hidden words. In our work on vertical search, we adopt the popular Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018), which has become a standard building block for NLP applications. Instead of predicting the next token based on the preceding tokens, as in traditional generative models, BERT employs a Masked Language Model (MLM), which randomly replaces a subset of tokens by a special token , and tries to predict them from the rest of the words. The training objective is the cross-entropy loss between the original tokens and the predicted ones. BERT builds on the transformer model (Vaswani et al., 2017)

with its multi-head self-attention mechanism, which has demonstrated high performance in parallel computation and modeling long-range dependencies, as compared to recurrent neural networks such as LSTM 

(Hochreiter and Schmidhuber, 1997). The input consists of text spans, such as sentences, separated by a special token . To address out-of-vocabulary words, tokens are divided into subword units using Byte-Pair Encoding (BPE) (Sennrich et al., 2016) or its variants (Kudo and Richardson, 2018), which generates a fixed-size subword vocabulary to compactly represent the training text corpora. The input is first passed to a lexical encoder, which combines the token embedding, position embedding, and segment embedding by element-wise summation. The embedding layer is then passed to multiple layers of transformer modules to generate a contextual representation (Vaswani et al., 2017).

Prior pretraining efforts have focused frequently on the newswire and web domains. For example, the BERT model was trained on Wikipedia333 and BookCorpus (Zhu et al., 2015), and subsequent efforts have focused on crawling additional web text to conduct increasingly large-scale pretraining (Liu et al., 2019; Raffel et al., 2020; Brown et al., 2020). For domain-specific applications, pretraining on in-domain text has been shown to provide additional gains, but the prevalent assumption is that out-domain text is still helpful and pretraining typically adopts a mixed-domain approach (Lee et al., 2020; Gururangan et al., 2020). Gu et al. (2020) changes this assumption and shows that, for domains with ample text, a pure domain-specific pretraining approach is advantageous and leads to substantial gains in downstream in-domain applications. We adopt this approach by generating domain-specific vocabulary and performing language model pretraining from scratch on in-domain text (Gu et al., 2020).

2.2. Self-Supervised Fine-Tuning

As a first-order approximation, the search problem can be abstracted as learning a relevance function for query and text span : . Here, may refer to a document or arbitrary text span such as a passage.

Traditional search methods adopt a sparse retrieval approach by essentially treating the query as a bag of words and matching each word against the candidate text, which can be done efficiently using an inverted index. Individual words are weighted (e.g., by TF-IDF) to downweight the effect of stop words or function words, as exemplified by BM25 and its variants (Robertson and Jones, 1976).

Variations abound in natural language expressions, which can cause significant challenges in sparse retrieval. To address this problem, dense retrieval maps query and text each to a vector in a continuous representation space and estimates relevance by computing the similarity between the two vectors (e.g., via dot product) 

(Karpukhin et al., 2020; Huang et al., 2013; Xiong et al., 2021). Dense retrieval can be made highly scalable by pre-computing text vectors, and can potentially replace or combine with sparse retrieval.

Neither sparse retrieval nor dense retrieval attempts to model complex interdependencies between the query and text. In contrast, sophisticated neural approaches concatenate query and text as input for a BERT model to leverage cross-attention among query and text tokens (Yilmaz et al., 2019). Specifically, query and text are combined into a sequence “ q t ” as input, where is a special token to be used for final prediction (Devlin et al., 2018). This could produce significant performance gains but requires a large amount of labeled data for fine-tuning the BERT model. Such a cross-attention neural model will not be scalable enough for the retrieval step, as we must compute, from scratch, for each candidate text with a new query. The standard practice thus adopts a two-stage approach, by using a fast L1 retrieval method to select top text candidates, and applying the neural ranker on these candidates as L2 reranking.

In our proposed approach, we use BM25 for L1 retrieval, and initialize our L2 neural ranker with a domain-specific BERT model. To fine-tune the neural ranker, we use the Microsoft Machine Reading Comprehension dataset, MS MARCO (Nguyen et al., 2016), and a domain-specific lexicon to generate noisy relevance labels at scale using self-supervision (Figure 1). MS MARCO was created by identifying pairs of anonymized queries and relevant passages from Bing’s search query logs, and crowd-sourcing potential answers from passages. The dataset contains about one million questions spanning a wide range of topics, each with corresponding relevant answer passages from Bing question answering systems. For self-supervised fine-tuning labels, we use the MS MARCO subset (MacAvaney et al., 2020) whose queries contain at least one domain-specific term from the domain ontology.

2.3. Application to Biomedical Literature Search

Biomedicine is a representative case study that illustrates the challenges of vertical search. It is a high-value domain with a vast and rapidly growing research literature, as evident in PubMed (30+ million articles; adding over a million a year). However, existing biomedical search tools are typically limited to sparse retrieval methods, as exemplified by PubMed. This search is primarily limited to keyword matching, though it is augmented with limited query expansion using domain ontologies (e.g., MeSH terms (Lipscomb, 2000)). This method is suboptimal for long queries expressing complex intent.

We use biomedicine as a running example to illustrate our approach for vertical search. We leverage PubMed articles for domain-specific pretraining and use the publicly-available PubMedBERT (Gu et al., 2020) to initialize our L2 neural ranker. For self-supervised fine-tuning, we use the Unified Medical Language System (UMLS) (Bodenreider, 2004) as our domain ontology and filter MS MARCO queries using the disease or syndrome terms in UMLS, similar to MacAvaney et al. (2020, 2020) but focusing on the broad biomedical literature rather than COVID-19. This medical subset of MS MARCO contains about 78 thousand annotated queries. We used these queries and their relevant passages in MS MARCO as positive relevance labels. To generate negative labels, we ran BM25 for each query over all non-relevant passages in MS MARCO, and selected the top 100 results. This forces the neural ranker to work harder in separating truly relevant passages from ones with mere overlap in keywords. For balanced training, we down-sampled negative instances to equal the number of positive instances (i.e., 1:1 ratio). This resulted in about 640 thousand (query, passage, label) examples.

Based on preliminary experiments, we chose a learning rate of

and ran fine-tuning for one epoch in all subsequent experiments. We found that the results are not sensitive to hyperparameters, as long as the learning rate is of the same order of magnitude and at least one epoch is run over all the examples. At retrieval time, we used

in the L1 ranker by default (i.e., we used BM25 to select top 60 text candidates).

3. Case Study Evaluation on COVID-19 Search

The COVID-19 literature provides a realistic test ground for biomedical search. In a little over a year, the COVID-related biomedical literature has grown to include over 440 thousand papers that mention COVID-19 or the SARS-CoV-2 virus. This explosive growth sparked the creation of the COVID-19 Open Research Dataset (CORD-19)(Wang et al., 2020b) and subsequently TREC-COVID (Roberts et al., 2020; Voorhees et al., 2020), an evaluation resource for pandemic information retrieval.

In this section, we describe our evaluation of the biomedical search system on TREC-COVID, focusing on two key questions. First, how does our system perform compared to the best systems participating in TREC-COVID? We note that many of these systems are expected to have complex designs and/or require COVID-related relevance labels for training and development. Second, what is the impact of domain-specific pretraining compared to general-domain or mixed-domain pretraining?

3.1. The TREC-COVID Dataset

To create TREC-COVID, organizers from the National Institute of Standards and Technology (NIST) used versions of CORD-19 from April 10 (Round 1), May 1 (Round 2), May 19 (Round 3), June 19 (Round 4), and July 16 (Round 5). These datasets spanned an initial set of 30 topics with five new topics planned for each additional round; the final set thus consists of 50 topics and cumulative judgements from previous rounds generated by domain experts (Roberts et al., 2020). Relevance labels were created by annotators using a customized platform and released in rounds. Round 1 contains 8,691 relevance labels for 30 topics, and was provided to participating teams for training and development. Subsequent rounds were hosted to introduce additional topics and relevance labels as a rolling evaluation for increased participation. We use Round 2, the round we participated in, to evaluate our system development. It contains 12,037 relevance labels for 35 topics.

3.2. Top Systems in TREC-COVID Leaderboard

The results of TREC-COVID Round 2 are organized into three groups: Manual, which used manual interventions, e.g., manual query rewriting, in any part of the system, Feedback, which used labels from Round 1, and Automatic, which does not use manual effort or Round 1 labels.444 Note that the categorization of Feedback and Automatic

is not always explicit so their grouping might be mixed. Overall, 136 systems participated in the official evaluation. NDCG@10 was used as the main evaluation metric, with Precision@5 (P@5) reported as an additional metric. The best performing systems typically adopted a sophisticated neural ranking pipeline and performed extensive training and development on TREC-COVID labeled data from Round 1. Some systems also use very large pretrained language models. For example,

used T5 Large (Raffel et al., 2020), a general-domain transformer-based model with 770 million parameters pretrained on the Colossal Clean Crawled (C4) web corpus (26 TB).555

The best performing non-manual system for Round 2 is CMT (CMU-Microsoft-Tsinghua) (Xiong et al., 2020), which adopted a two-stage ranking approach. For L1, CMT used standard BM25 sparse retrieval as well as dense retrieval by fusing top ranking results from the two methods. The dense retrieval method computed the dot product of query and passage embeddings based on a BERT model (Karpukhin et al., 2020). For L2, CMT used a neural ranker with cross-attention over query and candidate passage.

For training, CMT started with the same biomedical MS MARCO data (by selecting MS MARCO queries with biomedical terms) (MacAvaney et al., 2020), but then applied additional processing to generate synthetic labeled data. Briefly, it first trained a query generation system using query generation (QG) (Nogueira and Cho, 2019) on the query-passage pairs from biomedical MS MARCO, initialized by GPT-2 (Radford et al., 2019). Given this trained QG system, for each COVID-related document , it generated a pseudo query , and then applied BM25 to retrieve a pair of documents with high and low ranking, . Finally, it called on ContrastQG (Xiong et al., 2020) to generate a query that would best differentiate the two documents . For the neural ranker, CMT started with SciBERT (Beltagy et al., 2019) with continual pretraining on CORD-19, and fine-tuned the model using both Med MARCO labels and synthetic labels from ContrastQG.

To leverage the TREC-COVID data from Round 1, CMT incorporated data reweighting (ReinfoSelect) based on the REINFORCE algorithm (Zhang et al., 2020). It used performance on Round 1 data as a reward signal, and learned to denoise training labels by re-weighting them using policy gradient.

3.3. Our Approach on TREC-COVID

Model NDCG@10 P@5
Our approach:
PubMedBERT 61.5 (1.1) 69.5 (1.8)
PubMedBERT-COVID 65.6 (1.0) 73.2 (1.1)
+ dev set:
PubMedBERT 64.8 71.4
PubMedBERT-COVID 67.9 73.7
Top systems in TREC-COVID:
covidex.t5 (T5) 62.5 73.1
mpiid5 (ELECTRA) 66.8 77.7
CMT (SparseDenseSciBERT) 67.7 76.0
Table 1.

Comparison with the top-ranked systems in official TREC-COVID evaluation (test results; Round 2). Our results were averaged from ten runs with different random seeds (standard deviation shown in parentheses). The best systems in TREC-COVID evaluation (bottom panel) all used Round 1 data for training, as well as more sophisticated learning methods and/or larger models such as T5. In contrast, our systems (top panel) are much simpler and used zero TREC-COVID relevance labels, but they already perform competitively against the best systems by using domain-specific pretraining (PubMedBERT). Our systems were trained using one epoch with a fixed learning rate. By exploring longer training and multiple learning rates and using Round 1 data for development, our systems can perform even better (middle panel).

Model NDCG@10 P@5
BERT 55.0 (1.2) 63.4 (2.3)
RoBERTa 53.5 (1.6) 61.1 (2.3)
UNILM 55.0 (1.2) 62.0 (1.8)
SciBERT 58.9 (1.5) 67.7 (2.2)
PubMedBERT 61.5 (1.1) 69.5 (1.8)
PubMedBERT-COVID 65.6 (1.0) 73.2 (1.1)
Table 2. Comparison of domain-specific (PubMedBERT and PubMedBERT-COVID) pretraining with out-domain (BERT, RoBERTa, UniLM) or mixed-domain pretraining (SciBERT) in TREC-COVID test results (Round 2). All results were averaged from ten runs (standard deviation in parentheses). Domain-specific pretraining is essential for attaining good performance in our general approach for vertical search.
Biomedical Search Engine CORD-19 PubMed PMC Retrieval Reranking
PubMed666 Keyword + MeSH
COVID-19 Search (Azure)777 BM25
CORD-19 Explorer (AI2)888 BM25 LightGBM
COVID-19 Research Explorer (Google)999 BM25 Neural (BERT)
Covidex (U of Waterloo, NYU)101010 BM25 Neural (T5)
COVID-19 Search (Salesforce)111111 BM25 Neural (BERT)
Microsoft Biomedical Search121212https:// BM25 Neural (PubMedBERT)
Table 3. Overview of representative biomedical search systems. signifies coverage on CORD-19 (440 thousand abstracts and full-text articles), PubMed (30 million abstracts), PubMed Central (PMC; 3 million full-text articles). Most systems cover CORD-19 (or the earlier version with about 60 thousand articles). Only Microsoft Biomedical Search (our system) uses domain-specific pretraining (PubMedBERT), which outperforms general-domain language models, for neural reranking.

TREC-COVID offers an excellent benchmark for assessing the general applicability of our proposed approach for vertical search. We evaluated our systems on the test set (Round 2) and compared them with the best systems in the official TREC-COVID evaluation. We essentially took the biomedical search system from subsection 2.3 as is (PubMedBERT). Although COVID-related text may differ somewhat from general biomedical text, we expect that a biomedical model should offer strong performance for this subset of biomedical literature. To further assess the impact from domain-specific pretraining, we also conducted continual pretraining using CORD-19 for 100K BERT steps and evaluated it in our biomedical search system (PubMedBERT-COVID).

Table 1 shows the results. Surprisingly, without using any relevance labels, our systems (top panel) performs competitively against the best systems in TREC-COVID evaluation. E.g., PubMedBERT-COVID outperforms by over three absolute points in NDCG@10, even though the latter used a much larger language model pretrained on three orders of magnitude more data (26TB vs 21GB). Our systems were trained using one epoch with a fixed learning rate (2e-5). By exploring longer training (up to five epochs) and multiple learning rates (1e-5, 2e-5, 5e-5) and using Round 1 as dev set, our best system (middle panel) performs on par in NDCG@10 with CMT, the top system in TREC-COVID, while requiring no additional sophisticated learning components such as dense retrieval, QG, ContrastQG, and ReinfoSelect.

The success of our systems can be attributed primarily to our in-domain language models (PubMedBERT, PubMedBERT-COVID). To further assess the impact of domain-specific pretraining, we also evaluated our system using out-domain and mixed-domain models. See Table 2 for the results. Out-domain language models all perform relatively poorly in this evaluation of biomedical search, and exhibit little difference in search relevance despite significant difference in the size of vocabulary, pretraining corpus, and model (e.g., RoBERTa (Liu et al., 2019) used a larger vocabulary and both RoBERTa and UniLM (Dong et al., 2019) were pretrained on much larger text corpus). Pretraining on PubMed text helps SciBERT, but its mixed-domain approach (including compute science literature) inhibits its performance compared to domain-specific pretraining. Continual pretraining on covid-specific literature helps substantially, with PubMedBERT-COVID outperforming PubMedBERT by over four absolute points in NDCG@10. Overall, domain-specific pretraining is essential for the performance gain, with PubMedBERT-COVID outperforming general-domain BERT models by over ten absolute points in NDCG@10.

In sum, the TREC-COVID results provide strong evidence that, by leveraging domain-specific pretraining, our approach for vertical search is general and can attain high accuracy in a new domain without significant manual effort.

Figure 2. Left: Overview of the Microsoft Biomedical Search system. Right: A reference cloud architecture for servicing the L2 neural ranker and machine reading comprehension (MRC) with automatic scaling. Queries are processed by a standard two-stage architecture, where an L1 ranker based on BM25 generates the top 60 passages for each query, followed by an L2 neural ranker to produce final reranking results, which are then passed to the MRC module to generate answers from a candidate passage if applicable.

4. PubMed-Scale Biomedical Search

The canonical tool for biomedical search is the PubMed search itself. Recently, COVID-19 has spawned a plethora of new prototype biomedical search tools. See Table 3 for a list of representative systems. PubMed covers essentially the entire biomedical literature, but its aforementioned search engine is based on relatively simplistic sparse retrieval methods, which generally perform less well, especially in the presence of long queries with complex intent. By contrast, while some new search tools feature advanced neural ranking methods, their search scope was typically limited to CORD-19, which considers only a tiny fraction of biomedical literature. In this section, we describe our effort in developing and deploying Microsoft Biomedical Search, a new biomedical search engine that combines PubMed-scale coverage and state-of-the-art neural ranking, based on our general approach for vertical search, as described in subsection 2.3 and validated in section 3. Creating the system required addressing significant challenges with system design and engineering. Employing a modern cloud infrastructure helped with the fielding of the system. The fielded system can serve as a reference architecture for vertical search in general; many components are directly reusable for other high-value domains.

4.1. System Challenges

The key challenge in the system design is to scale to tens of millions of biomedical articles, while enabling affordable and fast computation in sophisticated neural ranking methods, based on large language models with hundreds of millions of parameters.

Specifically, the CORD-19 dataset initially covered about 29,000 documents (abstracts or full-text articles) when it was first launched in March 2020. It quickly grew to about 60,000 documents when it was adopted by TREC-COVID (Round 2, May 2020), which is the version used by many COVID-search tools. Even in its latest version (as of early Feb. 2021), CORD-19 only contains about 440,000 documents (with about 150,000 full-text articles). By contrast, PubMed covers over 30 million biomedical publications, with about 20 million abstracts and over 3 million full-text articles, which is two orders of magnitude larger than CORD-19.

Given early feedback from a range of biomedical practitioners, in addition to document-level retrieval, we decided to enable passage-level retrieval to enhance granularity and precision. This further exacerbates our scalability challenge, as the retrieval candidates now include over 216 million paragraphs (passages).

Neural ranking methods can greatly improve search relevance compared to standard keyword-based and sparse retrieval methods. However, they present additional challenges as these methods often build upon large pretrained language models, which are computational intensive and generally require expensive graphic processing units (GPUs).

4.2. Our Solution

As described in subsection 2.3, we adopt a two-stage ranking model, with an L1 ranker based on BM25 and an L2 reranker based on PubMedBERT. As shown in Figure 2 (left), the system comprises a web front end, web back end API, cache, L1 ranking, and L2 ranking. Query requests are passed on from web front end to back end API, which coordinates L1 and L2 ranking. The system first consults the cache and returns results directly if the query is cached. Otherwise, it calls on L1 to retrieve top candidates and then calls on L2 to conduct neural reranking. Finally, it combines the results and returns them to the front end for display.

To address the scalability challenges, we develop our system on top of modern cloud infrastructures to leverage their native capabilities of distributed computing, cache, and load balancing, which drastically simplifies our system design and engineering. We choose to use Microsoft Azure as the cloud infrastructure, but our design is general and can be easily adapted to other cloud infrastructures.

In early experiments, we found that the Web front end, back end and cache components are sufficiently fast. So, in what follows, we will focus on discussing how to address scalability challenges in L1 and L2 ranking.

For L1, we use BM25, which can be supported by standard inverted index methods. We adopt Elastic Search, an open-source distributed search engine built on Apache Lucene 

(Gormley and Tong, 2015). Given our PubMed-scale coverage, the index size of Elastic Search is over 160GB and is growing as new papers arrive. The index size further multiplies with the number of replications added to ensure system availability (we use two replications). As such, we need to use machines with enough memory and processing power.

For L2, although we only run on limited number of candidate passages from L1 (we used top 60 in our system), the neural ranking model is based on large pretrained language models which are computationally intensive. Currently, we use the base model of PubMedBERT with 12 layers of transformer modules, containing over 300 million parameters. We thus use a distributed GPU cluster and make careful hardware and software choices to maximize overall cost-effectiveness while minimizing L2 latency.

We use query-per-second (QPS) as our key workload metric for system design. To identify major bottlenecks and fine-tune design choices, we conducted focused experiments on L1 and L2 rankers separately to assess their impact on run-time latency.

We use Locust (Holmberg and Heyman, ), a Python-based framework for load testing. To ensure head-to-head comparison among design choices, we adopted a fixed system setup as follows:

  • The back end API is developed with Flask (Ronacher, ), using Gevent (Bilenko, ) with 8 workers to ensure the highest performance

  • To minimize variance due to network cost, the back end API and L1 or L2 rankers are deployed in the same data center, as well as machines used to send queries.

  • All the servers are deployed in the same virtual network.

  • We prepare a query set which contains 71 thousand anonymized queries sampled from Microsoft Academic Search.

  • We turned off the cache layer during all experiments.

With this configuration, the latency of back end API per query is around 20 ms. We used the Locust client to simulate asynchronous requests from multiple users. Each simulated user would randomly wait for 15-60 seconds after each search request. Each experiments ran for 10 minutes.

From preliminary experiments, we found that Elastic Search requires warm-up to reach maximum performance, so we ran the system with low QPS (0.5 per sec) for 10 minutes before conducting the our experiments. Elastic Search might cache results to speed up repeat queries. To eliminate confounders from caching, we ensure that no query is repeated in each experiment.

For L1, based on the performance experiments, we chose the following configuration for Elastic Search:

  • Each query is processed by a main node, which distributes its query to data nodes and then merges the results.

  • There are three main nodes and ten data nodes, each using a premium machine (D8s v3) with a 1TB SSD disk (P30).

  • The index is divided into 30 shards.

Figure 3. Sample screenshot of Microsoft Biomedical Search. The system applies our general approach for vertical search based on domain-specific pretraining and self-supervision, and covers all abstracts and full-text articles in CORD-19, PubMed, and PubMed Central (PMC).
QPS Median (s) 90% (s) Mean (s) Min (s) Max (s)
13.2 0.51 0.75 0.59 0.23 7.07
26.8 0.60 1.50 0.88 0.22 31.0
Table 4. Latency results in two simulated load tests on L1 ranking (plus back end API). Query-per-second (QPS) is the average request load in the test. Back end API takes about 20 ms for each query. Most queries can be processed within a second, even with relatively high request load.
QPS Median (s) 90% (s) Mean (s) Min (s) Max (s)
14.8 1.80 2.80 2.01 0.37 31.21
15.0 1.70 2.60 1.85 0.34 30.66
Table 5. Latency results in two simulated load tests on L2 ranking (plus back end API). Query-per-second (QPS) is the average request load in the test. Back end API takes about 20 ms for each query. Most queries can be processed within 1-2 second, even with relatively high request load.

For L2, we used Kubernetes to manage a GPU cluster. See Figure 2 (right) for a reference architecture. We used V100 GPUs in initial experiments. Since they are relatively expensive, we explored using low-cost GPUs in subsequent experiments to maximize cost effectiveness. For each query, to rerank the top 60 candidate paragraphs from L1, it takes about 0.9 second on a V100 GPU. The K80 GPU only costs a fraction of V100, but requires 3 second per query. We therefore used 4-K80 machines, which reduce the latency to 0.75 second but cost less than a third of the cost for V100.

Table 4 and Table 5 shows simulated test results for L1 and L2 ranking, respectively. There were no failures in all the tests. For L1 ranking, our configuration can already support 10-20 QPS while keeping latency for most queries to less than a second. To support higher QPS, we can simply add more main and data nodes, which scale roughly linearly. For L2 ranking, our test used 32 4-K80 machines with a total of 128 K80s. It can support about 10 QPS while keeping latency for most queries to around or under a second. To support higher QPS, we can simply add more K80 machines.

4.3. Microsoft Biomedical Search

Our biomedical search system has been deployed as Microsoft Biomedical Search, which is publicly available. See Figure 3 for a sample screenshot.

Before deployment, we conducted several user studies among the co-authors and our extended teams with a diverse set of self-constructed and sampled queries. Overall, we verified that our system performed well for long queries with complex intent, generally returning more relevant results compared to PubMed and other search tools. However, for overly general short queries (e.g., “breast cancer”), our system can be under-selective among articles that all mention the terms. To improve user experience, we augmented L1 ranking by including results from Microsoft Academic, which uses a saliency score that takes into account temporal evolution and heterogeneity of the Microsoft Academic Graph to predict and up-weight influential papers (Wang et al., 2019c, 2020a; Sinha et al., 2015). Given a query, we retrieve top 30 results from Microsoft Academic as ranked by its saliency score and combine them with the top 30 results from BM25. L2 reranking is then conducted over the combined set of results. The saliency score helps elevate important papers when the query is underspecified, which generally leads to a better user experience.

In addition to standard search capabilities, our system incorporates a state-of-the-art machine reading comprehension (MRC) method (Cheng et al., ) trained on (Kwiatkowski et al., 2019) as an optional component. Given a query and a top reranked candidate passage, the MRC component will treat it as a question-answering problem and return a text span in the passage as the answer, if the answer confidence is above the score of abstaining from answering. The MRC component uses the same cloud architecture as the L2 neural ranker Figure 2 (right), with similar latency performance.

QPS L1 (Cost) L2 (Cost) MRC (Cost) Total Cost
4 13 D8v3 ($5K) 32 K80 ($10K) 48 K80 ($14K) $29K
7 13 D8v3 ($5K) 64 K80 ($20K) 96 K80 ($28K) $53K
14 13 D8v3 ($5K) 128 K80 ($40K) 192 K80 ($55K) $100K
28 26 D8v3 ($10K) 256 K80 ($80K) 384 K80 ($110K) $200K
Table 6. Reference configuration and monthly cost estimate to support expected QPS while keeping median latency under two seconds (based on pricing from June 2021).

Our system can be deployed for public release at a rather affordable cost. Table 6 shows the reference configuration and cost estimate to support various expected loads (QPS).

5. Discussion

Prior work on vertical search tends to focus on domain-specific crawling (focused crawling) and user interface (Baeza-Yates et al., 2011). We instead explore the orthogonal aspect of the underlying search algorithm. These tend to be simplistic in past systems, due to the scarcity of domain-specific relevance labels, as exemplified by the PubMed search engine. While easier to implement and scale, such systems often render subpar search experiences, which is particularly concerning for high-value verticals such as biomedicine. E.g., Soni and Roberts (2021) studied the evaluation of commercial COVID-19 search systems and found that “commercial search engines sizably underperformed those evaluated under TREC-COVID. This has implications for trust in popular health search engines and developing biomedical search engines for future health crises.

By leveraging domain-specific pretraining and self-supervision from broad-coverage query-passage dataset, we show that it is possible to train a sophisticated neural ranking system to attain high search relevance, without requiring any manual annotation effort. Although we focus on biomedical search as a running example in this paper, our reference system comprises general and reusable components that can be directly applied to other domains. Our approach may potentially help bridge the performance gap in conventional vertical search systems while keeping the design and engineering effort simple and affordable.

There are many exciting directions to explore. For example, we can combine our approach with other search engines that take advantage of complementary signals not used in ours. Our hybrid L1 ranker combining BM25 with Microsoft Academic Search saliency scores is an example of such fusion opportunities. A particularly exciting prospect is applying our approach to help improve the PubMed search engine, which is an essential resource for millions of biomedical practitioners across the globe.

In the long run, we can also envision applying our approach to other high-value domains such as finance, law, retail, etc. Our approach can also be applied to enterprise search scenarios, to facilitate search across proprietary document collections, which standard search engines are not optimized for. In principle, all it takes is gathering unlabeled text in the given domain to support domain-specific pretraining. If a comprehensive index is not available (as in PubMed for biomedicine), one could leverage focused crawling in traditional vertical search to identify such in-domain documents from the web. In practice, additional challenges may arise, e.g., in self-supervised fine-tuning. Currently, we generate the training dataset by selecting MARCO queries using a domain lexicon. If such a lexicon is not readily available (as in UMLS for biomedicine), additional work is required to identify words most pertinent to the given domain (e.g., by contrasting between general and domain-specific language models). We also rely on MARCO to have sufficient coverage for a given domain. We expect that high-value domains are generally well represented in MARCO already. For an obscure domain with little representation in open-domain query log, we can fall back to using a general query-document relevance model as a start and invest additional effort for refinement.

6. Conclusion

We described a methodology for developing vertical search capabilities and demonstrate its effectiveness in the TREC-COVID evaluation for COVID-related biomedical search. The generality and efficacy of the approach rely on domain-specific pretraining and self-supervised fine-tuning, which require no annotation effort for applying to a new domain. Using biomedicine as a running example, we present a general reference system design that can scale to tens of millions of domain-specific documents by leveraging capabilities supplied in modern cloud infrastructure. Our system has been deployed as Microsoft Biomedical Search. Future directions include further improvement of self-supervised reranking, combining the core retrieval and ranking services with complementary search methods and resources, and validation of the generality of the methodology by testing the approach in building search systems for other vertical domains.

The authors thank Grace Huynh, Miah Wander, Michael Lucas, Rajesh Rao, Mu Wei, and Sam Preston for their support in assessing ranking relevance; as well as, Mihaela Vorvoreanu, Dean Carignan, Xiaodong Liu, Adam Fourney, and Susan Dumais for contributing their expertise and shaping Microsoft Biomedical Search. We thank colleagues at the Cleveland Clinic Foundation for composing and sharing a sample of COVID-19–centric queries spanning a broad range of biomedical topics.


  • E. Alsentzer, J. Murphy, W. Boag, W. Weng, D. Jindi, T. Naumann, and M. McDermott (2019) Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, Minnesota, USA, pp. 72–78. External Links: Link, Document Cited by: §1.
  • R. Baeza-Yates, B. Ribeiro-Neto, et al. (2011) Modern information retrieval. Addison Wesley. Cited by: §5.
  • I. Beltagy, K. Lo, and A. Cohan (2019) SciBERT: a pretrained language model for scientific text. In Proc. 2019 EMNLP-IJCNLP, Hong Kong, China, pp. 3615–3620. External Links: Link, Document Cited by: §1, §3.2.
  • [4] Gevent External Links: Link Cited by: 1st item.
  • O. Bodenreider (2004) The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research 32 (suppl_1), pp. D267–D270. Cited by: §2.3.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §2.1.
  • [7] H. Cheng, Y. Shen, X. Liu, P. He, W. Chen, and J. Gao UnitedQA: a hybrid approach for open domain question answering. arXiv preprint arXiv:2101.00178. Cited by: §4.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.1, §2.2.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197. Cited by: §3.3.
  • C. Gormley and Z. Tong (2015) Elasticsearch: the definitive guide: a distributed real-time search and analytics engine. ” O’Reilly Media, Inc.”. Cited by: §4.2.
  • Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon (2020) Domain-specific language model pretraining for biomedical natural language processing. arXiv preprint arXiv:2007.15779. Cited by: §1, §1, §2.1, §2.3.
  • S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith (2020) Don’t stop pretraining: adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964. Cited by: §1, §2.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.1.
  • [14] Locust External Links: Link Cited by: §4.2.
  • K. Huang, J. Altosaar, and R. Ranganath (2019) ClinicalBERT: modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342. Cited by: §1.
  • P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck (2013) Learning deep structured semantic models for web search using clickthrough data. In Proc. 22nd ACM Int. SIGCIKM, pp. 2333–2338. Cited by: §2.2.
  • V. Karpukhin, B. Oğuz, S. Min, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906. Cited by: §2.2, §3.2.
  • T. Kudo and J. Richardson (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proc. 2018 EMNLP: System Demonstrations, Brussels, Belgium, pp. 66–71. External Links: Link, Document Cited by: §2.1.
  • T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov (2019) Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (), pp. 453–466. External Links: Document, Link, Cited by: §4.3.
  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (4), pp. 1234–1240. Cited by: §1, §2.1.
  • J. Lin, R. Nogueira, and A. Yates (2020) Pretrained transformers for text ranking: bert and beyond. arXiv preprint arXiv:2010.06467. Cited by: §1.
  • C. E. Lipscomb (2000) Medical subject headings (mesh). Bulletin of the Medical Library Association 88 (3), pp. 265. Cited by: §2.3.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.1, §3.3.
  • S. MacAvaney, A. Cohan, and N. Goharian (2020) SLEDGE-Z: a zero-shot baseline for COVID-19 literature search. In Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4171–4179. External Links: Link, Document Cited by: §2.3.
  • S. MacAvaney, A. Cohan, and N. Goharian (2020) SLEDGE: a simple yet effective baseline for coronavirus scientific knowledge search. arXiv preprint arXiv:2005.02365. Cited by: §2.2, §2.3, §3.2.
  • T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS marco: a human generated machine reading comprehension dataset. In CoCo@ NIPS, Cited by: §1, §2.2.
  • R. Nogueira and K. Cho (2019) Passage re-ranking with BERT. External Links: Link Cited by: §1, §3.2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §3.2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Journal of Machine Learning Research 21 (140), pp. 1–67. External Links: Link Cited by: §2.1, §3.2.
  • K. Roberts, T. Alam, S. Bedrick, D. Demner-Fushman, K. Lo, I. Soboroff, E. Voorhees, L. L. Wang, and W. R. Hersh (2020) TREC-covid: rationale and structure of an information retrieval shared task for covid-19. J. Am. Med. Inform. Assoc. 27 (9), pp. 1431–1436. Cited by: §1, §3.1, §3.
  • S. E. Robertson and K. S. Jones (1976) Relevance weighting of search terms. J. Assoc. Inf. Sci. Technol. 27 (3), pp. 129–146. Cited by: §2.2.
  • [32] Flask External Links: Link Cited by: 1st item.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proc. 54th Annual Meeting of the ACL (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: §2.1.
  • Y. Si, J. Wang, H. Xu, and K. Roberts (2019) Enhancing clinical concept extraction with contextual embeddings. J. Am. Med. Inform. Assoc.. Cited by: §1.
  • A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B. Hsu, and K. Wang (2015) An overview of microsoft academic service (mas) and applications. In Proc. 24th international conference on world wide web, pp. 243–246. Cited by: §4.3.
  • S. Soni and K. Roberts (2021)

    An evaluation of two commercial deep learning-based information retrieval systems for covid-19 literature

    J. Am. Med. Inform. Assoc. 28 (1), pp. 132–137. Cited by: §5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.1.
  • E. Voorhees, T. Alam, S. Bedrick, D. Demner-Fushman, W. R. Hersh, K. Lo, K. Roberts, I. Soboroff, and L. L. Wang (2020) TREC-covid: constructing a pandemic information retrieval test collection. arXiv preprint arXiv:2005.04474. Cited by: §1, §3.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2019a) Superglue: a stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, pp. 3266–3280. Cited by: §1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019b) GLUE: a multi-task benchmark and analaysis platform for natural language understanding. In ICLR, Cited by: §1.
  • K. Wang, Z. Shen, C. Huang, C. Wu, Y. Dong, and A. Kanakia (2020a) Microsoft academic graph: when experts are not enough. Quantitative Science Studies 1 (1), pp. 396–413. Cited by: §4.3.
  • K. Wang, Z. Shen, C. Huang, C. Wu, D. Eide, Y. Dong, J. Qian, A. Kanakia, A. Chen, and R. Rogahn (2019c) A review of microsoft academic services for science of science studies. Frontiers in Big Data 2, pp. 45. Cited by: §4.3.
  • L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Eide, K. Funk, R. Kinney, Z. Liu, W. Merrill, et al. (2020b) Cord-19: the covid-19 open research dataset. ArXiv. Cited by: §3.
  • C. Xiong, Z. Liu, S. Sun, Z. Dai, K. Zhang, S. Yu, Z. Liu, H. Poon, J. Gao, and P. Bennett (2020) CMT in trec-covid round 2: mitigating the generalization gaps from web to special domain search. arXiv preprint arXiv:2011.01580. Cited by: §1, §3.2, §3.2.
  • L. Xiong, C. Xiong, Y. Li, K. Tang, J. Liu, P. N. Bennett, J. Ahmed, and A. Overwikj (2021) Approximate nearest neighbor negative contrastive learning for dense text retrieval. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.2.
  • W. Yang, H. Zhang, and J. Lin (2019) Simple applications of bert for ad hoc document retrieval. arXiv preprint arXiv:1903.10972. Cited by: §1.
  • Z. A. Yilmaz, S. Wang, W. Yang, H. Zhang, and J. Lin (2019) Applying bert to document retrieval with birch. In Proc. 2019 EMNLP-IJCNLP: System Demonstrations, pp. 19–24. Cited by: §2.2.
  • K. Zhang, C. Xiong, Z. Liu, and Z. Liu (2020) Selective weak supervision for neural information retrieval. In Proceedings of The Web Conference 2020, pp. 474–485. Cited by: §1, §3.2.
  • Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In ICCV, Cited by: §2.1.