CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web to Special Domain Search

11/03/2020 ∙ by Chenyan Xiong, et al. ∙ Microsoft Tsinghua University Carnegie Mellon University 0

Neural rankers based on deep pretrained language models (LMs) have been shown to improve many information retrieval benchmarks. However, these methods are affected by their the correlation between pretraining domain and target domain and rely on massive fine-tuning relevance labels. Directly applying pretraining methods to specific domains may result in suboptimal search quality because specific domains may have domain adaption problems, such as the COVID domain. This paper presents a search system to alleviate the special domain adaption problem. The system utilizes the domain-adaptive pretraining and few-shot learning technologies to help neural rankers mitigate the domain discrepancy and label scarcity problems. Besides, we also integrate dense retrieval to alleviate traditional sparse retrieval's vocabulary mismatch obstacle. Our system performs the best among the non-manual runs in Round 2 of the TREC-COVID task, which aims to retrieve useful information from scientific literature related to COVID-19. Our code is publicly available at



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recent years have witnessed continuous successes of neural ranking models in information retrieval (Pang et al., 2017; Dai et al., 2018; MacAvaney et al., 2019; Xiong et al., 2020). Most notably, deep pretrained language models (LMs) achieve state-of-the-art performance on several web search benchmarks (Yang et al., 2019; Nogueira and Cho, 2019; Craswell et al., 2020). Their success relies on the learned semantic information from general domain corpus with the language model pretraining (Craswell et al., 2020; Zhang et al., 2019).

However, ranking models in specific domains usually face the domain adaption problem, which comes from two generalization gaps between the general and the specific domain. The first gap derives from the discrepancy of vocabulary distributions in different domains. Taking the COVID domain as an example (Wang et al., 2020; Voorhees et al., 2020), the earliest related publication appeared at the end of 2019. Even pretrained LMs targeting the biomedical domain (Beltagy et al., 2019; Lee et al., 2020) are unfamiliar with new medical terms like COVID-19 because their pretraining corpora have not contained such new terminologies. The other gap is the label scarcity. For the specific searching scenario, large-scale relevance labels are luxury, such as biomedical and scientific domains.

In addition, most information retrieval (IR) systems usually use sparse ranking methods in the first-stage retrieval, such as BM25, which are based on term-matching signals to calculate the relevance between query and document. Nevertheless, these systems may fail when queries and documents use different terms to describe the same meaning, which is known as the vocabulary mismatch problem (Furnas et al., 1987; Croft et al., 2010). The vocabulary mismatch problem of sparse retrieval has become an obstacle to existing IR systems, especially for specific domains that have lots of in-domain terminologies.

This paper presents a solution to alleviate the specific domain adaption problem with three core technics. The first one conducts domain-adaptive pretraining (DAPT) (Gururangan et al., 2020) to help pretrained language models learn semantics of special domain terminologies to keep the language knowledge is the latest. The second one uses Contrast Query Generation (ContrastQG) and ReInfoSelect (Zhang et al., 2020)

to mitigate the label scarcity problem in the specific domain. ContrastQG and ReInfoSelect focus on generating and filtering pseudo relevance labels to further improve ranking performance, respectively. Finally, our system integrates dense retrieval to alleviate the sparse retrieval’s vocabulary mismatch bottleneck. Dense retrieval can encode query and document to dense vectors to measure the relevance between query and document in the latent semantic space 

(Karpukhin et al., 2020; Gao et al., 2020; Luan et al., 2020; Chang et al., 2020; Xiong et al., 2020).

Using above technologies, our system achieves the best performance among non-manual groups in Round 2 of TREC-COVID (Voorhees et al., 2020), which is a COVID-domain TREC task to evaluate information retrieval systems for searching COVID-19 related literature.

The next section will analyze the generalization gaps and vocabulary mismatch faced by the COVID domain search. Sec.3 and Sec.4 describe in detail how our system alleviates these problems. Sec.5

shows the evaluation results and hyperparameter study. In the Sec.

6 and Sec.7, we discuss the negative attempts and our concerns of the residual collection evaluation (Salton and Buckley, 1990) used in TREC-COVID.

2. Data Study

This section studies the generalization gaps from web to COVID domain, and the vocabulary mismatch problem of sparse retrieval.

Domain Discrepancy. Most existing pretrained language models divide uncommon words into subwords, which aims to alleviate the out-of-vocabulary problem (Sennrich et al., 2015). As shown in Figure 1, the subword ratio of TREC-COVID queries is dramatically higher than that of the web domain dataset, MS MARCO (Bajaj et al., 2016). The results show that existing pretrained language models treat most COVID-domain terminologies as unfamiliar words, indicating a considerable discrepancy between the existing pretraining and the COVID domain.

Label Scarcity. The label scarcity in the COVID domain search is very prominent. Only 30 queries were judged in the second round of TREC-COVID. In contrast, medical MS MARCO contains more than 78,800 annotated queries, which is the medical subset of MS MARCO filtered by the previous work (MacAvaney et al., 2020).

Vocabulary Mismatch. We observed that BM25 only covered 35% of relevant documents in the top 100 retrieved documents. The result reveals that retrieving relevant documents only according to term-matching signals will hinder the search system’s effectiveness.

Figure 1. The proportion of query words that are decomposed into subwords by the pretrained language model’s vocabulary.

3. System Description

Our system employs a two-stage retrieval architecture, which utilizes BM25 for base retrieval and SciBERT (Beltagy et al., 2019) for reranking. The domain-adaptive pretraining and two few-shot learning techniques are used to mitigate the generalization gaps faced by SciBERT in the COVID domain. Dense retrieval is also incorporated into our system to alleviate BM25’s vocabulary mismatch problem.

3.1. Domain-Adaptive Pretraining

SciBERT has been used in our system since it is pretrained with scientific texts and biomedical publications. However, COVID is a new concept that has not appeared in previous pretraining corpora. Therefore, we conduct domain-adaptive pretraining (DAPT) (Gururangan et al., 2020) for SciBERT. Our approach is straightforward to continuously train SciBERT with CORD-19 corpus (Wang et al., 2020), which is a growing collection of scientific papers about COVID-19 and coronavirus.

3.2. Few-Shot Learning

We introduce two few/zero-shot learning methods named ContrastQG and ReInfoSelect (Zhang et al., 2020) to alleviate the label scarcity challenge when fine-tuning the neural ranking model. Specifically, we first use ContrastQG to generate weakly supervised data in a zero-shot manner and then utilize a weak supervision data selection method, ReInfoSelect, to recognize high quality training data.

ContrastQG is a zero-shot data synthetic method aiming to generate queries for synthesizing weakly supervised relevance signals. Unlike the prior work (Ma et al., 2020), ContrastQG synthesizes a query given a relevant text pair rather than a single related text, which can capture the specificity between two documents to generate more meaningful queries instead of keyword-style queries.

The entire synthesis process uses two query generators named and , which aim to generate pseudo queries according to documents. Both and

are implemented with standard GPT-2 

(Radford et al., 2019). is trained on medical MS MARCO’s positive passage-query pairs (, ) following the previous method (Ma et al., 2020). is directly trained on medical MS MARCO’s triples by encoding the concatenated text of positive and negative passages (, ) to generate query .

At inference time, we first leverage to generate queries based on a single COVID domain document :

Then we utilize BM25 to retrieve two related documents (, ) that show different correlation according to the generated query . Finally, is used to generate another query based on the two contrastive documents (, ):

The synthetic triple is used as weakly supervised data to train the neural ranker.

ReInfoSelect (Zhang et al., 2020)

uses reinforcement learning to select weak supervision data. ReInfoSelect evaluates the neural ranker’s performance on the target data and regards the NDCG difference as the reward. Then the reward signal from target data is propagated to guide data selector via the policy gradient.

In our system, we use ContrastQG and medical MARCO to construct the weakly supervised data. The annotated data of TREC-COVID Round 1 is used as the target data. The trial-and-error learning mechanism of ReInfoSelect can select proper weakly supervised data according to neural ranker’s performance in the target domain, which helps to further mitigate the domain discrepancy.

Run ID Method R1 (dev) R2 (test)
NDCG@10 P@5 NDCG@10 P@5
r2.fusion2 BM25 Fusion 0.6056 0.7200 0.5553 0.6800
covidex.t5 T5 Fusion 0.5124 0.6333 0.6250 0.7314
GUIR S2 run1 SciBERT Fusion 0.6032 0.6867 0.6251 0.7486
SparseDenseSciBert SciBERT + DAPT + DenseRetrieval 0.7424 0.8933 0.6772 0.7600
ReInfoSelect SciBERT + DAPT + ContrastQG + ReInfoSelect 0.7134 0.8333 0.6259 0.6971
n.a. SciBERT + DAPT + ReInfoSelect 0.7061 0.8000 0.6210 0.6914
ContrastNLGSciBert SciBERT + DAPT + ContrastQG 0.6830 0.8467 0.6138 0.7314
n.a. SciBERT + DAPT 0.6775 0.7400 0.5880 0.6800
n.a. SciBERT 0.6598 0.7733 0.5828 0.6629
Table 1. Overall accuracy in Round 2 of TREC-COVID. The testing results of baselines and our three submitted runs (marked with asterisk) are from official evaluations. Compared baselines are BM25 Fusion (base retrieval), T5 Fusion as well as SciBERT Fusion.

3.3. Dense Retrieval

Dense retrieval maps queries and documents to the same distributed representation space and retrieves related documents based on the similarities between document vectors and query vectors 

(Karpukhin et al., 2020; Xiong et al., 2020).

Let each training instance contain a query , relevant (positive) document and irrelevant (negative) documents . Dense retrieval first encodes the query and all documents to dense vectors and . Then the similarity of and is calculated as . The training objective can be formulated as learning a distributed representation space that the positive document has a higher similarity to the query than all negative documents:

where the similarity is the dot product between vectors.

4. Implementation Details.

In this section, we describe the system’s implementation details.

Dataset. The testing data of TREC-COVID Round 2 contains the May 1, 2020 version of the CORD-19 document set (Wang et al., 2020) (59,851 COVID-related papers) and 35 queries written by biomedical professionals. Among these queries, the first 30 queries have been judged in the Round 1. In the experiment, we use TREC-COVID Round 1’s annotated data as the development set (30 queries) and the medical MS MARCO (MacAvaney et al., 2020) as the training data (78,895 queries).

System Setup. For data preprocessing, we concatenated title and abstract to represent each document and deleted stop words for all queries. Our system utilized the BM25 constructed by Anserini (Yang et al., 2017) as the base retrieval and adopted the dense retrieval implementation provided by Gao, et al. (Gao et al., 2020). The neural ranker based on SciBERT (Beltagy et al., 2019) was used in dense retrieval and reranking stages (MacAvaney et al., 2020)

with the learning rate of 2e-5 and the batch size of 32. We set the warm-up proportion as 0.1 and limited the maximum sequence length to 256. The NDCG@10 score on the development set is used to measure the convergence and is calculated every three training steps. Our system is based on PyTorch, and the training process it involves can be implemented on a GeForce RTX 2080 Ti.

Figure 2. Round 2 testing result on each query of baselines and our system’s best version (SciBERT + DAPT + Dense Retrieval). The X-axis denotes the Query ID in Round 2 of TREC-COVID, and Y-axis represents the NDCG@10 score. Noted that Queries 1-30 have been annotated in Round 1, and Queries 31-35 are newly added in Round 2.
Figure 3. The NDCG@10 gain of the top 10 feedback systems relative to BM25 Fusion system. ‘All’ represents the average gain of all queries in TREC-COVID Round 2, ‘Old’ and ‘New’ mean the annotated queries in Round 1 and the newly added queries in Round 2, respectively.

5. Evaluation Results

This section presents evaluation results and hyperparameter studies.

Rerank Depth NDCG@10 P@5 Top 10 Hole Rate
20 0.6545 0.7429 0.03
50 0.6853 0.7714 0.07
100 0.6838 0.7543 0.12
500 0.6044 0.6971 0.23
1000 0.5826 0.6686 0.26
Table 2. Dev results of SciBERT with different reranking depth in Round 2 of TREC-COVID. The top 10 hole rate denotes the unlabeled proportion of the top 10 reranked results.

5.1. Overall Results

Table 1 shows the overall performance of different models in the TREC-COVID task. Three top systems during Round 2 evaluation and several variants of our systems are compared.

Our system achieved the best performance in Round 2 of TREC-COVID. From our detailed experimental results, our method significant improves the ranking performance of SciBERT in the COVID domain. The domain-adaptive pretraining (DAPT) helps to improve SciBERT, which illustrates that learning the semantics of these new terminologies is crucial for language models. Then the system’s performance has been further improved with about 6.5% NDCG@10 gains by ContrastQG and ReInfoSelect. ContrastQG generates lots of pseudo relevance labels, which provides more training guidance for neural rankers in the specific domain. ReInfoSelect further boosts models with more fine-grained selected supervisions. The most significant improvement comes from the fusion of dense retrieval, where the P@5 score is increased by 11.8%. This result shows that dense retrieval can significantly improve retrieval effectiveness by alleviating sparse retrieval’s vocabulary mismatch problem.

5.2. Hyperparameter Study

Among all hyperparameters, we found the reranking depth significantly impacts the neural ranking model’s effectiveness. As shown in Table 2, SciBERT’s performance is significantly limited at the shallow reranking depth (20), mainly caused by the low ranking accuracy of BM25. With the increase the reranking depth to 50 and 100, the neural ranker shows stable performance and achieves the best. Nevertheless, the reranking accuracy begins to drop as the depth continues to increase. The possible reason is that the neural ranker is not good enough to distinguish truly relevant documents when more noisy documents are included.

5.3. Query Analysis

Figure 2 shows the testing results of each query. The first 30 queries have been judged in Round 1, and others are newly added in Round 2 (query 31-35). Our system outperforms baselines on most queries with previous annotations. Besides, our system is also comparable to the T5 Fusion system on new queries and avoids the sharp drop of the SciBERT Fusion system (such as 34th query), which shows our system’s robustness.

6. Failed Attempts

This section discusses some of our failed attempts and experience.

Manual Labeling. A straightforward approach to mitigate the label scarcity is to annotate more data within this domain manually. We recruited three medical students who compiled 50 COVID-related queries and assigned the relevance label to the top 20 documents retrieved by BM25 for each query. However, our annotations were not able to get good agreement with TREC-COVID’s annotations.

Corpus Filtering. MacAvaney et al. (MacAvaney et al., 2020) proposed to narrow the retrieval scale by filtering out the document published before 2020. Nevertheless, our analysis found that this method excluded more than 80% of documents from the second round of corpus, dropping a large amount of useful COVID-related literature, such as SARS and MERS. Thus, we did not adopt this method in our system.

Neural Reranker. We also attempted two other neural ranking models besides SciBERT for document reranking, including BERT (Devlin et al., 2019) and Conv-KNRM (Dai et al., 2018). Our experimental results show that BERT-Large has no obvious advantage over SciBERT-Base and Conv-KNRM performs the worst. The main reason for the poor performance of Conv-KNRM is that we did not use its subword version (Hofstätter et al., 2019), which led to a severe out-of-vocabulary problem.

Fusion Attempts. Two fusion methods have been tried to integrate dense retrieval into our system. One approach is to combine dense retrieval with BM25 in the base retrieval stage. The other is to fuse dense retrieval into SciBERT’s reranking processing directly. The second method works better in our limited attempts.

7. Concerns on Residual Evaluation

This section discusses our observations about the residual collection evaluation used in the TREC-COVID task. In residual collection evaluation, test queries can be divided into old queries and new queries

. The old queries have been annotated in previous rounds, but their annotated documents will be removed from the collection before scoring. TREC-COVID allows IR systems to use old queries’ relevance judgments and classify such systems as feedback types.

Figure 3 shows the evaluation results of the top 10 feedback systems in Round 2 of TREC-COVID. Although these systems performed closely in overall scores, they showed significant differences in the old and new queries. E.g., the 2nd system’s performance in the new query is greatly better than that in the old query. In contrast, some systems’ ranking accuracy for the new query is considerably lower than in the old query, even worse than the base retrieval BM25 Fusion system, such as the 3rd-5th and 9th systems.

A powerful search system is desirable to achieve balanced performance on known and unknown queries. However, this result shows that the residual collection evaluation may bias towards seen queries, which are much easier in real production scenarios.


We thank Luyu Gao for sharing the implementation of Dense Retrieval, the Track organizers for hosting this track, Sean Macavaney for releasing the medical MS MARCO filter, and Jimmy Lin & the Anserini project for open sourcing the well-rounded BM25 first stage retrieval.


  • P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, et al. (2016) Ms marco: a human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268. Cited by: §2.
  • I. Beltagy, K. Lo, and A. Cohan (2019) SciBERT: a pretrained language model for scientific text. In Proceedings of EMNLP-IJCNLP, pp. 3606–3611. Cited by: §1, §3, §4.
  • W. Chang, F. X. Yu, Y. Chang, Y. Yang, and S. Kumar (2020) Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932. Cited by: §1.
  • N. Craswell, B. Mitra, E. Yilmaz, D. Campos, and E. M. Voorhees (2020)

    Overview of the trec 2019 deep learning track

    arXiv preprint arXiv:2003.07820. Cited by: §1.
  • W. B. Croft, D. Metzler, and T. Strohman (2010) Search engines: information retrieval in practice. Vol. 520, Addison-Wesley Reading. Cited by: §1.
  • Z. Dai, C. Xiong, J. Callan, and Z. Liu (2018) Convolutional neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of WSDM, pp. 126–134. Cited by: §1, §6.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL, pp. 4171–4186. Cited by: §6.
  • G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais (1987) The vocabulary problem in human-system communication. Communications of the ACM 30 (11), pp. 964–971. Cited by: §1.
  • L. Gao, Z. Dai, Z. Fan, and J. Callan (2020) Complementing lexical retrieval with semantic residual embedding. arXiv preprint arXiv:2004.13969. Cited by: §1, §4.
  • S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith (2020) Don’t stop pretraining: adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964. Cited by: §1, §3.1.
  • S. Hofstätter, N. Rekabsaz, C. Eickhoff, and A. Hanbury (2019) On the effect of low-frequency terms on neural-ir models. In Proceedings of SIGIR, pp. 1137–1140. Cited by: §6.
  • V. Karpukhin, B. Oğuz, S. Min, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906. Cited by: §1, §3.3.
  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (4), pp. 1234–1240. Cited by: §1.
  • Y. Luan, J. Eisenstein, K. Toutanova, and M. Collins (2020) Sparse, dense, and attentional representations for text retrieval. arXiv preprint arXiv:2005.00181. Cited by: §1.
  • J. Ma, I. Korotkov, Y. Yang, K. Hall, and R. McDonald (2020) Zero-shot neural retrieval via domain-targeted synthetic query generation. arXiv preprint arXiv:2004.14503. Cited by: §3.2, §3.2.
  • S. MacAvaney, A. Cohan, and N. Goharian (2020) SLEDGE: a simple yet effective baseline for coronavirus scientific knowledge search. arXiv preprint arXiv:2005.02365. Cited by: §2, §4, §4, §6.
  • S. MacAvaney, A. Yates, A. Cohan, and N. Goharian (2019) CEDR: contextualized embeddings for document ranking. In Proceedings of SIGIR, pp. 1101–1104. Cited by: §1.
  • R. Nogueira and K. Cho (2019) Passage re-ranking with bert. arXiv preprint arXiv:1901.04085. Cited by: §1.
  • L. Pang, Y. Lan, J. Guo, J. Xu, J. Xu, and X. Cheng (2017) Deeprank: a new deep architecture for relevance ranking in information retrieval. In Proceedings of CIKM, pp. 257–266. Cited by: §1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §3.2.
  • G. Salton and C. Buckley (1990) Improving retrieval performance by relevance feedback. Journal of the American society for information science 41 (4), pp. 288–297. Cited by: §1.
  • R. Sennrich, B. Haddow, and A. Birch (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §2.
  • E. Voorhees, T. Alam, S. Bedrick, D. Demner-Fushman, W. R. Hersh, K. Lo, K. Roberts, I. Soboroff, and L. L. Wang (2020) TREC-covid: constructing a pandemic information retrieval test collection. arXiv preprint arXiv:2005.04474. Cited by: §1, §1.
  • L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Eide, K. Funk, R. Kinney, Z. Liu, W. Merrill, et al. (2020) CORD-19: the covid-19 open research dataset. arXiv preprint arXiv:2004.10706. Cited by: §1, §3.1, §4.
  • L. Xiong, C. Xiong, Y. Li, K. Tang, J. Liu, P. Bennett, J. Ahmed, and A. Overwijk (2020) Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808. Cited by: §1, §1, §3.3.
  • P. Yang, H. Fang, and J. Lin (2017) Anserini: enabling the use of lucene for information retrieval research. In Proceedings of SIGIR, pp. 1253–1256. Cited by: §4.
  • W. Yang, H. Zhang, and J. Lin (2019) Simple applications of bert for ad hoc document retrieval. arXiv preprint arXiv:1903.10972. Cited by: §1.
  • H. Zhang, X. Song, C. Xiong, C. Rosset, P. N. Bennett, N. Craswell, and S. Tiwary (2019) Generic intent representation in web search. In Proceedings of SIGIR, pp. 65–74. Cited by: §1.
  • K. Zhang, C. Xiong, Z. Liu, and Z. Liu (2020) Selective weak supervision for neural information retrieval. In Proceedings of WWW, pp. 474–485. Cited by: §1, §3.2, §3.2.