Log In Sign Up

IR-BERT: Leveraging BERT for Semantic Search in Background Linking for News Articles

This work describes our two approaches for the background linking task of TREC 2020 News Track. The main objective of this task is to recommend a list of relevant articles that the reader should refer to in order to understand the context and gain background information of the query article. Our first approach focuses on building an effective search query by combining weighted keywords extracted from the query document and uses BM25 for retrieval. The second approach leverages the capability of SBERT (Nils Reimers et al.) to learn contextual representations of the query in order to perform semantic search over the corpus. We empirically show that employing a language model benefits our approach in understanding the context as well as the background of the query article. The proposed approaches are evaluated on the TREC 2018 Washington Post dataset and our best model outperforms the TREC median as well as the highest scoring model of 2018 in terms of the nDCG@5 metric. We further propose a diversity measure to evaluate the effectiveness of the various approaches in retrieving a diverse set of documents. This would potentially motivate researchers to work on introducing diversity in their recommended list. We have open sourced our implementation on Github and plan to submit our runs for the background linking task in TREC 2020.


Deeper Text Understanding for IR with Contextual Neural Language Modeling

Neural networks provide new possibilities to automatically learn complex...

Neural Net Model for Featured Word Extraction

Search engines perform the task of retrieving information related to the...

BERT-QE: Contextualized Query Expansion for Document Re-ranking

Query expansion aims to mitigate the mismatch between the language used ...

News Article Retrieval in Context for Event-centric Narrative Creation

Writers such as journalists often use automatic tools to find relevant c...

Supporting search engines with knowledge and context

Search engines leverage knowledge to improve information access. In orde...

COPER: a Query-adaptable Semantics-based Search Engine for Persian COVID-19 Articles

With the surge of pretrained language models, a new pathway has been ope...

Managing Diversity in Airbnb Search

One of the long-standing questions in search systems is the role of dive...

1. Introduction

In last few years, online news services have been key sources of information and have affected the way we receive and share news. While drafting a news article, often times it is assumed that reader has a background knowledge about the article’s story. This demands the need of linking other articles that set the background context for the article in focus. These articles are typically by the same author, provide extra relevant information or introduce us to the key ideas to help understand the topic better. However, defining what could provide ”background context” and retrieving such documents is not straightforward.

Motivated by this need and with the goal of fulfilling the search requirement of news readers, the Background Linking task was introduced in the NEWS track of TREC 2018. The main objective of this task is to recommend/retrieve a list of other articles that the reader should refer to in order to understand the context and gain background information of the query article.

In this paper, we discuss our proposed methods to tackle the problem of background linking. The paper is structured as follows: In section 2, we provide an overview of the earlier submissions and some other noteworthy methods that motivate our idea. In section 3, we demonstrate in detail our retrieval strategies followed by section 4 and 5, where we report the retrieval performances and show their effectiveness compared to earlier submissions. In the end we summarize and conclude our work in section 6.

2. Related Work

BM25 (Robertson et al., 2009)

is one of the most popular ranking functions used by search engines to estimate the relevance of documents to a given search query. It is based on a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. A number of previous approaches to background linking are built using BM25. Anserini

(Yang et al., 2017, 2018) is an open-source information retrieval toolkit built on Lucene. (Yang and Lin, 2018) uses Anserini to solve the background linking problem, by exploring different approaches to constructing a keyword query from a query article, and using the BM25 ranking function to retrieve the documents relevant to the constructed query. ICTNET (Ding et al., 2019) and htwsaar1 (Bimantara et al., 2018) follow a tf-idf based approach similar to Anserini to form its query while bigIR (Essam and Elsayed, ) bases the construction of the search query mainly on a graph-based analysis of the query article’s text. Other approaches using BM25 include UDInfolab_ent (Lua and Fanga, ) and DMINR (Missaoui et al., ), which focus on leveraging Named Entities (NEs) in the query article to form a query before using BM25 to identify the background articles. The best performing model for TREC 2018 is umass_cbrdm, which is an RM model (Lavrenko and Croft, 2017) with BM25 scoring functions.

Figure 1. IR-BERT pipeline

There also exist methods which have tried exploiting language models for the task of ad-hoc retrieval (Yang and Lin, 2018; Nogueira et al., 2019). BERT (Devlin et al., 2018) is a state of the art language model which has achieved groundbreaking results in many NLG and NLI tasks. BERT has been further fine tuned for downstream tasks and used in many applications. One such relevant application is finding the semantic similarity between text documents since both query and results to be retrieved are textual within the problem of background linking. Sentence-BERT (Reimers and Gurevych, 2019) is one such model which enable us to derive the semantically meaningful sentence embeddings by modifying original BERT using Siamese networks. With Sentence-BERT we can now take advantage of BERT embeddings for the tasks like semantic similarity comparison and information retrieval via semantic search.

3. Methodology

The background linking task can be formulated as the following: Given a news story and a collection of news articles , retrieve other news articles from that provide important context or background information about .

It is reasonable to consider the background linking task as a specific case of news recommendation task aimed at retrieving relevant articles from a corpus for queries generated from an article . For this task, in order to retrieve articles that can provide contextual information, we filter out forward links from our results, i.e., the articles published after the query article are not considered.

There are two main components to our solution for this task of background linking a) Constructing a search query from the document and b) Performing a search in against to retrieve articles providing background information on . In sections 3.1 and 3.2, we describe two approaches that focus on these two steps separately and attempt to show improvement in order to solve the background linking task.

3.1. Approach 1 (A1): Weighted Search Query + BM25

Our first approach is based on BM25, similar to (Yang and Lin, 2018), where we focus on building an effective search query that best captures the relevant topics of the query article. The problem is formulated as extracting the essential keywords from the query article, assigning them weights according to their relevance, and concatenating them to form a query. This query is then used to search the corpus using BM25 through which a final ranked list is generated.

Query Construction

To find the keywords from query document , we sort all the words in in decreasing order of their tf-idf score. To assign different relevance levels to each keyword, we define a weight for each keyword as follows:


where is the number of keywords, and and are the two statistics, term frequency and inverse document frequency, respectively. To apply the weight for each keyword, we round its value to the nearest integer and repeat the th keyword , number of times in the query. This weighted query is fed to BM25 to retrieve the top number of articles. We experiment with different lower and upper bounds on the number of repetitions as discussed in section 4.4 and finally set the lower bound as 1 and the upper bound as 5.

We also assign different weights to the contribution of the title and body of the article in query . Our experiments with different values for these weights are discussed in section 4.4. We achieve the best scores by setting the weight of the title and 0.7 and the body as 0.3.

3.2. Approach 2 (A2): IR-BERT

Approach 1 is entirely based on the term frequencies of words appearing in the query article where BM25 then simply does search and retrieval in the indexed corpus. In order to understand the context of the query article , it is important to take semantics of words into consideration. This is because the background articles may not necessarily contain the keywords in search query constructed from the query article . For example, a query article which goes by the title In Russia, political engagement is blossoming online is likely to have Russia and online in the constructed query. In order to find the background articles, the retrieval model first must understand that Russia is a country and online refers to social media platforms like Facebook, Twitter etc. which are based on internet.

To this end, we propose IR-BERT, which combines the retrieval power of BM25 with the contextual understanding gained through a BERT based model. Similar to Approach 1, using weighted query we first retrieve number of documents using BM25. Let this set of documents be where . The goal now is to carry out the semantic search of over the set of retrieved documents to arrive at the final set of documents where . The overall pipeline of IR-BERT is shown in Figure 1.


Before carrying out the semantic search over the set of documents , it is important to feed only those words to sentence-BERT whose semantic meaning could benefit us. Thus, every document in

is passed through the Rapid automatic Keyword Extraction (RAKE) algorithm

(Rose et al., 2010). RAKE is an algorithm which takes a list of stopwords and the query as inputs and extracts keywords from the query in a single pass. The reason we chose to use RAKE is that it is completely domain independent. It is based on the idea that co-occurrences of words are meaningful in determining whether they are keywords or not. The relations between the words are hence measured in a manner that automatically adapts to the style and content of the text. This allows RAKE to have adaptive measurement of word co-occurrences which are used to score candidate keywords.

Sentence-BERT (SBERT):

Sentence-BERT (Reimers and Gurevych, 2019) adds a pooling operation on top of the last layer of BERT and is fine tuned to derive a fixed size sentence embedding. (Reimers and Gurevych, 2019) further uses Siamese and triplet networks (Schroff et al., 2015)

to update the weights of this model. Authors of Sentence-BERT incorporate the Siamese networks in order to have sentence embeddings that are semantically meaningful and which in turn can be compared with cosine-similarity. Figure

2 describes the Sentence-BERT architecture.

Figure 2. Sentence-BERT (Reimers and Gurevych, 2019) architecture at Inference (to calculate the similarity scores)

The updated retrieved documents in given out by RAKE only has keywords and this set of keyword for every document can be treated as a sentence. Thus SBERT is used to obtain the embeddings of and . The documents in are then sorted according to their cosine similarity with the query (equation 3). We further normalize this similarity measure using equation 4. Generating the final list of documents via SBERT embeddings consist of steps mentioned in algorithm 1.

1:   Number of documents retrieved by BM25
2:   Required number of final documents
3:   SBERT()
4:  for i=1, …p do
5:      =
6:      = SBERT
7:      = Score()
9:  end for
10:   sorted list of
11:   gets top documents with corresponding indices in
12:  return  
Algorithm 1 ComputeRelevance(, )

4. Experimental Setup

Figure 3. Data Preprocessing Steps

4.1. Dataset Preprocessing

We used the Washington Post Corpus222 released by TREC for the 2018 news track for our experiments and preprocessed it using the steps shown in Figure 3. The collection contains 608,180 news articles and blog posts from January 2012 through August 2017. The articles are in JSON format, and include fields for title, date of publication, kicker, article text, and links to embedded images and multimedia. All of our methods rely on Elasticsearch333 as the indexing platform. During indexing, we extracted the information from the various fields and indexed them as separate Elasticsearch fields. We also created a new field to store the body of the article. For this, we first extracted the HTML text content from the fields marked by type ’sanitized_html’ and subtype ’paragraph’, and then concatenated them after using regular expressions to extract the raw text from HTML text. Next, we performed lower-casing, stop-word removal, and stemming on the raw text. The final preprocessed text was then indexed as a separate text field in Elasticsearch, representing the article body.

While indexing, we filtered out the articles from the ”Opinion”, ”Letters to the Editor”, or ”The Post’s View” sections, as labeled in the ”kicker” field, are they are stated as not relevant in the TREC guideline. We used the the default scoring method in Elasticsearch, BM25, as the retrieval model. Also, Elasticsearch boosting queries were used to assign weights for title and body in the search queries.

4.2. Evaluation Metrics

The primary metric used by TREC for the background linking task is nDCG@5, where the gain value is where r is the relevance level, ranging from (The linked document provides little or no useful background information) to 4 (The document MUST appear in the sidebar otherwise critical context is missing). The zero relevance level contributes no gain. Apart from this, we also report nDCG@10, Precision@5 and Precision@10 for some of our experiments.

4.3. Proposed Diversity Measure

As per TREC guidelines, one of the criteria for ranking for the background linking task is to have a retrieved list of articles which are diverse. The idea of diversity may seem subjective but it is possible to formulate the diversity measure. We define diversity by equation 5.


where is the set of all queries/topics in TREC 2018. For every retrieved document list we calculate the sum of distance between all the possible pairs of documents in . This is then summed over all queries/topics.

4.4. Parameter Tuning

For both our proposed approaches in this work, we tuned a number of parameters using the TREC 2018 topics and relevance judgements. In this section, we list out the different settings under which we tested the two approaches. The P@5 and nDCG@5 results for our fine-tuning experiments are shown in Section 7.

Parameters for Approach 1 (A1):

  • Number of words in the query constructed from given news article.

  • Relative weights of title and body in the query.

  • Maximum and minimum number of repetitions of extracted keywords.

Parameters for Approach 2 (A2):

  • Number of filtered results from BM25.

  • Minimum number of keywords generated from RAKE

5. Results

In our first set of experiments we compare our best performing models from two approaches, A1 and A2, with several other previous methods. First entry in Table 1 represents the TREC 2018 hypothetical run that achieved a median effectiveness over all queries. The second group of entries represent some of the results for official runs in the TREC 2018 news track. Second row corresponds to the run where entire query document (without assigned weights) is directly fed to BM25 for retrieval. anserini_1000w (Yang et al., 2017) is the best performing run submitted by the researchers at University of Waterloo, which also uses BM25 as the ranking function. umass_cbrdm (Lavrenko and Croft, 2017) represents the best performing model for the news track in TREC 2018 . While A1.1 uses only weighted body and title while constructing a query for BM25, A1.2 uses also uses weights for all the words present in the query document. A2.1 and A2.2 are models based on the Approach 2 and use RoBERTa (Liu et al., 2019) and BERT sentence level embeddings respectively.

Methods nDCG@5
TREC 2018 Median 0.3448
BM25 0.3251
anserini_1000w 0.3529
umass_cbrdm 0.4173
(A1.1) wBT+BM25 0.4088
(A1.2) wQ+BM25 0.3942
(A2.1) IR-RoBERTa 0.394
(A2.2) IR-BERT 0.4199
Table 1. Comparison of nDCG@5 between proposed methods and previous submissions

A1.1, A1.2 and A2.1 significantly outperform anserini_1000w and BM25, showing that our approach helps construct effective queries which represent the important topics in the query document. Furthermore, our best performing model A2.2 (IR-BERT) outperforms umass_cbrdm, setting the new benchmark on 2018 background linking qrels. Also, all our runs outperform the TREC 2018 median for this task.

Methods nDCG@5 nDCG@10 P@5 P@10
BM25 0.3251 0.3359 0.532 0.446
(A1.1) wBT+BM25 0.4088 0.4155 0.644 0.53
(A1.2) wQ+BM25 0.3942 0.4315 0.632 0.578
(A2.1) IR-RoBERTa 0.394 0.3918 0.628 0.514
(A2.2) IR-BERT 0.4199 0.4104 0.628 0.504
Table 2. Comparison of nDCG and precision values between proposed methods and BM25

In our next set of experiments, we list nDCG@5, nDCG@10, Precision@5 and Precision10 scores achieved by our approaches on the 2018 Washington Post dataset in table 2. A1.1 (wBT+BM25) achieves the best P@5 score. On the other hand, A1.2 (wQ+BM25) gives best performance over two other metrics. It is interesting to note that using RoBERTa for finding semantic similarity between query and documents harms the performance given by A1.1 and A1.2.

Methods BM25 wBT+BM25 wQ+BM25 IR-RoBERTa IR-BERT
Diversity 0.907 0.9067 0.912 0.921 0.9084
Table 3. Comparison of diversity of retrieved documents from proposed methods and BM25

In our last set of experiments we evaluate the diversity of all our models along with that of BM25. It can be observed that A2.1 (IR-RoBERTa) which relatively fails on traditional measures, retrieves the most diverse list of background articles for a given query.

6. Conclusion

In this paper, we described two methods to solve the background linking task in the context of participating in the TREC 2020 news track. While our first approach attempts to extract representative keywords from the query article and use them to retrieve the article’s background links, the second approach leverages the contextual understanding ability of BERT to perform semantic search for background articles on the corpus. Our best model, IR-BERT, achieved an nDCG@5 score of beating the TREC median as well as the best performing model of 2018 on the TREC Washington Post 2018 dataset. We further measured the diversity of the retrieved background articles for our approaches using a diversity measure based on tf-idf.


  • A. Bimantara, M. Blau, K. Engelhardt, J. Gerwert, T. Gottschalk, P. Lukosz, S. Piri, N. S. Shaft, and K. Berberich (2018) Htw saar@ trec 2018 news track.. In TREC, Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
  • Y. Ding, X. Lian, H. Zhou, Z. Liu, H. Ding, and Z. Hou (2019) ICTNET at TREC 2019 news track. In Proceedings of the Twenty-Eighth Text REtrieval Conference, TREC 2019, Gaithersburg, Maryland, USA, November 13-15, 2019, E. M. Voorhees and A. Ellis (Eds.), NIST Special Publication, Vol. 1250. External Links: Link Cited by: §2.
  • [4] M. Essam and T. Elsayed BigIR at trec 2019: graph-based analysis for news background linking. Cited by: §2.
  • V. Lavrenko and W. B. Croft (2017) Relevance-based language models. In ACM SIGIR Forum, Vol. 51, pp. 260–267. Cited by: §2, §5.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §5.
  • [7] K. Lua and H. Fanga Leveraging entities in background document retrieval for news articles. Cited by: §2.
  • [8] S. Missaoui, A. MacFarlane, S. Makri, and M. Gutierrez-Lopez DMINR at trec news track. Cited by: §2.
  • R. Nogueira, W. Yang, K. Cho, and J. Lin (2019) Multi-stage document ranking with bert. arXiv preprint arXiv:1910.14424. Cited by: §2.
  • N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. Cited by: IR-BERT: Leveraging BERT for Semantic Search in Background Linking for News Articles, §2, Figure 2, §3.2.
  • S. Robertson, H. Zaragoza, et al. (2009) The probabilistic relevance framework: bm25 and beyond. Foundations and Trends® in Information Retrieval 3 (4), pp. 333–389. Cited by: IR-BERT: Leveraging BERT for Semantic Search in Background Linking for News Articles, §2.
  • S. Rose, D. Engel, N. Cramer, and W. Cowley (2010) Automatic keyword extraction from individual documents. Text mining: applications and theory 1, pp. 1–20. Cited by: §3.2.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 815–823. Cited by: §3.2.
  • P. Yang, H. Fang, and J. Lin (2017) Anserini: enabling the use of lucene for information retrieval research. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1253–1256. Cited by: §2, §5.
  • P. Yang, H. Fang, and J. Lin (2018) Anserini: reproducible ranking baselines using lucene. Journal of Data and Information Quality (JDIQ) 10 (4), pp. 1–20. Cited by: §2.
  • P. Yang and J. Lin (2018) Anserini at trec 2018: centre, common core, and news tracks. In Proceedings of the 27th Text REtrieval Conference (TREC 2018), Cited by: §2, §2, §3.1.

7. Appendix

Figure 4. Fine tuning number of words in a query for BM25
Figure 5. Fine tuning the maximum weight (repetitions) for a term in the query in Approach 1
Figure 6. Results with different BERT models in Approach 2